NAGPHORMER: A TOKENIZED GRAPH TRANSFORMER FOR NODE CLASSIFICATION IN LARGE GRAPHS

Abstract

The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratic complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.

1. INTRODUCTION

Graphs, as a powerful data structure, are widely used to represent entities and their relations in a variety of domains, such as social networks in sociology and protein-protein interaction networks in biology. Their complex features (e.g., attribute features and topology features) make the graph mining tasks very challenging. Graph Neural Networks (GNNs) (Chen et al., 2020c; Kipf & Welling, 2017; Veličković et al., 2018) , owing to the message passing mechanism that aggregates neighborhood information for learning the node representations (Gilmer et al., 2017) , have been recognized as a type of powerful deep learning techniques for graph mining tasks (Xu et al., 2019; Fan et al., 2019; Ying et al., 2018; Zhang & Chen, 2018; Jin et al., 2019) over the last decade. Though effective, message passing-based GNNs have a number of inherent limitations, including over-smoothing (Chen et al., 2020a) and over-squashing (Alon & Yahav, 2021) with the increment of model depth, limiting their potential capability for graph representation learning. Though recent efforts (Yang et al., 2020; Lu et al., 2021; Huang et al., 2020; Sun et al., 2022) have been devoted to alleviate the impact of over-smoothing and over-squashing problems, the negative influence of these inherent limitations cannot be eliminated completely. Transformers (Vaswani et al., 2017) , on the other hand recently, are well-known deep learning architectures that have shown superior performance in a variety of data with an underlying Euclidean or grid-like structure, such as natural languages (Devlin et al., 2019; Liu et al., 2019) and images (Dosovitskiy et al., 2021; Liu et al., 2021) . Due to their great modeling capability, there is a growing interest in generalizing Transformers to non-Euclidean data like graphs (Dwivedi & Bresson, 2020; In this work, we observe that existing graph Transformers treat the nodes as independent tokens and construct a single sequence composed of all the node tokens to train the Transformer model, causing a quadratic complexity on the number of nodes for the self-attention calculation. Training such a model on large graphs will cost a huge amount of GPU resources that are generally unaffordable since the mini-batch training is unsuitable for graph Transformers using a single long sequence as the input. Meanwhile, effective strategies that make GNNs scalable to large-scale graphs, including node sampling (Chen et al., 2018; Zou et al., 2019) and approximation propagation (Chen et al., 2020b; Feng et al., 2022) , are not applicable to graph Transformers, as they capture the global attention of all node pairs and are independent of the message passing mechanism. The current paradigm of graph Transformers makes it intractable to generalize to large graphs. To address the above challenge, we propose a novel model dubbed Neighborhood Aggregation Graph Transformer (NAGphormer) for node classification in large graphs. Unlike existing graph Transformers that regard the nodes as independent tokens, NAGphormer treats each node as a sequence and constructs tokens for each node by a novel neighborhood aggregation module called Hop2Token. The key idea behind Hop2Token is to aggregate neighborhood features from multiple hops and transform each hop into a representation, which could be regarded as a token. Hop2Token then constructs a sequence for each node based on the tokens in different hops to preserve the neighborhood information. The sequences are then fed into a Transformer-based module for learning the node representations. By treating each node as a sequence of tokens, NAGphormer could be trained in a mini-batch manner and hence can handle large graphs even on limited GPU resources. Considering that the contributions of neighbors in different hops differ to the final node representation, NAGphormer further provides an attention-based readout function to learn the importance of each hop adaptively. Moreover, we provide theoretical analysis on the relationship between NAGphormer and an advanced category of GNNs, the decoupled Graph Convolutional Network (GCN) (Dong et al., 2021; Klicpera et al., 2019; Wu et al., 2019; Chien et al., 2021) . The analysis is from the



Figure 1: Model framework of NAGphormer. NAGphormer first uses a novel neighborhood aggregation module, Hop2Token, to construct a sequence for each node based on the tokens of different hops of neighbors. Then, NAGphormer learns the node representations using a Transformer backbone, and an attention-based readout function is developed to aggregate neighborhood information of different hops adaptively. An MLP-based module is used in the end for label prediction.

