NAGPHORMER: A TOKENIZED GRAPH TRANSFORMER FOR NODE CLASSIFICATION IN LARGE GRAPHS

Abstract

The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratic complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.

1. INTRODUCTION

Graphs, as a powerful data structure, are widely used to represent entities and their relations in a variety of domains, such as social networks in sociology and protein-protein interaction networks in biology. Their complex features (e.g., attribute features and topology features) make the graph mining tasks very challenging. Graph Neural Networks (GNNs) (Chen et al., 2020c; Kipf & Welling, 2017; Veličković et al., 2018) , owing to the message passing mechanism that aggregates neighborhood information for learning the node representations (Gilmer et al., 2017) , have been recognized as a type of powerful deep learning techniques for graph mining tasks (Xu et al., 2019; Fan et al., 2019; Ying et al., 2018; Zhang & Chen, 2018; Jin et al., 2019) over the last decade. Though effective, message passing-based GNNs have a number of inherent limitations, including over-smoothing (Chen et al., 2020a) and over-squashing (Alon & Yahav, 2021) with the increment of model depth, limiting their potential capability for graph representation learning. Though recent efforts (Yang et al., 2020; Lu et al., 2021; Huang et al., 2020; Sun et al., 2022) have been devoted to alleviate the impact of over-smoothing and over-squashing problems, the negative influence of these inherent limitations cannot be eliminated completely. Transformers (Vaswani et al., 2017) , on the other hand recently, are well-known deep learning architectures that have shown superior performance in a variety of data with an underlying Euclidean or grid-like structure, such as natural languages (Devlin et al., 2019; Liu et al., 2019) and images (Dosovitskiy et al., 2021; Liu et al., 2021) . Due to their great modeling capability, there is a growing interest in generalizing Transformers to non-Euclidean data like graphs (Dwivedi & Bresson, 2020;  

