EXPHORMER: SCALING GRAPH TRANSFORMERS WITH EXPANDER GRAPHS

Abstract

Graph transformers have emerged as a promising architecture for a variety of graph learning and representation tasks. Despite their successes, though, it remains challenging to scale graph transformers to large graphs while maintaining accuracy competitive with message-passing networks. In this paper, we introduce Exphormer, a framework for building powerful and scalable graph transformers. EXPHORMER consists of a sparse attention mechanism based on expander graphs, whose mathematical characteristics, such as spectral expansion, pseduorandomness, and sparsity, yield graph transformers with complexity only linear in the size of the graph, while allowing us to prove desirable theoretical properties of the resulting transformer models. We show that incorporating EXPHORMER into the recently-proposed GraphGPS framework produces models with competitive empirical results on a wide variety of graph datasets, including state-of-the-art results on three datasets. We also show that EXPHORMER can scale to datasets on larger graphs than shown in previous graph transformer architectures.

1. INTRODUCTION

Graph learning has become an important and popular area of study that has yielded impressive results on a wide variety of graphs and tasks, including molecular graphs, social network graphs, knowledge graphs, and more. While much research around graph learning has focused on graph neural networks (GNNs), which are based on local message-passing, a more recent approach to graph learning that has garnered much interest involves the use of graph transformers (GTs). Graph transformers largely operate by encoding graph structure in the form of a soft inductive bias. These can be viewed as a graph adaptation of the Transformer architecture (Vaswani et al., 2017) that are successful in modeling sequential data in applications such as natural language processing. Graph transformers allow nodes to attend to all other nodes in a graph, allowing for direct modeling of long-range interactions, in contrast to GNNs. This allows them to avoid several limitations associated with local message passing GNNs, such as oversmoothing (Oono & Suzuki, 2020 ), oversquashing (Alon & Yahav, 2021; Topping et al., 2022) , and limited expressivity (Morris et al., 2019; Xu et al., 2018) . The promise of graph transformers has led to a large number of different graph transformer models that have been proposed in recent years (Dwivedi & Bresson, 2020; Kreuzer et al., 2021; Ying et al., 2021; Mialon et al., 2021) . A major issue with graph transformers is the need to identify the location and structure of nodes within the graph, which has also led to a number of proposals for positional and structural encodings for graph transformers (Lim et al., 2022) . One major challenge for graph transformers is their poor scalability, as the standard global attention mechanism incurs time and memory complexity of O(|V | 2 ), quadratic in the number of nodes in the graph. While this cost is often acceptable for datasets with small graphs (e.g., molecular graphs), it can be prohibitively expensive for datasets containing larger graphs, where graph transformer models often do not fit in memory even for high-memory GPUs, and hence would require much more complex and slower schemes to apply. Moreover, despite the expressivity advantages of graph transformer networks (Kreuzer et al., 2021) , graph transformer-based architectures have often lagged message-passing counterparts in accuracy in many practical settings. A breakthrough came about with the recent advent of GraphGPS (Rampásek et al., 2022) , a modular framework for constructing graph transformers by mixing and matching various positional and structural encodings with local message passing and a global attention mechanism. To overcome the quadratic complexity of the "dense" full transformer, they also incorporate "sparse" attention mechanisms, like Performer (Choromanski et al., 2021) or Big Bird (Zaheer et al., 2020) . These sparse trasnformer mechanisms are an attempt at improving the scalability. This combination of Transformers and GNNs achieves state-of-the-art performance on a wide variety of datasets. Despite great successes, the aforementioned works leave open some important questions. For instance, unlike pure attention-based approaches (e.g., SAN, Graphormer), GraphGPS combines transformers with message passing, which brings into question how much of the realized accuracy gains are due to transformers themeselves. Indeed, Rampásek et al.'s ablation studies showed that the impact of the transformer component of the model is limited: on a number of datasets, higher accuracies can be achieved in the GraphGPS framework by not using attention at all (rather than, say, BigBird). The question remains, then, of whether transformers in their own right can obtain results on par with message-passing based approaches while scaling to large graphs. Another major question concerns graph transformers' scalability. While BigBird and Performer are linear attention mechanisms, they still incur computational overhead that dominates the per-epoch computation time for moderately-sized graphs. The GraphGPS work tackles datasets with graphs of up to 5,000 nodes, a regime in which the full-attention transformer is in fact computationally faster than many sparse linear-attention mechanisms. Perhaps more suitable sparse attention mechanisms could enable their framework to operate on even larger graphs. Relatedly, existing sparse attention mechanisms have largely been designed for sequences, which are natural for language tasks but behave quite differently from graphs. From the ablation studies, BigBird and Performer are not as effective on graphs, unlike in the sequence world. Thus, it is natural to ask whether one can design sparse attention mechanisms more tailored to learning interactions on general graphs. Our contributions. We design a sparse attention mechanism, EXPHORMER, that has computational cost linear in the number of nodes and edges. We introduce expander graphs as a powerful primitive in designing scalable graph transformer architectures. Expander graphs have several desirable properties -small diameter, spectral approximation of a complete graph, good mixing properties -which make them a suitable ingredient in a sparse attention mechanism. As a result, we are able to show that EXPHORMER, which combines expander graphs with global nodes and local neighborhoods, spectrally approximates the full attention mechanism with only a small number of layers, and has universal approximation properties. Furthermore, we show the efficacy of EXPHORMER within the GraphGPS framework. That is, combining EXPHORMER with GNNs, helps achieve good performance on a number of datasets, including state-of-the-art results on CIFAR10, MNIST, and PATTERN. On many datasets, EXPHORMER is often even more accurate than full attention models, indicating that our attention scheme perhaps provides good inductive bias for places the model "should look," while being more efficient and less memory-intensive. Furthermore, EXPHORMER can scale to larger graphs than previously shownwe demonstrate that a pure EXPHORMER model can achieve strong results on ogbn-arxiv, a challenging transductive problem on a citation graph with over 160K nodes and a million edges, a setting in which full transformers are prohibitively expensive due to memory constraints.

2. RELATED WORK

Graph Neural Networks (GNNs). Early works in the area of graph learning and GNNs include the development of a number of architectures such as GCN (Defferrard et al., 2016; Kipf & Welling, 2017) , GraphSage (Hamilton et al., 2017) , GIN (Xu et al., 2018) , GAT (Veličković et al., 2018) , GatedGCN (Bresson & Laurent, 2017) , and more. GNNs are based on a message-passing architecture that generally confines their expressivity to the limits of the 1-Weisfeiler-Lehman (1-WL) isomorphism test (Xu et al., 2018) . A number of recent papers have sought to augment GNNs to improve their expressivity. For instance, one approach has been to use additional features that allow nodes to be distinguished -such as using a one-hot encoding of the node (Murphy et al., 2019) or a random scalar feature Sato et al. ( 2021) -or to encode positional or structural information of the graph -e.g., skip-gram based network embeddings (Qiu et al., 2018) , substructure counts (Bouritsas et al., 2020) , or Laplacian

