EXPHORMER: SCALING GRAPH TRANSFORMERS WITH EXPANDER GRAPHS

Abstract

Graph transformers have emerged as a promising architecture for a variety of graph learning and representation tasks. Despite their successes, though, it remains challenging to scale graph transformers to large graphs while maintaining accuracy competitive with message-passing networks. In this paper, we introduce Exphormer, a framework for building powerful and scalable graph transformers. EXPHORMER consists of a sparse attention mechanism based on expander graphs, whose mathematical characteristics, such as spectral expansion, pseduorandomness, and sparsity, yield graph transformers with complexity only linear in the size of the graph, while allowing us to prove desirable theoretical properties of the resulting transformer models. We show that incorporating EXPHORMER into the recently-proposed GraphGPS framework produces models with competitive empirical results on a wide variety of graph datasets, including state-of-the-art results on three datasets. We also show that EXPHORMER can scale to datasets on larger graphs than shown in previous graph transformer architectures.

1. INTRODUCTION

Graph learning has become an important and popular area of study that has yielded impressive results on a wide variety of graphs and tasks, including molecular graphs, social network graphs, knowledge graphs, and more. While much research around graph learning has focused on graph neural networks (GNNs), which are based on local message-passing, a more recent approach to graph learning that has garnered much interest involves the use of graph transformers (GTs). Graph transformers largely operate by encoding graph structure in the form of a soft inductive bias. These can be viewed as a graph adaptation of the Transformer architecture (Vaswani et al., 2017) that are successful in modeling sequential data in applications such as natural language processing. Graph transformers allow nodes to attend to all other nodes in a graph, allowing for direct modeling of long-range interactions, in contrast to GNNs. This allows them to avoid several limitations associated with local message passing GNNs, such as oversmoothing (Oono & Suzuki, 2020), oversquashing (Alon & Yahav, 2021; Topping et al., 2022) , and limited expressivity (Morris et al., 2019; Xu et al., 2018) . The promise of graph transformers has led to a large number of different graph transformer models that have been proposed in recent years (Dwivedi & Bresson, 2020; Kreuzer et al., 2021; Ying et al., 2021; Mialon et al., 2021) . A major issue with graph transformers is the need to identify the location and structure of nodes within the graph, which has also led to a number of proposals for positional and structural encodings for graph transformers (Lim et al., 2022) . One major challenge for graph transformers is their poor scalability, as the standard global attention mechanism incurs time and memory complexity of O(|V | 2 ), quadratic in the number of nodes in the graph. While this cost is often acceptable for datasets with small graphs (e.g., molecular graphs), it can be prohibitively expensive for datasets containing larger graphs, where graph transformer models often do not fit in memory even for high-memory GPUs, and hence would require much more complex and slower schemes to apply. Moreover, despite the expressivity advantages of graph transformer networks (Kreuzer et al., 2021) , graph transformer-based architectures have often lagged message-passing counterparts in accuracy in many practical settings. A breakthrough came about with the recent advent of GraphGPS (Rampásek et al., 2022) , a modular framework for constructing graph transformers by mixing and matching various positional and structural encodings with local message passing and a global attention mechanism. To overcome

