EXPHORMER: SCALING GRAPH TRANSFORMERS WITH EXPANDER GRAPHS

Abstract

Graph transformers have emerged as a promising architecture for a variety of graph learning and representation tasks. Despite their successes, though, it remains challenging to scale graph transformers to large graphs while maintaining accuracy competitive with message-passing networks. In this paper, we introduce Exphormer, a framework for building powerful and scalable graph transformers. EXPHORMER consists of a sparse attention mechanism based on expander graphs, whose mathematical characteristics, such as spectral expansion, pseduorandomness, and sparsity, yield graph transformers with complexity only linear in the size of the graph, while allowing us to prove desirable theoretical properties of the resulting transformer models. We show that incorporating EXPHORMER into the recently-proposed GraphGPS framework produces models with competitive empirical results on a wide variety of graph datasets, including state-of-the-art results on three datasets. We also show that EXPHORMER can scale to datasets on larger graphs than shown in previous graph transformer architectures. Under review as a conference paper at ICLR 2023 the quadratic complexity of the "dense" full transformer, they also incorporate "sparse" attention mechanisms, like Performer (Choromanski et al., 2021) or Big Bird (Zaheer et al., 2020). These sparse trasnformer mechanisms are an attempt at improving the scalability. This combination of Transformers and GNNs achieves state-of-the-art performance on a wide variety of datasets. Despite great successes, the aforementioned works leave open some important questions. For instance, unlike pure attention-based approaches (e.g., SAN, Graphormer), GraphGPS combines transformers with message passing, which brings into question how much of the realized accuracy gains are due to transformers themeselves. Indeed, Rampásek et al.'s ablation studies showed that the impact of the transformer component of the model is limited: on a number of datasets, higher accuracies can be achieved in the GraphGPS framework by not using attention at all (rather than, say, BigBird). The question remains, then, of whether transformers in their own right can obtain results on par with message-passing based approaches while scaling to large graphs. Another major question concerns graph transformers' scalability. While BigBird and Performer are linear attention mechanisms, they still incur computational overhead that dominates the per-epoch computation time for moderately-sized graphs. The GraphGPS work tackles datasets with graphs of up to 5,000 nodes, a regime in which the full-attention transformer is in fact computationally faster than many sparse linear-attention mechanisms. Perhaps more suitable sparse attention mechanisms could enable their framework to operate on even larger graphs. Relatedly, existing sparse attention mechanisms have largely been designed for sequences, which are natural for language tasks but behave quite differently from graphs. From the ablation studies, BigBird and Performer are not as effective on graphs, unlike in the sequence world. Thus, it is natural to ask whether one can design sparse attention mechanisms more tailored to learning interactions on general graphs.

1. INTRODUCTION

Graph learning has become an important and popular area of study that has yielded impressive results on a wide variety of graphs and tasks, including molecular graphs, social network graphs, knowledge graphs, and more. While much research around graph learning has focused on graph neural networks (GNNs), which are based on local message-passing, a more recent approach to graph learning that has garnered much interest involves the use of graph transformers (GTs). Graph transformers largely operate by encoding graph structure in the form of a soft inductive bias. These can be viewed as a graph adaptation of the Transformer architecture (Vaswani et al., 2017) that are successful in modeling sequential data in applications such as natural language processing. Graph transformers allow nodes to attend to all other nodes in a graph, allowing for direct modeling of long-range interactions, in contrast to GNNs. This allows them to avoid several limitations associated with local message passing GNNs, such as oversmoothing (Oono & Suzuki, 2020) , oversquashing (Alon & Yahav, 2021; Topping et al., 2022) , and limited expressivity (Morris et al., 2019; Xu et al., 2018) . The promise of graph transformers has led to a large number of different graph transformer models that have been proposed in recent years (Dwivedi & Bresson, 2020; Kreuzer et al., 2021; Ying et al., 2021; Mialon et al., 2021) . A major issue with graph transformers is the need to identify the location and structure of nodes within the graph, which has also led to a number of proposals for positional and structural encodings for graph transformers (Lim et al., 2022) . One major challenge for graph transformers is their poor scalability, as the standard global attention mechanism incurs time and memory complexity of O(|V | 2 ), quadratic in the number of nodes in the graph. While this cost is often acceptable for datasets with small graphs (e.g., molecular graphs), it can be prohibitively expensive for datasets containing larger graphs, where graph transformer models often do not fit in memory even for high-memory GPUs, and hence would require much more complex and slower schemes to apply. Moreover, despite the expressivity advantages of graph transformer networks (Kreuzer et al., 2021) , graph transformer-based architectures have often lagged message-passing counterparts in accuracy in many practical settings. A breakthrough came about with the recent advent of GraphGPS (Rampásek et al., 2022) , a modular framework for constructing graph transformers by mixing and matching various positional and structural encodings with local message passing and a global attention mechanism. To overcome Our contributions. We design a sparse attention mechanism, EXPHORMER, that has computational cost linear in the number of nodes and edges. We introduce expander graphs as a powerful primitive in designing scalable graph transformer architectures. Expander graphs have several desirable properties -small diameter, spectral approximation of a complete graph, good mixing properties -which make them a suitable ingredient in a sparse attention mechanism. As a result, we are able to show that EXPHORMER, which combines expander graphs with global nodes and local neighborhoods, spectrally approximates the full attention mechanism with only a small number of layers, and has universal approximation properties. Furthermore, we show the efficacy of EXPHORMER within the GraphGPS framework. That is, combining EXPHORMER with GNNs, helps achieve good performance on a number of datasets, including state-of-the-art results on CIFAR10, MNIST, and PATTERN. On many datasets, EXPHORMER is often even more accurate than full attention models, indicating that our attention scheme perhaps provides good inductive bias for places the model "should look," while being more efficient and less memory-intensive. Furthermore, EXPHORMER can scale to larger graphs than previously shownwe demonstrate that a pure EXPHORMER model can achieve strong results on ogbn-arxiv, a challenging transductive problem on a citation graph with over 160K nodes and a million edges, a setting in which full transformers are prohibitively expensive due to memory constraints.

2. RELATED WORK

Graph Neural Networks (GNNs). Early works in the area of graph learning and GNNs include the development of a number of architectures such as GCN (Defferrard et al., 2016; Kipf & Welling, 2017) , GraphSage (Hamilton et al., 2017) , GIN (Xu et al., 2018) , GAT (Veličković et al., 2018) , GatedGCN (Bresson & Laurent, 2017) , and more. GNNs are based on a message-passing architecture that generally confines their expressivity to the limits of the 1-Weisfeiler-Lehman (1-WL) isomorphism test (Xu et al., 2018) . A number of recent papers have sought to augment GNNs to improve their expressivity. For instance, one approach has been to use additional features that allow nodes to be distinguished -such as using a one-hot encoding of the node (Murphy et al., 2019) or a random scalar feature Sato et al. (2021) -or to encode positional or structural information of the graph -e.g., skip-gram based network embeddings (Qiu et al., 2018) , substructure counts (Bouritsas et al., 2020) , or Laplacian eigenvectors (Dwivedi et al., 2021) . Another direction has been to modify the message passing rule to allow the network to take further advantage of the graph structure, including the directional graph networks (DGN) of Beaini et al. (2021) that use Laplacian eigenvectors to define directional flows that are used for anisotropic message aggregation, or -to modify the underlying graph over which message passing occurs, higher-order GNNs (Morris et al., 2019) or the use of substructures such as junction trees (Fey et al., 2020) and simplicial complexes (Bodnar et al., 2021) . Graph transformer architectures. Attention mechanisms have been extremely successful in sequence modeling since the seminal work of Vaswani et al. (2017) . The GAT architecture (Veličković et al., 2018) proposed using an attention mechanism to determine how a node aggregates information from its neighbors. It does not use a positional encoding for nodes, limiting its ability to exploit global structural information. GraphBert (Zhang et al., 2020) uses the graph structure to determine an encoding of the nodes, but not for the underlying attention mechanism. Graph transformer models typically operate on a fully-connected graph in which every pair of nodes is connected, regardless of the connectivity structure of the original graph. Spectral Attention Networks (SAN) (Kreuzer et al., 2021) make use of two attention mechanisms, one on the fullyconnected graph and one on the original edges of the input graph, while using Laplacian positional encodings for the nodes. Graphormer (Ying et al., 2021) uses a single dense attention mechanism but adds structural features in the form of centrality and spatial encodings. Meanwhile, GraphiT (Mialon et al., 2021) incorporates relative positional encodings based on diffusion kernels. GraphGPS (Rampásek et al., 2022) proposed a general framework for combining message-passing networks with attention mechanisms, while allowing for the mixing and matching of positional and structural embeddings. Specifically, the framework also allows for sparse transformer models like BigBird (Zaheer et al., 2020) and Performer (Choromanski et al., 2021) . Sparse Transformers. Standard (dense) transformers have quadratic complexity in the number of tokens, which limits their scalability to extremely long sequences. By contrast, sparse transformer models improve computational and memory efficiency by restricting the attention pattern, i.e., the pairs of nodes that can interact with each other. In addition to BigBird and Performer, there have been a number of other proposals for sparse transformers; Tay et al. (2020) provide a survey.

3. THE EXPHORMER ATTENTION MECHANISM

This section describes EXPHORMER, our sparse generalized attention mechanism that can be used in individual layers of a graph transformer architecture. We begin by describing graph attention mechanisms in general.

3.1. ATTENTION MECHANISM ON GRAPHS

An attention mechanism on n tokens can be modeled by a directed graph H on [n] = {1, 2, . . . , n}, where a directed edge from i to j indicates a direct interaction between tokens i and j, i.e., an inner product that will be computed by the attention mechanism. More precisely, a transformer block can be viewed as a function on the d-dimensional embeddings for each of n tokens, mapping from R d×n to R d×n . Let X = (x 1 , x 2 , . . . , x n ) ∈ R d×n . A generalized (dot-product) attention mechanism ATTN H : R d×n → R d×n with attention pattern given by H is defined by ATTN H (X) :,i = x i + h j=1 W j O W j V X N H (i) • σ W j K X N H (i) T W j Q x i , where h is the number of heads and m is the head size, while W j K , W j Q , W j V ∈ R m×d and W j O ∈ R d×m . (The subscript K is for "keys," Q for "queries," V for "values," and O for "output.") Here X N H (i) denotes the submatrix of X obtained by picking out only those columns corresponding to elements of N H (i), the neighbors of i in H. We can see that the total number of inner product computations for all i ∈ [n] is given by the number of edges of H. A (generalized) transformer block consists of ATTN H followed by a feedforward layer: where FF(X) = ATTN H (X) + W 2 • ReLU(W 1 • ATTN H X) + b 1 1 T n + b 2 1 T n , (a) (b) (c) (d) W 1 ∈ R r×d , W 2 ∈ R d×r , b 1 ∈ R r , and b 2 ∈ R d . In the standard setting, the n tokens are part of a sequence (e.g., language applications). However, we are concerned with the graph transformer setting in which the tokens are nodes of some underlying graph G = (V, E) with V = [n]. The attention computation is nearly identical, except that one can also optionally augment it with edge features, as is done in SAN (Kreuzer et al., 2021) : ATTN H (X) :,i = x i + h j=1 W j O W j V X N H (i) • σ W j E E N H (i) ⊙ W j K X N H (i) T W j Q x i , where W j E ∈ R m×d E , E N H (i) is the d E × |N H (i) | matrix whose columns are d E -dimensional edge features for the edges connected to node i, and ⊙ denotes element-wise multiplications. The most typical cases of graph transformers use full (dense) attention, where every token attends to every other node: H is the fully-connected directed graph. As this results in computational complexity O(n 2 ) for the transformer block, which is prohibitively expensive for large graphs, we wish to replace full attention with a sparse attention mechanism, where H has o(n 2 ) edges -ideally, O(n). A number of sparse attention mechanisms have been proposed to address the aforementioned issue (see Tay et al., 2020) , but the vast majority are designed specifically for functions on sequences. EXPHORMER, on the other hand, is a graph-centric sparse attention mechanism that makes use of the underlying structure of the input graph G. As we will see, EXPHORMER is perhaps most similar in design to the BIGBIRD architecture (Zaheer et al., 2020) designed for functions on sequences, but takes advantage of the underlying graph structure. We can either use EXPHORMER layers in a pure graph transformer model, or in combination with a message passing network using the GraphGPS framework.

3.2. THE EXPHORMER ARCHITECTURE

We now describe the details of the construction of EXPHORMER, an expander-based sparse attention mechanism for graph transformers with O(|V | + |E|) computation, where G = (V, E) is the underlying input graph. The EXPHORMER architecture constructs an interaction graph H that consists of three main components, as shown in Figure 1 . The construction always has bidirectional edges, and so H can be viewed as an undirected graph. The mechanism uses three types of edges: 1. Expander graph attention: The main building block of our architecture is the use of edges from a random expander graph, described in more detail shortly. These graphs have several useful theoretical properties related to spectral approximation and random walk mixing (see Section 4), which allow propagating information between pairs of nodes that are distant in the input graph G without connecting all pairs of nodes. In particular, we use a regular expander graph of constant degree, which allows the number of edges to be just O(|V |). The process we use to construct a random expander graph is described in Appendix C.

2.. Local neighborhood attention:

Another desirable property to capture is locality. Graphs carry much more topological structure than sequences, and the neighborhoods of individual nodes carry a lot of information about connectivity. Thus, we model local interactions by allowing each node v to attend to every other node that is an immediate neighbor of v in G: that is, H includes the input graph edges E as well as their reverses, introducing O(|E|) interaction edges. One generalization would be to allow direct attention within k-hop neighborhoods, but this might introduce a superlinear number of interactions on general graphs.

3.. Global attention:

The final component is global attention, whereby a small number of virtual nodes are added to the interaction graph, and each such node is connected to all the non-virtual nodes. These nodes enable a global "storage sink" and help prove universality properties of EXPHORMER. We will generally add a constant number of virtual nodes, in which case the total number of edges due to global attention will be O(|V |). The model uses learnable embeddings for expander and global connection edge features, and virtual nodes features. Dataset edge features are used for the local neighborhood edge features. Remark 3.1 EXPHORMER has some conceptual similarities with BigBird, as mentioned previously. For instance, we also make use of virtual global attention nodes, corresponding to BIGBIRD-ETC. However, our approach departs from that of BigBird in some important ways. While BigBird uses w-width "window attention" to capture locality of reference, we use local neighborhood attention to capture locality and graph topology. In particular, the interaction graph due to window attention in BigBird can be viewed as a Cayley graph on Z n , which is sequence-centric, while EXPHORMER is graph-centric and, therefore, uses the structure of the input graph itself to capture locality. BigBird, as implemented for graphs by Rampásek et al. (2022) , instead simply orders the graph nodes in an arbitrary sequence and uses windows within that sequence. Both BigBird and EXPHORMER also make use of a random attention model. While BigBird uses an Erdős-Rényi graph on |V | nodes, our approach is to use a d-regular expander for fixed constant d. The astute reader may recall that a Erdős-Rényi graph G(n, p) has spectral expansion properties for large enough p. However, it is known that p = log n n is the connectivity threshold, i.e., for p < (1 -ϵ) log n n , G(n, p) is almost surely a disconnected graph. Therefore, in order to obtain even a connected graph in the Erdős-Rényi model -let alone one with expansion properties -one would need p = Ω log n n , giving superlinear complexity for the number of edges. BigBird uses p = Θ(1/n), keeping a linear number of edges but losing expansion properties. Our expander graphs, by contrast, allow both a linear number of edges and guaranteed spectral expansion properties. We will see in the practical experiments of Section 5 that EXPHORMER-based models often substantially outperform BigBird-based equivalents, with fewer parameters. Remark 3.2 Previous graph-oriented transformers have, naturally, used the graph structure in their attention mechanisms. The SAN architecture (Kreuzer et al., 2021) uses two attention mechanisms, one based on the input edges E and one based on all the other edges not present in the input graph. By using a single attention mechanism, EXPHORMERs introduce fewer additional parameters, allowing for more compact models that also significantly outperform SAN in our experiments.

4. THEORETICAL PROPERTIES OF EXPHORMER

EXPHORMER is based on expander graphs, which have a number of properties that make them suitable as a key building block of our approach. In this section, we describe relevant properties of expander graphs along with their implications for EXPHORMERs.

4.1. BASICS OF EXPANDER GRAPHS AND LAPLACIANS

For simplicity, let us consider d-regular graphs (where every node has d neighbors). Suppose G is a d-regular undirected graph on n vertices. Let A G be the n × n adjacency matrix of G. It is known that (Alon, 1986) . A G has n real eigenvalues d = λ 1 ≥ λ 2 ≥ • • • ≥ λ n ≥ -d. The graph G is said to be an ϵ-expander if max{|λ 2 |, |λ n |} ≤ (1 -ϵ)d Intuitively speaking, expander graphs are sparse approximations of complete graphs. It is known that expanders have several useful properties (viz., edge expansion and mixing properties), which make the graph "well-connected". A higher ϵ corresponds to better expansion properties. For example, in an expander graph, the diameter, O(log d n), is as low as possible for a given number of edges without a bottleneck cut (Alon, 1986) , which is defined as: E ∩ S × S ≥ (1 -ϵ) d|S||S| n . (1)

4.2. EXPANDER GRAPHS AS APPROXIMATORS OF COMPLETE GRAPHS

We now outline some important properties of an expander-based attention mechanism. Roughly speaking, our goal is to replace a densely connected graph, i.e., a complete graph, with Θ(n 2 ) edges by a graph with o(n 2 ), preferably O(n), edges that preserves certain properties of the complete graph. We show that suitable expander graphs, in fact, achieve this.

4.2.1. SPECTRAL PROPERTIES

A useful tool to study expanders is the Laplacian matrix of a graph. Letting D G denote the n × n diagonal matrix whose i-th diagonal entry is the degree of the i-th node, we define L G = D G -A G to be the Laplacian of G. It is known that L G captures several important spectral properties of the graph. The first useful property of complete graphs that expander graphs (approximately) preserve is the spectral decomposition of the Laplacian -per this well known theorem in spectral graph theory. Theorem 4.1 A d-regular ϵ-expander G on n vertices spectrally approximates the complete graph K n on n vertices, i.e.,foot_0  (1 -ϵ) 1 n L K ⪯ 1 d L G ⪯ (1 + ϵ) 1 n L K . Spectral approximation is known to preserve the cut structure in graphs. As a result, replacing a dense attention mechanism with a sparse attention mechanism along the edges of an expander graph retains spectral properties (viz., cuts, vertex expansion, etc.) .

4.2.2. MIXING PROPERTIES

Another property of expanders is that random walks mix well. Let G = (V, E) be a d-regular ϵexpander. Consider a random walk v 0 , v 1 , v 2 , . . . on G, where v 0 is chosen according to an initial distribution π (0) , and then each subsequent v t+1 is one of the d neighbors of v t chosen uniformly at random. Then each v t is distributed according to the probability distribution π (t) : V → R + , given recursively by t) . It turns out that after a logarithmic number of steps, a random walk from a starting probability distribution on the vertices is close to uniformly distributed along all nodes of the graph. π (t+1) = D -1 G A G π ( Lemma 4.2 (Alon, 1986)  Let G = (V, E) be a d-regular ϵ-expander graph on n = |V | nodes. For any initial distribution π (0) : V → R + and any δ > 0, π (t) : V → R + satisfies ∥π (t) -1/n∥ 1 ≤ δ, as long as t = Ω 1 ϵ log(n/δ) , i.e., the resulting distribution over the nodes of G of a random walk with t walks starting from the distribution π (0) is δ-close in L 1 -norm to the uniform distribution. In an attention mechanism of a transformer, one can consider the graph of pairwise interactions (i.e., i is connected to j if i and j attend to each other). If the attention mechanism is dense, then each node is connected to every other node and it is trivial to see that every pair of nodes interacts with each other in a single transformer layer. In a sparse attention mechanism, on the other hand, some pairs of nodes are not directly connected, meaning that a single transformer layer will not model interactions between all pairs of nodes. However, if we stack transformer layers on top of each other, the stack will be able to model longer range interactions. In particular, a consequence of the above lemma is that if our sparse attention mechanism is modeled after an ϵ-expander graph, then stacking at least t = 1 ϵ log(n/δ) layers will model "most" pairwise interactions between nodes. A related property concerns the diameter of expander graphs. One can, in fact, show that the diameter of a regular expander graph is logarithmic in the number of nodes, asymptotically. Theorem 4.3 (Alon, 1986 ) Suppose G = (V, E) is a d-regular ϵ-expander graph on n vertices. Then, for every vertex v and k ≥ 0, the k-hop neighborhood B(v, r) = {w ∈ V : d(v, w) ≤ k} has |B(v, r)| ≥ min{(1 + c) k , n} for some constant c > 0 depending on d, ϵ. In particular, we have that diam (G) = O d,ϵ (log n). As a consequence, we obtain the following result, which shows that using logarithmically many successive transformer layers allows each node to propagate information to every node. Corollary 4.4 If a sparse attention mechanism on n nodes is modeled as a d-regular ϵ-expander graph, then stacking O d,ϵ (log n) transformer layers models all pairwise node interactions.

4.3. EXPHORMER-BASED TRANSFORMERS ARE UNIVERSAL APPROXIMATORS

While the expander graph component of EXPHORMER guarantees that O(log n) graph transformer layers are enough to allow each node to interact with every other node, it may still be desirable to enable node interactions with a smaller number of layers (e.g., O(log n) can still be infeasible when the number of nodes, n, is extremely large). The global attention component allows this by essentially serving as a "short circuit" by which every node can interact with every other node using just two graph transformer layers. The global attention component also helps us to obtain a universal approximation property of EX-PHORMER. In particular, continuous functions f : [0, 1] d×|V | → R d×|V | can be approximated to desired accuracy by an EXPHORMER network as long as there is at least one virtual node. We defer the details to the appendix (see Appendix D).

5. EXPERIMENTS

In this section, we evaluate the empirical performance of graph transformer models based on EXPHORMER on a wide variety of graph datasets with graph prediction and node prediction tasks Dwivedi et al. (2020); Hu et al. (2020) ; Freitas et al. (2021) . We perform ablation studies on eight benchmark datasets, including image-based graph datasets (CIFAR10, MNIST), a molecular dataset (ogbg-molpcba), synthetic datasets (PATTERN, CLUSTER), and datasets based on code graphs (ogbg-code2, MalnetTiny). We also demonstrate the use of EXPHORMER on a large citation network, ogbn-arxiv. A full description of the datasets is found in A. Experiments for ogbn-arxiv have been performed on NVIDIA A100 GPUs, while all other experiments were on NVIDIA V100s.

5.1. COMPARISON OF EXPHORMER-BASED MODELS TO BASELINES

We apply EXPHORMER to the modular GraphGPS framework (Rampásek et al., 2022) , which constructs graph transformer models by composing attention mechanisms with message-passing schemes, together with a choice of positional and structural encodings. We also show the results for pure attention EXPHORMER-only models that do not use a message passing network. Table 1 shows results on five datasets from the Benchmarking GNNs collection (Dwivedi et al., 2020) , along with MalNet-Tiny (Freitas et al., 2021) . Using an EXPHORMER-based graph transformer with messagepassing in the GraphGPS framework yields competitive results, including state-of-the-art (SOTA) on three of the datasets. Notably, our EXPHORMER models outperform not only the MPNN baselines but also recent full (dense) attention graph transformer models (i.e., SAN and full transformer models in the GraphGPS paper). Tables 2 and 3 show the EXPHORMER architecture have are significantly fewer training parameters but provide much better accuracy, suggesting that the structured sparsity in EXPHORMER can offer regularization advantages as well as time and memory scalability. 2020): ogbg-molpcba and ogbg-code2. Our models are competitive with existing baselines. Our mildly worse performance on ogb-molpcba is perhaps not surprising, given that this is a molecular dataset consisting of small graphs (an average of 26 nodes). On small graphs, the attention pattern in EXPHORMER is unlikely to be very different from that of the dense attention mechanism; thus the scope for us to improve is limited. Indeed, the greatest improvements from EXPHORMER are on datasets where the average number of nodes is larger, e.g., around 100 or more nodes.

5.2. COMPARISON OF ATTENTION MECHANISMS

We now discuss a series of experiments to isolate the performance of pure attention EXPHORMER model compared to other attention mechanisms, which will help isolate the attention mechanism used in the model. For each dataset, we take our EXPHORMER-based models and compare them to models obtained by replacing the EXPHORMER attention mechanism with other dense (full transformer) and sparse (BigBird, Performer) attention mechanisms. As different mechanisms can require drastically different numbers of parameters, we train two models per mechanism: one in which all hyperparameters are kept the same but vary the attention mechanisms; and another in which the hyperparameters are adjusted in order to keep the total number of parameters of the model roughly the same. 

5.3. PURE TRANSFORMER ARCHITECTURES

The results presented thus far naturally lead to questions: (1.) how much of the performance gain of EXPHORMER-based GPS models is attributable to the attention component as opposed to the MPNN component, and (2.) whether "pure" sparse transformer models can achieve good performance by itself. Indeed, Rampásek et al. (2022) present ablation studies showing that removing the MPNN component from the proposed GraphGPS models results in significantly worse performance. Similarly, Kreuzer et al. (2021) offer two variants of their SAN architecture, the "full" variant that uses dense attention, as well as a "sparse" variant that allows nodes to attend only to direct neighbors in the input graph-their ablation results show that their "sparse" variant performs significantly worse than the "full" variant, often underperforming pure MPNNs. In order to address these questions, we also train pure EXPHORMER-based sparse transformer models that use only attention without any message passing. The results for CIFAR10 in Table 2 show that the pure EXPHORMER models (labeled "Exphormer"), in fact, outperform pure transformer GPS models as well as most GPS models that use MPNNs. The main exception is the dense attention "Transformer+MPNN" model; however our pure EXPHORMER model performs competitively. Similar comparisons on other datasets are shown in Appendix B.3.

5.4. SCALABILITY TO LARGE-SCALE GRAPHS

One of the difficulties with graph transformer architectures has been their poor scalability to larger graphs with thousands of nodes. Dense attention mechanisms, like SAN and Graphormer, with quadratic memory complexity and time complexity restrict their applicability to datasets on graphs with a small number of nodes. GraphGPS Rampásek et al. (2022) used sparse attention mechanisms but their architecture handles graphs of up to about 5,000 nodes, in MalNet-Tiny (Freitas et al., 2021) . Again, EXPHORMER-based models provide improved accuracy, as shown in Tables 1 and 10 . Our work allows us to extend graph transformer architectures to far larger graphs, with hundreds of thousands of nodes. We show that EXPHORMER architecture can scale, with competetive accuracy, to ogbn-arxiv (Hu et al., 2020) , a transductive dataset consisting of a single large graph of the citation network of over 160K arXiv papers, containing over a million edges. Specifically, we achieve a test accuracy of 0.7196 using the EXPHORMER architectures. At the time of writing, a relevant leaderboard ows 0.7637 as the highest reported test accuracy, based on adaptive graph diffusion networks (Zhao et al., 2021) . Table 3 shows the relative performance of EXPHORMER compared to other Transformers. A dense full transformer does not even fit in memory on an NVidia A100 GPU (even with only 2 layers and 32 hidden dimensions). The best accuracy for other sparse models was found with networks of 3 hidden layers and 96 hidden dimensions. Notice that BigBird and Performer have significantly longer epoch times and worse performance compared to EXPHORMER with degree 3 expander edges. 

6. CONCLUSION

EXPHORMER is a new sparse graph transformer architecture built on expander graphs. We have shown that the relevant mathematical properties of expander graphs make EXPHORMER a suitable choice for graph learning, with time and memory complexity linear in the size of the graph. Using EXPHORMER in the GraphGPS framework allows us to obtain state-of-the-art empirical results on a number of datasets while also allowing graph transformers to scale to datasets on large graphs, a realm which has proved elusive for graph transformers in the past.

A DATASET DESCRIPTIONS

Below, we provide descriptions of the datasets on which we conduct experiments. CIFAR10 and MNIST CIFAR10 and MNIST are the graph equivalents of the eponymous image classification datasets. A graph is created by constructing the 8-nearest neighbor graph of the SLIC superpixels of the image. These are both. 10-class classification problems. CLUSTER and PATTERN PATTERN and CLUSTER are node classification problems. Both are synthetic datasets that are sampled from the Stochastic Block Model (SBM), which is a popular way to model communities. In PATTERN, the prediction task is to identify if a node belongs to one of the 100 possible predetermined subgraph patterns. In CLUSTER, the goal is to identify the cluster to which a node belongs to, when the nodes are selected from 6 different clusters with the same distribution. MalNet-Tiny Malnet-Tiny is a smaller dataset generated from a larger dataset for identifying malware based on function call graphs. The dataset contains the call graphs from the Android APKs. The tiny dataset contains 5000 graphs. The graphs can have up to 5000 nodes. The task is to predict the graph as being benign or from one of four types of malware.

ogbg-code2

The ogbg-code2 dataset has the abstract sytax trees (ASTs) of python methods curated from github. The prediction problem is "code sumarization". To determine the name of a function given the body of the method. ogbg-molpcba ogbg-molpcba is a molecular datasets. It has graphs representing molecules, where the nodes and edges represent the atoms and their chemical bonds, respectively. The node features include atomic number, chirality etc. The goal is to predict binary labels. The dataset is heavily skewed as only only 1.4% of data has positive labels. ogbn-arxiv The ogbn-arxiv dataset consists of one large directed graph of 169343 nodes and 1,166,243 edges representing a citation network between all computer science papers on arXiv that were indexed by the Microsoft academic graph. Nodes in the graph represent papers, while a directed edge indicates that a paper cites another. Each node has an 128-dimensional feature vector derived from embeddings of words in the title and abstract of the underlying paper. The prediction task is a 40-class node classification problem -to identify the primary category of each arXiv paper, as listed by the authors. Moreover, the nodes of the citation graph are split into 90K training nodes, 30K validation notes, and 48K test nodes. Table 4 shows a summary of the statistics of the aforementioned datasets. (Hu et al., 2020) B.2 HYPERPARAMETERS Our hyperparameter choices, including the optimizer, positional encodings, and structural encodings, were guided by the instructions in GraphGPS (Rampásek et al., 2022) . There were some cases, however, when more layers with smaller dimensions gave better results in EXPHORMER. This may be due to the fact that each node gets fewer inputs for each layer, but EXPHORME requires more layers in order to propagate well. Additionally, we observed that Equivstable Laplacian Positional Encoding (ESLapPE) performed better than normal Laplacian Positional Encoding (LapPE). In our experiments, it consistently replaced LapPE. For the GPS models with Exphormer, we consistently use GatedGCN except for the ogbn-arxiv dataset, which we use GCN. Through our model, some extra hyperparameters are introduced -the degree of the graph expander and the number of virtual nodes. For these hyperparameters, we used linear search and found that expander degree 3-5 was the most effective. Depending on the graph size, we used 1-5 virtual nodes. To make fair comparisons, we used a similar parameter-budget to GraphGPS. For PATTERN and CLUSTER, we used a parameter-budget of 500K, and for CIFAR and MNIST, we used a parameterbudget of around 100K. See details in Table 6 and Table 8 anisms for CIFAR10 (Table 2 ) and PATTERN (Table 11 ). Here, we present similar results for the remaining datasets -MNIST in Table 9 ; MALNET-Tiny in and structural encodings are not fully sufficient to capture the structure of the graph and cannot be replace the actual edges. Table 14 shows the effect of the degree of the expander graph. Selecting the right expander degree is important; results can vary a lot for different expander degrees. Table 15 shows a similar study on number of virtual nodes.

C DETAILS OF EXPANDER GRAPH CONSTRUCTION

A major component of EXPHORMER is the use of the edges of an expander graph. Thus far, we have not specified which expander graph to use. In this section, we provide details of the specific expander graphs we use as well and quantify their spectral expansion properties.

C.1 RAMANUJAN GRAPHS

A natural question is how strong the spectral expansion properties of a d-regular graph can be, i.e., for how large an ϵ > 0 does a d-regular ϵ-expander exist. The following theorem gives a bound on how large the spectral gap can be. In other words, a d-regular ϵ-expander graph can exist only for ϵ ≤ 1 - However, we cannot quite apply Theorem D.1 or Theorem D.2, as it is specifically for full transformers in which all nodes are pairwise connected in the attention interaction graph. The final ingredient we require is a theorem from Zaheer et al. (2020) , which gives a universality theorem for sparse transformers on sequences. Theorem D.3 (Zaheer et al. (2020) ) Let 1 < p < ∞ and ϵ > 0. For any graph H on [n] that contains the star graph, we have that if f ∈ [0, 1] n×d → R n×d is a continuous function, then there exists a sparse transformer network g (with trainable positional encodings) such that ℓ p (f, g) < ϵ. Now, combining D.3 with the previous observations and noting that EXPHORMER with at least one virtual node contains the star graph, we see that EXPHORMER can approximate solutions to the graph isomorphism problem (note, however, that this does not imply a polynomial time solution to the problem, as in Kreuzer et al. (2021) ).



Notation: given matrices A and B, we say that A ⪯ B if B -A is a positive semi-definite matrix.



Figure 1: The components of EXPHORMER: (a) shows local neighborhood attention, i.e., edges of the input graph. (b) shows an expander graph with degree 3. (c) shows global attention with a single virtual node. (d) All of the aforementioned components are combined into a single interaction graph that determines the attention pattern of EXPHORMER.

Alon-Boppana) Let d > 0. The eigenvalues of the adjacency matrix of a d-regular graph on n nodes satisfy max{|λ 2 |, |λ n |} ≥ 2 √ d -1 -o n (1).

Comparison of EXPHORMER with baselines on various datasets. Best results are colored in first, second, third.

Results with varying attention and MPNNs on CIFAR10. EXPHORMER with MPNN provides the highest accuracy. Also, pure transformer models based on EXPHORMER (without the use of MPNNs) are comparable.





Comparison of attention mechanisms on the ogbn-arxiv dataset.

Dataset statistics .1 BENCHMARKS ON OGB DATASETSIn Table5we show the results for OGB datasets (see the discussion in Section 5.1).

Comparison of EXPHORMER with baselines on various datasets from the Open Graph Benchmark (OGB)

. Hyperparameters used for GPS model using EXPHORMER for datasets: ogbg-molpcba, ogbg-code2.

PATTERN in Table 11; and  CLUSTER in Table 12.

Ablation studies results for MNIST B.4 EFFECT OF DIFFERENT COMPONENTS OF THE MODEL [In the final version, this section will include similar analyses of more datasets.]Here we analyze the effect of each of the components of the model. Our Exphormer model has three main components: local neighborhood, expander edges, and virtual nodes. In Table13, we can see that removing each component leads to poorer results. The effect of local neighborhood edges is much more important in Exphormer models that do not include an MPNN, suggesting that local

Ablation studies results for PATTERN

Ablation studies results for CLUSTER non-isomorphic graphs to distinct values. We would like to apply one of the above thoerems to such a function.

annex

Table 14 : Effect of the selection the expander degree on the results. We can see that correct selecton of the expander degree does affect the quality of the final model.As it turns out, there exist ϵ-expander graphs with ϵ achieving this bound. In fact, a d-regular ϵ-is known as a Ramanujan graph. Ramanujan graphs are essentially the best possible spectral expanders, and several constructions have been considered over the years (Lubotzky et al., 1988; Margulis, 1988) .

C.2 RANDOM REGULAR GRAPHS

While there exist deterministic constructions of Ramanujan graphs, they are often algebraic/number theoretic in nature and therefore exist only for specific choices of d (e.g., the constructions of Lubotzky et al. (1988) as well as independently of Margulis (1988) , for which one requires d ≡ 2 (mod 4) and d -1 to be a prime). Recently, the work of Alon (2021) showed a construction of strongly explicit near-Ramanujan graphs of every degree, but it should be noted that the construction needs the number of nodes to be sufficiently large. It is, therefore, often convenient to use a probabilistic construction of an expander.A natural choice for an expander graph is a random d-regular graph on n vertices, formed by taking d/2 independent uniform permutations on {1, 2, . . . , n}. Friedman (2003) proved a conjecture of Alon, establishing that random regular graphs are weakly-Ramanujan.Theorem C.2 (Friedman 2003 ) Fix ϵ > 0 and an even integer d > 0. Then, suppose G is a random graph generated by taking d independent uniformly random permutations π 1 , π 2 , . . . , π d on V = {1, 2, . . . , n}, then choosing the edges asThen, with probabilityIn our experiments, we use a random regular graph to instantiate the expander graph component of EXPHORMER. We describe the details below.Generating a Random Regular Expander Let G = (V, E) be the original graph, where V = {1, 2, . . . , n}. Inspired by the expansion properties of the random graph process analyzed in Friedman (2003) (see Theorem C.2), we generate a random regular graph G ′ = (V, E ′ ) on the same node set V as follows.• Let s = (1, 1, . . . , 1, 2, 2, . . . , 2, . . . , n, n, . . . , n), where each value appears d times.• Pick a random permutation π on {1, 2, . . . , nd}, chosen uniformly at random from (nd)! possible permutations.• Let E ′ be the multiset {(s i , s π(i) ) : 1 ≤ i ≤ nd}.• Remove any self loops from E ′ ; for large n, this will happen with probability o(1). It is easy to see that this procedure is equivalent to sampling d permutations, so Theorem C.2 shows that the graph will be a 2d-regular expander with high probability.

D UNIVERSALITY OF EXPHORMER

In this section, we detail the universal approximation properties of EXPHORMER-based graph transformers.One of the limitations of standard message passing networks is that their expressivity is generally confined by the limitations of the WL hierarchy. In other words, they cannot distinguish pairs of non-isomorphic graphs that cannot be distinguished by a suitable WL test.On the other hand, transformer architectures have the ability to distinguish any graphs.The work of Yun et al. (2020a) showed that for sequences, transformers are universal approximators, i.e., they can approximate any permutation equivariant function mapping one sequence to another arbitrarily closely when provided with enough parameters. A function f : R d×n → R d×n is said to be permutation equivariant if f (XP) = f (X)P, i.e., if permuting the columns of an input X ∈ R d×n results in the columns of f (X) being permuted the same way.Theorem D.1 (Yun et al. (2020a) ) Let 1 ≤ p < ∞ and ϵ > 0. For any function f : R d×n → R d×n that is permutation equivariant, there exists a transformer network g such that ℓ p (f, g) < ϵ.The same work shows an extension to all (not necessarily permutation equivariant) sequence-tosequence functions that are defined on a compact domain, say, [0, 1] d×n provided that one uses a positional encoding. More specifically, for any transformer g : R d×n → R d×n , one can define a transformer with positional encoding g p : R d×n → R d×n such that g p (X) = g(X + E). The following results shows that trainable positional encodings allow a transformer to approximate any sequence-to-sequence continuous function on the compact domain.Theorem D.2 (Yun et al. (2020a) ) Let 1 ≤ p < ∞ and ϵ > 0. For any continuous function f : [0, 1] d×n → R d×n that is permutation equivariant, there exists a transformer with positional encoding g P such that ℓ p (f, g) < ϵ.Note that the above theorems hold for full (dense) transformers. However, under certain conditions about the sparsity pattern, one can obtain similar universality for sparse attention mechanisms (Yun et al., 2020b) .One important consideration is that the aforementioned results hold for functions on sequences.Since we are concerned with functions on graphs, it is interesting to ask what the implications are for graph transformers.We follow the approach of Kreuzer et al. (2021) : Given a graph G, we can view a node transformer as a function from R n×n → R n×n which uses the padded adjacency matrix of G as a positional encoding. Similarly, an edge transformer takes as input a sequence ((i, j), σ i,j ) with i, j ∈ [n] and i ≤ j such that σ i,j = 1 if i and j are connected in G or σ i,j = 0 otherwise. Any ordering of these vectors corresponds to the same graph. Moreover, we can capture the relevant functions going from R N (N -1)/2×2 → R N (N -1)/2×2 with permutation equivariance. Ideally, they can be approximated as closely as desired by suitable transformers on the edge input. Now, simply observe (see Kreuzer et al. (2021) ) that one can choose a function (in either the node transformer or edge transformer case) that is (a.) invariant under node permutations and (b.) maps

