FAST GRAPH ATTENTION NETWORKS USING EFFEC-TIVE RESISTANCE BASED GRAPH SPARSIFICATION

Abstract

The attention mechanism has demonstrated superior performance for inference over nodes in graph neural networks (GNNs), however, they result in a high computational burden during both training and inference. We propose FastGAT, a method to make attention based GNNs lightweight by using spectral sparsification to generate an optimal pruning of the input graph. This results in a per-epoch time that is almost linear in the number of graph nodes as opposed to quadratic. We theoretically prove that spectral sparsification preserves the features computed by the GAT model, thereby justifying our FastGAT algorithm. We experimentally evaluate FastGAT on several large real world graph datasets for node classification tasks under both inductive and transductive settings. FastGAT can dramatically reduce (up to 10x) the computational time and memory requirements, allowing the usage of attention based GNNs on large graphs.

1. INTRODUCTION

Graphs are efficient representations of pairwise relations, with many real-world applications including product co-purchasing network ((McAuley et al., 2015) ), co-author network ( (Hamilton et al., 2017b) ), etc. Graph neural networks (GNN) have become popular as a tool for inference from graph based data. By leveraging the geometric structure of the graph, GNNs learn improved representations of the graph nodes and edges that can lead to better performance in various inference tasks ((Kipf & Welling, 2016; Hamilton et al., 2017a; Veličković et al., 2018) ). More recently, the attention mechanism has demonstrated superior performance for inference over nodes in GNNs ( (Veličković et al., 2018; Xinyi & Chen, 2019; Thekumparampil et al., 2018; Lee et al., 2020; Bianchi et al., 2019; Knyazev et al., 2019) ). However, attention based GNNs suffer from huge computational cost. This may hinder the applicability of the attention mechanism to large graphs. GNNs generally rely on graph convolution operations. For a graph G with N nodes, graph convolution with a kernel g w : R ! R is defined as g w ? h = U g w (⇤)U > h ( ) where U is the matrix of eigenvectors and ⇤ is the diagonal matrix of the eigenvalues of the normalized graph Laplacian matrix defined as L norm = I D 1/2 AD 1/2 , with D and A being the degree matrix and the adjacency matrix of the graph, and g w is applied elementwise. Since computing U and ⇤ can be very expensive (O(N 3 )), most GNNs use an approximation of the graph convolution operator. For example, in graph convolution networks (GCN) (Kipf & Welling, 2016) , node features are updated by computing averages as a first order approximation of Eq.equation 1 over the neighbors of the nodes. A single neural network layer is defined as: H (l+1) GCN = ⇣ e D 1/2 e A e D 1/2 H (l) W (l) ⌘ , where H (l) and W (l) are the activations and the weight matrix at the lth layer respectively and e A = A + I and e D is the degree matrix of e A. Attention based GNNs add another layer of complexity: they compute pairwise attention coefficients between all connected nodes. This process can significantly increase the computational burden, especially on large graphs. Approaches to speed up GNNs were proposed in (Chen et al., 2018; Hamilton et al., 2017a) . However, these sampling and aggregation based methods were designed for simple GCNs and are not applicable to attention based GNNs. There has also been works in inducing sparsity in attention based GNNs (Ye & Ji, 2019; Zheng et al., 2020) , but they focus on addressing potential overfitting of attention based models rather than scalability. In this paper, we propose Fast Graph Attention neTwork (FastGAT), an edge-sampling based method that leverages effective resistances of edges to make attention based GNNs lightweight. The effective resistance measures importance of the edges in terms of preserving the graph connectivity. FastGAT uses this measure to prune the input graph and generate a randomized subgraph with far fewer edges. Such a procedure preserves the spectral features of a graph, hence retaining the information that the attention based GNNs need. At the same time, the graph is amenable to more complex but computationally intensive models such as attention GNNs. With the sampled subgraph as their inputs, the attention based GNNs enjoy much smaller computational complexity. Note that FastGAT is applicable to all attention based GNNs. In this paper, we mostly focus on the Graph Attention NeTwork model (GAT) proposed by (Veličković et al., 2018) . However we also show FastGAT is generalizable to two other attention based GNNs, namely the cosine similarity based approach (Thekumparampil et al., 2018) and Gated Attention Networks (Zhang et al., 2018) . We note that Graph Attention Networks can be re-interpreted as convolution based GNNs. We show this explicitly in the Appendix. Based on this re-interpretation, we theoretically prove that spectral sparsification preserves the feature representations computed by the GAT model. We believe this interpretation also opens up interesting connections between sparsifying state transition matrices of random walks and speeding up computations in GNNs. The contributions of our paper are as outlined below: • We propose FastGAT, a method that uses effective resistance based spectral graph sparsification to accelerate attention GNNs in both inductive and transductive learning tasks. The rapid subsampling and the spectrum preserving property of FastGAT help attention GNNs retain their accuracy advantages and become computationally light. • We provide a theoretical justification for using spectral sparsification in the context of attention based GNNs by proving that spectral sparsification preserves the features computed by GNNs. • FastGAT outperforms state-of-the-art algorithms across a variety of datasets under both transductive and inductive settings in terms of computation, achieving a speedup of up to 10x in training and inference time. On larger datasets such as Reddit, the standard GAT model runs out of memory, whereas FastGAT achieves an F1 score 0.93 with 7.73 second per epoch time in training. • Further, FastGAT is generalizable to other attention based GNNs such as the cosine similarity based attention (Thekumparampil et al., 2018) and the Gated Attention Network (Zhang et al., 2018) .

2. RELATED WORK

Accelerating graph based inference has drawn increasing interest. Two methods proposed in (Chen et al., 2018) (FastGCN) and (Huang et al., 2018) speed up GCNs by using importance sampling to sample a subset of nodes per layer during training. Similarly, GraphSAGE (Hamilton et al., 2017a) also proposes an edge sampling and aggregation based method for inductive learning based tasks. All of the above works use simple aggregation and target simple GCNs, while our work focus on more recent attention based GNNs such as (Veličković et al., 2018) . We are able to take advantage of the attention mechanism, while still being computationally efficient. Graph sparsification aims to approximate a given graph by a graph with fewer edges for efficient computation. Depending on final goals, there are cut-sparsifiers ( (Benczúr & Karger, 1996) ), pairwise distance preserving sparsifiers ( (Althöfer et al., 1993) ) and spectral sparsifiers ( (Spielman & Teng, 2004; Spielman & Srivastava, 2011) ) , among others ( (Zhao, 2015; Calandriello et al., 2018; Hübler et al., 2008; Eden et al., 2018; Sadhanala et al., 2016) ). In this work, we use spectral sparsification to choose a randomized subgraph while preserving spectral properties. Apart form providing the strongest guarantees in preserving graph structure ((Chu et al., 2018) ), they align well with GNNs due to their connection to spectral graph convolutions. Graph sparsification on neural networks have been studied recently ( (Ye & Ji, 2019; Zheng et al., 2020; Ioannidis et al., 2020; Louizos et al., 2017) ). However, their main goal is to alleviate overfitting in GNNs not reducing the training time. They still require learning attention coefficients and binary gate values for all edges in the graph, hence not leading to any computational or memory benefit. In contrast, FastGAT uses a fast subsampling procedure, thus resulting in a drastic improvement in training and inference time. It is also highly stable in terms of training and inference. 3 FASTGAT: ACCELERATING GRAPH ATTENTION NETWORKS VIA EDGE SAMPLING

3.1. THE FASTGAT ALGORITHM

Let G(E, V) be a graph with N nodes and M edges. An attention based GNN computes attention coefficients ↵ i,j for every pair of connected nodes i, j 2 V in every layer `. The ↵ i,j 's are then used as averaging weights to compute the layer-wise feature updates. In the original GAT formulation, the attention coefficients are ↵ ij = exp LeakyReLU(a > [W h i ||W h j ]) P j2Ni exp (LeakyReLU(a > [W h i ||W h j ])) , where h i 's are the input node features to the layer, W and a are linear mappings that are learnt, N i denotes the set of neighbors of node i, and || denotes concatenation. With the ↵ ij 's as defined above, the node-i output embedding of a GAT layer is h 0 i = 0 @ X j2Ni ↵ ij W h j 1 A . For multi-head attention, the coefficients are computed independently in each attention head with head-dependent matrices W and attention vector a. Note that the computational burden in GATs arises directly from computing the ↵ i,j 's in every layer, every attention head and every forward pass during training. Goal: Our objective is to achieve performance equivalent to that of full graph attention networks (GAT), but with only a fraction of the original computational complexity. This computational saving is achieved by reducing the number of attention computations.

Idea:

We propose to use edge-sampling functions that sparsify graphs by removing nonessential edges. This leads to direct reduction in the number of attention coefficients to be computed, hence reducing the burden. Choosing the sampling function is crucial for retaining the graph connectivity. Let EdgeSample(E, A, q) denote a randomized sampling function that, given an edge set E, adjacency matrix A and a number of edges to be sampled q, returns a subset of the original edge set E s ⇢ E with |E s | = q. Our algorithm then uses this function to sparsify the graph in every layer and attention head. Following this, the attention coefficients are computed only for the remaining edges. A more detailed description is given in Algorithm 1. In every layer and attention head, a randomized subgraph with q ⌧ M edges is generated and the attention coeffients are computed only for this subset of edges. We use a specialized distribution that depends on the contribution of each edge to the graph connectivity. We provide further details in Section 3.2. Note that in the general description below, the attention coefficients themselves are used as weights for sparsification and the reweighted attention coefficients are used to compute the feature update. Doing so helps in theoretical analysis of the algorithm. However in practice, we replace this expensive procedure with a one-time sampling of the graph with the original edge weights and compute the attention coefficients for only the remaining edges. In particular, we use two simpler variations of FastGAT include: i) FastGAT-const, where the subgraph g is kept constant in all the layers and attention heads and ii) FastGAT-layer, where the subgraph is different in each layer (drawn stochastically from the original edge weights), but the same across all the attention heads within a layer. Algorithm 1: The FastGAT Algorithm Input: Graph G(V, E), Num. layers = L, Num. Attention heads K (`) , `= 1, • • • , L Initial Weight matrices W (`) , Non-linearity , Feature matrix H 2 R N ⇥D Randomized edge sampling function EdgeSample(•), Attention function (•) Num. edges sampled q for each layer `do for each attention head k 2 {1, 2, • • • , K (`) } do Compute attention matrix (`) k 2 R N ⇥N , with (`) k (i, j) = ✓ k (h (`) i , h (`) j ) Sample a graph ˆ (`) k = EdgeSample( (`) k , A, q) Compute H (`+1) k = ⇣ ˆ (`) k H (`) k W (`) ⌘ H (`+1) = || k H (`) k // Concatenate the output of attention heads Compute loss and update W 's // gradient based weight update

3.2. SAMPLING GRAPH EDGES USING EFFECTIVE RESISTANCES

We use a particular edge sampling function EdgeSample(•) that is motivated by the field of spectral graph sparsification. Let L represent the graph Laplacian (defined as L = D A where D is the degree matrix), i (L) denote the ith eigenvalue of L and let A † denote the Moore-Penrose inverse of a matrix. Motivated by the fact that GNNs are approximations of spectral graph convolutions (defined in equation 1), we aim to preserve the spectrum (or eigenstructure) of the graph. Formally, let L G and L H be the Laplacian matrices of the original graph G and the sparsified graph H. Then, spectral graph sparsification ensures that the spectral content of H is similar to that of G: (1 ✏) i (L G )  i (L H )  (1 + ✏) i (L G ), 8i where ✏ is any desired threshold. (Spielman & Srivastava, 2011) showed how to achieve this by using a distribution proportional to the effective resistances of the edges Definition 1 (Effective Resistance) (Spielman & Srivastava, 2011) The effective resistance between any two nodes of a graph can be defined as the potential difference induced across the two nodes, when a unit current is induced at one node and extracted from the other node. Mathematically, it is defined as below. R e (u, v) = b > e L † b e , where b e = u v ( l is a standard basis vector with 1 in the lth position) and L † is the pseudo-inverse of the graph Laplacian matrix. The effective resistance measures the importance of an edge to the graph structure. For example, the removal of an edge with high effective resistance can harm the graph connectivity. The particular function EdgeSample we use in FastGAT is described in Algorithm 2. Algorithm 2: Effective resistance based EdgeSample function (Spielman & Srivastava, 2011) Input: Graph G(E G , V), w e is the edge weight for e 2 E, 0 < ✏ < 1 For each edge e(u, v), compute R e (u, v) using fast algorithm in (Spielman & Srivastava, 2011) Set q = max(M, int(0.16N log N/✏ 2 )), H = Graph(E H = Null, V) for i 1 to q do Sample an edge e i from the distribution p e proportional to w e R e if e i 2 E H then Add w e /qp e to its weight // Increase weight of an existing edge else Add e i to E H with weight w e /qp e // Add the edge for the first time H = Graph(E H , V) The effective-resistance based edge-samplng function is described in Algorithm 2. For a desired value of ✏, the algorithm sampled q = O(N log N/✏ 2 ) number of edges such that equation 6 is satisfied. Choosing ✏. As shown in Algorithm. 2, it requires setting a pruning parameter ✏, which determines the quality of approximation after sparsification and also determines the number of edges retained q. The choice of ✏ is a design parameter at the discretion of the user. To remove the burden of choosing ✏, we also provide an adaptive algorithm in Section B.4 in the appendix. Complexity. The sample complexity q in Algorithm. 2 directly determines the final complexity. If q = O(N log N/✏ 2 ), then the spectral approximation in equation 6 can be achieved (Spielman & Srivastava, 2011) . Note that this results in a number of edges that is almost linear in the number of nodes, as compared to quadratic as in the case of dense graphs. The complexity of computing R e for all edges (up to a constant factor approximation) is O(M log N ) time, where M is the number of edges (Spielman & Srivastava, 2011) . While we describe the algorithm in detail in the appendix (Section B.3) , it uses a combination of fast solutions to Laplacian based linear systems and the Johnson-Lindenstrauss Lemmafoot_0 . This is almost linear in the number of edges, and hence much smaller than the complexity of computing attention coefficients in every layer and forward pass of GNNs. Another important point is that the computation of R e 's is a one-time cost. Unlike graph attention coefficients, we do not need to recompute the effective resistances in every training iteration. Hence, once sparsified, the same graph can be used in all subsequent experiments. Further, since each edge is sampled independently, the edge sampling process itself can be parallelized.

4. THEORETICAL ANALYSIS OF FASTGAT

In this section we provide the theoretical analysis of FastGAT. Although we used the sampling strategy provided in (Spielman & Srivastava, 2011) , their work address the preservation of only the eigenvalues of L. However, we are interested in the following question: Can preserving the spectral structure of the graph lead to good performance under the GAT model? To answer this question, we give an upper bound on the error between the feature updates computed by a single layer of the GAT model using the full graph and a sparsified graph produced by FastGAT. Spectral sparsification preserves the spectrum of the underlying graph. This then hints that neural network computations that utilize spectral convolutions can be approximated by using sparser graphs. We first show that this is true in a layer-wise sense for the GCN (Kipf & Welling, 2016) model and then show a similar result for the GAT model as well. Below, we use ReLU to denote the standard Rectified Linear Unit and ELU to denote the Exponential Linear Unit. Theorem 1 At any layer l of a GCN model with input features H (l) 2 R N ⇥D , weight matrix W (l) 2 R D⇥F , if the element-wise non-linearity function is either the ReLU or the ELU function, the features c H f and c H s computed using equation 3 with the full and a layer dependent spectrally sparsified graph obey c H f c H s F  8✏ H (l) W (l) F . ( ) where L norm is as defined in equation 2 and k•k denotes the spectral norm. In our next result, we show a similar upper bound on the features computed with the full and the sparsified graphs using the GAT model. Theorem 2 At any layer l of GAT with input features H (`) 2 R N ⇥D , weight matrix W (l) 2 R D⇥F and ↵ i j's be the attention coefficients in that layer. Let the non-linearity used by either ReLU or the ELU functon. Then, the features c H f and c H s computed using equation 5 with the full and a layer dependent spectrally sparsified graph obey c H f c H s F  12✏ H (l) W (l) F ( ) where k•k denotes the spectral norm of the matrix. Theorem 2 shows that our proposed layer-wise spectral sparsification leads to good approximations of latent embedding c H for GAT model as well. The guarantees given above assume a layer-wise sparsification that is updated based on the attention coefficients. To circumvent the associated computational burden, we use the simpler versions such as 1-const and always use the original weight matrix to sparsify the graph in each layer. In the experiment section, we show that such a relaxation by a one-time spectral sparsification does not lead to any degradation in performance. Approximation of weight matrices. Theorems 1 and 2 provide an upper bound on the feature updates obtained using the full and sparsified graphs. In practice, we observe an even stronger notion of approximation between GAT and FastGAT: the weight matrices of the two models post training are good approximations of each other. We report this observation in Section. A.4 in the appendix. We show that the error between the learned matrices is small and proportional to the value of ✏ itself.

5. EXPERIMENTS

Datasets We evaluated FastGAT on large graph datasets using semi-supervised node classification tasks. This is a standard task to evaluate GNNs, as done in (Veličković et al., 2018; Hamilton et al., 2017a; Kipf & Welling, 2016) . Datasets are sourced from the DGLGraph library (DGL). Their statistics are provided in Table 1 . We evaluate on both transductive and inductive tasks. The PPI dataset serves as a standard benchmark for inductive classification and the rest of the datasets for transductive classification. Further details about the datasets including details about train/validaton/ test split are given in the appendix (Section B.1). We also evaluated on smaller datasets including Cora, citeseer and Pubmed, but present their results in the appendix (Section B.2). Baselines. Transductive learning: We compare FastGAT with the following baseline methods. (1) The original graph attention networks (GAT) (Veličković et al., 2018) , (2) SparseGAT (Ye & Ji, 2019) that learns edge coefficients to sparsify the graph, (3) random subsampling of edges, and (4) FastGCN (Chen et al., 2018) that is also designed for GNN speedup. Note that we compare SparseGAT in a transductive setting, whereas the original paper (Ye & Ji, 2019) uses an inductive setting. We thus demonstrate that FastGAT can handle the full input graph, unlike any previous attention based baseline method. Inductive learning: For this task, we compare with both GAT (Veličković et al., 2018) and GraphSAGE (Hamilton et al., 2017a) . More importantly, for both inductive and transductive tasks, we show that a uniform random subsampling of edges results in a drop in performance, where as FastGAT does not. Evaluation setup and Model Implementation Details are provided in Section. B in the appendix.

Q1. FASTGAT PROVIDES FASTER TRAINING WITH STATE-OF-THE-ART ACCURACY.

Our first experiment is to study FastGAT on the accuracy and time performance of attention based GNNs in node classification. We sample q = int(0.16N log N/✏ 2 ) number of edges from the distribution p e with replacement, as described in Section 3.2. Transductive learning: In this setting, we assume that the features of all the nodes in the graph, including train, validation and test nodes are available, but only the labels of training nodes are available during training, similar to (Veličković et al., 2018) . First, we provide a direct comparison between FastGAT and the original GAT model and report the results in Table 2 . As can be observed from the results, FastGAT achieves the same test accuracy as the full GAT across all datasets, while being dramatically faster: we are able to achieve up to 5x on GPU (10x on CPU) speedup. We then compare FastGAT with the following baselines: SparseGAT (Ye & Ji, 2019) , random subsampling of edges and FastGCN (Chen et al., 2018) in Table 3 . SparseGAT uses the attention mechanism to learn embeddings and a sparsifying mask on the edges. We compare the training time per epoch for the baseline methods against FastGAT in Figure 1 . The results shows that FastGAT matches state-of-the-art accuracy (F1-score), while being much faster. While random subsampling of edges leads to a model that is as fast as ours but with a degradation in accuracy performance. FastGAT is also faster compared to FastGCN on some large datasets, even though FastGCN does not compute any attention coefficients. Overall the classification accuracy of FastGAT remains the same (or sometimes even improves) compared to standard GAT, while the training time reduces drastically. This is most evident in the case of the Reddit dataset, where the vanilla GAT model runs out of memory on a machine with 128GB RAM and a Tesla P100 GPU when computing attention coefficients over 57 million edges, while FastGAT can train with 10 seconds per epoch. . 3). On the Reddit dataset, both GAT and sparseGAT models run out of memory in the transductive setting and hence do not include bar graphs for them in the plot above. Inductive learning. FastGAT can also be applied to the inductive learning framework, where features of only the training nodes are available and training is performed using the subgraph consisting of In Tables 2 and 3 , FastGCN-400 denotes that we sample 400 nodes in every forward pass, as described in (Chen et al., 2018) (similarly, in FastGCN-800, we sample 800 nodes). FastGAT-0.5 denotes we use ✏ = 0.5. GAT rand 0.5 uses random subsampling of edges, but keeps the same number of edges as FastGAT-0.5 the training nodes. To show the utility of FastGAT in such a setting, we use the Protein-Protein interaction (PPI) dataset ((Zitnik & Leskovec, 2017) ). Our model parameters are the same as in (Veličković et al., 2018) , but we sparsify each of the 20 training graphs before training on them. The other 4 graphs are used for validation and testing (2 each). We use ✏ = 0.25, 0.5 in our experiments, since the PPI dataset is smaller than the other datasets. We report the experimental results in Table 4 . FastGAT clearly outperforms the baselines like GraphSAGE and uniform subsampling of edges. While it has the same accuracy performance as the GAT model (which is expected), it has a much smaler training time, as reported in Table 4 . Finally, we study if FastGAT is sensitive to the particular formulation of the attention function. There have been alternative formulations proposed to capture pairwise similarity. For example, (Thekumparampil et al., 2018) proposes a cosine similarity based approach, where the attention coefficient of an edge is defined in equation 9, `) cos(h ↵ (`) ij = softmax j2Ni ⇣ [ (`) i , h (`) j )] ⌘ (9) where (`) is a layer-wise learnable parameter and cos(x, y) = x > y/ kxk kyk. Another definition is proposed in (Zhang et al., 2018) (GaAN:Gated Attention Networks), which defines attention as in equation 10, ↵ (l) ij = softmax j2Ni hFC src (h ì ), FC dst (h j )i where FC src and FC dst are 2-layered fully connected neural networks. We performed similar experiments on these attention definitions. Tables. 5 confirmed that FastGAT generalizes to different attention functions.Note that the variability in accuracy performance across Tables 2 and 5 comes from the different definitions of the attention functions and not from FastGAT. Our goal is to show that given a specific GAT model, FastGAT can achieve similar accuracy performance as that model, but in much faster time.

6. CONCLUSION

In this paper, we introduced FastGAT, a method to make attention based GNNs lightweight by using spectral sparsification. We theoretically justified our FastGAT algorithm. FastGAT can significantly reduce the computational time across multiple large real world graph datasets while attaining state-ofthe-art performance. 



https://en.wikipedia.org/wiki/Johnson-Lindenstrauss_lemma



Figure 1: Training time per epoch for our FastGAT model and baselines. FastGAT has a significantly lower training time, while also preserving classification accuracy (in Table.3). On the Reddit dataset, both GAT and sparseGAT models run out of memory in the transductive setting and hence do not include bar graphs for them in the plot above.

Dataset StatisticsDataset Reddit Coauth-Phy Github Amaz.Comp Coauth-cs . Amaz.Photos PPI (Inductive task)

Comparison of FastGAT with GAT(Veličković et al., 2018). 86±0.000 0.89±0.004 0.89±0.001 0.92±0.001 FastGAT-const-0.5 0.93±0.000 0.94±0.001 0.86±0.001 0.88±0.004 0.88±0.001 0.91±0.002 FastGAT-const-0.9 0.88±0.001 0.94±0.002 0.85±0.002 0.86±0.002 0.88±0.004 0.89±0.002

Transductive learning: While significantly lower training time (in Fig. 1), FastGAT still has comparable and sometimes better accuracy than other models. Computationally intensive model sparseGAT runs out of memory on the Reddit dataset.

Inductive learning on PPI Dataset The training time per epoch for the full GAT method is around 45.8s, whereas FastGAT requires only about 29.8s when ✏ = 0.25 and 20s when ✏ = 0.5. FastGAT achieved the same accuracy as GAT, and outperforms the other baselines, while also begin computatioanally efficient. The F1 score for GraphSAGE was obtained from what is reported inVeličković et al. (2018).

Comparison of full and sparsified graphs.

annex

Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33( 14):i190-i198, 2017.

