FAST GRAPH ATTENTION NETWORKS USING EFFEC-TIVE RESISTANCE BASED GRAPH SPARSIFICATION

Abstract

The attention mechanism has demonstrated superior performance for inference over nodes in graph neural networks (GNNs), however, they result in a high computational burden during both training and inference. We propose FastGAT, a method to make attention based GNNs lightweight by using spectral sparsification to generate an optimal pruning of the input graph. This results in a per-epoch time that is almost linear in the number of graph nodes as opposed to quadratic. We theoretically prove that spectral sparsification preserves the features computed by the GAT model, thereby justifying our FastGAT algorithm. We experimentally evaluate FastGAT on several large real world graph datasets for node classification tasks under both inductive and transductive settings. FastGAT can dramatically reduce (up to 10x) the computational time and memory requirements, allowing the usage of attention based GNNs on large graphs.

1. INTRODUCTION

Graphs are efficient representations of pairwise relations, with many real-world applications including product co-purchasing network ((McAuley et al., 2015) ), co-author network ((Hamilton et al., 2017b) ), etc. Graph neural networks (GNN) have become popular as a tool for inference from graph based data. By leveraging the geometric structure of the graph, GNNs learn improved representations of the graph nodes and edges that can lead to better performance in various inference tasks ((Kipf & Welling, 2016; Hamilton et al., 2017a; Veličković et al., 2018) ). More recently, the attention mechanism has demonstrated superior performance for inference over nodes in GNNs ((Veličković et al., 2018; Xinyi & Chen, 2019; Thekumparampil et al., 2018; Lee et al., 2020; Bianchi et al., 2019; Knyazev et al., 2019) ). However, attention based GNNs suffer from huge computational cost. This may hinder the applicability of the attention mechanism to large graphs. GNNs generally rely on graph convolution operations. For a graph G with N nodes, graph convolution with a kernel g w : R ! R is defined as g w ? h = U g w (⇤)U > h ( ) where U is the matrix of eigenvectors and ⇤ is the diagonal matrix of the eigenvalues of the normalized graph Laplacian matrix defined as L norm = I D 1/2 AD 1/2 , with D and A being the degree matrix and the adjacency matrix of the graph, and g w is applied elementwise. Since computing U and ⇤ can be very expensive (O(N 3 )), most GNNs use an approximation of the graph convolution operator. For example, in graph convolution networks (GCN) (Kipf & Welling, 2016) , node features are updated by computing averages as a first order approximation of Eq.equation 1 over the neighbors of the nodes. A single neural network layer is defined as: H (l+1) GCN = ⇣ e D 1/2 e A e D 1/2 H (l) W (l) ⌘ , where H (l) and W (l) are the activations and the weight matrix at the lth layer respectively and e A = A + I and e D is the degree matrix of e A. Attention based GNNs add another layer of complexity: they compute pairwise attention coefficients between all connected nodes. This process can significantly increase the computational burden, especially on large graphs. Approaches to speed up GNNs were proposed in (Chen et al., 2018; Hamilton et al., 2017a) . However, these sampling and aggregation based methods were designed for simple GCNs and are not applicable to attention based GNNs. There has also been works in inducing sparsity in attention based GNNs (Ye & Ji, 2019; Zheng et al., 2020) , but they focus on addressing potential overfitting of attention based models rather than scalability. In this paper, we propose Fast Graph Attention neTwork (FastGAT), an edge-sampling based method that leverages effective resistances of edges to make attention based GNNs lightweight. The effective resistance measures importance of the edges in terms of preserving the graph connectivity. FastGAT uses this measure to prune the input graph and generate a randomized subgraph with far fewer edges. Such a procedure preserves the spectral features of a graph, hence retaining the information that the attention based GNNs need. At the same time, the graph is amenable to more complex but computationally intensive models such as attention GNNs. With the sampled subgraph as their inputs, the attention based GNNs enjoy much smaller computational complexity. Note that FastGAT is applicable to all attention based GNNs. In this paper, we mostly focus on the Graph Attention NeTwork model (GAT) proposed by (Veličković et al., 2018) . However we also show FastGAT is generalizable to two other attention based GNNs, namely the cosine similarity based approach (Thekumparampil et al., 2018) and Gated Attention Networks (Zhang et al., 2018) . We note that Graph Attention Networks can be re-interpreted as convolution based GNNs. We show this explicitly in the Appendix. Based on this re-interpretation, we theoretically prove that spectral sparsification preserves the feature representations computed by the GAT model. We believe this interpretation also opens up interesting connections between sparsifying state transition matrices of random walks and speeding up computations in GNNs. The contributions of our paper are as outlined below: • We propose FastGAT, a method that uses effective resistance based spectral graph sparsification to accelerate attention GNNs in both inductive and transductive learning tasks. The rapid subsampling and the spectrum preserving property of FastGAT help attention GNNs retain their accuracy advantages and become computationally light. • We provide a theoretical justification for using spectral sparsification in the context of attention based GNNs by proving that spectral sparsification preserves the features computed by GNNs. • FastGAT outperforms state-of-the-art algorithms across a variety of datasets under both transductive and inductive settings in terms of computation, achieving a speedup of up to 10x in training and inference time. On larger datasets such as Reddit, the standard GAT model runs out of memory, whereas FastGAT achieves an F1 score 0.93 with 7.73 second per epoch time in training. • Further, FastGAT is generalizable to other attention based GNNs such as the cosine similarity based attention (Thekumparampil et al., 2018) and the Gated Attention Network (Zhang et al., 2018) . et al., 2018) . We are able to take advantage of the attention mechanism, while still being computationally efficient.

2. RELATED WORK

Graph sparsification aims to approximate a given graph by a graph with fewer edges for efficient computation. Depending on final goals, there are cut-sparsifiers ((Benczúr & Karger, 1996) ), pairwise distance preserving sparsifiers ((Althöfer et al., 1993) ) and spectral sparsifiers ((Spielman & Teng, 2004; Spielman & Srivastava, 2011) ) , among others ((Zhao, 2015; Calandriello et al., 2018; Hübler et al., 2008; Eden et al., 2018; Sadhanala et al., 2016) ). In this work, we use spectral sparsification to choose a randomized subgraph while preserving spectral properties. Apart form providing the strongest guarantees in preserving graph structure ((Chu et al., 2018) ), they align well with GNNs due to their connection to spectral graph convolutions.



Accelerating graph based inference has drawn increasing interest. Two methods proposed in (Chen et al., 2018) (FastGCN) and (Huang et al., 2018) speed up GCNs by using importance sampling to sample a subset of nodes per layer during training. Similarly, GraphSAGE (Hamilton et al., 2017a) also proposes an edge sampling and aggregation based method for inductive learning based tasks. All of the above works use simple aggregation and target simple GCNs, while our work focus on more recent attention based GNNs such as (Veličković

