FAST GRAPH ATTENTION NETWORKS USING EFFEC-TIVE RESISTANCE BASED GRAPH SPARSIFICATION

Abstract

The attention mechanism has demonstrated superior performance for inference over nodes in graph neural networks (GNNs), however, they result in a high computational burden during both training and inference. We propose FastGAT, a method to make attention based GNNs lightweight by using spectral sparsification to generate an optimal pruning of the input graph. This results in a per-epoch time that is almost linear in the number of graph nodes as opposed to quadratic. We theoretically prove that spectral sparsification preserves the features computed by the GAT model, thereby justifying our FastGAT algorithm. We experimentally evaluate FastGAT on several large real world graph datasets for node classification tasks under both inductive and transductive settings. FastGAT can dramatically reduce (up to 10x) the computational time and memory requirements, allowing the usage of attention based GNNs on large graphs.

1. INTRODUCTION

Graphs are efficient representations of pairwise relations, with many real-world applications including product co-purchasing network ((McAuley et al., 2015) ), co-author network ((Hamilton et al., 2017b) ), etc. Graph neural networks (GNN) have become popular as a tool for inference from graph based data. By leveraging the geometric structure of the graph, GNNs learn improved representations of the graph nodes and edges that can lead to better performance in various inference tasks ((Kipf & Welling, 2016; Hamilton et al., 2017a; Veličković et al., 2018) ). More recently, the attention mechanism has demonstrated superior performance for inference over nodes in GNNs ((Veličković et al., 2018; Xinyi & Chen, 2019; Thekumparampil et al., 2018; Lee et al., 2020; Bianchi et al., 2019; Knyazev et al., 2019) ). However, attention based GNNs suffer from huge computational cost. This may hinder the applicability of the attention mechanism to large graphs. GNNs generally rely on graph convolution operations. For a graph G with N nodes, graph convolution with a kernel g w : R ! R is defined as g w ? h = U g w (⇤)U > h (1) where U is the matrix of eigenvectors and ⇤ is the diagonal matrix of the eigenvalues of the normalized graph Laplacian matrix defined as L norm = I D 1/2 AD 1/2 , ( ) with D and A being the degree matrix and the adjacency matrix of the graph, and g w is applied elementwise. Since computing U and ⇤ can be very expensive (O(N 3 )), most GNNs use an approximation of the graph convolution operator. For example, in graph convolution networks (GCN) (Kipf & Welling, 2016) , node features are updated by computing averages as a first order approximation of Eq.equation 1 over the neighbors of the nodes. A single neural network layer is defined as: H (l+1) GCN = ⇣ e D 1/2 e A e D 1/2 H (l) W (l) ⌘ , where H (l) and W (l) are the activations and the weight matrix at the lth layer respectively and e A = A + I and e D is the degree matrix of e A. Attention based GNNs add another layer of complexity: they compute pairwise attention coefficients between all connected nodes. This process can significantly increase the computational burden, 1

