Graph Neural Network Acceleration via Matrix Dimension Reduction

Abstract

Graph Neural Networks (GNNs) have become the de facto method for machine learning on graph data (e.g., social networks, protein structures, code ASTs), but they require significant time and resource to train. One alternative method is Graph Neural Tangent Kernel (GNTK), a kernel method that corresponds to infinitely wide multi-layer GNNs. GNTK's parameters can be solved directly in a single step, avoiding time-consuming gradient descent. Today, GNTK is the state-of-the-art method to achieve high training speed without compromising accuracy. Unfortunately, solving for the kernel and searching for parameters can still take hours to days on real-world graphs. The current computation of GNTK has running time O(N 4 ), where N is the number of nodes in the graph. This prevents GNTK from scaling to datasets that contain large graphs. Theoretically, we present two techniques to speed up GNTK training while preserving the generalization error: (1) We use a novel matrix decoupling method to reduce matrix dimensions during the kernel solving. This allows us to reduce the dominated computation bottleneck term from O(N 4 ) to O(N 3 ). (2) We apply sketching to further reduce the bottleneck term to o(N ω ), where ω ≈ 2.373 is the exponent of current matrix multiplication. Experimentally, we demonstrate that our approaches speed up kernel learning by up to 19× on real-world benchmark datasets.

1. Introduction

Graph Neural Networks (GNNs) have quickly become the de facto method for machine learning on graph data. GNNs have delivered ground-breaking results in many important areas of AI, including social networking Yang et al. (2020a) Recently, a new direction for fast GNN training is to use Graph Neural Tangent Kernel (GNTK). Solving for the kernel and searching for the parameters in GNTK is equivalent to using gradient descent to train an infinitely wide multi-layer GNN. GNTK is significantly faster than iterative gradient descent optimization because solving the parameters in GNTK is just a single-step kernel learning process. In addition, GNTK allows GNN training to scale with GNN model sizes because the training time grows only linearly with the complexity of GNN models. However, GNTK training can still take hours to days on typical GNN datasets today. Our key observation is that, during the process of solving parameters in GNTK, most of the training time and resource is spent on multiplications of large matrices. Let N be the maximum number of nodes in the graphs, these matrices can have sizes as large as N 2 × N 2 ! This means a single matrix multiplication takes at least N 4 time, and it prevents GNTK from scaling to larger graphs. Thus, in order to speed up GNTK training, we need to reduce matrix dimensions. Our Contributions. We present two techniques to speed up GNTK: (1) We use a novel matrix decoupling method to reduce matrix dimensions during the training without harming the calculation results. This reduces the dominated computation bottleneck term from O(N 4 ) to O(N 3 ). (2) We propose a sketching method to further reduce the bottleneck term to o(N ω ), where ω ≈ 2.373 is the exponent of current matrix multiplication. We provide theoretical results that the resulting randomized GNTK still has a good generalization bound. In experiments, we evaluate our method on standard graph classification benchmarks. Our method improves GNTK training time by up to 19× while maintaining the same level of accuracy.

2. Background

Notations. For a positive integer n, we define (a, b] . For a full rank square matrix A, we use A -1 to denote its true inverse. We define the big O notation such that f (n) = O(g(n)) means there exists n 0 ∈ N + and M ∈ R such that f (n) ≤ M • g(n) for all n ≥ n 0 . For a matrix A, we use A or A 2 to denote its spectral norm. We use A F to denote its Frobenius norm. We use A to denote the transpose of A. For a matrix A and a vector x, we define x A := √ x Ax. We use φ to denote the ReLU activation function, i.e. φ(z) = max{z, 0}. For a function f : R → R, we use f to denote the derivative of f . [n] := {1, 2, • • • , n}. For two integers a ≤ b, we define [a, b] := {a, a + 1, • • • , b}, and (a, b) := {a + 1, • • • , b -1}. Similarly we define [a, b) and

Graph neural network (GNN).

A GNN has L levels of Aggregate operations, each followed by a Combine operation. A Combine operation has R fully-connected layers with output dimension m, and uses ReLU as non-linearity. In the end, the GNN has a ReadOut operation that corresponds to the pooling operation of normal neural networks.



, bio-informatics Zitnik & Leskovec (2017); Yue et al. (2020), recommendation systems Ying et al. (2018), and autonomous driving Weng et al. (2020); Yang et al. (2020b). However, efficient GNNs training has become a major challenge with the relentless increase in the complexity of GNN models and dataset sizes, both in terms of the number of graphs in a dataset and the sizes of the graphs.

annex

Consider a graph G = (V, E) with |V | = N . Each node u ∈ V has a feature vector h u ∈ R d .In GNN we will use vectors h (l,r) such that l denotes the number of levels, and r denotes the number of hidden layers. The size is hFor any l ∈ [L], the Aggregate operation aggregates the information from last level:where c u ∈ R is a scaling parameter, and N (u) is the set of neighbor nodes of u. The Combine operation then uses R fully-connected layers with ReLU activation: ∀r ∈ [R],Finally, the output of the GNN on graph G = (V, E) is computed by a ReadOut operation: 

