Graph Neural Network Acceleration via Matrix Dimension Reduction

Abstract

Graph Neural Networks (GNNs) have become the de facto method for machine learning on graph data (e.g., social networks, protein structures, code ASTs), but they require significant time and resource to train. One alternative method is Graph Neural Tangent Kernel (GNTK), a kernel method that corresponds to infinitely wide multi-layer GNNs. GNTK's parameters can be solved directly in a single step, avoiding time-consuming gradient descent. Today, GNTK is the state-of-the-art method to achieve high training speed without compromising accuracy. Unfortunately, solving for the kernel and searching for parameters can still take hours to days on real-world graphs. The current computation of GNTK has running time O(N 4 ), where N is the number of nodes in the graph. This prevents GNTK from scaling to datasets that contain large graphs. Theoretically, we present two techniques to speed up GNTK training while preserving the generalization error: (1) We use a novel matrix decoupling method to reduce matrix dimensions during the kernel solving. This allows us to reduce the dominated computation bottleneck term from O(N 4 ) to O(N 3 ). (2) We apply sketching to further reduce the bottleneck term to o(N ω ), where ω ≈ 2.373 is the exponent of current matrix multiplication. Experimentally, we demonstrate that our approaches speed up kernel learning by up to 19× on real-world benchmark datasets.

1. Introduction

Graph Neural Networks (GNNs) have quickly become the de facto method for machine learning on graph data. GNNs have delivered ground-breaking results in many important areas of AI, including social networking Yang et al. (2020a) , bio-informatics Zitnik & Leskovec (2017) ; Yue et al. (2020) , recommendation systems Ying et al. (2018) , and autonomous driving Weng et al. (2020) ; Yang et al. (2020b) . However, efficient GNNs training has become a major challenge with the relentless increase in the complexity of GNN models and dataset sizes, both in terms of the number of graphs in a dataset and the sizes of the graphs. Recently, a new direction for fast GNN training is to use Graph Neural Tangent Kernel (GNTK). Solving for the kernel and searching for the parameters in GNTK is equivalent to using gradient descent to train an infinitely wide multi-layer GNN. GNTK is significantly faster than iterative gradient descent optimization because solving the parameters in GNTK is just a single-step kernel learning process. In addition, GNTK allows GNN training to scale with GNN model sizes because the training time grows only linearly with the complexity of GNN models. However, GNTK training can still take hours to days on typical GNN datasets today. Our key observation is that, during the process of solving parameters in GNTK, most of the training time and resource is spent on multiplications of large matrices. Let N be the maximum number of nodes in the graphs, these matrices can have sizes as large as N 2 × N 2 ! This means a single matrix multiplication takes at least N 4 time, and it prevents GNTK from scaling to larger graphs. Thus, in order to speed up GNTK training, we need to reduce matrix dimensions. Our Contributions. We present two techniques to speed up GNTK: (1) We use a novel matrix decoupling method to reduce matrix dimensions during the training without harming the calculation results. This reduces the dominated computation bottleneck term from O(N 4 ) to O(N 3 ). (2) We propose a sketching method to further reduce the bottleneck term to o(N ω ), where ω ≈ 2.373 is the exponent of current matrix multiplication. We provide theoretical results that the resulting randomized GNTK still has a good generalization bound. In experiments, we evaluate our method on standard graph classification benchmarks. Our method improves GNTK training time by up to 19× while maintaining the same level of accuracy.

2. Background

Notations. For a positive integer n, we define [n] := {1, 2, • • • , n}. For two integers a ≤ b, we define [a, b] := {a, a + 1, • • • , b}, and (a, b) := {a + 1, • • • , b -1}. Similarly we define [a, b) and (a, b]. For a full rank square matrix A, we use A -1 to denote its true inverse. We define the big O notation such that f (n) = O(g(n)) means there exists n 0 ∈ N + and M ∈ R such that f (n) ≤ M • g(n) for all n ≥ n 0 . For a matrix A, we use A or A 2 to denote its spectral norm. We use A F to denote its Frobenius norm. We use A to denote the transpose of A. For a matrix A and a vector x, we define x A := √ x Ax. We use φ to denote the ReLU activation function, i.e. φ(z) = max{z, 0}. For a function f : R → R, we use f to denote the derivative of f .

Graph neural network (GNN).

A GNN has L levels of Aggregate operations, each followed by a Combine operation. A Combine operation has R fully-connected layers with output dimension m, and uses ReLU as non-linearity. In the end, the GNN has a ReadOut operation that corresponds to the pooling operation of normal neural networks. Consider a graph G = (V, E) with |V | = N . Each node u ∈ V has a feature vector h u ∈ R d . In GNN we will use vectors h (l,r) such that l denotes the number of levels, and r denotes the number of hidden layers. The size is h (1,0) u ∈ R d , and h (l,0) u ∈ R m for all l ∈ [2 : L]. We define the initial vector h (0,R) u = h u ∈ R d , ∀u ∈ U . For any l ∈ [L], the Aggregate operation aggregates the information from last level: h (l,0) u := c u • v∈N (u)∪{u} h (l-1,R) v . where c u ∈ R is a scaling parameter, and N (u) is the set of neighbor nodes of u. The Combine operation then uses R fully-connected layers with ReLU activation: ∀r ∈ [R], h (l,r) u := (c φ /m) 1/2 • φ(W (l,r) • h (l,r-1) u ) ∈ R m , where c φ ∈ R is a scaling parameter, W (1,1) ∈ R m×d , and W (l,r) ∈ R m×m for all (l, r) ∈ [L] × [R]\{(1, 1)}. We let W := {W (l,r) } l∈[L],r∈[R] . Finally, the output of the GNN on graph G = (V, E) is computed by a ReadOut operation: f gnn (W, G) := u∈V h (L,R) u ∈ R m . For more details see Appendix Section B.1. Neural tangent kernel. We briefly review the neural tangent kernel definition. Let φ denote the non-linear activation function, e.g. φ(z) = max{z, 0} is the ReLU activation function. Definition 2.1 (Neural tangent kernel Jacot et al. (2018) ). For any input two data points x, z ∈ R d , we define the kernel mapping K ntk : R d × R d → R K ntk (x, z) := w∼N (0,I d ) φ (w x)φ (w z)x zdw where N (0, I d ) is a d-dimensional multivariate Gaussian distribution, and φ is the derivative of activation function φ. If φ is ReLU, then φ (t) = 1 if t ≥ 0 and φ (t) = 0 if t < 0. Given x 1 , x 2 , • • • , x n ∈ R d , we define kernel matrix K ∈ R n×n as follows: K i,j = K(x i , x j ). The lower bound on smallest eigenvalue of neural tangent kernel matrix K (say λ = λ min (K), see Du et al. (2019b) ; Arora et al. (2019a; b) ; Song & Yang (2019); Brand et al. (2020) ) and separability of input data points {x 1 , x 2 , • • • , x n } (δ, see Li & Liang (2018) ; Allen-Zhu et al. (2019a; b) ) play a crucial role in deep learning theory. Due to Oymak & Soltanolkotabi (2020) , λ is at least Ω(δ/n 2 ) which unifies the two lines of research. The above work shows that as long as the neural network is sufficiently wide m ≥ poly(n, d, 1/δ, 1/λ) (where n is the number of input data points, d is the dimension of each data) , running (S)GD type algorithm is able to minimize the training loss to zero. Kernel regression and equivalence. Kernel method or Kernel regression is a fundamental tool in machine learning Avron et al. (2017; 2019) ; Scholkopf & Smola (2018) . Recently, it has been shown that training an infinite-width neural network is equivalent to kernel regression Arora et al. (2019a) . Further, the equivalence even holds for regularized neural network and kernel ridge regression Lee et al. (2020) . Let's consider the following neural network, min W 1 2 Y -f nn (W, X) 2 . Training this neural network is equivalent to solving the following neural tangent kernel ridge regression problem: min β 1 2 Y -f ntk (β, X) 2 2 . Note that f ntk (β, x) = Φ(x) β ∈ R and f ntk (β, X) = [f ntk (β, x 1 ), • • • , f ntk (β, x n )] ∈ R n are the test data predictors. Here, Φ is the feature map corresponding to the neural tangent kernel (NTK): K ntk (x, z) = E W ∂f nn (W, x) ∂W , ∂f nn (W, z) ∂W where x, z ∈ R d are any input data, and w r i.i.d. ∼ N (0, I), r = 1, • • • , m.

3. Our GNTK formulation

We show our GNTK formulation in this section. Our formulation builds upon the GNTK formulas of Du et al. (2019a) . The descriptions in this section is presented in a simplified way. See Section B.2 and B.3 for more details.

3.1. Exact GNTK formulas

We consider a GNN with L Aggregate operations and L Combine operations, and each Combine operation has R fully-connected layers. Let G = (U, E) and H = (V, F ) be two graphs with |U | = N and |V | = N . We use A G and A H to denote the adjacency matrix of G and H. We give the recursive formula to compute the kernel value K gntk (G, H) ∈ R induced by this GNN, which is defined as K gntk (G, H) := E W ∼N (0,I) ∂f gnn (W, G) ∂W , ∂f gnn (W, H) ∂W , where N (0, I) is a multivariate Gaussian distribution. Recall that the GNN uses scaling factors c u for each node u ∈ G. We define C G to be the diagonal matrix such that (C G ) u = c u for any u ∈ U . Similarly we define C H . We will use intermediate matrices K ( ,r) (G, H) ∈ R N ×N for each ∈ [0 : L] and r ∈ [0 : R], where l denotes the level of Aggregate and Combine operations, and r denotes the level of fully-connected layers inside a Combine operation. Initially we define K (0,R) (G, H) ∈ R N ×N as follows: ∀u ∈ U, v ∈ V , [K (0,R) (G, H)] u,v := h u , h v . where h u , h v ∈ R d are the input features of u and v. Next we recursively define 1) and R = O(1), the dominate term in previous work is O (N 4 ). We improve it to Tmat(N, N, b). K ( ,r) (G, H) for l ∈ [L] and r ∈ [R]. Reference Time Du et al. (2019a) O(n 2 ) • (T mat (N, N, d) + L • N 4 + LR • N 2 ) Thm. 4.1 and 5.1 O(n 2 ) • (T mat (N, N, d) + L • T mat (N, N, b) + LR • N 2 ) Table 1: When L = O( Exact Aggregate. The Aggregate operation gives the following formula: [K ( ,0) (G, H)] u,v := c u c v a∈N (u)∪{u} b∈N (v)∪{v} [K ( -1,R) (G, H)] a,b . In the experiments the above equation is computed using Kronecker product: vec(K ( ,0) (G, H)) := ((C G A G ) ⊗ (C H A H )) • vec(K ( -1,R) (G, H)). (1) The dominating term of the final running time does not come from Combine and ReadOut operations, thus we defer their details into Section B.2 in Appendix. We briefly review the running time in previous work.  i = (V i , E i )} n i=1 with |V i | ≤ N . Let d ∈ N + be the dimension of the feature vectors. The total running time is O(n 2 ) • (T mat (N, N, d) + L • N 4 + LR • N 2 ). When using GNN, we usually use constant number of operations and fully-connected layers, i.e., L = O(1), R = O(1), and we have d = o(N ), while the size of the graphs can grow arbitrarily large. Thus it is easy to see that the dominated term in the above running time is N 4 , the major contribution of this work is to reduce it to o(N ω ), where ω ≈ 2.373 is the exponent of current matrix multiplication.

3.2. Approximate GNTK formulas

We follow the notations of previous section. Now the goal is to compute an approximate version of the kernel value K(G, H) ∈ R such that K gntk (G, H) ≈ K gntk (G, H). We will use intermediate matrices K ( ,r)  K ( ,0) (G, H) := C G A G • (S G S G ) • K ( -1,R) (G, H) • (S H S H ) • A H C H . (2) Not that for the special case S G S G = S H S H = I, the Eq. ( 2) degenerates to the the following case: K ( ,0) (G, H) = C G A G • K ( -1,R) (G, H) • A H C H . This equation is exactly the same as the equation Eq. (1) of the exact case. See Fact 4.2 for why they are equivalent.

4. Our techniques : running time

The main contribution of our paper is to show that we can accelerate the computation of GNTK defined in Du et al. (2019a), while maintaining a similar generalization bound. In this section we present the techniques that we use to achieve faster running time. We will provide the generalization bound in the next section. We first present our main running time theorem.  O(n 2 ) • (T mat (N, N, d) + L • T mat (N, N, b) + LR • N 2 ). Note that we improve the dominating term from N 4 to T mat (N, N, b). We achieve this improvement using two techniques: 1. We accelerate the multiplication of a Kronecker product with a vector by decoupling it into two matrix multiplications of smaller dimensions. In this way we improve the running time from N 4 down to T mat (N, N, N ). We present more details in Section 4.2. 2. We further accelerate the two matrix multiplications by using two sketching matrices. In this way, we improve the running time from T mat (N, N, N ) to T mat (N, N, b). We present more details in Section 4.3. Kronecker product and vectorization. Given two matrices A ∈ R n1×d1 and B ∈ R n2×d2 . We use ⊗ to denote the Kronecker product, i.e., for

4.1. Notations and known facts

C = A ⊗ B ∈ R n1n2×d1d2 , the (i 1 + (i 2 -1) • n 1 , j 1 + (j 2 -1) • d 1 )-th entry of C is A i1,j1 B i2,j2 , ∀i 1 ∈ [n 1 ], i 2 ∈ [n 2 ], j 1 ∈ [d 1 ], j 2 ∈ [d 2 ] . For any give matrix H ∈ R d1×d2 , we use h = vec(H) ∈ R d1d2 to denote the vector such that h j1+(j2-1) •d1 = H j1,j2 , ∀j 1 ∈ [d 1 ], j 2 ∈ [d 2 ].

4.2. Speedup via Kronecker product equivalence

We make the following observation about kronecker product and vectorization. Proof is delayed to Section E. Fact 4.2 (Equivalence between two matrix products and Kronecker product then matrix-vector multiplication). Given matrices A ∈ R n1×d1 , B ∈ R n2×d2 , and H ∈ R d1×d2 , we have vec(AHB ) = (A ⊗ B) • vec(H). In GNTK, when computing the l-th Aggregate operation for l ∈ [L], we need to compute a product In our experiments, we instead compute AHB , which takes O(T mat (N, N, N )) time. And this is how we get our first improvement in running time. (A ⊗ B) • vec(H) with sizes A, B, H ∈ R N ×N . Note that A = C G A G , B = C H A H , H = K (l-1,R) (G, H) (see Eq.

4.3. Speedup via sketching matrices

The following lemma shows that the sketched version of matrix multiplication approximates the exact matrix multiplication. This justifies why we can use sketching matrices to speed up calculation. Lemma 4.3 (Informal version of Lemma 5.4). Given n 2 matrices H 1,1 , • • • , H n,n ∈ R N ×N and n matrices A 1 , • • • , A n ∈ R N ×N . Let S i ∈ R b×N denote a random matrix where each entry is + 1 √ b or -1 √ b , each with probability 1 2 . Then with high probability, we have the following guarantee: for all i, j ∈ [n], A i S i S i H i,j S j S j A j ≈ A i H i,j A j . Note that the sizes of the matrices are A i , A j , H i,j ∈ R N ×N , and S i , S j ∈ R b×N . They correspond to 2) in Section 3.2). A i = C Gi A Gi , A j = C Gj A Gj , H i,j = K (l-1,R) (G i , G j ) (see Eq. (

Directly computing

A i H i,j A j takes T mat (N, N, N ) time. After adding two sketching matrices, using a certain ordering of computation, we can avoid the time-consuming step of multiplying two N × N matrices. More specifically, we compute A i S i S i H i,j S j S j A j in the following order: • Computing A i S i and S j A j both takes T mat (N, N, b) time. • Computing S i • H i,j takes T mat (b, N, N ) time. • Computing (S i H i,j ) • S j takes T mat (b, N, b) time. • Computing (A i S i ) • (S i H i,j S j ) takes T mat (N, b, b) time. • Computing (A i S i S i H i,j S j ) • (S j A j ) takes T mat (N, b, N ) time. Thus, we improve the running time from T mat (N, N, N ) to T mat (N, N, b).

5. Our techniques : error analysis

In this section, we prove that even though adding the sketching matrices in the GNTK formula will introduce some error, this error can be bounded, and we can still prove a similar generalization bound as that of Du et al. (2019a) . Theorem 5.1 (Informal version of Theorem C.4). For each i ∈ [n], if the labels {y i } n i=1 satisfy y i = α 1 u∈V h u , β 1 + T l=1 α 2l u∈V h u , β 2l 2l , where h u = c u v∈N (u)∪{u} h v , α 1 , α 2 , α 4 , • • • α 2T ∈ R, β 1 , β 2 , β 4 , • • • , β 2T ∈ R d , and under the assumptions of Assumption C.6, and if we further have the conditions that 4 • α 1 β 1 2 + T l=1 4 √ π(2l -1) • α 2l β 2l 2 = o(n), N = o( √ n), then the generalization error of the approximate GNTK can be upper bounded by L D (f gntk ) = E (G,y)∼D [ (f gntk (G), y)] O(1/n c ), where constant c ∈ (0, 1). We use a standard generalization bound of kernel methods of Bartlett & Mendelson (2002) (Theorem C.1) which shows that in order to prove a generalization bound, it suffices to upper bound y K -1 and tr [ K] . We present our bound on y K -1 . The bound on tr[ K] is simpler. For the full version of the proofs, please see Section C. Lemma 5.2 (Informal version of bound on y K -1 ). Under Assumption C.6, we have y K -1 ≤ 4|α 1 | • β 1 2 + T l=1 4 √ π(2l -1)|α 2l | • β 2l 2l 2 . We provide a high-level proof sketch here. We first compute all the variables in the approximate GNTK formula to get a close-form formula of K. Then combining with the assumption on the labels y, we show that y K -1 1 is upper bounded by y K -1 1 ≤ (4α 2 • β H • ( H H) -1 • H β) 1/2 , where H, H ∈ R d×n are two matrices such that ∀i, j ∈ [n], [H H] i,j = 1 Ni C Gi A Gi H Gi • H Gj A Gj C Gj 1 Nj , [ H H] i,j = 1 Ni C Gi A Gi (S Gi S Gi )H Gi • H Gj (S Gj S Gj )A Gj C Gj 1 Nj . Next we show that H H is a PSD approximation of H H. Claim 5.3 (PSD approximation). We have (1 -1 10 )H H H H (1 + 1 10 )H H. Note that using this claim, we have y K -1 1 ≤ (4α 2 • β H • ( H H) -1 • H β) 1/2 ≤ (8α 2 • β H • (H H) -1 • H β) 1/2 ≤ 4 • α β 2 , which finishes the proof. Now it remains to prove the claim. We prove it by using the following lemma which upper bounds the error of adding two sketching matrices. The proof of this lemma is deferred to Section E. Lemma 5.4 (Error bound of adding two sketching matrices). Let R ∈ R b1×n , S ∈ R b2×n be two independent AMS matrices Alon et al. (1999) . Let β = O(log 1.5 n). For any matrix A ∈ R n×n and any vectors g, h ∈ R n , the following holds with probability 1 -1/poly(n ) |g R RAS Sh -g Ah| ≤ β √ b 1 g 2 Ah 2 + β √ b 2 g A 2 h 2 + β 2 √ b 1 b 2 g 2 h 2 A F . Using this lemma and the assumption of sketching sizes in the lemma statement, we can prove the following coordinate-wise upper bound: |[ H H] i,j -[H H] i,j | ≤ 1 10 • [H H] i,j . Then we can upper bound H H -H H 2 ≤ 1 10 H H 2 , which proves the claim. Thus we finish the proof.

6. Experiments

In this section, we evaluate our proposed GNTK acceleration algorithm on various graph classification tasks. More details about the experiment setup can be found in Section F of the supplementary material. Datasets. We test our method on 7 benchmark graph classification datasets, including 3 social networking dataset (COLLAB, IMDBBINARY, IMDBMULTI) and 4 bioinformatics datasets (PTC, NCL1, MUTAG and PROTEINS) Yanardag & Vishwanathan (2015) . For bioinformatics dataset, each node has its categorical features as input feature h to the algorithm. For social network dataset where nodes have no input feature, we use degree of each node as its feature to better represent its structural information. The dataset statistics are shown in Table 2 . Baselines. We compare our proposed results with a number of state-of-the-art baselines for graph classification: (1) State-of-the-art deep graph neural networks architectures, including Graph Convolution Network (GCN) Kipf & Welling (2017) , GraphSAGE Hamilton et al. (2017) , PATCHY- SAN Niepert et al. (2016) , Deep Graph CNN (DGCNN) Zhang et al. (2018a) and Graph Isomorphism Network (GIN) Xu et al. (2018a) . (2) Kernel based methods, including the WL subtree kernel Shervashidze et al. (2011) , Anonymous Walk Embeddings (AWL) Ivanov & Burnaev (2018), and RetGK Zhang et al. (2018b) . (3) Graph neural tangent kernel (GNTK) Du et al. (2019a) . For deep learning methods, GNTK, RetGK and AWL, we report accuracy reported in the original papers. For WL subtree, we report the accuracy of the implementation used in Xu et al. (2018a) . Results. We perform 10-fold cross validation and report the mean and standard deviation of accuracy. We show our performance by comparing with state-of-the-art Graph learning methods, including the original GNTK method. The accuracy is shown in 78.9 ± 1.9 73.8 ± 3.9 50.9 ± 3.8 59.9 ± 4.3 86.0 ± 1.8 90.4 ± 5.7 75.0 ± 3.1 AWL 73.9 ± 1.9 74.5 ± 5.9 51.5 ± 3.6 --87.9 ± 9.8 -RetGK 81.0 ± 0.3 71.9 ± 1.0 47.7 ± 0.3 62.5 ± 1.6 84.5 ± 0.2 90.3 ± 1.1 75.8 ± 0.6 GNTK 83.6 ± 1.0 76.9 ± 3.6 52.8 ± 4.6 67.9 ± 6.9 84.2 ± 1.5 90.0 ± 8.5 75.6 ± 4.2 Ours 83.6 ± 1.0 76.9 ± 3.6 52.8 ± 4.6 67.9 ± 6.9 84.2 ± 1.5 90.0 ± 8.5 75.6 ± 4.2 Table 3 : Running time analysis for our matrix decoupling method (in seconds). We report the kernel calculation time between the original GNTK method and our accelerated model. running time is shown in Table 3 . Our matrix decoupling method (MD) doesn't harm the result of GNTK while significantly accelerates the learning time of neural tangent kernel. Our proposed method achieves multiple times of improvements for all the datasets. In particular, on COLLAB, our method achieves more than 19 times of learning time acceleration. We observe that the improvement of our method depends on the sizes of the graphs. For large-scale dataset like COLLAB, we achieve highest acceleration because matrix multiplication dominates the overall calculation time. And for bioinformatics datasets where number of nodes is relatively small, the improvement is not as prominent. Note that we only show the running time comparison between our method and the original GNTK method, because other state-of-the-art deep GNN methods takes significantly longer to learn via gradient descent Du et al. (2019a) . Analysis of our sketching method can be found in Section F of the supplementary material.

7. Conclusion

Graph Neural Networks (GNNs) have become the most important method for machine learning on graph data (e.g., social networks, protein structures), but training GNNs efficiently is a major challenge. One alternative method is Graph Neural Tangent Kernel (GNTK), a kernel method that is equivalent to train infinitely wide multi-layer GNNs using gradient descent. GNTK's parameters can be solved directly in a single step, avoiding time-consuming gradient descent. Because of this, GNTK has become the state-of-the-art method to achieve high training speed without compromising accuracy. Unfortunately, GNTK still takes hours to days to train on real-world graphs because it has a computation bottleneck of O(N 4 ), where N denotes the number of nodes in the graph. We present two techniques to mitigate this bottleneck: (1) We use a novel matrix decoupling method to reduce matrix dimensions during the kernel solving. This allows us to reduce this dominated computation bottleneck from O(N 4 ) to O(N 3 ). (2) We apply sketching to further reduce the bottleneck to o(N ω ), where ω ≈ 2.373 is the exponent of current matrix multiplication. We demonstrate that our approaches speed up kernel learning by up to 19× on real-world benchmark datasets.



(G, H) ∈ R N ×N for each ∈ [0 : L] and r ∈ [0 : R]. In the approximate version we add two random Gaussian matrices S G ∈ R b×N and S H ∈ R b ×N in the Aggregate operation, where b ≤ N and b ≤ N . Approximate Aggregate operation. In the approximate version, we add two sketching matrices S G ∈ R b×N and S H ∈ R b ×N :

Main theorem, running time part, Theorem D.1). Consider a GNN with L Aggregate operations and L Combine operations, and each Combine operation has R fully-connected layers. We compute the kernel matrix using n graphs n graphs{G i = (V i , E i )} n i=1 with |V i | ≤ N . Let b = o(N )be the sketch size. Let d ∈ N + be the dimension of the feature vectors. The total running time is

matrix multiplication. We use the notation T mat(n, d, m)  to denote the time of multiplying an n × d matrix with another d × m matrix. Let ω denote the exponent of matrix multiplication, i.e., T mat (n, n, n) = n ω . The first result shows ω < 3 is Strassen(1969). The current best exponent is ω ≈ 2.373, due to Williams (2012); Le Gall (2014). The common belief is ω ≈ 2 in the computational complexity communityCohn et al. (2005); Williams (2012);Jiang et al. (2020). The following fact is well-known in the fast matrix multiplication literatureCoppersmith (1982); Strassen (1991); Bürgisser et al. (1997) : T mat (a, b, c) = O(T mat (a, c, b)) = O(T mat (c, a, b)) for any positive integers a, b, c.

(1) in Section 3.1). Computing this product naively takes O(N 4 ) time, since the Kronecker product A ⊗ B already has size N 2 × N 2 . This is the O(N 4 ) term ofDu et al. (2019a).



Classification accuracy (%) for graph classification datasets with matrix decoupling. We report the result of our proposed method, optimizing on original GNTK model.

