GLOBAL ATTENTION IMPROVES GRAPH NETWORKS GENERALIZATION Anonymous authors Paper under double-blind review

Abstract

This paper advocates incorporating a Low-Rank Global Attention (LRGA) module, a computation and memory efficient variant of the dot-product attention (Vaswani et al., 2017) , to Graph Neural Networks (GNNs) for improving their generalization power. To theoretically quantify the generalization properties granted by adding the LRGA module to GNNs, we focus on a specific family of expressive GNNs and show that augmenting it with LRGA provides algorithmic alignment to a powerful graph isomorphism test, namely the 2-Folklore Weisfeiler-Lehman (2-FWL) algorithm. In more detail we: (i) consider the recent Random Graph Neural Network (RGNN) (Sato et al., 2020) framework and prove that it is universal in probability; (ii) show that RGNN augmented with LRGA aligns with 2-FWL update step via polynomial kernels; and (iii) bound the sample complexity of the kernel's feature map when learned with a randomly initialized two-layer MLP. From a practical point of view, augmenting existing GNN layers with LRGA produces state of the art results in current GNN benchmarks. Lastly, we observe that augmenting various GNN architectures with LRGA often closes the performance gap between different models.

1. INTRODUCTION

In many domains, data can be represented as a graph, where entities interact, have meaningful relations and a global structure. The need to be able to infer and gain a better understanding of such data rises in many instances such as social networks, citations and collaborations, chemoinformatics, epidemiology etc. In recent years, along with the major evolution of artificial neural networks, graph learning has also gained a new powerful tool -graph neural networks (GNNs). Since first originated (Gori et al., 2005; Scarselli et al., 2009) as recurrent algorithms, GNNs have become a central interest and the main tool in graph learning. Perhaps the most commonly used family of GNNs are message-passing neural networks (Gilmer et al., 2017) , built by aggregating messages from local neighborhoods at each layer. Since information is only kept at the vertices and propagated via the edges, these models' complexity scales linearly with |V | + |E|, where |V | and |E| are the number of vertices and edges in the graph, respectively. In a recent analysis of the expressive power of such models, (Xu et al., 2019a; Morris et al., 2018) have shown that message-passing neural networks are at most as powerful as the first Weisfeiler-Lehman (WL) test, also known as vertex coloring. The k-WL tests, are a hierarchy of increasing power and complexity algorithms aimed at solving graph isomorphism. This bound on the expressive power of GNNs led to the design of new architectures (Morris et al., 2018; Maron et al., 2019a) mimicking higher orders of the k-WL family, resulting in more powerful, yet complex, models that scale super-linearly in |V | + |E|, hindering their usage for larger graphs. Although expressive power bounds on GNNs exist, empirically in many datasets, GNNs are able to fit the train data well. This indicates that the expressive power of these models might not be the main roadblock to a successful generalization. Therefore, we focus our efforts in this paper on strengthening GNNs from a generalization point of view. Towards improving the generalization of GNNs we propose the Low-Rank Global Attention (LRGA) module which can be augmented to any GNN. Standard dot-product global attention modules (Vaswani et al., 2017) apply |V | × |V | attention matrix to node data with O(|V | 3 ) computational complexity making them impractical for large graphs. To overcome this barrier, we define a κ-rank attention matrix, where κ is a parameter, that requires O(κ|V |) memory and can be applied in O(κ 2 |V |) computational complexity. To theoretically justify LRGA we focus on a GNN model family possessing maximal expressiveness (i.e., universal) but vary in the generalization properties of the family members. (Murphy et al., 2019; Loukas, 2019; Dasoulas et al., 2019; Loukas, 2020) showed that adding node identifiers to GNNs improves their expressiveness, often making them universal. In this work, we prove that even adding random features to the network's input, as suggested in (Sato et al., 2020) , a framework we call Random Graph Neural Network (RGNN), GNN models are universal in probability. The improved generalization properties of LRGA-augmented GNN models is then showcased for the RGNN framework, where we show that augmenting it with LRGA algorithmically aligns with the 2-folklore WL (FWL) algorithm; 2-FWL is a strictly more powerful graph isomorphism algorithm than vertex coloring (which bounds message passing GNNs). To do so, we adopt the notion of algorithmic alignment introduced in (Xu et al., 2019b) , stating that a neural network aligns with some algorithm if it can simulate it with simple modules, resulting in provable improved generalization. We opt to use monimials in the role of simple modules and prove the alignment using polynomial kernels. Lastly, we bound the sample complexity of the model when learning the 2-FWL update rule. Although our bound is exponential in the graph size, it nevertheless implies that RGNN augmented with LRGA can provably learn the 2-FWL step, when training each module independently with two-layer MLP. We evaluate our model on a set of benchmark datasets including tasks of graph classification and regression, node labeling and link prediction from (Dwivedi et al., 2020; Hu et al., 2020) . LRGA improves state of the art performance in most datasets, often with a significant margin. We further perform ablation study in the random features framework to support our theoretical propositions.

2. RELATED WORK

Attention mechanisms. The first work to use an attention mechanism in deep learning was (Bahdanau et al., 2015) in the context of natural language processing. Ever since, attention has proven to be a powerful module, even becoming the only component in the transformer architecture (Vaswani et al., 2017) . Intuitively, attention provides an adaptive importance metric for interactions between pairs of elements, e.g., words in a sentence, pixels in an image or nodes in a graph. A natural drawback of classical attention models is the quadratic complexity generated by computing scores among pairs. Methods to reduce the computation complexity were introduced by (Lee et al., 2018b) which introduced the set-transformer and addressed the problem by inducing point methods used in sparse Gaussian processes. Linearized versions of attention were suggested by (Shen et al., 2020) factorizing the attention matrix and normalizing separate components. Concurrently to the first version of this paper (Anonymous, 2020) , Katharopoulos et al. (2020) formulated a linearized attention for sequential data. Attention in graph neural networks. In the field of graph learning, most attention works (Li et al., 2016; Veličković et al., 2018; Abu-El-Haija et al., 2018; Bresson & Laurent, 2017; Lee et al., 2018a) restrict learning the attention scores to the local neighborhoods of the nodes in the graph. Motivated by the fact that local aggregations cannot capture long range relations which may be important when node homophily does not hold, global aggregation in graphs using node embeddings have been suggested by (You et al., 2019; Pei et al., 2020) . An alternative approach for going beyond the local neighborhood aggregation utilizes diffusion methods: (Klicpera et al., 2019) use diffusion in a pre-process to replace the adjacency with a sparsified weighted diffusion matrix, while (Zhuang & Ma, 2018) add the diffusion matrix as an additional aggregation operator. LRGA allows global weighted aggregations via embedding of the nodes in a low dimension (i.e., rank) space. Generalization in graph neural networks. Although being a pillar stone of modern machine learning, the generalization capabilities of NN are still not very well understood, e.g., see (Bartlett et al., 2017; Golowich et al., 2019) . Due to the irregular structure of graph data and the weight sharing nature of GNN, investigating their generalizing capabilities poses an even greater challenge. Despite the nonstandard setting, few works were able to construct generalization bounds for GNN via VC dimension (Scarselli et al., 2018) , uniform stability (Verma & Zhang, 2019 ), Rademacher Complexity (Garg et al., 2020) and Neural Tangent Kernel (Du et al., 2019) .

3. PRELIMINARIES AND NOTATIONS

We denote a graph by G = (V, E, X) where V is the vertex set of size |V | = n, E is the edge set, and adjacency A. X = (x 1 , . . . , x n ) T represents the input vertex features. A vertex v i ∈ V carries an input feature vector x i ∈ R d0 ; in turn, X l ∈ R n×d l represents the output of the l th layer of a neural network. We denote concatenation along the last dimension with brackets and stacking along a new last dimension with double brackets, i.e., for W , Z ∈ R n×d , [W , Z] ∈ R n×2d and [[W , Z]] ∈ R n×d×2 . A common form of evaluating GNNs is by their ability to distinguish different graphs, described by graph isomorphism which is an equivalence relation between graphs. The isomorphism type tensor of a graph G is a tensor Y ∈ R n 2 ×diso which holds the isomorphism types of all pairs (i, j) ∈ [n]×[n]. Given a pair (i, j), which represents either an edge or a node of graph G, Y i,j summarizes all the information this pair carries in graph G. More precisely put, isomorphism type is an equivalence relation defined by: (i, j) and (i , j ) have the same isomorphism type iff the following conditions hold: (i) i = j ⇐⇒ i = j ; (ii) x i = x i and x j = x j ; and (iii) (i, j) ∈ E ⇐⇒ (i , j ) ∈ E. One way to build an isomorphism type tensor for graph G is Y = [[I, 1 ⊗ X, X ⊗ 1, A]] , where I is the identity matrix, (1 ⊗ X) i,j,: = x j , and similarly (with a slight abuse of notation) (X ⊗ 1) i,j,: = x i .

4. LOW-RANK GLOBAL ATTENTION (LRGA)

We propose the Low-Rank Global Attention (LRGA) module that can augment any graph neural network layer, denoted here generically as GNN, in the following way: X l+1 ← X l , LRGA(X l ), GNN(X l ) (1) where the brackets denote concatenation along the feature dimension. The LRGA module is defined for an input feature matrix X ∈ R n×din via LRGA(X) = 1 η(X) m 1 (X) m 2 (X) T m 3 (X) , m 4 (X) where m 1 , m 2 , m 3 , m 4 : R n×din → R n×κ are MLPs operating on the feature dimension, that is m(X) = [m(x 1 ), . . . , m(x n )] T , and κ ∈ N 0 is a parameter representing the rank of the attention module. Lastly, η is a normalization factor: η(X) = 1 n 1 T m 1 (X) m 2 (X) T 1 , where 1 = (1, 1, . . . , 1) T ∈ R n . The matrix η(X) -1 m 1 (X)m 2 (X) T can be thought of as a κ-rank attention matrix that acts globally on the graph's node features. Computational complexity. Standard attention models (Vaswani et al., 2017; Luong et al., 2015) require explicitly computing the attention score between all possible pairs in the set, meaning that its memory requirement and computational cost scales as O(n 2 ). This makes global-attention seem impractical for large sets, or large graphs in our case. We address the global attention computational challenge by working with bounded rank (i.e., κ) attention matrices, and avoid the need to construct the attention matrix in memory by replacing the standard entry-wise normalization (softmax or tanh) with a the global normalization η. In turn, the memory requirement of LRGA is O(nκ), and using low rank matrix-vector multiplications LRGA allows applying global attention in O(nκ 2 ) computation cost. Permutation Equivariance. A common demand from GNN architectures is to respect the graph representation symmetries, namely the ordering of nodes (Maron et al., 2019b) . As shown in (Lee et al., 2018b) the set attention module is permutation equivariant. The same matrix product structure of the LRGA makes this module also permutation equivariant.

5. THEORETICAL ANALYSIS

In this section we establish the theoretical underpinning for LRGA. Since we want to analyse the generalization power added by LRGA, we focus on a family of GNNs with unbounded expressive power in probability (RGNN). Under this model we show the benefit of augmenting GNNs with LRGA in terms of improved generalization via the notion of algorithmic alignment with a powerful graph isomorphism testing algorithm (2-FWL).

5.1. RANDOM GRAPH NEURAL NETWORKS

We analyse LRGA under the framework of Random Graph Neural Networks (RGNNs): Definition 1 (Random Graph Neural Network). Let D be a probability distribution of zero mean and variance c, and G = (V, E, X) a graph. RGNN is a GNN variant with random input features sampled at every forward pass i.e., the input to the network is [X, R] where R are i.i.d. samples R ∈ R n×d ∼ D. RGNN, suggested by Sato et al. (2020) , has related variants (Loukas, 2020; 2019; Murphy et al., 2019) that use node identifiers or distinctive features, which can be viewed as constant random features, in order to break symmetry between isomorphic nodes. Such models are proven to be universal but lose their inherent equivariance due to arbitrary prescription of node identifiers. We choose to work in the seemingly more limited setting of RGNN, which allows the network to distinguish between different nodes but does not overfit specific identifiers. Our main claims regarding this framework is that RGNN is both universal in probability and equivariant in expectation. Proposition 1 (Universal). RGNN can approximate an arbitrary continuous graph function given random features sampled from a bounded distribution D. Here approximation is in a probabilitic sense: Let Ω ⊂ R n×d0 × R n 2 be a compact set of graphs, [X, A] ∈ Ω, where A ∈ R n 2 is the adjacency matrix. Then, given a continuous graph function f defined over Ω and arbitrary ε, δ > 0, there exist network parameters and d so that P (|GNN([X, R]) -f ([X, A])| < ε) > 1 -δ, for all graphs [X, A] ∈ Ω. Proposition 1 holds for GNN variants with a global attribute block such as (Battaglia et al., 2018) . The proof is based on the idea that random features allow the GNN to transfer the graph's connectivity information to the node features. Once all graph information is encapsulated at the nodes, we exploit the universality of set functions (Zaheer et al., 2017) to get universality. The full proof is in Appendix A. To the best of our knowledge this is the first result proving universality under the random feature assumption. Proposition 2 (Equivariant in expectation). RGNN is permutation equivariant in expectation. Changing the random features at each forward pass allows RGNN to preserve equivariance in expectation. Indeed, equivariance of GNN implies that GNN(P • [X, R]) = P • GNN([X, R]), for any permutation matrix P and input [X, R]. Taking the expectation of both sides w.r.t. R ∼ D, noting that P R ∼ R and using linearity of expectation we get equivariance in expectation.

5.2. RGNN AUGMENTED WITH LRGA ALIGNS WITH 2-FWL

In this section we will formulate our main theoretical result, Theorem 1, stating that augmenting RGNN with LRGA algorithmically aligns with a powerful graph isomorphism testing algorithm called 2-Folklore Weisfeiler-Lehman (2-FWL) (Grohe & Otto, 2015; Grohe, 2017) . We will first introduce the notion of algorithmic alignment and the 2-FWL algorithm, then formulate our main theorem, and continue in the next section with a proof. Algorithmic alignment. The notion of algorithmic alignment was introduced in Xu et al. (2019b) as a framework for exploring effective neural architectures for certain tasks. A neural network N is said to be aligned with an algorithm A if N can simulate A by a composition of modules, and each module is "simple", or learnable, i.e., have bounded (hopefully low) sample complexity. For example, message passing networks can simulate the vertex coloring algorithm (Xu et al., 2019a; Morris et al., 2018) and therefore message passing can be seen as algorithmically aligned with vertex coloring. Intuitively, algorithmic alignment introduces an inductive bias that improves the sample complexity. Our definition of algorithmic alignment is a slightly stricter version: Definition 2 (Monomial Algorithmic Alignment). A neural network N aligns with algorithm A if N can simulate A by learning only monomial functions, i.e., f (x) = x α , where x ∈ R d , α ∈ N d , and x α = x α1 1 • • • • • x α d d . To motivate this choice of monomials as "simple" functions we note that (Arora et al., 2019; Xu et al., 2019b) show a sample complexity bound for even-power polynomials learned by (two-layer) MLPs and we extend it to general monomials in the following proposition proved in Appendix E: Proposition 3. Let a two layer MLP trained with gradient descent be denoted as the learning algorithm A . The monomial g(x) = x α , x ∈ R d , of degree n, |α| ≤ n, is PAC learnable with A with a sample complexity bound: C A (g, , δ) = O C n,d + log(1/δ) 2 , C n,d = n 2 + 1 (n+1)/2 c n,d , ε > 0 is the error parameter and δ ∈ (0, 1) the failure probability. The asymptotic behaviour of c n,d is out of the scope of this paper. Therefore, a monomial algorithmic alignment of N to A means (under the assumptions and sequential training method of Theorem 3.6 in Xu et al. (2019b) ) that A is learnable by N . 2-Folklore Weisfeiler-Lehman (2-FWL) Algorithm. 2-FWL is part of the k-WL hierarchy of polynomial-time (approximate) graph isomorphism iterative algorithms that recolor k-tuples of vertices at each step according to neighborhoods aggregation. Upon reaching a stable coloring, the algorithm terminates and if the histograms of colors of two graphs are not the same then the graphs are deemed not isomorphic. The 2-FWL algorithm is equivalent to 3-WL, strictly stronger than vertex coloring (2-WL) which bounds the expressive power of GNNs. (1,j) (i,1) (i,2) (i,3) (i,4) (4,j) (3,j) (2,j) (i,1) (1,j) (i,2) (2,j) (i,3) (i,4) (4,j) (3,j) [ ] A A A A B B B B C C C D D E E In more detail, let Y 0 ∈ R n 2 ×diso represent the isomorphism types of a given graph G = (V, E, X), that is Y 0 i,j ∈ R diso represents the isomorphism type of the pair (i, j). The 2-FWL algorithm is initialized with Y 0 . Let Y l ∈ R n 2 ×d l denote the coloring tensor after the l th update step. An update step in the algorithm aggregates information from the multiset of neighborhood colors for each pair. We represent the multiset of neighborhood colors of the tuple (i, j) with a matrix Z l (i,j) ∈ R n×2d l . That is, any permutation of the rows of Z l (i,j) represent the same multiset. The rows of Z l (i,j) , which represent the elements in the multiset, are z k = [Y l i,k , Y l k,j ] ∈ R 2d l , k ∈ [n]. See the inset for an illustration. The 2-FWL update step of a pair (i, j) from Y l to Y l+1 concatenates the previous pair's color and an encoding of the multiset of neighborhoods colors: Y l+1 i,j = Y l i,j , ENC Z l (i,j) where ENC : R n×2d l → R denc is a multiset injective map invariant to the row-order of its input. Main result. Consider the 2-FWL update rule in equation 4 and let Y l+1 ∈ R n 2 denote (arbitrary) single feature dimension pealed off Y l+1 ∈ R n 2 ×d l+1 ; we call Y l+1 a single-head of the update rule. Then, Theorem 1. LRGA augmented RGNN algorithmically aligns with a single head 2-FWL update step. A corollary of this theorem is: Corollary 1. Multi-head LRGA augmented RGNN algorithmically aligns with 2-FWL. Multi-head LRGA is a module of the form [X l , LRGA 1 (X l ), . . . , LRGA k (X l ), GNN(X l )], which is an equivalent to multi-head self-attention. In practice, we found single-head LRGA to be on par performance-wise with multi-head LRGA and therefore we focus on the single-head version in the experimental section.

5.3. PROOF OF THEOREM 1

To prove Theorem 1 we need to show RGNN augmented with LRGA can simulate one head of the 2-FWL update step using only monomials as learnable functions. We achieve that by the following steps: (i) introduce the notion of node factorization to encode n × n tensor data as node features; (ii) show that RGNN can approximate node factorization of the graph's isomorphism type tensor with a single GNN layer using learnable monomial functions; (iii) show that 2-FWL update step can be formulated using matrix multiplication of monomial functions; and (iv) show LRGA can approximate a single head 2-FWL update step using learnable monomials. Part (i). We start with the definition of node feature factorization: Definition 3 (Node factorization). Let Y ∈ R n 2 ×d be a tensor. X ∈ R n×D is called node factorization of Y if there exists a block structure X = X 1 , . . . , X k so that Y = X s1 (X t1 ) T , . . . , X s d (X t d ) T , where (s 1 , t 1 ), . . . , (s d , t d ) ∈ [k] × [k] are index pairs. Note that for all i, j ∈ [n] we have Y i,j = x s1 i , x t1 j , . . . , x s d i , x t d j ∈ R d . Lets illustrate the definition with an example. Let A ∈ {0, 1} n×n be the adjacency matrix of some graph G, and for simplicity assume that there are no node features. Then, the isomorphism type tensor of G is Y 0 = [[I, A]] ∈ R n 2 ×2 . One possible way of node factoring Y 0 is using the SVD decomposition of the adjacency matrix A. Note that node factorization is not unique. Part (ii). Proposition 4. RGNN with skip connection can approximate node factorization of the isomorphism type tensor Y 0 . Proof. We will prove the case of graph G = (V, E), i.e., with no vertex features; the general case can be found in Appendix D. Let R ∈ R n×d be a random node features matrix sampled i.i.d. from D. A single layer of standard message passing can represent GNN(R) = d -0.5 [AR, R], which requires learning only first degree (linear) monomials in the GNN's learnable parts. Furthermore, GNN(R) is an approximate node factorization of Y 0 , since d -1 RR T , ARR T ≈ [[I, A]] = Y 0 , where the approximation error d -1 RR T ≈ I can be bounded using the result in Appendix A. Part (iii). As shown in (Maron et al., 2019a) the encoding function ENC from the 2-FWL update rule (see equation 4) can be expressed as follows (derivation can be found in Appendix B): Y l+1 = Y, Y β Y γ (β, γ) ∈ N 2d 0 , |β| + |γ| ≤ n (5) where for notational simplicity we denote Y = Y l and d = d l . By Y β we mean that we apply the multi-power β to the feature dimension, i.e., (Y β ) i,j = Y β i,j . Therefore, computing the multisets encoding amounts to calculating monomials Y β , Y γ and their matrix multiplications Y β Y γ .

Part (iv).

Proposition 5. The node factorization of each head of Y l+1 , the result of 2-FWL update step, can be approximated via LRGA module applied to node factorization of Y = Y l . The MLPs in the LRGA approximation need to learn only monomial functions. Proof. Let X = [X 1 , . . . , X k ] ∈ R n×D be a node factorization of Y = Y l . The 2-FWL update step requires computation of polynomials of the form Y β as shown in equation 5. Using the node factorization of Y, Y i,j = x s1 i , x t1 j , . . . , x s d i , x t d j ∈ R d , we can write: Y β i,j = d l=1 x s l i , x t l j β l = d l=1 ϕ β l (x s l i ), ϕ β l (x t l j ) = d l=1 ϕ β l (x s i ), ϕ β l (x t j ) = ϕ β (x s i ), ϕ β (x t j ) (6) where the second equality is using the feature maps ϕ β l of the (homogeneous) polynomial kernels (Vapnik, 1998) , x 1 , x 2 β l ; the third equality is reformulating the feature maps ϕ β l on the vectors x s i = [x s1 i , . . . , x s d i ] , and x t i = x t1 i , . . . , x t d i ; and the last equality is due to the closure of kernels to multiplication. We denote the final feature map by ϕ β . Now, let ψ β (x i ) = ϕ β (x s i ) and φ β (x i ) = ϕ β (x t i ) then we have: Y β = ψ β (X)φ β (X) T , where ψ β (X) is applying ψ β to every row of X. Therefore, arbitrary head of Y l+1 , i.e., of the form Y β Y γ , can be written directly as a function of X using the feature maps φ β , ψ β , φ γ , ψ γ : Y β Y γ = ψ β (X)φ β (X) T ψ γ (X)φ γ (X) T . A node factorization of the head Y β Y α is therefore ψ β (X)φ β (X) T ψ γ (X), φ γ (X) . Recalling the structure of the LRGA module introduced in equation 2: LRGA(X) = η(X) -1 m 1 (X) m 2 (X) T m 3 (X) , m 4 (X) , to implement the 2-FWL head the MLPs m 1 , m 2 , m 3 , m 4 need to learn the polynomial feature maps formulated in equation 7: m 1 ≈ ψ β , m 2 ≈ φ β , m 3 ≈ ψ γ , and m 4 ≈ φ γ . Every coordinate of these feature maps is a monomial (proof of this fact in Appendix C). Lastly, note that 2-FWL tensors Y l are insensitive to global scaling and therefore the normalization η has no theoretical influence (it is assumed non-zero).

6. EXPERIMENTS

We evaluated our method on various tasks including graph regression, graph classification, node classification and link prediction. The datasets we used are from two benchmarks: (i) benchmarking GNNs (Dwivedi et al., 2020) ; and (ii) Open Graph Benchmark (OGB) (Hu et al., 2020) . Each benchmark has its own evaluation protocol designed for a fair comparison among different models. These protocols define consistent splits of the data to train/val/test sets, set a budget on the size of the models (OGB), define a stopping criterion for reporting test results and require training with several different initializations to measure the stability of the results. We followed these protocols. Baselines. We compare performance with the following state of the art baselines: GCN (Kipf & Welling, 2016) , GraphSAGE (Hamilton et al., 2017) , GIN (Xu et al., 2019a) , GAT (Veličković et al., 2018) , GatedGCN (Bresson & Laurent, 2017) , Node2Vec (Grover & Leskovec, 2016) , DeepWalk (Perozzi et al., 2014) and MATRIX FACTORIZATION (Hu et al., 2020) . Attention Ablation. We compared the performance of different versions of global attention modules. The experiment was conduced on the ZINC dataset and compared performance on the GCN, GAT and GatedGCN models. Random Features Evaluation. In addition, we also conducted a set of experiments with the random feature framework. In this experiment we focused on the PATTERN node classification dataset from (Dwivedi et al., 2020) and evaluated a variety of models under the RGNN framework. Rank Ablation Study. In this experiment we examined the relation between the rank parameter κ, which can limit the expressiveness of the attention module, and the network performance. Results are presented in Appendix G. Implementation details of LRGA. We implemented the LRGA module according to the description in Section 4 (equations 2, 3) using the pytorch framework and the DGL (Wang et al., 2019) and Pytorch geometric (Fey & Lenssen, 2019) libraries. Each LRGA module contains 4 MLPs m 1 , m 2 , m 3 , m 4 . Each m i : R d → R κ is a single layer MLP (linear with ReLU activation). The implementation of a layer is according to equation 2, where in practice we added another single layer MLP, m 5 : R d+2κ+d GN N → R d , for the purpose of reducing the feature dimension size. In the OGB benchmark dataset we did not use the skip connections (better performance), and as advised in (Wang et al., 2019) , we used batch and graph normalization at each layer. 

Results

. Table 1 summarizes the results of training and evaluating our model according to the evaluation protocol; We observe that LRGA improves GNN performance, often by a large margin, across all models and datasets, besides GCN on ZINC and GatedGCN in TSP, supporting our claim for improved generalization. We further note that SOTA in all datasets except TSP is achieved with LRGA augmented GNNs. In some datasets, such as CLUSTER and PATTERN, LRGA reaches top and roughly equivalent performance for all models it augmented, which emphasizes the empirical contribution of LRGA independently of the GNN variant. 6.2 LINK PREDICTION DATASETS FROM THE OGB BENCHMARK (HU ET AL., 2020) Results. Table 2 summarizes the results on the link prediction tasks. It should be noted that the first three rows correspond to node embedding methods where the rest are GNNs. Augmenting GCN with LRGA achieves SOTA results on those datasets, while still using order of magnitude less parameters than the node embedding runner-up method. The LRGA model (equation 2) applies the low-rank attention matrix S = η(X) -1 m 1 (X)m 2 (X) T to the node features m 3 (X), that, together with m 4 (X), align with node factorization of 2-FWL head. In this experiment we have tested two variations of LRGA: First, removing the m 4 component; and second, replacing S with standard, kernel-based attention matrices (Tsai et al., 2019) . Results of incorporating the different attention mechanisms to GCN, GAT, and GatedGCN and experimenting with the ZINC dataset are summarized in Table 3 . First, it seems incorporating m 4 explicitly in the LRGA module compares mostly favorably to LRGA model with no m 4 . We attribute that mainly to the algorithmic alignment of the full LRGA model with 2-FWL, and in particular to the encoding of 2-FWL neighborhood multisets. Second, as indicated in (Tsai et al., 2019) , the attention matrix could be expressed using a kernel function, S i,j = (

6.3. ATTENTION ABLATION

n =1 k(x i , x )) -1 k(x i , x j ) . We replace the low-rank attention matrix S in the LRGA module with attention matrices defined via different kernels k: a polynomial kernel (of degree 2 and 4); exponential kernel (which is equivalent to the classical self-attention (Vaswani et al., 2017) ) and radial basis function (RBF) kernel. A full definition of the different kernels is provided in Appendix F. Note that the proof of Theorem 1 utilizes a kernel defined by a polynomials feature map to align with the 2-FWL head. As the table shows, with the expection of the exponential kernel on GatedGCN, LRGA achieve superior result across all the models. The major advantage of LRGA over the other kernels in that it does not require to explicitly compute and store in memory the attention matrix, and exploit the low rank structure for fast multiplication. In this experiment we wanted to validate the theoretical analysis presented at section 5. The dataset for this evaluation is the PATTERN dataset, which is originally equipped with random features, but in contrast to the RGNN framework those features are sampled only once at the dataset creation stage. We evaluated the different models according to the RGNN framework, i.e., resample the features with every forward pass. The features were sampled from a zero mean Gaussian distribution with variance 1 d , where d is the input feature dimension. The evaluation protocol is the same as the one used in section 6.1 and we followed the 500K budget. As seen from table 4, using alternating random features improves performance for all the models. GIN and GraphSage do not appear in the main table but according to (Dwivedi et al., 2020) achieves 85.39% and 50.49% respectively. The LRGA augmented RGNN models maintain their superiority (even presenting a small improvement compared to Table 1 ) and serve as an empirical validation to our main theorem.

7. CONCLUSIONS

In this work, we set ourself in a path for improving the generalization power of GNNs. To do so, we introduced the LRGA module, a global self attention module, which is a variant of the dot-product self-attention with linear complexity. In order to theoretically evaluate the contribution of LRGA we analyzed our model under the RGNN framework, which is proved to be universal in probability. Under this framework we were able to show that RGNN augmented with LRGA can align with the powerful 2-FWL isomorphism test by learning simple monomial functions, which have a known sample complexity bound. Under certain conditions the latter provides concrete generalization guarantees for RGNN augmented with LRGA. Empirically, we demonstrated augmenting GNN models with LRGA improves their performance significantly, often achieving SOTA performance. f ([P X, P AP T ]) = f ([X, A]) for all permutation matrices P ∈ R n×n . RGNN is defined as RGNN(X) = GNN([X, R]) where R ∈ R n×d are i.i.d. samples from D. To prove universality in probability we need to show that RGNN can approximate f to an arbitrary precision ε with high probability 1 -δ: ∀ε, δ > 0 ∃Θ, d s.t. P (|RGNN(X) -f ([X, A])| < ε) > 1 -δ where Θ are the RGNN network parameters and d is the dimension of the random features of RGNN. In fact, a simple RGNN composed of single message passing layer and a global attribute block, a DeepSets network (Zaheer et al., 2017) , suffices. The message passing layer first transfers the graph structural information to the node features by creating a factorized representation of A. This means that all the graph information is now stored in a set. Then, using the universality of DeepSets network for invariant set functions we can approximate f to an arbitrary precision. Let us denote the output of the message passing layer of RGNN by h 1 . The structural information of the graph can be transferred to the node features using the message passing layer by choosing parameters such that h 1 = [X, R, AR]. h 1 is then fed to the DeepSets network, so we have RGNN(X) = DeepSets([X, R, AR]). Observing the approximation error: |RGNN(X) -f ([X, A])| = |DeepSets([X, R, AR]) -f ([X, A])| = = |DeepSets([X, R, AR]) -f ([X, 1 d ARR T ]) + f ([X, 1 d ARR T ]) -f ([X, A])| ≤ ≤ |DeepSets([X, R, AR]) -f ([X, 1 d ARR T ])| + |f ([X, 1 d ARR T ]) -f ([X, A])| We can now bound the two terms in the last inequality above. Since f is defined on the compact set Ω we first make sure that 1 d ARR T remains bounded (we assume f can be extended continuously to this domain). Since we assume D is bounded (given x ∼ D, |x| < M/2), we get: 1 d ARR T F ≤ 1 d A F RR T ≤ 1 d A F d M 2 4 n For the second term we can achieve a bound in probability. Since f is a continuous function on a compact set, by the Heine-Cantor theorem, it is uniformly continuous, meaning that ∀ε > 0 ∃ ξ s.t ∀ Q, S ∈ Ω d Ω (Q, S) < ξ ⇒ d R (f (Q), f (S)) < ε Setting ε = ε/2 we can now choose d such that with probability 1 -δ we have d Ω ([X, 1 d ARR T ], [X, A]) < ξ. Let d Ω be the euclidean metric, then, d Ω ( 1 d ARR T , A) ≤ A F • 1 d RR T -I F . Since we assume a graph of fixed size n, A F ≤ n and we are left with bounding 1 d RR T -I F in probability. Using Hoeffding's inequality we will be able to find d satisfying the conditions. A single entry in R has mean 0 and variance c, for simplicity we set c = 1. An entry in RR T is of the form (RR T ) ij = d l=1 R il R jl . Note that all elements of the sum are statistically independent and bounded. Using Hoeffding's inequality: P 1 d d l=1 R il R jl -E 1 d d l=1 R il R jl ≥ t ≤ 2 exp - 2dt 2 M 4 For i = j: E 1 d d l=1 R il R jl = 0 and for i = j: E 1 d d l=1 R il R jl = 1. Using union bound over all entries of 1 d RR T : P   i,j∈[n] 1 d (RR T ) ij -I ij ≥ t   ≤ 2n 2 exp - 2dt 2 M 4 Setting t = ξ/n 2 A F and requiring 2n 2 exp -2dt 2 M 4 < δ we get d = M n 4 A 2 F ξ 2 log 2n 2 δ where M accumulates all constant factors. Lastly, A F is bounded by n, so the d we should take is d = M n 6 ξ 2 log 2n 2 δ . Finally, we have that for large enough d, 1 d RR T -I F is arbitrarily small with a high probability. For the first term, we note that f ([X, 1 d ARR T ]) = F ([X, R, AR] ) is a continuous invariant set function over a bounded domain. Therefore the first term can be bounded by invoking the universal approximation theorem of invariant set functions (Zaheer et al., 2017) , i.e., exist a set of parameters and model size such that the approximation error is less than ε/2. This concludes the proof. We found that exists a set of network parameters and d such that the approximation error is arbitrarily small.

B MULTISET ENCODING

As shown in Maron et al. (2019a) the multiset encoding function, ENC, can be defined using the collection of Power-sum Multi-symmetric Polynomials (PMPs). That is, given a multiset Z = (z 1 , . . . , z n ) T ∈ R n×2d the encoding is defined by ENC(Z) = n k=1 z α k α ∈ N 2d 0 , |α| ≤ n , where α = (α 1 , . . . , α 2d ), and z α = z α1 1 • • • z α 2d 2d . Let us focus on computing a single output coordinate α of the ENC function applied to a particular multiset Z (i,j) . This can be efficiently computed using matrix multiplication Maron et al. (2019a) : Let α = (β, γ) ∈ N 2d 0 , where β, γ ∈ N d 0 . Then, ENC α (Z (i,j) ) = n k=1 z α k = n k=1 Y β i,k Y γ k,j = (Y β Y γ ) i,j . By Y β we mean that we apply the multi-power β to the feature dimension, i.e., (Y β ) i,j = Y β i,j . This implies that computing the multisets encoding amounts to calculating monomials Y β , Y γ and their matrix multiplications Y β Y γ . Thus the 2-FWL update rule, equation 4, can be written in the following matrix form, where for notational simplicity we denote Y = Y l : Y l+1 = Y, Y β Y γ (β, γ) ∈ N 2d 0 , |β| + |γ| ≤ n C 2-FWL VIA POLYNOMIAL KERNELS In this section, we give a full characterization of feature maps, ϕ β , of the final polynomial kernel we use to formulate the 2-FWL algorithm. A key tool for the derivation of the final feature map is the multinomial theorem, which we state here in a slightly different form to fit our setting. Multinomial theorem. Let us define a set of m variables x 1 y 1 , . . . , x m y m composed of products of corresponding x and y's. Then, (x 1 y 1 + • • • + x m y m ) n = |ν|=n n ν m i=1 (x i y i ) νi where ν ∈ N m 0 , and the notation n ν = n! ν1!•••••νm! . The sum is over all possible ν which sum to n, in total n+m-1 m-1 elements. Recall that we wish to compute Y β i,j as in equation 6 in the paper: Y β i,j = d l=1 x s l i , x t l j β l = d l=1 ϕ β l (x s l i ), ϕ β l (x t l j ) = d l=1 ϕ β l (x s i ), ϕ β l (x t j ) = ϕ β (x s i ), ϕ β (x t j ) We will now follow the equalities in equation 6 to derive the final feature map. The second equality is using the feature maps ϕ β k of the (homogeneous) polynomial kernels (Vapnik, 1998) , x 1 , x 2 β k , which can be derived from the multinomial theorem. Suppose the dimensions of X s l , X t l are n × D l where d l=1 2D l = D. Then, ϕ β l consists of monomials of degree β l of the form ϕ β l (x) ν = β l ν D l i=1 x νi i = β l ν x ν , |ν| = β l . In total the size of the feaure map ϕ β l is β l +D l -1 D l -1 . The third equality is reformulating the feature maps ϕ β l on the vectors x s i = [x s1 i , . . . , x s k i ] ∈ R D/2 , and x t i = x t1 i , . . . , x t k i ∈ R D/2 . The last equality is due to the closure of kernels to multiplication. The final feature map, which is the product kernel, is composed of all possible products of elements of the feature maps, i.e., ϕ β (x) = d l=1 β l ν l x ν l l |ν j | = β j , ∀j ∈ [d] , where x = [x 1 , x 2 , . . . , x k ] ∈ R D/2 , and x l ∈ R D l for all l ∈ [d]. The size of the final feature map is d l=1 β l +D l -1 D l -1 ≤ N where N = n+D D .

D EXTENSION OF PROPOSITION 4

In this section we would like to extend the proof of proposition 4 to the case where the graph is equipped with prior node features X ∈ R n×d0 , s.t the network's input is [X, R]. As mentioned in Section 3 the isomorphism type of a graph equipped with node features is Y = [[I, 1 ⊗ X, X ⊗ 1, A]]. Following this description we claim that the node factorization representation of the graph will be of the form R = [1, X, R, AR], where 1 = (1, 1, . . . , 1) T ∈ R n . To build the isomorphosm tensor we can use the sequence of outer products X 1 1 T , . . . , X d 1 T , 1X T 1 , . . . , 1X T d , where X l ∈ R n is the l-th column of X. This sequence could be represented using the two first components of R . The last two components, R and AR allow to approximate in probability A and I as shown in Appendix A, which complete the isomorphism tensor construction and conclude that [1, X, R, AR] is a node factorization representation. Lastly, we have to show that we can construct this structure using RGNN, and actually we are left to explain how to add the 1 vector to the representation. This could be done using a global attribute block as used to proof Proposition 1.

E SAMPLE COMPLEXITY BOUND OF MONOMIALS

Corollary 6.2 in (Arora et al., 2019) provides a bound on the sample complexity, denoted C A (g, , δ), of a polynomial g : R D → R of the form g(x) = j a j β j , x pj , where p j ∈ {1, 2, 4, 6, 8, . . .}, a j ∈ R, β j ∈ R D ; , δ are the relevant PAC learning constants, and A represents an over-parameterized, randomly initialized two-layer MLP trained with gradient descent. C A (g, , δ) = O j p j |a j | β j pj 2 + log(1/δ) 2 It is not immediately clear, however, how to use this theorem to learn an arbitrary monomial x δ since g has the above particular form. Nevertheless we show how it can be generalized to this case. Let B = β ∈ N D 0 | |β| ≤ n , and note that there are N = n+D D elements in B. We assume some fixed ordering in B is prescribed. Define the sample matrix (multivariate Vandemonde) V ∈ R N ×N by V α,β = β α . Lemma 2.8 in (Wendland, 2004) implies that V is non-singular. Let c n,D = V -1 ∞ (i.e., the induced ∞ matrix norm); note that c n,D is dependant only upon n, D. Lemma 1. Fix D, n ∈ N, and let δ ∈ B be arbitrary. Then, there exist coefficients a ∈ R N , a 1 ≤ c n,D , so that x δ = β∈B a β ( β, x + 1) n , for all x ∈ R D . Proof. Using the multinomial theorem we have: ( β, x + 1) n = α∈B d α β α x α , where d α are positive multinomial coefficients. This equation defines a linear relation between the monomial basis x δ and ( β, x + 1) n , for β ∈ B. The matrix of this system is V multiplied by a positive diagonal matrix with d α on its diagonal. By inverting this matrix and solving this system for x δ the lemma is proved. We can use this Lemma in the following way: Assume n is even or otherwise consider 2 n/2 . Further assume that the MLP m : R D+1 → R is two-layer, over-parameterized of the form m(x, 1) (i.e., we assume there is a constant 1 plugged in an extra D + 1 coordinate). We consider training m with random initialization and gradient descent using data (x, x δ ) ∈ R D × R where x is sampled i.i.d. from some distribution D over R D . Let g : R D+1 → R defined as g(x, x D+1 ) = β∈B a β ( β, x + x D+1 ) n , where a ∈ R N is as promised by Lemma 1. Then, the learning setup described above is equivalent to training the MLP m(x, x D+1 ) using data of the form ((x, 1), g(x, 1) = x δ ), where (x, 1) is sampled i.i.d. from a distribution D over R D+1 concentrated on the hyperplane x D+1 = 1. Now using the Corollary 6.2 from (Arora et al., 2019) in our case where g : R D+1 → R is defined as g (x, x D+1 ) = β∈B a β ( β, x + x D+1 ) n where B = β ∈ N D 0 | |β| ≤ n and by Lemma 1 there exist a such that g(x, 1) = x δ . The sample complexity bound expression by Corollary 6.2 is therefore: C A (g, , δ) = O    β∈B n |a β | β n 2 + log(1/δ) 2    , β = (β, 1) Let us bound the first term in the numerator of the sample complexity expression: β∈B n |a β | β n 2 = n • β∈B |a β | D i=1 β 2 i + 1 n/2 ≤ n • n 2 + 1 n/2 β∈B |a β | ≤ n 2 + 1 (n+1)/2 c n,D The first inequality is due to • 2 ≤ • 1 , the second is by Lemma 1 and uniting n into the main term. From the above, the bound follows. F KERNEL DEFINITION Let x, y ∈ R d the kernel function were defined in the following manner - (i) Polynomial Kernel -k m (x, y) = ( x, y + 1) m (ii) Exponential Kernel -k(x, y) = exp( x,y √ d ) (iii) RBF Kernel -k(x, y) = exp(-x-y 2 √ d ) Lets X Q , X K , ∈ R n×d where X Q = {(x 1 Q ) T , . . . , (x n Q ) T } and X V = {(x 1 V ) T , . . . , (x n V ) T } denotes the attention Query and Key matrices. For a given kernel function k we define the attention matrix S ∈ R n×n in the following way - S i,j = k(x i Q , x j K ) n l=1 k(x i Q , x l K ) G RANK ABLATION STUDY We investigated the affects of the attention's rank κ on the performance of GNNs augmented with LRGA on the CLUSTER dataset. The dataset contains graphs of 40 to 190 nodes (117 nodes in average). Our experimental setting included fixing the GNN's hidden dimensions size and changing κ. Figure 1 shows that accuracy increases with the rank values until it reaches a plateau around κ ≈ 30 (κ/n = 0.25 where n is the average graph size), a fact that could be attributed to saturating the expressiveness of the LRGA module. Moreover, the maximal accuracy is achieved at a value that corresponds to the maximal graph size in the dataset, smaller than what the theory predicts as a function of the graph size n. This rank value is enough to compute any attention function on this graph collection.

H IMPLEMENTATION DETAILS

In this section we describe the datasets on which we performed our evaluation. In addition, we specify the hyperparameters for the experiments section in the paper. The rest of the model configurations are determined directly by the evaluation protocols defined by the benchmarks. It is worth noting that most of our experiments ran on a single Tesla V-100 GPU, if not stated otherwise. We performed our parameter search only on κ and d (except for CIFAR10 and MNIST were we searched over different dropout values), since the rest of the parameters were dictated by the evaluation protocol. The models sizes were restricted by the allowed parameter budget. H.1 BENCHMARKING GRAPH NEURAL NETWORKS (DWIVEDI ET AL., 2020) Datasets. This benchmark contains 6 main datasets : (i) ZINC, a molecular graphs dataset with a graph regression task where each node represents an atom and each edge represents a bond. The regression target is a property known as the constrained solubility (with mean absolute error as evaluation metric). Additionally, the node features represent the atom's type (28 types) and the edge features represents the type of connection (4 types). The result reported for GCN used d = 60 for the 100K budget and d = 90 structures in stochastic block model graphs (Abbe, 2017) . The goal of the task is to assign each node to the stochastic block it was originated from, while the structure of the graph is governed by two probabilities that define the inner-structure and cross-structure edges. A single representative from each block is assigned with an initial feature that indicates its block while the rest of the nodes have no features (CLUSTER), while in the PATTERN dataset nodes are assigned with a random value as input feature at the creation stage. Datasets. In order to provide a more complete evaluation of our model we also evaluate it on semi-supervised learning tasks of link prediction. We searched over the same hyperparameter range κ ∈ {25, 50, 100} , d ∈ {150, 256} and used κ = 50, d = 256 in all tasks. The three datasets were: (i) ogbl-ppa, an undirected unweighted graph. Nodes represent types of proteins and the edges signify biological connections between proteins. The initial node feature is a 58-dimensional one-hot-vector that indicates the origin specie of the protein. The learning task is to predict new connections between nodes. The train/validation/test split sizes are 21M/6M/3M . The evaluation metric is called Hits@K (Hu et al., 2020) . (ii) ogbl-collab, is a graph that represents a network of collaborations between authors. Every author in the network is represented by a node and each collaboration is assigned with an edge. Initial node features are obtained by combining word embeddings of papers by that author (128-dimensional vector). Additionally, each collaboration is described by the year of collaboration and the number of collaborations in that year as a weight. The train/validation/test split sizes are 1.1M/60K/46K. Similarly to the previous dataset, the evaluation metric is Hits@K. (iii) ogbl-ddi -an undirected unwighted graph which represent drug-drug interaction. Each Node represents FDA approved or experimental drug. The edges represent interactions between drugs and represent the joint effect of taking both drugs together. The learning task is to predict new drug to drug interactions. The train/validation/test split sizes are 1M/150K/150K. The evaluation here is also Hits@K.



Figure 1: Ablation study on CLUSTER dataset. The X-axis represent the ratio between the rank parameter κ and the average graph size n = 117. The Y-axis represent the network's accuracy

network's depth is L = 12) for the 500K budget. For the GAT network we used d = 60 (4 attention heads of dimension 15) for the 100K budget and d = 120 (4 attention heads of dimension 30) with L = 8 for the 500K budget. For the GatedGCN network we used d = 45 for the 100K budget and d = 60 with L = 12 for the 500K budget. All the models used κ = 30. (ii) MNIST and CIFAR10, the known image classification problem is converted to a graph classification task using Super-pixel representation(Knyazev et al., 2019), which represents small regions of homogeneous intensity as nodes. The edges in the graph are obtained by applying k-nearest neighbor algorithm on the nodes coordinates. Node features are a concatenation of the Super-pixel intensity (RGB for CIFAR10 and greyscale for MNIST) and its image coordinate. Edges features are the k-nearest distances. The result reported for GCN used d = 60 for the 100K budget and d = 110 with L = 8 for the 500K budget. For the GAT network we used d = 60 (4 attention heads of dimension 15) for the 100K budget and d = 122 (4 attention heads of dimension 28) with L = 8 for the 500K budget. For the GatedGCN network we used d = 45 for the 100K budget and d = 80 with L = 8 for the 500K budget. All the models used κ = 30. (iii) CLUSTER and PATTERN, node classification tasks which aim to identify embedded node

The result reported for GCN used d = 60 for the 100K budget and d = 100 with L = 8 for the 500K budget (PATTERN, CLUSTER respectively). For the GAT network we used d = 60 (4 attention heads of dimension 15) for the 100K budget and d = 120, 60 (8 attention heads of dimension 15, 4 attention heads of dimension 15) with L = 8, 12 for the 500K budget (PATTERN, CLUSTER respectively). For the GatedGCN network we used d = 45 for the 100K budget and d = 80, 50 with L = 8, 12 for the 500K budget (PATTERN, CLUSTER respectively). All the models used κ = 30. (iv) TSP, a link prediction task that tries to tackle the NP-hard classical Traveling Salesman Problem (Joshi et al., 2019). Given a 2D Euclidean graph the goal is to choose the edges that participate in the minimal edge weight tour of the graph. The evaluation metric for the task is F1 score for the positive class. The result reported for GCN used d = 60 . For the GAT network we used d = 60 (4 attention heads of dimension 15). For the GatedGCN network we used d = 45. All the models used κ = 30. H.2 LINK PREDICTION DATASETS FROM THE OGB BENCHMARK (HU ET AL., 2020)

Performance on the benchmarking GNN datasets. In bold: better performance between LRGA augmented and vanilla models; note the parameter (#) budget. Blue represents best performance with the 100K budget and red with the 500K budget.

Performance on the link prediction tasks from the OGB benchmark

Attention ablation table. Various GNNs augmented with attention variants on the ZINC dataset. Bold represent best performance and blue represent second best.

Random Features Evaluation

