GRAPH NEURAL NETWORK-INSPIRED KERNELS FOR GAUSSIAN PROCESSES IN SEMI-SUPERVISED LEARN-ING

Abstract

Gaussian processes (GPs) are an attractive class of machine learning models because of their simplicity and flexibility as building blocks of more complex Bayesian models. Meanwhile, graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning and beyond. Their competitive performance is often attributed to a proper capturing of the graph inductive bias. In this work, we introduce this inductive bias into GPs to improve their predictive performance for graph-structured data. We show that a prominent example of GNNs, the graph convolutional network, is equivalent to some GP when its layers are infinitely wide; and we analyze the kernel universality and the limiting behavior in depth. We further present a programmable procedure to compose covariance kernels inspired by this equivalence and derive example kernels corresponding to several interesting members of the GNN family. We also propose a computationally efficient approximation of the covariance matrix for scalable posterior inference with large-scale data. We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time, compared with the respective GNNs.

1. INTRODUCTION

Gaussian processes (GPs) (Rasmussen & Williams, 2006) are widely used in machine learning, uncertainty quantification, and global optimization. In the Bayesian setting, a GP serves as a prior probability distribution over functions, characterized by a mean (often treated as zero for simplicity) and a covariance. Conditioned on observed data with a Gaussian likelihood, the random function admits a posterior distribution that is also Gaussian, whose mean is used for prediction and the variance serves as an uncertainty measure. The closed-form posterior allows for exact Bayesian inference, resulting in great attractiveness and wide usage of GPs. The success of GPs in practice depends on two factors: the observations (training data) and the covariance kernel. We are interested in semi-supervised learning, where only a small amount of data is labeled while a large amount of unlabeled data can be used together for training (Zhu, 2008) . In recent years, graph neural networks (GNNs) (Zhou et al., 2020; Wu et al., 2021) emerged as a promising class of models for this problem, when the labeled and unlabeled data are connected by a graph. The graph structure becomes an important inductive bias that leads to the success of GNNs. This inductive bias inspires us to design a GP model under limited observations, by building the graph structure into the covariance kernel. An intimate relationship between neural networks and GPs is known: a neural network with fully connected layers, equipped with a prior probability distribution on the weights and biases, converges to a GP when each of its layers is infinitely wide (Lee et al., 2018; de G. Matthews et al., 2018) . Such a result is owing to the central limit theorem (Neal, 1994; Williams, 1996) and the GP covariance can be recursively computed if the weights (and biases) in each layer are iid Gaussian. Similar results for other architectures, such as convolution layers and residual connections, were subsequently established in the literature (Novak et al., 2019; Garriga-Alonso et al., 2019) . One focus of this work is to establish a similar relationship between GNNs and the limiting GPs. We will derive the covariance kernel that incorporates the graph inductive bias as GNNs do. We start with one of the most widely studied GNNs, the graph convolutional network (GCN) (Kipf & Welling, 2017) , and analyze the kernel universality as well as the limiting behavior when the depth also tends to infinity. We then derive covariance kernels from other GNNs by using a programmable procedure that corresponds every building block of a neural network to a kernel operation. Meanwhile, we design efficient computational procedures for posterior inference (i.e., regression and classification). GPs are notoriously difficult to scale because of the cubic complexity with respect to the number of training data. Benchmark graph datasets used by the GNN literature may contain thousands or even millions of labeled nodes (Hu et al., 2020b) . The semi-supervised setting worsens the scenario, as the covariance matrix needs to be (recursively) evaluated in full because of the graph convolution operation. We propose a Nyström-like scheme to perform low-rank approximations and apply the approximation recursively on each layer, to yield a low-rank kernel matrix. Such a matrix can be computed scalably. We demonstrate through numerical experiments that the GP posterior inference is much faster than training a GNN and subsequently performing predictions on the test set. We summarize the contributions of this work as follows: 1. We derive the GP as a limit of the GCN when the layer widths tend to infinity and study the kernel universality and the limiting behavior in depth. 2. We propose a computational procedure to compute a low-rank approximation of the covariance matrix for practical and scalable posterior inference. 3. We present a programmable procedure to compose covariance kernels and their approximations and show examples corresponding to several interesting members of the GNN family. 4. We conduct comprehensive experiments to demonstrate that the GP model performs favorably compared to GNNs in prediction accuracy while being significantly faster in computation.

2. RELATED WORK

It has long been observed that GPs are limits of standard neural networks with one hidden layer when the layer width tends to infinity (Neal, 1994; Williams, 1996) . Recently, renewed interests in the equivalence between GPs and neural networks were extended to deep neural networks (Lee et al., 2018; de G. Matthews et al., 2018) as well as modern neural network architectures, such as convolution layers (Novak et al., 2019) , recurrent networks (Yang, 2019) , and residual connections (Garriga-Alonso et al., 2019) . The term NNGP (neural network Gaussian process) henceforth emerged under the context of Bayesian deep learning. Besides the fact that an infinite neural network defines a kernel, the training of a neural network by using gradient descent also defines a kernel-the neural tangent kernel (NTK)-that describes the evolution of the network (Jacot et al., 2018; Lee et al., 2019) . Library supports in Python were developed to automatically construct the NNGP and NTK kernels based on programming the corresponding neural networks (Novak et al., 2020) . GNNs are neural networks that handle graph-structured data (Zhou et al., 2020; Wu et al., 2021) . They are a promising class of models for semi-supervised learning. Many GNNs use the messagepassing scheme (Gilmer et al., 2017) , where neighborhood information is aggregated to update the representation of the center node. Representative examples include GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2018) , and GIN (Xu et al., 2019) . It is found that the performance of GNNs degrades as they become deep; one approach to mitigating the problem is to insert residual/skip connections, as done by JumpingKnowledge (Xu et al., 2018) , APPNP (Gasteiger et al., 2019) , and GCNII (Chen et al., 2020) . GP inference is too costly, because it requires the inverse of the N ×N dense kernel matrix. Scalable approaches include low-rank methods, such as Nyström approximation (Drineas & Mahoney, 2005) , random features (Rahimi & Recht, 2007) , and KISS-GP (Wilson & Nickisch, 2015) ; as well as multi-resolution (Katzfuss, 2017) and hierarchical methods (Chen et al., 2017; Chen & Stein, 2021) . Prior efforts on integrating graphs into GPs exist. Ng et al. ( 2018) define a GP kernel by combing a base kernel with the adjacency matrix; it is related to a special case of our kernels where the network has only one layer and the output admits a robust-max likelihood for classification. Hu et al. (2020a) explore a similar route to us, by taking the limit of a GCN, but its exploration is less comprehensive because it does not generalize to other GNNs and does not tackle the scalability challenge.

3. GRAPH CONVOLUTIONAL NETWORK AS A GAUSSIAN PROCESS

We start with a few notations used throughout this paper. Let an undirected graph be denoted by G = (V, E) with N = |V| nodes and M = |E| edges. For notational simplicity, we use A ∈ R N ×N to denote the original graph adjacency matrix or any modified/normalized version of it. Using d l to denote the width of the l-th layer, the layer architecture of GCN reads X (l) = φ AX (l-1) W (l) + b (l) , where X (l-1) ∈ R N ×d l-1 and X (l) ∈ R N ×d l are layer inputs and outputs, respectively; W (l) ∈ R d l-1 ×d l and b (l) ∈ R 1×d l are the weights and the biases, respectively; and φ is the ReLU activation function. The graph convolutional operator A is a symmetric normalization of the graph adjacency matrix with self-loops added (Kipf & Welling, 2017) . For ease of exposition, it will be useful to rewrite the matrix notation (1) in element-sums and products. To this end, for a node x, let z (l) i (x) and x (l) i (x) denote the pre-and post-activation value at the i-th coordinate in the l-th layer, respectively. Particularly, in an L-layer GCN, x (0) (x) is the input feature vector and z (L) (x) is the output vector. The layer architecture of GCN reads y (l) i (x) = d l-1 j=1 W (l) ji x (l-1) j (x), z (l) i (x) = b (l) i + v∈V A xv y (l) i (v), x (l) i (x) = φ(z (l) i (x)). (2)

3.1. LIMIT IN THE WIDTH

The following theorem states that when the weights and biases in each layer are iid zero-mean Gaussians, in the limit on the layer width, the GCN output z (L) (x) is a multi-output GP over the index x. Theorem 1. Assume d 1 , . . . , d L-1 to be infinite in succession and let the bias and weight terms be independent with distributions b (l) i ∼ N (0, σ 2 b ), W (l) ij ∼ N (0, σ 2 w /d l-1 ), l = 1, . . . , L. Then, for each i, the collection {z (l) i (x)} over all graph nodes x follows the normal distribution N (0, K (l) ), where the covariance matrix K (l) can be computed recursively by C (l) = E z (l) i ∼N (0,K (l) ) [φ(z (l) i )φ(z (l) i ) T ], l = 1, . . . , L, K (l+1) = σ 2 b 1 N ×N + σ 2 w AC (l) A T , l = 0, . . . , L -1. ( ) All proofs of this paper are given in the appendix. Note that different from a usual GP, which is a random function defined over a connected region of the Euclidean space, here z (L) is defined over a discrete set of graph nodes. In the usual use of a graph in machine learning, this set is finite, such that the function distribution degenerates to a multivariate distribution. In semi-supervised learning, the dimension of the distribution, N , is fixed when one conducts transductive learning; but it will vary in the inductive setting because the graph will have new nodes and edges. One special care of a graph-based GP over a usual GP is that the covariance matrix will need to be recomputed from scratch whenever the graph alters. Theorem 1 leaves out the base definition C (0) , whose entry denotes the covariance between two input nodes. The traditional literature uses the inner product C (0) (x, x ) = x•x d0 (Lee et al., 2018) , but nothing prevents us using any positive-definite kernel alternatively. 1 For example, we could use the squared exponential kernel C (0) (x, x ) = exp -1 2 d0 j=1 xj -x j j 2 . Such flexibility in essence performs an implicit feature transformation as preprocessing.

3.2. UNIVERSALITY

A covariance kernel is positive definite; hence, the Moore-Aronszajn theorem (Aronszajn, 1950) suggests that it defines a unique Hilbert space for which it is a reproducing kernel. If this space is dense, then the kernel is called universal. One can verify universality by checking if the kernel matrix is positive definite for any set of distinct points.foot_1 For the case of graphs, it suffices to verify if the covariance matrix for all nodes is positive definite. We do this job for the ReLU activation function. It is known that the kernel E w∼N (0,I d ) [φ(w • x)φ(w • x ) ] admits a closed-form expression as a function of the angle between x and x , hence named the arc-cosine kernel (Cho & Saul, 2009) . We first establish the following lemma that states that the kernel is universal over a half-space. Lemma 2. The arc-cosine kernel is universal on the upper-hemisphere S = x ∈ R d : x 2 = 1, x 1 > 0 for all d ≥ 2. It is also known that the expectation in ( 3) is proportional to the arc-cosine kernel up to a factor et al., 2018) . Therefore, we iteratively work on the post-activation covariance (3) and the pre-activation covariance (4) and show that the covariance kernel resulting from the limiting GCN is universal, for any GCN with three or more layers. Theorem 3. Assume A is irreducible and non-negative and C (0) does not contain two linearly dependent rows. Then, K (l) is positive definite for all l ≥ 3. K (l) (x, x)K (l) (x , x ) (Lee

3.3. LIMIT IN THE DEPTH

The depth of a neural network exhibits interesting behaviors. Deep learning tends to favor deep networks because of their empirically outstanding performance, exemplified by generations of convolutional networks for the ImageNet classification (Krizhevsky et al., 2012; Wortsman et al., 2022) ; while graph neural networks are instrumental to be shallow because of the over-smoothing and oversquashing properties (Li et al., 2018; Topping et al., 2022) . For multi-layer perceptrons (networks with fully connected layers), several previous works have noted that the recurrence relation of the covariance kernel across layers leads to convergence to a fixed-point kernel, when the depth L → ∞ (see, e.g., Lee et al. (2018) ; in Appendix B.5, we elaborate this limit). In what follows, we offer the parallel analysis for GCN. Theorem 4. Assume A is symmetric, irreducible, aperiodic, and non-negative with Perron-Frobenius eigenvalue λ > 0. The following results hold as l → ∞. 1. When σ 2 b = 0, ρ min (K (l) ) 1, where ρ min denotes the minimum correlation between any two nodes x and x .

2.. When σ 2

w < 2/λ 2 , a subsequence of K (l) converges to some matrix. 3. When σ 2 w > 2/λ 2 , let c l = (σ 2 w λ 2 /2) l ; then, K (l) /c l → vv T where v is an eigenvector corresponding to λ. A few remarks follow. The first case implies that the correlation matrix converges monotonously to a matrix of all ones. As a consequence, up to some scaling c l that may depend on l, the scaled covariance matrix K (l) /c l converges to a rank-1 matrix. The third case shares a similar result, with the limit explicitly spelled out, but note that the eigenvector v may not be normalized. The second case is challenging to analyze. According to empirical verification, we speculate a stronger resultconvergence of K (l) to a unique fixed point-may hold.  O(•) Storage O(•) Computation of Q (L) LM N a + LN N 2 a + LN 3 a N N a Posterior mean (6) N * N a + N b N 2 a + N 3 a (N b + N * )N a Posterior variance (7) N * N a + N b N 2 a + N 3 a (N b + N * )N a 4 SCALABLE COMPUTATION THROUGH LOW-RANK APPROXIMATION The computation of the covariance matrix K (L) through recursion ( 3)-( 4) is the main computational bottleneck for GP posterior inference. We start the exposition with the mean prediction. We compute the posterior mean y * = K ) for the task, but the recursion (4) requires the full C (L-1) at the presence of A, and hence all the full C (l) 's and K (l) 's. (N b + N * ) × N b submatrix of K (L To reduce the computational costs, we resort to a low-rank approximation of C (l) , from which we easily see that K (l+1) is also low-rank. Before deriving the approximation recursion, we note (again) that for the ReLU activation φ, C (l) in ( 3) is the arc-cosine kernel with a closed-form expression: C (l) xx = 1 2π K (l) xx K (l) x x sin θ (l) xx + (π -θ (l) xx ) cos θ (l) xx where θ (l) xx = arccos   K (l) xx K (l) xx K (l) x x   . (5) Hence, the main idea is that starting with a low-rank approximation of K (l) , compute an approximation of C (l) by using (5), and then obtain an approximation of K (l+1) based on (4); then, repeat. To derive the approximation, we use the subscript a to denote a set of landmark nodes with cardinality N a . The Nyström approximation (Drineas & Mahoney, 2005)  of K (0) is K (0) :a (K (0) aa ) -1 K (0) a: , where the subscript : denotes retaining all rows/columns. We rewrite this approximation in the Cholesky style as Q (0) Q (0) T , where Q (0) = K (0) :a (K (0) aa ) -1 2 has size N × N a . Proceed with induction. Let K (l) be approximated by K (l) = Q (l) Q (l) T , where Q (l) has size N × (N a + 1). We apply (5) to compute an approximation to C (l) :a , namely C (l) :a , by using K (l) :a . Then, (4) leads to a Cholesky style approximation of K (l+1) : K (l+1) = σ 2 b 1 N ×N + σ 2 w A C (l) A T ≡ Q (l+1) Q (l+1) T , where Q (l+1) = σ w A C (l) :a ( C (l) aa ) -1 2 σ b 1 N ×1 . Clearly, Q (l+1) has size N ×(N a +1), completing the induction. In summary, K (L) is approximated by a rank-(N a +1) matrix K (L) = Q (L) Q (L) T . The computation of Q (L) is summarized in Algorithm 1. Once it is formed, the posterior mean is computed as y * ≈ K (L) * b ( K (L) bb + I) -1 y b = Q (L) * : Q (L) b: T Q (L) b: + I -1 Q (L) b: T y b , where note that the matrix to be inverted has size (N a + 1) × (N a + 1), which is assumed to be significantly smaller than N b × N b . Similarly, the posterior variance is K (L) * * -K (L) * b ( K (L) bb + I) -1 K (L) b * = Q (L) * : Q (L) b: T Q (L) b: + I -1 Q (L) * : T . The computational costs of Q (L) and the posterior inference ( 6)-( 7) are summarized in Table 1 Algorithm 1 Computing K (L) ≈ K (L) = Q (L) Q (L) T Require: Q (0) such that K (0) ≈ Q (0) Q (0) T 1: for l = 0, . . . , L -1 do 2: Compute K (l) :a = Q (0) :: Q (0) a: T 3: Compute C (l) :a by ( 5), where C (l) (resp. K (l) ) entries are replaced by C (l) (resp. K (l) ) entries 4: Compute  Q (l+1) = σ w A C (l) :a ( C (l) aa ) -1 2 σ b 1 N ×1 5: end for X ← X (0) K ← C (0) Q ← Chol(C (0) ) Bias term X ← X + b K ← K + σ 2 b 1 N ×N Q ← [Q σ b 1 N ×1 ] Weight term X ← XW K ← σ 2 w K Q ← σ w Q Mixed weight term X ← X(αI + βW ) K ← (α 2 + β 2 σ 2 w )K Q ← α 2 + β 2 σ 2 w Q Graph convolution X ← AX K ← AKA T Q ← AQ Activation X ← φ(X) K ← g(K) Q ← Chol(g(QQ T )) Independent addition X ← X 1 + X 2 K ← K 1 + K 2 Q ← [Q 1 Q 2 ]

5. COMPOSING GRAPH NEURAL NETWORK-INSPIRED KERNELS

Theorem 1, together with its proof, suggests that the covariance matrix of the limiting GP can be computed in a composable manner. Moreover, the derivation of Algorithm 1 indicates that the lowrank approximation of the covariance matrix can be similarly composed. Altogether, such a nice property allows one to easily derive the corresponding covariance matrix and its approximation for a new GNN architecture, like writing a program and obtaining a transformation of it automatically through operator overloading (Novak et al., 2020) : the covariance matrix is a transformation of the GNN and the composition of the former is in exactly the same manner and order as that of the latter. We call the covariance matrices programmable. For example, we write a GCN layer as X ← Aφ(X)W + b, where for notational simplicity, X denotes pre-activation rather than post-activation as in the earlier sections. The activation φ on X results in a transformation of the kernel matrix K into g(K), defined as: g(K) := C = E z∼N (0,K) [φ(z)φ(z) T ], due to (3). Moreover, if K admits a low-rank approximation QQ T , then g(K) admits a low-rank approximation P P T where P = Chol(g(K)) with Chol(C) := C :a C -1 2 aa . The next operation-graph convolution-multiplies A to the left of the post-activation. Correspondingly, the covariance matrix K is transformed to AKA T and the low-rank approximation factor Q is transformed to AQ. Then, the operation-multiplying the weight matrix W to the right-will transform K to σ 2 w K and Q to σ w Q. Finally, adding the bias b will transform K to K + σ 2 b 1 N ×N and Q to [Q σ b 1 N ×1 ] . Altogether, we have obtained the following updates per layer: GCN : X ← Aφ(X)W + b K ← σ 2 w Ag(K)A T + σ 2 b 1 N ×N Q ← σ w A Chol(g(QQ T )) σ b 1 N ×1 . One may verify the K update against (3)-( 4) and the Q update against Algorithm 1. Both updates can be automatically derived based on the update of X. We summarize the building blocks of a GNN and the corresponding kernel/low-rank operations in Table 2 . The independent-addition building block is applicable to skip/residual connections. For example, here is the composition for the GCNII layer (Chen et al., 2020) without a bias term, where a skip connection with X (0) occurs: GCNII : X ← (1 -α)Aφ(X) + αX (0) ((1 -β)I + βW ) K ← (1 -α) 2 Ag(K)A T + α 2 K (0) ((1 -β) 2 + β 2 σ 2 w ) Q ← (1 -α)A Chol(g(QQ T )) αQ (0) (1 -β) 2 + β 2 σ 2 w . For another example of the composability, we consider the popular GIN layer (Xu et al., 2019) , which we assume uses a 2-layer MLP after the neighborhood aggregation: GIN : X ← φ(Aφ(X)W + b)W + b K ← σ 2 w g(B) + σ 2 b 1 N ×N where B = σ 2 w Ag(K)A T + σ 2 b 1 N ×N Q ← σ w Chol(g(P P T )) σ b 1 N ×1 where P = σ w A Chol(g(QQ T )) σ b 1 N ×1 . Additionally, the updates for a GraphSAGE layer (Hamilton et al., 2017) are given in Appendix C.

6. EXPERIMENTS

In this section, we conduct a comprehensive set of experiments to evaluate the performance of the GP kernels derived by taking limits on the layer width of GCN and other GNNs. We demonstrate that these GPs are comparable with GNNs in prediction performance, while being significantly faster to compute. We also show that the low-rank version scales favorably, suitable for practical use.

Datasets.

The experiments are conducted on several benchmark datasets of varying sizes, covering both classification and regression. They include predicting the topic of scientific papers organized in a citation network (Cora, Citeseer, PubMed, and ArXiv); predicting the community of online posts based on user comments (Reddit), and predicting the average daily traffic of Wikipedia pages using hyperlinks among them (Chameleon, Squirrel, and Crocodile). Details of the datasets (including sources and preprocessing) are given in Appendix D. Experiment environment, training details, and hyperparameters are given in Appendix D. Prediction Performance: GCN-based comparison. We first conduct the semi-supervised learning tasks on all datasets by using GCN and GPs with different kernels. These kernels include the one equivalent to the limiting GCN (GCNGP), a usual squared-exponential kernel (RBF), and the GGP kernel proposed by Ng et al. (2018) . 3 Each of these kernels has a low-rank version (suffixed with -X). RBF-X and GGP-Xfoot_3 use the Nyström approximation, consistent with GCNGP-X. GPs are by nature suitable for regression. For classification tasks, we use the one-hot representation of labels to set up a multi-output regression. Then, we take the coordinate with the largest output as the class prediction. Such an ad hoc treatment is widely used in the literature, as other more principled approaches (such as using the Laplace approximation on the non-Gaussian posterior) are too time-consuming for large datasets, meanwhile producing no noticeable gain in accuracy. Table 3 summarizes the accuracy for classification and the coefficient of determination, R 2 , for regression. Whenever randomness is involved, the performance is reported as an average over five runs. The results of the two tasks show different patterns. For classification, GCNGP(-X) is sligtly better than GCN and GGP(-X), while RBF(-X) is significantly worse than all others; moreover, the low-rank version is outperformed by using the full kernel matrix. On the other hand, for regression, GCNGP(-X) significantly outperforms GCN, RBF(-X), and GGP(-X); and the low-rank version becomes better. The less competitive performance of RBF(-X) is expected, as it does not leverage the graph inductive bias. It is attractive that GCNGP(-X) is competitive with GCN. Prediction Performance: Comparison with other GNNs. In addition to GCN, we conduct experiments with several popularly used GNN architectures (GCNII, GIN, and GraphSAGE) and GPs with the corresponding kernels. We test with the three largest datasets: PubMed, ArXiv, and Reddit, for the latter two of which a low-rank version of the GPs is used for computational feasibility. Table 4 summarizes the results. The observations on other GNNs extend similarly those on the GCN. In particular, on PubMed the GPs noticeably improve over the corresponding GNNs, while on ArXiv and Reddit the two families perform rather similarly. An exception is GIN for ArXiv, which significantly underperforms the GP counterpart, as well as other GNNs. It may improve with an extensive hyperparameter tuning. Running time. We compare the running time of the methods covered by Table 3 . Different from usual neural networks, the training and inference of GNNs do not decouple in full-batch training. Moreover, there is not a universally agreed split between the training and the inference steps in GPs. Hence, we compare the total time for each method. Figure 1 plots the timing results, normalized against the GCN time for ease of comparison. It suggests that GCNGP(-X) is generally faster than GCN. Note that the vertical axis is in the logarithmic scale. Hence, for some of the datasets, the speedup is even one to two orders of magnitude. Scalability. For graphs, especially under the semi-supervised learning setting, the computational cost of a GP is much more complex than that of a usual one (which can be simply described as "cubic in the training set size"). One sees in Table 1 the many factors that determine the cost of our graph-based low-rank kernels. To explore the practicality of the proposed method, we use the timings gathered for Figure 1 to obtain an empirical scaling with respect to the graph size, M + N . Figure 2 fits the running times, plotted in the log-log scale, by using a straight line. We see that for neither GCN nor GCNGP(-X), the actual running time closely follows a polynomial complexity. However, interestingly, the least-squares fittings all lead to a slope approximately 1, which agrees with a linear cost. Theoretically, only GCNGP-X and GCN are approximately linear with respect to M + N , while GCNGP is cubic. Analysis on the depth. The performance of GCN deteriorates with more layers, known as the oversmoothing phenomenon. Adding residual/skip connections mitigates the problem, such as in GCNII. A natural question asks if the corresponding GP kernels behave similarly. Figure 3 shows that the trends of GCN and GCNII are indeed as expected. Interestingly, their GP counterparts both remain stable for depth L as large as 12. Our depth analysis (Theorem 4) suggests that in the limit, the GPs may perform less well because the kernel matrix may degenerate to rank 1. This empirical result indicates that the drop in performance may have not started yet. Analysis on the landmark set. The number of landmark nodes, N a , controls the approximation quality of the low-rank kernels and hence the prediciton accuracy. On the other hand, the computational costs summarized in Table 1 indicate a dependency on N a as high as the third power. It is crucial to develop an empirical understanding of the accuracy-time trade-off it incurs. Figure 4 clearly shows that as N a becomes larger, the running time increase is not linear, while the increase of accuracy diminishes as the landmark set approaches the training set. It is remarkable that using only 1/800 of the training set as landmark nodes already achieves an accuracy surpassing that of GCN, by using time that is only a tiny fraction of the time otherwise needed to gain an additional 1% increase in the accuracy.

7. CONCLUSIONS

We have presented a GP approach for semi-supervised learning on graph-structured data, where the covariance kernel incorporates a graph inductive bias by exploiting the relationship between a GP and a GNN with infinitely wide layers. Similar to other neural networks priorly investigated, one can work out the equivalent GP (in particular, the covariance kernel) for GCN; and inspired by this equivalence, we formulate a procedure to compose covariance kernels corresponding to many other members of the GNN family. Moreover, every building block in the procedure has a lowrank counterpart, which allows for building a low-rank approximation of the covariance matrix that facilitates scalable posterior inference. We demonstrate the effectiveness of the derived kernels used for semi-supervised learning and show their advantages in computation time over GNNs.

A CODE

Code is available at https://github.com/niuzehao/gnn-gp.

B PROOFS AND ADDITIONAL THEOREMS

B.1 PROOF OF THEOREM 1 We use mathematical induction. First, write the matrix form of (2) as follows: Y (l) N ×d l = X (l-1) N ×d l-1 W (l) d l-1 ×d l , Z (l) N ×d l = 1 N ×1 b (l) 1×d l + A N ×N Y (l) N ×d l , X (l) N ×d l = φ(Z (l) ) N ×d l . Assume each column of Z (l-1) is iid, then X (l-1) also has iid columns. Hence, y i (x) is a sum of iid terms and thus normal. Also, (y (l) i (x 1 ), • • • , y (l) i (x N )) are jointly multivariate normal, and identically distributed for different i; so they form a GP. Moreover, terms like y (l) i (x) and y (l) i (x ) with i = i are jointly Gaussian, with zero covariance: Cov(y (l) i (x), y (l) i (x )) = d l-1 j=1 d l-1 j =1 Cov W (l) ji x (l-1) j (x), W (l) j i x (l-1) j (x ) = 0. Thus, they are independent, despite the fact that they may share the same X (l-1) terms. In conclusion, each column of Y (l) is an iid GP. Hence, each column of Z (l) is also an iid GP N (0, K (l) ). The covariance K (l) can be computed recursively. Similar to the case of a fully connected network, the covariance of Y (l) is E[y (l) i (x)y (l) i (x )] = σ 2 w E[x (l-1) j (x)x (l-1) j (x )] = σ 2 w C (l-1) (x, x ) , where C (l-1) (x, x ) = E z∼N (0,K (l-1) ) [φ(z j (x))φ(z j (x ))] for any j. In the matrix form, this is to say Cov(Y (l) j ) = σ 2 w E[X (l-1) j X (l-1) j T ] = σ 2 w C (l-1) . Then, because Z (l) i = AY (l) i + 1 N ×1 b (l) i , we obtain K (l) = σ 2 b 1 N ×N + A Cov(Y (l) i )A T = σ 2 b 1 N ×N + σ 2 w AC (l-1) A T , which concludes the proof.

B.2 PROOF OF LEMMA 2

We know that φ(w•x) is the feature mapping of the arc-cosine kernel. It suffices to show for different x 1 , . . . , x m ∈ S, the φ(w, x i )'s are linearly independent. We prove this by contradiction. Assume there exist c 1 , . . . , c m not simultaneously zero, such that m i=1 c i φ(w • x i ) ≡ 0. WLOG, we further assume that m is the smallest integer that satisfies (9). Let e 1 = (1, 0, • • • , 0); then x • e 1 > 0, ∀x ∈ S. Assume x m • e 1 = min 1≤i≤m {x i • e 1 } > 0. Then, for 1 ≤ i ≤ m -1, x i • x m x i • e 1 < 1 x m • e 1 = x m • x m x m • e 1 . Therefore, there exists t such that x i • x m x i • e 1 < t < x m • x m x m • e 1 , 1 ≤ i ≤ m -1. Let w = x m -te 1 ; then w • x m > 0, w • x i < 0, 1 ≤ i ≤ m -1. Using (9) we know c m (w•x m ) = 0; so c m = 0. Thus, c 1 , • • • , c m-1 and φ(w•x 1 ), • • • , φ(w•x m-1 ) also satisfy ( 9), with a smaller number m, which forms a contradiction. Published as a conference paper at ICLR 2023

B.3 PROOF OF THEOREM 3

From the recursive relation ( 3)-( 4), we know C (0) is element-wise non-negative. Hence, K (1) is element-wise non-negative, C (1) is element-wise positive, and K (2) is element-wise positive. Assume the Cholesky decomposition K (2) = LL T and denote L = (l 1 , • • • , l n ) T . Let r i = l i 2 > 0 and x i = li ri . Then, x 1 = ±e 1 , and WLOG we assume x 1 = e 1 . Then for each x i , we know x i • e 1 = l i • l 1 r i r 1 = K (2) (i, 1) r i r 1 > 0 and each x i 2 = 1. Therefore, x i ∈ S is defined in Lemma 2. By the assumption that C (0) does not contain two linearly dependent rows, we know so are K (1) and K (2) . Thus, no two l i 's are linearly dependent and hence x i 's are pairwise distinct. Using the universality of the arc-cosine kernel (Lemma 2), we know C (2) = g(K (2) ) is positive definite. Hence, K (3) is positive definite. An induction argument on the layer index completes the proof.

B.4 PROOF OF THEOREM 4

For a covariance matrix K, denote its standard deviation and correlation by σ K (x) = K(x, x), ρ K (x, x ) = K(x, x ) K(x, x)K(x , x ) . From the closed-form formula (5) for the arc-cosine kernel, we have ρ C (l) (x, x ) = 1 π sin θ (l) x,x + (π -θ (l) x,x ) cos θ (l) x,x , ρ K (l) (x, x ) = cos θ (l) x,x . We define the correlation mapping from ρ K (l) to ρ C (l) : f : cos θ → 1 π (sin θ + (π -θ) cos θ) , θ ∈ [0, π]. We first establish a few properties of f . A pictorial illustration of f is given in Figure 5 . 2. f (ρ) ≥ ρ and the equality holds only when ρ = 1. Proof. For part 1, denote ρ = cos θ ∈ [-1, 1]. Then, f (ρ) = ∂f (ρ) ∂θ ∂θ ∂ρ = -1 π (π -θ) sin θ • -1 sin θ = π -θ π ∈ [0, 1]. Thus, f is increasing and it is a contraction mapping on [-1, 1]. For part 2, we know that f (1) = 1. Using the contraction mapping, we know f (ρ) ≥ ρ for ρ ∈ [-1, 1]. Clearly, the equality holds only when ρ = 1. After the correlation mapping f , we further consider the covariance mapping g : K (l) → C (l) defined in (3) (and equivalently in ( 8)). We establish a few properties of g below. Lemma 6. For positive semi-definite matrices B, the following facts hold. 1. g(B) = 1 2 D 1/2 f (D -1/2 BD -1/2 )D 1/2 , where D = diag{B} ∈ R N ×N and f is elementwise.

2.. g(B)

1 2 B and diag{g(B)} = 1 2 diag{B}, where means elementwise greater. 3. tr Ag(B)A T ≤ 1 2 A 2 2 tr (B), where A 2 is the spectral norm of matrix A. Proof. For part 1, it is obvious according to (5). For part 2, using the properties of f we straightforwardly obtain g(B) = 1 2 D 1/2 f (D -1/2 BD -1/2 )D 1/2 1 2 D 1/2 (D -1/2 BD -1/2 )D 1/2 = 1 2 B. For part 3, note that g(B) is positive semi-definite. Using von Neumann's trace inequality, tr Ag(B)A T = tr A T Ag(B) ≤ N i=1 λ i (A T A)λ i (g(B)) ≤ A T A 2 N i=1 λ i (g(B)), where the last term is equal to A 2 2 tr (g(B)) = 1 2 A 2 2 tr (B). Now, we are ready to prove Theorem 4. Proof of Theorem 4, part 1. We focus on uniform convergence over all pairwise correlations. Denote ρ min (K) = min x,x ρ K (x, x ) ∈ [-1, 1]. To simplify notation, we temporarily let C = C (l) . Then, K (l+1) (x, x ) = σ 2 w v,v A xv A x v C(v, v ) = σ 2 w v,v A xv A x v σ C (v)σ C (v )ρ C (v, v ) ≥ ρ min (C) • σ 2 w v,v A xv A x v σ C (v)σ C (v ) = ρ min (C) • σ 2 w v A xv σ C (v) v A x v σ C (v ) . Meanwhile, K (l+1) (x, x) = σ 2 w v,v A xv A xv C(v, v ) ≤ σ 2 w v,v A xv A xv σ C (v)σ C (v ) = σ 2 w v A xv σ C (v) 2 and similarly K (l+1) (x , x ) ≤ σ 2 w v A x v σ C (v ) 2 . Combining the above inequalities, we obtain ρ(K (l+1) )(x, x ) = K (l+1) (x, x ) K (l+1) (x, x)K (l+1) (x , x ) ≥ ρ min (C (l) ), ∀x, x , and thus ρ min (K (l+1) ) ≥ ρ min (C (l) ). From the correlation mapping, we know ρ min (C (l) ) = f (ρ min (K (l) )). Therefore, by the properties of f , ρ min (K (l+1) ) ≥ f (ρ min (K (l) )) ≥ ρ min (K (l) ). Thus, ρ min (K (l) ) some ρ ∞ ∈ [-1, 1]. Because ρ ∞ = f (ρ ∞ ), we conclude that ρ ∞ = 1. Proof of Theorem 4, part 2. Consider each h : K (l) → K (l+1) : K (l+1) = σ 2 b 1 N ×N + σ 2 w AC (l) A T = σ 2 b + σ 2 w Ag(K (l) )A T := h(K (l) ). Let δ = 1 2 σ 2 w λ 2 < 1. Using part 3 of Lemma 6, tr K (l+1) = tr σ 2 b 1 N ×N + tr σ 2 w Ag(K (l) )A T = N σ 2 b + σ 2 w tr Ag(K (l) )A T ≤ N σ 2 b + 1 2 σ 2 w λ 2 tr K (l) = N σ 2 b + δ • tr K (l) . Therefore, for sufficiently large l, tr K (l) ≤ N σ 2 b 1 -δ + 1. Because each K (l) is positive semi-definite, we know K (l) is bounded. Thus, a subsequence of K (l) → some matrix. Proof of Theorem 4, part 3. Define κ (l) := K (l) /c l . We have the following recursive relationship between κ (l) and κ (l+1) : is not available, so we conduct a random split with the same 0.48/0.32/0.20 proportion as that used for Chameleon and Squirrel (Rozemberczki et al., 2021) . κ (l+1) = 1 c l+1 (σ 2 b 1 N ×N + σ 2 w Ag(K (l) )A T ) = Dataset preprocessing. For each graph, we treat the edges as undirected and construct a binary, symmetric adjacent matrix A. Then, we apply the normalization proposed by Kipf & Welling (2017) to define A used in (1). Specifically, A = (I N + D) -1/2 (I + A)(I N + D) -1/2 , where D = diag{ j A ij }. For GraphSAGE and GGP, we instead use the row-normalized version A = (I N + D) -1 (I N + A). No feature preprocessing is done except running the GPs on ArXiv, for which we apply a centering on the input features. Environment. All experiments are conducted on a Nvidia Quadro GV100 GPU with 32GB of HBM2 memory. The code is written in Python 3.10.4 as distributed with Ubuntu 22.04 LTS. We use PyTorch 1.11.0 and PyTorch Geometric 2.1.0 with CUDA 11.3. Training. For all datasets except Reddit, we train GCN using the Adam optimizer in full batch for 100 epochs. Reddit is too large for GPU computation and hence we conduct mini-batch training (with a batch size 10240) for 10 epochs. Hyperparameters. For classification tasks in Table 3 , the hyperparameters are set to σ b = 0.0, σ w = 1.0, L = 2, hidden = 256, and dropput = 0.5. GCN is trained with learning rate 0.01. For regression tasks, they are set to σ b = √ 0.1, σ w = 1.0, L = 2, hidden = 256, and dropput = 0.5. GCN is trained with learning rate √ 0.1 We choose C (0) to be the inner product kernel. The exception is when working with the low-rank versions, we apply PCA on the input features to ensure the number of features is smaller than the number of landmark nodes. For Table 4 : GCNII uses a 2-layer architecture and sets α = 0.1, λ = 0.5, β l = log( λ l + 1). GIN uses L = 2 and = 0.0. GraphSAGE uses a 2-layer architecture and sets σ w1 = √ 0.1 for Reddit and σ w1 = 0.0 for other datasets, and σ w2 = 1.0. The nugget is chosen using a grid search over [10 -3 , 10 1 ]. For GCNGP-X, GNNGP-X and RBF-X, the landmark points are chosen to be the training set for small datasets, while for large datasets (ArXiv and Reddit) the landmark points are a random 1/50 of the training set. Kernels. For the RBF kernel, K(x, x ) = exp(-γ x -x 2 2 ), where γ is chosen using a grid search over [10 -2 , 10 2 ]. For the GGP kernel, K = AK 0 A T , K 0 (x, x ) = (x T x + c) d , where c = 5.0, d = 3.0. Note that we use only the kernel but not the robust-max likelihood nor the ELBO training proposed by Ng et al. (2018) , because their code written in Python 2 is outdated and cannot be used. Note also that GGP-X in Ng et al. ( 2018) has a different meaning-training GGP by using additionally the 500 nodes in the validation set-while we use GGP-X to denote the Nyström approximation of GGP.



Here, we abuse the notation and use x in place of x (0) (x) in the inner product. Note the conventional confusion in terminology between functions and matrices: a kernel function is positive definite (resp. strictly positive definite) if the corresponding kernel matrix is positive semi-definite (resp. positive definite) for any collection of distinct points. We apply only the kernel but not the likelihood nor the variational inference used inNg et al. (2018), for reasons given in Appendix D. GGP-X in our notation is the Nyström approximation of the GGP kernel, different from a technique under the same name inNg et al. (2018), which uses additionally the validation set to compute the prediction loss.



bb + I) -1 y b , where the subscripts b and * denote the training set and the prediction set, respectively; and , called the nugget, is the noise variance of the training data. Let there be N b training nodes and N * prediction nodes. It is tempting to compute only the

Figure 1: Timing comparison. For each dataset, the times are normalized against that of GCN.

Figure 2: Scaling of the running time with respect to the graph size, M + N .

Figure 3: Performance of GNNs and GNNGPs as depth L increases. Dataset: Pubmed.

Figure 4: Performance of GCNGP-X as number of landmark nodes increases. Dataset: Reddit.

Figure 5: Correlation mapping f . Lemma 5. The following facts hold. 1. f is increasing and it is a contraction mapping on [-1, 1].

l+1 1 N ×N + 2 λ 2 Ag(κ (l) )A T . Then, tr κ (l+1) = tr σ 2 b c l+1 1 N ×N + 2 λ 2 tr Ag(κ (l) )A T ≤ N σ 2 b c l+1 + tr κ (l) ,

Computational costs. M : number of edges; N : number of nodes; N b : number of training nodes; N * : number of prediction nodes; N a : number of landmark nodes; L: number of layers. Assume N b N a . For posterior variance, assume only the diagonal is needed.

Neural network building blocks, kernel operations, and the low-rank counterpart.

Performance of GCNGP, in comparison with GCN and typical GP kernels. The Micro-F1 score is reported for classification tasks and R 2 is reported for regression tasks.

between GNNs and the corresponding GP kernels.

ACKNOWLEDGMENTS

Mihai Anitescu was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract DE-AC02-06CH11347. Jie Chen acknowledges supports from the MIT-IBM Watson AI Lab.

annex

where the inequality results from part 3 of Lemma 8. Therefore, κ (l) is bounded.Using the Perron-Frobenius theorem, we know the eigenspace V corresponding to λ is onedimensional, and there exists v ∈ V such that v 2 = 1 and v 0. Then,Therefore, v T κ (l) v has a limit c > 0.Therefore,Thus, P κ (l) P T → 0.Combining the results, we have κ (l) → cvv T .

B.5 RESULTS FOR MULTI-LAYER PERCEPTRONS

Theorem 4 has parallel results for fully connected layers. Theorem 7. Let A be the identity matrix. The following results hold as l → ∞..Proof of Theorem 7, part 1. Starting with the diagonal elements, we haveFor the off-diagonal elements, we haveMeanwhile,Thus, K (l) (x, x ) → q. In conclustion, K (l) → q1 N ×N .Proof of Theorem 7, part 2. Again, we start with diagonal elements. Define κ (l) := K (l) /c l . When σ 2 w > 2, we haveFor the off-diagonal elements, we haveThus, κ (l) (x, x ) has a limit c. Taking limits on both sides, we have

C COMPOSING GRAPHSAGE

As before, X denotes pre-activation rather post activation. A GraphSAGE layer has two weight matrices, W 1 and W 2 . Hence, we use two variance parameters σ w1 and σ w2 for them, respectively. For simplicity, we focus on the mean-aggregation version of GraphSAGE, where A is effectively the random-walk matrix. The updates are:Note that we do not use a bias term. If it is desired, one can easily add it to the updates following the example of GCN.

D EXPERIMENT DETAILS

Datasets. A summary is given in Table 5 . The datasets Cora/Citeseer/PubMed/Reddit, with predefined training/validation/test splits, are downloaded from the PyTorch Geometric library (Fey & Lenssen, 2019) and used as is. The dataset ArXiv comes from the Open Graph Benchmark (Hu et al., 2020b) . The datasets Chameleon/Squirrel/Crocodile come from MUSAE (Rozemberczki et al., 2021) . The training/validation/test splits of the former two sets of datasets come from Geom-GCN (Pei et al., 2020) , in accordance with the PyTorch Geometric library. The split for Crocodile 

