GRAPH NEURAL NETWORK-INSPIRED KERNELS FOR GAUSSIAN PROCESSES IN SEMI-SUPERVISED LEARN-ING

Abstract

Gaussian processes (GPs) are an attractive class of machine learning models because of their simplicity and flexibility as building blocks of more complex Bayesian models. Meanwhile, graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning and beyond. Their competitive performance is often attributed to a proper capturing of the graph inductive bias. In this work, we introduce this inductive bias into GPs to improve their predictive performance for graph-structured data. We show that a prominent example of GNNs, the graph convolutional network, is equivalent to some GP when its layers are infinitely wide; and we analyze the kernel universality and the limiting behavior in depth. We further present a programmable procedure to compose covariance kernels inspired by this equivalence and derive example kernels corresponding to several interesting members of the GNN family. We also propose a computationally efficient approximation of the covariance matrix for scalable posterior inference with large-scale data. We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time, compared with the respective GNNs.

1. INTRODUCTION

Gaussian processes (GPs) (Rasmussen & Williams, 2006) are widely used in machine learning, uncertainty quantification, and global optimization. In the Bayesian setting, a GP serves as a prior probability distribution over functions, characterized by a mean (often treated as zero for simplicity) and a covariance. Conditioned on observed data with a Gaussian likelihood, the random function admits a posterior distribution that is also Gaussian, whose mean is used for prediction and the variance serves as an uncertainty measure. The closed-form posterior allows for exact Bayesian inference, resulting in great attractiveness and wide usage of GPs. The success of GPs in practice depends on two factors: the observations (training data) and the covariance kernel. We are interested in semi-supervised learning, where only a small amount of data is labeled while a large amount of unlabeled data can be used together for training (Zhu, 2008) . In recent years, graph neural networks (GNNs) (Zhou et al., 2020; Wu et al., 2021) emerged as a promising class of models for this problem, when the labeled and unlabeled data are connected by a graph. The graph structure becomes an important inductive bias that leads to the success of GNNs. This inductive bias inspires us to design a GP model under limited observations, by building the graph structure into the covariance kernel. An intimate relationship between neural networks and GPs is known: a neural network with fully connected layers, equipped with a prior probability distribution on the weights and biases, converges to a GP when each of its layers is infinitely wide (Lee et al., 2018; de G. Matthews et al., 2018) . Such a result is owing to the central limit theorem (Neal, 1994; Williams, 1996) and the GP covariance can be recursively computed if the weights (and biases) in each layer are iid Gaussian. Similar results for other architectures, such as convolution layers and residual connections, were subsequently established in the literature (Novak et al., 2019; Garriga-Alonso et al., 2019) . One focus of this work is to establish a similar relationship between GNNs and the limiting GPs. We will derive the covariance kernel that incorporates the graph inductive bias as GNNs do. We start with one of the most widely studied GNNs, the graph convolutional network (GCN) (Kipf & Welling, 2017) , and analyze the kernel universality as well as the limiting behavior when the depth also tends to infinity. We then derive covariance kernels from other GNNs by using a programmable procedure that corresponds every building block of a neural network to a kernel operation. Meanwhile, we design efficient computational procedures for posterior inference (i.e., regression and classification). GPs are notoriously difficult to scale because of the cubic complexity with respect to the number of training data. Benchmark graph datasets used by the GNN literature may contain thousands or even millions of labeled nodes (Hu et al., 2020b) . The semi-supervised setting worsens the scenario, as the covariance matrix needs to be (recursively) evaluated in full because of the graph convolution operation. We propose a Nyström-like scheme to perform low-rank approximations and apply the approximation recursively on each layer, to yield a low-rank kernel matrix. Such a matrix can be computed scalably. We demonstrate through numerical experiments that the GP posterior inference is much faster than training a GNN and subsequently performing predictions on the test set. We summarize the contributions of this work as follows: 1. We derive the GP as a limit of the GCN when the layer widths tend to infinity and study the kernel universality and the limiting behavior in depth. 2. We propose a computational procedure to compute a low-rank approximation of the covariance matrix for practical and scalable posterior inference. 3. We present a programmable procedure to compose covariance kernels and their approximations and show examples corresponding to several interesting members of the GNN family. 4. We conduct comprehensive experiments to demonstrate that the GP model performs favorably compared to GNNs in prediction accuracy while being significantly faster in computation.

2. RELATED WORK

It has long been observed that GPs are limits of standard neural networks with one hidden layer when the layer width tends to infinity (Neal, 1994; Williams, 1996) . Recently, renewed interests in the equivalence between GPs and neural networks were extended to deep neural networks (Lee et al., 2018; de G. Matthews et al., 2018) as well as modern neural network architectures, such as convolution layers (Novak et al., 2019) , recurrent networks (Yang, 2019), and residual connections (Garriga-Alonso et al., 2019) . The term NNGP (neural network Gaussian process) henceforth emerged under the context of Bayesian deep learning. Besides the fact that an infinite neural network defines a kernel, the training of a neural network by using gradient descent also defines a kernel-the neural tangent kernel (NTK)-that describes the evolution of the network (Jacot et al., 2018; Lee et al., 2019) . Library supports in Python were developed to automatically construct the NNGP and NTK kernels based on programming the corresponding neural networks (Novak et al., 2020) . GNNs are neural networks that handle graph-structured data (Zhou et al., 2020; Wu et al., 2021) . They are a promising class of models for semi-supervised learning. Many GNNs use the messagepassing scheme (Gilmer et al., 2017) , where neighborhood information is aggregated to update the representation of the center node. Representative examples include GCN (Kipf & Welling, 2017), GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2018), and GIN (Xu et al., 2019) . It is found that the performance of GNNs degrades as they become deep; one approach to mitigating the problem is to insert residual/skip connections, as done by JumpingKnowledge (Xu et al., 2018) , APPNP (Gasteiger et al., 2019), and GCNII (Chen et al., 2020) . GP inference is too costly, because it requires the inverse of the N ×N dense kernel matrix. Scalable approaches include low-rank methods, such as Nyström approximation (Drineas & Mahoney, 2005) , random features (Rahimi & Recht, 2007) , and KISS-GP (Wilson & Nickisch, 2015) ; as well as multi-resolution (Katzfuss, 2017) and hierarchical methods (Chen et al., 2017; Chen & Stein, 2021) .

