INDUCING GAUSSIAN PROCESS NETWORKS

Abstract

Gaussian processes (GPs) are powerful but computationally expensive machine learning models, requiring an estimate of the kernel covariance matrix for every prediction. In large and complex domains, such as graphs, sets, or images, the choice of suitable kernel can also be non-trivial to determine, providing an additional obstacle to the learning task. Over the last decade, these challenges have resulted in significant advances being made in terms of scalability and expressivity, exemplified by, e.g., the use of inducing points and neural network kernel approximations. In this paper, we propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points. The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains while also facilitating scalable gradient-based learning methods. We consider both regression and (binary) classification tasks and report on experimental results for real-world data sets showing that IGNs provide significant advances over state-ofthe-art methods. We also demonstrate how IGNs can be used to effectively model complex domains using neural network architectures.

1. INTRODUCTION

Gaussian processes are powerful and attractive machine learning models, in particular in situations where uncertainty estimation is critical for performance, such as for medical diagnosis (Dusenberry et al., 2020) .Whereas the original Gaussian process formulation was limited in terms of scalability, there has been significant progress in scalable solutions with Quiñonero-Candela & Rasmussen (2005) providing an early unified framework based on inducing points as a representative proxy of the training data. The framework by Quiñonero-Candela & Rasmussen (2005) has also been extended to variational settings (Titsias, 2009; Wilson et al., 2016b; Bauer, 2016) , further enabling a probabilistic basis for reasoning about the number of inducing points (Uhrenholt et al., 2021) . In terms of computational scalability, methods for leveraging the available computational resources have recently been considered (Nguyen et al., 2019b; Wang et al., 2019 ), with Chen et al. (2020) also providing insights into the theoretical underpinnings for gradient descent-based solutions in correlated settings (as in the case for GPs). Common for most of the inducing points-based approaches to scalability is that the inducing points live in the same space as the training points (see e.g. (Snelson & Ghahramani, 2006; Titsias, 2009; Hensman et al., 2013; Damianou & Lawrence, 2013) ). However, learning inducing points in the input space can be challenging for complex domains (e.g. over graphs), domains with high dimensionality (e.g. images), or domains with varying cardinality (e.g. text or point clouds) (Lee et al., 2019; Aitchison et al., 2021) . More recently, methods for reasoning about the inducing points directly in embedding space have also been considered, but these methods often constrain the positions of the inducing points (Wilson et al., 2016a) or the structure of the embedding space (Lázaro-Gredilla & Figueiras-Vidal, 2009; Bradshaw et al., 2017) , or they rely on complex inference/learning procedures (Aitchison et al., 2021) In this paper, we propose inducing Gaussian process networks (IGN) as a simple and scalable framework for jointly learning the inducing points and the (deep) kernel (Wilson et al., 2016a) . Key to the framework is that the inducing points live in an unconstrained feature space rather than in the input space. By defining the inducing points in the feature space together with an amortized pseudo-label function, we are able to represent the data distribution with a simple base kernel (such as Figure 1 : To the left four MNIST digits are embedded in the feature space by the neural network g. To the right, the feature space where both features and inducing points exist. The observations associated with the inducing points z 1 , z 2 , and z 3 are given by the pseudo-label function r, while the predictions associated with g(x 1 ), g(x 2 ), g(x 3 ), and g(x 4 ) are estimated using the GP posterior. the RBF and dot-product kernel), relying on the expressiveness of the learned features for capturing complex interactions. For learning IGNs, we rely on a maximum likelihood-based learning objective that is optimized using mini-batch gradient descent (Chen et al., 2020) . This setup allows the method to scale to large data sets as demonstrated in the experimental results. Furthermore, by only having the inducing points defined in feature space, we can seamlessly employ standard gradient-based techniques for learning the inducing points (even when the inputs space is defined over complex discrete/hybrid objects) without the practical difficulties sometimes encountered when learning deep neural network structures. We evaluate the performance of the proposed framework on several well-known data sets and show significant improvements compared to state of the art methods. We provide a qualitative analysis of the framework using a two-class version of the MNIST dataset. This is complemented by a more detailed quantitative analysis using the full MNIST and CIFAR10 data sets. Lastly, to demonstrate the versatility of the framework, we also provide sentiment analysis results for both a text-based and a graph-based dataset derived from the IMDB movie review dataset.

2. THE INDUCING GAUSSIAN PROCESS NETWORKS (IGN) FRAMEWORK

We start by considering regression problems, defined over an input space X and a label space of observations R, modeled by a Gaussian process: f ∼ GP(0, k(•, •)), y = f (x) + ϵ, x ∈ X , y ∈ R, where k(•, •) : X ×X → R denotes the kernel describing the prior covariance, and ϵ ∼ N (0, σ 2 ϵ ) is the noise associated with the observations. We assume access to a set of data points D = {(x i , y i )} n i=1 generated from the model in Equation 1, and we seek to learn the parameters that define k and σ ϵ in order to predict outputs for new points x * in X . In what follows we shall use X and y to denote (x 1 , . . . , x n ) T and (y 1 , . . . y n ) T , respectively. Firstly, we propose to embed the input points using a neural network g θg : X → R d parameterized by θ g . Secondly, for modeling the kernel function k, we introduce a set of m inducing points Z = (z 1 , . . . , z m ) T , z i ∈ R d , together with a (linear) pseudo-label function r θr : R d → R parameterized by θ r . We will use r = (r θr (z 1 ), . . . , r θr (z m )) T to denote the evaluation of r on Z, where r will play a rôle similar to that of inducing variables (Quiñonero-Candela & Rasmussen, 2005) . In the remainder of this paper, we will sometimes drop the parameter subscripts θ g and θ r from g and r for ease of representation. An illustration of the model and the relationship between the training data and the inducing points can be see in Figure 1 . We finally define k : R d × R d → R as the kernel between pairs of vectors in R d . In particular, we denote with (K XX ) ij = k(g(x i ), g(x j )) (K ZX ) ij = k(z i , g(x j )) (K XZ ) ij = k(g(x i ), z j ) (K ZZ ) ij = k(z i , z j ),

