A SCALABLE AND EXACT GAUSSIAN PROCESS SAM-PLER VIA KERNEL PACKETS

Abstract

In view of the widespread use of Gaussian processes (GPs) in machine learning models, generating random sample paths of GPs is crucial for many machine learning applications. Sampling from a GP essentially requires generating highdimensional Gaussian random vectors, which is computationally challenging if a direct method, such as the one based on Cholesky decomposition, is implemented. We develop a scalable algorithm to sample random realizations of the prior and the posterior of GP models with Matérn correlation functions. Unlike existing scalable sampling algorithms, the proposed approach draws samples from the theoretical distributions exactly. The algorithm exploits a novel structure called the kernel packets (KP), which gives an exact sparse representation of the dense covariance matrices. The proposed method is applicable for one-dimensional GPs, and multi-dimensional GPs under some conditions such as separable kernels with full grid designs. Via a series of experiments and comparisons with other recent works, we demonstrate the efficiency and accuracy of the proposed method.

1. INTRODUCTION

Gaussian processes (GPs) have been widely used in statistical and machine learning applications (Rasmussen, 2003; Cressie, 2015; Santner et al., 2003) . The relevant areas and topics include regression (O'Hagan, 1978; Bishop et al., 1995; Rasmussen, 2003; MacKay et al., 2003) , classification (Kuss et al., 2005; Nickisch & Rasmussen, 2008; Hensman et al., 2015) , Bayesian networks (Neal, 2012) , optimization (Srinivas et al., 2009) , and so on. GP modeling proceeds by imposing a GP as the prior of an underlying continuous function, which provides a flexible nonparametric framework for prediction and inference problems. When the sample size is large, the basic framework for GP regression suffers from the computational challenge of inverting large covariance matrices. A lot of work has been done to address this issue. Recent advances in scalable GP regression include Nyström approximation (Quinonero-Candela & Rasmussen, 2005; Titsias, 2009; Hensman et al., 2013) , random Fourier features (Rahimi & Recht, 2007 ), local approximation (Gramacy & Apley, 2015) , structured kernel interpolation (Wilson & Nickisch, 2015) , state-space formulation (Grigorievskiy et al., 2017; Nickisch et al., 2018) , Vecchia approximation (Katzfuss & Guinness, 2021), sparse representation (Chen et al., 2022; Ding et al., 2021), etc. In this article, we focus on the sampling of random GP realizations. Such GPs can be either the prior stochastic processes, or the posterior processes in GP regression. Generating random sample paths of the GP prior or the posterior of the GP regression is crucial in machine learning areas such as Bayesian Optimization (Snoek et al., 2012; Frazier, 2018a; b) , reinforcement learning (Kuss & Rasmussen, 2003; Engel et al., 2005; Grande et al., 2014) , and inverse problems in uncertainty quantification (Murray-Smith & Pearlmutter, 2004; Marzouk & Najm, 2009; Teckentrup, 2020) . To generate the function of a random GP sample, a common practice is to discretize the input space, and the problem becomes the sampling of a high-dimensional multivariate normal vector. Sampling high-dimensional multivariate normal vectors, however, is computationally challenging as well, as we need to factorize the large covariance matrices. Despite the vast literature of the scalable GP regression, the sampling methodologies are still underdeveloped. Existing scalable sampling algorithms for GPs are scarce. A recent prominent work is done by Wilson et al. (2020) . They proposed an efficient sampling approach called decoupled sampling by exploiting Matheron's rule and combining Nyström approximation and random Fourier feature. They also generalized it to pathwise conditioning (Wilson et al., 2021) based on Matheron's update, which only needs the sampling from GP priors and is a powerful tool for both reasoning about and working with GPs. Motivated by those work, Maddox et al. ( 2021) extended Matheron's rule for multi-task GPs and applied it to Bayesian optimization; Nguyen et al. (2021) proposed the first use of such bounds to improve Gaussian process posterior sampling. It is worth noting that each of the above methods enforces a certain approximation scheme to facilitate rapid computation, i.e., none of these methods draws random samples from the theoretical Gaussian distributions exactly. In this paper, we propose algorithms of sampling from GP priors and GP regression posteriors exactly for one-dimension Matérn kernel with half integer smoothness ν, then we extend it to noiseless data in multi-dimension. We introduce the kernel packet (KP) (Chen et al., 2022) as a major tool for sampling and reduce the time complexity to O(ν 3 n). This produces a linear-time exact sampler if ν is not too large. Specifically, our work makes the following contributions: • We propose an exact sampling method for Gaussian processes with (product) Matérn correlations on one-dimensional or multi-dimensional grid points. The computational time grows linearly in the size of the grid points. • We propose an exact and scalable sampler for the posterior Gaussian processes based on one-dimensional data or multi-dimensional data on full grid designs. • We demonstrate the value of the proposed algorithm in the Bayesian optimization and dynamical system problems.

2. BACKGROUND

This section covers the related background of the proposed method. Sections 2.1 and 2.2 introduce GPs, GP regression, and the basic method of GP sampling. In section 2.3, we review a newly introduced covariance matrix representation called the Kernel Packet (KP) (Chen et al., 2022), which will help expedite the GP sampling.

2.1. GPS AND GP REGRESSION

A GP is a stochastic process whose finite-dimensional distributions are multivariate normal. The probability law of a GP is uniquely determined by its mean function µ(•) and covariance function K(•, •), and we denote this GP as GP(µ(•), K(•, •)). Let f : X → R be an unknown function, X ⊆ R d . Suppose the training set consists of n Gaussian observations y i = f (x i ) + ϵ i with noise ϵ i ∼ N (0, σ 2 ϵ ). In GP regression, we impose the prior f ∼ GP(µ(•), K(•, •)).

Suppose that we have observed y

= y 1 , • • • , y n T on n distinct points X = {x i } n i=1 . The posterior of f given the data is also a GP. Specifically, the posterior evaluation at m untried inputs X * = {x * i } m i=1 follows f * |y ∼ N (µ * |n , K * , * |n ) with (Williams & Rasmussen, 2006): µ * |n = µ * + K * ,n K n,n + σ 2 ϵ σ 2 I n -1 (y -µ n ) , K * , * |n = σ 2 K * , * -K * ,n K n,n + σ 2 ϵ σ 2 I n -1 K n, * , where σ 2 > 0 is the variance of GP, I n is a n × n identity matrix, K * ,n = K(x * , X) = K(X, x * ) T = K T n, * = K(x * , x 1 ), • • • , K(x * , x n ) , K n,n = K(x i , x s ) n i,s=1 , K * , * = K(x * i , x * s ) m i,s=1 , µ n = µ(x 1 ), • • • , µ(x n ) T and µ * = µ(x * 1 ), • • • , µ(x * m ) T . In this work, we focus on Matérn correlation functions. One-dimensional Matérn correlation functions (Stein, 1999) are defined as K(x, x ′ ) = 2 1-ν Γ(ν) √ 2ν |x -x ′ | ω ν K ν √ 2ν |x -x ′ | ω , for any x, x ′ ∈ R, where σ 2 > 0 is the variance, ν > 0 is the smoothness parameter, ω > 0 is the lengthscale and K ν is the modified Bessel function of the second kind. GPs with Matérn correlations

