A SCALABLE AND EXACT GAUSSIAN PROCESS SAM-PLER VIA KERNEL PACKETS

Abstract

In view of the widespread use of Gaussian processes (GPs) in machine learning models, generating random sample paths of GPs is crucial for many machine learning applications. Sampling from a GP essentially requires generating highdimensional Gaussian random vectors, which is computationally challenging if a direct method, such as the one based on Cholesky decomposition, is implemented. We develop a scalable algorithm to sample random realizations of the prior and the posterior of GP models with Matérn correlation functions. Unlike existing scalable sampling algorithms, the proposed approach draws samples from the theoretical distributions exactly. The algorithm exploits a novel structure called the kernel packets (KP), which gives an exact sparse representation of the dense covariance matrices. The proposed method is applicable for one-dimensional GPs, and multi-dimensional GPs under some conditions such as separable kernels with full grid designs. Via a series of experiments and comparisons with other recent works, we demonstrate the efficiency and accuracy of the proposed method.

1. INTRODUCTION

Gaussian processes (GPs) have been widely used in statistical and machine learning applications (Rasmussen, 2003; Cressie, 2015; Santner et al., 2003) . The relevant areas and topics include regression (O'Hagan, 1978; Bishop et al., 1995; Rasmussen, 2003; MacKay et al., 2003) , classification (Kuss et al., 2005; Nickisch & Rasmussen, 2008; Hensman et al., 2015) , Bayesian networks (Neal, 2012) , optimization (Srinivas et al., 2009) , and so on. GP modeling proceeds by imposing a GP as the prior of an underlying continuous function, which provides a flexible nonparametric framework for prediction and inference problems. When the sample size is large, the basic framework for GP regression suffers from the computational challenge of inverting large covariance matrices. A lot of work has been done to address this issue. Recent advances in scalable GP regression include Nyström approximation (Quinonero-Candela & Rasmussen, 2005; Titsias, 2009; Hensman et al., 2013) , random Fourier features (Rahimi & Recht, 2007) , local approximation (Gramacy & Apley, 2015) , structured kernel interpolation (Wilson & Nickisch, 2015) , state-space formulation (Grigorievskiy et al., 2017; Nickisch et al., 2018) , Vecchia approximation (Katzfuss & Guinness, 2021) , sparse representation (Chen et al., 2022; Ding et al., 2021), etc. In this article, we focus on the sampling of random GP realizations. Such GPs can be either the prior stochastic processes, or the posterior processes in GP regression. Generating random sample paths of the GP prior or the posterior of the GP regression is crucial in machine learning areas such as Bayesian Optimization (Snoek et al., 2012; Frazier, 2018a; b) , reinforcement learning (Kuss & Rasmussen, 2003; Engel et al., 2005; Grande et al., 2014) , and inverse problems in uncertainty quantification (Murray-Smith & Pearlmutter, 2004; Marzouk & Najm, 2009; Teckentrup, 2020) . To generate the function of a random GP sample, a common practice is to discretize the input space, and the problem becomes the sampling of a high-dimensional multivariate normal vector. Sampling high-dimensional multivariate normal vectors, however, is computationally challenging as well, as we need to factorize the large covariance matrices. Despite the vast literature of the scalable GP regression, the sampling methodologies are still underdeveloped. Existing scalable sampling algorithms for GPs are scarce. A recent prominent work is done by Wilson et al. (2020) . They proposed an efficient sampling approach called decoupled sampling by exploiting Matheron's rule and combining Nyström approximation and random Fourier feature. They also generalized it to pathwise conditioning (Wilson et al., 2021) based on Matheron's update, which only needs the sampling from GP priors and is a powerful tool for both reasoning about and working with GPs. Motivated by those work, Maddox et al. ( 2021) extended Matheron's rule for multi-task GPs and applied it to Bayesian optimization; Nguyen et al. (2021) proposed the first use of such bounds to improve Gaussian process posterior sampling. It is worth noting that each of the above methods enforces a certain approximation scheme to facilitate rapid computation, i.e., none of these methods draws random samples from the theoretical Gaussian distributions exactly. In this paper, we propose algorithms of sampling from GP priors and GP regression posteriors exactly for one-dimension Matérn kernel with half integer smoothness ν, then we extend it to noiseless data in multi-dimension. We introduce the kernel packet (KP) (Chen et al., 2022) as a major tool for sampling and reduce the time complexity to O(ν 3 n). This produces a linear-time exact sampler if ν is not too large. Specifically, our work makes the following contributions: • We propose an exact sampling method for Gaussian processes with (product) Matérn correlations on one-dimensional or multi-dimensional grid points. The computational time grows linearly in the size of the grid points. • We propose an exact and scalable sampler for the posterior Gaussian processes based on one-dimensional data or multi-dimensional data on full grid designs. • We demonstrate the value of the proposed algorithm in the Bayesian optimization and dynamical system problems.

2. BACKGROUND

This section covers the related background of the proposed method. Sections 2.1 and 2.2 introduce GPs, GP regression, and the basic method of GP sampling. In section 2.3, we review a newly introduced covariance matrix representation called the Kernel Packet (KP) (Chen et al., 2022) , which will help expedite the GP sampling.

2.1. GPS AND GP REGRESSION

A GP is a stochastic process whose finite-dimensional distributions are multivariate normal. The probability law of a GP is uniquely determined by its mean function µ(•) and covariance function K(•, •), and we denote this GP as GP(µ(•), K(•, •)). Let f : X → R be an unknown function, X ⊆ R d . Suppose the training set consists of n Gaussian observations y i = f (x i ) + ϵ i with noise ϵ i ∼ N (0, σ 2 ϵ ). In GP regression, we impose the prior f ∼ GP(µ(•), K(•, •)).

Suppose that we have observed y

= y 1 , • • • , y n T on n distinct points X = {x i } n i=1 . The posterior of f given the data is also a GP. Specifically, the posterior evaluation at m untried inputs & Rasmussen, 2006) : X * = {x * i } m i=1 follows f * |y ∼ N (µ * |n , K * , * |n ) with (Williams µ * |n = µ * + K * ,n K n,n + σ 2 ϵ σ 2 I n -1 (y -µ n ) , K * , * |n = σ 2 K * , * -K * ,n K n,n + σ 2 ϵ σ 2 I n -1 K n, * , where σ 2 > 0 is the variance of GP, I n is a n × n identity matrix, K * ,n = K(x * , X) = K(X, x * ) T = K T n, * = K(x * , x 1 ), • • • , K(x * , x n ) , K n,n = K(x i , x s ) n i,s=1 , K * , * = K(x * i , x * s ) m i,s=1 , µ n = µ(x 1 ), • • • , µ(x n ) T and µ * = µ(x * 1 ), • • • , µ(x * m ) T . In this work, we focus on Matérn correlation functions. One-dimensional Matérn correlation functions (Stein, 1999) are defined as K(x, x ′ ) = 2 1-ν Γ(ν) √ 2ν |x -x ′ | ω ν K ν √ 2ν |x -x ′ | ω , for any x, x ′ ∈ R, where σ 2 > 0 is the variance, ν > 0 is the smoothness parameter, ω > 0 is the lengthscale and K ν is the modified Bessel function of the second kind. GPs with Matérn correlations form a rich family with finite smoothness; their sample paths are ⌈ν-1⌉ times differentiable (Santner et al., 2003) . By virtue of its flexibility, Matérn family is deemed a popular choice of correlation functions in spatial statistics (Diggle et al., 2003 ), geostatistics (Curriero, 2006; Pardo-Iguzquiza & Chica-Olmo, 2008) , image analysis (Zafari et al., 2020; Okhrin et al., 2020) , and other applications. A common choice of multi-dimensional correlation structure is the "separable" or "product" correlations given by K(x, x ′ ) = d j=1 K j (x j , x ′ j ), for any x, x ′ ∈ R d where K j is the one-dimensional Matérn correlation function for each j. Although the product of Matérn correlations doesn't have the same smoothness properties with multi-dimensional Matérn correlations, the assumption of separability is used extensively in spatiotemporal statistics (Gneiting et al., 2006; Genton, 2007; Constantinou et al., 2017) because it allows for a simple construction of valid space-time parametric models and facilitates the computational procedures for large datasets in inference and parameter estimation.

2.2. SAMPLING

The goal is to sample f (•) ∼ GP(µ(•), K(•, •)). To achieve a finite representation, we discretize the input space and consider the function values over a set of grid points Z = {z i } p i=1 , and the objective is to generate samples f p = f (z 1 ), • • • , f (z p ) T from multivariate normal distribution N (µ(Z), K(Z, Z)) = N (µ p , K p,p ). The standard sampling method of a multivariate normal distribution is as follows: 1) generate a vector of samples f p,0 whose entries are independent and identically distributed normal, 2) employ the Cholesky decomposition (Golub & Van Loan, 2013) to factorize the covariance matrix K p,p as C p C T p = K p,p , 3) generate the output sample f p as f p ← C p f p,0 + µ p . (5) Sampling a posterior GP can be done in a similar manner. Suppose we have observations y = y 1 , • • • , y n T on n distinct points X = {x i } n i=1 , where  y i = f (x i ) + ϵ i with noise ϵ i ∼ N (0, σ 2 ϵ ) and f ∼ GP(µ(•), K(•, •)). f * |y ← C * |n f n,0 + µ * |n , where f n,0 ∼ N (0, I n ).

2.3. KERNEL PACKETS

In this section, we review the theory and methods of kernel packets by Chen et al. (2022) . Suppose K(•, •) is a one-dimensional Matérn kernel defined in (3), and one-dimensional input points X is ordered and distinct. Let K = span{K(•, x j )} n j=1 . Then there exists a collection of linearly independent functions {ϕ i } n i=1 ⊂ K, such that each ϕ i = n j=1 A (i) j K(x, x j ) has a compact support. Then covariance matrix K n,n = K(X, X) is connected to a sparse matrix ϕ(X) in the following way: K n,n A X = ϕ(X), where both A X and ϕ(X) are banded matrices, the (l, i) th entry of ϕ(X) is ϕ i (x l ). The matrix A X consists of the coefficients to construct the KPs, and specifically, A j is the (j, i) th entry of A X . In view of the sparse representation and the compact supportedness of ϕ j , the bandwidth of A X is (k -1)/2, the bandwidth of ϕ(X) is (k -3)/2, k := 2ν + 2, ν is the smoothness parameter in (3). We defer the detailed algorithm to establish the factorization (7) to Appendix A.1 and the connections to the state-space GP to Appendix A.3. Here we only emphasize that the algorithm to find A X and ϕ(X) takes only O(ν 3 n) operations and O(νn) storage. Based on (7), we can substitute A X and ϕ(•) for K(•, X) in ( 1), ( 2) and rewrite the equations as follows: µ * |n = µ * + ϕ T (X * ) ϕ(X) + σ 2 ϵ σ 2 A X -1 (y -µ n ) , K * , * |n = σ 2 K * , * -ϕ T (X * ) A T X ϕ(X) + σ 2 ϵ σ 2 A T X A X -1 ϕ(X * ) , Because ϕ(X) and A X are both banded matrices, the summations ϕ(X) + σ 2 ϵ σ 2 A X and σ 2 ϕ(X) + σ 2 ϵ A X are also banded matrices. Therefore, the time complexity of matrix inversion via KPs is only O(ν 3 n).

3. SAMPLING WITH KERNEL PACKETS

In this section, we present the main algorithms for sampling from Gaussian process priors and posteriors based on the KP technique. We introduce the algorithms of sampling from one-dimensional GP and multi-dimensional GP respectively in sections 3.1 and 3.2.

3.1. SAMPLING FROM ONE-DIMENSIONAL GPS

Consider one-dimensional GPs with Matérn correlations for half integer smoothness ν. Sampling from the prior distribution For a set of grid points Z = {z i } p i=1 , we first compute the sparse matrices A Z and ϕ(Z) in ( 7). Now, instead of a direct Cholesky decomposition of the covariance matrix K p,p = K(Z, Z), we consider the Cholesky decomposition of the symmetric positive definite matrix R Z := A T Z ϕ(Z) = A T Z K p,p A Z . This shifting makes a significant difference: K p,p is a dense matrix, and the Cholesky decomposition takes O(n 3 ) time. In contrast, R Z , as a multiplication of two banded matrices, is also a banded matrix with bandwidth of no more than 2ν. It is well known that the Cholesky factor of a banded matrix is also banded, and the computation can be done in O(ν 3 n) time. Suppose Q Z Q T Z = R Z , then multiply (A T Z ) -1 Q Z by a vector f p,0 , which is standard multivariate normal distributed. It's not hard to show that (A T Z ) -1 Q Z f p,0 + µ p ∼ N (µ p , K p,p ). Hence, we obtain the samples which have same distribution with f p . In practical calculation, we may first compute the multiplication Q Z f p,0 then compute (A T Z ) -1 Q Z f p,0 to reduce the time complexity since both A Z and Q Z are banded matrices. The entire algorithm only costs O(ν 3 n) time and O(νn) storage. A summary is given in Algorithm 1. Algorithm 1 Sampling from one-dimensional GP priors. Input: Ordered input dataset Z defined in section 2.1 1: Compute banded matrices A Z and ϕ(Z) with respect to the input data Z 2: Obtain a banded matrix; R Z := A T Z ϕ(Z); 3: Apply Cholesky decomposition to R Z , get the lower triangular matrix Q Z satisfying Q Z Q T Z = R Z ; 4: Generate samples f p,0 ∼ N (0, I p ); 5: Compute f kp := (A T Z ) -1 Q Z f p,0 + µ p . Output: f kp Sampling from one-dimensional GP regression posteriors. This can be done by combining the KP technique with the Matheron's rule. The Matheron's rule was popularized by Journel & Huijbregts (1976) to geostatistics field. Recently Wilson et al. (2020) rediscovered and exploited it to develop a GP posterior sampling approach. The Matheron's rule is stated as follows. Let a and b be jointly Gaussian random vectors. Then the random vector a, conditional on b = β, is equal in distribution to (a|b = β) d = a + Cov(a, b)Cov(b, b) -1 (β -b), where Cov(a, b) is the covariance of (a, b). By (10), the exact GP posteriors can be sampled by two jointly Gaussian random variables, and we obtain f * |y d = f * + K * ,n K n,n + σ 2 ϵ σ 2 I n -1 (y -f n -ϵ), where f * and f n are jointly Gaussian random variables from the prior distribution, noise variates ϵ ∼ N (0, σ 2 ϵ I n ). Clearly, the joint distribution of (f n , f * ) follows the multivariate normal distribution as follows: (f n , f * ) ∼ N µ n µ * , K n,n K n, * K * ,n K * , * . We may apply KP to (11) and get the following corollary: f * |y d = f * + ϕ T (X * ) ϕ(X) + σ 2 ϵ σ 2 A X -1 (y -f n -ϵ). ( ) Given that Algorithm 1 requires distinct and ordered data points, it's reasonable to assume training set X doesn't coincide with test set X * . Also, we can rearrange the combined set of the training set X and test set X * to an ordered set and record the index of reordering. Next, we can utilize the Algorithm 1 to draw a reordered vector in R n+m from the GP prior and obtain jointly samples f n and f * by recovering the reordered vector to the original sequence. Finally, we plug f n and f * into formula (13) to calculate the posterior samples. It's obvious the time complexity of this approach is also O(ν 3 n) due to the sparsity of the matrices to be Cholesky decomposed and inverted.

3.2. SAMPLING FROM MULTI-DIMENSIONAL GPS

For multi-dimensional GPs, we suppose that the points X are given by a full grid, defined as j) , where each X (j) is a set of one-dimensional data points, A × B denotes the Cartesian product of sets A and B. Based on the separable structure defined in (4) and a full grid X FG , we can transform the multi-dimensional problem to a one-dimensional problem since the correlation matrix K(X FG , X FG ) can be represented by Kronecker products (Henderson et al., 1983; Saatc ¸i, 2012; Wilson & Nickisch, 2015) of matrices over each input dimension j: X FG = × d j=1 X ( K(X FG , X FG ) = d j=1 K j (X (j) , X (j) ). ( ) Prior To sample from GP priors over a full grid design Z FG = × d j=1 Z (j) , it's easy to compute that tensor product d j=1 A -T Z (j) Q Z (j) is the Cholesky factor of the correlation matrix with a full grid Z FG . For each dimension j, we can generate matrices A Z (j) , ϕ(Z (j) ), and Cholesky factor Q Z (j) with respect to one-dimensional dataset Z (j) . To get the GP prior samplings over full grids, we need to apply the tensor product d j=1 A -T Z (j) Q Z (j) to f FG p,0 . More specifically, the samples from GP priors can be computed via f FG kp := d j=1 A -T Z (j) Q Z (j) f FG p,0 + µ FG p ∼ N (µ FG p , K(Z FG , Z FG )). Accordingly, on full grids, we only need to perform a for loop over dimensions, multiply banded matrices in each dimension, and calculate a tensor product. The total method is also O(ν 3 d j=1 n j ) because we are also dealing with sparse matrices, here n j is the number of data points X (j) . Posterior With regard to GP posteriors, it's impossible to employ the prior sampling scheme to draw jointly distributed samples (f FG n , f FG * ) since we cannot order a sequence of multi-dimensional data. We consider directly Cholesky factorizing the posterior correlation matrix K 

4. EXPERIMENTS

In this section, we will demonstrate the computational efficiency and accuracy of the proposed sampling algorithm. We first generate samples from GP priors and posteriors with one-dimensional space in section 4.1 and two-dimensional full grid designs in section 4.2. Then we consider the same applications as in (Wilson et al., 2020) and perform our approach to two real problems in section 4.3. For prior samplings, we use the random Fourier features (RFF) with 1024 features (Rahimi & Recht, 2007) and the Cholesky decomposition method as benchmarks. For posterior samplings, decoupled (Wilson et al., 2020) algorithm with exact Matheron's update and the Cholesky decomposition method are used as benchmarks. We consider Matérn correlations in (3) with lengthscale ω = √ 2ν, smoothness ν = 3/2 and 5/2. In two-dimensional problems, we choose "separable" Matérn correlations mentioned in Section 2.1 with the same correlations and same parameters ω = √ 2ν in each dimension. We set the variance as σ 2 = 1 for all experiments. We set seed value as 99 and perform 1000 replications for each experiment. All plots regarding ν = 3/2 are given in the main article and these associated with ν = 5/2 are deferred to Appendix A.5.

4.1. ONE-DIMENSIONAL EXAMPLES

Prior Sampling We generate one-dimensional prior samples on uniformly distributed points Z = {z i } p i=1 over interval [0, 10] with p = 10, 50, 100, 500, 1000, 5000, 10000. Left plots in Figure 1 and Figure 7 show the time taken in sampling schemes for different algorithms over the different number of points p, we can observe that KP algorithm costs much less than other algorithms for both Matérn 3/2 and Matérn 5/2 correlations especially when p = 5000, 10000. Also, the curves of Cholesky decomposition are incomplete due to the limit of storage. To test the accuracy, we select three subsets of size ten: {z i } 10 i=1 , {z i } 259 i=250 , {z i } 500 i=491 when p = 500 and compute the 2-Wasserstein distances between empirical priors and true distributions over these three subsets (called Cases 1-3 thereafter). The 2-Wasserstein distance measures the similarity of two distributions. Let f 1 ∼ N (µ 1 , Σ 1 ) and f 2 ∼ N (µ 2 , Σ 2 ), the 2-Wasserstein distance between the Gaussian distributions f 1 , f 2 on L 2 norm is given by (Dowson & Landau, 1982; Gelbrich, 1990; Mallasto & Feragen, 2017 ) W 2 (f 1 , f 2 ) := ||µ 1 -µ 2 || 2 + tr Σ 1 + Σ 2 -2(Σ 1 2 1 Σ 2 Σ 1 2 1 ) 1 2 1 2 . ( ) For the empirical distributions, the parameters are estimated from the replicated samples. From right plots in Figure 1 and Figure 7 , we can observe that KP algorithm outperforms RFF and Cholesky method greatly in Case 1 and Case 3, which are boundary points of Z. It may be because of the numerical precision that Cholesky method has the lower 2-Wasserstein distances in Case 3. 

4.2. TWO-DIMENSIONAL EXAMPLES

Prior Sampling We generate prior samples over level-η full grid design: 4, 5, 6 and d = 2 . Likewise, we use 2-Wasserstein distances defined in ( 16) to evaluate the accuracy of prior samplings in three cases, Z FG η = × d j=1 {-5 + 10 • 2 -η , -5 + 2 • 10 • 2 -η , . . . , 5 -10 • 2 -η } with η = 3, × d j=1 {z i } 3 i=1 , × d j=1 {z i } 17 i=15 ,× d j=1 {z i } 31 i=29 when {z i } 31 i=1 = {-5 + 10 • 2 -5 , • • • , 5 -10 • 2 -5 }. Left plots in Figure 3 and Figure 9 in Appendix A.5 show that KP takes less time than RFF and the Cholesky method especially for 2 12 grid points. Right plots in Figures 3 and Figure 9 show that KP has higher accuracy than RFF for both Matérn 3/2 and 5/2 correlations but has lower accuracy than the Cholesky method for Matérn 5/2 correlations. Posterior Sampling We choose the Griewank function (Griewank, 1981) as our test function and level-η full grid design: f (x) = d i=1 x 2 i 4000 + d i=1 cos( x i √ i ) + 1, x ∈ (-5, 5) d X FG η = × d j=1 {-5+10•2 -η , -5+2•10•2 -η , . . . , 5-10 • 2 -η } with η = 3, 4, • • • , 9 and d = 2 as our design of experiment. We then investigate the average computational time and MSE over m = 1024 random test points for each sampling method. Figure 4 and Figure 10 in Appendix A.5 illustrate the performance of different sampling strategies, we can observe that both direct Cholesky decomposition and decoupled algorithm can only generate posterior samples from at most 2 12 observations due to the limit of storage, however, KP algorithm can sample from 2 18 observations because the space complexity of KP-based computation only requires O(νn). Although the accuracy of the KP method is lower than the direct Cholesky decomposition method and decoupled algorithm, it is still of high accuracy and the time cost is in a much shorter period of time compared with other algorithms.

4.3. APPLICATIONS

Thompson Sampling Thompson Sampling (TS) (Thompson, 1933) is a classical strategy for decision-making by selecting actions x ∈ X that minimize a black-box function f : X → R. In round t, TS selects x t+1 ∈ arg min x∈X (f |y)(x), y is the observation set. Upon finding the minimizer, we may obtain y t+1 by evaluating at x t+1 , and then add (x t+1 , y t+1 ) to the training set. In this experiment, we consider a one-dimensional cut of the Ackley function (Ackley, 2012)  f (x) = -20 exp{-0.2 • √ 0.5 • x 2 } -exp{-0.5(cos(2πx) + 1)} + exp{1} + 20. The goal of this experiment is to find the global minimizer of function in (17). We start with k = 2ν + 2 samples before the optimization, then at each round of TS, we draw a posterior sample f |y on 1000 uniformly distributed points over the interval [-5, 5] given the observation set. Next, we pick the smallest posterior sample at this round and add it to the training set, and repeat the above process. After some steps, we are able to get closer to the global minimum. In Figure 5 , we compare the logarithm of total regret of different sampling algorithms, both the proposed approach (KP) and the decoupled method can find the global minimum within 15 rounds, which outperform the direct Cholesky factorization sampling scheme. Simulating Dynamical Systems Gaussian process posteriors are also commonly used in dynamical systems when we don't have sufficient data. Consider a one-dimensional ordinary differential equation x ′ (s) = f (s, x(s)), (18) we can discretize the system's equation ( 18) to a difference equation in the following formula by Euler method (Butcher, 2016) , y t = x t+1 -x t = τ f (s t , x t ), ( ) where τ is the fixed step size. We aim to simulate the state trajectories of this dynamical system. First, we set an initial point x 0 , then at iteration t, we draw a posterior GP sampling from the conditional distribution p(y t |D t-1 ), where D t-1 denotes the set of the data {(x i , y i )} n i=1 and the current trajectory {(x j , y j )} t-1 j=1 . In our implement, we choose the model as f (s, x(s)) = 0.5x - Figure 6 demonstrates the state trajectories of each algorithm for T = 480 steps, time cost in each iteration, and logarithm of MSE between x t obtained from GP-based simulations and x t obtained by directly performing the Euler method in (19) at each iteration t. The left plot in Figure 6 shows that the KP algorithm can accurately characterize state trajectories of this dynamical system. The middle and right plots in Figure 6 indicate that the KP algorithm takes much less time and yields high accuracy in each step. It achieves nearly the same performance as the decoupled method and outperforms the Cholesky decomposition method. 

5. DISCUSSION

In this work, we propose a scalable and exact algorithm for one-dimensional Gaussian process sampling with Matérn correlations for half-integer smoothness ν, which only requires O(ν 3 n) time and O(νn) space. The proposed method can be extended to some multi-dimensional problems such as noiseless full grid designs by using tensor product techniques. If the design is not grid-based, the proposed algorithm is not applicable, we may use the method in (Ding et al., 2020) to devise approximation algorithms for sampling. While the proposed method is theoretically exact and scalable algorithm, we observe some numerical stability issues (see Appendix A.4) in our experiments. This explains why sometimes the proposed method is not as accurate as the Cholesky decomposition method. Improvements and extensions of the proposed algorithm to overcome the stability issues and accommodate general multi-dimensional observations will be considered in a future work.

A APPENDIX A.1 CONSTRUCTION OF BANDED MATRICES IN (7)

Intermediate KPs Let a = (a 1 , ..., a k ) T be a vector with a 1 < • • • < a k , then intermediate KPs are defined as ϕ a (x) := k j=1 A j K(x, a j ) and the coefficients A j 's can be obtained by solving the (k -1) × k linear system k j=1 A j a l j exp{δca j } = 0, with l = 0, . . . , (k -3)/2 and δ = ±1. One-sided KPs As before, let a = (a 1 , ..., a s ) T be a vector with a 1 < • • • < a s , one-sided KP is given by ϕ a (x) := s j=1 A j K(x, a j ), with (k + 1)/2 ≤ s ≤ k -1. For right-sided KPs, we can get coefficients A j 's by solving s j=1 A j a l j exp{-ca j } = 0, s j=1 A j a r j exp{ca j } = 0, where l = 0, . . . , (k -3)/2 and the second term of equation 23 comprises auxiliary equations for the case s ≥ (k + 3)/2 with r = 0, . . . , s -(k + 3)/2. Similar to equation 21, equation 23 is an (s -1) × s linear system. Left-sided KPs are constructed similarly by solving the following equations: s j=1 A j a l j exp{ca j } = 0, s j=1 A j a r j exp{-ca j } = 0, where l = 0, . . . , (k -3)/2 and the second term comprises auxiliary equations for the case s ≥ (k + 3)/2 with r = 0, . . . , s -(k + 3)/2. KP Basis Let X = {x i } n i=1 be the one-dimensional input data satisfying x 1 < • • • < x n , and K a Matérn correlation function with a half-integer smoothness ν. Suppose n ≥ k := 2ν + 2. We can construct the following n functions, as a subset of linear space K := span{K(•, x j )} n j=1 : 1 Here functions {ϕ j } n j=1 are linearly independent in K, together with the fact that the dimension of K is n, implies that {ϕ j } n j=1 forms a basis for K, referred to as the KP basis. In equation 7, the (l, j) th entry of ϕ(X) is ϕ j (x l ). In view of the compact supportedness of ϕ j , ϕ(X) is a banded matrix with bandwidth (k -3)/2: ϕ(X) =                  . . . . . . ϕ j-k-3 2 (x j-2 k-3 2 ) . . . . . . . . . ϕ j-k-3 2 (x j ) • • • ϕ j+ k-3 2 (x j ) . . . . . . . . . ϕ j+ k-3 2 (x j+2 k-3 2 ) . . . . . .                  . The matrix of A X consists of the coefficients to construct the KPs. A X is a banded matrix with bandwidth (k -1)/2: A X =                  . . . . . . A j-2 k-1 2 ,j-k-1 2 . . . . . . . . . A j,j-k-1 2 • • • A j,j+ k-1 2 . . . . . . . . . A j+2 k-1 2 ,j+ k-1 2 . . . . . .                  . A.2 SAMPLING FROM MULTI-DIMENSIONAL GP POSTERIORS Combine (1) and ( 2) with the properties of the Kronecker product, the posterior mean and variance for a training set {X FG , y FG } with full grid designs X FG = × d j=1 X (j) can be computed via the following equations: µ FG * |n = µ FG * + K * ,n d j=1 A X (j) ϕ(X (j) ) -1 y FG -µ FG n , K FG * , * |n = σ 2   K * , * -K * ,n d j=1 A X (j) ϕ(X (j) ) -1 K n, *   . A.3 KERNEL PACKETS AND STATE-SPACE Despite kernel packets (Chen et al., 2022) and state-space formulation (Grigorievskiy et al., 2017) can train one-dimensional Gaussian process regression with Matérn correlations exactly in linear time, there is no obvious clue to establish any mathematical connections between these methods. It is worth noting that the state-space approach is a sequential algorithm, which needs significantly more efforts to implement parallel computing; in contrast, the kernel packets approach can be paralleled in a trivial way. Besides, like most other supervised learning algorithms, the kernel packets method is able to separate the learning task into two independent steps: training and prediction, and the computational time for each prediction point is only O(1) when the inputs are uniformly spaced. However, the state-space formulation cannot make this separation (Grigorievskiy et al., 2017) , and users have to either provide the prediction points during the training step, or cost O(n) time for each prediction point. 

A.4 NUMERICAL INSTABILITY

In real computation, the matrix R Z in Algorithm 1 may not be numerically symmetric positivedefinite when the distances between the input points Z are very small. Since each column of the matrix A Z is obtained by solving a null space, each column of A Z can be up to a scalar, more work needs to be done to enhance the numerical stability of the algorithm. A.5 FIGURES 



FG * , * |n to obtain C FG * |n , then use (6) to generate samples. The calculation of posterior mean µ FG * |n and posterior variance K FG * , * |n only costs O(ν 3 m d j=1 n j ) time by using equations (25) and (26) in Appendix A.2. Therefore the time complexity of the entire scheme requires O(ν 3 m 3 + m d j=1 n j ).

Figure1: Time and accuracy of different algorithms for sampling from one-dimensional GP priors with Matérn 3/2. We denote our approach (KP) by blue dots, random Fourier features (RFF) by orange stars, direct Cholesky decomposition by green crosses. Left: Logarithm of time taken to generate a draw from GP priors, x-axis is the number of grid points p. Right: 2-Wasserstein distances between priors and true distributions over ten points for three different cases {z i } 10 i=1 , {z i } 259 i=250 , {z i } 500 i=491 when p = 500.

Figure 2: Time and accuracy of different algorithms for sampling from one-dimensional GP posteriors with Matérn 3/2. We denote decoupled method by red triangles. Left: Logarithm of time taken to generate a draw from GP posteriors over m = 1000 points, x-axis is the number of observations n. Right: Logarithm of MSE over m = 1000 points.

Figure 3: Time and accuracy of different algorithms for sampling from GP priors over full grids with Matérn 3/2. Left: Logarithm of time taken to generate a draw from GP priors. Right: 2-Wasserstein distances between priors and true distributions over nine points for three different cases × d j=1 {z i } 3 i=1 , × d j=1 {z i } 17 i=15 ,× d j=1 {z i } 31 i=29 when {z i } 31 i=1 = {-5 + 10 • 2 -5 , • • • , 5 -10 • 2 -5 }.

Figure 4: Time and accuracy of different algorithms for sampling from GP posteriors over full grids with Matérn 3/2. Left: Logarithm of time taken to generate a draw from GP posteriors over 1024 points. Right: Logarithm of MSE over 1024 points.

Figure 5: Logarithm of regret of Thompson sampling methods when optimizing Ackley function with Matérn 3/2 (left) and Matérn 5/2 (right).

Figure 6: Simulations of an ordinary differential equation. Left: Trajectories generated via different algorithms. Middle: Time cost at each iteration. Right: Logarithm of MSE between simulations and ground truth state trajectories at each iteration.

Figure 7: Time and accuracy of different algorithms for sampling from one-dimensional GP priors with Matérn 5/2. Left: Logarithm of time taken to generate a draw from GP priors, x-axis is the number of grid points p. Right: 2-Wasserstein distances between priors and true distributions over ten points for three different cases {z i } 10 i=1 , {z i } 259 i=250 , {z i } 500 i=491 when p = 500.

Figure 8: Time and accuracy of different algorithms for sampling from one-dimensional GP posteriors with Matérn 5/2. Left: Logarithm of time taken to generate a draw from GP posteriors over m = 1000 points, x-axis is the number of observations n. Right: Logarithm of MSE over m = 1000 points.

Figure 7 and Figure 8 show the performance of one-dimensional examples with Matérn 5/2 in section 4.1, Figure 9 and Figure 10 show the performance of two-dimensional examples with Matérn 5/2 in section 4.2.

Figure 9: Time and accuracy of different algorithms for sampling from GP priors over full grids with Matérn 5/2. Left: Logarithm of time taken to generate a draw from GP priors. Right: 2-Wasserstein distances between priors and true distributions over nine points for three different cases × d j=1 {z i } 3 i=1 , × d j=1 {z i } 17 i=15 ,× d j=1 {z i } 31 i=29 when {z i } 31 i=1 = {-5 + 10 • 2 -5 , • • • , 5 -10 • 2 -5 }.

Figure 10: Time and accuracy of different algorithms for sampling from GP posteriors over full grids with Matérn 5/2. Left: Logarithm of time taken to generate a draw from GP posteriors over 1024 points. Right: Logarithm of MSE over 1024 points.

The goal is to generate posterior samples f * |y at m test points X * =

