SPARSE GAUSSIAN PROCESS VARIATIONAL AUTOEN-CODERS

Abstract

Large, multi-dimensional spatio-temporal datasets are omnipresent in modern science and engineering. An effective framework for handling such data are Gaussian process deep generative models (GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing approaches for performing inference in GP-DGMs do not support sparse GP approximations based on inducing points, which are essential for the computational efficiency of GPs, nor do they handle missing data -a natural occurrence in many spatio-temporal datasets -in a principled manner. We address these shortcomings with the development of the sparse Gaussian process variational autoencoder (SGP-VAE), characterised by the use of partial inference networks for parameterising sparse GP approximations. Leveraging the benefits of amortised variational inference, the SGP-VAE enables inference in multi-output sparse GPs on previously unobserved data with no additional training. The SGP-VAE is evaluated in a variety of experiments where it outperforms alternative approaches including multi-output GPs and structured VAEs.

1. INTRODUCTION

Increasing amounts of large, multi-dimensional datasets that exhibit strong spatio-temporal dependencies are arising from a wealth of domains, including earth, social and environmental sciences (Atluri et al., 2018) . For example, consider modelling daily atmospheric measurements taken by weather stations situated across the globe. Such data are (1) large in number; (2) subject to strong spatio-temporal dependencies; (3) multi-dimensional; and (4) non-Gaussian with complex dependencies across outputs. There exist two venerable approaches for handling these characteristics: Gaussian process (GP) regression and deep generative models (DGMs). GPs provide a framework for encoding high-level assumptions about latent processes, such as smoothness or periodicity, making them effective in handling spatio-temporal dependencies. Yet, existing approaches do not support the use of flexible likelihoods necessary for modelling complex multi-dimensional outputs. In contrast, DGMs support the use of flexible likelihoods; however, they do not provide a natural route through which spatio-temporal dependencies can be encoded. The amalgamation of GPs and DGMs, GP-DGMs, use latent functions drawn independently from GPs, which are then passed through a DGM at each input location. GP-DGMs combine the complementary strengths of both approaches, making them naturally suited for modelling spatio-temporal datasets. Intrinsic to the application of many spatio-temporal datasets is the notion of tasks. For instance: medicine has individual patients; each trial in a scientific experiment produces an individual dataset; and, in the case of a single large dataset, it is often convenient to split it into separate tasks to improve computational efficiency. GP-DGMs support the presence of multiple tasks in a memory efficient way through the use of amortisation, giving rise to the Gaussian process variational autoencoder (GP-VAE), a model that has recently gained considerable attention from the research community (Pearce, 2020; Fortuin et al., 2020; Casale et al., 2018; Campbell & Liò, 2020; Ramchandran et al., 2020) . However, previous work does not support sparse GP approximations based on inducing points, a necessity for modelling even moderately sized datasets. Furthermore, many spatio-temporal datasets contain an abundance of missing data: weather measurements are often absent due to sensor failure, and in medicine only single measurements are taken at any instance. Handling partial observations in a principled manner is essential for modelling spatio-temporal data, but is yet to be considered. Our key technical contributions are as follows: i) We develop the sparse GP-VAE (SGP-VAE), which uses inference networks to parameterise multi-output sparse GP approximations. ii) We employ a suite of partial inference networks for handling missing data in the SGP-VAE. iii) We conduct a rigorous evaluation of the SGP-VAE in a variety of experiments, demonstrating excellent performance relative to existing multi-output GPs and structured VAEs.

2. A FAMILY OF SPATIO-TEMPORAL VARIATIONAL AUTOENCODERS

Consider the multi-task regression problem in which we wish to model T datasets D = {D (t) } T t=1 , each of which comprises input/output pairs D (t) = {x (t) n , y (t) n } Nt n=1 , x n ∈ R D and y (t) n ∈ R P . Further, let any possible permutation of observed values be potentially missing, such that each observation y n , conditioned on a corresponding latent variable f (t) n ∈ R K , as a fully-factorised Gaussian distribution parameterised by passing f (t) n through a decoder deep neural network (DNN) with parameters θ 2 . The elements of f (t) n correspond to the evaluation of a K-dimensional latent function f (t) = (f (t) 1 , f (t) 2 , . . . , f (t) K ) at input x (t) n . That is, f (t) n = f (t) (x n ). Each latent function f (t) is modelled as being drawn from K independent GP priors with hyper-parameters θ 1 = {θ 1,k } K k=1 , giving rise to the complete probabilistic model: f (t) ∼ K k=1 GP 0, k θ 1,k (x, x ) p θ 1 (f (t) k ) y (t) |f (t) ∼ Nt n=1 N µ o θ2 (f (t) n ), diag σ o θ2 2 (f (t) n ) p θ 2 (y o n (t) |f (t) ,x n ,O (1) where µ o θ2 (f (t) n ) and σ o θ2 2 (f (t) n ) are the outputs of the decoder indexed by O (t) n . We shall refer to the set θ = {θ 1 , θ 2 } as the model parameters, which are shared across tasks. The probabilistic model in equation 1 explicitly accounts for dependencies between latent variables through the GP prior. The motive of the latent structure is twofold: to discover a simpler representation of each observation, and to capture the dependencies between observations at different input locations.

2.1. MOTIVATION FOR SPARSE APPROXIMATIONS AND AMORTISED INFERENCE

The use of amortised inference in DGMs and sparse approximations in GPs enables inference in these respective models to scale to large quantities of data. To ensure the same for the GP-DGM described in equation 1, we require the use of both techniques. In particular, amortised inference is necessary to prevent the number of variational parameters scaling with O T t=1 N (t) . Further, the inference network can be used to condition on previously unobserved data without needing to learn new variational parameters. Similarly, sparse approximations are necessary to prevent the computational complexity increasing cubically with the size of each task O T t=1 N (t) 3 . Unfortunately, it is far from straightforward to combine sparse approximations and amortised inference in a computationally efficient way. To see this, consider the standard form for the sparse GP approximate posterior, q(f ) = p θ1 (f \u |u)q(u) where q(u) = N (u; m, S) with m, S and Z, the inducing point locations, being the variational parameters. q(u) does not decompose into a product over N (t) factors and is therefore not amendable to per-datapoint amortisation. That is, m and S must be optimised as free-form variational parameters. A naïve approach to achieving per-datapoint amortisation is to decompose q(u) into the prior p θ1 (u) multiplied by the product of approximate likelihoods, one for each inducing point. Each approximate likelihood is itself equal to the product of per-datapoint approximate likelihoods, which depend on both the observation y o n and the distance of the input x n to that of the inducing point. An inference network which takes these two values of inputs can be used to obtain the parameters of the approximate likelihood factors. Whilst we found that this approach worked, it is somewhat unprincipled. Moreover, it requires passing each datapoint/inducing point pair through an inference network, which scales very poorly. In the following section, we introduce a theoretically principled decomposition of q(f ) we term the sparse structured approximate posterior which will enable efficient amortization.

2.2. THE SPARSE STRUCTURED APPROXIMATE POSTERIOR

By simultaneously leveraging amortised inference and sparse GP approximations, we can perform efficient and scalable approximate inference. We specify the sparse structured approximate posterior, q(f (t) ), which approximates the intractable true posterior for task t: p θ (f (t) |y (t) , X (t) ) = 1 Z p p θ1 (f (t) ) Nt n=1 p θ2 (y o n (t) |f (t) , x (t) n , O (t) n ) ≈ 1 Z q p θ1 (f (t) ) Nt n=1 l φ l (u; y o n (t) , x (t) n , Z) = q(f (t) ). ( ) Analogous to its presence in the true posterior, the approximate posterior retains the GP prior, yet replaces each non-conjugate likelihood factor with an approximate likelihood, l φ l (u; y o n (t) , x (t) n , Z), over a set of KM 'inducing points', u = ∪ K k=1 ∪ M m=1 u mk , at 'inducing locations', Z = ∪ K k=1 ∪ M m=1 z mk . For tractability, we restrict the approximate likelihoods to be Gaussians factorised across each latent dimension, parameterised by passing each observation through a partial inference network: l φ l (u k ; y o n (t) , x (t) n , Z k ) = N µ φ l ,k (y o n (t) ); k f (t) nk u k K -1 u k u k u k , σ 2 φ l ,k (y o n (t) ) where φ l denotes the weights and biases of the partial inference network, whose outputs are shown in red. This form is motivated by the work of Bui et al. (2017) , who demonstrate the optimality of approximate likelihoods of the form N g n ; k f (t) nk u k K -1 u k u k u k , v n , a result we prove in Appendix A.1. Whilst, in general, the optimal free-form values of g n and v n depend on all of the data points, we make the simplifying assumption that they depend only on y o n (t) . For GP regression with Gaussian noise, this assumption holds true as g n = y n and v n = σ 2 y (Bui et al., 2017) . The resulting approximate posterior can be interpreted as the exact posterior induced by a surrogate regression problem, in which 'pseudo-observations' g n are produced from a linear transformation of inducing points with additive 'pseudo-noise' v n , g n = k f (t) nk u k K -1 u k u k u k + √ v n . The inference network learns to construct this surrogate regression problem such that it results in a posterior that is close to our target posterior. By sharing variational parameters φ = {φ l , Z} across tasks, inference is amortised across both datapoints and tasks. The approximate posterior for a single task corresponds to the product of K independent GPs, with mean and covariance functions m(t) k (x) = k f (t) k u k Φ (t) k K u k f (t) k Σ (t) φ l ,k -1 µ (t) φ l ,k k(t) k (x, x ) = k f (t) k f k (t) -k f (t) k u k K -1 u k u k k u k f k (t) + k f (t) k u k Φ (t) k k u k f k (t) where Φ (t) k -1 = K u k u k + K u k f (t) k Σ (t) φ l ,k -1 K f (t) k u k , µ (t) φ l ,k i = µ φ l ,k (y o i (t) ) and Σ (t) φ l ,k ij = δ ij σ 2 φ l ,k (y o i (t) ). See Appendix A.2 for a complete derivation. The computational complexity associated with evaluating the mean and covariance functions is O KM 2 N (t) , a significant improvement over the O P 3 N (t) 3 cost associated with exact multi-output GPs for KM 2 P 3 N (t) 2 . We refer to the combination of the aforementioned probabilistic model and sparse structured approximate posterior as the SGP-VAE. The SGP-VAE addresses three major shortcomings of existing sparse GP frameworks. First, the inference network can be used to condition on previously unobserved data without needing to learn new variational parameters. Suppose we use the standard sparse GP variational approximation q(f ) = p θ1 (f \u |u)q(u) where q(u) = N (u; m, S). If more data are observed, m and S have to be re-optimised. When an inference network is used to parameterise q(u), the approximate posterior is 'automatically' updated by mapping from the new observations to their corresponding approximate likelihood terms. Second, the complexity of the approximate posterior can be modified as desired with no changes to the inference network, or additional training, necessary: any changes in the morphology of inducing points corresponds to a deterministic transformation of the inference network outputs. Third, if the inducing point locations are fixed, then the number of variational parameters does not depend on the size of the dataset, even as more inducing points are added. This contrasts with the standard approach, in which new variational parameters are appended to m and S.

2.3. TRAINING THE SGP-VAE

Learning and inference in the SGP-VAE are concerned with determining the model parameters θ and variational parameters φ. These objectives can be attained simultaneously by maximising the evidence lower bound (ELBO), given by L ELBO = T t=1 L (t) ELBO where L (t) ELBO = E q(f (t) ) p θ (y (t) , f (t) ) q(f (t) ) = E q(f (t) ) log p θ (y (t) |f (t) ) -KL q (t) (u) p θ1 (u) and q (t) (u) ∝ p θ1 (u) Nt n=1 l φ l (u; y o n (t) , x n , Z). Fortunately, since both q (t) (u) and p θ1 (u) are multivariate Gaussians, the final term, and its gradients, has an analytic solution. The first term amounts to propagating a Gaussian through a non-linear DNN, so must be approximated using a Monte Carlo estimate. We employ the reparameterisation trick (Kingma & Welling, 2014) to account for the dependency of the sampling procedure on both θ and φ when estimating its gradients. We mini-batch over tasks, such that only a single L (t) ELBO is computed per update. Importantly, in combination with the inference network, this means that we avoid having to retain the O T Mfoot_1 terms associated with T Cholesky factors if we were to use a free-form q(u) for each task. Instead, the memory requirement is dominated by the  O KM 2 + KN M + |φ l | terms associated with storing K u k u k , K u k f (t)

2.4. PARTIAL INFERENCE NETWORKS

Partially observed data is regularly encountered in spatio-temporal datasets, making it necessary to handle it in a principled manner. Missing data is naturally handled by Bayesian inference. However, for models using inference networks, it necessitates special treatment. One approach to handling partially observed data is to impute missing values with zeros (Nazabal et al., 2020; Fortuin et al., 2020) . Whilst simple to implement, zero imputation is theoretically unappealing as the inference network can no longer distinguish between a missing value and a true zero. Instead, we turn towards the ideas of Deep Sets (Zaheer et al., 2017) . By coupling the observed value with dimension index, we may reinterpret each partial observation as a permutation invariant set. We define a family of permutation invariant partial inference networks 2 as µ φ (y o n ), log σ 2 φ (y o n ) = ρ φ2   p∈On h φ1 (s np )   (6) where h φ1 : R 2 → R R and ρ φ2 : R R → R 2P are DNN mappings with parameters φ 1 and φ 2 , respectively. s np denotes the couples of observed value y np and corresponding dimension index p. The formulation in equation 6 is identical to the partial variational autoencoder (VAE) framework established by Ma et al. (2019) . There are a number of partial inference networks which conform to this general framework, three of which include: PointNet Inspired by the PointNet approach of Qi et al. ( 2017) and later developed by Ma et al. (2019) for use in partial VAEs, the PointNet specification uses the concatenation of dimension index with observed value: s np = (p, y np ). This specification treats the dimension indices as continuous variables. Thus, an implicit assumption of PointNet is the assumption of smoothness between values of neighbouring dimensions. Although valid in a computer vision application, it is ill-suited for tasks in which the indexing of dimensions is arbitrary. IndexNet Alternatively, one may use the dimension index to select the first DNN mapping: h φ1 (s np ) = h φ1,p (y np ). Whereas PointNet treats dimension indices as points in space, this specification retains their role as indices. We refer to it as the IndexNet specification. FactorNet A special case of IndexNet, first proposed by Vedantam et al. (2017) , uses a separate inference network for each observation dimension. The approximate likelihood is factorised into a product of Gaussians, one for each output dimension: l φ l (u k ; y o n , x n , Z k ) = p∈On N µ φ l ,p k (y np ); k f nk u k K -1 u k u k u k , σ 2 φ l ,p k (y np ) . We term this approach FactorNet. See Appendix G for corresponding computational graphs. Note that FactorNet is equivalent to IndexNet with ρ φ2 defined by the deterministic transformations of natural parameters of Gaussian distributions. Since IndexNet allows this transformation to be learnt, we might expect it to always produce a better partial inference network for the task at hand. However, it is important to consider the ability of inference networks to generalise. Although more complex inference networks will better approximate the optimal non-amortised approximate posterior on training data, they may produce poor approximations to it on the held-out data. 3 In particular, FactorNet is constrained to consider the individual contribution of each observation dimension, whereas the others are not. Doing so is necessary for generalising to different quantities and patterns of missingness, hence we anticipate FactorNet to perform better in such settings.

3. RELATED WORK

We focus our comparison on approximate inference techniques; however, Appendix D presents a unifying view of GP-DGMs.

Structured Variational Autoencoder

Only recently has the use of structured latent variable priors in VAEs been considered. In their seminal work, Johnson et al. ( 2016) investigate the combination of probabilistic graphical models with neural networks to learn structured latent variable representations. The authors consider a two stage iterative procedure, whereby the optimum of a surrogate objective function -containing approximate likelihoods in place of true likelihoods -is found and substituted into the original ELBO. The resultant structured VAE (SVAE) objective is then optimised. In the case of fixed model parameters θ, the SVAE objective is equivalent to optimising the ELBO using the structured approximate posterior over latent variables q(z) ∝ p θ (z)l φ (z|y). Accordingly, the SGP-VAE can be viewed as an instance of the SVAE. Lin et al. ( 2018) build upon the SVAE, proposing a structured approximate posterior of the form q(z) ∝ q φ (z)l φ (z|y). The authors refer to the approximate posterior as the structured inference network (SIN). Rather than using the latent prior p θ (z), SIN incorporates the model's latent structure through q φ (z). The core advantage of SIN is its extension to more complex latent priors containing non-conjugate factors - (2009) , the structured approximate posterior is identical in form to the optimum Gaussian approximation to the true posterior. Most similar to ours is the approach of Pearce (2020), who considers the structured approximate posterior q(f ) = 1 Zq p θ1 (f ) q φ (z) N n=1 l φ l (f n ; y n ). We refer to this as the GP-VAE. Pearce's approach is a special case of the SGP-VAE for u = f and no missing data. Moreover, Pearce only considers the application to modelling pixel dynamics and the comparison to the standard VAE. See Appendix B for further details.

4. EXPERIMENTS

We investigate the performance of the SGP-VAE in illustrative bouncing ball experiments, followed by experiments in the small and large data regimes. The first bouncing ball experiment provides a visualisation of the mechanics of the SGP-VAE, and a quantitative comparison to other structured VAEs. The proceeding small-scale experiments demonstrate the utility of the GP-VAE and show that amortisation, especially in the presence of partially observed data, is not at the expense of predictive performance. In the final two experiments, we showcase the efficacy of the SGP-VAE on large, multi-output spatio-temporal datasets for which the use of amortisation is necessary. Full experimental details are provided in Appendix E. Using a two-dimensional latent space with periodic kernels, Figure 1b compares the posterior latent GPs and the mean predictive distribution with the ground truth for a single image sequence. Observe that the SGP-VAE has 'disentangled' the dynamics of each ball, using a single latent dimension to model each. The SGP-VAE reproduces the image sequences with impressive precision, owing in equal measure to (1) the ability of the GPs prior to model the latent dynamics and (2) the flexibility of the likelihood function to map to the high-dimensional observations.

4.2. SMALL-SCALE EXPERIMENTS

EEG Adopting the experimental procedure laid out by Requeima et al. (2019) , we consider an EEG dataset consisting of N = 256 measurements taken over a one second period. Each measurement comprises voltage readings taken by seven electrodes, FZ and F1-F6, positioned on the patient's scalp (x n ∈ R 1 , y n ∈ R 7 ). The goal is to predict the final 100 samples for electrodes FZ, F1 and F2 having observed the first 156 samples, as well as all 256 samples for electrodes F3-F6. Jura The Jura dataset is a geospatial dataset comprised of N = 359 measurements of the topsoil concentrations of three heavy metals -Cadmium Nickel and Zinc -collected from a 14.5km 2 region of the Swiss Jura (x n ∈ R 2 , y n ∈ R 3 ) (Goovaerts, 1997) . Adopting the experimental procedure laid out by others (Goovaerts, 1997; Álvarez & Lawrence, 2011; Requeima et al., 2019) , the dataset is divided into a training set consisting of Nickel and Zinc measurements for all 359 locations and Cadmium measurements for just 259 locations. Conditioned on the observed training set, the goal is to predict the Cadmium measurements at the remaining 100 locations. Table 1 compares the performance of the GP-VAE using the three partial inference networks presented in Section 2.4, as well as zero imputation (ZI), with independent GPs (IGP) and the GP autoregressive regression model (GPAR), which, to our knowledge, has the strongest published performance on these datasets. We also give the results for the best performing GP-VAEfoot_3 using a non-amortised, or 'free-form' (FF), approximate posterior, with model parameters θ kept fixed to the optimum found by the amortised GP-VAE and variational parameters initialised to the output of the optimised inference network. All GP-VAE models use a two-and three-dimensional latent space for EEG and Jura, respectively, with squared exponential (SE) kernels. The results highlight the poor performance of independent GPs relative to multi-output GPs, demonstrating the importance of modelling output dependencies. The GP-VAE achieves impressive SMSE and MAEfoot_4 on the EEG and Jura datasets using all partial inference networks except for PointNet. In Appendix F we demonstrate superior performance of the GP-VAE relative to the GPPVAE, which can be attributed to the use of the structured approximate posterior over the mean-field approximate posterior used by the GPPVAE. Importantly, the negligible difference between the results using free-form and amortised approximate posteriors indicates that amortisation is not at the expense of predictive performance. Whilst GPAR performs as strongly as the GP-VAE in the small-scale experiments above, it has two key limitations which severely limit the types of applications where it can be used. First, it can only be used with specific patterns of missing data and not when the pattern of missingness is arbitrary. Second, it is not scalable and would require further development to handle the large datasets considered in this paper. In contrast, the SGP-VAE is far more flexible: it handles arbitrary patterns of missingness, and scales to large number of datapoints and tasks. A distinct advantage of the SGP-VAE is that it models P outputs with just K latent GPs. This differs from GPAR, which uses P GPs. Whilst this is not an issue for the small-scale experiments, it quickly becomes computationally burdensome when P becomes large. The true efficacy of the SGP-VAE is demonstrated in the following three experiments, where the number of datapoints and tasks is large, and the patterns of missingness are random.

4.3. LARGE-SCALE EEG EXPERIMENT

We consider an alternative setting to the original small-scale EEG experiment, in which the datasets are formed from T = 60 recordings of length N = 256, each with 64 observed voltage readings (y n ∈ R 64 ). For each recording, we simulated electrode 'blackouts' by removing consecutive samples at random. We consider two experiments: in the first, we remove equal 50% of data from both the training and test datasets; in the second, we remove 10% of data from the training dataset and 50% from the test dataset. Both experiments require the partial inference network to generalise to different patterns of missingness, with the latter also requiring generalisation to different quantities of missingness. Each model is trained on 30 recordings, with the predictive performance assessed on the remaining 30 recordings. Figure 2 compares the performance of the SGP-VAE with that of independent GPs as the number of inducing points varies, with M = 256 representing use of the GP-VAE. In each case, we use a 10-dimensional latent space with SE kernels. The SGP-VAE using PointNet results in substantially worse performance than the other partial inference networks, achieving an average SMSE and NLL of 1.30 and 4.05 on the first experiment for M = 256. Similarly, using a standard VAE results in poor performance, achieving an average SMSE and NLL of 1.62 and 3.48. These results are excluded from Figure 2 for the sake of readability. For all partial inference networks, the SGP-VAE achieves a significantly better SMSE than independent GPs in both experiments, owing to its ability to model both input and output dependencies. For the first experiment, the performance using FactorNet is noticeably better than using either In-dexNet or zero imputation; however, this comes at the cost of a greater computational complexity associated with learning an inference network for each output dimension. Whereas the performance for the SGP-VAE using IndexNet and zero imputation significantly worsens on the second experiment, the performance using FactorNet is comparable to the first experiment. This suggests it is the only partial inference network that is able to accurately quantify the contribution of each output dimension to the latent posterior, enabling it to generalise to different quantities of missing data. The advantages of using a sparse approximation are clear -using M = 128 inducing points results in a slightly worse average SMSE and NLL, yet significantly less computational cost.

4.4. JAPANESE WEATHER EXPERIMENT

Finally, we consider a dataset comprised of 731 daily climate reports from 156 Japanese weather stations throughout 1980 and 1981, a total of 114,036 multi-dimensional observations. Weather reports consist of a date and location, including elevation, alongside the day's maximum, minimum and average temperature, precipitation and snow depth (x (t) n ∈ R 4 , y n ∈ R 5 ), any number of which is potentially missing. We treat each week as a single task, resulting in T = 105 tasks with N = 1092 data points each. The goal is to predict the average temperature for all stations on the middle five days, as illustrated in Figure 3 . Each model is trained on all the data available from 1980. For evaluation, we use data from both 1980 and 1981 with additional artificial missingness -the average temperature for the middle five days and a random 25% of minimum and maximum temperature measurementsfoot_6 . Similar to the second large-scale EEG experiment, the test datasets have more missing data than the training datasets. Table 2 compares the performance of the SGP-VAE using 100 inducing points to that of a standard VAE using FactorNet and a baseline of mean imputation. All models use a three-dimensional latent space with SE kernels. All models significantly outperform the mean imputation baseline (MI) and are able to generalise inference to the unseen 1981 dataset without any loss in predictive performance. The SGP-VAE achieves better predictive performance than both the standard VAE and independent GPs, showcasing its effectiveness in modelling large spatio-temporal datasets. The SGP-VAE using FactorNet achieves the best predictive performance on both datasets. The results indicate that FactorNet is the only partial inference network capable of generalising to different quantities and patterns of missingness, supporting the hypothesis made in Section 2.4.

A MATHEMATICAL DERIVATIONS A.1 OPTIMALITY OF APPROXIMATE LIKELIHOODS

To simplify notation, we shall consider the case P = 1 and K = 1. Separately, Opper & Archambeau (2009) considered the problem of performing variational inference in a GP for non-Gaussian likelihoods. They consider a multivariate Gaussian approximate posterior, demonstrating that the optimal approximate posterior takes the form q(f ) = 1 Z p(f ) N n=1 N (f n ; g n , v n ) , requiring a total of 2N variational parameters ({g n , v n } N n=1 ). In this section, we derive a result that generalises this to inducing point approximations, showing that for fixed M the optimal approximate posterior can be represented by max(M (M +1)/2+M, 2N ). Following Titsias (2009) , we consider an approximate posterior of the form q(f ) = q(u)p(f \u |u) (8) where q(u) = N u; mu , Kuu is constrained to be a multivariate Gaussian with mean mu and covariance Kuu . The ELBO is given by L ELBO = E q(f ) [log p(y|f )] -KL (q(u) p(u)) = E q(u) E p(f |u) [log p(y|f )] -KL (q(u) p(u)) = N n=1 E q(u) E N (fn; Anu+an, K fn|u ) [log p(y n |f n ] -KL (q(u) p(u)) where A n = K fnu K -1 uu (10) a n = m fn -K fnu K -1 uu mu . ( ) Recall that for a twice-differentiable scalar function h ∇ Σ E N (u; µ, Σ) [h(u)] = E N (u; µ, Σ) [H h (u)] where H h (u) is the Hessian of h at u. Thus, the gradient of the ELBO with respect to Kuu can be rewritten as ∇ Kuu L ELBO = N n=1 E N (u; mu, Kuu) [H hn (u)] - 1 2 K uu + 1 2 Kuu ( ) where h n (u) = E N (fn; Anu+an, K fn |u ) [log p(y n |f n ]. To determine an expression for H hn , we first consider the gradients of h n . Let α n (β n ) = E N (fn; βn, K fn |u ) [log p(y n |f n )] ( ) β n (u) = A n u + a n . ( ) The partial derivative of h n with respect to the j th element of u can be expressed as ∂h n ∂u j (u) = ∂α n ∂β n (β n (u)) ∂β n ∂u j (u). Taking derivatives with respect to the i th element of u gives ∂ 2 h n ∂u j ∂u i (u) = ∂ 2 α n ∂β 2 n (β n (u)) ∂β n ∂u j (u) ∂β n ∂u i (u) + ∂α n ∂β n (β n (u)) ∂ 2 β n ∂u j ∂u i (u). Thus, the Hessian is given by H hn (u) = ∂ 2 α n ∂β 2 n (β n (u)) R ∇β n (u) N ×1 [∇β n (u)] T 1×N + ∂α n ∂β n (β n (u)) R H βn (u) N ×N . ( ) Since β n (u) = A n u + a n , we have ∇β n (u) = A n and H βn (u) = 0. This allows us to write ∇ Kuu L ELBO as ∇ Kuu L ELBO = N n=1 E N (u; mu, Kuu) ∂ 2 α n ∂β 2 n (β n (u)) A n A T n - 1 2 K uu + 1 2 Kuu . ( ) The optimal covariance therefore satisfies K-1 uu = K -1 uu -2 N n=1 E N (u; mu, Kuu) ∂ 2 α n ∂β 2 n (β n (u)) A n A T n . ( ) Similarly, the gradient of the ELBO with respect to mu can be written as ∇ mu L ELBO = N n=1 ∇ mu E N (u; mu, Kuu) [h n (u)] -K -1 uu ( mu -m u ) = N n=1 E N (u; mu, Kuu) [∇h n (u)] -K -1 uu ( mu -m u ) where we have used the fact that for a differentiable scalar function h ∇ µ E N (u; µ, Σ) [g(u)] = E N (u; µ, Σ) [∇g(u)] . ( ) Using equation 16 and β n (u) = A n u + a n , we get ∇h n (u) = ∂α n ∂β n (β n (u))A n (23) giving ∇ mu L ELBO = N n=1 E N (u; mu, Kuu) ∂α n ∂β n (β n (u)) -K -1 uu ( mu -m u ). ( ) The optimal mean is therefore mu = m u - N n=1 E N (u; mu, Kuu) ∂α n ∂β n (β n (u)) K uu A n . ( ) Equation 20 and equation 25 show that each n th observation contributes only a rank-1 term to the optimal approximate posterior precision matrix, corresponding to an optimum approximate posterior of the form q(f ) ∝ p(f ) N n=1 N K fnu K -1 uu u; g n , v n where g n = -E N (u; mu, Kuu) ∂α n ∂β n (β n (u)) v n Kuu -1 K uu + A T n m u (27) 1/v n = -2E N (u; mu, Kuu) ∂ 2 α n ∂β 2 n (β n (u)) . For general likelihoods, these expressions cannot be solved exactly so g n and v n are freely optimised as variational parameters. When N = M , the inducing points are located at the observations and A n A T n is zero everywhere except for the n th element of its diagonal we recover the result of Opper & Archambeau (2009) . Note the key role of the linearity of each β n in this result -without it H βn would not necessarily be zero everywhere and the contribution of each n th term could have arbitrary rank.

A.2 POSTERIOR GAUSSIAN PROCESS

For the sake of notational convenience, we shall assume K = 1. First, the mean and covariance of q(u) = N u; mu , Kuu ∝ p θ1 (u) Nt n=1 l φ l (u; y o n , x n , Z) are given by mu = K uu ΦK uf Σ -1 φ l ,k µ φ l Kuu = K uu ΦK uu where Φ -1 = K uu + K uf Σ -1 φ l K f u . The approximate posterior over some latent function value f * is obtained by marginalisation of the joint distribution: q(f * ) = p θ1 (f * |u)q(u)du = N f * ; k f * u K -1 uu u, k f * f * -k f * u K -1 uu k uf * N u; mu , Kuu du = N f * ; k f * u K -1 uu mu , k f * f * -k f * u K -1 uu k uf * + k f * u K -1 uu Kuu K -1 uu k uf * Substituting in equation 29 results in a mean and covariance function of the form m(x) = k f u K -1 uu ΦK uf Σ -1 φ l ,k µ φ l k(x, x ) = k f f -k f u K -1 uu k uf + k f u Φk uf .

B THE GP-VAE

As discuss in Section 3, the GP-VAE is described by the structured approximate posterior q(f ) = 1 Z q (θ, φ) p θ1 (f ) N n=1 l φ l (f n ; y o n ), where l φ l (f n ; y o n ) = K k=1 N f n ; µ φ l (y o n ), diag σ 2 φ l (y o n ) , and corresponding ELBO L ELBO = E q(f ) log p θ1 (f )p θ2 (y|f ) 1 Zq(θ,φ) p θ1 (f )l φ l (f ; y) = E q(f ) log p θ2 (y|f ) l φ l (f ; y) + log Z q (θ, φ).

B.1 TRAINING THE GP-VAE

The final term in equation 33 has the closed-form expression Z q (θ, φ) = K k=1 K k=1 log N µ φ l ,k ; 0, K f k f k + Σ φ l ,k log Zq k (θ,φ) . ( ) which can be derived by noting that each Z q k (θ, φ) corresponds to the convolution between two multivariate Gaussians: Z q k (θ, φ) = N (f k ; 0, K f k f k ) N µ φ l ,k -f k ; 0, Σ φ l ,k df k . Similarly, a closed-form expression for E q(f ) [l φ l (f ; y)] exists: E q(f ) [log l φ l (f ; y)] = K k=1 N n=1 E q(f nk ) [log l φ l (f nk ; y o n )] = K k=1 N n=1 E q(f nk ) - (f nk -µ φ l ,k (y o n )) 2 2σ 2 φ l ,k (y o n ) - 1 2 log |2πσ 2 φ l ,k (y o n )| = K k=1 N n=1 - Σk nn + (μ k,n -µ φ l ,k (y o n )) 2 2σ 2 φ l ,k (y o n ) - 1 2 log |2πσ 2 φ l ,k (y o n )| = K k=1 N n=1 log N μk,n ; µ φ l ,k (y o n ), σ 2 φ l ,k (y o n ) - Σk nn 2σ 2 φ l ,k (y o n ) = K k=1 log N μk ; µ φ l ,k , Σ φ l ,k - N n=1 Σk nn 2σ 2 φ l ,k (y n ) where Σk = kk X, X and μk = mk (X), with mk (x) = k f k u k (K u k u k + Σ φ l ,k ) -1 µ φ l ,k kk (x) = k f k f k -k f k u k (K u k u k + Σ φ l ,k ) -1 k u k f k . E q(f ) [log p θ2 (y|f )] is intractable, hence must be approximated by a Monte Carlo estimate. Together with the closed-form expressions for the other two terms we can form an unbiased estimate of the ELBO, the gradients of which can be estimated using the reparameterisation trick (Kingma & Welling, 2014).

B.2 AN ALTERNATIVE SPARSE APPROXIMATION

An alternative approach to introducing a sparse GP approximation is directly parameterise the structured approximate posterior at inducing points u: q(f ) = 1 Z q (θ, φ) p θ1 (f ) N n=1 l φ l (u; y o n , x n , Z) where l φ l (u; y o n , x n , Z), the approximate likelihood, is a fully-factorised Gaussian distribution parameterised by a partial inference network: l φ l (u; y o n , x n , Z) = K k=1 M m=1 N u mk ; µ φ,k (y o n ), σ 2 φ,k (y o n ) . In general, each factor l φ l (u mk ; y o n , z mk , x n ) conditions on data at locations different to that of the inducing point. The strength of the dependence between these values is determined by the two input locations themselves. To account for this, we introduce the use of an inference network that, for each observation/inducing point pair (u mk , y n ), maps from (z mk , x n , y o n ) to parameters of the approximate likelihood factor. Whilst this approach has the same first order computational complexity as that used by the SGP-VAE, having to making forward and backward passes through the inference network KN M renders it significantly more computationally expensive for even moderately sized datasets. Whereas the approach adopted by the SGP-VAE employs an deterministic transformation of the outputs of the inference network based on the covariance function, this approach can be interpreted as learning an appropriate dependency between input locations. In practice, we found the use of this approach to result in worse predictive performance.

C MEMORY REQUIREMENTS

Assuming input locations and inducing point locations are shared across tasks, we require storing {K u k f (t) k + K u k u k } K k=1 and K f (t) k f (t) k in memory, which is O KM N + KM 2 + N 2 . For the SGP-VAE, we also require storing φ and instantiating {µ (t) φ l ,k , Σ (t) φ l ,k } K k=1 , which is O (|φ l | + KM D + 2KN ). Collectively, this results in the memory requirement O KN M + KM 2 + N 2 + |φ l | + KM D + 2KN . If we were to employ the same sparse structured approximate posterior, but replace the output of the inference network with free-form variational parameters, the memory requirement is O KN M + KM 2 + N 2 + KM D + 2T KN . 7 Alternatively, if we were to let q(u) to be parameterised by free-form Cholesky factors and means, the memory requirement is O KN M + KM 2 + N 2 + KM D + T KM (M + 1)/2 + T KM . Table 3 compares the first order approximations. Importantly, the use of amortisation across tasks stops the memory scaling with the number of tasks. Table 3 : A comparison between the memory requirements of approximate posteriors.

q(u)

Amortised? Memory requirement p(u) n ln(u) Yes O KN M + KM 2 + N 2 + |φ l | p(u) n ln(u) No O KN M + KM 2 + N 2 + T KN q(u) No O KN M + T KM 2 D MULTI-OUTPUT GAUSSIAN PROCESSES Through consideration of the interchange of input dependencies and likelihood functions, we can shed light on the relationship between the probabilistic model employed by the SGP-VAE and other multi-output GP models. These relationships are summarised in Figure 4 . Li ne ar lik eli ho od GP lik eli ho od

SGP-VAE

f k ∼ GP(0, k(x, x )) y n |f n ∼ N (µ(f n), Σ(f n)) VAE f n ∼ N (0, I) y n |f n ∼ N (µ(f n), Σ(f n)) GP-FA f k ∼ GP(0, k(x, x )) y n |f n ∼ N (Wf n, Σ) Factor Analysis f n ∼ N (0, I) y n |f n ∼ N (Wf n, Σ) GP-LVM f n ∼ N (0, I) yp|f ∼ GP(0, k(f , f )) DGP f k ∼ GP(0, k(x, x ) yp|f ∼ GP(0, k(f , f )) Figure 4 : A unifying perspective on multi-output GPs. Linear Multi-Output Gaussian Processes Replacing the likelihood with a linear likelihood function characterises a family of linear multi-output GPs, defined by a linear transformation of K independent latent GPs: f ∼ K k=1 GP 0, k θ 1,k (x, x ) y|f ∼ N n=1 N (y n ; Wf n , Σ) . The family includes Teh et al.'s (2005) semiparametric latent factor model, Yu et al.'s (2009) GP factor analysis (GP-FA) and Bonilla et al.'s (2008) class of multi-task GPs. Notably, removing input dependencies by choosing k θ 1,k (x, x ) = δ(x, x ) recovers factor analysis, or equivalently, probabilistic principal component analysis (Tipping & Bishop, 1999) when Σ = σ 2 I. Akin to the relationship between factor analysis and linear multi-output GPs, the probabilistic model employed by standard VAEs can be viewed as a special, instantaneous case of the SGP-VAE's. Deep Gaussian Processes Single hidden layer deep GPs (DGPs) (Damianou & Lawrence, 2013) are characterised by the use of a GP likelihood function, giving rise to the probabilistic model f ∼ K k=1 GP 0, k θ 1,k (x, x ) y|f ∼ P p=1 GP 0, k θ2,p (f (x)f (x )) (41) where y n = y(x n ). The GP latent variable model (GP-LVM) (Lawrence & Moore, 2007) is the special, instantaneous case of single layered DGPs. Multi-layered DGPs are recovered using a hierarchical latent space with conditional GP priors between each layer.

E EXPERIMENTAL DETAILS

Whilst the theory outlined in Section 2 describes a general decoder parameterising both the mean and variance of the likelihood, we experienced difficulty training SGP-VAEs using a learnt variance, especially for high-dimensional observations. Thus, for the experiments detailed in this paper we use a shared variance across all observations. We use the Adam optimiser (Kingma & Ba, 2014) with a constant learning rate of 0.001. Unless stated otherwise, we estimate the gradients of the ELBO using a single sample and the ELBO itself using 100 samples. The predictive distributions are approximated as Gaussian with means and variances estimated by propagating samples from q(f ) through the decoder. For each experiment, we normalise the observations using the means and standard deviations of the data in the training set. The computational complexity of performing variational inference (VI) in the full GP-VAE, per update, is dominated by the O KN 3 cost associated with inverting the set of K N × N matrices, {K f k f k + Σ φ l ,k } K k=1 . This can quickly become burdensome for even moderately sized datasets. A pragmatic workaround is to use a biased estimate of the ELBO using Ñ < N data points: L Ñ ELBO = N Ñ E q( f ) log p θ2 (ỹ| f ) l φ ( f |ỹ) + log Zq (θ, φ) . ỹ and f denote the mini-batch of Ñ observations and their corresponding latent variables, respectively. The bias is introduced due to the normalisation constant, which does not satisfy N Ñ E log Zq (θ, φ) = E [log Z q (θ, φ)]. Nevertheless, the mini-batch estimator will be a reasonable approximation to the full estimator provided the lengthscale of the GP prior is not too large.foot_8 Mini-batching cannot be used to reduce the O KN 3 cost of performing inference at test time, hence sparse approximations are necessary for large datasets.

E.1 SMALL-SCALE EEG

For all GP-VAE models, we use a three-dimensional latent space, each using squared exponential (SE) kernels with lengthscales and scales initialised to 0.1 and 1, respectively. All DNNs, except for those in PointNet and IndexNet, use two hidden layers of 20 units and ReLU activation functions. PointNet and IndexNet employ DNNs with a single hidden layer of 20 units and a 20-dimensional intermediate representation. Each model is trained for 3000 epochs using a batch size of 100, with the procedure repeated 15 times. Following (Requeima et al., 2019) , the performance of each model is evaluated using the standardised mean squared error (SMSE) and negative log-likelihood (NLL). The mean ± standard deviation of the performance metrics for the 10 iterations with the highest ELBO is reported.foot_9 

E.2 JURA

We use a two-dimensional latent space for all GP-VAE models with SE kernels with lengthscales and scales initialised to 1. This permits a fair comparison with other multi-output GP methods which also use two latent dimensions with SE kernels. For all DNNs except for those in IndexNet, we use two hidden layers of 20 units and ReLU activation functions. IndexNet uses DNNs with a single hidden layer of 20 units and a 20-dimensional intermediate representation. Following Goovaerts (1997) and Lawrence (2004) , the performance of each model is evaluated using the mean absolute error (MAE) averaged across 10 different initialisations. The 10 different initialisations are identified from a body of 15 as those with the highest training set ELBO. For each initialisation the GP-VAE models are trained for 3000 epochs using a batch size of 100.

E.3 LARGE-SCALE EEG

In both experiments, for each trial in the test set we simulate simultaneous electrode 'blackouts' by removing any 4 sample period at random with 25% probability. Additionally, we simulate individual electrode 'blackouts' by removing any 16 sample period from at random with 50% probability from the training set. For the first experiment, we also remove any 16 sample period at random with 50% probability from the test set. For the second experiment, we remove any 16 sample period at random with 10% probability. All models are trained for 100 epochs, with the procedure repeated five times, and use a 10-dimensional latent space with SE kernels and lengthscales initialised to 1 and 0.1, respectively. All DNNs, except for those in PointNet and IndexNet, use four hidden layers of 50 units and ReLU activation functions. PointNet and IndexNet employ DNNs with two hidden layers of 50 units and a 50-dimensional intermediate representation.

E.4 BOUNCING BALL

To ensure a fair comparison with the SVAE and SIN, we adopt an identical architecture for the inference network and decoder in the original experiment. In particular, we use DNNs with two hidden layers of 50 units and hyperbolic tangent activation functions. Whilst both Johnson et al. and Lin et al. use eight-dimensional latent spaces, we consider a GP-VAE with a one-dimensional latent space and periodic GP kernel. For the more complex experiment, we use a SGP-VAE with fixed inducing points placed every 50 samples. We also increase the number of hidden units in each layer of the DNNs to 256 and use a two-dimensional latent space -one for each ball.

E.5 WEATHER STATION

The spatial location of each weather station is determined by its latitude, longitude and elevation above sea level. The rates of missingness in the dataset vary, with 6.3%, 14.0%, 18.9%, 47.3% and 93.2% of values missing for each of the five weather variables, respectively. Alongside the average temperature for the middle five days, we simulate additional missingness from the test datasets by removing 25% of the minimum and maximum temperature values. Each model is trained on the data from 1980 using a single group per update for 50 epochs, with the performance evaluated on the data from both 1980 and 1981 using the root mean squared error (RMSE) and NLL averaged across five runs. We use a three-dimensional latent space with SE kernels and lengthscales initialised to 1. All DNNs, except for those in PointNet and IndexNet, use four hidden layers of 20 units and ReLU activation functions. PointNet and IndexNet employ DNNs with two hidden layers of 20 units and a 20-dimensional intermediate representation. Inducing point locations are initialised using kmeans clustering, and are shared across latent dimensions and groups. The VAE uses FactorNet. We consider independent GPs modelling the seven point time series for each variable and each station, with model parameters shared across groups. No comparison to other sparse GP approaches is made and there is no existing framework for performing approximate inference in sparse GP models conditioned on previously unobserved data. F FURTHER EXPERIMENTATION F.1 SMALL SCALE EXPERIMENTS Table 4 compares the performance of the GP-VAE to that of the GPPVAE, In all cases, FactorNet is used to handle missing data. We emphasise that the GP-VAE and GPPVAE employ identical probabilistic models, with the only difference being the form of the approximate posterior. The superior predictive performance of the GP-VAE can therefore be accredited to the use of the structured approximate posterior as opposed to the mean-field approximate posterior used by the GPPVAE.

F.2 SYNTHETIC BOUNCING BALL EXPERIMENT

The original dataset consists of 80 12-dimensional image sequences each of length 50, with the task being to predict the trajectory of the ball given a prefix of a longer sequence. The image sequences are generated at random by uniformly sampling the starting position of the ball whilst keeping the bouncing frequency fixed. Figure 5 compares the posterior latent GP and mean of the posterior predictive distribution with the ground truth for a single image sequence using just a single latent dimension. As demonstrated in the more more complex experiment, the GP-VAE is able to recover the ground truth with almost exact precision. Following Lin et al. (2018) , Figure 1a evaluates the τ -steps ahead predictive performance of the GP-VAE using the mean absolute error, defined as 



This assumes input locations are shared across tasks, which is true for all experiments we considered. Whilst the formulation in equation 6 can describe any permutation invariant set function, there is a caveat: both h φ 1 and ρ φ 2 can be infinitely complex, hence linear complexity is not guaranteed. This kind of 'overfitting' is different to the classical notion of overfitting model parameters. It refers to how well optimal non-amortised approximate posteriors are approximated on the training versus test data. i.e. using IndexNet for EEG and FactorNet for Jura. The two different performance metrics are adopted to enable a comparison to the results ofRequeima et al.. CONCLUSIONThe SGP-VAE is a scalable approach to training GP-DGMs which combines sparse inducing point methods for GPs and amortisation for DGMs. The approach is ideally suited to spatio-temporal data with missing observations, where it outperforms VAEs and multi-output GPs. Future research directions include generalising the framework to leverage state-space GP formulations for additional scalability and applications to streaming multi-output data. This prevents the model from using solely output dependencies to impute the missing temperature readings. Note we only require evaluating a single K f (t) k f (t) k at each update. In which case the off-diagonal terms in the covariance matrix will be large making the approximation p θ 1 (f ) = p θ 1 ( f ) extremely crude. We found that the GP-VAE occasionally got stuck in very poor local optima. Since the ELBO is calculated on the training set alone, the experimental procedure is still valid.



denoting the index set of observed values. For each task, we model the distribution of each observation y (t)

and φ l , as instantiating µ (t) φ l ,k and Σ (t) φ l ,k involves only O (KN ) terms. 1 This corresponds to a considerable reduction in memory. See Appendix C for a thorough comparison of memory requirements.

4.1 SYNTHETIC BOUNCING BALL EXPERIMENT The bouncing ball experiment -first introduced by Johnson et al. (2016) for evaluating the SVAE and later considered by Lin et al. (2018) for evaluating SIN -considers a sequence of onedimensional images of height 10 representing a ball bouncing under linear dynamics, (x(t) n ∈ R 1 , y (t)n ∈ R 10 ). The GP-VAE is able to significantly outperform both the SVAE and SIN in the original experiment, as shown in Figure1a. To showcase the versatility of the SGP-VAE, we extend the complexity of the original experiment to consider a sequence of images of height 100, y (t) n ∈ R 100 , representing two bouncing balls: one under linear dynamics and another under gravity. Furthermore, the images are corrupted by removing 25% of the pixels at random. The dataset consists of T = 80 noisy image sequences, each of length N = 500, with the goal being to predict the trajectory of the ball given a prefix of a longer sequence.

Figure 1: a) Comparing the GP-VAE's predictive performance to that of the SVAE, SIN, an LDS and independent GPs (IGP). b) Top: sequence of images representing two bouncing balls. Middle: mean of the SGP-VAE's predictive distribution conditioned on partial observations up to the red line. Bottom: latent approximate GPs posterior, alongside the inducing point locations (crosses).

Figure 2: Variation in performance of the SGP-VAE on the large-scale EEG experiment as the number of inducing points, M , varies.

Figure 3: An illustration of the Japanese weather experiment. The dotted red lines highlight the missing data, with the SGP-VAE's predictive mean shown below.

Figure 5: A comparison between the mean of the GP-VAE's posterior predictive distribution (middle) and the ground truth (top) conditioned on noisy observations up to the red line. The latent approximate GP posterior is also shown (bottom).

(T -τ )d y * n,t+τ -E q(yn,t+τ |yn,1:t) y n,t+τ 1(43)where N test is the number of test image sequences with T time steps and y * n,t+τ denotes the noiseless observation at time step t + τ .

Figure 6: An illustration of the three different partial inference network specifications discuss in Section 2.4. η denotes the vector of natural parameters of the multi-variate Gaussian being parameterised.

can replace them with their nearest conjugate approximations whilst retaining a similar latent structure. Although the frameworks proposed byJohnson et al. and Lin et al.  are more general than ours, the authors only consider Gaussian mixture model and linear dynamical system (LDS) latent priors. -1 , where m and Λ are parameterised by an inference network. Whilst this permits computational efficiency, the parameterisation is only appropriate for regularly spaced temporal data and neglects rigorous treatment of long term dependencies.Campbell & Liò (2020) employ an equivalent sparsely structured variational posterior as that used by Fortuin et al., extending the framework to handle more general spatio-temporal data. Their method is similarly restricted to regularly spaced spatio-temporal data. A fundamental difference between our framework and that of Fortuin et al. and Campbell & Liò is the inclusion of the GP prior in the approximate posterior. As shown byOpper & Archambeau

A comparison between multi-output GP models on the EEG and Jura experiments.

A comparison between model performance on the Japanese weather experiment.

A comparison between multi-output GP models on the EEG and Jura experiments.

