SPARSE GAUSSIAN PROCESS VARIATIONAL AUTOEN-CODERS

Abstract

Large, multi-dimensional spatio-temporal datasets are omnipresent in modern science and engineering. An effective framework for handling such data are Gaussian process deep generative models (GP-DGMs), which employ GP priors over the latent variables of DGMs. Existing approaches for performing inference in GP-DGMs do not support sparse GP approximations based on inducing points, which are essential for the computational efficiency of GPs, nor do they handle missing data -a natural occurrence in many spatio-temporal datasets -in a principled manner. We address these shortcomings with the development of the sparse Gaussian process variational autoencoder (SGP-VAE), characterised by the use of partial inference networks for parameterising sparse GP approximations. Leveraging the benefits of amortised variational inference, the SGP-VAE enables inference in multi-output sparse GPs on previously unobserved data with no additional training. The SGP-VAE is evaluated in a variety of experiments where it outperforms alternative approaches including multi-output GPs and structured VAEs.

1. INTRODUCTION

Increasing amounts of large, multi-dimensional datasets that exhibit strong spatio-temporal dependencies are arising from a wealth of domains, including earth, social and environmental sciences (Atluri et al., 2018) . For example, consider modelling daily atmospheric measurements taken by weather stations situated across the globe. Such data are (1) large in number; (2) subject to strong spatio-temporal dependencies; (3) multi-dimensional; and (4) non-Gaussian with complex dependencies across outputs. There exist two venerable approaches for handling these characteristics: Gaussian process (GP) regression and deep generative models (DGMs). GPs provide a framework for encoding high-level assumptions about latent processes, such as smoothness or periodicity, making them effective in handling spatio-temporal dependencies. Yet, existing approaches do not support the use of flexible likelihoods necessary for modelling complex multi-dimensional outputs. In contrast, DGMs support the use of flexible likelihoods; however, they do not provide a natural route through which spatio-temporal dependencies can be encoded. The amalgamation of GPs and DGMs, GP-DGMs, use latent functions drawn independently from GPs, which are then passed through a DGM at each input location. GP-DGMs combine the complementary strengths of both approaches, making them naturally suited for modelling spatio-temporal datasets. Intrinsic to the application of many spatio-temporal datasets is the notion of tasks. For instance: medicine has individual patients; each trial in a scientific experiment produces an individual dataset; and, in the case of a single large dataset, it is often convenient to split it into separate tasks to improve computational efficiency. GP-DGMs support the presence of multiple tasks in a memory efficient way through the use of amortisation, giving rise to the Gaussian process variational autoencoder (GP-VAE), a model that has recently gained considerable attention from the research community (Pearce, 2020; Fortuin et al., 2020; Casale et al., 2018; Campbell & Liò, 2020; Ramchandran et al., 2020) . However, previous work does not support sparse GP approximations based on inducing points, a necessity for modelling even moderately sized datasets. Furthermore, many spatio-temporal datasets contain an abundance of missing data: weather measurements are often absent due to sensor failure, and in medicine only single measurements are taken at any instance. Handling partial observations in a principled manner is essential for modelling spatio-temporal data, but is yet to be considered. i) We develop the sparse GP-VAE (SGP-VAE), which uses inference networks to parameterise multi-output sparse GP approximations. ii) We employ a suite of partial inference networks for handling missing data in the SGP-VAE. iii) We conduct a rigorous evaluation of the SGP-VAE in a variety of experiments, demonstrating excellent performance relative to existing multi-output GPs and structured VAEs.

2. A FAMILY OF SPATIO-TEMPORAL VARIATIONAL AUTOENCODERS

Consider the multi-task regression problem in which we wish to model T datasets D = {D (t) } T t=1 , each of which comprises input/output pairs D (t) = {x (t) n , y (t) n } Nt n=1 , x n ∈ R D and y (t) n ∈ R P . Further, let any possible permutation of observed values be potentially missing, such that each observation y n , conditioned on a corresponding latent variable f (t) n ∈ R K , as a fully-factorised Gaussian distribution parameterised by passing f (t) n through a decoder deep neural network (DNN) with parameters θ 2 . The elements of f (t) n correspond to the evaluation of a K-dimensional latent function t) is modelled as being drawn from K independent GP priors with hyper-parameters θ 1 = {θ 1,k } K k=1 , giving rise to the complete probabilistic model: f (t) = (f (t) 1 , f (t) 2 , . . . , f (t) K ) at input x (t) n . That is, f (t) n = f (t) (x (t) n ). Each latent function f ( f (t) ∼ K k=1 GP 0, k θ 1,k (x, x ) p θ 1 (f (t) k ) y (t) |f (t) ∼ Nt n=1 N µ o θ2 (f (t) n ), diag σ o θ2 2 (f (t) n ) p θ 2 (y o n (t) |f (t) ,x (t) n ,O (t) n ) where µ o θ2 (f (t) n ) and σ o θ2 2 (f (t) n ) are the outputs of the decoder indexed by O (t) n . We shall refer to the set θ = {θ 1 , θ 2 } as the model parameters, which are shared across tasks. The probabilistic model in equation 1 explicitly accounts for dependencies between latent variables through the GP prior. The motive of the latent structure is twofold: to discover a simpler representation of each observation, and to capture the dependencies between observations at different input locations.

2.1. MOTIVATION FOR SPARSE APPROXIMATIONS AND AMORTISED INFERENCE

The use of amortised inference in DGMs and sparse approximations in GPs enables inference in these respective models to scale to large quantities of data. To ensure the same for the GP-DGM described in equation 1, we require the use of both techniques. In particular, amortised inference is necessary to prevent the number of variational parameters scaling with O T t=1 N (t) . Further, the inference network can be used to condition on previously unobserved data without needing to learn new variational parameters. Similarly, sparse approximations are necessary to prevent the computational complexity increasing cubically with the size of each task O T t=1 N (t) 3 . Unfortunately, it is far from straightforward to combine sparse approximations and amortised inference in a computationally efficient way. To see this, consider the standard form for the sparse GP approximate posterior, q(f ) = p θ1 (f \u |u)q(u) where q(u) = N (u; m, S) with m, S and Z, the inducing point locations, being the variational parameters. q(u) does not decompose into a product over N (t) factors and is therefore not amendable to per-datapoint amortisation. That is, m and S must be optimised as free-form variational parameters. A naïve approach to achieving per-datapoint amortisation is to decompose q(u) into the prior p θ1 (u) multiplied by the product of approximate likelihoods, one for each inducing point. Each approximate likelihood is itself equal to the product of per-datapoint approximate likelihoods, which depend on both the observation y o n and the distance of the input x n to that of the inducing point. An inference network which takes these two values of inputs can be used to obtain the parameters of the approximate likelihood factors. Whilst we found that this approach worked, it is somewhat unprincipled. Moreover, it requires passing each datapoint/inducing point pair through an inference network, which scales very poorly. In the following



denoting the index set of observed values. For each task, we model the distribution of each observation y (t)

