NON-GAUSSIAN PROCESS REGRESSION

Abstract

Standard GPs offer a flexible modelling tool for well-behaved processes. However, deviations from Gaussianity are expected to appear in real world datasets, with structural outliers and shocks routinely observed. In these cases GPs can fail to model uncertainty adequately and may over-smooth inferences. Here we extend the GP framework into a new class of time-changed GPs that allow for straightforward modelling of heavy-tailed non-Gaussian behaviours, while retaining a tractable conditional GP structure through an infinite mixture of nonhomogeneous GPs representation. The conditional GP structure is obtained by conditioning the observations on a latent transformed input space and the random evolution of the latent transformation is modelled using a Lévy process which allows Bayesian inference in both the posterior predictive density and the latent transformation function. We present Markov chain Monte Carlo inference procedures for this model and demonstrate the potential benefits compared to a standard GP.

1. INTRODUCTION

Gaussian processes (GPs) are stochastic processes which are widely used in nonparametric regression and classification problems to represent probability distributions over functions (Rasmussen & Williams (2006) ). They allow Bayesian inference in a space of functions such that consistent uncertainty measures over predictions are obtained rather than only point estimates. In its simplest form a GP defines a distribution over functions through its particular mean and covariance (kernel) functions which determine the smoothness, stationarity and periodicity of a random realisation in the function space. As a prior distribution in Bayesian inference, using a zero mean GP reflects the lack of information in the values and trend of the function. In this case the covariance function, which defines the similarity between any two points in the input space, fully characterises the properties of the random function space. The design of kernel functions that are able to represent a wide range of characteristics and make consistent generalisations is a fundamental area of research. Some recent work in this area include modelling the kernel via spectral densities that are scale-location mixtures of Gaussians (Wilson & Adams (2013)), and similarly using Lévy process priors over adaptive basis expansions for the spectral density (Jang et al. (2017) ). Furthermore, extensions to the standard GP model can be made by directly modelling the covariance matrix as a stochastic process (Wilson & Ghahramani (2011) ), assuming heteroscedastic noise on the observations and carrying out variational inference (Lázaro-Gredilla & Titsias (2011)), or learning nonlinear transformations of the observations such that the latent transformed observations are modelled well by a GP (Snelson et al. (2003) ; Lázaro-Gredilla (2012) ). Nonstationarity in the measurement process can be expressed as a product of multiple GPs (Adams & Stegle (2008) ) and heavy-tailed observations may be modelled through the Student-t process (Shah et al. (2014) ). Particularly relevant extensions of GP models are presented in (Rasmussen & Ghahramani (2001) ) where the input space is locally modelled by separate GPs, and string GPs (Samo & Roberts (2016) ) introduce link functions between local GPs such that the global process is still a GP and provides efficient inference methods on large data sets. In (Schmidt & O'Hagan (2003) ; Snoek et al. ( 2014)) a latent space is defined between the inputs and observations through a separate GP and a class of bounded functions in [0, 1], respectively. By designing expressive covariance functions or stacking multiple GPs in structured arrangements, the GP framework produces accurate predictive models in numerous application domains. However, these models are limited by their Gaussianity assumption such that the local patterns learned through these models are highly dependent on particular observations instead of learning the overall dynamics of the data generating system. A more natural and interpretable way to define complex relationships may be to assume that the underlying random function is non-Gaussian which yields more sparse representations (Unser & Tafti (2014)) as discussed in Section 4. In this work, we present a novel approach to modelling non-Gaussian dynamics by constructing a non-Gaussian process (NGP) such that the observations form a conditional GP that is conditioned on a latent input transformation function that is separately modelled as a Lévy process. Building on the definition of a stationary kernel, the latent layer between the input and output spaces represent the random distances between any two points on an input space. In order to define the distribution of random distances without referring to a specific origin, and in order to maintain monotonicity of the input space transformation, the latent space of transformation functions is modelled by a special class of Lévy process called a subordinator that is non-negative and non-decreasing. Such a process is characterised by the distribution of its stationary and independent increments which as a result defines a probability distribution over the distance between any two input values. Making random monotonic transformations of input values allow the kernel to adapt to the local characteristics of an input space or in other words to the varying rate of change in the observations and the learned subordinator provides uncertainty estimates over the variation of the observed process everywhere on its domain. In this paper we focus principally upon 1-dimensional GPs for the sake of brevity, but we emphasise that our approach can be readily extended to multiple dimensions, as described throughout the text and illustrated in the experimental results. NGPs are related to continuous-time stochastic volatility models studied in the mathematical finance literature to model the behaviour of a stochastic process which has a randomly distributed variance (Ghysels et al. (1996) ). The time-change operation defined for continuous-time stochastic processes is a standard approach to building stochastic volatility models. A common example is the time-changed Brownian motion where the time-change is chosen to be a subordinator and the timechanged motion produces a Lévy process (Veraart & Winkel (2010) ). In such a model, the process is conditionally a Brownian motion i.e. the integral of a white-noise GP. Similarly, our construction of a stationary NGP follows a GP conditioned on the latent values of a subordinator, thus it is a time-changed GP. Particular non-Gaussian behaviour can be expressed through the characterisation of a subordinator, examples include the stable law, normal-tempered stable, and generalised hyperbolic (including Student-t) processes. Hence, NGPs provide a flexible and expressive probabilistic framework for nonparametric learning of functions. In Section 2 we briefly review the GP regression framework, introduce the time-change operation and define NGPs. An inference method for NGP regression is presented in Section 3 following an introduction to shot-noise simulation methods for Lévy processes. In Section 4, we present the results of applying NGP regression on representative synthetically generated non-Gaussian data sets to visually highlight their dynamics and compare the results to alternative GPs. Furthermore, a multidimensional example using a data set available in TensorFlow is presented.

2. MODELS

In this section, we briefly present the standard GP regression framework and introduce the timechange operation which results in a non-Gaussian process (NGP). The series representation of a Lévy process (Rosiński ( 2001)) reviewed in Section 2.2 is central to the inference methodology studied in Section 3.

2.1. GAUSSIAN PROCESS REGRESSION

A stochastic process {f (x) ∈ R; x ∈ X } is defined by the probability distribution of all possible finite subsets of its values, where X ∈ R d is a d-dimensional input space. In the case of GPs, for any finite set of inputs {x i } n i=1 the corresponding values of the function {f (x i )} n i=1 has a multivariate Gaussian distribution (MacKay ( 2003)) characterised by its mean m(x) = E[f (x)] and covariance kernel functions K(x ′ , x) = Cov (f (x ′ ), f (x)) where x ′ , x ∈ X . Given a set of inputs {x i } the mean function forms a vector m and the kernel function forms a positive-definite covariance matrix Σ. The resulting multivariate Gaussian distribution can be extended to any input x * ∈ X follow-

