NEURAL DIFFUSION PROCESSES

Abstract

Gaussian processes provide an elegant framework for specifying prior and posterior distributions over functions. They are, however, also computationally expensive, and limited by the expressivity of their covariance function. We propose Neural Diffusion Processes (NDPs), a novel approach based upon diffusion models, that learns to sample from distributions over functions. Using a novel attention block we are able to incorporate properties of stochastic processes, such as exchangeability, directly into the NDP's architecture. We empirically show that NDPs are able to capture functional distributions that are close to the true Bayesian posterior. This enables a variety of downstream tasks, including hyperparameter marginalisation, non-Gaussian posteriors and global optimisation.

1. INTRODUCTION

Gaussian processes (GPs) offer a powerful framework for defining distributions over functions [26] . It is an appealing framework because Bayes rule allows one to reason consistently about the predictive distribution, allowing the model to be data efficient. However, for many problems GPs are not an appropriate prior. Consider, for example, a function that has a discontinuity at some unknown location. This cannot be expressed in terms of a GP, because it is impossible to express such behaviour by the first two moments of a multivariate normal distribution [23] . One popular approach to these problems is to abandon GPs, in favour of Neural network (NN) based generative models. Successful methods include the meta-learning approaches of Neural Processes (NPs) [8; 12; 2; 21], and VAE-based models [22; 6] . By leveraging a large number of small datasets during training, they are able to transfer knowledge across datasets at prediction time. Using NNs is appealing since most of the computational effort is expended during the training process, while the task of prediction usually becomes more straightforward. A further major advantage of a NN-based approach is that they are not restricted by the Gaussian assumption. We seek to improve upon these methods by extending an existing state-of-the-art NN-based generative model. In terms of sample quality, the so-called probabilistic denoising diffusion model [31; 32; 10] has recently been shown to outperform existing methods on tasks such as image [24; 25], molecular structure [40; 11] , point cloud [20] and audio signal [14] generation. However, the Bayesian inference of functions poses a fundamentally different challenge, one which has not been tackled previously by diffusion models.

Contributions

We propose a novel model, the Neural Diffusion Process (NDP), which extends the use case of diffusion models to Stochastic Processes (SPs) and is able to describe a rich distribution over functions. NDPs generalise diffusion models to infinite-dimensional function spaces by allowing the indexing of random variables onto which the model diffuses. We take particular care to enforce known symmetries and properties of SPs, including exchangeability, and marginal consistency into the model, facilitating the training process. These properties are enforced with the help of a novel attention block, namely the bi-dimensional attention block, which guarantees equivariance over the ordering of (1) the input dimensionality and (2) the sequence (i.e., datapoints). From the experiments we draw the following two conclusions: firstly, NDPs are a clear improvement over existing NN-based generative models for functions such as Neural Processes (NPs). Secondly, NDPs are an attractive alternative to GPs for specifying appropriate (i.e., non-Gaussian) priors over functions. Finally, we present a novel global optimisation method using NDPs. 

2. BACKGROUND

The aim of this section is to provide an overview of the key concepts used throughout the manuscript.

2.1. GAUSSIAN PROCESSES

A Gaussian Process (GP) f : R D ! R is a stochastic process such that, for any finite collection of points x 1 , ..., x n 2 R D the random vector (f 1 , . . . , f n ) with f i = f (x i ), follows a multivariate normal distribution [26] . In the case of regression, GPs offer an analytically tractable Bayesian posterior which provides accurate estimations of uncertainty -as shown in Fig. 1a where we observe that the predictive variance (dashed black line) elegantly shrinks in the presence of data (black dots). GPs satisfy the Kolmogorov Extension Theorem (KET) which states that all finite-dimensional marginal distributions p are consistent with each other under permutation (exchangeability) and marginalisation. Let ⇡ be the a permutation of {1, . . . , n}, then the following holds for the GP's joint: p(f 1 , . . . , f n ) = p(f ⇡(1) , . . . , f ⇡(n) ) and p(f 1 ) = Z p(f 1 , f 2 , . . . , f n ) df 2 . . . df n . (1) Despite their favourable properties, GPs are plagued by a couple of limitations. Firstly, encoding prior assumptions through analytical covariance functions can be extremely difficult, especially for higher dimensions [39; 29; 17] . Secondly, by definition, GPs assume a multivariate Gaussian distribution for each finite collect of predictions -limiting the set of functions it can model [23; 36; 38; 13; 5; 27] . We will revisit these limitations in the context of our experiments.

2.2. NEURAL PROCESSES AND THE META-LEARNING OF FUNCTIONS

Neural Process (NP) models [7; 8] have recently been introduced as an alternative to GPs -addressing the aforementioned problems. NPs utilise deep neural networks to define a rich probability distribution over functions. During training, NPs leverage a large number of small datasets, so that knowledge can be transferred across datasets during inference. Despite being a promising research direction, NPs are severely limited by the fact that they do not produce consistent samples out-of-the-box. Broadly speaking, NPs have dealt with consistency in two ways: (1) by introducing an additional latent variable per function draw [8; 12], (2) by only modelling the marginals [7] . The former leads to consistent samples, but the likelihood for these models is not analytically tractable which requires crude approximations and ultimately limits their performance [21] . We observe this behaviour in Fig. 1b . The latter does not allow for sampling coherent functions at all, as no covariance information is available and all function values are modelled independently. We briefly summarise the family of NP models in Appendix D.4.

2.3. DIFFUSION MODELS

Our method relies on denoising diffusion probabilistic models (DPMs) [31] , which we briefly summarise here. Diffusion models depend on two procedures: a forward and a reverse process, as illustrated in Fig. 2 . The forward process consists of a Markov chain, which incrementally adds random noise to the data. The reverse process is tasked with inverting this chain in order to construct desired data samples from random noise alone.



Figure 1: Conditional samples: The blue curves are posterior samples conditioned on the context dataset (black dots) from different probabilistic models.

