MARTINGALE POSTERIOR NEURAL PROCESSES

Abstract

A Neural Process (NP) estimates a stochastic process implicitly defined with neural networks given a stream of data, rather than pre-specifying priors already known, such as Gaussian processes. An ideal NP would learn everything from data without any inductive biases, but in practice, we often restrict the class of stochastic processes for the ease of estimation. One such restriction is the use of a finite-dimensional latent variable accounting for the uncertainty in the functions drawn from NPs. Some recent works show that this can be improved with more "data-driven" source of uncertainty such as bootstrapping. In this work, we take a different approach based on the martingale posterior, a recently developed alternative to Bayesian inference. For the martingale posterior, instead of specifying prior-likelihood pairs, a predictive distribution for future data is specified. Under specific conditions on the predictive distribution, it can be shown that the uncertainty in the generated future data actually corresponds to the uncertainty of the implicitly defined Bayesian posteriors. Based on this result, instead of assuming any form of the latent variables, we equip a NP with a predictive distribution implicitly defined with neural networks and use the corresponding martingale posteriors as the source of uncertainty. The resulting model, which we name as Martingale Posterior Neural Process (MPNP), is demonstrated to outperform baselines on various tasks.

1. INTRODUCTION

A Neural Process (NP) (Garnelo et al., 2018a; b) meta-learns a stochastic process describing the relationship between inputs and outputs in a given data stream, where each task in the data stream consists of a meta-training set of input-output pairs and also a meta-validation set. The NP then defines an implicit stochastic process whose functional form is determined by a neural network taking the meta-training set as an input, and the parameters of the neural network are optimized to maximize the predictive likelihood for the meta-validation set. This approach is philosophically different from the traditional learning pipeline where one would first elicit a stochastic process from the known class of models (e.g., Gaussian Processes (GPs)) and hope that it describes the data well. An ideal NP would assume minimal inductive biases and learn as much as possible from the data. In this regard, NPs can be framed as a "data-driven" way of choosing proper stochastic processes. An important design choice for a NP model is how to capture the uncertainty in the random functions drawn from stochastic processes. When mapping the meta-training set into a function, one might employ a deterministic mapping as in Garnelo et al. (2018a) . However, it is more natural to assume that there may be multiple plausible functions that might have generated the given data, and thus encode the functional (epistemic) uncertainty as a part of the NP model. Garnelo et al. (2018b) later proposed to map the meta-training set into a fixed dimensional global latent variable with a Gaussian posterior approximation. While this improves upon the vanilla model without such a latent variable (Le et al., 2018) , expressing the functional uncertainty only through the Gaussian approximated latent variable has been reported to be a bottleneck (Louizos et al., 2019) . To this end, Lee et al. (2020) and Lee et al. (2022) propose to apply bootstrap to the meta-training set to use the uncertainty arising from the population distribution as a source for the functional uncertainty. In this paper, we take a rather different approach to define the functional uncertainty for NPs. Specifically, we utilize the martingale posterior distribution (Fong et al., 2021) , a recently developed alternative to conventional Bayesian inference. In the martingale posterior, instead of eliciting a likelihood-prior pair and inferring the Bayesian posterior, we elicit a joint predictive distribution on future data given observed data. Under suitable conditions on such a predictive distribution, it can be shown that the uncertainty due to the generated future data indeed corresponds to the uncertainty of the Bayesian posterior. Following this, we endow a NP with a joint predictive distribution defined through neural networks and derive the functional uncertainty as the uncertainty arising when mapping the randomly generated future data to the functions. Compared to the previous approaches of either explicitly positing a finite-dimensional variable encoding the functional uncertainty or deriving it from a population distribution, our method makes minimal assumptions about the predictive distribution and gives more freedom to the model to choose the proper form of uncertainty solely from the data. Due to the theory of martingale posteriors, our model guarantees the existence of the martingale posterior corresponding to the valid Bayesian posterior of an implicitly defined parameter. Furthermore, working in the space of future observations allows us to incorporate the latent functional uncertainty path with deterministic path in a more natural manner. We name our extension of NPs with the joint predictive generative models as the Martingale Posterior Neural Process (MPNP). Throughout the paper, we propose an efficient neural network architecture for the generative model that is easy to implement, flexible, and yet guarantees the existence of the martingale posterior. We also propose a training scheme to stably learn the parameters of MPNPs. Using various synthetic and real-world regression tasks, we demonstrate that MPNP significantly outperforms the previous NP variants in terms of predictive performance.

2.1. SETTINGS AND NOTATIONS

Let X = R din be an input space and Y = R dout be an output space. We are given a set of tasks drawn from an (unknown) task distribution, τ 1 , τ 2 , . . . i.i.d. ∼ p task (τ ). A task τ consists of a dataset Z and an index set c, where Z = {z i } n i=1 with each z i = (x i , y i ) ∈ X × Y is a pair of an input and an output. We assume Z are i.i.d. conditioned on some function f . The index set c ⊊ [n] where [n] := {1, . . . , n} defines the context set Z c = {z i } i∈c . The target set Z t is defined similarly with the index t := [n] \ c.

2.2. NEURAL PROCESS FAMILIES

Our goal is to train a class of random functions f : X → Y that can effectively describe the relationship between inputs and outputs included in a set of tasks. Viewing this as a meta-learning problem, for each task τ , we can treat the context Z c as a meta-train set and target Z t as a metavalidation set. We wish to meta-learn a mapping from the context Z c to a random function f that recovers the given context Z c (minimizing meta-training error) and predicts Z t well (minimizing meta-validation error). Instead of directly estimating the infinite-dimensional f , we learn a mapping from Z c to a predictive distribution for finite-dimensional observations, p(Y |X, Z c ) = i∈c p(y i |f, x i ) i∈t p(y i |f, x i ) p(f |Z c )df, where we are assuming the outputs Y are independent given f and X. We further restrict ourselves to simple heteroscedastic Gaussian measurement noises, p(y|f, x) = N (y|µ θ (x), σ 2 θ (x)I dout ), where µ θ : X → Y and σ 2 θ : X → R + map an input to a mean function value and corresponding variance, respectively. θ ∈ R h is a parameter indexing the function f , and thus the above predictive distribution can be written as p(Y |X, Z c ) = i∈[n] N (y i |µ θ (x i ), σ 2 θ (x i )I dout ) p(θ|Z c )dθ. (3) A NP is a parametric model which constructs a mapping from Z c to θ as a neural network. The simplest version, Conditional Neural Process (CNP) (Garnelo et al., 2018a) , assumes a deterministic mapping from Z c to θ as p(θ|Z c ) = δ rc (θ), r c = f enc (Z c ; ϕ enc ), (4)

