AUTOREGRESSIVE CONDITIONAL NEURAL PROCESSES

Abstract

Conditional neural processes (CNPs; Garnelo et al., 2018a) are attractive metalearning models which produce well-calibrated predictions and are trainable via a simple maximum likelihood procedure. Although CNPs have many advantages, they are unable to model dependencies in their predictions. Various works propose solutions to this, but these come at the cost of either requiring approximations or being limited to Gaussian predictions. In this work, we instead propose to change how CNPs are deployed at test time, without any modifications to the model or training procedure. Instead of making predictions independently for every target point, we autoregressively define a joint predictive distribution using the chain rule of probability, taking inspiration from the neural autoregressive density estimator (NADE) literature. We show that this simple procedure allows factorised Gaussian CNPs to model highly dependent, non-Gaussian predictive distributions. Perhaps surprisingly, in an extensive range of tasks with synthetic and real data, we show that CNPs in autoregressive (AR) mode not only significantly outperform non-AR CNPs, but are also competitive with more sophisticated models that are significantly more expensive and challenging to train. This performance is remarkable since AR CNPs are not trained to model joint dependencies. Our work provides an example of how ideas from neural distribution estimation can benefit neural processes, motivating research into the AR deployment of other neural process models.

1. INTRODUCTION

Conditional neural processes (CNPs; Garnelo et al., 2018a) are a family of meta-learning models which combine the flexibility of deep learning with the uncertainty awareness of probabilistic models. They are trained to produce well-calibrated predictions via a simple maximum-likelihood procedure, and naturally handle off-the-grid and missing data, making them ideally suited for tasks in climate science and healthcare. Since their introduction, attentive (ACNP; Kim et al., 2019) and convolutional (ConvCNP; Gordon et al., 2020) variants have also been proposed. Unfortunately, existing CNPs do Figure 1 : A ConvCNP trained on random sawtooth functions and applied in standard mode (left) and in our proposed autoregressive (AR) mode (right). The black crosses denote observed data points, the blue lines show model samples, and the bottom plots show the marginal predictive distributions at the locations marked by the dashed vertical lines. In standard mode, the CNP models each output with an independent Gaussian (left). However, when run in AR mode, the same CNP can produce coherent samples and model multimodality (right).

Class

Consistent It uses a neural network to define the mean and covariance function of a predictive Gaussian process (GP) that models dependencies. However, it uses a much more complex architecture and is only practically applicable to problems with one-dimensional inputs, limiting its adoption compared to the more lightweight CNP. Recently, Markou et al. ( 2022) proposed the Gaussian neural process (GNP), which is considerably simpler but sacrifices performance relative to the FullConvGNP. In this paper we propose a much simpler method for modelling dependencies with neural processes that has been largely overlooked: autoregressive (AR) sampling. AR sampling requires no changes to the architecture or training procedure. Instead, we change how the CNP is deployed at test time, extracting predictive information that would ordinarily be ignored. Instead of making predictions at all target points simultaneously, we autoregressively feed samples back into the model. AR CNPs trade the fundamental property of consistency under marginalisation and permutation, which is foundational to many neural process models, for non-Gaussian and correlated predictions. In Table 1 we place AR CNPs within the framework of other neural process models. Our key contributions are: • We show that CNPs used in AR mode capture rich, non-Gaussian predictive distributions and produce coherent samples (Figure 1 ). This is remarkable, since these CNPs have Gaussian likelihoods, are not trained to model joint dependencies or non-Gaussianity, and are significantly cheaper to train than LNPs and FullConvGNPs (Figure 2 ). • We prove that, given sufficient data and model capacity, the performance of AR CNPs is at least as good as that of GNPs, which explicitly model correlations in their predictions. • Viewing AR CNPs as a type of neural density estimator (Uria et al., 2016) , we highlight their connections to a range of existing methods in the deep generative modelling literature. • In an extensive range of Gaussian and non-Gaussian regression tasks, we show that AR CNPs are consistently competitive with, and often significantly outperform, all other neural process models in terms of predictive log-likelihood. • We deploy AR CNPs on a range of tasks involving real-world climate data. To handle the high-resolution data in a computationally tractable manner, we introduce a novel multiscale architecture for ConvCNPs. We also combine AR ConvCNPs with a beta-categorical mixture likelihood, producing strong results compared to other neural processes. Our work represents a promising first application of this procedure to the simplest class of neural processes, and motivates future work on applications of AR sampling to other neural process models.



Comparison of various classes of neural processes. Shows whether a class produces consistent predictions, models dependencies, can produce non-Gaussian predictions, and can be trained without approximating the objective function. For CNPs, even though the presentation byGarnelo et al. (2018a)  assumes Gaussian predictions, it is simple to relax this Gaussianity assumption; this is not the case for GNPs. not model statistical dependencies (Figure1; left). This harms their predictive performance and makes it impossible to draw coherent function samples, which are necessary in downstream estimation tasks(Markou et al., 2022). Various approaches have been proposed to address this.Garnelo et al. (2018b)   introduced the latent neural process (LNP), which uses a latent variable to induce dependencies and model non-Gaussianity. However, this renders the likelihood intractable, necessitating approximate inference. Another approach is the fully convolutional Gaussian neural process(FullConvGNP;  Bruinsma et al., 2021), which maintains tractability at the cost of only allowing Gaussian predictions.

