AUTOREGRESSIVE CONDITIONAL NEURAL PROCESSES

Abstract

Conditional neural processes (CNPs; Garnelo et al., 2018a) are attractive metalearning models which produce well-calibrated predictions and are trainable via a simple maximum likelihood procedure. Although CNPs have many advantages, they are unable to model dependencies in their predictions. Various works propose solutions to this, but these come at the cost of either requiring approximations or being limited to Gaussian predictions. In this work, we instead propose to change how CNPs are deployed at test time, without any modifications to the model or training procedure. Instead of making predictions independently for every target point, we autoregressively define a joint predictive distribution using the chain rule of probability, taking inspiration from the neural autoregressive density estimator (NADE) literature. We show that this simple procedure allows factorised Gaussian CNPs to model highly dependent, non-Gaussian predictive distributions. Perhaps surprisingly, in an extensive range of tasks with synthetic and real data, we show that CNPs in autoregressive (AR) mode not only significantly outperform non-AR CNPs, but are also competitive with more sophisticated models that are significantly more expensive and challenging to train. This performance is remarkable since AR CNPs are not trained to model joint dependencies. Our work provides an example of how ideas from neural distribution estimation can benefit neural processes, motivating research into the AR deployment of other neural process models.

1. INTRODUCTION

Conditional neural processes (CNPs; Garnelo et al., 2018a) are a family of meta-learning models which combine the flexibility of deep learning with the uncertainty awareness of probabilistic models. They are trained to produce well-calibrated predictions via a simple maximum-likelihood procedure, and naturally handle off-the-grid and missing data, making them ideally suited for tasks in climate science and healthcare. Since their introduction, attentive (ACNP; Kim et al., 2019) and convolutional (ConvCNP; Gordon et al., 2020) variants have also been proposed. Unfortunately, existing CNPs do Figure 1 : A ConvCNP trained on random sawtooth functions and applied in standard mode (left) and in our proposed autoregressive (AR) mode (right). The black crosses denote observed data points, the blue lines show model samples, and the bottom plots show the marginal predictive distributions at the locations marked by the dashed vertical lines. In standard mode, the CNP models each output with an independent Gaussian (left). However, when run in AR mode, the same CNP can produce coherent samples and model multimodality (right).

Class

Consistent Dependencies Non-Gaussian Exact Training AR CNPs (this work) ✗ ✓ ✓ ✓ CNPs (Garnelo et al., 2018a) ✓ ✗ ✓ ✓ GNPs (Markou et al., 2022) ✓ ✓ ✗ ✓ LNPs (Garnelo et al., 2018b) ✓ ✓ ✓ ✗ Table 1 : Comparison of various classes of neural processes. Shows whether a class produces consistent predictions, models dependencies, can produce non-Gaussian predictions, and can be trained without approximating the objective function. For CNPs, even though the presentation by Garnelo et al. (2018a) assumes Gaussian predictions, it is simple to relax this Gaussianity assumption; this is not the case for GNPs. not model statistical dependencies (Figure 1 ; left). This harms their predictive performance and makes it impossible to draw coherent function samples, which are necessary in downstream estimation tasks (Markou et al., 2022) . Various approaches have been proposed to address this. Garnelo et al. (2018b) introduced the latent neural process (LNP), which uses a latent variable to induce dependencies and model non-Gaussianity. However, this renders the likelihood intractable, necessitating approximate inference. Another approach is the fully convolutional Gaussian neural process (FullConvGNP; Bruinsma et al., 2021) , which maintains tractability at the cost of only allowing Gaussian predictions. It uses a neural network to define the mean and covariance function of a predictive Gaussian process (GP) that models dependencies. However, it uses a much more complex architecture and is only practically applicable to problems with one-dimensional inputs, limiting its adoption compared to the more lightweight CNP. Recently, Markou et al. (2022) proposed the Gaussian neural process (GNP), which is considerably simpler but sacrifices performance relative to the FullConvGNP. In this paper we propose a much simpler method for modelling dependencies with neural processes that has been largely overlooked: autoregressive (AR) sampling. AR sampling requires no changes to the architecture or training procedure. Instead, we change how the CNP is deployed at test time, extracting predictive information that would ordinarily be ignored. Instead of making predictions at all target points simultaneously, we autoregressively feed samples back into the model. AR CNPs trade the fundamental property of consistency under marginalisation and permutation, which is foundational to many neural process models, for non-Gaussian and correlated predictions. In Table 1 we place AR CNPs within the framework of other neural process models. Our key contributions are: • We show that CNPs used in AR mode capture rich, non-Gaussian predictive distributions and produce coherent samples (Figure 1 ). This is remarkable, since these CNPs have Gaussian likelihoods, are not trained to model joint dependencies or non-Gaussianity, and are significantly cheaper to train than LNPs and FullConvGNPs (Figure 2 ). • We prove that, given sufficient data and model capacity, the performance of AR CNPs is at least as good as that of GNPs, which explicitly model correlations in their predictions. • Viewing AR CNPs as a type of neural density estimator (Uria et al., 2016) , we highlight their connections to a range of existing methods in the deep generative modelling literature. • In an extensive range of Gaussian and non-Gaussian regression tasks, we show that AR CNPs are consistently competitive with, and often significantly outperform, all other neural process models in terms of predictive log-likelihood. • We deploy AR CNPs on a range of tasks involving real-world climate data. To handle the high-resolution data in a computationally tractable manner, we introduce a novel multiscale architecture for ConvCNPs. We also combine AR ConvCNPs with a beta-categorical mixture likelihood, producing strong results compared to other neural processes. Our work represents a promising first application of this procedure to the simplest class of neural processes, and motivates future work on applications of AR sampling to other neural process models.

2. AUTOREGRESSIVE CONDITIONAL NEURAL PROCESSES

Meta-learning. We first define the problem setup. Let X be a compact input space and let Y be the output space. Let D N = (X × Y) N be the collection of all sets of N input-output pairs, and let D = ∞ N =0 D N . We call elements D ∈ D data sets and denote D = (x, y) where x ∈ X N , y ∈ Y N are the inputs and outputs respectively. In meta-learning we are given a collection of data sets (D m ) M m=1 , called a meta-data set, with the individual data sets D m called tasks (Vinyals et al., 2016) . Every task D m is split up D m = D (c) m ∪ D (t) m into a context set D (c) m = (x (c) m , y (c) m ) and a target set D (t) m = (x (t) m , y (t) m ). Here x (c) m are called the context inputs, y (c) m the context outputs, x (t) m the target inputs, and y (t) m the target outputs. Our goal is to devise an algorithm which takes in a context set D (c) m and produces the best possible prediction for the target outputs y (t) m given target inputs x (t) m . 10 0 10foot_0 10 2 10 3 Wallclock training time (min.) Neural processes. Let P be the set of all Y-valued stochastic processes on X . Neural processes (NPs) directly and flexibly parametrise a map π θ : D → Q where Q ⊆ P and where θ are learnable parameters. CNPs set Q to be the collection of GPs f such that cov(f (x), f (y)) = 0 for x ̸ = y. GNPs let Q be the collection of continuous GPs. Latent NPs (LNPs; Garnelo et al., 2018b) let Q be a collection of non-Gaussian processes by making use of a latent variable. Let P x (t) m π(D (c) m ) denote the finite-dimensional distribution of the process π(D (c) m ) evaluated at inputs x (t) m , and denote its density by q θ ( • | x (t) m , D (c) m ). To learn the parameters θ, NPs seek to maximise L M (π) = 1 M M m=1 log q θ (y (t) m | x (t) m , D (c) m ). For CNPs and GNPs, L M can be computed exactly, since q θ is Gaussian. 1 . However, for LNPs, L M must be approximated (Garnelo et al., 2018b; Foong et al., 2020) , typically impacting performance. Autoregressive CNPs. Our proposal is to take an existing CNP and run it in an autoregressive fashion, feeding predictions for earlier outputs back into the model. Inspired by the product rule, we define the joint predictive as a product of conditionals, modelling each conditional with a CNP. For example, in the case of three target points, q θ (y (t) 3 | y (t) 1:2 , D (c) m )q θ (y (t) 2 | y (t) 1 , D (c) m )q θ (y (t) 1 | D (c) m ). To enable a theoretical analysis of this procedure, we now proceed to set up more formal notation. Suppose that π θ : D → Q is an NP, and we wish to predict at some target inputs x (t) given a context set D (c) . Standard NPs would output the predictive P x (t) π θ (D (c) ) which, for CNPs, would be a factorised Gaussian. We propose to instead roll out the NP autoregressively, as described in Proc. 2.1. Procedure 2.1 (Autoregressive application of neural process). For a neural process π θ , context set D (c) = (x (c) , y (c) ), and target inputs x (t) , let AR x (t) (π θ , D (c) ) be the distribution defined as follows: for i = 1, . . . , N , y (t) i ∼ P x (t) i π θ (x (c) ⊕ x (t) 1:(i-1) , y (c) ⊕ y (t) 1:(i-1) ), where a ⊕ b concatenates two vectors a and b. See Figure 7 in Appendix C for an illustration. Since earlier samples y (t) i feed back into later applications of π θ , the whole sample y (t) is correlated, even if π θ does not model dependencies between target outputs, as with CNPs. At test time, when evaluating the corresponding the density q (AR) θ of AR x (t) (π θ , D (c) ) at y (t) , we use the formula log q (AR) θ (y (t) | x (t) , D (c) ) = N i=1 log q θ (y (t) i | x (t) i , D (c) ⊕ (x (t) 1:(i-1) , y (t) 1:(i-1) )). ( ) Whilst any NP can be used in AR, we focus on CNPs as they are the computationally cheapest class. Understanding the infinite data limit. To better understand why AR CNPs successfully model dependencies, we analyse the idealised case of infinite data and model capacity. Let p(f ) be the law of the data-generating stochastic process, and let p(ε) be the law of a stochastic process representing observation noise, defined by letting ε(x) be a vector of i.i.d. noise variables for all x. We assume y (c) m = y m (x (c) m ) and y (t) m = y m (x (t) m ) where y m ( • ) = f m ( • ) + ε m ( • ), (f m ) M m=1 are i.i.d. draws from p(f ), and (ε m ) M m=1 are i.i.d. draws from p(ε). Define the prediction map π y : D → P as the mapping from a data set to the posterior over y, π y (D) = p(y | D). Then L M is a Monte Carlo approximation of the following infinite-sample objective (Foong et al., 2020) : L ∞ (π) = -E p(D (c) )p(x (t) ) [KL(P x (t) π y (D (c) ), P x (t) π(D (c) ))] + const. Under appropriate regularity assumptions, L ∞ (π) is maximised over all π when the expected KL divergence term is zero, which occurs if and only if π = π y . In practice, NPs do not maximise L ∞ (π) over all π, but (i) use a finite-sized meta-data set and (ii) restrict Q ⊆ P: what we compute in practice π M ∈ arg max all π : D→Q L M (π) (i) ----→ M →∞ ideal NP π ∞ ∈ arg max all π : D→Q L ∞ (π) (ii) ----→ Q→P exact prediction map π y = arg max all π : D→P L ∞ (π) (6) Here π M is an NP trained on the practical objective (1), which, in the limit of infinite data, approximates the so-called ideal NP π ∞ . The ideal NP depends on the choice of Q, i.e. the class of NPs under consideration, and, in turn, approximates π y . For CNPs and GNPs, using the fact that minimising KL(p, q) over q matches moments (Minka, 2001), we can readily determine and even practically compute the ideal NP for these two classes of NPs. The ideal CNP predicts a diagonal-covariance GP whose mean function and marginal variance function match π y : π ∞ (D) = GP(m, k) where m(x) = E[y(x) | D], and k(x, x ′ ) = V[y(x) | D] if x = x ′ and k(x, x ′ ) = 0 otherwise. On the other hand, the ideal GNP predicts a GP whose mean function and full covariance function match π y : π ∞ (D) = GP(m, k) where m(x) = E[y(x) | D], k(x, x ′ ) = cov(y(x), y(x ′ ) | D). The main result of this subsection is that the ideal CNP, despite not modelling correlations, becomes superior to the ideal GNP when deployed in AR mode: Proposition 2.1 (Advantage of AR CNPs over GNPs). Assume appropriate regularity conditions on y. Let π C be the ideal CNP and let π G be the ideal GNP. Then, for all inputs x and data sets D ∈ D, KL(P x π y (D), AR x (π C , D)) ≤ KL(P x π y (D), P x π G (D)). We provide a proof in Appendix A. Intuitively, the advantage of AR CNPs comes from their ability to model non-Gaussian dependencies. Proposition 2.1 shows that to outperform the GNP, it suffices to train a CNP to model the marginals of π y , and rely on the AR procedure to induce dependencies. A visualisation of the ideal CNP and the ideal CNP applied autoregressively can be seen in Figure 3 . Consistency and the AR design space. As shown in Table 1 , AR CNPs give up the fundamental property of consistency, since the distributions {AR x (π θ , D (c) m ) : x ∈ X N , N ∈ N} are not consistent under permutation or marginalisation: permuting x and introducing or marginalising target points can change the distribution. This violates the conditions of the Kolmogorov extension theorem (Oksendal, 2013) , preventing the distributions from defining a consistent stochastic process. There is thus a large design space involved when deploying AR CNPs, where choices that have no effect on the predictions of other NPs can now significantly affect performance. One such choice is how many points to sample at a time. Sampling one at a time induces dependencies between all points, but requires N forward passes. Alternatively, we could divide the N inputs in x (t) into blocks of K points each, and sample each block with a single CNP forward pass. This requires N/K forward passes, with points in the same block conditionally independent. If K = N , this is the standard CNP prediction; and if K = 1, we recover Procedure 2.1. This provides a knob for practitioners to trade off between faster, consistent, but less expressive standard CNP predictions; and slower, less consistent, but more expressive AR predictions. In this paper, we use full AR mode with K = 1, and leave an investigation of block AR sampling to future work. Obtaining smooth samples. Due to the lack of consistency in AR mode, the spacing chosen between target points can significantly affect performance. For example, care must be taken so the number of target points is not much greater than the size of the context sets seen during train time, to avoid confronting the model with an out-of-distribution context set size at test time. This raises the question of how to sample functions on a very fine grid. Furthermore, since CNPs do not differentiate between epistemic and aleatoric uncertainty, it is not clear how to obtain smooth, noiseless samples, that is, samples for f uncorrupted by the i.i.d. noise ε in (4) . The following proposition shows that, for a smooth sample corrupted by additive noise, the smooth component can be approximated with the predictive mean conditioned on noisy observations: Proposition 2.2 (Recovery of smooth samples). Let X ⊆ R be compact, and let f be a stochastic process with surely continuous sample paths and sup x∈X ∥f (x)∥ L 2 < ∞. Let (ε n ) n≥0 be i.i.d. (potentially non-Gaussian) random variables such that E[ε 0 ] = 0 and V[ε 0 ] < ∞. Consider any sequence (x n ) n≥1 ⊆ X , and let x * ∈ X be a limit point of (x n ) n≥1 . If y(x * ) = f (x * ) + ε 0 and y n = f (x n ) + ε n are noisy observations of f , then lim n→∞ E[y(x * ) | y 1 , . . . , y n ] = f (x * ) almost surely. ( ) ConvCNP (AR) ConvCNP (AR) Ground Truth Ground Truth Figure 4 : Comparison of noiseless (left) and noisy (right) samples from an AR ConvCNP trained on data sampled from a GP with an exponentiated-quadratic kernel, and the ground truth GP. The noiseless AR samples were generated from the noisy AR samples using the procedure suggested by Proposition 2.2. We provide a proof in Appendix B. Equation ( 8) suggests the following two-step procedure for obtaining smooth samples from AR CNPs. Step 1: Let x 1:n be a number of target inputs that does not exceed the number of points seen during training. Sample y 1:n ∼ AR x1:n (π θ , D (c) m ). This sample includes observation noise. Step 2: Remove the noise from the sample by passing it through the model once more: N (µ 1:n , D) = P x1:n π θ (D (c) m ⊕ (x 1:n , y 1:n )). Here the predictive mean µ 1:n forms the noiseless sample. To produce a sample at arbitrarily many inputs, one may also evaluate N (µ ′ 1:n , D) = P x ′ 1:n π θ (D (c) m ⊕ (x 1:n , y 1:n )) where x ′ 1:n is arbitrary. This result of this procedure is illustrated in Figure 4 , and was used to generate the noiseless samples shown in Figure 1 (right). Figure 7 in Appendix C also illustrates this two-step procedure in a pictorial step-by-step fashion.

3. CONNECTIONS TO OTHER NEURAL DISTRIBUTION ESTIMATORS

Various paradigms have been developed for neural distribution estimators (NDEs): normalising flows (Dinh et al., 2015) , generative adversarial networks (GANs; Goodfellow et al., 2014) , variational autoencoders (VAEs; Kingma & Welling, 2014) , autoregressive models (Uria et al., 2016) , and diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) . Figure 5 visualises the landscape of NDEs. We argue that NPs and AR CNPs should be viewed as neural distribution estimators (NDEs) and be placed in this landscape. AR CNPs inherit the strengths of AR models, such as the ability to model complex dependencies with a tractable likelihood, but also some of their weaknesses, most notably slow test-time sampling. Slow sampling is the main drawback of AR CNPs, though it may be possible to adapt techniques for speeding up AR models (Ramachandran et al., 2017) . One major difference between AR CNPs and existing AR models is that AR CNPs decompose the joint distribution of an uncountably infinite set of variables, allowing querying at arbitrary input locations (Section 2). Like DEformer (Alcorn & Nguyen, 2021) , EoNADE (Uria et al., 2014) , and XLnet (Yang et al., 2019) , AR CNPs are trained to not prefer any particular order of decomposing the joint distribution into conditionals. To achieve this goal, the AR CNP shares design choices with other AR models: (i) a shared architecture is used to produce each conditional distribution, similar to WaveNet (Oord et al., 2016a) and PixelCNN (Oord et al., 2016b) ; (ii) the data point index is given as input to the network as in the DEformer model (Alcorn & Nguyen, 2021) ; and (iii) training maximises a log-likelihood including all decompositions of the joint distribution, similar to EoNADE (Uria et al., 2014) and XLnet (Yang et al., 2019) . Figure 5 also shows the connections between NPs, VAEs and normalising flows (NFs). Like VAEs, LNPs use decoders that parametrise a factorised distribution and rely on the latent variable to induce dependencies. Again, the key difference is that LNPs model a distribution over an uncountable set of variables. Models like conditional BRUNO (Korshunova et al., 2020) and copula GNPs (Markou et al., 2022) combine ideas from NPs and NFs, transforming a stochastic process with an invertible transformation. Finally, GAN models such as Spatial GAN (Jetchev et al., 2016) and ∞-GAN (Lu et al., 2020) model countable numbers of variables, such as images of arbitrary size. Inspecting Figure 5 , we see that GANs are the only class of models depicted that do not currently have an NP version: a version that models an uncountable number of variables. This suggests adversarial training of NPs as an interesting avenue for future investigation. In recent work, Nguyen & Grover (2022) proposed the Transformer NP (TNPs), which uses a causally-masked transformer architecture with an autoregressive likelihood. In contrast, rather than proposing a new AR architecture, our work focuses on running existing CNPs in AR mode to obtain coherent samples and improved likelihoods, without modifying the architecture or training procedure. In prior work, Volpp et al. (2021) used AR sampling in order to visualise samples from CNPs. However, their work focuses on proposing a novel context aggregation mechanism for NPs, and they do not evaluate the likelihood of CNPs autoregressively or investigate any performance gains.

4. THE PERFORMANCE OF AUTOREGRESSIVE NEURAL PROCESSES

In this section we investigate the performance of AR CNPs on synthetic and real data. Across a wide range of tasks, the AR CNP is competitive with much more sophisticated approaches. Throughout, we train LNPs with both the ELBO (Garnelo et al., 2018b) and ML objective (Foong et al., 2020) . Code for implementations of NPs and reproducing our experiments can be found at https://github.com/wesselb/neuralprocesses. For all experiments, we use a random ordering of the target points in Proc. 2.1; see App. D for a justification. 4.1 SYNTHETICALLY GENERATED GAUSSIAN AND NON-GAUSSIAN DATA Synthetic experiment setup. We evaluate an extensive collection of NP models on a wide range of Gaussian and non-Gaussian synthetic regression tasks. We consider tasks with functions drawn from (i) different GPs; (ii) a non-Gaussian sawtooth process (as in Figure 1 ); (iii) a non-Gaussian mixture task, where, with equal probability, we sample the function from one of three possible GPs or the sawtooth process. We also consider various versions of the tasks for different input and output dimension d x , d y , with dependencies across the output channels. To ensure a fair comparison, we configure the architectures to make the parameter counts comparable between all models. Results. Table 2 highlights the best performing models on some representative tasks; for further results across all twenty synthetic tasks and further experimental details, see Appendix H. The AR procedure dramatically improves the performance of the ConvCNP, with the AR ConvCNP being the best performing model for most tasks, except on the Gaussian EQ task where it performs marginally worse than the FullConvGNP. In particular, the AR ConvCNP outperforms the FullConvGNP and ConvGNP on non-Gaussian tasks, in agreement with Proposition 2.1, while having a faster training time than the other convolutional models (Figure 2 ). For the sawtooth task, Figure 11 in Appendix H.2 illustrates that predictions by the AR ConvCNP can be multi-modal and non-Gaussian, even when using a Gaussian likelihood. Finally, we note that in tasks with d x = 2, where the FullConvGNP cannot be used (as discussed in Section 1), the AR ConvCNP far outperforms all competing approaches. can be easily generated to train them. We consider the hare-lynx data set, which is a population time series of Snowshoe hares and Canadian lynx (MacLulich, 1937) . To generate simulated data, we use a stochastic version of the Lotka-Volterra equations (Lotka, 1910; Volterra, 1926) :

4.2. SIM-TO-REAL TRANSFER

dX t = αX t dt -βY t X t dt + σX ν t dW t , dY t = -γX t dt + δY t X t dt + σY ν t dW t . (9) Under these equations, the prey population X t grows exponentially with rate α, the predator population Y t decays exponentially with rate γ, and the predators hunt the prey. W (1) and W (2) are independent Brownian motions introducing noisy behaviour. These equations generate non-Gaussian data with both within-channel as well as cross-channel dependencies. We simulate the Lotka-Volterra equations on a dense grid, and use them to generate meta-data sets in three different ways. Interpolation: we randomly divide the data into context and target sets. Forecasting: we choose a random time, before which all data are contexts, and all future data are targets. Reconstruction: we randomly choose between the X t or Y t , split the chosen series as in forecasting, and append the other series to the context. In training, for every batch, we choose one of these tasks uniformly at random. Results. Table 3 shows the results of the best performing models. The AR ConvCNP performs best both on the simulated as well as the real data, again demonstrating that running CNPs in AR mode improves performance and can even outperform strong GNP and LNP baselines. For full experimental details and additional results see Appendix I.

4.3. ELECTROENCEPHALOGRAM EXPERIMENTS

Electroencephalogram data. We next trained various NPs on real time series data consisting of electroencephalogram (EEG) measurements (Zhang et al., 1995) , following Markou et al. (2022) . Each time series consists of 256 regularly spaced measurements across 7 correlated channels. For each channel, we randomly select a number of the 256 points uniformly at random to be target points, and use the remaining ones as context points, independently across the channels. Results. After training, we test the models on this interpolation task and also on a reconstruction task, where we set a part of a channel as target and the remainder as context. In Table 4 , we observe that the AR ConvCNP is competitive with the FullConvGNP, despite having significantly shorter training times and fewer parameters. Both the AR ConvCNP and the FullConvGNP outperform the ConvCNP and the ConvLNP. Full experimental detail are in Appendix J.

4.4. ENVIRONMENTAL MODELLING

Environmental datasets bring a range of modelling challenges. One example is fusing spatiotemporal data from disparate sources (Chang & Bai, 2018; Lahat et al., 2015) , which arises in diverse environmental sciences applications from climate monitoring to hydrology (Gettelman et al., 2022; Ferrer-Cid et al., 2020; Robinson et al., 2021; Lu et al., 2010; Hosseini & Kerachian, 2017) . Another challenge involves estimating the probability of events of interest, such as the compound risk of both low wind speeds at an offshore wind farm and high cloud cover over a solar panel farm. To obtain robust uncertainty estimates for such events, it is essential to model correlations as well as non-Gaussian marginals (such as cloud cover). Current GAN-based approaches (e.g. Ravuri et al. 2021) can capture both joint and non-Gaussian statistics, but they cannot perform data fusion or predict at arbitrary off-grid locations. The AR ConvCNP can fuse data of on-grid and off-grid modalities and make predictions at arbitrary locations while modelling arbitrary non-Gaussian likelihoods and capturing statistical dependencies, thus achieving all the desiderata and filling a gap in the environmental modelling toolbox. Here, we assess the AR ConvCNP on two common environmental modelling tasks, namely data assimilation and statistical downscaling. Data assimilation. Data assimilation is the task of combining observations of the Earth system to produce predictions on a regular grid, called a reanalysis. Reanalyses are typically generated by fitting the trajectories of physics-based climate models to observations (Hersbach et al., 2020; Gettelman et al., 2022) , but the potential for improving data assimilation with ML has drawn increasing attention in recent years (Geer, 2021) . To explore the AR ConvCNP's data assimilation abilities for a non-Gaussian variable, we train convolutional NP models to predict simulated daily-average cloud cover fraction over Antarctica. We use reanalysis data from ECMWF ERA5 (Hersbach et al., 2020) as ground truth. Cloud cover takes values in the interval [0, 1], with observations frequently taking values of 0 or 1 (Figure 14 ). We evaluate the performance of NPs using either a Gaussian likelihood or a beta-categorical mixture model with three components: two discrete delta components for values of exactly 0 or 1, and a beta distribution capturing continuous values in (0, 1). This provides a robust way of handling 0 and 1 values, unlike the existing copula GNP model (Markou et al., 2022) which can have its output constrained in (0, 1) but places zero density at the endpoints. Data assimilation results. In Table 5 we see that the AR ConvCNP significantly outperforms competing NPs for both the Gaussian and beta-categorical likelihoods. Figure 6 shows samples drawn from the models, after observing context points on half of the space. The AR ConvCNP displays remarkable ability to extrapolate rich, non-stationary, multi-scale structure, such as sudden changes in cloud cover over the Ross Ice Shelf coastline at the bottom of the figure. By comparison, the ConvLNP and ConvGNP produce blurry, lower frequency samples. Unlike GPs, convolutional NP models have a fixed receptive field induced by the CNN architecture used for the encoder, which is computationally expensive to increase. Away from the context points on the left, samples from the non-AR models will be independent of the observations, reverting to some mean representation of the data (Fig. 6c-e ). This highlights a further benefit of AR CNPs: successive AR applications increase the receptive field, enabling rich, conditional sample structure to extrapolate far away from observed data. See Appendix K for further commentary, sample figures, and details. Environmental downscaling. The spatial resolutions of physics-based reanalyses are limited by their vast computational demands, making them unsuitable for capturing local and extreme events (Stocker et al., 2013; Maraun et al., 2017) . Statistical downscaling addresses this issue by leveraging additional information to produce fine-grained predictions (Maraun & Widmann, 2018) . Recently, NPs have been shown to outperform a large ensemble of existing climate downscaling approaches (Vaughan et al., 2022) . We compare the AR ConvCNP to the MLP ConvCNP of Vaughan et al. and the MLP ConvGNP of (Markou et al., 2022) in a temperature downscaling task over Germany. In this task, the context data consist of low-resolution ERA-Interim reanalysis data and high-resolution topography, and the target data consist of weather station observations from the ECA&D dataset. We also consider a second setup where we reveal some station observations to aid the downscaling process. As Appendix L.2 explains, the MLP ConvCNP and MLP ConvGNP cannot be extended to include these station observations. We therefore introduce a novel multiscale architecture, which we use to run the ConvCNP in AR mode. See Appendix L for full experimental details. Environmental downscaling results. In Table 6 we observe that the AR ConvCNP matches the performance of the ConvGNP, which is remarkable as the latter has been previously demonstrated to outperform a range of state-of-the-art downscaling approaches (Markou et al., 2022; Vaughan et al., 2022) . When additional observations from weather stations are revealed, the AR ConvCNP significantly outperforms the MLP ConvGNP in both metrics. 

5. LIMITATIONS AND CONCLUSION

We have shown that the AR procedure can be readily applied to improve the performance of CNPs, producing coherent samples and dramatically improved likelihoods. Surprisingly, in an extensive range of experiments, this simple approach often outperforms more complicated methods which rely on latent variables or which explicitly model correlations. We demonstrate the effectiveness of our approach on data sets of real-world interest by applying AR CNPs on climate data fusion tasks, modelling [0, 1]-constrained data with a beta-categorical likelihood and introducing a novel multiscale architecture. Notably, AR CNPs fill a gap in the climate modelling toolbox by enabling joint, non-Gaussian predictives, which could be used to better estimate the magnitude of compound risks. We also position AR CNPs within the larger neural density estimator literature, showing the fruitfulness of combining NPs with other modelling paradigms. More generally, AR CNPs equip the NP framework with a new knob where modelling complexity and computational expense at training time can be traded for computational expense at test time. In particular, the higher quality samples and better likelihoods obtained by applying NPs autoregressively come with the additional cost of performing a forward pass for every element in the target set. This can be prohibitively expensive for large target sets, and constitutes the primary practical drawback of using AR CNPs. In addition, since AR CNPs do not define a consistent stochastic process, design choices for the AR procedure may affect the quality of the results. Thus practitioners need to avoid choosing target sets that lead to pathological behaviour, such as when the spatial density of the target inputs is too high. However, the flexibility of this design space also presents an opportunity: as an example, in Appendix M we show that auxiliary target points can be used to further improve predictions. Finally, promising avenues for future work include applying the AR procedure to other NPs besides CNPs, and investigating the efficacy of the block sampling scheme presented in Section 2.

6. REPRODUCIBILITY STATEMENT

All our experiments are carried out using either synthetic or publicly available datasets. The EEG data set is available through the UCI database,foot_1 and the environmental data are also publicly available through the European Climate Data Service. 3We make publicly available all code necessary to reproduce our experimentsfoot_3 as well as instructions for downloading, preprocessing, and modelling the Antarctic cloud cover data 5 . Proofs for Propositions 2.1 and 2.2 are given in Appendix A and Appendix B respectively. Details on the model architectures and the experimental setup can be found in Appendices F to H for the synthetic datasets, Appendix I the sim-to-real transfer experiments, Appendix J for the EEG experiments, Appendix K for the data assimilation experiment, and Appendix L for the downscaling experiment.

7. ETHICS STATEMENT

Training CNPs autoregressively improves their performance dramatically, but we do not foresee adverse societal impacts as a result of this work. That being said, the problem of capturing the statistical trends present in a dataset must be performed with care, especially in safety critical applications, where the stakes of making incorrect and confident predictions can have severe consequences. We view the AR procedure as a useful tool, rather than a panacea, for capturing such behaviours, and hope this work encourages further research into building effective but reliable models to this end. We also note that while training CNPs is computationally cheaper than alternative NP models, AR sampling itself incurs a substantial computational cost, and thus energy cost, at test time. Running AR sampling on a large scale could lead to greater power demands for these models, resulting in larger carbon footprints which are undesirable. However, we believe the potential benefits for environmental modelling could outweigh this cost, while leveraging methods to make AR CNPs more computationally efficient should help alleviate this issue. The "appropriate regularity conditions". Let P N λ be the collection of distributions on R N that (a) have a density with respect to the Lebesgue measure and (b) have a covariance matrix which is strictly positive definite. Let P N λ,G ⊆ P N λ be the subcollection of distributions which are Gaussian. Then, by Corollary B.1 by Bruinsma et al. ( 2021), for all

8. ACKNOWLEDGEMENTS

µ ∈ P N λ such that inf ν∈P N λ,G KL(µ, ν) < ∞, arg min ν∈P N λ,G KL(µ, ν) = N (µ) where N (µ) denotes the Gaussian distribution with mean vector and covariance matrix equal to those of µ. In the proposition, by appropriate regularity conditions on y, we mean the assumption that, for all inputs x and D ∈ D, P x π y (D) is in P |x| λ and such that inf ν∈P |x| λ,G KL(P x π y (D), ν) < ∞. Assume the appropriate regularity conditions on y. We now list three technical observations. (1) Note that P x1 | x2 π y (D) is the distribution of y(x 1 ) | D, (x 2 , y 2 ), so we have the identity P x1 | x2 π y (D) = P x1 π y (D ⊕ (x 2 , y 2 )). Therefore, for all inputs x 1 , inputs x 2 , and D ∈ D, P x1 | x2 π y (D) is in P |x1| λ and such that inf ν KL(P x1 | x2 π y (D), ν) < ∞. (2) The ideal CNP π C matches the means and marginal variances of the true posterior predictives (Section 2). Hence, for all x ∈ X and D ∈ D, P x π C (D) is in P 1 λ,G . (3) The ideal GNP π G matches the mean vectors and covariance matrices of the true posterior predictives (Section 2). Hence, for all inputs x and D ∈ D, P x π G (D) is in P |x| λ,G ; which means that, for all x 1 ∈ X , inputs x 2 , and D ∈ D, P x1 | x2 π G (D) is in P 1 λ,G . In the proof, to apply and use (10) , we implicitly use these observations. Proposition 2.1 (Advantage of AR CNPs over GNPs). Assume appropriate regularity conditions on y. Let π C be the ideal CNP and let π G be the ideal GNP. Then, for all inputs x and data sets D ∈ D, KL(P x π y (D), AR x (π C , D)) ≤ KL(P x π y (D), P x π G (D)). ( ) Proof of Proposition 2.1. Let x be some inputs and let D ∈ D be some data set. We will argue that, for all n = 1, . . . , |x|, KL(P xn | x 1:(n-1) π y (D), P xn π C (D ⊕ (x 1:(n-1) , y 1:(n-1) ))) ≤ KL(P xn | x 1:(n-1) π y (D), P xn | x 1:(n-1) π G (D)). ) Assuming this inequality, the result follows directly from the chain rule for the KL divergence in combination with the definition of AR x (Procedure 2.1): KL(P x π y (D), AR x (π C , D)) = |x| n=1 E y 1:(n-1) [KL(P xn | x 1:(n-1) π y (D), P xn π C (D ⊕ (x 1:(n-1) , y 1:(n-1) )))] ≤ |x| n=1 E y 1:(n-1) [KL(P xn | x 1:(n-1) π y (D), P xn | x 1:(n-1) π G (D))] (13) = KL(P x π y (D), P x π G (D)) ) where the expectations are over y 1:(n-1) ∼ P x 1:(n-1) π y (D). To prove the inequality, note that, conditional on y 1:(n-1) , using (10) , arg min ν∈P 1 λ,G KL(P xn | x 1:(n-1) π y (D), ν) = N (P xn | x 1:(n-1) π y (D)). ( ) By the property of π C that it matches the mean and marginal variance of the true posterior (Section 2), N (P xn | x 1:(n-1) π y (D)) = N (P xn π y (D ⊕ (x 1:(n-1) , y 1:(n-1) ))) = P xn π C (D ⊕ (x 1:(n-1) , y 1:(n-1) )). (17) Therefore, arg min ν∈P 1 λ,G KL(P xn | x 1:(n-1) π y (D), ν) = P xn π C (D ⊕ (x 1:(n-1) , y 1:(n-1) )). Noting that P xn | x 1:(n-1) π G (D) ∈ P 1 λ,G , we obtain the desired inequality. B PROOF OF PROPOSITION 2.2 Proposition 2.2 (Recovery of smooth samples). Let X ⊆ R be compact, and let f be a stochastic process with surely continuous sample paths and sup x∈X ∥f (x)∥ L 2 < ∞. Let (ε n ) n≥0 be i.i.d. (potentially non-Gaussian) random variables such that E[ε 0 ] = 0 and V[ε 0 ] < ∞. Consider any sequence (x n ) n≥1 ⊆ X , and let x * ∈ X be a limit point of (x n ) n≥1 . If y(x * ) = f (x * ) + ε 0 and y n = f (x n ) + ε n are noisy observations of f , then lim n→∞ E[y(x * ) | y 1 , . . . , y n ] = f (x * ) almost surely. ( ) Proof of Proposition 2.2. Consider the increasing filtration F n = σ(y 1 , . . . , y n ) with limit F ∞ = σ( ∞ n=1 F n ). Also let T n = σ(ε n+1 , ε n+2 , . . .) and consider the tail σ-algebra T = ∞ n=1 T n . Let (x ni ) ∞ i=1 be a subsequence of (x n ) ∞ n=1 such that x ni → x * . Let g n = 1 n n i=1 y i . Since g n is a function of y 1 , . . . , y n , it is F n -measurable and therefore F ∞ -measurable. Note that g n = 1 n n i=1 f (x ni ) + 1 n n i=1 ε i . ( ) By sure continuity of f , the first term converges to f (x * ) surely. By the strong law of large numbers (Example 5.6.1; Durrett, 2010), the second term converges to zero on a tail event A ∈ T of probability one. We conclude that 1 A f (x * ) is σ(F ∞ , T )-measurable. Therefore, by almost sure convergence of L 2 -bounded martingales (Theorem 5.4.5; Durrett, 2010) , lim n→∞ E[y(x * ) | y 1 , . . . , y n ] = lim n→∞ E[f (x * ) | y 1 , . . . , y n ] (E[ε 0 ] = 0) (20) = lim n→∞ E[f (x * ) | F n ] (definition of F n ) (21) = lim n→∞ E[f (x * ) | F n , T ] (σ(f (x * ), F n ) ⊥ T ) (22) = lim n→∞ E[1 A f (x * ) | F n , T ] (P(A) = 1) (23) = E[1 A f (x * ) | F ∞ , T ] (L 2 -mart. convergence) (24) = 1 A f (x * ) (1 A f (x * ) ∈ σ(F ∞ , T )) (25) = f (x * ), (P(A) = 1) where all equalities hold almost surely. C ILLUSTRATION OF THE AR PROCEDURE Figure 7 depicts the AR sampling procedure (Procedure 2.1) and procedure to produce smooth samples (Proposition 2.2) using the ConvCNP trained on the EQ data process from Section 4.1.

Model fit

Step 1: Draw noisy samples using AR sampling (Procedure 2.1) . . . 

D NUMBER AND ORDER OF TARGET POINTS

When deploying a conditional neural process (CNP) autoregressively (AR; Procedure 2.1), the number and ordering of the target points matters. In this appendix, we describe our observations of the effects of the number and ordering of the target points on the quality of the predictions. In short, our recommendation is to choose a different random ordering for every sample, and to not let the number/density of target points exceed that at training time.

D.1 EFFECTS OF THE NUMBER OF TARGET POINTS

During the AR sampling procedure, the AR CNP is evaluated at context sets of increasing size. Our experience is that, as long as the sizes of these context sets do not exceed the sizes seen at training time, the predictions should not be significantly affected by changes in the number of target points. However, if the AR sampling procedure evaluates the model at context sets of larger sizes than seen during training time, then that presents the model with an out-of-distribution situation. What happens then comes down to how well the neural networks generalise. Our experience is that the predictions quickly start to break down. A notable exception of this rule of thumb are convolutional-deep-set-based models, such as the Convolutional Conditional Neural Process (ConvCNP; Gordon et al., 2020) . For these models, the magnitude of the density channel is what determines whether the models generalises or not. This means that it is not the total number of points that matters, but rather the density of the points. Therefore, the AR ConvCNP can be evaluated at arbitrarily many target points, as long as the density of these points does not significantly exceed the density of context points seen at training time. Once the density exceeds the density of the training data, the model is presented with an out-of-distribution situation, and what happens then again comes down to how well neural networks generalise. Although the number/density of points in the AR sampling procedure should not exceed that at training time, AR CNPs can still produce high-quality samples at arbitrarily many target points by following the trick outlined at the end of the two-step procedure below Proposition 2.2.

D.2 EFFECTS OF THE ORDERING OF TARGET POINTS

Our experience is that, as long as the number of target points (or density) does not exceed that at training time, the ordering of the target point does not really matter. Appendix D.1 also demonstrates this. When the density of the target points does not exceed the training data (50 and 100 points), sampling randomly or left to right does not really matter. However, once the density of the target points comes close to or exceeds the training data (200, 500, and 1000 points), we observe a difference in performance between sampling randomly and sampling left to right. Across all numbers of target points, a random ordering seems to perform most robustly. Our recommendation is therefore to choose a different random ordering of the target points for every sample.

D.3 ANALYSIS OF AR CNPS FOR CNPS WITH GAUSSIAN MARGINALS

In this subsection, we argue that, for CNPs with Gaussian marginals, predictions in the first few AR steps might be poor, but predictions in later AR steps tend to be more accurate. Choosing a different random ordering for every sample therefore "averages out" the effects from these first few AR steps. When evaluating a CNP with Gaussian marginals in AR mode, every conditional prediction in the AR process is Gaussian. Let us consider the process of producing an AR sample. For the first target input x 1 , we run the CNP forward to obtain a distribution for the corresponding target output y 1 . In reality, the true posterior most likely is non-Gaussian, which means that the prediction for the first target point may be poor. Nevertheless, we sample this Gaussian, append the sample (x 1 , y 1 ) to the context set, and run the CNP forward again. Because we now feed the earlier sample y 1 through the non-linear network, the marginal predictive for the next target output y 2 (having integrated out y 1 ) is non-Gaussian. As we perform more AR steps, the marginal predictions of later points become increasingly non-Gaussian, increasing the model's flexibility. We see that, for a given ordering of the target inputs, the prediction for the first target input is likely poor (because it is Gaussian), and (in the best case) the predictions become more and more accurate as we take more AR steps (because they become more and more non-Gaussian). This is exactly what is happening in Figure 3 : the left prediction is Gaussian and therefore a poor approximation, and, as we go to the right and take more and more AR steps, the prediction becomes more and more non-Gaussian and therefore more accurate. If we were to feed the target inputs in right to left, then the same phenomenon would happen. The right prediction would be a Gaussian and a very poor approximation, and, as we go to the left and take more AR steps, the prediction would become more non-Gaussian and therefore more accurate. More generally, for a given ordering of the target points, the ordering will produce high quality predictions if the conditional distributions of the AR factorisation match the corresponding conditional distributions of the true posterior. Since the conditionals of the AR CNP are typically Gaussian by design, this means that the ordering is "good" if the corresponding conditionals of the true posterior are close to Gaussian. So when is a conditional of the posterior close to Gaussian? Let us assume that the true underlying process is a sum of a non-Gaussian process (constituting epistemic uncertainty) and independent Gaussian noise (constituting aleatoric uncertainty). Generally, a conditional will have both epistemic and aleatoric uncertainty, so a Gaussian will be a bad fit. However, as we condition the conditionals of the true generative process on more and more data, the underlying function will be pinned down more and more accurately, meaning that the conditional will consist mostly of aleatoric uncertainty, which is Gaussian. Therefore, as we condition on more and more data, we expect the conditionals to become more and more Gaussian. This again suggests that the samples in the first few AR steps might be a poor fit (because the corresponding conditionals of the true posterior are not yet Gaussian), but that samples in later AR steps should be a better fit (because the corresponding conditionals are then close to Gaussian). To summarise, an ordering of the target points is "good" if the corresponding conditionals of the true posterior are also close to Gaussian. Under the assumption that the ground-truth process is a non-Gaussian process with additive Gaussian noise, conditionals tend to be close to Gaussian if they are conditioned on many data points. As a consequence, the earlier conditionals in the AR factorisation tend to be poor fits to the ground-truth posterior, whereas later conditionals tend to produce better fits. Choosing a different random ordering for every sample therefore "averages out" the effects from the first few AR steps.

D.4 EFFECT OF THE RANDOM ORDERING ON THE SPREAD OF THE LOG-LIKELIHOOD

We have thus far argued for the benefit of using random ordering in AR, due to the robustness it provides. However, one issue with random orderings is that, since different random orderings do not in general give rise to the same predictive distribution, we may obtain different predictive log-likelihoods in practice, depending on the exact random ordering that we sample. Ideally, we would like not only the mean predictive log-likelihood (averaged out over orderings) to be high, but also the standard deviation of the log-likelihood (due to, again, different random orderings) to be small. In other words, we would like the model to perform well regardless of the random ordering which we happen to sample. At this point, note that if the true underlying process is Gaussian, then a sufficiently well-trained AR CNP with Gaussian marginals would have a small such spread in the log-likelihood, because all conditional predictions of the model will be close to the ground truth conditional predictions. Consequently the order with which we make predictions will have a small effect on the log-likelihood, resulting in a small spread of predictive log-likelihood values. Consider for example the case where the conditionals of the CNP exactly match the conditionals of the true process. In this case, there will be zero variance in the predictive log-likelihood of the process under different orderings. However, the situation is different when the ground truth is non-Gaussian. In this case, as we explained in the previous section, the conditionals of the first few target points may be highly non-Gaussian under the true process, while those of the AR CNP are Gaussian. In this case, we may get different log-likelihoods depending on the random order that we happen to sample. Figure 9 provides a quantitative illustration of the above point. In this figure, we show the standard deviation in the per datapoint predictive log-likelihood of an AR CNP (due to different random orderings) on two variants of a task with sawtooth data. On the first variant, we always pass an empty context set to the model (blue), and on the other task, we pass non-empty context sets with randomly sampled number of context points, uniformly distributed between 0 and 100 (red). We observe that for empty contexts (blue), we get a relatively large standard deviation in predictive log-likelihood for the first few target points. This likely happens because, initially, the model may randomly pick a target input where the conditional of the true process is highly non-Gaussian (making a poor prediction), or it might pick a target input where the true conditional is Gaussian (making a good prediction). This results in a larger variance in performance for the first few target points. However, as more target points are introduced, the standard deviation shrinks. This is because the conditionals of the true process become increasingly Gaussian, which means that no matter which target input is picked next, the model will approximate the true conditional accurately using a Gaussian, thereby reducing the impact of the ordering of subsequent points on the variance of the log likelihood. Further, introducing a relatively modest number of initial context points (red) in a second variant of the task, substantially reduces the spread in the predictive log-likelihoods. This is again because conditioning on a context set means that the conditionals of the true process are better approximated by Gaussians, reducing the impact that different random orderings have on the spread of the log-likelihood. In practice, in our experiments, we have found the variance in the log-likelihood to be near-zero for Gaussian or Gaussian-like ground truth processes, and larger, but acceptable, for non-Gaussian tasks.

E DETAILS FOR FIGURE 3

The generative process visualised in the top panel of figure 3 is defined by the following mixture distribution: p true (y | x) = a 1 N (f 1 (x), 1) + a 2 N (f 2 (x), 1) + a 3 N (f 3 (x), 1). ( ) Given this mixture distribution, the (Gaussian) ideal CNP can be computed in closed form by computing the first two moments of p true : p CNP (y | x) = N (µ(x), σ 2 (x)) where µ(x) = 3 i=1 a i f i (x), σ 2 (x) = 3 i=1 a i 1 + f i (x) 2 - 3 i=1 a i f i (x) 2 . ( ) The updated mixture weights for the posterior distribution p true (y | x, D (c) ) given a context set D (c) can be computed via Bayes rule and p CNP (y | x, D (c) ) can be computed given the updated mixture weights. Note that in Figure 3 the prior mixture weights are a 1 = a 3 = 0.25 and a 2 = 0.5, means are given by f 1 (x) = x 2 + 1, (31) f 2 (x) = x, f 3 (x) = -x, and the target locations are x = 1, 2, 4, and 6. The bottom four panels of Figure 3 show kernel density estimates (Gaussian kernel) of 30 000 samples drawn from the generative distribution p true (y 1 , y  F DESCRIPTION OF MODELS The architectures follow the descriptions from the respective papers they are introduced. Although these descriptions are for one-dimensional inputs and outputs, the architectures are readily generalised to multidimensional inputs and outputs; we will explicitly mention wherever that generalisation requires extra care. All architectures use ReLU activation functions. All GNPs, in addition to a covariance matrix over the target points, also output heterogeneous observation noise along the marginal means; the total covariance over the target points is thus the sum of the covariance by the model and a diagonal matrix formed from these observation noises. Conditional neural process (CNP; Garnelo et al., 2018a) . Set the dimensionality of the encoding to K = 256. Parametrise the encoder with a three-hidden-layer multi-layer perceptron (MLP) of width 256; and parametrise the decoder with a six-hidden-layer MLP of width 256. For multidimensional outputs, let the decoder have width 512. For multidimensional outputs where outputs can have context points at different inputs, produce a separate encoding for every output and concatenate these into one big encoding. These encoders may or may not share parameters. In our experiments, for two-dimensional outputs, parametrise separate encoders; for higher-dimensional outputs, apply the same encoder. Gaussian neural process (GNP; Markou et al., 2022) . Use the same choices for K, the encoder, and the decoder as the CNP. Set the rank of the kernel map to R = 64. As mentioned in the introduction, let the decoder produce one extra dimension which forms heterogeneous observation noise. For multidimensional outputs, the same caveats as for the CNP apply. Latent neural process (LNP; Garnelo et al., 2018b) . The LNP builds off the CNP. Call the existing encoder the deterministic encoder. The NP adds one more encoder called the stochastic encoder. The stochastic encoder mimics the deterministic encoder, but outputs a K-dimensional vector of means and a K-dimensional vector of marginal variances. These are used to sample a K-dimensional Gaussian latent variable (the stochastic encoding). The decoder now additionally takes in the stochastic encoding. For multidimensional outputs, the same caveats as for the CNP apply. Attentive conditional neural process (ACNP; Kim et al., 2019) . The ACNP builds off the CNP. It replaces the deterministic encoder enc θ : D → R K with an eight-head attentive encoder enc (att) θ : D × X → R K (Vaswani et al., 2017) . Unlike the original deterministic encoder enc θ , the new attentive encoder enc (att) θ also takes in the target input. Let D (c) = (x (c) , y (c) ) ∈ D be a context set of size N and let x (t) ∈ X be a target input. We now descibe the computation of enc (att) θ (D (c) , x (t) ). Parametrise ϕ x : X → (R 32 ) 8 and ϕ xy : X × Y → (R 32 ) 8 both with three-hidden-layer MLPs of width 256. Compute the keys: (k h,n ) 8 h=1 = ϕ x (x (c) n ) for n = 1, . . . , N , the values: (v h,n ) 8 h=1 = ϕ xy (x (c) n , y (c) n ) for n = 1, . . . , N , the query: (q h ) 8 h=1 = ϕ x (x (t) ). Then compute v (q) h = N n=1 e ⟨q h ,k h,n ⟩ N n ′ =1 e ⟨q h ,k h,n ′ ⟩ v h,n ∈ R 256 Concatenate v (q) = (v (q) 1 , . . . , v (q) 8 ) ∈ R 256 and q = (q 1 , . . . , q 8 ) ∈ R 256 . Let L : R 256 → R 256 be a linear layer; let ϕ (res) : R 256 → R 256 be a one-hidden-layer MLP of width 256; and let norm 1 and norm 2 be two layer normalisation layers with learned pointwise transformations (Ba et al., 2016) . Then enc (att) θ (D (c) , x (t) ) = norm 2 (z + ϕ (res) (z)) where z = norm 1 (v (q) + Lq). (39) For multidimensional outputs, the same caveats as for the CNP apply. Attentive Gaussian neural process (AGNP). The AGNP build off the GNP. It replaces the deterministic encoder with the same eight-head attentive deterministic encoder of the ACNP. Attentive neural process (ALNP; Kim et al., 2019) . The ALNP build off the LNP. It replaces the deterministic encoder with the same eight-head attentive deterministic encoder of the ACNP. Convolutional Conditional Neural Process (ConvCNP; Gordon et al., 2020) . Set the discretisation to an evenly spaced grid at a certain density (the points per unit) spanning a bit more (the margin) than the most extremal context and target inputs. The points per unit and margin are specified separately for every experiment. Initialise the length scales of all Gaussian kernels to twice the interpoint spacing of the discretisation. Divide the data channel by the density channel. Parametrise dec θ with a U-Net (Ronneberger et al., 2015) . Before the U-turn, let the U-Net have six convolutional layers with kernel size five, stride two, and 64 output channels; and six more such layers, but using transposed convolutions, after the U-turn. The layers after the U-turn additionally take in the outputs of the layers before the U-turn in reversed order; this is the U-net structure (Figure 1 ; Ronneberger et al., 2015) . For multidimensional outputs where outputs can have context points at different inputs, produce a separate data and density channel for every output and concatenate these into one big encoding; use separate length scales for every application of enc θ . Convolutional Gaussian neural process (ConvGNP; Markou et al., 2022) . Use the same choices for the discretisation, length scales, and CNN architecture as for the ConvCNP. Set the rank of the kernel map to R = 64. As mentioned in the introduction, let the decoder produce one extra channel which forms heterogeneous observation noise. For multidimensional outputs, the same caveat as for the ConvCNP applies. Fully convolutional Gaussian neural process (FullConvGNP; Bruinsma et al., 2021) . For the mean architecture and the kernel architecture, use the same choices for the discretisation, length scales, and CNN architecture as for the ConvCNP. Implement the source channel with the identity matrix and apply the matrix transform Z → ZZ T to ensure positive definiteness. Let the decoder produce one extra channel which forms heterogeneous observation noise. For multidimensional outputs, in addition to the caveat for the ConvCNP, two additional caveats apply. First, for D odimensional outputs, let the decoder produce D 2 o channels rather than just one. These channels should be interpreted as all covariance and cross-covariance matrices between all outputs. Second, when applying the matrix transform Z → ZZ T , these channels should first be assembled into one total covariance matrix. Convolutional latent neural process (ConvLNP; Foong et al., 2020) . The ConvLNP builds off the ConvCNP. The ConvLNP replaces the CNN architecture by two copies of this architecture placed in sequence. In between the two architectures, there is a sampling step: the first architecture outputs 32 channels, comprising 16 means and 16 marginal variances, which are used to sample a 16-dimensional Gaussian latent variable; and the second architecture then takes in this sample. Autoregressive Conditional Neural Processes (AR CNPs). The AR CNP, AR ACNP, and AR ConvCNP use the architectures described above. Rolling out an AR CNP according to Procedure 2.1 requires an ordering of the target points. In all experiments, we choose a random ordering of the target points.

G TRAINING, CROSS-VALIDATION, AND EVALUATION PROTOCOLS

The following description applies to the synthetic experiments (Section 4.1), the predator-prey experiments (Section 4.2), the EEG experiments (Section 4.3), and the environmental downscaling experiments (Section 4.4). For the environmental data assimilation experiments, a different protocol was used; we refer the reader to Appendix K for full details of the environmental data assimilation experiments. A task consists of a context set and target set. How precisely the context and target sets are generated is specific to an experiment. To train a model, we consider batches of 16 tasks at a time, compute an objective function value, and update the model parameters using ADAM (Kingma & Ba, 2015) . The learning rate is specified separately for every experiment. We define an epoch to consist of 2 14 ≈ 16 k tasks. We typically train a model for between 100 and 1000 epochs. For an experiment, we split up the meta-data set into a training set, a cross-validation set, and an evaluation set. The model is trained on the training set. During training, after every epoch, the model is cross-validated on the cross-validation set. Cross-validation uses 2 12 fixed tasks. These 2 12 are fixed, which means that cross-validation always happens with exactly the same data. The cross-validation objective is a confidence bound computed from the model objective. Suppose that model objective over all 2 12 cross-validation tasks has empirical mean μ and empirical variance σ2 . If a higher model objective is better, then the cross-validation objective is given by μ -1.96 • σ/ √ 2 12 . The model with the best cross-validation objective is selected and used for evaluation. Evaluation is performed with the evaluation set and also uses 2 12 tasks. Conditional neural processes and Gaussian neural processes are trained, cross-validated, and evaluated with the neural process ELBO objective proposed by Garnelo et al. (2018a) . We normalise the terms in the neural process objective by the target set sizes. Latent-variable neural processes (LNPs) are trained, cross-validated, and evaluated with the ELBO objective proposed by Garnelo et al. (2018b) using five samples, also normalised by the target set size. When training LNPs with the ELBO objective, but not when cross-validating and evaluating, the context set is subsumed in the target set. Additionally, LNPs are trained, cross-validated, and evaluated with the ML objective proposed by Foong et al. (2020) , again normalised by the target set size. When training and cross-validating LNPs with the ML objective, we use twenty samples; and when evaluating, we use 512 samples. For completeness, LNPs trained with the ELBO objective are also evaluated with the ML objective using 512 samples. To stabilise the numerics for GNPs, we increase the regularisation of covariance matrices for one epoch. To encourage LNPs to fit, we fix the variance of the observation noise of the decoder to 10 -4 for the first three epochs.

H DETAILS OF SYNTHETIC EXPERIMENTS H.1 DESCRIPTION OF EXPERIMENTS

We synthetically generate data sets by randomly sampling from five different choices for the groundtruth stochastic process f . Let the inputs be d x -dimensional. Then define the following stochastic processes: EQ: a Gaussian process with an exponentiated quadratic (EQ) kernel: f ∼ GP(0, exp(-1 2ℓ 2 ∥x -x ′ ∥ 2 2 )) where ℓ > 0 is a length scale; Matérn- 5 2 : a Gaussian process with a Matérn-5 2 kernel: f ∼ GP(0, k( 1 ℓ ∥x -x ′ ∥ 2 )) where k(r) = (1 + √ 5r + 5 3 r 2 )e -r and ℓ > 0 is a length scale; weakly periodic: a Gaussian process with a weakly periodic kernel: f ∼ GP(0, exp(-1 2ℓ 2 d ∥x -x ′ ∥ 2 2 -2 ℓ 2 p ∥sin( π p (x -x ′ ))∥ 2 2 )) where ℓ d > 0 is a length scale specifying how quickly the periodic pattern changes, ℓ p > 0 a length scale of the periodic pattern, and p > 0 the period; and where the application of sin is elementwise; sawtooth: a sawtooth process with a random frequency, direction, and phase: f = ω⟨x, u⟩ 2 + ϕ mod 1 where ω ∼ Unif(Ω) is the frequency of the sawtooth wave, u ∼ Unif({x ∈ R dx : ∥x∥ 2 = 1}) the direction, and ϕ ∼ Unif([0, 1]) the phase; mixture: with equal probability, sample f from the EQ process, Matérn- 5 2 process, weakly periodic process, or sawtooth process. We will call these stochastic processes the data processes. The data processes are stochastic processes with d x -dimensional inputs and one-dimensional outputs. We will turn them into processes with d ydimensional outputs according to the following procedure: sample from the one-dimensional-output prior d y times; and, for these d y samples, take d y different linear combinations. We choose the parameters of the data processes based on the input dimensionality d x : ℓ = c • 1 4 , ℓ d = c • 1 2 , ℓ s = c, p = c • 1 4 , Ω = [c -1 • 2, c -1 • 4] with c = √ d x . Scaling with the input dimensionality aims to roughly ensure that data with onedimensional inputs and data with two-dimensional inputs are equally difficult. Figure 10 illustrates the sawtooth data process in all four configurations. We will construct data sets by sampling inputs uniformly at random from X = [-2, 2] dx and then sampling outputs from one of the data processes. We will colloquially call X the training range. For the EQ, Matérn- 5 2 , and weakly periodic process, but not for the sawtooth process 6 , we also add independent Gaussian noise with variance 0.05. The numbers of context and target points are as follows. For the EQ, Matérn- 5 2 , and weakly periodic process, the number of context points is chosen uniformly at random from {0, . . . , 30 • d x } and the number of targets points is fixed to 50 • d x . For the sawtooth and mixture process, the number of context points is chosen uniformly at random from {0, . . . , 30} if d x = 1 and {0, . . . , 75 • d x } otherwise; and the number of targets points is fixed to 100 • d x . In the case of a multidimensional-output data process, we separately sample the number and positions of the context and target inputs for every output dimension. For every data process and each of the four configurations, we evaluate every model in three different ways. First, we evaluate the model on data generated exactly like the training data. This task is called interpolation and abbreviated "int." in the tables of results. The interpolation task measures how well a model fits the data and is the primary measure of performance. Second, we evaluate the model on data with inputs sampled from [2, 6] dx . This task is called out-of-input-distribution (OOID) and abbreviated "OOID" in the tables of results. OOID interpolation measures how well a model generalises to data sampled from other regions of the input space. Third, we evaluate the model on data with context inputs sampled from [-2, 2] dx and target inputs sampled from [2, 6] dx . This task is called extrapolation and abbreviated "ext." in the tables of results. The extrapolation task measures how well predictions based on data in the training range generalise to other regions of the input space. For this experiment, the learning rate is 3 • 10 -4 , the margin is 0.1, and the points per unit is 64. We trained the models for 100 epochs. Due to an error in the cross-validation procedure, we did not use cross-validation, but used the model at epoch 100. For the kernel architecture of the FullConvGNP, we reduce the points per unit and the number of channels in the U-Net by a factor two. For the ConvLNP with two-dimensional inputs, we reduce the number of outputs channels in the U-Net by a factor √ 2; and, for training and cross-validation, we reduce the number of samples of the ELBO objective to one and the number of samples for the ML objective to five.

H.2 MULTI-MODALITY OF PREDICTIONS BY AR CONVCNP

Figure 11 demonstrates multi-modality of predictions by the AR ConvCNP trained on the sawtooth process. Note that the prediction is bimodal for one and two observations, and collapses to a single mode upon observing the third observation.

H.3 FULL RESULTS

We the show the full results for all data sets and tasks in Tables 7 to 18 . The AR ConvCNP consistently shows very strong performance compared to other NP models. Note that the FullConvGNP takes much longer to train than the ConvCNP (Figure 2 ), and cannot be applied to tasks with 2-dimensional input spaces. Table 8 : For the non-Gaussian experiments, average log-likelihoods normalised by the number of target points. Shows for one-dimensional inputs (1D; dx = 1) and two-dimensional inputs (2D; dy = 2) the performance for interpolation within the range [-2, 2] dx where the models were trained ("Int."); interpolation within the range [2, 6] dx which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] dx to the range [2, 6] dx ("Ext."). Models are ordered by interpolation performance for one-dimensional inputs. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run.

Model

Int. ( 1D Table 9 : For the EQ synthetic experiments with one-dimensional inputs, average Kullback-Leibler divergences of the posterior prediction map πy with respect to the model normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] where the models where trained ("Int."); interpolation within the range [2, 6] which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] to the range [2, 6] ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Diagonal GP refers to predictions by the ground-truth Gaussian processes without correlations. Table 10 : For the EQ synthetic experiments with two-dimensional inputs, average Kullback-Leibler divergences of the posterior prediction map πy with respect to the model normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] 2 where the models where trained ("Int."); interpolation within the range [2, 6] 2 which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] 2 to the range [2, 6] 2 ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Diagonal GP refers to predictions by the ground-truth Gaussian processes without correlations. 2 synthetic experiments with one-dimensional inputs, average Kullback-Leibler divergences of the posterior prediction map πy with respect to the model normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] where the models where trained ("Int."); interpolation within the range [2, 6] which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] to the range [2, 6] ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Diagonal GP refers to predictions by the ground-truth Gaussian processes without correlations. Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run.

Matérn-5 2

Int. OOID Ext. dx = 1, dy = 1 FullConvGNP 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 ConvCNP (AR) 0.00 ±0.00 0.00 ±0.00 0.00 Table 12 : For the Matérn- 5 2 synthetic experiments with two-dimensional inputs, average Kullback-Leibler divergences of the posterior prediction map πy with respect to the model normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] 2 where the models where trained ("Int."); interpolation within the range [2, 6] 2 which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] 2 to the range [2, 6] 2 ("Ext."). Models are ordered by interpolation performance. Diagonal GP refers to predictions by the ground-truth Gaussian processes without correlations. Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run. 13 : For the weakly periodic synthetic experiments with one-dimensional inputs, average Kullback-Leibler divergences of the posterior prediction map πy with respect to the model normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] where the models where trained ("Int."); interpolation within the range [2, 6] which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] to the range [2, 6] ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Diagonal GP refers to predictions by the ground-truth Gaussian processes without correlations. Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run.

Weakly Periodic

Int. OOID Ext. dx = 1, dy = 1 FullConvGNP 0.02 ±0.00 0.02 ±0.00 0.00 ±0.00 ConvCNP (AR) 0.05 ±0.00 0.05 ±0.00 0.04 ±0.00 ConvGNP 0.05 ±0.00 0.05 ±0.00 0.56  Int. OOID Ext. dx = 1, dy = 2 FullConvGNP 0.03 ±0.00 0.03 ±0.00 0.00 ±0.00 ConvCNP (AR) 0.09 ±0.00 0.09 ±0.00 0.06 ±0.00 ConvGNP 0.12 ±0.00 0.12 ±0.00 0.72 ±0.01 AGNP 0.25 ±0.00 2.17 Table 14 : For the weakly periodic synthetic experiments with two-dimensional inputs, average Kullback-Leibler divergences of the posterior prediction map πy with respect to the model normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] 2 where the models where trained ("Int."); interpolation within the range [2, 6] 2 which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] 2 to the range [2, 6] 2 ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Diagonal GP refers to predictions by the ground-truth Gaussian processes without correlations. Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run. Int. OOID Ext. dx = 2, dy = 1 ConvCNP (AR) 0.05 ±0.00 0.05 ±0.00 0.03 ±0.00 ConvGNP 0.10 ±0.00 0.10 ±0.00 0.19 ±0.00 ConvLNP (M) 0.18 ±0.00 0.18 ±0.00 0.21 15 : For the sawtooth synthetic experiments with one-dimensional inputs, average log-likelihoods normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] where the models where trained ("Int."); interpolation within the range [2, 6] which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] to the range [2, 6] ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run.

Sawtooth

Int -0.29±0.00 -1.96±0.19 -1.43±0.06 CNP -0.30±0.00 -0.51±0.01 -0.34±0.00 LNP -0.32±0.00 -0.32±0.00 -0.32±0.00 Trivial -0.33±0.00 -0.33±0.00 -0.33±0.00 ConvLNP (E-M) -2.98±0.10 -2.98±0.10 -6.74±0.04 Table 16 : For the sawtooth synthetic experiments with two-dimensional inputs, average log-likelihoods normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] 2 where the models where trained ("Int."); interpolation within the range [2, 6] 2 which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] 2 to the range [2, 6] 2 ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run. Int -0.18 ±0.00 -0.18 ±0.00 -0.18 ±0.00 CNP -0.18 ±0.00 -0.18 ±0.00 -0.18 ±0.00 GNP -0.18 ±0.00 -0.18 ±0.00 -0.18 ±0.00 LNP (M) -0.18 ±0.00 -0.18 ±0.00 -0.18 ±0.00 AGNP -0.18 ±0.00 -0.18 ±0.00 -0.18 ±0.00 ALNP (M) -0.18 ±0.00 -0.18 ±0.00 -0.18 ±0.00 ACNP (AR) -0.18 ±0.00 -0.18 ±0.00 -0.18 ±0.00 ACNP -0.18 ±0.00 -0.18 17 : For the mixture synthetic experiments with one-dimensional inputs, average log-likelihoods normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] where the models where trained ("Int."); interpolation within the range [2, 6] which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] to the range [2, 6] ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run.

Mixture

Int Our goal will be to make predictions for the famous hare-lynx data set. The hare-lynx data set is a time series from 1845 to 1935 recording yearly population counts of a population of Snowshoe hares and a population of Canadian lynx (MacLulich, 1937) . A digital version extracted from the original graph by MacLulich (1937) is available by Hundley. 7 Hundley, the author of this digital source, says that other authors caution that the hare-lynx data is actually a composition of multiple time series, and presents the data with that caution. We, therefore, also present the data with this caution. Figure 12a visualises the hare-lynx data set. To make predictions for the hare-lynx data set, we use the Lotka-Volterra equations (Lotka, 1910; Volterra, 1926) , also called the predator-prey equations. The Lotka-Volterra equations are an idealised mathematical model for the population counts of a prey population and a predator population: prey population: x ′ (t) = αx(t) -βx(t)y(t), predator population: y ′ (t) = -δy(t) + γx(t)y(t). These differential equations say that the prey population naturally grows exponentially with rate α, and that the predator population naturally decays exponentially with rate δ. In addition, the predators hunt the prey. The resulting additional growth in the predator population and the resulting additional decrease in the prey population are both proportional to the product of the densities. In this idealised mathematical form, the population counts converge to a smooth, noiseless limit cycle and then perfectly track this limit cycle ever after. This is unlike real-world predator-prey population counts, which exhibit noisy behaviour and imperfect cycles. We therefore consider a stochastic version of the Lotka-Volterra equations, given by the following two coupled stochastic differential equations: dX t = αX t dt -βY t X t dt + σX ν t dW t , dY t = -γX t dt + δY t X t dt + σY ν t dW (2) t where W (1) and W (2) are two independent Brownian motions. Compared to the Lotka-Volterra equations, ( 47) and ( 48) have two additional terms, σX ν t dW (1) t and σY ν t dW t , which introduce noisy behaviour. In these terms, multiplying by X ν t and Y ν t makes the noise go to zero when X t and Y t become small, ensuring that X t and Y t remain positive. In addition, we multiply by a parameter σ > 0 to control the magnitude of the noise, and we raise X t and Y t to a power ν > 0 to control how quickly the noise grows as X t and Y t grow. Namely, X t naturally grows exponentially, so, by adding noise of magnitude proportional to X t , we risk large spikes in the prey population. To moderate this behaviour, we choose ν to be strictly less than one. Finally, to obtain a variety of magnitudes of population counts, we multiply the realisation with a scale η. After simulating from ( 47) and ( 48) a few times, we settle on ν = 1 6 . For the remainder of the parameters, we simply manually play around with ( 47) and ( 48), settle on parameter ranges that look reasonable, and randomly sample parameters from those intervals. Table 19 summarises the sampling distributions for all parameters of ( 47) and (48) . Figure 12b shows four samples from the proposed stochastic model. To generate a meta-data set, we simulate ( 47) and ( 48) on a dense grid spanning 110 years, throw away the first 10 years, and retain between 150 and 250 data points for X t and Y t . The numbers of data points and the locations of the data points are sampled separately for X t and Y t . Hence, whereas the hare-lynx data is regularly spaced and the populations are always simultaneously observed, our simulator generates data at arbitrary and nonsimultaneous points in time. We split these data sets into context and target sets in three different ways. To train the models, for every batch, we randomly choose one of the interpolation, forecasting or reconstruction tasks by rolling a three-sided die. We will also perform these tasks on the real hare-lynx data; in that case, for interpolation, we let the number of target points per output be between one and fifteen. The tasks on simulated and real data are similar, but slightly differ in the number of context and target points. To deal with the positivity of population counts, we transform the marginals of all models to distributions on (0, ∞) by pushing the marginals through x → log(1 + x). 

I.2 FULL RESULTS

In Table 20 , we present the full results for the sim-to-real experiments.

K DETAILS OF ENVIRONMENTAL DATA ASSIMILATION EXPERIMENT

In this section we provide further details on the Antarctic cloud cover data assimilation experiment described in Section 4.4. K.1 DATA CONSIDERATIONS Data sources. Daily-averaged cloud cover reanalysis data was obtained from ERA5 (Hersbach et al., 2020 ). An Antarctic land mask and elevation field was obtained from the BedMachine dataset (Morlighem, 2020) . The input coordinates x were normalised from metres to take values in [-1, 1].

K.2 MODEL CONSIDERATIONS

Here we provide details on the training procedure and architectures for each of the convolutional NP models in the Antarctic data assimilation experiment. Generating the training, validation, and test tasks. Following meta-learning principles, we collect data from day τ into a task D τ . Each task D τ was generated by first drawing the integer number of simulated cloud cover context points N (c) ∼ Unif{1, 2, . . . 500}. Letting N (c) vary encourages the model to learn to deal with both data-sparse and data-rich scenarios. The number of target points N (t) was fixed to a value of 2,000. Next, the input locations x (c) τ and x (t) τ were sampled uniformly at random across the entire 280 × 280 input space and the corresponding y τ values were sampled without observation noise. For the training dates, the random seed used for generating D τ is updated every epoch, allowing for an infinitely growing simulated training data set. In contrast, for the validation and test dates, the random seeds were held fixed so that metrics computed over the validation and test sets are not stochastic. Training procedure. Each model was trained for 150 epochs on 14 years of data from 2000-2013. An Adam optimiser was used with a learning rate of 5 × 10 -5 and a batch size of 2. For the loss functions we use a negative log-likelihood loss function for the ConvCNP and ConvGNP. For the ConvLNP we use the ELBO objective and fix the variance of the observation noise to 0.01 for the first four epochs. Validation data from 2014-2017 was used for checkpointing the model weights using the per-datapoint predictive log-likelihood. The two year period of 2018-2019 data was reserved for the test set. The time taken to train each model on a Tesla V100 GPU is as follows: --→ 128 u.s. --→ 64 u.s. --→ 32 u.s. --→ 16. We use bilinear resize operators for the upsampling layers to fix checkerboard artifacts that we encountered when using standard zero-padding upsampling (Odena et al., 2016) . We use a margin of 0.1 and 150 points per unit for the encoder's internal discretisation. The length scales of the Gaussian kernels for both the encoder and decoder SetConv layers are set to 1/150 and held fixed during training. These architecture choices result in a receptive field of 0.433 in normalised input space, or roughly 1.500 km in raw input space, spanning around 20% of the region in Figure 6 in either dimension. For the ConvGNP we use 128 basis functions for the low-rank covariance parameterisation described in Markou et al. 2021 . For the ConvLNP we use an 8-dimensional latent variable and evaluate the ML objective (Foong et al., 2020) using 8 latent samples. The number of learnable parameters for each model is as follows: • ConvCNP: 618 k, • ConvGNP: 621 k, • ConvLNP: 1.234 k (increase due to second UNet architecture after the latent variable). The difference in parameters from switching to a beta-categorical likelihood from a Gaussian likelihood is negligible. Input data. Each model receives two context sets as input. The first contains observations of the simulated ERA5 daily-average cloud cover. The second contains a set of 6 gridded auxiliary variables. These are elevation, land mask, cos(2π × day of year/365), sin(2π × day of year/365), x 1 , and x 2 . The elevation and land mask auxiliary fields allow the models to predict spatial non-stationarity. For example, the convolutional filters of the model's encoder could learn how cloud cover around the Antarctic coastline behaves differently to the centre of the continent. The cos and sin variables inform the model at what time of year D τ corresponds to, helping with learning seasonal variations in the data. The x 1 and x 2 inputs again help with breaking translation equivariance in the convolutional filters by informing the model where in input space the data corresponds to.

K.3 ANTARCTIC CLOUD COVER MODEL SAMPLES

Figure 15 gives a detailed breakdown of sample extrapolation ability, showing seven samples from the four Antarctic cloud cover models. The AR ConvCNP samples display remarkable structure and variation while still closely interpolating the context points. The samples also provide interesting scenarios in the gaps between the context points on the left hand side. In contrast, the ConvCNP samples are incoherent, underestimating the probability of joint events. The ConvLNP samples were generated by sampling from the latent variable and then computing the mean of the marginal distributions. As is visible in Figure 15 , the ConvLNP displays low sample variance with respect to the latent variable. However, to the best of our knowledge we used a faithful reproduction of the original ConvLNP model, so we leave a more rigorous treatment of this undesirable behaviour to future work. For all the non-AR models, the limited receptive field size leads to samples on the right hand side becoming independent of the context observations on the left hand side after roughly 750 km of distance from them. This results in the models defaulting to some mean representation of the data. It is interesting to see that all the non-AR models display similar marginal mean structure, with greater cloud cover towards the centre of the continent and lower cloud cover towards the coastline, followed by increased cloud levels over the Southern Ocean. The AR samples were drawn on a sparse 70x70 grid spanning the entire input space to save compute time. The ConvCNP model was then conditioned on these AR samples and the predictive mean was computed over the dense 280x280 target space. It took 14 minutes to generate these AR ConvCNP samples on a Tesla V100 GPU. The MLP ConvGNP (Markou et al., 2022) can be used to successfully model dependencies between outputs in a statistical downscaling task, improving log-likelihoods over the MLP ConvCNP (Vaughan et al., 2022) and enabling coherent samples. In this experiment, we demonstrate that the AR ConvCNP can also be used for this purpose. The goal of this experiment is to estimate the maximum daily temperature at 589 weather stations around Germany. To estimate these temperatures, we follow Vaughan et al. (2022) and use 25 coarsegrained ERA-Interim reanalysis variables (Dee et al., 2011) in combination with 1 km-resolution elevation data (Earth Resources Observation and Science Center, U.S. Geological Survey, U.S. Department of the Interior, 1997). We also consider a second setup where we reveal some of the weather station observations. These revealed weather station observations can be used by the models to aid downscaling performance. The ERA-Interim reanalysis variables considered are tabulated in Table 21 . In contrast to previous downscaling work, which degrade reanalysis data to between 2 • and 2.5 • , we opt to use the ERA-Reanalysis data at the native 0.75 • resolution, consistent with the latest high-resolution climate models with horizontal resolution ranging from 0.5 • to 1.0 • . All variables are spatially subset to between 6 • to 16 • longitude and 47 • to 55 • latitude, covering Germany. The weather station data are a subselection from of the European Climate Assessment & Dataset (Tank et al., 2002) and are available at https://www.ecad.eu; we use the blended data. Like for the ERA-Reanaysis variables, we take the weather stations located within the aforementioned square. The locations of the weather stations around Germany are visualised in Figure 16 . The 1 km-resolution elevation data is taken from the United States Geological Survey GTOPO30 elevation data set available at https://doi.org/10.5066/F7DF6PQS. This provides global elevation data at 30-arc second resolution, which is approximately 1 km. Following the VALUE framework (Maraun et al., 2015) , we consider all days of the years 1979-2008 and split these years into five folds. We use the first four folds (spanning 1979-2003) for training, holding out the last 1000 days for cross-validation; and use the fifth fold (spanning 2003-2008) for evaluation.

L.2 MULTISCALE CONVOLUTIONAL ARCHITECTURE

Deploying the AR ConvCNP in this downscaling experiment comes with a significant challenge. Because the elevation data has a fine resolution of 1 km, we expect that predictions by the AR ConvCNP will vary roughly also on this length scale. In the autoregressive sampling procedure (Procedure 2.1), samples from the model will be fed back into the model. Therefore, the AR ConvCNP must handle context data which varies on a 1 km spatial scale, which means that the discretisation of the AR ConvCNP must roughly be a 1 km-resolution grid. Unfortunately, making the discretisation this fine is prohibitively expensive and imposes prohibitive memory requirements. It is this limitation that prevents us from extending the Vaughan et al. (2022)'s MLP ConvCNP and Markou et al. (2022) 's MLP ConvGNP to include additional weather station observations. We must therefore innovate on the AR ConvCNP design to come up with a convolutional architecture that can handle such a fine discretisation at reasonable computational expense. The architecture that we propose is a multiscale architecture operating on multiple spatial length scales. Let us divide the context set D = D lr ∪ D mr ∪ D hr into a low-resolution component D lr , a medium-resolution component D mr , and a high-resolution component D hr . Let the low-resolution component D lr consist of features of the context set that vary on a long spatial length scale, the medium-resolution component D mr of features that vary on a medium-range spatial length scale, and the high-resolution component D hr of features that vary on a short spatial length scale. The central assumption of the architecture is that predictions for target points depend on precise short-length-scale details D hr nearby, but that this dependence weakens as we move away from the target point, starting to depend more on broad-stroke long-length-scale components D lr . For example, predictions might depend on detailed orographic information nearby, but more on general orographic shapes farther away. The key to the computational efficiency of this architecture is that we construct the high-resolution discretisation only locally to the target points: a small square covering 0.25 • more than the most extremal target points. If the target points are confined to a small region, then the high-resolution grid will also be small, covering only 0.25 • more than that region. Crucially, the high-resolution different context sizes in Tables 24 and 23 . For example, the sawtooth data results in Table 24 show that, for context set sizes 16, the AR method provides no benefits. The increased flexibility of this AR method is not needed in this case -a Gaussian predictive models the marginal sufficiently well. For context set sizes of 4 and 8, on the other hand, we see significant improvements using the AR method. Similarly, for the synthetic audio data results in Table 23 , we see improvements using the ConvCNP (AuxAR) for all context set sizes except 16. The autoregressive sampling with auxiliary data method shown here shows promise for improving modeling of processes with multi-modal and heavy-tailed marginal distributions -all with no changes to the training procedure. The scenarios where this method are most useful are highly contingent upon the context set size, because of its impact how well the marginals are modeled by Gaussians.



Unless otherwise specified, we assume CNPs use Gaussian likelihoods, as inGarnelo et al. (2018a). However, it is straightforward to modify them to use non-Gaussian likelihoods, as we do inSection 4.4. https://kdd.ics.uci.edu/databases/eeg/eeg.data.html. https://cds.climate.copernicus.eu/#!/home. https://github.com/wesselb/neuralprocesses. https://github.com/tom-andersson/iclr2023-antarctic-arconvcnp. The sawtooth process is already challenging enough. See http://people.whitman.edu/~hundledr/courses/M250F03/LynxHare.txt. A trajectory length of 0 is equivalent to the standard test-time procedure. The Dirac comb is defined as CombT (t) := ∞ k=-∞ δ(t -kT ) for given period P .



Figure 2: Negative log-likelihoods on non-Gaussian sawtooth data. Deploying the ConvCNP in AR mode dramatically improves performance, and outperforms state-of-the-art NPs with Gaussian (FullConvGNP) and non-Gaussian (ConvLNP) predictive distributions, at a fraction of the training cost.

Figure 3: Top: generative process: mixture model of three deterministic functions with additive Gaussian noise. Bottom: at the four target locations indicated by dashed lines, the panes show the true distribution and predictions by the ideal CNP and the ideal CNP applied autoregressively at the targets from left to right. See Appendices D.3 and E.

Figure 5: Conceptual diagram showing the relationships between AR CNPs and various neural distribution estimators. The vertical axis denotes whether the model learns a distribution over a finite number of random variables, a countably infinite number, or an uncountably infinite number. The axis into the page denotes whether the architecture is MLP-based, or uses attention or convolutions. From left to right, we show different modelling paradigms. Fruitful exchanges occur when NPs (highlighted in green) are introduced into other modelling paradigms. Our proposed AR CNPs can be viewed as introducing NPs to the AR modelling paradigm.

Figure 6: (a) Ground truth simulated cloud cover fraction on 25/06/2018. (b-e), Sample draws from the AR ConvCNP, ConvCNP, ConvLNP and ConvGNP with context points denoted by red dots. Context points were removed from the right hand side of the 2D space to test the models' abilities to extrapolate coherent function samples far away from observations. The ConvCNP and ConvLNP models used a beta-categorical likelihood while the ConvGNP uses a low-rank Gaussian likelihood.

Figure Appendix D.1 illustrates this observation. When the density target points does not exceed the training data (50 and 100 points), the predictions look calibrated. However, once the density of target points comes close or exceed the training data (200, 500, and 1000 points), bias starts to creep into the predictions.

Figure 8: Samples and predictions for an AR ConvCNP with various numbers of target points ordered randomly (left column) and ordered left to right (right column). When the density of the target points does not exceed the training data (50 and 100 points), ordering the target points randomly or left to right does not matter. When the density of the target points comes close to the training data or exceeds it (200, 500, and 1000 points), bias creeps into the predictions. The random ordering appears to perform more robustly than left to right. The data is sampled from the EQ data process from the synthetic experiments (Section 4.1), and the trained model is also taken from the synthetic experiments. The predictions by the model are shown in solid blue and the marginals by the ground-truth EQ process are shown in dot-dashed purple.

Figure9: Plot of the standard deviation, due to different random orderings, of the per-data-point predictive log-likelihood (in nats) of an AR ConvCNP on one-dimensional sawtooth data, as a function of the number of target set size. For each point in the plot, we have used 2 10 randomly sampled and fixed tasks, on each of which we apply the AR ConvCNP with 100 different randomly sampled orderings.

Figure 10: Samples from the sawtooth data process with one and two-dimensional inputs (dx = 1 and dx = 2) and one and two-dimensional outputs (dy = 1 and dy = 2)

Figure 11: Multi-modality of predictions by the AR ConvCNP. Shows four observations sampled from the sawtooth process. In the four rows, these four observations are revealed one data point at a time. Every row also shows a kernel density estimate of the prediction at x = 0. Filled regions are central 95%-credible regions.

Figure 14 shows an empircal density of the ERA5 cloud cover values calculated over the models' training period of 2000-2013. The spikes at 0 and 1 correspond largely to values of exactly 0 and exactly 1. This motivates the beta-categorical likelihood described in Section 4.4.

Figure 14: Empirical density of ERA5 cloud cover fraction computed over the period 2000-2013.

ConvCNP: 25.0 hours, • ConvGNP: 27.5 hours, • ConvLNP: 43.6 hours. Architectures. For each model, the U-Net component of the encoder uses 5 × 5 convolutional kernels with the following sequence of channel numbers (d.s. = 2 × 2 downsample layer, u.s. = 2 × 2 upsample layer):

Figure 15: Seven samples from each model in the Antarctic cloud cover sample extrapolation task for 25/06/2018.

Figure 16: Locations of the 589 weather stations around Germany in the downscaling experiments.

Figure17depicts the multiscale architecture. The architecture is a cascade of three convolutional deep sets, parametrised by three CNNs; please see the caption. The low-resolution CNN handles the context data D lr with a long spatial length scale. Because these features have a long spatial length scale, the CNN can get away with a low-resolution discretisation. The output of the low-resolution CNN then feeds into a medium-resolution CNN. The medium-resolution CNN handles the context data D mr with a medium spatial length scale and has a medium-resolution discretisation. Finally, the output of the medium-resolution CNN feeds into a high-resolution CNN. This CNN handles the context data D hr with a short spatial length scale and has a high-resolution discretisation.

) -1.05±0.11 -0.63±0.23 -0.29±0.17

Log likelihood values for varying context sizes using ConvCNP and ConvCNP (AuxAR) using the function mixture data generator. Column headers indicate the context set size. Log-likelihoods are shown in bold when they are significantly best (p < 0.05). Column headers are context sizes. Errors indicate central 95% confidence interval.

WITH THE LOTKA-VOLTERRA EQUATIONS Predator-prey data. We next investigate sim-to-real transfer, where the models are trained on simulated data and tested on real data. NPs are well-suited to this setting, since a large meta-data set Performance of NPs training on the GP EQ task, sawtooth task, and mixture task. Diagonal GP denotes the exact GP predictive, but with correlations removed. Trivial denotes a model that predicts a Gaussian distribution with the empirical mean and standard deviation of the context outputs. Significantly best models in bold. Note that the FullConvGNP cannot be run on tasks where dx > 1.

Normalised log-likelihoods in the predator-prey experiments, showing interpolation (int.), forecasting (for.), and reconstruction (rec.) on simulated (S) and real (R) data. Significantly best results in bold.

Normalised log-likelihoods on the EEG experiments. Significantly best results in bold.





Additional notation. If y 1 ⊕ y 2 ∼ P x1⊕x2 π(D), then denote the distribution of y 1 | y 2 by P x1 | x2 π(D). Note that P x1 | x2 π(D) depends on y 2 , because it is the distribution of y 1 | y 2 , even though the notation does not make this dependence explicit.

For the Gaussian experiments, average Kullback-Leibler divergences of the posterior prediction map πy with respect to the model normalised by the number of target points. Shows for one-dimensional inputs (1D; dx = 1) and two-dimensional inputs (2D; dy = 2) the performance for interpolation within the range[-2, 2]  dx where the models were trained ("Int."); interpolation within the range[2,6] dx which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] dx to the range[2,6] dx ("Ext."). Models are ordered by interpolation performance for one-dimensional inputs. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Diagonal GP refers to predictions by the ground-truth Gaussian processes without correlations. Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run. ±0.00 0.03 ±0.00 0.02 ±0.00 0.03 ±0.00 0.03 ±0.00 0.02 ±0.00 ConvGNP 0.04 ±0.00 0.04 ±0.00 1.75 ±0.12 0.12 ±0.00 0.12 ±0.00 0.71 ±0.03 AGNP 0.10 ±0.00 4.34 ±0.17 5.45 ±0.23 0.17 ±0.00 0.62 ±0.01 0.39 ±0.01 ±0.02 0.43 ±0.01 1.16 ±0.45 0.52 ±0.01 Trivial 1.08 ±0.01 1.08 ±0.01 0.85 ±0.01 0.57 ±0.01 0.57 ±0.01 0.40 ±0.00 ConvLNP (E-M) 2.01 ±0.11 2.01 ±0.11 5.95 ±0.16 0.44 ±0.01 0.44 ±0.01 0.47 ±0.01

±0.04 1.53 ±0.04 1.32 ±0.04 0.56 ±0.03 0.56 ±0.03 0.29 ±0.03 ConvLNP (E) 1.40 ±0.05 1.40 ±0.05 0.82 ±0.05 0.06 ±0.03 0.06 ±0.03 -0.62 ±0.04 ConvLNP (M) 1.08 ±0.06 1.08 ±0.06 -0.36 ±0.03 0.26 ±0.04 0.26 ±0.04 -0.70 ±0.02

Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run. FullConvGNP 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 ConvGNP 0.01 ±0.00 0.01 ±0.00 3.46 ±0.08 ConvCNP (AR) 0.01 ±0.00 0.01 ±0.00 0.01 ±0.00 FullConvGNP 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 ConvGNP 0.01 ±0.00 0.01 ±0.00 1.73 ±0.05 ConvCNP (AR) 0.01 ±0.00 0.01 ±0.00 0.01 ±0.00 ±0.00 1.31 ±0.00 1.07 ±0.00 ConvLNP (E-M) 2.14 ±0.06 2.14 ±0.06 9.30 ±0.08

Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run. For the Matérn-5

±0.00 1.87 ±0.02 1.30 ±0.01 CNP 0.54 ±0.01 1.47 ±0.02 1.11 ±0.01 Trivial 1.08 ±0.00 1.08 ±0.00 0.85 ±0.00 ConvLNP (E-M) 1.37 ±0.04 1.36 ±0.04 4.30 ±0.06 FullConvGNP 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 ConvCNP (AR) 0.01 ±0.00 0.01 ±0.00 0.01 ±0.00 ±0.00 2.33 ±0.02 1.23 ±0.00 CNP 0.65 ±0.00 7.72 ±0.69 1.23 ±0.00 Trivial 1.18 ±0.00 1.18 ±0.00 0.94 ±0.00 ConvLNP (E-M) 3.07 ±0.06 3.06 ±0.06 9.83 ±0.09

ConvCNP (AR) 0.01 ±0.00 0.01 ±0.00 0.00 ±0.00 ACNP (AR) 0.05 ±0.00 0.54 ±0.00 0.38 ±0.00 AGNP 0.08 ±0.00 0.83 ±0.01 0.37 ±0.00 ConvCNP (AR) 0.01 ±0.00 0.01 ±0.00 0.01 ±0.00 ACNP (AR) 0.06 ±0.00 0.58 ±0.00 0.41 ±0.00 ConvGNP 0.14 ±0.00 0.14 ±0.00 0.64 ±0.01 AGNP 0.17 ±0.00 0.58 ±0.00 0.40 ±0.00 ±0.00 0.86 ±0.17 0.59 ±0.00 ConvLNP (E-M) 0.49 ±0.00 0.49 ±0.00 0.61 ±0.00 Trivial 0.58 ±0.00 0.58 ±0.00 0.41 ±0.00 FullConvGNP

±0.01 1.45 ±0.03 0.68 ±0.01 Trivial 0.82 ±0.00 0.82 ±0.00 0.61 ±0.00 ConvLNP (E-M) 1.58 ±0.03 1.57 ±0.03 2.85 ±0.04

±0.00 1.27 ±0.01 0.77 ±0.01 Trivial 0.89 ±0.00 0.89 ±0.00 0.67 ±0.00 ConvLNP (E-M) 3.02 ±0.03 3.02 ±0.03 5.40 ±0.04





For the mixture synthetic experiments with two-dimensional inputs, average log-likelihoods normalised by the number of target points. Shows for one-dimensional outputs (dy = 1) and two-dimensional outputs (dy = 2) the performance for interpolation within the range [-2, 2] 2 where the models where trained ("Int."); interpolation within the range[2,6] 2 which the models have never seen before ("OOID"); and extrapolation from the range [-2, 2] 2 to the range [2, 6] 2 ("Ext."). Models are ordered by interpolation performance. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Trivial refers to predicting the empirical means and standard deviation of the test data. Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run. ConvCNP (AR) -0.62 ±0.01 -0.62 ±0.01 -0.79 ±0.01 ConvGNP -0.74 ±0.01 -0.74 ±0.01 -1.43 ±0.02 ConvLNP (M) -0.78 ±0.02 -0.79 ±0.02 -1.25 ±0.02

±0.02 -3.46 ±0.02 -3.79 ±0.02 -4.16 ±0.04 -4.28 ±0.04 -4.45 ±0.00 ConvCNP (AR) -3.30 ±0.02 -3.47 ±0.02 -3.60 ±0.02 -4.10 ±0.03 -4.27 ±0.03 -4.32 ±0.01 ConvNP (ML) -3.41 ±0.02 -3.84 ±0.02 -4.44 ±0.02 -4.13 ±0.04 -4.45 ±0.05 -4.54 ±0.01 ConvGNP -3.47 ±0.02 -3.65 ±0.02 -4.15 ±0.02 -4.21 ±0.05 -4.82 ±0.13 -4.61 ±0.01 ConvCNP -3.47 ±0.02 -4.06 ±0.02 -4.85 ±0.02 -4.17 ±0.04 -4.70 ±0.06 -4.97 ±0.01 ConvNP (ELBO) -3.77 ±0.02 -3.83 ±0.02 -4.12 ±0.02 -5.45 ±0.05 -5.47 ±0.07 -6.39 ±0.05 ANP (ML) -4.09 ±0.02 -4.32 ±0.02 -4.55 ±0.02 -4.31 ±0.03 -4.43 ±0.04 -4.49 ±0.01 ANP (ELBO-ML) -4.22 ±0.02 -4.54 ±0.02 -4.80 ±0.02 -4.58 ±0.11 -4.58 ±0.04 -4.68 ±0.01 ACNP (AR) -4.23 ±0.02 -4.44 ±0.02 -4.58 ±0.02 -4.40 ±0.03 -4.55 ±0.04 -4.59 ±0.02 ANP (ELBO) -4.32 ±0.03 -4.58 ±0.02 -4.82 ±0.02 -4.71 ±0.15 -4.63 ±0.05 -4.70 ±0.01 ACNP -4.34 ±0.02 -4.65 ±0.02 -4.88 ±0.02 -4.43 ±0.04 -4.58 ±0.04 -4.74 ±0.00 ConvNP (E.-M.) -6.71 ±0.05 -8.44 ±0.11Normalised log-likelihoods in the predator-prey experiments. Shows the performance for interpolation ("Int."), forecasting ("For."), and reconstruction ("Rec.") on simulated ("S") and real ("R") data. Models are ordered by interpolation performance on simulated data. The latent variable models are trained and evaluated with the ELBO objective (E); trained and evaluated with the ML objective (M); and trained with the ELBO objective and evaluated with the ML objective (E-M). Errors indicate the central 95%-confidence interval. Numbers which are significantly best (p < 0.05) are boldfaced. Numbers which are very large are marked as failed with "F". Numbers which are missing could not be run.

ERA-Interim reanalysis predictors.

annex

 47) and (48) . The parameters of ( 47) and (48) are sampled according to Table 19 . Unif ([0.5, 10] ) ηUnif( [1, 5] )Table 19 : Sampling distributions for the parameters of the stochastic version of the Lotka-Volterra equations ( 47) and (48) . These equations are simulated on a dense grid spanning [-10, 100] . The table also shows the distribution for the initial conditions at t = -10. To not depend too heavily on these initial conditions, the simulation results on [-10, 0] are discarded.For this experiment, the learning rate is 1 • 10 -4 , the margin is 1, and the points per unit is 4. We trained the models for 200 epochs.The convolutional models use a U-Net architecture with seven layers instead of six where, in the first layer, the stride is one instead of two. For the kernel architecture of the FullConvGNP, we reduce the points per unit and the number of channels in the U-Net by a factor two. 

J DETAILS OF ELECTROENCEPHALOGRAPHY EXPERIMENTS

We explore an electroencephalography data set collected from 122 subjects (Begleiter, 2022) . There are two groups of subjects: alcoholic and control. Every subject was subjected to a single stimulus or two stimuli, and their response was measured with 64 electrodes placed on a subject's scalp. These measurements are in terms of trials, where a trial consists of 256 samples of the electrodes spanning one second. The data sets contains up to 120 trials for each subject. The data is available at https://archive.ics.uci.edu/ml/datasets/eeg+database and the collection is described in detail by Zhang et al. (1995) . In this experiment, we focus only on seven frontal electrodes: FZ, F1, F2, F3, F4, F5, and F6. Figure 13 illustrates a trial of a subject, showing the samples for these seven electrodes.We randomly split all subjects into three sets: an evaluation set consisting of ten subjects, a crossvalidation set consisting of ten other subjects, and a training set consisting of all remaining subjects. For each of these sets, we create a meta-data set by aggregating the trials for all subjects. We split every trial into a context and target set in the same three ways as for the predator-prey experiment. First, for all seven electrodes separately, randomly designate between 50 and 256 points to be the target points and let the remainder (between 0 and 206) be the context points. This task is called interpolation and is the primary measure of performance. Additionally, randomly choose one of the seven electrodes and, for that choice, split the data in two exactly like for forecasting. For all other electrodes, append all data to the context set. This task is called reconstruction and measures the model's ability to infer a signal for one electrode from the others. We train all models on the interpolation task, and evaluate the models on the interpolation and reconstruction task.For this experiment, the learning rate is 5 • 10 -5 , the margin is 0.1, and the points per unit is 256. We trained the models for 1000 epochs. For the FullConvGNP, the learning rate is 2 • 10 -4 . The training run for the FullConvGNP was terminated after 84 hours, reaching epoch 127.The convolutional models use a U-Net architecture where, in the first layer, the stride is one instead of two. In addition, the number of channels are adjusted as follows: the ConvCNP and ConvGNP use 128 channels, the ConvLNP uses 96 channels, and the FullConvGNP uses 64 channels. The length scales of the Gaussian kernels of the convolutional model is initialised to 0.77/256. To scale to seven outputs, the deep set-based and attentive models reuse the same encoder for every output dimension.Published as a conference paper at ICLR 2023 2020). The variables zlr( • ), zmr( • ), and zhr( • ) represent intermediate representations as continuous functions, and the maps CNNlr, CNNmr, and CNNhr are translation-equivariant maps between functions on X . Following the construction of the ConvCNP (Gordon et al., 2020) , these maps are all implemented with convolutional neural networks (CNN) using a discretisation. For CNNlr, the internal discretisation is the 0.75 • -resolution grid corresponding to the 25 coarse-grained ERA-Interim reanalysis variables. For CNNmr, the internal discretisation is a 0.1 • -resolution grid spanning 5 • more than the most extremal target inputs; the discretisation does not depend on the context set. For CNNhr, the internal discretisation is a 0.01 • -resolution grid spanning 0.25 • more than the most extremal target inputs; the discretisation also does not depend on the context set. grid will not be constructed over all of Germany, like it would if we were to more naively apply the ConvCNP with a high-resolution discretisation, incurring prohitive computational cost. Even though the high-resolution grid is only constructed locally to the target points, the model can still capture long-range dependencies via the medium-resolution and low-resolution grids. Namely, the medium-resolution grid is a square covering 5 • more than the most extremal target points, and the low-resolution grid covers all of Germany; see Figure 17 . To utilise this computational gain, the target points must be confined to a small region. This perfectly synergises with the autoregressive sampling procedure (Procedure 2.1), because this procedure evaluates the model one target point at a time. The training procedure, however, must be adapted. During training, we subsample the target points to ensure that the target set is always confined to a small square, which is described in Appendix L.4.During the autoregressive sampling procedure, the AR ConvCNP takes in earlier AR samples from the model. In the architectures of the MLP ConvCNP and MLP ConvGNP, these is no natural context data to which these samples can be appended. Therefore, in addition to the ERA-Interim reanalysis variables and the elevation data, we also let the AR ConvCNP take in observed weather stations as context data. We will append the earlier AR samples to these weather station context data. To have the model make appropriate use of the weather station context set, we must randomly divide the weather stations observations over the context and target set, which we describe in Appendix L.4. We let the low-resolution context data D lr consist of the 25 coarse-grained ERA-Interim reanalysis variables, and let the medium-resolution D mr and high-resolution context data D hr both consist of the weather station observations (and earlier AR samples) and the 1 km-resolution elevation data. When the 1 km-resolution data is fed to the medium-resolution CNN, the data loses some detail, because the internal discretisation of the medium-resolution CNN is coarser than the data; however, when it is fed to the high-resolution CNN, the data retains its detail. The same holds for the weather station observations (and earlier AR samples).

L.3 ARCHITECTURES

MLP ConvCNP and MLP ConvGNP (Vaughan et al., 2022; Markou et al., 2022) . The MLP ConvCNP and MLP ConvGNP are a respectively a ConvCNP and ConvGNP where the decoder dec θ = fuse θ • dec ′ θ is decomposed into a convolutional architecture dec ′ θ followed by a pointwise MLP fuse θ :In this architecture, the ERA-Interim variables are incorporated via the convolutional architecture, producing the encoding z( • ). On the other hand, as (49) shows, the 1 km-resolution elevation data is included via the pointwise MLP fuse θ .Parametrise dec ′ θ with a seven-layer residual convolutional neural network (He et al., 2016) . Every residual layer involves one depthwise-separable convolutional filter (Chollet, 2017) with kernel size three followed by a pointwise MLP. Every layer has 128 channels, and the network also outputs 128 channels. The discretisation for dec ′ θ is the grid of the ERA-Interim reanalysis variables. Parametrise fuse θ with a three-hidden-layer MLP of width 128.AR ConvCNP. The AR ConvCNP does not use the pointwise MLP fuse θ to incorporate the 1 kmresolution elevation data. Instead, it is a normal ConvCNP where the convolutional architecture is implemented by the multi-scale architecture described in Figure 17 .Parametrise CNN lr with a depthwise-separable residual convolutional neural network like in the MLP ConvCNP and MLP ConvGNP, but use six layers instead of seven. Let CNN lr output 64 channels. The discretisation for CNN lr is the grid of the ERA-Interim reanalysis variables. Parametrise CNN mr with a U-Net (Ronneberger et al., 2015) using an architecture similar to what we have been using. Before the U-turn, let the U-Net have five convolutional layers with kernel size five, stride one for the first layer and stride two afterwards, 64 output channels for the first three layers and 128 output channels afterwards. After the U-turn, instead of using transposed convolutions, use regular convolutions combined with an upsampling layer using bilinear interpolation. Let CNN mr output 64 channels. The receptive field of CNN mr is approximately 10 • . The discretisation for CNN mr is centred around the target points with margin 5 • . Parametrise CNN hr with a U-Net like for CNN hr , but with four convolutional layers before the U-turn. The receptive field of CNN hr is approximately 0.5 • . The discretisation for CNN hr is centred around the target points with margin 0.25 • .

L.4 TRAINING DETAILS

MLP ConvCNP and MLP ConvGNP. The MLP ConvCNP and MLP ConvGNP are trained with learning rate 2.5 • 10 -5 for 500 epochs. For the MLP ConvGNP, to encourage the covariance to fit, we fix the variance of the decoder to 10 -4 I for the first ten epochs.AR ConvCNP. The AR ConvCNP is trained with learning rate 1 • 10 -5 for 500 epochs. During training and cross-validation, the target points are subsampled to lie in a 3 • × 3 • square. For training, the number of target points is ensured to be at least ten; and for cross-validation, at least one. The size of the cross-validation set is increased ten fold.Sampling of data. For the MLP ConvCNP and MLP ConvGNP, since these architectures cannot take in weather station observations, all weather stations are used as context data. For the AR ConvCNP, a data set is split into a context and target set by randomly selecting n points as context points and letting the remainder be target points. Specifically, the number of context points n is sampled from p(n) ∝ e -0.01n . This splitting is done after subsampling the 3 • × 3 • square.

M ALTERNATE AR PROCEDURE WITH AUXILIARY DATA

We propose an additional procedure which uses autoregressive sampling with auxiliary data to generate more expressive marginal predictives. The input points of the auxiliary data are chosen randomly, and then sampled autoregressively before sampling the target points. Finally, we discard the sampled values for the auxiliary data, but retain the samples for the target points. Adding auxiliary points in this way allows the model to roll out autoregressively with more steps, even if the target set is small (or just a single point). We describe the procedure below: Procedure M.1 (Autoregressive application of neural process with auxiliary data). For a neural process π θ , context set D (c) = (x (c) , y (c) ), a target input x (t) , a distribution r over X , a number of auxiliary data points R ∈ N, and a number of trajectories M ∈ N, let AuxAR x (t) (π θ , D (c) , r, R, M ) be the distribution defined as follows. We first autoregressively sample the auxiliary data trajectories at random locations sampled from r:for j = 1, . . . , M ,Next, conditioned on the auxiliary data, we sample the target point of interest to make predictions.We then marginalise out the auxiliary data by averaging over the M trajectories: aux,j) , y (c) ⊕ y (aux,j) ).This procedure introduces three hyperparameters: the distribution from which to draw inputs r, the length of trajectories R, and the number of trajectories to sample M .In the following experiments, we set the distribution r to be uniform over the training domain with no dependence on the context set or target point of interest: r = Uniform([b, h]), where b and h are the lower and upper bounds of the training domain, respectively. One could experiment with other choices for the distribution r. The trajectory length R is chosen between 0 8 and 8, and the number of trajectories M is chosen between 1 and 128.

M.1 GENERATED DATA

We create three data generating processes for our experiments: a mixture of functions, random sawtooth functions, and random audio-like functions. The first two experiments have multi-modal true marginals, whereas the last has heavy-tailed marginals.Function mixture. The function mixture data are generated by choosing one of the following three functions, the first with a probability of 1 4 , the second with a probability of 1 2 , and the third with a probability of 1 4 :Sawtooth. The sawtooth data are generated from the following function:We sample the frequency ω ∼ Unif( [3, 5] ), the direction d as either -1 or 1 with equal probability, and the shift as ϕ ∼ Unif([ 1 ω , 1]). Synthetic Audio. Synthetic audio data are generated by convolving a Dirac comb 9 with a truncated decaying sum of sinusoids: whereWe truncate the waves up to the period length, because otherwise the convolution with the Dirac comb would lead to increasing amplitude, resulting in a non-stationary process.

M.2 TRAINING

For the function mixture experiment, no training is required because we use the analytically derived ideal CNP π ∞ as our model. See Section 2.For the sawtooth and synthetic audio experiments, we train ConvCNP models. We train each model for 100 epochs using 1024 batches per epoch with a batch size of 16. We discretise the encoder by evaluating 64 points per unit. We use a margin of 0.1, and a stride length of 2 for each of the 6 layers of the U-Net. Each layer has 64 channels. The receptive field size from this combination of parameters is 6.953.During training, we sample a number of context points between uniformly at random from {0,. . . ,75}, and we sample exactly 100 target points. The context points and target points are sampled uniformly from [-2, 2] . We use the Adam optimizer with a learning rate of 3 × 10 -4 .

M.3 RESULTS

Model 0 1 2 4 8 16ConvCNP -0.47±0.07 -0.45±0.18 -0.24±0.12 -0.12±0.12 0.36±0.15 1.07±0.18 ConvCNP (AuxAR) 0.02±0.11 0.19±0.11 0.24±0.12 0.31±0.11 0.74±0.10 1.27±0.12 In Table 22 , we see that using this procedure improves the held-out log-likelihoods for all of the experiments. We can better understand the utility of this method by observing the performance for

