BAYESIAN CONTEXT AGGREGATION FOR NEURAL PROCESSES

Abstract

Formulating scalable probabilistic regression models with reliable uncertainty estimates has been a long-standing challenge in machine learning research. Recently, casting probabilistic regression as a multi-task learning problem in terms of conditional latent variable (CLV) models such as the Neural Process (NP) has shown promising results. In this paper, we focus on context aggregation, a central component of such architectures, which fuses information from multiple context data points. So far, this aggregation operation has been treated separately from the inference of a latent representation of the target function in CLV models. Our key contribution is to combine these steps into one holistic mechanism by phrasing context aggregation as a Bayesian inference problem. The resulting Bayesian Aggregation (BA) mechanism enables principled handling of task ambiguity, which is key for efficiently processing context information. We demonstrate on a range of challenging experiments that BA consistently improves upon the performance of traditional mean aggregation while remaining computationally efficient and fully compatible with existing NP-based models.

1. INTRODUCTION

Estimating statistical relationships between physical quantities from measured data is of central importance in all branches of science and engineering and devising powerful regression models for this purpose forms a major field of study in statistics and machine learning. When judging representative power, neural networks (NNs) are arguably the most prominent member of the regression toolbox. NNs cope well with large amounts of training data and are computationally efficient at test time. On the downside, standard NN variants do not provide uncertainty estimates over their predictions and tend to overfit on small datasets. Gaussian processes (GPs) may be viewed as complementary to NNs as they provide reliable uncertainty estimates but their cubic (quadratic) scaling with the number of context data points at training (test) time in their basic formulation affects the application on tasks with large amounts of data or on high-dimensional problems. Recently, a lot of interest in the scientific community is drawn to combinations of aspects of NNs and GPs. Indeed, a prominent formulation of probabilistic regression is as a multi-task learning problem formalized in terms of amortized inference in conditional latent variable (CLV) models, which results in NN-based architectures which learn a distribution over target functions. Notable variants are given by the Neural Process (NP) (Garnelo et al., 2018b) and the work of Gordon et al. (2019) , which presents a unifying view on a range of related approaches in the language of CLV models. Inspired by this research, we study context aggregation, a central component of such models, and propose a new, fully Bayesian, aggregation mechanism for CLV-based probabilistic regression models. To transform the information contained in the context data into a latent representation of the target function, current approaches typically employ a mean aggregator and feed the output of this aggregator into a NN to predict a distribution over global latent parameters of the function. Hence, aggregation and latent parameter inference have so far been treated as separate parts of the learning pipeline. Moreover, when using a mean aggregator, every context sample is assumed to carry the same amount of information. Yet, in practice, different input locations have different task ambiguity and, therefore, samples should be assigned different importance in the aggregation process. In contrast, our Bayesian aggregation mechanism treats context aggregation and latent parameter inference as one holistic mechanism, i.e., the aggregation directly yields the distribution over the latent parameters of the target function. Indeed, we formulate context aggregation as Bayesian inference of latent parameters using Gaussian conditioning in the latent space. Compared to existing methods, the resulting aggregator improves the handling of task ambiguity, as it can assign different variance levels to the context samples. This mechanism improves predictive performance, while it remains conceptually simple and introduces only negligible computational overhead. Moreover, our Bayesian aggregator can also be applied to deterministic model variants like the Conditional NP (CNP) (Garnelo et al., 2018a) . In summary, our contributions are (i) a novel Bayesian Aggregation (BA) mechanism for context aggregation in NP-based models for probabilistic regression, (ii) its application to existing CLV architectures as well as to deterministic variants like the CNP, and (iii) an exhaustive experimental evaluation, demonstrating BA's superiority over traditional mean aggregation.

2. RELATED WORK

Prominent approaches to probabilistic regression are Bayesian linear regression and its kernelized counterpart, the Gaussian process (GP) (Rasmussen and Williams, 2005) . The formal correspondence of GPs with infinite-width Bayesian NNs (BNNs) has been established in Neal (1996) and Williams (1996) . A broad range of research aims to overcome the cubic scaling behaviour of GPs with the number of context points, e.g., through sparse GP approximations (Smola and Bartlett, 2001; Lawrence et al., 2002; Snelson and Ghahramani, 2005; Quiñonero-Candela and Rasmussen, 2005) , by deep kernel learning (Wilson et al., 2016) , by approximating the posterior distribution of BNNs (MacKay, 1992; Hinton and van Camp, 1993; Gal and Ghahramani, 2016; Louizos and Welling, 2017) , or, by adaptive Bayesian linear regression, i.e., by performing inference over the last layer of a NN which introduces sparsity through linear combinations of finitely many learned basis functions (Lazaro-Gredilla and Figueiras-Vidal, 2010; Hinton and Salakhutdinov, 2008; Snoek et al., 2012; Calandra et al., 2016) . An in a sense complementary approach aims to increase the data-efficiency of deep architectures by a fully Bayesian treatment of hierarchical latent variable models ("DeepGPs") (Damianou and Lawrence, 2013) . A parallel line of research studies probabilistic regression in the multi-task setting. Here, the goal is to formulate models which are data-efficient on an unseen target task by training them on data from a set of related source tasks. Bardenet et al. (2013) ; Yogatama and Mann (2014), and Golovin et al. (2017) study multi-task formulations of GP-based models. More general approaches of this kind employ the meta-learning framework (Schmidhuber, 1987; Thrun and Pratt, 1998; Vilalta and Drissi, 2005) , where a model's training procedure is formulated in a way which incentivizes it to learn how to solve unseen tasks rapidly with only a few context examples ("learning to learn", "few-shot learning" (Fei-Fei et al., 2006; Lake et al., 2011)) . A range of such methods trains a meta-learner to learn how to adjust the parameters of the learner's model (Bengio et al., 1991; Schmidhuber, 1992) , an approach which has recently been applied to few-shot image classification (Ravi and Larochelle, 2017) , or to learning data-efficient optimization algorithms (Hochreiter et al., 2001; Li and Malik, 2016; Andrychowicz et al., 2016; Chen et al., 2017; Perrone et al., 2018; Volpp et al., 2019) . Other branches of meta-learning research aim to learn similarity metrics to determine the relevance of context samples for the target task (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2017) , or explore the application of memory-augmented neural networks for meta-learning (Santoro et al., 2016) . Finn et al. (2017) propose model-agnostic meta-learning (MAML), a general framework for fast parameter adaptation in gradient-based learning methods. A successful formulation of probabilistic regression as a few-shot learning problem in a multi-task setting is enabled by recent advances in the area of probabilistic meta-learning methods which allow a quantitative treatment of the uncertainty arising due to task ambiguity, a feature particularly relevant for few-shot learning problems. One line of work specifically studies probabilistic extensions of MAML (Grant et al., 2018; Ravi and Larochelle, 2017; Rusu et al., 2018; Finn et al., 2018; Kim et al., 2018) . Further important approaches are based on amortized inference in multi-task CLV models (Heskes, 2000; Bakker and Heskes, 2003; Kingma and Welling, 2013; Rezende et al., 2014; Sohn et al., 2015) , which forms the basis of the Neural Statistician proposed by Edwards and Storkey (2017) and of the NP model family (Garnelo et al., 2018b; Kim et al., 2019; Louizos et al., 2019) . Gordon et al. (2019) present a unifying view on many of the aforementioned probabilistic architectures. Building on the conditional NPs (CNPs) proposed by Garnelo et al. (2018a) , a range of NP-based architectures, such as Garnelo et al. (2018b) and Kim et al. (2019) , consider combinations of deterministic and CLV model architectures. Recently, Gordon et al. (2020) extended CNPs to include translation equivariance in the input space, yielding state-of-the-art predictive performance. In this paper, we also employ a formulation of probabilistic regression in terms of a multi-task CLV model. However, while in previous work the context aggregation mechanism (Zaheer et al., 2017; Wagstaff et al., 2019) was merely viewed as a necessity to consume context sets of variable size, we take inspiration from Becker et al. (2019) and emphasize the fundamental connection of latent parameter inference with context aggregation and, hence, base our model on a novel Bayesian aggregation mechanism.

3. PRELIMINARIES

We present the standard multi-task CLV model which forms the basis for our discussion and present traditional mean context aggregation (MA) and the variational inference (VI) likelihood approximation as employed by the NP model family (Garnelo et al., 2018a; Kim et al., 2019) , as well as an alternative Monte Carlo (MC)-based approximation. Problem Statement. We frame probabilistic regression as a multi-task learning problem. Let F denote a family of functions f : R dx → R dy with some form of shared statistical structure. We assume to have available data sets D ≡ {(x ,i , y ,i )} i of evaluations y ,i ≡ f (x ,i ) + ε from a subset of functions ("tasks") {f } L =1 ⊂ F with additive Gaussian noise ε ∼ N 0, σ 2 n . From this data, we aim to learn the posterior predictive distribution p ( y | x , D c ) over a (set of) y , given the corresponding (set of) inputs x as well as a context set D c ⊂ D . The Multi-Task CLV Model. We formalize the multitask learning problem in terms of a CLV model (Heskes, 2000; Gordon et al., 2019) as shown in Fig. 1 . The model employs task-specific global latent variables z ∈ R dz , as well as a task-independent latent variable θ, capturing the statistical structure shared between tasks. To learn θ, we split the data into context sets D c ≡ {(x c ,n , y c ,n )} N n=1 and target sets D t ≡ {(x t ,m , y t ,m )} M m=1 and maximize the posterior predictive likelihood function L =1 p y t ,1:M x t ,1:M , D c , θ = L =1 p (z | D c , θ) M m=1 p y t ,m z , x t ,m , θ dz (1) w.r.t. θ. In what follows, we omit task indices to avoid clutter. Likelihood Approximation. Marginalizing over the task-specific latent variables z is intractable for reasonably complex models, so one has to employ some form of approximation. The NP-family of models (Garnelo et al., 2018b; Kim et al., 2019) uses an approximation of the form log p y t 1:M x t 1:M , D c , θ E q φ ( z|D c ∪D t ) M m=1 log p y t m z, x t m , θ + log q φ ( z| D c ) q φ ( z| D c ∪ D t ) . (2) Being derived using a variational approach, this approximation utilizes an approximate posterior distribution q φ ( z| D c ) ≈ p ( z| D c , θ). Note, however, that it does not constitute a proper evidence lower bound for the posterior predictive likelihood since the intractable latent posterior p ( z| D c , θ) has been replaced by q φ ( z| D c ) in the nominator of the rightmost term (Le et al., 2018 ). An alternative approximation, employed for instance in Gordon et al. (2019) , also replaces the intractable latent posterior distribution by an approximate distribution q φ ( z| D c ) ≈ p ( z| D c , θ) and uses a Monte-Carlo (MC) approximation of the resulting integral based on K latent samples, i.e., log p y t 1:M x t 1:M , D c , θ ≈ -log K + log K k=1 M m=1 p y t m z k , x t m , θ , z k ∼ q φ ( z| D c ) . (3) Note that both approaches employ approximations q φ ( z| D c ) of the latent posterior distribution p ( z| D c , θ) and, as indicated by the notation, amortize inference in the sense that one single set of parameters φ is shared between all context data points. This enables efficient inference at test time, as no per-data-point optimization loops are required. As is standard in the literature (Garnelo et al., 2018b; Kim et al., 2019) , we represent q φ ( z| D c ) and p (y t m |z, x t m , θ) by NNs and refer to them as the encoder (enc, parameters φ) and decoder (dec, parameters θ) networks, respectively. These networks set the means and variances of factorized Gaussian distributions, i.e.,  q φ ( z| D c ) = N z| µ z , diag σ 2 z , µ z = enc µz,φ (D c ) , σ 2 z = enc σ 2 z ,φ (D c ) , = enc r,φ (x c n , y c n ) ∈ R dr . Then, a permutation-invariant operation is applied to the set {r n } N n=1 to obtain an aggregated latent observation r. One prominent choice, employed for instance in Garnelo et al. (2018a) , Kim et al. (2019), and Gordon et al. (2019) , is to take the mean, i.e., r = 1 N N n=1 r n . Subsequently, r is mapped onto the parameters µ z and σ 2 z of the approximate posterior distribution q φ ( z| D c ) using additional encoder networks, i.e., µ z = enc µz,φ (r) and σ 2 z = enc σ 2 z ,φ (r). Note that three encoder networks are employed here: (i) enc r,φ to map from the context pairs to r n , (ii) enc µz,φ to compute µ z from the aggregated mean r and (iii) enc σ 2 z ,φ to compute the variance σ 2 z from r. In what follows, we refer to this aggregation mechanism as mean aggregation (MA) and to the networks enc µz,φ and enc σ 2 z ,φ collectively as "r-to-z-networks".

4. BAYESIAN CONTEXT AGGREGATION

We propose Bayesian Aggregation (BA), a novel context data aggregation technique for CLV models which avoids the detour via an aggregated latent observation r and directly treats the object of interest, namely the latent variable z, as the aggregated quantity. This reflects a central observation for CLV models with global latent variables: context data aggregation and hidden parameter inference are fundamentally the same mechanism. Our key insight is to define a probabilistic observation model p(r|z) for r which depends on z. Given a new latent observation r n = enc r,φ (x c n , y c n ), we can update p(z) by computing the posterior p(z|r n ) = p(r n |z)p(z)/p(r n ). Hence, by formulating context data aggregation as a Bayesian inference problem, we aggregate the information contained in D c directly into the statistical description of z based on first principles.

4.1. BAYESIAN CONTEXT AGGREGATION VIA GAUSSIAN CONDITIONING

BA can easily be implemented using a factorized Gaussian observation model of the form Figure 2 : Comparison of aggregation mechanisms in CLV models. Dashed lines correspond to learned components of the posterior approximation q φ ( z| D c ). BA avoids the detour via a mean-aggregated latent observation r and aggregates D c directly in the statistical description of z. This allows to incorporate a quantification of the information content of each context tuple (x c n , y c n ) as well as of z into the inference in a principled manner, while MA assigns the same weight to each context tuple. p ( r n | z) = N r n | z, diag(σ 2 rn ) , r n = enc r,φ (x c n , y c n ) , σ 2 rn = enc σ 2 r ,φ (x c n , y c n ) . Note that, in contrast to standard variational auto-encoders (VAEs) (Kingma and Welling, 2013), we do not learn the mean and variance of a Gaussian distribution, but we learn the latent observation r n (which can be considered as a sample of p(z)) together with the variance σ 2 rn of this observation. This architecture allows the application of Gaussian conditioning while this is difficult for VAEs. Indeed, we impose a factorized Gaussian prior p 0 (z) ≡ N z| µ z,0 , diag σ 2 z,0 and arrive at a Gaussian aggregation model which allows to derive the parameters of the posterior distribution q φ ( z| D c ) in closed form 1 (cf. App. 7.1): σ 2 z = σ 2 z,0 + N n=1 σ 2 rn , µ z = µ z,0 + σ 2 z N n=1 (r n -µ z,0 ) σ 2 rn . Here , and denote element-wise inversion, product, and division, respectively. These equations naturally lend themselves to efficient incremental updates as new context data (x c n , y c n ) arrives by using the current posterior parameters µ z,old and σ 2 z,old in place of the prior parameters, i.e., σ 2 z,new = σ 2 z,old + σ 2 rn , µ z = µ z,old + σ 2 z,new (r n -µ z,old ) σ 2 rn . BA employs two encoder networks, enc r,φ and enc σ 2 r ,φ , mapping context tuples to latent observations and their variances, respectively. In contrast to MA, it does not require r-to-z-networks, because the set {r n } N n=1 is aggregated directly into the statistical description of z by means of Eq. ( 8), cf. Fig. 2(b ). Note that our factorization assumptions avoid the expensive matrix inversions that typically occur in Gaussian conditioning and which are difficult to backpropagate. Using factorized distributions renders BA cheap to evaluate with only marginal computational overhead in comparison to MA. Furthermore, we can easily backpropagate through BA to compute gradients to optimize the parameters of the encoder and decoder networks. As the latent space z is shaped by the encoder network, the factorization assumptions are valid because the network will find a space where these assumptions work well. Note further that BA represents a permutation-invariant operation on D c . Discussion. BA includes MA as a special case. Indeed, Eq. ( 8) reduces to the mean-aggregated latent observation Eq. ( 6) if we impose a non-informative prior and uniform observation variances σ 2 rn ≡ 1. 2 This observation sheds light on the benefits of a Bayesian treatment of aggregation. MA assigns the same weight 1/N to each latent observation r n , independent of the amount of information contained in the corresponding context data tuple (x c n , y c n ), as well as independent of the uncertainty about the current estimation of z. Bayesian aggregation remedies both of these limitations: the influence of r n on the parameters µ z,old and σ 2 z,old describing the current aggregated state is determined by the relative magnitude of the observation variance σ 2 rn and the latent variance 1 Note that an extended observation model of the form p ( rn| z) = N rn| z + µr n , diag(σ 2 rn ) , with µr n given by a third encoder output, does not lead to a more expressive aggregation mechanism. Indeed, the resulting posterior variances would stay unchanged and the posterior mean would read µz = µz,0 + σ 2 z N n=1 (rn -µr n -µz,0) σ 2 rn . Therefore, we would just subtract two distinct encoder outputs computed from the same inputs, resulting in exactly the same expressivity, which is why we set µr n ≡ 0. 2 As motivated above, we consider r as the aggregated quantity of MA and the distribution over z, described by µz and σ 2 z , as the aggregated quantity of BA. Note that Eq. ( 8) does not necessarily generalize µz and σ 2 z after nonlinear r-to-z-networks. σ 2 z,old , cf. Eq. ( 9). This emphasizes the central role of the learned observation variances σ 2 rn : they allow to quantify the amount of information contained in each latent observation r n . BA can therefore handle task ambiguity more efficiently than MA, as the architecture can learn to assign little weight (by predicting high observation variances σ 2 rn ) to context points (x c n , y c n ) located in areas with high task ambiguity, i.e., to points which could have been generated by many of the functions in F. Conversely, in areas with little task ambiguity, i.e., if (x c n , y c n ) contains a lot of information about the underlying function, BA can induce a strong influence on the posterior latent distribution. In contrast, MA has to find ways to propagate such information through the aggregation mechanism by encoding it in the mean-aggregated latent observation r.

4.2. LIKELIHOOD APPROXIMATION WITH BAYESIAN CONTEXT AGGREGATION

We show that BA is versatile in the sense that it can replace traditional MA in various CLV-based NP architectures as proposed, e.g., in Garnelo et al. (2018b) and Gordon et al. (2019) , which employ samples from the approximate latent posterior q φ ( z| D c ) to approximate the likelihood (as discussed in Sec. 3), as well as in deterministic variants like the CNP (Garnelo et al., 2018a) . Sampling-Based Likelihood Approximations. BA is naturally compatible with both the VI and MC likelihood approximations for CLV models. Indeed, BA defines a Gaussian latent distribution from which we can easily obtain samples z in order to evaluate Eq. ( 2) or Eq. ( 3) using the decoder parametrization Eq. ( 5). Bayesian Context Aggregation for Conditional Neural Processes. BA motivates a novel, alternative, method to approximate the posterior predictive likelihood Eq. ( 1), resulting in a deterministic loss function which can be efficiently optimized for θ and φ in an end-to-end fashion. To this end, we employ a Gaussian approximation of the posterior predictive likelihood of the form p y t 1:M x t 1:M , D c , θ ≈ N y t 1:M µ y , Σ y . This is inspired by GPs which also define a Gaussian likelihood. Maximizing this expression yields the optimal solution µ y = μy , Σ y = Σy , with μy and Σy being the first and second moments of the true posterior predictive distribution. This is a well-known result known as moment matching, a popular variant of deterministic approximate inference used, e.g., in Deisenroth and Rasmussen (2011) and Becker et al. (2019) . μy and Σy are functions of the moments µ z and σ 2 z of the latent posterior p ( z| D c , θ) which motivates the following decoder parametrization: µ y = dec µy,θ µ z , σ 2 z , x t m , σ 2 y = dec σ 2 y ,θ µ z , σ 2 z , x t m , Σ y = diag σ 2 y . Here, µ z and σ 2 z are given by the BA Eqs. ( 8). Note that we define the Gaussian approximation to be factorized w.r.t. individual y t m , an assumption which simplifies the architecture but could be dropped if a more expressive model was required. This decoder can be interpreted as a "moment matching network", computing the moments of y given the moments of z. Indeed, in contrast to decoder networks of CLV-based NP architectures as defined in Eq. ( 5), it operates on the moments µ z and σ 2 z of the latent distribution instead of on samples z which allows to evaluate this approximation in a deterministic manner. In this sense, the resulting model is akin to the CNP which defines a deterministic, conditional model with a decoder operating on the mean-aggregated latent observation r. However, BA-based models trained in this deterministic manner still benefit from BA's ability to accurately quantify latent parameter uncertainty which yields significantly improved predictive likelihoods. In what follows, we refer to this approximation scheme as direct parameter-based (PB) likelihood optimization. Discussion. The concrete choice of likelihood approximation or, equivalently, model architecture depends mainly on the intended use-case. Sampling-based models are generally more expressive as they can represent complex, i.e., structured, non-Gaussian, posterior predictive distributions. Moreover, they yield true function samples while deterministic models only allow approximate function samples through auto-regressive (AR) sampling schemes. Nevertheless, deterministic models exhibit several computational advantages. They yield direct probabilistic predictions in a single forward pass, while the predictions of sampling-based methods are only defined through averages over multiple function samples and hence require multiple forward passes. Likewise, evaluating the MC-based likelihood approximation Eq. ( 3) during training requires to draw multiple (K) latent samples z. While the VI likelihood approximation Eq. ( 2) can be optimized on a single function sample per training step through stochastic gradient descent (Bishop, 2006) , it has the disadvantage that it requires to feed target sets D t through the encoder which can impede the training for small context sets D c as discussed in detail in App. 7.2.

5. EXPERIMENTS

We present experiments to compare the performances of BA and of MA in NP-based models. To provide a complete picture, we evaluate all combinations of likelihood approximations (PB/deterministic Eq. ( 10), VI Eq. ( 2), MC Eq. ( 3)) and aggregation methods (BA Eq. ( 8), MA Eq. ( 6)), resulting in six different model architectures, cf. Fig. 4 in App. 7.5.2. Two of these architectures correspond to existing members of the NP family: MA + deterministic is equivalent to the CNP (Garnelo et al., 2018a) , and MA + VI corresponds to the Latent-Path NP (LP-NP) (Garnelo et al., 2018b) , i.e., the NP without a deterministic path. We further evaluate the Attentive Neural Process (ANP) (Kim et al., 2019) , which employs a hybrid approach, combining LP-NP with a cross-attention mechanism in a parallel deterministic pathfoot_0 , as well as an NP-architecture using MA with a self-attentive (SA) encoder network. Note that BA can also be used in hybrid models like ANP or in combination with SA, an idea we leave for future research. In App. 7.4 we discuss NP-based regression in relation to other methods for (scalable) probabilistic regression. The performance of NP-based models depends heavily on the encoder and decoder network architectures as well as on the latent space dimensionality d z . To assess the influence of the aggregation mechanism independently from all other confounding factors, we consistently optimize the encoder and decoder network architectures, the latent-space dimensionality d z , as well as the learning rate of the Adam optimizer (Kingma and Ba, 2015) , independently for all model architectures and for all experiments using the Optuna (Akiba et al., 2019) framework, cf. App. 7.5.3. If not stated differently, we report performance in terms of the mean posterior predictive log-likelihood over 256 test tasks with 256 data points each, conditioned on context sets containing N ∈ {0, 1, . . . , N max } data points (cf. App. 7.5.4). For sampling-based methods (VI, MC, ANP), we report the joint log-likelihood over the test sets using a Monte-Carlo approximation with 25 latent samples, cf. App. 7.5.4. We average the resulting log-likelihood values over 10 training runs with different random seeds and report 95% confidence intervals. We publish source code to reproduce the experimental results online. 4GP Samples. We evaluate the architectures on synthetic functions drawn from GP priors with different kernels (RBF, weakly periodic, Matern-5/2), as proposed by Gordon et al. (2020) , cf. App. 7.5.1. We generate a new batch of functions for each training epoch. The results (Tab. 1) show that BA consistently outperforms MA, independent of the model architecture. In- particularly poorly for small context sets, reflecting the intricacies discussed in Sec. 4.2. As expected, the MC approximation yields the best results in terms of predictive performance, as it is more expressive than the deterministic approaches and does not share the problems of the VI approach. As shown in Tab. 2 and Tab. 9, App. 7.6, our proposed PB likelihood approximation is much cheaper to evaluate compared to both sampling-based approaches which require multiple forward passes per prediction. We further observe that BA tends to require smaller encoder and decoder networks as it is more efficient at propagating context information to the latent state as discussed in Sec. 4.1. The hybrid ANP approach is competitive only on the Matern-5/2 function class. Yet, we refer the reader to Tab. 10, App. 7.6, demonstrating that the attention mechanism greatly improves performance in terms of MSE. On the 1D task, all likelihood approximations perform approximately on-par in combination with BA, while MC outperforms both on the more complex 3D task. Fig. 3 compares prediction qualities. Dynamics of a Furuta Pendulum. We study BA on a realistic dataset given by the simulated dynamics of a rotary inverted pendulum, better known as the Furuta pendulum (Furuta et al., 1992) , which is a highly non-linear dynamical system, consisting of an actuated arm rotating in the horizontal plane with an attached pendulum rotating freely in the vertical plane, parametrized by two masses, three lengths, and two damping constants. The regression task is defined as the one-step-ahead prediction of the four-dimensional system state with a step-size of ∆t = 0.1 s, as detailed in App. 7.5.1. The results (Tab. 4) show that BA improves predictive performance also on complex, non-synthetic regression tasks with higher-dimensional input-and output spaces. Further, they are consistent with our previous findings regarding the likelihood approximations, with MC being strongest in terms of predictive likelihood, followed by our efficient deterministic alternative PB. 2D Image Completion. We consider a 2D image completion experiment where the inputs x are pixel locations in images showing handwritten digits, and we regress onto the corresponding pixel intensities y, cf. App. 7.6. Interestingly, we found that architectures without deterministic paths were not able to solve this task reliably which is why we only report results for deterministic models. As shown in Tab. 5, BA improves performance in comparison to MA by a large margin. This highlights that BA's ability to quantify the information content of a context tuple is particularly beneficial on this task, as, e.g., pixels in the middle area of the images typically convey more information about the identity of the digit than pixels located near the borders. Self-attentive Encoders. Another interesting baseline for BA is MA, combined with a self-attention (SA) mechanism in the encoder. Indeed, similar to BA, SA yields non-uniform weights for the latent observations r n , where a given weight is computed from some form of pairwise spatial relationship with all other latent observations in the context set (cf. App. 7.3 for a detailed discussion). As BA's weight for r n only depends on (x n , y n ) itself, BA is computationally more efficient: SA scales like O(N 2 ) in the number N of context tuples while BA scales like O(N ), and, furthermore, SA does not allow for efficient incremental updates while this is possible for BA, cf. Eq. ( 9). Tab. 6 shows a comparison of BA with MA in combination with various different SA mechanisms in the encoder. We emphasize that we compare against BA in its vanilla form, i.e., BA does not use SA in the encoder. The results show that Laplace SA and dot-product SA do not improve predictive performance compared to vanilla MA, while multihead SA yields significantly better results. Nevertheless, vanilla BA still performs better or at least on-par and is computationally more efficient. While being out of the scope of this work, according to these results, a combination of BA with SA seems promising if computational disadvantages can be accepted in favour of increased predictive performance, cf. App. 7.3.

6. CONCLUSION AND OUTLOOK

We proposed a novel Bayesian Aggregation (BA) method for NP-based models, combining context aggregation and hidden parameter inference in one holistic mechanism which enables efficient handling of task ambiguity. BA is conceptually simple, compatible with existing NP-based model architectures, and consistently improves performance compared to traditional mean aggregation. It introduces only marginal computational overhead, simplifies the architectures in comparison to existing CLV models (no r-to-z-networks), and tends to require less complex encoder and decoder network architectures. Our experiments further demonstrate that the VI likelihood approximation traditionally used to train NP-based models should be abandoned in favor of a MC-based approach, and that our proposed PB likelihood approximation represents an efficient deterministic alternative with strong predictive performance. We believe that a range of existing models, e.g., the ANP or NPs with self-attentive encoders, can benefit from BA, especially when a reliable quantification of uncertainty is crucial. Also, more complex Bayesian aggregation models are conceivable, opening interesting avenues for future research. We present the derivation of the Bayesian aggregation update equations (Eqs. ( 8), ( 9)) in more detail. To foster reproducibility, we describe all experimental settings as well as the hyperparameter optimization procedure used to obtain the results reported in Sec. 5, and publish the source code online. 5 We further provide additional experimental results and visualizations of the predictions of the compared architectures.

7.1. DERIVATION OF THE BAYESIAN AGGREGATION UPDATE EQUATIONS

We derive the full Bayesian aggregation update equations without making any factorization assumptions. We start from a Gaussian observation model of the form p ( r n | z) ≡ N ( r n | z, Σ rn ) , r n = enc r,φ (x c n , y c n ) , Σ rn = enc Σr,φ (x c n , y c n ) , ( ) where r n and Σ rn are learned by the encoder network. If we impose a Gaussian prior in the latent space, i.e., p (z) ≡ N ( z| µ z,0 , Σ z,0 ) , (13) we arrive at a Gaussian aggregation model which allows to derive the parameters of the posterior distribution, i.e., of q φ ( z| D c ) = N ( z| µ z , Σ z ) (14) in closed form using standard Gaussian conditioning (Bishop, 2006) : Σ z = (Σ z,0 ) -1 + N n=1 (Σ rn ) -1 -1 , µ z = µ z,0 + Σ z N n=1 (Σ rn ) -1 (r n -µ z,0 ) . ( ) As the latent space z is shaped by the encoder network, it will find a space where the following factorization assumptions work well (given d z is large enough): Σ rn = diag σ 2 rn , σ 2 rn = enc σ 2 r ,φ (x c n , y c n ) , Σ z,0 = diag σ 2 z,0 . This yields a factorized posterior, i.e., q φ ( z| D c ) = N z| µ z , diag σ 2 z , with σ 2 z = σ 2 z,0 + N n=1 σ 2 rn , µ z = µ z,0 + σ 2 z N n=1 (r n -µ z,0 ) σ 2 rn . Here , and denote element-wise inversion, product, and division, respectively. This is the result Eq. ( 8) from the main part of this paper.

7.2. DISCUSSION OF VI LIKELIHOOD APPROXIMATION

To highlight the limitations of the VI approximation, we note that decoder networks of models employing the PB or the MC likelihood approximation are provided with the same context information at training and test time: the latent variable (which is passed on to the decoder in the form of latent samples z (for MC) or in the form of parameters µ z , σ 2 z describing the latent distribution (for PB)) is in both cases conditioned only on the context set D c . In contrast, in the variational approximation Eq. ( 2), the expectation is w.r.t. q φ , conditioned on the union of the context set D c and the target set D t . As D t is not available at test time, this introduces a mismatch between how the model is trained and how it is used at test time. Indeed, the decoder is trained on samples from q φ ( z| D c ∪ D t ) but evaluated on samples from q φ ( z| D c ). This is not a serious problem when the model is evaluated on context sets with sizes large enough to allow accurate approximations of the true latent posterior distribution. Small context sets, however, usually contain too little information to infer z reliably. Consequently, the distributions q φ ( z| D c ) and q φ ( z| D c ∪ D t ) typically differ significantly in this regime. Hence, incentivizing the decoder to yield meaningful predictions on small context sets requires intricate and potentially expensive additional sampling procedures to choose suitable target sets D t during training. As a corner case, we point out that it is not possible to train the decoder on samples from the latent prior, because the right hand side of Eq. ( 2) vanishes for D c = D t = ∅. Kim et al. (2019) propose to use attention-mechanisms to improve the quality of NP-based regression. In general, given a set of key-value pairs {(x n , y n )} N n=1 , x n ∈ R dx , y n ∈ R dy , and a query x * ∈ R dx , an attention mechanism A produces a weighted sum of the values, with the weights being computed from the keys and the query:

7.3. SELF-ATTENTIVE ENCODER ARCHITECTURES

A {(x n , y n )} N n=1 , x * = N n=1 w (x n , x * ) y n . ( ) There are several types of attention mechanisms proposed in the literature (Vaswani et al., 2017) , each defining a specific form of the weights. Laplace attention adjusts the weights according to the spatial distance of keys and query: w L (x n , x * ) ∝ exp (-||x n -x * || 1 ) . Similarly, dot-product attention computes w DP (x n , x * ) ∝ exp x T n x * / d x . A more complex mechanism is multihead attention, which employs a set of 3H learned linear mappings L K h H h=1 , L V h H h=1 , L Q h H h=1 , where H is a hyperparameter. For each h, these mappings are applied to keys, values, and queries, respectively. Subsequently, dot-product attention is applied to the set of transformed key-value pairs and the transformed query. The resulting H values are then again combined by a further learned linear mapping L O to obtain the final result. Self-attention (SA) is defined by setting the set of queries equal to the set of keys. Therefore, SA produces again a set of N weighted values. Combining SA with an NP-encoder, i.e., applying SA to the set {f x (x n ) , r n } N n=1 of inputs x n and corresponding latent observations r n (where we also consider a possible nonlinear transformation f x of the inputs) and subsequently applying MA yields an interesting baseline for our proposed BA. Indeed, similar to BA, SA computes a weighted sum of the latent observations r n . Note, however, that SA weighs each latent observation according to some form of spatial relationship of the corresponding input with all other latent observations in the context set. In contrast, BA's weight for a given latent observation is based only on features computed from the context tuple corresponding to this very latent observation and allows to incorporate an estimation of the amount of information contained in the context tuple into the aggregation (cf. Sec. 4.1). This leads to several computational advantages of BA over SA: (i) SA scales quadratically in the number N of context tuples, as it has to be evaluated on all N 2 pairs of context tuples. In contrast, BA scales linearly with N . (ii) BA allows for efficient incremental updates when context data arrives sequentially (cf. Eq. ( 9)), while using SA does not provide this possibility: it requires to store and encode the whole context set D c at once and to subsequently aggregate the whole set of resulting (SA-weighted) latent observations. The results in Tab. 6, Sec. 5 show that multihead SA leads to significant improvements in predictive performance compared to vanilla MA. Therefore, a combination of BA with self-attentive encoders seems promising in situations where computational disadvantages can be accepted in favour of increased predictive performance. Note that BA relies on a second encoder output σ 2 rn (in addition to the latent observation r n ) which assesses the information content in each context tuple (x n , y n ). As each SA-weighted r n is informed by the other latent observations in the context set, obviously, one would have to also process the set of σ 2 rn in a manner consistent with the SA-weighting. We leave such a combination of SA and BA for future research. Table 7 : Comparison of the predictive log-likelihood of NP-based architectures with two simple GPbased baselines, (i) Vanilla GP (optimizes the hyperparameters individually on each target task and ignores the source data) (ii) Multi-task GP (optimizes one set of hyperparameters on all source tasks and uses them without further adaptation on the target tasks). Both GP implementations use RBFkernels. As in the main text, we average performance over context sets with sizes N ∈ {0, ..., 64} for RBF GP and N ∈ {0, ..., 20} for the other experiments. Multi-task GP constitutes the optimal model (assuming it fits the hyperparameters perfectly) for the RBF GP experiment, which explains its superior performance. On the Quadratic 1D experiment, Multi-task GP still performs better than the other methods as this function class shows a relatively low degree of variability. In contrast, on more complex experiments like Quadratic 3D and the Furuta dynamics, none of the GP variants is able to produce meaningful results given the small budget of at most 20 context points, while NP-based methods produce predictions of high quality as they incorporate the source data more efficiently. We discuss in more detail how NP-based models relate to other existing methods for (scalable) probabilistic regression, such as (multi-task) GPs (Rasmussen and Williams, 2005; Bardenet et al., 2013; Yogatama and Mann, 2014; Golovin et al., 2017) , Bayesian neural networks (BNNs) (MacKay, 1992; Gal and Ghahramani, 2016) , and DeepGPs (Damianou and Lawrence, 2013) . NPs are motivated in Garnelo et al. (2018a; b) , Kim et al. (2019) , as well as in our Sec. 1, as models which combine the computational efficiency of neural networks with well-calibrated uncertainty estimates (like those of GPs). Indeed, NPs scale linearly in the number N of context and M of target data points, i.e., like O(N + M ), while GPs scale like O(N 3 + M 2 ). Furthermore, NPs are shown to exhibit well-calibrated uncertainty estimates. In this sense, NPs can be counted as members of the family of scalable probabilistic regression methods. A central aspect of NP training which distinguishes NPs from a range of standard methods is that they are trained in a multi-task fashion (cf. Sec. 3). This means that NPs rely on data from a set of related source tasks from which they automatically learn powerful priors and the ability to adapt quickly to unseen target tasks. This multi-task training procedure of NPs scales linearly in the number L of source tasks, which makes it possible to train these architectures on large amounts of source data. Applying GPs in such a multi-task setting can be challenging, especially for large numbers of source tasks. Similarly, BNNs as well as DeepGPs are in their vanilla forms specifically designed for the single-task setting. Therefore, GPs, BNNs, and DeepGPs are not directly applicable in the NP multi-task setting, which is why they are typically not considered as baselines for NP-based models, as discussed in (Kim et al., 2019) . The experiments presented in Garnelo et al. (2018a; b) and Kim et al. (2019) focus mainly on evaluating NPs in the context of few-shot probabilistic regression, i.e., on demonstrating the dataefficiency of NPs on the target task after training on data from a range of source tasks. In contrast, the application of NPs in situations with large (> 1000) numbers of context/target points per task has to the best of our knowledge not yet been investigated in detail in the literature. Furthermore, it has not been studied how to apply NPs in situations where only a single or very few source tasks are available. The focus of our paper is a clear-cut comparison of the performance of our BA with traditional MA in the context of NP-based models. Therefore, we also consider experiments similar to those presented in (Garnelo et al., 2018a; b; Kim et al., 2019) and leave further comparisons with existing methods for (multi-task) probabilistic regressions for future work. Nevertheless, to illustrate this discussion, we provide two simple GP-based baseline methods: (i) a vanilla GP, which optimizes the hyperparameters on each target task individually and does not use the source data, and (ii) a naive but easily interpretable example of a multi-task GP, which optimizes one set of hyperparameters on all source tasks and uses it for predictions on the target tasks without further adaptation. The results in Tab. 7 show that those GP-based models can only compete with NPs on function classes where either the inductive bias as given by the kernel functions fits the data well (RBF GP), or on function classes which exhibit a relatively low degree of variablity (Quadratic 1D). On more complex function classes, NPs produce predictions of much better quality, as they incorporate the source data more efficiently.

7.5. EXPERIMENTAL DETAILS

We provide details about the data sets as well as about the experimental setup used in our experiments in Sec. 5.

7.5.1. DATA GENERATION

In our experiments, we use several classes of functions to evaluate the architectures under consideration. To generate training data from these function classes, we sample L random tasks (as described in Sec. 5), and N tot random input locations x for each task. For each minibatch of training tasks, we uniformly sample a context set size N ∈ {n min , . . . , n max } and use a random subset of N data points from each task as context sets D c . The remaining M = N tot -N data points are used as the target sets D t (cf. App. 7.5.3 for the special case of the VI likelihood approximation). Tab. 8 provides details about the data generation process. GP Samples. We sample one-dimensional functions f : R → R from GP priors with three different stationary kernel functions as proposed by Gordon et al. (2020) . A radial basis functions (RBF) kernel with lenghtscale l = 1.0: k RBF (r) ≡ exp -0.5r 2 . A weakly periodic kernel: k WP (r) ≡ exp -2 sin (0.5r) 2 -0.125r 2 . ( ) A Matern-5/2 kernel with lengthscale l = 0.25: k M5/2 (r) ≡ 1 + √ 5r 0.25 + 5r 2 3 • 0.25 2 exp - √ 5r 0.25 . ( ) Quadratic Functions. We consider two classes of quadratic functions. The first class f Q,1D : R → R is defined on a one-dimensional domain and parametrized by three parameters a, b, c ∈ R: f Q,1D (x) ≡ a 2 (x + b) 2 + c. ( ) The second class f Q,3D : R 3 → R is defined on a three-dimensional domain and also parametrized by three parameters a, b, c ∈ R: f Q,3D (x 1 , x 2 , x 3 ) ≡ 0.5a x 2 1 + x 2 2 + x 2 3 + b (x 1 + x 2 + x 3 ) + 3c. ( ) This function class was proposed in Perrone et al. (2018) . For both function classes we add Gaussian noise with standard deviation σ n to the evaluations, cf. Tab. 8. Furuta Pendulum Dynamics. We consider a function class obtained by integrating the non-linear equations of motion governing the dynamics of a Furuta pendulum (Furuta et al., 1992; Cazzolato and Prime, 2011) for a time span of ∆t = 0.1 s. More concretely, we consider the mapping where Θ = θ arm (t) , θ pend (t) , θarm (t) , θpend (t) T denotes the four-dimensional vector describing the dynamical state of the Furuta pendulum. The Furuta pendulum is parametrized by seven parameters (two masses, three lengths, two damping constants) as detailed in Tab. 8. During training, we provide L = 64 tasks, corresponding to 64 different parameter configurations. We consider the free system and generate noise by applying random torques at each integration time step (∆t Euler = 0.001 s) to the joints of the arm and pendulum drawn from Gaussian distributions with standard deviations σ τ,pend , σ τ,arm , respectively.  Θ (t) → Θ (t + ∆t) -Θ (t) ,

7.5.2. MODEL ARCHITECTURES

We provide the detailed architectures used for the experiments in Sec. 5 in Fig. 4 . For ANP we use multihead cross attention and refer the reader to Kim et al. (2019) for details about the architecture.

7.5.3. HYPERPARAMETERS AND HYPERPARAMETER OPTIMIZATION

To arrive at a fair comparison of our BA with MA, it is imperative to use optimal model architectures for each aggregation method and likelihood approximation under consideration. Therefore, we optimize the number of hidden layers and the number of hidden units per layer of each encoder and decoder MLP (as shown in Fig. 4 ), individually for each model architecture and each experiment. For the ANP, we also optimize the multihead attention MLPs. We further optimize the latent space dimensionality d z and the learning rate of the Adam optimizer. For this hyperparameter optimization, we use the Optuna framework (Akiba et al., 2019) with TPE Sampler and Hyperband pruner (Li et al., 2017) . We consistently use a minibatch size of 16. Further, we use S = 10 latent samples to evaluate the MC likelihood approximation during training. To evaluate the VI likelihood approximation, we sample target set sizes between N tot and N in each training epoch, cf. Tab. 8.

7.5.4. EVALUATION PROCEDURE

To evaluate the performance of the various model architectures we generate L = 256 unseen test tasks with target sets D t consisting of M = 256 data points each and compute the average posterior predictive log-likelihood 1 L 1 M L =1 log p y t ,1:M x t ,1:M , D c , θ , given context sets D c of size N . Depending on the architecture, we approximate the posterior predictive log-likelihood according to: • For BA + PB likelihood approximation: 1 L 1 M L =1 M m=1 log p y t ,m x t ,m , µ z, , σ 2 z, , θ . • For MA + deterministic loss (= CNP): 1 L 1 M L =1 M m=1 log p y t ,m x t ,m , r , θ . • For architectures employing sampling-based likelihood approximations (VI, MC-LL) we report the joint log-likelihood over all data points in a test set, i.e. 1 L 1 M L =1 log q φ ( z | D c ) M m=1 p y t ,m x t ,m , z , θ dz (30) ≈ 1 L 1 M L =1 log 1 S S s=1 M m=1 p y t ,m x t ,m , z ,s , θ = - 1 M log S + 1 L 1 M L l=1 S logsumexp s=1 M m=1 log p y t ,m x t ,m , z ,s , θ , where z ,s ∼ q φ ( z| D ). We employ S = 25 latent samples. To compute the log-likelihood values given in tables, we additionally average over various context set sizes N as detailed in the main part of this paper. We report the mean posterior predictive log-likelihood computed in this way w.r.t. 10 training runs with different random seeds together with 95% confidence intervals Table 9 : Relative evaluation runtimes and numbers of parameters of the optimized network architectures on the GP tasks. The deterministic methods (PB, det.) are much more efficient regarding evaluation runtime, as they require only on forward pass per prediction, while the sampling-based approaches (VI, MC) require multiple forward passes (each corresponding to one latent sample) to compute their predictions. We use S = 25 latent samples, as described in App. 7.5.4. Furthermore, BA tends to require less complex encoder and decoder network architectures compared to MA, because it represents a more efficient mechanism to propagate information from the context set to the latent state. 

7.6. ADDITIONAL EXPERIMENTAL RESULTS

We provide additional experimental results accompanying the experiments presented in Sec. 5: • Results for relative evaluation runtimes and numbers of parameters of the optimized network architectures on the full GP suite of experiments, cf. Tab. 9. • The posterior predictive mean squared error on all experiments, cf. Tab. 10. • The context-size dependent results for the predictive posterior log-likelihood for the 1D and 3D Quadratic experiments, the Furuta dynamics experiment, as well as the 2D image completion experiment, cf. 



For ANP, we use original code from https://github.com/deepmind/neural-processes https://github.com/boschresearch/bayesian-context-aggregation https://github.com/boschresearch/bayesian-context-aggregation



Figure 1: Multi-task CLV model with task-specific global latent variables z and a task-independent variable θ describing statistical structure shared between tasks.

Quadratic Functions. We further seek to study the performance of BA with very limited amounts of training data. To this end, we consider two quadratic function classes, each parametrized by three real parameters from which we generate limited numbers L of training tasks. The first function class is defined on a one-dimensional domain, i.e., x ∈ R, and we choose L = 64, while the second function class, as proposed byPerrone et al. (2018), is defined on x ∈ R 3 with L = 128, cf. App. 7.5.1. As shown in Tab. 3, BA again consistently outperforms MA, often by considerably large margins, underlining the efficiency of our Bayesian approach to aggregation in the regime of little training data.

Input angular velocities U (-2π rad/0.5 s, 2π rad/0.5 s) m arm Mass arm U 6.0 • 10 -2 kg, 6.0 • 10 -1 kg m pend Mass pendulum U 1.5 • 10 -2 kg, 1.5 • 10 -1 kg l arm Length arm U 5.6 • 10 -2 m, 5.6 • 10 -1 m L arm Distance joint arm -mass arm U 1.0 • 10 -1 m, 3.0 • 10 -1 m L pend Distance joint pend. -mass pend. U 1.0 • 10 -1 m, 3.0 • 10 -1 m b arm Damping constant arm U 2.0 • 10 -5 Nms, 2.0 • 10 -3 Nms b pend Damping constant pendulum U 5.6 • 10 -5 Nms, 5.

BA + VI, BA + MC.

MA + VI (LP-NP), MA + MC.

Figure4: Model architectures used for our experiments in Sec. 5. For the ANP architecture we refer the reader toKim et al. (2019). Orange rectangles denote MLPs. Blue rectangles denote aggregation operations. Variables in green rectangles are sampled from normal distributions with parameters given by the incoming nodes. To arrive at a fair comparison, we optimize all MLP architectures, the latent space dimensionality d z , as well as the Adam learning rate, individually for all model architectures and all experiments, cf. App. 7.5.3.

Figure 5: Posterior predictive log-likelihood in dependence of the context set size N for the 1D and 3D Quadratic experiments, the Furuta dynamics experiment as well as the 2D image completion experiment.

Figure9: Predictions on two instances (dashed lines) of the RBF GP function class, given N = 60 context data points (circles). We plot mean and standard deviation (solid line, shaded area) predictions together with 10 function samples (for deterministic methods we employ AR sampling).

Figure10: Predictions on two instances (dashed lines) of the Weakly Periodic GP function class, given N = 20 context data points (circles). We plot mean and standard deviation (solid line, shaded area) predictions together with 10 function samples (for deterministic methods we employ AR sampling).

Figure11: Predictions on two instances (dashed lines) of the Weakly Periodic GP function class, given N = 60 context data points (circles). We plot mean and standard deviation (solid line, shaded area) predictions together with 10 function samples (for deterministic methods we employ AR sampling).

The latent variable z is global in the sense that it depends on the whole context set D c . Therefore, some form of aggregation mechanism is required to enable the encoder to consume context sets D c of variable size. To represent a meaningful operation on sets, such an aggregation mechanism has to be invariant to permutations of the context data points.Zaheer et al.

Posterior predictive log-likelihood on functions drawn from GP priors with RBF, weakly periodic, and Matern-5/2 kernels, averaged over context sets with N ∈ {0, 1, . . . , 64} points (table) and in dependence of N (figure). BA consistently outperforms MA, independent of the likelihood approximation, with MC being the most expressive choice. PB represents an efficient, deterministic alternative, while the VI approximation tends to perform worst, in particular for small N .

Posterior predictive log-likelihood on 1D and 3D quadratic functions with limited numbers L of training tasks, averaged over context sets with N ∈ {0, 1, . . . , 20} data points. BA outperforms MA by considerable margins in this regime of little training data. Predictions on two instances (dashed lines) of the 1D quadratic function class, given N = 3 context data points (circles). We show mean and standard deviation predictions (solid line, shaded area), and 10 function samples (AR samples for deterministic methods). Cf. also App. 7.6. terestingly, despite employing a factorized Gaussian approximation, our deterministic PB approximation performs at least on-par with the traditional VI approximation which tends to perform Table2: Relative evaluation runtimes and #parameters of the optimized network architectures on RBF GP. Also cf. Tab. 9.

Posterior predictive log-likelihood on the dynamics of a Furuta pendulum, averaged over context sets with N ∈ {0, 1, . . . , 20} state transitions. BA performs favorably on this real-world task.

Predictive log-likelihood on a 2D image completion task on MNIST, averaged over N ∈ {0, 1, . . . , 392} context pixels.

Input spaces and parameters used to generate data for training and testing the architectures discussed in the main part of this paper. U (a, b) denotes the uniform distribution on the interval [a, b], and, likewise U {a, a + n} denotes the uniform distribution on the set {a, a + 1, . . . , a + n}.

Posterior predictive mean squared error (MSE) on all experiments presented in this paper. We average over the same context set sizes as used to compute the posterior predictive log-likelihood, cf. Sec. 5, and again use S = 25 latent samples to compute the mean prediction of sampling-based methods. Our BA consistently improves predictive performance compared to MA not only in terms of likelihood (as shown in Sec. 5), but also in terms of MSE. Furthermore, while ANP tends to perform poorly in terms of likelihood (cf. Sec. 5), it's MSE is improved greatly by the attention mechanism. L = 64 0.1447 ± 0.0095 0.1513 ± 0.0091 0.1757 ± 0.0128 0.1833 ± 0.0154 0.1473 ± 0.0107 0.1636 ± 0.0082 0.1330 ± 0.0037

ACKNOWLEDGMENTS

We thank Philipp Becker, Stefan Falkner, and the anonymous reviewers for valuable remarks and discussions which greatly improved this paper.

annex

0.00 0.25 0.50 0.75 1.00 We plot mean and standard deviation (solid line, shaded area) predictions together with 10 function samples (for deterministic methods we employ AR sampling).Figure 8 : Predictions on two instances (dashed lines) of the RBF GP function class, given N = 20 context data points (circles). We plot mean and standard deviation (solid line, shaded area) predictions together with 10 function samples (for deterministic methods we employ AR sampling).Figure 12 : Predictions on two instances (dashed lines) of the Matern-5/2 GP function class, given N = 20 context data points (circles). We plot mean and standard deviation (solid line, shaded area) predictions together with 10 function samples (for deterministic methods we employ AR sampling).Figure 13 : Predictions on two instances (dashed lines) of the Matern-5/2 GP function class, given N = 60 context data points (circles). We plot mean and standard deviation (solid line, shaded area) predictions together with 10 function samples (for deterministic methods we employ AR sampling).

