ACCURATE BAYESIAN META-LEARNING BY ACCURATE TASK POSTERIOR INFERENCE

Abstract

Bayesian meta-learning (BML) enables fitting expressive generative models to small datasets by incorporating inductive priors learned from a set of related tasks. The Neural Process (NP) is a prominent deep neural network-based BML architecture, which has shown remarkable results in recent years. In its standard formulation, the NP encodes epistemic uncertainty in an amortized, factorized, Gaussian variational (VI) approximation to the BML task posterior (TP), using reparametrized gradients. Prior work studies a range of architectural modifications to boost performance, such as attentive computation paths or improved context aggregation schemes, while the influence of the VI scheme remains under-explored. We aim to bridge this gap by introducing GMM-NP, a novel BML model, which builds on recent work that enables highly accurate, full-covariance Gaussian mixture (GMM) TP approximations by combining VI with natural gradients and trust regions. We show that GMM-NP yields tighter evidence lower bounds, which increases the efficiency of marginal likelihood optimization, leading to improved epistemic uncertainty estimation and accuracy. GMM-NP does not require complex architectural modifications, resulting in a powerful, yet conceptually simple BML model, which outperforms the state of the art on a range of challenging experiments, highlighting its applicability to settings where data is scarce.

1. INTRODUCTION

Driven by algorithmic advances in the field of deep learning (DL) and the availability of increasingly powerful GPU-assisted hardware, the field of machine learning achieved a plethora of impressive results in recent years (Parmar et al., 2018; Radford et al., 2019; Mnih et al., 2015) . These were enabled to a large extent by the availability of huge datasets, which enables training expressive deep neural network (DNN) models. In practice, e.g., in industrial settings, such datasets are unfortunately rarely available, rendering standard DL approaches futile. Nevertheless, it is often the case that similar tasks arise repeatedly, such that the number of context examples on a novel target task is typically relatively small, but the joint meta-dataset of examples from all tasks accumulated over time can be massive, s.t. powerful inductive biases can be extracted using meta-learning (Hospedales et al., 2022) . While these inductive biases allow restricting predictions to only those compatible with the meta-data, there typically remains epistemic uncertainty due to task ambiguity, as the context data is often not informative enough to identify the target task exactly. Bayesian meta-learning (BML) aims at an accurate quantification of this uncertainty, which is crucial for applications like active learning, Bayesian optimization (Shahriari et al., 2016) , model-based reinforcement learning (Chua et al., 2018) , robotics (Deisenroth et al., 2011) , and in safety-critical scenarios. : Visualization of our GMM-NP model for a d z = 2 dimensional latent space, trained on a meta-dataset of sinusoidal functions with varying amplitudes and phases, after having observed a single context example (red cross, right panel) from an unseen task (black dots, right panel). Left panel: unnormalized task posterior (TP) distribution (contours) and GMM TP approximation with K = 3 components (ellipses, mixture weights in %). Right panel: corresponding function samples from our model (blue lines). A single context example leaves much task ambiguity, reflected in a highly correlated, multi-modal TP. Our GMM approximation correctly captures this: predictions are in accordance with (i) the observed data (all samples pass close to the red context example), and with (ii) the learned inductive biases (all samples are sinusoidal), cf. also Fig. 12 in App. A.5.5 A prominent BML approach is the Neural Process (NP) (Garnelo et al., 2018b) which employs a DNN-based conditional latent variable (CLV) model, in which the Bayesian belief about the target task is encoded in a factorized Gaussian task posterior (TP) approximation, and inference is amortized over tasks using set encoders (Zaheer et al., 2017) . This architecture can be optimized efficiently using variational inference (VI) with standard, reparametrized gradients (Kingma & Welling, 2014) . A range of modifications, such as adding deterministic, attentive, computation paths (Kim et al., 2019) , or Bayesian set encoders (Volpp et al., 2021) , have been proposed in recent years to improve predictive performance. Interestingly, the VI scheme with an amortized, factorized Gaussian TP, optimized using standard gradients, remains largely unaltered. Yet, it is well known that (i) the factorized Gaussian assumption rarely holds in Bayesian learning (MacKay, 2003; Wilson & Izmailov, 2020) , (ii) amortized inference can yield suboptimal posterior approximations (Cremer et al., 2018) , and (iii) natural gradients are superior to standard gradients for VI in terms of optimization efficiency and robustness (Khan & Nielsen, 2018) . Building on these insights and on recent advances in VI (Lin et al., 2020; Arenz et al., 2022) , we propose GMM-NP, a novel NP-based BML algorithm that employs (i) a full-covariance Gaussian mixture (GMM) TP approximation, optimized in a (ii) non-amortized fashion, using (iii) robust and efficient trust region natural gradient (TRNG)-VI. We demonstrate through extensive empirical evaluations and ablations that our approach yields tighter evidence lower bounds, more efficient model optimization, and, thus, markedly improved predictive performance, outperforming the stateof-the-art both in terms of epistemic uncertainty quantification and accuracy. Notably, GMM-NP does not require complex architectural modifications, which shows that accurate TP inference is crucial for accurate BML, an insight we believe will be valuable for future research.

2. RELATED WORK

Multi-task learning aims to leverage inductive biases learned on a meta-dataset of similar tasks for improved data efficiency on unseen target tasks of similar structure. Notable variants include transfer-learning (Zhuang et al., 2020) , that refines and combines pre-trained models (Golovin et al., 2017; Krizhevsky et al., 2012) , and meta-learning (Schmidhuber, 1987; Thrun & Pratt, 1998; Vilalta & Drissi, 2005; Hospedales et al., 2022) , which makes the multi-task setting explicit in the model design by formulating fast adapation mechanisms in order to learn how to solve tasks with little context data ("few-shot learning"). A plethora of architectures were studied in the literature, including learner networks that adapt model parameters (Bengio et al., 1991; Schmidhuber, 1992; Ravi & Larochelle, 2017) , memory-augmented DNNs (Santoro et al., 2016) , early instances of Bayesian meta-models (Edwards & Storkey, 2017; Hewitt et al., 2018) , and algorithms that that make use of learned measures of task similarity (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017) . not explicitly model the epistemic uncertainty arising naturally in few-shot settings due to task ambiguity, both model families were extended to fully Bayesian meta-learning (BML) algorithms that explicitly infer the TP based on a CLV formulation (Heskes, 2000; Bakker & Heskes, 2003) . Important representatives are Probabilistic MAML (Grant et al., 2018; Finn et al., 2018) and Bayesian MAML (Kim et al., 2018) , as well as several NP-based BML approaches that inspire our work. These include the Standard NP (Garnelo et al., 2018b) , which was extended by attentive computation paths to avoid underfitting (Kim et al., 2019) , or by Bayesian set encoders (Zaheer et al., 2017; Wagstaff et al., 2019; Volpp et al., 2020) for improved handling of task ambiguity, as well as by hierarchical (Wang & Van Hoof, 2020) , bootstrapped (Lee et al., 2020) , or graph-based (Louizos et al., 2019) latent distributions. While the original NP formulation employs an amortized, reparametrized, stochastic gradient VI objective (Kingma & Welling, 2014; Rezende et al., 2014) , Monte-Carlo (MC)-based objective functions were also studied (Gordon et al., 2019; Volpp et al., 2021) . From a more general perspective, VI emerged as a central tool in many areas of probabilistic machine learning, which require tractable approximations of intractable probability distributions, typically arising as the posterior in Bayesian models (Gelman et al., 2004; Koller & Friedman, 2009; Neal, 1996; Wilson & Izmailov, 2020) . While early approaches (Attias, 2000) allow analytic updates, more complex algorithms employ stochastic gradients w.r.t. the variational parameters (Ranganath et al., 2014; Kingma & Welling, 2014; Blundell et al., 2015) . Such approaches are straightforward to implement and computationally efficient for factorized Gaussian variational distributions, but ignore the information geometry of the loss landscape, leading to suboptimal convergence rates (Khan & Nielsen, 2018) . Natural gradient (NG)-VI (Amari, 1998) alleviates this problem and recent work (Hoffman et al., 2013; Winn & Bishop, 2005; Khan & Nielsen, 2018; Khan et al., 2018) successfully applies this idea at scale to complex models, requiring only first-order gradient information (Lin et al., 2019) . Further extensions enable NG-VI for structured variational distributions such as mixture models by decomposing the NG update into individual updates per mixture component (Arenz et al., 2018; Lin et al., 2020) which, in combination with trust region (TR) step size control (Abdolmaleki et al., 2015; Arenz et al., 2022) , yields robust and efficient VI algorithms for versatile and highly expressive variational distributions such as Gaussian mixture models (GMMs).

3. PRELIMINARIES

We now briefly recap the TRNG-VI algorithm (Lin et al., 2020; Arenz et al., 2022) as well as the NP model (Garnelo et al., 2018b) , which form the central building blocks of our GMM-NP model.

3.1. TRUST REGION NATURAL GRADIENT VI WITH GAUSSIAN MIXTURE MODELS

Variational Inference. We consider a probability distribution p (z) over a random variable z ∈ R dz , which is intractable in the sense that we know it only up to some normalization constant Z, i.e., p (z) = p (z) / Z with Z = p (z) dz and tractable p(z). We seek to approximate p (z) by a tractable distribution q ϕ (z), parametrized by ϕ. Variational inference (VI) frames this task as the minimization w.r.t. ϕ of the reverse Kullback-Leiber (KL) divergence (Kullback & Leibler, 1951 ) KL [q ϕ ||p] ≡ -E q ϕ (z) log p (z) q ϕ (z) + log Z ≡ -L (ϕ) + log Z, where we introduced evidence lower bound (ELBO) L (ϕ). As Z is independent of ϕ, minimizing the KL divergence is equivalent to maximizing the ELBO. Natural Gradients. A standard approach employs stochastic, reparametrized gradients w.r.t. ϕ (Kingma & Welling, 2014) for optimization. While this is computationally efficient, it ignores the geometry of the statistical manifold defined by the set of probability distributions q ϕ , which can lead to suboptimal convergence rates (Khan & Nielsen, 2018) . A more efficient solution is to perform updates in the natural gradient (NG) direction, i.e., the direction of steepest ascent w.r.t. the Fisher information metric (Amari, 1998) . State-of-the-art approaches estimate the NG from first-order gradients of p (z) by virtue of Stein's lemma (Lin et al., 2019) , yielding efficient NG-VI algorithms that scale to complex problems (Khan et al., 2018; Lin et al., 2020; Arenz et al., 2022) . Trust Regions. Selecting appropriate step sizes for updates in ϕ can be intricate, which is why Abdolmaleki et al. (2015) propose a (zero-order) algorithm that incorporates a trust region constraint of the form KL [q ϕ ||q ϕold ] ≤ ε, which restricts the updates in distribution space and can be enforced with manageable computational overhead (a scalar, convex optimization problem in the Lagrangean parameter for the constraint). As shown by Arenz et al. (2022) , such trust regions can easily be combined with gradient information, and allow more aggressive updates in comparison to setting the step size directly, while still ensuring robust convergence. VI with Gaussian Mixture Models. The quality of the approximation depends on the expressiveness of the distribution family q ϕ . In settings where p corresponds to the Bayesian posterior of complex latent variable models (MacKay, 2003; Wilson & Izmailov, 2020) , simple Gaussian approximations do not yield satisfactory results, as p typically is multimodal. In such cases, Gaussian mixture models (GMMs) are an appealing choice, as they provide cheap sampling, evaluation, and marginalization while allowing expressive approximations (Arenz et al., 2018) . However, a naive application of VI is futile because gradients are coupled between GMM components, leading to computationally intractable updates. Fortunately, Arenz et al. (2018) and Lin et al. (2020) show that updating the components and weights individually is possible, while preventing a collapse of the approximation onto a single posterior mode. This leads to two state-of-the-art algorithms for NG-VI with variational GMMs, that differ most notably in the way the step sizes for the updates are controlled: iBayes-GMM (Lin et al., 2020) , which directly sets step sizes for the updates, and TRNG-VI (Arenz et al., 2022) , which employs trust regions for more efficient and robust convergence. 3.2 BAYESIAN META-LEARNING WITH NEURAL PROCESSES θ y ℓ,n z ℓ x ℓ,n N L The Multi-Task Latent Variable Model. We aim to fit a generative model to a meta-dataset D = D 1:L , consisting of regression tasks D ℓ = {x ℓ,1:N , y ℓ,1:N } with inputs x ℓ,n ∈ R dx and corresponding evaluations y ℓ,n ∈ R dy of unknown functions f ℓ , i.e., y ℓ,n = f ℓ (x ℓ,n ) + ε n , where ε n denotes (possibly heteroskedastic) noise. Tasks are assumed to share statistical structure as formalized in the multi-task CLV model shown to the right, defining the joint probability distribution p θ (y 1:L,1:N , z 1:L |x 1:L,1:N ) = ℓ,n p θ (y ℓ,n |x ℓ,n , z ℓ ) p (z ℓ ) , where z ℓ ∈ R dz denote latent task descriptors and θ denotes task-global parameters that capture shared statistical structure. Having observed context data D c , where the decoder dec µ θ is a DNN with weights θ, and observation noise variance σ 2 n . As the TP is intractable for this likelihood choice, NP computes a factorized Gaussian approximation q ϕ (z * |D c * ) ≡ N (z * |enc µ ϕ (D c * ) , diag(enc σ ϕ (D c * ))) with deep set encoders (Zaheer et al., 2017; Wagstaff et al., 2019 ) enc µ ϕ , enc σ ϕ , parametrized by ϕ. The parameters Φ ≡ (θ, ϕ) are optimized jointly on the meta-data by stochastic gradient ascent on the ELBO L ℓ=1 L ℓ (Φ) w.r.t. the approximate log marginal predictive likelihood defined by log q Φ (y ℓ,1:N |x ℓ,1:N , D c ℓ ) ≡ log n p θ (y ℓ,n |x ℓ,n , z ℓ ) q ϕ (z ℓ |D c ℓ ) dz ℓ (4) ≥ E q ϕ (z ℓ |D ℓ ) n log p θ (y ℓ,n |x ℓ,n , z ℓ ) + log q ϕ (z ℓ |D c ℓ ) q ϕ (z ℓ |D ℓ ) ≡ L ℓ (Φ) , where D c ℓ ⊂ D ℓ , and stochastic gradients w.r.t. ϕ are estimated using the reparametrization trick (Kingma & Welling, 2014) . Note that NP amortizes inference (the variational parameters ϕ are shared across tasks) and that it re-uses q ϕ (z ℓ |•) to compute the variational distribution q ϕ (z ℓ |D ℓ ), taking advantage of its deep set encoder, which allows to condition it on datasets of arbitrary size.

4. BAYESIAN META-LEARNING WITH GMM TASK POSTERIORS

Motivation. Our work is motivated by the observation that the current state-of-the-art approach for training NP-based BML models is suboptimal. Concretely, we identify three interrelated issues with the optimization objective Eq. ( 5): (I1) Expressivity of the Variational Distribution. q ϕ is a (i) factorized, (ii) unimodal Gaussian distribution, (iii) amortized over tasks. In effect, this parametrization only allows crude approximations of the TP distribution (MacKay, 2003; Cremer et al., 2018) . (I2) Optimization of the Variational Parameters. (i) Naive gradients of Eq. ( 5), ignoring the information geometry of q ϕ , with (ii) direct step size control are employed for optimization, yielding brittle convergence at suboptimal rates (Khan & Nielsen, 2018; Arenz et al., 2022) . (I3) Optimization of the Model Parameters. Due to the suboptimal VI scheme (I1,I2), the TP approximation is poor, resulting in a loose ELBO Eq. ( 5). In effect, optimization w.r.t. the model parameters θ is inefficient, cf. App. A.1.4 for a detailed discussion. Armed with these insights, we develop a novel BML model algorithm that is close in spirit to the NP but solves (I1-I3) through TRNG-VI with GMM TP approximations. Model. Our algorithm builds on the standard multi-task CLV architecture Eq. ( 2) and retains the likelihood parametrization using a decoder DNN, dec µ θ (x, z), as this allows for expressive BML models. Under this parametrization, the log marginal likelihood for a single task reads log p θ (y ℓ,1:N |x ℓ,1:N ) = log n p θ (y ℓ,n |x ℓ,n , z ℓ ) p (z ℓ ) dz ℓ ≡ log Z ℓ (θ) , where Z ℓ (θ) is the normalization constant of the TP p θ (z ℓ |D ℓ ) = pℓ (z ℓ ) / Z ℓ (θ) with p(z ℓ ) ≡ n p θ (y ℓ,n |x ℓ,n , z ℓ ) p (z ℓ ). In contrast to Eq. ( 4), we do not condition the left hand side on a context set D c ℓ , which yields a tractable integrand p(z ℓ ) that does not require further approximation. To tackle (I1), we approximate p θ (z ℓ |D ℓ ) by an expressive variational GMM of the form q ϕ ℓ (z ℓ ) ≡ k w ℓ,k q ϕ ℓ (z ℓ |k) ≡ k w ℓ,k N (z ℓ |µ ℓ,k , Σ ℓ,k ) , k w ℓ,k = 1, where we train individual GMMs with parameters ϕ ℓ ≡ {w ℓ,k , µ ℓ,k , Σ ℓ,k }, k ∈ {1, . . . , K} for each task ℓ, to not impair approximation quality by introducing inaccuracies through amortization. Update Equations for the Variational Parameters. To ensure efficient and robust optimization of ϕ ℓ (I2), we employ TRNG-VI as proposed by Arenz et al. (2022) , with the update equations Σ ℓ,k,new = η η + 1 Σ -1 ℓ,k,old - 1 η + 1 R ℓ,k -1 , µ ℓ,k,new = Σ ℓ,k,new η η + 1 Σ -1 ℓ,k,old µ ℓ,k,old + 1 η + 1 r ℓ,k -R ℓ,k µ ℓ,k,old , w ℓ,k,new ∝ exp ρ ℓ,k , where R ℓ,k , r ℓ,k , and ρ ℓ,k are defined as expectations that can be approximated from per-component samples using MC and require at most first-order gradients of p (z ℓ ), which are readily available using standard automatic differentiation software (Abadi et al., 2015; Paszke et al., 2019) . Due to space constraints, we move details to App. A.1.1. The optimal value for the trust region parameter η ≥ 0 is defined by a scalar convex optimization problem that can be solved efficiently by a bracketing search, which also ensures positive definiteness of the new covariance matrix Σ ℓ,k,new . Updates for the Model Parameters. To optimize the model parameters θ, we decompose the log marginal likelihood log Z ℓ (θ) according to Eq. ( 1) as from a sinusoidal instance (black). GMM-NP outperforms the baselines, as it accurately quantifies epistemic uncertainty through diverse samples. BA-NP also shows variability in its samples, but does not achieve competitive performance due to its inaccurate TP approximation. ANP and BANP produce essentially deterministic predictions that fail to give reasonable estimates of the predictive distribution. Cf. also Figs. 10, 11 in App. A.5.4 . log Z ℓ (θ) = E q ϕ ℓ (z ℓ ) n log p θ (y ℓ,n |x ℓ,n , z ℓ ) + log p (z ℓ ) q ϕ ℓ (z ℓ |D ℓ ) + KL [q ϕ ℓ (•) || p θ (•|D ℓ )] , where the first term on the right hand side is the ELBO w.r.t. log Z ℓ (θ), which we denote by L (θ). We expect L (θ) to be comparably tight, as our inference scheme allows accurate GMM TP approximations q ϕ ℓ , s.t., the KL term will be small. Consequently, maximization of Z ℓ (θ) w.r.t. θ can be performed efficiently by maximization of L (θ) (I3), cf. also App. A.1.4. As is standard, we use the Adam optimizer (Kingma & Ba, 2015) to perform updates in θ, with MC gradient estimates from samples z ℓ,s ∼ q ϕ ℓ (z ℓ ): ∇ θ L (θ) ∝ s,n ∇ θ log p θ (y ℓ,n |x ℓ,n , z ℓ,s ) . Meta-Training. The goal of any BML algorithm is to compute accurate predictions with wellcalibrated uncertainty estimates according to Eq. ( 4), based on samples from the approximate TP * suffice for accurate predictions. To find versatile solutions that work for variable context set sizes, it is necessary to emulate this during meta-training by evaluating gradients for θ on samples z ℓ,s from approximate TPs informed by a range of context set sizes. Standard NPs achieve this by sampling a minibatch of auxiliary subtasks, with a random number of datapoints, from D 1:L for each step in the parameters Φ (cf. Sec. A.3.2). Our algorithm uses a similar approach: starting from a fixed set of randomly initialized variational GMMs ϕ ℓ , and a randomly initialized model θ, we iterate through the meta-data in minibatches of auxiliary subtasks, and perform one update step in ϕ ℓ for all subtasks in the minibatch, according to Eqs. ( 8), followed by one gradient step in θ. Thus, variational and model parameters evolve jointly in a similar fashion as for standard NP, resulting in a meta-training stage with comparable computational complexity, cf. App. A.5.6. As this approach retains a fixed set of variational GMMs over the whole course of meta-training (one for each auxiliary subtask), we accordingly sample a fixed set of auxiliary subtasks at the beginning of meta-training. We summarize our algorithm in App. A.1, Alg. 1. q ϕ * (z * |D c * ) ≈ p θ (z|D c * ), Predictions. As our architecture does not amortize inference over tasks and, thus, does not learn a set encoder architecture, the variational GMMs learned during meta-training are not required for predictions on test tasks and can be discarded. To make predictions, we fix the model parameters θ and fit a new variational GMM q ϕ * to D c * by iterating Eqs. (8) until convergence. Afterwards, we can cheaply generate arbitrarily many samples z * ,s ∼ q ϕ * (z * ), and generate corresponding function samples, evaluated at arbitrary input locations x * , by a single forward pass through the decoder DNN to approximate the predictive distribution according to Eq. ( 4), cf. App. A.1, Alg. 2.

5. EMPIRICAL EVALUATION

Our empirical evaluation aims to study the effect on the predictive performance of (i) our improved TRNG-VI approach as well as of (ii) expressive variational GMM TP approximations in NP-based BML, in (iii) comparison to the state-of-the-art on (iv) a range of practically relevant meta-learning tasks. To this end, we evaluate our GMM-NP architecture on a diverse set of BML experiments, and present comparisons to state-of-the-art BML algorithms, namely the original NP with mean context aggregation (MA-NP) (Garnelo et al., 2018b) , the NP with Bayesian context aggregation (BA-NP) (Volpp et al., 2021) , the Attentive NP (ANP) (Kim et al., 2019) , as well as the Bootstrapping (Attentive) NP (B(A)NP) (Lee et al., 2020) . Tab. 1 in App. A.2 gives an overview of the architectural differences of these algorithms. We move details on data generation to App. A.4, and on the baseline implementations to App. A.2. For a fair comparison, we employ a fixed experimental protocol for all datasets and models: we first perform a Bayesian hyperparameter search (HPO) to determine optimal algorithm settings, individually for each model-dataset combination. We then retrain the best model with 8 different random seeds and report the median log marginal predictive likelihood (LMLHD) as well as the median mean squared error (MSE), both in dependence of the context set size. To foster reproducibility, we provide further details on our experimental protocol in App. A.3, the resulting hyperparameters and architecture sizes in App. A.5.7, and publish our source code.foot_0 Lastly, we include a detailed discussion of limitations and computational resources in App. A.5.6.

5.1. SYNTHETIC DATASETS

We first study two synthetic function classes (Finn et al., 2017; 2018) on which predictions can be easily visualized: (i) sinusoidal functions with varying amplitudes and phases, as well as (ii) a mix of these sinusoidal functions with affine functions with varying slopes and intercepts. Fig. 2 shows that our GMM-NP outperforms all baselines by a large margin over the whole range of context sizes, both in terms of LMLHD and MSE. This indicates that GMM-NP's improved TP approximation indeed yields improved epistemic uncertainty estimation (higher LMLHD). Interestingly, GMM-NP also shows improved accuracy (lower MSE) and, notably, achieves this without any additional architectural modifications like parallel deterministic paths with attention modules. This is particularly pleasing, as the results show that such deterministic paths indeed improve accuracy, but degrade epistemic uncertainty estimation massively: (B)ANP performs worst in terms of LMLHD. This is further substantiated by (i) observing that MA-NP and BA-NP, both of which don't employ deterministic paths, are among the best baselines w.r.t. LMLHD, and (ii) by visualizing model predictions (Figs. 2, 10, 11) , demonstrating that (B)ANP compute essentially deterministic function samples that fail to correctly estimate the predictive distribution, while our GMM-NP yields estimates uncertainty well through variable samples. BNP does not achieve competitive performance, presumably because the bootstrapping approach does not work well for small context sets.

5.2. ABLATION: TASK POSTERIOR INFERENCE

We now demonstrate that GMM-NP's improved performance can indeed be explained by the improved TRNG-VI algorithm with accurate GMM TP approximation. To this end, we compare: (i) BA-NP, i.e., amortized VI with reparameterized gradients and unimodal, factorized Gaussian TP (SGD-VI, diag, K = 1), (ii) our GMM-NP, i.e., non-amortized TRNG-VI and full-covariance GMM TP (TRNG-VI, full, K > 1), as well as two models employing TRNG-VI, but a unimodal Gaussian TP with (iii) full, and (iv) diagonal covariance. The results are shown in Fig. 3 . In addition, we compare (v) an architecture with full-covariance GMM TP, but trained with iBayes-GMM (Lin et al., 2020), i.e., with direct step size control instead of trust regions (Fig. 6 , App. A.5.1). VI Algorithm. Considering the LMLHD metric, we observe a significant performance boost when keeping the traditional factorized Gaussian approximation, but switching from SGD-VI to TRNG-VI, indicating that the standard SGD-VI approach is indeed suboptimal for BML. To study this further, we estimate the looseness of the ELBO (cf. App. A.3.3), i.e., the median (over tasks) value of the KL divergence KL [q ϕ ℓ (•|D ℓ ) || p θ (•|D ℓ )] between the true and approximate TPs. We observe that TRNG-VI provides ELBOs that are tighter by at least one order of magnitude in comparison to SGD-VI. As discussed above, this allows for more efficient optimization of the model parameters θ, explaining the performance gain. Lastly, we find that trust regions yield tighter ELBOs than direct step size control and, consequently, improve predictive performance, cf. Fig. 6 , App. A.5.1 Posterior Expressivity. We now study the effect of increasing the expressiveness of the TP approximation. This discussion is supplemented by Fig. 1 , where we visualize the TP and its approximation for a d z = 2 dimensional latent space. First, we observe tighter ELBOs and improved performance when considering full-covariance (but still unimodal) Gaussian TP approximations, and this effect is particularly pronounced for small context sets. This is intuitive, as small context sizes leave a lot of task ambiguity, leading to highly correlated latent dimensions (Fig. 1 ). If we now switch to multimodal TP approximations, i.e., our full GMM-NP architecture with K > 1 components (K optimized by HPO), we observe a further increase in performance, as the multimodality of the true TP can be captured more accurately (Figs. 1, 12) . This effect is especially pronounced for the affinesinusoidal mix, but also present for the purely sinusoidal function class. As more complex function classes exhibit stronger task ambiguity, the TP will likely exhibit multimodal, correlated structure over wider ranges of context sizes, s.t. an accurate TP approximation will be even more important.

5.3. BAYESIAN OPTIMIZATION

One important application area for probabilistic regression models is as the surrogate model of Bayesian optimization (BO), a global black-box optimization algorithm well-known for its sampling efficiency (Shahriari et al., 2016) . BO serves as an interesting experiment to benchmark Bayesian models, as it relies on well-calibrated uncertainty estimates in order to trade-off exploration against exploitation, which is crucial for efficient optimization. As proposed by Garnelo et al. (2018b) , we use Thompson sampling (Russo et al., 2018) as the BO acquisition function and present results on four function classes: (i) 1D functions sampled from Gaussian process (GP) priors with RBF kernels with varying lengthscales and signal variances (Kim et al., 2019) , and parametrized versions of the global optimization benchmark functions (ii) Forrester (1D) (Forrester et al., 2008) , (iii) Branin (2D) (Picheny et al., 2013) , and (iv) Hartmann-3 (3D) (Szego & Dixon, 1978) as proposed by (Volpp et al., 2020) . In Figs. 4a,4b,7, we report the median simple regret, i.e., the difference of the current incumbent value to the function's minimum, over BO iteration. We observe that our GMM-NP model represents a more powerful BO surrogate compared to the baselines, providing further evidence that TRNG-VI with GMM TP approximations yields superior epistemic uncertainty estimates. We provide further results in App. A.5.2, Figs. 7, 8.

5.4. DYNAMICS MODELING

We further investigate a challenging dynamics modeling problem on a function class obtained by simulating a Furuta pendulum (Furuta et al., 1992) , a highly non-linear 4D dynamical system, as proposed by Volpp et al. (2021) . The task is to predict the difference of the next system state x next ∈ R 4 to the current system state x ∈ R 4 , i.e., we study one-step ahead dynamics predictions x → y = ∆x ≡ x next -x ∈ R 4 . The function class is generated by simulating L = 64 episodes of N = 64 timesteps each (∆t = 0.1 s), where for each episode we randomly sample the 7 physical parameters of the pendulum (3 lengths, 2 masses, 2 friction coefficients). The results (Fig. 4c ) show that GMM-NP outperforms the baselines in terms of LMLHD by a large margin, demonstrating its applicability to complex dynamics prediction tasks where reliable uncertainty estimates are required, e.g., in robotics applications (Deisenroth et al., 2011) . Interestingly, while neither ANP nor BNP can reliable solve this task, BANP performs strongly, reaching GMM-NPs asymptotic performance in terms of LMLHD and yielding even slightly better MSE for small context sets.

5.5. IMAGE COMPLETION

To show that our architecture scales to large meta-datasets, we provide results on a 2D image completion experiment on the MNIST database of handwritten digits (LeCun & Cortes, 2010), as proposed by Garnelo et al. (2018b) . The task is to predict pixel intensities y ∈ R at 2D pixel locations x ∈ R 2 , given a set of context pixels. To obtain a realistic regression task, we add Gaussian noise to each context pixel. The meta-dataset consists of L = 60000 images with N = 784 pixels each. The results (Fig. 5 ) are consistent with our previous findings: GMM-NP yields markedly improved performance, outperforming the baselines over the whole range of context sizes. The architectures with deterministic paths ((B)ANP) fail at properly estimating epistemic uncertainties, leading to low LMLHD values, i.p., for large context sizes. Figs. 5b,5c,9 explain why this is the case: GMM-NP (and also, to some extent, BA-NP) generate meaningful images of high variability, corresponding to well-calibrated uncertainty estimates. In contrast, (B)ANP produce essentially deterministic samples that overfit the noise in the context data. While these samples might appear less blurry than those of GMM-NP and BA-NP, they represent inferior solutions of the regression problem.

6. CONCLUSION AND OUTLOOK

We proposed GMM-NP, a novel BML algorithm inspired by the NP model architecture. Our approach focuses on accurate task posterior inference, a central algorithmic building block that until now has been treated by amortized inference with set encoders optimized using standard, reparametrized gradients. We demonstrate that this approach leads to suboptimal task posterior approximations and, thus, inefficient optimization of model parameters. We apply modern TRNG-VI techniques that enable expressive variational GMMs, which yields tight ELBOs, efficient optimization, and markedly improved predictive performance in terms of both epistemic uncertainty estimation and accuracy. Despite its simplicity, GMM-NP outperforms the state-of-the-art on a range of experiments and demonstrates its applicability in practical settings, i.p., when meta and context data is scarce. This demonstrates that complex architectural extensions, like Bayesian set encoders or deterministic, attentive computation paths are not required -in fact, we observe that deterministic modules degrade epistemic uncertainty estimation. Therefore, we hope that our work inspires further research on accurate task posterior inference as this turns out to suffice for accurate BML.

REPRODUCIBILITY STATEMENT

We took great care to present a fair comparison of our GMM-NP algorithm with the baseline models, with statistically reliable results that can be easily reproduced. In particular, we • clearly state the hyperparameter settings and hyperparameter optimization procedure we used (Sec. A.3.1), • clearly state the generating process for the datasets on which we evaluated our algorithm (Sec. A.4), • concisely define the evaluation metrics we reported (Sec. A.3.3), • made sure to evaluate these metrics on large test and sample sets, as well as on multiple (8) random seeds, s.t., our results carry statistical significance (Sec. A.3), • use source code from the original authors for all baselines (Sec. A.2), • make the source code for our proposed algorithm available online (Sec. A.2).

ETHICS STATEMENT

We do not expect any negative ethical or societal impact of our work. I.p., we did not use sensitive/personal data in our experiments.

A APPENDIX

This appendix provides further details that supplement the main part of our paper.

A.1 ALGORITHMIC DETAILS

In this section we lay out the full set of variational update equations and provide pseudocode for our GMM-NP algorithm.

A.1.1 VARIATIONAL UPDATE EQUATIONS

We provide the full set of equations required to compute the TRNG update for the variational parameters ϕ ℓ ≡ {w ℓ,k , µ ℓ,k , Σ ℓ,k }, k ∈ {1, . . . , K}, parametrizing our GMM TP approximation as q ϕ ℓ (z ℓ ) ≡ k w ℓ,k q ϕ ℓ (z ℓ |k) ≡ k w ℓ,k N (z ℓ |µ ℓ,k , Σ ℓ,k ) , k w ℓ,k = 1. ( ) The TRNG-VI update equations, as proposed by Arenz et al. (2022) , read Σ ℓ,k,new = η η + 1 Σ -1 ℓ,k,old - 1 η + 1 R ℓ,k -1 , µ ℓ,k,new = Σ ℓ,k,new η η + 1 Σ -1 ℓ,k,old µ ℓ,k,old + 1 η + 1 r ℓ,k -R ℓ,k µ ℓ,k,old , w ℓ,k,new ∝ exp ρ ℓ,k , where R ℓ,k , r ℓ,k , and ρ ℓ,k are defined as expectations that can be approximated from per-component samples using MC: R ℓ,k = E q ϕ ℓ,old (z ℓ |k) Σ -1 ℓ,k,old z ℓ -µ ℓ,k,old ∇ T z ℓ h ℓ,k (z ℓ ) , r ℓ,k = E q ϕ ℓ,old (z ℓ |k) ∇ z ℓ h ℓ,k (z ℓ ) , ρ ℓ,k = E q ϕ ℓ,old (z ℓ |k) h ℓ,k (z ℓ ) -log q ϕ ℓ,old (z ℓ |k) . Here, we defined h ℓ,k (z ℓ ) ≡ log pℓ (z ℓ ) + log q ϕ ℓ,old (z ℓ |k) -log q ϕ ℓ,old (z ℓ ) . The optimal value for the Lagrangean parameter η ≥ 0 that enforces the trust region constraint KL [q ϕ ||q ϕold ] ≤ ε, is defined by a scalar convex optimization problem that can be solved efficiently by a bracketing search, which also ensures positive definiteness of the new covariance matrix Σ ℓ,k,new .

A.1.2 GMM INITIALIZATION

We provide details on the initialization of the variational GMMs before meta-training and testing. As we use the same procedure for each task, we drop task indices ℓ to avoid clutter. Given a number K of components for the GMM task posterior (TP) q ϕ (z) Eq. ( 7), we use a prior p(z) with K components. To initialize the means µ k , covariances Σ k , and mixture weights w k for k ∈ {1, . . . , K}, we use the same simple heuristic as Arenz et al. ( 2022): • Draw the means µ k from a d z -dimensional standard Normal distribution, • The covariances Σ k are initialized as diagonal matrices (with 1 on the diagonal), • The weights are initialized uniformly as w k = 1/K.

A.1.3 ALGORITHM SUMMARY

We provide pseudocode for the meta-training stage of our GMM-NP algorithm in Alg. 1 and for the prediction stage in Alg. 2.

A.1.4 DISCUSSION OF CONVERGENCE PROPERTIES

Convergence of the ELBO. Our algorithm inherits the convergence guarantee of the variational Bayes algorithm as discussed, e.g., in Bishop (2006) . In general, convergence of variational Bayes is independent of the concrete optimization strategy used for (ϕ, θ): as long as both the E-step (step in ϕ) and the M-step (step in θ) increase the ELBO objective (first term in Eq. ( 9)), the algorithm is guaranteed to converge to a local optimum of the ELBO. While in standard, reparametrized, variational Bayes (as employed by the baseline methods studied in Sec. 5) (ϕ, θ) are optimized jointly using, e.g., Adam (Kingma & Ba, 2015) , our method alternates between a step in ϕ using TRNG-VI (Arenz et al., 2022) and a step in θ using Adam. Nevertheless, both steps increase the ELBO, so our algorithm will converge. Convergence of the Marginal Likelihood. As discussed in Sec. 4, our GMM-NP algorithm is designed to improve the convergence behaviour w.r.t. the marginal likelihood Eq. ( 9) in comparison to existing NP-based BML approaches. Recall that the convergence guarantee of the classical expectation-maximization (EM) algorithm w.r.t. the marginal likelihood is lost as soon as the E-step becomes intractable, i.e., as soon as the posterior distribution cannot be computed exactly, and, thus, has to be approximated by a variational distribution, cf., e.g., Bishop (2006) . This is the case for most models of reasonable complexity, e.g., for the variational autoencoder (Kingma & Welling, 2013) or the NP model family (Garnelo et al., 2018b) . Our GMM-NP model is no exception here, as we build on the NP model for which the TP distribution cannot be computed analytically. Convergence of the marginal likelihood when using the ELBO (first term in Eq. ( 9)) as a surrogate objective is guaranteed if the ELBO is tight after the E-step, which is the setting of the aforementioned EM algorithm and only the case for a perfect TP approximation, i.e., if KL(q ϕ (z)||p θ (z|D c )) = 0, cf. also App. A.3.3. For imperfect approximations, the tightness of the bound is controlled by the variational gap KL(q ϕ (z)||p θ (z|D c )) > 0. A better approximate posterior q ϕ (z) yields a tighter ELBO, which in turn brings us closer to the EM setting, i.e., typically improves convergence. Our GMM-NP algorithm builds exactly on this insight: we use an expressive TP approximation by a fullcovariance GMM and a powerful optimizer for ϕ (TRNG-VI, (Arenz et al., 2022) ) to obtain a tighter ELBO than existing BML approaches in order to achieve optimization of the model parameters in a way that efficiently maximizes the marginal likelihood.

A.2 BASELINE ALGORITHMS

Tab. 1 gives an overview of the architectural differences of the BML approaches we compared in our empirical evaluation (Sec. 5). Table 1 : Comparison of state-of-the-art approaches for Bayesian meta-learning (TRNGD = trust region natural gradient descent, RSGD = reparametrized stochastic gradient descent, SGD = stochastic gradient descent, SE = set encoder, MA = mean aggregation, BA = Bayesian aggregation, SA = self attention, CA = cross attention). TP Approx. VI Approach Amortization Det. Path GMM-NP (ours) Full-cov. GMM TRNGD none none MA-NP (Garnelo et al., 2018b) Diag. Gaussian RSGD SE + MA none BA-NP (Volpp et al., 2021) Diag. Gaussian RSGD SE + BA none BNP (Lee et al., 2020) Non-parametric SGD SE + MA none ANP (Kim et al., 2019) Diag. Gaussian RSGD SA + SE + MA CA BANP (Lee et al., 2020) Non-parametric SGD SA + SE + MA CA To compute our results, we consistently use code by the original authors. We also provide source

A.3 EXPERIMENTAL PROTOCOL

To foster reproducibility, we provide details on our experimental protocol.

A.3.1 MODEL HYPERPARAMETERS

To arrive at a fair comparison of our GMM-NP model with the baseline approaches, we optimize model hyperparameters individually for each model-dataset combination presented in Sec. 5. Concretely, we perform a Bayesian hyperparameter sweep with 256 trials for each model-dataset combination over the parameters detailed below. For the image completion experiment on MNIST, we employ a grid search with fewer trials to keep the computational effort manageable. For hyperparameters not mentioned below, we consistently use standard settings proposed by the original authors. To implement the hyperparameter search, we use the wandb sweep functionality (Biewald, 2020) . Observation Noise Parametrization. As detailed in Sec. 3.2, all compared models (including our GMM-NP) employ a Gaussian likelihood of the form p θ (y|x, z) ≡ N y|dec µ θ (x, z) , diag σ 2 n , ( ) where the mean is computed by a decoder DNN dec µ θ receiving the input location x and a latent sample z. However, different parametrizations of the observation noise variance σ 2 n are used in the literature. As it is not clear which setting is fairest, we also treat the observation noise parametrization as a hyperparameter. Concretely, for each model-dataset combination, we test the following settings for the observation noise (with individual hyperparameter sweeps) and report the best performing one: 1. σ 2 n = σ 2 n,true with σ 2 n,true being the true noise variance on the data, 2. σ 2 n ∈ R is a single float value, optimized jointly with θ, 3. σ 2 n = dec σ θ (x), i.e., observation noise is parametrized by a second decoder network, optimized jointly with dec µ θ , but receiving only the input location, 4. σ 2 n = dec σ θ (x, z), i.e., observation noise is parametrized by a second decoder network, optimized jointly with dec µ θ , and also receiving both the input location and the latent sample. For all compared models, and regardless of the parametrization, we bound the observation noise from below using a softplus transformation s.t. σ n ≥ σ n,min = 0.1, as proposed by (Garnelo et al., 2018b; Kim et al., 2019; Lee et al., 2020) . DNN Architectures. For all experiments and all baseline models, we use encoder and decoder DNNs with two hidden layers. Likewise, our GMM-NP model uses a decoder DNN with two hidden layers. We optimize the number of hidden units per layer within the bounds {8, . . . , 64}. Latent Dimensionalities. For baseline models with parametric latent distributions (all except B(A)NP), we optimize the latent dimension d z within the bounds {1, . . . , 64}. As our GMM-NP algorithm employs full covariance matrices, we restrict the bounds for d z to {1, . . . , 8} for a fair comparison. Number of GMM components. For our GMM-NP algorithm, as well as for iBayes-GMM (Lin et al., 2020) used for the comparison in Sec. A.5.1, we optimize the number of GMM components within the bounds {1, . . . , 10}. Learning Rates and Trust Region Bounds. All algorithms use the Adam optimizer with standard settings to update DNN weights. We optimize the corresponding learning rates on a log-uniform scale within the bounds 10 -5 , 10 -1 . We use the same settings to optimize the step size for the GMM updates of the variational parameters of the iBayes-GMM algorithm (Lin et al., 2020) used for the comparison in Sec. A.5.1. As proposed by Arenz et al. (2022) , we optimize the Lagrangean parameter η of our GMM-NP algorithm using a bracketing search on the interval 10 -3 , 10 -1 .

A.3.2 AUXILIARY SUBTASK GENERATION FOR META-TRAINING

We describe the procedure to sample auxiliary subtasks during meta-training in more detail, cf. Sec. 4. Nomenclature. Recall from Sec. 3.2 that we define a meta-task as the set of all available (noisy) evaluations D ℓ , ℓ ∈ {1, . . . , L} from an unknown function f ℓ and that each meta-task contains N examples. Thus, a meta-task D ℓ is all data a BML algorithm has available to learn about f ℓ during meta-training. A subtask of meta-task D ℓ is defined as an arbitrary subset of D ℓ . Auxiliary Subtask Sampling. As described in Sec. 4, standard NP meta-training samples auxiliary subtasks from the metadata for each minibatch step in order to provide the decoder with samples from task posterior approximations informed by a range of context sizes. We use the following standard procedure (Garnelo et al., 2018b; Kim et al., 2019; Lee et al., 2020) to sample auxiliary subtasks to evaluate the optimization objectives of the baseline approaches (e.g., Eq. 5 for standard NP). Given a minibatch I ⊂ {1, . . . , L} of meta-tasks D ℓ , ℓ ∈ I, we first sample auxiliary subtasks Dℓ with a size Ñ drawn uniformly from Ñ ∈ {N min + 1, . . . , N max } with N min ≥ 1 and N max ≤ N . Then, we sample context sets Dc ℓ ⊂ Dℓ of size M , drawn uniformly from M ∈ 1, . . . , Ñ . Dc ℓ and Dℓ are then used in Eq. 5 to compute the ELBO objective for the current minibatch. As described in Sec. 4, our GMM-NP algorithm uses a similar approach: we employ auxiliary subtasks with sizes Ñ drawn uniformly from Ñ ∈ {N min , . . . , N max } to evaluate the updates for the variational GMM parameters and the model parameters. Note that our algorithm does not require to sample context sets during meta training from the auxiliary subtasks. Furthermore, recall that we train one variational GMM for each auxiliary subtask and retain those GMMs over the whole course of meta training, so we fix a set of L auxiliary subtasks at the beginning of meta-training (in contrast to standard NPs, which sample new subtasks for each minibatch). We use the following settings for N min , N max in our experiments: N min = 1, N max = N , except for MNIST image completion where we use N max = N/2. Further, we use L = 32L, except for MNIST image completion where we use L = 8.

A.3.3 METRICS

For each model-dataset combination, we retrain the best hyperparameter setting determined according to Sec. A.3.1 with 8 different random seeds used for model initialization, and report the median value together with (5%, 95%) percentiles of the metrics computed according to the formulae provided below. For all experiments (except the MNIST image completion experiment), we evaluate all metrics on L = 256 unseen test tasks D 1:L with D ℓ = {y ℓ,1:N , x ℓ,1:N } and N = 64, from which we sample context sets D c ℓ ⊂ D ℓ . For the image completion experiment we use L = 1024 and N = 784 (the number of pixels per image). We report the results in dependence of the context set size. Log Marginal Predictive Likelihood (LMLHD). For a given task ℓ the LMLHD is defined by Eq. ( 4), which we restate here for convenience: log q θ (y ℓ,1:N |x ℓ,1:N , D c ℓ ) ≡ log n p θ (y ℓ,n |x ℓ,n , z ℓ ) q (z ℓ |D c ℓ ) dz ℓ . Here, we use the generic notation q (z ℓ |D c ℓ ) to denote the task posterior TP approximation, the concrete form of which depends on the BML model under consideration. As the integral is analytically intractable, we resort to an MC approximation. To this end, we sample S = 1024 samples z ℓ,s ∼ q (z ℓ |D c ℓ ) in the test set and compute (Volpp et al., 2021 ) log q θ (y ℓ,1:N |x ℓ,1:N , D c ℓ ) ≡ log N n=1 p θ (y ℓ,n |x ℓ,n , z ℓ ) q (z ℓ |D c ℓ ) dz ℓ (17) ≈ log 1 S S s=1 N n=1 p θ (y ℓ,n |x ℓ,n , z ℓ,s ) = -log S + S logsumexp s=1 N n=1 log p θ (y ℓ,n |x ℓ,n , z ℓ,s ) . ( ) where logsumexp denotes the numerically stable implementation of the function log s exp(x s ), available in any scientific computing framework. We then compute the median of this expression over all tasks of the test set. Mean Squared Error (MSE). We report the MSE w.r.t. the mean prediction. That is, for a given task ℓ, we again draw S = 1024 samples z ℓ,s ∼ q (z ℓ |D c ℓ ) and compute MSE (y ℓ,1:N , x ℓ,1:N ) ≡ 1 N N n=1 1 S S s=1 dec µ θ (x ℓ,n , z ℓ,s ) -y ℓ,n 2 . ( ) We then compute the median of this expression over all tasks of the test set. ELBO Looseness. For a given task ℓ, we define the ELBO looseness as the KL-divergence between the approximate and true task posteriors. According to Eq. ( 4), this decomposes as KL [q (z ℓ |D ℓ ) ||p θ (z ℓ |D ℓ )] = log q θ (y ℓ,1:N |x ℓ,1:N , D c ℓ ) -E q(z ℓ |D ℓ ) N n=1 log p θ (y ℓ,n |x ℓ,n , z ℓ ) + log q (z ℓ |D c ℓ ) q (z ℓ |D ℓ ) , with log q θ (y ℓ,1:N |x ℓ,1:N , D c ℓ ) defined by Eq. ( 16). The second term is the ELBO, L (θ, D c ℓ , D ℓ ) ≡ E q(z ℓ |D ℓ ) N n=1 log p θ (y ℓ,n |x ℓ,n , z ℓ ) + log q (z ℓ |D c ℓ ) q (z ℓ |D ℓ ) , where we made its dependence on both the test set D ℓ and the context set D c ℓ ⊂ D ℓ explicit (in contrast to our notation in the main part of this paper). We say the ELBO is tight if its looseness is zero. Then, log q θ (y ℓ,1:N |x ℓ,1:N , D c ℓ ) = L (θ, D c ℓ , D ℓ ), and optimization of the ELBO w.r.t. θ is equivalent to optimization of the LMLHD. For our ablation study (Sec. 5.2), we estimate the looseness of the ELBO by computing the difference of an importance-weighted MC estimate with proposal distribution q (z ℓ |D ℓ ) of the LMLHD and an MC estimate of the ELBO Eq. ( 23) with S = 1024 samples z ℓ,s ∼ q (z ℓ |D ℓ ).

A.4 DATA GENERATION

We provide details on the meta-datasets we use to train the models we compare in Sec. 5. Concretely, we provide • the dimension d x of inputs x ℓ,n ∈ R dx , • the domain C ⊂ R dx from which we uniformly sample x ℓ,n , • the dimension d y of targets y ℓ,n ∈ R dy , • an expression for the function f ℓ : R dx → R dy , s.t., y ℓ,n = f ℓ (x ℓ,n ) + ε n , • the noise standard deviation σ, s.t., ε n ∼ N 0, σ 2 , • the number L of meta-tasks and the number N of datapoints for each meta-task. We denote the uniform distribution on (a, b ) d ⊂ R d by U (a, b) d . Sinusoidal Functions. • d x = 1 • C = [-5.0, 5.0] • d y = 1 • f ℓ (x) = A ℓ sin (x -ϕ ℓ ), A ℓ ∼ U (0.1, 5.0), ϕ ℓ ∼ U (0.0, π) • σ = 0.25 • L = 64, N = 16 Mix of Affine and Sinusoidal Functions. RBF-GP samples. • σ = 0.25 • d x = 1 • C = [-5.0, 5.0] • d y = 1 • f 1 ℓ (x) = a ℓ x + b ℓ , • d x = 1 • C = [-2.0, 2.0] • d y = 1 • f ℓ is • L = 64, N = 16 Hartmann 3D. • d x = 3 • d y = 1 • We use the definition given on https://www.sfu.ca/ ˜ssurjano/hart3.html and apply translations τ ℓ ∼ U (-0.25, 0.25) 3 to x, and scale the function values by s ℓ ∼ U (0.75, 1.25). • σ = 0.1 • L = 64, N = 16 4D Furuta Dynamics Prediction. • d x = 4 • d y = 4 • We use the dynamics equations given in Cazzolato & Prime (2011) to simulate episodes, starting from the pendulum balancing in the upright position. The input is the current system state x ∈ R 4 , the target is the difference to the next system state x next ∈ R 4 , i.e., y = ∆x ≡ x next -x ∈ R 4 . • Noise is generated by random actions on the joints. • L = 64, N = 64 2D MNIST Image Completion. • d x = 2 • d y = 1 • We use the MNIST handwritten image database (LeCun & Cortes, 2010) . Each image corresponds to one task. The input x is the pixel location, the target y is the pixel intensity. • σ = 0.25 • L = 60000, N = 784 A.5 FURTHER EXPERIMENTAL RESULTS We provide further experimental results for the experiments presented in Sec. 5.

A.5.1 ABLATION: TRUST REGIONS

In Fig. 6 we compare two methods for step size control for natural gradient VI, namely direct step size control as proposed by (Lin et al., 2020) and trust region step size control (Arenz et al., 2022) , as used by our GMM-NP algorithm. We observe that trust regions lead to more robust optimization of the variational parameters, and, thus, to tighter ELBOs. This allows more efficient optimization of the model parameters, leading to improved predictive performance. We observe that our GMM-NP model accurately quantifies epistemic uncertainty through the variability of its function samples. BA-NP also shows variable samples, but does not achieve the same predictive performance due to its inaccurate approximation of the task posterior distribution. ANP and BANP, both of which employ deterministic computation paths with attention modules, produce essentially deterministic predictions that massively overfit the context data and fail to give a reasonable estimate of the predictive distribution. Therefore, these models have to quantify epistemic uncertainty through the likelihood noise variance, which is ineffective, cf. Fig. 11 . Note also that BANP does not provide predictions for empty context sets. Figure 11 : This figure shows the same data as Fig. 10 , but for each function sample we also show a band of ±1 standard deviation of the observation noise, as computed by the decoder DNN. GMM-NP quantifies epistemic uncertainty correctly through its task posterior approximation, and thus does not have to rely on the decoder DNN to quantify epistemic uncertainty through the observation noise. In contrast, ANP and BANP fail to produce variable function samples, and have to make up for that by quantifying epistemic uncertainty through the observation noise, which is ineffective. Note also that BANP does not provide predictions for empty context sets. Figure 12 : Visualization of our GMM-NP model for a d z = 2 dimensional latent space, trained on sinusoidal functions with varying amplitudes and phases, cf. Sec. 5.1. Left panels: unnormalized task posterior distribution (contours) and variational GMM approximation with K = 3 components (ellipses, mixture weights in %). Right panels: corresponding samples from our model (blue lines), when having observed a context data set (red crosses), together with unobserved ground truth data (black dots). The visualizations show that (i) the true task posterior distribution can be highly correlated and multimodal, i.p., for small context sets (panels a,b), (ii) our variational task posterior approximation correctly approximates this distribution, which (iii) leads to expressive predictive distributions that incorporate both the inductive priors learned from the meta-dataset (all samples are sinusoidal in shape) and the additional information contained in the context set (all samples pass close to the context data).

A.5.7 ANALYSIS OF HPO RESULTS

As discussed in Secs. 5 and A.3, we optimized architectural hyperparameters individually for each model-dataset combination presented in our empirical evaluation, in order to arrive at a fair comparison of our GMM-NP with the baseline methods. In Tab. 2, we provide the resulting settings for the latent dimensionality d z , and the number of parameters of the BML models compared in Sec. 5.1. While the number of variational parameters during meta-training is naturally comparably high for non-amortizing methods such as GMM-NP, we observe that the expressive GMM-NP TP approximation allows comparably lightweight decoders and small latent dimensions. This is intuitive, as simple TP approximations require (i) large latent dimensions to encode relevant information in the latent space, together with (ii) expressive decoder architectures to transform the simple latent distribution into an expressive predictive distribution. Note further that the variational parameters belonging to different tasks are not coupled in non-amortizing architectures such as ours, which allows trivial parallelization of the variational optimization between tasks, explaining why the computational cost is easily managable, cf. Sec. A.5.6. Note also that the number of variational parameters one has to store and adapt for GMM-NP to make predictions on unseen test tasks is comparably small because the variational GMMs learned during meta-training can be discarded as they are not required for predictions at test time. Table 2 : Results of our hyperparameter optimization on the sinusoidal function class and on the mix of affine and sinusoidal functions. We provide the settings for the latent dimensionality d z and the number of parameters of the BML models compared in Sec. 5.1 (i.e., the number of decoder parameters |θ| as well as the number of encoder / variational parameters |ϕ|). If attentive modules are present, their parameters are counted as being part of the encoder. For our GMM-NP, we also provide the number of GMM-components K. Furthermore, as GMM-NP does not amortize TPinference but learns separate variational GMMs for each subtask generated from the meta-dataset (cf. Secs. 4 and A.3.2), we also provide the total number of variational GMM parameters during meta-training. Note that these variational GMMs are decoupled and can be optimized in parallel. Furthermore they are not required for predictions at test time and can be discarded after metatraining. 



https://github.com/ALRhub/gmm_np Published as a conference paper at ICLR 2023A.5.5 VISUALIZATION OF LATENT SPACE STRUCTUREWe provide further visualizations similar to Fig.1of the task posterior approximation and corresponding function samples of our GMM-NP, when trained on the sinusoidal function class.



Figure1: Visualization of our GMM-NP model for a d z = 2 dimensional latent space, trained on a meta-dataset of sinusoidal functions with varying amplitudes and phases, after having observed a single context example (red cross, right panel) from an unseen task (black dots, right panel). Left panel: unnormalized task posterior (TP) distribution (contours) and GMM TP approximation with K = 3 components (ellipses, mixture weights in %). Right panel: corresponding function samples from our model (blue lines). A single context example leaves much task ambiguity, reflected in a highly correlated, multi-modal TP. Our GMM approximation correctly captures this: predictions are in accordance with (i) the observed data (all samples pass close to the red context example), and with (ii) the learned inductive biases (all samples are sinusoidal), cf. alsoFig. 12 in App. A.5.5

Figure 2: Panels (a), (b): LMLHD and MSE on two synthetic function classes. Panels (c) -(f): function samples of models trained on the affine-sinusoidal class (b), given one context example (red)from a sinusoidal instance (black). GMM-NP outperforms the baselines, as it accurately quantifies epistemic uncertainty through diverse samples. BA-NP also shows variability in its samples, but does not achieve competitive performance due to its inaccurate TP approximation. ANP and BANP produce essentially deterministic predictions that fail to give reasonable estimates of the predictive distribution. Cf. alsoFigs. 10, 11 in App. A.5.4.

conditioned on a context set D c * from a target task. During a metatraining stage on meta-data D 1:L , we aim to encode inductive biases in the model parameters θ, s.t. small (few-shot) context sets D c

Mix of affine and sinusoidal functions.

Figure 3: LMLHD and ELBO looseness over context set size for different versions of our algorithm (blue). Our improved TRNG-VI inference scheme yields tighter ELBOs than standard SGD-VI (orange) and, thus, improved performance (cf. text and App. A.1.4 for details).

Figure4: Panels (a), (b): simple regret over iteration, when using BML models as Bayesian optimization (BO) surrogates (further results in App. A.5.2). As BO relies on well-calibrated uncertainty predictions, the results demonstrate that GMM-NP provides superior uncertainty estimates. Panel (c): log marginal likelihood (LMLHD) and MSE on one-step ahead predictions of 4D Furuta pendulum dynamics. While GMM-NP generally performs best, BANP also shows strong results.

Log marginal likelihood (LMLHD) and MSE. (b) GMM-NP (ours).(c) ANP.

Figure 5: Results on 2D image completion on MNIST. Panels (b),(c) visualize predictions on an unseen task showing the digit "6". The first row shows the context pixels, the remaining rows show five corresponding samples. The results are consistent with earlier observations (e.g., Fig. 2):our GMM-NP model shows highly variable samples for small context sets, yielding an accurate estimate of epistemic uncertainty, and contracts properly around the ground truth when more context information is available. ANP yields crisp predictions but massively overfits to the noise, explaining bad LMLHD and MSE scores. We provide further results in App. A.5.3, Fig.9.

a ℓ ∼ U (-3.0, 3.0), b ℓ ∼ U (-3.0, 3.0), f 2 ℓ (x) = A ℓ sin (x -ϕ ℓ ), A ℓ ∼ U (0.1, 5.0), ϕ ℓ ∼ U (0.0, π) f ℓ is given either by f 1 ℓ or f 2 ℓ with probability 0.5. • σ = 0.25 • L = 64, N = 16

drawn from a Gaussian process prior with RBF kernel with lengthscale l ℓ ∼ U (0.5, 1.0) and signal variance s ℓ ∼ U (0.5, 1.0).• σ = 0.1 • L = 64, N = 16 Forrester 1D. • d x = 1 • d y = 1 • Weuse the parametrized Forrester function Forrester et al. (2008) as defined on https://www.sfu.ca/ ˜ssurjano/forretal08.html. • σ = 0.25 • L = 64, N = 16 Branin 2D. • d x = 2 • d y = 1 • We use the definition given on https://www.sfu.ca/ ˜ssurjano/branin.html and apply translations τ ℓ ∼ U (-0.25, 0.25) 2 to x, and scale the function values by s ℓ ∼ U (0.75, 1.25).

Mix of affine and sinusoidal functions.

Figure 6: Log marginal predictive likelihood (LMLHD) and ELBO looseness over context size for our trust region natural gradient VI (TRNG-VI)-based (Arenz et al., 2022) GMM-NP algorithm in comparison to iBayes-GMM (Lin et al., 2020) that uses direct step size control instead of trust regions (NG-VI). Trust regions improve variational optimization, leading to tighter ELBOs, and, consequently, to improved predictive performance.

Figure7: Simple regret over optimization iteration, when using BML models as Bayesian Optimization (BO) surrogates on various function classes. As BO relies on well-calibrated uncertainty predictions, the results demonstrate that GMM-NP provides superior uncertainty estimates.

Figure 8: Log marginal predictive likelihood (LMLHD) and mean squared error (MSE) over context size on various function classes. GMM-NP generally performs favorably, showing accurate predictions with well-calibrated uncertainties.

Figure 9: Predictions on an unseen instance of the MNIST 2D image completion task, showing the digit "6". The first row of each panel shows the context pixels (ranging from zero pixels in the left column to the full image in the right column). The remaining rows show five samples from the BML models, conditioned on the context pixels shown in the first row. The results are consistent with observations from the other experiments (e.g., Fig.10): our GMM-NP model shows highly variable samples for small context sets, yielding an accurate estimate of epistemic uncertainty, and contracts properly around the ground truth when more context information is available. BA-NP also shows variable samples, albeit of lower quality. ANP and BANP yield crisp predictions but massively overfit to the noise, explaining their low LMLHD scores. Note also that BANP does not allow predictions for empty context sets.

Figure10: Function samples computed by various BML models (columns), trained on a function class consisting of a mix of affine and sinusoidal functions (cf. Sec. 5.1), when provided with increasing amounts of context examples (red crosses, rows) from an unseen sinusoidal representative function. We observe that our GMM-NP model accurately quantifies epistemic uncertainty through the variability of its function samples. BA-NP also shows variable samples, but does not achieve the same predictive performance due to its inaccurate approximation of the task posterior distribution. ANP and BANP, both of which employ deterministic computation paths with attention modules, produce essentially deterministic predictions that massively overfit the context data and fail to give a reasonable estimate of the predictive distribution. Therefore, these models have to quantify epistemic uncertainty through the likelihood noise variance, which is ineffective, cf. Fig.11. Note also that BANP does not provide predictions for empty context sets.

A small context set (one single example indicated by the red cross) yields a highly correlated, multi-modal task posterior distribution. Our GMM approximation correctly captures this, s.t., amplitudes and phases of the predicted sinusoidal functions are in accordance with the observed context data point.

A second example on another instance of the sinusoidal function class, where the task posterior shows pronounced multimodality, which translates into a bimodal predictive distribution. context sizes (three examples, red crosses) leave less task ambiguity, resulting in a unimodal and nearly isotropic task posterior distribution. Our GMM approximation again correctly approximates this distribution, making use of only two of the K = 3 mixture components (the mixture weight of the orange component is close to zero, so no samples from this component are observed).

ACKNOWLEDGMENTS

This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-Württemberg and by the Federal Ministry of Education and Research. The authors further acknowledge support by the state of Baden-Württemberg through bwHPC.

availability

code for our proposed GMM-NP algorithm: • Source code four our GMM-NP algorithm: https://github.com/ALRhub/gmm_np • MA-NP, ANP: https://github.com/deepmind/neural-processes, • BA-NP: https://github.com/boschresearch/bayesian-context-aggregation, • BNP, BANP: https://github.com/juho-lee/bnp.

annex

Algorithm 1 GMM-NP (Meta-Training) Require: Meta-data D ℓ = {x ℓ,1:N , y ℓ,1:N }, ℓ ∈ 1 : L Sample variably-sized auxiliary tasks Dl = x l,1:Nl , y l,1:Nl , l ∈ 1 : L, cf. Sec. A.3.2 Initialize variational parameters ϕ 1: L = w 1: L,1:K , µ 1: L,1:K , Σ 1: L,1:K Initialize model parameters θ while not converged do for each minibatch of tasks I ⊂ 1, . . . , L do Sample z ℓ,k,s ∼ q ϕ ℓ (z ℓ |k) for ℓ ∈ I, k ∈ 1 : K, s ∈ 1 : S Evaluate h ℓ,k on z ℓ,k,s and Dℓ for ℓ ∈ I, k ∈ 1 : K, s ∈ 1 : S, Eq. ( 13) Update variational parameters ϕ ℓ for ℓ ∈ I, Eq. ( 8) Sample z ℓ,s ∼ q ϕ ℓ (z ℓ ) for ℓ ∈ I, s ∈ 1 : S Estimate gradient of ELBO Eq. ( 9):Perform step in θ using Adam end for end while return Model parameters θ Algorithm 2 GMM-NP (Prediction) As the meta-training stage of GMM-NP requires computational effort comparable to standard NP (cf. Sec. 4), the only computational overhead of our algorithm occurs at test time, due to the optimization loop required to fit a variational GMM to D c * . While this can be trivially parallelized for multiple test tasks, it incurs a higher computational burden in comparison to the single forward pass through NP's set encoder (we provide an evaluation of the runtime of our algorithm on the synthetic tasks studied in Sec. 5.1 below). We leave a detailed examination for future work, but mention two possible remedies: (i) for problems where test data arrives sequentially, we expect that a few update steps in ϕ * suffice to reach convergence, and (ii) it might be possible to find amortized approximations to Eqs. ( 8), similar in spirit to standard NP, that retain the advantages of TRNG-VI.Meta-Training. Fig. 13 shows the learning curves for meta-training corresponding to the results presented in the main part of this paper. As discussed in Sec. 4, GMM-NP incurs a computational cost comparable to the baseline methods. For GMM-NP, we show the loss for the decoder parameters θ, for the other methods we show the joint loss for the encoder and decoder parameters (ϕ, θ). Note that for GMM-NP, convergence of θ implies convergence of the variational parameters ϕ. As discussed in Sec. 4, GMM-NP incurs a computational cost comparable to the baseline methods.Test-time Adaptation. As discussed in Sec. 4, GMM-NP does not amortize TP inference, i.e., it does not learn a set encoder architecture, but adapts new variational GMMs at test time. Naturally, this incurs a higher computational cost in comparison to amortized architectures, which compute Published as a conference paper at ICLR 2023 predictions on test tasks in a single forward pass through their set-encoder -decoder architecture. In Fig. 14 , we show the learning curves for fitting variational GMMs (by iterating Eqs. ( 8)) to the test tasks and for the range of context sizes used to compute the results presented in Sec. 5.1. GMM-NP's TRNG-VI optimization converges in approx. 0.1 s -1 s per test task (depending on the context set size). (Arenz et al., 2022) , as used by our GMM-NP (Sec. 4), on the synthetic datasets (Sec. 5.1). The quantity labelled "Loss (adapt)" is the expected negative log density of the unnormalized TP under the GMM TP approximation. Note that this is not the loss function optimized by iterating Eqs. ( 8), but it serves as a proxy to judge convergence. We show results in terms of wall clock time per test task (left panels) and in terms of TRNG-VI steps (right panels), for the range of context sizes used to compute the results in the main text. GMM-NP's TRNG-VI optimization converges in approx. 0.1 s -1 s per test task (depending on the context set size).

