META-LEARNING BAYESIAN NEURAL NETWORK PRIORS BASED ON PAC-BAYESIAN THEORY

Abstract

Bayesian deep learning is a promising approach towards improved uncertainty quantification and sample efficiency. Due to their complex parameter space, choosing informative priors for Bayesian Neural Networks (BNNs) is challenging. Thus, often a naive, zero-centered Gaussian is used, resulting both in bad generalization and poor uncertainty estimates when training data is scarce. In contrast, meta-learning aims to extract such prior knowledge from a set of related learning tasks. We propose a principled and scalable algorithm for meta-learning BNN priors based on PAC-Bayesian bounds. Whereas previous approaches require optimizing the prior and multiple variational posteriors in an interdependent manner, our method does not rely on difficult nested optimization problems, and moreover, it is agnostic to the variational inference method in use. Our experiments show that the proposed method is not only computationally more efficient but also yields better predictions and uncertainty estimates when compared to previous meta-learning methods and BNNs with standard priors. We refer to Q * (P ) as PAC-optimal since, among all meta-learners, it gives us the best possible PAC-Bayesian guarantees induced by Theorem 2. We discuss our main contribution, that is, how to translate the PACOH (Prop. 1) into a practical algorithm for meta-learning BNN priors. To this end, we first specify various components of the generic meta-learning setup presented in Sec. 4 and then discuss how to obtain a tractable approximation of Q * .

1. INTRODUCTION

Bayesian Neural Networks (BNNs) offer a probabilistic interpretation of deep learning by inferring distributions over the model's weights (Neal, 1996) . With the potential of combining the scalability and performance of neural networks (NNs) with a framework for uncertainty quantification, BNNs have lately received increased attention (Blundell et al., 2015; Gal & Ghahramani, 2016) . In particular, their ability to express epistemic uncertainty makes them highly relevant for applications such as active learning (Hernández-Lobato & Adams, 2015) and reinforcement learning (Riquelme et al., 2018) . However, BNNs face two major issues: 1) the intractability of posterior inference and 2) the difficulty of choosing good Bayesian priors. While the former has been addressed in an extensive body of literature on variational inference (e.g. Blundell et al., 2015; Blei et al., 2016; Mishkin et al., 2018; Liu & Wang, 2016) , the latter has only received limited attention (Vladimirova et al., 2019; Ghosh & Doshi-Velez, 2017) . Choosing an informative prior for BNNs is particularly difficult due to the high-dimensional and hardly interpretable parameter space of NNs. Due to the lack of good alternatives, often a zero-centered, isotropic Gaussian is used, reflecting (almost) no a priori knowledge about the problem at hand. This does not only lead to poor generalization when data is scarce, but also renders the Bayesian uncertainty estimates poorly calibrated (Kuleshov et al., 2018) . Meta-learning (Schmidhuber, 1987; Thrun & Pratt, 1998) acquires inductive bias in a data-driven way, thus, constituting an alternative route for addressing this issue. In particular, meta-learners attempt to extract shared (prior) knowledge from a set of related learning tasks (i.e., datasets), aiming to learn in the face of a new, related task. Our work develops a principled and scalable algorithm for metalearning BNN priors. We build on the PAC-Bayesian framework (McAllester, 1999) , a methodology from statistical learning theory for deriving generalization bounds. Previous PAC-Bayesian bounds for meta-learners (Pentina & Lampert, 2014; Amit & Meir, 2018 ) require solving a difficult optimization problem, involving the optimization of the prior as well as multiple variational posteriors in a nested manner. Aiming to overcome this issue, we present a PAC-Bayesian bound that does not rely on nested optimization and, unlike (Rothfuss et al., 2020) , can be tractably optimized for BNNs. This makes the resulting meta-learner, referred to as PACOH-NN, not only much more computationally efficient and scalable than previous approaches for meta-learning BNN priors (Amit & Meir, 2018) , but also agnostic to the choice of approximate posterior inference method which allows us to combine it freely with recent advances in MCMC (e.g. Chen et al., 2014) or variational inference (e.g. Wang et al., 2019) . Our experiments demonstrate that the computational advantages of PACOH-NN do not result in degraded predictive performance. In fact, across several regression and classification environments, PACOH-NN achieves a comparable or better predictive accuracy than several popular meta-learning approaches, while improving the quality of the uncertainty estimates. Finally, we showcase how metalearned PACOH-NN priors can be used in a real-world bandit task concerning the development of vaccines, suggesting that many other challenging real-world problems may benefit from our approach.

2. RELATED WORK

Bayesian Neural Networks. The majority of research on BNNs focuses on approximating the intractable posterior distribution (Graves, 2011; Blundell et al., 2015; Liu & Wang, 2016; Wang et al., 2019) . In particular, we employ the approximate inference method of Liu & Wang (2016) . Another crucial question is how to select a good BNN prior (Vladimirova et al., 2019) . While the majority of work (e.g. Louizos & Welling, 2016; Huang et al., 2020) employs a simple zero-centered, isotropic Gaussian, Ghosh & Doshi-Velez (2017) and Pearce et al. (2020) have proposed other prior distributions for BNNs. In contrast, we go the alternative route of choosing priors in a data-driven way. Meta-learning. A range of popular methods in meta-learning attempt to learn the "learning program" in form of a recurrent model (Hochreiter et al., 2001; Andrychowicz et al., 2016; Chen et al., 2017) , learn an embedding space shared across tasks (Snell et al., 2017; Vinyals et al., 2016) or the initialization of a NN such that it can be quickly adapted to new tasks (Finn et al., 2017; Nichol et al., 2018; Rothfuss et al., 2019b) . A group of recent methods also uses probabilistic modeling to also allow for uncertainty quantification (Kim et al., 2018; Finn et al., 2018; Garnelo et al., 2018) . Although the mentioned approaches are able to learn complex inference patterns, they rely on settings where metatraining tasks are abundant and fall short of performance guarantees. In contrast, we provide a formal assessment of the generalization properties of our algorithm. Moreover, PACOH-NN allows for principled uncertainty quantification, including separate treatment of epistemic and aleatoric uncertainty. This makes it particularly useful for sequential decision algorithms (Lattimore & Szepesvari, 2020) . PAC-Bayesian theory. Previous work presents generalization bounds for randomized predictors, assuming a prior to be given exogenously (McAllester, 1999; Catoni, 2007; Germain et al., 2016; Alquier et al., 2016) . More recent work explores data-dependent priors (Parrado-Hernandez et al., 2012; Dziugaite & Roy, 2016) or extends previous bounds to the scenario where priors are meta-learned (Pentina & Lampert, 2014; Amit & Meir, 2018) . However, these meta-generalization bounds are hard to minimize as they leave both the hyper-posterior and posterior unspecified, which leads to nested optimization problems. Our work builds on the results of Rothfuss et al. (2020) who introduce the methodology to derive the closed-form solution of the PAC-Bayesian meta-learning problem. However, unlike ours, their approach suffers from (asymptotically) non-vanishing terms in the bounds and relies on a closedform solution of the marginal log-likelihood. By contributing a numerically stable score estimator for the generalized marginal log-likelihood, we are able to overcome such limitations, making PAC-Bayesian meta-learning both tractable and scalable for a much larger array of models, including BNNs.

3. BACKGROUND: THE PAC-BAYESIAN FRAMEWORK

Bayesian Neural Networks. Consider a supervised learning task with data S = {(x j , y j )} m j=1 drawn from unknown distribution D. In that, X = {x j } m j=1 ∈ X m denotes training inputs and Y = {y j } m j=1 ∈ Y m the targets. For brevity, we also write z j := (x j , y j ) ∈ Z. Let h θ : X → Y be a function parametrized by a NN with weights θ ∈ Θ. Using the NN mapping, we define a conditional distribution p(y|x, θ). For regression, we set p(y|x, θ) = N (y|h θ (x), σ 2 ), where σ 2 is the observation noise variance. For classification, we choose p(y|x, θ) = Categorical(softmax(h θ (x))). For Bayesian inference, one presumes a prior distribution p(θ) over the model parameters θ which is combined with the training data S into a posterior distribution p(θ|X, Y) ∝ p(θ)p(Y|X, θ). For unseen test data points x * , we form the predictive distribution as p(y * |x * , X, Y) = p(y * |x * , θ)p(θ|X, Y)dθ. The Bayesian framework presumes partial knowledge of the data-generating process in form of a prior distribution. However, due to the practical difficulties in choosing an appropriate BNN prior, the prior is typically strongly misspecified (Syring & Martin, 2018) . As a result, modulating the influence of the prior relative to the likelihood during inference typically improves the empirical performance of BNNs and is thus a common practice (Wenzel et al., 2020) . Using such a "tempered" posterior p τ (θ|X, Y) ∝ p(θ)p(Y|X, θ) τ with τ > 0 is also referred to as generalized Bayesian learning (Guedj, 2019) . The PAC-Bayesian Framework. In the following, we introduce the most relevant concepts of PAC-Bayesian learning theory. For more details, we refer to Guedj (2019) . Given a loss function l : θ × Z → R, we typically want to minimize the generalization error L(θ, D) = E z * ∼D l(θ, z * ). Since D is unknown, the empirical error L(θ, S) = 1 m m i=1 l(θ, z i ) is usually employed, instead. In the PAC-Bayesian framework, we are concerned with randomized predictors, i.e., probability measures on the parameter space Θ, allowing us to reason about epistemic uncertainty. In particular, we consider two such probability measures, a prior P ∈ M(Θ) and a posterior Q ∈ M(Θ). In here, by M(Θ), we denote the set of all probability measures on Θ. While in Bayesian inference, the prior and posterior are tightly coupled through Bayes' theorem, the PAC-Bayesian framework only requires the prior to be independent of the data S. Using the definitions above, the so-called Gibbs error for a randomized predictor Q is defined as L(Q, D) = E h∼Q L(h, D). Similarly, we define its empirical counterpart as L(Q, S) = E h∼Q L(h, S). The PAC-Bayesian framework provides upper bounds for the unknown Gibbs error in the following form: Theorem 1. (Alquier et al., 2016) Given a data distribution D, a prior P ∈ M(Θ), a confidence level δ ∈ (0, 1], with probability at least 1δ over samples S ∼ D m , we have: ∀Q ∈ M(Θ) : L(Q, D) ≤ L(Q, S) + 1 √ m D KL (Q||P ) + ln 1 δ + Ψ( √ m) where Ψ( √ m) = ln E θ∼P E S∈D m exp √ m L(θ, D) -L(θ, S) . In that, Ψ( √ m) is a log moment generating function that quantifies how strong the empirical error deviates from the Gibbs error. By making additional assumptions about the loss function l, we can bound Ψ( √ m) and thereby obtain tractable bounds. For instance, if l(θ, z) is bounded in [a, b], we obtain Ψ( √ m) ≤ ((b -a) 2 )/8 by Hoeffding's lemma. For unbounded loss functions, it is common to assume bounded moments. For instance, a loss is considered sub-gamma with variance factor s 2 and scale parameter c, under a prior P and data distribution D, if its deviations from the mean can be characterized by random variable V := L(h, D)l(h, z) whose moment generating function is upper bounded by that of a Gamma distribution Γ(s, c) (Boucheron et al., 2013) . In such case, we obtain Ψ( √ m) ≤ s 2 /(2 -2c √ m )). Connecting the PAC-Bayesian framework and generalized Bayesian learning. In PAC-Bayesian learning we aim to find the posterior that minimizes the bound in (1) which is in general a challenging optimization problem over the space of measures M(Θ). However, to our benefit, it can be shown that the Gibbs posterior is the probability measure that minimizes (1). For details we refer to Lemma 2 in the Appendix or Catoni (2007) and (Germain et al., 2016) . In particular, this gives us Q * (θ) := arg min Q∈M(Θ) √ m L(Q, S) + D KL (Q||P ) = P (θ)e - √ m L(θ,S) /Z(S, P ) , where Z(S, P ) is a normalization constant. In a probabilistic setting, our loss function is the negative log-likelihood, i.e. l(θ, z i ) :=log p(z i |θ). In this case, the optimal Gibbs posterior coincides with the generalized Bayesian posterior Q * (θ; P, S) ∝ P (θ)p(S|θ) 1/ √ m /Z(S, P ) where Z(S, P ) = Θ P (θ)p(S|θ) 1/ √ m dθ is the generalized marginal likelihood of the data sample S.

4. PAC-BAYESIAN BOUNDS FOR META-LEARNING

This section describes the PAC-Bayesian meta-learning setup and discusses how we can obtain generalization bounds that can be transformed into practically useful meta-learning objectives. In that, we draw on the methodology of Rothfuss et al. (2020) which allows us to derive a closed form solution of the PAC-Bayesian meta-learning problem. In the standard supervised learning setup (see Sec. 3), the learner has prior knowledge in the form of a prior distribution P , given exogenously, over the hypothesis space H. When the learner faces a new task, it uses the evidence, provided as a dataset S, to update the prior into a posterior distribution Q. We formalize such a base learner Q(S, P ) as a mapping Q : Z m × M(H) → M(H) that takes in a dataset and prior and outputs a posterior. In contrast, in meta-learning we aim to acquire such a prior P in a data-driven manner, that is, by consulting a set of n statistically related learning tasks {τ 1 , ..., τ n }. We follow the setup of Baxter (2000) in which all tasks τ i := (D i , S i ) share the same data domain Z := X × Y, parameter space Θ and loss function l(θ, z), but may differ in their (unknown) data distributions D i and the number of points m i in the corresponding dataset S i ∼ D mi i . For our theoretical expositions, we assume w.l.o.g that m = m i ∀i. Further, each task τ i ∼ T is drawn i.i.d. from an environment T , a probability distribution over data distributions and datasets. The goal is to extract knowledge from the observed datasets which can then be used as a prior for learning on new target tasks τ ∼ T (Amit & Meir, 2018) . To extend the PAC-Bayesian analysis to the meta-learning setting, we again consider the notion of probability distributions over hypotheses/parameters. However, while the object of learning has previously been the NN parameters θ, it is now the prior distribution P ∈ M(Θ). Accordingly, the meta-learner presumes a hyper-prior P ∈ M(M(H)), i.e., a distribution over priors P . Combining the hyper-prior P with the datasets S 1 , ..., S n from multiple tasks, the meta-learner then outputs a hyper-posterior Q over priors. The hyper-posterior's performance/quality is measured in form of the expected Gibbs error when sampling priors P from Q and applying the base learner, the so-called transfer-error: L(Q, T ) := E P ∼Q E (D,m)∼T [E S∼D m [L(Q(S, P ), D)]] While the transfer error is unknown in practice, we can estimate it using the empirical multi-task error L(Q, S 1 , ..., S n ) := E P ∼Q 1 n n i=1 L (Q(S i , P ), S i ) . Similar to the PAC-Bayesian guarantees for single task learning, we can bound the transfer error by its empirical counterpart L(Q, S 1 , ..., S n ) plus several tractable complexity terms: Theorem 2. Let Q : Z m × M(H) → M(H) be a base learner and P ∈ M(M(H)) some fixed hyper-prior. For all hyper-posteriors Q ∈ M(M(H)) and δ ∈ (0, 1], L(Q, T ) ≤ L(Q, S 1 , ..., S n ) + 1 n √ m + 1 √ n D KL (Q||P) + 1 n n i=1 1 √ m E P ∼Q [D KL (Q(S i , P )||P )] + C(δ, n, m) holds with probability 1δ. If the loss function is bounded, that is l : H × Z → [a, b], the above inequality holds with C(δ, n, m) = (b k -a k ) 2 8 √ m + (b k -a k ) 2 8 √ n -1 √ n ln δ. If the loss function is sub-gamma with variance factor s 2 I and scale parameter c I for data distributions D i and s 2 II , c II for the task distribution T , the inequality holds with C(δ, n, m) = s 2 I 2( √ m-cI) + s 2 II 2( √ n-cII) -1 √ n ln δ . Under bounded loss assumption, Theorem 2 provides a structurally similar, but tighter bound than Pentina & Lampert (2014) and Rothfuss et al. (2020) . In particular, by using a improved proof methodology, we are able to forgo a union bound argument, allowing us to reduce the negative influence of confidence parameter δ. Compared to Rothfuss et al. (2020) , the D KL (Q, P) term has an improved decay rate, that is, 1/(n √ m) + 1/ √ n as opposed to 1/ √ m + 1/ √ n. Importantly, the bound in (3) is consistent, i.e. C(δ, n, m) → 0 as n, m → ∞. Unlike Pentina & Lampert (2014) and Amit & Meir (2018) , the theorem also provides guarantees for unbounded loss functions under moment constraints (see Appendix A.1 for details). This makes Theorem 2 particularly relevant for probabilistic models such as BNNs in which the loss function coincides with the inherently unbounded negative log-likelihood. Amit & Meir (2018) propose to meta-learn NN priors by directly minimizing a bound similar to (3). However, the posterior inference for BNNs, i.e. obtaining Q i = Q(S i , P ), is a stochastic optimization problem in itself whose solution in turn depends on P . Hence, minimizing such meta-level bound w.r.t. P constitutes a computationally infeasible two-level optimization problem. To circumvent this issue, they jointly optimize P and n approximate posteriors Q i that correspond to the different datasets S i , leading to an unstable and poorly scalable meta-learning algorithm. To overcome these issues, we employ the methodology of Rothfuss et al. (2020) and assume the Gibbs posterior Q * (S i , P ) as a base learner. As discussed in Section 3, Q * (S i , P ) does not only constitute a generalized Bayesian posterior but also minimizes the PAC-Bayesian bound. Thus, the resulting bound in Corollary 1 is tighter than (3). More importantly, the bound can be stated in terms of the partition function Z(S i , P ) which allows us to forgo the explicit reliance on the task posteriors Q i . This makes the bound much easier to optimize as a meta-learning objective than previous bounds (e.g. Pentina & Lampert, 2014; Amit & Meir, 2018) , since it no longer constitutes a two-level optimization problem. Moreover, it renders the corresponding meta-learner agnostic to the choice of approximate inference method used to approach the Gibbs/Bayes posterior Q * (S, P ) when performing inference on a new task. Corollary 1. When choosing the Gibbs posterior Q * (S i , P ) := P (θ) exp(-√ m L(S i , θ))/Z(S i , P ) as a base learner, under the same assumptions as in Theorem 2, it holds with probability 1δ that L(Q, T ) ≤ - 1 n n i=1 1 √ m E P ∼Q [ln Z(S i , P )] + 1 √ n + 1 n √ m D KL (Q||P) + C(δ, n, m) (4) wherein Z(S i , P ) = E θ∼P exp(- √ m L(S i , θ)) is the generalized marginal log-likelihood. A natural way to obtain a PAC-Bayesian meta-learning algorithm could be to minimize (4) w.r.t. Q. Though, in general, this is a hard problem since it would require a minimization over M(M(H)), the space of all probability measures over priors. Following Rothfuss et al. (2020) , we exploit once more the insight that the minimizer of ( 4) can be written as Gibbs distribution (c.f. Lemma 2), allowing us to to derive such minimizing hyper-posterior Q * , i.e. the PACOH, in closed form: Proposition 1. (PAC-Optimal Hyper-Posterior) Given a hyper-prior P and datasets S 1 , ..., S n , the hyper-posterior Q that minimizes the PAC-Bayesian meta-learning bound in ( 4) is given by Q * (P ) = P(P ) exp 1 √ nm+1 n i=1 ln Z(S i , P ) Z II (S 1 , ..., S n , P) with the partition function defined as Z II = E P ∼P exp 1 √ nm+1 n i=1 ln Z(S i , P ) . The setup. First, we define a family of priors {P φ : φ ∈ Φ} over the NN parameters θ. For computational convenience, we employ Gaussian priors with diagonal covariance matrix, i.e. P φ = N (µ P , diag(σ 2 P )) with φ := (µ P , log σ P ). Note that we represent σ P in the log-space to avoid additional positivity constraints. In fact, any parametric distribution such as normalizing flows (Rezende & Mohamed, 2015) that allows for re-parametrized sampling and has a tractable log-density could be used. As typical in the Bayesian framework, our loss function is the negative log-likelihood, i.e. l(θ, z) =ln p(y|x, θ) for which we assume an additive Gaussian noise model p(y|x, θ) = N (y; h θ (x), σ 2 y ) in regression and a categorical softmax distribution in case of classification. Moreover, we use a zero-centered, spherical Gaussian P := N (0, σ 2 P I) as a hyper-prior over the parameters φ that specify the prior. In our setup, the hyper-prior acts a form of meta-level regularization that penalizes complex priors. Approximating the hyper-posterior. Given the hyper-prior and (level-I) log-partition function ln Z(S i , P ), we can compute the PACOH Q * up to the normalization constant Z II . Such a setup lends itself to approximate inference methods (Blei et al., 2016) . In particular, we employ Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) which approximates Q * as a set of particles Q = {φ 1 , ..., φ K }. Initially, the particles φ K ∼ P (i.e. the priors' parameters) are sampled randomly. Subsequently, the method iteratively transports the set of particles to match Q * , by applying a form of functional gradient descent that minimizes D KL ( Q|Q * ) in the reproducing kernel Hilbert space induced by a kernel function k(•, •). In each iteration, the particles are updated by φ k ← φ k + ηψ * (φ k ) with the step size η and ψ * (φ) = 1 K K k =1 k(φ k , φ)∇ φ k ln Q * (φ k ) + ∇ φ k k(φ k , φ) . In that, ∇ φ k ln Q * (φ k ) = ∇ φ k ln P(φ k ) + 1 √ n m+1 n i=1 ∇ φ k ln Z(S i , P φ k ) is the score of Q * . Approximating the generalized marginal log-likelihood. The last remaining issue towards a viable meta-learning algorithm is the intractable generalized marginal likelihood ln Z(S i , P φ ) = ln E θ∼P φ e -√ mi L(θ,Si) . Estimating and optimizing ln Z(S i , P φ ) is not only challenging due to the high-dimensional expectation over Θ but also due to numerical instabilities inherent in computing e -√ mi L(θ,Si) when m i is large. Aiming to overcome these issues, we compute numerically stable Monte Carlo estimates of ∇ φ ln Z(S i , P φ k ) by combining the LogSumExp (LSE) with the reparametrization trick (Kingma & Welling, 2014) . In particular, we draw L samples θ l := f (φ, l ) = µ P + σ P l , l ∼ N (0, I) and compute the forward pass as follows: ln Z(S i , P φ ) := ln 1 L L l=1 e -√ mi L(θ l ,Si) = LSE L l=1 - √ m i L(θ l , S i ) -ln L , θ l ∼ P φ (7) The corresponding gradients follow as softmax-weighted average of score gradients: ∇ φ ln Z(S i , P φ ) = - √ m i L l=1 e -√ mi L(θ l ,Si) L l=1 e -√ mi L(θ l ,Si) softmax ∇ φ f (φ, l ) re-param. Jacobian ∇ θ l L(θ l , S i ) score (8) Note that ln Z(S i , P φ ) is a consistent but not an unbiased estimator of ln Z(S i , P φ ). The following proposition ensures us that we still minimize a valid bound (see Appx. B.3 for details). Proposition 2. In expectation, replacing ln Z(S i , P φ ) in ( 4) by the estimate ln Z(S i , P ) still yields a valid upper bound of the transfer error L(Q, T ) for any L ∈ N. Moreover, by the law of large numbers, we have that ln Z(S i , P ) a.s. --→ ln Z(S i , P ) as L → ∞, that is, for large sample sizes L, we recover the original PAC-Bayesian bound in (4). In the opposite edge case, i.e. L = 1, the boundaries between tasks vanish meaning that the meta-training data {S 1 , ..., S n } is treated as if it were one large dataset i S i (see Appx. B.3 for further discussion). The algorithm. Algorithm 1 in Appendix B summarizes the proposed meta-learning method which we henceforth refer to as PACOH-NN. To estimate the score ∇ φ k ln Q * (φ k ) in ( 6), we can even use mini-batching on the task level. This mini-batched version, outlined in Algorithm 2, maintains K particles to approximate the hyper-posterior, and in each forward step samples L NN-parameters (of dimensionaly |Θ|) per prior particle that are deployed on a mini-batch of n bs tasks to estimate the score of Q * . As a result, the total space complexity is in the order of O(|Θ|K + L) and the computational complexity of the algorithm for a single update (c.f. ( 6)) is O(K 2 + KLn bs ). A key advantage of PACOH-NN over previous methods for meta-learning BNN priors (e.g. Pentina & Lampert, 2014; Amit & Meir, 2018) is that it turns the previously nested optimization problem into a much simpler stochastic optimization problem. This makes meta-learning not only much more stable but also more scalable. In particular, we do not need to explicitly compute / maintain the task posteriors Q i and can do mini-batching over tasks. As a result, the space and compute complexity do not depend on the number of tasks n. In contrast, MLAP (Amit & Meir, 2018) has a memory footprint of O(|Θ|n) making meta-learning prohibitive for more than 50 tasks. A central feature of PACOH-NN is that is comes with principled meta-level regularization in form of the hyper-prior P which combats overfitting to the meta-training tasks (Qin et al., 2018) . As we show in our experiments, this allows us to successfully perform meta-learning with as little as 5 tasks. This is unlike the the majority of the popular meta-learners (Finn et al., 2017; Kim et al., 2018; Garnelo et al., 2018, e.g.) which rely on a large number of tasks to generalize well on the meta-level (Qin et al., 2018; Rothfuss et al., 2020) .

6. EXPERIMENTS

We empirically evaluate the method introduced in Section 5, in particular, two variants of the algorithm: PACOH-NN-SVGD with K = 5 priors as particles and the edge case K = 1 which coincides with a maximum-a-posteriori (MAP) approximation of Q * , thus referred to as PACOH-NN-MAP. In order to evaluate the quality of the meta-learned prior, i.e. meta-testing, we need to do approximate the BNN posterior Q * (S, P ) for which we use SVGD with 5 particles, too. Comparing it to various NN-based meta-learning approaches on various regression and classification environments, we demonstrate that PACOH-NN (i) outperforms previous meta-learning algorithms in terms of predictive accuracy, (ii) improves the quality of uncertainty estimates and (iii) is much more scalable than previous PAC-Bayesian meta-learners. Finally, we showcase how meta-learned PACOH-NN priors can be harnessed in a real-world bandit task concerning peptide-based vaccine development.

6.1. META-LEARNING BNN PRIORS FOR REGRESSION AND CLASSIFICATION

Figure 1 illustrates BNN predictions on a sinusoidal regression task with a standard Gaussian prior as well as a PACOH-NN prior meta-learned on sinusoidal functions of varying amplitude, phase and slope (details can be found in Appendix C.1). In Figure 1a we can see that the standard Gaussian prior provides poor inductive bias, not only leading to bad mean predictions away from the testing points but also to poor 95% confidence intervals (blue shaded areas). In contrast, the meta-learned PACOH-NN prior encodes useful inductive bias towards sinusoidal function shapes, leading to good predictions and uncertainty estimates, even in face of minimal training data (i.e. 1 training point). Meta-learning benchmark. In the following, we present a comprehensive benchmark study. First, we use a BNN with a zero-centered, spherical Gaussian prior and SVGD posterior inference (Liu & Wang, 2016 ) as a baseline. Second, we compare our proposed approach against various popular meta-learning algorithms, including model-agnostic meta-learning (MAML) (Finn et al., 2017) , its first-order version (FOMAML) (Nichol et al., 2018) , Bayesian MAML (BMAML) (Kim et al., 2018) and two variants of the PAC-Bayesian approach by Amit & Meir (2018) (MLAP). For experiments with regression tasks, we also include into our comparison neural processes (NPs) (Garnelo et al., 2018) and the GP based meta-learner of Rothfuss et al. (2020) (PACOH-GP). The latter, is similar to our method as it also approximates a form of the PAC-optimal Hyper-Posterior with SVGD. However, unlike PACOH-NN it uses Gaussian Processes (GPs) as base learners and relies on a closed-form marginal log-likelihood. Among all, MLAP is the most similar to our approach as it is neural network based and minimizes PAC-Bayesian bounds of the transfer error. Though, unlike PACOH-NN, it relies on nested optimization of the task posteriors Q i and the hyper-posterior Q. Regression environments. In our experiments, we consider one synthetic and four real-world metalearning environments for regression. As a synthetic environment we follow (Nichol et al., 2018) 0.429 ± 0.047 0.590 ± 0.010 N/A N/A MAML (Finn et al., 2017) 0.571 ± 0.018 0.693 ± 0.013 N/A N/A BMAML (Kim et al., 2018) 0.651 ± 0.008 0.764 ± 0.025 0.132 ± 0.007 0.191 ± 0.018 Table 3 : Comparison of meta-learning algorithms in terms of test accuracy and calibration error on the Omniglot environment with 2-shot and 5-shot 5-way-classification tasks. intensive care patients (Silva et al., 2012) , in particular the Glasgow Coma Scale (GCS) and the hematocrit value (HCT). Here, the different tasks correspond to different patients. Moreover, we employ the Intel Berkeley Research lab temperature sensor dataset (Berkeley-Sensor) (Madden, 2004) where the tasks require auto-regressive prediction of temperatures measurements corresponding to sensors installed in different locations of the building. Further details can be found in Appendix C.1. Table 1 reports the results of our benchmark study in terms of the root mean squared error (RMSE) on unseen test tasks. Among the approaches, PACOH-NN consistently performs best or is among the best two methods, demonstrating that the introduced meta-learning framework is not only sound, but also endows us with an algorithm that works well in practice. Only for low-dimensional and small-scale regression environments like Physionet, we find that the GP-based meta-learner PACOH-GP which is build on a similar theoretical foundation as our method works better than PACOH-NN. Further, we hypothesize that by acquiring the prior in a principled data-driven manner (e.g., with PACOH-NN), we can improve the quality of the BNN's uncertainty estimates. To investigate the effect of meta-learned priors on the uncertainty estimates of the BNN, we compute the probabilistic predictors' calibration error, reported in Table 2 . The calibration error measures the discrepancy between predicted confidence regions and actual frequencies of test data in the respective areas (Kuleshov et al., 2018) . Note that, since MAML only produces point predictions, the concept of calibration does not apply to it. First, we observe that meta-learning priors with PACOH-NN consistently improves the standard BNN's uncertainty estimates. For meta-learning environments where the task similarity is high, i.e. SwissFel and Berkeley-Sensor, the improvement is substantial. Surprisingly, while improving upon the standard BNN in terms of the RMSE, we find that NPs consistently yields worse-calibrated predictive distributions than the BNN without meta-learning. This may be due to meta-level overfitting as NPs lack any form of meta-level regularization (cf. Qin et al., 2018; Rothfuss et al., 2020) . Classification environments. We conduct experiments with the multi-task classification environment Omniglot (Lake et al., 2015) , consisting of handwritten letters across 50 alphabets. Unlike previous work (e.g. Finn et al., 2017) we do not perform data-augmentation and do not re-combine letters of different alphabets, preserving the data's original structure. In particular, one task corresponds to 5-way classification of letters within an alphabet. This leaves us with much fewer tasks (i.e. 30 train and 20 test tasks), making the environment more challenging and more interesting for uncertainty quantification. This also allows us to include MLAP in the experiment which hardly scales to more than 50 tasks. In Table 3 , we report both the accuracy and calibration error for 2-shot and 5-shot classification on test tasks. Again, PACOH-NN yields the most accurate classification results and the lowest calibration error. Note that MAML fails to improve upon the standard BNN, i.e. demonstrating negative transfer. This is consistent with previous work (Qin et al., 2018; Rothfuss et al., 2020) raising concerns about overfitting to the meta-training tasks and observing that MAML requires a large number of tasks to generalize well. In contrast, by its very construction on meta-generalization bounds, PACOH-NN is able to achieve positive transfer even when the meta-training tasks are diverse and small in number. Scalability. Unlike the MLAP (Amit & Meir, 2018) , PACOH-NN does not need to maintain posteriors Q i for the meta-training tasks and can use mini-batching on the task level. As a result, it is computationally much faster and more scalable than previous PAC-Bayesian meta-learners. This is reflected in its computation and memory complexity, discussed in Section 5. Figure 2 showcases this computational advantage during meta-training with PACOH-NN-MAP and MLAP-S in the Sinusoids environment with varying number of tasks, reporting the maximum memory requirements, as well as the training time. While MLAP's memory consumption and compute time grow linearly and becomes prohibitively large even for less than 100 tasks, PACOH-NN maintains a constant memory and compute load as the number of tasks grow.

6.2. META-LEARNING FOR BANDITS -VACCINE DEVELOPMENT

We showcase how a relevant real-world application such as vaccine design can benefit from our proposed method. In particular, the goal is to discover peptide sequences which bind to major histocompatibility complex class-I molecules (MHC-I). MHC-I molecules present fragments of proteins from within a cell to T-cells, allowing the immune system to distinguish between healthy and infected cells. Following the bandit setup of Krause & Ong (2011) , each task corresponds to searching for maximally binding peptides, a vital step in the design of peptide-based vaccines. The tasks differ in their targeted MHC-I allele, i.e., correspond to different genetic variants of the MHC-I protein. We use data from Widmer et al. (2010) which contains the binding affinities (IC 50 values) of many peptide candidates to the MHC-I alleles. The peptide candidates are encoded as 45-dimensional feature vector and the binding affinities were standardized. We use 5 alleles (tasks) to meta-learn a BNN prior with PACOH-NN and leave the most genetically dissimilar allele (A-6901) for our bandit task. In each iteration, the experimenter (i.e. bandit algorithm) chooses to test one peptide among the pool of more than 800 candidates and receives its binding affinity as a reward feedback. In particular, we employ UCB (Lattimore & Szepesvari, 2020) and Thompson-Sampling (TS) (Thompson, 1933) as bandit algorithms, comparing the BNN-based bandits with meta-learned prior (PACOH-UCB/TS) against a zero-centered Gaussian BNN prior (BNN-UCB/TS) and a Gaussian process (GP-UCB) (Srinivas et al., 2009) . Figure 3 reports the respective average regret and simple regret over 50 iterations. Unlike the bandit algorithms with standard BNN/GP prior, PACOH-UCB/TS reaches near optimal regret within less than 10 iterations and after 50 iterations still maintains a significant performance advantage. This highlights the importance of transfer (learning) for solving real-world problems and demonstrates the effectiveness of PACOH-NN to this end. While the majority of meta-learning methods rely on a large number of meta-training tasks (Qin et al., 2018) , PACOH-NN allows us to achieve promising positive transfer, even in complex real-world scenarios with only a handful (in this case 5) tasks.

7. CONCLUSION

Based on PAC-Bayesian theory, we present a novel, scalable algorithm for meta-learning BNN priors, that overcomes previous issues of nested optimization, by employing the closed-form solution of the PAC-Bayesian meta-learning problem. Experiments show that our method, PACOH-NN, does not only come with computational advantages, but also achieves comparable or better predictive accuracy than several popular meta-learning approaches, while improving the quality of the uncertainty estimates -a key aspect of our approach. The benefits of our principled treatment of uncertainty -showcased in the real-world vaccine development bandit task -are particularly amenable to interactive machine learning systems. This makes the integration of PACOH-NN with Bayesian optimization and reinforcement learning a potentially promising avenue to pursue. While our experiments are limited to diagonal Gaussian priors and SVGD as approximate inference method, we hope that future work will build on the added flexibility of our framework to possibly explore more recent approaches in variational inference (e.g. Wang et al., 2019) or consider more expressive priors such as normalizing flows (Rezende & Mohamed, 2015) .

APPENDIX A PROOFS AND DERIVATIONS

A.1 PROOF OF THEOREM 2 Lemma 1. (Change of measure inequality) Let f be a random variable taking values in a set A and let X 1 , ..., X l be independent random variables, with X k ∈ A with distribution µ k . For functions g k : A × A → R, k = 1, ..., l, let ξ k (f ) = E X k ∼µ k [g k (f, X k )] denote the expectation of g k under µ k for any fixed f ∈ A. Then for any fixed distributions π, ρ ∈ M(A) and any λ > 0, we have that E f ∼ρ l k=1 ξ k (f ) -g k (f, X k ) ≤ 1 λ D KL (ρ||π) + ln E f ∼π e λ l k=1 ξ k (f )-g k (f,X k ) . Proof of Theorem 2 To prove the Theorem, we need to bound the difference between transfer error L(Q, T ) and the empirical multi-task error L(Q, S 1 , ..., S n ). To this end, we introduce an intermediate quantity, the expected multi-task error: L(Q, D 1 , ..., D n ) = E P ∼Q 1 n n i=1 E S∼D m i i [L(Q(S, P ), D i )] In the following we invoke Lemma 1 twice. First, in step 1, we bound the difference between L(Q, D 1 , ..., D n ) and L(Q, S 1 , ..., S n ), then, in step 2, the difference between L(Q, T ) and L(Q, D 1 , ..., D n ). Finally, in step 3, we use a union bound argument to combine both results. Step 1 (Task specific generalization) First, we bound the generalization error of the observed tasks τ i = (D i , m i ), i = 1, ..., n, when using a learning algorithm Q : M × Z mi → M, which outputs a posterior distribution Q i = Q(S i , P ) over hypotheses θ, given a prior distribution P and a dataset S i ∼ D mi i of size m i . In that, we define m := ( n i=1 m -1 i ) -1 as the harmonic mean of sample sizes. In particular, we apply Lemma 1 to the union of all training sets S = n i=1 S i with l = n i=1 m i . Hence, each X k corresponds to one data point, i.e. X k = z ij and µ k = D i . Further, we set f = (P, h 1 , ..., h n ) to be a tuple of one prior and n base hypotheses. This can be understood as a two-level hypothesis, wherein P constitutes a hypothesis of the meta-learning problem and h i a hypothesis for solving the supervised task τ i . Correspondingly, we take π = (Q, Q n ) = P • n i=1 P and ρ = (Q, Q n ) = Q • n i=1 Q i as joint two-level distributions and g k (f, X k ) = 1 nmi l(h i , z ij ) as summand of the empirical multi-task error. We can now invoke Lemma 1 to obtain that ( 11) and ( 14) 1 n n i=1 E P ∼Q [L(Q i , D i )] ≤ 1 n n i=1 E P ∼Q [L(Q i , S i )] + 1 λ D KL [(Q, Q n )||(P, P n )] + ln E P ∼P E h∼P e λ n n i=1 (L(h,Di)-L(h,Si) Using the above definitions, the KL-divergence term can be re-written in the following way: D KL [(Q, Q n )||(P, P n )] = E P ∼Q E h∼Qi ln Q(P ) n i=1 Q i (h) P(P ) n i=1 P (h) (12) = E P ∼Q ln Q(P ) P(P ) + n i=1 E P ∼Q E h∼Qi ln Q i (h) P (h) (13) = D KL (Q||P) + n i=1 E P ∼Q [D KL (Q i ||P )] Using ( 11) and ( 14) we can bound the expected multi-task error as follows: L(Q, D 1 , ..., D n ) ≤ L(Q, S 1 , ..., S n ) + 1 λ D KL (Q||P) + 1 λ n i=1 E P ∼Q [D KL (Q i ||P )] + 1 λ ln E P ∼P E h∼P e λ n n i=1 (L(h,Di)-L(h,Si)) Υ I (λ) Step 2 (Task environment generalization) Now, we apply Lemma 1 on the meta-level. For that, we treat each task as random variable and instantiate the components as X k = τ i , l = n and µ k = T . Furthermore, we set ρ = Q, π = P, f = P and g k (f, X k ) = 1 n L(Q i , D i ). This allows us to bound the transfer error as Combining ( 15) with ( 16), we obtain L(Q, T ) ≤ L(Q, D 1 , ..., D n ) + 1 β D KL (ρ||π) + Υ II (β) L(Q, T ) ≤ L(Q, S 1 , ..., S n ) + 1 β + 1 λ D KL (Q||P) + 1 λ n i=1 E P ∼Q [D KL (Q i ||P )] + Υ I (λ) + Υ II (β) Step 3 (Bounding the moment generating functions) Case I: bounded loss If the loss function l(h i , z ij ) is bounded in [a k , b k ], we can apply Hoeffding's lemma to each factor in (18), obtaining: e Υ I (λ)+Υ II (β) ≤ E P ∼P e β 2 8n (b k -a k ) 2 1/β • E P ∼P E h∼P e λ 2 8n m (b k -a k ) 2 1/λ (19) = e ( β 8n + λ 8n m )(bk-ak) 2 Next, we factor out √ n from λ and β, obtaining e Υ I (λ)+Υ II (β) = e Υ I (λ √ n)+Υ II (β √ n) 1 √ n (21) Using E T E D1 ...E Dn e Υ I (λ √ n)+Υ II (β √ n) ≤ e β 8 √ n + λ 8 √ n m (b k -a k ) 2 (22) we can apply Markov's inequality w.r.t. the expectations over the task distribution T and data distributions D i to obtain that Υ I (λ) + Υ II (β) ≤ β 8n (b k -a k ) 2 Ψ I (β) + λ 8n m (b k -a k ) 2 Ψ II (λ) - 1 √ n ln δ with probability at least 1δ. Case II: sub-gamma loss First, we assume that, ∀i = 1, ..., n the random variables V I i := L(h, D i )l(h i , z i,j ) are sub-gamma with variance factor s 2 I and scale parameter c I under the two-level prior (P, P ) and the respective data distribution D i . That is, their moment generating function can be bounded by that of a Gamma distribution Γ(s 2 I , c I ): E z∼Di E P ∼P E h∼P e γ(L(h,Di)-l(h,z)) ≤ exp γ 2 s 2 I 2(1 -c I γ) ∀γ ∈ (0, 1/c I ) Second, we assume that, the random variable V II := E (D,S)∼T [L(Q(P, S), D)] -L(Q(P, S i ), D i ) is sub-gamma with variance factor s 2 II and scale parameter c II under the hyper-prior P and the task distribution T . That is, its moment generating function can be bounded by that of a Gamma distribution Γ(s 2 II , c II ): E (D,S)∼T E P ∼P e γ E (D,S)∼T [L(Q(P,S),D)]-L(Q(P,S),D) ≤ exp γ 2 s 2 II 2(1 -c II γ) ∀γ ∈ (0, 1/c II ) These two assumptions allow us to bound the expectation of (18) as follows: E e Υ I (λ)+Υ II (β) ≤ exp λs 2 I 2n m(1 -c I λ/(n m) • exp βs 2 II 2n(1 -c II β/n) Next, we factor out √ n from λ and β, obtaining e Υ I (λ)+Υ II (β) = e Υ I (λ √ n)+Υ II (β √ n) 1 √ n Finally, by using Markov's inequality we obtain that Υ I (λ) + Υ II (β) ≤ λs 2 I 2n m(1 -c I λ/(n m) Ψ I (β) + βs 2 II 2n(1 -c II β/n) Ψ II (λ) - 1 √ n ln δ with probability at least 1δ. To conclude the proof, we choose λ = n √ m and β = √ n.

A.2 PROOF OF COROLLARY 1

Lemma 2. (Catoni, 2007)  Proof of Corollary 1 When we choose the posterior Q as the optimal Gibbs posterior Q * i := Q * (S i , P ), it follows that L(Q, S 1 , ..., S n ) + 1 n n i=1 1 √ m E P ∼Q [D KL (Q * i ||P )] (31) = 1 n n i=1 E P ∼Q E h∼Q * i L(h, S i ) + 1 √ m (E P ∼Q [D KL (Q * i ||P )]) (32) = 1 n n i=1 1 √ m E P ∼Q E h∼Q * i √ m L(h, S i ) + ln Q * i (h) P (h) (33) = 1 n n i=1 1 √ m   E P ∼Q E h∼Q * i   1 √ m m j=1 l(h, z i ) + ln P (h)e -1 √ m m j=1 l(h,zi) P (h)Z(S i , P )     (34) = 1 n n i=1 1 √ m (-E P ∼Q [ln Z(S i , P )]) . ( ) This allows us to write the inequality in (3) as L(Q, T ) ≤ - 1 n n i=1 1 m i E P ∼Q [ln Z(S i , P )] + 1 √ n + 1 n √ m D KL (Q||P) + C(δ, n, m) . According to Lemma 2, the Gibbs posterior Q * (S i , P ) is the minimizer of (33), in particular ∀P ∈ M(H), ∀i = 1, ..., n : Q * (S i , P ) = P (h)e - √ m L(h,Si) Z(S i , P ) = arg min Q∈M(H) E h∼Q L(h, S i ) + 1 √ m D KL (Q||P ) . Hence, we can write L(Q, T ) ≤ - 1 n n i=1 1 √ m E P ∼Q [ln Z(S i , P )] + 1 √ n + 1 n √ m D KL (Q||P) + C(δ, n, m) (38) = 1 n n i=1 E P ∼Q min Q∈M(H) L(Q, S i ) + 1 √ m D KL (Q||P ) + 1 √ n + 1 n √ m + C(δ, n, m) (40) ≤ 1 n n i=1 E P ∼Q L(Q, S i ) + 1 √ m D KL (Q||P ) + 1 √ n + 1 n √ m D KL (Q||P) + C(δ, n, m) (42) = L(Q, S 1 , ..., S n ) + 1 √ n + 1 n √ m D KL (Q||P) (43) + 1 n n i=1 1 √ m E P ∼Q [D KL (Q i ||P )] + C(δ, n, m) , which proves that the bound for Gibbs-optimal base learners in ( 36) and ( 4) is tighter than the bound in Theorem 2 which holds uniformly for all Q ∈ M(H).  l = 1, ..., L do ∇ θ k l Q * (θ k l )) ← ∇ θ k l ln P φ k (θ k l )) + √ m ∇ θ k l L(l, S) // compute score θ k l ← θ k l + ν L L l =1 k(θ k l , θ k l )∇ θ k l ln Q * (θ k l ) + ∇ θ k l k(θ k l , θ k l ) ∀l ∈ [L] // ) := ln 1 L L l=1 e -√ mi L(θ l ,Si) = LSE L l=1 - √ m i L(θ l , S i ) -ln L , θ l ∼ P φ . Since the Monte Carlo estimator involves approximating an expectation of an exponential, it is not unbiased. However, we can show that replacing ln Z(S i , P φ ) by the estimator ln Z(S i , P φ ), we still minimize a valid upper bound on the transfer error (see Proposition 3). Proposition 3. In expectation, replacing ln Z(S i , P φ ) in (4) by the Monte Carlo estimate ln Z(S i , P ) := ln 1 L L l=1 e -√ mi L(θ l ,Si) , θ l ∼ P still yields an valid upper bound of the transfer error. In particular, it holds that L(Q, T ) ≤ - 1 n n i=1 1 √ m E P ∼Q [ln Z(S i , P )] + 1 √ n + 1 n √ m D KL (Q||P) + C(δ, n, m) (51) ≤ - 1 n n i=1 1 √ m E P ∼Q E θ1,...,θ L ∼P ln Z(S i , P ) + 1 √ n + 1 n √ m D KL (Q||P) + C(δ, n, m). Proof. Firsts, we show that: E θ1,...,θ L ∼P ln Z(S i , P ) = E θ1,...,θ L ∼P ln 1 L L l=1 e -√ mi L(θ l ,Si) ≤ ln 1 L L l=1 E θ l ∼P e -√ mi L(θ l ,Si) = ln E θ∼P e -√ mi L(θ,Si) = ln Z(S i , P ) which follows directly from Jensen's inequality and the concavity of the logarithm. Now, Proposition 3 follows directly from (53). In this work, we treat each patient as a separate task and the different clinical variables as different environments. We use the Glasgow coma scale (GCS) and hematocrit value (HCT) as environments for our study, since they are among the most frequently measured variables in this dataset. From the dataset, we remove all patients where less than four measurements of CGS (and HCT respectively) are available. From the remaining patients we use 100 patients for meta-training and 500 patients each for meta-validation and meta-testing. Here, each patient corresponds to a task. Since the number of available measurements differs across patients, the number of training points m i ranges between 4 and 24.

C.1.5 BERKELEY-SENSOR

We use data from 46 sensors deployed in different locations at the Intel Research lab in Berkeley (Madden, 2004) . The dataset contains 4 days of data, sampled at 10 minute intervals. Each task corresponds to one of the 46 sensors and requires auto-regressive prediction, in particular, predicting the next temperature measurement given the last 10 measurement values. In that, 36 sensors (tasks) with data for the first two days are use for meta-training and whereas the remaining 10 sensors with data for the last two days are employed for meta-testing. Note, that we separate meta-training and -testing data both temporally and spatially since the data is non-i.i.d. For the meta-testing, we use the 3rd day as context data, i.e. for target training and the remaining data for target testing.

C.2 EXPERIMENTAL METHODOLOGY

In the following, we describe our experimental methodology and provide details on how the empirical results reported in Section 6 were generated. Overall, evaluating a meta-learner consists of two phases, meta-training and meta-testing, outlined in Appendix B. The latter can be further sub-divided into target training and target testing. and the calibration error (see Appendix C.2.1). Note that unlike e.g. Rothfuss et al. (2019a) who report the test log-likelihood, we aim to measure the quality of mean predictions and the quality of uncertainty estimate separately, thus reporting both RMSE and calibration error. The described meta-training and meta-testing procedure is repeated for five random seeds that influence both the initialization and gradient-estimates of the concerned algorithms. The reported averages and standard deviations are based on the results obtained for different seeds.

C.2.1 CALIBRATION ERROR

The concept of calibration applies to probabilistic predictors that, given a new target input x i , produce a probability distribution p(y i |x i ) over predicted target values y i . Corresponding to the predictive density, we denote a predictor's cumulative density function (CDF) as F (y j |x j ) = yj -∞ p(y|x i )dy. For confidence levels 0 ≤ q h < ... < q H ≤ 1, we can compute the corresponding empirical frequency qh = |{y j | F (y j |x j ) ≤ q h , j = 1, ..., m}| m , based on dataset S = {(x i , y i )} m i=1 of m samples. If we have calibrated predictions we would expect that qh → q h as m → ∞. Similar to (Kuleshov et al., 2018) , we can define the calibration error as a function of residuals qhq h , in particular, calib-err = 1 H H h=1 |q h -q h | . ( ) Note that we while (Kuleshov et al., 2018) reports the average of squared residuals |q hq h |foot_1 , we report the average of absolute residuals |q hq h | in order to preserve the units and keep the calibration error easier to interpret. In our experiments, we compute (64) with K = 20 equally spaced confidence levels between 0 and 1.

C.3 HYPER-PARAMETER SELECTION

For each of the meta-environments and algorithms, we ran a separate hyper-parameter search to select the hyper-parameters. In particular, we use the hyperoptfoot_0 package (Bergstra et al., 2013) which performs Bayesian optimization based on regression trees. As optimization metric, we employ the average log-likelihood, evaluated on a separate validation set of tasks. The scripts for reproducing the hyper-parameter search are included in our code repository 2 For the reported results, we provide the selected hyper-parameters and detailed evaluation results under [Link will be made added upon acceptance] C.4 META-LEARNING FOR BANDITS -VACCINE DEVELOPMENT In this section, we provide additional details on the experiment in Section 6.2. We use data from Widmer et al. (2010) which contains the binding affinities (IC 50 values) of many peptide candidates to seven different MHC-I alleles. Peptides with IC 50 > 500nM are considered non-binders, all others binders. Following Krause & Ong (2011) , we convert the IC 50 values into negative log-scale and normalize them such that 500nM corresponds to zero, i.e. r := log 10 (IC 50 ) + log 10 (500) with is used as reward signal of our bandit.



http://hyperopt.github.io/hyperopt/ [Link will be made added upon acceptance]



Figure 1: BNN posterior predictions with (a) standard Gaussian prior vs. (b) meta-learned prior. Meta-learning with PACOH-NN-SVGD was conducted on the Sinusoids environment.

Figure 2: Scalability comparison of PACOH-NN and MLAP-S in memory footprint and compute time, as the number of meta-training task grows.

Figure 3: MHC-I peptide design task: Regret for different priors and bandit algorithms. A metalearned PACOH-NN prior substantially improves the regret, compared to a standard BNN/GP prior.

) wherein Υ II (β) = 1 β ln E P ∼P e β n n i=1 E (D,S)∼T [L(Q(P,S),D)]-L(Q(P,Si),Di) .

(Υ I (λ)+Υ II (β)) =E P ∼P e

Let A be a set, g : A → R a function, and ρ ∈ M(A) and π ∈ M(A) probability densities over A. Then for any β > 0 and ∀a ∈ A, ρ * (a) := π(a)e -βg(a) Z = π(a)e -βg(a) E a∼π e -βg(a) (29) is the minimizing probability density arg min ρ∈M(A) βE a∼ρ [g(a)] + D KL (ρ||π) .

probability distribution given as p(y * |x * , S) ← 1 y * |h θ k l (x * )). We evaluate the quality of the predictions on a held-out test dataset S * ∼ D from the same task, in a target testing phase (see Appendix C.2).Algorithm 3 PACOH-NN-SVGD: meta-testing Input: set of priors {P φ1 , ..., P φ K }, target training dataset S, evaluation point x * Input: kernel function k(•, •), SVGD step size ν, number of particles L for k = 1, ..., K do {θ k 1 , ..., θ k L } ∼ P φ k // initialize NN posterior particles from k-th prior while not converged do for

Figure S3: Accelerator of the Swiss Free-Electron Laser (SwissFEL).

Figure S1 illustrates these different stages for our PAC-Bayesian meta-learning framework.The outcome of the training procedure is an approximation for the generalized Bayesian posterior Q * (S, P ) (see Appendix ), pertaining to an unseen task τ = (D, m) ∼ T from which we observe a dataset S ∼ D. In target-testing, we evaluate its predictions on a held-out test dataset S * ∼ D from the same task. For PACOH-NN, NPs and MLAP the respective predictor outputs a probability distribution p(y * |x * , S) for the x * in S * . The respective mean prediction corresponds to the expectation of p, that is ŷ = Ê(y * |x * , S). In the case of MAML, only a mean prediction is available. Based on the mean predictions, we compute the root mean-squared error (RMSE):RMSE = 1 | S * | (x * ,y * )∈S * (y *ŷ) 2 .(62)

Comparison of standard and meta-learning algorithms in terms of test RMSE in 5 metalearning environments for regression. Reported are mean and standard deviation across 5 seeds.

Comparison of standard and meta-learning algorithms in terms of test calibration error in 5 meta-learning environments for regression. Reported are mean and standard deviation across 5 seeds.

Metalearner z j

< l a t e x i t s h a 1 _ b a s e 6 4 = " a b / q O K H C f u O U A p K D f j g u V a H p g z w = " > A A A F q X i c j Z R L b 9 N A E M e n A U M J j 7 Z w 5 B I R I S F R I t t J H 8 e K Z w 8 g t U D a i F J V t r N x T f y S 7 Z S 2 U T 4 C V / h s f B c O / G e 8 V G l K 0 t h a z + z s 7 G 8 e a 9 t N w y A v T P P 3 Q u X G T e P W 7 c U 7 1 b v 3 7 j 9 Y W l 5 5 u J c n g 8 x T b S 8 J k 6 z j O r k K g 1 i 1 i 6 A I V S f N l B O 5 o d p 3 + 6 9 4 f f 9 E Z X m Q x J + L s 1 Q d R o 4 f B 7 3 A c w q Y P p 0 f f T t a r p s N U 6 7 a V c X S S p 3 0 t Z O s V L b p K 3 U p I Y 8 G F J G i m A r o I T m U 4 z 4 g i 0 x K Y T u k I W w Z t E D W F Y 2 o i r 0 D e C l 4 O L D 2 8 f Q x O 9 D W G H N m 5 r L b Q 5 Q Q I 8 P O G j 3 F e C t E F 9 4 c V U H P I f 9 g n I v N n x p h K G T O 8 A z S F W L J / I C V g o 7 h c 9 3 e S H u O 5 t 7 J d R X U o 0 2 p J 0 C G q V i 4 U u + C 8 x o r G W x 9 W a n R G / H 0 w X B l f o I e x J B t Z M B 9 / k e o S c 1 d S E e k E k q s i Q 5 4 G S T 3 n / O Z X p 0 L f i L R O W I + 4 5 y G e G b Q U 1 T N p 3 o q v t f 1 j X s U a W o M + 3 f d t U T y j e A X i u T + j j B 7 P z a b R T + 9 9 I a U n t y F n v R P j a 1 y h A C 7 j h H 5 B W o M Q V Q S 6 y O 9 o 5 e i W b S O e 5 V s v L + l t G l D Z z 2 N 2 p X 3 s T + V 2 Q R 1 F Y O Z L D c h Z x P H 8 / S l P 3 y i k 5 l u g F R m 2 B R u E / N 5 M 5 1 G X d O 5 r k u u L d z z Z 8 r E 7 g T P v i C u g V X K 5 t x Z / p / I J 8 J 1 l + d T 1 l z F P 8 y a / G N d V f b s h t V s 2 L u t + l Z L / 8 0 W 6 T E 9 o W f S z y 3 a p h 1 8 Y R 5 q + k E / 6 Z f x 3 N g 1 O s a X 0 r W y o P c 8 o k u X 4 f 0 F S R Y R k Q = = < / l a t e x i t >

Base learner

T < l a t e x i t s h a 1 _ b a s e 6 4 = " S k U i Q E T B G 6 b z 1 d M y y F 2 d Y E 4 9 F j 0 = " > A A A F s X i c j V R L b 9 N A E J 4 G D C W 8 W j h y i Y i Q O J Q o s d P H s S q v H k A q q G m L 0 g p s Z + N Y f s p 2 + i D K z + A K v 4 v / w o F v x k u V p i T N W u u Z n Z 3 5 5 p t Z e 5 0 0 9 P O i 2 f y 9 V L l 1 2 7 h z d / l e 9 f 6 D h 4 8 e r 6 w + O c i T Y e a q j p u E S X b k 2 L k K / V h 1 C r 8 I 1 V G a K T t y Q n X o B K 9 5 / / B U Z b m f x P v F R a p O I t u L / b 7 v 2 g V M 3 e P I L g a u H Y 7 2 x 1 9 X 6 s 1 G U 0 b t u t L S S p 3 0 2 E t W K 7 t 0 T D 1 K y K U h R a Q o p g J 6 S D b l e L r U o i a l s J 3 Q C L Y M m i / 7 i s Z U R e w Q X g o e N q w B 3 h 5 W X W 2 N s W b M X K J d Z A k x M 0 T W 6 A X m O 0 F 0 4 M 1 Z F f Q c 8 g / m d 7 F 5 M z O M B J k Z X k A 6 g l h i f s R O Q Q P 4 3 B Q b a c / x w p F c V 0 F 9 2 p J 6 f D B M x c K V u p c 4 b 7 C T w R b I T o 3 e i q c H D E f W p + h B D N k B A + 7 z P 4 S a 1 N y D t E U q Q Y k 1 o g 2 8 D J L 7 z 3 x m V + c A P 5 H s n D G f c 0 4 j v D P o K a r m U z 0 X 3 5 v 6 x j 2 K N G o M + 5 n u W i J 8 I / i F I r m / Y 6 w + T K z m o Z 9 f + U J K T + 5 C X / q n J n Y 5 g 4 + o A T K / Q o 0 h E J X k + k z v a U e 0 F m 3 g W S M T 3 2 8 p T d r U r G e h 9 u R 7 D G Z i W k B d w 2 R M l l u Q 8 x E n e X r S H z 7 R a a a b Q C o Z W o J r Y b 0 o 0 1 m o 6 5 r r h n B t 4 1 m c K S P 2 p v D M S 8 R 1 Y J X S W p j l / x H 5 R L j u 8 n z K m q u 4 w 1 r T N 9 Z 1 5 c B s t K y G + a l d 3 2 7 r 2 2 y Z n t F z e i n 9 3 K Z d 2 s M f x m x + 0 E / 6 Z V j G F + O b 4 Z S u l S U d 8 5 S u D C P 4 C 7 j Q F S A = < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = " O c 0 I V e + W 4 R 0 I p z d 3 J g q 7 D j W G 0 M 0 = " > A A A F s X i c j V R L b 9 N A E J 4 G D C U 8 2 s K R S 0 S E x K F E j p 0 + j h X P H k A K i L R F a Q W 2 s 3 G s + C X b K S 1 R f g Z X + F 3 8 F w 5 8 M 1 6 q N C V p 1 l r P 7 O z M N 9 / M 2 u u m Y Z A X p v l 7 p X L j p n H r 9 u q d 6 t 1 7 9 x + s r W 8 8 P M i T U e a p j p e E S X b k O r k K g 1 h 12 0 7 s e + l W b 3 + e 6 W 0 e s e 4 e 2 / t f v n B w 0 e P n 6 x v P D 1 K o 3 H i q r Y b + V F y 4 t i p 8 r 1 Q t T M v 8 9 V J n C g 7 c H x 1 7 I z e 8 P r x u U p S L w q / Z J e x 6 g b 2 I P T 6 n m t n M Hu Z n Z 3 5 5 p t Z e 5 0 0 9 P O i 2 f y 9 V L l 1 2 7 h z d / l e 9 f 6 D h 4 8 e r 6 w + O c i T Y e a q j p u E S X b k 2m K 0 9 h a z + z s 7 G 8 e a 9 t N A j / L T f P 3 S u X G T e P W 7 d U 7 1 b v 3 7 j 9 Y W 9 9 4 2 M 3 i c e q p j h c H c d p z n U w F f q Q 6 u Z 8 H q p e k y g n d Q B 2 6 o 1 e 8 f n i q 0 s y P o 8 / 5 e a 7 d U 7 1 b v 3 7 j 9 Y W 9 9 4 2 M 3 0 e p p 9 p e H M T p g e t k K v A j 1 c 7 9 P F A H S a q c 0 A 3 U v j t 8 w + v 7 p y r N / D j 6 m p 8 n 6 j h 0 + p H f 8 z 0 n h 6 l T q Z 6 Y r 0 y r b h 5 1 4 z y r m 2 H< l a t e x i t s h a 1 _ b a s e 6 4 = " E t J p 9 o X m u t 0 e e t u t yE q 7 w o f g u H P j P e K n S F K e x t Z n Z 2 d n f P N Z Z N w n 8 L L e s 3 w u V W 4 v G 7 T t L d 6 v 3 7 j 9 4 u L y y + m g / i 4 e p p 9 p e H M T p o e t k K v A j 1 c 7 9 P F C H S a q c 0 A 3 U g T t 4 w + s H Z y r N / D j 6 k l 8 k 6 i R 0 e p H f 9 T 0 n h + l 0s q 2 9 m 4 l q + y n Z I S 5 R t 4 h U / j X 3 j g z H i p 0 h S n W W s 9 s 7 M z Z 8 7 M 2 m s n g Z f l z e b v h c q N m 8 a t 2 4 t 3 q n f v 3 X + w t L z y c C + L B 6 m j u k 4 c x O m B b W U q 8 C L V z b 0 8 U A d J q q z Q D t S + 7 b / k / f 0 z l W Z e H H 3 O z x N 1 H F p u 5 P U 9 x 8 p h 6 g 6 / P j / x T 5 b r z U Z T R u 2 q 0 t Jh < l a t e x i t s h a 1 _ b a s e 6 4 = " R F U p T D L W e E e C i z O 9 l N a 3 D 5 3 q / a 4 = " > A A A F p 3 i c j Z R L b 9 N A E M e n A U M J j 7 Z w 5 B I R I Y F U o t h J H 8 e K Z w 8 g W k T S S K V C t r N x r P g l 2 y k t U T 4 B V / h w f B c O / G e 8 V G m K 0 9 h a z + z s 7 G 8 e a 9 t J A j / L m 8 3 f K 5 U b N 4 1 b t 1 f v V O / e u / 9 g b X 3 j Y T e L x 6 m r O m 4 c x G n P s T M V + J H q 5 H 4 e q F 6 S K j t 0 A n X k j F 7 x + t G p S j M / j j 7 n 5 4 k 6 C W 0 v 8 g e + a + c w H Q 6 / r t e b j a Z c t a u K q Z U 6 6 e s g 3 q j s 0 x f q U 0 w uy w 3 I 2 c T J P D 3 p D 5 / o d K b r I B U Z W s K 1 M J 8 3 0 z L q q s 5 1 T X J t 4 5 4 / U y b 2 p n i t c + I q W I W 0 5 s 7 y / 0 Q + E a 6 7 O J + i 5 i r + Y c 3 p P 9 Z l Z a / V a F q N 1 k 6 7 v t n W f 7 N F e k S P 6 a n 0 c 5 O 2 a B t f m I u a f t B P + m U 8 N 3 a M r v G 5 c K 0 s 6 D 0 P 6 c J l u H 8 B J X E R K Q = = < / l a t e x i t > D < l a t e x i t s h a 1 _ b a s e 6 4 = " D L sz T e z 9 j p J 4 G V 5 s / l 7 q X L r t n H n 7 v K 9 6 v 0 H D x 8 9 X l l 9 c p D F o 9 R V H T c O 4 v T I s T M V e J H q 5 F 4 e q K M k V X b o B O r Q 8 V / z / u G p S j M v j j 7 n F 4 k 6 C e 1 B 5 P U 9 1 8 5 h 6 h 6 H d j 5 0 7 W C 8 O / m 6 U m 8 2 m j J q 1 5 W W V u q k x 3 6 8 W t m j Yr M 4 U 0 b s z e C Z l 4 j r w C q k t T D L / y P y i X D d x f k U N V d x h 7 V m b 6 z r y o H Z a F k N 8 2 O 7 v t 3 W t 9 k y P a P n 9 F L 6 u U 1 7 t I 8 / j N n 8 o J / 0 y 7 C M L 8 Y 3 w y l c K 0 s 6 5 i l d G Y b / F 2 S Q F R A = < / l a t e x i t > k = 1, . . . , m ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " Z k P t Q P 1 A b 9 u e 5 c y u 8 L v 6 5 S N 1 q e U = " > AR H J / x 5 i 9 n 5 j N o 5 9 e u C G F J 3 e h K / 1 T E 6 s c w c e u P i I / Q 4 0 B i E p i f a J 3 9 F I 0 i 9 b x 1 s n G / S 2 k T R s 6 6 1 n U j t z H w U x m C 9 Q 6 B j N Z b k L O J 0 7 m 2 Z P + 8 I l O Z 7 o B U p F h S 7 g t z B f N d B Z 1 T e e 6 L r m u 4 l 0 8 U y Z 2 p n j 2 O X E N r E K 2 F s 7 y / 0 Q + E a 6 7 O J + i 5 j K + Y d b 0 F + u y s m s 3 r F b D / r h a 2 6 r r r 9 k y P a R H 9 E T 6 u U X b t I N / m I f b + Y N + 0 i / j u e E Y f e N r 4 V p a 0 n s e 0 I X H y P 4 C 1 c c V v w = = < / l a t e x i t > l = 1, . . . , m < l a t e x i t s h a 1 _ b a s e 6 4f m t X t p r n N l u g Z P a e X 0 s 9 t 2 q U 9 / G H M 5 g f 9 p F + W Y 3 2 1 v l l e 4 V p a M D F P 6 d q w + n 8 B k X Q V S Q = = < / l a t e x i t > S < l a t e x i t s h a 1 _ b a s e 6 4 = " p g 7 A.3 PROOF OF PROPOSITION 1: PAC-OPTIMAL HYPER-POSTERIOR An objective function corresponding to (4) reads asTo obtain J(Q), we omit all additive terms from (4) that do not depend on Q and multiply by the scaling factor n √ m √ mn+1 . Since the described transformations are monotone, the minimizing distribution of J(Q), that is, * = arg minis also the minimizer of (4). More importantly, J(Q) is structurally similar to the generic minimization problem in (30). Hence, we can invoke Lemma 2 with A = M(H), g(a) = - Technically, this concludes the proof of Proposition 1. However, we want to remark the following result:If we choose Q = Q * , the PAC-Bayes bound in (4) can be expressed in terms of the meta-level partition function Z II , that is,We omit a detailed derivation of (48) since it is similar to the one for Corollary 1.

B PACOH-NN: A SCALABLE ALGORITHM FOR LEARNING BNN PRIORS

In this section, we summarize and further discuss our proposed meta-learning algorithm PACOH-NN.An overview of our proposed framework is illustrated in Figure S1 . Overall, it consists of two stages meta-training and meta-testing which we explain in more details in the following.

B.1 META-TRAINING

The hyper-posterior distribution Q that minimizes the upper bound on the transfer error is given byProvided with a set of datasets S 1 , ..., S 2 , the meta-learner minimizes the respective meta-objective, in the case of PACOH-SVGD, by performing SVGD on the Q * . Algorithm 1 outlines the required steps in more detail. The meta-learned prior knowledge is now deployed by a base learner. The base learner is given a training dataset S ∼ D pertaining to an unseen task τ = (D, m) ∼ T . With the purpose of approximating the generalized Bayesian posterior Q * (S, P ), the base learner performs (normal) posterior inference. Algorithm 3 details the steps of the approximating procedure -referred to as target training -when performed via SVGD. For a data point x * , the respective predictor outputs a In fact, by the law of large numbers, it is straightforward to show that as L → ∞, the ln Z(S i , P ) a.s.--→ ln Z(S i , P ), that is, the estimator becomes asymptotically unbiased and we recover the original PAC-Bayesian bound (i.e. (52) a.s.--→ (51)). Also it is noteworthy that the bound in (52) we get by our estimator is, in expectation, tighter than the upper bound when using the naive estimator ln Ẑ(S i , Pwhich can be obtained by applying Jensen's inequality to ln E θ∼P φ e -√ mi L(θ,Si) . In the edge case L = 1 our LSE estimator ln Z(S i , P ) falls back to this naive estimator and coincides in expectation with E[ln Ẑ(S i , P )] = -√ m i E θ∼P L(θ l , S i ). As a result, we effectively minimize the looser upper boundAs we can see from ( 55), the boundaries between the tasks vanish in the edge case of L = 1, that is, all data-points are treated as if they would belong to one dataset. This suggests that L should be chosen greater than one. In our experiments, we used L = 5 and found the corresponding approximation to be sufficient.

C EXPERIMENTS C.1 META-LEARNING ENVIRONMENTS

In this section, we provide further details on the meta-learning environments used in Section 6. Information about the numbers of tasks and samples in the respective environments can be found in Table S1 . which, in essence, consists of an affine as well as a sinusoid function. Tasks differ in the function parameters (a, b, c, β) that are sampled from the task environment T as follows:Figure S2a depicts functions f a,b,c,β with parameters sampled according to (57). To draw training samples from each task, we draw x uniformly from U(-5, 5) and add Gaussian noise with standard deviation 0.1 to the function values f (x):x ∼ U(-5, 5) , y ∼ N (f a,b,c,β (x), 0.1 2 ) . with l = 0.2. The (unnormalized) mixture of Cauchy densities is defined as:with µ 1 = (-1, -1) and µ 2 = (2, 2) .Functions from the task environments are sampled as follows:f (x) = m(x) + g(x) , g ∼ GP(0, k(x, x )) .(60)Figure S2b depicts a one-dimensional projection of functions sampled according to (60). To draw training samples from each task, we draw x from a truncated normal distribution and add Gaussian noise with standard deviation 0.05 to the function values f (x):x := min{max{x, 2}, -3} , x ∼ N (0, 2.5 2 ) , y ∼ N (f (x), 0.05 2 ) .C.1.3 SWISSFEL Free-electron lasers (FELs) accelerate electrons to very high speed in order to generate shortly pulsed laser beams with wavelengths in the X-ray spectrum. These X-ray pulses can be used to map nanometer scale structures, thus facilitating experiments in molecular biology and material science.The accelerator and the electron beam line of a FEL consist of multiple magnets and other adjustable components, each of which has several parameters that experts adjust to maximize the pulse energy (Kirschner et al., 2019a) . Due do different operational modes, parameter drift, and changing (latent) conditions, the laser's pulse energy function, in response to its parameters, changes across time. As a result, optimizing the laser's parameters is a recurrent task.Overall, our meta-learning environment consists of different parameter optimization runs (i.e., tasks) on the SwissFEL, an 800 meter long laser located in Switzerland (Milne et al., 2017) . A picture of the SwissFEL is shown in Figure S3 . The input space, corresponding to the laser's parameters, has 12 dimensions whereas the regression target is the pulse energy (1-dimensional). For details on the individual parameters, we refer to Kirschner et al. (2019b) . For each run, we have around 2000 data points. Since these data-points are generated with online optimization methods, the data are non-i.i.d. and get successively less diverse throughout the optimization. As we are concerned with meta-learning with limited data and want to avoid issues with highly dependent data points, we only take the first 400 data points per run and split them into training and test subsets of size 200. Overall, Table S2 : MHC-I alleles used for meta-training and their corresponding number of meta-training samples m i .We use 5 alleles to meta-learn a BNN prior. The alleles and the corresponding number of data points, available for meta-training, are listed in Table S2 . The most genetically dissimilar allele (A-6901) is used for our bandit task. In each iteration, the experimenter (i.e. bandit algorithm) chooses to test one peptide among the pool of 813 candidates and receives r as a reward feedback. Hence, we are concerned with a 813-arm bandit wherein the action a t ∈ {1, ..., 813} = A in iteration t corresponds to testing a t -th peptide candidate. In response, the algorithm receives the respective negative log-IC 50 as reward r(a t ).As metrics, we report the average regret To ensure a fair comparison, the prior parameters of the GP for GP-UCB and GP-TS are meta-learned by minimizing the GP's marginal log-likelihood on the five meta-training tasks. For the prior, we use a constant mean function and tried various kernel functions (linear, SE, Matern). Due to the 45-dimensional feature space, we found the linear kernel to work the best. So overall, the constant mean and the variance parameter of the linear kernel are meta-learned.

