INFORMATION THEORETIC META LEARNING WITH GAUSSIAN PROCESSES

Abstract

We formulate meta learning using information theoretic concepts such as mutual information and the information bottleneck. The idea is to learn a stochastic representation or encoding of the task description, given by a training or support set, that is highly informative about predicting the validation set. By making use of variational approximations to the mutual information, we derive a general and tractable framework for meta learning. We particularly develop new memorybased meta learning algorithms based on Gaussian processes and derive extensions that combine memory and gradient-based meta learning. We demonstrate our method on few-shot regression and classification by using standard benchmarks such as Omniglot, mini-Imagenet and Augmented Omniglot.

1. INTRODUCTION

Meta learning (Ravi & Larochelle, 2017; Vinyals et al., 2016; Edwards & Storkey, 2017; Finn et al., 2017; Lacoste et al., 2019; Nichol et al., 2018) and few-shot learning (Li et al., 2006; Lake et al., 2011) aim to derive data efficient learning algorithms that can rapidly adapt to new tasks. Such systems require training deep neural networks from a set of tasks drawn from a common distribution, where each task is described by a small amount of experience, typically divided into a training or support set and a validation set. By sharing information across tasks the neural network can learn to rapidly adapt to new tasks and generalize from few examples at test time. Several few-shot learning algorithms use memory-based (Vinyals et al., 2016; Ravi & Larochelle, 2017) or gradient-based procedures (Finn et al., 2017; Nichol et al., 2018) , with the gradient-based model agnostic meta learning algorithm (MAML) by Finn et al. (2017) being very influential in the literature. Despite the success of specific schemes, one fundamental issue in meta learning is concerned with deriving unified principles that can allow to relate different approaches and invent new schemes. While there exist probabilistic interpretations of existing methods, such as the approximate Bayesian inference approach (Grant et al., 2018; Finn et al., 2018; Yoon et al., 2018) and the related conditional probability modelling approach (Garnelo et al., 2018; Gordon et al., 2019) , meta learning still lacks of a general and tractable learning principle that can help to get a better understanding of existing algorithms and derive new methods. To this end, the main contribution of this paper is to introduce an information theoretic view of meta learning, by utilizing tools such as the mutual information and the information bottleneck (Cover & Thomas, 2006; Tishby et al., 1999) . Given that each task consists of a support or training set and a target or validation set, we consider the information bottleneck principle, introduced by Tishby et al. (1999) , which can learn a stochastic encoding of the support set that is highly informative about predicting the validation set. Such stochastic encoding is optimized through the difference between two mutual informations, so that the encoding compresses the training set into a representation that can predict well the validation set. By exploiting recent variational approximations to the information bottleneck (Alemi et al., 2017; Chalk et al., 2016; Achille & Soatto, 2016) that make use of variational lower bounds on the mutual information (Barber & Agakov, 2003) , we derive a general and tractable framework for meta learning. Such framework can allow us to re-interpret gradient-based algorithms, such as MAML, and also derive new methods. Based on the variational information bottleneck (VIB) framework (Alemi et al., 2017; Chalk et al., 2016; Achille & Soatto, 2016) , we introduce a new memory-based algorithm for supervised fewshot learning (right panel in Figure 1 ) based on Gaussian processes (Rasmussen & Williams, 2006) and deep neural kernels (Wilson et al., 2016) that offers a kernel-based Bayesian view of a memory system. With Gaussian processes, the underlying encoding takes the form of a non-parametric function that follows a stochastic process amortized by the training set. Further, we show that VIB gives rise to gradient-based meta learning methods, such as MAML, when combined with parametric encodings corresponding to model parameters or weights, and based on this we derive a stochastic MAML algorithm. In an additional scheme, we show that our framework can naturally allow for combinations of memory and gradient-based meta learning by constructing suitable encodings, and we derive such an algorithm that combines Gaussian processes with MAML. We demonstrate our methods on few-shot regression and classification by using standard benchmarks such as Omniglot, mini-Imagenet and Augmented Omniglot.

2. META LEARNING WITH INFORMATION BOTTLENECK

Suppose we wish to learn from a distribution of tasks. During training for each task we observe a pair consisted of a task description represented by the support or training set D t and task validation represented by the target or validation set D v . At test time only D t will be given and the learning algorithm should rapidly adapt to form predictions on D v or on further test data. We wish to formulate meta learning using information theoretic concepts such as mutual information and the information bottleneck (Tishby et al., 1999) . The idea is to learn a stochastic representation or encoding of the task description D t that is highly informative about predicting D v . We introduce a random variable, Z, associated with this encoding drawn from a distribution q w (Z|D t ) parametrized by w. Given this encoding the full joint distribution is written as q w (D v , D t , Z) = q w (Z|D t )p(D v , D t ), where p(D v , D t ) denotes the unknown data distribution over D t and D v . In equation 1 and throughout the paper we use the convention that the full joint as well as any marginal or conditional that depends on Z is denoted by q w (emphasizing the dependence on the parametrized encoder), while corresponding quantities over data D t , D v are denoted by p. Eg. from the above we can express a Z-dependent marginal such as, q w (Z, D v ) = q w (Z|D t )p(D v , D t )dD t . To tune w we would like to maximize the mutual information between Z and the target set D v , denoted by I(Z, D v ). A trivial way to obtain a maximally informative representation is to set Z = D t , which does not provide a useful representation. Thus, the information bottleneck (IB) principle (Tishby et al., 1999) adds a model complexity penalty to the maximization of I(Z, D v ) which promotes an encoding Z that is highly compressive of D t , i.e. for which I(Z, D t ) is minimized. This leads to the IB objective: L IB (w) = I(Z, D v ) -βI(Z, D t ), where β ≥ 0 is a hyperparameter. Nevertheless, in order to use IB for meta learning we need to approximate the mutual information terms I(Z, D v ) and I(Z, D t ), which are both intractable since they depend on the unknown data distribution p(D v , D t ). To overcome this, we will consider variational approximations by following similar arguments to the variational IB approach (Alemi et al., 2017) that was introduced for supervised learning of a single task, which allows us to express a tractable lower bound on L IB (w) by lower bounding I(Z, D v ) and upper bounding I(Z, D t ).

2.1. VARIATIONAL INFORMATION BOTTLENECK (VIB) FOR META LEARNING

To construct a bound F ≤ L IB (w) we need first to lower bound I(Z, D v ), which is written as I(Z, D v ) = KL [q w (Z, D v )||q w (Z)p(D v )] = E qw(Z,D v ) log q w (D v |Z) p(D v ) , where KL denotes the Kullback-Leibler divergence and q w (D v |Z) = qw(Z|D t )p(D v ,D t )dD t qw(Z|D t )p(D v ,D t )dD t dD v is intractable since we do not know the analytic form of the data distribution p(D v , D t ). To lower bound I(Z, D v ), we follow Barber & Agakov (2003) (see Appendix A.1) by introducing a decoder model p θ (D v |Z) to approximate the intractable q w (D v |Z) where θ are additional parametersfoot_0 , I(Z, D v ) ≥ E qw(Z,D v ) log p θ (D v |Z) p(D v ) = E qw(Z,D v ) [log p θ (D v |Z)] + H(D v ), where the entropy H(D v ) is just a constant that does not depend on the tunable parameters (θ, w). Furthermore, to deal with the second intractable mutual information I(Z, D t ) so that to maintain an lower bound on L IB (w) we need to upper bound this term. Note that I(Z, D t ) = E qw(Z,D t ) log q w (Z|D t ) q w (Z) , where q w (Z) = q w (Z|D t )p(D t )dD t is intractable since, e.g. it involves the unknown data distribution p(D t ). By working similarly as before, we can approximate q w (Z) by a tractable prior model distribution p θ (Z) which leads to the following upper bound on the mutual information, I(Z, D t ) ≤ E qw(Z,D t ) log q w (Z|D t ) p θ (Z) . Then, by combining the two bounds we obtain the overall bound, F(θ, w) + H(D v ) ≤ L IB (w): F(θ, w) = E qw(Z,D v ) [log p θ (D v |Z)] -βE qw(Z,D t ) log q w (Z|D t ) p θ (Z) , where the constant H(D v ) is dropped from the objective function. Given a set of task pairs {D t i , D v i } b i=1 , where each (D t i , D v i ) ∼ p(D v , D t ), during meta-training the objective function for learning (θ, w) reduces to the maximization of the empirical average, 1 b i F i (θ, w), where each F i (θ, w) is an unbiased estimate of F(θ, w) (see Appendix A.2) and is given by F i (w, θ) = E qw(Zi|D t i ) [log p θ (D v i |Z i )] -βKL[q w (Z i |D t i )||p θ (Z i )]. The meta-training procedure is carried out in different episodes where at each step we receive a minibatch of task pairs and perform a stochastic gradient maximization step. The objective in equation 6 is similar to variational inference objectives for meta learning (Ravi & Beatson, 2019) . In particular, it can be viewed as an evidence lower bound (ELBO) on the validation set log marginal likelihood, log p θ (D v i |Z i )p θ (Z i )dZ i , with the differences: (i) there is the hyperparameter β in front of the KL term and (ii) the variational distribution q w (Z i |D t i ) in equation 6 is more restricted than in standard variational inference, since q w (Z i |D t i ) now acts as a stochastic bottleneck that encodes the support set D t i (i.e. it is amortized by D t i ) and via the term E qw(Zi|D t i ) [log p θ (D v i |Z i )] it is optimized to reconstruct the validation set.

2.2. INFORMATION THEORETIC VIEW OF MAML-TYPE METHODS

MAML (Finn et al., 2017) is a special case of our framework. To see this, suppose that the task encoding variable Z i for the i-th task coincides with a vector of some task-specific model parameters or neural network weights ψ i , so that p θ (D v i |Z i ) ≡ p(D v i |ψ i ) and p θ (Z i ) reduces to a prior p θ (ψ i ) over these parameters. The MAML approach (Finn et al., 2017) tries to find a shared initial parameter value θ so that few gradient steps based on the support set objective, log p(D t i |θ), lead to a task-specific parameter value ψ i with good generalization on the validation set. MAML estimates the task parameters by ψ i = θ + ∆(θ, D t i ), where ∆(θ, D t i ) denotes the inner loop adaptation steps which for a single gradient step is just ρ∇ θ log p(D t i |θ) and where ρ is a step size. By setting β = 0 and by using a deterministic Dirac measure encoder, δ ψi,θ+∆(θ,D t i ) , the VIB objective from equation 6 reduces to the standard MAML objective, F i (θ) = log p(D v i |θ + ∆(θ, D t i )) . We can construct a generalization of MAML by making the encoder stochastic, e.g. q θ,s ( ψ i |D t i ) = N (ψ i |θ + ∆(θ, D t i ), sI) where s is a scalar variance parameter. Then, the objective becomes F i (θ, s) = E N ( |0,I) log p(D v |θ + ∆(θ, D t i ) + √ s ) -βKL q θ,s (ψ i |D t i )||p θ (ψ i ) , where we reparametrized the expectation suggesting the use of the reparametrization trick (Kingma & Welling, 2013) for stochastic optimization of the meta parameters (θ, s). In the experiments we will investigate an instance of the above where p θ (ψ i ) follows the hierarchical Gaussian form N (ψ i |θ, sI), which views each task-specific parameter ψ i as a randomized version of θ and with s being the same variance used by the encoder. For such prior the KL divergence term reduces to -1 2s ||∆(θ, D t i )||foot_1 , which penalizes large values of the inner adaptation steps and small values of s.

2.3. VIB FOR SUPERVISED META LEARNING

Here, we explain how to adapt the VIB principle to supervised meta learning. Suppose the learning problem involves few-shot supervised learning where for each task we wish to predict outputs or labels given corresponding inputs. Let us denote the task support set as D t = (Y t , X t ), where Y t = {y t j } n t j=1 and X t = {x t j } n t j=1 denote the output and input observations. Similarly, we write * . This suggests that we can construct a task encoder distribution of the form q w (Z|Y t , X) that depends on the training outputs Y t and generally on all inputs X = (X t , X v ). 2 We would like to train this encoder so that Z becomes highly predictive about the validation outputs Y v and simultaneously compressive about Y t . Then, a suitable VIB objective can be based on approximating the input-conditioned information bottleneck, I(Z, Y v |X) -βI(Z, Y t |X) i.e. where both I(Z, Y v |X) and I(Z, Y t |X) are conditional mutual informations (see Appendix A.3). By following similar arguments as those in Section 2.1, we can lower bound this objective and finally approximate it by an unbiased empirical average, 1 b b i=1 F i (θ, w), where D v = (Y v , X v ) F i (θ, w) = E qw(Zi|Y t i ,Xi) [log p θ (Y v i |Z i , X i )] -βKL q w (Z i |Y t i , X i )||p θ (Z i |X i ) , and where p θ (Y v i |X v i , Z i ) and p θ (Z i |X i ) are the decoder and prior model distributions introduced when applying the variational approximation. A detailed derivation can be found in Appendix A.3. Equation 7 provides the general form of the VIB objective suitable for supervised meta learning. The supervised version of MAML is expressed as a special case by following the arguments of Section 2.2. In Section 3, we particularize the above by combining it with a Gaussian process model.

3. SUPERVISED META LEARNING WITH GAUSSIAN PROCESSES

In this section we introduce VIB-based meta learning algorithms using Gaussian processes (GPs), which are suitable for few-shot supervised learning. In Section 3.1 we introduce a memory-based system, while in Section 3.2 we further combine it with gradient-based meta learning.

3.1. A GAUSSIAN PROCESS MEMORY-BASED METHOD

To use the VIB framework for few-shot supervised learning, as described in Section 2.3, for each i-th task we need to specify the encoding variable Z i together with the encoder q w (Z i |Y t i , X i ), the decoder over the validation outputs p θ (Y v i |Z i , X i ) and the prior model p θ (Z i |X i ). Here, we construct these quantities by using a GP model (Rasmussen & Williams, 2006) . We denote the unknown task-specific function that solves the i-th task by f i (x) and we assume that a priori (before observing any task data) this function is a draw from a GP, i.e. f i (x) ∼ GP(0, k θ (x, x )), where k θ denotes the kernel function. Without loss of generality we shall use a deep kernel function where f i (x) is a linear function of a deep neural network feature vector φ(x; θ) with task-specific Gaussian weights, i.e. f i (x) = φ(x; θ) θ out i , θ out i ∼ N (θ out i |0, σ 2 f I). Such function can be equivalently viewed as a GP sample: f i (x) ∼ GP(0, k θ (x, x )), k θ (x, x ) = σ 2 f φ(x; θ) φ(x ; θ). In this functional space view the task-specific output weights θ out i have been marginalized out and we are left with the feature vector parameters θ shared across tasks. 3Suppose now that we observe the task data, i.e. the support D t i = (Y t i , X t i ) and validation D v i = (Y v i , X v i ) sets, so that we can evaluate the task function on all task inputs X i = (X t i , X v i ). Let f v i,j ≡ f (x v i,j ) denote the function value at the validation input x v i,j , associated with output y v i,j , while the vector of all such values is denoted by f v i = {f v i,j } n v j=1 . Similarly, the vector of function values at the training inputs X t i is f t i . In the VIB framework, we specify the task encoding variable Z i to be the full set of function values, Z i ≡ (f v i , f t i ), and we further choose the prior model for this encoding to be the GP prior, p θ (Z i |X i ) ≡ p(f v i , f t i |X i ), where p(f v i , f t i |X i ) = p(f v i |f t i , X i ) × p(f t i |X t i ), = N (f v i |K vt i [K t i ] -1 f t i , K v i -K vt i [K t i ] -1 [K vt i ] ) × N (f t i |0, K t i ). Here, K t i , K v i are square kernel matrices of size n t × n t and n v × n v on the training inputs X t i and validation inputs X v i , while K vt i is the n v × n t cross kernel matrix between these two sets of inputs. Note that the encoding is non-parametric since its size grows with the number of task data points. To continue with the specification of the VIB objective, the second quantity we need to set is the decoder model p θ (Y v i |f v i , f t i , X i ) which is chosen to be a standard GP likelihood. Specifically, for i.i.d. observations Y v i given f v i becomes independent from f t i and X i and the likelihood factorizes across data points, i.e. p( Y v i |f v i ) = n v j=1 p(y v i,j |f v i,j ), where each p(y v i,j |f v i,j ) is a standard likelihood model, such as Gaussian density p(y v i,j |f v i,j ) = N (y v i,j |f v i,j , σ 2 ) suitable for standard regression problems or categorical/softmax likelihood for few-shot classification; see Appendix C.4. Finally, we specify the encoder distribution q w (Z i |Y t i , X i ) ≡ q(f v i , f t i |Y t i , X i ) as follows, q(f v i , f t i |Y t i , X i ) = p(f v i |f t i , X i )q(f t i |D t i ), where p(f v i |f t i |X i ) is the same conditional GP prior from equation 8, while q(f t i |D t i ) is a encoder of the training set that takes the form of a Gaussian distribution specified based on a very general amortization procedure, see Appendix C.1. Equation 10 shares a similar structure with a standard posterior Gaussian process where we first observe the training set, then we compute the (approximate) posterior distribution q(f t i |D t i ), and finally we extrapolate/predict the validation set function values at inputs X v i based on the conditional GP prior p(f v i |f t i , X i ). The above assumptions yield (see Appendix C.2) the following VIB objective for a single task, n v j=1 E q(f v i,j ) [log p(y v i,j |f v i,j )] -βKL q(f t i |D t i )||p(f t i |X t i ) , where q(f v i,j ) = p(f v i,j |f t i , x v i,j , X t i )q(f t i |D t i )df t i is a univariate Gaussian over an individual vali- dation function value f v i,j . Here, q(f v i,j ) depends on the training set and the single validation input x v i,j , so that from the training set and the corresponding function values f t i we extrapolate (through the univariate conditional GP p(f v i,j |f t i , x v i,j , X t i )) to the input x v i,j in order to predict its value f v i,j .

3.2. COMBINATION WITH GRADIENT-BASED META LEARNING

In this section, we combine the GP memory-based meta learning method with a gradient-based approach, such as MAML (Finn et al., 2017) . Based on the VIB principle we need to specify an encoding that will allow us to combine a memory-based with a gradient-based approach. As discussed in Section 2.2, gradient-based meta learning is associated with a parametric encoding that corresponds to a fixed-size task model parameter ψ i . In contrast, as seen in Section 3.1, a memory system is associated with a function-space or non-parametric encoding that consists of the Under review as a conference paper at ICLR 2021 function values (f t i , f v i ) of all task input points. Therefore, a way to combine these techniques is to concatenate the encodings, i.e. Z i ≡ (ψ i , f t i , f v i ). Here, ψ i are the task-specific parameters of the GP kernel function k ψi (x, x ) (and possibly of the likelihood p(y|f )), which in our implementation are the parameters of the feature vector φ(x; ψ i ). Intuitively, a combination of the GP memorybased method with MAML will try to apply a short inner adaptation loop in order to adjust an initial feature vector φ(x; θ) to obtain a final φ(x; ψ i ) that can better solve the task. For the overall encoding (ψ i , f t i , f v i ), the general form of the encoder distribution takes the form q(f t i , f v i , ψ i |Y t i , X i ) = p(f v i |f t i , ψ i , X i )q(f t i |ψ i , D t i )q(ψ i |D t i ), where p(f v i |f t i , ψ i , X i ) is the conditional GP prior and q(f t i |ψ i , D t ) is the amortized encoder (see Appendix C.1), where we have emphasized their dependence on ψ i . The VIB objective becomes E q(ψi|D t i )   n v j=1 E q(f v i,j |ψi) [logp(y v i,j |f v i,j )]-βKL[q(f t i |ψ i ,D t i )||p(f t i |ψ i ,X t i )]   -βKL[q(ψ i |D t i )||p(ψ i )]. In practise, we can relax this objective and use different hyperparameters β f and β ψ in front of the two KL terms. This is convenient as we would prefer to set β ψ = 0 and use a deterministic MAML w.r.t. ψ i rather than a stochastic MAML; see Section 2.2. This simplification avoids the need to specify a prior p(ψ i ) over the task-specific neural network parameters and at the same time it reduces the encoder q(ψ i |D t i ) to a Dirac delta, which simplifies the objective as follows, n v j=1 E q(f v i,j |ψi) [log p(y v i,j |f v i,j )] -β f KL q(f t i |ψ i , D t i )||p(f t i |ψ i , X t i ) , where ψ i = θ + ∆(θ, D t i ). The inner loop adaptation term can be defined by an objective function on the support set D t i . In our case a suitable objective is the VIB for the GP memory-based method obtained by setting the validation set equal to the training set, i.e. D v i = D t i in equation 11, which gives n t j=1 E q(f t i,j |θ) [log p(y t i,j |f t i,j )] -β f KL [q(f t i |θ, D t i )||p(f t i |θ, X t i )].

4. RELATED WORK

In this work, the VIB principle was used to formulate meta learning. VIB has been used before for different purposes, such as for regularization of single-task supervised learning (Alemi et al., 2017) , sparse coding (Chalk et al., 2016) , re-interpretation of β-VAEs and Dropout (Burgess et al., 2018; Achille & Soatto, 2016) and for compression of deep neural networks (Dai et al., 2018) . A meta learning method that connects with the information bottleneck was recently proposed by Hu et al. (2020) . There the information bottleneck was used to analyze the generalization of a variational Bayesian inference objective suitable for transductive supervised few-shot learning. Note that the information bottleneck derived in Hu et al. (2020) (theorem 1) is not the same as the information bottleneck objective used here (i.e. the objective in equation 7 for the supervised learning case), since they differ in the second term. In addition, our framework expresses a general information bottleneck principle for meta learning applicable beyond transductive supervised learning. Given the probabilistic nature of our framework, we can relate it to other probabilistic or Bayesian approaches and particularly with those that: (i) probabilistically re-interpret and extend gradientbased methods (Grant et al., 2018; Finn et al., 2018; Yoon et al., 2018; Nguyen et al., 2019; Gordon et al., 2019; Chen et al., 2019) and (ii) derive amortized conditional probability models (Garnelo et al., 2018; Gordon et al., 2019) . The underlying learning principle in both (i)-(ii) is to construct and maximize a predictive distribution (or conditional marginal likelihood) of the validation points given the training points, which, e.g. in supervised few-shot learning is written as p θ (Y v |X v , X t , Y t ) = p(Y v |X v , ψ i )p θ (ψ i |X t , Y t )dψ i = p θ (Y v ,Y t |X v ,X t ) p θ (Y t |X t ) . Here, p θ (ψ i |X t , Y t ) is a posterior distribution over the task parameters ψ i , after observing the training points, and θ is a meta parameter which for simplicity we assume to be found by point estimation. However, this objective is very hard to rigorously approximate. Unlike the marginal likelihood on all task outputs p θ (Y v , Y t |X v , X t ) for which we can easily compute a lower bound, there is no tractable lower bound on the predictive conditional p θ (Y v |X v , X t , Y t ). 4 This inherent difficulty with computing the predictive distribution has led to several approximations, i.e. methods of category (i) above, ranging from MAP, Laplace, variational inference procedures (without rigorous bounds on the predictive conditional) and Stein variational gradient descent (Grant et al., 2018; Finn et al., 2018; Yoon et al., 2018; Nguyen et al., 2019; Gordon et al., 2019; Chen et al., 2019) . The conditional probability modelling approaches (Garnelo et al., 2018; Gordon et al., 2019) try to directly model p θ (Y v |X v , X t , Y t ) without considering this as an approximation to an initial joint Bayesian model. Our VIB framework differs significantly from the predictive distribution principle since VIB has an information theoretic motivation and it rigorously bounds an information bottleneck objective. VIB is also a fully tractable objective, avoiding the need to choose a particular approximate inference method and allowing us to rather focus on setting up the encoding procedure, as we did in the GP example in Section 3. Regarding related works of GPs in meta learning, the ALPaCA method (Harrison et al., 2018) applied GPs and Bayesian linear regression to standard regression tasks with Gaussian likelihoods, while Tossou et al. (2019) used kernel-based methods (from a regularization rather than Bayesian perspective) again in standard regression. Closer to us, Patacchiola et al. (2019) and Snell & Zemel (2020) used GPs to perform few-shot classification together with deep neural kernels. However, our usage of GPs is different from these latter approaches, e.g. our general encoder amortization strategy can potentially deal with arbitrary likelihood functions and task output observations, while Patacchiola et al. ( 2019) assumes a Gaussian likelihood for the binary class labels and Snell & Zemel (2020) consider the Pólya-Gamma augmentation, which is very tailored to classification problems.

5. EXPERIMENTS

To evaluate the proposed algorithms, we consider a standard set of meta-learning benchmarks which include sinusoid regression and few-shot classification. As a baseline for comparison we consider MAML (Finn et al., 2017) and use exactly the same neural architecture for all methods and benchmarks as in Finn et al. (2017) . The new methods we implemented are the following: (i) Stochastic MAML (S. MAML), as defined in Section 2.2, where the difference with MAML is the injected task-specific noise added to the outer loop update rule, together with the regularization term that comes from the VIB objective; see Section 2.2. The scalar noise parameter s is learned alongside the others. (ii) The memory-based GP method (referred as GP) trained by the VIB objective, as defined in Section 3, where the kernel feature vector φ(x; θ) is obtained from the last hidden layer of the same neural architecture as used in MAML. We use cosine and linear kernels (see Appendix D). For few-shot classification we report results separately for both kernels, while for sinusoid regression we only consider the linear kernel (the cosine kernel has similar performance). (iii) A combination of a memory-based and gradient-based approach referred to as GP+MAML where a MAML meta update rule is applied to the feature vector parameters θ; see Section 3.2. Sinusoid regression. We first evaluate the method on a sinusoid regression as described by Finn et al. (2017) following their protocol. We train GP using VIB and evaluate its K-shot mean squared error performance. From Table 1 , we observe that GP significantly outperforms MAML especially as K grows. Similarly, Figure 3 in Appendix illustrates this few-shot predictive ability of the GP, where posterior GP uncertainty reduces as K grows. Finally, from Figure 2 (left), we observe that GP drastically improves when more data is available and is more data efficient than MAML achieving near-optimal performance for K = 5. See Appendix D for more results and details. Table 1 : Few-shot sinusoid regression results. We report the results (mean values and 95% confidence intervals after 10 repeats) of all methods (GP, GP+MAML, MAML and S. MAML) for K = 5, 10, 20. GP-based methods use linear kernel.

Model

K=5 K=10 K=20 MAML 0.280±0.013 0.096±0.005 0.043±0.003 S. MAML 0.317±0.34 0.116±0.012 0.054±0.004 GP 0.02±0.014 0.002±0.001 0.001±0.001 GP+MAML 0.058±0.054 0.002±0.001 0.002±0.002 Few-shot classification. The second domain is a standard few-shot meta-learning benchmark based on three datasets: Omniglot (Lake et al., 2011), mini-Imagenet (Ravi & Larochelle, 2017) and Augmented Omniglot (Flennerhag et al., 2018) . For both Omniglot and mini-Imagenet we follow the experimental protocol proposed by Finn et al. (2017) . For Augmented Omniglot, following Flennerhag et al. ( 2018 predictive density GP update we sample a mini-batch from the fixed N × K support set, we apply random transformations in this mini-batch and then we use it to perform the actual update. The mini-batch size was 20 and based on this data augmentation process we grow the amount of data (x-axis) processed by each method from 20 up to 2000. Finally to create all plots we average performance under 10 repeats, where in each repeat the systems are meta-trained from scratch and then are evaluated in a large number of meta-testing tasks. Omniglot 5-way mini-Imagenet 5-way Augmented Omniglot 20-way 100 steps of adaptation (resulting in 2000 data points seen by the model where each step processes a minibatch of size 20 points), while they are meta-trained by applying 20 adaptation steps (i.e. 400 training points seen per task). Both GP methods are meta-trained by memorizing the full N ×K = 20×15 = 300 support points without further data augmentation, while during meta-testing we allow the GP methods to see up to 2000 points. See Appendix D for more details. Model K = 1 shot K = 5 shot K = 1 shot K = 5 shot K = Omniglot 5-way mini-Imagenet 5-way Augmented Omniglot 20-way Model K = 1 shot K = 5 shot K = 1 shot K = 5 shot K = 15 shot Stochastic MAML 0.031±0.001% 0.008±0.001 1.27±0.008% 0.925±0.013% 0.673±0.025% GP (linear) 0.036±0.002% 0.012±0.001% 1.267±0.008% 0.904±0.005% 0.676±0.034% GP (cos) 0.045±0.001% 0.019±0.001% 1.262±0.006% 0.921±0.006% 0.662±0.027% GP+MAML (linear) 0.036±0.003% 0.010±0.001% * 1.246±0.007% * 0.900±0.009% 0.671±0.024% GP+MAML (cos) 0.045±0.001% 0.019±0.001% 1.274±0.009% 0.902±0.005% * 0.616±0.027% MAML (our) 0.032±0.001% 0.008±0.001% 1.279±0.006% 0.926±0.011% 0.694±0.02% Classification accuracy performance for all methods are given in Table 2 , while the corresponding negative log likelihood (NLL) scores are given in Table 3 . We observe GP-based architectures outperform MAML and S. MAML in more complex scenarios such as Augmented Omniglot. From the NLL scores that depend on how well the predicted class probabilities are calibrated, we can observe that the GP methods perform significantly better in all cases where uncertainty matters i.e. on mini-Imagenet and Augmented Omniglot. On top of the described standard implementations of GP and GP+MAML, we implement ones with additional previously used architectures and tricks to improve results (see Appendx D.4). We observe that GP+MAML generally performs better than GP. Further, we notice that S. MAML has similar performance to MAML. One reason why S. MAML does not always outperform MAML could be related to hyperparameters and to the higher variance of the gradients caused by the reparametrization trick used to maximize the VIB objective in equation 7. This could imply that training a stochastic architecture could require longer training time, an issue deserving further investigation. We also found that the qualitative behaviour of GP and GP+MAML are quite different as shown in Figure 4 in the Appendix D.2. In Appendix D.3 we provide ablative analysis for the impact of β on the performance of our architecture where the main result is that a large range of β values gives similar performance in practice. Data efficiency in meta-testing of gradient-based vs GP-based meta learning. Having performed an expensive meta-training phase, the ultimate goal of a meta-learning system is to be deployed in practice and solve new tasks. In a real meta-testing scenario it is more likely the system to operate in a regime, where the training data for a given task is coming sequentially in mini-batches, and the system should continuously update itself. With this in mind, it is interesting to study the effect of K shots (or more generally the effect of the amount of processed data) in meta-testing performance. In Figure 2 , we carry out an ablation study by comparing MAML and GP in sinusoid regression, mini-Imagenet and Augmented Omniglot by varying either K, for sinusoid regression and mini-Imagenet, or the amount of data augmentation, for Augmented Omniglot. We found that GP can be much more data efficient than MAML. This is because the GP predictive updates are based on Bayesian updating of sufficient statistics (for the linear kernels used here) which are similarly to Bayesian linear regression, as described in detail in Appendix C.3. Such updates are exchangeable (i.e. data order does not matter on the final prediction) and do not depend on learning rates. In contrast, a gradient-based method such as MAML can be less data efficient since stochastic gradient updates depend on the learning rate, mini-batch size, data order and the size of the inner loop.

6. CONCLUSION

We introduced an information theoretic framework for meta learning by using a variational approximation (Alemi et al., 2017; Chalk et al., 2016; Achille & Soatto, 2016) to the information bottleneck (Tishby et al., 1999) . We derived a novel memory-based meta learning method with GPs, a stochastic MAML and a combination of memory and gradient-based meta learning. While we have demonstrated our method in few-shot regression and classification, we believe that the scope of the information bottleneck for meta learning is much broader. For instance, an interesting topic for future research is to consider applications in reinforcement learning. Here, we provide additional details regarding our method. Appendix A describes fully the derivation of the variational information bottleneck (VIB) objective for meta learning. Appendix B explains how the transductive and non-transductive settings, often used in few-shot image classification, can be explained as a particular cases of the VIB framework under suitably defined encodings. Appendix C provides full details regarding the GP meta learning method proposed in the main paper. Finally, Appendix D discusses experimental settings, and it presents additional experimental results and ablation studies.

A.1 BOUNDS ON THE MUTUAL INFORMATION

Here, we review the standard variational bounds on the mutual information from Barber & Agakov (2003) . Recall the definition of the mutual information, I(x, y) = q(x, y) log q(x, y) q(x)q(y) dxdy = q(x, y) log q(x|y) q(x) dxdy. By introducing p(x|y) that approximates q(x|y) we get I(x, y) = q(x, y) log p(x|y)q(x|y) p(x|y)q(x) dxdy = q(x, y) log p(x|y) q(x) dxdy + q(y)KL[q(x|y)||p(x|y)]dy, which shows that I(x, y) ≥ q(x, y) log p(x|y) q(x) dxdy, since q(y)KL[q(x|y)||p(x|y)]dy is non negative. An upper bound is obtained similarly. Suppose p(x) approximates q(x) then I(x, y) = q(x, y) log p(x)q(x|y) p(x)q(x) dxdy = q(x, y) log q(x|y) p(x) dxdy -KL[q(x)||p(x)]dy, which shows that I(x, y) ≤ q(x, y) log q(x|y) p(x) dxdy. A.2 THE GENERAL VIB META LEARNING CASE Consider the general case, where we work with the unconditional mutual information and we wish to approximate the IB: I(Z, D v ) -βI(Z, D t ). Recall that the joint distribution is written as q w (D v , D t , Z) = q w (Z|D t )p(D v , D t ), from which we can express any marginal or conditional. In particular observe that q w (Z, D v ) = q w (Z|D t )p(D v , D t )dD t . If we have a function f (Z, D v ) and we wish to approximate the expectation, q w (Z, D v )f (Z, D v )dZdD v = q w (Z|D t )p(D v , D t )f (Z, D v )dZdD v dD t , then given that we sample a task pair (D v i , D t i ) ∼ p(D v , D t ) we can obtain the following unbiased estimate of this expectation, q w (Z|D t i )f (Z, D v i )dZ. ( ) We are going to make use of equation 15 and equation 16 in the derivation below. To compute the variational approximation to IB, we need to lower bound I(Z, D v ) as I(Z, D v ) = q w (Z, D v ) log q w (Z, D v ) q w (Z)p(D v ) = q w (Z, D v ) log q w (D v |Z) p(D v ) dZdD v ≥ q w (Z, D v ) log p θ (D v |Z) p(D v ) dZdD v , by using equation 12 = q w (Z, D v ) log p θ (D v |Z)dZdD v + H(D v ), where the entropy H(D v ) is just a constant. Subsequently, we upper bound I(Z, D t ) as follows, I(Z, D t ) = q w (Z, D t ) log q w (Z, D t ) q w (Z)p(D t ) dZdD t = q w (Z|D t )p(D t ) log q w (Z|D t ) q w (Z) dZdD t , ≤ q w (Z|D t )p(D t ) log q w (Z|D t ) p θ (Z) dZdD t by using equation 13Then we obtain the overall loss, F(θ, w) ≤ L IB (w): F(θ, w) = q w (Z, D v ) log p θ (D v |Z)dZdD v -β q w (Z|D t )p(D t ) log q w (Z|D t ) p θ (Z) dZdD t , where we dropped the constant entropic term H(D v ). Therefore, given a set of task pairs {D t i , D v i } b i=1 , where each (D t i , D v i ) ∼ p(D v , D t ), the objective function for learning (θ, w) becomes the empirical average, 1 b b i=1 F i (θ, w), where F i (w, θ) = q w (Z i |D t i ) log p θ (D v i |Z i )dZ i -β q w (Z i |D t i ) log q w (Z i |D t i ) p θ (Z i ) dZ i , where for the first term we made use of equation 15 and equation 16 with f (D v , Z) = log p θ (D v |Z).

A.3 THE SUPERVISED META LEARNING VIB CASE

For the supervised meta learning case the joint density can be written as q w (D v , D t , Z) = q w (Z|Y t , X t , X v )p(Y t , Y v |X t , X v )p(X v , X t ), = q w (Z|Y t , X)p(Y t , Y v |X)p(X), where X = (X t , X v ) and the encoding distribution q w (Z|Y t , X) could depend on all inputs X but only on the training outputs Y t . The derivation of the VIB objective is similar with the general case with the difference that now we approximate the conditional information bottleneck I(Z, Y v |X) -βI(Z, Y t |X) where we condition on the inputs X. In other words, both I(Z, Y v |X) and I(Z, Y t |X) are conditional mutual informations, i.e. they have the form I(z, y|x) = q(x) q(z, y|x) log q(z, y|x) q(z|x)q(y|x) dzdy dx = q(z, y, x) log q(z, y|x) q(z|x)q(y|x) dzdydx. We can lower bound I(Z, Y v |X), as follows, p(X) q w (Z, Y v |X) log q w (Z, Y v |X) q w (Z|X)p(Y v |X) dZdY v dX = p(X) q w (Z, Y v |X) log q w (Y v |Z, X) p(Y v |X) dZdY v dX where q w (Z|X) cancels ≥ p(X) q w (Z, Y v |X) log p θ (Y v |Z, X) p(Y v |X) dZY v dX by using equation 12 = q w (Z, Y v , X) log p θ (Y v |Z, X) p(Y v |X) dZdY v dX = q w (Z, Y v , X) log p θ (Y v |Z, X)dZdY v dX -p(Y v , X) log p(Y v |X)dY v dX (19) Note that -p(Y v , X) log p(Y v |X)dY v dX is just a constant that does not depend on tunable parameters. Also q w (Z, Y v , X) = q w (Z|Y t , X)p(Y t , Y v |X)p(X)dY t , so that if we have a task sample (Y t i , Y v i , X i ) ∼ p(Y t , Y v |X)p(X) an unbiased estimate of the expectation q w (Z, Y v , X) log p θ (Y v |Z, X)dZdY v dX is given by q w (Z|Y t i , X i ) log p θ (Y v i |Z, X i )dZ. ( ) We upper bound I(Z, Y t |X) as follows, p(X) q w (Z, Y t |X) log q w (Z, Y t |X) q w (Z|X)p(Y t |X) dZdY t dX = p(X) q w (Z, Y t |X) log q w (Z|Y t , X) q w (Z|X) dZdY t dX, where p(Y t |X) cancels ≤ p(X) q w (Z, Y t |X) log q w (Z|Y t , X) p θ (Z|X) dZdY t dX, by using equation 13 = q w (Z|Y t , X)p(Y t , X) log q w (Z|Y t , X) p θ (Z|X) dZdY t dX, Then we obtain the overall objective, F(θ, w) = q w (Z, Y v , X) log p θ (Y v |Z, X)dZdY v dX -β q w (Z|Y t , X)p(Y t , X) log q w (Z|Y t , X) p θ (Z|X) dZdY t dX, where we dropped the constant term. Therefore, given a set of task pairs the objective becomes the empirical average, 1 b b i=1 F i (θ, w), where F i (θ, w) = q w (Z|Y t i , X i ) log p θ (Y v i |Z, X i )dZ -β q w (Z|Y t i , X i ) log q w (Z|Y t i , X i ) p θ (Z|X i ) dZ, where we made use of equation 21.

A.4 CONNECTION WITH VARIATIONAL INFERENCE

As mentioned in the main paper, the VIB for meta learning (where we consider for simplicity the general case from A.2) is similar to applying approximate variational inference to a certain joint model over the validation set, p θ (D v |Z)p θ (Z), where p θ (D v |Z) is the decoder model, p θ (Z) a prior model over the latent variables and where the corresponding marginal likelihood is p(D v ) = p θ (D v |Z)p θ (Z)dZ. We can lower bound the log marginal likelihood with a variational distribution q w (Z|D t ) that depends on the training set D t , F β=1 (w, θ) = q w (Z|D t ) log p θ (D v |Z)dZ -q w (Z|D t ) log q w (Z|D t ) p θ (Z) dZ, which corresponds to the VIB objective with β = 1.

B TRANSDUCTIVE AND NON-TRANSDUCTIVE META LEARNING

Here, we discuss how the transductive and non-transductive settings that appear in few-shot image classification (Bronskill et al., 2020; Finn et al., 2017; Nichol et al., 2018) , due to the use of batchnormalization, can be interpreted under our VIB framework by defining suitable encodings. We shall use MAML as an example, but the discussion is more generally relevant. The transductive case occurs when the concatenated support and validation/test inputs X = (X t , X v ) of a single task (we ignore the task index i to keep the notation uncluttered) are used to compute batch-norm statistics (possibly at different stages) shared by all validation/test points, when predicting those points. For MAML this implies a deterministic parametric encoding, i.e. common to all individual validation inputs x v j ∈ X v , obtained by a sequence of two steps: (i) Obtain first the task-specific parameter ψ in the usual way by the support loss, i.e. ψ = θ + ∆(θ, D t ). If batch-normalization is used here, then the statistics are computed only by X t . (ii) Compute the validation loss by applying batch-normalization on X v or the union X = X t ∪ X v (the union seems to be a better choice, but not used often in practice for computational reasons, e.g. Finn et al. (2017) ; Nichol et al. (2018) prefer to use only X v ). In both cases, the underlying encoder is parametric over the final effective task parameter ψ = BN (ψ, X), where BN denotes the final batch-norm operation that outputs a parameter vector, that predicts all validation points and it is a deterministic delta measure. In contrast, the non-transductive setting occurs when each individual validation input x v j is concatenated with the support inputs X t to form the sets x v j ∪ X t , j = 1, . . . , n v . Then, each set x v j ∪ X t is used to compute point-specific batch-norm statistics when predicting the corresponding validation output y v j . Under the VIB framework this corresponds to a non-parametric encoding, which grows with the size of the validation set. The first deterministic step of this encoder is the same (i) above from the transductive case but the second step differs in the sense that now we get a validation pointspecific task parameter ψ j = BN (ψ, x v j ∪X t ) by computing the statistics using the set x v j ∪X t . For MAML, this encoding becomes, Z ≡ { ψ j } n v j=1 , and the encoder distribution is a product of delta measures. i.e. p({ ψ j } n v j=1 |Y t , X) ≡ n v j=1 δ ψj ,BN (θ+∆(θ,D t ),x v j ∪X t ) . Finally, note that under the VIB perspective it does not make much sense to meta-train transductively and meta-test non-transductively and via versa, since this changes the encoding. That is, whatever we do in meta-training we should do the same in meta-testing.

C FURTHER DETAILS ABOUT THE GAUSSIAN PROCESS METHOD

For simplicity next we ignore the task index i to keep the notation uncluttered, and write for example f t i as f t and etc. C.1 AMORTIZATION OF THE GP ENCODER q(f t |D t ) A suitable choice of q(f t |D t ) is to set it equal to the exact posterior distribution over f t given the training set, i.e. p( f t |D t ) ∝ n t j=1 p(y t j |f t j )N (f t |0, K t ). Interestingly, such a setting does not require to introduce any extra variational parameters w and it will depend only on the model parameters θ that appear in the kernel function and possibly also in the likelihood. For standard regression problems where the likelihood is Gaussian, i.e. p(y t j |f t j ) = N (y t j |f t j , σ 2 ), the exact posterior has an analytic form given by p(f t |D t ) = N (f t |K t (K t + σ 2 I) -1 Y t , K t -K t (K t + σ 2 I) -1 K t ) and thus we can set q(f t |D t ) = p(f t |D t ). For all other cases where the likelihood is not Gaussian we need to construct an amortized encoding distribution by approximating each non-Gaussian likelihood term p(y t j |f t j ), with a Gaussian term similarly to how we often parametrize variational Bayes or Expectation-Propagation Gaussian approximation to a GP model (Hensman et al., 2014; Opper & Archambeau, 2009; Rasmussen & Williams, 2006) i.e.

p(y

t j |f t j ) ≈ N (m t j |f t j , s t j ), (Φ t [S t ] -1 Φ t + I) -1 Φ t [S t ] -1 (based on the identity (AB + I) -1 A = A(BA + I) -1 ). Now ob- serve that the M -dimensional vector b t = Φ t [S t ] -1 m t = n t j=1 φ(x t j ; θ) m t j s t j can grow incrementally without memorizing the feature vectors φ(x t j ; θ) based on the recursion b t ← b t + φ(x t j ; θ) m t j s t j (with the initialization b t = 0) as individual data points (similarly for mini-batches) are added in the support set: D t ← D t ∪ (x t j , y t j ). Similarly, the M × M matrix A t = Φ t [S t ] -1 Φ t = n t j=1 1 s t j φ(x t j ; θ)φ(x t j ; θ) can also be computed recursively with constant O(M 2 ) memory. Finally, note that the above constant memory during meta-testing can only be implemented when the feature vector θ remain fixed, which means that it is not applicable for the GP+MAML combination.

C.4 MULTI-CLASS CLASSIFICATION

For multi-class classification meta learning problems we need to introduce as many latent functions as classes. For instance, when the number of classes for each task is N we will need N latent functions f n (x) which all are independent draws from the same GP. The marginal GP prior on the training and validation function values for a certain task factorizes as N n=1 p(f v n |f t n , X v , X t )p(f t n |X t ). We assume a factorized encoding distribution of the form, N n=1 p(f v n |f t n , X v , X t )q(f t n |D t ), where each q(f t n |D t ) = N (f t n |K t (K t + S t ) -1 m t n , K t -K t (K t + S t ) -1 K t ). Here, m t n = Y t n • m t and Y t n is a vector obtaining the value 1 for each data point x t j that belongs to class n and -1 otherwise. Note that the encoding distributions share the covariance matrix and they only have different mean vectors. The representation of m t n makes the full encoding distribution permutation invariant to the values of the class labels. Since also we are using a shared (i.e. independent of class labels) amortized functions m w (x) and s w (x) the terms (S t , m t ) are common to all N factors. This allows to compute the VIB objective very efficiently (in way that is fully scalable w.r.t. the number of classes N ) by requiring only a single Cholesky decomposition of the matrix K t + S t . Specifically, by working similarly to C.2 we obtain the VIB objective per single task, n v j=1 E q({f v n,j } N n=1 ) [log p(y v j |{f v n,j } N n=1 )] -β N n=1 n t j=1 E q(f t n,j ) [log N (m t n,j |f t n,j , s t j )] + β N n=1 log N (m t n |0, K t + S t ), where q({f v n,j } N n=1 ) = N n=1 q(f v n,j ) and each univariate Gaussian q(f v n,j ) is given by the same expression as provided in C.3. The last two terms of the bound (i.e. the ones multiplied by the hyperparameter β) are clearly analytically computed, while the first term involves an expectation of a log softmax since the likelihood is p(y v j = n|{f v n ,j } N n =1 ) = e f v n,j N n =1 e f v n ,j . To evaluate this expectation we apply first the reparametrization trick to move all tunable parameters of q({f v n,j } N n=1 ) inside the log-likelihood (so that we get a new expectation under a product of N univariate standard normals) and then we apply Monte Carlo by drawing 200 samples. Finally, note that to compute the predictive density we need to evaluate, q(y * ) = E q({fn, * } N n=1 ) p(y * |{f n, * } N n=1 ) , which again is done by applying Monte Carlo by drawing 200 samples from q({f n, * } N n=1 ). This is precisely how the negative log-likelihood test performance was computed for the GP models in Table 3 . To decide the classification label based on the maximum class predictive probability (in order to compute e.g. accuracy scores), we take advantage of the fact that all N univariate predictive Gaussians q(f n, * ) have the same variance but different means, thus the predicted class can be equivalently obtained by taking the argmax of the means of these N distributions.

C.5 SPECIFIC GP IMPLEMENTATION AND AMORTIZATION FOR FEW-SHOT CLASSIFICATION

For all few-shot multi-class classification experiments in order to implement the GP and GP+MAML methods we need to specify the feature vector φ(x; θ) and the amortized variational functions m w (x) and s w (x). The feature vector is specified to have exactly the same neural architecture used in previous works for all datasets, Omniglot, mini-Imagenent and Augmented Omniglot; see for example Chen et al. (2019) for details on these architectures for all these three datasets. Note that when computing the GP kernel function the feature vector φ(x; θ) is also augmented with the value 1 to automatically account for a bias term in the kernel function. Regarding the two amortized variational functions needed to obtain the encoder, we consider a shared (with the GP functions) representation by adding two heads to the same feature vector φ(x; θ): the first head corresponds to a linear output function m w (x) and the second applies at the end the softplus activation s w (x) = log(1 + exp(a(x))) (since s w (x) represents variance) where the pre-activation a(x) is obtained by a linear function of the feature vector. For numerical stability we also apply a final clipping by bounding these functions so that m w (x) ∈ [-20, 20] and s w (x) ∈ [0.001, 20]. The bound -20 and 20 are almost never realized during optimization, so they are not so crucial, in contrast the lower bound 0.001 on s w (x) is rather crucial regarding numerical stability since it ensures that the minimum eigenvalue of the matrix K t + S t (i.e. the matrix we need to decompose using Cholesky) is bounded below by 0.001.

D.1 EXPERIMENTAL SETTINGS AND HYPERPARAMETERS

The memory-based GP method (referred to in the Tables as GP) is trained by the VIB objective, as defined in Section 3, where the kernel feature vector φ(x; θ) is obtained by the last hidden layer of the same neural architecture as used in MAML. Based on this M -dimensional feature vector φ(x; θ) ∈ R M we consider two kernel functions: the standard linear kernel function k θ (x, x ) = The Augmented Omniglot benchmark is a modified version of Omniglot which necessitates longhorizon adaptation and it is often considered as many-shots problem (Chen et al., 2019) . For each alphabet, 20 characters are sampled to define a N = 20-class classification problem with K = 15 shots. Further, both train and validation images are randomly augmented, by applying transformations, which makes it even more challenging. Following Flennerhag et al. (2018) ; Chen et al. (2019) , during meta-testing both MAML and Stochastic MAML perform 100 steps of adaptation (resulting in 2000 data points seen in total by the model where each step processes a minibatch of size 20 points), while they are meta-trained by applying 20 adaptation steps (i.e. 400 training points seen per task). Both GP methods are meta-trained by memorizing the full N × K = 20 × 15 = 300 support points without further data augmentation, while during meta-testing we allow the GP methods to see up to 2000 points. For all methods we perform a hyperparameter search using the train and validation subsets of all three benchmarks as detailed below. For sinusoid regression, Omniglot and mini-Imagenet, meta-training of all methods consisted of 60000 iterations or episodes, where in each episode a learning update is performed based on a minibatch of tasks. For the sinusoid regression the meta batch-size was 5 tasks and for each task we have K = 10 examples. For N -way, K-shot classification in Omniglot and mini-Imagenet (with N = 5, and K = 1, 5 as mentioned in Table 2 in the main paper), the meta batch was 32 tasks in Omniglot and 4 tasks in mini-Imagenet. Also for these two datasets Stochastic MAML uses one gradient step (in the inner loop) for both meta-training and meta-testing in Omniglot, while in mini-Imagenet it uses 5 and 10 steps respectively, i.e. exactly as MAML by Finn et al. (2017) was applied in these datasets. As mentioned in the main paper GP+MAML considers one gradient adaptation step in Omniglot and 5 in mini-Imagenet (for both meta-training and meta-testing). The neural architectures of all these experiments is the same as in Finn et al. (2017) . Specifically, for the Omniglot the architecture is from Vinyals et al. (2016) , which has 4 modules with a 3 × 3 convolutions and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2 × 2 maxpooling. The Omniglot images are downsampled to 28 × 28, so the dimensionality of the last hidden layer is 64. For Omniglot, strided convolutions are used instead of max-pooling. For the GP methods this constructs 65-dimensional feature vector φ(x; θ) (64 plus one for the bias term which is included in φ(x; θ)). For mini-Imagenet, the network uses 32 filters per layer and the final layer the feature vector feature is obtained by flattening so that finally the feature vector φ(x; θ) is 801-dimensional. For Augmented Omniglot, the meta-training of all methods consisted of 5000 iterations. The Augmented Omniglot dataset is a modified version of Omniglot which, for gradient-based methods like MAML, necessitates long-horizon adaptation and it is often considered as many-shots problem (Chen et al., 2019) . Additional hyperparameters details are included in Tables 4, 5 and 6. The ranges for the hyperparameters' search are: • Outer learning rate α: [0.0001, 0.00025, 0.0005, 0.00075, 0.001, 0.002, 0.0025, 0.005, 0.0075, 0.01, 0.025, 0.05, 0.075, 0.1] • Inner learning rate ρ: [0.0001, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 1.5, 2.0] • Bottleneck coefficient β: [1e -07, 1e -06, 1e -05, 5e -05, 0.0001, 0.00025, 0.0005, 0.00075, 0.001, 0.0025, 0.005, 0.0075, 0.01, 0.1, 0.5, 1.0]

D.2 ADDITIONAL EXPERIMENTAL RESULTS

Table 7 provides detailed few-shot sinusoid regression results, while Figure 3 illustrates how the GP model, after having been meta-trained on the sinusoid regression tasks, predicts a test task as the number K of the training examples increases. We also found that the qualitative behaviour of GP and GP+MAML are quite different. In Figure 4 , we report the performance of the GP and GP+MAML on mini-Imagenet tasks as a function of Omniglot  K = 1 K = 5 K = 1 K = 5 K = 1 K = 5 K = 1 K = 5 Outer l. -Imagenet K = 1 K = 5 K = 1 K = 5 K = 1 K = 5 K = 1 K = 5 Outer l. inner steps executed at test time by GP+MAML. Note, that GP actually has only one step executed over all the test-time data in once, because there is no inner loop for it. Intuitively, we would expect GP+MAML to have similar performance to GP at the beginning of inner loop, but this is not what is happening. Instead, it starts quite low and peaks at exactly the number of inner steps used during meta-training and then the performance starts to deteriorate.

D.3 ABLATIVE STUDIES

Bottleneck cost β ablation To understand the impact of β on learning we provide the ablation analysis in Figure 5 on sinusoid regression and two few-shot classification tasks. For a regression task, Figure 5 (a), we observe that having a large β = 1.0 is beneficial and provides the best mean squared error (MSE). For regression we report mean squared error (MSE): lower is better. For few-shot classification tasks, we report the accuracy: higher is better.

D.4 IMPROVED VERSIONS FOR THE GP-BASED METHODS

For GP-based methods we consider additional popular tricks used in the community to improve performance. For Augmented Omniglot dataset (Flennerhag et al., 2018) , for the network architecture we add 4 additional layers similar to Flennerhag et al. (2020) , where each layer is added after a convolutional layer. These additional layers are simple convolutions as considered in Flennerhag et al. (2020) . In addition, we add a batch normalization. Simply using more layers in GP methods significantly improves the performance on this dataset. Note, that the results are achieved without a special warpgrad architecture as in Flennerhag et al. (2020) . In Table 8 we show the results with and without batch normalization (BN). In addition, we run additional experiments on mini-ImageNet (Ravi & Larochelle, 2017) and tiered-ImageNet (Ren et al., 2018) with using pre-trained embeddings from LEO (Rusu et al., 2019) paper as features. The embeddings are taken from their GitHub repository. On top of these features, we construct an MLP with two hidden layers and ReLU activations where each layer has 128 hidden dimensions. This defines the feature vector φ(x; θ) which is used by the GP methods in all the experiments. Moreover, we apply dropout to the LEO embeddings before passing them to the MLP. The results are given in Table 9 . The hyperparameters are tuned in the same way as for the previous experiments. Their parameters are given in Table 10 . Notice that the GP methods are comparable



The lower bounds are valid even when the parameters w of the encoder qw(Z|D t ) and θ of the decoder p θ (D v |Z) (and prior p θ (Z)) have shared components, e.g. are parameters of the same neural architecture. Dependence on all inputs can allow to explain both transductive and non-transductive settings for meta learning(Bronskill et al., 2020;Finn et al., 2017;Nichol et al., 2018) as special cases; see Appendix B. The kernel variance parameter σ 2 f (if learnable) is also considered to be part of the full set of parameters θ. To obtain such a bound we either need to have access to the intractable posterior p θ (ψi|X t , Y t ), or to upper bound the marginal likelihood on the training points p θ (Y t |X t ), which is very hard.



Figure 1: (left) Meta learning with the information bottleneck. Zi ∼ qw(Zi|D t i ) is the encoder we wish to optimize to compress the task training set D t i (minimize the mutual information I(Zi, D t i )) and predict well the validation set D v i (maximize I(Zi, D v i )); see Section 2. (right) Specialization to supervised few-shot learning where for each task we have input-output data. Gradient-based algorithms, such as MAML, and Gaussian process memory-based methods (proposed in this paper) are instances of the framework; see Section 2.2 and 3.

Figure2: (a) Sinusoid regression with GP and MAML in meta-testing as the number of shots K (x-axis) increases. On y-axis, we report the MSE. For MAML, we report the performance with different number of inner loop steps, i.e. just SGD steps since we do meta-testing, specified in the legend. (b)-(c) Meta-testing classification accuracy (y-axis) on mini-Imagenet, where each system has been meta-trained with either N = 5, K = 1 or N = 5, K = 5, as the number K of observed examples per class (while always N = 5) grows from 1 to 20, e.g. for K = 20 each system sees N × K = 100 support examples. For MAML we show the performance for different inner loop sizes, which in meta-testing is just SGD updates, where each SGD step uses a random mini-batch of size 10 data points from the support set of size N × K. (d) Similarly to (b)-(c) for Augmented Omniglot, where instead of growing K we increase the amount of data augmentation in a pre-specified/fixed N = 20, K = 15 initial support set. This means in each SGD update of MAML or predictive density GP update we sample a mini-batch from the fixed N × K support set, we apply random transformations in this mini-batch and then we use it to perform the actual update. The mini-batch size was 20 and based on this data augmentation process we grow the amount of data (x-axis) processed by each method from 20 up to 2000. Finally to create all plots we average performance under 10 repeats, where in each repeat the systems are meta-trained from scratch and then are evaluated in a large number of meta-testing tasks.

M φ(x; θ) φ(x ; θ) (where the kernel variance σ 2 f is fixed to 1/M ) and the cosine kernel k θ (x, x ) = φ(x;θ) φ(x ;θ) ||φ(x;θ)||||φ(x ;θ)|| . These kernels are used in the experiments. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets. Each instance was drawn by a different person. It is augmented by creating new characters which are rotations of each of the existing characters by 0, 90, 180 or 270 degrees. In Omniglot experiment, MAML, Stochastic MAML and GP+MAML run by applying one adaptation step for both metatraining and meta-testing. The mini-Imagenet involves 64 training classes, 12 validation and 24 tests classes. Following previous work on mini-Imagenet we meta-train Stochastic MAML with 5 adaptation steps, while 10 steps are used for meta-testing. GP+MAML uses 5 steps for both meta-training and meta-testing. For both Omniglot and mini-Imagenet we follow the experimental protocol proposed by Finn et al. (2017), which involves fast learning of N = 5-way classification with K = 1 or K = 5 shots. The problem of N -way classification is set up as follows: select N unseen classes, provide the model with K different instances per class, and evaluate the model's ability to classify new instances within the N classes.

For each alphabet, 20 characters are sampled to define a 20-class classification problem with K = 15 data points per class. Furthermore, both train and test images are randomly augmented, by applying transformations. Following Flennerhag et al. (2018); Chen et al. (2019), we use a 4-layer convnet and during meta-testing MAML and Stochastic MAML performs 100 steps of adaptation (resulting in 2000 data points where each step processes a minibatch of size 20 datapoints, i.e. per class), while they are meta-trained by applying 20 adaptation steps. GP+MAML uses 5 adaptation steps for both meta training and meta testing.

Figure3: The red curve represents a ground truth sinusoid function whereas the light blue one is the mean prediction of a GP. As K increases, the uncertainty (shown by the shaded area) of the GP decreases and its mean predictions become closer to the ground truth. Given already K = 4 points, the mean prediction matches well the sinusoid function. mini-Imagenet (K = 1) mini-Imagenet (K = 5)

Figure5: Ablation analysis for different β values on sinusoid regression and few shot classification tasks. For regression we report mean squared error (MSE): lower is better. For few-shot classification tasks, we report the accuracy: higher is better.

for the validation set. During meta-testing, for any novel task we observe the sup-

Classification test accuracy on Omniglot, mini-Imagenet and Augmented Omniglot. For all methods, mean performances with 95% confidence intervals are reported after repeating the experiments 10 times and each experiments performs meta-testing in 1000 tasks. Best performance is with bold, while * indicates statistically significant better performance than MAML in Augmented Omniglot. For Augmented Omniglot we run MAML (see Appendix for the hyperparameters) since this dataset was not included inFinn et al. (2017).

Classification negative log likelihood (NLL) test performance on Omniglot, mini-Imagenet and Augmented Omniglot. To obtain these scores for MAML we re-run MAML since only classification accuracy is reported inFinn et al. (2017). Again * indicates statistically significant better performance than MAML.

Hyperparameters for Stochastic MAML.

Hyperparameters for MAML in sinusoid regression and Augmented Omniglot.

Hyperparameters for the GP methods.

Detailed few-shot sinusoid regression results. We report the results of the GP model for K = 5, 10, 20 and for MAML assuming different gradient adaptation steps (including also a result reported in the original MAML paper byFinn et al. (2017)).

annex

where m t j ≡ m w (y t j , x t j ) ∈ R and s t j ≡ s w (x t j ) ∈ R + are neural network amortized functions that depend on tunable parameters w and receive as input an individual data point (y t j , x t j ) associated with the latent variable f t j . We made the simplification that the output point might only influence the real-valued mean m w (y t j , x t j ), while the variance s w (x t j ) can depend only on the input. Based on the above the amortized encoder is a fully dependent multivariate Gaussian distribution having the formwhere S t is a diagonal covariance matrix with the vector (s t 1 , . . . , s t n t ) in the diagonal and m t is the vector of values (m t 1 , . . . , m t n t ). This allows to re-write the VIB objective in equation 11 in the following computationally more convenient form (see Appendix C.2 next):where each marginal Gaussian distribution q(f j ) when x j is either from the validation or the training set (or any other further test set) is computed by the same expression,where k t j ≡ k(x j , X t ) is the n t dimensional row vector of kernel values between x j and the training inputs X t and k j ≡ k(x j , x j ).Classification. Here, we discuss how the above general amortization procedure can be particularized to classification problems, which is the standard application in few-shot learning. For notational simplicity we focus on binary classification, while multi-class classification is fully covered later in Appendix C.4. Suppose a meta learning problem where each task is a binary classification problem where the binary class labels are encoded in {-1, 1}. To apply the method we simply need to specify the form of the amortized mean function m w (y t j , x t j ) (recall that s w (x t j ) is independent from the output y t j ), which is chosen to be m(y t j , x t j ) = y t j × m w (x t j ), where m w (x t j ) is a real-valued function given by the neural network. Notice that the dependence on the output label y t j ∈ {-1, 1} simply changes the sign of m w (x t j ). This latter function acts as a discriminative function that should tend towards positive values for data from the positive class and negative values for data from the negative class, while the product y t j × m w (x t j ) should tend towards positive values. This amortization of the mean function is invariant to class re-labeling, i.e. if we swap the roles of the two labels {-1, 1} the amortization remains valid and it does not require any change. The multi-class classification case can be dealt with similarly, by introducing as many latent functions as classes, as discussed fully in Appendix C.4.

C.2 DERIVATION OF THE VIB BOUND

The VIB objective for a single task from Eq. equation 22 in the main paper is computed as followswhere q(f v j ) = p(f v j |f t , x v j , X t )q(f t |D t )df t is a marginal Gaussian over an individual validation function value f v j , as also explained in the main paper. Specifically, q(f v j ) depends on the training set (Y t , X t ) and the single validation input x v j , so intuitively from the training set and the corresponding function values f t we extrapolate (through the conditional GP p(f v j |f t , x v j , X t )) to the input x v j in order to predict its function value f v j .Given the specific amortization of q(f t |D t ):the VIB objective, by using the middle part of equation 28, can be written in the following form,which is convenient from computational and programming point of view. Specifically, to compute this we need to perform a single Cholesky decomposition of K t + S t which scales as O((n t ) 3 ), i.e. cubically w.r.t. the size of the support set n t . This is fine for small support sets (which is the standard case in few-shot learning) but it can become too expensive when n t becomes very large. However, given that the kernel has the linear form k θ (x, x ) = φ(x; θ) φ(x ; θ) (ignoring any kernel variance σ 2 f for notational simplicity), where φ(x i ; θ) is M -dimensional and given that M n t , we can also carry out the computations based on a Cholesky of a matrix of size M × M . This is because K t = Φ t Φ t , where Φ t is an n t × M matrix storing as rows the features vectors on the support inputs X t , and therefore we can apply the standard matrix inversion and determinant lemmas for the matrix Φ t Φ t + S t when computing log N (m t |0, K t + S t ). Such O(M 3 ) computations also gives us the quantities q(f v j ) and q(f t j ), as also explained next.

C.3 DATA EFFICIENT GP META-TESTING PREDICTION WITH CONSTANT MEMORY

Once we have trained the GP meta learning system we can consider meta-testing where a new fresh task is provided having a support set D t * = (Y t * , X t * ) based on which we predict at any arbitrary validation/test input x * . This requires to compute quantities (such as the mean value E[y * ]) associated with the predictive densitywhere q(f * ) is an univariate Gaussian given byHere, Φ t is an n t * × M matrix storing as rows the features vectors on the support inputs X t * . Note that if we wish to evaluate q(y * ) at certain value of y * , and given that the likelihood p(y * |f * ) is not the standard Gaussian, we can use 1-D Gaussian quadrature or Monte Carlo by sampling from q(f * ).An interesting property of the above predictive density is that when the support set D t * can grow incrementally, e.g. individual data points or mini-batches are added sequentially, the predictive density can be implemented with constant memory without requiring to explicit memorize the points in the support. The reason behind this that the feature parameters θ remain constant at meta-test time and the kernel function is linear, so we can apply standard tricks to update sufficient statistics as in Bayesian linear regression.More precisely, what we need to show is that we can sequentially update the mean and variance of q(f * ) with constant memory. q(f * ) can be written aswhere we applied the matrix inversion lemma backwards to write Flennerhag et al. (2020) . We also report results with and without batch normalization.mini-ImageNet 5-way tiered-ImageNet 5-way with LEO, with no significant difference. We would like to point out that LEO uses a set of different tricks: dropout, label smoothing, 2 regularization and orthogonality penalty (Rusu et al., 2019) , while we have only considered dropout.GP (cos) GP (linear) GP+MAML (cos) GP+MAML (linear)Outer l.r. α 5 * 10 -5 2.5 * 10 -5 0.00025 7.5 * 10 -5 5 * 10 -5 7.5 * 10 -5 2.5 * 10 -5 0.0001 Inner l.r. 

