INFORMATION THEORETIC META LEARNING WITH GAUSSIAN PROCESSES

Abstract

We formulate meta learning using information theoretic concepts such as mutual information and the information bottleneck. The idea is to learn a stochastic representation or encoding of the task description, given by a training or support set, that is highly informative about predicting the validation set. By making use of variational approximations to the mutual information, we derive a general and tractable framework for meta learning. We particularly develop new memorybased meta learning algorithms based on Gaussian processes and derive extensions that combine memory and gradient-based meta learning. We demonstrate our method on few-shot regression and classification by using standard benchmarks such as Omniglot, mini-Imagenet and Augmented Omniglot.

1. INTRODUCTION

Meta learning (Ravi & Larochelle, 2017; Vinyals et al., 2016; Edwards & Storkey, 2017; Finn et al., 2017; Lacoste et al., 2019; Nichol et al., 2018) and few-shot learning (Li et al., 2006; Lake et al., 2011) aim to derive data efficient learning algorithms that can rapidly adapt to new tasks. Such systems require training deep neural networks from a set of tasks drawn from a common distribution, where each task is described by a small amount of experience, typically divided into a training or support set and a validation set. By sharing information across tasks the neural network can learn to rapidly adapt to new tasks and generalize from few examples at test time. Several few-shot learning algorithms use memory-based (Vinyals et al., 2016; Ravi & Larochelle, 2017) or gradient-based procedures (Finn et al., 2017; Nichol et al., 2018) , with the gradient-based model agnostic meta learning algorithm (MAML) by Finn et al. (2017) being very influential in the literature. Despite the success of specific schemes, one fundamental issue in meta learning is concerned with deriving unified principles that can allow to relate different approaches and invent new schemes. While there exist probabilistic interpretations of existing methods, such as the approximate Bayesian inference approach (Grant et al., 2018; Finn et al., 2018; Yoon et al., 2018) and the related conditional probability modelling approach (Garnelo et al., 2018; Gordon et al., 2019) , meta learning still lacks of a general and tractable learning principle that can help to get a better understanding of existing algorithms and derive new methods. To this end, the main contribution of this paper is to introduce an information theoretic view of meta learning, by utilizing tools such as the mutual information and the information bottleneck (Cover & Thomas, 2006; Tishby et al., 1999) . Given that each task consists of a support or training set and a target or validation set, we consider the information bottleneck principle, introduced by Tishby et al. (1999) , which can learn a stochastic encoding of the support set that is highly informative about predicting the validation set. Such stochastic encoding is optimized through the difference between two mutual informations, so that the encoding compresses the training set into a representation that can predict well the validation set. By exploiting recent variational approximations to the information bottleneck (Alemi et al., 2017; Chalk et al., 2016; Achille & Soatto, 2016) that make use of variational lower bounds on the mutual information (Barber & Agakov, 2003) , we derive a general and tractable framework for meta learning. Such framework can allow us to re-interpret gradient-based algorithms, such as MAML, and also derive new methods. Based on the variational information bottleneck (VIB) framework (Alemi et al., 2017; Chalk et al., 2016; Achille & Soatto, 2016) , we introduce a new memory-based algorithm for supervised fewshot learning (right panel in Figure 1 ) based on Gaussian processes (Rasmussen & Williams, 2006 ) and deep neural kernels (Wilson et al., 2016 ) that offers a kernel-based Bayesian view of a memory system. With Gaussian processes, the underlying encoding takes the form of a non-parametric function that follows a stochastic process amortized by the training set. Further, we show that VIB gives rise to gradient-based meta learning methods, such as MAML, when combined with parametric encodings corresponding to model parameters or weights, and based on this we derive a stochastic MAML algorithm. In an additional scheme, we show that our framework can naturally allow for combinations of memory and gradient-based meta learning by constructing suitable encodings, and we derive such an algorithm that combines Gaussian processes with MAML. We demonstrate our methods on few-shot regression and classification by using standard benchmarks such as Omniglot, mini-Imagenet and Augmented Omniglot.

2. META LEARNING WITH INFORMATION BOTTLENECK

Suppose we wish to learn from a distribution of tasks. During training for each task we observe a pair consisted of a task description represented by the support or training set D t and task validation represented by the target or validation set D v . At test time only D t will be given and the learning algorithm should rapidly adapt to form predictions on D v or on further test data. We wish to formulate meta learning using information theoretic concepts such as mutual information and the information bottleneck (Tishby et al., 1999) . The idea is to learn a stochastic representation or encoding of the task description D t that is highly informative about predicting D v . We introduce a random variable, Z, associated with this encoding drawn from a distribution q w (Z|D t ) parametrized by w. Given this encoding the full joint distribution is written as q w (D v , D t , Z) = q w (Z|D t )p(D v , D t ), where p(D v , D t ) denotes the unknown data distribution over D t and D v . In equation 1 and throughout the paper we use the convention that the full joint as well as any marginal or conditional that depends on Z is denoted by q w (emphasizing the dependence on the parametrized encoder), while corresponding quantities over data D t , D v are denoted by p. Eg. from the above we can express a Z-dependent marginal such as, q w (Z, D v ) = q w (Z|D t )p(D v , D t )dD t . To tune w we would like to maximize the mutual information between Z and the target set D v , denoted by I(Z, D v ). A trivial way to obtain a maximally informative representation is to set Z = D t , which does not provide a useful representation. Thus, the information bottleneck (IB) principle (Tishby et al., 1999) adds a model complexity penalty to the maximization of I(Z, D v ) which promotes an encoding Z that is highly compressive of D t , i.e. for which I(Z, D t ) is minimized. This leads to the IB objective: L IB (w) = I(Z, D v ) -βI(Z, D t ), (2) where β ≥ 0 is a hyperparameter. Nevertheless, in order to use IB for meta learning we need to approximate the mutual information terms I(Z, D v ) and I(Z, D t ), which are both intractable since they depend on the unknown data distribution p(D v , D t ). To overcome this, we will consider variational approximations by following similar arguments to the variational IB approach (Alemi



Figure 1: (left) Meta learning with the information bottleneck. Zi ∼ qw(Zi|D t i ) is the encoder we wish to optimize to compress the task training set D t i (minimize the mutual information I(Zi, D t i )) and predict well the validation set D v i (maximize I(Zi, D v i )); see Section 2. (right) Specialization to supervised few-shot learning where for each task we have input-output data. Gradient-based algorithms, such as MAML, and Gaussian process memory-based methods (proposed in this paper) are instances of the framework; see Section 2.2 and 3.

