META-LEARNING WITH IMPLICIT PROCESSES

Abstract

This paper presents a novel implicit process-based meta-learning (IPML) algorithm that, in contrast to existing works, explicitly represents each task as a continuous latent vector and models its probabilistic belief within the highly expressive IP framework. Unfortunately, meta-training in IPML is computationally challenging due to its need to perform intractable exact IP inference in task adaptation. To resolve this, we propose a novel expectation-maximization algorithm based on the stochastic gradient Hamiltonian Monte Carlo sampling method to perform metatraining. Our delicate design of the neural network architecture for meta-training in IPML allows competitive meta-learning performance to be achieved. Unlike existing works, IPML offers the benefits of being amenable to the characterization of a principled distance measure between tasks using the maximum mean discrepancy, active task selection without needing the assumption of known task contexts, and synthetic task generation by modeling task-dependent input distributions. Empirical evaluation on benchmark datasets shows that IPML outperforms existing Bayesian meta-learning algorithms. We have also empirically demonstrated on an e-commerce company's real-world dataset that IPML outperforms the baselines and identifies "outlier" tasks which can potentially degrade metatesting performance.

1. INTRODUCTION

Few-shot learning (also known as meta-learning) is a defining characteristic of human intelligence. Its goal is to leverage the experiences from previous tasks to form a model (represented by metaparameters) that can rapidly adapt to a new task using only a limited quantity of its training data. A number of meta-learning algorithms (Finn et al., 2018; Jerfel et al., 2019; Ravi & Beatson, 2018; Rusu et al., 2019; Yoon et al., 2018) have recently adopted a probabilistic perspective to characterize the uncertainty in the predictions via a Bayesian treatment of the meta-parameters. Though they can consequently represent different tasks with different values of meta-parameters, it is not clear how or whether they are naturally amenable to (a) the characterization of a principled similarity/distance measure between tasks (e.g., for identifying outlier tasks that can potentially hurt training for the new task, procuring the most valuable/similar tasks/datasets to the new task, detecting task distribution shift, among others), (b) active task selection given a limited budget of expensive task queries (see Appendix A.2.3 for an example of a real-world use case), and (c) synthetic task/dataset generation in privacy-aware applications without revealing the real data or for augmenting a limited number of previous tasks to improve generalization performance. To tackle the above challenge, this paper presents a novel implicit process-based meta-learning (IPML) algorithm (Sec. 3) that, in contrast to existing works, explicitly represents each task as a continuous latent vector and models its probabilistic belief within the highly expressive IPfoot_0 framework (Sec. 2). Unfortunately, meta-training in IPML is computationally challenging due to its need to perform intractable exact IP inference in task adaptation. 2 To resolve this, we propose a novel expectation-maximization (EM) algorithm to perform meta-training (Sec. 3.1): In the E step, we perform task adaptation using the stochastic gradient Hamiltonian Monte Carlo sampling method (Chen et al., 2014) to draw samples from IP posterior beliefs for all meta-training tasks, which eliminates the need to learn a latent encoder (Garnelo et al., 2018) . In the M step, we optimize the meta-learning objective w.r.t. the meta-parameters using these samples. Our delicate design of the neural network architecture for meta-training in IPML allows competitive meta-learning performance to be achieved (Sec. 3.2). Our IPML algorithm offers the benefits of being amenable to (a) the characterization of a principled distance measure between tasks using maximum mean discrepancy (Gretton et al., 2012) , (b) active task selection without needing the assumption of known task contexts in (Kaddour et al., 2020) , and (c) synthetic task generation by modeling task-dependent input distributions (Sec. 3.3).

2. BACKGROUND AND NOTATIONS

For simplicity, the inputs (outputs) for all tasks are assumed to belong to the same input (output) space. Consider meta-learning on probabilistic regression tasks:foot_2 Each task is generated from a task distribution and associated with a dataset (X , y X ) where the set X and the vector y X , (y x ) > x2X denote, respectively, the input vectors and the corresponding noisy outputs y x , f (x) + ✏(x) (1) which are outputs of an unknown underlying function f corrupted by an i.i.d. Gaussian noise ✏(x) ⇠ N (0, 2 ) with variance 2 . Let f be distributed by an implicit process (IP), as follows: Definition 1 (Implicit process for meta-learning). Let the collection of random variables f (•) denote an IP parameterized by meta-parameters ✓, that is, every finite collection {f (x)} x2X has a joint prior distribution p(f X , (f (x)) > x2X ) implicitly defined by the following generative model: z ⇠ p(z), f(x) , g ✓ (x, z) for all x 2 X where z is a latent task vector to be explained below and generator g ✓ can be an arbitrary model (e.g., deep neural network) parameterized by meta-parameters ✓. Definition 1 defines valid stochastic processes if z is finite dimensional (Ma et al., 2019) . Though, in reality, a task may follow an unknown distribution, we assume the existence of an unknown function that maps each task to a latent task vector z satisfying the desired known distribution p(z), like in (Kaddour et al., 2020) . 4 Using p(y X |f X ) = N (f X , 2 I) (1) and the IP prior belief p(f X ) from Def. 1, we can derive the marginal likelihood p(y X ) by marginalizing out f X . Remark 1. Two sources of uncertainty exist in p(y X ): Aleatoric uncertainty in p(y X |f X ) reflects the noise (i.e., modeled in (1)) inherent in the dataset, while epistemic uncertainty in the IP prior belief p(f X ) reflects the model uncertainty arising from the latent task prior belief p(z) in (2).foot_4  Let the sets T and T ⇤ denote the meta-training and meta-testing tasks, respectively. Following the convention in (Finn et al., 2018; Gordon et al., 2019; Ravi & Beatson, 2018; Yoon et al., 2018) , for each meta-training task t 2 T , we consider a support-query (or train-test) split of its dataset (X t , y Xt ) into the support set (or training dataset) (X s t , y X s t ) and query set (or test/evaluation dataset) (X q t , y X q t ) where X t = X s t [ X q t and X s t \ X q t = ;. Specifically, for a N -way K-shot classification problem, the support set has K examples per class and N classes in total. Meta-learning can be defined as an optimization problem (Finn et al., 2017; 2018) and its goal is to learn meta-parameters ✓ that maximize the following objective defined over all meta-training tasks: J meta , log Y t2T p y X q t |y X s t = X t2T log Z p y X q t |f X q t p f X q t |y X s t df X q t . Task adaptation p(f X q t |y X s t ) is performed via IP inference after observing the support set: p f X q t |y X s t = Z z p f X q t |z p z|y X s t dz .



An IP (Ma et al., 2019) is a stochastic process such that every finite collection of random variables has an implicitly defined joint prior distribution. Some typical examples of IP include Gaussian processes, Bayesian neural networks, neural processes(Garnelo et al., 2018), among others. An IP is formally defined in Def. 1.2 The work of Ma et al. (2019) uses the well-studied Gaussian process as the variational family to perform variational inference in general applications of IP, which sacrifices the flexibility and expressivity of IP by constraining the distributions of the function outputs to be Gaussian. Such a straightforward application of IP to meta-learning has not yielded satisfactory results in our experiments (see Appendix A.4). We defer the discussion of meta-learning on probabilistic classification tasks using the robust-max likelihood(Hernández-Lobato et al., 2011) to Appendix A.1. p(z) is often assumed to be a simple distribution like multivariate Gaussian N (0, I)(Garnelo et al., 2018). Our work here considers a point estimate of meta-parameters ✓ instead of a Bayesian treatment of ✓(Finn et al., 2018; Yoon et al., 2018). This allows us to interpret the epistemic uncertainty in p(fX ) via p(z) directly.

