META-LEARNING WITH IMPLICIT PROCESSES

Abstract

This paper presents a novel implicit process-based meta-learning (IPML) algorithm that, in contrast to existing works, explicitly represents each task as a continuous latent vector and models its probabilistic belief within the highly expressive IP framework. Unfortunately, meta-training in IPML is computationally challenging due to its need to perform intractable exact IP inference in task adaptation. To resolve this, we propose a novel expectation-maximization algorithm based on the stochastic gradient Hamiltonian Monte Carlo sampling method to perform metatraining. Our delicate design of the neural network architecture for meta-training in IPML allows competitive meta-learning performance to be achieved. Unlike existing works, IPML offers the benefits of being amenable to the characterization of a principled distance measure between tasks using the maximum mean discrepancy, active task selection without needing the assumption of known task contexts, and synthetic task generation by modeling task-dependent input distributions. Empirical evaluation on benchmark datasets shows that IPML outperforms existing Bayesian meta-learning algorithms. We have also empirically demonstrated on an e-commerce company's real-world dataset that IPML outperforms the baselines and identifies "outlier" tasks which can potentially degrade metatesting performance.

1. INTRODUCTION

Few-shot learning (also known as meta-learning) is a defining characteristic of human intelligence. Its goal is to leverage the experiences from previous tasks to form a model (represented by metaparameters) that can rapidly adapt to a new task using only a limited quantity of its training data. A number of meta-learning algorithms (Finn et al., 2018; Jerfel et al., 2019; Ravi & Beatson, 2018; Rusu et al., 2019; Yoon et al., 2018) have recently adopted a probabilistic perspective to characterize the uncertainty in the predictions via a Bayesian treatment of the meta-parameters. Though they can consequently represent different tasks with different values of meta-parameters, it is not clear how or whether they are naturally amenable to (a) the characterization of a principled similarity/distance measure between tasks (e.g., for identifying outlier tasks that can potentially hurt training for the new task, procuring the most valuable/similar tasks/datasets to the new task, detecting task distribution shift, among others), (b) active task selection given a limited budget of expensive task queries (see Appendix A.2.3 for an example of a real-world use case), and (c) synthetic task/dataset generation in privacy-aware applications without revealing the real data or for augmenting a limited number of previous tasks to improve generalization performance. To tackle the above challenge, this paper presents a novel implicit process-based meta-learning (IPML) algorithm (Sec. 3) that, in contrast to existing works, explicitly represents each task as a continuous latent vector and models its probabilistic belief within the highly expressive IPfoot_0 framework (Sec. 2). Unfortunately, meta-training in IPML is computationally challenging due to its need to perform intractable exact IP inference in task adaptation. 2 To resolve this, we propose a novel expectation-maximization (EM) algorithm to perform meta-training (Sec. 3.1): In the E step, we perform task adaptation using the stochastic gradient Hamiltonian Monte Carlo sampling method (Chen et al., 2014) to draw samples from IP posterior beliefs for all meta-training tasks, which eliminates the need to learn a latent encoder (Garnelo et al., 2018) . In the M step, we optimize the meta-learning objective w.r.t. the meta-parameters using these samples. Our delicate design of the neural network architecture for meta-training in IPML allows competitive meta-learning performance to be achieved (Sec. 3.2). Our IPML algorithm offers the benefits of being amenable to (a) the characterization of a principled distance measure between tasks using maximum mean discrepancy (Gretton et al., 2012) , (b) active task selection without needing the assumption of known task contexts in (Kaddour et al., 2020) , and (c) synthetic task generation by modeling task-dependent input distributions (Sec. 3.3).

2. BACKGROUND AND NOTATIONS

For simplicity, the inputs (outputs) for all tasks are assumed to belong to the same input (output) space. Consider meta-learning on probabilistic regression tasks:foot_2 Each task is generated from a task distribution and associated with a dataset (X , y X ) where the set X and the vector y X , (y x ) > x2X denote, respectively, the input vectors and the corresponding noisy outputs y x , f (x) + ✏(x) (1) which are outputs of an unknown underlying function f corrupted by an i.i.d. Gaussian noise ✏(x) ⇠ N (0, 2 ) with variance 2 . Let f be distributed by an implicit process (IP), as follows: Definition 1 (Implicit process for meta-learning). Let the collection of random variables f (•) denote an IP parameterized by meta-parameters ✓, that is, every finite collection {f (x)} x2X has a joint prior distribution p(f X , (f (x)) > x2X ) implicitly defined by the following generative model: z ⇠ p(z), f(x) , g ✓ (x, z) for all x 2 X where z is a latent task vector to be explained below and generator g ✓ can be an arbitrary model (e.g., deep neural network) parameterized by meta-parameters ✓. Definition 1 defines valid stochastic processes if z is finite dimensional (Ma et al., 2019) . Though, in reality, a task may follow an unknown distribution, we assume the existence of an unknown function that maps each task to a latent task vector z satisfying the desired known distribution p(z), like in (Kaddour et al., 2020) . 4 Using p(y X |f X ) = N (f X , 2 I) (1) and the IP prior belief p(f X ) from Def. 1, we can derive the marginal likelihood p(y X ) by marginalizing out f X . Remark 1. Two sources of uncertainty exist in p(y X ): Aleatoric uncertainty in p(y X |f X ) reflects the noise (i.e., modeled in (1)) inherent in the dataset, while epistemic uncertainty in the IP prior belief p(f X ) reflects the model uncertainty arising from the latent task prior belief p(z) in (2).foot_4  Let the sets T and T ⇤ denote the meta-training and meta-testing tasks, respectively. Following the convention in (Finn et al., 2018; Gordon et al., 2019; Ravi & Beatson, 2018; Yoon et al., 2018) , for each meta-training task t 2 T , we consider a support-query (or train-test) split of its dataset (X t , y Xt ) into the support set (or training dataset) (X s t , y X s t ) and query set (or test/evaluation dataset) (X q t , y X q t ) where X t = X s t [ X q t and X s t \ X q t = ;. Specifically, for a N -way K-shot classification problem, the support set has K examples per class and N classes in total. Meta-learning can be defined as an optimization problem (Finn et al., 2017; 2018) and its goal is to learn meta-parameters ✓ that maximize the following objective defined over all meta-training tasks: J meta , log Y t2T p y X q t |y X s t = X t2T log Z p y X q t |f X q t p f X q t |y X s t df X q t . Task adaptation p(f X q t |y X s t ) is performed via IP inference after observing the support set: p f X q t |y X s t = Z z p f X q t |z p z|y X s t dz . The objective J meta (3) is the "test" likelihood on the query set, which reflects the idea of "learning to learn" by assessing the effectiveness of "learning on the support set" through the query set. An alternative interpretation views p(f X q t |y X s t ) as an "informative prior" after observing the support set. The objective J meta (3) is also known as the Bayesian held-out likelihood (Gordon et al., 2019) . In a meta-testing task, adaptation is also performed via IP inference after observing its support set and evaluated on its query set. Similar to GP or any stochastic process, the input vectors of the dataset are assumed to be known/fixed beforehand. We will relax this assumption by allowing them to be unknown when our IPML algorithm is exploited for synthetic task generation (Sec. 3.3). 3 IMPLICIT PROCESS-BASED META-LEARNING (IPML)

3.1. EXPECTATION MAXIMIZATION (EM) ALGORITHM FOR IPML

Recall that task adaptation requires evaluating p(f X q t |y X s t ) (4). From Def. 1, if generator g ✓ (2) can be an arbitrary model (e.g., deep neural network), then p(f X q t |y X s t ) and p(f X q t ) cannot be evaluated in closed form and have to be approximated by samples. Inspired by the Monte Carlo EM algorithm (Wei & Tanner, 1990) which utilizes posterior samples to obtain a maximum likelihood estimate of some hyperparameters, we propose an EM algorithm for IPML: The E step uses the stochastic gradient Hamiltonian Monte Carlo (SGHMC) sampling method to draw samples from p(f X q t |y X s t ) (4), while the M step maximizes the meta-learning objective J meta (3) w.r.t. metaparameters ✓: Expectation (E) step. Note that since f X q t = (g ✓ (x, z)) > x2X q t (2), no uncertainty exists in p(f X q t |z) in (4). So, p(f X q t |y X s t ) can be evaluated using the same generator g ✓ (2) and the latent task posterior belief p(z|y X s t ), as follows: Remark 2. Drawing samples from p(f X q t |y X s t ) is thus equivalent to first drawing samples of z from p(z|y X s t ) and then passing them as inputs to generator g ✓ to obtain samples of f X q t . Hence, given a task t, adaptation p(f X q t |y X s t ) (4) essentially reduces to a task identification problem by performing IP inference to obtain the latent task posterior belief p(z|y X s t ). This is a direct consequence of epistemic uncertainty arising from p(z|y X s t ) and p(z) (Remark 1). In general, p(z|y X s t ) also cannot be evaluated in closed form. Instead of using variational inference (VI) and approximating p(z|y X s t ) with a potentially restrictive variational distribution (Garnelo et al., 2018; Kaddour et al., 2020; Ma et al., 2019) , we draw samples from p(z|y X s t ) using SGHMC (Chen et al., 2014) . SGHMC introduces an auxiliary random vector r and samples from a joint distribution p(z, r|y X s t ) following the Hamiltonian dynamics (Brooks et al., 2011; Neal, 1993) : p(z, r|y X s t ) / exp( U (z) 0.5r > M 1 r) where the negative log-probability U (z) , log p(z|y X s t ) resembles the potential energy and r resembles the momentum. SGHMC updates z and r, as follows: z = ↵M 1 r, r = ↵r z U (z) ↵CM 1 r + N (0, 2↵(C B)) where ↵, C, M, and B are the step size, friction term, mass matrix, and Fisher information matrix, respectively.foot_5 Note that r z U (z) = r z log p(z|y X s t ) = r z log p(z, y X s t ) = r z [log p(y X s t |f X s t = (g ✓ (x, z)) > x2X s t ) + log p(z) ] can be evaluated tractably. Maximization (M) step. We optimize J meta (3) w.r.t. ✓ using samples of z. The original objective J meta = P t2T log(E p(z|y X s t ) [p(y X q t |f X q t = (g ✓ (x, z)) > x2X q t )] ) is not amenable to stochastic optimization with data minibatches, which is usually not an issue in a few-shot learning setting. When a huge number of data points and samples of z are considered, we can resort to optimizing the lower bound J s-meta of J meta by applying the Jensen's inequality:  J meta J s-meta , P t2T E p(f X q t |y X s t ) ⇥ log p(y X q t |f X q t ) ⇤ = P t2T E p(z|y X s t ) ⇥ log p(y X q t |f X q t ) ⇤ . 𝐲 𝐟 𝐳 𝜒 𝑔 𝜃 𝜃 𝜃 ⊙ 𝐳 ω ℎ 𝐱 𝐳 𝜓 𝜙 𝐳 ω ⊕ 𝐱 𝐳 ⊕ (a) (b) (c)

3.2. ARCHITECTURE DESIGN FOR META-TRAINING

Our generator g ✓ is implemented using a deep neural network (DNN) parameterized by metaparameters ✓. Under this setup, we have empirically observed that the design of the coupling of z with the DNN g ✓ (x, •) is crucial to achieving competitive performance of our IPML algorithm. A naive design by concatenating z with x (or higher-level abstractions of x) as a contextual input during forward passes has not worked well as the resulting gradients w.r.t. z may not have provided enough guidance for SGHMC to learn a sufficiently useful representation of z in meta-training. To this end, inspired by the attention mechanism (Vaswani et al., 2017) and dropout method (Srivastava et al., 2014) , we introduce a design of the coupling by applying z as a mask to the last DNN layer's parameters: The last DNN layer's parameters are first masked by z (i.e., point-wise product with z), as illustrated in Figs. 1a and 1b . Different tasks can now be distinguished by different masks, hence resembling different attentions on the last DNN layer's connections during forward propagation. We adopt soft masksfoot_6 (i.e., continuous values) instead of hard masks (i.e., either 0 or 1). Such a design of the coupling is empirically demonstrated to be effective in our experiments (Appendix A.4.3).

3.3. ARCHITECTURE DESIGN FOR SYNTHETIC TASK GENERATION

Recall the assumption of known/fixed input vectors in X t in the last paragraph of Sec. 2,foot_7 which we will have to relax here. Synthetic task generation can be performed by the following procedure if x is task-independent (e.g., p(x, z) = p(x)p(z)): After meta-training is completed (Sec. 2), draw a sample of latent task vector z ⇠ p(z), draw samples of x ⇠ p(x) to form X t , and then generate noisy outputs y Xt = (g ✓ (x, z) + ✏(x)) > x2Xt to obtain the dataset (X t , y Xt ) for task t. When x is task-dependent (e.g., for image classifications of different objects, p(x, z) 6 = p(x)p(z)), not modeling p(x|z) limits the ability to generate t-dependent X t . To resolve this, our IPML algorithm includes an X-generative network (X-Net): x , h (z, !) that learns to generate an input vector x given samples of the latent task vector z and random vector ! ⇠ p(!) = N (0, I) where ! models the diversity of the input distribution given a fixed task represented by the sample of z. There are several options to implement X-Net: Note that during the training of X-Net, both X t and the samples of z ⇠ p(z|y X s t ) for all meta-training task t 2 T are available. So, generative models such as the conditional variational autoencoder (CVAE) (Sohn et al., 2015) or conditional generative adversarial networks (Mirza & Osindero, 2014) are suitable for X-Net as they can utilize z as the contextual information. Our work here uses (the decoder of) CVAE to implement X-Net. Figs. 1c and 1d illustrate such a design. We have empirically observed that a simple concatenation with z suffices here as our delicate architecture design for meta-training (Sec. 3.2) can yield a useful representation of z for training X-Net well. Further details and a method to ensure balanced data generation are given in Appendix A.5. The training objective for synthetic task generation is the empirical lower bound (Sohn et al., 2015) of VI on p(!|x, z): J X , P t2T E z⇠p(z|y X s t ) h |X t | 1 P x2Xt E q (!|x,z) [log p (x|z, !)] D KL [q (!|x, z)kp(!)] i where and are, respectively, the parameters of X-Net (decoder neural network) and the encoder neural network, and D KL denotes the KL distance. In the training of X-Net, we sample one z per update. We also sample one ! per update to train with reparameterization tricks. Algorithms 1 and 2 describe meta-training (with training of X-Net) and synthetic task generation, respectively. Algorithm  for i = 1, . . . , final size of X t do Sample ! ⇠ N (0, I) Compute x = h (z, !) Compute y x = g ✓ (x, z) + ✏(x) (X t , y Xt ) (X t [ {x}, y Xt[{x} ) return (X t , y Xt ) for task t

4. EXPERIMENTS AND DISCUSSION

Benchmark datasets: sinusoid regression and few-shot image classification. We first empirically evaluate the performance of our IPML algorithm against that of several Bayesian meta-learning baselines like the neural process (NP) (Garnelo et al., 2018) , Bayesian model-agnostic metalearning (BMAML) (Yoon et al., 2018) , PLATIPUS (Finn et al., 2018) , and amortized Bayesian meta-learning (ABML) (Ravi & Beatson, 2018) on benchmark meta-learning datasets. For fewshot image classification, we also empirically compare IPML with a strong baseline: prototypical network (PN) (Snell et al., 2017) . We run experiments on three datasets: sinusoid, Omniglot (Lake et al., 2011) , and mini-ImageNet (Ravi & Larochelle, 2017) . Sinusoid is a regression task of sine waves with uniformly sampled amplitude in For Omniglot and mini-ImageNet, our implementation and baselines all use the same data pre-processing, same train-test split, and same data augmentation as that in (Finn et al., 2017) . The generator of IPML and the baseline classifiers are convolutional neural networks with 4 modules of 3 ⇥ 3 convolutions and 64 filters, followed by batch normalization, ReLU nonlinearities, and strided convolutions (Omniglot) or 2 ⇥ 2 max-pooling (mini-ImageNet). More details of the experimental settings can be found in Appendix A.2.2. For sinusoid regression (Table 1 ), IPML outperforms MAML and BMAML by a fair margin. For Omniglot (Table 2 ), IPML is competitive with MAML and PN. For mini-ImageNet (Table 3 ), IPML outperforms MAML and all tested Bayesian meta-learning algorithms,foot_8 while being competitive with PN. PN achieves a higher classification accuracy for 1-shot 20-way Omniglot and 5-shot 5-way mini-ImageNet because PN utilizes more information from the extra classes during training (Snell et al., 2017) . Specifically, though meta-testing involves N -way classification for all tested algorithms, the training of PN requires more than N classes, that is, 60-way classification which is also the setting adopted in (Snell et al., 2017) . As a result, since PN utilizes more information from the extra classes during training, it is reasonable to expect that PN achieves a higher classification accuracy at times. Overall, IPML is effective for benchmark datasets. For both sinusoid regression (Table 1 ) and Omniglot (Table 2 ), NP performs unsatisfactorily as compared to IPML, likely because (a) it performs amortized variational inference of z through a heavily parameterized encoder which may introduce optimization difficulties and overfitting during meta-training, and (b) the encoder of NP takes in the simple concatenation of (x, y x ) and thus does not explicitly capture the x ! y x relationship in the support set.foot_9 Active task selection. We can evaluate the effectiveness of the uncertainty measure arising from latent task posterior belief p(z|y X s t ) by performing active task selection. Unlike previous works (Yoon et al., 2018; Finn et al., 2018) that can only perform active learning by querying data points, IPML can perform active learning by querying tasks and does not need the assumption of known task contexts in (Kaddour et al., 2020) . In every iteration, a set of tasks are proposed with only the support set (X s t , y X s t ) given; in image classification, it is usually one-shot. IPML will select among them the task with the maximum variance in p(z|y X s t ) (with samples from the E step/SGHMC): arg max t Var(z|y X s t ), and request for its query set to perform meta-training. This corresponds to a variance-based active task selection criterion. We test on both sinusoid regression and mini-ImageNet classification. Fig. 2 shows that the performance of IPML with active task selection improves over that of both MAML or IPML without active task selection, that is, it reaches a given MSE/accuracy with less training tasks. This shows that the uncertainty measure arising from p(z|y X s t ) can be exploited to benefit meta-training. Measuring distance between tasks using latent task representation. A most interesting question yet to be answered is the following: Does IPML learn a useful latent task representation? IPML learns to model the task through z. If IPML learns the correct representation, then it can reflect patterns of task distribution in the latent space. While a solid criterion for assessing the correctness of learned latent task representation is hard to define, we can resort to an oracle (e.g., human expert with prior knowledge in designing the tasks). Our visualization of the latent task representation and quantitative evaluation of distance measure between tasks using maximum mean discrepancy (MMD) (Gretton et al., 2012) provide ways to assess the correctness of the learned task representation. We denote the set of samples from p(z|y X s t ) as Z t . The MMD between tasks t 1 and t 2 can be calculated using MMD[H, t 1 , t 2 ] , sup {2H ⇣ |Z t1 | 1 P z2Zt 1 {(z) |Z t2 | 1 P z2Zt 2 {(z) ⌘ where H is a unit ball in the reproducing kernel Hilbert space with a radial basis function kernel. We conduct experiments with the following 5-way 1-shot settings. Setting A: For subsampled Omniglot, we applied one rotation out of 4 possibilities (0, ⇡/2, ⇡, 3⇡/2) uniformly across all the input images for each sampled task.foot_10 Setting B: For subsampled mini-ImageNet, a random artistic filter (normal, brighten, or darken) is applied for each sampled task. Setting C: For subsampled mini-ImageNet, a random artistic filter (3 different types of hue) is applied for each sampled task. Setting D: For subsampled mini-ImageNet, a random zooming (no zooming, zooming 3 times, or zooming 10 times) is applied for each sampled task. Setting E: On subsampled mini-ImageNet, a random artistic filter (normal, low contrast, or high contrast) is applied for each sampled task. Setting A has 4 types of tasks while settings B to E result in 3 types of tasks. For each setting mentioned above, we first train our models in IPML to converge, and then sample tasks from their latent task posterior beliefs (i.e., one sample of z per task). Finally, we visualize their latent task embeddings in the 2D space using TSNE (van der Maaten & Hinton, 2008) . Furthermore, for setting A, we evaluate the distance measure between tasks using the well-known MMD metric with radial basis function kernels on the z samples. It can be observed from Fig. 3 and Table 4 that IPML successfully distinguishes 4 types of rotations for Omniglot. Both Fig. 3 and Table 4 contemporaneously show that flipping upside down (i.e., either right half of the embedding 0 ⇡ or left half of the embedding ⇡/2 3⇡/2) are reckoned to be closer tasks compared with rotation of ⇡/2, thus revealing that our visualization and evaluation of distance measure between tasks are in accordance. From Fig. 3B to Fig. 3D , IPML successfully distinguishes different types of transformations on the tasks while revealing interesting facts: for example, tasks of high brightness are more isolated from that of low or normal brightness. Fig. 3E shows that tasks of low contrast are more distinct from that of normal or high contrast. The values of MMD metric for settings B to E and more details of the experiments are provided in Appendix A.6. On the overall, both the visualization and evaluation of distance measure between tasks reveal that IPML successfully learns useful latent task representations and even provides interesting insights. Synthetic task generation for Omniglot. We assess the usefulness of latent task representation z by performing synthetic task generation. The training tasks we consider are three types of subsampled binary classifications: classification of characters A vs. B, B vs. C, and C vs. A, as in Fig. 4a . During meta-learning, we train a X-Net concurrently to learn to generate task-related input images (Sec. 3.3). The CVAE implementation of X-Net contains a decoder neural network with 3 hidden layers of size [128, 128, 256] and ReLU nonlinearity, and a symmetric design of the encoder. After meta-training is completed, we continue to train the X-Net to converge. In this experiment, the dimension of z is set as 2, which further allows walking through such a latent space/embedding to visualize how the generated tasks map to their latent representations. Fig. 4b shows the latent embedding of real tasks. Fig. 4c shows the sampled synthetic tasks by walking through the latent space. It can be observed that X-Net successfully captures the task-dependent input distributions and can generate high-quality data of task type 1, 2, and 3 when sampled from their corresponding latent clusters (see samples of task type 1, 2, and 3 in the colored bounding boxes in Fig. 4c ). We further evaluate the quality of generated tasks by training on it. We hold out half of the images for each character during meta-training to construct the meta-testing tasks. The results are presented in Table 5 . When training on both real and generated tasks, we first train on the generated tasks to converge and then train on the real tasks for another 30 iterations. It can be observed that compared to only using real tasks, a higher accuracy is achieved with training merely using generated tasks. When training on both real and generated tasks, a huge boost in accuracy is observed. We conjecture that due to their diversity, generated tasks (i.e., sometimes containing more ambiguous tasks) alleviate overfitting and provide a promising direction on meta-task augmentation. Real-world risk detection. We perform experiments on a real-world risk detection dataset provided by an anonymous e-commerce company. The task is to classify whether an item in the online shop has risks (e.g., fraud, pornography, contraband). Such risks appear in different forms and in different categories (of items). It is hard to detect risks in different categories by training models separately for each category because some categories have only very limited amounts of black samples (i.e., < 50). The similarities of the detected risks in different categories, if discovered, can help improve the performance. Meta-learning is thus a suitable algorithm for its ability to perform (a) detection of risks across different categories of items and (b) adaptation to new categories. The input x of the dataset is the text (title and descriptions) embedding obtained from self-supervised learning, while its label is a binary variable indicating whether it contains risks (i.e., y x = 1 for black samples and y x = 0 for white samples). The data are separated by categories of items to yield 47 categories in total. Initially, we hold out 10 categories for meta-testingfoot_11 while the rest are used for meta-training. Table 6 shows results comparing the performance of IPML vs. a multi-task learning baseline. 13 It can be observed that IPML outperforms multi-task learning, which indicates its stronger ability to generalize to unseen categories. Fig. 5 visualizes the latent task embedding of the 10 meta-testing categories for analysis. IPML learns useful latent task representations: For example, from Fig. 5a , gaming-related categories with IDs 46 and 47 are mapped closely in the latent task space/embedding. The individual meta-testing performance on the 10 meta-testing categories, which are given in Appendix A.3, can be further examined: For the five categories with IDs 19, 21, 23, 36, and 44 covered by the shaded light green zone in Fig. 5b , IPML outperforms multi-task learning by a large margin. They are mapped to the center of the latent task space (Fig. 5b ), which may imply that IPML's adaptations to them can largely build on previous experiences of the meta-training categories and IPML's exploitation of such similarities allows their performance to improve over multi-task learning. For the three categories with IDs 3, 9, and 39 covered by the shaded light orange zone, IPML does not have a performance advantage over multi-task learning. For the two categories with IDs 46 and 47 covered by the shaded light pink zone, both IPML and multi-task learning perform unsatisfactorily. As a matter of fact, for IPML, the categories with unsatisfactory performance (i.e., either covered by the shaded light orange or pink zone) are all mapped to be some distance away from the center, which indicates that they are likely considered by IPML as "outlier"/dissimilar tasks. We further compare meta-learning on (A) the same setting as before by holding out the 10 metatesting categories vs. (B) training on all categories in setting A as well as the dissimilar ones with IDs 3, 9, 39, 46, and 47. Table 7 shows results on the desired categories with IDs 19, 21, 23, 36, and 44. It can be observed that when a meta-learning model is trained to perform well (during metatesting) on the desired categories/tasks, training alongside with dissimilar ones can compromise its performance. More details of the experimental settings and data preparation, experimental results, and analysis are given in Appendix A.3. We have also empirically compared the time efficiency of IPML against that of several meta-learning baselines and reported the results in Appendix A.7.

5. RELATED WORK

A number of meta-learning algorithms (Finn et al., 2018; Ravi & Beatson, 2018; Yoon et al., 2018) have proposed a Bayesian extension of the MAML framework (Finn et al., 2017) . Their difference with IPML is that they model the uncertainty in the predictions with a set of particles (Yoon et al., 2018) or a variational distribution (Finn et al., 2018; Ravi & Beatson, 2018) , which does not allow latent task modeling. The work of Rusu et al. ( 2019) introduces a generative model that decodes latent vectors into the meta-parameters, but does not scale well in the dimension of meta-parameters. In comparison, IPML explicitly represents each task as a latent continuous vector and models its probabilistic belief and is hence scalable in the dimension of meta-parameters. Moreover, MAMLbased algorithms usually require evaluating computationally-intensive second-order derivatives of the meta-parameters during meta-training because they approximate the Bayesian inference through an inner loop of gradient descent. Although this issue can be addressed by methods such as firstorder approximations (e.g., first-order MAML (Finn et al., 2017) , Reptile (Nichol et al., 2018) ) or implicit MAML (Rajeswaran et al., 2019) using implicit gradient, these works are not Bayesian. In contrast, our IPML algorithm naturally utilizes Bayes' rule to perform sampling during Bayesian inference and does not need second-order derivatives. The work of Kaddour et al. (2020) uses latent information to perform active task selection, but assumes known task-descriptor (task context) which is usually unknown. The work of Garnelo et al. (2018) introduces the first use of stochastic processes (i.e., neural processes) in meta-learning and learns a heavily parameterized encoder to encode a dataset into its latent representation, which might introduce optimization difficulties and overfitting and can only output Gaussian posterior beliefs. In comparison, our IPML algorithm is the first to consider SGHMC in task adaptation/inference of meta-learning, which can capture a non-Gaussian posterior belief to achieve a better performance (Appendix A.4). Our IPML algorithm is also the first to explicitly model task-dependent input distributions, which is lacking in the literature. Such a modeling enables synthetic task generation of complex image classification tasks for the first time.

6. CONCLUSION

This paper describes a novel IPML algorithm that, in contrast to existing works, explicitly represents each task as a continuous latent vector and models its probabilistic belief within the highly expressive IP framework. Unlike existing works, IPML offers the benefits of being amenable to (a) the characterization of a principled distance measure between tasks using MMD, (b) active task selection without needing the assumption of known task contexts in (Kaddour et al., 2020) , and (c) synthetic task generation of complicated image classifications via modeling of task-dependent input distributions using our X-Net. Empirical evaluation on benchmark datasets shows that IPML outperforms existing Bayesian meta-learning algorithms. We have also empirically demonstrated on an anonymous e-commerce company's real-world dataset that IPML outperforms the multi-task learning baseline and identifies "outlier"/dissimilar tasks which can degrade meta-testing performance.



An IP(Ma et al., 2019) is a stochastic process such that every finite collection of random variables has an implicitly defined joint prior distribution. Some typical examples of IP include Gaussian processes, Bayesian neural networks, neural processes(Garnelo et al., 2018), among others. An IP is formally defined in Def. 1.2 The work ofMa et al. (2019) uses the well-studied Gaussian process as the variational family to perform variational inference in general applications of IP, which sacrifices the flexibility and expressivity of IP by constraining the distributions of the function outputs to be Gaussian. Such a straightforward application of IP to meta-learning has not yielded satisfactory results in our experiments (see Appendix A.4). We defer the discussion of meta-learning on probabilistic classification tasks using the robust-max likelihood(Hernández-Lobato et al., 2011) to Appendix A.1. p(z) is often assumed to be a simple distribution like multivariate Gaussian N (0, I)(Garnelo et al., 2018). Our work here considers a point estimate of meta-parameters ✓ instead of a Bayesian treatment of ✓(Finn et al., 2018;Yoon et al., 2018). This allows us to interpret the epistemic uncertainty in p(fX ) via p(z) directly. The sampler hyperparameters ↵, C, M, and B are set according to the auto-tuning method ofSpringenberg et al. (2016) which has been verified to work well in our experiments; more details are given in Appendix A.2.1. The latent task prior belief p(z) is thus assumed to be a multivariate Gaussian N (1, I). This assumption is reasonable for meta-training since only p(yX ) (and not p(x)) needs to be modeled. Some of the results are taken from(Finn et al., 2018;Nguyen et al., 2020;Yoon et al., 2018). The 5-shot 5-way results for PLATIPUS and ABML are missing because there are no publicly available implementations. An ablation study of the limitations of NP can be found in Appendix A.8. In the previous experiment, the Omniglot dataset is augmented with rotations, but is random across the classes in a single task. Their category names and IDs are given in Fig.5. When testing on an unseen category, multi-task learning performs adaptation by randomly initializing its untied parameters for retraining on the few-shot support data.



Figure 1: (a) Graphical model corresponding to IPML. (b) DNN implementation of generator g ✓ where ✓ , (✓ a , ✓ b ) and ✓ a can be convolutions to obtain high-level representations of the input vector, while ✓ b is the last DNN layer's parameters which are masked by z during the forward passes. (c) Graphical model corresponding to input generation by X-Net. (d) CVAE implementation of X-Net (i.e., decoder neural network with parameters ).

[0.1, 5.0], phase in [0, ⇡], and input x in [ 5, 5]. The generator of IPML and the baseline regressors are neural networks with 2 hidden layers of size 40 with ReLU nonlinearities. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets. The mini-ImageNet dataset involves 64 training classes, 12 validation classes, and 24 test classes.

Figure 2: Results of active task selection on (a) 5-shot sinusoid and (b) 1-shot 5-way mini-ImageNet.

Figure 3: Visualization of latent task embeddings from settings A to E.

Figure 4: (a) TSNE visualization of (samples of) 3 types of binary classification tasks; images of black/white background are black/white samples (y x = 1/y x = 0). (b) Visualization of latent embedding of real tasks in (normalized) z space [ 2, 2] 2 . (c) Sampled generated task data by walking through the (normalized) z space [ 2, 2] 2 ; note that the inversion of color is only for visualization to distinguish black and white samples. In training, NO images are inverted.

Figure 5: (a) TSNE visualization of latent task embedding of 10 meta-testing categories and (b) their analysis (see main text). Legend shows IDs and names of categories.

1: IPML: Meta-Training while not converged do Sample task t from T E step : Sample {z 1 , . . . , z n } with SGHMC M step : Sample z from {z 1 , . . . , z n } M step : ✓ ✓ + ⌘r ✓ J meta Update X-Net with z and X t : + ⌘r J X , + ⌘r J X return ✓, , Initialize synthetic task t and X t = ;

Mean square error (MSE) on few-shot sinusoid regression.

Few-shot classification accuracy (%) on held-out Omniglot characters.

Few-shot classification accuracy (%) on mini-Imagenet test set.

Values of MMD metric between 4 types of tasks for Omniglot (setting A). Larger value means larger dissimilarity.

Results of meta-testing for training with real and generated tasks.

Averaged meta-testing performance on 10 meta-testing categories.

Averaged meta-testing performance on 5 desired categories(IDs 19, 21, 23, 36, 44).

