HIERARCHICAL GAUSSIAN MIXTURE BASED TASK GENERATIVE MODEL FOR ROBUST META-LEARNING

Abstract

Meta-learning enables quick adaptation of machine learning models to new tasks with limited data. While tasks could come from varying distributions in reality, most of the existing meta-learning methods consider both training and testing tasks as from the same uni-component distribution, overlooking two critical needs of a practical solution: (1) the various sources of tasks may compose a multicomponent mixture distribution, and (2) novel tasks may come from a distribution that is unseen during meta-training. In this paper, we demonstrate these two challenges can be solved jointly by modeling the density of task instances. We develop a meta-training framework underlain by a novel Hierarchical Gaussian Mixture based Task Generative Model (HTGM). HTGM extends the widely used empirical process of sampling tasks to a theoretical model, which learns task embeddings, fits the mixture distribution of tasks, and enables density-based scoring of novel tasks. The framework is agnostic to the encoder and scales well with large backbone networks. The model parameters are learned end-to-end by maximum likelihood estimation via an Expectation-Maximization algorithm. Extensive experiments on benchmark datasets indicate the effectiveness of our method for both sample classification and novel task detection.

1. INTRODUCTION

Training models in small data regimes is of fundamental importance. It demands a model's ability to quickly adapt to new environments and tasks. To compensate for the lack of training data for each task, meta-learning (a.k.a. learning to learn) has become an essential paradigm for model training by generalizing meta-knowledge across tasks (Snell et al., 2017; Finn et al., 2017) . While most existing meta-learning approaches were built upon an assumption that all training/testing tasks are sampled from the same distribution, a more realistic scenario should accommodate training tasks that lie in a mixture of distributions, and testing tasks that may belong to or deviate from the learned distributions. For example, in recent medical research, a global model is typically trained on the historical medical records of a certain set of patients in the database (Shukla & Marlin, 2019; Wu et al., 2021) . However, due to the uniqueness of individuals (e.g., gender, age, genetics), patients' data have a substantial discrepancy, and the pre-trained model may demonstrate significant demographic or geographical biases when testing on a new patient (Purushotham et al., 2017) . This issue can be mitigated by personalized medicine approaches (Chan & Ginsburg, 2011; Ni et al., 2022) where each patient is regarded as a task, and the pre-trained model is fine-tuned (i.e., personalized) on a support set of a few records collected in a short period (e.g., a few weeks) from every patient for adaptation. In this case, the training tasks (i.e., patients) could be sampled from a mixture of distributions (e.g., different age groups), and a testing task may or may not belong to any of the observed groups. As such, a meta-training strategy that is able to fit a mixture of task distributions and identify novel tasks is desirable for making meta-learning a practical solution. One way to tackle the mixture distributions of tasks is to tailor the transferable knowledge to each task by learning a task-specific representation (Oreshkin et al., 2018; Vuorio et al., 2018; Lee & Choi, 2018) , but as discussed in (Yao et al., 2019a) , the over-customized knowledge prevents its generalization among closely related tasks (e.g., tasks from the same distribution). The more recent methods try to balance the generalization and customization of the meta-knowledge by promoting local generalization either among a cluster of related tasks (Yao et al., 2019a) , or within a neighborhood of a meta-knowledge graph of tasks (Yao et al., 2019b) . Neither of them explicitly learns the underlying distribution from which the tasks are generated, rendering them infeasible for detecting novel tasks that are out-of-distribution. However, detecting novel tasks is crucial in high-stake domains, such as medicine and finance, which provides users (e.g., physicians) confidence on whether to trust the results of a testing task or not, and facilitates the downstream decision-making. In (Lee et al., 2019a) , a task-specific tuning variable was introduced to modulate the initial parameters learned by MAML (Finn et al., 2017) , so that the impacts of the meta-knowledge on different tasks are adjusted differently, e.g., novel tasks receive less impact than known tasks do. Whereas, this method focuses on improving model performance on different tasks (either known or novel), but neglects the critical mission of detecting which tasks are novel. In practice, providing an unreliable accuracy on a novel task, without differentiating it from other tasks may be meaningless and risky. Since the aforementioned methods cannot simultaneously handle the mixture distribution of tasks and novel tasks, a practical solution is in demand. In this work, we consider tasks as instances, and demonstrate the dual problem of modeling the mixture of task distributions and detecting novel tasks are two sides of the same coin, i.e., density estimation on task instances. To this end, we propose a new Hierarchical Gaussian Mixture based Task Generative Model (HTGM) to explicitly model the generative process of task instances. Our contributions are summarized as follows. • For the first time, the widely used empirical process of generating a task is theoretically extended to and specified by a hierarchy of Gaussian mixture (GM) distributions. HTGM generates a task embedding from a task-level GM, and uses it to define the task-conditioned mixture probabilities for a class-level GM, from which samples are drawn, for instantiating the generated task. To allow realistic classes per task, a new Gibbs distribution is proposed to underlie the class-level GM. • HTGM is an encoder-agnostic framework, thus is flexible to different domains. It inherits metricbased meta-learning methods, and only introduces a small overhead to an encoder for parameterizing its distributions, thus is efficient, and enables large-scale backbone networks. The model parameters are learned end-to-end by maximum likelihood estimation via a principled Expectation-Maximization (EM) algorithm. The bounds of our likelihood function is theoretically analyzed. • In the experiments, we evaluated HTGM on benchmark image datasets for validating its ability to take advantage of large backbone networks, its effectiveness in modeling the mixture distribution of tasks, and its usefulness in identifying novel tasks. The results demonstrate HTGM outperforms the state-of-the-art (SOTA) baselines with significant improvements in most cases.

2. RELATED WORK

To the best of our knowledge, this is the first work to explicitly model the generative process of task instances from a mixture of distributions for meta-learning with novel task detection. Meta-learning aims to handle the few-shot learning problem, which derives memory-based (Mishra et al., 2018) , optimization-based (Finn et al., 2017; Li et al., 2017) , and metric-based methods (Vinyals et al., 2016; Snell et al., 2017) , which often consider an artificial scenario where training/test tasks are sampled from the same distribution. To enable more varying tasks, task-adaptive methods facilitates the customization of meta-knowledge by learning task-specific parameters (Rusu et al., 2018; Lee & Choi, 2018) , temperature scaling parameters (Oreshkin et al., 2018) , and task-specific modulation on model initialization (Vuorio et al., 2018; Yao et al., 2019a; b; Lee et al., 2019a) . Among them, there are methods tackling the mixture distribution of tasks by clustering tasks (Yao et al., 2019a) or learning task graphs (Yao et al., 2019b) , and method relocating the initial parameters for different tasks so that they use the meta-knowledge differently (Lee et al., 2019a) . As discussed before, none of these methods jointly handle the mixture of task distributions and the detection of novel tasks. Our model is built upon metric-based methods, and learns task embeddings for modeling task distributions. Achille et al. (2019) also proposed to learn embeddings for tasks and introduced a metalearning method, but not for few-shot learning. Its embeddings are from a pre-specified set of tasks (rather than episode-wise sampling), and the meta-learning framework is for model selection. The model in (Yao et al., 2019a) has an augmented encoder for task embedding, but it does not explicitly model task generation, and is not designed for novel task detection (empirical comparison in 4.1). Conventional novelty detection aims to identify and reject samples from unseen classes (Cheng & Vasconcelos, 2021) . It relates to open-set recognition (Vaze et al., 2022) , which aims to simultaneously identify unknown samples and classify samples from known classes. Out-of-distribution (OOD) detection (Liang et al., 2018; Liu et al., 2020) can be seen as a special case of novelty detection where novel samples are from other problem domains or datasets, thus are considered to be easier to detect than novelties (Cheng & Vasconcelos, 2021) . These methods are for large-scale training. In contrast, we want to detect novel tasks, which is a new problem in the small data regime. Hierarchical Gaussian Mixture (HGM) model has appeared in some traditional works (Goldberger & Roweis, 2005; Olech & Paradowski, 2016; Athey et al., 2019) for hierarchical clustering by applying GM agglomeratively or divisively, which do not pre-train models for meta-learning, and is remarkably different from the topic in this paper. The differences are elaborated in Appendix B.1. Moreover, we discuss the relevant multi-task learning methods with task grouping in Appendix B.2.

3. HIERARCHICAL GAUSSIAN MIXTURE BASED TASK GENERATIVE MODEL

Meta-learning methods typically use an episodic learning strategy, where the meta-training set D tr consists of a batch of episodes. Each episode samples a task τ from a distribution p(τ ). Task τ has a support set D s τ = {(x s i , y s i )} ns i=1 for training, and a query set D q τ = {(x q i , y q i )} nq i=1 for testing, where n s is a small number to denote a few training samples. In particular, in a commonly used N -way K-shot Q-query task (Vinyals et al., 2016) , D s τ and D q τ contain N classes, with K and Q samples per class respectively, i.e., n s = N K and n q = N Q. Let f θ (x * i ) → y * i be a base model ( * denotes s or q), and f θ (•; D s τ ) be the adapted model on D s τ . The training objective on τ is to minimize the average test error of the adapted model, i.e., E (x q i ,y q i )∈D q τ ℓ(y q i , f θ (x q i ; D s τ )) , where ℓ(•, •) is a loss function (e.g., cross-entropy loss), and the metatraining process aims to find the parameter θ that minimizes this error over all episodes in D tr . Then, f θ is evaluated on every episode of a meta-test set D te that samples a task from the same distribution p(τ ). Usually, p(τ ) is a simple distribution (Finn et al., 2017; Lee et al., 2019a) . In this work, p(τ ) is generalized to a mixture distribution consisting of multiple components p 1 (τ ), ..., p r (τ ), and a test episode may sample a task either in or out of any component of p(τ ). As such, given the training tasks in D tr , our goal is to estimate the underlying density of p(τ ), so that once a test task is given, we can (1) identify if it is a novel task, and (2) adapt f θ to it with optimal accuracy. Specifically, the base model f θ can be written as a combination of an encoder g θe and a predictor h θp , i.e., f θ (x * i ) = h θp (g θe (x * i )) (Tian et al., 2020) . In this work, we focus on a metric-based nonparametric learner, i.e., θ p = ∅ (e.g., prototypical networks (Snell et al., 2017) ), not only because metric-based classifiers were confirmed as more effective than probabilistic classifiers for novelty detection (Jeong et al., 2021) , but also for its better training efficiency that fits large-scale backbone networks than the costly nested-loop training of optimization-based methods (Tian et al., 2020) . Formally, our goal is to find the model parameter θ that maximizes the likelihood of observing a task τ . In other words, let f θ (x * i ) = e * i ∈ R d be the sample embedding, we want to maximize the likelihood of the joint distribution p θ (e * i , y * i ) on the observed data in D τ = {D s τ , D q τ }. We consider each task τ as an instance, with a representation v τ ∈ R d in the embedding space (the method to infer v τ is described in Sec. 3.2). To model the unobserved mixture component, we associate every task with a latent variable z τ to indicate to which component it belongs. Suppose there are r possible components, and let n = n s + n q be the total number of samples in D τ , the log-likelihood to maximize can be written by hierarchically factorizing it on y * i and marginalizing out v τ and z τ . ℓ(Dτ ; θ) = 1 n n i=1 log [p θ (e * i , y * i )] = 1 n n i=1 log [p θ (e * i |y * i )p(y * i )] = 1 n n i=1 log p θ (e * i |y * i )[ vτ p(y * i |vτ )p(vτ )dvτ ] = 1 n n i=1 log p θ (e * i |y * i ) vτ p(y * i |vτ ) r zτ =1 p(vτ |zτ )p(zτ ) dvτ where p θ (e * i |y * i ) specifies the probability of sampling e * i from the y * i -th class, p(y * i |v τ ) is the probability of sampling the y * i -th class for task τ , and p(v τ |z τ ) indicates the probability of generating a task τ from the z τ -th mixture component. p(z τ ) is a prior on the z τ -th component. Hence, Eq. (1) implies a generative process of task τ : z τ → v τ → y * i → e * i . Next, we define each of the aforementioned distributions and propose our HTGM method.

3.1. MODEL SPECIFICATION AND PARAMETERIZATION

In Eq. ( 1), the class-conditional distribution p θ (e * i |y * i ), the task-conditional distribution p(y * i |v τ ), and the mixture distribution of tasks defined by {p(v τ |z τ ), p(z τ )} are not specified. To make Eq. ( 1) optimizable, we introduce our HTGM that models the generative process of tasks. Because D s τ and D q τ follow the same distribution, in the following, we ignore the superscript * for simplicity. Class-Conditional Distribution. First, similar to (Lee et al., 2018; 2019b) , we use Gaussian distribution to model the embeddings e i 's in every class. Let µ c yi and Σ c yi be the mean and variance of the distribution of the y i -th class, then p θ (e i |y i ) = N (e i |µ c yi , Σ c yi ). In fact, the samples in all of the classes of task τ comprise a Gaussian mixture distribution, where p(y i ) is the mixture probability of the y i -th class. In Eq. (1), p(y i ) is factorized to be task-specific, i.e., p(y i |v τ ), which resorts to another mixture distribution p(v τ ) of tasks, and establishes a structure of hierarchical mixture. Task-Conditional Distribution. A straightforward definition of p(y i |v τ ) is the density at µ c yi in a Gaussian distribution with v τ as the mean, where µ c yi is the mean (or prototype) of the y i -th class. However, doing so exposes two problems: (1) the density function of Gaussian distribution is logconcave with one global maximum. Given the mean and variance, maximizing its log-likelihood tends to collapse the prototypes µ c yi 's of all classes in τ , making them indistinguishable and impairing classification; (2) given v τ , this method tends to sample classes with small D vτ (µ c yi ), where D vτ (•) measures the Mahalanobis distance between a data point and the Gaussian distribution centered at v τ . However, in most of the existing works, classes are often uniformly sampled from a domain without any prior on distances (Finn et al., 2017) . Fitting the distance function with such "uniform" classes naively leads to an ill-posed learning problem with degenerated solutions. In light of these issues, we seek to define p(y i |v τ ) as a (parameterized) density function with at least N global optimums so that it can distinguish the N different class prototypes of N -way tasks. The N equal (global) optimums also allow it to fit N classes uniformly sampled from a domain. To this end, let µ c k be the surrogate embedding of the k-th class, we propose a Gibbs distribution π(µ c k |v τ , ω) defined by v τ and trainable parameters ω with an energy function. Then we write p(y i = k|v τ ) as pω(yi = k|vτ ) = π(µ c k |vτ , ω) = exp [-Eω(µ c k ; vτ )] µ c k exp [-Eω(µ c k ; vτ )] where E ω (µ c k ; v τ ) = min ({||µ c k -W j v τ || 2 2 } N j=1 ) is our energy function, and the denominator in Eq (2) is a normalizing constant (with respect to µ c k ), a.k.a. the partition function in an energybased model (EBM) (LeCun et al., 2006) . ω = {W 1 , ..., W N } are trainable parameters, with W i ∈ R d×d . Given ω and v τ , Eq. (2) has N global maximums at µ c k = W 1 v τ , ..., µ c k = W N v τ . More interpretations of the proposed task-conditional distribution can be found in Appendix B.3. Mixture Distribution of Tasks. In Eq. ( 1), the task distribution p(v τ ) is factorized as a mixture of p(v τ |z τ = 1), ..., p(v τ |z τ = r), weighted by their respective mixture probability p(z τ ). Thus we specify p(v τ ) as a Gaussian mixture distribution, and introduce µ t zτ and Σ t zτ as the mean and variance for each component, i.e., p(v τ |z τ ) = N (v τ |µ t zτ , Σ t zτ ). Then the generation of v τ involves two steps: (1) draw a latent variable z τ from a categorical distribution on [p(z τ = 1), ..., p(z τ = r)], which can be Uniform(r), and (2) draw v τ from N (µ t zτ , Σ t zτ ) (Bishop, 2006) . As such, our HTGM generative process of an N -way K-shot Q-query task τ can be summarized as 1. Draw a latent task variable zτ ∼ Categorical([p(zτ = 1), ..., p(zτ = r)]) 2. Draw a task embedding vτ ∼ N (µ t zτ , Σ t zτ ) 3. For k = 1, ..., N : (a) Draw a class prototype µ c k ∼ π(µ c k |vτ , ω) from the proposed Gibbs distribution in Eq. (2) (b) For i = 1, ..., K + Q: i. Set yi = k, draw a sample embedding ei ∼ N (ei|µ c y i , Σ c y i ) ii. Allocate (ei, yi) to the support set D s τ if i ≤ K; else allocate (ei, yi) to the query set D q τ To reduce complexity, we investigate the feasibility of using isotropic Gaussian with tied variance, i.e., Σ c 1 = ... = Σ c N = σ 2 I, for class distributions, which turned out to be efficient in our experiments. Here, I is the identity matrix, σ is a hyperparameter. Tied variance is also a commonly used trick in Gaussian discriminate analysis (GDA) for generative classifiers (Lee et al., 2018) . For task distributions, the variances Σ t 1 , ..., Σ t r can be automatically inferred by our algorithm in Sec. 3.2. 𝜋 𝝁 ! ! ' |𝐯 " , 𝝎 𝑝 𝐯 " |𝑧 " 𝑝 𝑧 " 𝐖 # , … , 𝐖 $ 𝐯 " 𝑝 𝐞 % |𝑦 % 𝐯 "& 𝐖 # , … , 𝐖 $ 𝐖 # 𝐯 "& , … , 𝐖 $ 𝐯 "& Training task embeddings Novelty? Classification Query set 𝓓 𝝉" 𝒒 (a) (b) Query set {𝐱 & # } &'( ) ! Support set {𝐱 & * } &'( ) " Class-pooling 𝑓 ! 𝑓 ! {𝐞 & * } &'( )" Task Finally, substituting p θ (e i |y i ) = N (e i |µ c yi , σ 2 I), p ω (y i |v τ ) = π(µ c yi |v τ , ω) (y i = k), p(v τ |z τ ) = N (v τ |µ t zτ , Σ t zτ ) and p(z τ ) = Uniform(r) in Eq. ( 1), whose probabilities are specified and parameterized, we get our HTGM induced loss ℓ HTGM (D τ ; θ, ω). The class means µ c yi , task means µ t zτ and variances Σ t zτ are inferred in the E-step of our EM algorithm (details are in Sec. 3.2 and A.4).

3.2. MODEL OPTIMIZATION

It is hard to directly optimize ℓ HTGM (D τ ; θ, ω), because the exact posterior inference is intractable (due to the integration over v τ ). To solve it, we resort to variational methods, and introduce an approximated posterior q ϕ (v τ |D s τ ), which is defined by an inference network ϕ, and implies we want to infer v τ from its observed support set D s τ . The query set D q τ is not included because it is unavailable during model testing. Then we propose to maximize a lower-bound of Eq. ( 1), which is derived as (the details are in Appendix A.1) ℓHTGM(Dτ ; θ, ω) ≥ ℓHTGM-ELBO(Dτ ; θ, ω) = 1 n n i=1 log p θ,ω (ei|yi) + 1 n n i=1 E vτ ∼q ϕ (vτ |D s τ ) log pω(yi|vτ ) + log r zτ =1 p(vτ |zτ )p(zτ ) + H q ϕ (vτ |D s τ ) where H(q ϕ (v τ |D s τ )) = -vτ q ϕ (v τ |D s τ ) log q ϕ (v τ |D s τ )dv τ is the entropy function. Similar to VAE (Kingma & Welling, 2013), Eq. ( 3) estimates the expectation (in the second term) by sampling v τ from q ϕ (v τ |D s τ ), instead of the integration in Eq. ( 1), hence facilitates computation. Next, we elaborate on the inference network, the challenges of maximizing Eq. (3), and our workarounds. Inference Network. Similar to VAE, q ϕ (v τ |D s τ ) is defined as a Gaussian distribution N (µ a zτ , σ2 I), where µ a zτ is the output of the inference network, which approximates µ t zτ in Step 2 of the generative process, and σ is a hyperparameter for the corresponding variance. As illustrated by Fig. 1 (a), the inference network is built upon the base model f θ (•) with two non-parametric aggregation (i.e., mean pooling) functions, thus ϕ = θ. The first function aggregates class-wise embeddings to prototypes µ c yi 's, similar to prototypical networks (Snell et al., 2017) . Differently, the second aggregates all prototypes to µ a zτ . During model training, we use the reparameterization trick (Kingma & Welling, 2013) to sample v τ from N (µ a zτ , σ2 I). It is noteworthy that H(q ϕ (v τ |D s τ )) in Eq. ( 3) becomes a constant now because σ2 is a constant. N in task τ could collide, drawing all sample embeddings to the same spot. To avoid such a trivial solution and improve the stability of optimization, we apply negative sampling (Mikolov et al., 2013) ℓneg(Dτ ; yi, θ, ω) = -log Ee j ∼Dτ exp (- 1 2σ 2d ∥ej -µ c y i ∥ 2 2 ) (4) where e j is a negative sample embedding from any class in the support set, and µ c yi is the mean of the positive class. In practice, we found it is beneficial to integrate ℓ neg with our likelihood ℓ HTGM in Eq. ( 1) during training, i.e. ℓ HTGM + 1 n n i=1 ℓ neg . Correspondingly, from Eq. (3) we have ℓ(Dτ ; θ, ω) = ℓHTGM-ELBO(Dτ ; θ, ω) + 1 n n i=1 ℓneg(Dτ ; yi, θ, ω) which does not only serve as a robust training loss, but also helps solve the next challenge. Challenge 2: The Partition Function in Eq. (2). The second term p ω (y i |v τ ) in Eq. ( 3) involves computing the partition function in Eq. (2) (i.e., the denominator), which is intractable because of the integration over all possible µ c k 's. To solve it, we propose an upper bound of the partition function µ k exp -E ω (µ c k ; v τ ) dµ c k ≤ N √ 2 d-1 π d (the derivation is in Appendix A.2), which is a constant with a specific N . By replacing the partition function in Eq. ( 2) with N √ 2 d-1 π d , we got a lower bound of p ω (y i |v τ ), which in turn relaxes the lower bound in Eq. ( 3). The following theorem (the proof is in Appendix A.3) states the tightness of the relaxed bound is controllable. Theorem 1. Among the N global maximums W 1 v τ , ..., W N v τ of Eq. (2), let W h v τ and W l v τ (1 ≤ h, l ≤ N ) be the pair with the smallest Euclidean distance D(W h v τ , W l v τ ), we have lim D(W h vτ ,W l vτ )→∞ µ k exp -Eω(µ c k ; vτ ) dµ c k = N √ 2 d-1 π d This theorem indicates the partition function approximates N √ 2 d-1 π d when all pairs of the global maximums are far apart. It is noteworthy that during training (i.e., maximizing the likelihood) we fit W 1 v τ , ..., W N v τ to the different class prototypes µ c 1 , ..., µ c N in N -way tasks. Because ℓ neg in Eq. ( 4) tends to maximize the distances between different prototypes through the negative samples, maximizing the joint loss ℓ in Eq. ( 5) tends to separate W 1 v τ , ..., W N v τ , thus tighten the relaxed bound after using N √ 2 d-1 π d according to Theorem 1. This is another benefit of negative sampling. Optimization via Expectation-Maximization. In the third term of ℓ HTGM-ELBO in Eq. (3), we need to estimate the mixture distribution p(z τ ). Similar to optimizing Gaussian mixture models, we alternately infer p(z τ ) and solve the model parameters {θ, ω} through an Expectation-Maximization algorithm. In E-step, we infer p(z τ ) when fixing model parameters. In M-step, when fixing p(z τ ), {θ, ω} can be efficiently solved by optimizing Eq. ( 5) with stochastic gradient descent (SGD). The formula to infer p(z τ ) and the detailed training algorithm of HTGM can be found in Appendix A.4. N (similar to prototypical networks), and (2) distribution q ϕ (v τ ′ |D s τ ′ ), from which we draw the average task embedding v τ ′ = µ a z τ ′ . Recall that the inference network is the base model f θ (•) with class-pooling and task-pooling layers, as illustrated in Fig. 1(b) , and ϕ = θ. Then v τ ′ is projected to W 1 v τ ′ , ..., W N v τ ′ which represent the N optimal choices of class prototypes for task τ ′ as learned by the Gibbs distribution in Eq. (2) from the training tasks. They are used to adapt µ c 1 , ..., µ c N so that the adapted prototypes are drawn towards the closest classes from the mixture component that task τ ′ belongs to. The adaptation is performed by selecting the closest optimum for each prototype, i.e., μc j = αµ c j + (1 -α)W l * v τ ′ where l * = arg min 1≤l≤N D(µ c j , W l v τ ′ ) using Euclidean distance D(•, •) and α is a hyperparameter. Finally, we (1) assess if τ ′ is a novelty by computing the likelihood of v τ ′ in a pre-fitted GMM on the embeddings v τ 's of the training tasks in D tr , and (2) perform classification on each sample x ′ i in the query set D q τ ′ using the adapted prototypes by p(y 

3.3. MODEL ADAPTATION

′ i = j ′ |x ′ i ) = exp (-D(f θ (x ′ i ), μc j ′ )) N j=1 exp (-D(f θ (x ′ i ), μc j )) .

4. EXPERIMENTS

In this section, we evaluate HTGM's effectiveness on few-shot classification and novel task detection on benchmark datasets, and compare it with SOTA methods. Datasets. The first is the Plain-Multi benchmark proposed in (Yao et al., 2019a) . It includes four fine-grained image classification datasets, i.e., CUB-200-2011 (Bird), Describable Textures Dataset (Texture), FGVC of Aircraft (Aircraft), and FGVCx-Fungi (Fungi). In each episode, a task samples classes from one of the four datasets, so that different tasks are from a mixture of the four domains. The second is the Art-Multi benchmark from (Yao et al., 2019b) , whose distribution is more complex than Plain-Multi. Similar to (Jerfel et al., 2019) , each image in Plain-Multi was applied with two filters, i.e., blur filter and pencil filter, respectively, to simulate a changing distribution of few-shot tasks. Afterward, together with the original four datasets, a total of 12 datasets comprise Art-Multi, and each task is sampled from one of them. Both benchmarks were divided into the meta-training, meta-validation, and meta-test sets by following their corresponding papers. Baselines. We compare HTGM with the most relevant SOTA methods on meta-learning, including (1) optimization-based methods: MAML (Finn et al., 2017) and Meta-SGD (Li et al., 2017) learn globally shared initialization among tasks. MUMOMAML (Vuorio et al., 2018 ) is a task-specific method. TAML (Lee et al., 2019a) handles imbalanced tasks. HSML (Yao et al., 2019a) and ARML (Yao et al., 2019b) learn locally shared initial parameters in clusters of tasks and neighborhoods of a meta-graph of tasks, respectively; and (2) Metric-based methods: ProtoNet (Snell et al., 2017) learns prototypes with distance-based classifier. MetaOptNet (Lee et al., 2019c) uses an SVM classifier with kernel metrics. ProtoNet-Aug (Su et al., 2020) , FEATS (Ye et al., 2020) and NCA (Laenen & Bertinetto, 2021) were built upon ProtoNet by augmenting images (e.g., rotation, jigsaw), adding prototype aggregator (e.g., Transformer), and using contrastive training loss, (instead of prototypebased loss), respectively. The detailed setup of these methods is deferred to Appendix C.1. Implementation. Following (Tian et al., 2020) , optimization-based baselines use the standard fourblock convolutional layers as the base learner, and metric-based methods use ResNet-12 as the base learner. The output dimension of these networks is 640 (MetaOptNet uses 16000 as in its paper). In our experiments, we observed the optimization-based methods have out-of-memory issues when using ResNet-12, indicating their limitation in using large backbone networks. Table 2 : Results (accuracy±95% confidence) of the compared methods on Art-Multi dataset. on ResNet-12, we followed the ANIL method (Raghu et al., 2020) The learning rates for inner-and outer-loops for optimization-based methods are 1e -3 and 1e -4 . The weight decay was 1e -4 . For HTGM, we set σ = 1.0, σ = 0.1, α = 0.5 (0.9) for 1-shot (5-shot) tasks. The number of mixture components r varies w.r.t. different datasets, and was grid-searched within [2, 4, 8, 16, 32 ]. All hyperparameters were set according to the meta-validation sets.

4.1. EXPERIMENTAL RESULTS

Few-shot classification. Following (Tian et al., 2020) , we report the mean accuracy and 95% confidence interval of 1000 random tasks with 5-way 1-shot/5-shot, 5/25-query tests. Following (Yao et al., 2019b) , we report the accuracy of each domain (Bird, Texture, Aircraft and Fungi) and the overall average accuracy for Plain-Multi, and report the accuracy of each image filtering strategy and the overall average accuracy for Art-Multi. Table 1 and 2 summarize the results. From the tables, we have several observations. First, metricbased methods generally outperform optimization-based methods. This is because of the efficiency of metric-based methods, enabling them to fit a larger backbone network, which is consistent with the results in (Tian et al., 2020) . Built upon the metric-based method, HTGM only introduces a few distribution-related parameters and thus has the flexibility to scale with the encoder size. Second, baselines designed for dealing with mixture distributions of tasks, i.e., HSML and ARML, outperform their counterparts without such design, demonstrating the importance to consider mixture task distribution in practice. Finally, HTGM outperforms the SOTA baselines in most cases by large margins, suggesting its effectiveness in modeling the generative process of task instances. Novel task detection. We also evaluate HTGM on the task of detecting novel N -way-K-shot tasks (N = 5, K = 1) that are drawn out of the training task distributions. To this end, we train each comapred model in the Original domain in Art-Multi dataset, and test the model on tasks drawn from either Original domain (i.e., known tasks), or {Blur, Pencil} domains (i.e., novel tasks), and evaluate if the model can tell whether a testing task is known or novel. For comparison, since none of the baselines detects novel tasks, we adapt them as follows. For metric-based methods, since they use a fixed encoder for all training/testing tasks, we averaged the sample embeddings in each task to represent the task. Then a separate GMM model was built upon the training task embeddings, and its likelihood was adapted to score the novelty of testing tasks (some details of setup are in Appendix C.2. However, optimization-based models perform gradient descent on the support set of each task, leading to varying encoders per task. As such, sample embeddings of different tasks are not comparable, and we cannot obtain task embeddings in the same way as before. Among them, only HSML has an augmented task-level encoder for task embedding, allowing us to include it for comparison. For a fair comparison, our HTGM also trains a GMM on its task embeddings for detecting novel tasks. Moreover, two HTGM variants were included for ablation analysis to understand some design choices: (1) HTGM-Gaussian replaces the Gibbs distribution in Eq. ( 2) with a Gaussian distribution; (2) HTGM w/o GMM removes the task-level GM, i.e., the third term in Eq. ( 3). The classification results of the ablation variants are in Appendix D.4. Following (Cheng & Vasconcelos, 2021; Vaze et al., 2022; Sharma et al., 2021) , we report Area Under ROC (AUROC), Average Precision (AP), and Max-F1 for performance evaluation. Table 3 : Comparison between HTGM and its variants and the applicable baselines on novel task detection. Table 3 summarize the results, from which we observe HTGM outperforms all baselines over all evaluation metrics, indicating the superior quality of task embeddings learned by our model. The embeddings follow the specified mixture distribution of tasks p(v τ ) as described in Sec. 3.1, which fits the mixture data well hence allowing to detect novel tasks that are close to the boundary. Since the baselines learn embeddings without explicit constraint, they even don't fit the post-hoc GMM very well. Moreover, HTGM outperforms HTGM w/o GMM, which is even worse than some other baselines. This further validates the necessity to introduce the regularization of task-level mixture distribution p(v τ ). Also, the drops of AUROC and AP of HTGM-Gaussian demonstrate the importance of our unique design of the Gibbs distribution for the task-conditional distribution in Eq. ( 2). Similar to (Vaze et al., 2022) , in Fig. 2 , we visualized the normalized likelihood histogram of known and novel tasks for HSML, MetaOpt-Net (the best baseline), ProtoNet-Aug (the near-best baseline), and HTGM. The figures indicate the likelihoods (i.e., novelty scores) of HTGM are more distinguishable for known and novel tasks than the baselines. We also analyzed the hyperparameters of HTGM, which are in D.1, D.2, D.3.

5. CONCLUSION

In this paper, we propose a novel Hierarchical Gaussian Mixture based Task Generative Model (HTGM). HTGM models the generative process of task instances, and performs maximum likelihood estimation to learn task embeddings, which can help adjust prototypes acquired by the feature extractor and thus achieve better performance. Moreover, by explicitly modeling the mixture distribution of tasks in the embedding space, HTGM can effectively detect the tasks that are drawn from distributions unseen in the meta-training stage. The extensive experimental results indicate the advantage of the proposed method on both few-shot classification and novel task detection. Therefore, we have the following derivation from Eq. ( 9). µ c k ∈ N m=1 Bm exp -Eω(µ c k ; vτ ) dµ c k = N m=1 µ c k ∈Bm exp -Eω(µ c k ; vτ ) dµ c k = N m=1 µ c k ∈Bm exp -||µ c k -Wmvτ || 2 2 dµ c k = N µ k ∈Bm exp -||µ c k -Wmvτ || 2 2 dµ c k (11) Meanwhile, since N m=1 B m is a sub-area of the entire R d space, we have µ c k ∈ N m=1 Bm exp -Eω(µ c k ; vτ ) dµ c k ≤ µ c k exp -Eω(µ c k ; vτ ) dµ c k (12) According to the multidimensional Gaussian integral, we have lim D(W h vτ ,W l vτ )→∞ µ c k ∈Bm exp -Eω(µ c k ; vτ ) dµ c k = √ 2 d-1 π d (13) Therefore, lim D(W h vτ ,W l vτ )→∞ µ c k exp -Eω(µ c k ; vτ ) dµ c k ≥ N √ 2 d-1 π d (14) Since N √ 2 d-1 π d is its upper bound, based on the squeeze theorem, we have lim D(W h vτ ,W l vτ )→∞ µ c k exp -Eω(µ c k ; vτ ) dµ c k = N √ 2 d-1 π d (15) which completes the proof of Theorem 1.

A.4 THE TRAINING ALGORITHM OF HTGM

The training algorithm of HTGM is summarized in Algorithm 1. B APPENDIX FOR FURTHER DISCUSSION To the best of our knowledge, the Hierarchical Gaussian Mixture (HGM) model has appeared in the traditional works (Goldberger & Roweis, 2005; Olech & Paradowski, 2016; Athey et al., 2019) for hierarchical clustering by applying Gaussian Mixture model agglomeratively or divisively on the input samples. They are unsupervised methods that infer clusters of samples, but do not pretrain embedding models (or parameter initializations) that could be fine-tuned for the adaptation to new tasks in meta-learning. Therefore, these methods are remarkably different from meta-learning methods, and we think it is a non-trivial problem to adapt the concept of HGM to solve the metalearning problem. To this end, we need to (1) identify the motivation; and (2) solve the new technical challenges. For (1), we found the hierarchical structure of mixture distributions naturally appears when we want to model the generative process of tasks from a mixture of distributions, where each task contains another mixture distribution of classes (as suggested by Eq. ( 1)). In other words, the motivating point of our method is more on meta-learning than HGM. However, we think drawing such a connection between meta-learning and HGM is a novel contribution. For (2), our method is different from traditional HGM in (a) its generative process of tasks (Sec. 3.1), which is a theoretical extension of the widely used empirical process of generating tasks in meta-learning; (b) its Gibbsstyle task-conditional distribution (Eq. ( 2)) for fitting uniformly sampled classes; (c) the metricbased end-to-end meta-learning framework (Fig. 1 The modeling of the clustering/grouping structure of tasks or the mixture of distributions of tasks has been studied in multi-tasking learning (MTL). In (Xue et al., 2007; Jacob et al., 2008) , tasks are assumed to have a clustering structure, and the model parameters of the tasks in the same cluster are drawn to each other via optimization on their L2 distances. In (Kang et al., 2011) , a subspace based regularization framework was proposed for grouping task-specific model parameters, where the tasks in the same group are assumed to lie in the same low dimensional subspace for parameter sharing. The method in (Kumar & Daumé III, 2012 ) also uses the subspace based sharing of task parameters, but allows two tasks from different groups to overlap by having one or more bases in common. The method in (Passos et al., 2012 ) introduces a generative model for task-specific model parameters that encourages parameter sharing by modeling the latent mixture distribution of the parameters via the Dirichlet process and Beta process. μc j = αµ c j + (1 -α)W l * v τ ′ where l * = arg min 1≤l≤N D(µ c j , W l v τ ′ ) 20 end 21 Calculate ℓ({e q i } n q i=1 , V, { μc j } N j=1 , The key difference between these methods and our method HTGM lies in the difference between MTL and meta-learning. In an MTL method, all tasks are known a priori, i.e., the testing tasks are from the set of training tasks, and the model is non-inductive at the task-level (but it is inductive at the sample-level). In HTGM, testing tasks can be disjoint from the set of training tasks, thus the model is inductive at the task-level. In particular, we aim to allow testing tasks that are not from the distribution of the training tasks by enabling the detection of novel tasks, which is an extension of the task-level inductive model. The second difference lies in the generative process. The method in (Passos et al., 2012) models the generative process of the task-specific model parameters (e.g., the weights in a regressor). In contrast, HTGM models the generative process of each task by generating the classes in it, and the samples in the classes hierarchically, i.e., the (x, y)'s (in Eq. (1) and Sec. 3.1). In this process, we allow our model to fit uniformly sampled classes given a task (without specifying a prior on the distance function on classes) by the proposed Gibbs distribution in Eq. ( 2). Other remarkable differences to the aforementioned MTL methods include the inference network (Fig. 1(b) ), which allows the inductive inference on task embeddings and class prototypes; the optimization algorithm (Sec. 3.2) to our specific loss function in Eq. ( 3), which is from the likelihood in Eq. ( 1); and the model adaptation algorithm (Sec. 3.3) for performing predictions in a testing task, and detecting novel tasks. As such, the MTL methods can not be trivially applied to solve our problem.

B.3 FURTHER INTERPRETATION OF THE TASK-CONDITIONAL DISTRIBUTION

The task-conditional class distribution p ω (y i = k|v τ ) in Eq. ( 2) is defined through an energy function E ω (µ c k ; v τ ) = min ({||µ c k -W j v τ || 2 2 } N j=1 ) with trainable parameters ω = {W 1 , ..., W N }, for allowing uniformly sampled classes per task. The conditional distribution p(y i |v τ ) represents how classes distribute for a given task τ . The reason for its definition in Eq. ( 2) is as follows. If it is a Gaussian distribution with v τ (i.e., task embedding) as the mean, p(y i = k|v τ ) can be interpreted as the density at the representation of the k-th class in this Gaussian distribution, i.e., the density at µ k , which is the mean/surrogate embedding of the k-th class. One problem of this Gaussian p(y i |v τ ) is that different classes, i.e., different µ yi 's, are not uniformly distributed, contradicting the practice that given a dataset (e.g., images), classes are often uniformly sampled for constituting a task in the empirical studies. Using a uniformly sampled set of classes to fit the Gaussian distribution p(y i |v τ ) will lead to an ill-posed learning problem, as described in Sec. 3.1. To solve it, we introduced ω = {W 1 , ..., W N } in the energy function E ω (µ c k ; v τ ) in Eq. ( 2). W j ∈ R d×d (1 ≤ j ≤ N ) can be interpreted as projecting v τ to the j-th space spanned by the basis (i.e., columns) of W j . There are N different spaces for j = 1, ..., N . Thus, the N projected task means W 1 v τ , ..., W N v τ are in N different spaces. Fitting the energy function E ω (µ c k ; v τ ) to N uniformly sampled classes µ c 1 , ..., µ c N , which tend to be far from each other because they are uniformly random, tends to learn W 1 , ..., W N that project v τ to N far apart spaces that fit each of the µ c 1 , ..., µ c N by closeness, due to the min-pooling operation. This mitigates the aforementioned ill-posed learning problem.

C APPENDIX FOR IMPLEMENTATION DETAILS C.1 THE SETUP OF THE COMPARED MODELS

Encoder of Metric-based Meta-Learning. For fairness, for all metric-based methods, including ProtoNet (Snell et al., 2017) , MetaOptNet (Lee et al., 2019c) , ProtoNet-Aug (Su et al., 2020) , FEATS (Ye et al., 2020) and NCA (Laenen & Bertinetto, 2021) , following (Tian et al., 2020; Lee et al., 2019c) , we apply ResNet-12 as the encoder. ResNet-12 has 4 residual blocks, each has 3 convolutional layers with a kernel size of 3 × 3. ResNet-12 uses dropblock as a regularizer, and its number of filters is (60, 160, 320, 640) . For MetaOptNet, following its paper (Lee et al., 2019c) , we flattened the output of the last convolutional layer to acquire a 16000-dimensional feature as the image embedding. For other baselines, following (Tian et al., 2020) , we used a global averagepooling layer on the top of the last residual block to acquire a 640-dimensional feature as the image embedding. Further Details. Following (Snell et al., 2017) , ProtoNet, ProtoNet-Aug, and NCA use Adam optimizer with β 1 = 0.9 and β 2 = 0.99. We did grid-search for the initial learning rate of the Adam within {1e -2 , 1e -3 , 1e -4 }, where 1e -3 was selected, which is the same as the official implementation provided by the authors. For FEATS, we chose transformer as the set-to-set function based on the results reported by (Ye et al., 2020) . When pre-training the encoder in FEATS, following its paper (Ye et al., 2020) , we applied the same setting as ProtoNet, which is to use Adam optimizer with an initial learning rate of 1e -3 , β 1 = 0.9 and β 2 = 0.99. When training its aggregation function, we grid-searched the initial learning rate in {1e -4 , 5e -4 , 1e -5 } since a larger learning rate leads to invalid results on our datasets. The optimal choice is 1e -4 . For MetaOptNet, following its paper (Lee et al., 2019c) , we used SGD with Nesterov momentum of 0.9, an initial learning rate of 0.1 and a scheduler to optimize it, and applied the quadratic programming solver OptNet (Amos & Kolter, 2017) for the SVM solution in it.

C.2 THE DETAILS OF THE SETUP FOR NOVEL TASK DETECTION

In the experiments on novel task detection in Sec. 4.1, the number of in-distribution tasks (from the Original domain) in the test set is 4000 (1000 per task cluster) and the number of novel tasks (from the Blur and Pencil domains) in the test set is 8000 (4000 for the Blur and 4000 for the Pencil). We summarize the classification performance of the two Ablation Variants HTGM w/o GMM and HTGM-Gaussian in Table 7 . As we can see, our unique designs improve the novel task detection performance without significantly decreasing the classification performance.



Figure 1: An illustration of HTGM on its (a) the training process, and (b) the testing process. In (a), 1 ⃝ 2 ⃝ 3 ⃝ are the three parts of the training loss in Eq. (3). In (b), the training task embeddings contain the embeddings of all training tasks, i.e. the outputs of the task-pooling in (a).

Trivial Solution. In Eq. (3), since the first term log p θ,ω (e i |y i ) = -1 2σ 2d ∥e i -µ c yi ∥ 2 2 (constants are ignored) only penalizes the distance between a sample e i and its own class mean µ c yi (i.e., intra-class distances) without considering inter-class relationships, different class means µ c 1 , ..., µ c

Fig. 1(b) illustrates the adaptation process of HTGM. Given a new N -way task τ ′ from the metatest set D te , its support set D s τ ′ is fed to the inference network to generate (1) class prototypes µ c 1 , ..., µ cN (similar to prototypical networks), and (2) distribution q ϕ (v τ ′ |D s τ ′ ), from which we draw the average task embedding v τ ′ = µ a z τ ′ . Recall that the inference network is the base model f θ (•) with class-pooling and task-pooling layers, as illustrated in Fig.1(b), and ϕ= θ. Then v τ ′ is projected to W 1 v τ ′ , ..., W N v τ ′which represent the N optimal choices of class prototypes for task τ ′ as learned by the Gibbs distribution in Eq. (2) from the training tasks. They are used to adapt µ c 1 , ..., µ c N so that the adapted prototypes are drawn towards the closest classes from the mixture component that task τ ′ belongs to. The adaptation is performed by selecting the closest optimum for each prototype, i.e., μc j = αµ c j + (1 -α)W l * v τ ′ where l * = arg min 1≤l≤N D(µ c j , W l v τ ′ ) using Euclidean distance D(•, •) and α is a hyperparameter. Finally, we (1) assess if τ ′ is a novelty by computing the likelihood of v τ ′ in a pre-fitted GMM on the embeddings v τ 's of the training tasks in D tr , and (2) perform classification on each sample x ′ i in the query set D q τ ′ using the adapted prototypes by p(y ′ i = j ′ |x ′ i ) =

by pre-training ResNet-12 via ProtoNet, freezing the encoder, and fine-tuning the last fully-connected layer. In this case, HSML and ARML cannot work properly as they require joint training of the encoder and other layers. The details are in Appendix D.5. For training, Adam optimizer was used. Each batch contains 4 tasks. Each model was trained with 20000 episodes. The learning rate of metric-based methods is 1e -3 .

Figure 2: The frequency of tasks w.r.t. the normalized likelihood for (a) HSML (b) MetaOptNet (c) ProtoNet-Aug (d) HTGM. The x-axis ranges vary as only 95% tasks with top scores were preserved.

To test them

.1 DISCUSSION ABOUT THE RELATIONSHIP BETWEEN HTGM AND HGM MODEL

) (note the traditional HGM is not for learning embeddings); (d) the non-trivial derivation of the optimization algorithm in Sect. 3.2 and Alg. 1; and (e) the novel model adaptation process in Sec. 3.3. Solving the technical challenges in the new generative model is another novel contribution of the proposed method.

Analysis of different σTabel 4 report the effect of different σ on the classification performance (5-way-1-shot classification on Multi-Plain dataset). As shown in the table, although the too low or too high setting of this hyper-parameter will hurt the performance, in general the model is robust toward the setting of σ.

Analysis of different σTabel 5 summarize how different σ influence classification performance (5-way-1-shot classification on Multi-Plain dataset). In general, different settings of σ will influence the model performance at a marginal level, indicating our model's robustness toward this hyper-parameter.

Analysis on the number of mixture components Different choices of the number of mixture components does not significantly influence the model classification performance. However, the clustering quality may vary due to the different numbers of components. Here, we report the Silhouette score(Shahapure & Nicholas, 2020;Sharma et al., 2021) w.r.t. the number in Table6. From Table6, we can see that selecting a component number close to the ground-truth component number of the distribution can benefit the clustering quality.

A APPENDIX FOR DETAILS OF DERIVING HTGM A.1 THE LOWER-BOUND OF THE LIKELIHOOD FUNCTION

In this section, we provide the details of the lower-bound in Eq. (3). By introducing the approximated posterior q ϕ (v τ |D s τ ), the likelihood in Eq. ( 1) becomes (the superscript * is neglected for clarity) where the fourth step uses Jensen's inequality. This completes the derivation of Eq. (3).

A.2 THE UPPER-BOUND OF THE PARTITION FUNCTION

In Sec. 3.2, we apply an upper bound on the partition function in Eq. (2) for solving the challenging 2. The derivation of the upper bound is as follows.µ cwhere the last equation is from the multidimensional Gaussian integral. This completes the derivation of the upper bound of the partition function.

A.3 THE PROOF OF THEOREM 1

Proof. Let B j denote a ball in R d . Its center is at W j v τ and its radius is, for any pair of balls B j and B m we haveIn other words, there is no overlap between any pair of balls. Therefore, if we compute the integral over the joint of all balls, we haveAlso, because there is no overlap between any pair of balls, for each point µ c k ∈ B m , we haveAlgorithm 1: Hierarchical Gaussian Mixture based Task Generative Model (HTGM)Input: encoder f θ , training dataset D tr , hyperparameters r, σ, σ Output: model parameters {θ, ω} 1 Pre-train the encoder f θ via ProtoNet with augmentations. 2 Pre-train the energy function in Eq. ( 2) by maximizing 1 n n i=1 log p θ,ω (ei|yi) + log pω(yi|vτ )// embeddings of the support setSample a task embedding vτ from q ϕ (vτrepresents the labeling of the vτ 's in V / * M-step * / Table 8 : More results (accuracy±95% confidence) of the optimization-based methods.

D.5 ABLATION ANALYSIS OF OPTIMIZATION-BASED METHODS

Table 8 summarizes the performance of MAML, HSML and ARML trained in ANIL method (Raghu et al., 2020) , i.e., we pre-trained the ResNet-12 by ProtoNet, froze the encoder, and fine-tuned the last fully-connected layers using MAML, HSML and ARML on Plain-Multi dataset. From Table 8 , the performance of ANIL-MAML is better than MAML in Table 1 , similar to the observation in (Raghu et al., 2020) , indicating the effectiveness of ANIL method. However, ANIL-HSML and ANIL-ARML perform similarly to ANIL-MAML, losing their superiority of modeling the mixture distribution of tasks achieved when implemented without ANIL as in Table 1 (up to 5.6% average improvement). This is because the cluster layer in HSML and the graph layer in ARML both affect the embeddings learned through backpropagation, i.e., they were designed for joint training with the encoder. When the encoder is frozen, they cannot work properly. For this reason, to be consistent with the existing researches (Yao et al., 2019a; b) In the case when the task distribution is not a mixture, our model would degenerate to and perform similarly to the general metric-based meta-learning methods, e.g., ProtoNet, which only considers a uni-component distribution. To confirm this, we added an experiment that compares our model with ProtoNet-Aug on Mini-Imagenet (Vinyals et al., 2016) , which does not have the same explicit mixture distributions as in the Plain-Multi and Art-Multi datasets in Section 4. The results are summarized in Table 9 . From the table, we observe our method performs comparably to ProtoNet, which validates the aforementioned guess. Meanwhile, together with the results in Table 1 and Table 2 , the proposed method could be considered as a generalization of the metric-based methods to the mixture of task distributions.

