META-GMVAE: MIXTURE OF GAUSSIAN VAES FOR UNSUPERVISED META-LEARNING

Abstract

Unsupervised learning aims to learn meaningful representations from unlabeled data which can capture its intrinsic structure, that can be transferred to downstream tasks. Meta-learning, whose objective is to learn to generalize across tasks such that the learned model can rapidly adapt to a novel task, shares the spirit of unsupervised learning in that the both seek to learn more effective and efficient learning procedure than learning from scratch. The fundamental difference of the two is that the most meta-learning approaches are supervised, assuming full access to the labels. However, acquiring labeled dataset for meta-training not only is costly as it requires human efforts in labeling but also limits its applications to pre-defined task distributions. In this paper, we propose a principled unsupervised meta-learning model, namely Meta-GMVAE, based on Variational Autoencoder (VAE) and set-level variational inference. Moreover, we introduce a mixture of Gaussian (GMM) prior, assuming that each modality represents each class-concept in a randomly sampled episode, which we optimize with Expectation-Maximization (EM). Then, the learned model can be used for downstream few-shot classification tasks, where we obtain task-specific parameters by performing semi-supervised EM on the latent representations of the support and query set, and predict labels of the query set by computing aggregated posteriors. We validate our model on Omniglot and Mini-ImageNet datasets by evaluating its performance on downstream few-shot classification tasks. The results show that our model obtains impressive performance gains over existing unsupervised metalearning baselines, even outperforming supervised MAML on a certain setting.

1. INTRODUCTION

Unsupervised learning is one of the most fundamental and challenging problems in machine learning, due to the absence of target labels to guide the learning process. Thanks to the enormous research efforts, there now exist many unsupervised learning methods that have shown promising results on real-world domains, including image recognition (Le, 2013) and natural language understanding (Ramachandran et al., 2017) . The essential goal of unsupervised learning is obtaining meaningful feature representations that best characterize the data, which can be later utilized to improve the performance of the downstream tasks, by training a supervised task-specific model on the top of the learned representations (Reed et al., 2014; Cheung et al., 2015; Chen et al., 2016) or fine-tuning the entire pre-trained models (Erhan et al., 2010) . Meta-learning, whose objective is to learn general knowledge across diverse tasks, such that the learned model can rapidly adapt to novel tasks, shares the spirit of unsupervised learning in that both seek more efficient and effective learning procedure over learning from scratch. However, the essential difference between the two is that most meta-learning approaches have been built on the supervised learning scheme, and require human-crafted task distributions to be applied in fewshot classification. Acquiring labeled dataset for meta-training may require a massive amount of human efforts, and more importantly, meta-learning limits its applications to the pre-defined task distributions (e.g. classification of specific set of classes). Two recent works have proposed unsupervised meta-learning that can bridge the gap between unsupervised learning and meta-learning by focusing on constructing supervised tasks with pseudo-labels from the unlabeled data. To do so, CACTUs (Hsu et al., 2019) clusters data in the embedding space learned with several unsupervised learning methods, while UMTRA (Khodadadeh et al., 2019) assumed that each randomly drawn sample represents a different class and augmented each pseudoclass with data augmentation (Cubuk et al., 2018) . After constructing the meta-training dataset with such heuristics, they simply apply supervised meta-learning algorithms as usual. Despite the success of the existing unsupervised meta-learning methods, they are fundamentally limited, since 1) they only consider unsupervised learning for heuristic pseudo-labeling of unlabeled data, and 2) the two-stage approach makes it impossible to recover from incorrect pseudo-class assignment when learning the unsupervised representation space. In this paper, we propose a principled unsupervised meta-learning model based on Variational Autoencoder (VAE) (Kingma & Welling, 2014 ) and set-level variational inference using self-attention (Vaswani et al., 2017) . Moreover, we introduce multi-modal prior distributions, a mixture of Gaussians (GMM), assuming that each modality represents each class-concept in any given tasks. Then the parameter of GMM is optimized by running Expectation-Maximization (EM) on the observations sampled from the set-dependent variational posterior. In this framework, however, there is no guarantee that each modality obtained from EM algorithm corresponds to a label. To realize modality as label, we deploy semi-supervised EM at meta-test time, considering the support set and query set as labeled and unlabeled observations, respectively. We refer to our method as Meta-Gaussian Mixture Variational Autoencoders (Meta-GMVAE) (See Figure 1 for high-level concept). While our method can be used as a full generative model for generating the samples (images), the ability to generalize to generate samples may not be necessary for capturing the meta-knowledge for non-generative downstream tasks. Thus, we propose another version of Meta-GMVAE that reconstructs high-level features learned by unsupervised representation learning approaches (e.g. Chen et al. (2020) ). To investigate the effectiveness of our framework, we run experiments on two benchmark fewshot image classification datasets, namely Omiglot (Lake et al., 2011) and Mini-Imagenet (Ravi & Larochelle, 2017) . The experimental results show that our Meta-GMVAE obtains impressive performance gains over the relevant unsupervised meta-learning baselines on both datasets, obtaining even better accuracy than fully supervised MAML (Finn et al., 2017) while utilizing as small as 0.1% of the labeled data on one-shot settings in Omniglot dataset. Moreover, our model can generalize to classification tasks with different number of ways (classes) without loss of accuracy. Our contribution is threefold: • We propose a novel unsupervised meta-learning model, namely Meta-GMVAE, which metalearns the set-conditioned prior and posterior network for a VAE. Our Meta-GMVAE is a principled unsupervised meta-learning method, unlike existing methods on unsupervised meta-learning that combines heuristic pseudo-labeling with supervised meta-learning. • We propose to learn the multi-modal structure of a given dataset with the Gaussian mixture prior, such that it can adapt to a novel dataset via the EM algorithm. This flexible adaptation to a new task, is not possible with existing methods that propose VAEs with Gaussian mixture priors for single task learning. • We show that Meta-GMVAE largely outperforms relevant unsupervised meta-learning baselines on two benchmark datasets, while obtaining even better performance than a supervised metalearning model under a specific setting. We further show that Meta-GMVAE can generalize to classification tasks with different number of ways (classes).

2. RELATED WORK

Unsupervised learning Many prior unsupervised learning methods have developed proxy objectives which is either based on reconstruction (Vincent et al., 2010; Higgins et al., 2017) , adversarially obtained image fidelity (Radford et al., 2016; Salimans et al., 2016; Donahue et al., 2017; Dumoulin et al., 2017) , disentanglement (Bengio et al., 2013; Reed et al., 2014; Cheung et al., 2015; Chen et al., 2016; Mathieu et al., 2016; Denton & Birodkar, 2017; Kim & Mnih, 2018; Ding et al., 2020) , clustering (Coates & Ng, 2012; Krähenbühl et al., 2016; Bojanowski & Joulin, 2017; Caron et al., 2018) , or contrastive learning (Chen et al., 2020) . In the unsupervised learning literature, the most relevant work to ours are methods that use Gaussian Mixture priors for variational autoencoders. Dilokthanakul et al. (2016) ; Jiang et al. (2017) consider single task learning and therefore, the learned prior parameter is fixed after training, and thus cannot adapt to new tasks. CURL (Rao et al., 2019) learns a network that outputs Gaussian mixture priors over a sequence of tasks for unsupervised continual learning. However CURL cannot adapt to a new task without training on it, while our framework can generalize to a new task without any training, via amortized inference with a dataset (task) encoder. Also, our model does not learn Gaussian mixture priors but rather obtain them on the fly using the expectation-maximization algorithm. Meta-learning Meta-learning (Thrun & Pratt, 1998) shares the intuition of unsupervised learning in that it aims to improve the model performance on an unseen task by leveraging prior knowledge, rather than learning from scratch. While the literature on meta-learning is vast, we only discuss relevant existing works for few-shot image classification. Metric-based meta-learning (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Oreshkin et al., 2018; Mishra et al., 2018) is one of the most popular approaches, where it learns to embed the data instances of the same class to be closer in the shared embedding space. One can measure the distance in the embedding space by cosine similarity (Vinyals et al., 2016) , or Euclidean distance (Snell et al., 2017) . On the other hand, gradient-based meta-learning (Finn et al., 2017; 2018; Li et al., 2017; Lee & Choi, 2018; Ravi & Beatson, 2019; Flennerhag et al., 2020) aims at learning a global initialization of parameters, which can rapidly adapt to a novel task with only a few gradient steps. Moreover, some previous works (Hewitt et al., 2018; Edwards & Storkey, 2017; Garnelo et al., 2018) tackle meta-learning by modeling the set-dependent variational posterior with a single global latent variable, however, we model the variational posterior conditioned on each data instances. Moreover, while all of these works assume supervised learning scenarios where one has access to full labels in meta-training stage, we focus on unsupervised setting in this paper. Unsupervised meta-learning One of the main limitations of conventional meta-learning methods is that their application is strictly limited to the tasks from a pre-defined task distribution. A few works (Hsu et al., 2019; Khodadadeh et al., 2019) have been proposed to resolve this issue by combining unsupervised learning with meta-learning. The main idea is to construct meta-training dataset in an unsupervised manner by leveraging existing supervised meta-learning models. CACTUs (Hsu et al., 2019) deploy several deep metric learning (Berthelot et al., 2019; Donahue et al., 2017; Caron et al., 2018; Chen et al., 2016) to episodically cluster the unlabeled dataset, and then train MAML (Finn et al., 2017) and Prototypical Networks (Snell et al., 2017) on the constructed data. UMTRA (Khodadadeh et al., 2019) assumes that each randomly drawn sample is from a different class from others, and use data augmentation (Cubuk et al., 2018) to construct synthetic task distribution for meta-training. Instead of only deploying unsupervised learning for constructing meta-training task distributions, we propose an unsupervised meta-learning model that meta-learns set-level variational posterior by matching the multi-modal prior distribution representing latent classes.

3. UNSUPERVISED META-LEARNING WITH META-GMVAES

In this section, we describe our problem setting with respect to unsupervised meta-learning, and demonstrate our approach. The graphical illustration of our model for unsupervised meta-training and supervised meta-test is depicted in Figure 2 .

3.1. PROBLEM STATEMENT

Our goal is to learn unsupervised feature representations which can be transferred to wide range of downstream few-shot classification tasks. As suggested by Hsu et al. (2019) ; Khodadadeh et al. (2019), we only assume an unlabeled dataset D u = {x u } U u=1 in the meta-training stage. We aim toward applying the knowledge learned during unsupervised meta-training stage to novel tasks in meta-test stage, which comes with a modest amount of labeled data (or as few as a single example per class) for each task. As with most meta-learning methods, we further assume that the labeled data are drawn from the same distribution as that of the unlabeled data, with a different set of classes. Specifically, the goal of a K-way S-shot classification task T is to correctly predict the labels of query data points Q = {x q } Q q=1 , using S support data points and labels S = {(x s , y s )} S s=1 per class, where S is relatively small (i.e. between 1 and 50).

3.2. META-LEVEL GAUSSIAN MIXTURE VAE

Unsupervised meta-training We now describe the meta-learning framework for learning unsupervised latent representations that can be transferred to human-designed few-shot image-classification tasks. In particular, we aim toward learning multi-modal latent spaces for Variational Autoencoder (VAE) in an episodic manner. We use the Gaussian mixture for the prior distribution p ψ (z) = K k=1 p ψ (y = k)p ψ (z|y = k), where ψ is the parameter of the prior network. Then the generative process can be described as follows: • y ∼ p ψ (y), where y corresponds to the categorical L.V. for a single mode. • z ∼ p ψ (z|y), where z corresponds to the Gaussian L.V. responsible for data generation. • x ∼ p θ (x|z), where θ is the parameter of the generative model. The above generative process is similar to those from the previous works (Dilokthanakul et al., 2016; Jiang et al., 2017) on modeling the VAE prior with Gaussian mixtures. However, they target single-task learning and the parameter of the prior network is fixed after training such as equation 1c in Dilokthanakul et al. (2016) and equation 5 in Jiang et al. (2017) , which is suboptimal since a meta-learning model should be able to adapt and generalize to a novel task. To learn the set-dependent multi-modalities, we further assume that there exists a parameter ψ i for each episodic dataset D i = {x j } M j=1 , which is randomly drawn from the unlabeled dataset D u . Then we derive the variational lower bound for the marginal log-likelihood of D i as follows: log p θ (D i ) = M j=1 log p θ (x j ) = M j=1 log p θ (x j |z j )p ψi (z j ) q φ (z j |x j , D i ) q φ (z j |x j , D i ) dz j (1) ≥ M j=1 E zj ∼q φ (zj |xj ,Di) [log p θ (x j |z j ) + log p ψi (z j ) -log q φ (z j |x j , D i ))] (2) ≈ M j=1 1 N N n=1 log p θ (x j |z (n) j ) + log p ψi (z (n) j ) -log q φ (z (n) j |x j , D i ) (3) =: L(θ, φ, ψ i , D i ), z (n) j i.i.d ∼ q φ (z j |x j , D i ). Here the lower bound for each datapoint is approximated by Monte Carlo estimation with the sample size N . Following the convention of the VAE literature, we assume that the variational posterior q φ (z j |x j , D i ) follows an isotropic Gaussian distribution. for all i ∈ [1, B] do 5: Draw n MC samples from q φ (zj|xj, Di) 6: Initialize π k as 1/K and randomly choose K different points for µ k . 7: Compute optimal parameter ψ * i using Eq 7 8: end for 9: Update θ, φ using L(θ, φ, {Di} B i=1 ) in Eq 9. 10: end while Algorithm 2 Meta-test for an episode Require: A test task T = S ∪ Q 1: Set D = {xs} S s=1 ∪ {xq} Q q=1 2: Draw n MC samples from q φ (zj|xj, D) 3: Initialize µ k = S,N s,n=1 1 y (n) s =k z (n) s S,N s,n=1 1 y (n) s =k and σ 2 k = I 4: Compute optimal parameter ψ * using Eq 10 5: Compute p(y q |xq, D) using Eq 11 6: Infer the label yq = arg max k p(y q = k|xq, D) 7: 8: Set-dependent variational posterior Our derivation of the evidence lower bound in Eq 4 is similar to that of the hierarchical VAE framework, such as equation 3 in Edwards & Storkey (2017) and equation 4 in Hewitt et al. (2018) , in that we use the i.i.d assumption that the log likelihood of a dataset equals the sum over the log-likelihoods of each individual data point. Yet, previous works assume that each input set consists of data instances from a single concept (e.g. a class), therefore, they encode the dataset into a single global latent variable (e.g. q φ (z|D)). This is not appropriate for unsupervised meta-learning where labels are unavailable. Thus we learn a set-conditioned variational posterior q φ (z j |x j , D i ), which models a latent variable to encode each data x j within the given dataset D i into the latent space. Specifically, we model the variational posterior q φ (z j |x j , D i ) using the self-attention mechanism (Vaswani et al., 2017) as follows: H = TransformerEncoder(f (D i )) µ j = W µ H j + b µ , σ 2 j = exp(W σ 2 H j + b σ 2 ) q φ (z j |x j , D i ) = N (z j ; µ j , σ 2 j ) Here we deploy TransformerEncoder(•), a neural network based on the multi-head self-attention mechanism proposed by Vaswani et al. (2017) , to model the dependency between data instances, and f is a convolutional neural network (or an identity function for the Mini-ImageNet) which takes each data in D i as an input. Moreover, we use the reparameterization trick (Kingma & Welling, 2014) to train the model with backpropagation since the stochastic sampling process z (n) j i.i.d ∼ q φ (z j |x j , D i ) is non-differentiable. Expectation Maximization As discussed before, we assume that the parameter ψ i of the prior Gaussian Mixture is task-specific and characterizes the given dataset D i . To obtain the task-specific parameter that optimally explain the given dataset, we propose to locally maximize the lower bound in Eq 4 with respect to the prior parameter ψ i . We can obtain the optimal parameter ψ * i by solving the following optimization problem: ψ * i = arg max ψi L(θ, φ, ψ i , D i ) = arg max ψi M,N j,n=1 log p ψ (z (n) j ), z (n) j i.i.d ∼ q φ (z j |x j , D i ), where we only consider the term related to the task-specific parameter ψ i , and eliminate the normalization term 1 N since it does not change the solution of the optimization problem. The above formula implies that the optimal parameter maximizes the log-likelihood of observations which can be drawn from the variational posterior distribution. However, we do not have an analytic solution for Maximum Likelihood Estimation (MLE) of a GMM. The most prevalent approach for estimating the parameters for the mixture of Gaussian is solving it with Expectation Maximization (EM) algorithm. To this end, we propose to optimize the taskspecific parameter of GMM prior distribution using EM algorithm as follows: (E-step) Q j,n (k) := p(y (n) j = k|z (n) j ) = π k N (z (n) j ; µ k , I) k π k N (z (n) j ; µ k , I) (M-step) µ k := M,N j,n=1 Q j,n (k)z (n) j M,N j,n=1 Q j,n (k) , π k := M,N j,n=1 Q j,n (k) K k=1 M,N j,n=1 Q j,n (k) ψ i := {(µ k , I, π k )} K k=1 , where π k , µ k , and N (•) denote the mixing probability of k-th component, mean parameter, and normal distribution, respectively. We assume that the covariance matrix of Gaussian distribution is fixed with the identity matrix I, following the assumption of original VAE on the prior distribution. We initialize {π k } K k=1 and {µ k } K k=1 as 1 K and randomly drawn K different points, respectively. We can obtain MLE solution for the parameters of GMM, by iteratively performing E-step and M-step until the log-likelihood converges. We found that using a fixed number of iterations for the EM algorithm does not degrade the performance, and consider it as a hyperparameter of our framework. Training objective Note that we want to maximize the variational lower bound of the marginal loglikelihood over all the episode datasets D i that can be sampled from D u . We use stochastic gradient ascent with respect to the variational parameter φ and the generative parameter θ, to maximize the following objective: L(θ, φ, {D i } B i=1 ) := 1 B B i=1 max ψi L(θ, φ, ψ i , D i ) (8) = 1 B B i=1 M j=1 1 N N n=1 log p θ (x j |z (n) j ) + log p ψ * i (z (n) j ) -log q φ (z (n) j |x j , D i ) . Here we use B mini-batch of episode datasets, where each dataset consists of M datapoints. The task-specific parameter ψ * i for each episode dataset D i is obtained by EM algorithm in Eq 7. Supervised meta-test By introducing the multi-modal prior distribution into a generative learning framework, our model learns pseudo-class concepts by clustering latent features with EM algorithm. However, there is no guarantee that each modality obtained by EM algorithm corresponds to the label we are interested in at the meta-test stage. To realize modality as label in downstream fewshot image classification tasks, we deploy semi-supervised EM algorithm instead. Given a task T consisting of support set S = {(x s , y s )} S s=1 and query set Q = {x q } Q q=1 , we use both the support set and query set as an episode dataset D = {x s } S s=1 ∪ {x q } Q q=1 and draw latent variables from the variational posterior q φ (z j |x j , D). Note that we abbreviate the index i since we consider a single task for now. We then perform semi-supervised EM algorithm as follows: (E-step) Q q,n (k) := p(y (n) q = k|z (n) q ) = N (z (n) q ; µ k , σ 2 k ) k N (z (n) q ; µ k , σ 2 k ) (M-step) µ k := S,N s,n=1 1 y (n) s =k z (n) s + Q,N q,n=1 Q q,n (k)z (n) q S,N s,n=1 1 y (n) s =k + Q,N q,n=1 Q q,n (k) , σ 2 k := S,N s,n=1 1 y (n) s =k (z (n) s -µ k ) 2 + Q,N q,n=1 Q q,n (k)(z (n) q -µ k ) 2 S,N s,n=1 1 y (n) s =k + Q,N q,n=1 Q q,n (k) ψ := {(µ k , σ 2 k , 1 K )} K k=1 , where 1 denotes an indicator function. We fix the mixing probability as 1 K since the labels in each task T are uniformly distributed. Moreover, we utilize diagonal covariance σ 2 k to obtain more accurate statistics for the inference. We initialize µ k and σ 2 k as the average value of support latent representations and the identity matrix I, respectively. Similar to the meta-training stage, we obtain the MLE solution for the parameters of GMM, by performing E-step and M-step for a fixed number of iterations. Finally, we compute the conditional probability of p(y q |x q , D) using the obtained parameters ψ * as follows: p(y q |x q , D) = E q φ (zq|xq,D) p ψ * (y q |z q ) ≈ 1 N N n=1 p ψ * (y q |z (n) q ), z (n) q i.i.d ∼ q φ (z q |x q , D). ( ) Here we compute p ψ * (y q |z (n) q ) with Bayes rule, and we reuse N different Monte Carlo samples that is drawn for Eq 10, where the prediction of query ŷq = arg max k p(y q = k|x q , D). We present the pseudo-code of the algorithm for training and inference of Meta-GMVAE in the Algorithm 1 and 2. Visual feature reconstruction While our method is a generative model that can generate samples from output distribution, the ability to generate samples may not be necessary for discriminative downstream tasks (Chen et al., 2020) . Moreover, we found that VAEs almost fail to learn in Mini-ImageNet dataset with the architecturally limited constraints of the meta-learning literature. Thus, we propose a high-level feature reconstruction objective instead for Mini-ImageNet dataset. We experimentally find that the recently proposed constrastive learning framework, namely SimCLR (Chen et al., 2020) , is the most effective for our settings. Specifically, SimCLR learns high-level representation by performing a constrastive prediction task on pairs of augmented examples derived from a minibatch. We train SimCLR on the unsupervised dataset D u = {x u } U u=1 , and use high-level features extracted by SimCLR as an input for our framework.

4. EXPERIMENT

In this section, we now validate the effectiveness of our Meta-GMVAE on several downstream fewshot classification tasks. The source codes are available at https://github.com/db-Lee/ Meta-GMVAE.

4.1. EXPERIMENTAL SETUPS

Baselines and ours We now describe two supervised meta-learning approaches which we consider as "oracles", unsupervised meta-learning baselines, and the proposed Meta-GMVAE. 1) MAML (oracle): Model Agnostic Meta Learning by Finn et al. (2017) . We compare against its performance reported in Hsu et al. (2019) . 2) ProtoNets (oracle): Euclidean distance-based meta-learning approach by Snell et al. (2017) . We also compare against it using its performance reported in Hsu et al. (2019) . 3) CACTUs: Clustering to Automatically Construct Tasks for Unsupervised meta-learning by Hsu et al. (2019) . It automatically constructs tasks by clustering the unsupervised dataset in embedding space learned by ACAI (Berthelot et al., 2019) , BiGAN (Donahue et al., 2017) , and Deep-Cluster (Caron et al., 2018) . Then they train either MAML or ProtoNets using the cluster indices as pseudo-labels. 4) UMTRA: Unsupervised Meta-learning with Tasks constructed by Random sampling and Augmentation by Khodadadeh et al. (2019) . For constructing a K-way 1-shot task, it randomly samples K-way datapoints from unsupervised dataset and augments each datapoint. Then MAML is trained on the constructed tasks. 5) Meta-GMVAE: Our proposed Meta-level Gaussian Mixture VAE. It learns a latent representation by matching set-level amortized variational posterior and task-specific multimodal prior optimized by EM algorithm.

Datasets

We validate all the models on two benchmark datasets for few-shot classification. 1) Omniglot: This is a collection of 28 × 28 gray-scale hand-written characters that describe 1623 different alphabets, each of which contains 20 instances. Following the experimental setup of Hsu et al. (2019) , we use 1200 classes for unsupervised meta-training, 100 classes for meta-validation and the remaining 323 classes for meta-test. We further augment each class by rotating the images 90, 180, and 270 degrees, such that the total number of classes is 1623 × 4, following the convention. 2) Mini-ImageNet: This is a subset of ILSVRC-2012 (Deng et al., 2009) introduced by Ravi & Larochelle (2017), consisting of 100 classes that comes with 600 images of size 84 × 84 that describe different instances. We use 64 classes for unsupervised meta-training, 16 classes for meta-validation, and the remaining 20 classes for meta-test, following the standard protocol.

Implementation details

We now introduce the specific implementation details of Meta-GMVAE on the two benchmark datasets. 1) Variational posterior network q φ (z|x, D i ): we use the standard Conv4 architecture on Omniglot dataset for a fair comparison against relevant baselines. On top of the Conv4 architecture, we stack two TransformerEncoder layers and an affine transformation layer to predict the mean and log-variance of Gaussian distribution. For Mini-ImageNet dataset, we only utilize two TransformerEncoder layers and an affine transformation layer since the input used for Mini-ImageNet is already a high-level visual representation extracted from the Conv5 architecture trained with SimCLR. For both datasets, we set the dimensionality of the latent variable to 64. 2) Generative network p θ (x|z): For Omniglot dataset, the architecture of generative network is symmetric to the Conv4 architecture of variational posterior network. The last layer outputs the parameter of output Bernoulli distribution. For Mini-ImageNet dataset, we use 3-layer MLP with ReLU activation to predict the mean of output Gaussian distribution. 3) Other details: we utilize Adam optimizer (Kingma & Ba, 2015) with a constant learning rate of 0.001 and 0.0001 for Omniglot and MiniImageNet experiments, respectively. We set the number of iterations for EM algorithm as 10 for all the experiments. For the more details, please see the Appendix. 

4.2. EXPERIMENTAL RESULTS

Few-shot classification Table 1 shows the few-shot classification results obtained by supervised meta-learning baselines (oracle), the two unsupervised meta-learning baselines, and our Meta-GMVAE. For the Omniglot dataset, the Meta-GMVAE outperforms all the baselines that only utilize unsupervised-learning for constructing meta-training tasks, except for the UMTRA on the 20-shot 5-shot classification. Meta-GMVAE also outperforms baselines on Mini-ImageNet 1-shot, and 5shot settings which are the most widely used settings, while it matches the performance of baselines in 20-shot, and 50-shot settings. This shows that meta-learning the posterior network can capture the multi-modal distribution of any given tasks with Meta-GMVAE, is indeed more effective over unsupervised meta-learning baselines which simply trains supervised meta-learning models with pseudo-labels obtained from unlabeled data. Moreover, our Meta-GMVAE obtains better performance than supervised MAML on Omniglot 5-way 1-shot classification, while utilizing as small as 0.1% of the labeled data. This matches the observation in Chen et al. ( 2020) that well-calibrated unsupervised learning approaches with a modest amount of labels can obtain a performance comparable to or even better than supervised approaches. Visualization To better understand how our Meta-GMVAE learns and realizes class-concepts in fewshot classification tasks, we visualize the actual samples in an episode classified by Meta-GMVAE and ones generated by generative network p θ (x|z), during unsupervised meta-training and supervised meta-testing. We visualize the actual samples and generated ones that have a same modality in a same row. In Figure 3 -a, b, we can observe that our Meta-GMVAE captures the similar visual structure in each modality during meta-training, but the modalities are not the class-concepts. However, as shown Figure 3 -c, d, our Meta-GMVAE easily realizes each modality as each class-concept at meta-test time. Ablation study Furthermore, we compare the performance of our model variants by eliminating each of the most important components for our model. We describe the each variant as follows: 1) LR (SimCLR): This performs the logistic regression using support set on top of features pretrained by SimCLR. level variational posterior (i.e. q φ (z|x, D i )) or not (i.e. q φ (z|x)). 6) σ 2 : Meta-GMM performs semi-supervised EM algorithm whether using diagonal covariance matrix or fixing it with identity matrix I. Table 2 -Left shows that all the components we consider are critical for the performance on the few-shot classification tasks as expected. The best performance gain comes from Ep, which supports our proposal on meta-learning the set-level variational posterior by matching it with the multi-modal prior, where the task-specific parameter is obtained with EM. Cross-way classification We then experiment our Meta-GMVAE by varying the number of way (between 2 and 50) and fixing the number of shot as 1. In particular, we set the number of component k as the Test way for the meta-test and perform semi-supervised EM algorithm in Eq 10. Table 2 -Right shows that the difference in the number of way used for training and test does not significantly affect the performance, which demonstrates the robustness of Meta-GMVAE on varying number of way. We also visualize the latent space for the cross-shot experiment using t-SNE (Rauber et al., 2016) , in Figure 4 , which shows that Meta-GMVAE trained with 20-way can cluster 5-way meta-test task.

5. CONCLUSION

We proposed a novel unsupervised meta-learning model, namely Meta-GMVAE, which can generate a task-dependent posterior for a given unseen task with multi-modal Gaussian Mixture priors. Given a random episode that consists of samples from diverse classes, we optimize the task-specific parameter of the mixture of Gaussian prior with Expectation-Maximization algorithm, such that each mode can capture intrinsic groupings in the given data. We meta-train the variational posterior network over such data-driven prior obtained over large number of episodes. Then, at the metatest step, we realize each modality with a label by deploying semi-supervised EM algorithm with both the support and the query set. We validate our method on two few-shot image classification benchmark datasets, and show that Meta-GMVAE largely outperforms the relevant unsupervised meta-learning baselines, even achieving better performance than supervised MAML on Omniglot 5-way 1-shot experiments.



Figure 1: During meta-training, Meta-GMVAE learns multi-modal latent space that can best explain the unlabeled data using EM algorithm. At meta-test time, we use semi-supervised EM to map both the support (labeled data) and queries (unlabeled data) to each mode learned during meta-training.

Figure 2: The graphical illustration of Meta-GMVAE. The dotted lines denote either variational inference or Expectation Maximization. (a): We introduce the multimodal distribution p ψ (z) into prior distribution, and its optimal task-specific parameter ψ * i is obtained by EM in an episodic manner. (b): For meta-test, we obtain task-specific parameter ψ * i by semi-supervised EM using xs, y s , and xq.

Figure 3: The samples obtained and generated for each mode at unsupervised meta-training and supervised meta-test step of Meta-GMVAE. Samples in each row are in the same modality obtained by EM.

Left: The results of the ablation study on Meta-GMVAE (O: 20-way 1-shot classification on Omniglot, M: 5-way 1-shot classification on Mini-ImageNet). Right: The results of cross-way 1-shot experiments on Omniglot. The values in the parenthesis indicate that a model is trained based on the (way, shot) setting.

Figure 4: The visualization of the latent space for the cross-shot generalization experiment.

The few



acknowledgement

Acknowledgements This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)), Samsung Research Funding Center of Samsung Electronics (No. SRFC-IT1502-51), Samsung Electronics (IO201214-08145-01), and the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921). We sincerely thank the anonymous reviewers for their constructive comments which helped us significantly improve our paper during the rebuttal period. We also appreciate D. Khuê Lê-Huu for the valuable discussion on Rao et al. (2019) .

A OMNIGLOT EXPERIMENTS A.1 TRAINING PROCEDURE

Omniglot is a collection of 28 × 28 gray-scale hand-written characters that describe 1623 different alphabets, each of which contains 20 instances. Following the experimental setup of Hsu et al. (2019) , we use 1200 classes for unsupervised meta-training, 100 classes for meta-validation and the remaining 323 classes for meta-test. We further augment each class by rotating the images 90, 180, and 270 degrees, such that the total number of classes is 1623 × 4, following the convention. We evaluate the trained model using 1000 randomly selected tasks from test set. During evaluation, K × S data instances are used as support inputs and K × 15 data instances are used as query inputs. We use the Adam (Kingma & Ba, 2015) optimizer with a constant learning rate of 0.001 to train all models. All models are trained for 60,000 iterations. For the 5-way experiments (i.e. K = 5), we set the mini-batch size, the number of datapoints, and Monte Carlo sample size as 4, 200, and 32, respectively (i.e. B = 4, M = 200, and N = 32). For the 20-way experiments (i.e. K = 20), we set them as 4, 300, and 32 (i.e. B = 4, M = 300, and N = 32). We set the number of EM iterations as 10.

A.2 NETWORK ARCHITECTURE

We summarize the network architecture in the following Table 3 , and 4. We assume that the output follows Bernoulli distribution, therefore, the output of generative network p θ (x|z) is the mean parameter.Set-level variational posterior network q φ (z|x, Di) ). We train the models for 5K, 10K, 15k, 25K, and 30K for 1, 5, 20, and 50-shot experiments, respectively. We set the number of EM iterations as 10.

B.2 NETWORK ARCHITECTURE

We summarize the network architecture in the following Table 6 , 7, and 8. We assume that the output follows Gaussian distribution, therefore, the output of generative network p θ (x|z) is the mean parameter. Moreover, the variance of output Gaussian distribution is obtained as suggested in Rybkin et al. (2020) .

Feature Extractor for SimCLR

Output Size Layers 

B.4 ADDITIONAL COMPARISON USING SIMCLR

To further understand where the improvement of Meta-GMVAE on the Mini-ImageNet dataset, we ran experiment on baselines with SimCLR pretrained features. For CACTUs, we cluster in the embedding space pretrained by SimCLR. For UMTRA, we follow the exact same procedure to generate training episode, which is proposed by the authors. Moreover, we fix the pretrained SimCLR features as the setting of Meta-GMVAE for the both of baselines. Table 10 shows that Meta-GMVAE outperforms the baselines with SimCLR, which supports the effectiveness of Meta-GMVAE combined with SimCLR pretrained features.Mini-ImageNet (5,1) (5,5) (20,1) (20,5) 

