UNSUPERVISED META-LEARNING VIA FEW-SHOT PSEUDO-SUPERVISED CONTRASTIVE LEARNING

Abstract

Unsupervised meta-learning aims to learn generalizable knowledge across a distribution of tasks constructed from unlabeled data. Here, the main challenge is how to construct diverse tasks for meta-learning without label information; recent works have proposed to create, e.g., pseudo-labeling via pretrained representations or creating synthetic samples via generative models. However, such a task construction strategy is fundamentally limited due to heavy reliance on the immutable pseudo-labels during meta-learning and the quality of the representations or the generated samples. To overcome the limitations, we propose a simple yet effective unsupervised meta-learning framework, coined Pseudo-supervised Contrast (PsCo), for few-shot classification. We are inspired by the recent self-supervised learning literature; PsCo utilizes a momentum network and a queue of previous batches to improve pseudo-labeling and construct diverse tasks in a progressive manner. Our extensive experiments demonstrate that PsCo outperforms existing unsupervised meta-learning methods under various in-domain and cross-domain few-shot classification benchmarks. We also validate that PsCo is easily scalable to a large-scale benchmark, while recent prior-art meta-schemes are not.

1. INTRODUCTION

Learning to learn (Thrun & Pratt, 1998) , also known as meta-learning, aims to learn general knowledge about how to solve unseen, yet relevant tasks from prior experiences solving diverse tasks. In recent years, the concept of meta-learning has found various applications, e.g., few-shot classification (Snell et al., 2017; Finn et al., 2017) , reinforcement learning (Duan et al., 2017; Houthooft et al., 2018; Alet et al., 2020) , hyperparameter optimization (Franceschi et al., 2018) , and so on. Among them, few-shot classification is arguably the most popular one, whose goal is to learn some knowledge to classify test samples of unseen classes during (meta-)training with few labeled samples. The common approach is to construct a distribution of few-shot classification (i.e., N -way K-shot) tasks and optimize a model to generalize across tasks (sampled from the distribution) so that it can rapidly adapt to new tasks. This approach has shown remarkable performance in various few-shot classification tasks but suffers from limited scalability as the task construction phase typically requires a large number of human-annotated labels. To mitigate the issue, there have been several recent attempts to apply meta-learning to unlabeled data, i.e., unsupervised meta-learning (UML) (Hsu et al., 2019; Khodadadeh et al., 2019; 2021; Lee et al., 2021; Kong et al., 2021) . To perform meta-learning without labels, the authors have suggested various ways to construct synthetic tasks. For example, pioneering works (Hsu et al., 2019; Khodadadeh et al., 2019) assigned pseudo-labels via data augmentations or clustering based on pretrained representations. In contrast, recent approaches (Khodadadeh et al., 2021; Lee et al., 2021; Kong et al., 2021) utilized generative models to generate synthetic (in-class) samples or learn unknown labels via categorical latent variables. They have achieved moderate performance in few-shot learning benchmarks, but are fundamentally limited as: (a) the pseudo-labeling strategies are fixed during meta-learning and impossible to correct mislabeled samples; (b) the generative approaches heavily rely on the quality of generated samples and are cumbersome to scale into large-scale setups. PsCo constructs an Nway K-shot few-shot classification task using the current mini-batch {x i } and the queue of previous mini-batches; and then, it learns the task via contrastive learning. Here, A is a label assignment matrix found by the Sinkhorn-Knopp algorithm (Cuturi, 2013) , A is a pre-defined augmentation distribution, f is a backbone feature extractor, g and h are projection and prediction MLPs, respectively, and ϕ is an exponential moving average (EMA) of the model parameter θ. To overcome the limitations of the existing UML approaches, in this paper, we ask whether one can (a) progressively improve a pseudo-labeling strategy during meta-learning, and (b) construct more diverse tasks without generative models. We draw inspiration from recent advances in selfsupervised learning literature (He et al., 2020; Khosla et al., 2020) , which has shown remarkable success in representation learning without labeled data. In particular, we utilize (a) a momentum network to improve pseudo-labeling progressively via temporal ensemble; and (b) a momentum queue to construct diverse tasks using previous mini-batches in an online manner. Formally, we propose Pseudo-supervised Contrast (PsCo), a novel and effective unsupervised metalearning framework, for few-shot classification. Our key idea is to construct few-shot classification tasks using the current and previous mini-batches based on the momentum network and the momentum queue. Specifically, given a random mini-batch of N unlabeled samples, we treat them as N queries (i.e., test samples) of different N labels, and then select K shots (i.e., training samples) for each label from the queue of previous mini-batches based on representations extracted by the momentum network. To further improve the selection procedure, we utilize top-K sampling after applying a matching algorithm, Sinkhorn-Knopp (Cuturi, 2013) . Finally, we optimize our model via supervised contrastive learning (Khosla et al., 2020) for solving the N -way K-shot task. Remark that our few-shot task construction relies on not only the current mini-batch but also the momentum network and the queue of previous mini-batches. Therefore, our task construction (i.e., pseudo-labeling) strategy (a) is progressively improved during meta-learning with the momentum network, and (b) constructs diverse tasks since the shots can be selected from the entire dataset. Our framework is illustrated in Figure 1 . Throughout extensive experiments, we demonstrate the effectiveness of the proposed framework, PsCo, under various few-shot classification benchmarks. First, PsCo achieves state-of-the-art performance under both Omniglot (Lake et al., 2011) and miniImageNet (Ravi & Larochelle, 2017) few-shot benchmarks; its performance is even competitive with supervised meta-learning methods. Next, PsCo also shows superiority under cross-domain few-shot learning scenarios. Finally, we demonstrate that PsCo is scalable to a large-scale benchmark, ImageNet (Deng et al., 2009) . We summarize our contributions as follows: • We propose PsCo, an effective unsupervised meta-learning (UML) framework for few-shot classification, which constructs diverse few-shot pseudo-tasks without labels utilizing the momentum network and the queue of previous batches in a progressive manner. • We achieve state-of-the-art performance on few-shot classification benchmarks, Omniglot (Lake et al., 2011) and miniImageNet (Ravi & Larochelle, 2017) . For example, PsCo outperforms the prior art of UML, Meta-SVEBM (Kong et al., 2021) , by 5% accuracy gain (58.03→63.26), for 5-way 5-shot tasks of miniImageNet (see Table 1 ). • We show that PsCo achieves comparable performance with supervised meta-learning methods in various few-shot classification benchmarks. For example, PsCo achieves 44.01% accuracy for 5-way 5-shot tasks of an unseen domain, Cars (Krause et al., 2013) , while supervised MAML (Finn et al., 2017) does 41.17% (see Table 2 ). • We validate PsCo is also applicable to a large-scale dataset: e.g., we improve PsCo by 5.78% accuracy gain (47.67→53.45) for 5-way 5-shot tasks of Cars using large-scale unlabeled data, ImageNet (Deng et al., 2009) (see Table 3 ).

2. PRELIMINARIES

2.1 PROBLEM STATEMENT: UNSUPERVISED FEW-SHOT LEARNING The problem of interest in this paper is unsupervised few-shot learning, one of the popular unsupervised meta-learning applications. This aims to learn generalizable knowledge without human annotations for quickly adapting to unseen but relevant few-shot tasks. Following the meta-learning literature, we refer to the learning phase as meta-training and the adaptation phase as meta-test. Formally, we are only able to utilize an unlabeled dataset D meta train := {x i } during meta-training our model. At the meta-test phase, we transfer the model to new few-shot tasks {T i } ∼ D meta test where each task T i aims to classify query samples {x q } among N labels using support (i.e., training) samples S = {(x s , y s )} N K s=1 . We here assume the task T i consists of K support samples for each label y ∈ {1, . . . , N }, which is referred to as N -way K-shot classification. Note that D meta train and D meta test can come from the same domain (i.e., the standard in-domain setting) or different domains (i.e., cross-domain) as suggested by Chen et al. (2019) .

2.2. CONTRASTIVE LEARNING

Contrastive learning (Oord et al., 2018; Chen et al., 2020a; He et al., 2020; Khosla et al., 2020) aims to learn meaningful representations by maximizing the similarity between similar (i.e., positive) samples, and minimizing the similarity between dissimilar (i.e., negative) samples on the representation space. We first describe a general form of contrastive learning objectives based on the temperature-normalized cross entropy (Chen et al., 2020a; He et al., 2020) and its variant for multiple positives (Khosla et al., 2020) as follows: L Contrast ({q i } N i=1 , {k j } M j=1 , A; τ ) := - 1 N N i=1 1 j A i,j M j=1 A i,j log exp(q ⊤ i k j /τ ) M k=1 exp(q ⊤ i k k /τ ) , where {q i } and {k j } are ℓ 2 -normalized query and key representations, respectively, A ∈ {0, 1} N M represents whether q i and k j are positive (A i,j = 1) or negative (A i,j = 0), and τ is a hyperparameter for temperature scaling. Based on the recent observations in the self-supervised learning literature, we also describe a general scheme to construct the query and key representations using data augmentations and a momentum network. Formally, given a random mini-batch {x i }, the representations can be obtained as follows: q i = Normalize(h θ • g θ • f θ (t i,1 (x i ))), k i = Normalize(g ϕ • f ϕ (t i,2 (x i ))), where Normalize(•) is ℓ 2 normalization, t i,1 ∼ A 1 and t i,2 ∼ A 2 are random data augmentations, f is a backbone feature extractor like ResNet (He et al., 2016) , g and h are projection and prediction MLPs,foot_0 respectively, and ϕ is an exponential moving average (i.e., momentum) of the model parameter θ. 2 Since a large number of negative samples plays a crucial role in contrastive learning, one can re-use the key representations of previous mini-batches by maintaining a queue (He et al., 2020) . Note that the above forms (1) and (2) can be formulated as various contrastive learning frameworks. For example, SimCLR (Chen et al., 2020a ) is a special case of no momentum ϕ and no predictor h. In addition, self-supervised contrastive learning methods (Chen et al., 2020a; He et al., 2020) often assume that k i is only the positive key of q i , i.e., A i,j = 1 if and only if i = j, while supervised contrastive learning (Khosla et al., 2020) directly uses labels for A. Algorithm 1 Pseudo-supervised Contrast (PsCo): PyTorch-like Pseudocode # f, g, h: backbone, projector, and predictor # {f,g}_ema: momentum backbone, and projector # queue: momentum queue (Mxd) # mm: matrix multiplication, mul: element-wise multiplication def PsCo(x): # x: a mini-batch of N samples x1, x2 = aug1(x), aug2(x) # two augmented views of x q = h(g(f(x1))) # (Nxd) N query representations z = g_ema(f_ema(x2)) # (Nxd) N query momentum representations sim = mm(z, queue.T) # (NxM) similarity matrix A_tilde = sinkhorn(sim) # (NxM) soft pseudo-label assignment matrix s, A = select_topK(queue, A_tilde) # (NKxd) s: support momentum representations # (NxNK) A: pseudo-label assignment matrix logits = mm(q, s.T) / temperature loss = logits.logsumexp(dim=1) -mul(logits, A).sum(dim=1) / K return loss.mean()

3. METHOD: PSEUDO-SUPERVISED CONTRASTIVE META-LEARNING

In this section, we introduce Pseudo-supervised Contrast (PsCo), a novel and effective framework for unsupervised few-shot learning. Our key idea is to construct few-shot classification pseudotasks using the current and previous mini-batches with the momentum network and the momentum queue. We then employ supervised contrastive learning (Khosla et al., 2020) for learning the pseudotasks. The detailed implementations of our task construction, meta-training objective, and meta-test scheme for unsupervised few-shot learning are described in Section 3.1, 3.2, and 3.3, respectively. Our framework is illustrated in Figure 1 and its pseudo-code is provided in Algorithm 1. Note that we use the same notations described in Section 2 for consistency.

3.1. ONLINE PSEUDO-TASK CONSTRUCTION

We here describe how to construct a few-shot pseudo-task using unlabeled data D meta train = {x i }. To this end, we maintain a queue of previous mini-batches. Then, we treat the previous and current mini-batch samples as training (i.e., shots) and test (i.e., queries) samples for our few-shot pseudotask. Formally, let B := {x i } N i=1 be the current mini-batch randomly sampled from D meta train , and Q := {x j } M j=1 be the queue of previous mini-batch samples. Now, we treat B = {x i } N i=1 as queries of N different pseudo-labels and find K (appropriate) shots for each pseudo-label from the queue Q. Remark that this approach to utilize the previous mini-batches encourages us to construct more diverse tasks. To find the shots efficiently, we utilize the momentum network and the momentum queue described in Section 2.2. For the current mini-batch samples, we compute the momentum query representations with data augmentations t i,2 ∼ A 2 , i.e., z i := Normalize(g ϕ • f ϕ (t i,2 (x i ))). Following He et al. (2020) , we store only the momentum representations of the previous mini-batch samples instead of raw data in the queue Q z , i.e., Q z := {z j } M j=1 . Remark that the use of the momentum network is not only for efficiency but also for improving our task construction strategy because the momentum network is consistent and progressively improved during training. Following He et al. (2020) , we randomly initialize the queue Q z at the beginning of training. Now, the remaining question is as follows: How to find K appropriate shots from the queue Q for each pseudo-label using the momentum representations? Before introducing our algorithm, we first discuss two requirements for constructing semantically meaningful few-shot tasks: (i) shots and queries of the same label should be semantically similar, and (ii) all shots should be different. Based on these requirements, we formulate our assignment problem as follows: max Ã∈{0,1} N ×M N i=1 M j=1 Ãij • z ⊤ i zj such that j Ãij = K, i Ãij ≤ 1. Obtaining the exact optimal solution to the above assignment problem for each training iteration might be too expensive for our purpose (Ramshaw & Tarjan, 2012) . Instead, we use an approximate algorithm: we first apply a fast version (Cuturi, 2013) of the Sinkhorn-Knopp algorithm to solve the following problem: max Ã∈[0,1] N ×M N i=1 M j=1 Ãij • z ⊤ i zj + ϵH( Ã) such that j Ãij = 1/N, i Ãij = 1/M, (4) which is an entropy-regularized optimal transport problem (Cuturi, 2013) . Its optimal solution Ã * can be obtained efficiently and can be considered as a soft assignment matrix between the current mini-batch {z i } N i=1 and the queue Q z = {z j } M j=1 . Hence, we select top-K elements for each row of the assignment matrix Ã * and finally construct an N -way K-shot pseudo-task consisting of (a) query samples B = {x i } N i=1 , (b) the support representations S z := {z s } N K s=1 , and (c) the pseudolabel assignment matrix A ∈ {0, 1} N ×N K . Note that Figure 1 shows an example of a 5-way 2-shot task. We empirically observe that our task construction strategy satisfies the above requirements (i) and (ii) (see Section 4.3).

3.2. META-TRAINING: SUPERVISED CONTRASTIVE LEARNING WITH PSEUDO TASKS

We now describe our meta-learning objective L PsCo for learning our few-shot pseudo-tasks. We here use our model θ to obtain query representations: q i := Normalize(h θ • g θ • f θ (t i,1 (x i ))) where t i,1 ∼ A 1 is a random data augmentation for each i. Then, our objective L PsCo is defined as follows: L PsCo := L Contrast ({q i } N i=1 , S z , A; τ PsCo ), where S z := {z s } N K s=1 is the support representations and A ∈ {0, 1} N ×N K is the pseudo-label assignment matrix, which are constructed by our task construction strategy described in Section 3.1. Since our framework PsCo uses the same architectural components as a self-supervised learning framework, MoCo (He et al., 2020) , the MoCo objective L MoCo can be incorporated into our PsCo without additional computation costs. Note that the MoCo objective can be written as follows: L MoCo := L Contrast ({q i } N i=1 , {z i } N i=1 ∪ Q z , A MoCo ; τ MoCo ), where (A MoCo ) i,j = 1 if and only if i = j, and z i := Normalize(g ϕ • f ϕ (t i,2 (x i ))) as described in Section 3.1. We optimize our model θ via all the objectives, i.e., L total := L PsCo + L MoCo . Remark again that ϕ is updated by exponential moving average (EMA), i.e., ϕ ← mϕ + (1 -m)θ. Weak augmentation for momentum representations. To successfully find the pseudo-label assignment matrix A, we apply weak augmentations for the momentum representations (i.e., A 2 is weaker than A 1 ) as Zheng et al. (2021) did. This reduces the noise in the representations and consequently enhances the performance of our PsCo as A becomes more accurate (see Section 4.3).

3.3. META-TEST

At the meta-test stage, we have an N -way K-shot task T consisting of query samples {x q } and support samples S = {(x s , y s )} N K s=1 . 3 We here discard the momentum network ϕ and use only the online network θ. To predict labels, we first compute the query representation q q := Normalize(h θ • g θ •f θ (x q )) and the support representations z s := Normalize (g θ • f θ (x s ))). Then we predict a label by the following classification rule: ŷ := arg max y q ⊤ q c y where c y := Normalize( s 1 ys=y • z s ) is the prototype vector. This is inspired by our L PsCo , which can be interpreted as minimizing distance from the mean (i.e., prototype) of the shot representations. 4 Further adaptation for cross-domain few-shot classification. Under cross-domain few-shot classification scenarios, the model θ should further adapt to the meta-test domain due to the dissimilarity from meta-training. We here suggest an efficient adaptation scheme using only a few labeled samples. Our idea is to consider the support samples as queries. To be specific, we compute the query representation q s := Normalize(h θ •g θ •f θ (x s )) for each support sample x s , and construct the label assignment matrix A ′ as A ′ s,s ′ = 1 if and only if y s = y s ′ . Then we simply optimize only g θ and h θ via contrastive learning, i.e., L Contrast ({q s }, {z s }, A ′ ; τ PsCo ), for few iterations. We empirically observe that this adaptation scheme is effective under cross-domain settings (see Section 4.3). 3 Note that N and K for meta-training and meta-test could be different. We use a large N (e.g., N = 256) during meta-training to fully utilize computational resources like standard deep learning, and a small N (e.g., N = 5) during meta-test following the meta-learning literature. 4 LPsCo = - (Ravi & Larochelle, 2017) , respectively, for the backbone feature extractor f θ . For the number of shots during meta-learning, we use K = 1 for Omniglot and K = 4 for miniImageNet (see Table 6 for the sensitivity of K). Other details are fully described in Appendix A. We omit the confidence intervals in this section for clarity, and the full results with them are provided in Appendix F. 1 N i 1 τ PsCo q ⊤ i 1 K j Ai,jzj + term not depending on A.

4.1. STANDARD FEW-SHOT BENCHMARKS

Setup. We here evaluate PsCo on the standard few-shot benchmarks of unsupervised meta-learning: Omniglot (Lake et al., 2011) and miniImageNet (Ravi & Larochelle, 2017) . We compare PsCo's performance with unsupervised meta-learning methods (Hsu et al., 2019; Khodadadeh et al., 2019; 2021; Lee et al., 2021; Kong et al., 2021) , self-supervised learning methods (Chen et al., 2020a; b; Caron et al., 2020) , and supervised meta-learning methods (Finn et al., 2017; Snell et al., 2017) on the benchmarks. The details of the benchmarks and the baselines are described in Appendix D. Few-shot classification results. Table 1 shows the results of the few-shot classification with various (way, shot) tasks of Omniglot and miniImageNet. PsCo achieves state-of-the-art performance on both Omniglot and miniImageNet benchmarks under the unsupervised setting. For example, we obtain 5% accuracy gain (67.07 → 72.22) on miniImageNet 5-way 20-shot tasks. Moreover, the performance is even competitive with supervised meta-learning methods, ProtoNets (Snell et al., 2017) , and MAML (Finn et al., 2017) as well.

4.2. CROSS-DOMAIN FEW-SHOT BENCHMARKS

Setup. We evaluate PsCo on cross-domain few-shot classification benchmarks following Oh et al. (2022) . To be specific, we use (a) benchmark of large-similarity with ImageNet: CUB (Wah et al., 2011) , Cars (Krause et al., 2013) , Places (Zhou et al., 2018) test the previous state-of-the-art unsupervised meta-learning (Lee et al., 2021; Kong et al., 2021) , self-supervised learning (Chen et al., 2020a; b; Caron et al., 2020) , and supervised meta-learning (Finn et al., 2017; Snell et al., 2017) . We here use our adaptation scheme (Section 3.3) with 50 iterations. The details of the benchmarks and implementations are described in Appendix E. Small-scale cross-domain few-shot classification results. We here evaluate various Conv5 models meta-trained on miniImageNet as used in Section 4.1. Table 2 shows that PsCo outperforms all the baselines across all the benchmarks, except ChestX, which is too different from the distribution of miniImageNet (Oh et al., 2022) . Somewhat interestingly, PsCo competitive with supervised learning under these benchmarks, e.g., PsCo achieves 88% accuracy on CropDiseases 5-way 5-shot tasks, whereas MAML gets 77%. This implies that our unsupervised method, PsCo, generalizes on more diverse tasks than supervised learning, which is specialized to in-domain tasks. Large-scale cross-domain few-shot classification results. We also validate that our meta-learning framework is applicable to the large-scale benchmark, ImageNet (Deng et al., 2009) . Remark that the recent unsupervised meta-learning methods (Lee et al., 2021; Kong et al., 2021; Khodadadeh et al., 2021) rely on generative models, so they are not easily applicable to such a large-scale benchmark. For example, we observe that PsCo is 2.7 times faster than the best baseline, Meta-SVEBM (Kong et al., 2021) , even though Meta-SVEBM uses low-dimensional representations instead of full images during training. Hence, we compare PsCo with (a) self-supervised methods, MoCo v2 (Chen et al., 2020b) and BYOL (Grill et al., 2020) , and (b) the publicly-available supervised learning baseline. We here use the ResNet-50 (He et al., 2016) architecture. The training details are described in Appendix E.4 and we also provide ResNet-18 results in Appendix F. Table 3 : 5-way 5-shot classification accuracy (%) on cross-domain few-shot benchmarks. We transfer ImageNet-trained ResNet-50 models to each benchmark. We report the average accuracy over 600 few-shot tasks. Table 3 shows that (i) PsCo consistently improves both MoCo and BYOL under this setup (e.g., 67% → 82% in CUB), and (ii) PsCo benefits from the large-scale dataset as we obtain a huge amount of performance gain on the benchmarks of large-similarity with ImageNet: CUB, Cars, Places, and Plantae. Consequently, we achieve comparable performance with the supervised learning baseline, except Cars, which shows that our PsCo is applicable to large-scale unlabeled datasets.

4.3. ABLATION STUDY

Component analysis. In Table 4 , we demonstrate the necessity of each component in PsCo by removing the components one by one: momentum encoder ϕ, prediction head h, Sinkhorn-Knopp algorithm, top-K sampling for sampling support samples, and the MoCo objective, L MoCo (6). We found that the momentum network ϕ and the prediction head h are critical architectural components in our framework like recent self-supervised learning frameworks (Grill et al., 2020; Chen et al., 2021) . In addition, Table 4 shows that training with only our objective, L PsCo (5), achieves meaningful performance, but incorporating it into MoCo is more beneficial. To further validate that our task construction is progressively improved during meta-learning, we evaluate whether a query and a corresponding support sample have the same true label. Figure 2a shows that our task construction is progressively improved, i.e., the task requirement (i) described in Section 3.1 satisfies. Table 4 also verifies the contribution of the Sinkhorn-Knopp algorithm and Top-K sampling for the performance of PsCo. We further analyze the effect of the Sinkhorn-Knopp algorithm by measuring the overlap ratio of selected supports between different pseudo-labels. As shown in Figure 2b , there are almost zero overlaps when using the Sinkhorn-Knopp algorithm, which means the constructed task is a valid few-shot task, satisfying the task requirement (ii) described in Section 3.1. Adaptation effect on cross-domain. To validate the effect of our adaptation scheme (Section 3.3), we evaluate the few-shot classification accuracy during the adaptation process on miniImageNet (i.e., in-domain) and CropDiseases (i.e., cross-domain) benchmarks. As shown in Figure 2d , we found that the adaptation scheme is more useful in cross-domain benchmarks than in-domain ones. Based on these results, we apply the scheme to only the cross-domain scenarios. We also found that our adaptation does not cause over-fitting since we only optimize the projection and prediction heads g θ and h θ . The results for the adaptation effect on the whole benchmarks are represented in Appendix C. Augmentations. We here confirm that weak augmentation for the momentum network (i.e., A 2 ) is more effective than strong augmentation unlike other self-supervised learning literature (Chen et al., 2020a; He et al., 2020) . We denote the standard augmentation consisting of both geometric and color transformations by Strong, and a weaker augmentation consisting of only geometric transformations as Weak (see details in Appendix A). As shown in Table 5 , utilizing the weak augmentation for A 2 is much more beneficial since it helps to find an accurate pseudo-label assignment matrix A. Training K. We also look at the effect of the training K, i.e. number of shots sampled online. We conduct the experiment with K ∈ {1, 4, 16, 64}. We observe that PsCo performs consistently well regardless of the choice of K as shown in Table 6 . The proper K is suggested to obtain the best-performing models, e.g., K = 4 for miniImageNet and K = 1 for Omniglot are the best.

5. RELATED WORKS

Unsupervised meta-learning. Unsupervised meta-learning (Hsu et al., 2019; Khodadadeh et al., 2019; Lee et al., 2021; Kong et al., 2021; Khodadadeh et al., 2021) links meta-learning and unsupervised learning by constructing synthetic tasks and extracting the meaningful information from unlabeled data. For example, CACTUs (Hsu et al., 2019) cluster the data on the pretrained representations at the beginning of meta-learning to assign pseudo-labels. Instead of pseudo-labeling, UMTRA (Khodadadeh et al., 2019) and LASIUM (Khodadadeh et al., 2021) generate synthetic samples using data augmentations or pretrained generative networks like BigBiGAN (Donahue & Simonyan, 2019) . Meta-GMVAE (Lee et al., 2021) and Meta-SVEBM (Kong et al., 2021) represent unknown labels via categorical latent variables using variational autoencoders (Kingma & Welling, 2014) and energy-based models (Teh et al., 2003) , respectively. In this paper, we suggest a novel online pseudo-labeling strategy to construct diverse tasks without help from any pretrained network or generative model. As a result, our method is easily applicable to large-scale datasets. Self-supervised learning. Self-supervised learning (SSL) (Doersch et al., 2015) has shown remarkable success for unsupervised representation learning across various domains, including vision (He et al., 2020; Chen et al., 2020a ), speech (Oord et al., 2018) , and reinforcement learning (Laskin et al., 2020) . Among SSL objectives, contrastive learning (Oord et al., 2018; Chen et al., 2020a; He et al., 2020) is arguably most popular for learning meaningful representations. In addition, recent advances have been made with the development of various architectural components: e.g., Siamese networks (Doersch et al., 2015) , momentum networks (He et al., 2020) , and asymmetric architectures (Grill et al., 2020; Chen & He, 2021) . In this paper, we utilize the SSL components to construct diverse few-shot tasks in an unsupervised manner.

6. CONCLUSION

Although unsupervised meta-learning (UML) and self-supervised learning (SSL) share the same purpose of learning generalizable knowledge to unseen tasks by utilizing unlabeled data, there still exists a gap between UML and SSL literature. In this paper, we bridge the gap as we tailor various SSL components to UML, especially for few-shot classification, and we achieve superior performance under various few-shot classification scenarios. We believe our research could bring many future research directions in both the UML and SSL communities.

ETHICS STATEMENT

Unsupervised learning, especially self-supervised learning, often requires a large number of training samples, a huge model, and a high computational cost for training the model on large-scale data to obtain meaningful representations because of the absence of human annotations. Furthermore, finetuning the model for solving a new task is also time-consuming and memory-inefficient. Hence, it could raise environmental issues such as carbon generation, which could bring an abnormal climate and accelerate global warming. In that sense, meta-learning should be considered as a solution since its purpose is to learn generalizable knowledge that can be quickly adapted to unseen tasks. In particular, unsupervised meta-learning, which benefits from both meta-learning and unsupervised learning, would be an important research direction. We believe that our work could be a useful step toward learning easily-generalizable knowledge from unlabeled data.

REPRODUCIBILITY STATEMENT

We provide all the details to reproduce our experimental results in Appendix A, D, and E. The code is available at https://github.com/alinlab/PsCo. In our experiments, we mainly use NVIDIA GTX3090 GPUs.

C EFFECT OF ADAPTATION

We measure the performance with and without our adaptation scheme on various domains using miniImageNet-pretrained PsCo. Table 11 shows that our adaptation scheme enhances the way to adapt to each domain. In particular, the adaptation scheme is highly suggested for cross-domain few-shot classification scenarios. 

D SETUP FOR STANDARD FEW-SHOT BENCHMARKS

We here describe details of benchmarks and baselines in Section D.1 and D.2, respectively, for the standard few-shot classification experiments (Section 4.1).

D.1 DATASETS

Omniglot (Lake et al., 2011) is a 28 × 28 gray-scale dataset of 1623 characters with 20 samples each. We follow the setup of unsupervised meta-learning approaches (Hsu et al., 2019) . We split the dataset into 120, 100, and 323 classes for meta-training, meta-validation, and meta-test respectively. In addition, the 0, 90, 180, and 270 degrees rotated views for each class become the different categories. Thus, we have a total of 6492, 400, and 1292 classes for meta-training, meta-validation, and meta-test respectively. MiniImageNet (Ravi & Larochelle, 2017 ) is an 84 × 84 resized subset of ILSVRC-2012 (Deng et al., 2009) with 600 samples each. We split the dataset into 64, 16, and 20 classes for metatraining, meta-validation, and meta-test respectively as introduced in Ravi & Larochelle (2017).

D.2 BASELINES

We compare our performance with unsupervised meta-learning, self-supervised learning, and supervised meta-learning methods. To be specific, (a) for the unsupervised meta-learning, we use CACTUs (Hsu et al., 2019) of the best options (ACAI clustering for Omniglot and DeepCluster for miniImageNet), UMTRA (Khodadadeh et al., 2019) , LASIUM (Laskin et al., 2020) of the best options (LASIUM-RO-GAN for Omniglot and LASIUM-N-GAN for miniImageNet), Meta-GMVAE (Lee et al., 2021) , Meta-SVEBM (Kong et al., 2021) ; (b) for the self-supervised learning methods, we use SimCLR (Chen et al., 2020a) , MoCo v2 (Chen et al., 2020b) , and SwAV (Caron et al., 2020) ; (c) for the supervised meta-learning, we use the results of MAML (Finn et al., 2017) and ProtoNets (Snell et al., 2017) reported in (Hsu et al., 2019) . For training self-supervised learning methods in our experimental setups, we use the same architecture and hyperparameters. For the hyperparameter of temperature scaling, we use the value provided in each paper: τ SimCLR = 0.5 for SimCLR, τ MoCo = 0.2 for MoCo v2, and τ SwAV = 0.1 for SwAV. For evaluation, we use K-Nearest Neightobrs (K-NN) for self-supervised learning methods since their classification rules are not specified.



The prediction MLPs have been utilized in the recent SSL literature(Grill et al., 2020;Chen et al., 2021). 2 ϕ is updated by ϕ ← mϕ + (1 -m)θ for each training iteration where m is a momentum hyperparameter.



Figure1: An overview of the proposed Pseudo-supervised Contrast (PsCo). PsCo constructs an Nway K-shot few-shot classification task using the current mini-batch {x i } and the queue of previous mini-batches; and then, it learns the task via contrastive learning. Here, A is a label assignment matrix found by the Sinkhorn-Knopp algorithm(Cuturi, 2013), A is a pre-defined augmentation distribution, f is a backbone feature extractor, g and h are projection and prediction MLPs, respectively, and ϕ is an exponential moving average (EMA) of the model parameter θ.

Figure2: (a) Pseudo-label quality, measuring the agreement between pseudo-labels and true labels, (b) Shot overlap ratio, measuring whether the shots for each pseudo-label are disjoint, during meta-training. (c,d) Performance while adaptation on in-domain (miniImageNet) and cross-domain (CropDiseases) benchmarks, respectively. We obtain these results from 100 random batches.

Few-shot classification accuracy (%) on Omniglot and miniImageNet benchmarks. We report the average accuracy over 2000 few-shot tasks for PsCo and self-supervised learning methods. Other reported numbers borrow fromKhodadadeh et al. (2021);Kong et al. (2021). Bold entries indicate the best for each task configuration, among unsupervised and self-supervised methods.

Few-shot classification accuracy (%) on cross-domain few-shot classification benchmarks. We transfer Conv5 trained on miniImageNet to each benchmark. We report the average accuracy over 2000 few-shot tasks for all methods, except Meta-SVEBM as it is evaluated over 200 tasks due to the long evaluation time. Bold entries indicate the best for each task configuration, among unsupervised and self-supervised methods.

Component ablation studies on Omniglot.

The ablation study with varying augmentation choices for A 1 and A 2 on miniImageNet.

The ablation study with varying K on miniImageNet.

Before and after adaptation of PsCo in few-shot classification.

acknowledgement

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING This work was mainly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST); No.2022-0-00713, Meta-learning applicable to real-world problems; No.2022-0-00959, Few-shot Learning of Causal Inference in Vision and Language for Decision Making).

A IMPLEMENTATION DETAILS

We train our models via stochastic gradient descent (SGD) with a batch size of N = 256 for 400 epochs. Following Chen et al. (2020b) ; Chen & He (2021) , we use an initial learning rate of 0.03 with the cosine learning schedule, τ MoCo = 0.2, and a weight decay of 5 × 10 -4 . We use a queue size of M = 16384 since Omniglot (Lake et al., 2011) and miniImageNet (Ravi & Larochelle, 2017) has roughly 100k meta-training samples. Following Lee et al. (2021) , we use Conv4 and Conv5 for Omniglot and miniImageNet, respectively, for the backbone feature extractor f θ . We describe the detailed architectures in Table 7 . For projection and prediction MLPs, g θ and h θ , we use 2-layer MLPs with a hidden size of 2048 and an output dimension of 128. For the hyperparameters of PsCo, we use τ PsCo = 1 and a momentum parameter of m = 0.99 (see Appendix B for the hyperparameter sensitivity). For the number of shots during meta-learning, we use K = 1 for Omniglot and K = 4 for miniImageNet (see Table 6 for the sensitivity of K). We use the last-epoch model for evaluation without any guidance from the meta-validation dataset. Augmentations. We describe the augmentations for Omniglot and miniImagenet in Table 8 . For Omniglot, because it is difficult to apply many augmentations to gray-scale images, we use the same rule for weak and strong augmentations. For miniImageNet, we use only geometric transformations for the weak augmentation following Zheng et al. (2021) . Training procedures. To ensure the performance of PsCo and self-supervised learning models, we use three independently-trained models with random seeds and report the average performance of them.

B ANALYSIS ON HYPERPARAMETER SENSITIVITY

For the small-scale experiments, we use a momentum of m = 0.99 and a temperature of τ PsCo = 1. We here provide more ablation experiments with varying the hyperparameters m and τ PsCo . Table 9 and 10 show the sensitivity of hyperparameters on the miniImageNet dataset. We observe that PsCo achieves good performance even for non-optimal hyperparameters. 

E SETUP FOR CROSS-DOMAIN FEW-SHOT BENCHMARKS

We now describe the setup for cross-domain few-shot benchmarks, including detailed information on datasets, baseline experiments, implementational details, and the setup for large-scale experiments.

E.1 DATASETS

For the cross-domain few-shot benchmarks, we use eight different datasets. We describe the dataset information in Table 12 . We use the dataset split described in Tseng et al. (2020) for the benchmark of high-similarity and we use the dataset split described in Guo et al. (2020) for the benchmark of low-similarity. Because we do not perform the meta-training procedure using the datasets of crossdomain benchmarks, we only utilize the meta-test splits on these datasets. We use the 84×84 resized samples for evaluation on small-scale experiments. 

E.2 BASELINES

We compare our performance with (a) previous in-domain state-of-the-art methods of unsupervised meta-learning, self-supervised learning models, and supervised meta-learning models.Unsupervised meta-learning models. We use previous in-domain state-of-the-art methods of unsupervised meta-learning models, Meta-GMVAE (Lee et al., 2021) and Meta-SVEBM (Kong et al., 2021) . We use the miniImageNet pretrained parameters that the paper provided, and follow the meta-test procedure of each model to evaluate the performance.Self-supervised learning models. We use SimCLR (Chen et al., 2020a) , MoCo v2 (Chen et al., 2020b) , and SwAV (Caron et al., 2020) of miniImageNet pretrained parameters as our baselines.Because self-supervised learning models are pretrained on miniImageNet, we additionally fine-tune the models with a linear classifier to let the models adapt to each domain. Following the setting provided in Guo et al. (2020) ; Oh et al. (2022) , we detach the head of the models g θ and attach the linear classifier c ψ to the model. We freeze the base network f θ while fine-tuning and only c ψ is learned. We fine-tune the models via SGD with an initial learning rate of 0.01, a momentum of 0.9, weight decay of 0.001, and a batch size of N = 4 for 100 epochs.Supervised meta-learning models. We use MAML (Finn et al., 2017) and ProtoNets (Snell et al., 2017) of Conv5 architectures of miniImageNet pretrained. Following the procedure of Snell et al. (2017) , we train the models via Adam (Kingma & Ba, 2015) with a learning rate of 0.001 and cut the learning rate in half for every training of 2000 episodes. We train them for 60K episodes and use the model of the best validation accuracy. We train them through a 5-way 5-shot, and the rest of the hyperparameters are referenced in their respective papers. We observe that their performances are similar to the performance described in Table 1 .

E.3 EVALUATION DETAILS

To evaluate our method, we apply our adaptation scheme. Following Section 3.3, we freeze the base network f θ . We train only projection head g θ and prediction head h θ via SGD with an initial learning rate of 0.01, a momentum of 0.9, and weight decay of 0.001 as self-supervised learning models are fine-tuned. We only apply 50 iterations of our adaptation scheme when reporting performance.

E.4 LARGE-SCALE SETUP

Here, we describe the setup for large-scale experiments. For evaluating, we use the same protocol with the small-scale experiments, except the scale of images is 224 × 224.Augmentations. For large-scale experiments, we use 224 × 224-scaled data. Thus, we use similar yet slightly different augmentation schemes with small-scale experiments. Following the strong augmentation used in Chen et al. (2020b; a) , we additionally apply GaussianBlur as a random augmentation. We use the same configuration for weak augmentation. For evaluation, we resize the images into 256 × 256 and then apply the CenterCrop to make 224 × 224 images by following Guo et al. (2020) .ImageNet pretraining. We pretrain MoCo v2 (Chen et al., 2020b) , BYOL (Grill et al., 2020) , and our PsCo of ResNet-18/50 (He et al., 2016) via SGD with a batch size of N = 256 for 200 epochs. Following (Chen et al., 2020b; Chen & He, 2021) , we use an initial learning rate of 0.03 with the cosine learning schedule, τ MoCo = 0.2 and a weight decay of 0.0001. We use a queue size of M = 65536 and momentum of m = 0.999. For the parameters of PsCo, we use τ PsCo = 0.2 and K = 16 as the queue is 4 times bigger. For supervised pretraining, we use the the model checkpoint officially provided by torchvision (Paszke et al., 2019) .

