SELF-SUPERVISED SET REPRESENTATION LEARNING FOR UNSUPERVISED META-LEARNING

Abstract

Unsupervised meta-learning (UML) essentially shares the spirit of self-supervised learning (SSL) in that their goal aims at learning models without any human supervision so that the models can be adapted to downstream tasks. Further, the learning objective of self-supervised learning, which pulls positive pairs closer and repels negative pairs, also resembles metric-based meta-learning. Metric-based meta-learning is one of the most successful meta-learning methods, which learns to minimize the distance between representations from the same class. One notable aspect of metric-based meta-learning, however, is that it is widely interpreted as a set-level problem since the inference of discriminative class prototypes (or set representations) from few examples is crucial for the performance of downstream tasks. Motivated by this, we propose Set-SimCLR, a novel self-supervised set representation learning framework for targeting UML problem. Specifically, our Set-SimCLR learns a set encoder on top of instance representations to maximize the agreement between two sets of augmented samples, which are generated by applying stochastic augmentations to a given image. We theoretically analyze how our proposed set representation learning can potentially improve the generalization performance at the meta-test. We also empirically validate its effectiveness on various benchmark datasets, showing that Set-SimCLR largely outperforms both UML and instance-level self-supervised learning baselines.

1. INTRODUCTION

One of the most challenging and long-standing problems in machine learning is unsupervised learning which aims at learning generalizable representations without human supervision, which can be transferred to diverse downstream tasks. Meta-learning (Finn et al., 2017; Snell et al., 2017) is a popular framework for learning models that quickly adapt to novel tasks on the fly with few examples, and thus shares the spirit of unsupervised learning in that it seeks more efficient and effective learning procedures than learning from scratch. However, the essential difference between unsupervised learning and meta-learning is that most meta-learning approaches have been built on a supervised learning scheme and require human-crafted task distributions. In order to tackle this limitation, several previous works (Hsu et al., 2019; Khodadadeh et al., 2019; 2021; Lee et al., 2021) have proposed unsupervised meta-learning (UML) frameworks which combine unsupervised learning and meta-learning. They train a model with unlabeled data such that the model can adapt to unseen tasks with few labels. Meanwhile, self-supervised learning (Chen et al., 2020a; b; He et al., 2020; Chen et al., 2020c; 2021; Grill et al., 2020; Zbontar et al., 2021) (SSL) is rising as a promising learning paradigm to learn transferable representations from unlabeled data in a task-agnostic manner. These methods rely on pretext tasks generated from data, and a popular pretext task is to maximize the agreement between different views of the same image in the latent space. The different views are easily obtained by sequentially applying pre-defined stochastic augmentations to an image. The main applications of these SSL methods essentially resemble the problem scenarios of UML, where we aim to transfer the learned representations to various downstream tasks. Further, the learning objective of SSL is also closely related to metric-based meta-learning (Ni et al., 2022) , which is one of the most successful meta-learning methods. Metric-based meta-learning (Snell et al., 2017) learns to minimize the distance between representations from the same class, while SSL pulls positive pairs closer and repels negative pairs. This motivates us to design a SSL method for addressing the UML problem. Most SSL methods have focused on learning meaningful instance visual features. The importance of the instance feature is clear for generalization to unseen tasks coming with few examples, however, a meta-learning problem is often interpreted as a set-level problem in the literature of metric-based meta-learning. It has been widely shown that inference of discriminative class prototypes (or set representations) from few examples is crucial for the performance of downstream tasks. For example, Snell et al. (2017) basically takes an average of features belonging to the same class as a prototype (set representation). Similarly, Gordon et al. (2019) ; Iakovleva et al. (2020) propose Bayesian framework to learn stochastic prototypes using multi-layer perceptron and properly reflect uncertainty originating from few examples. Further, Triantafillou et al. (2019) propose to fine-tune the prototype with supervised loss. Inspired by the successes of set representation in few-shot learning, we propose a self-supervised set representation learning framework for UML. The underlying assumption of SSL is that two different views of an image share most visual semantics. Built upon this idea, we construct two sets where each set consists of different views of the same image and maximize the agreement between them. Concretely, we repeatedly apply stochastic augmentations to each image of the mini-batch multiple times and construct a set consisting of the augmented images. Then we divide the set by half into two sets which are considered to be a positive pair of sets. Given a positive set pair, similar to Chen et al. (2020a) , the other sets within mini-batch are considered as negative sets. We use attention-based set encoder (Vaswani et al., 2017; Lee et al., 2019) to obtain set representations. The set encoder is trained to reduce the distance of positive sets and increase that of negative sets. We dub our framework Set-SimCLR. At meta-test, we initialize each row of the weight for a linear classifier with the learned representation of the set composed of instances belonging to the same class, and the classifier is then optimized with supervised loss. We motivate our algorithmic design of Set-SimCLR based on theoretical analysis. Specifically, we study how our set representation can potentially improve the final performance and the reason why we use set representations as the initialization of classifier weights. We then empirically validate our Set-SimCLR by comparing it against four UML methods and four instance-level SSL methods. We find that our method outperforms the baselines on six benchmark datasets, including Mini-ImageNet (Ravi & Larochelle, 2017) , Tiny-ImageNet (Le & Yang, 2015) , CIFAR100 (Krizhevsky et al., 2009) , Aircraft (Maji et al., 2013) , Stanford Cars (Krause et al., 2013) and CUB (Wah et al., 2011) datasets. We summarize our contributions as follows: • We introduce Set-SimCLR framework for solving unsupervised meta-learning problem, which learns both instance and set representations for downstream tasks. • We provide a theoretical motivation of Set-SimCLR and study how the set representation potentially improves few-shot classification performance. • The proposed Set-SimCLR outperforms the previous UML baselines and self-supervised learning baselines by significant margins in all the datasets we consider.

2. RELATED WORK

Unsupervised Meta-Learning (UML) To tackle the limitation of supervised meta-learning, several UML works have been proposed to construct pseudo-tasks for meta-training by clustering data on an unsupervised embedding space (Hsu et al., 2019) , data augmentation (Khodadadeh et al., 2019) , or harvesting synthetic data from the latent space of generative models (Khodadadeh et al., 2021) . Contrary to the works focusing on generating pseudo tasks, Meta-GMVAE (Lee et al., 2021) introduces a Mixture of Gaussian priors by performing Expectation-Maximization during the metatraining and the meta-test. To our knowledge, none of the existing works have proposed to tackle UML with self-supervised set representation learning, although Lee et al. (2021) ; Ericsson et al. (2021) use a backbone network pretrained with instance-level SSL objective. Set Representation DeepSets (Zaheer et al., 2017) independently processes elements and aggregates them with either min, max, mean or sum operation to obtain permutation invariant set encoding. To tackle the lack of expressiveness of Deepsets, Set Transformer (Lee et al., 2019) utilize self-attention to model the pairwise interaction of elements in a set. Instead of designing a more expressive neural architecture for set encoding, several methods are proposed to learn set representation by minimizing the distance between an input set and a trainable reference set using a bipartite matching (Skianis et al., 2020) , an optimal transport (Mialon et al., 2020; dan Guo et al., 2022) , or Wasserstein embedding (Kolouri et al., 2020) . Note that our self-supervised set representation learning framework is agnostic to any set encoding and any of them can be utilized for ours. Self-supervised Learning Recently, a large volume of works has proposed self-supervised learning methods. The core idea is the representation of differently augmented views of the same image should be similar. Note that we introduce just a few of them which we consider as baselines in our experiments. SimCLR (Chen et al., 2020a; b) is one of the most representative contrastive frameworks where two views of the same image are pulled together while the negative pairs are repulsed. MOCO (He et al., 2020; Chen et al., 2020c; 2021) builds a dynamic feature dictionary using a queue and momentum encoder and learns to minimize contrastive loss from the dictionary. Meanwhile, several works show that non-contrastive approaches can learn meaningful representation without a latent feature collapse. For example, BYOL (Grill et al., 2020) leverages two identical networks where one of them is a momentum encoder to encode different views of images and minimizes the distance between positive pairs. Barlow Twins (Zbontar et al., 2021) computes a cross-correlation matrix between a different view of images and optimize it to be close to an identity matrix. Recently, MAE (He et al., 2022) masks an image and reconstructs the masked input to learn a meaningful representation of images. In this work, we exploit the effectiveness of self-supervised learning on UML, especially combined with our proposed set representation learning.

3. METHOD

In this section, we describe problem setting of unsupervised meta-learning (UML) and self-supervised set representation learning, Set-SimCLR. We depict an overview of our method in Figure 1 .

3.1. PROBLEM STATEMENT

For UML problem, we can only access to an unlabeled dataset D u = {x i } U i=1 for meta-training. Same as most existing meta-learning works (Finn et al., 2017; Snell et al., 2017) , we assume meta-test data follows the same data distribution of unlabeled dataset D u while having a different set of classes. At meta-test time, we are given a set of N -way S-shot classification tasks and each task consists of a support set D s = {(x s i , y s i )} N ×S i=1 , and a query set D q = {x q i } N ×Q i=1 . The final goal is to leverage the model trained on the unlabeled data to predict labels of the query set with the help of the support set.

3.2. SELF-SUPERVISED CONTRASTIVE LEARNING

Before introducing our method, we first describe one of the most successful self-supervised learning methods SimCLR (Chen et al., 2020a) . SimCLR is a contrastive learning framework that maximizes agreement between differently augmented views of the same instance in the latent space. Specifically, it first randomly samples a mini-batch of M images {x m } M m=1 and obtains two different views of each image using stochastic data augmentation, resulting in 2M instances {(x m,1 , x m,2 )} M m=1 . There are two components: 1) a base encoder f extracting feature representations and 2) a projection head g mapping the representation to the latent space where the contrastive loss is applied. With the encoder and projection head, the latent representation of each image is obtained as z m,j = g(f (x m,j )). Then, the contrastive loss for the mini-batch of M images is defined as L SimCLR {(z m,1 , z m,2 )} M m=1 = - 1 2M M m=1 log exp(sim(z m,1 , z m,2 )/τ ) j,k 1 [z k,j ̸ =zm,1] exp(sim(z m,1 , z k,j )/τ ) + log exp(sim(z m,2 , z m,1 )/τ ) j,k 1 [z k,j ̸ =zm,2] exp(sim(z m,2 , z k,j )/τ ) , ( ) where sim is a measure of similarity (e.g., cosine similarity) and 1 [z k,j ̸ =zm,1] ∈ {0, 1} is an indicator function. The temperature τ > 0 is a hyperparameter controling the sharpness of the distribution.

3.3. SELF-SUPERVISED SET REPRESENTATION LEARNING WITH SIMCLR

Existing self-supervised learning has focused on instance-level visual features. The importance of the instance-level features is clear for generalization on unseen tasks, however, a meta-learning problem is often interpreted as a set-level problem rather than instance-level. For example, Snell et al. (2017) Instance Reps. split H 1,1 H 1,2 H 2,1 H 2,2 H 3,1 H 3,2 s 1,1 s 1,2 s 2,1 s 2,2 s 3,1 s 3,2 Set Reps. h 2,𝑣 1 𝑉 h 1,𝑣 1 𝑉 h 3,𝑣 1 𝑉 Base Encoder 𝑓 minimize - g(h) T g(h') g(h) ⋅ |g(h')| minimize - 𝑔(𝑠) T 𝑔(𝑠′) 𝑔 𝑠 ⋅ |𝑔(𝑠')|

Set-level Loss

Instance-level Loss takes the average of features belonging to the same class as a prototype (or a set representation), or Gordon et al. (2019) ; Iakovleva et al. (2020) propose Bayesian framework to learn stochastic prototypes using multi-layer perceptron. Further, Triantafillou et al. (2019) propose to fine-tune a prototype with a supervised loss. Inspired by the successes, we propose a self-supervised contrastive learning framework for learning set representation to more effectively address UML problems.

Set Representation

The underlying assumption of self-supervised learning methods is that two different views of an image share most of the visual semantics. We extend this idea to set-level representation by constructing two sets where each set consists of multiple different views of the same image and maximizing agreement between the two sets. Specifically, we repeatedly apply stochastic augmentations to an image for V times and construct a set {x m,v } V v=1 for each image of a mini-batch, where V is an even number. Then we independently encode each augmented image with the base encoder f to obtain instance-level feature representations h m,v = f (x m,v ) ∈ R d for m = 1, . . . , M and v = 1, . . . , V , where d is the dimension of the feature representation. This results in M different sets of instance-level representations H m = {h m,v } V v=1 for m = 1, . . . , M . Then we divide each set H m by half to obtain positive pairs of sets, i.e., H m,1 = {h m,v } V /2 v=1 and H m,2 = {h m,v } V v=V /2+1 , and get a set representation by applying a set encoder to each set. Any permutation-invariant set encoder that takes a set of vectors as an input and outputs a vector can be employed. Here, we design a set encoder with self-attention for better representation: T m,j = TransformerEncoder(H m,j ) ∈ R V /2×d s m,j = MLP(concat (mean(T m,j ); std(T m,j ); max(T m,j ); min(T m,j ))) ∈ R d , and we define a set encoding function φ : H m,j ∈ R V /2×d → s m,j ∈ R d . For TransformerEncoder, we use the multi-head self-attention mechanism proposed by Vaswani et al. (2017) ; Lee et al. (2019) . Please see Appendix A for more detail. We take the row-wise operations on the outputs T m,j ∈ R V /2×d of TransformerEncoder to compute mean, standard deviation, maximum and minimum (denoted as mean, std, max and min), where each results in a d-dimensional vector. Then we concatenate them, denoted as concat, which is a 4d-dimensional vector, and feed it into multi-layer perceptron MLP to obtain the final set representation s m,j ∈ R d .

Contrastive Loss for Set Representation Learning

We now obtain positive pair of set representations s m,1 and s m,2 , by applying the set encoder in Eq. 2 to each set H m,1 and H m,2 . Following self-supervised literature, we project this set representation into the latent space with the same head g used for instance-level feature learning. Finally, we compute set-level contrastive loss by plugging the projected set representations into Eq. 1, i.e., L SimCLR ({(g(s m,1 ), g(s m,2 )} M m=1 ). The difference between our loss and SimCLR loss is that instead of instance-level representation, we pull positive pair of set-level representation and repulse the negative set pairs. Further, we introduce a cross loss that regularizes the subspace of instance-and set-level representations to be shared in the latent space as follows: L SimCLR ({(g(s m,1 ), g(h m,2 )} M m=1 ). Then the final loss is a combination of the set-level and the instance-level SimCLR losses as follows: L SimCLR {(g(h m,1 ), g(h m,2 ))} M m=1 Instance-level Loss + L SimCLR {(g(s m,1 ), g(s m,2 ))} M m=1 Set-level Loss + L SimCLR {(g(s m,1 ), g(h m,2 ))} M m=1 Cross Loss (3) Linear Evaluation for Downstream Tasks We now describe how we utilize the learned instancelevel and set-level representations on downstream few-shot classification tasks. For a N -way S-shot task at meta-test time, we are given the support set D s = {(x s i , y s i )} N ×S i=1 and supposed to predict the labels of the query set D q = {x q i } N ×Q i=1 . We first apply the base encoder f to obtain instance feature representations of the support set {(h s i , y s i )} N ×S i=1 and the query set {h q i } N ×Q i=1 . For each class c = 1, . . . , N , we encode H s c = {h s i | y s i = c}, a set of instances belonging to the class c, with the mapping φ as described in Eq. 2. Let the set representation of c-th class be s c = φ(H s c ) ∈ R d . We then initialize a weight of a linear classifier with the set representations s 1 , . . . , s N and train the classifier with the support set while freezing the base encoder f , which is similar to the linear evaluation of self-supervised learning (Chen et al., 2020a) . We find that this is more suitable for our few-shot setting than the strategy of finetuning the full model to prevent the risk of overfitting to few data. Specifically, we initialize the weight W of the classifier by stacking the learned set representation s c as row vectors, denoted as W 0 , and optimize it by minimizing the cross-entropy loss with weight-decay as follows: minimize W L CE (W ; D s ) via algorithm A as W * = A(L CE ; W 0 , D s ) L CE (W ; D s ) = 1 |D s | x s i ,y s i ∈D s ℓ (W f (x s i ), y s i ) where W 0 = [s 1 • • • s N ] ⊤ ∈ R N ×d , ℓ(q, y) = -log exp(q y )/ N k=1 exp(q k ) and A(L CE ; W 0 , D s ) denotes an iterative optimization algorithm with weight-decay and the initialization W 0 . After the optimization, we predict a label for each instance in the query set D q as y q i = arg max c p (i) c , where (p (i) 1 , . . . , p N ) ⊤ = W * f (x q i ). We provide pseudo-code of our metatraining (self-supervised learning) and meta-test in Appendix B. Connection to Meta-Learning We further discuss why our set representation boosts generalization performance in the view of meta-learning. One of the most effective approaches in meta-learning literature to tackle few-shot learning problems is to learn an initialization and adapt the initialization to meta-test tasks. For instance, ANIL (Raghu et al., 2019 ) learns a feature extractor and an initialization of a linear classifier such that the learned linear classifier can rapidly adapt to the target task while freezing the feature extractor. ANIL has shown that meta-learning the initialization of the linear classifier is crucial for improving the generalization performance of meta-test tasks. In this point of view, Set-SimCLR meta-learns set representations based on the set-level contrastive learning loss using pseudo tasks constructed by leveraging data augmentation, where different views of an image belong to the same pseudo-class. Then the meta-learned set representations are utilized as an initialization which leads to better generalization performance of meta-test tasks. We further provide theoretical motivation of how our set representation can improve generalization in the next section, and a detailed relationship to the meta-learning in Appendix D.

3.4. THEORETICAL MOTIVATION

In this section, we provide theoretical motivations on our algorithmic design. In appendix C.1, we show that the proposed method is equivalent to the metric-based inference with the fine-tuning of the class prototypes s c , where the initial class prototypes s c are obtained with the set representation by s c = φ(H s c ) and each input x is represented by instance-level representation f (x). Thus, in the following, we discuss how such metric-based inference behaves with respect to the supervised loss in the downstream task. To obtain theoretical insights, this section focuses on the binary classification without the head g and considers the following abstract data-generating process: each of the unknown labels y + and y -is drawn independently from a uniform distribution U on {1, 2}, and then each of the unlabeled positive examples x + and x ++ is drawn from the conditional distribution D y + conditioned on the label y + while the negative example x -is drawn from the conditional distribution D y -. Accordingly, this hidden process forms the joint distribution D(x, y) = D y (x)U (y). In this setting, we can write the contrastive unsupervised loss L SimCLR of the representation f and the corresponding supervised loss L s of our classifier W t f by L SimCLR (f ) = E y + ∼U,y -∼U E x + ,x ++ ∼D 2 y + ,x -∼D y -[-log( exp(f (x ++ ) ⊤ f (x + )) exp(f (x ++ ) ⊤ f (x + ))+exp(f (x ++ ) ⊤ f (x -))) )] and L t s (f ) = E (x,y)∼D [ℓ(W t f (x), y)] where ℓ(q, y) = -log exp(qy) 2 k=1 exp(q k ) and the matrix W t ∈ R 2×d is de- fined by W t = [φ[H s 1 ]+( ⃗ ∆ t ) 1 , φ[H s 2 ]+( ⃗ ∆ t ) 2 ] ⊤ . Here, ⃗ ∆ t = [( ⃗ ∆ t ) 1 , ( ⃗ ∆ t ) 2 ] ⊤ = W t -W 0 ∈ R 2×d is the elements added during the training with the support set. Importantly, y + and y -can be the same as y + = y -since we do not know the true labels in the unsupervised loss. This is reflected by the fact that they are sampled from the same (unknown) probability measure on labels U . We define the training loss Lt s (f ) = 1 |D s | (x s i ,y s i )∈D s ℓ(W t f (x s i ), y s i ) and the corresponding training loss with the average pooling (instead of our set representation) by LA s (f ) = 1 |D s | (x s i ,y s i )∈D s ℓ(Af (x s i ), y s i ) where A = [E x∼D1 [f (x)], E x∼D2 [f (x)]] ⊤ ∈ R 2×d . Define the probability of y + and y -being the same by P(y + = y -) = E y + ,y -∼U 2 [1{y + = y -}]. Similarly, P(y + ̸ = y -) = E y + ,y -∼U 2 [1{y + ̸ = y -}]. We define c = P(y + ̸ = y -) -1 and ζ = c • P(y + = y -) log(2). Let L ℓ be the Lipschitz constant of ℓ w.r.t. its first argument. Let C ℓ be upper bounds on ℓ. Define C f = E x [∥f (x)∥ 2 2 ]. The following theorem provides an upper bound on the expected supervised loss L t s (f ) in the downstream task: Theorem 1. Let ∆ t ∈ R ≥0 and suppose that W t satisfies ∥W t -W 0 ∥ F ≤ ∆ t . Then, for any δ > 0, with probability at least 1 -δ, L t s (f ) ≤ cL SimCLR (f ) -ζ log(2) -γt + ∆ t 16L 2 ℓ C f |D s | + 2C ℓ ln(2/δ) 2|D s | . ( ) where γt = LA s (f ) -Lt s (f ) and ∥•∥ F denotes Frobenius norm. The proof is deferred to appendix C.2. Theorem 1 shows that we can minimize the expected supervised loss L t s (f ) by minimizing the contrastive loss L SimCLR (f ) and training loss Lt s (f ). As we increase t ∈ N 0 , the value of γt = LA s (f ) -Lt s (f ) increases since Lt s (f ) decreases in t while LA s (f ) is a constant in t. However, increasing t can also increase ∆ t in Theorem 1. Thus, there is a tradeoff of γt v.s. ∆ t . At t = 0, we have ∆ t = 0. As we increase t, both γt and ∆ t tend to increase. Here, if |D s | is very large, then an optimal strategy would be to increase t towards infinity, because the term of ∆ t is O(∆ t / |D s |). However, when |D s | is small, we do not want to increase ∆ t too much. Thus, Theorem 1 predicts that we should conduct fine-tuning to control the tradeoff between γt and ∆ t with the initialization obtained through the unsupervised meta-learning step. We can see in the definition of ∆ t that the initialization matters to avoid increasing ∆ t too much while increasing γt .

4. EXPERIMENT

In this section, we empirically validate the effectiveness of our set representation learning framework on several downstream few-shot classification tasks, and compare our Set-SimCLR against UML baselines and instance-level self-supervised baselines in subsection 4.1 and 4.2, respectively.

4.1. COMPARISON TO UNSUPERVISED META-LEARNING

Dataset We use the Mini-ImageNet dataset introduced by Ravi & Larochelle (2017), which is a subset of ILSVRC-2012 (Deng et al., 2009) . It consists of 100 classes and each class contains 600 different images. We use the resolution of 3 × 84 × 84, which is widely used in the meta-learning literature. We use 64 classes for unsupervised meta-training, 16 classes for meta-validation, and the remaining 20 classes for meta-test. Following the standard protocol of unsupervised meta-learning, we evaluate our method on 1000 randomly sampled tasks from the meta-test set. Baselines We compare Set-SimCLR with four UML methods as the baselines: 1) CACTUs (Hsu et al., 2019) , 2) UMTRA (Khodadadeh et al., 2019) , 3) LASIUM (Khodadadeh et al., 2021) and 4) Meta-GMVAE (Lee et al., 2021) . In addition, we provide the performance of two supervised metalearning methods as "oracles": MAML (oracle) (Finn et al., 2017) and ProtoNets (oracle) (Snell et al., 2017) . The detailed explanation of the baselines is in Appendix F. We report mean and standard deviation of accuracy evaluated on 1000 episodes with 5 different runs for ours. Note that we take the accuracy of baselines from the previous works Khodadadeh et al. (2021) ; Lee et al. (2021) .

Method

Clustering Implementation Details We use Conv5 architecture as the base encoder for the fair comparison. We provide the details of neural architectures for base encoder, set encoder and head in Appendix H. We follow SimCLR (Chen et al., 2020a; b) for random augmentation, which is detailed in Appendix J. We apply the composed augmentations to 64 mini-batch images eight times (i.e., M = 64, V = 8), resulting in 4 elements in each set. We optimize the base encoder, set encoder and head network for 400 epochs using Adam optimizer (Kingma & Ba, 2015) with default settings (i.e., β 1 = 0.9 and β 2 = 0.999). We use constant learning rate of 0.001. For downstream tasks, we use L-BFGS (Liu & Nocedal, 1989 ) algorithm implemented in scikit-learn (Pedregosa et al., 2011) package to optimize a linear classifier. Results Table 1 shows the performance of the baselines and our Set-SimCLR for 5-way 1, 5, 20 and 50-shot classification on the Mini-ImageNet dataset, where Set-SimCLR outperforms all the baselines by considerable margins. For an instance, it achieves +0.54%, +2.95%, +3.95%, and +4.27% performance improvement over the best performing baseline on 1-5-, 20-and 50-shot settings. Notably, the performance gain of Set-SimCLR over the baselines gets larger as we increase the number of instances for a support set, i.e., shot. We can observe the similar pattern when comparing MAML-variant and ProtoNet-variant within baselines, e.g., CACTUs-MAML vs CACTUs-ProtoNets and LASIUM-MAML-RO/N vs LASIUM-ProtoNets-RO/N. This is because the adaptation with the support set at meta-test gets more effective since the model is less likely to overfit to larger shot.

4.2. COMPARISON TO SELF-SUPERVISED LEARNING (SSL)

Dataset We use the Mini-ImageNet dataset for training and evaluating models. Further, to verify the effectiveness of the proposed method on transfer learning scenarios, we evaluate the models trained with Mini-ImageNet on the conventional meta-test split of Tiny-ImageNet (Le & Yang, 2015) , CIFAR100 (Krizhevsky et al., 2009) , Aircraft (Maji et al., 2013) , Stanford Cars (Krause et al., 2013) and CUB (Wah et al., 2011) datasets. See Appendix E for the number of classes of meta-splits for each dataset. Since all the models are trained on 84 × 84 images from the source dataset Mini-ImageNet, we resize the image to 84 × 84 resolution for all the target datasets. Following UML literature, we evaluate our method on 1,000 randomly sampled tasks from the meta-test set. Baselines Although there are a vast number of SSL methods, in this work, we want to show the effectiveness of SSL compared to learning instance representation. Thus, we choose following four representative contrastive SSL baselines as follows: 1) SimCLR (Chen et al., 2020a; b) , 2) MOCO (He et al., 2020; Chen et al., 2020c; 2021) , 3) BYOL (Grill et al., 2020) , 4) Barlow Twins (Zbontar et al., 2021) . All the details are deferred to Appendix F. Note that we have tried a very recent SSL method -MAE (He et al., 2022) , however, it fails to achieve comparable performance to ours and baselines. Please see details in Appendix G. Implementation Details For the base encoder f , we use ResNet-18 architecture (He et al., 2016) which is widely used for evaluating self-supervised learning methods. For a fair comparison, we use the same architecture of head network g, for all SSL methods except for MOCO since MOCO does not use the head. For our method Set-SimCLR, we apply the augmentations (which is defined in Appendix J) 8 times to the mini-batch of 64 images (i.e., M = 64, V = 8), resulting in 4 elements in each set, while performing the same augmentation twice on the mini-batch of 256 images (i.e., M = 256, V = 2) for the baselines. Following SSL literature, we train a linear classifier for downstream tasks using scikit-learn package with default settings. We provide more implementation details in Appendix I. Results Figure 2a shows 5-way 5-shot experimental results of all the models on the Mini-ImageNet, Tiny-ImageNet, CIFAR100, Aircraft, Stanford Cars, CUB datasets. We can see that Set-SimCLR outperforms all the SSL baselines by considerable margins, from +0.17% to +2.71%. The results of Set-SimCLR in the transfer learning scenario, one of the important goals of the self-supervised learning methods, is particularly remarkable. We further evaluate ours and baselines over various shots, e.g., 1-, 10-, 20 and 50-shot on the Mini-ImageNet. As shown in Figure 2b , our Set-SimCLR obtains outstanding performance gains of 7.31%, 2.71%, 2.02%, 1.96% over the best performing baselines on 1-, 10-, 20-, 50-shot settings. Notably, that performance gain of Set-SimCLR is much larger in 1-shot setting than the other shots. It shows that SSL baselines are vulnerable to overfitting to the single shot. In contrast, the classifiers obtained by Set-SimCLR shows much robustness in the 1-shot setting due to the initialization with learned set representations.

4.3. ABLATION STUDY AND ANALYSIS

In this subsection, we conduct ablation studies to verify a necessity of each components. We further provide analysis on our Set-SimCLR in comparison to SSL baselines. Set Encoder Architecture We replace the architecture of the set encoder described in Eq. 2 with mean pooling, Deep Set (Zaheer et al., 2017) , Rep the Set (Skianis et al., 2020) , or Set Transformer (Lee et al., 2019) . Figure 3a shows the 5-way 5-shot test accuracy of different set encoder architectures on the Mini-ImageNet dataset. We find that Rep the Set architecture works well on 1-shot setting, and our set encoder φ in Equation 2shows slightly better performance on 5-, 20-and 50-shot settings than the others. Note that our Set-SimCLR is set representation learning framework that is agnostic to the choice of set encoder architecture. Furthermore, even with the simplest architecture (mean pooling), it still shows slightly better performance than the best-performing self-supervised baseline (SimCLR) which is denoted as dotted lines. The Depth of Set Encoder Another important compoment of our model is the number of TransformerEncoder layers in Equation 2. First, we start without TransformerEncoder layer, i.e., identity function, and increase the depth of the set encoder. Figure 3b shows the 5-way 5-shot test accuracy on the Mini-ImageNet dataset with varying the number of layers. We find that the set encoder with a single layer is the most effective on the overall settings considering the computational cost due to extra layers. Note that all of our models with the different number of layers outperform the best performing self-supervised baseline (SimCLR) which is denoted as dotted lines.

Cardinality of Set

In order to study effects of the number of set elements for Set-SimCLR, we plot 5-way 5-shot test accuracy improvement over SimCLR, denoted as ∆ Test Acc., as a function of the cardinality of set. In Figure 3c , the performance of the downstream tasks is not sensitive to the size of sets, which results in consistent improvement over SimCLR with all the cardinality we consider. Training Budgets Analysis It approximately takes twice longer to train our Set-SimCLR than the baselines, since it requires multiple stochastic augmentations to construct a set (Please see wall-clock time in Appendix K). Then one may wonder whether the baseline can be comparable or even better if we train it with similar computational budgets to ours. To address this question, we train the self-supervised learning baselines for 800 epochs, which is twice larger than the before, and observe test accuracy over training. Figure 4a shows the 5-way 5-shot test accuracy of self-supervised learning baselines on the Mini-ImageNet dataset. We find that the our Set-SimCLR evaluated at 400 epochs largely outperforms the self-supervised baselines for all the training budgets we consider.

Qualitative Analysis on Adaptation of Set Representation

We now qualitatively analyze the adaptation process of our set representation at meta-test time. To do so, we visualize the set representations before and after the adaptation (i.e., each row of the classifier weight W 0 and W * ), and instances from support and query set. We normalize all the examples to be length 1 and project them to 2d space with T-SNE (Van der Maaten & Hinton, 2008) . Figure 4b shows instance representation from query and support set and set representations, denoted as circle, cross and star, respectively. We represent arrows as the adaptation process of set representation, and the color stands for each class. We find that the set representation is not that discriminative at the beginning, however, it represents each class very well after the adaptation. This shows the necessity of our proposed adaptation process of set representation to achieve better performance of the downstream tasks.

Accuracy with Various Ways

We finally conduct experiments to show the performance of each model with varying the way of meta-test tasks. Figure 4c shows the 2-, 5-, 10-and 20-way 5-shot test accuracy of the self-supervised learning baselines and ours on the Mini-ImageNet dataset. We find that our Set-SimCLR consistently outperforms the baselines on all the way we consider here.

5. CONCLUSION

In this paper, we proposed self-supervised set representation learning framework for unsupervised meta-learning (UML). Our Set-SimCLR learns set representation by maximizing the agreement between positive sets in latent space, where the positive sets are constructed with repeated stochastic augmentations of an image. Based on theoretical analysis, we studied how the learned set representation can improve generalization ability and why it makes sense to initialize of the weight of linear classifier with the learned set representation for downstream tasks. We further validated the empirical efficacy of proposed Set-SimCLR and compared it against UML and self-supervised baselines using several benchmark few-shot classification datasets. Note that our main idea of minimizing distance between semantically similar sets constructed with repeated augmentations is not limited to SimCLR framework. Based on this, we plan to expand our framework to various self-supervised learning methods to exploit their potential merits. 

A TRANSFORMER ENCODER

We describe the TransformerEncoder from Eq. 2 in more detail. Let X ∈ R n×d be a stack of n d-dimensional row vectors. Let W Q j , W K j , W V j ∈ R d×d k be weight matrices for self-attention and let b Q j , b K j , b V j ∈ R d k be bias vectors for j = 1, . . . , 4. For encoding the input X, we compute self-attention as follows: Q (j) = XW Q j + 1(b Q 1,j ) ⊤ ∈ R n×d k K (j) = XW Q j + 1(b K 1,j ) ⊤ ∈ R n×d k V (j) = XW K j + 1(b V i,j ) ⊤ ∈ R n×d k A (j) (X) = LayerNorm Q (j) 1 + softmax Q (j) 1 (K (j) 1 ) ⊤ / d k V (j) 1 ∈ R n×d k O(X) = Concat(A (1) 1 (X), . . . , A (4) 1 (X)) ∈ R n×d (6) where 1 = (1, . . . , 1) ⊤ ∈ R n is a vector of ones, d = 4d k , and softmax is applied for each row. After self-attention, we add a skip connection with layer normalization (Ba et al., 2016) as follows: TransformerEncoder(X) = LayerNorm (O(X)) + ReLU W O(X) + 1b ⊤ where W ∈ R d×d , b ∈ R d .

B ALGORITHM

We provide the pseudo-code for Set-SimCLR described in Section 3. for m ← 1, . . . , M do 5: for v ← 1, . . . , V do 6: Sample augmentation functions t ∼ T 7: x m,v ← t(x m ) 8: h m,v ← f (x m,v ) 9: end for 10: H m,1 ← {h m,v } V /2 v=1 11: H m,2 ← {h m,v } V v=V /2+1 12: s m,1 ← φ(H m,1 ), s m,2 ← φ(H m,2 ) 13: end for 14: L ← L SimCLR {(g(h m,1 ), g(h m,2 ))} M m=1 15: L += L SimCLR {(g(s m,1 ), g(s m,2 ))} M m=1 16: L += L SimCLR {(g(s m,1 ), g(h m,2 ))} M m=1 17: Perform gradient descent on the loss L w.r.t the parameters of f, g, and φ. 18: end while 19: Output: f, φ Algorithm 2 Meta-Test for Set-SimCLR 1: Input: Support set D s = {(x s i , y s i )} N ×S i=1 , query set D q = {x q i } N ×Q i=1 , a pretrained encoder f , and a pretrained set encoder φ. 2: W ← 0 ∈ R d×N 3: U ← {h s i = f (x s i ) ∈ R d } N ×S i=1 4: B ← |D s | 5: while not converged do 6: for c ← 1, . . . , N do 7: H s c ← {h s i ∈ U | y s i = c} 8: s c ← φ(H s c ) 9: W [c, :] ← s c 10: end for 11: L ← 1 B B i=1 -log exp((W h s i ) y s i ) N k=1 exp((W h s i ) k ) 12: Update W to minimize L with L-BFGS (Liu & Nocedal, 1989). 13: end while 14: for i ← 1, . . . , N × Q do 15: By choosing the metric to be the negative dot product as d(s c , h) = -s ⊤ c h, we can write y q i ← arg max c W f (x q i ) c 16: end for 17: Output: {y q i } N ×Q ŷ(x) = arg min c -s ⊤ c h = arg min c -log exp(s ⊤ c h) k exp(s ⊤ k h) = arg min c ℓ(softmax(W 0 f (x)), c), where the second line follows from the fact that the output of arg min c does not change by adding the constant in c. In other words, the prediction of the metric-based inference with d(s c , h) = -s ⊤ c h (without further fine-tuning) is equivalent to the proposed method at the initialization of W 0 . Thus, the proposed method can be understood as the metric-based inference with the fine-tuning of the class prototypes s c based on the support set, where the initial class prototypes s c are obtained by the set representation and each input x represented by instance-level representation h. In this view, a naive approach of fine-tuning of the class prototypes s c is to fine-tune the parameters of φ to minimizelog exp(s ⊤ c h) k exp(s ⊤ k h) with the support set where s c = φ(H c ). However, since φ has many parameters, changing φ allows to change s c = φ(H c ) freely without restrictions on the space of s c . Thus, instead of fine-tuning the parameters of φ, we can directly optimize the values of s c by initializing s c = φ(H c ) and untying s c from φ(H c ). This is what is done in the proposed algorithm. This results in the faster computation and the well-behaving convex optimization when compared to the fine-tuning of parameters of φ.

C.2 PROOF OF THEOREM 1

Proof. We define the performance difference between the average pooling and our set representation in terms of the expected loss by γ = L A s (f ) -L t s (f ). Define the function ℓ by ℓ (q) = log(1 + exp(-q)). Then, ℓ f (x ++ ) ⊤ (f (x + ) -f (x -)) = log(1 + exp(-f (x ++ ) ⊤ (f (x + ) -f (x -))) = log 1 + exp(-f (x ++ ) ⊤ f (x + ) + f (x ++ ) ⊤ f (x -)) × exp(f (x ++ ) ⊤ f (x + )) exp(f (x ++ ) ⊤ f (x + )) = log exp(f (x ++ ) ⊤ f (x + )) + exp(f (x ++ ) ⊤ f (x -))) exp(f (x ++ ) ⊤ f (x + )) = -log exp(f (x ++ ) ⊤ f (x + )) exp(f (x ++ ) ⊤ f (x + )) + exp(f (x ++ ) ⊤ f (x -))) . Thus, we have that L SimCLR (f ) = E y + ∼U y -∼U E x + ,x ++ ∼D 2 y + x -∼D y - ℓ f (x ++ ) ⊤ (f (x + ) -f (x -)) . Then, from the convexity of ℓ, Jensen's inequality and the linearity of the expectation, we have that L SimCLR (f ) ≥ E y + ∼U y -∼U E x ++ ∼D y + ℓ f (x ++ ) ⊤ (E x + ∼D y + [f (x + )] -E x -∼D y -[f (x -)]) . By decomposing the expectation with sums of conditional expectations, conditioned on the event of y + = y -and its complement of y + ̸ = y -, L SimCLR (f ) ≥ P(y + = y -)κ 1 + P(y + ̸ = y -)κ 2 where κ 1 = E y + ∼U y -∼U E x ++ ∼D y + ℓ f (x ++ ) ⊤ (E x + ∼D y + [f (x + )] -E x -∼D y -[f (x -)]) | y + = y - and κ 2 = E y + ∼U y -∼U E x ++ ∼D y + ℓ f (x ++ ) ⊤ (E x + ∼D y + [f (x + )] -E x -∼D y -[f (x -)]) | y + ̸ = y -. For the first term, since y + = y -inside the loss ℓ, we have that κ 1 = ℓ(0) = log(2) For the second term, κ 2 = E y + ∼U E x ++ ∼D y + ℓ f (x ++ ) ⊤ (E x + ∼D y + [f (x + )] -E x -∼D σ(y + ) [f (x -)]) = E (x,y)∼D ℓ g φ (x) y -g φ (x) σ(y) + γ where g φ (x) = W t f (x), g A (x) = Af (x), γ = E (x,y)∼D [ℓ g A (x) y -g A (x) σ(y) - ℓ g φ (x) y -g φ (x) σ(y) ] and σ is defined as σ(y) = 1 if y = 2 2 if y = 1, we have that ℓ g φ (x) y -g φ (x) σ(y) = log 1 + exp(-g φ (x) y + g φ (x) σ(y) × exp(g φ (x) y ) exp(g φ (x) y ) = -log exp(g φ (x) y ) exp(g φ (x) y ) + exp(g φ (x) σ(y) ) = -log exp(g φ (x) y ) 2 k=1 exp(g φ (x) k ) . Similarly, ℓ(g A (x) y -g A (x) σ(y) ) = -log exp(g A (x) y ) 2 k=1 exp(g A (x) k ) . ( ) By combining equation 10, equation 11, and equation 12, κ 2 = E (x,y)∼D -log exp(g φ (x) y ) 2 k=1 exp(g φ (x) k ) + γ, By combining equation 8, equation 9 and equation 13, we have L SimCLR (f ) ≥ P(y + ̸ = y -) E (x,y)∼D -log exp(g φ (x) y ) 2 k=1 exp(g φ (x) k ) + γ + P(y + = y -) log(2) This implies that L t s (f ) ≤ cL SimCLR (f ) -ζ log(2) + (L t s (f ) -L A s (f )). By using Hoeffding's inequality, P LA s (f ) -L A s (f ) ≥ t ≤ exp - 2t 2 (x s i ,y s i )∈D s (|D s | -1 C ℓ ) 2 = exp - 2t 2 |D s | C 2 ℓ for all t > 0. Note that E LA s (f ) = L A s (f ). Let δ := exp(-2t 2 |D s |/C 2 ℓ ). Then we get t = C ℓ ln(1/δ)(2|D s |) -1 . In other words, for any δ > 0, with probability at least 1δ, LA s (f ) -L A s (f ) ≤ C ℓ ln(1/δ) 2|D s | . Thus, for any δ > 0, with probability at least 1δ, L t s (f ) ≤ cL SimCLR (f ) -ζ log(2) + (L t s (f ) -LA s (f )) + C ℓ ln(1/δ) 2|D s | . ( ) Let W t = {W t ∈ R 2×d : ∥W t -W 0 ∥ F ≤ ∆ t }. Then, since W t ∈ W t from the assumption on W t , by using Lemma 4 of (Pham et al., 2021) , for any δ > 0, with probability at least 1δ, the following holds: L t s (f ) ≤ Lt s (f ) + 2R n (W t ) + C ℓ ln(1/δ) 2n , where R n (W t ) = E s,ξ [sup W ∈Wt 1 n n i=1 ξ i ℓ(W f (x i ), y i )], s = ((x i , y i )) n i=1 , n = |D s |, and ξ 1 , . . . , ξ n are independent uniform random variables taking values in {-1, 1}. Given a matrix M ∈ R m×m ′ , let vec[M ] ∈ R mm ′ be the vectorization of M . By using Corollary 4 of (Maurer, 2016) , R n (W t ) ≤ √ 2L ℓ n E s,ξ sup W ∈Wt n i=1 2 k=1 ξ ik W k f (x i ) = √ 2L ℓ n E s,ξ sup W ∈Wt 2 k=1 W k n i=1 ξ ik f (x i ) = √ 2L ℓ n E s,ξ sup W ∈Wt w ⊤ h where W k is the k-th row of W , w = vec[W ⊤ ] ∈ R 2d , ξ ik are independent uniform random variables taking values in {-1, 1}, h = vec[H] ∈ R 2d , and H ∈ R d×2 with H jk = n i=1 ξ ik f (x i ) j . Define w 0 = vec[W ⊤ 0 ]. Since E s,ξ w ⊤ 0 h = w ⊤ 0 E s,ξ [h] = 0, we have R n (W t ) ≤ √ 2L ℓ n E s,ξ sup W ∈Wt w ⊤ h = √ 2L ℓ n E s,ξ sup W ∈Wt w ⊤ h - √ 2L ℓ n E s,ξ w ⊤ 0 h = √ 2L ℓ n E s,ξ sup W ∈Wt (w -w 0 ) ⊤ h Thus, R n (W t ) ≤ √ 2L ℓ n E s,ξ sup W ∈Wt ∥w -w 0 ∥ 2 ∥h∥ 2 = √ 2L ℓ ∆ t n E s,ξ [∥h∥ 2 ] Here, Combining these, we have E s,ξ [∥h∥ 2 ] = E s,ξ d j=1 2 k=1 n i=1 ξ ik f (x i ) j 2 ≤ d j=1 2 k=1 E s,ξ n i=1 ξ ik f (x i ) j 2 = d j=1 2 k=1 E s n i=1 (f (x i ) j ) 2 = 2 k=1 n i=1 E s d j=1 (f (x i ) j ) 2 = 2 k=1 n i=1 E s ∥f (x i )∥ 2 2 ≤ 2C f n R n (W t ) ≤ L ℓ 4C f ∆ t √ n . ( ) Combining equation 14, equation 15, and equation 16 with union bounds, we have that for any δ > 0, with probability at least 1δ, L t s (f ) ≤ cL SimCLR (f ) -ζ log(2) -γt + ∆ t 16L 2 ℓ C f |D s | + 2C ℓ ln(2/δ) 2|D s | . ( ) where γt = LA s (f ) -Lt s (f ).

C.3 NUMERICAL EXPERIMENTS

This subsection aims to provide numerical evidence to support the assertion that the value of γt = LA s (f ) -Lt s (f ) increases as we increase the value of t ∈ N 0 . Our experimental results, as illustrated in Figure 5 , demonstrate that this claim holds true, and that the value of γt becomes positive and remains steady after a few iterations (t = 10) of optimizing W t on the support set D s .

D CONNECTION TO META-LEARNING

Here we discuss the connection between our Set-SimCLR and meta-learning to clarify why our method can be seen as a unsupervised meta-learning method as follows: • First, we leverage data augmentation to construct pseudo-meta-tasks, where different views of an image belong to the same pseudo-class, and meta-learn the set-encoder of Set-SimCLR. The set encoder minimizes the distance between positive pairs of set representations and repels negative pairs, where the set representation of the pseudo-class is considered to be a class prototype. In other words, the set encoder enlarges inter-class distance so that the set representation of each class eventually leads to a good initialization of a linear classifier at meta-test time. • There are a vast amount of existing meta-learning works proposing to meta-learn the initialization of linear classifiers (Raghu et al., 2019) or amortized neural networks to predict the weight of linear classifiers (Gordon et al., 2019; Iakovleva et al., 2020) by constructing meta-tasks and simulating exact scenarios of meta-test. Similarly we construct the pseudometa-tasks and learn the initialization of linear classifiers by simulating meta-test, thus the set encoder of our Set-SimCLR is an indeed meta-learner. • Moreover, Ni et al. (2022) have already highlighted the close relationship between metricbased meta-learning (e.g., Prototypical Networks (Snell et al., 2017) ) and contrastive selfsupervised learning (Chen et al., 2020a) . They claim that we can consider contrastive selfsupervised learning as meta-learning since sampling a mini-batch corresponds to sampling a meta-task and contrastive learning with a mini-batch is a B-way 1-shot classification problem, where B is mini-batch size. Thus, our feature extractor f which learns through instance and set-level contrastive learning is also a meta-learner. In Table 2 , we provide the number of classes for meta-split of all datasets we consider in this paper. Note that we only use meta-test split of Tiny-ImageNet, CIFAR100, Aircraft, Stanford Cars and CUB datasets for the evaluation in Section 4.2.

F DETAILS OF BASELINES

In this section, we detail the supervised meta-learning, unsupervised meta-learning and instance-level self-supervised learning baselines. We first introduce two supervised meta-learning approaches which we consider as "oracles" and four different unsupervised meta-learning baselines as follows; 1) MAML (oracle) (Finn et al., 2017) : Model Agnostic Meta Learning where it learns the initialization of the parameters of the model such that few steps of gradient descent on a support set leads to generalization on a query set. We compare against its performance reported in Hsu et al. (2019) . 2) ProtoNets (oracle) (Snell et al., 2017) : Euclidean distance-based meta-learning framework. It learns a metric embedding space where we perform prediction by computing a distance between class prototype and instances from query sets. We also compare against it using its performance reported in Hsu et al. (2019) . 3) CACTUs (Hsu et al., 2019) : Clustering to Automatically Construct Tasks for Unsupervised meta-learning. It automatically constructs tasks by clustering the unsupervised dataset in embedding space learned by ACAI (Berthelot et al., 2019) , BiGAN (Donahue et al., 2017) , or DeepCluster (Caron et al., 2018) . Then it train either MAML or ProtoNets using the cluster indices as pseudo-labels. 4) UMTRA (Khodadadeh et al., 2019) : Unsupervised Meta-learning with Tasks constructed by Random sampling and Augmentation. For constructing a K-way 1-shot task, it randomly samples K-way data points from unsupervised dataset and augments each data point. Then MAML is trained on the constructed tasks. 5) LASIUM (Khodadadeh et al., 2021) : It trains generative models on the given unlabeled data and sample N different latent vector such that each pair-wise distance is greater than a predefined threshold. Each latent vector is fed into the generative model and decoded to a training instance belonging to distinct class. Then it adds some noise to each latent vector to generate S examples and the generated ones are labeled with the class of the original latent vector. Finally, it trains MAML or ProtoNets using the synthetic N -way S-shot task. 6) Meta-GMVAE (Lee et al., 2021) : Meta-level Gaussian Mixture Variational AutoEncoder. It learns a latent representation by matching set-level amortized variational posterior and task-specific multi-modal prior optimized by EM algorithm. We then present the four representative self-supervised baselines used in our experiments as follows: 1) SimCLR (Chen et al., 2020a; b) : It is a constrative learning framework which learns by maximizing agreement between differently augmented views of the same data example in the latent space. 2) MOCO (He et al., 2020; Chen et al., 2020c; 2021) : It builds a dynamic feature dictionary using a queue and momentum encoder and learns to minimize contrastive loss from the dictionary. 3) BYOL (Grill et al., 2020) : From pair views of an image, it learns visual representation by matching momentum encoder, which is exponentially moving average of the encoder. 4) Barlow Twins (Zbontar et al., 2021) : This method measure the cross-correlation matrix between the feature representations of two different views and learns by making it close to identity matrix.

G MASKED AUTOENCODERS

Table 3 : The hyperparameters of MAE, which produces the similar number of parameters as VIT: 12, 782, 176, 512) . The name of hyperparameter is based on huggingface transformers library (Wolf et al., 2020) . MAE (He et al., 2022 ) is a recent self-supervised learning method based on masked auto-encoding objective. We tried to use MAE as a baseline, and the experimental setups are as follows. It assumes VIT (Dosovitskiy et al., 2021) as a base encoder, therefore, we use the hyperparameters in Table 3 which produce the similar amount of parameters as ResNet-18 (i.e., VIT: 12782080 and ResNet-18: 11176512) . We use huggingface transformers library (Wolf et al., 2020) for implementation. Following the original implementation of MAE, we optimize MAE using AdamW (Loshchilov & Hutter, 2019) with 0.05 for 400 epochs. The mini-batch size is set to 512. We search the adequate learning rate in 0.002, 0.001, 0.0005 using meta-validation split. We use cosine learning rate scheduler with 40 warm-up epochs. We use ResizedCrop, HorizontalFlip for augmentations. In Table 4 shows the mean accuracy of ours, self-supervised learning baselines and MAE on the Mini-ImageNet 5-way few-shot classification tasks. We found that MAE fails to achieve comparable performance in our UML setting, therefore, we exclude it in our main text. To understand the performance gain of Set-SimCLR step-by-step, we conduct an additional ablation study by comparing the full model Set-SimCLR against SimCLR, and Set-SimCLR without the initialization of classifier weight using set representations. Table 15 shows that Set-SimCLR without set initialization, improves the generalization performance of the model trained with only SimCLR loss by 1.44% ∼ 3.54%. Thus, the performance gain is a consequence of introducing set-level loss. If we leverage learned set representation to initialize the weight W (Set-SimCLR with set), we can further boost the performance of the model Set-SimCLR without set by 0.42% ∼ 4.2%. We further observe the performance gain becomes larger for fewer shots. Therefore, learning a set representation with our proposed set-level loss is crucial for better generalization performance. Table 16 : 5-way N -shot Mini-ImageNet classification results Set-SimCLR with the parameters W t at different optimization steps (t = 0, 20, 100). The base encoder is ResNet-18. We report the mean and standard deviation of 5 runs with different random seeds. In Table 16 , we provide the performance on 5-way N-shot Mini-ImageNet with the parameters W t at different optimization steps (t = 0, 20, 100) for fine-tuning. Though Set-SimCLR performs not that good at t = 0, it rapidly adapts to support sets to reach near the best accuracy at t = 20.



Figure 1: (Left): A conceptual illustration of Set-SimCLR with three images. We first encode each augmented image into instance representation using the base encoder f . Then we partition the set of V augmented images into two sets and obtain set representations with the set encoder φ. We finally compute set-and instance-level loss. We additionally minimize the cross loss in Eq. 3, which is abbreviated in this figure. (Right): At meta-test, we use set representation of each class as an initialization of linear classifier weight.

Figure 2: (a): 5-way 5-shot classification results on six datasets. (b): 5-way 1, 5, 20, 25-shot classification results on Mini-ImageNet dataset. The base encoder is ResNet-18. We report mean and standard deviation of accuracy evaluated on 1000 episodes with 5 different runs. See Appendix L for the results in tabular format.

Figure 3: The results of ablation study on the 5-way 1, 5, 20, 50-shot classification using Mini-ImageNet.We study the effectiveness of (a): different set encoder architectures, (b): the depth of TransformerEncoder layers, and (c): the number of set elements w.r.t SimCLR. We report the results over 3 different random seeds.

Figure 4: Analysis of the proposed Set-SimCLR on the Mini-ImageNet dataset. (a): 5-way 5-shot test accuracy of baselines as a function of training epochs. (b): T-SNE visualization of our adaptation process on a 5-way 5-shot task. (c): 5-shot test accuracy of different ways. We report the results over 3 different runs.

Meta-Training for Set-SimCLR 1: Input: Batch size M , constant τ , the number of augmentations V , augmentation T , and unlabeled dataset D u 2: while not converged do 3: Sample a mini batch {x m } M m=1 from D u 4:

THE RELATIONSHIP WITH METRIC-BASED INFERENCE The metric-based inference using the instance-level representation h of x with the class prototypes s c can be written by ŷ(x) = arg min c d(s c , h).

Figure 5: Plotting of the value of γt = LA s (f ) -Lt s (f ) as we update the weight Wt with the support set D s .

Results for 5-way S-shot classification on Mini-ImageNet. The base encoder is either Conv4 or Conv5.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38-45, 2020.

The number of classes for meta-split of all datasets.

Results for 5-way S-shot classification on Mini-ImageNet. We report mean and standard deviation of accuracy evaluated on 1000 episodes with 5 different runs, except for MAE. For MAE, we report mean for accuracy for one run. ±0.31 67.08 ±0.26 76.51 ±0.23 80.14 ±0.45 MOCO 43.96 ±0.35 62.64 ±0.14 72.21 ±0.35 78.02 ±0.20 BYOL 45.59 ±1.57 64.19 ±1.29 73.97 ±1.26 76.55 ±1.63 Barlow Twins 45.12 ±0.19 63.44 ±0.27 72.13 ±0.27 75.92 ±0.25 Set-SimCLR (ours) 53.54 ±0.66 69.79 ±0.28 78.53 ±0.26 82.10 ±0.47 MAE with lr = 0.002 VIT in Table 3

±1.53 57.34 ±1.24 59.78 ±1.63 60.17 ±1.51 Set-SimCLR 20 52.94 ±0.42 69.22 ±0.24 78.27 ±0.24 81.90 ±0.22 Set-SimCLR 100 53.54 ±0.66 69.79 ±0.28 78.53 ±0.26 82.10 ±0.47

ACKNOWLEDGEMENTS

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00713), KAIST-NAVER Hypercreative AI Center, the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921), and Samsung Electronics (IO201214-08145-01).

REPRODUCIBILITY STATEMENT

We clearly specify implementation details for reproducibility, including data split, baselines for comparisons, neural architecture, training process and augmentation in Appendix E, F, H, I, and J. In Supplementary File, we further provide the code for reproducing the main experimental results in Table 1 and Figure 2 . Note that all the numerical results are based on more than three runs. Lastly, we will release our full code and the checkpoint of models to be publicly available after acceptance.H IMPLEMENTATION DETAILS OF SECTION 4.1 64 × 42 × 42 Conv2d(3 × 3, stride = 1, pad = 1), BatchNorm2D, ReLU, Maxpool(2 × 2, stride = 2)  64 × 21 × 21 Conv2d(3 × 3, stride = 1, pad = 1), BatchNorm2D, ReLU, Maxpool(2 × 2, stride = 2)  64 × 10 × 10 Conv2d(3 × 3 We provide pytorch-like architecture implementations of base encoder f , set encoder φ and head g in Table 5 , 6 and 7, respectively. We follow SimCLR (Chen et al., 2020a; b) for random augmentation, which is detailed in Appendix J. We apply the composed augmentations to 64 mini-batch images eight times (i.e., M = 64, V = 8), resulting in 4 elements in each set. We optimize the base encoder, set encoder and head network for 400 epochs using Adam optimizer (Kingma & Ba, 2015) with default settings (i.e., β 1 = 0.9 and β 2 = 0.999). We use constant learning rate of 0.001. For downstream tasks, we use scikit-learn (Pedregosa et al., 2011) package with default settings to optimize a linear classifier.I IMPLEMENTATION DETAILS OF SECTION 4.2 For the base encoder f , we use ResNet-18 architecture. Please see the original paper (He et al., 2016) for implementation details. We provide pytorch-like architecture implementations of set encoder φ and head g in Table 8 and 9, respectively. For a fair comparison, we use the same architecture of head network g in Table 9 , for all self-supervised learning methods except for MOCO. MOCO does not use the head as firstly proposed in the original paper. We use the same random augmentations described Published as a conference paper at ICLR 2023 in Appendix J. For our method Set-SimCLR, we apply the augmentations 8 times to the mini-batch of 64 images (i.e., M = 64, V = 8), resulting in 4 elements in each set, while performing the same augmentation twice on the mini-batch of 256 images (i.e., M = 256, V = 2) for the other baselines. For all the methods, we optimize the models for 400 epochs using Adam optimizer (Kingma & Ba, 2015) with default settings (i.e., β 1 = 0.9 and β 2 = 0.999). We do not use learning rate scheduling which is not effective for any methods in our experiments. We search for an adequate learning rate in 0.001, 0.0005, 0.0001 for baselines and ours using a meta-validation split. We provide the selected learning rate of each method in Table 10 . We use scikit-learn (Pedregosa et al., 2011) package with default settings to optimize classifiers for downstream tasks. For random augmentation, we compose ResizedCrop, HorizontalFlip, ColorJitter, GrayScale and GaussianBlur. The application probability and hyperparameters of each augmentation is shown in Table 11 . Note that we perform ResizedCrop on a larger resolution of 224 × 224 images than the resolution of 84 × 84 images we target, which is found to be more effective. We implement the augmentation using Kornia framework (Riba et al., 2020) , which allows a faster augmentations on GPU. 

