A THEORY OF SELF-SUPERVISED FRAMEWORK FOR FEW-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Recently, self-supervised learning (SSL) algorithms have been applied to Fewshot learning (FSL). FSL aims at distilling transferable knowledge on existing classes with large-scale labeled data to cope with novel classes for which only a few labeled data are available. Due to the limited number of novel classes, the initial embedding network becomes an essential component and can largely affect the performance in practice. But almost no one analyzes why a pre-trained embedding network with self-supervised training can provide representation for downstream FSL tasks in theory. In this paper, we first summarized the supervised FSL methods and explained why SSL is suitable for FSL. Then we further analyzed the main difference between supervised training and self-supervised training on FSL and obtained the bound for the gap between self-supervised loss and supervised loss. Finally, we proposed potential ways to improve the test accuracy under the setting of self-supervised FSL.

1. INTRODUCTION

Recently, the self-supervised learning (SSL) algorithms have been applied to the FSL. The purpose of FSL is to extract transferable knowledge from existing classes with large-scale label data to deal with novel classes with only a few labeled data. The initial embedded network becomes an essential component and will greatly affect performance because of the limited number of novel classes. In practice, SSL greatly enhances the generalization of FSL method and increases the potential for industrial application. Once combining SSL and FSL, we only need to collect a large amount of related unlabeled data and a few data on the new task to obtain a model with good generalization performance on the new task. In theory, it is difficult to analyze the performance of self-supervised pre-trained models on multiple downstream tasks. Because the downstream task itself involves a large amount of data with different data distribution from the primary training data distribution, such as multi-view SSL. Besides, downstream tasks and self-supervised tasks may be quite different, such as classification and segmentation, which further increases the difficulty of theoretical analysis. However, when back to the purpose of SSL, which is to learn a good pretrain model that can be transferred to different tasks, we find FSL also focus the same purpose to get an initialization model that can achieve good results with a few data on a new task by a simple classifier (such as a mean classifier). Thus, FSL tasks is suitable for evaluation of the effect of SSL. The main research direction about when and why self-supervised methods improves FSL is to compare the performance of different self-supervised methods through experiments. Almost no one analyzes why a pre-trained embedded network with self-supervised training can provide a representation for downstream FSL tasks in theory. We believe that theoretical analysis is necessary. For example, MoCo uses momentum update to greatly expand the size of the key-value dictionary, thereby improving the effect. But we don't know why the key-value dictionary needs to be large enough. Is the batch size really the bigger the better? SimCLR proposes a head layer to calculate the contrastive loss, instead of directly comparing the representations. Why is this method effective? We find that although self-supervised learning researchers have made great progress, analysis about why SSL works is halted at experimental and empirical conclusions due to the lack of theorical analysis. Therefore, we think it is necessary and useful to analyze self-supervised learning theoretically. We analyze the self-supervised training process via the specific application scenario of FSL. Under this settings, we avoid the complexity of downstream tasks, and can directly judge the quality of self-supervised learning by the performance of new few-shot tasks. Our main intuition is to quantify the gap between self-supervised learning and supervised training on FSL tasks by constructing supervised metrics corresponding to self-supervised tasks. We find that the self-supervised training loss is actually an upper bound of the supervised metric loss function (Theorem 1). It means that if we can reduce the self-supervision loss small enough, we can control the model's supervision loss on the training data. And because FSL methods has good generalization on similar downstream tasks, we conclude that self-supervised training can also have good generalization on similar tasks, even if the categories of training tasks and test tasks are different. Unfortunately, it is often difficult to minimize the training loss of self-supervision. Contrastive-based SSL method samples different augment data as query and positive data, and others as negative data. Those false negative data have the same class as query. This part of training loss introduced by the false negative data limits our performance. We separate the negative samples in the self-supervised training into true negative samples and false negative samples. For true negative samples, we assume that loss can be small enough by suitable models and optimizers. As for false negative samples, we bound this loss by the intra-class deviation. This part is also the difference between self-supervised learning and supervised learning (Theorem 2). We should control the intra-class deviation of these false negative samples while training according to Theorem 2. Finally, we discuss potential ways to improve test accuracy under self-supervised FSL settings. First, we suggest that the larger the batch size is, the better, but within a certain range. Second, increasing the number of support samples is beneficial to reducing the within-class variance of false negative samples for good test performance. Technically, we set the different augmented data as the support samples from the same class. Third, we need to choose unsupervised training data whose number of categories are large, because large categories will reduce the probability of us sampling false negative samples. We also introduce the limitations of our theory. Ideally, one would like to know whether a simple contrastive self-supervised framwork can give representations that competable with those learned by supervised methods. We show that under the two assumptions, one can get a test performance close to supervised training. Experiments on Omniglot also support our theoretical analysis. For instance, the self-supervised framework reaches the accuracy of 98.23% for 5-way 5-shot classification on Omniglot, which is quite competitive compared to 98.83% achieved by supervised MAML.

2.1. SUMMARY OF SUPERVISED FSL METHODS

In the typical few-shot scenario introduced by Vinyals et al. (2016) , the model is presented with episodes composed of a support set and a query set. The support set contains concepts about the categories into which we want to classify the queries. In fact, models are usually given five categories (5-way), and one (one-shot) or five (five-shot) images per category. During training, the model is fed with these episodes and it has to learn to correctly label the query set given the support set. The category sets seen during training, validation, and testing, are all disjoint. This way we know for sure that the model is learning to adapt to any data and not just memorizing samples from the training set. Although most algorithms use episodes, different algorithm families differ in how to use these episodes to train the model. Recently, transfer learning approaches have become the new state-of-the-art for few-shot classifications. Methods like Gidaris & Komodakis (2018) , pre-train a feature extractor and linear classifier in a first stage, and remove its last FC layer, then fix the feature extractor and train a new linear classifier on new samples in the fine-tuning stage. Due to its success and simplicity, transfer learning approaches have been named "Baseline" on two recent papers Chen et al. (2019); Dhillon et al. (2019) . We mark the feature extractor in the first stage as f q , and the linear classifier of Chen et al. (2019) on new samples in the fine-tuning stage as y = f q (x) T W, W = [w 1 , w 2 , . . . , w c ] ∈ R d×c . The classifier in Chen et al. (2020b) is a mean classifier with the weight as the centroid of features from the same class. Let S c denote the few-shot support samples in class c, then they have w c = 1 |Sc| x∈Sc f q (x). In this paper, we take a generalized mean classifier with w c = 1 |Sc| x∈Sc f k (x). Because of the arbitrariness and complexity of f k , this generalized mean classifier is nearly an arbitrary linear classifier.

2.2. ASSUMPTIONS IN SELF-SUPERVISED FSL

Assumption 1 Mean classifier is good enough for evaluation. We usually analyze the effectiveness of self-supervised with the results on downstream tasks Liu et al. (2020) ; Jing & Tian (2020) . But another different downstream task usually makes the theoretical analysis more difficult. And a complex classifier usually performs better than mean classifier on most of tasks Liu et al. (2020) ; Jing & Tian (2020) , i.e., the classification on ImageNet. Therefore, we consider a simple way to analyze self-supervised learning. We assume that only the mean classifier is used for classification during the test phase. When the f k in mean classifier is complex enough, it is equivalent to linear classifier which used in Baseline++. Even if f k is simply the same as f q , this mean classifier is consistent with the classifier in metric-based FSL methods, such as ProtoNets Snell et al. (2017) . The mean classifier is suitable for self-supervised FSL because of the the good performance, simple form, convenient measurement and wide usability. We assume that the mean classifier performs well enough during testing if decreasing supervised training loss enough. Our hypothesis has been verified on Baseline++ Chen et al. (2019) , Pro-toNet Snell et al. (2017) and some other transfer-based FSL methods Dhillon et al. (2019) ; Li et al. (2019a) ; Hu et al. (2020) . Our paper mainly focuses on analyzing the difference between supervised training loss and self-supervised training loss on the feature extractor f q and the mean classifier based on f k , regardless of how to generalize from the training set to the test set. We just make an assumption that generalization remains to work based on existing supervised FSL methods.

Assumption 2

The training data is balanced in different classes. The main work of this paper is to analyze self-supervised loss. We assume that each training sample is drawn from a certain data distribution. We regard the sampling process as two steps, first randomly sample the categories, and then select samples from these categories for training. In order to analyze the difference between the self-supervised loss and the supervision loss, we mark these samples with their labels in supervised dataset, and then selecte one sample for each class to construct a dynamic N -way 1-shot supervised training task. When assumption 2 is satisfied, that is, the number of samples in each class is the same, our theoretical results are relatively simple and clear. In practice, the assumption is also acceptable because we can collect data by specifying some keywords to ensure that the data is roughly balanced. Please note that whether this assumption is necessary is still worth further analysis. There may be different annotations in different self supervised pretext tasks. Especially in the FSL, the categories in the training set and the test set do not overlap, so the assumption of balanced data is more likely to be removed. This paper assume this assumption is true and do not further analysis the necessity. The above two assumptions are the basis of our subsequent analysis. Our theoretical analysis is also applicable to other scenarios that satisfied the two assumptions, not just FSL. We chose selfsupervised FSL in our experiment because these two assumptions have been adopted in FSL.

3.1. CONTRASTIVE SELF-SUPERVISED TRAINING FRAMEWORK

Given an unlabeled dataset A, self-supervised training method, like MoCo He et al. (2019) creates many synthetic query-key matching tasks on-the-fly by randomly sampling N K data at a time and then augmenting them. A basic consideration is that two synthetic data, x + = Aug(x) and x q = Aug (x), who are augmented from the same ancestor x, hold the same class label. In this case, one of the N K data is randomly selected to be the positive data, and its two augmented data are taken as the query and the positive samples, respectively, while the synthetic data augmented from the remainder N K -1 data are treated as negative samples. After that, the query encoder f q maps the query into q = f q (x q ), and the key encoder f k maps the positive sample into k + = f k (x + ), and the negative samples into k - i = f k (x - i ), i = 1, . . . , N K -1. These output vectors q, k + , k - i are normalized by its L2-norm, followed by a metric loss: L = log(1 + N K -1 i=1 exp(µq T k - i -µq T k + )), where µ is a learnable metric scaling scalar Oreshkin et al. (2018) in the hope of facilitating metric training. As a note, in our experiments, the metric loss L is evaluated based on multiple query data in a mini-batch manner. Please refer to Appendix C for more details. In our paper, we assume that the queries from the same class have the same distribution, and the keys from the same class have the same distribution. The query and keys depend on the specific self-supervised pretext task. The input x q and x k can be images Hadsell et al. (2006) 

3.2. SUPERVISED AND SELF-SUPERVISED TRAINING LOSS.

Assume the dataset A to be supervised, the same framework can be trained in a supervised fewshot learning manner using this A. We will show that the self-supervised loss is an upper bound for supervised evaluation metric, and prove that minimizing unsupervised loss makes sense in Section 4. Self-Supervised Metric (SSM) for Representations SSM accesses to a flow of unsupervised tasks T t which contain augumented data {x q , x + , x - 1 , . . . ,x - N K -1 }. We mark their ground-truth labels by C U ={c q , c + , c - 1 , . . . ,c - N K -1 }, respectively. Note that x q and x + are drawn from the same data distribution D c + (since c q =c + ) while negative x - i are from D c - i . Let I={1, . . . , N K -1} be the set of indices of negative data, the unsupervised loss in Eq. ( 1) can be rewritten as L U = E q,k + ,k - i log(1 + i∈I exp(µq T k - i -µq T k + )). Supervised Metric for Representations. The quality of the representation function f q , f k is evaluated by its performance on a multi-class classification task using linear classification. Let C denote the set of class label with prior distribution ρ. Assume that the augumented data x ∈ X is drawn from the data distribution D c , where c ∼ ρ. Now considering one N -way task T sup on N different classes C sup ={c 1 , . . . ,c N }, its multi-class classifier is denoted as the funtion g : X → R N . The softmax-based cross-entropy loss on data pair (x, y) can be rewritten as L sup (g, x, y) = log(1 + y =y exp(g(x) y -g(x) y )), where g(x) y is the y-th element of the vector g(x). It is a general intuition that the data pair (x, y) with right label y has greater confidence than the data pair (x, y ) with wrong label y . We choose the linear classifier as mean classifier g(x) c =µq T p c , where µ is a scaling scalar and q=f q (x) is the query representation, and p c =E x∼Dc [f k (x)] is the mean of representation of inputs with label c. That is, f q is for the feature extraction, and p c is the weight of linear classifier. Then, the expected supervised metric loss in terms of f q , f k on N -way tasks is L sup = E c∼ρ,x∼Dc log(1 + c =c exp(µq T p c -µq T p c )). ( ) Our goal is to find a unsupervised training method to decrease the value of the evaluation metric L sup . Previous supervised FSL methods deal with the evaluation metric in different ways. Pro-toNets conducts an episode training by using the evaluation metric as supervised loss function but replacing scaled dot products with Euclidean distance. 

4. GAPS BETWEEN SUPERVISED AND SELF-SUPERVISED TRAINING LOSS

It can be seen that Eq. ( 2) has a similar form with Eq. ( 4). As a first step, we show that L U bounds the L sup in an ideal situation. This conclusion indicates that it makes sense to minimize the unsupervised SSM loss L U . We represent the upper bound for the supervised loss L sup by Theorem 1. Theorem 1 ∀f q , f k ∈ F, L sup ≤ γ 0 L U + δ, ) where γ 0 , δ are constants depending on the class distribution ρ. When ρ is uniform and |C| → ∞, then γ 0 → 1, δ → 0. Proof. The key point for proof is the use of Jensen's inequality since (v) = log(1 + i exp(v i )), ∀v ∈R N K -1 is a convex function (v i is the i-th element of v), that is, L U = E q E k + ,k - i log(1 + i∈I exp(µq T k - i -µq T k + )) ≥ E q,c + ,c - i log(1 + i∈I exp(µq T p c - i -µq T p c + )). However, unsupervised data label C U may contain duplicate classes and even false negative classes. Clearly, we divide I into two disjoint subsets, true negative index set I -={i ∈ I|c - i =c + } and false negative index set I + ={i ∈ I|c - i =c + }. We define C uni as the label set after de-duplicating class labels in C U , C uni ⊆C U . Since ({vi}i∈I 1 ∪I 2 ):=log(1+ i∈I 1 ∪I 2 exp(vi))≥ ({vi}i∈I 1 ), ∀ I1, I2⊆I, we could decompose Eq. ( 6) into E q,c + ,c - i log(1 + i∈I exp(µq T p c - i -µq T p c + )) ≥ P (I + = ∅) E q,c + {µq T p c -µq T p c + } c∈Cuni\c + |I + = ∅ +P (I + = ∅) E c + [log(1 + |I + |)|I + = ∅]. The first expectation in Eq. ( 7) is actually the supervised loss L sup in Eq. ( 4) by regarding C sup :=C uni . Combining this result with Eq. (6) (7), we obtain the inequality in Theorem 1 with γ 0 = 1 P (I + =∅) , δ = -P (I + =∅) P (I + =∅) E c + [log(1 + |I + |)|I + = ∅]. When ρ is uniform and |C| → ∞, then P (I + = ∅) → 0. Please refer to Appendix B.1 for more details. The supervised loss is built from unsupervised training. We consider the following tweak in the way: sample the class set {c + , c - 1 , . . . , c - N K -1 } from the class distribution ρ, and random select one class as positive class, and consider the episode that all negative classes are different from the positive class (I + = ∅), then sample one data from one negative class independently to get |C sup |-way 1-shot supervised loss. Notice that this loss is about separating c + from the total class set, we calculate the probability by symmetrization under the assumption that training dataset are balanced. We derive the above encouraging result about the relationship between SSM loss and supervised loss. Thus, we can decrease the supervised loss by minimizing the unsupervised SSM loss L U . However, L U can not be small enough if false negative keys present in some tasks and the proportion of false negative data is considerable, especially for the task scenarios whose label space |C| is small (e.g., miniImageNet). In that case, minimizing L U will meet a theoretical bottleneck since the loss on the false negative data can not be decreased enough. To explore the effect of some inputs of the same class as positive data sneaking into the negative data, we further decompose the SSM loss L U into two terms: (1) L - U , the loss on all true negative data x - i , i ∈ I -. These data has the different labels from the positive data. (2) L + U , the loss on all false negative data x - i , i ∈ I + . These data have the same label as the positive data. Theorem 2 does reveal the underlying factors behind the explicit gap between L sup and L U . We define a notation of intra-class deviation as s  (f k ) := |µ|E c [E x∈Dc f k (x) -p c 2 2 ] 1/2 , Theorem 2 ∀f q , f k ∈ F, L sup ≤ γ 0 L - U + γ 1 s(f k ), ) where γ 0 , γ 1 are constants depending on the class distribution ρ. When ρ is uniform and |C|→∞, then γ 0 → 1, γ 1 → 0. Proof. The key point for proof is that ({v i } i∈I1∪I2 ) ≤ ({v i } i∈I1 ) + ({v i } i∈I2 ), ∀I 1 , I 2 ⊆ I, L U ≤ E q,k + ,k - i log(1 + i∈I -exp(µq T k - i -µq T k + )) + E q,k + ,k - i log(1 + i∈I + exp(µq T k - i -µq T k + )) := L - U + L + U , where the first expectation is L - U and the second one is L + U (imagining that true negative data and false negative data have been separated during training). With the following inequalities ({v i } i∈I1 ) ≤ log(1 + |I 1 |) + max{max{v i } i∈I1 , 0}, and max{v i } i∈I1 ≤| max{v i } i∈I1 |≤max{|v i |} i∈I1 ≤ i∈I1 |v i |, we can get the following inequality: L + U ≤ E q,k + ,k - i log(1 + |I + |) + i∈I + (|µq T k - i -µq T k + |) = P (I + = ∅) E c + [log(1 + |I + |)|I + = ∅] + E q,k + ,k - i i∈I + |µq T k - i -µq T k + | . By combining Eq. ( 5) with Eq. ( 9) and Eq. ( 10), we get L sup ≤ γ 0 (L - U + L + U ) + δ ≤ γ 0 L - U + γ 0 E q,k + ,k - i i∈I + |µq T k - i -µq T k + | = γ 0 L - U + γ 0 E c + E q,k+,k - i ∼c + i∈I + |µq T k - i -µq T k + | . When the class distribution ρ is uniform, we have E|I + | = (N K -1)/|C| for any class. Considering that the representations are normalized to satisfy q 2 = 1, thus the right expectation in Eq. ( 11) can be bound by s(f k ), that is, E c + E q,k+,k - i ∼c + [ i∈I + |µq T k - i -µq T k + |] ≤ √ 2E|I + | • s(f k ). Combining Eq. ( 11) and Eq. ( 12), Theorem 2 can be proved, where γ 1 = √ 2γ0(N K -1) |C| , γ 0 = 1 P (I + =∅) . When ρ is uniform and |C|→∞, then P (I + = ∅) → 0, γ 0 → 1, γ 1 → 0. Please refer to Appendix B.2 for more details. Compared to Theorem 1, Theorem 2 indicates that the supervised loss L sup is bounded by two explicit parts. The first one is L - U , which measures the similarity between the query data with the positive data and true negative data. It is somehow like dynamic N -way M -shot supervised training. The difference is that the value of N and M might change along with the number of true negative data. The second is s(f k ), which acts as the penalty for representation ability by measuring the intraclass representation deviation. Moreover, the γ 1 s(f k ) is an explicit gap distancing unsupervised SSM loss from supervised loss. Theoretically, if given an unsupervised set with infinite classes and data, the performance achieved by SSM can be very close to that by supervised training. data. In supervised FSL methods, the performance increases as the N or M increases. So how to increase N and M in self supervised training, and what is the improvement? We can increase N by increasing the total negative samples N K . But γ 1 , the coefficient of the gap, also increases. Thus we suggest that in the self-supervised FSL, we do not need to use a particularly large N K . Experiments in Table 1 on Omniglot and miniImageNet show that the best N K is 2048, 512, respectively.

5. DISCUSSION ON SELF-SUPERVISED FSL

Most of self-supervised methods take one positive key for one query while FSL usually set M support samples as positive keys. ProtoCLR Medina et al. (2020) proposed that original images x serve as class prototypes around which their Q augmentations should cluster. And Q = 3 showed the best performance in their experiments. Supposed that we replace each key representations with the average of M different key representations from different augmentations of the same input x. The new intra-class deviation s( fk ) := s(f k )/ √ M reduces the gap between self-supervised and supervised training. Impact of the class number |C|. The class number affects the coefficient of gap. Assuming that the class number is infinite, then the probability of sampling false negative data is zero, thus we have γ 0 → 1, γ 1 → 0. Concretely, the unsupervised A of miniImageNet only contains 64 classes while that of Omniglot involves up to 4800 classes, which leads to a smaller γ 1 for Omniglot (larger |C| implies larger P (I + = ∅), larger P (I + = ∅) implies smaller γ 0 , larger |C| + smaller γ 0 imply smaller γ 1 ). Assume that there is no significant difference about the representation ability of the backbone on two datasets (even if the intra-class deviation on Omniglot is intuitively smaller than that on miniImageNet since Omniglot is more easy), the gap γ 1 s(f k ) on Omniglot is smaller than that on miniImageNet. Because the more categories, the more information introduced, it is difficult to design a fair comparative experiment. Still, we believe self-supervised methods is suitable for large number of categories according to the theory. Impact of the imbalance data. We assume training data is balanced to get simple and clear coefficient γ 0 , γ 1 in the gap. From Eq. 12, the intra-class deviations of different classes are weighted by the expected ratio of false negative samples in all negative samples. Moreover, The generalization performance may also be affected by unbalanced data training. Because the weight of each sample in the supervision loss varies in different categories. Limitations of the theories. (1) For simplicity, we only consider the mean classifier, and the mean classifier in the supervised framework may become a bottleneck. Not all FSL methods take the mean classifier or linear classifier and training the model in two stages. We do not discuss the relationship between different FSL methods and the supervision loss in our paper, which may limit the theory generalizing to other FSL methods. (2) We have not carried out abundant experiments to verify the theory. There are still many obstacles in guiding experiments with theory. We think that collecting large categories of unsupervised data, increasing the number of negative samples in a certain extent, and using multiple augmentations to replace the original keys, will help to reduce the gap between supervision and self-supervised training. (3) The supervised loss is actually an evaluation metric, slightly different from the training loss used in FSL methods. Therefore, the convergence and generalization of the framework need to be further verified.

6. RELATED WORK

Few-Shot Learning Compared to the common machine learning paradigm that involves large-scale labeled training data, the development of FSL is tardy due to its intrinsic difficulty. Early works for FSL were based on generative models that sought to build a Bayesian probabilistic framework Fei-Fei et al. (2006); Lake et al. (2015) . Recently, more works were from the view of meta-learning, which can be roughly summarized into five sub-categories: learn-to-measure (e.g., MatchNets Vinyals et al. (2016 ), ProtoNets Snell et al. (2017) , RelationNets Yang et al. ( 2018)), learn-to-finetune (e.g., Meta-Learner LSTM Ravi & Larochelle (2017) , MAML Finn et al. (2017) ), learn-to-remember (e.g., MANN Santoro et al. (2016) , SNAIL Mishra et al. (2018) ), learn-to-adjust (e.g., MetaNets Munkhdalai & Yu (2017) , CSN Munkhdalai et al. (2018) ) and learn-to-parameterize (e.g., DynamicNets Gidaris & Komodakis (2018 ), Acts2Params Qiao et al. (2018) ). These methods consistently need to create meta-training tasks on a large-scale supervised auxiliary set to obtain an FSL model that can be transferred across tasks. Contrastive self-surpervised representation learning Our analysis is based on contrastive representation learning. It is a classic machine learning topic Barlow (1989) that aims to acquire a pretrained representation space from unsupervised data and works as a pre-bedding for downstream supervised learning tasks. It can be traced back to Hadsell et al. (2006) , these methods learn representation by comparing positive and negative pairs. Wu et al. (2018) suggests using a repository to store instance class representation vectors, which is a method adopted and extended in recent papers. Other studies have explored the use of intra batch samples instead of memory for negative sampling Ji et al. (2019) ; He et al. (2019) . Recent literature attempts to link the success of their methods with the maximization of mutual information between potential representations Oord et al. (2018); Hénaff et al. (2019) . However, it is not clear whether the success of methods is determined by mutual information or by the specific form of sustained loss Tschannen et al. (2019) . Self-Supervised Methods for FSL CACTUs Hsu et al. (2019) first developed a two-stage strategy: constructing meta-training tasks on an unsupervised set by clustering algorithms and then running MAML or ProtoNets on the constructed tasks. Comparably, UMTRA Khodadadeh et al. (2019) , AAL Antoniou & Storkey (2019) , and ULDA Qin et al. (2020) proposed to construct meta-training tasks by augmenting the unsupervised data and training by the ready-made ProtoNet or MAML model. These unsupervised methods for FSL focused on allocating pseudo labels to unsupervised data. Then the existing supervised FSL models can work without modification with these pseudo labels. Three latest work, LST Li et al. (2019b) , ProtoTransfer Medina et al. (2020) and CC Gidaris et al. (2019) , introduce self-supervised techniques Jing & Tian (2020) to achieve superior performance on FSL tasks. LST is a semi-supervised meta-learning method that meta learns how to pick and label unsupervised data to improve FSL performance further. ProtoTransfer is state-of-the-art self-supervised FSL methods. CC considers two self-supervised methods in the present FSL work: CC-Rot, predicting the rotation incurred by an image Gidaris et al. (2018) , and CC-Loc, predicting the relative location of two patches from the same image Doersch et al. (2015) . CC can deal with unsupervised tasks while LTS not. In this work, We reconsidered FSL theoretically and experimentally from the perspective of self-supervised learning.

7. CONCLUSION

In this work, we focus on theories of self-supervised FSL. We first give two assumptions and explain why self-supervised methods are suitable for few-shot learning. Then we decompose the selfsupervised loss into supervised loss and a gap bounded by the intra-class representation deviation. Experimental results on two FSL benchmarks Omniglot and miniImageNet verify our theories and we propose potential ways to improve test performance. A APPENDIX B PROOFS. Assume that we have access to an episodic unsupervised task T t with augumented data X = {x q , x + , x - 1 , . . . ,x - N K -1 }. Please refer to Algorithm 1 for details in our paper. We mark their ground-truth labels by C U = {c q , c + , c - 1 , . . . ,c - N K -1 }, respectively. Note that x q , x + are drawn from the same data distribution D c + (since c q = c + ) while negative x - i are drawn from the data distributions D c - i . Let I = {1, . . . , N K -1} be the set of indices of negative data, the unsupervised loss function is as shown in Eq. 2. where q, k + , k - i are the representations of data x q , x + , x - i , respectively. B.1 PROOF OF THEOREM 1. We first leverage the convexity of to get a lower bound of unsupervised loss L U in Eq. ( 16). Then we decompose the lower bound into a supervised loss L sup plus a degenerate term in Eq. ( 20). Step 1. Convexity. (v) = log(1 + i e vi ), ∀v ∈ R N K -1 is a convex function. Because, ∀t ∈ R, z, v ∈ R N K -1 , g(t) = (z + tv) = log(1 + i e zi+tvi ) g (t) = i vie z i +tv i 1+ i e z i +tv i g (t) = ( i v 2 i e z i +tv i )(1+ i e z i +tv i )-( i vie z i +tv i ) 2 (1+ i e z i +tv i ) 2 , where z i , v i are the i-th component of vector z, v. And We have, i v 2 i e zi+tvi ≥ 0, and Cauchy inequality, i v 2 i e zi+tvi i e zi+tvi ≥ i v i e zi+tvi 2 . Thus, g (t) are always non-negative. (v) is a convex function. Step 2. Jensen's inequality. The key point in the proof is the use of Jensen's inequality since (v) = log(1 + i exp(v i )), ∀v ∈ R N K -1 is a convex function. L U = E q,k + ,k - i log(1 + i∈I exp(µq T k - i -µq T k + )) = E q,c + ,c - i E k + ∼c + ,k - i ∼c - i log(1 + i∈I exp(µq T k - i -µq T k + )) ≥ E q,c + ,c - i log(1 + i∈I exp( E k + ∼c + ,k - i ∼c - i (µq T k - i -µq T k + ))) = E q,c + ,c - i log(1 + i∈I exp(µq T p c - i -µq T p c + )), where p c is a class-wise prototype, which is the mean of all representations with the same class label c. That is, p c - i is the prototype of class c - i and p c + is that of class c + . There are some subtle differences between our class-wise prototype and the episodic prototype in ProteNets. Our p c is a global prototype for each class. But p c in ProteNets is re-calculated by support data in each episode. Note that in the lower bound quantity of unsupervised loss, the random class labels c - i may be true negative classes or false negative classes. Step 3. Decompose the lower bound. We devide all negative classes c - i set into two disjoint subsets, true negative classes and false negative classes. Clearly, we divide the set of negative data indices I into two unjoint subsets: true negative indices I -= {i ∈ I|c - i = c + } and false negative indices I + = {i ∈ I|c - i = c + }. We have, E q,c + ,c - i log(1 + i∈I exp(µq T p c - i -µq T p c + )) = E q,c + ,c - i ({µq T p c - i -µq T p c + } i∈I ) = P (I + = ∅) E q,c + ,c - i ({µq T p c - i -µq T p c + } i∈I )|I + = ∅ +P (I + = ∅) E q,c + ,c - i ({µq T p c - i -µq T p c + } i∈I )|I + = ∅ . We define C uni as the label set after de-duplicating class labels in C U (class labels including positve class c + and negative classes c - i ). Since we have ({v i } i∈I1∪I2 ) := log(1 + i∈I1∪I2 exp(v i )) ≥ ({v i } i∈I1 ), ∀I 1 , I 2 ⊆ I, we can decompose the above quantity to handle repeated classes. If I + = ∅, then I = I -, we can choose all de-duplicating negative class indices as I uni . Thus I uni ⊆ I -= I, and ({v i } i∈I ) ≥ ({v i } i∈Iuni ). That is, E q,c + ,c - i ({µq T p c - i -µq T p c + } i∈I )|I + = ∅ ≥ E q,c + ,c - i ({µq T p c - i -µq T p c + } i∈Iuni )|I + = ∅ = E q,c + ({µq T p c -µq T p c + } c∈Cuni\c + )|I + = ∅ . (18) Observe that the last expectation in the Eq. ( 18) is actually the supervised loss L sup by regarding C sup := C uni . The supervised loss is based on the positive class and de-duplicating negative classes of each episodic data in our SSM. It is somehow like dynamic N -way 1-shot supervised loss and N is the number of unique classes of sampled classes C U . If I + = ∅, we can choose all false negative class indices I + . All these false negative data have the same class label c + . Thus, c - i = c + , p c - i = p c + , ∀i ∈ I + . Since I + ⊆ I, ({v i } i∈I ) ≥ ({v i } i∈I + ). That is, E q,c + ,c - i ({µq T p c - i -µq T p c + } i∈I )|I + = ∅ ≥ E q,c + ,c - i ({µq T p c - i -µq T p c + } i∈I + )|I + = ∅ = E q,c + ,c - i [ ({0} i∈I + )|I + = ∅] = E q,c + ,c - i [log(1 + |I + |)|I + = ∅] = E c + [log(1 + |I + |)|I + = ∅], where |I + | represents the number of elements in the set I + . From the Eq. ( 17), Eq. ( 18) and Eq. ( 19), we get E q,c + ,c - i log(1 + i∈I exp(µq T p c - i -µq T p c + )) ≥ P (I + = ∅)L sup + P (I + = ∅) E c + [log(1 + |I + |)|I + = ∅]. Combining Eq. ( 20) with Eq. ( 16), we have, L U ≥ P (I + = ∅)L sup + P (I + = ∅) E c + log(1 + |I + |)|I + = ∅ . ( ) Thus we have proved the Theorem 1 using the fact that γ 0 = 1 P (I + =∅) , δ = -P (I + =∅) P (I + =∅) E c + [log(1 + |I + |)|I + = ∅]. B.2 PROOF OF THEOREM 2. First, we decompose the unsupervised loss into true negative data loss L - U and false negative data loss L + U by the property of Eq. ( 22), and further give an upper bound for the false negative data loss L + U by the property of Eq. ( 25). Finally, we get a bound for our SSM in Eq. ( 29) and prove Theorem 2. Step 1. Inequality 1 of . We note that the function satisfies the following constant: ({v i } i∈I1∪I2 ) ≤ ({v i } i∈I1 ) + ({v i } i∈I2 ), ∀I 1 , I 2 ⊆ I. Because, ({v i } i∈I1∪I2 ) = log(1 + i∈I1∪I2 e vi ) ≤ log(1 + i∈I1 e vi + i∈I2 e vi ) ≤ log[(1 + i∈I1 e vi )(1 + i∈I2 e vi )] = log(1 + i∈I1 e vi ) + log(1 + i∈I2 e vi ) = ({v i } i∈I1 ) + ({v i } i∈I2 ). ( ) Step 2. Decompose the unsupervised loss. We have already divided all negative classes into two disjoint subsets and gotten their index sets I -, I + . I -is for the true negative data while I + is for the false negative data. According to these index sets and the property in Eq. ( 22), we have, L U = E q,k + ,k - i log(1 + i∈I exp(µq T k - i -µq T k + )) = E q,k + ,k - i ({µq T k - i -µq T k + } i∈I ) ≤ E q,k + ,k - i ({µq T k - i -µq T k + } i∈I -) + ({µq T k - i -µq T k + } i∈I + ) = E q,k + ,k - i ({µq T k - i -µq T k + } i∈I -) + E q,k + ,k - i ({µq T k - i -µq T k + } i∈I + ) := L - U + L + U , where the first expectation is L -and the second is L + ( imagining that the true negative data and false negative data have been separated during training). Step 3. Inequality 2 of . We define v max ∈ R as the maximum component with indices in I 1 , that is v max := max{v i } i∈I1 . If v max > 0, we have ({v i } i∈I1 ) = log(1 + i∈I1 e vi ) ≤ log(1 + |I 1 |e vmax ) = log(1 + |I 1 |) + log(e vmax + 1-e vmax 1+|I1| ) ≤ log(1 + |I 1 |) + v max . (24) Otherwise, v i ≤ v max ≤ 0, ∀i ∈ I 1 , we have ({v i } i∈I1 ) = log(1 + i∈I1 e vi ) ≤ log(1 + |I 1 |). Thus we get the inequality, ({v i } i∈I1 ) ≤ log(1 + |I 1 |) + max{v max , 0} ≤ log(1 + |I 1 |) + i∈I1 |v i |. ( ) Step 4. The upper bound of L + U . Using the property for the function in Eq. ( 25), we can get an upper bound for L + U , L + U = E q,k + ,k - i ({µq T k - i -µq T k + } i∈I + ) ≤ E q,k + ,k - i log (1 + |I + |) + i∈I + µq T k - i -µq T k + , where the second term acts as the penalty for representation ablility by measuring the intra-class representaion deviation. E q,k + ,k - i i∈I + µq T k - i -µq T k + = E c + |I + | E q,k + ,k - i ∼c + ,i∈I + |µq T k - i -µq T k + | ≤ E c + |I + | E q,k + ,k - i ∼c + ,i∈I + |µq T k - i -µq T k + | 2 ≤ |µ| E c + |I + | E q,k + ,k - i ∼c + ,i∈I + q 2 2 k - i -k + 2 2 q 2 2 = 1|µ| E c + |I + | E k + ,k - i ∼c + ,i∈I + k - i -k + 2 2 = |µ| E c + |I + | E k + ,k - i ∼c + ,i∈I + k - i -p c + + p c + -k + 2 2 = |µ| E c + |I + | E k + ∼c + 2 p c + -k + 2 2 . ( ) All data with indices in I + have the same label c + , and the expectation in Eq. ( 27) show the intraclass representation deviation. Mark the deviation as s (f k ) = |µ| E c + E k + ∼c + p c + -k + 2 2 . We have a uniform class distribution, thus (N K -1) negative data can be drawn from any class with equal probability 1/|C|. Then E|I + | = (N K -1)/|C| for any positive class c + . Thus, the right expectation in Eq. ( 27) can be bound by s(f k ), that is, E q,k + ,k - i i∈I + µq T k - i -µq T k + ≤ √ 2 E|I + |s(f k ). From Eq. ( 28), Eq. ( 26) and Eq. ( 23), we have L U ≤ L - U + √ 2 E|I + |s(f k ) + E q,k + ,k - i log 1 + |I + | , and we have E q,k + ,k - i [log (1 + |I + |)] = P (I + = ∅) E q,k + ,k - i [log (1 + |I + |)| I + = ∅] = P (I + = ∅) E c + [log (1 + |I + |)| I + = ∅]. Combining Eq. ( 29), Eq. ( 30) and Theorem 1, we have proved Theorem 2 by setting γ randomly select x j as postive data, 1 ≤ j ≤ N K 5: augment x j into Aug(x j ) and Aug (x j ) 6: augment x i into Aug(x i ), ∀i ∈ {1, . . . , N K }\j 7: 1 = √ 2γ 0 (N K -1)/|C|. positive key: x + ←Aug(x j ) and negative keys: {x - 1 , . . . ,x - N K -1 }←{Aug(x i )} i =j , query: x q ←Aug (x j ), 8: representation: k + =f k (x + ), k - i =f k (x - i ), ∀i ∈ {1, . . . , N K -1}, and q=f q (x q ), 9: evaluate task-specific metric loss L Eq. ( 1) 10: back propagation update: (θ q , µ) ← (θ q , µ) -lr • ∇ (θq,µ) L Setup For fairness, the architecture of encoder f q , f k keeps aligned with that used by CACTUs and UMTRA, as well as the supervised methods like MAML and ProtoNets. It is comprised of 4 convolutional blocks, each of which is a sequential combination of 64-channel 3×3 convolution, batch normalization, ReLU and 2×2 max-pooling. The last block is followed by a flattening and a normalization to form the feature representation, which leads to 64-/1600-dimensional representations for the images from Omniglot/miniImageNet, respectively. We set lr = 0.005, β = 0.999 by monitoring validation performance. For augmentation Aug(•), we keep consistent with UMTRA: for Omniglot by randomly zeroing pixels and randomly shifting, while for miniImageNet by the readymade Auto-Augmentation Cubuk et al. ( 2018) model. For miniImageNet, we use the augmentation model trained on CIFAR. Different augmentation methods do matter according to AAL Antoniou & Storkey (2019) , CC-Rot Gidaris et al. (2019) , CC-Loc Gidaris et al. (2019) . Thus we keep the same augmentation setting as a basic method though we could do better by changing augmentation settings. The SGD optimizer is used for BP update for f q . Compared Methods Compared methods are explicitly divided into unsupervised group, supervised group and ablation group. Specifically, the unsupervised group includes not only cutting-edge CAC-TUs and UMTRA, but also some alternate algorithms that can work on unsupervised A (see Hsu et al. (2019) for more details about them). We also compare to supervised MAML Finn et al. (2017) and ProtoNets Snell et al. (2017) , which are considered as the ceiling limit of the unsupervised methods. In ablation group, to explore the effectiveness of the negative numbers (N K -1) and the number M of positive keys in SSM: (1) baseline, the algorithm in Algorithm 1. (2) N K /4, replacing the best N K (2048 for Omniglot, 512 for miniImageNet) with N K /4. (3) M , replacing all positive or negative keys with the mean of three different keys from three augmentations of the same data. Results on Omniglot Contrast results on Omniglot in Table 3 demonstrate that SSM completely surpasses CACTUs-MAML, CACTUs-ProtoNets, UMTRA and other alternate unsupervised methods, yielding dramatic improvements regardless of (N ,K) settings. Another noticeable observation is the much smaller performance gap between SSM with supervised MAML and ProtoNets. For (5,5) setting, especially, SSM baseline realizes accuracy 98.09% that is very close to accuracy 98.83% by supervised MAML, although it needs to use (4800×20+5×5) labeled data whereas our SSM relies on only 5×5 labeled images for each (5,5) classification task. 4 contrasts SSM to other methods on miniImageNet. Compared to Omniglot, the underlying complexity and ambiguity of the real-world image objects in miniImageNet cause relatively lower classification accuracy. Although SSM is not finetuned on the 50×5 support data, it still reaches the second-best for (5,50) setting and beats many finetuning un-



; Wu et al. (2018); Ye et al. (2019), or context consisting a set of patches Oord et al. (2018) [44]. The networks f q and f k can be identical Hadsell et al. (2006); Ye et al. (2019); Chen et al. (2020a), partially shared Oord et al. (2018); Hjelm et al. (2018); Rezaabad & Vishwanath (2019), or different Tian et al. (2019).

θ M ← β • θ M + (1 -β) • θ q 12: end while are for training, 16 classes for validation and 20 classes for testing. The above splits all keep the same with those used by CACTUs Hsu et al. (2019) and UMTRA Khodadadeh et al. (2019) for comparison fairness. Certainly, the labels of data in training classes have been stripped to form the unsupervised auxiliary set A.

episode training on L sup with proper pretrain outperforms most of methods. The core of FSL is to generate a flow of supervised N -way M -shot tasks, and training with viriants of L sup on each task. For the generality and flexibility of our theories, we will prove that contrastive self-supervised training methods with loss function Eq. 2 is essentially reducing L sup on training data. We analyze the evaluation metric with a class-wise prototype p c rather than a episodic mean of support samples. It is proved to be a better choice of loss function due to explicitly reducing intra-class variationsChen et al. (2019).

and show that s(f k ) can bound L + U . Then, we get a new narrow bound by Theorem 2.

Impact of the value N , M in dynamic N -way M -shot. In Section 4, we regard L - U as dynamic N -way M -shot supervised training. It depends on the sampled self-supervised true negative training Accuracy (%) averaged over 1000 random 5-way 5-shot test tasks.

Datasets We evaluate SSM on two FSL benchmark datasets, Omniglot Lake et al.(2015)  and miniImageNetVinyals et al. (2016). Omniglot is a character image dataset containing 1623 handwritten characters from 50 alphabets. Each character contains 20 gray images drawn by different writers. We resize raw images into size of 28×28 and rotate each character by 0 • , 90 • , 180 • and 270 Self-Supervised Algorithm Require: unsupervised auxiliary set A={. . ., x i ,. . .}, number of keys per matching task N K , random augmentation function Aug(•), momentum rate β, SGD learning rate lr. 1: randomly initialize the encoder f q , f k parameters θ q , θ k , metric scaling scalar µ 2: while not done do 3:sample N K data {x 1 , . . . , x N K } from A

ACKNOWLEDGMENTS

Use unnumbered third level headings for the acknowledgments. All acknowledgments, including those to funding agencies, go at the end of the paper.

Algorithm

(5,1) (5,5) (20,1) (20,5) supervised methods, which is a convincing proof for the effectiveness of the representation space learned by SSM.

