A THEORY OF SELF-SUPERVISED FRAMEWORK FOR FEW-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Recently, self-supervised learning (SSL) algorithms have been applied to Fewshot learning (FSL). FSL aims at distilling transferable knowledge on existing classes with large-scale labeled data to cope with novel classes for which only a few labeled data are available. Due to the limited number of novel classes, the initial embedding network becomes an essential component and can largely affect the performance in practice. But almost no one analyzes why a pre-trained embedding network with self-supervised training can provide representation for downstream FSL tasks in theory. In this paper, we first summarized the supervised FSL methods and explained why SSL is suitable for FSL. Then we further analyzed the main difference between supervised training and self-supervised training on FSL and obtained the bound for the gap between self-supervised loss and supervised loss. Finally, we proposed potential ways to improve the test accuracy under the setting of self-supervised FSL.

1. INTRODUCTION

Recently, the self-supervised learning (SSL) algorithms have been applied to the FSL. The purpose of FSL is to extract transferable knowledge from existing classes with large-scale label data to deal with novel classes with only a few labeled data. The initial embedded network becomes an essential component and will greatly affect performance because of the limited number of novel classes. In practice, SSL greatly enhances the generalization of FSL method and increases the potential for industrial application. Once combining SSL and FSL, we only need to collect a large amount of related unlabeled data and a few data on the new task to obtain a model with good generalization performance on the new task. In theory, it is difficult to analyze the performance of self-supervised pre-trained models on multiple downstream tasks. Because the downstream task itself involves a large amount of data with different data distribution from the primary training data distribution, such as multi-view SSL. Besides, downstream tasks and self-supervised tasks may be quite different, such as classification and segmentation, which further increases the difficulty of theoretical analysis. However, when back to the purpose of SSL, which is to learn a good pretrain model that can be transferred to different tasks, we find FSL also focus the same purpose to get an initialization model that can achieve good results with a few data on a new task by a simple classifier (such as a mean classifier). Thus, FSL tasks is suitable for evaluation of the effect of SSL. The main research direction about when and why self-supervised methods improves FSL is to compare the performance of different self-supervised methods through experiments. Almost no one analyzes why a pre-trained embedded network with self-supervised training can provide a representation for downstream FSL tasks in theory. We believe that theoretical analysis is necessary. For example, MoCo uses momentum update to greatly expand the size of the key-value dictionary, thereby improving the effect. But we don't know why the key-value dictionary needs to be large enough. Is the batch size really the bigger the better? SimCLR proposes a head layer to calculate the contrastive loss, instead of directly comparing the representations. Why is this method effective? We find that although self-supervised learning researchers have made great progress, analysis about why SSL works is halted at experimental and empirical conclusions due to the lack of theorical analysis. Therefore, we think it is necessary and useful to analyze self-supervised learning theoretically. We analyze the self-supervised training process via the specific application scenario of FSL. Under this settings, we avoid the complexity of downstream tasks, and can directly judge the quality of self-supervised learning by the performance of new few-shot tasks. Our main intuition is to quantify the gap between self-supervised learning and supervised training on FSL tasks by constructing supervised metrics corresponding to self-supervised tasks. We find that the self-supervised training loss is actually an upper bound of the supervised metric loss function (Theorem 1). It means that if we can reduce the self-supervision loss small enough, we can control the model's supervision loss on the training data. And because FSL methods has good generalization on similar downstream tasks, we conclude that self-supervised training can also have good generalization on similar tasks, even if the categories of training tasks and test tasks are different. Unfortunately, it is often difficult to minimize the training loss of self-supervision. Contrastive-based SSL method samples different augment data as query and positive data, and others as negative data. Those false negative data have the same class as query. This part of training loss introduced by the false negative data limits our performance. We separate the negative samples in the self-supervised training into true negative samples and false negative samples. For true negative samples, we assume that loss can be small enough by suitable models and optimizers. As for false negative samples, we bound this loss by the intra-class deviation. This part is also the difference between self-supervised learning and supervised learning (Theorem 2). We should control the intra-class deviation of these false negative samples while training according to Theorem 2. Finally, we discuss potential ways to improve test accuracy under self-supervised FSL settings. First, we suggest that the larger the batch size is, the better, but within a certain range. Second, increasing the number of support samples is beneficial to reducing the within-class variance of false negative samples for good test performance. Technically, we set the different augmented data as the support samples from the same class. Third, we need to choose unsupervised training data whose number of categories are large, because large categories will reduce the probability of us sampling false negative samples. We also introduce the limitations of our theory. Ideally, one would like to know whether a simple contrastive self-supervised framwork can give representations that competable with those learned by supervised methods. We show that under the two assumptions, one can get a test performance close to supervised training. Experiments on Omniglot also support our theoretical analysis. For instance, the self-supervised framework reaches the accuracy of 98.23% for 5-way 5-shot classification on Omniglot, which is quite competitive compared to 98.83% achieved by supervised MAML. We mark the feature extractor in the first stage as f q , and the linear classifier of Chen et al. ( 2019) on new samples in the fine-tuning stage as y = f q (x) T W, W = [w 1 , w 2 , . . . , w c ] ∈ R d×c . The classifier in Chen et al. ( 2020b) is a mean classifier with the weight as the centroid of features



2.1 SUMMARY OF SUPERVISED FSL METHODSIn the typical few-shot scenario introduced by Vinyals et al.(2016), the model is presented with episodes composed of a support set and a query set. The support set contains concepts about the categories into which we want to classify the queries. In fact, models are usually given five categories (5-way), and one (one-shot) or five (five-shot) images per category. During training, the model is fed with these episodes and it has to learn to correctly label the query set given the support set. The category sets seen during training, validation, and testing, are all disjoint. This way we know for sure that the model is learning to adapt to any data and not just memorizing samples from the training set. Although most algorithms use episodes, different algorithm families differ in how to use these episodes to train the model. Recently, transfer learning approaches have become the new state-of-the-art for few-shot classifications. Methods like Gidaris & Komodakis (2018), pre-train a feature extractor and linear classifier in a first stage, and remove its last FC layer, then fix the feature extractor and train a new linear classifier on new samples in the fine-tuning stage. Due to its success and simplicity, transfer learning approaches have been named "Baseline" on two recent papers Chen et al. (2019); Dhillon et al. (2019).

