A THEORY OF SELF-SUPERVISED FRAMEWORK FOR FEW-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Recently, self-supervised learning (SSL) algorithms have been applied to Fewshot learning (FSL). FSL aims at distilling transferable knowledge on existing classes with large-scale labeled data to cope with novel classes for which only a few labeled data are available. Due to the limited number of novel classes, the initial embedding network becomes an essential component and can largely affect the performance in practice. But almost no one analyzes why a pre-trained embedding network with self-supervised training can provide representation for downstream FSL tasks in theory. In this paper, we first summarized the supervised FSL methods and explained why SSL is suitable for FSL. Then we further analyzed the main difference between supervised training and self-supervised training on FSL and obtained the bound for the gap between self-supervised loss and supervised loss. Finally, we proposed potential ways to improve the test accuracy under the setting of self-supervised FSL.

1. INTRODUCTION

Recently, the self-supervised learning (SSL) algorithms have been applied to the FSL. The purpose of FSL is to extract transferable knowledge from existing classes with large-scale label data to deal with novel classes with only a few labeled data. The initial embedded network becomes an essential component and will greatly affect performance because of the limited number of novel classes. In practice, SSL greatly enhances the generalization of FSL method and increases the potential for industrial application. Once combining SSL and FSL, we only need to collect a large amount of related unlabeled data and a few data on the new task to obtain a model with good generalization performance on the new task. In theory, it is difficult to analyze the performance of self-supervised pre-trained models on multiple downstream tasks. Because the downstream task itself involves a large amount of data with different data distribution from the primary training data distribution, such as multi-view SSL. Besides, downstream tasks and self-supervised tasks may be quite different, such as classification and segmentation, which further increases the difficulty of theoretical analysis. However, when back to the purpose of SSL, which is to learn a good pretrain model that can be transferred to different tasks, we find FSL also focus the same purpose to get an initialization model that can achieve good results with a few data on a new task by a simple classifier (such as a mean classifier). Thus, FSL tasks is suitable for evaluation of the effect of SSL. The main research direction about when and why self-supervised methods improves FSL is to compare the performance of different self-supervised methods through experiments. Almost no one analyzes why a pre-trained embedded network with self-supervised training can provide a representation for downstream FSL tasks in theory. We believe that theoretical analysis is necessary. For example, MoCo uses momentum update to greatly expand the size of the key-value dictionary, thereby improving the effect. But we don't know why the key-value dictionary needs to be large enough. Is the batch size really the bigger the better? SimCLR proposes a head layer to calculate the contrastive loss, instead of directly comparing the representations. Why is this method effective? We find that although self-supervised learning researchers have made great progress, analysis about why SSL works is halted at experimental and empirical conclusions due to the lack of theorical analysis. Therefore, we think it is necessary and useful to analyze self-supervised learning theoretically.

