ON THE ROLE OF PRE-TRAINING FOR META FEW-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Few-shot learning aims to classify unknown classes of examples with a few new examples per class. There are two key routes for few-shot learning. One is to (pre-)train a classifier with examples from known classes, and then transfer the pretrained classifier to unknown classes using the new examples. The other, called meta few-shot learning, is to couple pre-training with episodic training, which contains episodes of few-shot learning tasks simulated from the known classes. Pre-training is known to play a crucial role for the transfer route, but the role of pre-training for the episodic route is less clear. In this work, we study the role of pre-training for the episodic route. We find that pre-training serves as major role of disentangling representations of known classes, which makes the resulting learning tasks easier for episodic training. The finding allows us to shift the huge simulation burden of episodic training to a simpler pre-training stage. We justify such a benefit of shift by designing a new disentanglement-based pretraining model, which helps episodic training achieve competitive performance more efficiently.

1. INTRODUCTION

In recent years, deep learning methods have outperformed most of the traditional methods in supervised learning, especially in image classification. However, deep learning methods generally require lots of labeled data to achieve decent performance. Some applications, however, do not have the luxury to obtain lots of labeled data. For instance, for bird classification, an ornithologist typically can only obtain a few pictures per bird species to update the classifier. Such needs of building classifiers from limited labeled data inspire some different research problems, including the few-shot learning problem (Finn et al., 2017; Snell et al., 2017; Rajeswaran et al., 2019; Oreshkin et al., 2018; Vinyals et al., 2016; Lee et al., 2019) . In particular, few-shot learning starts with a training dataset that consists of data points for "seen" classes, and is required to classify "unseen" ones in the testing phase accurately based on limited labeled data points from unseen classes. Currently, there are two main frameworks, meta-learning (Finn et al., 2017; Snell et al., 2017; Chen et al., 2019) and transfer learning (Dhillon et al., 2020) , that deal with the few-shot learning problem. For transfer learning, the main idea is to train a traditional classifier on the meta-train dataset. In the testing phase, these methods finetune the model on the limited datapoints for the labeled novel classes. For meta-learning frameworks, their main concept is episodic training (Vinyals et al., 2016) . For the testing phase of few-shot learning, the learning method is given N novel classes, each containing K labeled data for fine-tuning and Q query data for evaluation. Unlike transfer learning algorithms, episodic training tries to simulate the testing literature in the training phase by sampling episodes in training dataset. In these two years, some transfer-learning methods (Dhillon et al., 2020) with sophisticated design in the finetuning part have a competitive performance to the meta-learning approaches. Moreover, researchers (Lee et al., 2019; Sun et al., 2019; Chen et al., 2019; Oreshkin et al., 2018) have pointed out that combining both the global classifier (pre-training part) in the transfer learning framework and the episodic training concept for the meta-learning framework could lead to better performance. Yet, currently most of the attentions are on the episodic training part (Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017; Oreshkin et al., 2018; Sun et al., 2019; Lee et al., 2019) and the role of pre-training is still vague. Meta-learning and pre-training has both improved a lot in the past few years. However, most of the works focus on the accuracy instead of the efficiency. For meta-learning, to make the progress more efficient, an intuitive way is to reduce the number of episodes. Currently, there are only limited researches (Sun et al., 2019) working on reducing the number of episodes. One of the methods (Chen et al., 2019; Lee et al., 2019) is to apply a better weight initialization method, the one from pre-training, instead of the random initialization. Another method (Sun et al., 2019) is to mimic how people learn. For example, when we are learning dynamic programming, given a knapsack problem with simple constraint and the one with strong constraint, we will learn much more when we solve the problem with strong constraint. Sun et al. ( 2019) followed the latter idea and crafted the hard episode to decrease amount of necessary episodes. In this work, we study the role of pre-training in meta few-shot learning. We study the pre-training from the disentanglement of the representations. Disentanglement is the property that whether the datapoints within different classes has been mixed together. Frosst et al. ( 2019) pointed out that instead of the last layer of the model all representations after other layers were entangled. The last layer does the classifier and the rest captures some globally shared information. By analyzing the disentanglement property of episodic training, though the pre-training gives a better representation that benefits the episodic training, the representation becomes more disentangled after episodic training. That is to say, episodic training has spent some effort on making the representation more disentangled. Benefited from the understanding, we design a sophisticated pre-training method that is more disentangled and helps episodic training achieve competitive performance more efficiently. With our pre-training loss, the classical meta-learning algorithm, ProtoNet (Snell et al., 2017) , achieves competitive performance to other methods. Our study not only benefits the episodic training but also points out another direction to sharpen and speed-up episodic training. To sum up, there are three main contributions in this work: 1. A brief study of the role of pre-training in episodic training. 2. A simple regularization loss that sharpens the classical meta-learning algorithms. 3. A new aspect for reducing the necessary episodic training episodes.

2. RELATED WORK

Few-shot learning tries to mimic the human ability to generalize to novel classes with limited datapoints. In the following, we briefly introduce the recent progress of the transfer-learning framework and two categories of the meta-learning framework. Afterward, we give a brief introduction of the not well studied episode efficiency problem. 

2.2. META-LEARNING FRAMEWORK

For meta-learning like framework, the main concepts are learning to learn and episodic training (Vinyals et al., 2016) . Learning to learn refers to learn from a lot of tasks to benefit the new task learning. To prevent confusion, the original train and test phase are regarded as "meta-train" and "meta-test". The term "train" and "test" would be referred to the one in each small task. Episodic



2.1 TRANSFER-LEARNING FRAMEWORK In the training phase, the transfer-learning framework trains a classifier on the general classification task across all base classes instead of utilizing episodic training. And for the testing phase, transferlearning methods finetune the model with the limited labeled data. There are several kinds of tricks. Qi et al. (2018) proposed a method to append the mean of the embedding with a given class as a final layer of the classifier. Qiao et al. (2018) used the parameter of the last activation output to predict the classifier for novel classes dynamically. Gidaris & Komodakis (2018) proposed a similar concept with Qiao et al. (2018). They also embedded the weight of base classes during the novel class prediction. Moreover, they introduced an attention mechanism instead of directly averaging among the parameters of each shot. Besides embedding base classes weight to the final classifier, Dhillon et al. (2020) utilized label propagation by the uncertainty on a single prediction to prevent overfitting in the finetune stage, which is quite similar to the classical classification tasks.

