LEARNING FLEXIBLE CLASSIFIERS WITH SHOT-CONDITIONAL EPISODIC (SCONE) TRAINING

Abstract

Early few-shot classification work advocates for episodic training, i.e. training over learning episodes each posing a few-shot classification task. However, the role of this training regime remains poorly understood, and its usefulness is still debated. Standard classification training methods ("pre-training") followed by episodic finetuning have recently achieved strong results. This work aims to understand the role of this episodic fine-tuning phase through an exploration of the effect of the "shot" setting (number of examples per class) that is used during fine-tuning. We discover that fine-tuning on episodes of a particular shot can specialize the pre-trained model to solving episodes of that shot at the expense of performance on other shots, in agreement with a trade-off recently observed in the context of end-to-end episodic training. To amend this, we propose a shot-conditional form of episodic fine-tuning, inspired from recent work that trains a single model on a distribution of losses. Our investigation shows that this improves overall performance, without suffering disproportionately on any shot. We also examine the usefulness of this approach on the large-scale Meta-Dataset benchmark where test episodes exhibit varying shots and imbalanced classes. We find that our flexible model improves performance in that challenging environment.

1. INTRODUCTION

Few-shot classification is the problem of learning a classifier using only a few examples. Specifically, the aim is to utilize a training dataset towards obtaining a flexible model that has the ability to 'quickly' learn about new classes from few examples. Success is evaluated on a number of test episodes, each posing a classification task between previously-unseen test classes. In each such episode, we are given a few examples, or "shots", of each new class that can be used to adapt this model to the task at hand, and the objective is to correctly classify a held-out set of examples of the new classes. A simple approach to this problem is to learn a classifier over the training classes, parameterized as a neural network feature extractor followed by a classification layer. While the classification layer is not useful at test time due to the class shift, the embedding weights that are learned during this "pre-training" phase evidently constitute a strong representation that can be used to tackle test tasks when paired with a simple "inference algorithm" (e.g. nearest-neighbour, logistic regression) to make predictions for each example in the test episode given the episode's small training set. Alternatively, early influential works on few-shot classification (Vinyals et al., 2016) advocate for episodic training, a regime where the training objective is expressed in terms of performance on a number of training episodes of the same structure as the test episodes, but with the classes sampled from the training set. It was hypothesized that this episodic approach captures a more appropriate inductive bias for the problem of few-shot classification and would thus lead to better generalization. However, there is an ongoing debate about whether episodic training is in fact required for obtaining the best few-shot classification performance. Notably, recent work (Chen et al., 2019; Dhillon et al., 2020) proposed strong "pre-training" baselines that leverage common best practices for supervised training (e.g. normalization schemes, data augmentation) to obtain a powerful representation that works well for this task. Interestingly, other recent work combines the pre-training of a single classifier with episodic fine-tuning by removing the classification head and continuing to train the embedding network using the episodic inference algorithm that will be applied at test time (Triantafillou et al., 2020; Chen et al., 2020) . The success of this hybrid approach suggests that perhaps the two regimes have complementary strengths, but the role of this episodic fine-tuning is poorly understood: what is the nature of the modification it induces into the pre-trained solution? Under which conditions is it required in order to achieve the best performance? As a step towards answering those questions, we investigate the effect of the shot used during episodic fine-tuning on the resulting model's performance on test tasks of a range of shots. We are particularly interested in understanding whether the shot of the training episodes constitutes a source of information that the model can leverage to improve its few-shot classification performance on episodes of that shot at test time. Our analysis reveals that indeed a particular functionality that this fine-tuning phase may serve is to specialize a pre-trained model to solving tasks of a particular shot; accomplished by performing the fine-tuning on episodes of that shot. However, perhaps unsurprisingly, we find that specializing to a given shot comes at the expense of hurting performance for other shots, in agreement with (Cao et al., 2020)'s theoretical finding in the context of Prototypical Networks (Snell et al., 2017) where inferior performance was reported when the shot at training time did not match the shot at test time. Given those trade-offs, how can our newfound understanding of episodic fine-tuning as shotspecialization help us in practice? It is unrealistic to assume that we will always have the same number of labeled examples for every new class we hope to learn at test time, so we are interested in approaches that operate well on tasks of a range of shots. However, it is impractical to fine-tune a separate episodic model for every shot, and intuitively that seems wasteful as we expect that tasks of similar shots should require similar models. Motivated by this, we propose to train a single shot-conditional model for specializing the pre-trained solution to a wide spectrum of shots without suffering trade-offs. This leads to a compact but flexible model that can be conditioned to be made appropriate for the shot appearing in each test episode. In what follows we provide some background on few-shot classification and episodic models and then introduce our proposed shot-conditioning approach and related work. We then present our experimental analysis on the effect of the shot chosen for episodic fine-tuning, and we observe that our shot-conditional training approach is beneficial for obtaining a general flexible model that does not suffer the trade-offs inherent in naively specializing to any particular shot. Finally, we experiment with our proposed shot-conditional approach in the large-scale Meta-Dataset benchmark for few-shot classification, and demonstrate its effectiveness in that challenging environment. where S and Q are support and query sets sampled from the distribution P N,k train of N -way, k-shot training episodes induced by C train , and θ represents the model's parameters. This training regime is often characterized as meta-learning or learning to learn, i.e. learning over many episodes how to learn within an episode (from few labeled examples). Episodic models differ by their "inference



Problem definition Few-shot classification aims to classify test examples of unseen classes from a small labeled training set. The standard evaluation procedure involves sampling classification episodes by picking N classes at random from a test set of classes C test and sampling two disjoint sets of examples from the N chosen classes: a support set (or training set) of k labeled examples per class, and a query set (or test set) of unlabeled examples, forming N -way, k-shot episodes. The model is allowed to use the support set, in addition to knowledge acquired while training on a disjoint set of classes C train , to make a prediction for examples in the query set, and is evaluated on its query set accuracy averaged over multiple test episodes. Episodic training Early few-shot classification approaches (Vinyals et al., 2016) operate under the assumption that obtaining a model capable of few-shot classification requires training it on (mini-batches of) learning episodes, instead of (mini-batches of) individual examples as in standard supervised learning. These learning episodes are sampled in the same way as described above for test episodes, but with classes sampled from C train this time. In other words, the model is trained to minimize a loss of the form: E S,Q∼P N,k train   1 |Q| (x * ,y * )∈Q -log p θ (y * | x * , S)

