LEARNING FLEXIBLE CLASSIFIERS WITH SHOT-CONDITIONAL EPISODIC (SCONE) TRAINING

Abstract

Early few-shot classification work advocates for episodic training, i.e. training over learning episodes each posing a few-shot classification task. However, the role of this training regime remains poorly understood, and its usefulness is still debated. Standard classification training methods ("pre-training") followed by episodic finetuning have recently achieved strong results. This work aims to understand the role of this episodic fine-tuning phase through an exploration of the effect of the "shot" setting (number of examples per class) that is used during fine-tuning. We discover that fine-tuning on episodes of a particular shot can specialize the pre-trained model to solving episodes of that shot at the expense of performance on other shots, in agreement with a trade-off recently observed in the context of end-to-end episodic training. To amend this, we propose a shot-conditional form of episodic fine-tuning, inspired from recent work that trains a single model on a distribution of losses. Our investigation shows that this improves overall performance, without suffering disproportionately on any shot. We also examine the usefulness of this approach on the large-scale Meta-Dataset benchmark where test episodes exhibit varying shots and imbalanced classes. We find that our flexible model improves performance in that challenging environment.

1. INTRODUCTION

Few-shot classification is the problem of learning a classifier using only a few examples. Specifically, the aim is to utilize a training dataset towards obtaining a flexible model that has the ability to 'quickly' learn about new classes from few examples. Success is evaluated on a number of test episodes, each posing a classification task between previously-unseen test classes. In each such episode, we are given a few examples, or "shots", of each new class that can be used to adapt this model to the task at hand, and the objective is to correctly classify a held-out set of examples of the new classes. A simple approach to this problem is to learn a classifier over the training classes, parameterized as a neural network feature extractor followed by a classification layer. While the classification layer is not useful at test time due to the class shift, the embedding weights that are learned during this "pre-training" phase evidently constitute a strong representation that can be used to tackle test tasks when paired with a simple "inference algorithm" (e.g. nearest-neighbour, logistic regression) to make predictions for each example in the test episode given the episode's small training set. Alternatively, early influential works on few-shot classification (Vinyals et al., 2016) advocate for episodic training, a regime where the training objective is expressed in terms of performance on a number of training episodes of the same structure as the test episodes, but with the classes sampled from the training set. It was hypothesized that this episodic approach captures a more appropriate inductive bias for the problem of few-shot classification and would thus lead to better generalization. However, there is an ongoing debate about whether episodic training is in fact required for obtaining the best few-shot classification performance. Notably, recent work (Chen et al., 2019; Dhillon et al., 2020) proposed strong "pre-training" baselines that leverage common best practices for supervised training (e.g. normalization schemes, data augmentation) to obtain a powerful representation that works well for this task. Interestingly, other recent work combines the pre-training of a single classifier with episodic fine-tuning by removing the classification head and continuing to train the embedding network using the episodic inference algorithm that will be applied at test time (Triantafillou et al., 2020; Chen et al., 2020) . The success of this hybrid approach suggests that perhaps the two regimes

