SELF-TRAINING FOR FEW-SHOT TRANSFER ACROSS EXTREME TASK DIFFERENCES

Abstract

Most few-shot learning techniques are pre-trained on a large, labeled "base dataset". In problem domains where such large labeled datasets are not available for pre-training (e.g., X-ray, satellite images), one must resort to pre-training in a different "source" problem domain (e.g., ImageNet), which can be very different from the desired target task. Traditional few-shot and transfer learning techniques fail in the presence of such extreme differences between the source and target tasks. In this paper, we present a simple and effective solution to tackle this extreme domain gap: self-training a source domain representation on unlabeled data from the target domain. We show that this improves one-shot performance on the target domain by 2.9 points on average on the challenging BSCD-FSL benchmark consisting of datasets from multiple domains. Our code is available at https://github.com/cpphoo/STARTUP.

1. INTRODUCTION

Despite progress in visual recognition, training recognition systems for new classes in novel domains requires thousands of labeled training images per class. For example, to train a recognition system for identifying crop types in satellite images, one would have to hire someone to go to the different locations on earth to get the labels of thousands of satellite images. The high cost of collecting annotations precludes many downstream applications. This issue has motivated research on few-shot learners: systems that can rapidly learn novel classes from a few examples. However, most few-shot learners are trained on a large base dataset of classes from the same domain. This is a problem in many domains (such as medical imagery, satellite images), where no large labeled dataset of base classes exists. The only alternative is to train the fewshot learner on a different domain (a common choice is to use ImageNet). Unfortunately, few-shot learning techniques often assume that novel and base classes share modes of variation (Wang et al., 2018) , class-distinctive features (Snell et al., 2017) , or other inductive biases. These assumptions are broken when the difference between base and novel is as extreme as the difference between object classification in internet photos and pneumonia detection in X-ray images. As such, recent work has found that all few-shot learners fail in the face of such extreme task/domain differences, underperforming even naive transfer learning from ImageNet (Guo et al., 2020) . Another alternative comes to light when one considers that many of these problem domains have unlabeled data (e.g., undiagnosed X-ray images, or unlabeled satellite images). This suggests the possibility of using self-supervised techniques on this unlabeled data to produce a good feature representation, which can then be used to train linear classifiers for the target classification task using just a few labeled examples. Indeed, recent work has explored self-supervised learning on a variety of domains (Wallace & Hariharan, 2020). However, self-supervised learning starts tabula rasa, and as such requires extremely large amounts of unlabeled data (on the order of millions of images). With more practical unlabeled datasets, self-supervised techniques still struggle to outcompete naive ImageNet transfer (Wallace & Hariharan, 2020) . We are thus faced with a conundrum: on the one hand, few-shot learning techniques fail to bridge the extreme differences between ImageNet and domains such as X-rays. On the other hand, self-supervised techniques fail when they ignore inductive biases from ImageNet. A sweet spot in the middle, if it exists, is elusive. Figure 1 : Problem setup. In the representation learning phase (left), the learner has access to a large labeled "base dataset" in the source domain, and some unlabeled data in the target domain, on which to pre-train its representation. The learner must then rapidly learn/adapt to few-shot tasks in the target domain in the evaluation phase (right). In this paper, we solve this conundrum by presenting a strategy that adapts feature representations trained on source tasks to extremely different target domains, so that target task classifiers can then be trained on the adapted representation with very little labeled data. Our key insight is that a pre-trained base classifier from the source domain, when applied to the target domain, induces a grouping of images on the target domain. This grouping captures what the pre-trained classifier thinks are similar or dissimilar in the target domain. Even though the classes of the pre-trained classifier are themselves irrelevant in the target domain, the induced notions of similarity and dissimilarity might still be relevant and informative. This induced notion of similarity is in contrast to current self-supervised techniques which often function by considering each image as its own class and dissimilar from every other image in the dataset (Wu et al., 2018; Chen et al., 2020) . We propose to train feature representations on the novel target domain to replicate this induced grouping. This approach produces a feature representation that is (a) adapted to the target domain, while (b) maintaining prior knowledge from the source task to the extent that it is relevant. A discerning reader might observe the similarity of this approach to self-training, except that our goal is to adapt the feature representation to the target domain, rather than improve the base classifier itself. We call our approach "Self Training to Adapt Representations To Unseen Problems", or STARTUP. In a recently released BSCD-FSL benchmark consisting of datasets from extremely different domains (Guo et al., 2020) , we show that STARTUP provides significant gains (up to 2.9 points on average) over few-shot learning, transfer learning and self-supervision state-of-the-art. To the best of our knowledge, ours is the first attempt to bridge such large task/domain gaps and successfully and consistently outperform naive transfer in cross-domain few-shot learning.

2. PROBLEM SETUP

Our goal is to build learners for novel domains that can be quickly trained to recognize new classes when presented with very few labeled data points ("few-shot"). Formally, the target domain is defined by a set of data points (e.g. images) X N , an unknown set of classes (or label space) Y N , and a distribution D N over X N × Y N . A "few-shot learning task" in this domain will consist of a set of classes Y ⊂ Y N , a very small training set ("support") S = {(x i , y i )} n i=1 ∼ D n N , y i ∈ Y and a small test set ("query") Q = {x i } m i=1 ∼ D m N When presented with such a few-shot learning task, the learner must rapidly learn the classes presented and accurately classify the query images.

