LEARNING EFFICIENT MODELS FROM FEW LABELS BY DISTILLATION FROM MULTIPLE TASKS

Abstract

We address the challenge of getting efficient yet accurate recognition systems that can be trained with limited labels. Many specialized applications of computer vision (e.g. analyzing X-rays or satellite images) have severe resource constraints both during training and inference. While transfer learning is an effective solution for training on small labeled datasets it still often requires a large base model for fine-tuning. In this paper we present a weighted multi-source distillation method; we distill multiple (diverse) source models trained on different domains, weighted by their relevance for the target task, into a single efficient model using limited labeled data. When the goal is accurate recognition under computational constraints, our approach outperforms both transfer learning from strong ImageNet initializations as well as state-of-the-art semi-supervised techniques such as Fix-Match. When averaged over 8 diverse target tasks our method outperform the baselines by 5.6%-points and 4.5%-points, respectively.

1. INTRODUCTION

With recent advances in recognition, there is an increasing interest in deploying deep networks in a variety of downstream applications, be it analyzing X-rays, skin conditions, or satellite images. However, in contrast to the increasingly massive training datasets and large networks that power advances in deep learning, many of these downstream applications have severe resource constraints. In particular, labeled training data is expensive. In addition, we often require fast and cheap inference as privacy and availability concerns, as well as practical limitations often require deployment to be on local devices. These constraints immediately rule out the standard approach of training a large neural network on massive amounts of training data. A key research question for these applications is thus: how do we get efficient but accurate recognition systems that we can train with limited labels? A common approach for dealing with small labeled datasets is transfer learning. Here, one pre-trains a base model on a different problem where large labeled datasets are available, and then fine-tunes this model on the application of interest. This transferred model will often be large, but can then be distilled into a smaller model for deployment (Ba & Caruana, 2014; Hinton et al., 2015) . While in principle this approach can be effective, in practice this relies heavily on having a good source model that is relevant to the target task. There are many ways to compare and choose the best base model (Achille et al., 2019; Kornblith et al., 2019b; Recht et al., 2019; Bolya et al., 2021) . But this assumes that a single optimal base model exists. What if no single source model matches the target task, as is likely going to be the case for new problem domains?



Figure 1: Overview of our paper and method.

