LEARNING EFFICIENT MODELS FROM FEW LABELS BY DISTILLATION FROM MULTIPLE TASKS

Abstract

We address the challenge of getting efficient yet accurate recognition systems that can be trained with limited labels. Many specialized applications of computer vision (e.g. analyzing X-rays or satellite images) have severe resource constraints both during training and inference. While transfer learning is an effective solution for training on small labeled datasets it still often requires a large base model for fine-tuning. In this paper we present a weighted multi-source distillation method; we distill multiple (diverse) source models trained on different domains, weighted by their relevance for the target task, into a single efficient model using limited labeled data. When the goal is accurate recognition under computational constraints, our approach outperforms both transfer learning from strong ImageNet initializations as well as state-of-the-art semi-supervised techniques such as Fix-Match. When averaged over 8 diverse target tasks our method outperform the baselines by 5.6%-points and 4.5%-points, respectively.

1. INTRODUCTION

With recent advances in recognition, there is an increasing interest in deploying deep networks in a variety of downstream applications, be it analyzing X-rays, skin conditions, or satellite images. However, in contrast to the increasingly massive training datasets and large networks that power advances in deep learning, many of these downstream applications have severe resource constraints. In particular, labeled training data is expensive. In addition, we often require fast and cheap inference as privacy and availability concerns, as well as practical limitations often require deployment to be on local devices. These constraints immediately rule out the standard approach of training a large neural network on massive amounts of training data. A key research question for these applications is thus: how do we get efficient but accurate recognition systems that we can train with limited labels? A common approach for dealing with small labeled datasets is transfer learning. Here, one pre-trains a base model on a different problem where large labeled datasets are available, and then fine-tunes this model on the application of interest. This transferred model will often be large, but can then be distilled into a smaller model for deployment (Ba & Caruana, 2014; Hinton et al., 2015) . While in principle this approach can be effective, in practice this relies heavily on having a good source model that is relevant to the target task. There are many ways to compare and choose the best base model (Achille et al., 2019; Kornblith et al., 2019b; Recht et al., 2019; Bolya et al., 2021) . But this assumes that a single optimal base model exists. What if no single source model matches the target task, as is likely going to be the case for new problem domains?

Source Model Weighting Multi-Source Distillation

Figure 2 : While common multi-source distillation usually weighs a set of S source models, M s = h s • ϕ s , equally for every target task, we propose to weight the source models by using task similarity metrics to estimate the alignment of each source model with the particular target task using a small subset of labeled data, D p τ . Since the task similarity metrics are independent of feature dimension, we can utilize source models of any architecture and from any source task. We show that choosing the weighting, α 1 , . . . , α S , this way we are able to improve performance over transfer from ImageNet and training with FixMatch (see e.g. Table 1 and Figure 3 ). This challenge motivates our approach. We propose to produce efficient yet accurate models for new tasks with few labels by task-similarity-weighted multi-source distillation (see Figure 2 ). That is, we distill multiple (diverse) source models trained on different domains, weighted by their relevance for the target task. All this is done without access to any other data than that of the target task. Specifically, we propose to first rank a diverse set (both in architecture and task) of source models for a particular target task using a task similarity metric. This ranking is used to select and weight the most relevant source models to distill into a target model of some suitable target-architecture in a semi-supervised learning setting using multi-source distillation (see Figure 1 ). Contributions. We summarize our contributions as follows: 1) By analyzing over 200 distilled models we extensively verify that, for single-source cross-domain distillation, the choice of source model is important for the predictive performance of the target model. 2) We show that task similarity metrics can be used to select and weight source models for single-and multi-source distillation without access to any source data. 3) We show that our approach yields the best accuracy on multiple target tasks under compute and data constraints. We compare our task similarity-weighted multisource distillation to two baselines; classical transfer learning and FixMatch, as well as the special case of multi-source distillation with equal weighting. Averaged over 8 diverse datasets, our method outperforms the baselines with at least 4.5%-points and in particular 17.5%-points on CUB200.

2. RELATED WORK

We review several research areas relevant to our problem setup and approach. Knowledge Distillation One key aspect of our problem is to figure out how to compress multiple models into an efficient target model. A common approach is knowledge distillation (Ba & Caruana, 2014; Hinton et al., 2015) where an efficient student model is trained to mimic the output of a larger teacher model. However, most single-teacher (Adriana et al., 2015; Mirzadeh et al., 2019; Park et al., 2019; Cho & Hariharan, 2019; Borup & Andersen, 2021) or multi-teacher knowledge distillation (You et al., 2017; Fukuda et al., 2017; Tan et al., 2019; Liu et al., 2020) research focuses on the closed set setup, where the teacher(s) and the student both attempts to tackle the same task. To the best of our knowledge, compressing multiple models specializing in various tasks different from the target task has rarely been explored in the literature. Our paper explores this setup and illustrate that carefully distilling multiple source models can bring forth efficient yet accurate models.



Figure 1: Overview of our paper and method.

