WHICH MODEL TO TRANSFER? FINDING THE NEEDLE IN THE GROWING HAYSTACK

Abstract

Transfer learning has been recently popularized as a data-efficient alternative to training models from scratch, in particular in vision and NLP where it provides a remarkably solid baseline. The emergence of rich model repositories, such as Ten-sorFlow Hub, enables the practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. We provide a formalization of this problem through a familiar notion of regret and introduce the predominant strategies, namely task-agnostic (e.g. picking the highest scoring ImageNet model) and task-aware search strategies (such as linear or kNN evaluation). We conduct a large-scale empirical study and show that both task-agnostic and task-aware methods can yield high regret. We then propose a simple and computationally efficient hybrid search strategy which outperforms the existing approaches. We highlight the practical benefits of the proposed solution on a set of 19 diverse vision tasks.

1. INTRODUCTION

Services such as TensorFlow Hubfoot_0 or PyTorch Hub 1 offer a plethora of pre-trained models that often achieve state-of-the-art performance on specific tasks in the vision and NLP domains. The predominant approach, namely choosing a pre-trained model and fine-tuning it to the downstream task, remains a very strong and data efficient baseline. This approach is not only successful when the pre-training task is similar to the target task, but also across tasks with seemingly differing characteristics, such as applying an ImageNet pre-trained model to medical applications like diabetic retinopathy classification (Oquab et al., 2014) . Fine-tuning often entails adding several more layers to the pre-trained deep network and tuning all the parameters using a limited amount of downstream data. Due to the fact that all parameters are being updated, this process can be extremely costly and intensive in terms of compute (Zhai et al., 2019) . Fine-tuning all models to find the best performing one is rapidly becoming computationally infeasible. A more efficient alternative is to simply train a linear classifier or a k-nearest neighbour (kNN) classifier on top of the learned representation (e.g. pre-logits). However, the performance gap with respect to fine-tuning can be rather large (Kolesnikov et al., 2019; Kornblith et al., 2019) . In this paper we study the application of computationally efficient methods for determining which model(s) one should fine-tune for a given task at hand. We divide existing methods into two groups: (a) task-agnostic model search strategies -which rank pre-trained models independently of the downstream task (e.g. sort by ImageNet accuracy, if available), and (b) task-aware model search strategies -which make use of the provided downstream dataset in order to rank models (e.g. kNN classifier accuracy as a proxy for fine-tuning accuracy) (Kornblith et al., 2019; Meiseles & Rokach, 2020; Puigcerver et al., 2020) . Clearly, the performance of these strategies depends on the set of models considered and the computational constraints of the practitioner (e.g. the memory footprint, desired inference time, etc.). To this end, we define several model pools and study the performance and generalization of each strategy across different pools. In particular, we make sure that these pools contain both "generalist" models (e.g. models trained on ImageNet), but also "expert" models (e.g. models trained on domain-specific datasets, such as flowers, animals, etc.). Our contributions. (i) We formally define and motivate the model search problem through a notion of regret. We conduct the first study of this problem in a realistic setting focusing on heterogeneous model pools. (ii) We perform a large-scale experimental study by fine-tuning 19 downstream tasks on 46 models from a heterogeneous set of pre-trained models split into 5 meaningful and representative pools. (iii) We highlight the dependence of the performance of each strategy on the constrained model pool, and show that, perhaps surprisingly, both task-aware and task-agnostic proxies fail (i.e. suffer a large regret) on a significant fraction of downstream tasks. (iv) Finally, we develop a hybrid approach which generalizes across model pools as a practical alternative.

2. BACKGROUND AND RELATED WORK

We will now introduce the main concepts behind the considered transfer learning approach where the pre-trained model is adapted to the target task by learning a mapping from the intermediate representation to the target labels (Pan & Yang, 2009; Tan et al., 2018; Wang, 2018; Weiss et al., 2016) , as illustrated in Figure 1 . (I) Upstream models. Upstream training, or pre-training, refers simply to a procedure which trains a model on a given task. Given the variety of data sources, losses, neural architectures, and other design decisions, the set of upstream models provides a diverse set of learned representations which can be used for a downstream task. In general, the user is provided with these models, but can not control any of these dimensions, nor access the upstream training data. Previous work call the models in these pools specialist models (Ngiam et al., 2018) or experts (Puigcerver et al., 2020) . (II) Model search. Given no limits on computation, the problem is trivial -exhaustively fine-tune each model and pick the best performing one. In practice, however, one is often faced with stringent requirements on computation, and more efficient strategies are required. The aim of the second stage in Figure 1 is therefore to select a small number of models from the pool, so that they can be finetuned in the last step. The central research question of this paper is precisely how we can choose the most promising models for the task at hand. We divide existing related work and the methods we focus on later in this work into three categories as illustrated in Figure 2 . (II A) Task-agnostic search strategies. These strategies rank models without looking at the data of the downstream task (Kornblith et al., 2019) . As a consequence, given a fixed pool of candidates, the same model is recommended for every task. We focus on the following popular task-agnostic approach: (i) Pick the highest test accuracy ImageNet model (if such a model is in the pool), otherwise (ii) pick the one trained on the largest dataset. If there is a tie, pick the biggest model in the pool (in terms of the number of parameters).



https://tfhub.dev and https://pytorch.org/hub



Figure 1: Transfer learning setup: (1) Upstream models Pre-training of models from randomly initialized weights on the (large) upstream datasets; (2) Model search Either downstream task independent or by running a proxy task, i.e. fixing the weights of all but the last layer and training a linear classifier or deploying a kNN classifier on the downstream dataset; (3) Downstream training Unfreezing all the weights, optimizing both the pre-defined ones and a new linear classification layer on the downstream dataset.

