WHICH MODEL TO TRANSFER? FINDING THE NEEDLE IN THE GROWING HAYSTACK

Abstract

Transfer learning has been recently popularized as a data-efficient alternative to training models from scratch, in particular in vision and NLP where it provides a remarkably solid baseline. The emergence of rich model repositories, such as Ten-sorFlow Hub, enables the practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. We provide a formalization of this problem through a familiar notion of regret and introduce the predominant strategies, namely task-agnostic (e.g. picking the highest scoring ImageNet model) and task-aware search strategies (such as linear or kNN evaluation). We conduct a large-scale empirical study and show that both task-agnostic and task-aware methods can yield high regret. We then propose a simple and computationally efficient hybrid search strategy which outperforms the existing approaches. We highlight the practical benefits of the proposed solution on a set of 19 diverse vision tasks.

1. INTRODUCTION

Services such as TensorFlow Hubfoot_0 or PyTorch Hub 1 offer a plethora of pre-trained models that often achieve state-of-the-art performance on specific tasks in the vision and NLP domains. The predominant approach, namely choosing a pre-trained model and fine-tuning it to the downstream task, remains a very strong and data efficient baseline. This approach is not only successful when the pre-training task is similar to the target task, but also across tasks with seemingly differing characteristics, such as applying an ImageNet pre-trained model to medical applications like diabetic retinopathy classification (Oquab et al., 2014) . Fine-tuning often entails adding several more layers to the pre-trained deep network and tuning all the parameters using a limited amount of downstream data. Due to the fact that all parameters are being updated, this process can be extremely costly and intensive in terms of compute (Zhai et al., 2019) . Fine-tuning all models to find the best performing one is rapidly becoming computationally infeasible. A more efficient alternative is to simply train a linear classifier or a k-nearest neighbour (kNN) classifier on top of the learned representation (e.g. pre-logits). However, the performance gap with respect to fine-tuning can be rather large (Kolesnikov et al., 2019; Kornblith et al., 2019) . In this paper we study the application of computationally efficient methods for determining which model(s) one should fine-tune for a given task at hand. We divide existing methods into two groups: (a) task-agnostic model search strategies -which rank pre-trained models independently of the downstream task (e.g. sort by ImageNet accuracy, if available), and (b) task-aware model search strategies -which make use of the provided downstream dataset in order to rank models (e.g. kNN classifier accuracy as a proxy for fine-tuning accuracy) (Kornblith et al., 2019; Meiseles & Rokach, 2020; Puigcerver et al., 2020) . Clearly, the performance of these strategies depends on the set of models considered and the computational constraints of the practitioner (e.g. the memory footprint, desired inference time, etc.). To this end, we define several model pools and study the performance and generalization of each strategy across different pools. In particular, we make sure that these pools contain both "generalist" models (e.g. models trained on ImageNet), but also "expert" models (e.g. models trained on domain-specific datasets, such as flowers, animals, etc.).



https://tfhub.dev and https://pytorch.org/hub 1

