DEEP ENSEMBLES FOR LOW-DATA TRANSFER LEARNING Anonymous

Abstract

In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for transfer via pre-trained weights. In this work, we study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity, and propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset. The approach is simple: Use nearest-neighbour accuracy to rank pre-trained models, fine-tune the best ones with a small hyperparameter sweep, and greedily construct an ensemble to minimise validation cross-entropy. When evaluated together with strong baselines on 19 different downstream tasks (the Visual Task Adaptation Benchmark), this achieves state-of-the-art performance at a much lower inference budget, even when selecting from over 2,000 pre-trained models. We also assess our ensembles on ImageNet variants and show improved robustness to distribution shift.

1. INTRODUCTION

There are many ways to construct models with minimal data. It has been shown that fine-tuning pre-trained deep models is a compellingly simple and performant approach (Dhillon et al., 2020; Kolesnikov et al., 2019) , and this is the paradigm our work operates in. It is common to use networks pre-trained on ImageNet (Deng et al., 2009) , but recent works show considerable improvements by careful, task-specific pre-trained model selection (Ngiam et al., 2018; Puigcerver et al., 2020) . Ensembling multiple models is a powerful idea that often leads to better predictive performance. Its secret relies on combining different predictions. The source of diversity for deep networks has been studied (Fort et al., 2019; Wenzel et al., 2020) , though not thoroughly in the low-data regime. Two of the most common approaches involve training independent models from scratch with (a) different random initialisations, (b) different random subsets of the training data. Neither of these are directly applicable downstream with minimal data, as we require a pre-trained initialisation to train competitive modelsfoot_0 , and data scarcity makes further data fragmentation impractical. We study some ways of encouraging model diversity in a supervised transfer-learning setup, but fundamentally argue that the nature of pre-training is itself an easily accessible and valuable form of diversity. Previous works consider the construction of ensembles from a set of candidate models (Caruana et al., 2004) . Services such as Tensorflow Hub (Google, 2018) and PyTorch Hub (FAIR, 2019) contain hundreds of pre-trained models for computer vision; these could all be fine-tuned on a new task to generate candidates. Factoring in the cost of hyperparameter search, this may be prohibitively expensive. We would like to know how suited a pre-trained model is for our given task before training it. This need has given rise to cheap proxy metrics which assess this suitability (Puigcerver et al., 2020) . We use such metrics -leave-one-out nearest-neighbour (kNN) accuracy, in particular -as a way of selecting a subset of pre-trained models, suitable for creating diverse ensembles of task-specific experts. We show that our approach is capable of quickly narrowing large pools (up to 2,000) of candidate pre-trained models down to manageable (15 models) task-specific sets, yielding a practical algorithm in the common context of the availability of many pre-trained models. We propose an algorithm that exploits diversity in a large pool of pre-trained models, by using leave-one-out k-nearest-neighbour (kNN) accuracy to select a subset to form the ensemble. We first experiment with sources of downstream diversity (induced only by hyperparameterisation, augmentation or random data ordering), giving significant performance boosts over single models. Using our algorithm on different pools of candidate pre-trained models, we show that various forms of upstream diversity produce ensembles that are more accurate and robust to domain shift than this. Figure 1 illustrates the different approaches studied in our work. Ultimately, this new form of diversity improves on the Visual Task Adaptation Benchmark (Zhai et al., 2019) SOTA by 1.8%. The contributions of this paper can be summarized as follows: • We study ensembling in the context of transfer learning in the low data regime & propose a number of ways to induce advantageous ensemble diversity which best leverage pre-trained models. • We show that diversity from upstream pre-training achieves better accuracy than that from the downstream fine-tuning stage (+1.2 absolute points on average across the 19 downstream classification VTAB tasks), and that it is more robust to distribution shift (+2.2 absolute average accuracy increase on distribution shifted ImageNet variants). • We show that they also surpass the accuracy of large SOTA models (76.2% vs. 77.6%) at a much lower inference cost, and achieve equal performance with less than a sixth of the FLOPS. • We extend the work from Puigcerver et al. ( 2020) and demonstrate the efficacy of kNN accuracy as a cheap proxy metric for selecting a subset of candidate pre-trained models.

2. CREATING ENSEMBLES FROM PRE-TRAINED MODELS

We first formally introduce the technical problem we address in this paper. Next we discuss baseline approaches which use a single pre-trained model, and then we present our method that exploits using multiple pre-trained models as a source of diversity.

2.1. THE LEARNING SETUP: UPSTREAM, MODEL SELECTION, DOWNSTREAM

Transfer learning studies how models trained in one context boost learning in a different one. The most common approach pre-trains a single model on a large dataset such as ImageNet, to then tune the model weights to a downstream task. Despite algorithmic simplicity, this idea has been very



For an illustration of the importance of using pre-trained models in the low-data regime see Appendix C.1.



Figure 1: Overview of the different ways of constructing diverse ensembles studied in this work.We propose an algorithm that exploits diversity in a large pool of pre-trained models, by using leave-one-out k-nearest-neighbour (kNN) accuracy to select a subset to form the ensemble.

