DEEP ENSEMBLES FOR LOW-DATA TRANSFER LEARNING Anonymous

Abstract

In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for transfer via pre-trained weights. In this work, we study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity, and propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset. The approach is simple: Use nearest-neighbour accuracy to rank pre-trained models, fine-tune the best ones with a small hyperparameter sweep, and greedily construct an ensemble to minimise validation cross-entropy. When evaluated together with strong baselines on 19 different downstream tasks (the Visual Task Adaptation Benchmark), this achieves state-of-the-art performance at a much lower inference budget, even when selecting from over 2,000 pre-trained models. We also assess our ensembles on ImageNet variants and show improved robustness to distribution shift.

1. INTRODUCTION

There are many ways to construct models with minimal data. It has been shown that fine-tuning pre-trained deep models is a compellingly simple and performant approach (Dhillon et al., 2020; Kolesnikov et al., 2019) , and this is the paradigm our work operates in. It is common to use networks pre-trained on ImageNet (Deng et al., 2009) , but recent works show considerable improvements by careful, task-specific pre-trained model selection (Ngiam et al., 2018; Puigcerver et al., 2020) . Ensembling multiple models is a powerful idea that often leads to better predictive performance. Its secret relies on combining different predictions. The source of diversity for deep networks has been studied (Fort et al., 2019; Wenzel et al., 2020) , though not thoroughly in the low-data regime. Two of the most common approaches involve training independent models from scratch with (a) different random initialisations, (b) different random subsets of the training data. Neither of these are directly applicable downstream with minimal data, as we require a pre-trained initialisation to train competitive modelsfoot_0 , and data scarcity makes further data fragmentation impractical. We study some ways of encouraging model diversity in a supervised transfer-learning setup, but fundamentally argue that the nature of pre-training is itself an easily accessible and valuable form of diversity. Previous works consider the construction of ensembles from a set of candidate models (Caruana et al., 2004) . Services such as Tensorflow Hub (Google, 2018) and PyTorch Hub (FAIR, 2019) contain hundreds of pre-trained models for computer vision; these could all be fine-tuned on a new task to generate candidates. Factoring in the cost of hyperparameter search, this may be prohibitively expensive. We would like to know how suited a pre-trained model is for our given task before training it. This need has given rise to cheap proxy metrics which assess this suitability (Puigcerver et al., 2020) . We use such metrics -leave-one-out nearest-neighbour (kNN) accuracy, in particular -as a way of selecting a subset of pre-trained models, suitable for creating diverse ensembles of task-specific experts. We show that our approach is capable of quickly narrowing large pools (up to 2,000) of candidate pre-trained models down to manageable (15 models) task-specific sets, yielding a practical algorithm in the common context of the availability of many pre-trained models.



For an illustration of the importance of using pre-trained models in the low-data regime see Appendix C.1. 1

