SCALABLE TRANSFER LEARNING WITH EXPERT MODELS

Abstract

Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.

1. INTRODUCTION

Deep learning has been successful on many computer vision tasks. Unfortunately, this success often requires a large amount of per-task data and compute. To scale deep learning to new vision tasks, practitioners often turn to transfer learning. Transfer learning involves re-using models trained on a large source task, and tuning them on the target task. This can improve both convergence rates (Ben-David et al., 2007; 2010; Blitzer et al., 2008; Du et al., 2017; Kuzborskij & Orabona, 2013; Mansour et al., 2009) and empirical performance (Dai et al., 2007; Donahue et al., 2014; Oquab et al., 2014; Tan et al., 2018) . Transfer learning reduces per-task data or compute requirements, given a large one-off pre-training cost. In practice, this one-off down payment may not be made by the practitioner, since pre-trained networks are made available through platforms like PyTorch and TensorFlow Hub 1 . For instance, ImageNet pre-training is popular since it is freely available and works well for many tasks (Donahue et al., 2014; Oquab et al., 2014; Sharif Razavian et al., 2014) . In contrast to generic homogeneous models (e.g. most pre-trained ImageNet networks), Mixture of Experts (MoE) include multiple heterogeneous sub-models ("experts") that specialize to sub-problems of the full task. MoEs have been studied for decades (Eigen et al., 2013; Jacobs & Jordan, 1993) , and have also been successful in deep learning (Shazeer et al., 2017 ). Yet, the application of experts for deep transfer learning has been less explored. We study visual transfer with experts, and present a simple, scalable, yet effective strategy. Transfer of specialist models has been studied before. However, they either require expensive retraining on the source dataset for every target task (Ngiam et al., 2018; Yan et al., 2020) , or operate at a small scale where all experts can be applied simultaneously (Dvornik et al., 2020) . Further, most of them are tested only on a limited suite of natural single-object classification tasks. We lift these Step 2. The upstream data is divided in semantic subsets (possibly overlapping). One expert is trained on each subset using the weights from B as initialization. Step 3. Given a new downstream task D T = (X T , Y T ), we compute the image representations M e (X T ) from each expert e. We use kNN to compute the accuracy on the supervised problem D T,e = (M e (X T ), Y T ), and select the expert e * with highest accuracy. Step 4. We add a new head to e * and fine-tune its whole network with the downstream data, leading to the final model. constraints, and present a practical approach that scales to hundreds of large experts, while requiring relatively little compute per target task. Our strategy consists of four stages (fig. 1 ). ( 1) Unconditional pre-training. A single baseline model is trained on the entire upstream data. (2) Experts training. Multiple experts are pre-trained by exploiting the label hierarchy present in many large-scale image datasets, such as ImageNet and JFT. In addition to entire expert networks, we explore residual adapters that allow all of the expertise to be packed into a single model that can be loaded into memory. These two stages may be expensive, but are done only once. (3) Expert selection. Applying all experts to each task does not scale well; some sort of sparsification is required. We focus on inexpensive model selection that can be applied to hundreds or thousands of experts. (4) Downstream fine-tuning. We take the output of the model selection phase and tune it on the target task. Importantly, this phase does not require revisiting the source dataset, which may be unavailable or expensive to train on. We show that this approach yields remarkably strong performance on many diverse tasks. We evaluate not only on classic vision tasks, but also on the diverse VTAB benchmark of 19 tasks (Zhai et al., 2019) . Our contributions can be summarized as follows. • We propose a transfer learning algorithm with a large number of experts based on per-task routing via nearest neighbors selection. Once we have amortized the pre-training cost, this algorithm requires little compute per target task, achieving an speed-up of 500×-1000× compared to competing strategies. Also, it can be easily replicated with any large upstream multilabel dataset. • We achieve a mean accuracy improvement of 3.6% over the state-of-the-art performance on 19 VTAB datasets using ResNet50 networks. Our algorithm offers improvements on every group of tasks: natural, specialized, and structured. Figure 2 summarizes these results. • We explore using sub-networks as experts via residual adapters, allowing all experts to be packed into a single model. Surprisingly these perform almost as well as their full-network counterparts.

2. RELATED WORK

Transfer Learning. Tasks with little training data can benefit from other larger datasets, often from a similar domain. Transfer learning concerns the link between the source and target dataset (Pan & Yang, 2009; Weiss et al., 2016; Tan et al., 2018; Wang, 2018) . One family of methods creates a single training dataset, where source instances are re-weighted according to their relevance (Dai et al., 2007; Pardoe & Stone, 2010; Wan et al., 2011; Xu et al., 2017) . A popular method consists of fine-tuning a model that was pre-trained on the source data (Donahue et al., 2014; Oquab et al., 2014; Sharif Razavian et al., 2014) . Some transfer learning algorithms condition the initial source model on the target dataset itself (Ngiam et al., 2018; Xie et al., 2019; Yalniz et al., 2019) , while



Figure 1: Transfer Learning with Per-Task Routing of Experts. Step 1. A single baseline model B is trained on the entire upstream dataset.Step 2. The upstream data is divided in semantic subsets (possibly overlapping). One expert is trained on each subset using the weights from B as initialization.Step 3. Given a new downstream task D T = (X T , Y T ), we compute the image representations M e (X T ) from each expert e. We use kNN to compute the accuracy on the supervised problem D T,e = (M e (X T ), Y T ), and select the expert e * with highest accuracy. Step 4. We add a new head to e * and fine-tune its whole network with the downstream data, leading to the final model.

funding

* Equal contribution. Order decided by a coin toss. † Work done while interning at Google Research.

