SURGICAL FINE-TUNING IMPROVES ADAPTATION TO DISTRIBUTION SHIFTS

Abstract

A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.

1. INTRODUCTION

While deep neural networks have achieved impressive results in many domains, they are often brittle to even small distribution shifts between the source and target domains (Recht et al., 2019; Hendrycks & Dietterich, 2019; Koh et al., 2021) . While many approaches to robustness attempt to directly generalize to the target distribution after training on source data (Peters et al., 2016; Arjovsky et al., 2019) , an alternative approach is to fine-tune on a small amount of labeled target datapoints. Collecting such small labeled datasets can improve downstream performance in a cost-effective manner while substantially outperforming domain generalization and unsupervised adaptation methods (Rosenfeld et al., 2022; Kirichenko et al., 2022) . We therefore focus on settings where we first train a model on a relatively large source dataset and then fine-tune the pre-trained model on a small target dataset, as a means of adapting to distribution shifts. The motivation behind existing fine-tuning methods is to fit the new data while also preserving the information obtained during the pre-training phase. Such information preservation is critical for successful transfer learning, especially in scenarios where the source and target distributions share a lot of information despite the distribution shift. To reduce overfitting during fine-tuning, existing works have proposed using a smaller learning rate compared to initial pretraining (Kornblith et al., 2019; Li et al., 2020) , freezing the early backbone layers and gradually unfreezing (Howard & Ruder, 2018; Mukherjee & Awadallah, 2019; Romero et al., 2020) , or using a different learning rate for each layer (Ro & Choi, 2021; Shen et al., 2021) . We present a result in which preserving information in a non-standard way results in better performance. Contrary to conventional wisdom that one should fine-tune the last few layers to re-use the learned features, we observe that fine-tuning only the early layers of the network results in better performance on image corruption datasets such as CIFAR-10-C (Hendrycks & Dietterich, 2019). More specifically, as an initial finding, when transferring a model pretrained on CIFAR-10 to CIFAR-10-C by fine-tuning on a small amount of labeled corrupted images, fine-tuning only the first block of layers and freezing the others outperforms full fine-tuning on all parameters by almost 3% on average on unseen corrupted images.

