SURGICAL FINE-TUNING IMPROVES ADAPTATION TO DISTRIBUTION SHIFTS

Abstract

A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.

1. INTRODUCTION

While deep neural networks have achieved impressive results in many domains, they are often brittle to even small distribution shifts between the source and target domains (Recht et al., 2019; Hendrycks & Dietterich, 2019; Koh et al., 2021) . While many approaches to robustness attempt to directly generalize to the target distribution after training on source data (Peters et al., 2016; Arjovsky et al., 2019) , an alternative approach is to fine-tune on a small amount of labeled target datapoints. Collecting such small labeled datasets can improve downstream performance in a cost-effective manner while substantially outperforming domain generalization and unsupervised adaptation methods (Rosenfeld et al., 2022; Kirichenko et al., 2022) . We therefore focus on settings where we first train a model on a relatively large source dataset and then fine-tune the pre-trained model on a small target dataset, as a means of adapting to distribution shifts. The motivation behind existing fine-tuning methods is to fit the new data while also preserving the information obtained during the pre-training phase. Such information preservation is critical for successful transfer learning, especially in scenarios where the source and target distributions share a lot of information despite the distribution shift. To reduce overfitting during fine-tuning, existing works have proposed using a smaller learning rate compared to initial pretraining (Kornblith et al., 2019; Li et al., 2020) , freezing the early backbone layers and gradually unfreezing (Howard & Ruder, 2018; Mukherjee & Awadallah, 2019; Romero et al., 2020) , or using a different learning rate for each layer (Ro & Choi, 2021; Shen et al., 2021) . We present a result in which preserving information in a non-standard way results in better performance. Contrary to conventional wisdom that one should fine-tune the last few layers to re-use the learned features, we observe that fine-tuning only the early layers of the network results in better performance on image corruption datasets such as CIFAR-10-C (Hendrycks & Dietterich, 2019) . More specifically, as an initial finding, when transferring a model pretrained on CIFAR-10 to CIFAR-10-C by fine-tuning on a small amount of labeled corrupted images, fine-tuning only the first block of layers and freezing the others outperforms full fine-tuning on all parameters by almost 3% on average on unseen corrupted images. 81.2% 79.3% 86.2% 82.2%

Subgroup shift Spurious correlation Image corruption

Figure 1 : Surgical fine-tuning, where we tune only one block of parameters and freeze the remaining parameters, outperforms full fine-tuning on a range of distribution shifts. Moreover, we find that tuning different blocks performs best for different types of distribution shifts. Fine-tuning the first block works best for input-level shifts such as CIFAR-C (image corruption), later blocks work best for feature-level shifts such as Entity-30 (shift in entity subgroup), and tuning the last layer works best for output-level shifts such as CelebA (spurious correlation between gender and hair color). To better understand this counterintuitive result, we study a general class of fine-tuning algorithms which we call surgical fine-tuning, defined as fine-tuning only a small contiguous subset of all layers in the pre-trained neural network. Equivalently, we could define surgical fine-tuning as freezing all but a few layers during fine-tuning. Parameter freezing can be beneficial because, depending on the relationship between the source and target tasks, some layer parameters trained on the source task may be close to a minima for the target distribution. Therefore, freezing these layers can facilitate generalization to the target distribution. We evaluate the performance of surgical fine-tuning with various layer choices on 7 different distribution shift scenarios, which we categorize into input-level, feature-level, and output-level shifts. As shown in Figure 1 , fine-tuning only the first block of layers, the middle block, or the last layer can perform best in different distribution shift conditions, with the best such subset consistently outperforming fine-tuning all parameters. To support our empirical results, we theoretically analyze why different types of distribution shifts require fine-tuning different layers. For two-layer neural networks, we show why fine-tuning the first layer is better for input perturbations but fine-tuning the last layer is better for label perturbations. We then present a setting where surgical fine-tuning on the first layer provably outperforms fine-tuning all parameters. If the target distribution contains only a few new "directions" (inputs outside the span of the source distribution), we show that tuning only the first layer can learn these new directions with very few target examples, while preserving all the information learned from the source distribution. However, we show that full fine-tuning forgets information learned from the source distribution-the last layer changes to accommodate the new target directions, but now performs poorly on examples outside the span of the training data. Motivated by the theoretical insight that freezing some layers can help generalization, we empirically analyze two criteria for automatically selecting layers to tune based on loss gradients. Tuning the layers selected by such criteria can also outperform full fine-tuning, though this procedure does not outperform manually choosing the best layers to tune. Our main contribution is the empirical observation that fine-tuning only a small contiguous subset of layers can outperform full fine-tuning on a range of distribution shifts. Intriguingly, the best layers to tune differ for different distribution shift types (Figure 1 ). This finding is validated empirically across seven real-world datasets and three types of distribution shifts, and theoretically in an idealized two-layer neural network setup. We additionally empirically analyze two criteria for automatically selecting which layers to tune and find that fine-tuning only the layers with higher relative gradient norm outperforms full fine-tuning.

2. SURGICAL FINE-TUNING: FREEZING PARAMETERS DURING ADAPTATION

Our problem setting assumes two datasets from different distributions: a large dataset following the source distribution P src , and a relatively smaller dataset following the target distribution P tgt . The objective is to achieve high accuracy on target data by leveraging the different but closely related source distribution, a common scenario in real-world applications that require adaptation. For example, the source dataset can be the 50, 000 training images in CIFAR-10 (Krizhevsky et al., 2009) while the target dataset is a smaller set of 1000 corrupted CIFAR datapoints with the same image corruption (Hendrycks & Dietterich, 2019) ; see Figure 1 for more examples of source-target dataset pairs that we consider. To achieve high performance on the target distribution, a model should broadly fit the large source dataset and make minor adjustments based on the smaller target dataset. We empirically evaluate transfer learning performance with a two-stage training procedure consisting of pre-training and fine-tuning. First, we pre-train a network to minimize the loss on the source dataset to obtain f src , which has high accuracy in the source distribution. The fine-tuning stage starts from pre-trained model parameters and minimizes the loss on the labeled target data, resulting in the model f tgt . We evaluate two fine-tuning settings in this section: supervised fine-tuning (Section 2.1) and unsupervised adaptation (Section 2.2). In all experiments, we perform early stopping on held-out target data according to the fine-tuning loss. Finally, we evaluate the performance of the fine-tuned model on held-out data from the target distribution, i.e. L tgt (f tgt ) = E (x,y)∼Ptgt [ℓ(f tgt (x), y)]. Our main focus is analyzing surgical fine-tuning, in which we fine-tune only a subset of layers of the pre-trained model while keeping the others frozen. Denote the pre-trained model as f = f n • . . . • f 1 (x), where each layer f i has parameters θ i , and the empirical target loss as L tgt . Formally, surgical fine-tuning with respect to a subset S ⊆ {1, . . . , n} of layers is defined as solving the optimization problem arg min θi ∀i∈S L tgt (f (θ 1 , . . . , θ n )), where all non-surgery parameters (θ i for i / ∈ S) are fixed to their pre-trained values. Typical choices of parameters to optimize are fine-tuning all (S = {1, . . . , n}), last (S = {n}), or the last few layers (S = {n -k, • • • , n}). The main novelty of the surgical fine-tuning framework is that it additionally considers tuning earlier layers while keeping later layers frozen. For example, surgical fine-tuning on the first layer (S = {1}) updates only θ 1 , resulting in the fine-tuned model f tgt (x) = f src n • . . . • f src 2 • f tgt 1 (x). Intuitively, surgical fine-tuning can outperform full fine-tuning when some layers in f src are already near-optimal for the target distribution. As a hypothetical example, consider a scenario where there exist first-layer parameters θ * 1 such that changing only the first layer of f src to θ * 1 achieves zero target loss. Here, first-layer fine-tuning (S = {1}) can find θ * 1 with a small amount of target data, while full fine-tuning (S = {1, . . . , n}) may needlessly update the other layers and thus underperform on held-out target data due to overfitting. We note that the efficacy of parameter freezing is a consequence of having limited target data, and choosing a bigger S will be beneficial in settings where target data is plentiful. Now that we have introduced the problem set-up, we will next empirically investigate how surgical fine-tuning with different choices of S performs on real datasets.

2.1. SURGICAL FINE-TUNING: EXPERIMENTS ON REAL DATA

In this subsection, we aim to empirically answer the following question: how does surgical parameter fine-tuning compare to full fine-tuning in terms of sample efficiency and performance on real-world datasets? Datasets. We run experiments on nine real-world distribution shifts, categorized into input-level, feature-level, output-level, and natural shifts, with examples shown in Figure 1 . For more details about these datasets, see Appendix B.3. • Input-level shift: (1) CIFAR-C (Hendrycks & Dietterich, 2019) , (2) ImageNet-C (Kar et al., 2022) . The source distributions correspond to the original CIFAR-10 and ImageNet datasets (Krizhevsky et al., 2009; Deng et al., 2009) , respectively. The task is to classify images from the target datasets, which consist of corrupted images. • Feature-level shift: (3) Living-17 and (4) Entity-30 (Santurkar et al., 2020) : While the source and target distributions consist of the same classes, they contain different subpopulations of those classes. For example, in Entity-30, for the class "vegetables", P src and P tgt will contain different subclasses of vegetables. • Output-level shift: (5) CIFAR-Flip, (6) Waterbirds, and (7) CelebA (Sagawa et al., 2019) . CIFAR-Flip is a synthetic task where the P src consists of the original CIFAR-10 dataset and the target distribution is the same dataset where each label y has been flipped to be 9 -y, e.g. the label 0 is now label 9 and vice versa. For Waterbirds and CelebA, the task labels are spuriously correlated with an attribute. The source distribution P src is the training set while the target distribution P tgt is a balanced subset with equal amounts of each of the four (label, spurious attribute) groups. and FMoW, which use a pre-trained CLIP ViT-B/16. After initialization, we pre-train on the source domain P src and then fine-tune on a small amount of data from the target domain. We fine-tune with the Adam optimizer, sweeping over 3 learning rates. We choose the best hyperpameters and early stop based on accuracy on held-out target data. We report results across 3 seeds for all experiments. See Appendix B.4 for more fine-tuning details.

CIFAR-C

Surgical fine-tuning. The models used consist of three (for ResNet-26) or four (for ResNet-50) convolutional blocks followed by a final fully connected layer. We denote these blocks as "Block 1", "Block 2", etc in the order that they process the input, and the fully connected layer as "Last Layer". CLIP ViT-B/16 has 1 embedding layer, 11 attention blocks and a linear classifier as the last layer. For each experimental setting, we report the relative target distribution accuracy and standard error across three runs after surgical fine-tuning on each block of the network, fine-tuning only that block while freezing all other parameters. We compare against full fine-tuning, i.e. tuning all parameters to minimize target loss. Experimental results. Results in Figure 2 show that on every domain, surgically fine-tuning one block of the network outperforms tuning all parameters on the target distribution. We note that even matching full fine-tuning performance with surgical fine-tuning would indicate that ignoring some gradients is harmless; these results show that ignoring some gradients has a positive effect. Furthermore, we find that the best block to fine-tune is different across settings, depending on the nature of the distribution shift between source and target data. Datasets with an input-level shift are best handled by tuning the first network block, and similarly for feature-level shifts with a middle block and output-level shifts with the last layer. We see the a similar phenomenon in the natural shifts: Camelyon17 is closer to an input-level shift due to the lighting difference in different hospitals, while the shift between different regions in FMoW can be seen as close to a feature-level shift because building shape and spacing is most salient in satellite imagery. Quantitative results in in Table 1 are in agreement with this intuition, where tuning the earliest embedding layer works best for Camelyon17 and tuning later attention blocks works best for FMoW. Following Kumar et al. (2022b) , we also evaluate fine-tuning performance with the AdamW optimizer; results in Table 8 show a similar tendency but with smaller performance gaps. In Figure 4 , we find that on CIFAR-C, fine-tuning the first block matches and even outperforms full fine-tuning as well as tuning with other individual blocks when given varying amounts of data for tuning, although the gap between Block 1 and All decreases as the number of training points increases. CIFAR-C Supervised Intuitively, why might surgical fine-tuning match or even outperform full fine-tuning on distribution shifts? For each type of shift we consider (input-level, feature-level, and output-level), there is a sense in which one aspect of the distribution changes while everything else is kept the same, therefore requiring modification of only a small part of information learned during pre-training. For example, in image corruptions (categorized as an input-level shift), pixel-wise local features are shifted while the underlying structure of the data is the same in the source and target distributions. On the other hand, in a label shift (categorized as an output-level shift), the pixel-wise features remain the same in the source and target distributions while the mapping from final features to labels is shifted. This intuition is also in line with the independent causal mechanisms (ICM) principle (Schölkopf et al., 2012; Peters et al., 2017) , which states that the causal generative process of a system's variables is composed of autonomous modules that do not inform or influence one another. From this viewpoint, distribution shifts should correspond to local changes in the causal generative process. Because discriminative models learn to invert the generative process from label to datapoint, it suffices to fine-tune only the region of the network that corresponds to the change in the causal process. We formalize this intuition more concretely in our theoretical analysis in Section 3.

2.2. UNSUPERVISED ADAPTATION WITH PARAMETER FREEZING

In this subsection, we aim to validate whether the findings from Section 2.1 hold in the unsupervised test-time adaptation setting, where we adapt a model trained on source data to the target distribution using only unlabeled target data. We experiment with variants of a representative state-of-the-art unsupervised adaptation method (Zhang et al., 2021a, MEMO) , which minimizes marginal entropy of averaged predictions for a single image. We consider two settings: online, where the model retains updates from past test images, and episodic, where we reset the model back to the pre-trained weights after every test image. Results in Table 2 and Table 3 show that the highest accuracy is achieved by adapting the first two layers and first block in the online setting for CIFAR-10-C and ImageNet-C respectively, and doing so outperforms fine-tuning all parameters. With full fine-tuning, online MEMO performance deteriorates as the test set size increases due to distortion of pre-trained features, as shown graphically in Figure 4 . In contrast, surgical fine-tuning mitigates this effect. These results are consistent with the supervised learning experiments in Section 2.1, where adapting the early parameters was best for image corruption datasets. We show detailed results in Appendix B.6. Theoretical setup. We focus on regression, where our goal is to map inputs x ∈ R d to outputs y ∈ R, and l(y, ŷ) = (y -ŷ) 2 is the squared loss. We consider two-layer networks f v,B (x) = v ⊤ ϕ(Bx) where v ∈ R k , B ∈ R k×d , and ϕ is an elementwise activation function such as ReLU. Let x src , y src ∼ P src and x trg , y trg ∼ P trg be the inputs and outputs in the source and target distributions. We assume y src = f vsrc,Bsrc (x src ) for some v src , B src . Note that x src , y src , x trg , y trg are all random variables, and expectations are taken over all random variables if not specified. We define the population losses for source and target as L src (v, B) = E l(f v,B (x src ), y src ) and L trg (v, B) = E l(f v,B (x trg ), y trg ) . 3.1 LAYER CHOICE AND EXPRESSIVITY: WHY FINE-TUNING THE RIGHT LAYER MATTERS First note that for two-layer neural networks, we have two choices for surgical fine-tuning: the first layer and the last layer. We show by construction that if the distribution shift is closer to the input then first-layer tuning is better, but if the shift is closer to the output then last-layer tuning is better. In this section, we assume that ϕ is the elementwise ReLU function: ϕ(x) i = max(x i , 0). Recall that we first train on lots of source data-suppose this gives us pretrained parameters vsrc , Bsrc which achieve minimum source loss: L src (v src , Bsrc ) = 0. Input perturbation. Suppose that the target input is a "perturbed" or "corrupted" version of the source input: x trg = Ax src for some invertible matrix A ∈ R n×n , where the corresponding label is unchanged: y trg = y src . We note that this simplified perturbation class includes some common image corruptions as brightness shift and Gaussian blur as special cases, while others such as pixelation are similarly linear projections but non-invertible. Proposition 1 shows that for this distribution shift, tuning only the first layer can minimize the target loss but only changing the last layer may not. Proposition 1. For all A, P src , P trg with x trg = Ax src for invertible A and y trg = y src , there exists a first-layer B that can minimize the target loss: min B L trg (v src , B) = 0. However, changing the last layer may not be sufficient: there exists such A, P src , P trg such that the target loss is non-zero for any choice of last layer v: for all i, min v L trg (v, Bsrc ) > 0. Intuitively, the first-layer can learn to "undo" the perturbation by selecting B = A -1 . However, if we freeze the first-layer then the representations ϕ( Bsrc x trg ) may miss important input directions in the target, so no last-layer v can produce the correct output. For a full statement and proof, see Appendix B.1. Figure 5 : In this setting, we construct distribution shifts such that a particular block of parameters is substantially more suited for adaptation. We find that tuning only the subset of parameters that is responsible for the shift performs better than tuning any other block of parameters or all parameters. Darker blue indicates higher accuracy while darker red indicates lower accuracy. Fine-tuning data. Suppose we have n target datapoints sampled from P trg : (x (1) trg , y trg ) . . ., (x (n) trg , y (n) trg ). The empirical fine-tuning loss is given by: Ltrg (v, B) = n i=1 l(f v,B (x (i) trg ), y (i) trg ). Fine-tuning algorithms. We study two different gradient flows each corresponding to fine-tuning methods: first-layer tuning (fl) and full fine-tuning (ft). ∂ t B fl (t) = -∇ B Ltrg (v fl (t), B fl (t)) ∂ t v fl (t) = 0 ∂ t B ft (t) = -∇ B Ltrg (v ft (t), B ft (t)) ∂ t v ft (t) = -∇ v Ltrg (v ft (t), B ft (t)) , with initial conditions v fl (0) = v ft (0) = vsrc and B fl (0) = B ft (0) = Bsrc . We denote the limit points of these gradient flows as v ∞ fl = lim t→∞ v ft (t), etc. The following theorem shows that there exists a shift, where if we have a small target dataset, full fine-tuning does worse than first-layer tuning. Theorem 1. For any δ > 0, there exists d, k, P src , P trg , n such that with probability at least 1 -δ, first-layer tuning gets 0 loss at convergence, but full fine-tuning gets higher (non-zero) loss throughout the fine-tuning trajectory: L trg (v ft (t), B ft (t)) > L trg (v ∞ fl , B ∞ fl ) = 0 ∀t. Intuitively, if P ood contains a few additional directions in the input that are not present in P id , then first layer-tuning can quickly learn those new directions. Full fine-tuning changes both the head v and feature extractor B to fit these new directions-however, because the head v has changed it may be incompatible with B in some directions not seen in the finite training set, thus "forgetting" some knowledge present in the source data. The full proof is in Appendix B.2.

3.3. SURGICAL FINE-TUNING ON SYNTHETIC DISTRIBUTION SHIFTS

To better illustrate how specific subsets of parameters are better suited depending on the distribution shift, we model distribution shifts by adding noise to individual blocks of layers. More specifically, we initialize with a ResNet-26 model pretrained on the CIFAR-10 ( Krizhevsky et al., 2009) training dataset. We then add noise to each of the three blocks or the last layer, simulating distribution shifts localized to those parameters, and then tune each of the different blocks of the network while freezing all other parameters on the CIFAR-10 test dataset. In Figure 5 , we find that only tuning the subset of parameters that is responsible for the shift performs better than tuning any other subset of parameters and even outperforms tuning all layers, indicating that when tuning the parameters responsible for the shift, tuning other parameters may actually hurt performance.

4. AUTOMATICALLY SELECTING WHICH LAYERS TO TUNE

In this section, we investigate three criteria for automatically finding an adequate subset of layers to perform surgical fine-tuning on. We evaluate their fine-tuning performance versus full fine-tuning and another prior method on the 7 real-data domains introduced in Section 2.1. We also analyze performance on the synthetic distribution shifts introduced in Section 3.3.

4.1. CRITERIA FOR SELECTING LAYERS

We consider three metrics for automatically choosing which layers to freeze. Cross-Val. After running surgical fine-tuning for all blocks, we select the best block based on a held-out validation set from the target distribution. While quite effective, this method requires as many fine-tuning runs as there are blocks inside the network. Table 4 : We report the average accuracy and standard error achieved on the target distribution on 7 real-data tasks. Cross-Val, which requires as a surgical fine-tuning run for each block, performs the best, but we find that Auto-RGN performs the best out of all methods that require only 1 fine-tuning run, outperforming Full Fine-tuning, Gradual Unfreezing, L1 Regularize, and Auto-SNR. The best overall method for each shift is underlined, and the best among methods that use 1 fine-tuning run is bolded. Relative Gradient Norm (Auto-RGN). Within each layer, we measure the ratio of gradient norm to parameter norm, and select layers that have relatively larger gradients. Intuitively, our hypothesis is that layers with large gradient magnitudes may carry more information about the target task than others and can therefore be more useful. Formally, denote gradients at layer i as g i . We define the relative gradient norm of this layer as RGN(θ i ) = (gi) ||θi|| . Then to alter fine-tuning with this criterion, at each epoch, we normalize the RGNs for each layer between 0 and 1 and then multiply the learning rate for each layer by its RGN. Using this criterion for fine-tuning requires no additional hyperparameters over tuning all layers and only one fine-tuning run. Signal-to-Noise Ratio (Auto-SNR). For each layer i, this criterion is defined as SNR(g i ) = Avg(gi) 2 Var(gi) , with average and variance computed across (target) datapoints. Intuitively, SNR measures how noisy the gradient of each layer is and thus how much it may contribute to distorting the function learned during pre-training. This gradient-based criterion has been shown to be useful for early stopping (Mahsereci et al., 2017) . During fine-tuning, we normalize the SNR for each layer between 0 and 1 and then freeze all layers that have SNR under a threshold that is tuned as an additional hyperparameter. As points of comparison, we compare the three criteria for layer selection above to existing methods for regularizing fine-tuning. We consider two variations of gradual unfreezing: Gradual Unfreeze (First → Last) and Gradual Unfreeze (Last → First) (Howard & Ruder, 2018; Romero et al., 2020; Kumar et al., 2022a) , in addition to L 1 Regularize (Xuhong et al., 2018) . These methods are similar in spirit to surgical fine-tuning in that they aim to minimize changes to parameters. We additionally experimented with regularizing the L 2 norm, but found that L 1 consistently performs better.

4.2. RESULTS ON REAL WORLD DATASETS

In Table 4 , we compare Cross-Val with 6 methods (Full Fine-Tuning, Gradual Unfreeze (First → Last), Gradual Unfreeze (Last → First), L 1 Regularize, Auto-SNR, and Auto-RGN) that require only 1 fine-tuning run. We find that auto-tuning with relative grad norm (Auto-RGN) matches or outperforms fine-tuning all parameters on all domains and is the most competitive method that requires only 1 fine-tuning run although it does not quite match the performance of Cross-Val. We find that Cross-Val corresponds in performance to the best surgical fine-tuning result for each dataset, which is expected, as the validation and test sets are both held-out subsets of the same target distribution. Auto-SNR struggles to extract the most effective layers for tuning and hence does worse on most shifts than All and Auto-RGN. While Gradual Unfreeze fails to consistently outperform full fine-tuning, the directionality of results is consistent with surgical fine-tuning: unfreezing first layers is best in input-level shifts and unfreezing last layers is best in output-level shifts. L 1 Regularize performs slightly better than fine-tuning all, but performs worse than Auto-RGN on all datasets except CIFAR-C. All methods outperform no adaptation.

4.3. AUTOMATIC SELECTIVE FINE-TUNING IN SYNTHETIC DISTRIBUTION SHIFTS

As Auto-RGN is the best performing method that requires only one fine-tuning run, and in particular outperforms fine-tuning all layers, we further analyze what layers Auto-RGN chooses to fine-tune and see to what extent they correlate with our experiments in Section 2. To do so, we evaluate Auto-RGN on the synthetic distribution shifts introduced in Section 3.3, where we model distribution shifts by adding noise to blocks of parameters, and plot the weights that Auto-RGN gives to the layers. We find that Auto-RGN is able to ascertain which parameters may be responsible for the shift and weight the learning of those parameters to be higher than the others, resulting in an informative signal that matches the performance of tuning only the noisy subset of parameters and outperforms full fine-tuning, as seen in Table 5 . Figure 6 shows the accumulated weights given by Auto-RGN over the course of training for each layer, colored by block. The weights for the layers responsible for the distribution shifts are higher than the weights for the other layers.

5. RELATED WORK

Parameter freezing. Freezing parameters to preserve previously learned information has been shown to be an effective strategy in a diverse set of domains: domain adaptation (Sener et al., 2016; Long et al., 2016) , early stopping (Mahsereci et al., 2017) , generative models (Mo et al., 2020) , and gradient-based meta-learning (Zintgraf et al., 2019; Raghu et al., 2019; Triantafillou et al., 2021) , A highly effective approach to fast adaptation of large language models is prompt tuning (Li & Liang, 2021; Hambardzumyan et al., 2021; Lester et al., 2021; Wei et al., 2021) , which can similarly be seen as an extreme special case of freezing where we only fine-tune the inputs to the neural network. Our surgical fine-tuning framework contains many such previous works as special cases, and our experiments highlight the value of carefully choosing the subset of parameters to freeze. Transfer learning. Prior works in transfer learning have studied how fine-tuning may be used to adapt pretrained features to a target distribution (Oquab et al., 2014; Yosinski et al., 2014; Sharif Razavian et al., 2014) . To preserve information obtained during pre-training, many works propose methods of regularizing the fine-tuning process (Zhang et al., 2020; Xuhong et al., 2018; Lee et al., 2019a; Jiang et al., 2019; Li et al., 2020; Aghajanyan et al., 2020; Gouk et al., 2021; Shen et al., 2021; Karani et al., 2021) . In particular, many works show that freezing some parameters in the pre-trained model can reduce overfitting during fine-tuning (Kirkpatrick et al., 2017; Lee et al., 2019b; Guo et al., 2019; Ramasesh et al., 2020; Liu et al., 2021b; Royer & Lampert, 2020; Eastwood et al., 2021; Evci et al., 2022; Eastwood et al., 2022; Cohen et al., 2022; Touvron et al., 2022) , and we build on such observations. Module criticality (Zhang et al., 2019; Chatterji et al., 2019; Neyshabur et al., 2020) , which independently examines each layers' loss surface, is also closely related to our analysis. In contrast to existing works, we make the counterintuitive observation that freezing the later layers, or equivalently performing surgical fine-tuning on the early layers, can perform best in some settings. Furthermore, we study the relationship between the best subset of layers to tune and the nature of the distribution shift between the source and target distributions. Distribution shifts. Many existing works have studied adaptation and robustness to various distribution shifts (Tzeng et al., 2014; Byrd & Lipton, 2019; Hendrycks et al., 2019; Arjovsky et al., 2019; Salman et al., 2020; Liu et al., 2021a; Wiles et al., 2021; Andreassen et al., 2021; Miller et al., 2021; Creager et al., 2021; Lee et al., 2022; Kumar et al., 2022a) . Such works typically frame robustness to distribution shift as a zero-shot generalization problem, where the model is trained on source and evaluated on target. We consider a different problem setting where the model is allowed to adapt to some labeled target data available. Some recent works have proposed methods for model adaptation at test time (Sun et al., 2020; Varsavsky et al., 2020; Iwasawa & Matsuo, 2021; Wang et al., 2020; Zhang et al., 2021a; b; Gandelsman et al., 2022) . Recent works (Rosenfeld et al., 2022; Kirichenko et al., 2022) study a problem setting close to ours, showing that fine-tuning the last layer is sufficient for adapting to datasets with a spuriously correlated attribute. Our experiments in Section 2 confirm these results, and we further evaluate on a broader set of distribution shifts including image corruptions and shifts at the level of intermediate features. We find that fine-tuning different subsets of layers performs best for different types of distribution shifts, and also present theoretical analysis on the relationship between surgical fine-tuning and the type of distribution shift.

A EXTENDED DISCUSSION

In this paper, we empirically find that when fine-tuning on a new target distribution, it is often best to perform surgical fine-tuning, i.e. to adapt only a small contiguous subset of parameters. More importantly, which subset is more effective to tune depends on the type of distribution shift. For example, on input-level shifts like image corruption, only tuning earlier layers can outperform fine-tuning all or only later layers. These results support our intuition from the independent causal mechanisms (ICM) principle: many distribution shifts can be explained by a shift in one module of the prediction mechanism and can thus be adapted to by tuning only a small subset of the network. Our empirical findings are supported by theoretical results, which show by construction that first-layer tuning may outperform full fine-tuning in an idealized two-layer neural network setting. Additionally, manually choosing which layers to freeze with the framework of surgical fine-tuning requires more fine-tuning runs than fine-tuning all layers, so we analyze two criteria for automatically selecting which layers to tune. While Auto-RGN consistently improves over full fine-tuning, its performance does not match the best surgical fine-tuning approach. Future work may close this gap by investigating more effective criteria for automatic selection. More generally, a potentially fruitful direction for future work is in better understanding when a distribution shift would prefer a certain layer, potentially shedding light on the nature of different distribution shifts.

B APPENDIX

B.1 PROOFS FOR SECTION 3.1 Proposition 1. For all A, P src , P trg with x trg = Ax src for invertible A and y trg = y src , there exists a first-layer B that can minimize the target loss: min B L trg (v src , B) = 0. However, changing the last layer may not be sufficient: there exists such A, P src , P trg such that the target loss is non-zero for any choice of last layer v: for all i, min v L trg (v, Bsrc ) > 0. Proof. Let Bsrc , vsrc be minimum loss solutions so that y src = vsrc ϕ( Bsrc x src ) for all x src , y src . Denoting B = Bsrc A -1 , we have for all x trg vsrc ϕ(Bx trg ) = vsrc ϕ( Bsrc A -1 Ax src ) = vsrc ϕ( Bsrc x src ) = y src = y trg . (3) Therefore, this pair of parameters v, B achieves L trg (v src , B) = 0. We construct a counterexample showing the impossibility of last-layer tuning as follows. Recall that ϕ(•) is the elementwise ReLU function. Let A = -I, an invertible diagonal matrix with all entries -1. Let the source distribution be so that Bsrc x src has only positive entries for all x src in its support. Then for any v, we have vϕ( Bsrc x trg ) = vϕ(-Bsrc x src ) = 0, so the expected loss is positive. Therefore, min v L trg (v, Bsrc ) > 0. Proposition 2. For all t, P src , P trg with x trg = x src , and y trg = ty src , there exists a last-layer v that can minimize the target loss: min v L trg (v, Bsrc ) = 0. However, changing the first layer may not be sufficient-there exists such t, P src , P trg such that the target loss is non-zero for any choice of first layer B: min B L trg (v src , B) > 0 Proof. Let Bsrc , vsrc be minimum loss solutions so that y src = vsrc ϕ( Bsrc x src ) for all x src , y src . Let v = tv src . Then for all x trg , vϕ( Bsrc x trg ) = tv src ϕ( Bsrc x src ) = ty src = y trg . (4) Therefore, this pair of parameters v, Bsrc achieves L trg (v, Bsrc ) = 0. We next construct a counterexample showing that tuning only the first layer may not be sufficient. Recall that ϕ(•) is the elementwise ReLU function. Let t = -1. Let the source distribution be so that both Bx src and vsrc consist only of positive entries for all x src in its support. For any B, both vsrc and ϕ(Bx trg ) consist only of positive entries, so vsrc ϕ(Bx trg ) cannot express y trg = -y src < 0. so the expected loss is positive for any B. Therefore, min B L trg (v src , B) > 0.

B.2 PROOF OF THEOREM 1 IN SECTION 3.2

We introduce some additional setup, prove two key lemmas which bound the loss of first-layer tuning and full fine-tuning respectively, and then this immediately implies Theorem 1. Defining the label distribution. We assume that P (y | x) is the same in both the source and the target. That is, we assume there exists some v * , B * such that y = v ⊤ * B * x for both P src and P trg . Let w * = B ⊤ * v * . Defining the covariate distributions. Now, we define the distribution over the inputs x. Let the source distribution P src have density on a d src -dimensional subspace, where d src < d (recall that the input dimension is d). Formally, this means that there exists some S src ∈ R d×dsrc with linearly independent columns, and some distribution P src with density on R dsrc , such that P src has the distribution S src z where z ∼ P (z) src . We assume a non-degeneracy condition-that the optimal model w * does not map (non-zero) source examples to 0: for all source examples x ∈ colspace(S src ), if x ̸ = 0, then w ⊤ * x ̸ = 0. If w * were random or had some noise then this would hold with probability 1. Suppose we have an orthogonal distribution P orth which has density on a d orth dimensional subspace. Formally, this means that there exists some S orth ∈ R d×d orth with linearly independent columns, and some distribution P (z) orth with density on R d orth , such that P orth has the distribution S orth z where z ∼ P (z) orth . We assume that the support of P orth and P src are orthogonal-that is, the columns of S orth and the columns of S src are all orthogonal. The target distribution P trg is an equal mixture of the source distribution P src and the orthogonal distribution P orth : P trg = 1 2 (P src + P orth ) This means that to sample from P trg , with probability 0.5 we pick a sample from P src and with probability 0.5 we pick a sample from P orth . First layer tuning gets 0 target loss. We first show that first-layer tuning gets 0 loss on the target distribution: Lemma 1. For any δ > 0, suppose n > 10d orth log 2 δ . Then with probability at least 1 -δ, first-layer tuning gets 0 loss at convergence: L trg (v ∞ fl , B ∞ fl ) = 0 Proof. We first note that first-layer tuning does not update the head v, so we have v ∞ fl = vsrc . Convex so converges. Note that the loss Ltrg is convex in B. To see this, note that we can write: Ltrg (v, B) = n i=1 (v ⊤ Bx (i) trg -y (i) trg ) 2 = (Tr(x (i) trg v ⊤ B) -y (i) trg ) 2 . trg v ⊤ B) is a linear function of B, so this is simply a least squares regression problem and is convex. This means that gradient flow converges to a minimizer of the train loss: Ltrg (v src , B ∞ fl ) ≤ Ltrg (v src , B) ∀B. However, since vsrc ̸ = 0, there exists B such that Ltrg (v src , B) = 0, so since the loss is non-negative, this implies that: Ltrg (v src , B ∞ fl ) = 0. Define ID and orthogonal training examples. We note that every example x sampled from P trg comes from exactly one of P orth or P src . * We group examples based on which distribution they come from. Let X src and X orth denote the source and orthogonal examples respectively, where X i denotes X src = {x (i) trg : x (i) trg ∈ colspace(S src )} X orth = {x (i) trg : x (i) trg ∈ colspace(S orth )} Enough to get a basis correct. Since we are working with linear models, it suffices to get all examples in a basis correct to get the entire subspace correct. Stated formally, if v ⊤ Bx = v ⊤ * B * x and v ⊤ Bx ′ = v ⊤ * B * x ′ , then the equality holds for any linear combination as well: v ⊤ B(αx + βx ′ ) = v ⊤ * B * (αx + βx ′ ) for all α, β. So to show that L trg (v src , B ∞ fl ) = 0 is suffices to show that v⊤ src B ∞ fl x = v ⊤ * B * x for some set of x that spans the support of P src and P orth . X orth spans orthogonal subspace. A standard application of Hoeffding gives us that with probability ≥ 1 -δ/2, we have at least d orth examples from the orthogonal distribution: |X orth | ≥ d orth . Since P orth has density on a d orth dimensional subspace, from e.g., Lemma 3 in Xie et al. ( 2021) these examples will span the orthogonal subspace almost surely: span(X orth ) = colspace(S orth ). Intuitively, since P orth has density we will sample points in different directions. Get all examples in orthogonal subspace correct. Since we have 0 training loss, and X orth spans the support of P orth , this means that we get all examples in the orthogonal subspace correct-that is, for all x in the support of P orth , we have: v⊤ src B ∞ fl x = v ⊤ * B * x. Get all examples in source subspace correct. For the source subspace we will split into two cases. First, we define the region of the source subspace colspace(S src ) that is orthogonal to all source training examples X src : P trg is a mixture of P src and P orth . Since we get 0 loss on the support of P src and support of P orth , and P trg is a mixture of the two, we get 0 loss on P trg as well: X ⊥ src = {x ∈ colspace(S src ) : ∀x ′ ∈ X src . x ⊥ x ′ }. L trg (v src , B ∞ fl ) = 0 Lemma 2. Suppose the representation dimension is 1-dimensional (k = 1) and d src > n. 2018)), we have: v2 src -b⊤ src bsrc = v ft (t) 2 -b(t) ⊤ b(t). ( ) Since vsrc = v ft (t), this gives us: b⊤ src bsrc = b(t) ⊤ b(t). Let R = colspace(S src ) be the source subspace. Since P src is a subset of P trg , we have that (v ft (t), b(t)) also gets 0 loss on the source distribution, and we have: Π R (v * b * ) = Π R (v src bsrc ) = Π R (v ft (t)b(t)). Since v 0 = v ft (t), , we have: Π R (v ft (t) bsrc ) = Π R (v ft (t)b(t)). Since v ft (t) ̸ = 0 (otherwise we would get source examples wrong): Π R ( bsrc ) = Π R (b(t)). Let T = colspace(S orth ) be the orthogonal subspace. From Equation 15and Equation 18, we have: ∥Π T ( bsrc )∥ 2 2 = ∥Π T (b(t))∥ 2 2 (19) But to get 0 loss on T , we have: Π T (v ft (t)b(t)) = Π T (v * b * ). Which implies: ∥Π T (b(t))∥ 2 2 = v 2 * (v ft (t)) 2 ∥Π T (b * )∥ 2 2 . ( ) From Equation 19and since vsrc = v ft (t), we have: ∥Π T ( bsrc )∥ 2 2 = v 2 * (v ft (t)) 2 ∥Π T (b * )∥ 2 2 . Recall that the way we got bsrc was we initialized b 0 = B ⊤ 0 ∼ N (0, σ 2 B I d ). We then ran gradient descent on the source distribution However, from Lemma A.3 in Kumar et al. (2022a) this does not change the projection onto components orthogonal to the source distribution. In other words, ∥Π T ( bsrc )∥ 2 2 = ∥Π T (b 0 )∥ 2 2 . However, this is a random variable with density, so the probability that this is exactly equal to the RHS of Equation 21 which is a fixed number, is 0. This is a contradiction. Wrap up. Either ways, if v ft (t) = vsrc or v ft (t) ̸ = vsrc we have: L trg (v ft (t), B ft (t)) > 0. Proof of Theorem 1. The proof follows directly because Lemma 2 shows that full fine-tuning gets positive (non-zero) loss but Lemma 1 gets zero loss.

B.3 ADDITIONAL DATASET DETAILS

Below, we provide additional information on the datasets used in our experiments. • CIFAR-10 → CIFAR-10-C (Krizhevsky et al., 2009; Hendrycks & Dietterich, 2019) : The task is to classify images into 10 classes, where the target distribution contains severely corrupted images. We run experiments over 14 of the corruptions (frost, gaussian blur, gaussian noise, glass blur, impulse noise, jpeg compression, motion blur, pixelate, saturate, shot noise, snow, spatter, speckle noise, and zoom blur). For the main experiments, we tune on 1000 images from CIFAR-10-C and evaluate on corrupted images from each of the corruptions. We use the data loading code from (Croce et al., 2020) , which has 5 levels of severity, and we evaluate with the most severe level. In our main experiments, we report the accuracies averaged across all corruptions and the average std error for all corruptions. • ImageNet → ImageNet-C (Deng et al., 2009; Hendrycks & Dietterich, 2019) : The task is to classify images into 1000 classes, where the target distribution contains severely corrupted images. We run experiments over 15 of the corruptions (brightness, contrast, defocus blur, elastic transform, fog, frost, Gaussian noise, glass blur, impulse noise, jpeg compression, motion blur, pixelate, shot noise, snow, zoom blur). For the main experiments, we tune on 5000 images from ImageNet-C, evenly split between classes, giving 5 corrupted images per class and evaluate on corrupted images from each of the corruptions. Similar to CIFAR-10-C, we evaluate with the most severe level. We also report the accuracies averaged across all corruptions and the average std error for all corruptions. • Living-17 and Entity-30 (Santurkar et al., 2020) : The task is to classify images into one of 17 animal categories or one of 30 entities. These datasets present subpopulation shifts, in that while the ID and OOD distributions have the same overall classes, they contain different subpopulations of those classes. For Living-17, we tune on 850 images from the target distribution, evenly split between the 17 classes, giving 50 images per class. For Entity-30, we tune on 1500 images from the target distribution, evenly split between the 30 classes, giving 50 images per class. • Waterbirds (Sagawa et al., 2019) : The task is to classify images as being a "waterbird" or "landbird". The label is spuriously correlated with the image background, which is either "land" or "water." The source distribution is the training set while the target distribution is a balanced subset with equal amounts of each bird on each background. In the training data, 95 % of the waterbirds appear on water backgrounds, and 95% of the landbirds appear on land backgrounds, so the minority groups contain far fewer examples than the majority groups. We tune on 400 images from the target distribution, evenly split between the 4 groups of (bird, background) pairs, giving 100 images per group. • CelebA (Sagawa et al., 2019) : The task is to classify the hair color in images as "blond" or "not blond", and the label is spuriously correlated with the Male attribute. The source distribution is the training set while the target distribution is a balanced subset with equal amounts of each of the four (hair color, gender) groups. We tune on 400 images from the target distribution, evenly split between the 4 groups of (hair color, gender) pairs, giving 100 images per group. • Camelyon17 (Bandi et al., 2018) : This dataset is part of the WILDS (Koh et al., 2021) datasets and contains roughly 450,000 images in the source distribution (Train) and 84,000 images in the target distribution (OOD test) of size 96 × 96. It comprises of medical images collected from 5 hospitals where difference in devices/data-processing between different hospitals produces a natural distribution shift. We pre-train on the 450,000 images of the source distribution and use 100 label-balanced images (50 per class) from the target distribution for fine-tuning. We use another 100 label-balanced target distribution images for tuning hyper-parameters, and report the performance of the fine-tuned model on the rest of the images of the target distribution. • FMoW (Christie et al., 2018) : This is also part of the WILDS (Koh et al., 2021) datasets and its source distribution contains 520,000 satellite images of size 224x224 from 5 geographic regions. The task is to classify one of 62 building or land use types. For target distribution, we use Africa test, i.e., the subset of the OOD test data belonging to Africa region, which has roughly 2,500 images. We use 62 label-balanced images from target distribution for fine-tuning and report the accuracy on the rest of the target distribution images.

B.4 ADDITIONAL DETAILS FOR SUPERVISED TRANSFER LEARNING EXPERIMENTS

Below, we provide additional details for our experiments on real data, including tuning details. For all datasets and experiments, we early stop according to the best accuracy on a held-out validation subset of the labeled target data. • CIFAR-10 → CIFAR-10-C (Krizhevsky et al., 2009; Hendrycks & Dietterich, 2019) and CIFAR-Flip: We use the Standard pre-trained model from (Croce et al., 2020) , which is trained on the source CIFAR-10 distribution. We fine-tune on the labeled target data for 15 total epochs. We tune over the 3 learning rates {1e-3, 1e-4, 1e-5} for all methods except last-layer fine-tuning, where we tune over {1e-1, 1e-2, 1e-3}, and we use a weight decay of 0.0001 for all methods. • ImageNet → ImageNet-C (Deng et al., 2009; Hendrycks & Dietterich, 2019) : We use the Standard pre-trained model from (Croce et al., 2020) , which is trained on the source ImageNet distribution. We then fine-tune on the labeled target data for 10 total epochs. We tune over the 3 learning rates {1e-3, 1e-4, 1e-5} for all methods, and we use a weight decay of 0.0001 for all methods. • Living-17 and Entity-30 (Santurkar et al., 2020) : We first train on the source data for 5 epochs, tuning only the head for 3 epochs and then fine-tuning all for 2 more epochs, following LP-FT (Kumar et al., 2022a) and using the Adam optimizer. We then fine-tune on the labeled target data for 15 epochs. We tune over the 3 learning rates {0.0005, 0.0001, 0.00001} for all methods and do not use any weight decay. • Waterbirds (Sagawa et al., 2019) : We first start with a ResNet-50 pretrained on ImageNet and train on the source distribution for 300 epochs, taking the best checkpoint based on early stopping and using the Adam optimizer. We then fine-tune on the labeled target data for 100 total epochs. We tune over the 3 learning rates {0.005, 0.001, 0.0005} for all methods and use a weight decay of 0.0001.

Method mCE (%)

Vanilla ResNet-50 (Hendrycks & Dietterich, 2019) 76.7 Cross-Val 61.7 Full Fine-tuning 62.8 Gradual (First → Last) 62.4 Gradual (Last → First) 63.6 L1 Regularize (Xuhong et al., 2018) 64.7 Auto-SNR 62.9 Auto-RGN 61.9 Table 6 : For completeness, we report the mean corruption error (mCE) on ImageNet-C, which weights the target distribution error by the difficulty of the corruption. We report the average accuracies on the target distribution in Figure 2 and Table 4 , and we see that the two metrics are correlated for our experiments, as the best performing methods according to average accuracy are also the best for mCE. • CelebA (Sagawa et al., 2019) : We first start with a ResNet-50 pretrained on ImageNet and train on the source distribution for 50 epochs, taking the best checkpoint based on early stopping and using the Adam optimizer. We then fine-tune on the labeled target data for 50 total epochs. We tune over the 3 learning rates {0.001, 0.0005, 0.0001} for all methods and use a weight decay of 0.0001. • Camelyon17 (Bandi et al., 2018; Koh et al., 2021) : We start with a vision transformer, CLIP ViT-B/16 (Radford et al., 2021) , pre-trained on the CLIP datasets. Next we fine-tune the model on Camelyon17 train dataset using an SGD optimizer with initial learning rate 0.0001 for 3 epochs. We use a cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017) and batch size 32. Finally, we fine-tune on the labeled target data for 10 total epochs with the same setting as before, except we tune over learning rates {10 -4 , 10 -5 , 3 × 10 -5 , 7 × 10 -5 , 10 -6 , 3 × 10 -6 , 10 -7 , 10 -8 } • FMoW (Christie et al., 2018; Koh et al., 2021) : Similar to Camelyon17, we start with a vision transformer, CLIP ViT-B/16 (Radford et al., 2021) , pre-trained on the CLIP datasets. Next we finetune the model on FMoW train dataset using an SGD optimizer with initial learning rate 0.0003 for 5 epochs. We use a cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017) and batch size 32. Finally, we fine-tune on the labeled target data for 10 total epochs with the same setting as before, except we tune over learning rates {10 -6 , 10 -5 , 0.0003, 0.0001, 0.001, 0.01, 0.1, 0.25}. In Table 6 , for completeness, we additionally report the mean corruption error (mCE), which weights the target distribution error by the difficulty of the corruption, on ImageNet-C since that is a common metric used for the dataset. We find that these results give similar conclusions as the average accuracies reported in Table 4 , with Cross-Val and Auto-RGN performing the best. We additionally include an ablation where we use a CLIP ViT-B/16 (Vision Transformer) as our initial model pretrained on the WebImageText dataset. This model consists of 2 first layers followed by 4 transformer blocks and 2 last layers. We analyze surgical fine-tuning with this model architecture on the Living-17 dataset, which has a feature-level shift. In Table 7 , we find that our results in Section 2 hold similarly using this vision transformer, as tuning only a middle block outperforms full fine-tuning or tuning any other block of layers.

B.5 MORE LARGE VISION TRANSFORMER EXPERIMENTS

Prior work has shown that when fine-tuning vision transformers, task accuracy on held-out data is substantially higher when using the AdamW optimizer rather than SGD (Kumar et al., 2022b) . We evaluate the performance of surgical fine-tuning on two large pre-trained vision transformer models (CLIP ViT-B/16 and ViT-L/14) while fine-tuning with AdamW. We follow the experimental setting of Kumar et al. (2022b) as closely as possible:



ANALYSIS OF SURGICAL FINE-TUNINGWe now present a theoretical and empirical analysis on idealized examples of distribution shifts, to better understand the role of surgical parameter tuning in our previous experimental results. In Section 3.1, we present a setting with two-layer neural networks where tuning only the first layer can obtain zero loss on the target task while tuning only the last layer cannot and vice versa. Then, in Section 3.2, we study a setting in which tuning only the first layer provably achieves zero loss while full fine-tuning overfits and gets non-zero loss due to limited data. Finally, in Section 3.3, to support this theoretical analysis, we construct distribution shifts where localized subsets of parameters are substantially better suited for adaptation than tuning all or other parameters. * Since the distributions have density on some subspace, x is almost surely non-zero.





Since we have 0 training loss, we get all examples in the span of the source training examples correct: for all x ∈ span(X src ) we have:v⊤ src B ∞ fl x = v ⊤ * B * x.For all x ∈ X ⊥ src , from Lemma A.3 inKumar et al. (2022a)  we have thatB ∞ fl x = B ft (0)x = Bsrc x. But this means that v⊤ src B ∞ fl x = v⊤ src Bsrc x.Since we assumed that we pretrained to get 0 loss on the source we have:v⊤ src Bsrc x = v ⊤ * B * x.Combining these two cases, we get all examples in the support of P src correct.

Proof. First, we note that since n < d src , and we only have n training examples, our training examples do not span the source distribution. That is, there exists some source example x s ∈ colspace(S src ) which is in the support of P src , x s ̸ = 0, which is orthogonal to all the training examples: x s ⊥ x ≤ i ≤ n. Choose such an x s .We note that since k = 1 (the representation dimension is 1), v ft (t) ∈ R is a scalar, andB ft (t), B * ∈ R 1×d are row vectors. For notational convenience, let b(t) = B ft (t) ⊤ , b * = B ⊤* , and bsrc = B⊤ src , so we have for example y ft (t) cannot change for zero loss. First, we show that ifv ft (t) ̸ = vsrc then L trg (v ft (t), B ft (t)) > 0.Since x s is orthogonal to the training examples, from Lemma A.3 in Kumar et al. (2022a), we have b(t) ⊤ x s = b⊤ src x s . Since pretraining gave us 0 loss on the source distribution, we have vsrc b⊤ src x s = v * b ⊤ * x s .Recall that we assumed that w ⊤ * x s ̸ = 0 if x s ̸ = 0, which implies that b⊤ src x s ̸ = 0 since pretraining gets all the source examples right, and the ground truth label for all non-zero source examples is non-zero. But then if v ft (t) ̸ = vsrc , we have:v ft (t)b(t) ⊤ x s ̸ = vsrc b⊤ src x = v * b ⊤ * x s .Since P src has density, we can construct a small ball B of non-zero probability around x s such that for allx ∈ B, v ft (t)b(t) ⊤ x ̸ = v * b ⊤ * x. This implies that L trg (v ft (t), B ft (t)) > 0.v ft (t) must change for zero loss. Next, suppose that L trg (v ft (t), B ft (t)) = 0. We will show v ft (t) ̸ = vsrc . Suppose for the sake of contradiction that v ft (t) = vsrc .From Lemma A.4 inKumar et al. (2022a)  (also see Theorem 2.2 in Du et al. (

Surgical fine-tuning results on Living-17 (% accuracy) initialized with a CLIP ViT-B/16. We find that similar to the results using a ResNet-50 architecture, fine-tuning a single parameter block with this vision transformer architecture outperforms full fine-tuning, and in particular, a middle block still performs best for this feature-level shift.

annex

Label perturbation. Now suppose that the source and target inputs are the same x trg = x src , but the target output is perturbed from the source output: y trg = ty src for some t. Proposition 2 shows that tuning only the first layer may not achieve non-zero target loss for this distribution shift while tuning the last layer will do so. Proposition 2. For all t, P src , P trg with x trg = x src , and y trg = ty src , there exists a last-layer v that can minimize the target loss: min v L trg (v, Bsrc ) = 0. However, changing the first layer may not be sufficient-there exists such t, P src , P trg such that the target loss is non-zero for any choice of first layer B: min B L trg (v src , B) = 0Similarly to Proposition 1, the last layer can adapt to the label shift by "reversing" the multiplication by t. In contrast, when the last layer is frozen and only the first layer is tuned, we may lack expressivity due to the information destroyed by the ReLU activation ϕ(•). For a full statement and proof, see Appendix B.1.

3.2. CAN SURGICAL FINE-TUNING OUTPERFORM FULL FINE-TUNING?

In this section, we show that first-layer fine-tuning can provably outperform full fine-tuning when we have an insufficient amount of target data. We show that this can happen even in tuning two-layer linear networks (Kumar et al., 2022a) , where ϕ is the identity map: for all i, ϕ(x) i = x i . Our analysis suggests perhaps a more general principle underlying the benefits of surgical fine-tuning over full fine-tuning: by fine-tuning more parameters than necessary, the model can overfit to the small target dataset while forgetting relevant information learned during pre-training.Pretraining. We first start with v 0 , B 0 which are initialized randomly:B ) for all i, j. We then run gradient descent on L src (v, B) to obtain vsrc , Bsrc , which we assume minimizes the source loss: L src (v src , Bsrc ) = 0. • Camelyon17 (Bandi et al., 2018; Koh et al., 2021) . We first train a pre-trained CLIP ViT-B/16 (Radford et al., 2021) model on the train split of the Camelyon17 dataset using an AdamW optimizer with initial learning rate 10 -6 for 3 epochs. We use a cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017) and batch size 32. Finally, we fine-tune on the labeled target data for 10 total epochs with the same setting as before, except we tune over learning rates {10 -4 , 10 -5 , 3 × 10 -5 , 7 × 10 -5 , 10 -6 , 3 × 10 -6 , 10 -7 , 10 -8 }.

Dataset

• FMoW (Christie et al., 2018; Koh et al., 2021) . We similarly train a pre-trained CLIP ViT-B/16 or ViT-L/14 model on the train split of the FMoW dataset with initial learning rate 10 -5 for 5 epochs. We use a cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017) and batch size 32.For the ViT-L/14 setting, we use larger 336 × 336 images. Finally, we fine-tune on labeled target data for 10 total with the same setting as before, tuning over learning rates {10 -4 , 10 -5 , 3 × 10 -5 , 7 × 10 -5 , 10 -6 , 3 × 10 -6 , 10 -7 , 10 -8 }.We show AdamW fine-tuning results in Table 8 . While surgically fine-tuning the right layer continues to improve over no fine-tuning, the relative advantage compared to fine-tuning all layers is smaller than what we observed with SGD in Table 7 . We observe instability in fine-tuning all layers of a ViT-B/16 network for FMoW target distributions. Fine-tuning later layers seems to consistently improve performance without running into such instability issues. We leave further investigation of such properties of fine-tuning ViT models to future work.

B.6 COMPLETE UNSUPERVISED ADAPTATION RESULTS

Method. We experiment with MEMO (Zhang et al., 2021a) as our unsupervised adaptation method. Given a test image x, MEMO first takes an "adapt" stage, where it minimizes the marginal entropy over standard augmentations of x, then it takes a "test" step, where the network predicts a label for x.Note that MEMO tests a single image at a time, i.e., the test batch size is 1. We also consider the two following variations.• Episodic: This version is discussed in the origin work. Here after predicting the labels, we reset the weights of the network to the pre-trained ones, i.e., we undo the "adapt" stage.• Online: We also consider the online variation of MEMO, where we do not reset the weights after each test image, i.e., we accumulate the "adaptation" changes over test images.We have also experimented with TENT (Wang et al., 2020) , but since TENT only updates the batchnorm modules (whereas MEMO updates all parameters), freezing parameters with TENT did not produce expected results and we did not pursue it further.Dataset and Network. We use the CIFAR-10-C and ImageNet-C corruption datasets for our experiments. For CIFAR-10-C, we use the same ResNet-26 (He et al., 2015) pre-trained model used by MEMO (Zhang et al., 2021a) , which is available in their GitHub repository. For ImageNet-C, we use RVT * -small architecture and pre-trained weights used by Zhang et al. (2021a) .Hyper-parameters. For CIFAR-10-C, we use 1000 corrupted test image for hyper-parameter tuning and report the test accuracy on the held out 9000 examples. We consider the following hyper-parameter grid: • Learning rate: 10 -3 , 10 -4 , 10 -5 and 10 -6 , then 2.5x, 5x, 0.5x of the best learning rate from before. • Steps: 1, 2.• Weight decay: 0, 10 -3 , 10 -2 , 10 -1 .

• Number of augmentations per image: 32

For ImageNet-C, we do not do any hyper-parameter tuning and simply use the best hyper-parameters described by Zhang et al. (2021a) Finally, in practice, we saw that AdamW and SGD optimizers work better for the episodic and online setting, respectively.Layers. We use the following naming convention for the layers of ResNet-26:• First: Only the first conv layer.• First 2 layers: The first conv layer of the entire network, and the first conv layer within the first block. • First 2 blocks: The first conv layer, and the first block.• Last: The last fully-connected (FC) layer.For RVT * -small:• First layer: First conv layer inside the first transformer block.• First block: First transformer block.• Last: Head or the final fully connected layer. 

