HOW TO PREPARE YOUR TASK HEAD FOR FINETUNING

Abstract

In deep learning, transferring information from a pretrained network to a downstream task by finetuning has many benefits. The choice of task head plays an important role in fine-tuning, as the pretrained and downstream tasks are usually different. Although there exist many different designs for finetuning, a full understanding of when and why these algorithms work has been elusive. We analyze how the choice of task head controls feature adaptation and hence influences the downstream performance. By decomposing the learning dynamics of adaptation, we find that the key aspect is the training accuracy and loss at the beginning of finetuning, which determines the "energy" available for the feature's adaptation. We identify a significant trend in the effect of changes in this initial energy on the resulting features after finetuning. Specifically, as the energy increases, the Euclidean and cosine distances between the resulting and original features increase, while their dot products (and the resulting features' norm) first increase then decrease. Inspired by this, we give several practical principles that lead to better downstream performance. We analytically prove this trend in an overparamterized linear setting, and verify its applicability to different experimental settings. How does the task head influence the pretrained features during finetuning? How does the feature adaptation influence the generalization performance after finetuning? We first clarify the scope of our analysis. We don't consider the details of the pretraining procedure, instead just assuming that there are some well-trained checkpoints for a particular dataset or task.

1. INTRODUCTION

In the era of deep learning, pretraining a model on a large dataset and adapting it to downstream tasks is a popular workflow. With the help of large amount of data and huge computing resources, the pretrained model can usually provide beneficial features for the downstream tasks. Such a framework is proven to be efficient and effective in many domains and tasks, e.g. natural language processing (Kenton & Toutanova, 2019 ), computer vision (Chen et al., 2020b) , graph based learning (Liu et al., 2022) , and so on. Although different variants of pretraining and finetuning (FT) methods are widely applied -including direct finetuning, finetuning after linear probing (Kumar et al., 2022 ), side-tuning (Zhang et al., 2020a) , using different learning rates for different layers (Zhang et al., 2021) , and more -a detailed understanding of how features are adapted during finetuning under different settings remains elusive. Our work builds significantly off the analysis of Kumar et al. (2022) , who study the interactions between the "task head" (the final layer of the network, usually randomly initialized) and the "backbone" (usually copied from the pretrained model). Kumar et al. claim that the standard finetuning method, randomly initializing a task head then updating all parameters of the whole network, can distort the pretrained features and hence can deteriorate the generalization ability if (as they assume) the previous backbone features were optimal for the downstream task. By analyzing an overparameterized linear model, they prove that linear probing (i.e., only updating the parameters of the task head) first, followed by finetuning the whole network, leads to better performance in their setting. In this work, we consider less stringent assumptions than they made, and study more practical settings from a different perspective. First, we consider scenarios where the pretrained features are not optimal for the downstream tasks, thus feature adaptation is indeed beneficial. Unlike the two extreme cases studied by Kumar et al. (2022) , i.e. finetuning with fully random initialization and Figure 1 : Left: a general example of pretraining (PT), head probing (HP) and finetuning (FT) procedure (DS is short for downstream). Right: an example showing that neither probing the head to converge nor no probing is the optimum (pretrained on ImageNet-1K and finetuned on STL10). fully-pretrained parameters, we consider intermediate cases where features are mildly adapted by stopping earlier (before convergence) in the linear probing procedure. To better understand the feature's behavior, we decompose the learning dynamics of the feature vector during finetuning based on "energy" and "direction" of the learning. We discover a non-trivial trend in how this "energy" affects the way that features change from their initialization, which can inspires us to design an appropriate finetuning procedure. Under this framework, we demonstrate that the "unchanged feature" assumption of Kumar et al. ( 2022) is hard to achieve. Second, our task heads are not necessarily linear. Inspired by the illustrations of Olah et al. ( 2020), it is reasonable to only preserve the lower layers of the pretrained model and reinitialize the top layers, assuming that the low-level features are common across task. That is, the probed task head is non-linear, and we refer to this more general process as "head probing" (HP) rather than linear probing. Our analysis can also help to explain feature behavior in this setting. Finally, following our analysis, we provide a user guide to conclude when and why specific methods should be considered. Specifically, we have one basic method: stop head probing earlier, before convergence; and three advanced tricks: 1.) use label smoothing during head probing; 2.) use more complex task head design; 3.) merge and reinitialize some later layers of the backbone and attach them to the task head. In summary, in this work: • we formalize and explain feature adaptation by decomposing the learning dynamics; • we find a non-trivial trend in feature adaptation and verify it in many cases; • and we show how controlling feature adaptation can improve downstream performance.

2. MOTIVATION

Pretrain-then-finetune is a popular workflow for many tasks in deep learning. One common practice is to 1) randomly initialize a task head, 2) attach it to a pretrained backbone, then 3) finetune the whole network together (Li et al., 2020) . However, the untrained task head may distort the pretrained features during finetuning. To solve this problem, Kumar et al. (2022) propose to train the head to fully converge before the finetuning. However, suppose we train the head long enough and its training accuracy (HP-train-acc) converges to 100%, then the features won't change during the finetuning stage. To sum up from the above, we can see that neither probing the head to converge nor no probing is optimal, since the pretraining and downstream tasks (or datasets) are usually distinct. To verify this argument, we HP various number of epochs before finetuning, and record the corresponding validation accuracy after finetuning (FT-valid-acc for short), and the results are shown in Figure 1 . It is surprising to see that stopping the head training earlier (before the convergence of HP-train-acc) brings more improvement. As the only variable among these experiments is the parameters of the head before finetuning, the following two questions emerge:

