A KERNEL-BASED VIEW OF LANGUAGE MODEL FINE-TUNING Anonymous authors Paper under double-blind review

Abstract

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with 10 8 or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)-which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization-describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We also extend the NTK formalism to fine-tuning with Adam. We present extensive experiments that show that once the downstream task is formulated as a language modeling problem through prompting, the NTK lens can often reasonably describe the model updates during fine-tuning with both SGD and Adam. This kernel view also suggests an explanation for success of parameter-efficient subspace-based fine-tuning methods. Finally, we suggest a path toward a formal explanation for our findings via Tensor Programs (Yang, 2020b).

1. INTRODUCTION

It is now customary to solve most supervised natural language processing (NLP) tasks such as topic classification and textual entailment by fine-tuning a pre-trained language model (e.g., (Devlin et al., 2019; Liu et al., 2020b; Clark et al., 2020; Raffel et al., 2020; Joshi et al., 2020) ). We lack theoretical understanding of this fine-tuning paradigm. Why do we not see over-fitting when fine-tuning a very large language model using a couple dozen instances of the supervised task? Why is fine-tuning so sensitive to details such as whether or not we include a prompt (e.g., adding "It was [great/terrible]" for sentiment analysis (Schick & Schütze, 2021; Gao et al., 2021) ? Why does restricting optimization to a low-rank subspace of model parameters (Hu et al., 2021; Li et al., 2018; Aghajanyan et al., 2021) still result in performance comparable to full fine-tuning? Answering such questions requires understanding how the sequence of parameter updates changes in various scenarios, e.g., the addition of a prompt, or the introduction of randomly initialized parameters. The current theory of deep learning, at first sight, seems too primitive to address such questions, especially since fine-tuning has to start from a parameter initialization inherited from pre-training. Recently, Wei et al. (2022) suggested replacing fine-tuning with Neural Tangent Kernel (NTK), an idea invented for study of infinite-width deep neural networks (Jacot et al., 2018; Du et al., 2019a) and previously applied to solving vision tasks with infinitely wide ConvNets (Arora et al. (2019b) ). They note that NTK can be defined for any neural model f and any initialization θ 0 by representing an input ξ by the gradient it induces ∇f (ξ; θ 0 ), which yields a kernel matrix: K(ξ, ξ ′ ) = ⟨∇f (ξ; θ 0 ), ∇f (ξ ′ ; θ 0 )⟩. ( ) This kernel is well-defined for any parameter vector θ 0 . However, for an infinite-width network initialized with θ 0 sampled from a suitably-scaled Gaussians, it can be shown that the kernel matrix is unchanged during gradient descent, which turns the classification task into a form of kernel regression with respect to this kernel (Jacot et al., 2018) . In the fine-tuning setting, however, the initialization θ 0 is inherited from the pre-trained network, and not sampled from the Gaussian distribution. Nevertheless, Wei et al. (2022) found that kernel regression using this "empirical NTK" (eNTK) defined with the inherited θ 0 performs well, achieving classification accuracy within 6% absolute of actual fine tuning on several image recognition tasks. In other words, their work hints that mathematical understanding of the fine-tuning phenomenon (e.g., its sample efficiency) could go via the theory of kernel classifiers. The current paper furthers an empirical and theoretical understanding of the pre-training (PT) and fine-tuning (FT) paradigm for NLP tasks. Our contributions are: 1. We formally extend the standard NTK theory developed for gradient descent to characterize kernel-based dynamics when training with Adam (Section 4). We propose and rigorously prove the correctness of a new kernel formula relying on the sign of the gradient to describe early-stage training (e.g., fine-tuning) with Adam. 2. We perform an extensive empirical analysis on 12 diverse NLP tasks to reveal when and to what extent fine-tuning exhibits kernel behavior (Section 5). We find that using a prompt is crucial for the eNTK to achieve good performance, suggesting that prompting induces a well-characterized optimization benefit for fine-tuning. Further experiments reveal that the trajectory of prompt-based FT can often be described by kernel-based dynamics when the eNTK succeeds. The eNTK often achieves comparable performance to FT in the prompt-based setting but struggles to solve multi-class and entailment tasks. 3. We straightforwardly apply the kernel view of FT dynamics to formally analyze the success fine-tuning methods that update in a low-rank subspace of model parameters (e.g., LoRA, Hu et al. ( 2021)). These results in Section 6 highlight how a kernel-based understanding of FT can aid in the practical design and theoretical analysis of efficient variants. 4. We formally extend infinite-width analysis to account for a pre-trained initialization and characterize conditions under which fine-tuning can exhibit kernel behavior. Using insights into the importance of prompting, we formally prove the existence of a rigorous mechanism through which prompt-based FT of complex architectures (e.g., Transformers) can exhibit kernel behavior (Section 7). Analysis proceeds in the context of networks whose widths go to infinity (i.e., through the Tensor Programs framework), but unlike standard infinite-width NTK theory, it allows a non-random initialization (i.e., one that results from pre-training).

2. RELATED WORK

Kernel view of training. The infinite-width limit is a well-studied theoretical model for deep network optimization. Jacot et al. ( 2018) introduced NTK to capture training a deep and infinitely wide neural network from a random initialization. Subsequent experiments showed that the kernels underperformed for standard tasks (Arora et al., 2019b) but performed well on small datasets (i.e., hundreds of examples) (Arora et al., 2020) . Many works (Allen-Zhu et al., 2019a; b; Arora et al., 2019a; Du et al., 2019b; a; Li & Liang, 2018; Zou et al., 2018; Cao & Gu, 2019) have since applied this lens to understand the optimization and generalization behavior of deep networks. However, these analyses of optimization and generalization do not directly apply to the pre-training and fine-tuning framework because (1) the network trained during FT is inherited and non-random; and (2) LMs are often trained with Adam, and the NTK formula only describes training an infinitely wide network with SGD. In this work, we handle the issue of a non-random (i.e., pre-trained) initialization by assuming that the pre-training task is sufficiently related to the downstream task (Definition 7.3), and we derive new kernels to model early-stage training with Adam (Section 4). Theory of self-supervised learning and transfer learning. Existing theoretical works on transfer learning focus on linear probing and use the representation to provide guarantees on various tasks related to the original training data (Du et al., 2021; Tripuraneni et al., 2020; Wu et al., 2020) . Additionally, Saunshi et al. ( 2021) studied autoregressive language models to rigorously characterized why prompting can improve zero-shot task performance, but their analysis precludes an investigation of FT. We focus on the masked language model pretraining objective, but it is worth noting that there are many works (Saunshi et al., 2019; Tosh et al., 2021a; b; Lee et al., 2021; Tsai et al., 2021) studying transfer when pre-training with a contrastive objective. However, experiments on language modeling (Abnar et al., 2021) and contrastive learning (Saunshi et al., 2022) recently demonstrated that properties of transfer between self-supervised pre-training and supervised FT cannot be fully captured by model-agnostic analyses that directly relate the pre-training and downstream task errors. Kernel theory provides a principled optimization-and architecture-aware framework to analyze FT.

