A KERNEL-BASED VIEW OF LANGUAGE MODEL FINE-TUNING Anonymous authors Paper under double-blind review

Abstract

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with 10 8 or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)-which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization-describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022) . We also extend the NTK formalism to fine-tuning with Adam. We present extensive experiments that show that once the downstream task is formulated as a language modeling problem through prompting, the NTK lens can often reasonably describe the model updates during fine-tuning with both SGD and Adam. This kernel view also suggests an explanation for success of parameter-efficient subspace-based fine-tuning methods. Finally, we suggest a path toward a formal explanation for our findings via Tensor Programs (Yang, 2020b). Under review as a conference paper at ICLR 2023 absolute of actual fine tuning on several image recognition tasks. In other words, their work hints that mathematical understanding of the fine-tuning phenomenon (e.g., its sample efficiency) could go via the theory of kernel classifiers. The current paper furthers an empirical and theoretical understanding of the pre-training (PT) and fine-tuning (FT) paradigm for NLP tasks. Our contributions are: 1. We formally extend the standard NTK theory developed for gradient descent to characterize kernel-based dynamics when training with Adam (Section 4). We propose and rigorously prove the correctness of a new kernel formula relying on the sign of the gradient to describe early-stage training (e.g., fine-tuning) with Adam. 2. We perform an extensive empirical analysis on 12 diverse NLP tasks to reveal when and to what extent fine-tuning exhibits kernel behavior (Section 5). We find that using a prompt is crucial for the eNTK to achieve good performance, suggesting that prompting induces a well-characterized optimization benefit for fine-tuning. Further experiments reveal that the trajectory of prompt-based FT can often be described by kernel-based dynamics when the eNTK succeeds. The eNTK often achieves comparable performance to FT in the prompt-based setting but struggles to solve multi-class and entailment tasks. 3. We straightforwardly apply the kernel view of FT dynamics to formally analyze the success fine-tuning methods that update in a low-rank subspace of model parameters (e.g., LoRA, Hu et al. ( 2021)). These results in Section 6 highlight how a kernel-based understanding of FT can aid in the practical design and theoretical analysis of efficient variants. 4. We formally extend infinite-width analysis to account for a pre-trained initialization and characterize conditions under which fine-tuning can exhibit kernel behavior. Using insights into the importance of prompting, we formally prove the existence of a rigorous mechanism through which prompt-based FT of complex architectures (e.g., Transformers) can exhibit kernel behavior (Section 7). Analysis proceeds in the context of networks whose widths go to infinity (i.e., through the Tensor Programs framework), but unlike standard infinite-width NTK theory, it allows a non-random initialization (i.e., one that results from pre-training). Kernel view of training. The infinite-width limit is a well-studied theoretical model for deep network optimization. Jacot et al. (2018) introduced NTK to capture training a deep and infinitely wide neural network from a random initialization. Subsequent experiments showed that the kernels underperformed for standard tasks (Arora et al., 2019b) but performed well on small datasets (i.e., hundreds of examples) (Arora et al.,

1. INTRODUCTION

It is now customary to solve most supervised natural language processing (NLP) tasks such as topic classification and textual entailment by fine-tuning a pre-trained language model (e.g., (Devlin et al., 2019; Liu et al., 2020b; Clark et al., 2020; Raffel et al., 2020; Joshi et al., 2020) ). We lack theoretical understanding of this fine-tuning paradigm. Why do we not see over-fitting when fine-tuning a very large language model using a couple dozen instances of the supervised task? Why is fine-tuning so sensitive to details such as whether or not we include a prompt (e.g., adding "It was [great/terrible]" for sentiment analysis (Schick & Schütze, 2021; Gao et al., 2021) ? Why does restricting optimization to a low-rank subspace of model parameters (Hu et al., 2021; Li et al., 2018; Aghajanyan et al., 2021) still result in performance comparable to full fine-tuning? Answering such questions requires understanding how the sequence of parameter updates changes in various scenarios, e.g., the addition of a prompt, or the introduction of randomly initialized parameters. The current theory of deep learning, at first sight, seems too primitive to address such questions, especially since fine-tuning has to start from a parameter initialization inherited from pre-training. Recently, Wei et al. (2022) suggested replacing fine-tuning with Neural Tangent Kernel (NTK), an idea invented for study of infinite-width deep neural networks (Jacot et al., 2018; Du et al., 2019a) and previously applied to solving vision tasks with infinitely wide ConvNets (Arora et al. (2019b) ). They note that NTK can be defined for any neural model f and any initialization θ 0 by representing an input ξ by the gradient it induces ∇f (ξ; θ 0 ), which yields a kernel matrix: K(ξ, ξ ′ ) = ⟨∇f (ξ; θ 0 ), ∇f (ξ ′ ; θ 0 )⟩. ( ) This kernel is well-defined for any parameter vector θ 0 . However, for an infinite-width network initialized with θ 0 sampled from a suitably-scaled Gaussians, it can be shown that the kernel matrix is unchanged during gradient descent, which turns the classification task into a form of kernel regression with respect to this kernel (Jacot et al., 2018) . In the fine-tuning setting, however, the initialization θ 0 is inherited from the pre-trained network, and not sampled from the Gaussian distribution. Nevertheless, Wei et al. (2022) found that kernel regression using this "empirical NTK" (eNTK) defined with the inherited θ 0 performs well, achieving classification accuracy within 6% Optimization of transformers. Several works (Zhang et al., 2020; Liu et al., 2020a; Li et al., 2022) have documented issues with optimizing Transformer-based architectures with SGD instead of Adam. To study the unique properties of optimizing transformers with Adam, we derive a new kernel formula (Theorem 4.3) to capture early-stage training with Adam. We show results with this kernel and FT with Adam and SGD in Table 1 . Variants of fine-tuning methods. A standard way of fine-tuning pre-trained LMs as introduced in Radford et al. (2018) ; Devlin et al. (2019) is to add a linear classifier on top of a PT encoder and update all the parameters together. Subsequent work (Schick & Schütze, 2021; Gao et al., 2021) formulated downstream tasks as a language modeling problem (i.e., prompt-based FT) and demonstrated empirical suuccess in low-data scenarios (see Liu et al. (2022) for a comprehensive survey). Another line of research studies parameter-efficient fine-tuning methods in which only a subset of model parameters are updated (Lester et al., 2021; Ben Zaken et al., 2022; Li & Liang, 2021) or the parameters updates are restricted to a low-dimensional subspace (Hu et al., 2021; Aghajanyan et al., 2021) . We show that good eNTK performance arises only when studying prompt-based FT in Section 5 (Figure 1 ) and we later show in Section 6 that subspace-based fine-tuning methods such as LoRA (Hu et al., 2021) have a simple interpretation through the kernel.

3. PRELIMINARIES

3.1 KERNEL BEHAVIOR It has been mathematically proven that training infinitely wide deep networks (with large Gaussian initialization) on small datasets can cause deep learning to turn into kernel-based learning. We are interested in identifying kernel behavior arising when training from an arbitrary initialization. Below, we adapt the definition of lazy regime (Woodworth et al., 2020) to an arbitrary initialization. Definition 3.1 (Kernel Behavior). Consider a neural network f (ξ; θ) that takes input ξ and computes a scalar outputfoot_0 using θ as the parameters. Let θ t be the parameters after t steps of training by a gradient-based optimization algorithm. We say this training process of the network demonstrates kernel behavior if the following properties are satisfied. 1. Linearization: The change of the network can be approximated by its first order Taylor expansion, i.e., f (ξ; θ t ) -f (ξ; θ t-1 ) ≈ ⟨∇f (ξ; θ t-1 ), θ t -θ t-1 ⟩; 2. Fixed Features: The gradient at step t is approximately the same as before training, i.e., ∇f (ξ; θ t ) ≈ ∇f (ξ; θ 0 ). ∇f denotes the gradient of f w.r.t. θ. "Closeness to kernel behavior" is quantified using the difference in the quantities on the two sides of the ≈ symbol. Definition 3.2 (Kernel Analog). Suppose optimization of the parameters θ of a model f using the gradient-based update algorithm A to minimize a loss ℓ : R 2 → R exhibits kernel behavior (Definition 3.1). Then, we say that a kernel K (A) is the kernel analog of the optimization algorithm A if f (ξ; θ t ) -f (ξ; θ t-1 ) ≈ -ηχ(ξ t , θ t-1 )K (A) (ξ, ξ t ), ∀t ≥ 0 (2) where ξ t is the training input of step t, θ t is the parameter at step t, χ(ξ, θ) = ∂ℓ(f (ξ;θ),y(ξ)) ∂f is the derivative of the loss with respect to the model output, and y(ξ) is the label of ξ. We illustrate the dynamics of an optimization algorithm that demonstrates kernel behavior relates to the kernel analog. Let A be stochastic gradient descent (SGD). If SGD exhibits kernel behavior, then we can write how the function changes for a chosen input ξ as f (ξ; θ t+1 )-f (ξ; θ t ) ≈ ⟨∇f (ξ; θ t ), θ t+1 -θ t ⟩ = ⟨∇f (ξ; θ t ), -ηχ t ∇f (ξ t ; θ t )⟩ ≈ -ηχ t K (SGD) (ξ, ξ t ), where the approximations follow from the Linearization and Fixed Features property, respectively. This construction immediately yields the standard neural tangent kernel (NTK) formula for K (SGD) derived in Jacot et al. (2018) , which represents an input ξ as the resulting gradient ∇f (ξ; θ 0 ). Definition 3.3 (Neural Tangent Kernel K (SGD) ). K (SGD) (ξ, ξ ′ ) = ⟨∇f (ξ; θ 0 ), ∇f (ξ ′ ; θ 0 )⟩ Given an kernel K, one can solve the classification problem by learning kernel coefficients α i to minimize the empirical risk of i α i K(•, ξ i ), where {ξ i } is the training data (see Appendix A). In Section 4, we derive the kernel analog for SignGD (i.e., an early-stage approximation of Adam), and in Section 5, we compare its eNTK performance against Adam FT. The eNTK computation relies on two design choices for the setting: (1) what the model output f (ξ; θ) is, and (2) which optimization algorithm A is being studied. For a given setting, the eNTK can be computed directly using the kernel analog (Definition 3.2) of A. We run experiments choosing A as SGD or Adam and choosing f based on the fine-tuning setting.

3.2. PRE-TRAINING AND FINE-TUNING PARADIGM

We focus our attention on masked language models (MLMs), such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2020b) , which are trained to minimize the cross-entropy loss on independently predicting masked tokens (i.e., a |V|-way classification task, where V is the vocabulary). Given a text input s of length T from the pre-training distribution S PT , replace a small percentage (e.g., 15%) of tokens with [MASK] tokens. This masked input is then fed into the representation function h : S PT → T × R n (e.g., a Transformer encoder) to produce a low-dimensional contextual embedding for each position in the input. The contextual embeddings are independently multiplied by a classifier head (i.e., word embeddings) Φ ∈ R n×|V| to produce logits that will be used to compute the probability of a token filling each masked position. Using a PT model to solve downstream tasks effectively has been a highly active area of research. We focus on fine-tuning (FT) methods, which adapt the pre-trained model to a new input distribution S FT using additional training on the C-way downstream classification task. 1. Standard FT (Devlin et al., 2019; Liu et al., 2020b) : To solve a C-way downstream classification task, initialize and learnfoot_1 a new classifier head Γ : R n → R C on top of the contextual embedding of the [CLS], denoted h [CLS] . In this case, the choice of f : S FT → R C for the eNTK construction is f (s) = Γ(h [CLS] (s)). 2. Prompt-based FT (Schick & Schütze, 2021; Gao et al., 2021) : Add a natural language prompt (e.g. "This is [MASK] .") in addition to the task input to the downstream task, and use the pre-trained MLM to fill in the masked token. Compute the logits over task-relevant words (e.g., "great" and "terrible") using the corresponding columns of Φ, denoted Φ ∈ R n×C . These logits will serve as surrogates to solve the downstream task. In this case, the choice of f : S FT → R C for the eNTK construction is f (s) = Φ ⊤ h [MASK] (s).

4. KERNEL DERIVATION FOR ADAM

Computing the eNTK requires using the kernel analog (Definition 3.2) of the chosen optimization algorithm A. However, it is difficult to construct a long-term kernel analog for Adam, because the adaptivity causes the each update to depend on the entire gradient history. Previous work has shown that in the early stages of training, full-batch (Ma et al., 2022) and mini-batch (Malladi et al., 2022) Adam updates can be approximated as using the sign of the gradient. In particular, the moving averages for the moment estimates are computed in a small neighborhood when the learning rate is small, so the Adam update is similar to performing coordinate-wise normalization on the gradient. This gradient-based optimization algorithm is called SignGD, defined below. Definition 4.1 (SignGD). SignGD is a gradient-based optimization algorithm that updates the parameters as θ t+1 = θ t -η sign(∇f (ξ t ; θ t )), where sign is applied element-wise. We define the sign-based kernel below and prove that it is the correct kernel analog for SignGD. Definition 4.2 (Asymmetric SignGD Kernel). K (A-SignGD) (ξ, ξ ′ ) = ⟨∇f (ξ; θ 0 ), sign(∇f (ξ ′ ; θ 0 )⟩. Theorem 4.3 (Informal version of Theorem C.4). If a network is trained with SignGD (Definition 4.1) and the training exhibits kernel behavior (Definition 3.1), then the training dynamics can be characterized as f (ξ; θ t ) -f (ξ; θ t-1 ) ≈ -ηχ t K (A-SignGD) (ξ, ξ t ), where χ t is the derivative of the loss with respect to f at step t. Proof sketch. By the Linearization property in Definition 3.1, f (ξ; θ t ) -f (ξ; θ t-1 ) ≈ ⟨∇f (ξ; θ t ), θ t -θ t-1 ⟩ = -ηχ t ⟨∇f (ξ; θ t ), sign(∇f (ξ t ; θ t-1 ))⟩. Then by the Fixed Features in Definition 3.1, ⟨∇f (ξ; θ t ), sign(∇f (ξ t ; θ t-1 ))⟩ ≈ ⟨∇f (ξ; θ 0 ), sign(∇f (ξ t ; θ 0 ))⟩ = K (A-SignGD) (ξ, ξ t ). We solve the asymmetric kernel regression by building an augmented system modified from He et al. (2022b) (Appendix A.3), but the difficulties of solving the kernel regression problem with an asymmetric kernel motivate us to also use the symmetric SignGD kernel, though it is not as theoretically sound as the asymmetric one. Definition 4.4 (SignGD Kernel). K (SignGD) (ξ, ξ ′ ) = ⟨sign(∇f (ξ; θ 0 )), sign(∇f (ξ ′ ; θ 0 ))⟩ The kernel analog for Adam differs from the standard NTK formula for SGD because the sign function is agnostic to the relative scales of the gradients.

5. EXPERIMENTS

We compute the eNTK as described in Section 3 for different optimization algorithms and FT settings. eNTK performance being comparable to FT performance is a necessary but not sufficient condition for FT to exhibit kernel behavior (Definition 3.1), so we also directly measure if the Linearization and Fixed Features properties hold (Section 5.3). Overall, we find that only prompt-based FT exhibits kernel behavior, although the eNTK still struggles with multi-class classification and exhibits anomalous behavior on entailment tasks.

5.1. SETUP

Our experiments follow the few-shot setting from Gao et al. (2021) and use their manual prompt templates. We consider 12 NLP tasks, divided equally into 6 single sentence and 6 sentence pair datasets, which cover: sentiment analysis (SST-2, MR, CR); classifying an opinion's polarity (MQPA) or subjectivity (Subj) or question type (TREC); natural language inference (MNLI, SNLI, QNLI, RTE); and paraphrase detection tasks (MRPC, QQP). For each task, we randomly sample 5 k-shot datasets with k training examples for each label. We use a pre-trained RoBERTa-base (Liu et al., 2020b) for FT and eNTK. Appendix A contains further details on datasets and implementation.

5.2. KERNEL PERFORMANCE ON DOWNSTREAM TASKS

Prompting is critical for eNTK to match FT performance. We measure the eNTK performance in the standard and prompt-based FT settings across SST-2, MR, QNLI, QQP (Figure 1 ). In the standard FT setting, K (SGD) and SGD FT demonstrate a gap of up to 15% absolute on tasks that exhibit only a 1% gap in the prompt-based setting. SGD performs comparably to Adam in prompt-based FT. We focus on the prompt-based FT setting (Table 1 ). We note that when doing prompt-based FT, Adam and SGD perform within 4% absolute of each other, furthering the discussion around the optimization of transformers with Adam versus SGD (Li et al., 2022; Zhang et al., 2020; Liu et al., 2020a) . Standard SGD FT Standard (SGD) Prompt SGD FT Prompt (SGD) Figure 1 : Comparing SGD-FT and K (SGD) performance in the standard and the prompt-based FT settings (Section 3) suggests that kernel behavior (Definition 3.1 can only arise when using a prompt. Prompt-based eNTK matches FT in most tasks. To study if the eNTK can solve a given task in the prompt-based FT setting, we compare SGD-FT to K (SGD) and Adam-FT to K (A-SignGD) and K (SignGD) in Table 1 . We observe that for 8 out of 12 tasks, K (SGD) can achieve accuracy within 5% absolute of SGD FT for k = 16 and k = 64. Similarly, K (SignGD) or K (A-SignGD) can achieve accuracy comparable to Adam FT for 7 out of the 12 tasks. The difference between K (SignGD) and K (A-SignGD) is negligible on most tasks. We suggest the asymmetry of the latter may cause K (A-SignGD) to sometimes perform worse than K (SignGD) despite being the theoretically sound kernel analog (Theorem 4.3). eNTK struggles with multi-class tasks. The eNTK performs much worse than FT on all of the multi-class (i.e., TREC, MNLI, and SNLI) tasks, which we believe warrants further investigation. We conjecture that the eNTK cannot solve MNLI and SNLI despite solving other entailment tasks (i.e., QNLI and RTE) because the prompt is less natural when considering the label word "Maybe". One explanation of why the kernel analog sometimes outperforms FT is that certain batches may induce anomalous gradients that disrupt the FT trajectory, the effect of which the kernel can mitigate by downweighting these examples.

5.3. MEASURING KERNEL BEHAVIOR

The eNTK matches the performance of prompted FT for many tasks (Table 1 ), suggesting that these tasks may induce kernel behavior (Definition 3.1). However, the kernel's success may just be a coincidence. We take additional measurements to provide further empirical evidence that FT could be modeled as kernel behavior. Measuring the Linearization Property of Kernel Behavior If FT exhibits kernel behavior (Definition 3.1), then the function output after FT should be close to the first order Taylor expansion around the pre-trained model: f (ξ; θ FT ) ≈ f (ξ; θ PT ) + ⟨∇f (ξ; θ PT ), θ FT -θ PT ⟩ where θ PT is the model parameters after pre-training, θ FT is the model parameters after fine-tuning on the downstream task, and ξ is sampled from the test set. Figure 2 summarizes the results. Pre-trained models perform significantly better than random on many single-sentence downstream tasks (e.g., SST-2, MR, and CR) but close to random on most sentence-pair tasks (e.g., QNLI, RTE, MRPC, and QQP). Subj, MNLI, and SNLI are outliers to this trend. The linearized model recovers a substantial amount of FT performance for SST-2, MR, CR, Subj, RTE, and QQP, all of which the eNTK could solve (Table 1 ). Although pre-trained models perform much better than random on MNLI and SNLI, we find that the eNTK cannot solve these tasks very well (Table 1 ). Similarly, although the pre-trained model demonstrates near-random performance on QNLI and RTE, we find that the eNTK can solve these tasks. Moreover, although QNLI and RTE could be solved by the eNTK, the results suggest they do not induce the Linearization property of kernel behavior very strongly. Altogether, these findings suggest a deeper mystery around entailment tasks in particular.

Measuring the Fixed Features Property of Kernel Behavior

We also empirically test if the Fixed Features property (Definition 3.1) holds for tasks that the eNTK can solve. We measure the relative distance between K (SGD) computed before and after FT and record the average element-wise distance in Table 5 . A smaller distance indicates that the Fixed Features property is more likely to hold. We see that tasks that the eNTK can solve exhibit relatively low (i.e., less than 1) distances. 0 (1.5) 83.2 (2.4) 93.3 (0.2) 83.3 (1.3) 88.5 (2.6) 80.3 (7. 2) K (SGD) 88. 3 (0.3) 84.7 (1.5) 89.5 (0.5) 76.4 (2.7) 88.6 (1.3) 56.0 (9 SignGD) 89.1 (0.5) 85.6 (1.0) 90.0 (0.2) 78.6 (6.4) 92.4 (0.5) 82.0 (1.4) K (A-SignGD) 88.9 (0.9) 85.6 (1.0) 90. Adam-FT 67.9 (1.0) 76.9 (1.4) 74.2 (3.2) 67.3 (2.7) 80.9 (1.2) 69.8 (0.6) K (SignGD) 59.9 (1.7) 63.9 (2.5) 65.4 (1.7) 63.8 (1.8) 77.4 (2.3) 63.7 (4.4) K (A-SignGD) 58.5 (1.7) 66.8 (1.1) 66.5 (1.1) 63.8 (2.2) 77.3 (2.0) 66.1 (3.4) (b) Sentence-pair tasks Table 1 : Performance achieved by prompt-based FT and prompt-based eNTKs with different formulas on the LM-BFF test set (Gao et al., 2021) . The eNTK performs comparably to the analogous FT on many tasks but fails on multi-class tasks (i.e., TREC, SNLI, and MNLI). Performance is measure by average test accuracy over 5 k-shot splits for all tasks except MRPC and QQP, where it is F1. and fine-tuned model (FT). Tasks that induce the Linearization property of kernel behavior (Definition 3.1) will show that Lin. performance recovers a substantial amount of the FT performance. For each k, we plot the median and range of the test accuracies across 5 seeds and data splits. k-shot Method SST-2 MR CR MPQA Subj TREC 16 SGD-FT 89. K (

6. EFFICACY OF SUBSPACE-BASED FINE-TUNING METHODS

We study parameter-efficient fine-tuning methods that roughly preserve performance but reduce the overhead of fine-tuning and saving a large language model (He et al., 2022a) . One such method is LoRA (Hu et al., 2021) , which restricts fine-tuning updates to be low-rank as defined below. Definition 6.1 (A-LoRA FT (Hu et al., 2021) ). Let A be a gradient-based optimization algorithm. For every weight matrix W ∈ R m×n , choose k ≪ m and initialize A ∈ R m×k with i.i.d. mean-0 Gaussian values and B ∈ R k×n as 0. Set the weight to be W + AB. To fine-tune, fix W at its pre-trained value and train only A and B using A. We also consider a fine-tuning variant that projects the parameter vector to a low-dimensional subspace. This method was originally proposed to characterize the difficulty of downstream tasks (Li et al., 2018) , and recent LM experiments in Aghajanyan et al. (2021) have shown that fine-tuning the projected parameters can recover most of the performance of standard fine-tuning. Definition 6.2 (A-IntrinsicDimension FT (Li et al., 2018; Aghajanyan et al., 2021) ). Fix a random projection Π ∈ R M ×k , where M is the number of parameters in a model f . To fine-tune using a loss ℓ on a downstream task, replace the gradient in the update formula of A with Π ⊤ ∇ℓ(ξ; θ). Although theoretical characterization of these methods seems complex, the kernel view admits a simple interpretation. We straightforwardly apply the classical Johnson-Lindenstrauss, or JL, lemma in Johnson (1984) , which guarantees inner product preservation under random projections, to show that these methods approximately preserve the SGD kernel (Definition 3.3). Theorem 6.3 (LoRA and IntrinsicDimension FT preserve K (SGD) , informal version of Theorem D.5). Let K (SGD) be the kernel analog (Definition 3.2 to SGD FT, K (SGD) LoRA be the kernel analog to SGD-LoRA FT (Definition 6.1), and K (SGD) ID be the kernel analog to SGD-IntrinsicDimension FT on a downstream task Ξ. Then, with high probabililty, (K (SGD) LoRA (i, j) -K (SGD) (i, j))/K (SGD) (i, j) ≈ 0 and K (SGD) ID (i, j) ≈ K (SGD) (i, j) for all i, j ∈ [N ]. Proof sketch. Consider an individual layer in the network and inputs ξ, ξ ′ ∈ Ξ to the downstream task. LoRA randomly projects ∇ B f (ξ; θ) and ∇ B f (ξ ′ ; θ), where ∇ B denotes the gradient with respect to B, and does not modify the gradient to A, since B is initialized to zero. The rest of the proof for LoRA and the proof for IntrisicDimension FT follows from applying JL to all such pairs ξ, ξ ′ to show the inner product, which determines the kernel entry, is preserved. Remark 6.4. Theorem 6.3 states that the kernel analog of SGD FT is unchanged by LoRA in both prompt-based and standard FT. Therefore, the theorem only applies when A FT exhibits kernel behavior, which we find to only be in the prompt-based setting (Figure 1 ). 7 verify that prompted SGD FT and prompted LoRA-SGD FT achieve similar performance on several downstream tasks, and K (SGD) LoRA achieves performance similar to K (SGD) . We leave it to future work to account for the success of these methods when FT does not exhibit kernel behavior.

7. THEORY: PROMPT-BASED FINE-TUNING CAN EXHIBIT KERNEL BEHAVIOR

We give a plausible mechanism for how prompt-based FT can exhibit kernel behavior (Definition 3.1) as the network width grows large. We start by defining a pre-training scheme, which formalizes how changing the architecture width impacts pre-training. Definition 7.1 (Pre-Training Scheme). A pre-training scheme (X , A, F n ) with width n contains the dataset X , optimizer A and its hyperparameters, and a model architecture F n . Let f n ∼ (X , A, F n ) denote a model resulting from training the architecture F n on the dataset X with optimizer A. Remark 7.2. The concrete reliance of the architecture on the width is given by Tensor Programs: for example, in a Transformer, increasing n corresponds to increasing the embedding dimension. We now connect pre-training to the downstream task. Analogous to Saunshi et al. (2021) , we reason that prompting transforms the downstream task into a fill-in-the-blank problem, and thus the downstream task can be viewed as a subcase of the pre-training task. We then assume that a wider pre-trained network will be better at filling in masked tokens and that an infinitely wide pre-trained network can solve the downstream task perfectly when using a suitable prompt. Definition 7.3 (Natural Task in the Infinite-Width Limit). We say that a downstream task Ξ is natural with respect to a pre-training scheme (X , A, F n ) if, for any f n ∼ (X , A, F n ) and any ξ ∈ Ξ, lim n→∞ χ(ξ, f n (ξ)) = 0. (3) Remark 7.4. Note that a task may only be natural in the infinite-width limit when using a prompt, since standard FT will always require training a randomly initialized head (i.e., χ will not vanish at infinite width). Although Definition 7.3 is asymptotic, we design a cheap empirical test. We require access to two models of different widths resulting otherwise identical pre-training schemes: f n1 ∼ (X , A, F n1 ) and f n2 ∼ (X , A, F n2 ). Then, we can check if χ decreases with width by measuring χ(ξ, f n1 (ξ)) and χ(ξ, f n2 (ξ)), with n 1 ̸ = n 2 , without making any gradient updates (Table 8 ). To study the behavior of fine-tuning one also needs to make assumptions about parameters that resulted from the pre-training. In particular, we assume that the network can be written as a Tensor Program (Yang, 2019; 2020a; b) , which is sufficiently general to allow our theory to describe many complex architectures (e.g., Transformers). To allow the analysis to proceed by way of Tensor Programs, we require that the network is (1) stable: its output does not grow with width (i.e., the infinite-width limit is meaningful), and (2) non-trivial: its output can be updated during fine-tuning (i.e., learning can occur). Theorem 7.5 (Informal version of Theorem C.5). Assume the downstream task Ξ is natural in the infinite-width limit with a pre-trained model f , and f is stable, non-trivial, and can be written as a Tensor Program. Then prompt-based FT of f will exhibit the Linearization and Fixed Features properties of kernel behavior (Definition 3.1). The proof of the theorem formalizes the intuition that if the pre-trained network is already decent at solving the downstream task, the network needs to only mildly adapt to solve the downstream task. Notably, we extend standard NTK theory to account for an arbitrary initialization and to characterize early-stage training with Adam (see Section 4 for kernel).

8. CONCLUSION

We use NTKs to mathematically formalize the general intuition that fine-tuning pretrained language models to solve downstream tasks requires only a "small change." Extensive experiments on 12 NLU tasks demonstrate that prompt-based FT is much more likely to exhibit kernel behavior (Definition 3.1) than standard FT (Figure 1 ). Further experiments in the prompt-based FT setting using a newly derived kernel for Adam (Definition 4.2, see Theorem 4.3) demonstrate that the eNTK can match the performance of FT on many tasks. On the tasks that eNTK can solve, measurements in Section 5.3 suggest that prompt-based FT does exhibit kernel behavior. We demonstrate one possible use of the kernel view to explain empirical phenomena by applying it to understand subspace-based fine-tuning methods (Section 6), and we note that the kernel has many mathematically useful properties that can aid design and study of parameter-efficient fine-tuning methods. Moreover, one can use the kernel to study the inductive bias of FT, as was done for gradient descent from a random initialization in the past (Allen-Zhu et al., 2019b; a; Li & Liang, 2018) . We provide a first-cut theoretical analysis in Section 7 as to why prompt-based fine-tuning can exhibit kernel behavior. & Liu, 2004) , MPQA (Wiebe et al., 2005) , Subj (Pang & Lee, 2004 ) and TREC (Voorhees & Tice, 2000)), and 6 sentence pair datasets (MNLI (Williams et al., 2018) , SNLI (Bowman et al., 2015) , QNLI (Rajpurkar et al., 2016) , RTE (Dagan et al., 2005; Bar Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) , MRPC (Dolan & Brockett, 2005) Unless otherwise stated, we usually run experiments over 5 seeds of few-shot data sets. We directly use the 'manual' prompt templates and label words proposed by Gao et al. (2021) , which are reproduced in Table 2 .

A.2 COMPUTING THE KERNEL

We adapt the functorch implementation of Novak et al. (2022) to compute the eNTK for a large model, using a mix of backward-mode auto-differentiation for computing the jacobians and forward-mode auto-differentiation for computing jacobian-vector products. Note that K (SignGD) cannot be computed using jacobian-vector products and thus requires significantly more memory and run-time in practice.

A.3 SOLVING THE KERNEL

In the standard NTK setting, the initial output of the model f (•; θ 0 ) contains no information about solving the task, because θ 0 is a random initiaization. However, in the prompted FT setting, we expect the pre-trained model to be able to solve the downstream task well even before any fine-tuning occurs (see Table 4 ). So, we add the pre-trained model's output to the output from the kernel. Furthermore, we run a grid search over scaling the labels in order to take advantage of any pre-existing knowledge the model has about the downstream task. In particular, the kernel regression is based on the ℓ 2 distance to the ground truth one-hot vector, but the pre-trained model outputs the logits which will be used for cross-entropy loss. Scaling the one-hot vector by f 0 helps align its scaling with the logits. Our hyperparameter grid for f 0 can be found in Table 3 , where ∞ corresponds to not using the pre-trained model logits when solving the kernel. Solving Multi-Class Tasks There are several options for how to solve C-way classification tasks (C > 2). We perform the most general one, which scales with C 2 . Each logit is treated as an independent output of the network, essentially scaling the size N of the original dataset by a factor of

C.

With CN examples, the kernel now has shape CN × CN . The labels are also scaled up to treat the multi-class problem as many binary classification problems. Solving the multi-class task this way allows the kernel regression model to view relationships between different logits. Symmetric Kernel Given a symmetric kernel K ∈ R N ×N , we solve the kernel regression problem. In particular, we use the representer theorem to write that the empirical risk minimizer of the loss can be expressed as a linear combination of the kernel features computed on the train set. h * (•) = arg min h∈H K 1 N N i=1 ℓ(h(x i ), y i ) ↔ h * (•) = N i=1 α i K(•, x i ) for a given loss function ℓ. The symmetric SignGD and SGD kernels train α i via gradient descent to minimize a regularized logistic loss on the downstream task. We search over a grid of regularization strengths chosen proportional to ∥K∥ op , see Table 3 . For a test input x, the kernel outputs the prediction h(x) = i α i K(x, x i ). Asymmetric Kernel We write how to solve the kernel regression problem with an asymmetric kernel, developed in He et al. (2022b) , here. Consider the augmented linear system: I/γ H H ⊤ I/γ α β = 1 1 where H ij = y i ϕ s (x i ) ⊤ ϕ t (x j )y j with ϕ s and ϕ t as the two different feature maps and y i as the label for the ith example. Define ω * and ν * as ω * = i β * i y i ϕ t (x i ) ν * = i α * i y i ϕ s (x i ) Solving this system yields two discriminant functions: f s (x) = K(x, X)(β * ⊙ Y ) f t (x) = K(X, x)(α * ⊙ Y ) where K(x i , x j ) = ⟨ϕ s (x i ), ϕ t (x j )⟩. We can thus create one discriminant function as cf s (x) + (1 -c)f t (x) where c ∈ [0, 1] is some hyperparameter. When ϕ s = ϕ t , we see that f s = f t and we reduce to the standard kernel problem (though with repeated equations). Note that per He et al. (2022b) , this system is only meaningful in terms of stationary points when training α and β using the least squares loss. We now leverage some specific knowledge about the NTK setting. In particular, we know that we should only use f s as the predictor in order to correctly represent a new test input in the kernel analog for SignGD.

Hyperparameters and Implementation

We follow Gao et al. (2021) in using the few-shot validation set to search over hyperparameters and finding the best hyperparameter per few-shot dataset. We use value ranges given by Gao et al. (2021) and Hu et al. (2021) , and search over a wider range of values for SGD. Table 3 shows the hyperparameter grids for fine-tuning and the kernel method. 85.6 (1.6) 93.6 (0.4) 66.0 (1.6) 63.9 (4.5) Table 7 : Accuracies of prompt-based SGD FT and prompt-based SGD-LoRA FT, along with their kernel analogs K (SGD) and K (SGD) LoRA , on a subset of tasks. SGD FT and SGD-LoRA FT achieve comparable performance, and K (SGD) and K (SGD) LoRA also achieve comparable performance to each other. These experiments support Theorem 6.3. Yang & Hu (2021) shows how the initialization variance, multiplier, and learning rate for each parameter can move training from the kernel behavior to the feature learning behavior. They further developed the Maximal Update Parametrization (abbreviated MUP or µP) where every parameter is updated maximally (in terms of scaling with width) while keeping the network stable. Yang et al. (2022) then extends µP to Transformers with Adam optimization, and showed empirically that for pre-training of large language models using µP, the optimal hyperparameters remain the same when increasing width. It allows more comprehensive hyperparameter searches on a smaller model and direct transfer of the resulting optimal hyperparameters to the larger model, thus providing better performance of pre-training. In this section, we formally describe the right parametrization for kernel behavior in various finetuning settings. In general, we consider the overparameterized setting in which the width of the network goes to infinity. Additionally, we assume that when initializing a weight matrix of the model, 2) in the prompt-based FT setting for RoBERTa-base and RoBERTa-large. A decrease in the χ value when going from RoBERTa-base to RoBERTa-large indicates the task may be solvable in the infinite-width limit (Definition 7.3). We find that for most tasks the eNTK can solve (Table 1 ), χ decreases as model width grows. each entry of the matrix is drawn from i.i.d. Gaussian distribution. We use Tensor Programs (Yang, 2020b) as the framework for this setting.

C.1 PRELIMINARIES

Notations Let ξ ∈ R din be the input of the network. Let n be the hidden dimension of the network and d out be the output dimension of the network. We define the network as a function of the following form: f (ξ; {U i } i , {W j } j , V ) = V ⊤ h(ξ; {U i } i , {W j } j ), where ξ is the input, U i ∈ R n×din are the input weight matrices, W j ∈ R n×n are hidden weight matrices, V ∈ R n×dout is the output weight matrix, and h(ξ; {U i } i , {W j } j ) ∈ R n is the input of last layer (readout layer). 4 We write M as the set of weight matrices, i.e., M = {U i } i ∪ {W j } j ∪ {V }. For M ∈ M, let ∇ M f (ξ) be the gradient of f w.r.t. M at input ξ. For simplicity of the notations, we assume d in = 1 in this section. Any non-trivial extension to d in > 1 of results below will be noted along. For any weight matrix M ∈ M, let γ M be the multiplier of M , such that M is multiplied by γ M before performing matrix multiplication. Let η M be the learning rate of the weight m. Let σ 2 M be the variance of entries of m at initialization, so each entry of m is drawn N (0, σ 2 M ) independently. Because we are considering the infinite-width limit, f (ξ; {U i } i , {W j } j , V ) actually represents a series of networks {f n (ξ; {U i,n } i , {W j,n } j , V n )} n>0 of the same architecture, but f n has a hidden dimension n. When we say model f , it includes not only the information of the architecture, but also γ M , η M , σ M for every weight matrix M in f and the training optimizer of f . Let M t be the weight matrix at time step t of training. If the network is pre-trained, we let M -1 be the weight matrix before pre-training, and M 0 be the parameters right after pre-training. Let ∆M t = M t -M t-1 be the change of each training step. Let f t be the network at step t that f t (ξ) = f (ξ; {U i t } i , {W j t } j , V t ). Let ξ t , y t be the training input and target at step t, and let the loss function at step t be L t (f t (ξ t )) = L(f t (ξ t ), y t ). Let χ t = L ′ t (f t (ξ t ) ) be the derivative of the loss function. Big-O Notation For a series of scalar random variables c = {c n } n>0 and a function e : N → R, we say c = Θ(e(n)) if there exist A, B such that for sufficiently large n, |c n | ∈ [Ae(n), Be(n)] almost surely. For a series of vector random variables x = {x n } n>0 , we say that x is coordinate-wise Θ(n a ), or x = Θ(e(n)) if this series of scalar random variables {∥x n ∥ 2 / √ n} n>0 is Θ(e(n)). Similarly for the notation O(e(n)), Ω(e(n)), and o(e(n)). For convenience, we assume every e(n) in this section is equal to n a for some a. Tensor Programs We refer reader to see Section 7 of Yang & Hu (2021) for detailed explanation and full definition of Tensor Programs. Here, we provide a simple overview of Tensor Programs: Definition C.1 (Definition 7.1 of Yang & Hu (2021) ). A Tensor Program is a sequence of R n -vectors and R-scalars inductively generated via one of the following ways from an initial set C of random scalars, V of random R n vectors, and a set W of random R n×n matrices. MatMul Given W ∈ R n×n and x ∈ R n , we can generate W x ∈ R n or W ⊤ x ∈ R n . Nonlin Given ϕ : R k × R l → R, previous scalar θ 1 , . . . , θ l ∈ R and vector x 1 , . . . , x k ∈ R n , we can generate a new vector ϕ(x 1 , . . . , x k ; θ 1 , . . . , θ l ) ∈ R n where ϕ(-; θ 1 , . . . , θ l ) applies coordinate-wise to each "α-slice " (x 1 α , . . . , x k α ). Moment Given the same setup as above, we can also generate a new scalar 1 n n α=1 ϕ(x 1 α , . . . , x k α ; θ 1 , . . . , θ l ) ∈ R. Yang (2019; 2020a); Yang & Littwin (2021) ; Yang et al. (2022) show that Tensor Programs can express the computation, SGD/Adam optimization, and the kernel of practically any architecture. The key result of the Tensor Programs is that we can represent the coordinates of any vector x in the Tensor Program with a random variable Z x , and represent any scalar θ with a deterministic scalar θ. There is a way to define all θ and Z x correspond to the Tensor Program (cf. Definition 7.3 in Yang & Hu ( 2021)), and the Master Theorem of the Tensor Program shows that θ → θ when n → ∞ (cf. Theorem 7.4 in Yang & Hu (2021) ). Although it is in general hard to compute Z x and θ, it allows us to reason about the scales of vectors in the training of a network. Assumptions Related to Tensor Programs. Since we are studying the infinite width limit and using Tensor Programs as our framework, there are some mild assumptions that we need in order to apply Tensor Programs and results in Yang & Hu (2021) . Assumption C.2. We assume the nework f satisfies the following a) The forward pass of f in the infinite-width limit can be written as Tensor Programs. b) The hidden vectors have Θ(1) coordinates at initialization. e) There exist a training scheme and some constant t and input ξ such that f t (ξ) -f 0 (ξ) = Θ(1). f) The activation function of f is tanh or σ-gelu for a small enough σ (so it approximates ReLU), where σ-gelu(x) = 1 2 xerf(σ -1 x) + σ e -σ -2 x 2 2 √ π + x 2 . Furthermore, we have one assumption on SignGD: g) SignGD is approximated as the sign function being replaced with ϵ-sign for small enough ϵ when updating parameters, where ϵ-sign(x) = x |x|+ϵ is smoothed version of sign. We assume using different ϵ when computing the sign of ∇ M f , so that ϵ for ∇ M f match the maximum scale of ∇ M f . b), c), d) and e) in Assumption C.2 together recover the definition of nontrivial stable network in Yang & Hu (2021) . b) and c) ensure that the pre-activations in the network are not too large, so that the activation function like tanh are not trivialized to always output ±1. b) ensures that the pre-activations in the network are not too small at initialization, so the activation function is not trivialized to its first-order Taylor expansion. d) ensures the network output is bounded. e) ensures that the network is not frozen during training. f) and g) in Assumption C.2 assures all non-linear functions that appear in the Tensor Programs is pseudo-Lipschitz, which is required for the Master Theorem of Tensor Programs. g) also assures that ϵ-sign is not trivialize to 0 or sign when ∇ M f ̸ = Θ(1).

C.2 SIGNGD KERNEL DERIVATION

Definition C.3 (Formal Definition of Kernel Behavior). We say that this network training process demonstrates kernel behavior if the following properties are satisfied. 1. Linearization: The change of the network can be approximated by its first order Taylor expansion, i.e., lim n→∞ f t (ξ) -f t-1 (ξ) χ t = lim n→∞ M ∈M ∇ M f t-1 (ξ), ∆M t χ t ; 2. Fixed Features: The gradients at step t are approximately the same as before training, i.e., ∀M ∈ M, lim n→∞ ∥∇ M f t (ξ) -∇ M f 0 (ξ)∥ 2 2 max ξ ′ ∥∇ M f 0 (ξ ′ )∥ 2 2 = 0. Note that we define Linearization with both LHS and RHS divided by χ t so it is meaningful for the case of χ t = o(1). We do the same thing in the following theorem.  lim n→∞ f t (ξ) -f t-1 (ξ) χ t = lim n→∞ M ∈{U i }i∪{W j }j ∪{V } -η M ⟨∇ M f 0 (ξ), ϵ-sign(∇ M f 0 (ξ t ))⟩ . Note if η M = η, the RHS of the equation above equals to -η⟨∇f 0 (ξ), ϵ-sign(∇f 0 (ξ t )⟩ ≈ -ηK (A-SignGD) (ξ, ξ t ), where the approximation comes from the difference between ϵ-sign and sign. Proof. By the update rule of SignGD, ∆Mt χt = -η M ϵ-sign(∇ M f t-1 ). It suffices to prove η M ⟨∇ M f t (ξ), ϵ-sign(∇ M f t (ξ t ))⟩ = η M ⟨∇ M f 0 (ξ), ϵ-sign(∇ M f 0 (ξ t ))⟩ when n → ∞. Since η M ⟨∇ M f t (ξ), ϵ-sign(∇ M f t (ξ t ))⟩ -η M ⟨∇ M f 0 (ξ), ϵ-sign(∇ M f 0 (ξ t ))⟩ = η M ⟨∇ M f t (ξ) -∇ M f 0 (ξ), ϵ-sign(∇ M f t (ξ t ))⟩ + (4) η M ⟨∇ M f t (ξ), ϵ-sign(∇ M f t (ξ t )) -ϵ-sign(∇ M f 0 (ξ t ))⟩ + (5) η M ⟨∇ M f t (ξ) -∇ M f 0 (ξ), ϵ-sign(∇ M f t (ξ t )) -ϵ-sign(∇ M f 0 (ξ t ))⟩ , we only need to prove Equations ( 4) to ( 6) are all 0 when n → ∞. Let ξ * = arg max ξ * ∥∇ M f 0 (ξ ′ )∥ 2 2 be the input of maximum gradient scale, then by Fixed Features, we have ∥∇ M f t (ξ) -∇ M f 0 (ξ)∥ 2 ∥∇ M f 0 (ξ * )∥ 2 = o(1). Since ϵ-sign(x) -ϵ-sign(y) ≤ |x -y|/ϵ, ∥ϵ-sign(∇ M f t (ξ)) -ϵ-sign(∇ M f 0 (ξ))∥ 2 ≤ ∥∇ M f t (ξ) -∇ M f 0 (ξ)∥ 2 /ϵ. Proof. We first prove the theorem under the assumption that the network is a multilayer perceptron and the optimizer is SGD, which is the same setting as Yang & Hu (2021) . Then we will extend to more general cases. Consider the following L-hidden-layer perceptron: h 1 (ξ) = U ξ,

and

x l (ξ) = ϕ(h l (ξ)), h l+1 (ξ) = W l+1 x l (ξ), for l = 1, . . . , L -1, and f (ξ) = V x L (ξ). Following Yang & Hu (2021) , we let the learning rate for every parameter equal to ηn -c . Let W 1 = U and W L+1 = V , and for l = 1, . . . , L + 1, we parametrize W l as W l = γ l w l for actual trainable parameter w l , and we initialize each coordinate w l i.i.d. from N (0, σ 2 l ). The setting covers all parameterizations based on Lemma C.6. For convenience, we assume γ l = n -a l and σ l = n -b l . Without loss of generality, we further assume that χ t = Θ(n -d ) (rather than χ t = o(1) only). By Theorem 3.3 of Yang & Hu (2021) , stable network implies r ≜ min(a L+1 + b L+1 , 2a L+1 + c) + c -1 + L min l=1 [2a l + I(l = 1)] ≥ 0. Also by Theorem 3.8 of Yang & Hu (2021) , for nontrivial stable network (included in Assumption C.2), if r > 0 then there exists a kernel K such that f t+1 (ξ) = f t (ξ) -ηχ t K(ξ, ξ t ), which is very close to our definition of kernel behavior. In fact, we will prove that they are equivalent in the fine-tuning case. Since χ t = Θ(n -d ) for fine-tuning, it is equivalent to set the learning rate to ηn -c-d and replace χ t with χt = n d χ t . Formally, we are considering the following training scheme: at the pre-training stage, r ≥ 0 (so it could demonstrate feature learning or kernel behavior); at the fine-tuning stage, c is increased to c ′ ≜ c + d > c, thus, the corresponding r is increased to be strictly greater than 0. Therefore, it suggests kernel behavior with following caveats. Caveat 1: do pre-training affect the result? The answer is effectively NO. First of all, the scale of the update on W l , h l , x l and f are all multiplied by n -d when switching from the pre-training stage (ηn -c learning rate) to the fine-tuning stage(ηn -c-d learning rate). The scales are exactly the same as training from scratch with ηn -c-d learning rate except b L+1 needs to be changed to b ′ L+1 ≜ min(b L+1 , a L+1 + c). Note this change of b L+1 does not affect the fact that r is updated to r ′ ≜ r + d > 0. Caveat 2: does r ′ > 0 formally implies our definition of kernel behavior (Definition C.3)? The answer is YES. We first prove Fixed Features in Definition C.3. The gradient of matrix W l is equal to outer product between ∇ h l f (gradient w.r.t. h l ) and x l-1 . Let dh l t be the normalized gradient w.r.t. h l at step t (so dh l t = Θ(1)), and x l t be the x l at step t (x l t = Θ(1) without normalization). It suffices to prove dh l t -dh l 0 = O(1) and x l t -x l 0 = o(1). The later was proved by Proposition H.27 of Yang & Hu (2021) . To prove dh l t -dh l 0 = O(1), we let dx l t be the the normalized gradient w.r.t. x l at step t, and compute the scale of dh l t -dh l t-1 and dx l t -dx l t-1 inductively from l = L to l = 1. We obtain that they both has the same scale of n -min(2a L+1 +c-a L+1 -b ′ L+1 ,a L+1 +b L+1 +c ′ -1+min L m=l+1 2am) ≤ n -min(0,r ′ ) = 1, the inequality is because b ′ L+1 ≤ a L+1 + c and r ′ ≤ a L+1 + b L+1 + c ′ -1 + min L m=l+1 2a m . Second, we prove Linearization in Definition C.3. We need to first make a very slight modification to the Tensor Program in Yang & Hu (2021) , that is, changing the computation of f t (ξ) -f t-1 (ξ) to n -d (f t (ξ) -f t-1 (ξ)). By Theorem H.32 of Yang & Hu (2021) and its definition of Σ, we can show that lim n→∞ n -d (f t (ξ) -f t-1 (ξ)) = lim n→∞ L+1 l=1 ηn -c-d ⟨∇ W l f t-1 (ξ), ∇ W l f t-1 (ξ t )⟩ = lim n→∞ L+1 l=1 ∇ W l f t-1 (ξ), ∆W l t n -d . From SGD to SignGD. Since sign(xy) = sign(x) sign(y), the update of matrix W l can still be written as outer product of two vectors, i.e., ∆W l t = ηn -c χ t sign(∇ h l f t-1 ) ⊗ sign(x l-1 t-1 ). After applying sign, the scale of vector changes. If the parametrization is the same, the scales of vectors using SignGD will be different from those using SGD. This can be easily resolved by changing learning rates for each parameter, so the scaling change brought by sign is corrected. Furthermore, as mentioned in Assumption C.2, we need to approximate sign by a smoothed version ϵ-sign so the Master Theorem of Tensor Programs can still stands. Extension to universal architecture. The theorem is correct for any network whose first forward pass can be written as Tensor Programs. Given this condition, the forward pass, backward pass and kernel of any steps can be written as Tensor Programs (Yang, 2020a; b) . To analyse the scaling of the Tensor Program will need the following steps: 1. Extension to general computation graph. We can still inductive reason about the scale of preactivations and activations by the topological order of the computation graph; and similarly reason about the gradient by the reverse topological order. 2. Extension to weight sharing. We may use weights multiple times in a forward pass. The preactivations, activations and their gradients will not be affected. Only the update of a weight is now sum of several vector outer product depending on the number of occurrence of the weight.

C.4 µP FOR SGD AND SIGNGD

In the following subsections, we provide more intuitions to Theorem C.5 and some other situations where kernel behavior exhibits. Although we care about all type of pre-trained models, we are mostly interested in models with feature learning behavior. For pre-trained models with kernel behavior, it is obvious that fine-tuning with the same settings as pre-training (corresponds to Prompt-based FT) will lead it to the kernel behavior. Furthermore, Theorem H.17 of Yang & Hu (2021) proved that if the last layer is replaced with a freshly initialized layer (corresponds to Standard FT), fine-tuning from a pre-trained models with kernel behavior is the same as training the downstream task from scratch. Among all the models with feature learning behavior, µP is the most special one where each parameter itself (except the last layer) can push the model to learn feature. Therefore, we use µP as an example to give a proof of intuitive understanding. The formulation of µP contains three sets of hyperparameters: initial variance of M , multiplier of M and learning rate of M for M ∈ {U i } i ∪ {W j } j ∪ {V }. However, even if we restrict these three hyperparameters to be in the form of n α , µP is not unique, because there is one degree of freedom for each weight according to the following lemma. Lemma C.6 (Lemma J.1 of Yang et al. (2022) ). Consider a weight matrix M with learning rate C, initialized as M ∼ N (0, B 2 ), and with a multiplier A. Then for any γ > 0, f t (ξ) stays fixed for all t and ξ if we set • A ← Aγ, B ← B/γ, C ← C/γ 2 if training with SGD. • A ← Aγ, B ← B/γ, C ← C/γ if training with Adam. Note the conclusion about Adam in Lemma C.6 also extends to SignGD. With Lemma C.6, we can always set the multiplier of any weight matrix M to be 1, which leave us only the initialization variance σ 2 M and learning rate η M . Furthermore, in terms of the scale at initialization and the scale of updates, µP for SGD and SignGD are entirely the same. The only difference would be learning rate. We provide details in Table 9 (recall M -1 is the weight M at initialization of pre-training, ∆M 0 = M 0 -M -1 is the overall change of weight in pre-training). Since we have different learning rate for η M , the kernel that we care is defined as K(ξ, ξ ′ ) = M ∈M η M ⟨∇ W f (ξ), ϕ(∇ W f (ξ ′ ))⟩ , where ϕ is identity if the algorithm is SGD, ϕ = sign if the algorithm is SignGD. And we want to prove the dynamic of the network follows f t (ξ) -f t-1 (ξ) χ t → -K(ξ, ξ t ) when n → ∞.

C.5 PROMPT-BASED FINE-TUNING: A LINEAR EXAMPLE

As an intuitive example, we consider a three-layer linear network f (ξ; U, W, V ) = V ⊤ W U ξ. For simplicity, we train the network with SGD, and freeze V so η V = 0. Then we have ∇ U f = W ⊤ V ξ ⊤ and ∇ W f = V (U ξ) ⊤ . We assume |⟨ξ, ξ ′ ⟩| > 0 for any ξ, ξ ′ . Zero step (Pre-training) We model the pre-training of f as one step of training with χ 0 = Θ(1). Then we have ∆U 0 = -η U χ 0 W ⊤ -1 V ξ ⊤ 0 , and ∆W 0 = -η W χ 0 V (U -1 ξ 0 ) ⊤ . Since W ⊤ -1 is independent from V , we have W ⊤ -1 V = Θ(1/n), thus ∆U 0 = Θ(1) matching Table 9 . On the other hand, it is obvious that ∆W 0 = Θ(1/n) because V = Θ(1/n) and U = Θ(1), also matching Table 9 . Then the function is now f 0 (ξ) = V ⊤ (W -1 + ∆W 0 )(U -1 + ∆U 0 )ξ = V ⊤ (W -1 -η W χ 0 V (U -1 ξ 0 ) ⊤ )(U -1 ξ -η U χ 0 W ⊤ -1 V ⟨ξ 0 , ξ⟩) = V ⊤ W -1 U -1 ξ -η U χ 0 ∥W ⊤ -1 V ∥ 2 2 ⟨ξ 0 , ξ⟩ -η W χ 0 ∥V ∥ 2 ⟨U -1 ξ 0 , U -1 ξ⟩ + η W η U χ 2 0 ∥V ∥ 2 ⟨ξ 0 , ξ⟩V ⊤ W -1 U -1 ξ. It is not difficult to see that η U χ 0 ∥W ⊤ -1 V ∥ 2 2 ⟨ξ 0 , ξ⟩, η W χ 0 ∥V ∥ 2 ⟨U -1 ξ 0 , U -1 ξ⟩, and η W η U χ 2 0 ∥V ∥ 2 ⟨ξ 0 , ξ⟩ are all Θ(1). Unfortunately, here V ⊤ W -1 U -1 ξ = 0 in the infinite-width limit, but if we train one more step, it is easy to see that all four terms of f 0 is Θ(1). First step At the first step of fintuning, we have ∆U 1 = -η U χ 1 W ⊤ 0 V ξ ⊤ 1 and ∆W 1 = -η W χ 1 V (U 0 ξ 1 ) ⊤ . Then the function is now f 1 (ξ) = V ⊤ (W 0 + ∆W 1 )(U 0 + ∆U 1 )ξ, and Note that the sum of the first and second terms is exactly -χ 1 K(ξ, ξ 1 ). f 1 (ξ) -f 0 (ξ) = V ′⊤ ∆W 1 U 0 ξ + V ⊤ W 0 ∆U 1 ξ + V ⊤ ∆W 1 ∆U 1 ξ. (8) coordinate-wise scale M = U i M = W j M = V M -1 Θ(1) Θ(1/ √ n) Θ(1/n) ∆M 0 Θ(1) Θ(1/n) Θ(1/n) η M for SGD Θ(n) Θ Plug in ∆W 1 = -η W χ 1 V (U 0 ξ 1 ) ⊤ into the first term of eq. ( 8), V ⊤ ∆W 1 U 0 ξ = -η W χ 1 V ⊤ V (U 0 ξ 1 ) ⊤ U 0 ξ = Θ(χ 1 ), because (U 0 ξ 1 ) ⊤ U 0 ξ = (U -1 ξ 1 + ∆U 0 ξ 1 ) ⊤ (U -1 ξ + ∆U 0 ξ) = ⟨U -1 ξ 1 , U -1 ξ⟩ -η U χ 0 ⟨ξ 1 , ξ 0 ⟩f -1 (ξ) -η U χ 0 ⟨ξ, ξ 0 ⟩f -1 (ξ 1 ) + ∥∆U 0 ∥ 2 ⟨ξ 1 , ξ⟩ = Θ(n). Plug in ∆U 1 = -η U χ 1 W ⊤ 0 V ξ ⊤ 1 into the second term of eq. ( 8), we have V ⊤ W 0 ∆U 1 ξ = -η U χ 1 V ⊤ W 0 W ⊤ 0 V ξ ⊤ 1 ξ = Θ(χ 1 ) because V ⊤ W 0 W ⊤ 0 V = ∥(W -1 + ∆W 0 ) ⊤ V, (W -1 + ∆W 0 ) ⊤ V ∥ 2 2 = ∥W ⊤ -1 V ∥ 2 2 + η 2 W χ 2 0 ∥V ∥ 4 2 ∥U -1 ξ 0 ∥ 2 2 -2η W χ 0 ∥V ∥ 2 2 f -1 (ξ 0 ) = Θ(1/n). The third term of eq. ( 8) equals η U η W χ 2 1 V ⊤ V (U 0 ξ 1 ) ⊤ W ⊤ 0 V ξ ⊤ 1 ξ = η U η W χ 2 1 ∥V ∥ 2 ⟨ξ 1 , ξ⟩f 0 (ξ 1 ) = Θ(χ 2 1 ). Therefore, f1(ξ)-f0(ξ) χ1 → K(ξ, ξ 1 ) Second step At the second step of fine-tuning, we have ∆U 2 = -η U χ 1 W ⊤ 1 V ξ ⊤ 2 , and ∆W 2 = -η W χ 1 V (U 1 ξ 2 ) ⊤ and f 2 (ξ) -f 1 (ξ) = V ⊤ ∆W 2 U 1 ξ + V ⊤ W 1 ∆U 2 ξ + V ⊤ ∆W 2 ∆U 2 ξ. (9) Assuming χ 2 and χ 1 share the same order, then when n → ∞, f 2 (ξ) -f 1 (ξ) χ 2 = V ⊤ ∆W 2 U 1 ξ/χ 2 + V ⊤ W 1 ∆U 2 ξ/χ 2 = -η W V ⊤ V (U 1 ξ 2 ) ⊤ U 1 ξ -η U V ⊤ W 1 W ⊤ 1 V ξ ⊤ 2 ξ = -η W V ⊤ V (U 0 ξ 2 ) ⊤ U 0 ξ -η U V ⊤ W 0 W ⊤ 0 V ξ ⊤ 2 ξ = -K(ξ, ξ 2 ). tth step Same as the second step by noting ∆U t , ∆W t always have smaller order than ∆U 0 and ∆W 0 .

C.6 STANDARD FINE-TUNING

In standard fine-tuning, V is replaced with a randomly initialized matrix V ′ . That is, the model at the step t of fine-tuning is f t (ξ) = f (ξ; {U i t } i , {W j t } j , V ′ t ). We set V ′ 0 ∼ N (0, σ 2 n I n×dout ), which has a larger scale than V in µP. In this section, we will prove that this standard fine-tuning has kernel behavior. We still consider a three-layer linear network f (ξ; U, W, V ′ ) = V ′⊤ W U ξ where V ′ ∈ R n×1 (i.e., d out = 1) and we freeze V ′ during the fine-tuning so it is not trained. Then we have  ∇ U f = W ⊤ V ′ ξ ⊤ and ∇ W f = V ′ (U ξ) ⊤ . coordinate-wise scale M = U i M = W j M = V M = V ′ M -1 Θ(1) Θ(1/ √ n) Θ(1/n) - ∆M 0 Θ(1) Θ(1/n) Θ(1/n) Θ(1/ √ n) ∆M t Θ(1/ √ n) Θ(1/n √ n) - Θ(1/n) η M for SGD Θ(1) Θ(1/n) - Θ(1/n) η M for SignGD/Adam Θ(1/ √ n) Θ(1/n √ n) - Θ(1/n)



Note that for C-way classification, f is a vector in R C . We say f exhibits kernel behavior if the Linearization and Fixed Features properties hold for every component of f . The subsequent definition of a kernel analog also generalizes to a vector output component-wise. In our experiments, Standard FT corresponds to initializing Γ at the linear probing solution (i.e., training Γ on the downstream task while freezing all other layers) and then performing FT. We do this because when FT exhibits kernel behavior (Definition 3.1), it finds solutions close to initialization, and we hypothesize that the Γ learned during FT is closer to the linear probing solution than a random initialization. https://www.quora.com/q/quoradata/ We are able to describe transformers (without weight tying) in the definition. The bias can be regarded as input weights assuming there is a coordinate in ξ that is always 1. Training scheme means a sequence of training examples {(ξt, y(ξt)}, and loss function ℓ(ft(ξt), y(ξt)).



Figure 2: Accuracies of zero-shot pre-trained model (PT), linearized model (Lin., see Definition 3.1)and fine-tuned model (FT). Tasks that induce the Linearization property of kernel behavior (Definition 3.1) will show that Lin. performance recovers a substantial amount of the FT performance. For each k, we plot the median and range of the test accuracies across 5 seeds and data splits.

c) The hidden vectors have O(1) coordinates during training. d) For any training scheme 5 and any constant t and any input ξ, f t (ξ) = O(1).

SignGD Kernel). If SignGD training of f demonstrates kernel behavior, then under Assumption C.2,

1) Θ(1/n) η M for signGD/Adam Θ(1) Θ(1/n) Θ(1/n)

.2) Adam-FT 88.3 (1.2) 81.3 (6.1) 93.0 (1.6) 82.8 (2.2) 87.4 (2.1) 79.6 (6.1) K (SignGD) 88.3 (0.5) 84.3 (1.5) 89.0 (4.0) 76.7 (3.3) 89.2 (2.0) 58.9 (7.1) K (A-SignGD) 88.3 (0.4) 84.9 (1.1) 88.0 (1.8) 74.6 (3.5) 88.6 (1.8) 20.0 (2.5) 64 SGD-FT 89.7 (0.4) 85.6 (1.1) 94.3 (0.5) 84.8 (0.8) 92.9 (0.5) 93.2 (1.0) K (SGD) 89.2 (1.0) 86.4 (0.6) 89.8 (0.3) 81.2 (0.9) 91.4 (0.7) 77.8 (2.3)

1 (0.7) 81.8 (1.1) 91.8 (1.1) 21.0 (4.3)

The statistics and prompts of the datasets we used in our experiments. The choices of prompts are fromGao et al. (2021) which include a template, and a set of label words that are used to fill in the [MASK]token. <S 1 > and <S 2 > refer to the first and the second (if any) input sentence.

Gao et al. (2021) train for 1000 steps in the 16-shot setting, and validate the performance every 100 steps to take the best checkpoints. As we consider varying values of k, we use the formula of training for 32kC steps and validating every 4kC steps, where C is the number of classes in the dataset. This gives a comparable number of training and validation steps for binary tasks in the 16-shot setting. (0.14) 0.409 (0.14) 0.419 (0.19) 0.414 (0.08) 0.429 (0.28) 2.023 (0.8458) 64 0.41 (0.07) 0.448 (0.22) 0.416 (0.05) 0.416 (0.13) 0.601 (0.17)1.588 (0.2295)

Average element-wise relative distance of K(SGD) computed on the pre-trained and best fine-tuned model. A smaller value indicates a higher likelihood that the Fixed Features property of kernel behavior (Definition 3.1) holds when performing fine-tuning. Distances are averaged across 5 seeds for each value of k and measured on the LM-BFF test set(Gao et al., 2021). Adam-FT 78.1 (4.2) 88.3 (1.2) 69.0 (6.0) 81.3 (6.1) 56.7 (3.6) 63.1 (3.5) 58.5 (5.6) 61.8 (4.5) 64 SGD-FT 85.6 (1.9) 89.7 (0.4) 83.4 (1.7) 85.6 (1.1) 65.8 (4.2) 72.8 (2.2) 64.5 (3.7) 69.2 (1.3) Adam-FT 86.1 (1.2) 89.3 (0.7) 83.9 (1.9) 86.0 (0.4) 71.5 (4.5) 74.2 (3.2) 65.0 (3.6) 69.8 (0.6) 512 SGD-FT 91.0 (0.4) 91.8 (0.4) 88.7 (0.5) 89.0 (0.6) 81.6 (1.5) 82.5 (0.5) 75.8 (0.6) 76.0 (1.0) Adam-FT 91.4 (0.7) 90.7 (1.2) 88.4 (0.8) 88.6 (0.6) 82.2 (0.3) 82.6 (1.2) 76.1 (0.8) 75.7 (0.9)

Fine-tuning performance in the standard fine-tuning setting, where the contextual embedding of the [CLS] token is used for classification, and the prompt-based fine-tuning setting, where a prompt is added and the embedding for the [MASK] token is used (see Section 3). This table relates to Figure1by comparing the SGD fine-tuning results to the more common fine-tuning with Adam.

We measure χ (Definition 3.

Scales of initialization, update and learning rate for µP in pre-training.

Scales for standard fine-tuning w.r.t. n

Experiment

Hyperparameters Values SGD FT Batch size {2, 4, 8} × Learning rate {1e-4, 5e-4, 1e-3, 5e-3, 1e FT Lin. FT 0 --79.0 ----71.9 ----86.2 --16 87.5 (1.3) 88.3 (1.2) 84.3 (1.8) 81.3 (6.1) 93.3 (0.6) 93.0 (1.6) 64 88.6 (0.4) 89.3 (0.7) 85.0 (0.2) 86.0 (0.4) 94.0 (0.5) 93.7 (0.8) 512 89.2 (0.5) 90.7 (1.2) 86.3 (0.8) 16 43.6 (6.4) 56.8 (2.9) 47.2 (9.3) 64.6 (4.1) 57.5 (2.3) 63.1 (3.5) 64 55.1 (4.8) 67.9 (1.0) 56.9 (5.7) 76.9 (1.4) 60.4 (5.3) (FT) . Tasks that exhibit the Linearization property of kernel behavior (Definition 3.1) during fine-tuning will show that Lin. performance recovers a substantial amount of the gain in performance achieved by performing fine-tuning. Accuracies are averaged across 5 fine-tuning seeds for each value of k and measured on the test set. This table corresponds to the bar chart in Figure 2 . neural networks, but it is not believed to be the full answer to neural networks in practice. In particular, kernel behavior implies the feature of the neural networks remains unchanged in the overparameterized setting, which is not true in practical pre-training of large models. In contrast,

Combined with ∥∇

by eq. ( 7).By d) in Assumption C.2, and consider the training scheme that sets ξ 1 = ξ * and the loss function ℓ so χ 1 = Θ(1), thenAnd it is easy to see thathas the same scale.Now it suffices to prove Equations ( 4) to ( 6) divided bySimilarly, for Equation ( 5),and for Equation ( 6),

C.3 PROMPT-BASED FINE-TUNING

In this section, we prove that prompt-based fine-tuning exhibits kernel behavior with the assumption of χ t = o(1). Prompt-based fine-tuning is trained directly on the pre-trained network without substituting or adding any parameters. Without our assumption, it is obvious that the behavior of fine-tuning and pre-training is the same from the perspective of the Tensor Programs.Theorem C.5. If the downstream task is solvable for network f , that is, for any t, χ t = o(1), then under Assumption C.2, the fine-tuning of f exhibits kernel behavior (Definition C.3).Below we provide a proof that is heavily based on Tensor Programs and the analysis in Yang & Hu (2021) . For readers who are not familiar with Tensor Programs, we provide intuitive examples in the next few subsections, where we focus on a three-layer linear network parameterized with µP.First step At the first step of fintuning, we haveandNote the first order and the second order term is exactly -χ 1 K(ξ, ξ 1 ).⊤ into the first term of eq. ( 10),1 into the second term of eq. ( 10), we have (assuming |ξ ⊤ 1 ξ| > 0). 7 . It is easy to verify that the third term is O(1/ √ n). Therefore, f 1 (ξ)-f 0 (ξ) converges to its first-order Talyer expansion when n → ∞.Second step At the second step of fine-tuning, we haveWe remove all terms of o( 1), thentth step Same as the second step by noting ∆U t , ∆W t always have smaller order than ∆U 0 and ∆W 0 .

C.7 PROMPT-BASED FINE-TUNING WITH PROJECTION

If χ = Θ(1) in Prompt-based FT, we need some modification in order to push the model to kernel regime, otherwise, the network will stay in feature learning regime or it will not learn (if we decrease learning rate). In particular, we add a randomized projection matrix before the last layer so the function is now equal towhere V ′ t ≜ W ′⊤ t V t , and W ′ 0 ∼ N (0, σ 2 I n×n ). Now we consider the linear example again whereIf we freeze both V and W ′ during the fine-tuning and let d out = 1, then it is equivalent to the linear example in Appendix C.6 with

D SUBSPACE-BASED FINE-TUNING METHODS

We start by restating the Johnson-Lindenstrauss lemma, which preserves inner products under random projection.Lemma D.1 (Johnson-Lindenstrauss). Let u, v ∈ R d such that ∥u∥ ≤ 1 and ∥v∥ ≤ 1. Choose k = 20 log N/ϵ 2 , where N is the number of datapoints. Let h = 1 √ k Ax, where A ∈ R k×d with each entry sampled i.i.d. from N (0, 1) or U(-1, 1). Then,Lemma D.2 (Norm Preservation (Johnson-Lindenstrauss)). Let x ∈ R n and assume the entries of A ∈ R k×n are sampled i.i.d. from N (0, 1). Then, So, we can writeThe other side of the double-sided bound can be derived analogously.We can now look at LoRA for a simple fully connected layer.coordinate-wise scale LoRA , whereas full FT with SGD yields the kernel K:where dH ∈ R N ×m has dh(x i ) in the ith row and X ∈ R N ×d has x i in the ith row.Proof. We start by noting the well-known fact that dW = dh ⊗ x, where dh is the gradient to h and ⊗ is the cross product. Thus, K = dHdH ⊤ ⊙ (XX ⊤ ). In the LoRA setting, dA = 0 and dB = dh ⊗ Ax. Because we are in the kernel setting, B = 0 and thus, dA = 0, throughout training. So, K LoRA (i, j) = ⟨dB(i), dB(j)⟩ = ⟨dh(i), dh(j)⟩⟨Ax i , Ax j ⟩where dB(a) denotes the gradient to B when given example a. Analogous reasoning yields K (SGD) (i, j) = ⟨dh(i), dh(j)⟩⟨x i , x j ⟩ Theorem D.5 (K (SGD) LoRA is likely not far from K (SGD) ). Let K (SGD) LoRA ∈ R N ×N and K (SGD) ∈ R N ×N be defined as in Lemma D.4. Additionally, assume that |⟨x(ξ), x(ξ ′ )⟩| ≥ c∥x(ξ)∥∥x(ξ ′ )∥ for some c > 0 for all pairs ξ, ξ ′ in the downstream dataset (i.e., no two gradients are not orthogonal). Then, for any i, j ∈ [N ],

Pr

|K (SGD) LoRA (i, j) -K (SGD) (i, j)| |K (SGD) (i, j)| ≥ ϵ/c ≤ exp(-(ϵ 2 -ϵ 3 )k/4) Proof. Note that K (SGD) LoRA (i, j) -K (SGD) (i, j) |K (SGD) (i, j)| = ⟨dh(i), dh(j)⟩(⟨Ax i , Ax j ⟩ -⟨x i , x j ⟩) ⟨dh(i), dh(j)⟩⟨x i , x j ⟩ = ⟨Ax i , Ax j ⟩ -⟨x i , x j ⟩ ⟨x i , x j ⟩The rest of the proof follows from Lemma D.3.The statement for IntrinsicDimension FT can be derived by applying Johnson-Lindenstrauss directly to the gradient vectors.

