APPLYING SECOND ORDER OPTIMIZATION TO DEEP TRANSFORMERS WITH PARAMETER-EFFICIENT TUN-ING

Abstract

Despite the theoretical superiority in convergence rate, second-order optimizers are generally not among the top choices for training large-scale neural networks due to their high computation and memory cost. Nevertheless, introduced in recent progress of parameter-efficient tuning is a new paradigm that large-scale pre-trained models (PTMs) can be adapted to specific tasks by optimizing a tiny proportion of parameters, which might hopefully change the game. We associate this new paradigm with the computational tractability of second-order optimizers and succeed in applying them to large PTMs from hundreds of millions of parameters to billions in scale. Beyond verifying their tractability, we further investigate the stability-influencing factors in the optimization process and propose accordingly a Newton-step clipping approach in which we clip the update tensors rather than the gradients. This approach stabilizes the convergence by gating the magnitude of Newton steps along the optimization trajectories through the rugged landscapes of deep transformers. We conduct extensive experiments across different downstream tasks, demonstrating that, when equipped with Newton-step clipping, second-order optimizers, especially Kronecker-factored curvature approximation (K-FAC), can attain comparable and even superior results and faster convergence to those stateof-the-art bars implemented with AdamW. Furthermore, we scale the model up to 3 billion parameters and validate the tractability and effectiveness of our method. This work is not only the first successful application of second-order optimization on such enormous models but also paves the road towards the design and analysis of second-order optimizers for the downstream adaptation of large-scale PTMs.

1. INTRODUCTION

Pre-Trained models (Bommasani et al., 2021; Han et al., 2021) (PTMs) based on deep transformers (Vaswani et al., 2017; Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) yield remarkable performance on a wide range of tasks thanks to the tremendous capacity brought by numerous parameters. A prevalent paradigm is to firstly pre-train such models on large-scale corpora in a self-supervised manner and then adapt them to specific datasets (typically with supervision). Such adaptations are usually implemented by first-order gradient-based optimization. However, recent applications have witnessed an approaching bottleneck for first-order training -it is neither likely to be faster nor easy to attain higher scores (Pascanu et al., 2013; Goodfellow et al., 2016) . On the other hand, second-order optimizers which enjoy better convergence properties might have been an ideal alternative in theory (Nocedal & Wright, 1999; Boyd et al., 2004) . They are, nevertheless, almost unattended on the downstream adaptation of PTMs in practice. It is because they require quadratic storage and cubic computation time for each update, which is especially prohibitive given the enormous model scale, even though bunches of their simplified counterparts have been devised (Byrd et al., 1995; Martens & Grosse, 2015; Botev et al., 2017; Anil et al., 2020; Tang et al., 2021) . Hopefully, recent advances in PTMs have brought a slight twist to the situation. Studies show that full-parameter optimization may not be necessary for task-specific adaptations, as updating a small portion (0.05%∼1%) of parameters can achieve non-trivial performance in many datasets (Houlsby et al., 2019; Li & Liang, 2021; Lester et al., 2021; Hu et al., 2021) . With the new paradigm largely shrinking the trainable parameter space, we find in experiments that second-order optimization is now tractable on large-scale PTMs with up to billions of parameters. This successful implementation signals that the design and analysis of second-order optimization can start marching from relatively toy or medium models to pre-trained deep transformers that are formidably large in scale. Further triggered is a sequence of interesting research topics including: How tractable are second-order optimizers on deep transformers? Are they capable of converging faster and more steadily and yielding better results? If not, what auxiliary skills are needed? And also, how does the ratio of trainable parameters influence the relative performance of second-order optimization? In this application-oriented paper, we answer all the aforementioned questions through theoretical justification and experimental verification. Taking K-FAC as the major example of the second-order optimizer, we first experimentally verify that its training time and memory cost on extremely large PTMs are totally affordable. In addition, we point out that clipping the gradients before or after Hessian (or Fisher) preconditioning makes a non-negligible difference. We propose accordingly a Newton-step clipping strategy which is indispensable for second-order training because of its superior stabilizing effect. We then post comprehensive results to illustrate that with the assistance of our Newton-step clipping strategy, second-order optimization outperforms baseline first-order optimzers as well as its non-clipping and traditionally gradient clipping counterparts in terms of both convergence speed and final test scores. Moreover, we scale the model scale up to 3 billion parameters to evaluate the tractability and effectiveness. Observations of the relation between tunable ratio and convergence speed are also presented subsequently.

2. SECOND-ORDER OPTIMIZERS FOR LARGE-SCALE TRAINING

Following the introductory part, we recap the essential background of second-order optimization. While first-order optimizers such as stochastic gradient descent (SGD) and Adam (Kingma & Ba, 2014) have been more than ubiquitous in deep learning, second-order optimizers remain relatively under-explored in this field. Unlike their first-order counterparts, which take only first-order derivatives into account, second-order optimizers will in addition incorporate the loss function's second-order features, or, in other words, the curvature information.

2.1. NEWTON'S METHOD AND ITS VARIANTS

Newton's method and its variants are perhaps the most typical second-order optimization skills. To be specific, in Newton's method, we approximate the loss function with its Taylor expansion up to order 2: L(θ + δθ) ≈ L(θ) + ∇ θ L(θ) ⊤ δθ + 1 2 δθ ⊤ ∇ 2 θ L(θ)δθ ≜ L(θ) + g(θ) ⊤ δθ + 1 2 δθ T H(θ)δθ, in which g(θ) and H(θ) are the gradient and Hessian matrix of L at θ respectively. We then hope to move towards the direction in which the quadratic function on the right-hand side of Equation 1 is minimized. This condition implies that the standard Newton's method should proceed as θ t+1 ← θ t -η t H(θ t ) -1 g(θ t ), where η t is the learning rate for the t-th step. Here the Hessian H is assumed to be invertible. The quantity u = H -1 g is usually called a Newton step. Despite the theory that fewer steps are required for convergence, the original Newton's method has long been criticized for high computation and storage costs in calculating the Hessian matrix, inverting it, and other additional matrix manipulations (Nocedal & Wright, 1999; Boyd et al., 2004) . To remedy this deficiency of speed, amazing increments of Newton's method have been proposed successively in recent centuries. Included are those reducing the cost by using an approximation of the Hessian matrix (Levenberg, 1944; Marquardt, 1963; Botev et al., 2017) , and those quasi-Newton optimizers such as BFGS (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) and its limited-memory version L-BFGS (Byrd et al., 1995) which approximate the inverse of Hessian matrix directly. Those aforementioned methods can partly alleviate the intractability of full Newton's method on deep neural networks.

2.2. NATURAL GRADIENT DESCENT AND K-FAC

In a common situation of supervised learning, a class of parametric conditional distribution {p(y | x, θ) | θ ∈ Θ} is assigned to the model to fit the underlying distribution q(y | x) of the ob-served data. Kullback-Leibler divergence (Kullback & Leibler, 1951 ) L(θ) = D KL [q(x, y)∥p(x, y | θ)] between the two joint distributions, or up to an additive constant, the negative log-likelihood, is usually selected as the loss function to be minimized. Natural gradient descent suggests an update direction satisfying u(θ) = lim ε→0 + 1 ε arg min D[p(x,y|θ)∥p(x,y|θ+δθ)]≤ε 2 L(θ) (3) = -κF (θ) -1 ∇L(θ), where κ is a positive constant and F (θ) = E p(x,y|θ) [∇ log p(x, y | θ) ⊤ ∇ log p(x, y | θ)], is known as the Fisher information matrix (FIM) of p(x, y | θ). Since the true FIM is generally not accessible, it is a common practice to evaluate the empirical FIM (Martens, 2020) instead, that is, to evaluate F (θ) ≜ E q(x,y) [∇ θ log p(x, y | θ) ⊤ ∇ θ log p(x, y | θ)], in which q is the empirical distribution of the observed samples. Interestingly, when current distribution p(• | θ) and target distribution q are close to each other, we have ∇ 2 θ L(θ) ≈ F (θ). This relation sheds light on the connection between Newton's method and natural gradient descent, in the sense that natural gradient descent can be conceived as a special Newton's method with approximate Hessian matrices. For this reason, we do not distinguish natural gradient descent from Newton's method in the rest of the paper and also call the update in equation 4 a Newton step or Newton update. Further discussion on the connections and distinctions among FIM, empirical FIM and Hessian matrix are provided in Kunstner et al. (2019) . Vanilla natural gradient descent suffers from the same inefficiency with full Newton's method. Compared to the numerous increments of Newton's method, few improvements for natural gradient descent have ever been devised, among which is the enlightening work of Kronecker-factored approximate curvature (K-FAC) (Martens & Grosse, 2015) . K-FAC specializes in optimizing the weights only involved once in linear mapping, including the weights of fully-connected linear projection and convolutional operation (Grosse & Martens, 2016; Martens et al., 2018; Ba et al., 2016) . The core idea of K-FAC is the approximate evaluation of empirical FIM F (θ) taking advantage of the properties of Kronecker products. To be specific, it approximates the expectation of the Kronecker product with the Kronecker product of expectation. In this way, the computation and storage burden is greatly relieved without simplifying the second-order structures too much. K-FAC has been proved a promising optimizer across various deep learning models and tasks (Martens & Grosse, 2015; Grosse & Martens, 2016; Martens et al., 2018; Wu et al., 2017; Osawa et al., 2019) , and will be the major second-order optimizer we implement and analyze in this paper. More details about natural gradient descent and K-FAC can be found in appendix A.

3. PARAMETER-EFFICIENT TUNING ENABLES SECOND-ORDER TRAINING

In the foregoing section, we have introduced typical second-order optimizers together with incremental methods which ameliorate their efficiency in large-scale settings. In spite of that amelioration, second-order optimization still requires generally N 2 order of storage space and N 3 order of computing operations for every full parameter block with N parameters, and is therefore intractable on extremely large models, especially those foundation language models based on deep transformers. For example, it is unlikely a possible mission to fine-tune a T5 XL with either L-BFGS or K-FAC directly. The sad truth informs us that, beyond making second-order optimizers lighter and faster, more modifications should be made to the way we train. While hardly any further improvements could be made to the optimizers, cutting down the volume of the trainable parameters might be another way out. This straightforward thought coincides with a new paradigm in the spotlightparameter-efficient tuning -including adapter, LoRA, and other useful methods, which we will explain in detail in the following passage.

3.1. BACKGROUND OF PARAMETER-EFFICIENT TUNING

Parameter-efficient tuning aims to adapt PTMs, especially large ones, via optimizing a tiny proportion of parameters. This intuition could also be found in previous works under other scenarios (Tajbakhsh et al., 2016; Guo et al., 2020; 2019) . The most practical advantage of such a paradigm is that we do not need to update all the parameters and produce separate fine-tuned instances for every downstream task. By training such lightweight parameters, we are able to flesh out the abstract ability of large-scale models to solve specific problems. Houlsby et al. (2019) injects small neural modules to each layer of the Transformer model and only optimizes the adapters in training. Subsequently, a series of variants of adapters have emerged (Pfeiffer et al., 2020; Sung et al., 2021; Mahabadi et al., 2021; Sung et al., 2022) . Prefix and prompt tuning (Li & Liang, 2021; Lester et al., 2021) prepends tunable parameters to the input layers. LoRA (Hu et al., 2021) injects low-rank trainable decomposition matrices to the weights and is successfully applied to GPT-3 (Brown et al., 2020) with 175 billion parameters. Apart from introducing additional parameters, experiments show that optimizing a designated proportion of the inherent parameters produce a similar effect (Zhao et al., 2020; Zaken et al., 2021; Guo et al., 2021) . This line of work implies that after massive pre-training, the adaptation of large-scale PTMs may be a "simple" process and is worth further exploring (He et al., 2022; Ding et al., 2022) . Adapter. The adapter (Houlsby et al., 2019) method inserts lightweight neural modules (adapter layer) to each layer of the Transformer model. Given the input hidden state h in ∈ R d , each adapter layer comprises a down-projection D ∈ R d×r , a non-linear activation function σ(•), and an up-projection U ∈ R r×d . There is also a residual connection from the input to the output of an adapter layer. h out ← σ(h in D)U + h in . The position of adapter layers is optional. Houlsby et al. (2019) add them after the multi-head attention layer and the feed-forward layer of each Transformer block. There are also variants that add adapter modules to other positions like the LayerNorm layer (Pfeiffer et al., 2020) . Depending on the choice of the size of the bottleneck dimension r, the tunable parameters of the adapter approach account for roughly 0.5%∼8% of the total number of parameters. LoRA. The LoRA method (Hu et al., 2021) assumes that the change of each weight matrix is intrinsically low-rank, thus it injects tunable low-rank decomposition matrices D ∈ R d×r , U ∈ R r×d to estimate the weight changes ∆W = DU . Hence, the output hidden state is h out ← h in (W + D)U , ( ) where W is the initial weight matrix. Depending on the intrinsic rank r, the optimizable parameters of LoRA are normally less than 1% of the total parameters.

3.2. PARAMETER-EFFICIENT TUNING ENABLES SECOND-ORDER TRAINING

Having introduced the background in second-order optimization and parameter-efficient tuning and clarified our motivation for integrating them, we move on to verify through experiments that parameter-efficient tuning can indeed enable second-order optimization on large-scale PTMs. As shown in Table 1 , PTMs with different architectures and scales could be tractably adapted with a single NVIDIA A100 GPU. It is worth noting that the excess memory consumption resulting from second-order optimization is relatively low compared to the whole memory consumption. As the model scales, the absolute amount of parameters to be fine-tuned increases, and the memory footprint of the second-order optimization increases considerably. For example, in T5 XL , the second-order optimization takes up 13.06 GB more video memory than the first-order optimization. However, this is still acceptable in practice. We report an example of the GPU memory variation in Figure 2 , which has a jagged shape over the training time. At low points, all the data is sent to the CPU for processing, leaving only the parameters of the model itself and the states in the optimizer in GPU. The gap between the first-and second-order optimization's memory usage at the low point reflects the volume difference in states of optimizers. We also conduct an analysis of time efficiency in Appendix C.3.

4. NEWTON-STEP CLIPPING STABILIZES SECOND-ORDER TRAINING

The combination of second-order optimization and parameter-efficient tuning is not enough for training smoothly (illustrated in Figure 4 in Section 5.2). We reveal both in theory and by experiments that one new auxiliary skill, which we call Newton-step clipping, has to be implemented to achieve satisfactory results. In this section, we first introduce a more traditional stabilizing strategy named gradient clipping. We will afterward clarify the divergence between gradient clipping and Newton-step clipping and justify our initiative of devising and applying this new approach.

4.1. GRADIENT CLIPPING

One crucial point for a smoother training experience is to manage your step-sizes (norms of update tensors) wisely. However, unlike the cases in toy tasks, it is of low efficiency to apply delicate step-size schedules, such as line search with Wolfe (Wolfe, 1969) and Armijo-Goldstein (Armijo, 1966) conditions, to large models like a deep-transformer-based PTM. Such low efficiency is partially ascribed to high computational cost, and, more substantially, to the nonconvex landscape that invalidates the theoretical benefit of those schedules. Mostly adopted in deep learning is to set up a fixed learning rate schedule for all layers at the beginning of a training process. But, this solution may not be flexible enough and may result in gradient explosion problem since the optimal step-sizes for different layers can typically be different. Two strategies -adaptive gradient and gradient clipping -have been devised in previous works to tackle this issue. AdaGrad (Hinton et al., 2011) , RMSProp (Duchi et al., 2012) , Adam (Kingma & Ba, 2014) , and other optimizers adopting the adaptive gradient strategy divide the first-order term by the square root of the second-order term to ensure that the magnitude of each layer's step-size automatically remains in a rather stable range highly irrelevant to the gradient norm. Compared to the adaptive gradient methods, gradient-clipping (Pascanu et al., 2013; Goodfellow et al., 2016) is a more straightforward approach to set off the influence of extremely rugged landscapes. In this approach, a gradient vector is clipped whenever its norm exceeds a fixed threshold, that is, g clp ← min 1, τ ||g|| g. ( ) This approach is designed mostly by intuition to conquer the high-curvature landscape in deep learning, especially indispensable for some NLP-specialized models (Pascanu et al., 2013) . Apart from the stabilizing effect, more of its advantages are subsequently explored and proved, such as faster convergence (Zhang et al., 2019) and prevention of convergence to stationary points (Chen et al., 2020) .

4.2. NEWTON-STEP CLIPPING

In fact, the term "gradient clipping" can be misleading. We claim that the underlying purpose of performing "gradient clipping" is to restrict the scale of the update tensor, NOT the gradient as is literally suggested. It is noteworthy that gradient clipping is usually applied to first-order methods, especially SGD and its variants. For these optimizers, the stepsize is proportional to the gradient norm, and hence clipping the gradient is equivalent to restricting the scales of updates. Nevertheless, for second-order optimizers, this rule may not hold. For instance, in natural gradient descent, the update is chosen to be The above relation indicates that K-FAC can take an arbitrarily large step at a low curvature point, therefore, it is problematic to associate large gradient norm with large step size in natural gradient descent. A similar problem persists among general second-order optimizers due to the precondition of gradients by either Hessian matrices or FIMs. In addition, though traditional gradient clipping (that is applied before preconditioning) indeed shrinks the updates of a second-order optimizer to a certain extend, the preconditioned updates remain poorly controlled since the eigenvalues of Hessian matrices or FIMs can be arbitrarily small. u = E[gg ⊤ ] -1 • E[g]. = O(1/∥E[g]∥). < l a t e x i t s h a 1 _ b a s e 6 4 = " X Y D D I v R U 3 x u / 8 n h k x s y m T Q R s / L M = " > A A A B 7 n i c b V B N S w M x E J 3 U r 1 q / q h 6 9 B I v g q e x K U Y 9 F L x 4 r 2 A 9 o 1 5 J N s 2 1 o N h u S r F C W / g g v H h T x 6 u / x 5 r 8 x b f e g r Q 8 G H u / N M D M v V I I b 6 3 n f q L C 2 v r G 5 V d w u 7 e z u 7 R + U D 4 9 a J k k 1 Z U 2 a i E R 3 Q m K Y 4 J I 1 L b e C d Z R m J A 4 F a 4 f j 2 5 n f f m L a 8 E Q + 2 I l i Q U y G k k e c E u u k 9 v A x o 0 J N + + W K V / X m w K v E z 0 k F c j T 6 5 a / e I K F p z K S l g h j T 9 T 1 l g 4 x o y 6 l g 0 1 I v N U w R O i Z D 1 n V U k p i Z I J u f O 8 V n T h n g K N G u p M V z 9 f d E R m J j J n H o O m N i R 2 b Z m 4 n / e d 3 U R t d B x q V K L Z N 0 s S h K B b Y J n v 2 O B 1 w z a s X E E U I 1 d 7 d i O i K a U O s S K r k Q / O W X V 0 n r o u p f V m v 3 t U r 9 J o + j C C d w C u f g w x X U 4 Q 4 a 0 A Q K Y 3 i G V 3 h D C r 2 g d / S x a C 2 g f O Y Y / g B 9 / g C f l Y / F < / l a t e x i t > g clp < l a t e x i t s h a 1 _ b a s e 6 4 = " u S W 0 H 7 1 J U a t 1 z r U v R T N U i f 2 j r k k = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j 2 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l 5 r B f r r h V d w 6 y S r y c V C B H o 1 / + 6 g 1 i l k Y o D R N U 6 6 7 n J s b P q D K c C Z y W e q n G h L I x H W L X U k k j 1 H 4 2 P 3 R K z q w y I G G s b E l D 5 u r v i Y x G W k + i w H Z G 1 I z 0 s j c T / / O 6 q Q m v / Y z L J D U o 2 W J R m A p i Y j L 7 m g y 4 Q m b E x B L K F L e 3 E j a i i j J j s y n Z E L z l l 1 d J + 6 L q X V Z r z V q l f p P H U Y Q T O I V z 8 O A K 6 n A H D W g B A 4 R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f z 2 O M 9 A = = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " w 0 b K W N 5 q / 9 J Y 0 I n + H X i a o w m k 5 3 w = " > A A A B / X i c b V D L S s N A F L 2 p r 1 p f 8 b F z M 1 g E N 5 Z E i r o R i m 6 6 r G A f 0 M Y y m U 7 a o Z M H M x O h h u C v u H G h i F v / w 5 1 / 4 6 T N Q l s P D B z O u Z d 7 5 r g R Z 1 J Z 1 r d R W F p e W V 0 r r p c 2 N r e 2 d 8 z d v Z Y M Y 0 F o k 4 Q 8 F B 0 X S 8 p Z Q J u K K U 4 7 k a D Y d z l t u + O b z G 8 / U C F Z G N y p S U Q d H w 8 D 5 j G C l Z b 6 5 k G M r l D P x 2 r k e k k 9 v U 9 O 7 X T Y N 8 t W x Z o C L R I 7 J 2 X I 0 e i b X 7 1 B S G K f B o p w L G X X t i L l J F g o R j h N S 7 1 Y 0 g i T M R 7 S r q Y B 9 q l 0 k m n 6 F B 1 r Z Y C 8 U O g X K D R V f 2 8 k 2 J d y 4 r t 6 M s s p 5 7 1 M / M / r x s q 7 d B I W R L G i A Z k d 8 m K O V I i y K t C A C U o U n 2 i C i W A 6 K y I j L D B R u r C S L s G e / / I i a Z 1 V 7 P N K 9 b Z a r l 3 n d R T h E I 7 g B G y 4 g B r U o Q F N I P A I z / A K b 8 a T 8 W K 8 G x + z 0 Y K R 7 + z D H x i f P 3 A S l J c = < / l a t e x i t > u = H 1 g < l a t e x i t s h a 1 _ b a s e 6 4 = " m q Z 9 N Z Q / w e a w U V 6 E 5 + e W E w r G y v Q = " > A A A B 7 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K U I 9 B L x 4 j m A c k a 5 i d z C Z D Z m e H e Q h h y U d 4 8 a C I V 7 / H m 3 / j J N m D J h Y 0 F F X d d H d F k j N t f P / b K 6 y t b 2 x u F b d L O 7 t 7 + w f l w 6 O W T q 0 i t E l S n q p O h D X l T N C m Y Y b T j l Q U J x G n 7 W h 8 O / P b T 1 R p l o o H M 5 E 0 T P B Q s J g R b J z U t o 8 Z 4 X L a L 1 f 8 q j 8 H W i V B T i q Q o 9 E v f / U G K b E J F Y Z w r H U 3 8 K U J M 6 w M I 5 x O S z 2 r q c R k j I e 0 6 6 j A C d V h N j 9 3 i s 6 c M k B x q l w J g + b q 7 4 k M J 1 p P k s h 1 J t i M 9 L I 3 E / / z u t b E 1 2 H G h L S G C r J Y F F u O T I p m v 6 M B U 5 Q Y P n E E E 8 X c r Y i M s M L E u I R K L o R g + e V V 0 r q o B p f V 2 n 2 t U r / J 4 y j C C Z z C O Q R w B X W 4 g w Y 0 g c A Y n u E V 3 j z p v X j v 3 s e i t e D l M 8 f w B 9 7 n D 7 U h j 9 M = < / l a t e x i t > u clp For the above reasons, gradient clipping, though intuitively correct, may fails to improve the stability in second-order optimization. To address this issue, we propose Newton-stepclipping in which we clip the Newton update u = H -1 g (or u = F -1 g) rather than the original gradient g, that is, u clp ← min 1, τ ||u|| u. τ , standing for the maximum update norm, is a hyperparameter. This approach has previously appeared as an exclusive skill for K-FAC training (Ba et al., 2016) , but it would hopefully work for a wider range of second-order optimizers. We examine this approach on K-FAC, on which we discover that τ = 0.1 generally works well.

4.3. GENERALIZATIONS: PRE-CLIPPING AND POST-CLIPPING

We can generalize our approach as follows. In gradient-based optimization, gradient plays a major role. However, the weight (or parameter) is seldom updated directly by its corresponding gradient; most optimizers will further transform the gradient into a final update tensor. Newton's method makes a case in point. It left-multiplies the gradient with the inverse of Hessian to yield a final Newton step. Then a divergence emerges in the question that when is the appropriate time for clipping. The vanilla gradient clipping strategy clips the gradients at the stage when the gradient backpropagation are completed. This sounds reasonable for SGD, feasible for Adam, but problematic for K-FAC. A different measure is not to clip anything until the final updates have been transformed from the original gradients, such as our Newton-step clipping. According to their time of clipping, we name the two aforementioned approaches pre-clipping and post-clipping by convenience. Procedures of pre-clipping and post-clipping are summarized in algorithm 1. In a word, we have explained in the above passage that the choices of when to clip and what to clip can lead to entirely different training results. Although second-order optimizers favor later clipping upon the Newton step, for most of the rest of the optimizers that have ever been proposed, the choices between pre-clipping and post-clipping remain a mystery to be explored in the future.

5. EVALUATION

In this section, we evaluate and analyze our approach with corresponding baselines across widely used natural language understanding (NLU) tasks and different backbone PLMs.

5.1. EXPERIMENTAL SETTINGS

Datasets. The evaluated benchmarks in our experiments include SST-2 (Socher et al., 2013) for sentiment analysis, MRPC (Dolan & Brockett, 2005) for paragraph detection, CoLA (Warstadt et al., 2019) for inference acceptability, RTE (Dagan et al., 2005) for inference, QNLI (Rajpurkar et al., 2016) for inference, STS-B (Cer et al., 2017) for textual similarity, Choice of Plausible Alternatives (COPA) (Gordon et al., 2012) for commonsense causal reasoning, CommitmentBank (CB) (Marneffe et al., 2019) for inference, and Winograd Schema Challenge (WSC) (Levesque, 2011) for commonsense reasoning. Setup. We adopt RoBERTa large (Liu et al., 2019) with 350 million parameters for QNLI, SST-2, RTE, COLA, MRPC, and STS-B; and T5-3b (Raffel et al., 2020) with 3 billion parameters for COPA, CB, RTE, and WSC. The project and second-order optimizers are implemented by PyTorch (Paszke et al., 2019) , and the used models are loaded from the Huggingface Transformers (Wolf et al., 2019) library. We use NVIDIA Tesla A100 with 40GB memory for all experiments. Most datasets use accuracy as evaluation metrics, except for MRPC, CoLA, and STS-B. MRPC uses F1 scores. CoLA uses Matthew correlation coefficient (Matthews, 1975) . Spearman correlation and Pearson correlation scores are reported on STS-B. More experimental details are reported in Appendix B.

5.2. EFFECT OF NEWTON-STEP CLIPPING

We first directly assess the effect of the Newton-step clipping method in the second-order training of deep transformers. When straightforwardly applying K-FAC to deep transformers without any clipping or with traditional pre-clipping, the training process is exceedingly unstable. Series of experimental results shown in Figure 4 verify that for large-scale second-order optimization, the strategy of Newton-step clipping significantly stabilizes the training process and outperforms its nonclipping counterpart, while pre-clipping cannot deliver similar effectiveness. The results indicate the indispensability of appropriate clipping when applying second-order optimization in the parameterefficient paradigm, which is consistent with the discussion in Section 4.2. 

5.3. EXPERIMENTAL RESULTS

Experimental results of natural language understanding are reported in Table 2 . We mainly compare our approach with AdamW (Loshchilov & Hutter, 2017) , which is broadly considered as the most powerful optimization method for deep transformers. Most of the parameter-efficient adaptations achieve on-par performances to full parameter fine-tuning. With both the adapter and LoRA approaches, second-order optimization with our Newton-step clipping considerably outperforms the AdamW counterparts. Specifically, in direct comparisons, with the adapter method, the average performance in six tasks of NewtwonClip outperforms AdamW by 1.06%. And with the LoRA method, the absolute improvement is 1.25%. We also observe that the performance gap is mainly reflected in RTE, COLA, and MRPC datasets, which are also generally considered to be the more difficult natural language understanding task than the other three tasks. As illustrated in Figure 5 , equipped with Newton-step clipping, K-FAC demonstrates faster and more stable convergence than the first-order optimizer AdamW. Despite the effectiveness, the hyperparameter tuning of K-FAC is slightly more costly than that of AdamW since the second-order optimizer itself is more sensitive to the learning rate and also involves more hyperparameters, such as the scale of clipping. We carry out the ablation study of hyperparameters in Appendix C.1. We will investigate adaptive techniques for Newton-step clipping in future work to better deploy second-order optimization to PTM adaptations. 

5.4. IMPACT OF THE NUMBER OF TRAINABLE PARAMETERS

To explore the impact of trainable parameters to our method, we train a RoBERTa large + LoRA with different LoRA intrinsic ranks (i.e., the bottleneck dimension r of D, U in equation 8) on the MRPC dataset. When linearly choosing r in {8, 12, 16, 20, 24, 28, 32}, the amount of trainable parameters becomes {0.8M, 1.1M, 1.5M, 1.9M, 2.3M, 2.6M, 3.0M, 3, 4M}, respectively (we conduct 3 runs with different random seeds for each LoRA rank). However, as illustrated in Figure 6 and Figure 7 , the convergence speed is observably inversely proportional to the number of trainable parameters, i.e., the smaller the number of trainable parameters, the faster the convergence speed. Meanwhile, we observe that the test performance does not change significantly as the number of trainable parameters changes. For a PTM and a specific task, the adaptation process is "simple" that it can be accomplished with very few parameter optimizations, but it is difficult to make a leap in adaptation performance by changing the number of training parameters. In other words, the pre-training, model structure, and scale of the model itself seem to determine the upper limit of practical adaptations. The fact that using fewer parameters leads to faster convergence is also a testament to the effectiveness of our Newton-step clipping approach. (SRFK 7UDLQLQJORVV /R5$UDQN We scale our experimentation to T5 XL , a sequence-to-sequence model with 3 billion parameters. Evaluated datasets in this part are relatively small due to high training costs. As shown in Table 3 , K-FAC with the proposed Newton-step clipping can achieve comparable performance with AdamW. For some datasets like RTE and WSC, our method even outperforms its first-order counterpart by a considerable margin. We also empirically find that larger models tend to favor larger maximum norms for Newton-step clipping due to the vast capacity. The success of K-FAC on T5 XL further demonstrates its tractability under parameter-efficient tuning paradigm, and its potential in efficiently steering large pre-trained models.

6. DISCUSSION

Theoretical analysis and experimental results are presented in this paper, illustrating that the parameterefficient paradigm can vivify second-order optimization on extremely large-scale PTMs with the assistance of the Newton-step clipping strategy. Although the application of second-order optimization on enormous PTMs is promising, the exploration is yet far from closed in the sense that pieces of dark clouds are still hanging over this topic. (1) To begin with, as we have observed in our experiments, second-order optimizers exhibit higher sensitivity to the choices of hyper-parameters compared to their first-order counterparts. While second-order optimizers tend to introduce more hyper-parameters, many of these newly-added hyper-parameters are more obscured in mathematical meaning and their experimental influence is more elusive. It remains unclear whether there are theories and methods to make the hyper-parameter tuning of second-order optimization no longer a sort of dark art. (2) Another uncertainty lies in the question that whether the design of architecture-specified optimizers is feasible. We notice in current work that both adapter and LoRA add to the original model fully-connected feed-forward branches which coincide with K-FAC's strength. But, for the other parameter-efficient methods like modifications of each head of attention output in Prefix Tuning, no similar conclusion has yet been drawn. (3) Moreover, it appears promising to study the combination of first-order or second-order optimization instead of sticking merely to one. Observed in our experiments is the rule of thumb that first-order optimizers, though with slower loss descent and lower test scores, enjoy better numerical and convergence stability. It is a natural deduction to run first-order steps as the warm-up period for second-order optimization. Seemingly trivial at first glance, the design of first and second-order compounds is bound to be an arduous journey of in-detail technical implementation.

A MORE ABOUT NATURAL GRADIENT DESCENT AND K-FAC A.1 CONNECTIONS BETWEEN NEWTON'S METHOD AND NATURAL GRADIENT DESCENT

The Hessian matrix of the Kullback-Leibler divergence (the loss function mentioned in section 2.2) reads ∇ 2 θ L(θ) = ∇ 2 θ D KL [q(x, y)∥p(x, y | θ)] (12) = -E q(x,y) [∇ 2 θ log p(x, y | θ)] (13) = E q(x,y) [∇ θ log q(x, y) ⊤ ∇ θ log p(x, y | θ)] In case that the current density p(• | θ) is close to the target density q, we could see that ∇ 2 θ L ≈ F .

A.2 UPDATE RULES OF K-FAC

In accordance with the symbols adopted in Martens & Grosse (2015) , a layer of linear mapping can be written as s i = W i a i-1 . ( ) Define g i = ∂L ∂s i , then the W i -related sub-block of FIM can be expressed as Fi,i = E q [vec(∇ Wi L)vec(∇ Wi L) ⊤ ] = E q [(a i-1 ⊗ g i )(a i-1 ⊗ g i ) ⊤ ] = E q [a i-1 a ⊤ i-1 ⊗ g i g ⊤ i ] The core idea of K-FAC is to approximate the expectation of Kronecker product in equation 16 with the Kronecker product of expectation, that is, Fi,i ≈ E q [a i-1 a ⊤ i-1 ] ⊗ E q [g i g ⊤ i ] ≜ A i-1 ⊗ G i . Furthermore, in light of the properties that (B ⊗ C) -1 = B -1 ⊗ C -1 and that (B ⊗ C)vec(V ) = vec(CV B ⊤ ), the natural gradient update in K-FAC can be formulated in tensors as W i ← W i -η • G -1 i (∇ Wi L)A -1 i-1 . B ADDITIONAL EXPERIMENTAL DETAILS Hyper-parameters. For experiments in Table 2 , we perform a hyper-parameter grid search for both AdamW and K-FAC optimizers to select better-performing models. For AdamW, the search space for learning rate is {0.01, 0.001, 0.0001, 0.00003} and for maximum gradient norm clipping scale is {0.1, 1.0, 10}. For K-FAC, in addition to the learning rate and the Newton-step clipping scale (τ ), we also set the damping factor to stabilize the matrix inverting operation (Levenberg, 1944; Marquardt, 1963) . The search space for learning rate, Newton-step clipping scale, and damping factor are {0.01, 0.05, 0.1, 0.5}, {0.1, 1.0, 1.5, 2.0}, and {1e-2, 1e-3, 1e-4, 1e-5, 1e-6}, respectively. The search space of K-FAC is larger for AdamW for two reasons. (1) K-FAC has more hyper-parameters than AdamW, and we find that the damping factor has a considerable impact on the training. (2) Using AdamW to optimize deep transformers is extensively practiced in the community and our previous empirical studies, and we choose the set of reasonable learning rates and simply use commonly used values for other hyper-parameters like weight decay. However, there is little empirical evidence to provide a reasonable search space for K-FAC on large language models. The second-order update interval for K-FAC is set to 500 in our experiments, which could simultaneously yield promising performance and time efficiency. The batch size is 128 for Table 2 and Figure 5 , and the training epochs and steps for each dataset is shown in Table 4 . The reported results in Table 2 and Table 3 use the following hyper-parameter settings in Table 5 and Table 6 . For results in Figure 4 , we adopt K-FAC as the optimizer and set the learning rate as 0.01, and set the damping factor as 1e-4 for both experiments. For all experiments, weight decay is set to 1e-4, and a linear scheduler with warm-up is adopted. one hyper-parameter, we fix the other two hyper-parameters. We run an ablation study on STS-B and RTE datasets and illustrate the training loss in Figure 8 and the corresponding performance on the test set in Table 7 . It could be seen that with a small learning rate and a small maximum norm value, the model may fail to converge within a reasonable number of training steps. It indicates that, generally, the second-order optimizer is compatible with a relatively large update value, which is in line with its theoretical precision in loss landscape estimation. In terms of performance, it is relatively stable for STS-B, but observes a larger variation on the RTE dataset. It is probably because the scale of RTE is much smaller and thus the performance is more vulnerable to small perturbations in model parameters. We investigate how the proposed clipping technique affects the training procedure. Figure 10 shows how many times the Newton-step clipping actually occurs in a whole training procedure. It can be seen that clipping happens in almost every step in the early training stage, testifying to the importance of clipping. After considerable iterations of training, some steps could obtain a steady amount of updates without Newton-step clipping. The phenomenon persists through different datasets. It should also be noted that clipping is not equivalent to shrinking the learning rate. The functionality of clipping depends on the scale of the update norm so that the scaling effect can be adjusted dynamically according to the current update scale. Learning rate adjustment, however, is applied in a pre-defined manner that happens independent of the current update. Designing a suitable learning rate schedule for a second-order optimizer would require the knowledge of the change in update norm, and is thus extremely difficult given the rugged loss landscape of deep neural networks. Figure 9 further proves the point where simply decreasing the learning rate without clipping technique will not achieve satisfactory training, where we reduce the learning rate to K-FAC without Newton-step clipping and find that none of them could steadily converge. In this section, we take a closer look at the training process by prolonging the iterations. Specifically, we extend the training procedure to 500 epochs on the STS-B dataset, and the training loss and validation metric curves are shown in Figure 11 . It can be seen that with both optimizers, the training loss keeps decreasing, while K-FAC observes an even smaller loss scale and converges faster, especially in the later training stage. At the final stage of training, both optimizers reach very small orders of magnitude in terms of training losses (10 -4 and 10 -6 for AdamW and K-FAC, respectively). The validation metric curve of comparable performances also shows that there are no overfitting issues for both two optimizers, which could be regarded as an advantageous characteristic of parameter-efficient tuning. 5 and Table 6 .



Figure 3: An illustration of the Newton-step clipping strategy.

Figure 4: Training loss curve of K-FAC optimizer on RoBERTa large + Adapter with and without our clipping approaches. All settings and hyperparameters are the same except for clipping strategy.

Figure 5: Training loss and validation metric curves of RoBERTa large + Adapter on NLU tasks with AdamW optimizer and K-FAC optimizer equipped with our Newton-step clipping strategy.

Figure 6: Training loss curve of models with different LoRA ranks. Experiments are conducted on MRPC with RoBERTa large and batch size is 128.

Figure 9: Comparison of training loss curve between small learning rate without post clipping and large learning rate with post clipping. Experiments are conducted on MRPC dataset with RoBERTa large + Adapter and K-FAC. Weight decay is 1e-4, and linear scheduler with warm-up is applied. The solid line denotes the training process with Newton-step clipping and dotted lines denote processes without Newton-step clipping across different learning rates.

Figure 11: Training loss and validation metric curves on STS-B dataset with RoBERTa large over 500 training epochs. The hyper-parameter settings follow Table5 and Table 6.



Algorithm 1: Generalized procedures of pre-clipping and post-clipping Data: Total iteration T , current iteration t, current learning rate η t ; while t ≤ T do Compute and store gradient g t and other required intermediate quantities h t ; if pre-clipping then g t ← min 1, τ ||gt|| g t ; Compute other required quantities r t ; Transform the gradient to the final update by: u t ← T (g t , h t ), where T is a given transform; if post-clipping then u t ← min 1, τ ||ut|| u t ; Update the parameter θ ← θη t u t . Hence, we may expect that approximately ∥u∥

Results on NLU tasks. † indicates the results are from Hu et al. (2021), and a, b, and c indicate different amounts of trainable parameters. Blue and orange represent the best and second best performances of each column. ±0.6 95.4 ±0.7 82.5 ±0.9 69.1 ±2.7 91.1 ±0.2 91.6 ±0.7 87.30 NewtonClip 0.8M (0.23%) 94.4 ±0.1 96.2 ±0.3 85.7 ±1.7 70.4 ±1.8 92.7 ±0.3 91.9 ±0.3 88.55

Results on COPA, CB, WiC, and WSC datasets of T5 XL + Adapter.

The training hyper-parameters for AdamW.

Impact of different hyper-parameters on training loss with STS-B and RTE datasets. Experiments are conducted with RoBERTa large + Adapter and K-FAC + NewtonClip, and are trained with batch size of 128 for 50 epochs.

Wall clock time of AdamW and K-FAC + NewtonClip on the RTE dataset with RoBERTa large . Values in the parentheses denote the relative time over AdamW.

annex

Datasets. All datasets are loaded with huggingface datasets (Lhoest et al., 2021) . Since labels are not accessible for the test set, we manually split part of the data for testing. For small datasets (#samples < 10K), we randomly divide the validation set into halves, and use one half as the test set and one half as the validation set. For larger datasets (#samples > 10K), we randomly take 1K samples from the original training set as validation and the rest as the training set, keeping the original validation set as the test set. All models are trained on the training set and evaluated every 200 steps on the validation set, the checkpoint with the best performance on the validation set is kept for evaluation on the test set.When dealing with MRPC, RTE, and STS-B datasets, some works use the best model checkpoint on the MNLI dataset for initialization to boost the performance. In our empirical study, we do NOT use this strategy, and we use usual initializations for all the models. In our experimentation, the main parameters of PTMs are frozen, and we only optimize the trainable modules.GPU Memory. The GPU memory statistics in Table 1 and Figure 2 come from PyTorch API and NVIDIA, respectively, so there might be a small inconsistency caused by NVIDIA's extra calculation of cache memory. 

C ADDITIONAL EXPERIMENTAL RESULTS

This section reports additional experimental results to Section 5. We study the effect of hyperparameters, time efficiency, and clipping occurrences in the training process. We find that the training will be more sensitive when dealing with small datasets. Hence, to take a closer look at the training procedure, we choose a larger dataset (STS-B and MRPC, respectively) and a small dataset (RTE) for analyses in Appendix C. 

C.3 TIME EFFICIENCY

After analyzing the memory tractability of second-order optimization in Section 3.2, we find that second-order optimization is tractable for large language models under the parameter-efficient paradigm. In this section, we further conduct experiments for time efficiency. We conduct experiments on the MRPC and RTE datasets with a single NVIDIA A100 GPU and compute the average wall clock time over 100 epochs. For second-order optimization, the value of second-order update interval could particularly affect the training time. Thus we conduct different runs with intervals in {1, 50, 200, 500}. The results are reported in Table 8 , from where we observe that K-FAC reaches comparable speed with AdamW when the update interval is set to 50 since there are not a lot of trainable parameters in the parameter-efficient paradigm. When the value is above 200, K-FAC outspeeds AdamW with satisfying performance (we did not fine-tune the hyper-parameters for each setting in this section). In the spirit of making second-order optimization as practical as possible on large models, we set it to 500 in all our other experiments.

