APPLYING SECOND ORDER OPTIMIZATION TO DEEP TRANSFORMERS WITH PARAMETER-EFFICIENT TUN-ING

Abstract

Despite the theoretical superiority in convergence rate, second-order optimizers are generally not among the top choices for training large-scale neural networks due to their high computation and memory cost. Nevertheless, introduced in recent progress of parameter-efficient tuning is a new paradigm that large-scale pre-trained models (PTMs) can be adapted to specific tasks by optimizing a tiny proportion of parameters, which might hopefully change the game. We associate this new paradigm with the computational tractability of second-order optimizers and succeed in applying them to large PTMs from hundreds of millions of parameters to billions in scale. Beyond verifying their tractability, we further investigate the stability-influencing factors in the optimization process and propose accordingly a Newton-step clipping approach in which we clip the update tensors rather than the gradients. This approach stabilizes the convergence by gating the magnitude of Newton steps along the optimization trajectories through the rugged landscapes of deep transformers. We conduct extensive experiments across different downstream tasks, demonstrating that, when equipped with Newton-step clipping, second-order optimizers, especially Kronecker-factored curvature approximation (K-FAC), can attain comparable and even superior results and faster convergence to those stateof-the-art bars implemented with AdamW. Furthermore, we scale the model up to 3 billion parameters and validate the tractability and effectiveness of our method. This work is not only the first successful application of second-order optimization on such enormous models but also paves the road towards the design and analysis of second-order optimizers for the downstream adaptation of large-scale PTMs.

1. INTRODUCTION

Pre-Trained models (Bommasani et al., 2021; Han et al., 2021) (PTMs) based on deep transformers (Vaswani et al., 2017; Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) yield remarkable performance on a wide range of tasks thanks to the tremendous capacity brought by numerous parameters. A prevalent paradigm is to firstly pre-train such models on large-scale corpora in a self-supervised manner and then adapt them to specific datasets (typically with supervision). Such adaptations are usually implemented by first-order gradient-based optimization. However, recent applications have witnessed an approaching bottleneck for first-order training -it is neither likely to be faster nor easy to attain higher scores (Pascanu et al., 2013; Goodfellow et al., 2016) . On the other hand, second-order optimizers which enjoy better convergence properties might have been an ideal alternative in theory (Nocedal & Wright, 1999; Boyd et al., 2004) . They are, nevertheless, almost unattended on the downstream adaptation of PTMs in practice. It is because they require quadratic storage and cubic computation time for each update, which is especially prohibitive given the enormous model scale, even though bunches of their simplified counterparts have been devised (Byrd et al., 1995; Martens & Grosse, 2015; Botev et al., 2017; Anil et al., 2020; Tang et al., 2021) . Hopefully, recent advances in PTMs have brought a slight twist to the situation. Studies show that full-parameter optimization may not be necessary for task-specific adaptations, as updating a small portion (0.05%∼1%) of parameters can achieve non-trivial performance in many datasets (Houlsby et al., 2019; Li & Liang, 2021; Lester et al., 2021; Hu et al., 2021) . With the new paradigm largely shrinking the trainable parameter space, we find in experiments that second-order optimization is now tractable on large-scale PTMs with up to billions of parameters. This successful implementation signals that the design and analysis of second-order optimization can start marching from relatively toy or medium models to pre-trained deep transformers that are formidably large in scale. Further triggered is a sequence of interesting research topics including: How tractable are second-order optimizers on deep transformers? Are they capable of converging faster and more steadily and yielding better results? If not, what auxiliary skills are needed? And also, how does the ratio of trainable parameters influence the relative performance of second-order optimization? In this application-oriented paper, we answer all the aforementioned questions through theoretical justification and experimental verification. Taking K-FAC as the major example of the second-order optimizer, we first experimentally verify that its training time and memory cost on extremely large PTMs are totally affordable. In addition, we point out that clipping the gradients before or after Hessian (or Fisher) preconditioning makes a non-negligible difference. We propose accordingly a Newton-step clipping strategy which is indispensable for second-order training because of its superior stabilizing effect. We then post comprehensive results to illustrate that with the assistance of our Newton-step clipping strategy, second-order optimization outperforms baseline first-order optimzers as well as its non-clipping and traditionally gradient clipping counterparts in terms of both convergence speed and final test scores. Moreover, we scale the model scale up to 3 billion parameters to evaluate the tractability and effectiveness. Observations of the relation between tunable ratio and convergence speed are also presented subsequently.

2. SECOND-ORDER OPTIMIZERS FOR LARGE-SCALE TRAINING

Following the introductory part, we recap the essential background of second-order optimization. While first-order optimizers such as stochastic gradient descent (SGD) and Adam (Kingma & Ba, 2014) have been more than ubiquitous in deep learning, second-order optimizers remain relatively under-explored in this field. Unlike their first-order counterparts, which take only first-order derivatives into account, second-order optimizers will in addition incorporate the loss function's second-order features, or, in other words, the curvature information.

2.1. NEWTON'S METHOD AND ITS VARIANTS

Newton's method and its variants are perhaps the most typical second-order optimization skills. To be specific, in Newton's method, we approximate the loss function with its Taylor expansion up to order 2: L(θ + δθ) ≈ L(θ) + ∇ θ L(θ) ⊤ δθ + 1 2 δθ ⊤ ∇ 2 θ L(θ)δθ ≜ L(θ) + g(θ) ⊤ δθ + 1 2 δθ T H(θ)δθ, in which g(θ) and H(θ) are the gradient and Hessian matrix of L at θ respectively. We then hope to move towards the direction in which the quadratic function on the right-hand side of Equation 1 is minimized. This condition implies that the standard Newton's method should proceed as θ t+1 ← θ t -η t H(θ t ) -1 g(θ t ), where η t is the learning rate for the t-th step. Here the Hessian H is assumed to be invertible. The quantity u = H -1 g is usually called a Newton step. Despite the theory that fewer steps are required for convergence, the original Newton's method has long been criticized for high computation and storage costs in calculating the Hessian matrix, inverting it, and other additional matrix manipulations (Nocedal & Wright, 1999; Boyd et al., 2004) . To remedy this deficiency of speed, amazing increments of Newton's method have been proposed successively in recent centuries. Included are those reducing the cost by using an approximation of the Hessian matrix (Levenberg, 1944; Marquardt, 1963; Botev et al., 2017) , and those quasi-Newton optimizers such as BFGS (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) and its limited-memory version L-BFGS (Byrd et al., 1995) which approximate the inverse of Hessian matrix directly. Those aforementioned methods can partly alleviate the intractability of full Newton's method on deep neural networks.

2.2. NATURAL GRADIENT DESCENT AND K-FAC

In a common situation of supervised learning, a class of parametric conditional distribution {p(y | x, θ) | θ ∈ Θ} is assigned to the model to fit the underlying distribution q(y | x) of the ob-

