APPLYING SECOND ORDER OPTIMIZATION TO DEEP TRANSFORMERS WITH PARAMETER-EFFICIENT TUN-ING

Abstract

Despite the theoretical superiority in convergence rate, second-order optimizers are generally not among the top choices for training large-scale neural networks due to their high computation and memory cost. Nevertheless, introduced in recent progress of parameter-efficient tuning is a new paradigm that large-scale pre-trained models (PTMs) can be adapted to specific tasks by optimizing a tiny proportion of parameters, which might hopefully change the game. We associate this new paradigm with the computational tractability of second-order optimizers and succeed in applying them to large PTMs from hundreds of millions of parameters to billions in scale. Beyond verifying their tractability, we further investigate the stability-influencing factors in the optimization process and propose accordingly a Newton-step clipping approach in which we clip the update tensors rather than the gradients. This approach stabilizes the convergence by gating the magnitude of Newton steps along the optimization trajectories through the rugged landscapes of deep transformers. We conduct extensive experiments across different downstream tasks, demonstrating that, when equipped with Newton-step clipping, second-order optimizers, especially Kronecker-factored curvature approximation (K-FAC), can attain comparable and even superior results and faster convergence to those stateof-the-art bars implemented with AdamW. Furthermore, we scale the model up to 3 billion parameters and validate the tractability and effectiveness of our method. This work is not only the first successful application of second-order optimization on such enormous models but also paves the road towards the design and analysis of second-order optimizers for the downstream adaptation of large-scale PTMs.

1. INTRODUCTION

Pre-Trained models (Bommasani et al., 2021; Han et al., 2021) (PTMs) based on deep transformers (Vaswani et al., 2017; Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) yield remarkable performance on a wide range of tasks thanks to the tremendous capacity brought by numerous parameters. A prevalent paradigm is to firstly pre-train such models on large-scale corpora in a self-supervised manner and then adapt them to specific datasets (typically with supervision). Such adaptations are usually implemented by first-order gradient-based optimization. However, recent applications have witnessed an approaching bottleneck for first-order training -it is neither likely to be faster nor easy to attain higher scores (Pascanu et al., 2013; Goodfellow et al., 2016) . On the other hand, second-order optimizers which enjoy better convergence properties might have been an ideal alternative in theory (Nocedal & Wright, 1999; Boyd et al., 2004) . They are, nevertheless, almost unattended on the downstream adaptation of PTMs in practice. It is because they require quadratic storage and cubic computation time for each update, which is especially prohibitive given the enormous model scale, even though bunches of their simplified counterparts have been devised (Byrd et al., 1995; Martens & Grosse, 2015; Botev et al., 2017; Anil et al., 2020; Tang et al., 2021) . Hopefully, recent advances in PTMs have brought a slight twist to the situation. Studies show that full-parameter optimization may not be necessary for task-specific adaptations, as updating a small portion (0.05%∼1%) of parameters can achieve non-trivial performance in many datasets (Houlsby et al., 2019; Li & Liang, 2021; Lester et al., 2021; Hu et al., 2021) . With the new paradigm largely shrinking the trainable parameter space, we find in experiments that second-order optimization is

