EVA: PRACTICAL SECOND-ORDER OPTIMIZATION WITH KRONECKER-VECTORIZED APPROXIMATION

Abstract

Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order counterparts such as stochastic gradient descent (SGD). In this work, we present a memory-and time-efficient second-order algorithm named Eva with two novel techniques: 1) we construct the second-order information with the Kronecker factorization of small stochastic vectors over a mini-batch of training data to reduce memory consumption, and 2) we derive an efficient update formula without explicitly computing the inverse of matrices using the Sherman-Morrison formula. We further provide a theoretical interpretation of Eva from a trust-region optimization point of view to understand how it works. Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to 2.05× and 2.42× compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively.

1. INTRODUCTION

While first-order optimizers such as stochastic gradient descent (SGD) (Bottou et al., 1998) and Adam (Kingma & Ba, 2015) have been widely used in training deep learning models (Krizhevsky et al., 2012; He et al., 2016; Devlin et al., 2019) , these methods require a large number of iterations to converge by exploiting only the first-order gradient to update the model parameter (Bottou et al., 2018) . To overcome such inefficiency, second-order optimizers have been considered with the potential to accelerate the training process with a much fewer number of iterations to converge (Osawa et al., 2019; 2020; Pauloski et al., 2020; 2021) . For example, our experimental results illustrate that second-order optimizers, e.g., K-FAC (Martens & Grosse, 2015) , require ∼50% fewer iterations to reach the target top-1 validation accuracy of 93.5% than SGD, in training a ResNet-110 (He et al., 2016) model on the Cifar-10 dataset (Krizhevsky, 2009) (more results are shown in Table 2 ). The fast convergence property of second-order algorithms benefits from preconditioning the gradient with the inverse of a matrix C of curvature information. Different second-order optimizers construct C by approximating different second-order information, e.g., Hessian, Gauss-Newton, and Fisher information (Amari, 1998) , to help improve the convergence rate (Dennis & Schnabel, 1983) . However, classical second-order optimizers incur significant computation and memory overheads in training deep neural networks (DNNs), which typically have a large number of model parameters, as they require a quadratic memory complexity to store C, and a cubic time complexity to invert C, w.r.t. the number of model parameters. For example, a ResNet-50 (He et al., 2016) model with 25.6M parameters has to store more than 650T elements in C using full Hessian, which is not affordable on current devices, e.g., an Nvidia A100 GPU has 80GB memory. To make second-order optimizers practical in deep learning, approximation techniques have been proposed to estimate C with smaller matrices. For example, the K-FAC algorithm (Martens & Grosse, 2015) uses the Kronecker factorization of two smaller matrices to approximate the Fisher information matrix (FIM) in each DNN layer, thus, K-FAC only needs to store and invert these small matrices, namely Kronecker factors (KFs), to reduce the computing and memory overheads. Table 1 : Time and memory complexity comparison of different second-order algorithms. d is the dimension of a hidden layer, L is the number of layers, and m is the number of gradient copies. Complexity Newton K-FAC Shampoo M-FAC Eva Time O(d 6 L 3 ) O(2d 3 L) O(2d 3 L) O(md 2 L) O(d 2 L) Memory O(d 4 L 2 ) O(2d 2 L) O(2d 2 L) O(md 2 L) O(2dL) Second-order Info. Hessian KFs Gradient Statistics Gradient Copies KVs However, even by doing so, the additional costs of each second-order update are still significant, which makes it slower than first-order SGD. In our experiment, the iteration time of K-FAC is 2.5× than that of SGD in training ResNet-50 (see Table 4 ), and the memory consumption of storing KFs and their inverse results is 12× larger than that of storing the gradient. Despite the reduced number of iterations, existing second-order algorithms, including K-FAC (Martens & Grosse, 2015), Shampoo (Gupta et al., 2018) , and M-FAC (Frantar et al., 2021) , are not time-and-memory efficient, as shown in Table 1 . One limitation in K-FAC and Shampoo is that they typically require dedicated system optimizations and second-order update interval tuning to outperform the first-order counterpart (Osawa et al., 2019; Pauloski et al., 2020; Anil et al., 2021; Shi et al., 2021) . To address the above limitations, we propose a novel second-order training algorithm, called Eva, which introduces a matrix-free approximation to the second-order matrix to precondition the gradient. Eva not only requires much less memory to estimate the second-order information, but it also does not need to explicitly compute the inverse of the second-order matrix, thus eliminating the intensive computations required in existing methods. Specifically, we propose two novel techniques in Eva. First, for each DNN layer, we exploit the Kronecker factorization of two small stochastic vectors, called Kronecker vectors (KVs), over a mini-batch of training data to construct a rank-one matrix to be the second-order matrix C for preconditioning. Note that our constructed second-order matrix is different from the average outer-product of gradient (i.e., Fisher information (Amari, 1998)) that has been used in existing K-FAC related algorithms 1 ). Second, we derive a new update formula to precondition the gradient by implicitly computing the inverse of the constructed Kronecker factorization using the Sherman-Morrison formula (Sherman & Morrison, 1950) . The new update formula takes only a linear time complexity; it means that Eva is much more timeefficient than existing second-order optimizers which normally take a superlinear time complexity in inverting matrices (see Table 1 ). Finally, we provide a theoretical interpretation to Eva from a trust-region optimization point of view to understand how it preserves the fast convergence property of second-order optimization (Asi & Duchi, 2019; Bae et al., 2022) . We conduct extensive experiments to illustrate the effectiveness and efficiency of Eva compared to widely used first-order (SGD, Adagrad, and Adam) and second-order (K-FAC, Shampoo, and M-FAC) optimizers on multiple deep models and datasets. The experimental results show that 1) Eva outperforms first-order optimizers -achieving higher accuracy under the same number of iterations or reaching the same accuracy with fewer number of iterations, and 2) Eva generalizes very closely to other second-order algorithms such as K-FAC while having much less iteration time and memory footprint. Specifically, in terms of per-iteration time performance, Eva only requires an average of 1.14× wall-clock time over first-order SGD, while K-FAC requires 3.47× in each second-order update. In term of memory consumption, Eva requires almost the same memory as first-order SGD, which is up to 31% and 45% smaller than second-order Shampoo and K-FAC respectively. In term of the end-to-end training performance, Eva reduces the training time on different training benchmarks up to 2.05×, 1.58×, and 2.42× compared to SGD, K-FAC, and Shampoo respectively. In summary, our contributions are as follows: (1) We propose a novel efficient second-order optimizer Eva via Kronecker-vectorized approximation, which uses the Kronecker factorization of two small vectors to be second-order information so that Eva has a sublinear memory complexity and requires almost the same memory footprint as first-order algorithms like SGD. (2) We derive a new update formula with an implicit inverse computation in preconditioning by exploiting the Sherman-Morrison formula to eliminate the expensive explicit inverse computation. Thus, Eva reduces each second-order update cost to linear time complexity. (3) We conduct extensive experiments to validate that Eva can converge faster than SGD, and it is more system efficient than K-FAC and Shampoo. Therefore, Eva is capable of improving end-to-end training performance.

