EVA: PRACTICAL SECOND-ORDER OPTIMIZATION WITH KRONECKER-VECTORIZED APPROXIMATION

Abstract

Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order counterparts such as stochastic gradient descent (SGD). In this work, we present a memory-and time-efficient second-order algorithm named Eva with two novel techniques: 1) we construct the second-order information with the Kronecker factorization of small stochastic vectors over a mini-batch of training data to reduce memory consumption, and 2) we derive an efficient update formula without explicitly computing the inverse of matrices using the Sherman-Morrison formula. We further provide a theoretical interpretation of Eva from a trust-region optimization point of view to understand how it works. Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to 2.05× and 2.42× compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively.

1. INTRODUCTION

While first-order optimizers such as stochastic gradient descent (SGD) (Bottou et al., 1998) and Adam (Kingma & Ba, 2015) have been widely used in training deep learning models (Krizhevsky et al., 2012; He et al., 2016; Devlin et al., 2019) , these methods require a large number of iterations to converge by exploiting only the first-order gradient to update the model parameter (Bottou et al., 2018) . To overcome such inefficiency, second-order optimizers have been considered with the potential to accelerate the training process with a much fewer number of iterations to converge (Osawa et al., 2019; 2020; Pauloski et al., 2020; 2021) . For example, our experimental results illustrate that second-order optimizers, e.g., K-FAC (Martens & Grosse, 2015) , require ∼50% fewer iterations to reach the target top-1 validation accuracy of 93.5% than SGD, in training a ResNet-110 (He et al., 2016) model on the Cifar-10 dataset (Krizhevsky, 2009) (more results are shown in Table 2 ). The fast convergence property of second-order algorithms benefits from preconditioning the gradient with the inverse of a matrix C of curvature information. Different second-order optimizers construct C by approximating different second-order information, e.g., Hessian, Gauss-Newton, and Fisher information (Amari, 1998), to help improve the convergence rate (Dennis & Schnabel, 1983) . However, classical second-order optimizers incur significant computation and memory overheads in training deep neural networks (DNNs), which typically have a large number of model parameters, as they require a quadratic memory complexity to store C, and a cubic time complexity to invert C, w.r.t. the number of model parameters. For example, a ResNet-50 (He et al., 2016) model with 25.6M parameters has to store more than 650T elements in C using full Hessian, which is not affordable on current devices, e.g., an Nvidia A100 GPU has 80GB memory. To make second-order optimizers practical in deep learning, approximation techniques have been proposed to estimate C with smaller matrices. For example, the K-FAC algorithm (Martens & Grosse, 2015) uses the Kronecker factorization of two smaller matrices to approximate the Fisher information matrix (FIM) in each DNN layer, thus, K-FAC only needs to store and invert these small matrices, namely Kronecker factors (KFs), to reduce the computing and memory overheads.

