HESSCALE: SCALABLE COMPUTATION OF HESSIAN DIAGONALS

Abstract

Second-order optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on first-order approximations that do not capture the curvature information. In this paper, we develop HesScale, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate second-order information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient second-order optimization. 1

1. INTRODUCTION

First-order optimization offers a cheap and efficient way of performing local progress in optimization problems by using gradient information. However, their performance suffers from instability or slow progress when used in ill-conditioned landscapes. Such a problem is present because firstorder methods do not capture curvature information which causes two interrelated issues. First, the updates in first-order have incorrect units (Duchi et al. 2011) , which creates a scaling issue. Second, first-order methods lack parameterization invariance (Martens 2020) in contrast to secondorder methods such as natural gradient (Amari 1998) or Newton-Raphson methods. Therefore, some first-order normalization methods were developed to address the invariance problem (Ba et al. 2016 , Ioffe & Szegedy 2015 , Salimans & Kingma 2016) . On the other hand, some recent adaptive stepsize methods try to alleviate the scaling issue by using gradient information for first-order curvature approximation (Luo et al. 2019 , Duchi et al. 2011 , Zeiler 2012 , Reddi et al. 2018 , Kingma & Ba 2015 , Tran & Phong 2019 , Tieleman et al. 2012) . Specifically, such methods use the empirical Fisher diagonals heuristic by maintaining a moving average of the squared gradients to approximate the diagonal of the Fisher information matrix. Despite the huge adoption of such methods due to their scalability, they use inaccurate approximations. Kunstner et al. (2019) showed that the empirical Fisher does not generally capture curvature information and might have undesirable effects. They argued that the empirical Fisher approximates the Fisher or the Hessian matrices only under strong assumptions that are unlikely to be met in practice. Moreover, Wilson et al. (2017) presented a counterexample where the adaptive step-size methods are unable to reduce the error compared to non-adaptive counterparts such as stochastic gradient descent. Although second-order optimization can speed up the training process by using the geometry of the landscape, its adoption is minimal compared to first-order methods. The exact natural gradient or Newton-Raphson methods require the computation, storage, and inversion of the Fisher information or the Hessian matrices, making them computationally prohibitive in large-scale tasks. Accordingly, many popular second-order methods attempt to approximate less expensively. For example, a type of truncated-Newton method called Hessian-free methods (Martens 2010) exploits the fact that the Hessian-vector product is cheap (Bekas et al. 2007 ) and uses the iterative conjugate gradient method to perform an update. However, such methods might require many iterations per update or some tricks to achieve stability, adding computational overhead (Martens & Sutskever 2011). Some variations try to approximate only the diagonals of the Hessian matrix using stochastic estimation with matrix-free computations (Chapelle & Erhan 2011 , Martens et al. 2012 , Yao et al. 2021) . Other methods impose probabilistic modeling assumptions and estimate a block diagonal Fisher information matrix (Martens & Grosse 2015 , Botev et al. 2017) . Such methods are invariant to reparametrization but are computationally expensive since they need to perform matrix inversion for each block. Deterministic diagonal approximations to the Hessian (LeCun et al. 1990 , Becker & Lecun 1989) provide some curvature information and are efficient to compute. Specifically, they can be implemented to be as efficient as first-order methods. We view this category of approximation methods as scalable second-order methods. In neural networks, curvature backpropagation (Becker & Lecun 1989) can be used to backpropagate the curvature vector. We distinguish this efficient method from other expensive methods (e.g., Mizutani & Dreyfus 2008 , Botev et al. 2017 ) that backpropagate the full Hessian matrices. Although these diagonal methods show a promising direction for scalable second-order optimization, the approximation quality is sometimes poor with objectives such as cross-entropy (Martens et al. 2012) . A scalable second-order method with high quality approximation is still needed. In this paper, we present HesScale, a high-quality approximation method for the Hessian diagonals. Our method is also scalable and has little memory requirement with linear computational complexity while maintaining high approximation accuracy.

2. BACKGROUND

In this section, we describe the Hessian matrix for neural networks and some existing methods for estimating it. Generally, Hessian matrices can be computed for any scalar-valued function that are twice differentiable. If f : R n → R is such a function, then for its argument ψ ∈ R n , the Hessian matrix H ∈ R n×n of f with respect to ψ is given by H i,j = ∂ 2 f (ψ) /∂ψi∂ψj. Here, the ith element of a vector v is denoted by v i , and the element at the ith row and jth column of a matrix M is denoted by M i,j . When the need for computing the Hessian matrix arises for optimization in deep learning, the function f is typically the objective function, and the vector ψ is commonly the weight vector of a neural network. Computing and storing an n × n matrix, where n is the number of weights in a neural network, is expensive. Therefore, many methods exist for approximating the Hessian matrix or parts of it with less memory footprint, computational requirement, or both. A common technique is to utilize the structure of the function to reduce the computations needed. For example, assuming that connections from a certain layer do not affect other layers in a neural network allows one to approximate a block diagonal Hessian. The computation further simplifies when we have piece-wise linear activation functions (e.g., ReLU), which result in a Generalized Gauss-Newton (GGN) (Schraudolph 2002) approximation that is equivalent to the block diagonal Hessian matrix with linear activation functions. The GGN matrix is more favored in second-order optimization since it is positive semi-definite. However, computing a block diagonal matrix is still demanding. Many approximation methods were developed to reduce the storage and computation requirements of the GGN matrix. For example, under probabilistic modeling assumptions, the Kronecker-factored Approximate Curvature (KFAC) method (Martens & Grosse 2015) writes the GGN matrix G as a Kronecker product of two matrices of smaller sizes as: G = A ⊗ B, where A = E[hh ⊤ ], B = E[gg ⊤ ], h is the activation output vector, and g is the gradient of the loss with respect to the activation input vector. The A and B matrices can be estimated by Monte Carlo sampling and an exponential moving average. KFAC is more efficient when used in optimization since it requires inverting only the small matrices using the Kronecker-product property (A ⊗ B) -1 = A -1 ⊗ B -1 . However, KFAC is still expensive due to the storage of the block diagonal matrices and computation of Kronecker product, which prevent it from being used as a scalable method. Computing the Hessian diagonals can provide some curvature information with relatively less computation. However, it has been shown that the exact computation for diagonals of the Hessian typically has quadratic complexity with the unlikely existence of algorithms that can compute the exact diagonals with less than quadratic complexity (Martens et al. 2012) . Some stochastic methods provide a way to compute unbiased estimates of the exact Hessian diagonals. For example, the AdaHessian (Yao et al. 2021) algorithm uses the Hutchinson's estimator diag(H) = E[z • (Hz)], where z is a multivariate random variable with a Rademacher distribution and the expectation can



Code will be available.

