HESSCALE: SCALABLE COMPUTATION OF HESSIAN DIAGONALS

Abstract

Second-order optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on first-order approximations that do not capture the curvature information. In this paper, we develop HesScale, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate second-order information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient second-order optimization. 1

1. INTRODUCTION

First-order optimization offers a cheap and efficient way of performing local progress in optimization problems by using gradient information. However, their performance suffers from instability or slow progress when used in ill-conditioned landscapes. Such a problem is present because firstorder methods do not capture curvature information which causes two interrelated issues. First, the updates in first-order have incorrect units (Duchi et al. 2011) , which creates a scaling issue. Second, first-order methods lack parameterization invariance (Martens 2020) in contrast to secondorder methods such as natural gradient (Amari 1998) or Newton-Raphson methods. Therefore, some first-order normalization methods were developed to address the invariance problem (Ba et al. 2016 , Ioffe & Szegedy 2015 , Salimans & Kingma 2016) . On the other hand, some recent adaptive stepsize methods try to alleviate the scaling issue by using gradient information for first-order curvature approximation (Luo et al. 2019 , Duchi et al. 2011 , Zeiler 2012 , Reddi et al. 2018 , Kingma & Ba 2015 , Tran & Phong 2019 , Tieleman et al. 2012) . Specifically, such methods use the empirical Fisher diagonals heuristic by maintaining a moving average of the squared gradients to approximate the diagonal of the Fisher information matrix. Despite the huge adoption of such methods due to their scalability, they use inaccurate approximations. Kunstner et al. (2019) showed that the empirical Fisher does not generally capture curvature information and might have undesirable effects. They argued that the empirical Fisher approximates the Fisher or the Hessian matrices only under strong assumptions that are unlikely to be met in practice. Moreover, Wilson et al. (2017) presented a counterexample where the adaptive step-size methods are unable to reduce the error compared to non-adaptive counterparts such as stochastic gradient descent. Although second-order optimization can speed up the training process by using the geometry of the landscape, its adoption is minimal compared to first-order methods. The exact natural gradient or Newton-Raphson methods require the computation, storage, and inversion of the Fisher information or the Hessian matrices, making them computationally prohibitive in large-scale tasks. Accordingly, many popular second-order methods attempt to approximate less expensively. For example, a type of truncated-Newton method called Hessian-free methods (Martens 2010) exploits the fact that the Hessian-vector product is cheap (Bekas et al. 2007 ) and uses the iterative conjugate gradient method to perform an update. However, such methods might require many iterations per update or some tricks to achieve stability, adding computational overhead (Martens & Sutskever 2011). 1 Code will be available. 1

