HESSCALE: SCALABLE COMPUTATION OF HESSIAN DIAGONALS

Abstract

Second-order optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on first-order approximations that do not capture the curvature information. In this paper, we develop HesScale, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate second-order information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient second-order optimization. 1

1. INTRODUCTION

First-order optimization offers a cheap and efficient way of performing local progress in optimization problems by using gradient information. However, their performance suffers from instability or slow progress when used in ill-conditioned landscapes. Such a problem is present because firstorder methods do not capture curvature information which causes two interrelated issues. First, the updates in first-order have incorrect units (Duchi et al. 2011) , which creates a scaling issue. Second, first-order methods lack parameterization invariance (Martens 2020 ) in contrast to secondorder methods such as natural gradient (Amari 1998) or Newton-Raphson methods. Therefore, some first-order normalization methods were developed to address the invariance problem (Ba et al. 2016 , Ioffe & Szegedy 2015 , Salimans & Kingma 2016) . On the other hand, some recent adaptive stepsize methods try to alleviate the scaling issue by using gradient information for first-order curvature approximation (Luo et al. 2019 , Duchi et al. 2011 , Zeiler 2012 , Reddi et al. 2018 , Kingma & Ba 2015 , Tran & Phong 2019 , Tieleman et al. 2012) . Specifically, such methods use the empirical Fisher diagonals heuristic by maintaining a moving average of the squared gradients to approximate the diagonal of the Fisher information matrix. Despite the huge adoption of such methods due to their scalability, they use inaccurate approximations. Kunstner et al. (2019) showed that the empirical Fisher does not generally capture curvature information and might have undesirable effects. They argued that the empirical Fisher approximates the Fisher or the Hessian matrices only under strong assumptions that are unlikely to be met in practice. Moreover, Wilson et al. (2017) presented a counterexample where the adaptive step-size methods are unable to reduce the error compared to non-adaptive counterparts such as stochastic gradient descent. Although second-order optimization can speed up the training process by using the geometry of the landscape, its adoption is minimal compared to first-order methods. The exact natural gradient or Newton-Raphson methods require the computation, storage, and inversion of the Fisher information or the Hessian matrices, making them computationally prohibitive in large-scale tasks. Accordingly, many popular second-order methods attempt to approximate less expensively. For example, a type of truncated-Newton method called Hessian-free methods (Martens 2010 ) exploits the fact that the Hessian-vector product is cheap (Bekas et al. 2007 ) and uses the iterative conjugate gradient method to perform an update. However, such methods might require many iterations per update or some tricks to achieve stability, adding computational overhead (Martens & Sutskever 2011) . Some variations try to approximate only the diagonals of the Hessian matrix using stochastic estimation with matrix-free computations (Chapelle & Erhan 2011 , Martens et al. 2012 , Yao et al. 2021) . Other methods impose probabilistic modeling assumptions and estimate a block diagonal Fisher information matrix (Martens & Grosse 2015 , Botev et al. 2017) . Such methods are invariant to reparametrization but are computationally expensive since they need to perform matrix inversion for each block. Deterministic diagonal approximations to the Hessian (LeCun et al. 1990 , Becker & Lecun 1989) provide some curvature information and are efficient to compute. Specifically, they can be implemented to be as efficient as first-order methods. We view this category of approximation methods as scalable second-order methods. In neural networks, curvature backpropagation (Becker & Lecun 1989 ) can be used to backpropagate the curvature vector. We distinguish this efficient method from other expensive methods (e.g., Mizutani & Dreyfus 2008 , Botev et al. 2017 ) that backpropagate the full Hessian matrices. Although these diagonal methods show a promising direction for scalable second-order optimization, the approximation quality is sometimes poor with objectives such as cross-entropy (Martens et al. 2012) . A scalable second-order method with high quality approximation is still needed. In this paper, we present HesScale, a high-quality approximation method for the Hessian diagonals. Our method is also scalable and has little memory requirement with linear computational complexity while maintaining high approximation accuracy.

2. BACKGROUND

In this section, we describe the Hessian matrix for neural networks and some existing methods for estimating it. Generally, Hessian matrices can be computed for any scalar-valued function that are twice differentiable. If f : R n → R is such a function, then for its argument ψ ∈ R n , the Hessian matrix H ∈ R n×n of f with respect to ψ is given by H i,j = ∂ 2 f (ψ) /∂ψi∂ψj. Here, the ith element of a vector v is denoted by v i , and the element at the ith row and jth column of a matrix M is denoted by M i,j . When the need for computing the Hessian matrix arises for optimization in deep learning, the function f is typically the objective function, and the vector ψ is commonly the weight vector of a neural network. Computing and storing an n × n matrix, where n is the number of weights in a neural network, is expensive. Therefore, many methods exist for approximating the Hessian matrix or parts of it with less memory footprint, computational requirement, or both. A common technique is to utilize the structure of the function to reduce the computations needed. For example, assuming that connections from a certain layer do not affect other layers in a neural network allows one to approximate a block diagonal Hessian. The computation further simplifies when we have piece-wise linear activation functions (e.g., ReLU), which result in a Generalized Gauss-Newton (GGN) (Schraudolph 2002) approximation that is equivalent to the block diagonal Hessian matrix with linear activation functions. The GGN matrix is more favored in second-order optimization since it is positive semi-definite. However, computing a block diagonal matrix is still demanding. Many approximation methods were developed to reduce the storage and computation requirements of the GGN matrix. For example, under probabilistic modeling assumptions, the Kronecker-factored Approximate Curvature (KFAC) method (Martens & Grosse 2015) writes the GGN matrix G as a Kronecker product of two matrices of smaller sizes as: G = A ⊗ B, where A = E[hh ⊤ ], B = E[gg ⊤ ], h is the activation output vector, and g is the gradient of the loss with respect to the activation input vector. The A and B matrices can be estimated by Monte Carlo sampling and an exponential moving average. KFAC is more efficient when used in optimization since it requires inverting only the small matrices using the Kronecker-product property (A ⊗ B) -1 = A -1 ⊗ B -1 . However, KFAC is still expensive due to the storage of the block diagonal matrices and computation of Kronecker product, which prevent it from being used as a scalable method. Computing the Hessian diagonals can provide some curvature information with relatively less computation. However, it has been shown that the exact computation for diagonals of the Hessian typically has quadratic complexity with the unlikely existence of algorithms that can compute the exact diagonals with less than quadratic complexity (Martens et al. 2012) . Some stochastic methods provide a way to compute unbiased estimates of the exact Hessian diagonals. For example, the AdaHessian (Yao et al. 2021 ) algorithm uses the Hutchinson's estimator diag (H) = E[z • (Hz)], where z is a multivariate random variable with a Rademacher distribution and the expectation can be estimated using Monte Carlo sampling with an exponential moving average. Similarly, the GGN-MC method (Dangel et al. 2020 ) uses the relationship between the Fisher information matrix and the Hessian matrix under probabilistic modeling assumptions to have an MC approximation of the diagonal of the GGN matrix. Although these stochastic approximation methods are scalable due to linear or O(n) computational and memory complexity, they suffer from low approximation quality, improving which requires many sampling and factors of additional computations.

3. THE PROPOSED HESSCALE METHOD

In this section, we present our method for approximating the diagonal of the Hessian at each layer in feed-forward networks, where a backpropagation rule is used to utilize the Hessian of previous layers. We present the derivation of the backpropagation rule for fully connected and convolutional neural networks in supervised learning. Similar derivation for fully connected networks with mean squared error is presented before (LeCun et al. 1990 , Becker & Lecun 1989 ). However, we use the exact diagonals of the Hessian matrix at the last layer with some non-linear and non-element-wise output activations such as softmax and show that it can still be computed in linear computational complexity. We show the derivation for Hessian diagonals for fully connected networks in the following and provide the derivation for the convolutional neural networks in Appendix B. We use the supervised classification setting where there is a collection of data examples. These data examples are generated from some target function f * mapping the input x to the output y, where the k-th input-output pair is (x k , y k ). In this task, the learner is required to predict the output class y ∈ {1, 2, ..., m} given the input vector x ∈ R d by estimating the target function f * . The performance is measured with the cross-entropy loss, L(p, q) = -m i=1 p i log q i , where p ∈ R m is the vector of the target one-hot encoded class and q ∈ R m is the predicted output. The learner is required to reduce the cross-entropy by matching the target class. Consider a neural network with L layers that outputs the predicted output q. The neural network is parametrized by the set of weights {W 1 , ..., W L }, where W l is the weight matrix at the l-th layer, and its element at the ith row and the jth column is denoted by W l,i,j . During learning, the parameters of the neural network are changed to reduce the loss. At each layer l, we get the activation output h l by applying the activation function σ to the activation input a l : h l = σ(a l ). We simplify notations by defining h 0 . = x. The activation output h l is then multiplied by the weight matrix W l+1 of layer l + 1 to produce the next activation input: a l+1,i = |h l | j=1 W l+1,i,j h l,j . We assume here that the activation function is element-wise activation for all layers except for the final layer L, where it becomes the softmax function. The backpropagation equations for the described network are given as follows Rumelhart et al. (1986) : ∂L ∂a l,i = |a l+1 | k=1 ∂L ∂a l+1,k ∂a l+1,k ∂h l,i ∂h l,i ∂a l,i = σ ′ (a l,i ) |a l+1 | k=1 ∂L ∂a l+1,k W l+1,k,i , ∂L ∂W l,i,j = ∂L ∂a l,i ∂a l,i ∂W l,i,j = ∂L ∂a l,i h l-1,j . In the following, we write the equations for the exact Hessian diagonals with respect to weights ∂ 2 L /∂W 2 l,i,j , which requires the calculation of ∂ 2 L /∂a 2 l,i first: ∂ 2 L ∂a 2 l,i = ∂ ∂a l,i   σ ′ (a l,i ) |a l+1 | k=1 ∂L ∂a l+1,k W l+1,k,i   = σ ′ (a l,i ) |a l+1 | k=1 |a l+1 | p=1 ∂ 2 L ∂a l+1,k ∂a l+1,p ∂a l+1,p ∂a l,i W l+1,k,i + σ ′′ (a l,i ) |a l+1 | k=1 ∂L ∂a l+1,k W l+1,k,i = σ ′ (a l,i ) 2 |a l+1 | k=1 |a l+1 | p=1 ∂ 2 L ∂a l+1,k ∂a l+1,p W l+1,p,i W l+1,k,i + σ ′′ (a l,i ) |a l+1 | k=1 ∂L ∂a l+1,k W l+1,k,i , ∂ 2 L ∂W 2 l,i,j = ∂ ∂W l,i,j ∂L ∂a l,i h l-1,j = ∂ ∂a l,i ∂L ∂a l,i ∂a l,i ∂W l,i,j h l-1,j = ∂ 2 L ∂a 2 l,i h 2 l-1,j . Since, the calculation of ∂ 2 L /∂a 2 l,i depends on the off-diagonal terms, the computation complexity becomes quadratic. Following Becker and Lecun (1989) , we approximate the Hessian diagonals by ignoring the off-diagonal terms, which leads to a backpropagation rule with linear computational complexity for our estimates ∂ 2 L ∂W 2 l,i,j and ∂ 2 L ∂a 2 l,i : ∂ 2 L ∂a 2 l,i . = σ ′ (a l,i ) 2 |a l+1 | k=1 ∂ 2 L ∂a 2 l+1,k W 2 l+1,k,i + σ ′′ (a l,i ) |a l+1 | k=1 ∂L ∂a l+1,k W l+1,k,i , ∂ 2 L ∂W 2 l,i,j . = ∂ 2 L ∂a 2 l,i h 2 l-1,j . However, for the last layer, we use the exact Hessian diagonals ∂ 2 L ∂a 2 L,i . = ∂ 2 L ∂a 2 L,i since it can be computed in O(n) for the softmax activation function and the cross-entropy loss. More precisely, the exact Hessian diagonals for cross-entropy loss with softmax is simply qq • q, where q is the predicted probability vector and • denotes element-wise multiplication. We found empirically that this small change makes a large difference in the approximation quality, as shown in Fig. 1a . Hence, unlike Becker and Lecun (1989) who use a Hessian diagonal approximation of the last layer by Eq. 4, we use the exact values directly to achieve more approximation accuracy. We call this method for Hessian diagonal approximation HesScale and provide its pseudocode for supervised classification in Algorithm 1. HesScale is not specific to cross-entropy loss as the exact Hessian diagonals can Algorithm 1 HesScale: Computing Hessian diagonals of a neural network layer in classification Require: Neural network f and a layer number l Require: First and second order information ∂L ∂a l+1,i and ∂ 2 L ∂a 2 l+1,i,j , unless l = L Require: Input-output pair (x, y) Set loss function L to cross-entropy loss Compute preference vector a L ← f (x) and target one-hot-encoded vector p ← onehot(y) Compute the predicted probability vector q ← σ(a L ) using softmax function σ Compute the error L(p, q) if l = L then ▷ Computing Hessian diagonals exactly for the last layer Compute ∂L ∂a L ← q -p ▷ ∂L ∂a L consists of elements ∂L ∂a L,i Compute ∂L ∂W L using Eq. 2 ▷ ∂L ∂W L consists of elements ∂L ∂W L,i,j ∂ 2 L ∂a 2 L ← q -q • q ▷ ∂ 2 L ∂a 2 L consists of elements ∂ 2 L ∂a 2 L,i Compute ∂ 2 L ∂W 2 L using Eq. 5 ▷ ∂ 2 L ∂W 2 L consists of elements ∂ 2 L ∂W 2 L,i,j else if l ̸ = L then Compute ∂L ∂a l and ∂L /∂W l using Eq. 1 and Eq. 2 Compute ∂ 2 L ∂a 2 l and ∂ 2 L ∂W 2 l using Eq. 4 and Eq. 5 end if return ∂L ∂W l , ∂ 2 L ∂W 2 l , ∂L ∂a l , and ∂ 2 L ∂a 2 l be calculated in O(n) for some other widely used loss functions as well. We show this property for negative log-likelihood function with Gaussian and softmax distributions in Appendix A. The computations can be reduced further using a linear approximation for the activation functions (by dropping the second term in Eq. 4), which corresponds to an approximation of the GGN matrix. We call this variation of our method HesScaleGN. Based on HesScale, we make a stable optimizer, which we call AdaHesScale, given in Algorithm 2. We use the same style introduced in Adam (Kingma & Ba 2015), using the squared diagonal approximation instead of the squared gradients to update the moving average. Moreover, we introduce another optimizer based on HesScaleGN, which we call AdaHesScaleGN. We refer the reader to the convergence proof for methods with Hessian diagonals, which was presented by Yao et al. (2021) .  M l ← 0; V l ← 0 ▷ Same size as W l end for for (x, y) in D do t ← t + 1 r L+1 ← s L+1 ← ∅ ▷ r l and s l stand for ∂L ∂a l and ∂ 2 L ∂a 2 l , respectively for l in {L, L -1, ..., 1} do F l , S l , r l , s l ← HesScale(f, x, y, l, r l+1 , s l+1 ). ▷ Check Algorithm 1 M l ← β 1 M l + (1 -β 1 )F l ▷ F l stands for ∂L ∂W l V l ← β 2 V l + (1 -β 2 )S 2 l ▷ S l stands for ∂ 2 L ∂W 2 l Ml ← M l /(1 -β t 1 ) ▷ Bias-corrected estimate for F l Vl ← V l /(1 -β t 2 ) ▷ Bias-corrected estimate for S l W l ← W l -α Ml ⊘ ( Vl + ϵ) • 1 2 ▷ ⊘ is element-wise division ▷ A • 1 2 is element-wise square root of A end for end for

4. APPROXIMATION QUALITY & SCALABILITY OF HESSCALE

In this section, we evaluate HesScale for its approximation quality and computational cost and compare it with other methods. These measures constitute the criteria we look for in scalable and efficient methods. For our experiments, we implemented HesScale using the BackPack framework (Dangel et al. 2020) , which allows easy implementation of backpropagation of statistics other than the gradient. We start by studying the approximation quality of Hessian diagonals compared to the true values. To measure the approximation quality of the Hessian diagonals for different methods, we use the L 1 distance between the exact Hessian diagonals and their approximations. Our task here is supervised classification, and data examples are randomly generated. We used a network of three hidden layers with tanh activations, each containing 16 units. The network weights and biases are initialized randomly. The network has six inputs and ten outputs. For each example pair, we compute the exact Hessian diagonals for each layer and their approximations from each method. All layers' errors are summed and averaged over 1000 data examples for each method. In this experiment, we used 40 different initializations for the network weights, shown as colored dots in Fig. 1a . Each point represents the summed error over network layers, averaged over 1000 examples for each different initialization. In this figure, we show the average error incurred by each method normalized by the average error incurred by HesScale. Any approximation that incurs an averaged error above 1 has a worse approximation than HesScale, and any approximation with an error less than 1 has a better approximation than HesScale. Moreover, we show the layer-wise error for each method in Fig. 1b . Different Hessian diagonal approximations are considered for comparison with HesScale. We included several deterministic and stochastic approximations for the Hessian diagonals. We also include the approximation of the Fisher Information Matrix done by squaring the gradients and denoted by g 2 , which is highly adopted by many first-order methods (e.g., Kingma and Ba, 2015) . We compare HesScale with three stochastic approximation methods: AdaHessian (Yao et al. 2021 ), Kronecker-factored approximate curvature (KFAC) (Martens & Grosse 2015) , and the Monte-Carlo (MC) estimate of the GGN matrix (GGN-MC) (Dangel et al. 2020) . We also compare HesScale with two deterministic approximation methods: the diagonals of the exact GGN matrix (Schraudolph 2002) (diag(G)) and the diagonal approximation by Becker and Lecun (1989) (BL89) . In HesScale provides a better approximation than the other deterministic and stochastic methods. For stochastic methods, we use many MC samples to improve their approximation. However, their approximation quality is still poor. Methods approximating the GGN diagonals do not capture the complete Hessian information since the GGN and Hessian matrices are different when the activation functions are not piece-wise linear. Although these methods approximate the GGN diagonals, their approximation is significantly better than the AdaHessian approximation. And among the methods for approximating the GGN diagonals, HesScaleGN performs the best and is close to the exact GGN diagonals. This experiment clearly shows that HesScale achieves the best approximation quality compared to other stochastic and deterministic approximation methods. Next, we perform another experiment to evaluate the computational cost of our optimizers. Our Hessian approximation methods and corresponding optimizers have linear computational complexity, which can be seen from Eq. 4 and Eq. 5. However, computing second-order information in optimizers still incurs extra computations compared to first-order optimizers, which may impact how the total computations scale with the number of parameters. Hence, we compare the computational cost of our optimizers with others for various numbers of parameters. More specifically, we measure the update time of each optimizer, which is the time needed to backpropagate first-order and second-order information and update the parameters. We designed two experiments to study the computational cost of first-order and second-order optimizers. In the first experiment, we used a neural network with a single hidden layer. The network has 64 inputs and 512 hidden units with tanh activations. We study the increase in computational time when increasing the number of outputs exponentially, which roughly doubles the number of parameters. The set of values we use for the number of outputs is {2 4 , 2 5 , 2 6 , 2 7 , 2 8 , 2 9 }. The results of this experiment are shown in Fig. 2a . In the second experiment, we used a neural network with multi-layers, each containing 512 hidden units with tanh activations. The network has 64 inputs and 100 outputs. We study the increase in computational time when increasing the number of layers exponentially, which also roughly doubles the number of parameters. The set of values we use for the number of layers is {1, 2, 4, 8, 16, 32, 64, 128}. The results are shown in Fig. 2b . The points in Fig. 2a and Fig. 2b are averaged over 30 updates. The standard errors of the means of these points are smaller than the width of each line. On average, we notice that the cost of AdaHessian, AdaHesScale, and AdaHesScaleGN are three, two, and 1.25 times the cost of Adam, respectively. It is clear that our methods are among the most computationally efficient approximation method for Hessian diagonals. The computed update time is the time needed by each optimizer to backpropagate gradients or second-order information and to update the parameters of the network. GGN overlaps with H in (a).

5. EMPIRICAL PERFORMANCE OF HESSCALE IN OPTIMIZATION

In this section, we compare the performance of our optimizers-AdaHesScale and AdaHesScaleGN-with three second-order optimizers: BL89, GGNMC, and AdaHessian. We also include comparisons to two first-order methods: Adam and SGD. We exclude KFAC and the exact diagonals of the GGN matrix from our comparisons due to their prohibitive computations. Our optimizers are evaluated in the supervised classification problem with a series of experiments using different architectures and three datasets: MNIST, CIFAR-10, and CIFAR-100. Instead of attempting to achieve state-of-the-art performance with specialized techniques and architectures, we follow the DeepOBS benchmarking work (Schneider et al. 2019 ) and compare the optimizers in their generic and pristine form using relatively simple networks. It allows us to perform a more fair comparison without extensively utilizing specialized knowledge for a particular task. In the first experiment, we use the MNIST-MLP task from DeepOBS. The images are flattened and used as inputs to a network of three fully connected layers (1000, 500, and 100 units) with tanh activations. We train each method for 100 epochs with a batch size of 128. We show the training plots in Fig. 7a with their corresponding sensitivity plots in Appendix D, Fig. 9a . In the second experiment, we use the CIFAR10-3C3D task from the DeepOBS benchmarking tasks. The network consists of three convolutional layers with tanh activations, each followed by max pooling. After that, two fully connected layers (512 and 256 units) with tanh activations are used. We train each method for 100 epochs with a batch size of 128. We show the training plots in Fig. 7b with their corresponding sensitivity plots in Fig. 9b . In the third experiment, we use the CIFAR100-3C-3D task from DeepOBS. The network is the same as the one used in the second task except for the activations are ELU. We train each method for 200 epochs with a batch size of 128. We show the training plots in Fig. 8b with their corresponding sensitivity plots in Fig. 10b . In the fourth experiment, we use the CIFAR100-ALL-CNN task from DeepOBS with the ALL-CNN-C network, which consists of 9 convolutional layers (Springenberg et al. 2015) with ELU activations. We use tanh and ELU instead of ReLU, which is used in DeepOBS, to differentiate between the performance of AdaHesScale and AdaHesScaleGN. We show the training plots in Fig. 8a with their corresponding sensitivity plots in Fig. 10a . In the MNIST-MLP and CIFAR-10-3C3D experiments, we performed a hyperparameter search for each method to determine the best set of β 1 , β 2 , and α. The range of β 2 is {0.99, 0.999, 0.9999} and the range of β 1 is {0.0, 0.9}. The range of step size is selected for each method to create a convex curve. Our criterion was to find the best hyperparameter configuration for each method in the search space that minimizes the area under the validation loss curve. The performance of each method was averaged over 30 independent runs. Each independent run had the same initial representation for the algorithms used in an experiment. Using each method's best hyperparameter configuration on the validation set, we show the performance of each method against the time in seconds needed to complete the required number of epochs, which better depicts the computational efficiency of the methods. Fig. 3a and Fig. 3b show these results on MNIST-MLP and CIFAR-10 tasks. Moreover, we show the sensitivity of each method to the step size in Fig. 5a and Fig. 5b . We show the time taken by each algorithm in seconds (left) and we show the learning curves in the number of epochs (right). The performance of each method is averaged over 30 independent runs. The shaded area represents the standard error. In the CIFAR-100-ALL-CNN and CIFAR-100-3C3D experiments, we used the set of β 1 and β 2 that achieved the best robustness in the previous two tasks, which were 0.9 and 0.999 respectively. We did a hyperparameter search for each method to determine the best step size using the specified β 1 and β 2 . The rest of the experimental details are the same as the first two experiments. Using each method's best hyperparameter configuration on the validation set, we show the performance of each method against the time in seconds needed to complete the required number of epochs. Fig. 4a and Fig. 4b show these results on CIFAR-100-ALL-CNN and CIFAR-100-3C3D tasks. We summarize the results in Appendix E. Our results show that all optimizers except for BL89 performed well on the MNIST-MLP task. However, in CIFAR-10, CIFAR-100 3c3d, and CIFAR-100 ALL-CNN, we notice that AdaHessian performed worse than all methods except BL89. This result is aligned with AdaHessian's inability to accurately approximate the Hessian diagonals, as shown in Fig. 1 . Moreover, AdaHessian required more computational time compared to all methods, which is also reflected in Fig. 2 . While being time-efficient, AdaHesScaleGN consistently outperformed all methods in CIFAR-10-3C3D and CIFAR-100-3C3D, and it outperformed all methods except AdaHesScale in CIFAR-100 ALL-CNN. This result is aligned with our methods' accurate approximation of Hessian diagonals. Our experiments indicate that incorporating HesScale and HesScaleGN approximations in optimization methods can be of significant performance advantage in both computation and accuracy. AdaHesScale and AdaHesScaleGN outperformed other optimizers likely due to their accurate approximation of the diagonals of the Hessian and GGN, respectively.

6. CONCLUSION

HesScale is a scalable and efficient second-order method for approximating the diagonals of the Hessian at every network layer. Our work is based on the previous work done by Becker and Lecun (1989) . We performed a series of experiments to evaluate HesScale against other scalable algorithms in terms of computational cost and approximation accuracy. Moreover, we demonstrated how Hes-Scale can be used to build efficient second-order optimization methods. Our results showed that our methods provide a more accurate approximation and require small additional computations.

7. BROADER IMPACT

Second-order information is used in domains other than optimization. For example, some works alleviating catastrophic forgetting use a utility measure for the network's connections to protect them. Typically, an auxiliary loss is used between such connections, and their old values are weighted by their corresponding importance. Such methods (LeCun et al. 1990 , Hassibi & Stork 1993 , Dong et al. 2017 , Kirkpatrick et al. 2017 , Schwarz et al. 2018 , Ritter et al. 2018 ) use the diagonal of the Fisher information matrix or the Hessian matrix as a utility measure for each weight. The quality of these algorithms depends heavily on the approximation quality of the second-order approximation. Second-order information can also be used in neural network pruning. Molchanov et al. (2019) showed that second-order approximation with the exact Hessian diagonals could closely represent the true measure of the utility of each weight. The accurate and efficient approximation for the diagonals of the Hessian at each layer enables Hes-Scale to be used in many important lines of research. Using this second-order information provides a reliable measure of connection utility. Therefore, using HesScale in these types of problems can potentially improve the performance of neural network pruning methods and regularization-based catastrophic forgetting prevention methods.

A HESSIAN DIAGONALS OF THE LOG-LIKELIHOOD FUNCTION FOR TWO COMMON DISTRIBUTIONS

Here, we provide the diagonals of the Hessian matrix of functions involving the log-likelihood of two common distributions: a normal distribution and a categorical distribution with probabilities represented by a softmax function, which we refer to as a softmax distribution. We show that the exact computations of the diagonal can be computed with linear complexity since computing the diagonal elements does not depend on off-diagonals in these cases. In the following, we consider the softmax and normal distributions, and we write the exact Hessian diagonals in both cases. A.1 SOFTMAX DISTRIBUTION Consider a cross-entropy function for a discrete probability distribution as f . = -|q| i=1 p i log q i (θ), where q is a probability vector that depends on a parameter vector θ, and p is a one-hot vector for the target class. For softmax distributions, q is parametrized by a softmax function q . = e θ / |q| i=1 e θi . In this case, we can write the gradient of the cross-entropy function with respect to θ as ∇ θ f (θ) = q -p. Next, we write the exact diagonal elements of the Hessian matrix as follows: diag(H θ ) = diag(∇ θ (q -p)) = q -q 2 , where q 2 denotes element-wise squaring of q, and ∇ operator applied to a vector denotes Jacobian. Computing the exact diagonals of the Hessian matrix depends only on vector operations, which means that we can compute it in O(n). The cross-entropy loss is used with softmax distribution in many important tasks, such as supervised classification and discrete reinforcement learning control with parameterized policies (Chan et al. 2022) .

A.2 MULTIVARIATE NORMAL DISTRIBUTION WITH DIAGONAL COVARIANCE

For a multivariate normal distribution with diagonal covariance, the parameter vector θ is determined by the mean-variance vector pair: θ . = (µ, σ 2 ). The log-likelihood of a random vector x drawn from this distribution can be written as log q(x; µ, σ 2 ) = - 1 2 (x -µ) ⊤ D(σ 2 ) -1 (x -µ) - 1 2 log(|D(σ 2 )|) + c = - 1 2 (x -µ) ⊤ D(σ 2 ) -1 (x -µ) - 1 2 log( |σ| i=1 σ 2 i ) + c, where D(σ 2 ) gives a diagonal matrix with σ 2 in its diagonal, |M | is the determinant of a matrix M and c is some constant. We can write the gradients of the log-likelihood function with respect to µ and σ 2 as follows: ∇ µ log q(x; µ, σ 2 ) = D(σ 2 ) -1 (x -µ) = (x -µ) ⊘ σ 2 , ∇ σ 2 log q(x; µ, σ 2 ) = 1 2 (x -µ) 2 ⊘ σ 2 -1 ⊘ σ 2 , where 1 is an all-ones vector, and ⊘ denotes element-wise division. Finally, we write the exact diagonals of the Hessian matrix as diag(H µ ) = diag(∇ µ (x -µ) ⊘ σ 2 ) = -1 ⊘ σ 2 , diag(H σ 2 ) = diag ∇ σ 2 1 2 [(x -µ) 2 ⊘ σ 2 -1] ⊘ σ 2 = 0.51 -(x -µ) 2 ⊘ σ 2 ⊘ σ 4 . Clearly, the gradient and the exact Hessian diagonals can be computed in O(n). Log-likelihood functions for normal distributions are used in many important problems, such as variational inference and continuous reinforcement learning control.

B HESSCALE WITH CONVOLUTIONAL NEURAL NETWORKS

Here, we derive the Hessian propagation for convolutional neural networks (CNNs). Consider a CNN with L -1 layers followed by a fully connected layer that outputs the predicted output q. The CNN filters are parameterized by {W 1 , ..., W L }, where W l is the filter matrix at the l-th layer with the dimensions k l,1 × k l,2 , and its element at the ith row and the jth column is denoted by W l,i,j . For the simplicity of this proof, we assume that the number of filters at each layer is one; the proof can be extended easily to the general case. The learning algorithm learns the target function f * by optimizing the loss L. During learning, the parameters of the neural network are changed to reduce the loss. At the layer l, we get the activation output matrix H l by applying the activation function σ to the activation input A l : H l = σ(A l ). We assume here that the activation function is element-wise activation for all layers except for the final layer L, where it becomes the softmax function. We simplify notations by defining H 0 . = X, where X is the input sample. The activation output H l is then convoluted by the weight matrix W l+1 of layer l + 1 to produce the next activation input: (j+n) . We denote the size of the activation output at the l-th layer by h l × w l . The backpropagation equations for the described network are given following Rumelhart et al. (1986) : A l+1,i,j = k l,1 -1 m=0 k l,2 -1 n=0 W l+1,m,n H l,(i+m), ∂L ∂A l,i,j = k l+1,1 -1 m=0 k l+1,2 -1 n=0 ∂L ∂A l+1,(i-m),(j-n) ∂A l+1,(i-m),(j-n) ∂A l,i,j = k l+1,1 -1 m=0 k l+1,2 -1 n=0 ∂L ∂A l+1,(i-m),(j-n) k l+1,1 -1 m ′ =0 k l+1,2 -1 n ′ =0 W l+1,m ′ ,n ′ ∂H l,(i-m+m ′ ),(j-n+n ′ ) ∂A l,i,j = k l+1,1 -1 m=0 k l+1,2 -1 n=0 ∂L ∂A l+1,(i-m),(j-n) W l+1,m,n σ ′ (A l,i,j ) = σ ′ (A l,i,j ) k l+1,1 -1 m=0 k l+1,2 -1 n=0 ∂L ∂A l+1,(i-m),(j-n) W l+1,m,n , ∂L ∂W l,i,j = h l -k l,1 m=0 w l -k l,2 n=0 ∂L ∂A l,m,n ∂A l,m,n ∂W l,i,j = h l -k l,1 m=0 w l -k l,2 n=0 ∂L ∂A l,m,n H l-1,(i+m),(j+n) . In the following, we write the equations for the exact Hessian diagonals with respect to weights ∂ 2 L /∂W 2 l,i,j , which requires the calculation of ∂ 2 L /∂A 2 l,i,j first: ∂ 2 L ∂A 2 l,i,j = ∂ ∂A l,i,j σ ′ (A l,i,j ) k l+1,1 -1 m=0 k l+1,2 -1 n=0 ∂L ∂A l+1,(i-m),(j-n) W l+1,m,n = σ ′ (A l,i,j ) k l+1,2 -1 m,p=0 k l+1,2 -1 n,q=0 ∂ 2 L ∂A l+1,(i-m),(j-n) ∂A l+1,(i-p),(j-q) ∂A l+1,(i-p),(j-q) ∂A l,i,j W l+1,m,n + σ ′′ (A l,i,j ) k l+1,2 -1 m=0 k l+1,2 -1 n=0 ∂L ∂A l+1,(i-m),(j-n) W l+1,m,n ∂ 2 L ∂W 2 l,i,j = ∂ ∂W l,i,j h l -k l,1 m=0 w l -k l,2 n=0 ∂L ∂A l,m,n H l-1,(i+m),(j+n) = h l -k l,1 m,p=0 w l -k l,2 n,q=0 ∂ 2 L ∂A l,m,n ∂A l,p,q ∂A l,p,q ∂W l,i,j H l-1,(i+m),(j+n) Since the calculation of ∂ 2 L /∂A 2 l,i,j and ∂ 2 L /∂W 2 l,i,j depend on the off-diagonal terms, the computation complexity becomes quadratic. Following Becker and Lecun (1989) , we approximate the Hessian diagonals by ignoring the off-diagonal terms, which leads to a backpropagation rule with linear computational complexity for our estimates ∂ 2 L ∂W 2 l,i,j and ∂ 2 L ∂A 2 l,i,j : ∂ 2 L ∂A 2 l,i,j . = σ ′ (A l,i,j ) 2 k l+1,2 -1 m=0 k l+1,2 -1 n=0 ∂ 2 L ∂A 2 l+1,(i-m),(j-n) W 2 l+1,m,n + σ ′′ (A l,i,j ) k l+1,2 -1 m=0 k l+1,2 -1 n=0 ∂L ∂A l+1,(i-m),(j-n) W l+1,m,n , ∂ 2 L ∂W 2 l,i,j . = h l -k l,1 m=0 w l -k l,2 n=0 ∂ 2 L ∂A 2 l,m,n H 2 l-1,(i+m),(j+n) . C APPROXIMATION QUALITY WITH MNIST DATA We repeat the experiment shown in Fig. 1 with MNIST data points instead of random data points. The experimental details are the same except for two changes. First, we used a larger network where we changed the number of units in each hidden layer to 32 instead of 16. Second, we performed an optimization update with SGD at each data point. The results shown in Fig. 6 are similar to the results shown in Fig. 1 where HesScale gives a better approximation quality than other methods. This experiment shows that our results hold for realistic settings where learning is involved. 

D OPTIMIZATION PLOTS IN THE NUMBER OF EPOCHS

We give the training loss, training accuracy, validation loss, validation accuracy, test loss, and test accuracy for each of the methods we include in our comparison in Fig. 7 and Fig. 8 . Moreover, we give the sensitivity plots for β 1 , β 2 , and α for each method in Fig. 9 and Fig. 10 . architectures. The range of step size is {10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 , 10 0 }. We choose β 1 to be equal to 0.9 and β 2 to be equal to 0.999. Each point for each algorithm represents the average test loss given a set of parameters.

E SUMMARY OF OPTIMIZATION RESULTS

We summarize the final performance of AdaHesScale and AdaHesScaleGN against other optimizers on the train sets and test sets, in Table 1 and Table 2 respectively. 



Code will be available.



Figure 1: The averaged error for each method is normalized by the averaged error incurred by HesScale. We show 40 initialization points with the same colors across all methods. The norm of the vector of Hessian diagonals |diag(H)| is shown as a reference.

Figure 2: The average computation time for each step of an update is shown for different optimizers.The computed update time is the time needed by each optimizer to backpropagate gradients or second-order information and to update the parameters of the network. GGN overlaps with H in (a).

Figure3: MNIST-MLP and CIFAR-10 3C3D classification tasks. Each method is trained for 100 epochs. We show the time taken by each algorithm in seconds (left) and we show the learning curves in the number of epochs (right). The performance of each method is averaged over 30 independent runs. The shaded area represents the standard error.

Figure5: Sensitivity of the step size for each method on MNIST-MLP and CIFAR-10 3C3D tasks. We select the best values of β 1 and β 2 for each step size α.

Figure6: The averaged error for each method is normalized by the averaged error incurred by HesScale for data points coming from MNIST. We show 40 initialization points with the same colors across all methods. The norm of the vector of Hessian diagonals |diag(H)| is shown as a reference.

Figure7: Learning curves of each algorithm on two tasks, MNIST-MLP and CIFAR-10 3C3D, for 100 epochs. We show the best configuration for each algorithm on the validation set. The best parameter configuration for each algorithm is selected based on the area under the curve for the validation loss.

Figure8: Learning Curves of each algorithm on CIFAR-100 with All-CNN and 3C3D architectures, for 100 epochs. We show the best configuration for each algorithm on the validation set. The best parameter configuration for each algorithm is selected based on the area under the curve for the validation loss.

Figure9: Parameter Sensitivity study for each algorithm on two data sets, MNIST and CIFAR-10. The range of β 2 is {0.99, 0.999, 0.9999} and the range of β 1 is {0.0, 0.9}. Each point for each algorithm represents the average test loss given a set of parameters.

Algorithm 2 AdaHesScale for optimization Require: Neural network f with weights {W 1 , ..., W L } and a dataset D Require: Small number ϵ ← 10 -8 Require: Exponential decay rates β 1 , β 2 ∈ [0, 1) Require: step size α Require: Initialize {W 1 , ..., W L } Initialize time step t ← 0. for l in {L, L -1, ..., 1} do ▷ Set exponential moving averages at time step 0 to zero

Performance of optimization methods on the train sets of different problems.

Performance of optimization methods on the test sets of different problems.

