

Abstract

The curvature of the loss, provides rich information on the geometry underlying neural networks, with applications in second order optimisation and Bayesian deep learning. However, accessing curvature information is still a daunting engineering challenge, inaccessible to most practitioners. We hence provide a software package the Deep Curvature Suite, which allows easy curvature evaluation for large modern neural networks. Beyond the calculation of a highly accurate moment matched approximation of the Hessian spectrum using Lanczos, our package provides: extensive loss surface visualisation, the calculation of the Hessian variance and stochastic second order optimisers. We further address and disprove many common misconceptions in the literature about the Lanczos algorithm, namely that it learns eigenvalues from the top down. We prove using high dimensional concentration inequalities that for specific matrices a single random vector is sufficient for accurate spectral estimation, informing our spectral visualisation method. We showcase our package practical utility on a series of examples based on realistic modern neural networks such as the VGG-16 and Preactivated ResNets on the CIFAR-10/100 datasets. We further detail 3 specific potential use cases enabled by our software: research in stochastic second order optimisation for deep learning, learning rate scheduling using known optimality formulae for convex surfaces and empirical verification of deep learning theory based on comparing empirical and theoretically implied spectra.

1. INTRODUCTION

The success of deep neural networks trained with gradient based optimisers in speech and object recognition (LeCun et al., 2015) , has led to an explosion in easy to use high performance software implementations. Automatic differentiation packages such as TensorFlow (Abadi et al., 2016) and PyTorch (Paszke et al., 2017) have become widely adopted. Higher level packages, such as Keras (Chollet, 2015) allow practitioners users to state their model, dataset and optimiser in a few lines of code, effortlessly achieving state of the art performance. However, software for extracting second order information, representing the curvature of the loss at a point in weight space, has not kept abreast. Researchers aspiring to evaluate curvature information need to implement their own libraries, which are rarely shared or kept up to date. Naive implementations, which rely on full eigendecomposition (cubic cost in the parameter count) are computationally intractable for all but the smallest of models. Hence, researchers typically ignore curvature information or use highly optimistic approximations. Examples in the literature include the diagonal elements of the matrix or of a surrogate matrix, Chaudhari et al. (2016); Dangel et al. (2019) , which we show in AppendixE can be very misleading.

2. MOTIVATION

The curvature of the loss informs us about the local conditioning of the problem (i.e the ratio of the largest to smallest eigenvalues λ1 λ P ). This determines the rate of convergence for first order methods and informs us about the optimal learning and momentum rates (Nesterov, 2013) . Hence easily accessible curvature information could allow practitioners to scale their learning rates in an optimal way throughout training, instead of relying on expert scheduling, we investigate this using our software in Section 5.2. Research areas where curvature information features most prominently are analyses of the Loss Surface and Newton type optimization methods.

2.1. LOSS SURFACES

Recent neural network loss surfaces using full eigendecomposition (Sagun et al., 2016; 2017) have been limited to toy examples with less than five thousand parameters. Hence, loss surface visualisation of deep neural networks have often focused on two dimensional slices of random vectors (Li et al., 2017) or the changes in the loss traversing a set of random vectors drawn from the ddimensional Gaussian distribution (Izmailov et al., 2018) . It is not clear that the loss surfaces of modern expressive neural networks, containing millions or billions of dimensions, can be well captured in this manner. Small experiments have shown neural networks have a large rank degeneracy Sagun et al. ( 2016) with a small number of large outliers. However high dimensional concentration theorems Vershynin (2018) guarantee that even a large number of randomly sampled vectors are unlikely to encounter such outliers and hence have limited ability to discern the geometry between various solutions. Other works, that try to distinguish between flat and sharp minima have used the diagonal of the Fisher information matrix (Chaudhari et al., 2016) , an assumption we will challenge in this paper. Specifically we show in Appendix E, that diagonal approximations do not capture key properties in synthetic and real neural network examples. From a practical perspective, specific properties of the loss surface are not captured by the aforementioned approaches. Examples include the flatness as specified by the trace, Frobenius and spectral norm. These measures have been extensively used to characterise the generalisation of a solution found by SGD (Wu et al., 2018; Izmailov et al., 2018; He et al., 2019; Jastrzkebski et al., 2017; 2018; Keskar et al., 2016) . Under a Bayesian and minimum description length argument (Hochreiter and Schmidhuber, 1997) flatter minima should generalise better than sharper minima. The magnitude of these outliers have been linked to poor generalisation performance Keskar et al. ( 2016) and as a consequences the generalisation benefits of large learning rate SGD Wu et al. (2018); Jastrzkebski et al. (2017) . These properties are extremely easy in principle to estimate, at a computational cost of a small multiple of gradient evaluations. However the calculation of these properties are not typically included in standard deep learning frameworks, which limits the ability of researchers to undertake such analysis. Other important areas of loss surface investigation include understanding the effectiveness of batch normalisation (Ioffe and Szegedy, 2015) . Recent convergence proofs (Santurkar et al., 2018) bound the maximal eigenvalue of the Hessian with respect to the activations and bounds with respect to the weights on a per layer basis. Bounds on a per layer basis do not imply anything about the bounds of the entire Hessian and furthermore it has been argued that the full spectrum must be calculated to give insights on the alteration of the landscape (Kohler et al., 2018) .

2.2. SECOND ORDER OPTIMISATION METHODS

Second order optimisation methods solve the minimisation problem for the loss, L(w) ∈ R associated with parameters w ∈ R P ×1 and perturbation δw ∈ R P ×1 to the second order in Taylor expansion, δw * = argmin δw L(w + δw) = argmin δw L(w + δw) = -H-1 ∇L(w). (1) Sometimes, such as in deep neural networks when the Hessian H = ∇ 2 L(w) ∈ R P ×P is not positive definite, a positive definite surrogate is used. Note that Equation 1 is not lower bounded unless H is positive semi-definite. Often either a multiple of the identity is added (2) The first term on the LHS of Equation 2 is known as the Generalised Gauss-Newton matrix. Despite the success of second order optimisation methods using the GGN for difficult problems on which SGD is known to stall, such as recurrent neural networks (Martens and Sutskever, 2012), or



(known as damping) to the Hessian (H → H + γI)Dauphin et al. (2014)  or a surrogate positive definite approximation to the Hessian H, such as the Generalised Gauss-Newton (GGN)(Martens, 2010; Martens and  Sutskever, 2012)  is employed. To derive the GGN, we express the loss in terms of the activation L(w) = σ(f (w)) of the output of the final layer f (w). Hence the elements of the Hessian can be written asH(w) ij = dy k=0 dy l=0 ∂ 2 σ(f (w)) ∂f l (w)∂f k (w) ∂f l (w) ∂w j ∂f k (w) ∂w i + dy k=0∂σ(f (w)) ∂w j ∂ 2 f k (w) ∂w j ∂w i .

