Flatness is a False

Abstract

Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that, for feed-forward neural networks under the crossentropy loss, low-loss solutions with large neural network weights have small Hessian based measures of flatness. This implies that solutions obtained without L2 regularisation should be less sharp than those with despite generalising worse. We show this to be true for logistic regression, multilayer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-100 datasets. Furthermore, we show that adaptive optimisation algorithms using iterate averaging, on the VGG-16 network and CIFAR-100 dataset, achieve superior generalisation to SGD but are 30× sharper. These theoretical and experimental results further advocate the need to use flatness in conjunction with the weights scale to measure generalisation (Neyshabur et al., 2017; Dziugaite and Roy, 2017) .

1. Introduction

Deep Neural Networks (DNNs), with more parameters than data-points, trained with many passes of the same data, still manage to perform exceptionally on test data. The reasons for this remain laregly unsolved (Neyshabur et al., 2017) . However, DNNs are not completely immune to the classical problem of over-fitting. Zhang et al. (2016) show that DNNs can perfectly fit random labels. Schedules with initially low or sharply decaying learning rates, lead to identical training but much higher testing error (Berrada et al., 2018; Granziol et al., 2020a; Jastrzebski et al., 2020) . In Wilson et al. (2017) the authors argue that specific adaptive gradient optimisers lead to solutions which don't generalise. This has lead to a significant development in partially adaptive algorithms (Chen and Gu, 2018; Keskar and Socher, 2017) . Given the importance of accurate predictions on unseen data, understanding exactly what helps deep networks generalise has been a fundamental area of research. A key concept which has taken a foothold in the community, allowing for the comparison of different training loss minima using only the training data, is the concept of flatness. From both a Bayesian and minimum description length framework, flatter minima should generalize better than sharp minima (Hochreiter and Schmidhuber, 1997) . Sharpness is usually measured by properties of the second derivative of the loss" the Hessian H = ∇ 2 L(w) (Keskar et al., 2016; Jastrzebski et al., 2017b; Chaudhari et al., 2016; Wu et al., 2017; 2018) , such as the spectral norm or trace. The assumption is that due to finite numerical precision (Hochreiter and Schmidhuber, 1997) or from a Bayesian perspective (MacKay, 2003) , the test surface is shifted from the training surface. The difference between train and test loss for a shift ∆w is given by L(w * + ∆w) -L(w * ) ≈ ∆w T H∆w + ... ≈ P i λ i |φ T i ∆w| 2 ≈ Tr(H) P ||∆w|| 2 ≤ λ 1 ||∆w|| 2 (1) in which w * is the final training point and [λ i , φ i ] are the eigenvalue/eigenvector pairs of H ∈ R P ×P . We have dropped the terms beyond second-order by assuming that the gradient at training end is small. In general we have no a priori reason to assume that shift should preferentially lie along any of the Hessian eigenvectors, hence by taking a maximum entropy prior (MacKay, 2003; Jaynes, 1982) we expect strong high dimensional concentration results (Vershynin, 2018) to hold, hence |φ T i ∆ ŵ| 2 ≈ 1/P , where ŵ is simply the normalised version of w. This justifies the trace as a measure of sharpness. In the worst case scenario the shift is completely aligned with the eigenvector corresponding to the largest eigenvalue λ 1 , i.e. ∆w T φ 1 = 1. Hence the spectral norm λ 1 of H serves as a localfoot_0 upper bound to the loss change. The idea of a shift between the training and testing loss surface is prolific in the literature and regularly related to generalisation (He et al., 2019; Izmailov et al., 2018; Maddox et al., 2019) . Alternative, yet closely related, measures of flatness are also used. Keskar et al. (2016) define a sharp minimiser as one "with a significant number of large positive eigenvalues", in fact as can be seen by the Rayleigh-Ritz theorem, the metric which they propose, shown in Equation 2 is proportional to the largest eigenvalue. φ w,L ( , A) := (max y∈C L(w + Ay)) -L(w) 1 + L(w) ≤ κ( )λ 1 (2) C is the constraint box as defined in (Keskar et al., 2016) , where controls the box size. As shown by Dinh et al. (2017) , this definition of sharpness is approximately given by λ 1 2 /2(1 + L(w)), proportional to the largest eigenvalue. This result can be explained intuitively as within a small vicinity of w the largest change in loss is along the leading eigenvector and is proportional to the largest eigenvalue. Wu et al. (2017) consider the logarithm of the product of the top k eigenvalues as a proxy measure the volume of the minimum (a truncated log determinant). In this paper we will exclusively consider the Hessian trace, spectral and Frobenius norm as measures of sharpness. Motivation: There have been numerous positive empirical results relating sharpness and generalisation. Keskar et al. (2016) ; Rangamani et al. (2019) consider how large batch vs small batch stochastic gradient descent (SGD) alters the sharpness of solutions, with smaller batches leading to convergence to flatter solutions, leading to better generalisation. Jastrzebski et al. (2017a) look at the importance of the ratio learning rate and batch size in terms of generalisation, finding that large ratios lead to flatter minima (as measured by the spectral norm) and better generalisation. Yao et al. (2018) investigated flat regions of weight space (small spectral norm) showing them to be more robust under adversarial attack. Zhang et al. (2018) show that SGD concentrates in probability on flat minima. Certain algorithmic design choices, such as Entropy-SGD (Chaudhari et al., 2016) and the use of Polyak averaging (Izmailov et al., 2018) have been motivated by considerations of flatness. However Dinh et al. (2017) show that by exploiting ReLUs (Rectified Linear Units) positive homogeneity property f (αx) = αf (x), any flat minima can be mapped into a sharp minimum, without altering the loss. As these measures can be arbitrarily distorted, this implies they serve little value as generalisation measures. However such transformations alter other properties, such as the weight norm. In practice the use of L2 regularisation, which penalises weight norm, means that optimisers are unlikely to converge to such a solution. It can even be shown that unregularised SGD converges to the minimum norm solution for simple problems (Wilson et al., 2017) , further limiting the practical relevance of such reparameterisation arguments. The question which remains and warrants investigation, is are Hessian based sharpness metrics at the end of training meaningful metrics for generalisation? We demonstrate both theoretically and experimentally that the answer to this question is an affirmative no. Contributions: To the best of our knowledge, this is the first work which demonstrates theoretically motivated empirical results contrary to purely flatness based generalisation measures. For the fully connected feed-forward network with ReLU activation and cross entropy loss, we demonstrate in the limit of 0 training loss, that the spectral norm and trace of the Hessian also go to 0. The key insight is that in order for the loss to go to 0, the weight vector components w c must tend to infinity. Conversely, this implies that methods which reduce the weight magnitudes extensively used to aid generalisation (Bishop, 2006; Krogh and Hertz, 1992) , makes solutions sharper. We present the counter-intuitive result that adding L2 regularisation increases both sharpness and generalisation, for Logistic Regression, MLP, simple CNN, PreResNet-164 and WideResNet-28 × 10 for the MNIST and CIFAR-100 datasets. We also present and discuss various amendments to the Hessian, which are robust against the arguments presented here and those of Dinh et al. (2017) . Related work: Empirically negative results on flatness and its effect on generalisation have been previously observed. Neyshabur et al. (2017) show that it captures generalisation for large but not small networks. Golatkar et al. (2019) show that the maximum of the trace of the Fisher information correlates better with generalisation than its final value. Jastrzebski et al. (2018) show that it is possible to optimise faster and attain better generalisation performance whilst finding a final sharper region. For small networks, those trained on random labels (with no generalisation) are less sharp than those trained on the true labels. However this does not rule out that for the same network trained on true labels, solutions which are flatter generalise better. Instead the main focus has centered around the Hessians lack of reparameterisation invariance (Neyshabur et al., 2017; Tsuzuku et al., 2019; Rangamani et al., 2019) . This has been a primary motivator for normalised definitions of flatness Tsuzuku et al. (2019) ; Rangamani et al. (2019) often in a PAC-Bayesian framework. In Ballard et al. (2017) ; Mehta et al. (2018) , it was shown that adding the L2 regularization on weights including the bias weights removed singular modes of the Hessian matrix for a feed-forward artificial neural network with one hidden layer, with tanh activation function, employed to fit the XOR data. In Mehta et al. (2018) , with the help of an algebraic geometry interpretation of the loss landscape of the deep linear networks, it was proven that a generalized L2 regularization guaranteed to remove all singular solutions leaving the Hessian matrix strictly non-singular at every critical point.

2. Gedanken Experiment: why the Hessian won't do

For a simple illustration let us consider the deep linear model, with exponential loss. The deep linear model is often employed as a theoretical tool for its analytical tractability (Kawaguchi, 2016; Lu and Kawaguchi, 2017) . In Section 3 we formalise the results to the fully connected feed forward network with cross entropy loss. Intuitively we can think of a feed forward network as a sum of deep linear networks and the cross entropy as an approximation to the exponential loss. For 3 parameters and a single datum X, the loss is given by L = exp (w 1 w 2 w 3 X). The Hessian, its trace and spectral norm H, Tr(H), λ 1 (H) are given by H =   w 2 2 w 2 3 w 1 w 2 w 2 3 w 2 2 w 1 w 3 w 1 w 2 w 2 3 w 2 1 w 2 3 w 2 1 w 2 w 3 w 2 2 w 1 w 3 w 2 1 w 2 w 3 w 2 1 w 2 2   X exp (w 1 w 2 w 3 )X (3) Tr(H) = λ 1 (H) = (w 2 2 w 2 3 + w 2 2 w 2 1 + w 2 1 w 2 3 )X exp (w 1 w 2 w 3 X) Smaller losses imply flatter Hessians: Equation 4shows that under this model the trace and maximum eigenvalue are products of a polynomial function of the weights and an exponential in the weights. As the optimiser drives the loss L → 0 we expect the exponential to dominate the polynomialfoot_1 . This implies that methods to reduce the weight magnitude, such as L2 regularisation, which has been extensively shown to aid generalisation (Krogh and Hertz, 1992; Bishop, 2006) should increase Hessian measures of sharpness. We show that this is the case experimentally in Section 4.

3. Theoretical Framework

In this Section we extend our intuition developed under the deep linear network with exponential loss, to more realistic scenarios. Similar to the prior work of Choromanska et al. (2015) ; Milne (2019) , we consider a neural network with a d x dimensional input x. Our network has H -1 hidden layers and we refer to the output as the H'th layer and the input as the 0'th layer. We denote the ReLU activation function as f (x) where f (x) = max(0, x). Let W i be the matrix of weights between the (i -1)'th and i'th layer. For a d y dimensional output our q'th component of the output can be written as z(x i ; w) q = f (W T H f (W T H-1 ....f (W 1 x))) = dx i=1 γ j=1 x i A i,j H k=1 w (k) i,j where the indices i, j denote the sum over network inputs and paths respectively and γ is the number of paths. A i,j ∈ [0, 1] denotes whether the path is active or not and w (q) i,j denotes the the weight of the path segment which connects node i in layer q -1 with node j in layer q. layer i has n i nodes and γ = H-1 q n q We formalise our intuition from Section 2 with the following theorem: Theorem 1. For any feed forward neural network with ReLU output activation functions f (x) = max(0, x), coupled with a softmax output in the final layer and cross entropy loss, in the limit that the training loss L(w) → 0 the spectral norm λ 1 of the empirical Hessian H = ∇∇L(w) ∈ R P ×P also tends to 0. Proof. The cross-entropy loss (h(x i ; w), y) of a single sample x i is defined by: (h(x i ; w), y i ) = - dy c (1 -1[h(x i ; w) c = y c ]) log h(x i ; w) c (6) Where d y is the number of classes and 1 is the indicator function which takes the value of 1 for the incorrect class and 0 for the correct class, z(x i ; w) is the softmax input. The softmax output z(x i ; w) c for class c is given by h(x i , w) c = σ(z) c = exp z(x i ; w) c dy k=1 exp z(x i ; w) k = 1 + dy k =c exp (z(x i ; w) k -z(x i ; w) c ) -1 Hence combining Equations 6 and 7, the loss per sample can be written as (h(x i ; w), y i ) = log 1 + dy k =c(i) exp (z k,c(i) ) in which c(i) denotes the correct class for the data point x i and z k,c(i) = z(x i ; w) kz(x i ; w) c(i) . Note that, for the per sample loss (h( x i ; w), y i ) → 0, exp(z k,c(i) ) → 0 ∀k = c(i). Using the chain rule ∂ 2 (h(x i ; w), y i ) ∂w l ∂w m = k =c(i) exp(z k,c(i) )[ ∂ 2 z k,c(i) ∂w l ∂wm + ∂z k,c(i) ∂w l ∂z k,c(i) ∂wm ] 1 + i =c exp(z k,c(i) ) - k =c(i),u =c(i) exp(z k,c ) ∂z k,c(i) ∂w l exp(z u,c(i) ) ∂z u,c(i) ∂wm (1 + k =c(i) exp(z k,c(i) )) 2 (9) As shown in Milne (2019) , the network is differentiable at the majority of points in weight space. Specifically the zero set of non-zero real analytic functions (the network is piecewise analytic in the weights) has Lebesgue measure zero. Hence all we need to show is that the output derivatives tend to ∞ more slowly than the loss tends to 0. To do this we consider ∂ m m ∂w m µm,φm dx i=1 γ j=1 x i A i,j H k=1 w (k) i,j = dx i=1 x i A i,j γ/ m nm k =m w (k) i,j (1 -1[w m µm,φm ]) We are interested in the limit where the output for the correct class z(x i ; w) c(i) → ∞. Although any individual weight segment w k i,j may be zero, this simply deactivates that path segment and reduces the contributing sum. Hence we can absorb such zero weights into the A i,j and without loss of generality assume that the weights are bounded to a minimum absolute value of |w (k) i,j | ≥ > 0. Note from Equation 10 that only paths containing the weight segment corresponding to the differentiated nodes contribute, hence the bracket on the RHS of this equation is also upper bounded by 1. We can upper bound the norm of the differentiated output ∂ m m ∂w m µm,φm dx i=1 γ j=1 x i A i,j H k=1 w (k) i,j ≤ | dx i=1 γ j=1 x i A i,j H k=1 w (k) i,j | m m n m (11) Hence ∂ m z(x i ; w) q m ∂w m µm,φm exp (z(x i ; w) q ) ≤ |z(x i ; w) q | m m n m exp (z(x i ; w) q ) (12) and hence in the limit (h(x i ; w), y i ) → 0, all terms in Equation 9 have norms tending to zero. Hence ∂ 2 (h(xi;w),yi) ∂w l ∂wm → 0. Taking l = m and summing over m we have 0 trace and using the Frobenius norm identity, i.e. taking the sum of squares over l, m we have P i λ 2 i → 0 and hence λ 1 → 0. Remark. By writing the loss in terms of the activation σ at the output of the final layer f (w), i.e L(w) = σ(f (w)). The Hessian may be expressed using the chain rule as H(w) jk = 1 N N n=1 dy c=0 dy l=0 ∂ 2 σ(f (w)) ∂f l (w)∂f c (w) ∂f l (w) ∂w j ∂f c (w) ∂w c + dy c=0 ∂σ(f (w)) ∂w j ∂ 2 f c (w) ∂w j ∂w k (13) Where, for the cross-entropy loss and softmax output at exactly 0 loss, ∂f l (w) ∂wj = 0 and ∂ 2 σ(f (w)) ∂f l (w)∂fc(w) = 0. However, in practice, since the weights are finite, we never have 0 loss. Hence, unlike our proof which shows that the Hessian is given by a product of a polynomial and exponential in the weights which we expect to go to 0 in the limit of large weights and low loss, this simple result does yield information prior to the loss being exactly 0.

4. Weight Decay and Sharpness

Section 3 more formally demonstrates what was already hypothesised in Section 2. Larger weights, required to drive the loss to very low values, are expected to give small Hessian based measures of sharpness, despite potentially wildly over-fitting the data and generalising poorly. To show that this is relevant in practice we evaluate the effect of weight norm reducing techniques, such as L2 regularisation on spectral sharpness. L2 is regularly used to help generalisation, having been showed to reduce the effect of static noise on the target (Krogh and Hertz, 1992) . Experimental Setup: We use the deep visualisation suite (Granziol et al., 2019) package to visualise the spectrum of the Hessian and calculate the largest eigenvalues. We train all networks using SGD with momentum ρ = 0.9 and varying levels of L2 regularisation γ 2 ||w|| 2 , γ ∈ [0, 0.0001, 0.0005]. For further experimental details, such as the learning rate schedule ( we a linear decay schedule with a terminal learning rate of 0.01 the initial) employed and the finer details of the spectral visualisation method see Appendix A. Since adding L2 regularisation naturally adds γ to each eigenvalue, as H → H + γI, in our results we do not calculate the Hessian on the regularised loss. For simplicity we focus on the spectral norm, but include results on the Frobenius norm and trace in the Appendix. Furthermore since for certain simple architectures the inclusion of L2 regularisation also increases optimisation performance, wherever this is the case we will relate Hessian based measures of sharpness to the generalisation gap. For which we just report the difference in validation and training accuracy. For more modern networks using batch normalisation (Ioffe and Szegedy, 2015) , L2 regularisation reduces convergence speed but negligibly alters optimisation performance, so we can relate measures of sharpness to the validation performance directlyfoot_2 .

Logistic Regression on MNIST:

The simplest Neural Network model, corresponding to a 0 hidden layer feed forward neural network, is the Softmax Regressorfoot_3 . By the diagonal dominance theorem (Cover and Thomas, 2012) the Hessian of the Softmax Regressor is positive semi-definite, so the loss surface is convex. For a convex objective all local minima are global, hence we do not have the complication of different minima. We run this model on the MNIST dataset (LeCun, 1998), splitting the training set into 45, 000 training and 5, 000 validation samples. The total parameter count is 7850. We run for 1000 (we specifically use an abnormally large number of epochs to make sure that convergence is not an issue) epochs with learning rate 0.01. The validation accuracy increases incrementally with increased weight decay [93.48, 94, 94.08] . We plot the spectra of the final solution in Figure 1 . We note here that for increasing weight decay coefficient, which corresponds to higher performing testing solutions, the spectral norm increases, from λ 1 = [12. 26, 14.17, 15.84] . This shows that greater Hessian based measures of sharpness, occur for solutions with improved generalisation. We now consider a single hidden layer MLP on the MNIST dataset, with a hidden layer of 100 units, parameter count 9960, trained for 50 epochs with an identical schedule and a learning rate of 0.01. We similarly find that the addition of weight decay both increases the generalisation accuracy (from 94.4 → 96.46 → 96.7 as we increase the regularisation coefficient γ from 0 → 0.0001 → 0.0005). Similar to the Softmax example, this also increases the spectral norm as shown in Figure 2 . The training accuracy increases slightly with the introduction of regularisation, but decreases over the unregularised network when the regularisation is increased to 0.0005. The generalisation gap for the various levels of regularisation is [0.48%, 0.51%, 0.04%], where the smallest generalisation gap corresponds to the largest spectral norm. We consider a 9 layer simple convolutional neural network on the CIFAR-100 dataset (Dangel et al., 2019) , with parameter count 1, 387, 108 and a learning rate of α = 0.01 for 300 epochs. We also observe that adding weight decay increases the spectral norm, as shown in Figure 3 . For this network, training is also improved by the addition of a little L2 regularisation, but performance decreases for over regularisation, i.e. as the weight decay parameter increases from [0, 10 -4 , 5 × 10 -4 ] the training performance is [86.3%, 87.9%, 86.0%]. In this particular example the training accuracy is quite low, but there is still a generalisation gap [32.4%, 32.5%, 31.1%] . Furthermore the generalisation gap is smaller for the network with the largest spectral norm, seen by comparing Figures 3c to Figures 3a and 3b WideResNet-28 × 10: We use a wide residual network on the CIFAR-100 dataset, with parameter count 36, 546, 980, we observe the training accuracy remains roughly constant [99.984%, 99.984%, 99.982%] as we increase the regularisation from [0, 10 -4 , 5 × 10 -4 ] . We are now in the regime where the optimisation benefit of regularisation is negligible, but the generalisation benefit is significant. As shown in Figure 5 , the Hessian spectral norm continues to increase with the increased regularisation coefficient γ. The validation set accuracies are [75.2%, 79.5%, 80.6%] and again we see the spectral norm increases as the generalisation gap decreases. Note on relationship between learning rate and spectral norm: Under a stability analysis Wu et al. (2018) ; Lewkowycz et al. (2020) argue that gradient descent must find λ 1 ≤ 2/α, whereas SGD satisfies a more restrictive condition, i.e the minima must be even flatter due to the condition of non-uniformity, specifically λ 1 ≤ 1 + 1 -α 2 λ 1 (Var(H)). Where λ 1 (Var(H)) is simply the largest eigenvalue of the variance of the Hessian. Although the authors Wu et al. (2018) remark that this bound is tight in deep learning experiments. We note for our ResNet experiments, with initial learning rate α = 0.1, as we increase the weight decay, even the GD bound of 20 is over-reached. Whilst potentially the learning rate decrease in training by a factor of 100, brings the optimiser to a new region within the loss surface, the new bound of 2000 is certainly not tight.

5. Sharpness and Adaptive optimisation

Given that all high performing solutions use some form of weight regularisation, we consider whether sharpness can be a useful indicator in the wild for the same neural network trained on the same dataset, but with alternative optimisers and schedules. Since out of the box adaptive gradient methods perform more poorly on the validation/test sets than SGD Wilson et al. ( 2017), we compare the sharpness of solutions of the Adam optimiser when combined with decoupled weight decay (Loshchilov and Hutter, 2018) and iterate averaging (Granziol et al., 2020b) , which has been shown to generalise better than SGD. We use the VGG-16 with batch-normalisation on the CIFAR-10/100 datasets. We use a decoupled weight decay of 0.35/0.25 and a learning rate of α = 0.0005. For SGD we use a weight decay γ of 3/5 × 10 -4 and a learning rate of α = 0.1. We plot the validation accuracy curve for CIFAR-100 in Figure 6c , whilst we see that Adam clearly generalises better than SGD. As shown in Figures 6a and 6b , the spectral norm of the better performing Adam solutions is almost 40× larger than the SGD solution, the Frobenius norm of Adam is 0.02 as opposed to 0.0001 for SGD. Both solutions give similar training performance, with Adam 99.81 and SGD 99.64. For CIFAR-10 although the generalisation gap is smaller, we see a similar picture, as shown in Figure 7 . To investigate the fragility of other commonly employed metrics for this practical scenario, the Frobenius norm of the Adam solution is 1.55 × 10 -3 as opposed to 1.43 × 10 -5 . Furthermore we note from Figures 6c and 7c that not only are these solutions sharper, but they also have higher norms. 

6. Conclusion

In this paper we show that the Hessian can be written as the product of a polynomial and exponential in the weights, which in the limit of large weights, which overfit, give rise to flat minima despite poor generalisation. We show that adding L2 regularisation, significantly increases the spectral norm whilst improving generalisation. We also show that certain heuristics used to improve adaptive optimisation generalisation, give high weight norm sharp solutions, which generalise better than the corresponding flatter lower norm SGD solutions. This shows that whilst intuitive, pure Hessian based flatness measures are not relevant for generalisation. Solutions found in practical schedules which generalise significantly better can be much sharper.

A.1 Image Classification Experiments

Hyper parameter Tuning For SGD and Gadam, we set the momentum parameter to be 0.9 whereas for Adam, we set (β 1 , β 2 ) = (0.9, 0.999) and = 10 -8 , their default values. For SGD, we use a grid searched initial learning rates in the range of [0.01, 0.03, 0.1] for all experiments with a fixed weight decay; for Adam and all its variants, we use grid searched initial learning rate range of [10 -4 , 3 × 10 -3 , 10 -3 ]. After the best learning rate has been identified, we conduct a further search on the weight decay, which we find often leads to a trade off between the convergence speed and final performance. For CIFAR experiments, we search in the range of [10 -4 , 10 -3 ] whereas for ImageNet experiments, we search in the range of [10 -6 , 10 -5 ]. For decoupled weight decay, we search the same range for the weight decay scaled by initial learning rate.

A.2 Experimental Details

For all experiments with SGD, we use the following learning rate schedule for the learning rate at the t-th epoch, similar to Izmailov et al. (2018) : α t =      α 0 , if t T ≤ 0.5 α 0 [1 - (1-r)( t T -0.5) 0.4 ] if 0.5 < t T ≤ 0.9 α 0 r, otherwise where α 0 is the initial learning rate. In the motivating logistic regression experiments on MNIST, we used T = 50. T = 300 is the total number of epochs budgeted for all CIFAR experiments. We set r = 0p01 for all experiments. For experiments with iterate averaging, we use the following learning rate schedule instead: α t =        α 0 , if t Tavg ≤ 0.5 α 0 [1 - (1- αavg α 0 )( t T -0.5) 0.4 ] if 0.5 < t Tavg ≤ 0.9 α avg , otherwise where α avg refers to the (constant) learning rate after iterate averaging activation, and in this paper we set α avg = 1 2 α 0 . T avg is the epoch after which iterate averaging is activated, and the methods to determine T avg was described in the main text. This schedule allows us to adjust learning rate smoothly in the epochs leading up to iterate averaging activation through a similar linear decay mechanism in the experiments without iterate averaging, as described above.

B Lanczos algorithm

In order to empirically analyse properties of modern neural network spectra with tens of millions of parameters N = O(10 7 ), we use the Lanczos algorithm (Meurant and Strakoš, 2006) , provided for deep learning by Granziol et al. (2019) . It requires Hessian vector products, for which we use the Pearlmutter trick (Pearlmutter, 1994) with computational cost O(N P ), where N is the dataset size and P is the number of parameters. Hence for m steps the total computational complexity including re-orthogonalisation is O(N P m) and memory cost of O(P m). In order to obtain accurate spectral density estimates we re-orthogonalise at every step (Meurant and Strakoš, 2006) . We exploit the relationship between the Lanczos method and Gaussian quadrature, using random vectors to allow us to learn a discrete approximation of the spectral density. A quadrature rule is a relation of the form, b a f (λ)dµ(λ) = M j=1 ρ j f (t j ) + R[f ] for a function f , such that its Riemann-Stieltjes integral and all the moments exist on the measure dµ(λ), on the interval [a, b] and where R[f ] denotes the unknown remainder. The nodes t j of the Gauss quadrature rule are given by the Ritz values and the weights (or mass) ρ j by the squares of the first elements of the normalised eigenvectors of the Lanczos tridiagonal matrix (Golub and Meurant, 1994) . The main properties of the Lanczos algorithm are summarized in the theorems 2,3 Theorem 2. Let H N ×N be a symmetric matrix with eigenvalues λ 1 ≥ .. ≥ λ n and corresponding orthonormal eigenvectors z 1 , ..z n . If θ 1 ≥ .. ≥ θ m are the eigenvalues of the matrix T m obtained after m Lanczos steps and q 1 , ...q k the corresponding Ritz eigenvectors then λ 1 ≥ θ 1 ≥ λ 1 - (λ 1 -λ n ) tan 2 (θ 1 ) (c k-1 (1 + 2ρ 1 )) 2 λ n ≤ θ k ≤ λ m + (λ 1 -λ n ) tan 2 (θ 1 ) (c k-1 (1 + 2ρ 1 )) 2 (17) where c k is the chebyshev polyomial of order k Proof: see (Golub and Van Loan, 2012) . Theorem 3. The eigenvalues of T k are the nodes t j of the Gauss quadrature rule, the weights w j are the squares of the first elements of the normalised eigenvectors of T k Proof: See (Golub and Meurant, 1994) . The first term on the RHS of equation 16 using Theorem 3 can be seen as a discrete approximation to the spectral density matching the first m moments v T H m v (Golub and Meurant, 1994; Golub and Van Loan, 2012) , where v is the initial seed vector. Using the expectation of quadratic forms, for zero mean, unit variance random vectors, using the linearity of trace and expectation E v Tr(v T H m v) = TrE v (vv T H m ) = Tr(H m ) = N i=1 λ i = N λ∈D λdµ(λ) The error between the expectation over the set of all zero mean, unit variance vectors v and the monte carlo sum used in practice can be bounded (Hutchinson, 1990; Roosta-Khorasani and Ascher, 2015) . However in the high dimensional regime N → ∞, we expect the squared overlap of each random vector with an eigenvector of H, |v T φ i | 2 ≈ 1 N ∀i, with high probability. This result can be seen by computing the moments of the overlap between Rademacher vectors, containing elements P (v j = ±1) = 0.5. Further analytical results for Gaussian vectors have been obtained (Cai et al., 2013) .

C Mathematical Preliminaries

For an input/output pair [x ∈ R dx , y ∈ R dy ] and a given model h(•; •) : R dx × R P → R dy . Without loss of generality, we consider the family of models functions parameterized by the weight vector w, i.e., H := {h(•; w) : w ∈ R P }, with a given loss (h(x; w), y) : R dy × R dy → R. The empirical risk (often denote the loss in deep learning), its gradient and Hessian are given by R emp (w) = 1 N N i=1 (h(x i ; w), y i ), g emp (w) = ∇R emp , H emp (w) = ∇ 2 R emp (19) The Hessian describes the curvature at that point in weight space w and hence the risk surface can be studied through the Hessian. By the spectral theorem, we can rewrite H emp (w) = P i=1 λ i φ i φ T i in terms of its eigenvalue, eigenvector pairs [λ i , φ i ]. In order to characterise H emp (w) by a single value, authors typically consider the spectral norm, which is given by the largest eigenvalue of H emp (w) or the normalised trace, which gives the mean eigenvalue. The Hessian contains P 2 elements, so cannot be stored or eigendecomposed for all but the simplest of models. Stochastic Lanczos Quadrature can be used Meurant and Strakoš (2006) , with computational complexity O(P ) to give tight bounds on the extremal eigenvalues and good estimations of Tr(H) and Tr(H 2 ), along with a moment matched



we use the word local here because the largest eigenvalue/eigenvector pair may change along the path taken consider the function x n exp (-x m ) → 0 ∀n, m as x → ∞ since we are substracting essentially the same amount multi-class equivalent of Logistic Regression



Figure 1: Hessian spectrum for Softmax regression after 1000 epochs of SGD on the MNIST dataset, for various L2 regularisation coefficients λ

Acc = 96.7, γ = 5 × 10 -4

Figure 2: Hessian spectrum for MLP after 50 epochs of SGD on the MNIST dataset, for various L2 regularisation coefficients γ

.

V al = 54.8, γ = 5 × 10 -4

Figure 3: Hessian spectrum for CNN after 300 epochs of SGD on the CIFAR-100 dataset, for various L2 regularisation coefficients γ PreResNet-164 We use a pre-activated residual network on the CIFAR-100 dataset with parameter count 1, 726, 388. Our training performance decreases with increased level of regularisation [0, 10 -4 , 5×10-4 ] from [99.987%, 99.985%, 99.87%] but our testing performance increases significantly [72.78%, 75.56%, 76.76%]. The generalisation gap decreases as we increase the regularisation, yet the Hessian spectral norm continues to increase.

Figure 4: Hessian spectrum for PreResNet-164 after 300 epochs of SGD on the CIFAR-100 dataset, for various L2 regularisation coefficients γ

V al = 80.6, λ = 5e -4

Figure 5: Hessian spectrum for WideResNet28×10 after 300 epochs of SGD on the CIFAR-100 dataset, for various L2 regularisation coefficients γ, Batch Norm Train mode

Figure 6: Hessian spectrum for VGG-16BN after 300 epochs of SGD on the CIFAR-100 dataset, for various optimisation algorithms [SGD, Adam], batch norm train mode 0.258 3.296 10 8 10 6 10 4 10 2 10 0 (a) V al = 94.3, SGD

Figure 7: Hessian spectrum for VGG-16BN after 300 epochs of SGD on the CIFAR-10 dataset, for various optimisation algorithms [SGD, Adam], batch norm train mode

annex

approximation of the spectrum. We use the Deep Learning implementation provided by Granziol et al. (2019) . DNNs are typically trained using stochastic gradient descent with momentum, where we iteratively update the weightsWhere ρ is the momentum. The gradient is usually taken on a randomly selected sub-sample of size B N . An epoch is defined as a full training pass of the data, so comprises ≈ N/B iterations. Often L2 regularisation (also termed weight decay) is added to the loss, which corresponds to R emp (w) → R emp (w) + µ/2||w|| 2 .How does batch normalisation affect curvature? During training both the mean and variance of the batch normalisation layers are adapted to the specific batch, whereas at evaluation they are fixed (to their exponentially moving average). This is done so that the transforms can function even if the prediction set is only 1 sample 5 . Previous works investigating neural network Hessians (Papyan, 2018; Ghorbani et al., 2019) do not consider this free parameter in batch-normalisation and its effect on the spectrum. From a sharpness and generalisation perspective, we would consider that it is the model that is making predictions that we should evaluate. Changing batch normalisation to the evaluation mode, we find that a somewhat different curvature profile, as shown in Figure 8 . In this case the sharpness of the regularised solution in terms of the spectral norm is nearly 1000 times larger than that of the regularised, better generalising solution. The Frobenius norm, for the regularised solution is 4.9 × 10 -5 as opposed to 9.8 × 10 -12 , so O(10 7 ) larger. 

