Flatness is a False

Abstract

Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that, for feed-forward neural networks under the crossentropy loss, low-loss solutions with large neural network weights have small Hessian based measures of flatness. This implies that solutions obtained without L2 regularisation should be less sharp than those with despite generalising worse. We show this to be true for logistic regression, multilayer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-100 datasets. Furthermore, we show that adaptive optimisation algorithms using iterate averaging, on the VGG-16 network and CIFAR-100 dataset, achieve superior generalisation to SGD but are 30× sharper. These theoretical and experimental results further advocate the need to use flatness in conjunction with the weights scale to measure generalisation (Neyshabur et al., 2017; Dziugaite and Roy, 2017).

1. Introduction

Deep Neural Networks (DNNs), with more parameters than data-points, trained with many passes of the same data, still manage to perform exceptionally on test data. The reasons for this remain laregly unsolved (Neyshabur et al., 2017) . However, DNNs are not completely immune to the classical problem of over-fitting. Zhang et al. (2016) show that DNNs can perfectly fit random labels. Schedules with initially low or sharply decaying learning rates, lead to identical training but much higher testing error (Berrada et al., 2018; Granziol et al., 2020a; Jastrzebski et al., 2020) . In Wilson et al. (2017) the authors argue that specific adaptive gradient optimisers lead to solutions which don't generalise. This has lead to a significant development in partially adaptive algorithms (Chen and Gu, 2018; Keskar and Socher, 2017) . Given the importance of accurate predictions on unseen data, understanding exactly what helps deep networks generalise has been a fundamental area of research. A key concept which has taken a foothold in the community, allowing for the comparison of different training loss minima using only the training data, is the concept of flatness. From both a Bayesian and minimum description length framework, flatter minima should generalize better than sharp minima (Hochreiter and Schmidhuber, 1997) . Sharpness is usually measured by properties of the second derivative of the loss" the Hessian H = ∇ 2 L(w) (Keskar et al., 2016; Jastrzebski et al., 2017b; Chaudhari et al., 2016; Wu et al., 2017; 2018) , such as the spectral norm or trace. The assumption is that due to finite numerical precision (Hochreiter and Schmidhuber, 1997) or from a Bayesian perspective (MacKay, 2003) , the test surface is shifted from the training surface. The difference between train and test loss for a shift ∆w is given by L(w * + ∆w) -L(w * ) ≈ ∆w T H∆w + ... ≈ P i λ i |φ T i ∆w| 2 ≈ Tr(H) P ||∆w|| 2 ≤ λ 1 ||∆w|| 2 (1) in which w * is the final training point and [λ i , φ i ] are the eigenvalue/eigenvector pairs of H ∈ R P ×P . We have dropped the terms beyond second-order by assuming that the gradient at training end is small. In general we have no a priori reason to assume that shift should preferentially lie along any of the Hessian eigenvectors, hence by taking a maximum entropy prior (MacKay, 2003; Jaynes, 1982) we expect strong high dimensional concentration results (Vershynin, 2018) to hold, hence |φ T i ∆ ŵ| 2 ≈ 1/P , where ŵ is simply the normalised version of w. This justifies the trace as a measure of sharpness. In the worst case scenario the shift is completely aligned with the eigenvector corresponding to the largest eigenvalue λ 1 , i.e. ∆w T φ 1 = 1. Hence the spectral norm λ 1 of H serves as a localfoot_0 upper bound to the loss change. The idea of a shift between the training and testing loss surface is prolific in the literature and regularly related to generalisation (He et al., 2019; Izmailov et al., 2018; Maddox et al., 2019) . Alternative, yet closely related, measures of flatness are also used. Keskar et al. (2016) define a sharp minimiser as one "with a significant number of large positive eigenvalues", in fact as can be seen by the Rayleigh-Ritz theorem, the metric which they propose, shown in Equation 2 is proportional to the largest eigenvalue. φ w,L ( , A) := (max y∈C L(w + Ay)) -L(w) 1 + L(w) ≤ κ( )λ 1 (2) C is the constraint box as defined in (Keskar et al., 2016) , where controls the box size. As shown by Dinh et al. (2017) , this definition of sharpness is approximately given by λ 1 2 /2(1 + L(w)), proportional to the largest eigenvalue. This result can be explained intuitively as within a small vicinity of w the largest change in loss is along the leading eigenvector and is proportional to the largest eigenvalue. Wu et al. ( 2017) consider the logarithm of the product of the top k eigenvalues as a proxy measure the volume of the minimum (a truncated log determinant). In this paper we will exclusively consider the Hessian trace, spectral and Frobenius norm as measures of sharpness. 2017) show that by exploiting ReLUs (Rectified Linear Units) positive homogeneity property f (αx) = αf (x), any flat minima can be mapped into a sharp minimum, without altering the loss. As these measures can be arbitrarily distorted, this implies they serve little value as generalisation measures. However such transformations alter other properties, such as the weight norm. In practice the use of L2 regularisation, which penalises weight norm, means that optimisers are unlikely to converge to such a solution. It can even be shown that unregularised SGD converges to the minimum norm solution for simple problems (Wilson et al., 2017) , further limiting the practical relevance of such reparameterisation arguments. The question which remains and warrants investigation, is are Hessian based sharpness metrics at the end of training meaningful metrics for generalisation? We demonstrate both theoretically and experimentally that the answer to this question is an affirmative no. Contributions: To the best of our knowledge, this is the first work which demonstrates theoretically motivated empirical results contrary to purely flatness based generalisation measures. For the fully connected feed-forward network with ReLU activation and cross entropy loss, we demonstrate in the limit of 0 training loss, that the spectral norm and trace of the Hessian also go to 0. The key insight is that in order for the loss to go to 0, the weight vector components w c must tend to infinity. Conversely, this implies that methods which reduce the weight magnitudes extensively used to aid generalisation (Bishop, 2006; Krogh and Hertz, 1992) , makes solutions sharper. We present the counter-intuitive result that adding L2 regularisation increases both sharpness and generalisation, for Logistic Regression, MLP, simple CNN, PreResNet-164 and WideResNet-28 × 10 for the MNIST and CIFAR-100



we use the word local here because the largest eigenvalue/eigenvector pair may change along the path taken



There have been numerous positive empirical results relating sharpness and generalisation. Keskar et al. (2016); Rangamani et al. (2019) consider how large batch vs small batch stochastic gradient descent (SGD) alters the sharpness of solutions, with smaller batches leading to convergence to flatter solutions, leading to better generalisation. Jastrzebski et al. (2017a) look at the importance of the ratio learning rate and batch size in terms of generalisation, finding that large ratios lead to flatter minima (as measured by the spectral norm) and better generalisation. Yao et al. (2018) investigated flat regions of weight space (small spectral norm) showing them to be more robust under adversarial attack. Zhang et al. (2018) show that SGD concentrates in probability on flat minima. Certain algorithmic design choices, such as Entropy-SGD (Chaudhari et al., 2016) and the use of Polyak averaging (Izmailov et al., 2018) have been motivated by considerations of flatness. However Dinh et al. (

