Flatness is a False

Abstract

Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that, for feed-forward neural networks under the crossentropy loss, low-loss solutions with large neural network weights have small Hessian based measures of flatness. This implies that solutions obtained without L2 regularisation should be less sharp than those with despite generalising worse. We show this to be true for logistic regression, multilayer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-100 datasets. Furthermore, we show that adaptive optimisation algorithms using iterate averaging, on the VGG-16 network and CIFAR-100 dataset, achieve superior generalisation to SGD but are 30× sharper. These theoretical and experimental results further advocate the need to use flatness in conjunction with the weights scale to measure generalisation (Neyshabur et al., 2017; Dziugaite and Roy, 2017).

1. Introduction

Deep Neural Networks (DNNs), with more parameters than data-points, trained with many passes of the same data, still manage to perform exceptionally on test data. The reasons for this remain laregly unsolved (Neyshabur et al., 2017) . However, DNNs are not completely immune to the classical problem of over-fitting. Zhang et al. (2016) show that DNNs can perfectly fit random labels. Schedules with initially low or sharply decaying learning rates, lead to identical training but much higher testing error (Berrada et al., 2018; Granziol et al., 2020a; Jastrzebski et al., 2020) . In Wilson et al. (2017) the authors argue that specific adaptive gradient optimisers lead to solutions which don't generalise. This has lead to a significant development in partially adaptive algorithms (Chen and Gu, 2018; Keskar and Socher, 2017) . Given the importance of accurate predictions on unseen data, understanding exactly what helps deep networks generalise has been a fundamental area of research. A key concept which has taken a foothold in the community, allowing for the comparison of different training loss minima using only the training data, is the concept of flatness. From both a Bayesian and minimum description length framework, flatter minima should generalize better than sharp minima (Hochreiter and Schmidhuber, 1997). Sharpness is usually measured by properties of the second derivative of the loss" the Hessian H = ∇ 2 L(w) (Keskar et al., 2016; Jastrzebski et al., 2017b; Chaudhari et al., 2016; Wu et al., 2017; 2018) , such as the spectral norm or trace. The assumption is that due to finite numerical precision (Hochreiter and Schmidhuber, 1997) or from a Bayesian perspective (MacKay, 2003) , the test surface is shifted from the training surface. The difference between train and test loss for a shift ∆w is given by L(w * + ∆w) -L(w * ) ≈ ∆w T H∆w + ... ≈ P i λ i |φ T i ∆w| 2 ≈ Tr(H) P ||∆w|| 2 ≤ λ 1 ||∆w|| 2 (1) in which w * is the final training point and [λ i , φ i ] are the eigenvalue/eigenvector pairs of H ∈ R P ×P . We have dropped the terms beyond second-order by assuming that the gradient at training end is small. In general we have no a priori reason to assume that shift should preferentially lie along any of the Hessian eigenvectors, hence by taking a maximum entropy prior (MacKay, 2003; Jaynes, 1982) we expect strong high dimensional concentration results

