Entropic gradient descent algorithms and wide flat minima

Abstract

The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: The local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).

1. Introduction

The geometrical structure of the loss landscape of neural networks has been a key topic of study for several decades (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016) . One area of ongoing research is the connection between the flatness of minima found by optimization algorithms like stochastic gradient descent (SGD) and the generalization performance of the network (Baldassi et al., 2020; Keskar et al., 2016) . There are open conceptual problems in this context: On the one hand, there is accumulating evidence that flatness is a good predictor of generalization (Jiang et al., 2019) . On the other hand, modern deep networks using ReLU activations are invariant in their outputs with respect to rescaling of weights in different layers (Dinh et al., 2017) , which makes the mathematical picture complicatedfoot_0 . General results are lacking. Some initial progress has been made in connecting PAC-Bayes bounds for the generalization gap with flatness (Dziugaite & Roy, 2018) . The purpose of this work is to shed light on the connection between flatness and generalization by using methods and algorithms from the statistical physics of disordered systems, and to corroborate the results with a performance study on state-of-the-art deep architectures. Methods from statistical physics have led to several results in the last years. Firstly, wide flat minima have been shown to be a structural property of shallow networks. They exist even when training on random data and are accessible by relatively simple algorithms, even though coexisting with exponentially more numerous minima (Baldassi et al., 2015; 2016a; 2020) . We believe this to be an overlooked property of neural networks, which makes them particularly suited for learning. In analytically tractable settings, it has been shown that flatness depends on the choice of the loss and activation functions, and that it correlates with generalization (Baldassi et al., 2020; 2019) . In the above-mentioned works, the notion of flatness used was the so-called local entropy (Baldassi et al., 2015; 2016a) . It measures the low-loss volume in the weight space around a minimizer, as a function of the distance (i.e. roughly speaking it measures the amount of "good" configurations around a given one). This framework is not only useful for analytical calcuations, but it has also been used to introduce a variety of efficient learning algorithms that focus their search on flat regions (Baldassi et al., 2016a; Chaudhari et al., 2019; 2017) . In this paper we call them entropic algorithms. A different notion of flatness, that we refer to as local energy in this paper, measures the average profile of the training loss function around a minimizer, as a function of the distance (i.e. it measures the typical increase in the training error when moving away from the minimizer). This quantity is intuitively appealing and rather easy to estimate via sampling, even in large systems. In Jiang et al. ( 2019), several candidates for predicting generalization performance were tested using an extensive numerical approach on an array of different networks and tasks, and the local energy was found to be among the best and most consistent predictors. The two notions, local entropy and local energy, are distinct: in a given region of a complex landscape, the local entropy measures the size of the lowest valleys, whereas the local energy measures the average height. Therefore, in principle, the two quantities could vary independently. It seems reasonable, however, to conjecture that they would be highly correlated under mild assumptions on the roughness of the landscape (which is another way to say that they are both reasonable measures to express the intuitive notion of "flatness"). In this paper, we first show that for simple systems in controlled conditions, where all relevant quantities can be estimated well by using the Belief Propagation (BP) algorithm (Mezard & Montanari (2009) ), the two notions of flatness are strongly correlated: regions of high local entropy have low local energy, and vice versa. We also confirm that they are both correlated with generalization. This justifies the expectation that, even for more complex architectures and datasets, those algorithms which are driven towards high-local-entropy regions would minimize the local energy too, and thus (based on the findings in Jiang et al. ( 2019)) would find minimizers that generalize well. Indeed, we systematically applied two entropic algorithms, Entropy-SGD (eSGD) and Replicated-SGD (rSGD), to state-of-the-art deep architectures, and found that we could achieve an improved generalization performance, at the same computational cost, compared to the original papers where those architectures were introduced. We believe these results to be an important addition to the current state of knowledge, since in (Baldassi et al. (2016b) ) rSGD was applied only to shallow networks with binary weights trained on random patterns and the current work represents the first study of rSGD in a realistic deep neural network setting. Together with the first reported consistent improvement of eSGD over SGD on image classification, these results point to a very promising direction for further research. While we hope to foster the application of entropic algorithms by publishing code that can be used to adapt them easily to new architectures, we also believe that the numeric results are important for theoretical research, since they are rooted in a well-defined geometric interpretation of the loss landscape. We also confirmed numerically that the minimizers found in this way have a lower local energy profile, as expected. Remarkably, these results go beyond even those where the eSGD and rSGD algorithms were originally introduced, thanks to a general improvement in the choice for the learning protocol, that we also discuss; apart from that, we used little to no hyper-parameter tuning.



We note, in passing, that an appropriate framework for theoretical studies would be to consider networks with binary weights, for which most ambiguities are absent.

