Entropic gradient descent algorithms and wide flat minima

Abstract

The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: The local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).

1. Introduction

The geometrical structure of the loss landscape of neural networks has been a key topic of study for several decades (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016) . One area of ongoing research is the connection between the flatness of minima found by optimization algorithms like stochastic gradient descent (SGD) and the generalization performance of the network (Baldassi et al., 2020; Keskar et al., 2016) . There are open conceptual problems in this context: On the one hand, there is accumulating evidence that flatness is a good predictor of generalization (Jiang et al., 2019) . On the other hand, modern deep networks using ReLU activations are invariant in their outputs with respect to rescaling of weights in different layers (Dinh et al., 2017) , which makes the mathematical picture complicatedfoot_0 . General results are lacking. Some initial progress has been made in connecting PAC-Bayes bounds for the generalization gap with flatness (Dziugaite & Roy, 2018) . The purpose of this work is to shed light on the connection between flatness and generalization by using methods and algorithms from the statistical physics of disordered systems, and to corroborate the results with a performance study on state-of-the-art deep architectures. Methods from statistical physics have led to several results in the last years. Firstly, wide flat minima have been shown to be a structural property of shallow networks. They exist even when training on random data and are accessible by relatively simple algorithms, even though coexisting with exponentially more numerous minima (Baldassi et al., 2015; 2016a;  



We note, in passing, that an appropriate framework for theoretical studies would be to consider networks with binary weights, for which most ambiguities are absent.

