A DIFFUSION THEORY FOR DEEP LEARNING DYNAM-ICS: STOCHASTIC GRADIENT DESCENT EXPONEN

Abstract

Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time.

1. INTRODUCTION

In recent years, deep learning (LeCun et al., 2015) has achieved great empirical success in various application areas. Due to the over-parametrization and the highly complex loss landscape of deep networks, optimizing deep networks is a difficult task. Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks. Empirically, SGD can usually find flat minima among a large number of sharp minima and local minima (Hochreiter & Schmidhuber, 1995; 1997) . More papers reported that learning flat minima closely relate to generalization (Hardt et al., 2016; Zhang et al., 2017a; Arpit et al., 2017; Hoffer et al., 2017; Dinh et al., 2017; Neyshabur et al., 2017; Wu et al., 2017; Dziugaite & Roy, 2017; Kleinberg et al., 2018) . Some researchers specifically study flatness itself. They try to measure flatness (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Sagun et al., 2017; Yao et al., 2018) , rescale flatness (Tsuzuku et al., 2019; Xie et al., 2020b) , and find flatter minima (Hoffer et al., 2017; Chaudhari et al., 2017; He et al., 2019b; Xie et al., 2020a) . However, we still lack a quantitative theory that answers why deep learning dynamics selects a flat minimum. The diffusion theory is an important theoretical tool to understand how deep learning dynamics works. It helps us model the diffusion process of probability densities of parameters instead of model parameters themselves. The density diffusion process of Stochastic Gradient Langevin Dynamics (SGLD) under injected isotropic noise has been discussed by (Sato & Nakagawa, 2014; Raginsky et al., 2017; Zhang et al., 2017b; Xu et al., 2018) . Zhu et al. (2019) revealed that anisotropic diffusion of SGD often leads to flatter minima than isotropic diffusion. A few papers has quantitatively studied the diffusion process of SGD under the isotropic gradient noise assumption. 



Jastrzębski et al. (2017) first studied the minima selection probability of SGD. Smith & Le (2018) presented a Beyesian perspective on generalization of SGD. Wu et al. (2018) studied the escape problems of

