A DIFFUSION THEORY FOR DEEP LEARNING DYNAM-ICS: STOCHASTIC GRADIENT DESCENT EXPONEN

Abstract

Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time.

1. INTRODUCTION

In recent years, deep learning (LeCun et al., 2015) has achieved great empirical success in various application areas. Due to the over-parametrization and the highly complex loss landscape of deep networks, optimizing deep networks is a difficult task. Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks. Empirically, SGD can usually find flat minima among a large number of sharp minima and local minima (Hochreiter & Schmidhuber, 1995; 1997) . More papers reported that learning flat minima closely relate to generalization (Hardt et al., 2016; Zhang et al., 2017a; Arpit et al., 2017; Hoffer et al., 2017; Dinh et al., 2017; Neyshabur et al., 2017; Wu et al., 2017; Dziugaite & Roy, 2017; Kleinberg et al., 2018) . Some researchers specifically study flatness itself. They try to measure flatness (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Sagun et al., 2017; Yao et al., 2018) , rescale flatness (Tsuzuku et al., 2019; Xie et al., 2020b) , and find flatter minima (Hoffer et al., 2017; Chaudhari et al., 2017; He et al., 2019b; Xie et al., 2020a) . However, we still lack a quantitative theory that answers why deep learning dynamics selects a flat minimum. The diffusion theory is an important theoretical tool to understand how deep learning dynamics works. It helps us model the diffusion process of probability densities of parameters instead of model parameters themselves. The density diffusion process of Stochastic Gradient Langevin Dynamics (SGLD) under injected isotropic noise has been discussed by (Sato & Nakagawa, 2014; Raginsky et al., 2017; Zhang et al., 2017b; Xu et al., 2018) . Zhu et al. (2019) revealed that anisotropic diffusion of SGD often leads to flatter minima than isotropic diffusion. A few papers has quantitatively studied the diffusion process of SGD under the isotropic gradient noise assumption. Jastrzębski et al. ( 2017 However, the related papers mainly analyzed the diffusion process under parameter-independent and isotropic gradient noise, while stochastic gradient noise (SGN) is highly parameter-dependent and anisotropic in deep learning dynamics. Thus, they failed to quantitatively formulate how SGD selects flat minima, which closely depends on the Hessian-dependent structure of SGN. We try to bridge the gap between the qualitative knowledge and the quantitative theory for SGD in the presence of parameter-dependent and anisotropic SGN. Mainly based on Theorem 3.2 , we have four contributions: • The proposed theory formulates the fundamental roles of gradient noise, batch size, the learning rate, and the Hessian in minima selection. • The SGN covariance is approximately proportional to the Hessian and inverse to batch size. • Either a small learning rate or large-batch training requires exponentially many iterations to escape minima in terms of ratio of batch size and learning rate. • To the best of our knowledge, we are the first to theoretically and empirically reveal that SGD favors flat minima exponentially more than sharp minima.

2. STOCHASTIC GRADIENT NOISE AND SGD DYNAMICS

We mainly introduce the necessary foundation for the proposed diffusion theory in this section. We denote the data samples as {x j } m j=1 , the model parameters as θ and the loss function over data samples x as L(θ, x). For simplicity, we denote the training loss as L(θ). Following Mandt et al. (2017), we may write SGD dynamics as θ t+1 = θ t -η ∂ L(θ t ) ∂θ t = θ t -η ∂L(θ t ) ∂θ t + ηC(θ t ) 1 2 ζ t , where L(θ) is the loss of one minibatch, ζ t ∼ N (0, I), and C(θ) represents the gradient noise covariance matrix. The classic approach is to model SGN by Gaussian noise, N (0, C(θ)) (Mandt et al., 2017; Smith & Le, 2018; Chaudhari & Soatto, 2018) . Stochastic Gradient Noise Analysis. We first note that the SGN we study is introduced by minibatch training, C(θ t ) 1 2 ζ t = ∂L(θt) ∂θt -∂ L(θt) ∂θt , which is the difference between gradient descent and stochastic gradient descent. According to Generalized Central Limit Theorem (Gnedenko et al., 1954) , the mean of many infinite-variance random variables converges to a stable distribution, while the mean of many finite-variance random variables converges to a Gaussian distribution. As SGN is finite in practice, we believe the Gaussian approximation of SGN is reasonable. Simsekli et al. (2019) argued that SGN is Lévy noise (stable variables), rather than Gaussian noise. They presented empirical evidence showing that SGN seems heavy-tailed, and the heavy-tailed distribution looks closer to a stable distribution than a Gaussian distribution. However, this research line (Simsekli et al., 2019; Nguyen et al., 2019) relies on a hidden strict assumption that SGN must be isotropic and obey the same distribution across dimensions. Simsekli et al. (2019) computed "SGN" across n model parameters and regarded "SGN" as n samples drawn from a single-variant distribution. This is why one tail-index for all parameters was studied in Simsekli et al. (2019) . The arguments in Simsekli et al. (2019) did not necessarily hold for parameter-dependent and anisotropic Gaussian noise. In our paper, SGN computed over different minibatches obeys a n-variant Gaussian distribution, which can be parameter-dependent and anisotropic. In Figure 1 , we empirically verify that SGN is highly similar to Gaussian noise instead of heavy-tailed Lévy noise. We recover the experiment of Simsekli et al. (2019) to show that gradient noise is approximately Lévy noise only if it is computed across parameters. Figure 1 actually suggests that



) first studied the minima selection probability of SGD. Smith & Le (2018) presented a Beyesian perspective on generalization of SGD. Wu et al. (2018) studied the escape problems of SGD from a dynamical perspective, and obtained the qualitative conclusion on the effects of batch size, learning rate, and sharpness. Hu et al. (2019) quantitatively showed that the mean escape time of SGD exponentially depends on the inverse learning rate. Achille & Soatto (2019) also obtained a related proposition that describes the mean escape time in terms of a free energy that depends on the Fisher Information. Li et al. (2017) analyzed Stochastic Differential Equation (SDE) of adaptive gradient methods. Nguyen et al. (2019) mainly contributed to closing the theoretical gap between continuous-time dynamics and discrete-time dynamics under isotropic heavy-tailed noise.

