DYNAMIC OF STOCHASTIC GRADIENT DESCENT WITH STATE-DEPENDENT NOISE Anonymous

Abstract

Stochastic gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. Since neural networks are non-convex, more and more works study the dynamic behavior of SGD and its impact to generalization, especially the escaping efficiency from local minima. However, these works make the over-simplified assumption that the distribution of gradient noise is stateindependent, although it is state-dependent. In this work, we propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. Then, we prove that the stationary distribution of power-law dynamic is heavy-tailed, which matches the existing empirical observations. Next, we study the escaping efficiency from local minimum of power-law dynamic and prove that the mean escaping time is in polynomial order of the barrier height of the basin, much faster than exponential order of previous dynamics. It indicates that SGD can escape deep sharp minima efficiently and tends to stop at flat minima that have lower generalization error. Finally, we conduct experiments to compare SGD and power-law dynamic, and the results verify our theoretical findings.

1. INTRODUCTION

Deep learning has achieved great success in various AI applications, such as computer vision, natural language processing, and speech recognition (He et al., 2016b; Vaswani et al., 2017; He et al., 2016a) . Stochastic gradient descent (SGD) and its variants are the mainstream methods to train deep neural networks, since they can deal with the computational bottleneck of the training over large-scale datasets (Bottou & Bousquet, 2008) . Although SGD can converge to the minimum in convex optimization (Rakhlin et al., 2012) , neural networks are highly non-convex. To understand the behavior of SGD on non-convex optimization landscape, on one hand, researchers are investigating the loss surface of the neural networks with variant architectures (Choromanska et al., 2015; Li et al., 2018b; He et al., 2019b; Draxler et al., 2018; Li et al., 2018a) ; on the other hand, researchers illustrate that the noise in stochastic algorithm may make it escape from local minima (Keskar et al., 2016; He et al., 2019a; Zhu et al., 2019; Wu et al., 2019a; HaoChen et al., 2020) . It is clear that whether stochastic algorithms can escape from poor local minima and finally stop at a minimum with low generalization error is crucial to its test performance. In this work, we focus on the dynamic of SGD and its impact to generalization, especially the escaping efficiency from local minima. To study the dynamic behavior of SGD, most of the works consider SGD as the discretization of a continuous-time dynamic system and investigate its dynamic properties. There are two typical types of models to approximate dynamic of SGD. (Li et al., 2017; Zhou et al., 2019; Liu et al., 2018; Chaudhari & Soatto, 2018; He et al., 2019a; Zhu et al., 2019; Hu et al., 2019; Xie et al., 2020) approximate the dynamic of SGD by Langevin dynamic with constant diffusion coefficient and proved its escaping efficiency from local minima.These works make over-simplified assumption that the covariance matrix of gradient noise is constant, although it is state-dependent in general. The simplified assumption makes the proposed dynamic unable to explain the empirical observation that the distribution of parameters trained by SGD is heavy-tailed (Mahoney & Martin, 2019) . To model the heavy-tailed phenomenon, Simsekli et al. (2019); Şimşekli et al. (2019) point that the variance of stochastic gradient may be infinite, and they propose to approximate SGD by dynamic driven by α-stable process with the strong infinite variance condition. However, as shown in the work (Xie et al., 2020; Mandt et al., 2017) , the gradient noise follows Gaussian distribution and the infinite variance condition does not satisfied. Therefore it is still lack of suitable theoretical explanation on the implicit regularization of dynamic of SGD. In this work, we conduct a formal study on the (state-dependent) noise structure of SGD and its dynamic behavior. First, we show that the covariance of the noise of SGD in the quadratic basin surrounding the local minima is a quadratic function of the state (i.e., the model parameter). Thus, we propose approximating the dynamic of SGD near the local minimum using a stochastic differential equation whose diffusion coefficient is a quadratic function of state. We call the new dynamic power-law dynamic. We prove that its stationary distribution is power-law κ distribution, where κ is the signal to noise ratio of the second order derivatives at local minimum. Compared with Gaussian distribution, power-law κ distribution is heavy-tailed with tail-index κ. It matches the empirical observation that the distribution of parameters becomes heavy-tailed after SGD training without assuming infinite variance of stochastic gradient in (Simsekli et al., 2019) . Second, we analyze the escaping efficiency of power-law dynamic from local minima and its relation to generalization. By using the random perturbation theory for diffused dynamic systems, we analyze the mean escaping time for power-law dynamic. Our results show that: (1) Power-law dynamic can escape from sharp minima faster than flat minima. (2) The mean escaping time for power-law dynamic is only in the polynomial order of the barrier height, much faster than the exponential order for dynamic with constant diffusion coefficient. Furthermore, we provide a PAC-Bayes generalization bound and show power-law dynamic can generalize better than dynamic with constant diffusion coefficient. Therefore, our results indicate that the state-dependent noise helps SGD to escape from sharp minima quickly and implicitly learn well-generalized model. Finally, we corroborate our theory by experiments. We investigate the distributions of parameters trained by SGD on various types of deep neural networks and show that they are well fitted by power-law κ distribution. Then, we compare the escaping efficiency of dynamics with constant diffusion or state-dependent diffusion to that of SGD. Results show that the behavior of power-law dynamic is more consistent with SGD. Our contributions are summarized as follows: (1) We propose a novel power-law dynamic with state-dependent diffusion to approximate dynamic of SGD based on both theoretical derivation and empirical evidence. The power-law dynamic can explain the heavy-tailed phenomenon of parameters trained by SGD without assuming infinite variance of gradient noise. (2) We analyze the mean escaping time and PAC-Bayes generalization bound for power-law dynamic and results show that power-law dynamic can escape sharp local minima faster and generalize better compared with the dynamics with constant diffusion. Our experimental results can support the theoretical findings.

2. BACKGROUND

In empirical risk minimization problem, the objective is L(w) = 1 n n i=1 (x i , w), where x i , i = 1, • • • , n are n i.i.d. training samples, w ∈ R d is the model parameter, and is the loss function. Stochastic gradient descent (SGD) is a popular optimization algorithm to minimize L(w). The update rule is w t+1 = w t -η • g(w t ), where g(w t ) = 1 b x∈S b ∇ w (x, w t ) is the minibatch gradient calculated by a randomly sampled minibatch S b of size b and η is the learning rate. The minibatch gradient g(w t ) is an unbiased estimator of the full gradient g(w t ) = ∇L(w t ), and the term (g(w t ) -g(w t )) is called gradient noise in SGD. Langevin Dynamic In (He et al., 2019a; Zhu et al., 2019) , the gradient noise is assumed to be drawn from Gaussian distribution according to central limit theorem (CLT), i.e., g(w) -g(w) ∼ N (0, C), where covariance matrix C is a constant matrix for all w. Then SGD can be regarded as the numerical discretization of the following Langevin dynamic, dw t = -g(w t )dt + √ ηC 1/2 dB t , where B t is a standard Brownian motion in R d and √ ηC 1/2 dB t is called the diffusion term. α-stable Process Simsekli et al. (2019) assume the variance of gradient noise is unbounded. By generalized CLT, the distribution of gradient noise is α-stable distribution S(α, σ), where σ is the α-th moment of gradient noise for given α with α ∈ (0, 2]. Under this assumption, SGD is approximated by the stochastic differential equation (SDE) driven by an α-stable process.

