DYNAMIC OF STOCHASTIC GRADIENT DESCENT WITH STATE-DEPENDENT NOISE Anonymous

Abstract

Stochastic gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. Since neural networks are non-convex, more and more works study the dynamic behavior of SGD and its impact to generalization, especially the escaping efficiency from local minima. However, these works make the over-simplified assumption that the distribution of gradient noise is stateindependent, although it is state-dependent. In this work, we propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. Then, we prove that the stationary distribution of power-law dynamic is heavy-tailed, which matches the existing empirical observations. Next, we study the escaping efficiency from local minimum of power-law dynamic and prove that the mean escaping time is in polynomial order of the barrier height of the basin, much faster than exponential order of previous dynamics. It indicates that SGD can escape deep sharp minima efficiently and tends to stop at flat minima that have lower generalization error. Finally, we conduct experiments to compare SGD and power-law dynamic, and the results verify our theoretical findings.

1. INTRODUCTION

Deep learning has achieved great success in various AI applications, such as computer vision, natural language processing, and speech recognition (He et al., 2016b; Vaswani et al., 2017; He et al., 2016a) . Stochastic gradient descent (SGD) and its variants are the mainstream methods to train deep neural networks, since they can deal with the computational bottleneck of the training over large-scale datasets (Bottou & Bousquet, 2008) . Although SGD can converge to the minimum in convex optimization (Rakhlin et al., 2012) , neural networks are highly non-convex. To understand the behavior of SGD on non-convex optimization landscape, on one hand, researchers are investigating the loss surface of the neural networks with variant architectures (Choromanska et al., 2015; Li et al., 2018b; He et al., 2019b; Draxler et al., 2018; Li et al., 2018a) ; on the other hand, researchers illustrate that the noise in stochastic algorithm may make it escape from local minima (Keskar et al., 2016; He et al., 2019a; Zhu et al., 2019; Wu et al., 2019a; HaoChen et al., 2020) . It is clear that whether stochastic algorithms can escape from poor local minima and finally stop at a minimum with low generalization error is crucial to its test performance. In this work, we focus on the dynamic of SGD and its impact to generalization, especially the escaping efficiency from local minima. To study the dynamic behavior of SGD, most of the works consider SGD as the discretization of a continuous-time dynamic system and investigate its dynamic properties. There are two typical types of models to approximate dynamic of SGD. (Li et al., 2017; Zhou et al., 2019; Liu et al., 2018; Chaudhari & Soatto, 2018; He et al., 2019a; Zhu et al., 2019; Hu et al., 2019; Xie et al., 2020) approximate the dynamic of SGD by Langevin dynamic with constant diffusion coefficient and proved its escaping efficiency from local minima.These works make over-simplified assumption that the covariance matrix of gradient noise is constant, although it is state-dependent in general. The simplified assumption makes the proposed dynamic unable to explain the empirical observation that the distribution of parameters trained by SGD is heavy-tailed (Mahoney & Martin, 2019) . To model the heavy-tailed phenomenon, Simsekli et al. (2019) ; Şimşekli et al. (2019) point that the variance of stochastic gradient may be infinite, and they propose to approximate SGD by dynamic driven by α-stable process with the strong infinite variance condition. However, as shown in the work (Xie et al., 2020; Mandt et al., 2017) , the gradient noise follows Gaussian distribution and the infinite variance condition does not satisfied. Therefore it is still lack of suitable theoretical explanation on the implicit regularization of dynamic of SGD. In this work, we conduct a formal study on the (state-dependent) noise structure of SGD and its dynamic behavior. First, we show that the covariance of the noise of SGD in the quadratic basin surrounding the local minima is a quadratic function of the state (i.e., the model parameter). Thus, we propose approximating the dynamic of SGD near the local minimum using a stochastic differential equation whose diffusion coefficient is a quadratic function of state. We call the new dynamic power-law dynamic. We prove that its stationary distribution is power-law κ distribution, where κ is the signal to noise ratio of the second order derivatives at local minimum. Compared with Gaussian distribution, power-law κ distribution is heavy-tailed with tail-index κ. It matches the empirical observation that the distribution of parameters becomes heavy-tailed after SGD training without assuming infinite variance of stochastic gradient in (Simsekli et al., 2019) . Second, we analyze the escaping efficiency of power-law dynamic from local minima and its relation to generalization. By using the random perturbation theory for diffused dynamic systems, we analyze the mean escaping time for power-law dynamic. Our results show that: (1) Power-law dynamic can escape from sharp minima faster than flat minima. (2) The mean escaping time for power-law dynamic is only in the polynomial order of the barrier height, much faster than the exponential order for dynamic with constant diffusion coefficient. Furthermore, we provide a PAC-Bayes generalization bound and show power-law dynamic can generalize better than dynamic with constant diffusion coefficient. Therefore, our results indicate that the state-dependent noise helps SGD to escape from sharp minima quickly and implicitly learn well-generalized model. Finally, we corroborate our theory by experiments. We investigate the distributions of parameters trained by SGD on various types of deep neural networks and show that they are well fitted by power-law κ distribution. Then, we compare the escaping efficiency of dynamics with constant diffusion or state-dependent diffusion to that of SGD. Results show that the behavior of power-law dynamic is more consistent with SGD. Our contributions are summarized as follows: (1) We propose a novel power-law dynamic with state-dependent diffusion to approximate dynamic of SGD based on both theoretical derivation and empirical evidence. The power-law dynamic can explain the heavy-tailed phenomenon of parameters trained by SGD without assuming infinite variance of gradient noise. (2) We analyze the mean escaping time and PAC-Bayes generalization bound for power-law dynamic and results show that power-law dynamic can escape sharp local minima faster and generalize better compared with the dynamics with constant diffusion. Our experimental results can support the theoretical findings.

2. BACKGROUND

In empirical risk minimization problem, the objective is L(w) = 1 n n i=1 (x i , w), where x i , i = 1, • • • , n are n i.i.d. training samples, w ∈ R d is the model parameter, and is the loss function. Stochastic gradient descent (SGD) is a popular optimization algorithm to minimize L(w). The update rule is w t+1 = w t -η • g(w t ), where g(w t ) = 1 b x∈S b ∇ w (x, w t ) is the minibatch gradient calculated by a randomly sampled minibatch S b of size b and η is the learning rate. The minibatch gradient g(w t ) is an unbiased estimator of the full gradient g(w t ) = ∇L(w t ), and the term (g(w t ) -g(w t )) is called gradient noise in SGD. Langevin Dynamic In (He et al., 2019a; Zhu et al., 2019) , the gradient noise is assumed to be drawn from Gaussian distribution according to central limit theorem (CLT), i.e., g(w) -g(w) ∼ N (0, C), where covariance matrix C is a constant matrix for all w. Then SGD can be regarded as the numerical discretization of the following Langevin dynamic, dw t = -g(w t )dt + √ ηC 1/2 dB t , where B t is a standard Brownian motion in R d and √ ηC 1/2 dB t is called the diffusion term. α-stable Process Simsekli et al. (2019) assume the variance of gradient noise is unbounded. By generalized CLT, the distribution of gradient noise is α-stable distribution S(α, σ), where σ is the α-th moment of gradient noise for given α with α ∈ (0, 2]. Under this assumption, SGD is approximated by the stochastic differential equation (SDE) driven by an α-stable process.

2.1. RELATED WORK

There are many works that approximate SGD by Langevin dynamic and most of the theoretical results are obtained for Langevin dynamic with constant diffusion coefficient. From the aspect of optimization, the convergence rate of SGD and its optimal hyper-parameters have been studied in (Li et al., 2017; He et al., 2018; Liu et al., 2018; He et al., 2018) via optimal control theory. From the aspect of generalization, Chaudhari & Soatto (2018) ; Zhang et al. (2018) ; Smith & Le (2017) show that SGD implicitly regularizes the negative entropy of the learned distribution. Recently, the escaping efficiency from local minima of Langevin dynamic has been studied (Zhu et al., 2019; Hu et al., 2019; Xie et al., 2020) . He et al. (2019a) analyze the PAC-Bayes generalization error of Langevin dynamic to explain the generalization of SGD. The solution of Langevin dynamic with constant diffusion coefficient is Gaussian process, which does not match the empirical observations that the distribution of parameters trained by SGD is a heavy-tailed (Mahoney & Martin, 2019; Hodgkinson & Mahoney, 2020; Gurbuzbalaban et al., 2020) . Simsekli et al. (2019) ; Şimşekli et al. (2019) assume the variance of stochastic gradient is infinite and regard SGD as discretization of a stochastic differential equation (SDE) driven by an α-stable process. The escaping efficiency for the SDE is also shown in (Simsekli et al., 2019) . However, these theoretical results are derived for dynamics with constant diffusion term, although the gradient noise in SGD is state-dependent. There are some related works analyze state-dependent noise structure in SGD, such as label noise in (HaoChen et al., 2020) and multiplicative noise in (Wu et al., 2019b) . These works propose new algorithms motivated by the noise structure, but they do not analyze the escaping behavior of dynamic of SGD and the impact to generalization. Wu et al. (2018) analyze the escaping behavior of SGD with considering the fluctuations of the second order derivatives and propose the concept linearly stability. In our work, we propose power-law dynamic to approximate SGD and analyze the stationary distribution and the mean escaping time for it.

3. APPROXIMATING SGD BY POWER-LAW DYNAMIC

In this section, we study the (state-dependent) noise structure of SGD (in Section 3.1) and propose power-law dynamic to approximate the dynamic of SGD. We first study 1-dimensional power-law dynamic in Section 3.2 and extend it to high dimensional case in Section 3.3.

3.1. NOISE STRUCTURE OF STOCHASTIC GRADIENT DESCENT

For non-convex optimization, we investigate the noise structure of SGD around local minima so that we can analyze the escaping efficiency from it. We first describe the quadratic basin where the local minimum is located. Suppose w * is a local minimum of the training loss L(w) and g(w * ) = 0. We name the -ball B(w * , ) with center w * and radius as a quadratic basin if the loss function for w ∈ B(w * , ) is equal to its second-order Taylor expansion as L(w) = L(w * ) + 1 2 (w -w * ) T H(w * )(w -w * ). Here, H(w * ) is the Hessian matrix of loss at w * , which is (semi) positive definite. Then we start to analyze the gradient noise of SGD. The full gradient of training loss is g(w) = H(w * )(w -w * ). The stochastic gradient is g(w) = g(w * ) + H(w * )(w -w * ) by Taylor expansion where g(•) and H(•) are stochastic version of gradient and Hessian calculated by the minibatch. The randomness of gradient noise comes from two parts: g(w * ) and H(w * ), which reflects the fluctuations of the first-order and second-order derivatives of the model at w * over different minibatches, respectively. The following proposition gives the variance of the gradient noise. Proposition 1 For w ∈ B(w * , ) ⊂ R, the variance of gradient noise is σ(g(w) -g(w)) = σ(g(w * )) + 2ρ(g(w * ), H(w * ))(w -w * ) + σ( H(w * ))(w -w * ) 2 , where σ(•) and ρ(•, •) are the variance and covariance in terms of the minibatch. From Proposition 1, we can conclude that: (1) The variance of noise is finite if g(w * ) and H(w * ) have finite variance because ρ(g(w * ), H(w * )) ≤ σ(g(w * )) • σ( H(w * )) according to Cauchy-Schwarz inequality. For fixed w * , a sufficient condition for that g(w * ) and H(w * ) have finite variance is that the training data x are sampled from bounded domain. This condition is easy to be satisfied because the domain of training data are usually normalized to be bounded before training. In this case, the infinite variance assumption about the stochastic gradient in α-stable process is not satisfied. (2) The variance of noise is state-dependent, which contradicts the assumption in Langevin dynamic. Notations: For ease of the presentation, we use C(w), σg, σH , ρg,H to denote σ(g(w) -g(w * )), σ(g(w * )), σ( H(w * )), ρ(g(w * ), H(w * )) in the following context, respectively.foot_0 3.2 POWER-LAW DYNAMIC According to CLT, the gradient noise follows Gaussian distribution if it has finite variance, i.e., g(w) -g(w) → d N (0, C(w)) as b → ∞, where → d means "converge in distribution". Using Gaussian distribution to model the gradient noise in SGD, the update rule of SGD can be written as: w t+1 = w t -ηg(w t ) + ηξ t , ξ t ∼ N (0, C(w)). Eq.3 can be treated as the discretization of the following SDE, which we call it power-law dynamic: dw t = -g(w t )dt + ηC(w)dB t . Power-law dynamic characterizes how the distribution of w changes as time goes on. The distribution density of parameter w at time t (i.e., p(w, t)) is determined by the Fokker-Planck equation (Zwanzig's type (Guo & Du, 2014) ): ∂ ∂t p(w, t) = ∇p(w, t)g(w) + η 2 • ∇ (C(w) • ∇p(w, t)) . The stationary distribution of power-law dynamic can be obtained if we let the left side of Fokker-Planck equation be zero. The following theorem shows the analytic form of the stationary distribution of power-law dynamic, which is heavy-tailed and the tail of the distribution density decays at polynomial order of w -w * . This is the reason why we call the stochastic differential equation in Eq.4 power-law dynamic. Theorem 2 The stationary distribution density for 1-dimensional power-law dynamic (Eq.4) is p(w) = 1 Z (C(w)) -H ησ H exp   H 4ρg,H • ArcT an C (w)/ 4σH σg -4ρ 2 g,H ησH 4σH σg -4ρ 2 g,H   , where C(w) = σ g +2ρ g,H (w -w * )+σ H (w -w * ) 2 , Z is the normalization constant and ArcT an(•) is the arctangent function. We make discussions on property of p(w). The decreasing rate of p(w) as w goes away from the center w * is mainly determined by the term C(w) -H ησ H (because the function ArcT an(•) is bounded) which is a polynomial function about w -w * . Compared with Gaussian distribution the probability density which follows exponential decreasing rate, power-law distribution is less concentrated in the quadratic basin B(w * , ) and heavy-tailed. We call H ησ H the tail-index of p(w) and denote it as κ in the following context. We can conclude that the state-dependent noise results in heavy-tailed distribution of parameters, which matches the observations in (Mahoney & Martin, 2019) . Langevin dynamic with constant diffusion can be regarded as special case of power-law dynamic when ρ H,g = 0 and σ H = 0. In this case, p(w) degenerates to Gaussian distribution. Compared with α-stable process, we do not assume infinite variance on gradient noise and demonstrate another mechanism that results in heavy-tailed distribution of parameters. networks can be well approximated by quadratic curves, which supports Proposition 1. (2) The minimum of the quadratic curve is nearly located at the local minimum w * . It indicates that the coefficient of the first-order term ρ g,H ≈ 0. Based on the fact that ρ g,H is not the determinant factor of the tail of the distribution in Eq.6 and the observations in Figure .1, we consider a simplified form of C(w) that C(w) = σ g + σ H (w -w * ) 2 . Corollary 3 If C(w) = σ g + σ H (w -w * ) 2 , the stationary distribution of 1-dimensional power-law dynamic (Eq.4) is p(w) = 1 Z (1 + σ H σ -1 g (w -w * ) 2 ) -κ , ( ) where Z is the normalization constant and κ = H ησ H is the tail-index. The distribution density in Eq.7 is known as the power-law κ distribution (Zhou & Du, 2014) (It is also named as q-Gaussian distribution in (Tsallis & Bukman, 1996) ). As κ → ∞, the distribution density tends to be Gaussian, i.e., p(w) ∝ exp(-H(w-w * ) 2 ησg ). Power-law κ distribution becomes more heavy-tailed as κ becomes smaller. Meanwhile, it produces higher probability to appear values far away from the center w * . Intuitively, smaller κ helps the dynamic to escape from local minima faster. In the approximation of dynamic of SGD, κ equals the signal (i.e., H(w * )) to noise (i.e., ησ H ) ratio of second-order derivative at w * in SGD, and κ is linked with three factors: (1) the curvature H(w * ); (2) the fluctuation of the curvature over training data; (3) the hyper-parameters including η and minibatch size b. Please note that σ H linearly decreases as the batch size b increases.

3.3. MULTIVARIATE POWER-LAW DYNAMIC

In this section, we extend the power-law dynamic to d-dimensional case. We first illustrate the covariance matrix C(w) of gradient noise in SGD. We use the subscripts to denote the element in a vector or a matrix. We use Σ g to denote the covariance matrix of g(w * ) and assume that Σ g is isotropic (i.e., Σ g = σ g • I). We also assume that Cov( Hi(w * ), Hj(w * )) are equal for all i, j. It can be shown that C(w) = Σg(1 + (w -w * ) T ΣH Σ -1 g (w -w * )). Similarly as 1-dimensional case, we omit the first-order term (w -w * ) in C(w). Readers can refer Proposition 10 in Appendix 7.2 for the detailed derivation. We suppose that the signal to noise ratio of H(w * ) can be characterized by a scalar κ, i.e., ηΣ H = 1 κ • H(w * ). Then C(w) can be written as C(w) = Σg(1 + 1 ηκ (w -w * ) T H(w * )Σ -1 g (w -w * )). Theorem 4 If w ∈ R d and C(w) has the form in Eq.( 8) for w ∈ B(w * , ). The stationary distribution density of power-law dynamic is p(w) = 1 Z [1 + 1 ηκ (w -w * ) T H(w * )Σ -1 g (w -w * )] -κ (9) for w ∈ B(w * , ), where Z is the normalization constant and κ satisfies ηΣ H = 1 κ • H(w * ). Remark: The multivariate power-law κ distribution (Eq.9) is a natural extension of the 1-dimensional case. Actually, the assumptions on Σ g and κ can be replaced by just assuming Σ g , H(w * ), Σ H are codiagonalized. Readers can refer Proposition 11 in Appendix 7.2 for the derivation. Definition 5 Suppose w t starts at the local minimum a, we denote the time for w t to first reach the saddle point b as inf{t > 0|w 0 = a, w t = b}. The mean escaping time τ is defined as

4. ESCAPING EFFICIENCY OF POWER-LAW DYNAMIC

τ = E wt [inf{t > 0|w 0 = a, w t = b}]. We first give the mean escaping time for 1-dimensional case in Lemma 6 and then we give the mean escaping time for high-dimensional power-law dynamic in Theorem 7. To analyze the mean escaping time, we take the following assumptions. Assumption 1: The loss function around critical points can be written as L(w) = L(w * ) + 1 2 (w - w * ) T H(w * )(w -w * ), where w * is a critical point. Assumption 2: The system is in equilibrium near minima, i.e., ∂p(w,t) ∂t = 0. Assumption 3: (Low temperature assumption) The gradient noise is small, i.e., ησ g ∆L. These three assumptions are commonly used in analyzing escaping time (Xie et al., 2020; Zhou & Du, 2014 ) for a dynamic. Because both a and b are critical points, we can apply Assumption 1 to get the loss surface around them. We put more discussions about the assumptions in Appendix 7.3.2. We suppose the basin a is quadratic and the variance of noise has the form that C(w) = σg a + σH a (wa) 2 , which can also be written as C(w) = σg a + 2σ Ha Ha (L(w) -L(a)). Furthermore, we suppose that C(w) = σg a + 2σ Ha Ha (L(w) -L(a)) on the whole escaping path from a to b (not just near the local minimum a). It means that the variance of gradient noise becomes larger as the loss becomes larger. The following lemma gives the mean escaping time of power-law dynamic for 1-dimensional case. Lemma 6 Suppose that Assumption 1-3 are satisfied and C(w) = σg a + 2σ Ha Ha (L(w) -L(a)) on the whole escaping path from a to b. The mean escaping time of 1-dimensional power-law dynamic is, τ = 2π (1 -1 2κ ) Ha|H b | 1 + 2 κησg a ∆L κ-1 2 , ( ) where κ = Ha ησ Ha > 1 2 , H a and H b are the second-order derivatives of training loss at local minimum a and at saddle point b, respectively. The proof of Lemma 6 is based on the results in (Zhou & Du, 2014) . We provide a full proof in Appendix 7.3.1. For the dynamic near the saddle point, we just assume that its dynamic is the same as that near the local minimum for simplicity. This assumption is not necessary and we put the extension to more complex dynamic in Appendix 7.3.3. We summarize the mean escaping time of power-law dynamic and dynamics in previous works in Table 1 . Based on the results, we have the following discussions. Comparison with other dynamics: (1) Both power-law dynamic and Langevin dynamic can escape sharp minima faster than flat minima, where the sharpness is measured by H a and larger H a corresponds to sharper minimum. Power-law dynamic improves the order of barrier height (i.e., ∆L) from exponential to polynomial compared with Langevin dynamic, which implies a faster escaping efficiency of SGD to escape from deep basin. (2) The mean escaping time for α-stable process is independent with the barrier height, but it is in polynomial order of the width of the basin (i.e., Table 1 : Summary of related works and ours. Here, we only show 1-dimensional result for escaping time in the table for all the three dynamics for ease of the presentation.

Noise distribution

Dynamic Stationary solution Escaping time N (0, σ) Langevin Gaussian O 1 √ Ha|H b | exp 2∆L ησ S(α, σ) α-stable Heavy-tailed O ηα • |b-a| ησ α N (0, σg + σH (w -w * ) 2 ) Power-law (ours) Heavy-tailed O 1 √ Ha|H b | (1 + 2 κ ∆L ησg ) κ-1 2 width=|b -a|). Compared with α-stable process, the result for power-law dynamic is superior in the sense that it is also in polynomial order of the width (if ∆L ≈ O(|b -a| 2 )) and power-law dynamic does not rely on the infinite variance assumption. Based on Lemma 6, we analyze the mean escaping time for d-dimensional case. Under the low temperature condition, the probability density concentrates only along the most possible escaping paths in the high-dimensional landscape. For rigorous definition of most possible escaping paths, readers can refer section 3 in (Xie et al., 2020) . For simplicity, we consider the case that there is only one most possible escaping path between basin a and basin c. Specifically, the Hessian at saddle point b has only one negative eigenvalue and the most possible escaping direction is the direction corresponding to the negative eigenvalue of the Hessian at b. Theorem 7 Suppose that Assumption 1-3 are satisfied. For w ∈ R d , we suppose C(w) = Σg a + 2 ηκ (L(w) -L(a) ) on the whole escaping path from a to b and there is only one most possible path path between basin a and basin c. The mean escaping time for power-law dynamic escaping from basin a to basin c is τ = 2π -det(H b ) (1 -d 2κ ) det(Ha) 1 |H be | 1 + 1 ηκσe ∆L κ-1 2 , ( ) where e indicates the most possible escaping direction, H be is the only negative eigenvalue of H b , σ e is the eigenvalue of Σ ga that corresponds to the escaping direction, ∆L = L(b) -L(a), and det(•) is the determinant of a matrix. Remark: In d-dimensional case, the flatness is measured by det(H a ). If H a has zero eigenvalues, we can replace H a by H + a in above theorem, where H + a is obtained by projecting H a onto the subspace composed by the eigenvectors corresponding to the positive eigenvalues of H a . This is because by Taylor expansion, the loss L(w) only depends on the positive eigenvalues and the corresponding eigenvectors of H a , i.e., L(w) = L(a) + 1 2 (w -a) T Ha(w -a) = L(a) + 1 2 (P(w -a)) T Λ H + a P(w -a), where Λ H + a is a diagonal matrix composed by non-zero eigenvalues of H a and the operator P(•) operates the vector to the subspace corresponding to non-zero eigenvalues of H a . Therefore, the dimension d in Theorem 7 can be regarded as the dimension of subspace that is composed by directions with large eigenvalues. It has been observed that most of the eigenvalues in H is very small (Sagun et al., 2016) . Therefore, d will not be a large number and power-law dynamic in multi-dimensional case will inherit the benefit of that in 1-dimensional case compared with Langevin dynamic and α-stable process. The next theorem give an upper bound of the generalization error of the stationary distribution of power-law dynamic, which shows that flatter minimum has smaller generalization error. Theorem 8 Suppose that w ∈ R d and κ > d 2 . For δ > 0, with probability at least 1 -δ, the stationary distribution of power-law dynamic has the following generalization error bound, E w∼p(w),x∼P(x) (w, x) ≤ E w∼p(w) L(w) + KL(p||p ) + log 1 δ + log n + 2 n -1 , where KL(p||p ) ≤ 1 2 log det(H) det(Σg) + T r(ηΣgH -1 )-2d 4(1-1 κ ( d 2 -1)) + d 2 log 2 η , p(w) is the stationary distribution of d-dimensional power-law dynamic, p (w) is a prior distribution which is selected to be standard Gaussian distribution, and P(x) is the underlying distribution of data x, det(•) and T r(•) are the determinant and trace of a matrix, respectively. We make the following discussions on results in Theorem 8. For 1-dimensional case, we have if H > η 2(1+ 1 2κ ) , KL divergence is decreasing as H decreases. For d > 1 and fixed T r(Σ g H -1 ) and det(Σ g ), the generalization error (i.e., E w∼p(w),x∼P(x) (w, x) -E w∼p(w) L(w)) is decreasing as det(H) decreases, which indicates that flatter minimum has smaller generalization error. Moreover, if 2d > T r(ηΣgH -1 ), the generalization error is decreasing as κ increases. When κ → ∞, the generalization error tends to that for Langevin dynamic. Combining the mean escaping time and the generalization error bound, we can conclude that state-dependent noise makes SGD escape from sharp minima faster and implicitly tend to learn a flatter model which generalizes better.

5. EXPERIMENTS

In this section, we conduct experiments to verify the theoretical results. We first study the fitness between parameter distribution trained by SGD and power-law κ distribution. Then we compare the escaping behavior for power-law dynamic, Langevin dynamic and SGD.

5.1. FITTING PARAMETER DISTRIBUTION USING POWER-LAW DISTRIBUTION

We investigate the distribution of parameters trained by SGD on deep neural networks and use power-law κ distribution to fit the parameter distribution. We first use SGD to train various types of deep neural networks till it converge. For each network, we run SGD with different minibatch sizes over the range {64, 256, 1024}. For the settings of other hyper-parameters, readers can refer Appendix 7.5.2. We plot the distribution of model parameters at the same layer using histogram. Next, we use power-law κ distribution to fit the distribution of the parameters and estimate the value of κ via the embedded function "T sallisQGaussianDistribution[]" in Mathematica software. We show results for LeNet-5 with MNIST dataset and ResNet-18 with CIFAR10 dataset (LeCun et al., 2015; He et al., 2016b) in this section, and put results for other network architectures in Appendix 7.5.2. In Figure 3 , we report the generalization error (i.e., Test error -Training error) and the values of κ that best fit the histogram. 2 We have the following observations: (1) The distribution of the parameter trained by SGD can be well fitted by power-law κ distribution (blue curve). (2) As the minibatch size becomes larger, κ becomes larger. It is because the noise σ H linearly decreases as minibatch size becomes larger and κ = H ησ H . (3) As κ becomes smaller, the generalization error becomes lower. It indicates that κ also plays a role as indicator of generalization. These results are consistent with the theory in Section 4.

5.2. COMPARISON ON ESCAPING EFFICIENCY

We use a 2-dimensional model to simulate the escaping efficiency from minima for power-law dynamic, Langevin dynamic and SGD. We design a non-convex 2-dimensional function written as L(w) = 1 n n i=1 (w -x i ), where (w) = 15 2 j=1 |w j -1| 2.5 • |w j + 1| 3 and training data x i ∼ N (0, 0.01I 2 ). We regard the following optimization iterates as the numerical discretization of the power-law dynamic, w t+1 = w t -ηg(w t ) + ηλ 2 1 + λ 1 (w t -w * ) 2 ξ, where ξ ∼ N (0, I 2 ), λ 1 , λ 2 are two hyper-parameters and stands for Hadamard product. Note that if we set λ 1 = 0, it can be regarded as discretization of Langevin dynamic. We set learning rate η = 0.025, and we take 500 iterations in each training. In order to match the trace of covariance matrix of stochastic gradient at minimum point w * with the methods above, λ 2 is chosen to satisfy T r(Cov(λ 2 ξ)) = T r(Cov(g(w * ))). We compare the success rate of escaping for power-law dynamic, Langevin dynamic and SGD by repeating the experiments 100 times. To analyze the noise term λ 1 , we choose different λ 1 and evaluate corresponding success rate of escaping, as shown in Figure.4(c) . The results show that: (1) there is a positive correlation between λ 1 and the success rate of escaping; (2) power-law dynamic can mimic the escaping efficiency of SGD, while Langevin dynamic can not. We then scale the loss function by 0.9 to make the minima flatter and repeat all the algorithms under the same setting. The success rate for the scaled loss function is shown in Figure.4(d) . We can observe that all dynamics escape flatter minima slower. 

6. CONCLUSION

In this work, we study the dynamic of SGD via investigating state-dependent variance of the stochastic gradient. We propose power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. We analyze the escaping efficiency from local minima and the PAC-Bayes generalization error bound for power-law dynamic. Results indicate that state-dependent noise helps SGD escape from poor local minima faster and generalize better. We present direct empirical evidence to support our theoretical findings.This work may motivate many interesting research topics, for example, non-Gaussian state-dependent noise, new types of state-dependent regularization tricks in deep learning algorithms and more accurate characterization about the loss surface of deep neural networks. We will investigate these topics in future work. 

7. APPENDIX

7.1 POWER-LAW DYNAMIC AND STATIONARY DISTRIBUTION Theorem 9 (Theorem 2 in main paper) The stationary distribution density for 1-dimensional powerlaw dynamic (Eq.4) is p(w) = 1 Z (C(w)) -H ησ H exp   H 4ρg,H • ArcT an C (w)/ 4σH σg -4ρ 2 g,H ησH 4σH σg -4ρ 2 g,H   , where C(w) = σ g +2ρ g,H (w -w * )+σ H (w -w * ) 2 , Z is the normalization constant and ArcT an(•) is the arctangent function. Proof: We denote the function H(4ρ g,H •ArcT an(C (w)/ √ 4σ H σg-4ρ g,H )) ησ H √ 4σ H σg-4ρ 2 g,H as h(w). According to the Fokker-Planck equation, p(w) satisfies 0 = ∇p(w)g(w) + η 2 • ∇ • (C(w)∇p(w)) = ∇ • (p(w) • ∇L(w)) + η 2 C(w)∇p(w) = ∇ • η 2 C(w) -H ησ H +1 e h(w) ∇(C(w) H ησ H • e -h(w) • p(w)) Readers can check the third equality by calculating ∇(C(w) w) • p(w)) with C(w) = σ g + 2ρ g,H (w -w * ) + σ H (w -w * ) 2 . Because the left side equals zero, we have C(w) w) • p(w). So we can get the conclusion in the theorem. H ησ H • e -h( H ησ H • e -h(w) • p(w) equals to constant. So p(w) ∝ C(w) -H ησ H • e h( Theorem 10 (Corollary 3 in main paper) If C(w) = σ g + σ H (w -w * ) 2 , the stationary distribution density of power-law dynamic is p(w) = 1 Z (1 + σ H σ -1 g (w -w * ) 2 ) -κ , ( ) where Z = w (1 + σ H σ -1 g (w -w * ) 2 ) -κ dw is the normalization constant and κ = H ησ H is the tail-index. Proof: According to the Fokker-Planck equation, p(w) satisfies 0 = ∇p(w)g(w) + η 2 • ∇ • (C(w)∇p(w)) = ∇(p(w) • ∇L(w)) + η 2 ∇ • (σ g + 2σ H H (L(w) -L(w * )))∇p(w) = ∇ • η 2 C(w)(1 + 2σ H Hσ g (L(w) -L(w * ))) H -ησ H ∇(1 + 2σ H Hσ g (L(w) -L(w * ))) H ησ H p(w) Because the left side equals zero, we have (1 + 2σ H Hσg (L(w) -L(w * ))) H ησ H p(w) equals to constant. So p(w) ∝ (1 + 2σ H Hσg (L(w) -L(w * ))) H -ησ H . So we can get the conclusion in the theorem. We plot the un-normalized distribution density for 1-dimensional power-law dynamics with different κ in Figure 5 . For the four curves, we set β = 10. We set κ = 1, 0.5, 0.1, 0 and use green, red, Actually, for any given time t, the distribution p(w, t) for w t that satisfies power-law dynamic has analytic form, i.e., p(w, t) ∝ (1 + H ηκσ(t) (w -w(t)) 2 ) -κ , where w(t) = w * + (w 0 -w * )e -Ht and σ(t) is a function of σ g and t. Readers can refer Eq.18 -Eq.23 in (Tsallis & Bukman, 1995) for the detailed expression.

7.2. SGD AND MULTIVARIATE POWER-LAW DYNAMIC

The following proposition shows the covariance of stochastic gradient in SGD in d-dimensional case. We use the subscripts to denote the elements in a vector or a matrix. Proposition 11 For w ∈ R d , we use C(w) to denote the covariance matrix of stochastic gradient g(w) = g(w * )+ H(w-w * ) and Σ to denote the covariance matrix of g(w * ). If Cov(g i (w * ), H jk ) = 0, ∀i, j, k, we have C ij (w) = Σ ij + (w -w * ) T A (ij) (w -w * ), ( ) where Σ ij = Cov(g i (w * ), gj (w * )), A (ij) is a d × d matrix with elements A (ij) ab = Cov( Hia , Hjb ) with a ∈ [d], b ∈ [d]. Eq.13 can be obtained by directly calculating the covariance of gi (w) and gj (w) where gi (w) = gi (w * ) + ). In order to get a analytic tractable form of C(w), we make the following assumptions: (1) If Σ ij = 0, A (ij) is a zero matrix; (2) For Σ ij = 0, A (ij) Σij are equal for all i ∈ [d], j ∈ [d]. The first assumption is reasonable because both Σ ij and A (ij) reflect the dependence of the derivatives along the i-th direction and j-th direction. Let Σ H = A (ij) Σij , C(w) can be written as C(w) = Σ g (1+(w-w * ) T Σ H (w-w * )). The d-dimensional power-law dynamic is written as dw t = -H(w -w * )dt + ηC(w)dB t , where C(w) = Σ g (1 + (w -w * ) T Σ H (w -w * )) which is a symmetric positive definite matrix that C(w) 1/2 exists. The following proposition shows the stationary distribution of the d-dimensional power-law dynamic. Proposition 12 Suppose Σ g , Σ H , H are codiagonalizable, i.e., there exist orthogonal matrix Q and diagonal matrices Λ, Γ, Π to satisfy Σ g = Q T ΛQ, Σ H = Q T ΓQ, H = Q T ΠQ. Then, the stationary distribution of power-law dynamic is p(w) = 1 Z (1 + (w -w * ) T Σ H (w -w * )) -κ , ( ) where Z is the normalization constant and κ = T r(H) ηT r(Σ H Σg) . Proof: Under the codiagonalization assumption on Σ g , Σ H , H, Eq.15 can be rewritten as dv t = -Πv t dt + ηΛ(1 + v T t Γv t )dB t if we let v t = Q(w t -w * ). We use φ(v) = ηC(v) 2 = η 2 Λ(1 + v T Γv) , the stationary probability density p(v) satisfies the Smoluchowski equation: 0 = d i=1 ∂ ∂vi (Πivi • p(v)) + d i=1 ∂ ∂vi • φi(w) ∂ ∂vi p(v) (16) = d i=1 ∂ ∂vi (Πi•vi • p(v)) + d i=1 ∂ ∂vi • ηΛi 2 (1 + v T Γv) ∂ ∂vi p(v) . ( ) According to the result for 1-dimensional case, we have the expression of p(v) is p(v) ∝ (1 + v T Γv) -κ . To determine the value of κ, we put p(v) in the Smoluchowski equation to obtain d i=1 Πip(v) -2κ d i=1 Πivi • Γivi • (1 + v T Γv) -κ-1 = d i=1 ∂ ∂vi ηΛiκ(1 + v T Γv) -κ • Γivi = d i=1 ηΛiκ(1 + v T Γv) -κ • Γi -2 d i=1 ηΛiκ 2 (1 + v T Γv) -κ-1 • (Γivi) 2 . The we have d i=1 Π i = ηκ d i=1 Λ i Γ i . So we have κ = T r(H) ηT r(Σ H Σg) . According to Proposition 11, we can also consider another assumption on Σ g , Σ H , H without assuming their codiagonalization. Instead, we assume (1) If Σ ij = 0, A (ij) is a zero matrix; (2) For Σ ij = 0, A (ij) are equal for all i ∈ [d], j ∈ [d] and we denote A (ij) = Σ H . We suppose η •Σ H = κH. (3) Σ g = σ g • I d which is isotropic. Under these assumptions, we can get the following theorem. Theorem 13 (Theorem 4 in main paper) If w is d-dimensional and C(w) has the form in Eq.( 8). The stationary distribution density of multivariate power-law dynamic is p(w) = 1 Z [1 + 1 ηκ (w -w * ) T HΣ -1 g (w -w * )] -κ where Z = ∞ -∞ [1 + 1 ηκ (w -w * ) T HΣ -1 g (w -w * )] -κ dw is the normalization constant. The proof for Theorem 12 is similar to that for Proposition 11. Readers can check that p(w) satisfies the Smoluchowski equation. An example to illustrate why C(w) is diagonally dominant. In Theorem 13, C(w) is assumed to be diagonally dominant. Diagonally dominant indicates that the variance of each dimension of g(w) is significantly larger than the covariance of two different dimensions of g(w). Consider a two layer fully-connected linear neural network f w,v (x) = wvx where w ∈ R 1×m , v ∈ R m×d , x ∈ R d and h(•) is the ReLU activation. We consider the regression loss (w, v) = 1 2 (y -f w,v (x)) 2 . The gradient of w i and v jk can be written as ∂ (w, v) ∂w i = (f w,v (x) -y) • v i x (19) ∂ (w, v) ∂v jk = (f w,v (x) -y) • w j x k , where v i denotes the i-th row of matrix v. Suppose that the initialization of w and v is: w i i.i.d ∼ N (0, δ 1 ) and v ij i.i.d ∼ N (0, δ 2 ) . We also assume that Ex i = Ex j = 0 and x i , x j are independent with each other for i = j where x i is the i-th dimension. We have E w,v ∂ (w, v) ∂w i ∂ (w, v) ∂w j = E w,v (f w,v (x) -y) 2 • v i x • v j x (21) = E w,v y 2 • v i x • v j x + E w,v m i=1 (w i v i x) 2 • v i x • v j x -2E w,v ( m i=1 yw i v i x) • v i x • v j x Because the independence of v i , v j and their expectations are zero, we can obtain E w,v ∂ (w,v) ∂wi ∂ (w,v) ∂wj = 0 for i = j. Similarly, we can get E w,v ∂ (w,v) ∂wi ∂ (w,v) ∂v jk = 0 and E w,v ∂ (w,v) ∂v j k ∂ (w,v) ∂v jk = 0 for (j, k) = (j , k ). The above analyses show that the gradients for different dimensions are independent at initialization. It has been observed that many weights are kept random during training because of the over-parameterization Balduzzi et al. (2017) . So, diagonalization dominant property of C(w) is reasonable. Ha (L(w) -L(a)) on the whole escaping path from a to b. The mean escaping time of the 1-dimensional power-law dynamic is,

7.3. SUPPLEMENTARY MATERIALS FOR RESULTS

τ = 2π (1 -1 2κ ) Ha|H b | 1 + 2 κησg a ∆L κ-1 2 , where κ = Ha ησ Ha , H a , H b are the second-order derivatives of training loss at local minimum a and saddle point b. Proof: According to (Van Kampen, 1992) , the mean escaping time τ is expressed as τ = P (w∈Va) Ω JdΩ , where V a is the volume of basin a, J is the probability current that satisfies -∇J(w, t) = ∂ ∂w (g(w) • p(w, t)) + ∂ ∂w φ(w) ∂p(w, t) ∂w = ∂ ∂w   φ(w) • 1 + µ σg ∆L(w) -κ ∂ 1 + µ σg ∆L(w) κ p(w, t) ∂w   , where φ(w) = η 2 C(w) and µ = 2σ Ha Ha , σ g = σ ga and ∆L(w) = L(w) -L(a). Integrating both sides, we obtain J(w) = -φ(w) • 1 + µ σg ∆L(w) -κ ∂ 1+ µ σg ∆L(w) κ p(w,t) ∂w . Because there is no field source on the escape path, J(w) is fixed constant on the escape path. Multiplying φ(w) -1 • 1 + µ σg ∆L(w) κ on both sizes, we have J • c a φ(w) -1 • 1 + µ σ g ∆L(w) κ dw = - c a ∂ 1 + µ σg ∆L(w) κ p(w, t) ∂w dw = -0 + p(a). Then we get J = p(a) c a φ(w) -1 • 1+ µ σg ∆L(w) κ dw . As for the term c a φ(w) -1 • 1 + µ σg ∆L(w) 1 κ dw, we have c a φ(w) -1 • 1 + µ σg ∆L(w) κ dw (24) = 2 ησg c a 1 + µ σg ∆L(w) -1+κ dw = 2 ησg b c 1 + µ σg (∆L(b) - 1 2 |H b |(w -b) 2 ) -1+κ dw = 2 ησg b c 1 + µ σg (∆L(b) - 1 2 |H b |(w -b) 2 ) -1+κ dw = 2 ησg (1 + µ σg ∆L(b)) -1+κ b c 1 - µ σg • 1 2 |H b |(w -b) 2 1 + µ σg ∆L(b) -1+κ dw = 2 ησg (1 + µ σg ∆L(b)) -1+κ • 1 2 µ σg |H b | 1 + µ σg ∆L(b) -1/2 1 0 y -1/2 (1 -y) -1+κ dy = 2 ησg (1 + µ σg ∆L(b)) -1 2 +κ 2σg µ|H b | B( 1 2 , κ), where the third formula is based on the second order Taylor expansion. Under the low temperature assumption, we can use the second-order Taylor expansion around the saddle point b. As for the term P (w ∈ V a ), we have P (w ∈ Va) = Va p(w)dV = w∈Va p(a)(1 + µ σg ∆L(w)) -κ = p(a) 2σg µHa B( 1 2 , κ -1 2 ), where we use Taylor expansion of L(w) near local minimum a. Then we have τ = P (w∈Va) Ω JdΩ = P (w∈Va) J because J is a constant. Combining all the results, we can get the result in the lemma. Theorem 15 (Theorem 7 in main paper) Suppose w ∈ R d and there is only one most possible path path between basin a and the outside of basin a. The mean escaping time for power-law dynamic escaping from basin a to the outside of basin a is τ = 2π -det(H b ) (1 -d 2κ ) det(Ha) 1 |H be | 1 + 1 ηκσe ∆L κ-1 2 , ( ) where e indicates the most possible escape direction, H be is the only negative eigenvalue of H b , σ e is the eigenvalue of Σ ga corresponding to the escape direction and ∆L = L(b) -L(a). Proof: According to (Van Kampen, 1992) , the mean escaping time τ is expressed as τ = P (w∈Va) Ω JdΩ , where V a is the volume of basin a, J is the probability current that satisfies -∇ • J(w, t) = ∂p(w,t) ∂t . Under the low temperature assumption, the probability current J concentrates along the direction corresponding the negative eigenvalue of H be , and the probability flux of other directions can be ignored. Then we have Ω JdΩ = J e • Ω 1 + 1 ηκ (w -b) T (H b Σ -1 g ) ⊥e (w -b) -κ+ 1 2 dΩ, where J e = p(a) • η(1+µσe∆L(b)) -κ+ 1 2 √ µσe|H be | 2 √ 2B( 1 2 ,κ) which is obtained by the calculation of J e for 1-dimensional case in the proof of Lemma 13, and (•) ⊥e denotes the directions perpendicular to the escape direction e. Suppose H b Σ -1 g are symmetric matrix. Then there exist orthogonal matrix Q and diagonal ma- trix Λ = diag(λ 1 , • • • , λ d ) that satisfy H b Σ -1 g = Q T ΛQ. We also denote v = Q(w -b). We define a sequence as T k = 1 + 1 ηκ • d j=k λ j v 2 j for k = 1, • • • , d. As for the term Ω 1 + 1 ηκ (w -b) T (H b Σ -1 g ) ⊥e (w -b) -κ+ 1 2 dΩ, we have Ω 1 + 1 ηκ (w -b) T (H b Σ -1 g ) ⊥e (w -b) -κ+ 1 2 dΩ = (1 + 1 ηκ • v T Λv) -κ+ 1 2 dw = (1 + 1 ηκ • d j =e λ j v 2 j ) -κ+ 1 2 dv =((ηκ) -1 λ 1 ) -1 2 T -κ+ 1 2 2 B( 1 2 , κ)dv = d-2 j=0 ((ηκ) -1 λ j ) -1 2 B( 1 2 , κ - j 2 ) = d-2 j=0 ((ηκ) -1 λ j ) -1 2 • √ π d Γ(κ -d 2 ) Γ(κ) = (ηκπ) d-1 • Γ(κ -d-2 2 ) Γ(κ + 1 2 ) det((H b Σ -1 g ) ⊥e ) . As for the term P (w ∈ V a ), we have P (w ∈ V a ) = Va p(w)dV = p(a) w∈Va 1 + (w -w * ) T H a Σ -1 g (w -w * ) dw (27) =p(a) • (ηκπ) d • Γ(κ -d 2 ) Γ(κ) det((H a Σ -1 g )) where we use Taylor expansion of L(w) near local minimum a. Combined the results for P (w ∈ V a ) and J, we can get the result.

7.3.2. FURTHER EXPLANATION ABOUT ASSUMPTION 1-3

We adopt the commonly used assumptions to analyze mean escaping time for dynamic system (Xie et al., 2020; Smith & Le, 2017; Zhou & Du, 2014) . Assumption 2 can be replaced by weaker assumption that the system is quasi-equilibrium which is adopted in (Xie et al., 2020) . For the differences between quasi-equilibrium and equilibrium, readers can refer to (Xie et al., 2020) for detailed discussions. Assumption 3 is commonly used (Xie et al., 2020; Zhou & Du, 2014) . Under Assumption 3, the probability densities will concentrate around minima and the most possible paths. Assumption 3 will make the second order Taylor approximation more reasonable.

7.3.3. EXTENSION TO MORE COMPLEX DYNAMIC ON THE ESCAPING PATH

In Lemma 6, we assume that C(w) = σg a + 2σ Ha Ha (L(w) -L(a)) on the whole escaping path from a to b for ease of comparison and presentation. This assumption is not necessary and we can assume a different dynamic near saddle point b. Specially, we can assume the point z is the midpoint on the most possible path beween a and b, where L(z) = (1 -z)L(a) + zL(b). The dynamic with C(w) = σg a + 2σ Ha Ha (L(w) -L(a)) dominates the path a → z and the dynamic with C(w) = σg b + 2σ H b H b (L(b) -L(w)) dominates the path z → b. Then only two things will be changed in proof of Lemma 6. First, we need to change the stationary distribution near saddle points according to its own dynamic in Eq.20. Second, we need to change the integral about probability density on the whole path to sum of integrals on these two sub-paths. Similar proof techniques are adopted for analyzing escaping time of Langevin dynamic in proof of Theorem 4.1 in the work Xie et al. (2020) . Since the proof is analogous, we omit the details here.

7.4. PAC-BAYES GENERALIZATION BOUND

We briefly introduce the basic settings for PAC-Bayes generalization error. The expected risk is defined as E x∼P(x) (w, x). Suppose the parameter follows a distribution with density p(w), the expected risk in terms of p(w) is defined as E w∼p(w),x∼P(x) (w, x). The empirical risk in terms of p(w) is defined as E w∼p(w) L(w) = E w∼p(w) 1 n n i=1 (w, x i ). Suppose the prior distribution over the parameter space is p (w) and p(w) is the distribution on the parameter space expressing the learned hypothesis function. For power-law dynamic, p(w) is its stationary distribution and we choose p (w) to be Gaussian distribution with center w * and covariance matrix I. Then we can get the following theorem. Theorem 16 (Theorem 8 in main paper) For w ∈ R d , we select the prior distribution p (w) to be standard Gaussian distribution. For δ > 0, with probability at least 1 -δ, the stationary distribution of power-law dynamic has the following generalization error bound, E w∼p(w),x∼P(x) (w, x) ≤ E w∼p(w) L(w) + KL(p||p ) + log 1 δ + log n + 2 n -1 , where KL(p||p ) ≤ 1 2 log det(H) det(Σg) + T r(ηΣgH -1 )-2d 4(1-1 κ ( d 2 -1)) + d 2 log 2 η and P(x) is the underlying distribution of data x. Proof: Eq.( 29) directly follows the results in (McAllester, 1999) . Here we calculate the Kullback-Leibler (KL) divergence between prior distribution and the stationary distribution of power-law dynamic. The prior distribution is selected to be standard Gaussion distribution with distribution density p (w) = 1 √ (2π) d det (I) exp{-1 2 (w -w * ) T I(w -w * )}. The posterior distribution density is the stationary distribution for power-law dynamic, i.e., p(w) = 1 Z •(1+ 1 ηκ •(w-w * ) T HΣ -1 g (w-w * )) -κ . Suppose HΣ -1 g are symmetric matrix. Then there exist orthogonal matrix Q and diagonal matrix Λ = diag(λ 1 , • • • , λ d ) that satisfy HΣ -1 g = Q T ΛQ. We also denote v = Q(w -w * ). We have log p(w) p (w) = -κ log(1 + 1 ηκ • (w -w * ) T HΣ -1 g (w -w * )) -log Z + 1 2 (w -w * ) T I(w -w * ) + d 2 log 2π The KL-divergence is defined as KL(p(w)||p (w)) = w p(w) log p(w) p (w) dw. Putting v = Q(ww * ) in the integral, we have KL(p(w)||p (w)) = d 2 log 2π -log Z + 1 2Z v v T v 1 + 1 ηκ • v T Λv -κ dv - 1 Zη v v T Λv • (1 + 1 ηκ • v T Λv) -κ dv, where we use the approximation that log(1 + x) ≈ x. We define a sequence as T k = 1 + 1 ηκ • d j=k λ j v 2 j for k = 1, • • • , d. We first calculate the normalization constant Z. Z = (1 + 1 ηκ • v T Λv) -κ dw = (1 + 1 ηκ • d j=1 λ j v 2 j ) -κ dv =((ηκ) -1 λ 1 ) -1 2 T -κ+ 1 2 2 B( 1 2 , κ - 1 2 )dv = d j=1 ((ηκ) -1 λ j ) -1 2 B( 1 2 , κ - j 2 ) = d j=1 ((ηκ) -1 λ j ) -1 2 • √ π d Γ(κ -d 2 ) Γ(κ) We define Z j = ((ηκ) -1 λ j ) -1 2 B 1 2 , κ -j 2 . For the third term in Eq.( 30), we have We also observe the parameter distribution on many pretrained models. Details for pre-trained models can be found on https://pytorch.org/docs/stable/torchvision/models.html. We follow the settings in (Zhu et al., 2019) . For convenience of the readers, here we give the details of this setting again. We use corrupted FashionMNIST dataset which contains 1000 images with correct labels and another 200 images with random labels to be training data. A small LeNet-like network with 11,330 parameters is used. Firstly we run the full gradient decent to reach the parameters w * near the global minima. Then we continue training using both Langevin dynamic(GLD) and power-law dynamic(PLD). Following Zhu's setting, the learning rates for GD, GLD and PLD are η GD = 0.1, η GLD = 0.07 and η P LD = 0.07, respectively. For GLD, noise std σ = 10 -4 as Zhu already tuned. For our PLD, w t+1 = w t -η∇L(w t ) + η • α∇L(w t ) 1 + β(w t -w * ) 2 ξ, where α, β are hyperparameters, ξ ∼ N (0, I), and stands for Hadamard product. Here we select α = 2.4, β = 2 after grid search. Expected sharpness is measured as E ν∼N (0,δ 2 I) [L(w + ν)] -L(w) where δ = 0.01, and the expectation is computed by average on 1000 times sampling. 2Z • III = v v T v(1 + 1 ηκ v T Λv) -κ dv = v 2 ,•••v d v 1 v 2 1 1 + 1 ηκ • v T Λv -κ dv1 + Z1 d j=2 v 2 j 1 + 1 ηκ • d j=2 λjv 2 j -κ+ 1 2 dv 2 ••• ,v d = v 2 ,•••v d T -κ 2 v 1 v 2 1 1 + (ηκ) -1 λ1v 2 1 T2 -κ dv1 + Z1 d j=2 v 2 j 1 + 1 ηκ • d j=2 λjv 2 j -κ+ 1 2 dv 2 ••• ,v d = v 2 ,••• ,v d T -κ 2 T2 (ηκ) -1 λ1 3 2 y 1 2 (1 + y) -κ dy + Z1 d j=2 v 2 j 1 + 1 ηκ • d j=2 λjv 2 j -κ+ 1 2 dv 2 ••• ,v d = v 2 ,••• ,v d ((ηκ) -1 λ1) -3 2 T -κ+ 3 2 2 B 3 2 , κ - 3 2 + Z1 d j=2 v 2 j 1 + 1 ηκ • d j=2 λjv 2 j -κ+ 1 2 dv 2 ••• ,v d =( λ1 ηκ ) -3 2 B 3 2 , κ - 3 2 v 2 ,••• ,v d T -κ+ 3 2 2 dv 2 ••• ,v d + v 2 ,••• ,v d Z1 d j=2 v 2 j 1 + 1 ηκ • d j=2 λjv 2 j -κ+ 1 2 dv 2 ••• ,v d For term v2,••• ,v d T -1 κ + 3 2 2 d v2••• ,v d in above equation, we have v2,••• ,v d T -κ+ 3 2 2 d v2••• ,v d = v3,••• ,v d T -κ+2 3 ((ηκ) -1 λ 2 ) -1 2 B 1 2 , κ -2 d v3,••• ,v d = v4,••• ,v d T -κ+ 5 2 4 ((ηκ) -1 λ 2 ) -1 2 ((ηκ) -1 λ 3 ) -1 2 B 1 2 , κ - 5 2 B 1 2 , κ -2 d v4,••• ,v d = v d T -κ+ 1 2 + 1 2 ×d d d-1 j=2 ((ηκ) -1 λ j ) -1 2 d-1 j=2 B 1 2 , κ -( j 2 + 1) d v d = d j=2 ((ηκ) -1 λ j ) -1 2 d j=2 B 1 2 , κ -( j 2 + 1) The numbers at the first column of the legend show the test accuracy and the numbers in the bracket show the sharpness of the model trained by the three algorithms. From Figure 9 , we can conclude that PLD generalizes better than GLD and GD. Moreover, PLD can find flatter critical points than GLD and GD. For power-law dynamic (PLD), w t+1 = w t -η∇L(w t )+ηλ 2 1 + λ 1 (w t -w * ) 2 ξ, where λ 1 , λ 2 are hyperparameters, ξ ∼ N (0, I), and stands for Hadamard product. Here we let λ 1 = 1, λ 2 = 4. For Langevin dynamic (GLD), we set noise std σ = 4 in consistence with PLD. Learning rate η = 0.1 for both methods. We initialize w 0 = w * and apply both methods on L(w) with different barrier heights. Then we record the number of iterations t when w t firstly escaping from the barrier. We repeat this procedure 100 rounds for each method and each barrier height and utilize the average to estimate the mean escaping time, of which the results are shown in Figure .10(b). From Figure.10(b) , the mean escaping time of GLD grows much faster than PLD along barrier height, which validates that power-law dynamic improves the order of barrier height compared with Langevin dynamic.



In the following context, we assume σg is positive number. The training errors under the six settings are almost zero.



We empirically observe the covariance matrix around the local minimum of training loss on deep neural networks. The results are shown in Figure.1. Readers can refer more details in Appendix 7.1. We have the following observations: (1) The traces of covariance matrices for the deep neural

Figure 1: Trace of covariance matrix of gradient noise in a region around local minimum w * . w * is selected by running gradient descent with small learning rate till it converges. The number at horizontal axis shows the distance of the point away from w * . (a),(b): Results for plain CNN. (c),(d):Results for ResNet18.

Figure 2

Figure 3: Approximating distribution of parameters (trained by SGD) by power-law dynamic. Training batchsize, generalization error (i.e., Test error -Training error) and approximated tail-index κ are shown in the title of each plot. (a): Results for LeNet-5. (b):Results for ResNet-18.

Figure 4: (a):Loss surface of L(w) for 2-D model. (b):Trace of covariance matrix around minimum (1, 1). (c)/(d): Success rate of escaping from the basin of L(w) / 0.9L(w) in repeated 100 runs.

Figure 5: Probability density for power-law dynamic.

Hia (w a -w * a ), gj (w) = gj (w * ) + d b=1 Hjb (w b -w * b

IN SECTION 4 7.3.1 PROOF FOR MEAN ESCAPING TIME Lemma 14 (Lemma 6 in main paper) We suppose C(w) = σg a + 2σ Ha

Figure 8: Comparison between Q-Q plots of network parameters versus normal distribution and power-law distribution. (upper): Q-Q plots of parameters versus normal distribution. (bottom): Q-Q plots of parameters versus power-law distribution.

Figure9: Escaping experiment on corrupted FashionMNIST. Test accuracy versus iteration after pretraining by GD. Model is pretrained by GD before the vertical dashed line and continued by GD, GLD and PLD (ours). Numbers in brackets are expected sharpness after model converging.

Figure 10: (a): Loss curve L(w) for 1-D model. (b): Mean escaping time versus different barrier heights. Mean escaping time is computed by average on 100 rounds, in which we record the number of iterations when firstly escaping from the saddle point.

Zhou, Yanjun, & Du, Jiulin. 2014. Kramers escape rate in overdamped systems with the power-law distribution. Physica A: Statistical Mechanics and its Applications, 402, 299-305. Zhu, Zhanxing, Wu, Jingfeng, Yu, Bing, Wu, Lei, & Ma, Jinwen. 2019. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. Pages 7654-7663 of: Proceedings of International Conference on Machine Learning.

annex

Let A j = ((ηκ) -1 λ j ) -3 2 B 3 2 , κ -( j 2 + 1) . According to the above two equations, we can get the recursionSimilarly, for the fourth term in Eq.(30), we have IV = κd 2(κ-d 2 -1) . Combining all the results together, we can get KL(p||p7.5 IMPLEMENTATION DETAILS OF THE EXPERIMENTS

7.5.1. OBSERVATIONS ON THE COVARIANCE MATRIX

In this section, we introduce the settings on experiments of the quadratic approximation of covariance of the stochastic gradient on plain convolutional neural network (CNN) and ResNet. For each model, we use gradient descent with small constant learning rate to train the network till it converges. The converged point can be regarded as a local minimum, denoted as w * .As for the detailed settings of the CNN model, the structure for plain CNN model is inputConv2 use 5 × 5 kernels with 10 channels and no padding. Dimensions of full connected layer f c1 and f c2 are 1600 × 50 and 50 × 10 respectively. We randomly sample 1000 images from FashionMNIST (Xiao et al., 2017) dataset as training set. The initialization method is the Kaiming initialization (He et al., 2015) in PyTorch. The learning rate of gradient descent is set to be 0.1. After 3000 iterations, GD converges with almost 100% training accuracy and the training loss being 1e -3 .As for ResNet, we use the ResNet-18 model (He et al., 2016b) and randomly sample 1000 images from Kaggle's dogs-vs-cats dataset as training set. The initialization method is the Kaiming initialization (He et al., 2015) in PyTorch. The learning rate of gradient descent is set to be 0.001. After 10000 iterations, GD converges with 100% training accuracy and the training loss being 1e -3 .We then calculate the covariance matrix of the stochastic gradient at some points belonging to the local region around w * . The points are selected according to the formula: w * layerL ± (i × Scale), where w * layerL denotes the parameters at layer L, and i × Scale, i ∈ [N ] determines the distance away from w * layerL . When we select points according to this formula by changing the parameters at layer L, we fixed the parameters at other layers. For both CNN model and ResNet18 model, we select 20 points by setting i = 1, • • • , 10. For example, for CNN model, we choose the 20 points by changing the parameters at the Conv1 layer with Scale = 0.001 and Conv2 layer with Scale = 0.0001, respectively. For ResNet18, we choose the 20 points by changing the parameters for a convolutional layer at the first residual block with Scale = 0.0001 and second residual block with Scale = 0.0001, respectively.The results are shown in Figure .1. The x-axis denotes the distance of the point away from the local minimum and the y-axis shows the value of the trace of covariance matrix at each point. The results show that the covariance of noise in SGD is indeed not constant and it can be well approximated by quadratic function of state (the blue line in the figures), which is consistent with our theoretical results in Section 3.1. For Figure . 3(a), we train LeNet-5 on MNIST dataset using SGD with constant learning rate η = 0.03 for each batchsize till it converges. Parameters are conv2.weight in LeNet-5. For Figure 3 (b), we train ResNet-18 on CIFAR10 using SGD with momentum. We do a RandomCrop on training set scaling to 32 × 32 with padding = 4 and then a RandomHorizontalF lip. In training, momentum is set to be 0.9 and weight decay is set to be 5e -4. Initial learning rate in SGD is set to be 0.1 and we using a learning rate decay of 0.1 on {150, 250}-th epoch respectively. We train it until converges after 250 epoch. Parameters are layer1.1.conv2.weight in ResNet-18.

