DYNAMIC OF STOCHASTIC GRADIENT DESCENT WITH STATE-DEPENDENT NOISE Anonymous

Abstract

Stochastic gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. Since neural networks are non-convex, more and more works study the dynamic behavior of SGD and its impact to generalization, especially the escaping efficiency from local minima. However, these works make the over-simplified assumption that the distribution of gradient noise is stateindependent, although it is state-dependent. In this work, we propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. Then, we prove that the stationary distribution of power-law dynamic is heavy-tailed, which matches the existing empirical observations. Next, we study the escaping efficiency from local minimum of power-law dynamic and prove that the mean escaping time is in polynomial order of the barrier height of the basin, much faster than exponential order of previous dynamics. It indicates that SGD can escape deep sharp minima efficiently and tends to stop at flat minima that have lower generalization error. Finally, we conduct experiments to compare SGD and power-law dynamic, and the results verify our theoretical findings.

1. INTRODUCTION

Deep learning has achieved great success in various AI applications, such as computer vision, natural language processing, and speech recognition (He et al., 2016b; Vaswani et al., 2017; He et al., 2016a) . Stochastic gradient descent (SGD) and its variants are the mainstream methods to train deep neural networks, since they can deal with the computational bottleneck of the training over large-scale datasets (Bottou & Bousquet, 2008) . Although SGD can converge to the minimum in convex optimization (Rakhlin et al., 2012) , neural networks are highly non-convex. To understand the behavior of SGD on non-convex optimization landscape, on one hand, researchers are investigating the loss surface of the neural networks with variant architectures (Choromanska et al., 2015; Li et al., 2018b; He et al., 2019b; Draxler et al., 2018; Li et al., 2018a) ; on the other hand, researchers illustrate that the noise in stochastic algorithm may make it escape from local minima (Keskar et al., 2016; He et al., 2019a; Zhu et al., 2019; Wu et al., 2019a; HaoChen et al., 2020) . It is clear that whether stochastic algorithms can escape from poor local minima and finally stop at a minimum with low generalization error is crucial to its test performance. In this work, we focus on the dynamic of SGD and its impact to generalization, especially the escaping efficiency from local minima. To study the dynamic behavior of SGD, most of the works consider SGD as the discretization of a continuous-time dynamic system and investigate its dynamic properties. There are two typical types of models to approximate dynamic of SGD. (Li et al., 2017; Zhou et al., 2019; Liu et al., 2018; Chaudhari & Soatto, 2018; He et al., 2019a; Zhu et al., 2019; Hu et al., 2019; Xie et al., 2020) approximate the dynamic of SGD by Langevin dynamic with constant diffusion coefficient and proved its escaping efficiency from local minima.These works make over-simplified assumption that the covariance matrix of gradient noise is constant, although it is state-dependent in general. The simplified assumption makes the proposed dynamic unable to explain the empirical observation that the distribution of parameters trained by SGD is heavy-tailed (Mahoney & Martin, 2019) . To model the heavy-tailed phenomenon, Simsekli et al. (2019); Şimşekli et al. (2019) point that the variance of stochastic gradient may be infinite, and they propose to approximate SGD by dynamic driven by α-stable process with the strong infinite variance condition. However, as shown in the work (Xie

