RETHINKING THE STRUCTURE OF STOCHASTIC GRADIENTS: EMPIRICAL AND STATISTICAL EVIDENCE Anonymous authors Paper under double-blind review

Abstract

It is well known that stochastic gradients significantly improve both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures in deep learning. While previous papers believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients. The novel structure of stochastic gradients may help understand the success of stochastic optimization for deep learning.

1. INTRODUCTION

Stochastic optimization methods, such as Stochastic Gradient Descent (SGD), have been highly successful and even necessary in the training of deep neural networks (LeCun et al., 2015) . It is widely believed that stochastic gradients as well as stochastic gradient noise (SGN) significantly improve both optimization and generalization of deep neural networks (DNNs) (Hochreiter & Schmidhuber, 1995; 1997; Hardt et al., 2016; Wu et al., 2021; Smith et al., 2020; Wu et al., 2020; Sekhari et al., 2021; Amir et al., 2021) . SGN, defined as the difference between full-batch gradient and stochastic gradient, has attracted much attention in recent years. People studied its type (Simsekli et al., 2019; Panigrahi et al., 2019; Hodgkinson & Mahoney, 2021; Li et al., 2021) , its magnitude (Mandt et al., 2017; Liu et al., 2021) , its structure (Daneshmand et al., 2018; Zhu et al., 2019; Chmiel et al., 2020; Xie et al., 2020; Wen et al., 2020) , and its manipulation (Xie et al., 2021) . Among them, the noise type and the noise covariance structure are two core research topics. Topic 1. The arguments on the type and the heavy-tailed property of SGN. Recently, a line of research (Simsekli et al., 2019; Panigrahi et al., 2019; Gurbuzbalaban et al., 2021; Hodgkinson & Mahoney, 2021) argued that SGN has the heavy-tail property due to Generalized Central Limit Theorem (Gnedenko et al., 1954) . Simsekli et al. (2019) presented statistical evidence showing that SGN looks closer to an α-stable distribution that has power-law heavy tails rather than a Gaussian distribution. (Panigrahi et al., 2019 ) also presented the Gaussianity tests. However, their statistical tests were actually not applied to the true SGN that is caused by minibatch sampling. Because, in this line of research, the abused notation "SGN" is studied as stochastic gradient at some iteration rather than the difference between full-batch gradient and stochastic gradient. Another line of research (Xie et al., 2020; 2022b; Li et al., 2021) pointed out this issue and suggested that the arguments in Simsekli et al. ( 2019) rely on a hidden strict assumption that SGN must be isotropic and does not hold for parameter-dependent and anisotropic Gaussian noise. This is why one tail-index for all parameters was studied in Simsekli et al. (2019) . In contrast, SGN could be well approximated as an multi-variant Gaussian distribution in experiments at least when batch size is not too small, such as B ≥ 128 (Xie et al., 2020; Panigrahi et al., 2019) . Another work (Li et al., 2021) further provided theoretical evidence for supporting the anisotropic Gaussian approximation of SGN. Nevertheless, none of these works conducted statistical tests on the Gaussianity or heavy tails of the true SGN. Contribution 1. To our knowledge, we are the first to conduct formal statistical tests on the distribution of stochastic gradients/SGN across parameters and iterations. Our statistical tests reveal that dimension-wise gradients (due to anisotropy) exhibit power-law heavy tails, while iteration-wise gradient noise (which is the true SGN due to minibatch sampling) often has Gaussian-like light tails. Our statistical tests and notations help reconcile recent conflicting arguments on Topic 1. Topic 2. The covariance structure of stochastic gradients/SGN. A number of works (Zhu et al., 2019; Xie et al., 2020; HaoChen et al., 2021; Liu et al., 2021; Ziyin et al., 2022) demonstrated that the anisotropic structure and sharpness-dependent magnitude of SGN can help escape sharp minima efficiently. Moreover, some works theoretically demonstrated (Jastrzkebski et al., 2017; Zhu et al., 2019) and empirically verified (Xie et al., 2020; 2022b; Daneshmand et al., 2018) that the covariance of SGN is approximately equivalent to the Hessian near minima. However, this approximation is only applied to minima and along flat directions corresponding to nearly-zero Hessian eigenvalues. The covariance structure of stochastic gradients is still a fundamental issue in deep learning. Contribution 2. We discover that the covariance of stochastic gradients has the power-law spectra in deep learning, while full-batch gradients have no such properties. While previous papers believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant power-law structure. The power-law covariance may help understand the success of stochastic optimization for deep learning. Organization. In Section 2, we introduce prerequisites and the statistical test methods. In Section 3, we reconcile the conflicting arguments on the noise type and heavy tails in SGD. In Section 4, we discover the power-law covariance structure. In Section 5, we present extensive empirical results for deeply exploring the covariance structure. In Section 6, we conclude our main contributions.

2. METHODOLOGY: NOTATIONS AND GOODNESS-OF-FIT TESTS

Notations. Suppose a neural network f θ has n model parameters as θ. We denote the training dataset as {(x, y)} = {(x j , y j )} N j=1 drawn from the data distribution S and the loss function over one data sample {(x j , y j )} as l(θ, (x j , y j )). We denote the training loss as L(θ) = 1 N N j=1 l(θ, (x j , y j )). We compute the gradients of the training loss with the batch size B and the learning rate η for T iterations. We let g (t) represent the stochastic gradient at the t-th iteration. We denote the Gradient History Matrix as G = [g (1) , g (2) , • • • , g (T ) ] an n×T matrix where the column vector G •,t represents the dimension-wise gradients g (t) for n model parameters, the row vector G i,• represents the iterationwise gradients g • i for T iterations, and the element G i,t is g (t) i at the t-th iteration for the parameter θ i . We analyze G for a given model without updating the model parameter θ. The Gradient History Matrix G plays a key role in reconciling the conflicting arguments on Topic 1. Because the defined dimension-wise SGN (due to anisotropy) is the abused "SGN" in one line of research (Simsekli et al., 2019; Panigrahi et al., 2019; Gurbuzbalaban et al., 2021; Hodgkinson & Mahoney, 2021) , while iteration-wise SGN (due to minibatch sampling) is the true SGN as another line of research (Xie et al., 2020; 2022b; Li et al., 2021) suggested. Our notation mitigates the abused "SGN". We further denote the second moment as C m = E[gg ⊤ ] for stochastic gradients and the covariance as C = E[(g -ḡ)(g -ḡ) ⊤ ] for SGN, where ḡ = E[g] is the full-batch gradient. We denote the descending ordered eigenvalues of a matrix, such as the Hessian H and the covariance C, as {λ 1 , λ 2 , . . . , λ n } and denote the corresponding spectral density function as p(λ). Goodness-of-Fit Test. In statistics, various Goodness-of-Fit Tests have been proposed for measuring the goodness of empirical data fitting to some distribution. In this subsection, we introduce how to conduct the Kolmogorov-Smirnov (KS) Test (Massey Jr, 1951; Goldstein et al., 2004) for measuring the goodness of fitting a power-law distribution and the Pearson's χ 2 Test (Plackett, 1983) for measuring the goodness of fitting a Gaussian distribution. We present more details in Appendix B.

