RETHINKING THE STRUCTURE OF STOCHASTIC GRADIENTS: EMPIRICAL AND STATISTICAL EVIDENCE Anonymous authors Paper under double-blind review

Abstract

It is well known that stochastic gradients significantly improve both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures in deep learning. While previous papers believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients. The novel structure of stochastic gradients may help understand the success of stochastic optimization for deep learning.

1. INTRODUCTION

Stochastic optimization methods, such as Stochastic Gradient Descent (SGD), have been highly successful and even necessary in the training of deep neural networks (LeCun et al., 2015) . It is widely believed that stochastic gradients as well as stochastic gradient noise (SGN) significantly improve both optimization and generalization of deep neural networks (DNNs) (Hochreiter & Schmidhuber, 1995; 1997; Hardt et al., 2016; Wu et al., 2021; Smith et al., 2020; Wu et al., 2020; Sekhari et al., 2021; Amir et al., 2021) . SGN, defined as the difference between full-batch gradient and stochastic gradient, has attracted much attention in recent years. People studied its type (Simsekli et al., 2019; Panigrahi et al., 2019; Hodgkinson & Mahoney, 2021; Li et al., 2021) , its magnitude (Mandt et al., 2017; Liu et al., 2021) , its structure (Daneshmand et al., 2018; Zhu et al., 2019; Chmiel et al., 2020; Xie et al., 2020; Wen et al., 2020) , and its manipulation (Xie et al., 2021) . Among them, the noise type and the noise covariance structure are two core research topics. Topic 1. The arguments on the type and the heavy-tailed property of SGN. Recently, a line of research (Simsekli et al., 2019; Panigrahi et al., 2019; Gurbuzbalaban et al., 2021; Hodgkinson & Mahoney, 2021) argued that SGN has the heavy-tail property due to Generalized Central Limit Theorem (Gnedenko et al., 1954) . Simsekli et al. (2019) presented statistical evidence showing that SGN looks closer to an α-stable distribution that has power-law heavy tails rather than a Gaussian distribution. (Panigrahi et al., 2019 ) also presented the Gaussianity tests. However, their statistical tests were actually not applied to the true SGN that is caused by minibatch sampling. Because, in this line of research, the abused notation "SGN" is studied as stochastic gradient at some iteration rather than the difference between full-batch gradient and stochastic gradient. Another line of research (Xie et al., 2020; 2022b; Li et al., 2021) pointed out this issue and suggested that the arguments in Simsekli et al. ( 2019) rely on a hidden strict assumption that SGN must be isotropic and does not hold for parameter-dependent and anisotropic Gaussian noise. This is why one tail-index for all parameters

