THE HEAVY-TAIL PHENOMENON IN SGD

Abstract

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the 'flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize η to the batch size b, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the 'tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters η and b, the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed Gaussian data, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We finally support our theory with experiments conducted on both synthetic data and fully connected neural networks.

1. INTRODUCTION

The learning problem in neural networks can be expressed as an instance of the well-known population risk minimization problem in statistics, given as follows: min x∈R d F (x) := E z∼D [f (x, z)], (1.1) where z ∈ R p denotes a random data point, D is a probability distribution on R p that denotes the law of the data points, x ∈ R d denotes the parameters of the neural network to be optimized, and f : R d × R p → R + denotes a measurable cost function, which is often non-convex in x. While this problem cannot be attacked directly since D is typically unknown, if we have access to a training dataset S = {z 1 , . . . , z n } with n independent and identically distributed (i.i.d.) observations, i.e., z i ∼ i.i.d. D for i = 1, . . . , n, we can use the empirical risk minimization strategy, which aims at solving the following optimization problem (Shalev-Shwartz & Ben-David, 2014) : min x∈R d f (x) := f (x, S) := (1/n) n i=1 f (i) (x), (1.2) where f (i) denotes the cost induced by the data point z i . The stochastic gradient descent (SGD) algorithm has been one of the most popular algorithms for addressing this problem: x k = x k-1 -η∇ fk (x k-1 ), where ∇ fk (x) := (1/b) i∈Ω k ∇f (i) (x). (1.3) Here, k denotes the iterations, η > 0 is the stepsize (also called the learning-rate), ∇ f is the stochastic gradient, b is the batch-size, and Ω k ⊂ {1, . . . , n} is a random subset with |Ω k | = b for all k. Even though the practical success of SGD has been proven in many domains, the theory for its generalization properties is still in an early phase. Among others, one peculiar property of SGD that has not been theoretically well-grounded is that, depending on the choice of η and b, the algorithm can exhibit significantly different behaviors in terms of the performance on unseen test data. A common perspective over this phenomenon is based on the 'flat minima' argument that dates back to Hochreiter & Schmidhuber (1997) , and associates the performance with the 'sharpness' or 'flatness' of the minimizers found by SGD, where these notions are often characterized by the magnitude of the eigenvalues of the Hessian, larger values corresponding to sharper local minima (Keskar et al., 2016) . Recently, Jastrzębski et al. (2017) focused on this phenomenon as well and empirically illustrated that the performance of SGD on unseen test data is mainly determined by the stepsize η and the batch-size b, i.e., larger η/b yields better generalization. Revisiting the flat-minima argument, they concluded that the ratio η/b determines the flatness of the minima found by SGD; hence the difference in generalization. In the same context, Şimşekli et al. (2019b) focused on the statistical properties of the gradient noise (∇ fk (x) -∇f (x)) and illustrated that under an isotropic model, the gradient noise exhibits a heavy-tailed behavior, which was also confirmed in follow-up studies (Zhang et al., 2019) . Based on this observation and a metastability argument (Pavlyukevich, 2007) , they showed that SGD will 'prefer' wider basins under the heavy-tailed noise assumption, without an explicit mention of the cause of the heavy-tailed behavior. In another recent study, Martin & Mahoney (2019) introduced a new approach for investigating the generalization properties of deep neural networks by invoking results from heavy-tailed random matrix theory. They empirically showed that the eigenvalues of the weight matrices in different layers exhibit a heavy-tailed behavior, which is an indication that the weight matrices themselves exhibit heavy tails as well (Ben Arous & Guionnet, 2008) . Accordingly, they fitted a power law distribution to the empirical spectral density of individual layers and illustrated that heavier-tailed weight matrices indicate better generalization. Very recently, Şimşekli et al. ( 2020) formalized this argument in a mathematically rigorous framework and showed that such a heavy-tailed behavior diminishes the 'effective dimension' of the problem, which in turn results in improved generalization. While these studies form an important initial step towards establishing the connection between heavy tails and generalization, the originating cause of the observed heavy-tailed behavior is yet to be understood. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that, depending on the choice of the algorithm parameters η and b, the dimension d, and the curvature of f (to be precised in Section 3), SGD exhibits a 'heavy-tail phenomenon', meaning that the law of the iterates converges to a heavy-tailed distribution. We rigorously prove that, this phenomenon is not specific to deep learning and in fact it can be observed even in surprisingly simple settings: we show that when f is chosen as a simple quadratic function and the data points are i.i.d. from an isotropic Gaussian distribution, the iterates can still converge to a heavy-tailed distribution with arbitrarily heavy tails, hence with infinite variance. We summarize our contributions as follows: 1. When f is a quadratic, we prove that: (i) the tails become monotonically heavier for increasing curvature, increasing η, or decreasing b, hence relating the heavy-tails to the ratio η/b and the curvature, (ii) the law of the iterates converges exponentially fast towards the stationary distribution in the Wasserstein metric, (iii) there exists a higher-order moment (e.g., variance) of the iterates that diverges at most polynomially-fast, depending on the heaviness of the tails at stationarity. 2. We support our theory with experiments conducted on both synthetic data and neural networks. Our experimental results confirm our theory on synthetic setups and also illustrate that the heavy-tail phenomenon is also observed in fully connected multi-layer neural networks. To the best of our knowledge, these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD with respect to η, b, d, and the curvature, with explicit convergence rates. 1

2. TECHNICAL BACKGROUND

Heavy-tailed distributions with a power-law decay. In probability theory, a real-valued random variable X is said to be heavy-tailed if the right tail or the left tail of the distribution decays slower 1 We note that in a concurrent work, which very recently appeared on arXiv, Hodgkinson & Mahoney (2020) showed that heavy tails with power laws arise in more general Lipschitz stochastic optimization algorithms that are contracting on average for strongly convex objectives near infinity with positive probability. Our Theorem 1 and Lemma 14 are more refined as we focus on the special case of SGD with Gaussian data, where we are able to provide constants which explicitly determine the tail index as an expectation over data and SGD parameters (see also eqn. (3.6)). Due to the generality of their framework, (Hodgkinson & Mahoney, 2020, Thm 1) is more implicit and it cannot provide such a characterization of these constants, however it can be applied to other algorithms beyond SGD. All our other results (including Theorem 2 -monotonicity of the tail-index and Corollary 9 -central limit theorem for the ergodic averages) are all specific to SGD and cannot be obtained under the framework of Hodgkinson & Mahoney (2020). We encourage the readers to refer to (Hodgkinson & Mahoney, 2020) for the treatment of more general stochastic recursions.

