THE HEAVY-TAIL PHENOMENON IN SGD

Abstract

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the 'flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize η to the batch size b, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the 'tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters η and b, the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed Gaussian data, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We finally support our theory with experiments conducted on both synthetic data and fully connected neural networks.

1. INTRODUCTION

The learning problem in neural networks can be expressed as an instance of the well-known population risk minimization problem in statistics, given as follows: min x∈R d F (x) := E z∼D [f (x, z)], (1.1) where z ∈ R p denotes a random data point, D is a probability distribution on R p that denotes the law of the data points, x ∈ R d denotes the parameters of the neural network to be optimized, and f : R d × R p → R + denotes a measurable cost function, which is often non-convex in x. While this problem cannot be attacked directly since D is typically unknown, if we have access to a training dataset S = {z 1 , . . . , z n } with n independent and identically distributed (i.i.d.) observations, i.e., z i ∼ i.i.d. D for i = 1, . . . , n, we can use the empirical risk minimization strategy, which aims at solving the following optimization problem (Shalev-Shwartz & Ben-David, 2014) : min x∈R d f (x) := f (x, S) := (1/n) n i=1 f (i) (x), where f (i) denotes the cost induced by the data point z i . The stochastic gradient descent (SGD) algorithm has been one of the most popular algorithms for addressing this problem: x k = x k-1 -η∇ fk (x k-1 ) , where ∇ fk (x) := (1/b) i∈Ω k ∇f (i) (x). (1.3) Here, k denotes the iterations, η > 0 is the stepsize (also called the learning-rate), ∇ f is the stochastic gradient, b is the batch-size, and Ω k ⊂ {1, . . . , n} is a random subset with |Ω k | = b for all k. Even though the practical success of SGD has been proven in many domains, the theory for its generalization properties is still in an early phase. Among others, one peculiar property of SGD that has not been theoretically well-grounded is that, depending on the choice of η and b, the algorithm can exhibit significantly different behaviors in terms of the performance on unseen test data. A common perspective over this phenomenon is based on the 'flat minima' argument that dates back to Hochreiter & Schmidhuber (1997) , and associates the performance with the 'sharpness' or 'flatness' of the minimizers found by SGD, where these notions are often characterized by the magnitude of the eigenvalues of the Hessian, larger values corresponding to sharper local minima (Keskar et al., 2016) . Recently, Jastrzębski et al. (2017) focused on this phenomenon as well and empirically illustrated that the performance of SGD on unseen test data is mainly determined by the stepsize η and the batch-size b, i.e., larger η/b yields better generalization. Revisiting the flat-minima argument, they concluded that the ratio η/b determines the flatness of the minima found by SGD; hence the difference in generalization. In the same context, Şimşekli et al. (2019b) focused on the statistical properties of the gradient noise (∇ fk (x) -∇f (x)) and illustrated that under an isotropic model, the gradient noise exhibits a heavy-tailed behavior, which was also confirmed in follow-up studies (Zhang et al., 2019) . Based on this observation and a metastability argument (Pavlyukevich, 2007) , they showed that SGD will 'prefer' wider basins under the heavy-tailed noise assumption, without an explicit mention of the cause of the heavy-tailed behavior. In another recent study, Martin & Mahoney (2019) introduced a new approach for investigating the generalization properties of deep neural networks by invoking results from heavy-tailed random matrix theory. They empirically showed that the eigenvalues of the weight matrices in different layers exhibit a heavy-tailed behavior, which is an indication that the weight matrices themselves exhibit heavy tails as well (Ben Arous & Guionnet, 2008) . Accordingly, they fitted a power law distribution to the empirical spectral density of individual layers and illustrated that heavier-tailed weight matrices indicate better generalization. Very recently, Şimşekli et al. (2020) formalized this argument in a mathematically rigorous framework and showed that such a heavy-tailed behavior diminishes the 'effective dimension' of the problem, which in turn results in improved generalization. While these studies form an important initial step towards establishing the connection between heavy tails and generalization, the originating cause of the observed heavy-tailed behavior is yet to be understood. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that, depending on the choice of the algorithm parameters η and b, the dimension d, and the curvature of f (to be precised in Section 3), SGD exhibits a 'heavy-tail phenomenon', meaning that the law of the iterates converges to a heavy-tailed distribution. We rigorously prove that, this phenomenon is not specific to deep learning and in fact it can be observed even in surprisingly simple settings: we show that when f is chosen as a simple quadratic function and the data points are i.i.d. from an isotropic Gaussian distribution, the iterates can still converge to a heavy-tailed distribution with arbitrarily heavy tails, hence with infinite variance. We summarize our contributions as follows: 1. When f is a quadratic, we prove that: (i) the tails become monotonically heavier for increasing curvature, increasing η, or decreasing b, hence relating the heavy-tails to the ratio η/b and the curvature, (ii) the law of the iterates converges exponentially fast towards the stationary distribution in the Wasserstein metric, (iii) there exists a higher-order moment (e.g., variance) of the iterates that diverges at most polynomially-fast, depending on the heaviness of the tails at stationarity. 2. We support our theory with experiments conducted on both synthetic data and neural networks. Our experimental results confirm our theory on synthetic setups and also illustrate that the heavy-tail phenomenon is also observed in fully connected multi-layer neural networks. To the best of our knowledge, these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD with respect to η, b, d, and the curvature, with explicit convergence rates. 1

2. TECHNICAL BACKGROUND

Heavy-tailed distributions with a power-law decay. In probability theory, a real-valued random variable X is said to be heavy-tailed if the right tail or the left tail of the distribution decays slower 1 We note that in a concurrent work, which very recently appeared on arXiv, Hodgkinson & Mahoney (2020) showed that heavy tails with power laws arise in more general Lipschitz stochastic optimization algorithms that are contracting on average for strongly convex objectives near infinity with positive probability. Our Theorem 1 and Lemma 14 are more refined as we focus on the special case of SGD with Gaussian data, where we are able to provide constants which explicitly determine the tail index as an expectation over data and SGD parameters (see also eqn. (3.6)). Due to the generality of their framework, (Hodgkinson & Mahoney, 2020 , Thm 1) is more implicit and it cannot provide such a characterization of these constants, however it can be applied to other algorithms beyond SGD. All our other results (including Theorem 2 -monotonicity of the tail-index and Corollary 9 -central limit theorem for the ergodic averages) are all specific to SGD and cannot be obtained under the framework of Hodgkinson & Mahoney (2020) . We encourage the readers to refer to (Hodgkinson & Mahoney, 2020) for the treatment of more general stochastic recursions. than any exponential distribution. We say X has heavy (right) tail if lim x→∞ P(X ≥ x)e cx = ∞ for any c > 0.foot_0 Similarly, an R d -valued random vector X has heavy tail if u T X has heavy right tail for some vector u ∈ S d-1 , where S d-1 := {u ∈ R d : u = 1} is the unit sphere in R d . Heavy tail distributions include α-stable distributions, Pareto distribution, log-normal distribution and the Weilbull distribution. One important class of the heavy-tailed distributions is the distributions with power-law decay, which is the focus of our paper. That is, P(X ≥ x) ∼ c 0 x -α as x → ∞ for some c 0 > 0 and α > 0, where α > 0 is known as the tail-index, which determines the tail thickness of the distribution. Similarly, we say that the random vector X has power-law decay with tail-index α if for some u ∈ S d-1 , we have P(u T X ≥ x) ∼ c 0 x -α , for some c 0 , α > 0. Stable distributions. The class of α-stable distributions are an important subclass of heavy-tailed distributions with a power-law decay, which appears as the limiting distribution of the generalized CLT for a sum of i.i.d. random variables with infinite variance (Lévy, 1937) . A random variable X follows a symmetric α-stable distribution denoted as X ∼ SαS(σ) if its characteristic function takes the form: E e itX = exp (-σ α |t| α ), t ∈ R, where σ > 0 is the scale parameter that measures the spread of X around 0, and α ∈ (0, 2] is known as the tail-index, and SαS becomes heavier-tailed as α gets smaller. The probability density function of a symmetric α-stable distribution, α ∈ (0, 2], does not yield closed-form expression in general except for a few special cases. When α = 1 and α = 2, SαS reduces to the Cauchy and the Gaussian distributions, respectively. When 0 < α < 2, α-stable distributions have their moments being finite only up to the order α in the sense that E[|X| p ] < ∞ if and only if p < α, which implies infinite variance. Wasserstein metric. For any p ≥ 1, define P p (R d ) as the space consisting of all the Borel probability measures ν on R d with the finite p-th moment (based on the Euclidean norm). For any two Borel probability measures ν 1 , ν 2 ∈ P p (R d ), we define the standard p-Wasserstein metric (Villani, 2009) : W p (ν 1 , ν 2 ) := (inf E [ Z 1 -Z 2 p ]) 1/p , where the infimum is taken over all joint distributions of the random variables Z 1 , Z 2 with marginal distributions ν 1 , ν 2 respectively.

3. SETUP AND MAIN THEORETICAL RESULTS

Before stating our theoretical results in detail, let us informally motivate our main method of analysis. Suppose that the initial SGD iterate x 0 is in the domain of attractionfoot_1 of a local minimum x of f and the function f is smooth and well-approximated by a quadratic function in this basin. Under this assumption, by considering a first-order Taylor approximation of ∇f (i) (x) around x , we have ∇f (i) (x) ≈ ∇f (i) (x ) + ∇ 2 f (i) (x )(x -x ) . By using this approximation, we can approximate the SGD recursion (1.3) as follows: x k ≈ x k-1 -(η/b) i∈Ω k ∇ 2 f (i) (x )x k-1 + (η/b) i∈Ω k ∇ 2 f (i) (x )x -∇f (i) (x ) =: (I -(η/b)H k ) x k-1 + q k , where I denotes the identity matrix of appropriate size. Here, our main observation is that the SGD recursion can be approximated by a linear stochastic recursion, which gives us access to the tools from implicit renewal theory for investigating its statistical properties (Kesten, 1973; Goldie, 1991) . In a renewal theoretic context, the object of interest would be the matrix (I -η b H k ), whose statistical properties determine the behavior of x k : depending on the moments of this matrix, x k can have heavy or light tails, or might even diverge. In this study, we focus on the tail behavior of the SGD dynamics by analyzing it through the lens of implicit renewal theory. As, the recursion (3.1) is obtained by a quadratic approximation of the component functions f (i) , which arises naturally in linear regression, we will consider a simplified setting and rigorously study this dynamics in the case of linear regression. We would like to underline that, in our analysis, the Taylor approximation (3.1) is not crucial. Indeed, we can easily extend our theory to more general non-linear recursions by imposing strict statistical assumptions on the loss function (which can be chosen non-convex) and the data distribution (Mirek, 2011) . Unfortunately, such assumptions would either be trivially false for deep learning problems, e.g., (Mirek, 2011) or cannot be verified in practice, e.g., (Hodgkinson & Mahoney, 2020) . In order to be able to provide explicit results with clear assumptions, we limit our scope to quadratic optimization, which turns out to be already fairly technical. We leave the analysis of the general case as a natural next step of our work. We now consider the case when f is a quadratic, which arises in linear regression: min x∈R d F (x) := (1/2)E (a,y)∼D a T x -y 2 , (3.2) where the data (a, y) comes from an unknown distribution D with support R d × R. Assume we have access to i.i.d. samples (a i , y i ) from the distribution D where ∇f (i) (x) = a i (a T i x -y i ) is an unbiased estimator of the true gradient ∇F (x). The curvature, i.e. the value of second partial derivatives, of this objective around a minimum is determined by the Hessian matrix E(aa T ) which depends on the distribution of a. In this setting, SGD with batch size b leads to the iterations x k = M k x k-1 + q k with M k := I -(η/b)H k , H k := i∈Ω k a i a T i , q k := (η/b) i∈Ω k a i y i , (3.3) where Ω k := {b(k -1) + 1, b(k -1) + 2, . . . , bk} with |Ω k | = b. Here, for simplicity, we assume that we are in the one-pass regime (also called the streaming setting (Frostig et al., 2015; Jain et al., 2017) ) where each sample is used only once without being recycled. Our purpose in this paper is to show that heavy tails can arise in SGD even in simple settings such as when the input data a i is Gaussian, without the necessity to have a heavy-tailed input datafoot_2 . Consequently, we make the following assumptions on the data throughout the paper: (A1) a i ∼ N (0, σ 2 I d ) are i.i.d. (A2) y i are i.i.d. with a continuous density whose support is R with all the moments finite. Assumption (A2) would be satisfied in many cases, for instance when y i is normally distributed on R. Note that by Assumption (A1), the matrices M k = I -η b H k are i.i.d. and the Hessian matrix of the objective (3.2) satisfies E(aa T ) = σ 2 I d where the value of σ 2 determines the curvature around a minimum; smaller (larger) σ 2 implies the objective will grow slower (faster) around the minimum and the minimum will be flatter (sharper) (see e.g. Dinh et al. (2017) ). We introduce h(s) := lim k→∞ (E M k M k-1 . . . M 1 s ) 1/k , ( .4) which arises in stochastic matrix recursions (see e.g. Buraczewski et al. (2014) ) where • denotes the matrix 2-norm (i.e. largest singular value of a matrix). Since E M k s < ∞ for all k and s > 0, we have h(s) < ∞. Let us also define Π k := M k M k-1 . . . M 1 and ρ := lim k→∞ (2k) -1 log largest eigenvalue of Π T k Π k . (3. 5) The latter quantity is called the top Lyapunov exponent of the stochastic recursion (3.3). Furthermore, if ρ exists and is negative, it can be shown that a stationary distribution of the recursion (3.3) exists. In the Appendix (see Lemma 14), we show that under our assumptions, ρ = E log (I -(η/b)H) e 1 , h(s) = E [ (I -(η/b)H) e 1 s ] for ρ < 0, (3.6) where H is a matrix with the same distribution as H k , and e 1 is the first basis vector. In the following, we show that the limit density has a polynomial tail with a tail-index given precisely by α, the unique critical value such that h(α) = 1. The result builds on adapting the techniques developed in stochastic matrix recursions (Alsmeyer & Mentemeier, 2012; Buraczewski et al., 2016) to our setting. Our result shows that even in the simplest setting when the input data is Gaussian without any heavy tail, SGD iterates can lead to a heavy-tailed stationary distribution with an infinite variance. To our knowledge, this is the first time such a phenomenon is proven in the linear regression setting. Theorem 1. Consider the SGD iterations (3.3). If ρ < 0, then SGD iterations admit a unique stationary distribution x ∞ which satisfy lim t→∞ t α P u T x ∞ > t = e α (u) , u ∈ S d-1 , (3.7) for some positive and continuous function e α on the unit sphere S d-1 , where α is the unique positive value such that h(α) = 1. As Martin & Mahoney (2019); Şimşekli et al. (2020) provide both numerical and theoretical evidence showing that the tail-index of the density of the network weights is closely related to the generalization performance, where smaller tail-index indicates better generalization, a natural question of practical importance is how the tail-index depends on the parameters of the problem including the batch size, dimension and the stepsize. We prove that larger batch sizes lead to a lighter tail (i.e. larger α), which links the heavy tails to the observation that smaller b yields improved generalization in a variety of settings in deep learning (Keskar et al., 2016; Panigrahi et al., 2019; Martin & Mahoney, 2019) . We also prove that smaller stepsizes lead to larger α, hence lighter tails, which agrees with the fact that the existing literature for linear regression often choose η small enough to guarantee that variance of the iterates stay bounded (Dieuleveut et al., 2017b; Jain et al., 2017) . Theorem 2. The tail-index α is strictly increasing in batch size b and strictly decreasing in stepsize η and variance σ 2 provided that α ≥ 1. Moreover, the tail-index α is strictly decreasing in dimension d. Next result characterizes the tail-index α depending on the choice of the batch size b, the variance σ 2 , which determines the curvature around the minimum and the stepsize; in particular we show that if the stepsize exceeds an explicit threshold, the stationary distribution will become heavy tailed with an infinite variance. Proposition 3. Let η crit = 2b σ 2 (d+b+1) . The following holds: (i) There exists η max > η crit such that for any η crit < η < η max , Theorem 1 holds with tail index 0 < α < 2. (ii) If η = η crit , Theorem 1 holds with tail index α = 2. (iii) If η ∈ (0, η crit ), then Theorem 1 holds with tail index α > 2. Relation to first exit times. Proposition 3 implies that, for fixed η and b, the tail-index α will be decreasing with increasing σ. Combined with the first-exit-time analyses of Şimşekli et al. (2019b); Nguyen et al. (2019) , which state that the escape probability from a basin becomes higher for smaller α, our result implies that the probability of SGD escaping from a basin gets larger with increasing curvature; hence providing an alternative view for the argument that SGD prefers flat minima. Three regimes for stepsize. Theorems 1-2 and Proposition 3 identify three regimes: (I) convergence to a limit with a finite variance if ρ < 0 and α > 2; (II) convergence to a heavy-tailed limit with infinite variance if ρ < 0 and α < 2; (III) ρ > 0 when convergence cannot be guaranteed. For Gaussian input, if the stepsize is small enough, smaller than η crit , by Proposition 3, ρ < 0 and α > 2, therefore regime (I) applies. As we increase the stepsize, there is a critical stepsize level η crit for which η > η crit leads to α < 2 as long as η < η max where η max is the maximum allowed stepsize for ensuring convergence (corresponds to ρ = 0). A similar behavior with three (learning rate) stepsize regimes was reported in Lewkowycz et al. (2020) and derived analytically for one hidden layer linear networks with a large width. The large stepsize choices that avoids divergence, so called the catapult phase for the stepsize, yielded the best generalization performance empirically, driving the iterates to a flatter minima in practice. We suspect that the catapult phase in Lewkowycz et al. (2020) corresponds to regime (II) in our case, where the iterates are heavy-tailed, which might cause convergence to flatter minima as the first-exit-time discussions suggest ( Şimşekli et al., 2019a) . Moment Bounds and Convergence Speed. Theorem 1 is of asymptotic nature which characterizes the stationary distribution x ∞ of SGD iterations with a tail-index α. Next, we provide non-asymptotic moment bounds for x k at each k-th iterate, and also for the limit x ∞ . Theorem 4. (i) If the tail-index α ≤ 1, then for any p ∈ (0, α), we have h(p) < 1 and E x k p ≤ (h(p)) k E x 0 p + 1 -(h(p)) k 1 -h(p) E q 1 p . (3.8) (ii) If the tail-index α > 1, then for any p ∈ (1, α), we have h(p) < 1 and for any > 0 such that (1 + )h(p) < 1, we have E x k p ≤ ((1 + )h(p)) k E x 0 p + 1 -((1 + )h(p)) k 1 -(1 + )h(p) (1 + ) p p-1 -(1 + ) ((1 + ) 1 p-1 -1) p E q 1 p . (3.9) Theorem 4 shows that when p < α the upper bound on the p-th moment of the iterates converges exponentially to the p-the moment of q 1 when α ≤ 1 and a neighborhood of the p-moment of q 1 when α > 1, where q 1 is defined in (3.3). By letting k → ∞ and applying Fatou's lemma, we can also characterize the moments of the stationary distribution. Corollary 5. (i) If the tail-index α ≤ 1, then for any p ∈ (0, α), E x ∞ p ≤ 1 1-h(p) E q 1 p , where h(p) < 1. (ii) If the tail-index α > 1, then for any p ∈ (1, α), we have h(p) < 1 and for any > 0 such that (1 + )h(p) < 1, we have E x ∞ p ≤ 1 1-(1+ )h(p) (1+ ) p p-1 -(1+ ) ((1+ ) 1 p-1 -1) p E q 1 p . Next, we will study the speed of convergence of the k-th iterate x k to its stationary distribution x ∞ in the Wasserstein metric W p for any 1 ≤ p < α. Theorem 6. Assume α > 1. Let ν k , ν ∞ denote the probability laws of x k and x ∞ respectively. Then W p (ν k , ν ∞ ) ≤ (h(p)) k/p W p (ν 0 , ν ∞ ), for any 1 ≤ p < α, where the convergence rate (h(p)) 1/p ∈ (0, 1). Theorem 6 shows that in case α < 2 the convergence to a heavy tailed distribution occurs relatively fast, i.e. with a linear convergence in the p-Wasserstein metric. We can also characterize the constant h(p) in Theorem 6 which controls the convergence rate as follows: Corollary 7. When η < η crit = 2b σ 2 (d+b+1) , the tail-index α > 2, W 2 (ν k , ν ∞ ) ≤ 1 -2ησ 2 (1 -η/η crit ) k/2 W 2 (ν 0 , ν ∞ ). (3.10) Theorem 6 works for any p < α. At the critical p = α, Theorem 1 indicates that E x ∞ α = ∞, and therefore has E x k α → ∞ as k → ∞,foot_3 which serves as an evidence that the tail gets heavier as the number of iterates k increases. By adapting the proof of Theorem 4, we have the following result stating that the moments of the iterates of order α go to infinity but this speed can only be polynomially fast. Proposition 8. Given the tail-index α, we have E x ∞ α = ∞. Moreover, E x k α = O(k) if α ≤ 1, and E x k α = O(k α ) if α > 1. It may be possible to leverage recent results on the concentration of products of i.i.d. random matrices (Huang et al., 2020; Henriksen & Ward, 2020) to study the tail of x k for finite k, which can be a future research direction. Extension to non-Gaussian data. Our main purpose in Theorem 1 is to show that heavy tails can arise even in the simplest setting when the input is Gaussian. However, Proposition 1 extends naturally if the input a i is not necessarily Gaussian. For example, if we assume that the distribution of a i has support of R d and has a finite second moment, it can be checked that our proof technique for Theorem 1 will be still applicable and Theorem 1 will hold with h(s) defined by (3.4). The only difference is that when input is not Gaussian, the explicit formula (3.6) for h(s) will not hold as an equality but it will become an inequality, i.e. h(s) ≤ h(s) := E [ (I -(η/b)H) e 1 s ] , where h(s) is defined by (3.4). This inequality is just a consequence of sub-multiplicativity of matrix products appearing in (3.4). If α is such that h(α) = 1, then by (3.11), α is a lower bound on the tail index α that satisfies h(α) = 1 where h is defined as in (3.4). In other words, when the input is not Gaussian, we have α ≤ α and therefore α serves as a lower bound on the tail index. Furthermore, Theorem 2 will also apply in the sense that α. α will be strictly increasing in batch size b and strictly increasing in stepsize η and variance σ 2 provided that α ≥ 1. Extension to non-quadratic optimization. We note that extending our results beyond quadratic optimization is possible, if the gradients have asymptotic linear growth. For example, consider the cost F (x) = E[ (a T x -y)] with loss function . The choice of (z) = z 2 /2 is the standard linear regression setting where the gradient of F is an affine function of x and in this case ∇F (x) -Σx is bounded if we choose Σ = E[aa T ]. Theorem 1 will hold as long as there exists a matrix Σ such that ∇F (x) -Σx stays bounded even if the function is not a quadratic function, the proof is straightforward and would be based on verifying that the conditions of (Mirek, 2011, Theorem 1.4) hold in the setting of Theorem 1. This type of optimization problems arises for instance in robust regression where the objective is F (x) = E[(a T x -b) 2 ] + λg(x) with a penalty function g(x) whose gradient is bounded and a tunable parameter λ. The boundedness of the gradient of g(x) results in at-most linear growth of g(x) and allows robustness to outliers where the parameter λ can be used to adjust the robustness level desired. Examples for the choice of g(x) include the smoothly clipped absolute deviation (SCAD) penalty (Loh & Wainwright (2015) ), Huber loss (Huber, 1992) or the exponential squared loss when g(x) = 1 -exp(-x 2 /c) where c is a tuning parameter. Generalized Central Limit Theorem for Ergodic Averages. When α > 2, by Corollary 5, second moment of the iterates x k are finite, in which case central limit theorem (CLT) says that if the cumulative sum of the iterates S K = K k=1 x k is scaled properly, the resulting distribution is Gaussian. In the case where α < 2, the variance of the iterates is not finite; however in this case, we derive the following generalized CLT (GCLT) which says that if the iterates are properly scaled, the limit will be an α-stable distribution. This is stated in a more precise manner as follows. Corollary 9. Assume ρ < 0 so that Theorem 1 holds. Then, we have the following: (i) If α ∈ (0, 1) ∪ (1, 2), then there is a sequence d K = d K (α) and a function C α : S d-1 → C such that as K → ∞ the random variables K -1 α (S K -d K ) converge in law to the α-stable random variable with characteristic function Υ α (tv) = exp(t α C α (v)), for t > 0 and v ∈ S d-1 . (ii) If α = 1, then there are functions ξ, τ : (0, ∞) → R and C 1 : S d-1 → C such that as K → ∞ the random variables K -1 S K -Kξ K -1 converge in law to the random variable with characteristic function Υ 1 (tv) = exp (tC 1 (v) + it v, τ (t) ), for t > 0 and v ∈ S d-1 . (iii) If α = 2, then there is a sequence d K = d K (2) and a function C 2 : S d-1 → R such that as K → ∞ the random variables (K log K) -1 2 (S K -d K ) converge in law to the random variable with characteristic function Υ 2 (tv) = exp t 2 C 2 (v) , for t > 0 and v ∈ S d-1 . (iv) If α ∈ (0, 1), then d K = 0, and if α ∈ (1, 2], then d K = K x, where x = R d xν ∞ (dx). In addition to its evident theoretical interest, Corollary 9 has also an important practical implication: estimating the tail-index of a generic heavy-tailed distribution is a challenging problem (see e.g. Clauset et al. (2009); Goldstein et al. (2004) ; Bauke (2007) ); however, for the specific case of α-stable distributions, accurate and computationally efficient estimators, which do not require the knowledge of the functions C α , τ , ξ, have been proposed (Mohammadi et al., 2015) . Thanks to Corollary 9, we will be able to use such estimators in our numerical experiments, as we will detail in the next section. We finally note that the gradient noise in SGD is actually both multiplicative and additive (Dieuleveut et al., 2017b; a) ; a fact that is often discarded for simplifying the mathematical analysis. In the linear regression setting, we have shown that the multiplicative noise M k is the main source of heavy-tails, where a deterministic M k would not lead to heavy tails. 6 In the light of our theory, in Appendix A, we discuss in detail the recently proposed stochastic differential equation (SDE) representations of SGD in continuous-time and argue that, compared to classical SDEs driven by a Brownian motion (e.g., (Jastrzębski et al., 2017; Cheng et al., 2019) , SDEs driven by heavy-tailed α-stable Lévy processes (e.g., ( Şimşekli et al., 2019b) ) are more adequate when α < 2.

4. EXPERIMENTS

In this section, we present our experimental results on both synthetic and real data, in order to illustrate that our theory also holds in finite-sum problems (besides the streaming setting). Our main goal will be to illustrate the tail behavior of SGD by varying the algorithm parameters: depending on the choice of the stepsize η and the batch-size b, the iterates do converge to a heavy-tailed distribution (Theorem 1) and the behavior of the tail-index obeys Theorem 2. Synthetic experiments. In our first setting, we consider a simple synthetical setup, where we assume that the data points follow a Gaussian distribution. We will illustrate that the SGD iterates can become heavy-tailed even in this simplistic setting where the problem is a simple linear regression with all the variables being Gaussian. More precisely, we will consider the following model: x 0 ∼ N (0, σ 2 x I), a i ∼ N (0, σ 2 I), y i |a i , x 0 ∼ N a i x 0 , σ 2 y , (4.1) where x 0 , a i ∈ R d , y i ∈ R for i = 1, . . . , n, and σ, σ x , σ y > 0. In our experiments, we will need to estimate the tail-index α of the stationary distribution ν ∞ . Even though several tail-index estimators have been proposed for generic heavy-tailed distributions in the literature (Paulauskas & Vaičiulis, 2011) , we observed that, even for small d, these estimators can yield inaccurate estimations and require tuning hyper-parameters, which is non-trivial. We Under review as a conference paper at ICLR 2021 circumvent this issue thanks to the GCLT in Corollary 9: since the average of the iterates is guaranteed to converge to a multivariate α-stable random variable, we can use the tail-index estimators that are specifically designed for stable distributions. By following Tzagkarakis et al. (2018) ; Şimşekli et al. (2019b) , we use the estimator proposed by Mohammadi et al. (2015) , which is fortunately agnostic to the scaling function C α . The details of this estimator are given in Appendix B. To be able to benefit from the CLT, we are required to compute the average of the 'centered' iterates: 0.1 0.2 0.5 1.0 1.5 2.0 b =1 b =2 b =3 b =4 b =5 b =10 b =20 1 K-K0 K k=K-K0+1 (x k -x) , where K 0 is a 'burn-in' period aiming to discard the initial phase of SGD, and the mean of ν ∞ is given by x = R d xν ∞ (dx) = (A A) -1 A y as long as α > 1 7 , where the i-th row of A ∈ R n×d contains a i and y = [y 1 , . . . , y n ] ∈ R n . We then repeat this procedure 1600 times for different initial points and obtain 1600 different random vectors, whose distributions are supposedly close to an α-stable distribution. Finally, we run the tail-index estimator of Mohammadi et al. (2015) on these random vectors to estimate α. In our first experiment, we investigate the tail-index α of the stationary measure ν ∞ for varying stepsize η and batch-size b. We set d = 100 first fix the variances σ = 1, σ x = σ y = 3, and generate {a i , y i } n i=1 by simulating the statistical model. Then, by fixing this dataset, we run the SGD recursion (3.3) for a large number of iterations and vary η from 0.02 to 0.2 and b from 1 to 20. We also set K = 1000 and K 0 = 500. Figure 1 (a) illustrates the results. We can observe that, increasing η and decreasing b both result in decreasing α, where the tail-index can be prohibitively small (i.e., α < 1, hence even the mean of ν ∞ is not defined) for large η. Besides, we can also observe that the tail-index is in strong correlation with the ratio η/b. In our second experiment, we investigate the effect of d and σ on α. In Figure 1 (b) (left), we set d = 100, η = 0.1 and b = 5 and vary σ from 0.8 to 2. For each value of σ, we simulate a new dataset from by using the generative model and run SGD with K, K 0 . We again repeat each experiment 1600 times. We follow a similar route for Figure 1 (b) (right): we fix σ = 1.75 and repeat the previous procedure for each value of d ranging from 5 to 50. The results confirm our theory: α decreases for increasing σ and d, and we can observe that for a fixed b and η the change in d can abruptly alter α. In our final synthetic data experiment, we investigate how the tails behave under adaptive optimization algorithms. We replicate the setting of our first experiment, with the only difference that we replace SGD with RMSProp (Hinton et al., 2012) . As shown in Figure 1 (c), the 'clipping' effect of RMSProp as reported in Zhang et al. (2019) prevents the iterates become heavy-tailed and the vast majority of the estimated tail-indices is around 2, indicating a Gaussian behavior. On the other hand, we repeated the same experiment with the variance-reduced optimization algorithm SVRG (Johnson & Zhang, 2013) , and observed that for almost all choices of η and b the algorithm converges near the minimizer (with an error in the order of 10 -6 ), hence the stationary distribution ν ∞ seems to be a degenerate distribution, which does not admit a heavy-tailed behavior. Regarding the observed link between heavy-tails and generalization (Martin & Mahoney, 2019; Şimşekli et al., 2020) , this behavior of RMSProp and SVRG might be related to their ineffective generalization performance as reported in Keskar & Socher (2017); Defazio & Bottou (2019) . Experiments on fully connected neural networks. In the second set of experiments, we investigate the applicability of our theory beyond the quadratic optimization problems. Here, we follow the setup of Şimşekli et al. (2019a) and consider a fully connected neural network with the cross entropy loss and ReLU activation functions on the MNIST and CIFAR10 datasets. We train the models by using SGD for 10K iterations and we range η from 10 -4 to 10 -1 and b from 1 to 10. Since it would be computationally infeasible to repeat each run thousands of times as we did in the synthetic data experiments, in this setting we follow a different approach based on (i) ( Şimşekli et al., 2019a ) that suggests that the tail behavior can differ in different layers of a neural network, and (ii) (De Bortoli et al., 2020 ) that shows that in the infinite width limit, the different components of a given layer of a two-layer fully connected network (FCN) becomes independent. Accordingly, we first compute the average of the last 1K SGD iterates, whose distribution should be close an α-stable distribution by the GCLT. We then treat each layer as a collection of i.i.d. α-stable random variables and measure the tail-index of each individual layer separately by using the the estimator from Mohammadi et al. (2015) . Figure 2 shows the results for a three-layer network (with 128 hidden units at each layer) , whereas we obtained very similar results with a two-layer network as well. We observe that, while the dependence of α on η/b differs from layer to layer, in each layer the measured α correlate very-well with the ratio η/b in both datasets. Experiments on VGG networks. In our last set of experiments, we evaluate our theory on VGG networks (Simonyan & Zisserman, 2015) with 11 layers (10 convolutional layers with max-pooling and ReLU units, followed by a final linear layer), which contains 10M parameters. We follow the same procedure as we used for the fully connected networks, where we vary η from 10 -4 to 1.7 × 10 -3 and b from 1 to 10. The results are shown in Figure 3 . Similar to the previous experiments, we observe that α depends on the layer. For the intermediate layers (Layers 2-8), the tail index correlates well with the ratio η/b, whereas the first and last two convolutional layers (Layers 9 and 10) exhibit a Gaussian behavior (α ≈ 2). On the other hand, the tail-index of the last layer (which is linear) does not correlate with either η or b. These observations provide further support for our theory and show that the heavy-tail phenomenon also occurs in neural networks, whereas α is potentially related to η and b in a more complicated way.

5. CONCLUSION AND FUTURE DIRECTIONS

We studied the tail behavior of the SGD in a quadratic optimization problem and showed that depending on the curvature, η, and b, the iterates can converge to a heavy-tailed random variable. We further supported our theory with experiments conducted on fully connected neural networks and illustrated that our results would also apply to more general settings and hence provide new insights about the behavior of SGD in deep learning. This study also brings up a number of future directions. (i) Our proof techniques are for the streaming setting, where each sample is used only once. However, in practice SGD is typically implemented on the finite-sum problem (1.2) with multiple passes over the data. Extending our results to this scenario and investigating the effects of finite-sample size on the tail index and generalization would be an interesting future research direction. (ii) We suspect that the tail index of the SGD iterates may have an impact on the time required to escape a saddle point and this can be investigated further as another future research direction.

A A NOTE ON STOCHASTIC DIFFERENTIAL EQUATION REPRESENTATIONS FOR SGD

In recent years, a popular approach for analyzing the behavior of SGD has been viewing it as a discretization of a continuous-time stochastic process that can be represented via a stochastic differential equation (SDE) (Mandt et al., 2016; Jastrzębski et al., 2017; Li et al., 2017; Hu et al., 2019; Zhu et al., 2018; Chaudhari & Soatto, 2018; Şimşekli et al., 2019b) . While these SDEs have been useful for understanding different properties of SGD, their differences and functionalities have not been clearly understood. In this section, in the light of our theoretical results, we will discuss in which situation their choice would be more appropriate. We will restrict ourselves to the case where f (x) is a quadratic function; however, the discussion can be extended to more general f . The SDE approximations are often motivated by first rewriting the SGD recursion as follows: x k+1 = x k -η∇ fk+1 (x k ) = x k -η∇f (x k ) + ηU k+1 (x k ), (A.1) where  U k (x) := ∇ fk (x) -∇f (x) dx t = -∇f (x t )dt + √ ησ z dB t , (A.2) where B t denotes the d-dimensional standard Brownian motion. This process is called the Ornstein-Uhlenbeck (OU) process (see e.g. Øksendal ( 2013)), whose invariant measure is a Gaussian distribution. We argue that this process can be a good proxy to (3.3) only when α ≥ 2, since otherwise the SGD iterates will exhibit heavy-tails, whose behavior cannot be captured by a Gaussian distribution. As we illustrated in Section 4, to obtain large α, the step-size η needs to be small and/or the batch-size b needs to be large. However, it is clear that this approximation will fall short when the system exhibits heavy tails, i.e., α < 2. Therefore, for the large η/b regime, which appears to be more interesting since it often yields improved test performance (Jastrzębski et al., 2017) , this approximation would be inaccurate for understanding the behavior of SGD. This problem mainly stems from the fact that the additive isotropic noise assumption results in a deterministic M k matrix for all k. Since there is no multiplicative noise term, this representation cannot capture a potential heavy-tailed behavior. A natural extension of the state-independent Gaussian noise assumption is to incorporate the covariance structure of U k . In our linear regression problem, we can easily see that the covariance matrix of the gradient noise has the following form: Σ U (x) = Cov(U k |x) = σ 2 b diag(x • x), (A.3) where • denotes element-wise multiplication and σ 2 is the variance of the data points. Therefore, we can extend the previous assumption by assuming Z k |x ∼ N (0, ηΣ U (x)). It has been observed that this approximation yields a more accurate representation (Cheng et al., 2019; Ali et al., 2020; Jastrzębski et al., 2017) . Using this assumption in (A.1), the SGD recursion coincides with the Euler-Maruyama discretization of the following SDE: dx t = -∇f (x t )dt + ηΣ U (x t )dB t d = -A Ax t -A y dt + σ 2 η b diag(x t )dB t , (A.4) where d = denotes equality in distribution. The stochasticity in such SDEs is called often called multiplicative. Let us illustrate this property by discretizing this process and by using the definition of the gradient and the covariance matrix, we observe that (noting that N k ∼ N (0, I)) x k+1 = x k -η A Ax k -A y + σ 2 η 2 b diag(x k )N k+1 = I -ηA A + σ 2 η 2 /b diag(N k+1 ) x k -ηA y, (A.5) where we can clearly see the multiplicative effect of the noise, as indicated by its name. On the other hand, we can observe that, thanks to the multiplicative structure, this process would be able to capture the potential heavy-tailed structure of SGD. However, there are two caveats. The first one is that, in the case of linear regression, the process is called a geometric (or modified) Ornstein-Uhlenbeck process which is an extension of geometric Brownian motion. One can show that the distribution of the process at any time t will have lognormal tails. Hence it will be accurate only when the tail-index α is close to the one of the lognormal distribution. The second caveat is that, for a more general cost function f , the covariance matrix is more complicated and hence the invariant measure of the process cannot be found analytically, hence analyzing these processes for a general f can be as challenging as directly analyzing the behavior of SGD. The third way of modeling the gradient noise is based on assuming that it is heavy-tailed. In particular, we can assume that ηU k ≈ η 1/α L k where [L k ] i ∼ SαS(σ L η (α-1)/α ) for all i = 1, . . . , d. Under this assumption the SGD recursion coincides with the Euler discretization of the following Lévy-driven SDE ( Şimşekli et al., 2019b) : dx t = -∇f (x t )dt + σ L η (α-1)/α dL α t , (A.6) where L α t denotes the α-stable Lévy process with independent components (see Section A.1 for technical background on Lévy processes and in particular α-stable Lévy processes). In the case of linear regression, this processes is called a fractional OU process (Fink & Klüppelberg, 2011) , whose invariant measure is also an α-stable distribution with the same tail-index α. Hence, even though it is based on an isotropic, state-independent noise assumption, in the case of large η/b regime, this approach can mimic the heavy-tailed behavior of the system with the exact tail-index α. On the other hand, Buraczewski et al. (2016) (Theorem 1.7 and 1.16) showed that if U k is assumed to heavy tailed with index α (not necessarily SαS) then the process x k will inherit the same tails and the ergodic averages will still converge to an SαS random variable, hence generalizing the conclusions of the SαS assumption to the case where U k follows an arbitrary heavy-tailed distribution. A.1 TECHNICAL BACKGROUND: LÉVY PROCESSES Lévy motions (processes) are stochastic processes with independent and stationary increments, which include Brownian motions as a special case, and in general may have heavy-tailed distributions (see e.g. Bertoin (1996) for a survey). Symmetric α-stable Lévy motion is a Lévy motion whose time increments are symmetric α-stable distributed. We define L α t , a d-dimensional symmetric α-stable Lévy motion as follows. Each component of L α t is an independent scalar α-stable Lévy process defined as follows: (i) L α 0 = 0 almost surely; (ii) For any t 0 < t 1 < • • • < t N , the increments L α tn -L α tn-1 are independent, n = 1, 2, . . . , N ; (iii) The difference L α t -L α s and L α t-s have the same distribution: SαS((t -s) 1/α ) for s < t; (iv) L α t has stochastically continuous sample paths, i.e. for any δ > 0 and s ≥ 0, P(|L α t -L α s | > δ) → 0 as t → s. When α = 2, we obtain a scaled Brownian motion as a special case, i.e. L α t = √ 2B t , so that the difference L α t -L α s follows a Gaussian distribution N (0, 2(t -s)).

B TAIL-INDEX ESTIMATION

In this study, we follow Tzagkarakis et al. (2018) ; Şimşekli et al. (2019b) , and make use of the recent estimator proposed by Mohammadi et al. (2015) . Theorem 10 Mohammadi et al. (2015) Corollary 2.4). Let {X i } K i=1 be a collection of strictly stable random variables in R d with tail-index α ∈ (0, 2] and ( K = K 1 ×K 2 . Define Y i = K1 j=1 X j+(i-1)K1 for i ∈ 1, K 2 . Then, the estimator 1 α 1 log K 1 1 K 2 K2 i=1 log Y i - 1 K K i=1 log X i , (B.1) converges to 1/α almost surely, as K 2 → ∞. As this estimator requires a hyperparameter K 1 , at each tail-index estimation, we used several values for K 1 and we used the median of the estimators obtained with different values of K 1 . We provide the codes in the supplement, where the implementation details can be found. For the neural network experiments, we used the same setup as provided in the repository of Şimşekli et al. (2019b) .

C PROOFS OF MAIN RESULTS

C.1 PROOF OF THEOREM 1 Proof of Theorem 1. The proof follows from (Buraczewski et al., 2016, Thm 4.4.15) which goes back to (Alsmeyer & Mentemeier, 2012 , Theorem 1.1) and Kesten (Kesten, 1973, Theorem 6) . See also (Goldie, 1991; Buraczewski et al., 2015) . We recall that we have the stochastic recursion: x k = M k x k-1 + q k , (C.1) where the sequence (M k , q k ) are i.i.d. distributed as (M, q) and for each k, (M k , q k ) is independent of x k-1 . To apply (Buraczewski et al., 2016, Thm 4.4.15) , it suffices to have the following conditions being satisfied: 1. M is invertible with probability 1. 2. The matrix M has a continuous Lebesgue density that is positive in a neighborhood of the identity matrix. 3. ρ < 0 and h(α) = 1. 4. P(M x + q = x) < 1 for every x. 5. E M α (log + M + log + M -1 ) < ∞. 6. 0 < E q α < ∞. All the conditions are satisfied under our assumptions. In particular, Condition 1 and Condition 5 are proved in Lemma 18, and Condition 2 and Condition 4 follow from the fact that M and q have continuous distributions. If ρ < 0, then by Lemma 15, we have h(0) = 1, h (0) = ρ < 0 and h(s) is convex in s, and moreover by Lemma 16, we have lim inf s→∞ h(s) > 1. Therefore, there exists some α ∈ (0, ∞) such that h(α) = 1, which gives Condition 3. Finally, Condition 6 is satisfied by the definition of q and by the Assumptions (A1)-(A2).

C.2 PROOF OF THEOREM 2

Proof of Theorem 2. We will split the proof of Theorem 2 into two parts: (I) We will show that the tail-index α is strictly decreasing in stepsize η and variance σ 2 provided that α ≥ 1. (II) We will show that the tail-index α is strictly increasing in batch size b provided that α ≥ 1. (III) We will show that the tail-index α is strictly decreasing in dimension d. First, let us prove (I). Let a := ησ 2 > 0 be given. Consider the tail-index α as a function of a, i.e. α(a) := min{s : h(a, s) = 1} , where h(a, s) = h(s) with emphasis on dependence on a. By assumption, α(a) ≥ 1. The function h(a, s) is convex function of a (see Lemma 19 for s ≥ 1 and a strictly convex function of s for s ≥ 0). Furthermore, it satisfies h(a, 0) = 1 for every a ≥ 0 and h(0, s) = 1 for every s ≥ 0. We consider the curve C := {(a, s) ∈ (0, ∞) × [1, ∞] : h(a, s) = 1} . This is the set of the choice of a, which leads to a tail-index s where s ≥ 1. Since h is smooth in both a and s, we can represent s as a smooth function of a, i.e. on the curve h(a, s(a)) = 0 , where s(a) is a smooth function of a. We will show that s (a) < 0; i.e. if we increase a; the tail-index s(a) will drop. Pick any (a * , s * ) ∈ C, it will satisfy h(a * , s * ) = 1. We have the following facts: (i) The function h(a, s) = 1 for either a = 0 or s = 0. This is illustrated in Figure 4 with a blue marker. (ii) h(a * , s) < 1 for s < s * . This follows from the convexity of h(a * , s) function and the fact that h(a * , 0) = 1, h(a * , s * ) = 1. From here, we see that the function h(a * , s) is increasing at s = s * and we have its derivative ∂h ∂s (a * , s * ) > 0. (iii) The function h(a, s * ) is convex as a function of a by Lemma 19, it satisfies h(0, s * ) = h(a * , s * ) = 1. Therefore, by convexity h(a, s * ) < 1 for a ∈ (0, s * ); otherwise the function h(a, s * ) would be a constant function. We have therefore necessarily. ∂h ∂a (a * , s * ) > 0. By convexity of the function h(a, s * ), we have also h(a, s * ) ≥ h(a * , s * ) + ∂h ∂a (a * , s * )(aa * ) > h(a * , s * ) = 1. Therefore, h(a, s * ) > 1 for a > a * . Then, it also follows that h(a, s) > 1 for a > a * and s > s * (otherwise if h(a, s) ≤ 1, we get a contradiction because h(0, s) = 1, h(a * , s) > 1 and h(a, s) ≤ 1 is impossible due to convexity). This is illustrated in Figure 4 where we mark this region as a rectangular box where h > 1. (iv) By similar arguments we can show that the function h(a, s) < 1 if (s, a) ∈ (0, a * ) × [1, s * ). Indeed, if h(a, s) ≥ 1 for some (s, a) ∈ [1, s * ) × (0, a * ), this contradicts the fact that h(0, s) = 1 and h(a * , s) < 1 proven in part (ii). This is illustrated in Figure 4 where inside the rectangular box on the left-hand side, we have h < 1. where we used Lemma 14. When s ≥ 1, the function x → x s is convex, and by Jensen's inequality, we get for any b ≥ 2 and b ∈ N, h(b, s) = E 1 b b i=1   I - η b -1 j =i a j a T j   e 1 s ≤ E   1 b b i=1   I - η b -1 j =i a j a T j   e 1 s   = 1 b b i=1 E     I - η b -1 j =i a j a T j   e 1 s   = h(b -1, s), where we used the fact that a i are i.i.d. Indeed, from the condition for equality to hold in Jensen's inequality, and the fact that a i are i.i.d. random, the inequality above is a strict inequality. Hence when d ∈ N for any s ≥ 1, h(b, s) is strictly decreasing in b. By following the same argument as in the proof of (I), we conclude that the tail-index α is strictly increasing in batch size b. Finally, let us prove (III). Let us show the tail-index α is strictly decreasing in dimension d. Since a i are i.i.d. and a i ∼ N (0, σ 2 I d ), by Lemma 25, h(s) = E 1 - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY s/2 , (C.4) where X, Y are independent chi-square random variables with degree of freedom b and d -1 respectively. Notice that h(s) is strictly increasing in d since the only dependence of h(s) on d is via Y , which is a chi-square distribution with degree of freedom (d-1). By writing Y = Z 2 1 +• • •+Z 2 d-1 , where Z i ∼ N (0, 1) i.i.d., it follows that h(s) is strictly increasing in d. Hence, by similar argument as in (I), we conclude that α is strictly decreasing in dimension d. Remark 11. When d = 1 and a i are i.i.d. N (0, σ 2 ), we can provide an alternative proof that the tail-index α is strictly increasing in batch size b. It suffices to show that for any s ≥ 1, h(s) is strictly decreasing in the batch size b. By Lemma 25 when d = 1, h(b, s) = E 1 - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY s/2 , (C.5) where h(b, s) is as in (C.3) and X, Y are independent chi-square random variables with degree of freedom b and d -1 respectively. When d = 1, we have Y ≡ 0, and h(b, s) = E 1 - 2ησ 2 b X + η 2 σ 4 b 2 X 2 s/2 = E 1 - ησ 2 b X s . (C.6) Since X is a chi-square random variable with degree of freedom b, we have h(b, s) = E 1 - ησ 2 b b i=1 Z i s , (C.7) where Z i are i.i.d. N (0, 1) random variables. When s ≥ 1, the function x → |x| s is convex, and by Jensen's inequality, we get for any b ≥ 2 and b ∈ N h(b, s) = E   1 b b i=1   1 - ησ 2 b -1 j =i Z j   s   ≤ E   1 b b i=1 1 - ησ 2 b -1 j =i Z j s   = 1 b b i=1 E   1 - ησ 2 b -1 j =i Z j s   = h(b -1, s), where we used the fact that Z i are i.i.d. Indeed, from the condition for equality to hold in Jensen's inequality, and the fact that Z i are i.i.d. N (0, 1) distributed, the inequality above is a strict inequality. Hence when d = 1 for any s ≥ 1, h(b, s) is strictly decreasing in b.

C.3 PROOF OF PROPOSITION 3

Proof of Proposition 3. We next prove (i). When η = η crit = 2b σ 2 (d+b+1) , that is ησ 2 (d+b+1) = 2b, it follows from the proof of Proposition 23 that ρ ≤ 1 2 log E   1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   = 0. (C.8) Note that since 1 -2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1 is random, the inequality above is a strict inequality from Jensen's inequality. Thus, when η = η crit , i.e. ησ 2 (d + b + 1) = 2b, ρ < 0. By continuity, there exists some δ > 0 such that for any 2b < ησ 2 (d + b + 1) < 2b + δ, i.e. η crit < η < η max , where η max := η crit + δ σ 2 (d+b+1) , we have ρ < 0. Moreover, when ησ 2 (d + b + 1) > 2b, i.e. η > η crit , we have h(2) = E   1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   = 1 -2ησ 2 + η 2 σ 4 b (d + b + 1) ≥ 1, which implies that there exists some 0 < α < 2 such that h(α) = 1. Finally, let us prove (ii) and (iii). When ησ 2 (d + b + 1) ≤ 2b, i.e. η ≤ η crit , we have h(2) ≤ 1, which implies that α > 2. In particular, when ησ 2 (d + b + 1) = 2b, i.e. η = η crit , the tail-index α = 2.

C.4 PROOF OF THEOREM 4 AND COROLLARY 5

Proof of Theorem 4. We recall that x k = M k x k-1 + q k , (C.9) which implies that x k ≤ M k x k-1 + q k . (C.10) (i) If the tail-index α ≤ 1, then for any 0 < p < α, we have h(p) = E M k e 1 p < 1 and moreover by Lemma 20, x k p ≤ M k x k-1 p + q k p . (C.11) Due to spherical symmetry of the isotropic Gaussian distribution, the distribution of M k x x does not depend on the choice of x ∈ R d \{0}. Therefore, M k x k-1 x k-1 and x k-1 are independent, and M k x k-1 x k-1 has the same distribution as M k e 1 , where e 1 is the first basis vector. It follows that E x k p ≤ E M k e 1 p E x k-1 p + E q k p , (C.12) so that E x k p ≤ h(p)E x k-1 p + E q 1 p , (C.13 ) where h(p) ∈ (0, 1). By iterating over k, we get E x k p ≤ (h(p)) k E x 0 p + 1 -(h(p)) k 1 -h(p) E q 1 p . (C.14) (ii) If the tail-index α > 1, then for any 1 < p < α, by Lemma 20, for any > 0, we have x k p ≤ (1 + ) M k x k-1 p + (1 + ) p p-1 -(1 + ) (1 + ) 1 p-1 -1 p q k p , (C.15) which (similar as in (i)) implies that E x k p ≤ (1 + )E M k e 1 p E x k-1 p + (1 + ) p p-1 -(1 + ) (1 + ) 1 p-1 -1 p E q k p , (C.16) so that E x k p ≤ (1 + )h(p)E x k-1 p + (1 + ) p p-1 -(1 + ) (1 + ) 1 p-1 -1 p E q 1 p . (C.17) We choose > 0 so that (1 + )h(p) < 1. By iterating over k, we get E x k p ≤ ((1 + )h(p)) k E x 0 p + 1 -((1 + )h(p)) k 1 -(1 + )h(p) (1 + ) p p-1 -(1 + ) (1 + ) 1 p-1 -1 p E q 1 p . (C.18) The proof is complete. Remark 12. In general, there is no closed-form expression for E q 1 p in Theorem 4. We provide an upper bound as follows. When p > 1, by Jensen's inequality, we can compute that E q 1 p = η p E 1 b b i=1 a i y i p ≤ η p b b i=1 E a i y i p = η p E [|y 1 | p a 1 p ] , (C.19) and when p ≤ 1, by Lemma 20, we can compute that E q 1 p = η p b p E b i=1 a i y i p ≤ η p b p E b i=1 a i y i p ≤ η p b p b i=1 E a i y i p = η p E [|y 1 | p a 1 p ] . (C.20) Proof of Corollary 5. It follows from Theorem 4 by letting k → ∞ and applying Fatou's lemma. C.5 PROOF OF THEOREM 6, COROLLARY 7, PROPOSITION 8 AND COROLLARY 9 Proof of Theorem 6. For any ν 0 , ν0 ∈ P p (R d ), there exists a couple x 0 ∼ ν 0 and x0 ∼ ν0 independent of (M k , q k ) k∈N and W p p (ν 0 , ν0 ) = E x 0 -x0 p . We define x k and xk starting from x 0 and x0 respectively, via the iterates x k = M k x k-1 + q k , (C.21) xk = M k xk-1 + q k , (C.22) and let ν k and νk denote the probability laws of x k and xk respectively. For any p < α, since E M k α = 1 and E q k α < ∞, we have ν k , νk ∈ P p (R d ) for any k. Moreover, we have x k -xk = M k (x k-1 -xk-1 ), (C.23) Due to spherical symmetry of the isotropic Gaussian distribution, the distribution of M k x x does not depend on the choice of x ∈ R d \{0}. Therefore, M k (x k-1 -x k-1 ) x k-1 -x k-1 and x k-1 -xk-1 are independent, and M k (x k-1 -x k-1 ) x k-1 -x k-1 has the same distribution as M k e 1 , where e 1 is the first basis vector. It follows from (C.23) that E x k -xk p ≤ E [ M k (x k-1 -xk-1 ) p ] = E [ M k e 1 p ] E [ x k-1 -xk-1 p ] = h(p)E [ x k-1 -xk-1 p ] , which by iterating implies that W p p (ν k , νk ) ≤ E x k -xk p ≤ (h(p)) k E x 0 -x0 p = (h(p)) k W p p (ν 0 , ν0 ). (C.24) By letting ν0 = ν ∞ , the probability law of the stationary distribution x ∞ , we conclude that W p (ν k , ν ∞ ) ≤ (h(p)) 1/q k W p (ν 0 , ν ∞ ). (C.25) Finally, notice that 1 ≤ p < α, and therefore h(p) < 1. The proof is complete. Proof of Corollary 7. When ησ 2 < 2b d+b+1 , by Proposition 3, the tail-index α > 2, by taking p = 2, and using h (2) = 1 -2ησ 2 + η 2 σ 4 b (d + b + 1) < 1 (see Proposition 3), it follows from Theorem 6 that W 2 (ν k , ν ∞ ) ≤ 1 -2ησ 2 1 - ησ 2 2b (d + b + 1) k/2 W 2 (ν 0 , ν ∞ ). (C.26) Remark 13. Consider the case a i are i.i.d. N (0, σ 2 I d ). In Theorem 4, Corollary 5 and Theorem 6, the key quantity is h(p) ∈ (0, 1), where p < α. We recall that h(p) = E 1 - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY p/2 , (C.27) where a = ησ 2 , X, Y are independent chi-square random variables with degree of freedom b and d -1 respectively. The first-order approximation of h(p) is given by h(p) ∼ 1 + p 2 E - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY = 1 + p 2 -2a + a 2 b (b + 2) + a 2 b (d -1) < 1, (C.28) provided that a = ησ 2 < 2b d+b+1 which occurs if and only if α > 2. In other words, when ησ 2 < 2b d+b+1 , α > 2 and h(p) ∼ 1 -pησ 2 1 - ησ 2 (b + d + 1) 2b < 1. (C.29) On the other hand, when ησ 2 ≥ 2b d+b+1 , p < α ≤ 2, and the second-order approximation of h(p) is given by h(p) ∼ 1 + p 2 E - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY + p 2 ( p 2 -1) 2 E - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY 2 = 1 + qa a(b + d + 1) 2b -1 - 2 -p 8 E - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY 2 , and we computed before in (E.55) that for small a = ησ 2 and large d, E - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY 2 ∼ 4a 2 b (b + 2) + a 4 b 3 (b + 2)d 2 - 4a 3 b 2 (b + 2)d, (C.30) and therefore with a = ησ 2 , h(p) ∼ 1 -pa -a(b + d + 1) 2b + 1 + (2 -p)a(b + 2) 2qb 1 + a 2 4b 2 d 2 - a b d < 1, (C.31) provided that 1 ≤ a(b+d+1) 2b < 1 + (2-p)a(b+2) 2qb 1 + a 2 4b 2 d 2 -a b d . Proof of Proposition 8. First, we notice that it follows from Theorem 1 that E x ∞ α = ∞. To see this, notice that lim t→∞ t α P(e T 1 x ∞ > t) = e α (e 1 ), where e 1 is the first basis vector in R d , and P( x ∞ ≥ t) ≥ P(e T 1 x ∞ ≥ t), and thus E x ∞ α = ∞ 0 tP( x ∞ α ≥ t)dt = ∞ 0 tP( x ∞ ≥ t 1/α )dt = ∞. (C.32) By following the proof of Theorem 4 by letting q = α in the proof, one can show the following. (i) If the tail-index α ≤ 1, then we have E x ∞ α ≤ E x 0 α + kE q 1 α , (C.33 ) which grows linearly in k. (ii) If the tail-index α > 1, then for any > 0, we have E x k α ≤ (1 + ) k E x 0 α + (1 + ) k -1 (1 + ) α α-1 -(1 + ) (1 + ) 1 α-1 -1 α E q 1 α = O(k), (C.34) which grows exponentially in k for any fixed > 0. By letting → 0, we have E x k α = (1 + ) k E x 0 α + (1 + O( ))((1 + ) k -1) (α -1) α-1 α E q 1 α . Therefore, it holds for any sufficiently small > 0 that, E x k α ≤ (1 + ) k α E x 0 α + (α -1) α-1 E q 1 α . We can optimize (1+ ) k α over the choice of > 0, and by choosing = α k-α , which goes to zero as k goes to ∞, we have (1+ ) Proof of Corollary 9. The result is obtained by a direct application of (Mirek, 2011, Theorem 1.15) to the recursions (3.3) where it can be checked in a straightforward manner that the conditions for this theorem hold. k α = (1 + α k-α ) k ( k-α α ) α = O(k α ),

D SUPPORTING LEMMAS

In this section, we present a few supporting lemmas that are used in the proofs of the main results of the paper as well as the additional results in the Appendix. First, we recall that the iterates are given by x k = M k x k-1 + q k , where (M k , q k ) are i.i.d. and M k is distributed as I -η b H, where H = b i=1 a i a T i and q k is distributed as η b b i=1 a i y i , where a i and y i are i.i.d. satisfying the Assumptions (A1)-(A2). We can compute ρ and h(s) as follows where ρ and h(s) are defined by (3.5) and (3.4). Lemma 14. ρ can be characterized as: ρ = E log I - η b H e 1 , (D.1) and h(s) can be characterized as: h(s) = E I - η b H e 1 s , (D.2) provided that ρ < 0. Proof. It is known that the Lyapunov exponent defined in (3.5) admits the alternative representation ρ := lim k→∞ 1 k log xk , (D.3) where xk := Π k x0 with Π k := M k M k-1 . . . M 1 and x0 := x 0 (see (Newman, 1986, eqn. (2))). We will compute the limit on the right-hand side of (D.3). First, we observe that due to spherical symmetry of the isotropic Gaussian distribution, the distribution of M k x x does not depend on the choice of x ∈ R d \{0} and is i.i.d. over k with the expectation E( M e 1 ) = E( I -η b H e 1 ) where we chose x = e 1 . Therefore, 1 k log xk - 1 k log x0 = 1 k k i=1 log xi xi-1 = 1 k k i=1 log M i xi-1 xi-1 is an average of i.i.d. random variables and by the law of large numbers we obtain ρ = lim k→∞ 1 k log xk = E I - η b H e 1 . From (D.3), we conclude that this proves (D.1). It remains to prove (D.2). We consider the function h(s) = lim k→∞ E xk s x0 s 1/k , where the initial point x0 = x 0 is deterministic. In the rest of the proof, we will show that for ρ < 0, h(s) = h(s) where h(s) is given by (3.4) and h(s) is equal to the right-hand side of (D.2); our proof is inspired by the approach of Newman ( 1986). We will first compute h(s) and show that it is equal to the right-hand side of (D.2). Note that we can write x k s x 0 s = k i=1 M i x i-1 s x i-1 s . This is a product of i.i.d. random variables with the same distribution as that of M e 1 s due to the spherical symmetry of the input a i . Therefore, we can write h(s) = lim k→∞ E x k s x 0 s 1/k = lim k→∞ E k i=1 M i e 1 s 1/k = E [ M e 1 s ] = E I - η b H e 1 s , (D.4) where we used the fact that M i e 1 are i.i.d. over i. It remains to show that h(s) = h(s) for ρ < 0. Note that xk s x0 s ≤ Π k s , and therefore from the definition of h(s) and h(s), we have immediately h(s) ≥ h(s) (D.5) for any s > 0. We will show that h(s) ≤ h(s) when ρ < 0. We assume ρ < 0. Then, Theorem 1 is applicable and there exists a stationary distribution x ∞ with a tail index α such that h(α) = 1. We will show that h(α) = 1. First, the tail density admits the characterization (3.7), and therefore x ∞ ∈ L s for s < α, i.e. the s-th moment of x ∞ is finite. Similarly due to (3.7), x ∞ / ∈ L s for s > α. Since h(α) = 1, it follows from (D.5) that we have h(α) ≤ 1. However if h(α) < 1, then by the continuity of the h function there exists ε such that h(s) < 1 for every s ∈ (α -ε, α + ε) ⊂ (0, 1). From the definition of h(s) then this would imply that E( x k s ) → 0 for every s ∈ (α -ε, α + ε). On the other hand, by following a similar argument to the proof technique of Corollary 5, it can be shown that the s-th moment of x ∞ has to be bounded,foot_6 which would be a contradiction with the fact that x ∞ ∈ L s for s > α. Therefore, h(α) ≥ 1. Since h(α) = 1, (D.5) leads to h(α) = h(α) = 1. (D.6) We observe that the function h is homogeneous in the sense that if the iterations matrices M i are replaced by cM i where c > 0 is a real scalar, h(s) will be replaced by h c (s) := c s h(s). In other words, the function h c (s) := lim k→∞ (E (cM k )(cM k-1 ) . . . (cM 1 ) s ) 1/k (D.7) clearly satisfies h c (s) = c s h(s) by definition. A similar homogeneity property holds for h(s): If the iterations matrices M i are replaced by cM i , then h(s) will be replaced by hc (s) := c sh (s). We will show that this homogeneity property combined with the fact that h(α) = h(α) = 1 will force h(s) = h(s) for any s > 0. For this purpose, given s > 0, we choose c = 1/ s h(s). Then, by considering input matrix cM i instead of M i and by following a similar argument which led to the identity (D.6), we can show that h c (s) = c s h(s) = 1. Therefore, hc (s) = hc (s) = 1. This implies directly h(s) = h(s). Next, we show the following property for the function h. Lemma 15. We have h(0) = 1, h (0) = ρ and h(s) is strictly convex in s. Proof. By the expression of h(s) from Lemma 14, it is easy to check that h(0) = 1. Moreover, we can compute that which implies that h(s) is strictly convex in s. h (s) = E log I - η b H e 1 I - η b H e 1 s , In the next result, we show that lim inf s→∞ h(s) > 1. This property, together with Lemma 15 implies that if ρ < 0, then there exists some α ∈ (0, ∞) such that h(α) = 1. Indeed, in the proof of Lemma 16, we will show that lim inf s→∞ h(s) = ∞. Lemma 16. We have lim inf s→∞ h(s) > 1. Proof. We recall from Lemma 14 that D.11) where e 1 is the first basis vector in R d and H = b i=1 a i a T i , and a i = (a i1 , . . . , a id ) are i.i.d. distributed as N (0, σ 2 I d ). We can compute that h(s) = E I - η b H e 1 s , E I - η b H e 1 s = E I - η b H e 1 2 s/2 = E   e T 1 I - η b b i=1 a i a T i I - η b b i=1 a i a T i e 1 s/2   = E   1 - 2η b e T 1 b i=1 a i a T i e 1 + η 2 b 2 e T 1 b i=1 a i a T i b i=1 a i a T i e 1 s/2   = E      1 - 2η b b i=1 a 2 i1 + η 2 b 2 b i=1 b j=1 (a i1 a j1 + • • • + a id a jd )a i1 a j1   s/2    = E      1 - η b b i=1 a 2 i1 2 + η 2 b 2 b i=1 b j=1 (a i2 a j2 + • • • + a id a jd )a i1 a j1   s/2    ≥ E 2 s/2 1 η 2 b 2 b i=1 b j=1 (ai2aj2+•••+a id a jd )ai1aj1≥2 = 2 s/2 P   η 2 b 2 b i=1 b j=1 (a i2 a j2 + • • • + a id a jd )a i1 a j1 ≥ 2   → ∞, as s → ∞. In the next result, we show that the inverse of M exists with probability 1, and provide an upper bound result, which will be used to prove Lemma 18. Lemma 17. M -1 exists with probability 1. Moreover, we have E log + M -1 2 ≤ 8. Proof. Note that M is a continuous random matrix, by the assumption on the distribution of a i . Therefore, P(M -1 does not exist) = P(det M = 0) = 0. (D.12) Note that the singular values of M -1 are of the form |1 -η b σ H | -1 where σ H is a singular value of H and we have (log + M -1 ) 2 = 0 if η b H 2I , ( (I -η b H) -1 ) 2 if 0 η b H 2I . (D.13) We consider two cases 0 η b H I and I η b H 2I. We compute the conditional expectations for each case: E log + M -1 2 0 η b H I = E log I - η b H -1 2 0 η b H ≺ I (D.14) ≤ E 2 η b H 2 0 η b H I (D.15) ≤ 4 , (D.16) where in the first inequality we used the fact that log(I -X) -1 2X (D.17) for a symmetric positive semi-definite matrix X satisfying 0 X ≺ I (the proof of this fact is analogous to the proof of the scalar inequality log( 1 1-x ) ≤ 2x for 0 ≤ x < 1). By a similar computation, E (log + M -1 ) 2 I η b H 2I = E log I - η b H -1 I η b H ≺ 2I = E log 2 η b H -1 I - η b H -1 -1 I η b H ≺ 2I ≤ E log 2 η b H -1 • I - η b H -1 -1 I η b H ≺ 2I ≤ E log 2 I - η b H -1 -1 I η b H ≺ 2I = E log 2 I - η b H -1 -1 1 2 I η b H -1 ≺ I , where in the last inequality we used the fact that ( η b H) -1 I for I η b H ≺ 2I. If we apply the inequality (D.17) to the last inequality for the choice of X = ( η b H) -1 , we obtain E log 2 I - η b H -1 2 I η b H -1 ≺ I ≤ E 2 η b H -1 2 1 2 I η b H -1 ≺ I ≤ 4 . (D.18) Combining (D.16) and (D.18), it follows from (D.13) that E log + M -1 ≤ 8. In the next result, we show that a certain expected value that involves the moments and logarithm of M , and logarithm of M -1 is finite, which is used in the proof of Theorem 1. Lemma 18. E M α log + M + log + M -1 < ∞ . Proof. Note that M = I -η b H, where H = b i a i a T i in distribution. Therefore for any s > 0, E[ M s ] = E I - η b b i=1 a i a T i s ≤ E 1 + η b b i=1 a i 2 s < ∞ , (D.19) since all the moments of a i are finite by the Assumption (A1). This implies that E M α log + M < ∞ . By Cauchy-Schwarz inequality, E M α log + M -1 ≤ E M 2α E log + M -1 2 1/2 < ∞, where we used Lemma 17. In the next result, we show a convexity result, which is used in the proof of Theorem 2 to show that the tail-index α is strictly decreasing in stepsize η and variance σ 2 . Lemma 19. For any given positive semi-definite symmetric matrix H fixed, the function Proof. We consider the case s ≥ 1 and consider the function F H : [0, ∞) → R defined as F H G H (a) := (I -aH) e 1 , and show that it is convex for H 0 and it is strongly convex for H 0 over the interval [0, ∞). Let a 1 , a 2 ∈ [0, ∞) be different points, i.e. a 1 = a 2 . It follows from the subadditivity of the norm that G H a 1 + a 2 2 = I - a 1 + a 2 2 H e 1 ≤ I 2 - a 1 2 H e 1 + I 2 - a 2 2 H e 1 = 1 2 G H (a 1 ) + 1 2 G H (a 2 ) , which implies that G H (a) is a convex function. On the other hand, the function g(x) = x s is convex for s ≥ 1 on the positive real axis, therefore the composition g(G H (a)) is also convex for any H fixed. Since the expectation of random convex functions is also convex, we conclude that h(s) is also convex. The next result is used in the proof of Theorem 4 to bound the moments of the iterates. Lemma 20. (i) Given 0 < p ≤ 1, for any x, y ≥ 0, (x + y) p ≤ x p + y p . (D.21) (ii) Given p > 1, for any x, y ≥ 0, and any > 0, (x + y) p ≤ (1 + )x p + (1 + ) p p-1 -(1 + ) (1 + ) 1 p-1 -1 p y p . (D.22) Proof. (i) If y = 0, then (x + y) p ≤ x p + y p trivially holds. If y > 0, it is equivalent to show that x y + 1 p ≤ x y p + 1, (D.23) which is equivalent to show that (x + 1) p ≤ x p + 1, for any x ≥ 0. (D.24) Let F (x) := (x + 1) p -x p -1 and F (0) = 0 and F (x) = p(x + 1) p-1 -px p-1 ≤ 0 since p ≤ 1, which shows that F (x) ≤ 0 for every x ≥ 0. (ii) If y = 0, then the inequality trivially holds. If y > 0, by doing the transform x → x/y and y → 1, it is equivalent to show that for any x ≥ 0, (1 + x) p ≤ (1 + )x p + (1 + ) p p-1 -(1 + ) (1 + ) 1 p-1 -1 p . (D.25) To show this, we define F (x) := (1 + x) p -(1 + )x p , x ≥ 0. (D.26) Then F (x) = p(1 + x) p-1 -p(1 + )x p-1 so that F (x) ≥ 0 if x ≤ ((1 + ) 1 p-1 -1) -1 , and F (x) ≤ 0 if x ≥ ((1 + ) 1 p-1 -1) -1 . Thus, max x≥0 F (x) = F 1 (1 + ) 1 p-1 -1 = (1 + ) p p-1 -(1 + ) (1 + ) 1 p-1 -1 p . (D.27) The proof is complete.

E ADDITIONAL TECHNICAL RESULTS

We recall that the iterates are given by x k = M k x k-1 + q k , where (M k , q k ) are i.i.d. copies of (M, q) where M = I -η b H with H = b i=1 a i a T i in distribution and q = η b b i=1 a i y i , where a i and y i are i.i.d. satisfying the Assumptions (A1)-(A2). We first obtain more explicit expressions for ρ and h(s) under the Assumption (A1). Proposition 21. We have ρ = 1 2 E log   1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   , (E.1) and for any s ≥ 0, h(s) = E      1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   s/2    , (E.2) where z i := (z i1 , z i2 , . . . , z id ) ∼ N (0, I d ), 1 ≤ i ≤ b are i.i.d. Proof. By the expression of ρ from Lemma 14, we can compute that ρ = E log I - η b H e 1 = 1 2 E log I - η b H e 1 2 = 1 2 E log e T 1 I - η b b i=1 a i a T i I - η b b i=1 a i a T i e 1 = 1 2 E log 1 - 2η b e T 1 b i=1 a i a T i e 1 + η 2 b 2 e T 1 b i=1 a i a T i b i=1 a i a T i e 1 = 1 2 E log   1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   , where z i = (z i1 , z i2 , . . . , z id ) ∼ N (0, I d ), 1 ≤ i ≤ b are i.i.d. Similarly, by the expression of h(s) from Lemma 14, we have h(s) = E I - η b H e 1 s = E      1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   s/2    . Remark 22. It follows from Proposition 21 that ρ, h(s) and hence the tail-index α depends on η and σ 2 only via its product ησ 2 . We have seen in Theorem 1 that the iterates converge to a heavy-tailed distribution with tail-index α ∈ (0, ∞) provided that ρ < 0. When the data a i are i.i.d. in general, it is not easy to check whether ρ < 0 holds. For the Gaussian data (Assumption (A1)), it is possible to characterise the region of the parameters η, b, d, σ 2 in which ρ < 0. We first state the following result, which provides a sufficient (but not necessary) condition for ρ < 0. Proposition 23. ρ < 0 provided that ησ 2 (d + b + 1) < 2b. (E.3) Proof. We recall from Proposition 21 that ρ = 1 2 E log   1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   , (E.4) where z i = (z i1 , . . . , z id ) ∼ N (0, I d ), 1 ≤ i ≤ b. Note that the function x → log x is concave, and by Jensen's inequality, we have ρ ≤ 1 2 log E   1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   . (E.5) We can compute that E b i=1 z 2 i1 = bE[z 2 11 ] = b, (E.6) and E   b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   = E b i=1 (z 2 i1 + • • • + z 2 id )z 2 i1 + E   1≤i =j≤b (z i1 z j1 + • • • + z id z jd )z i1 z j1   = bE (z 2 11 + • • • + z 2 1d )z 2 11 + b(b -1)E [(z 11 z 21 + • • • + z 1d z 2d )z 11 z 21 ] = bE[z 4 11 ] + b(d -1) E[z 2 11 ] 2 + b(b -1)E[z 2 11 z 2 21 ] = 3b + b(d -1) + b(b -1) = b(d + b + 1), where we used the property that z i = (z ij , 1 ≤ j ≤ d) are i.i.d. and z ij are i.i.d. N (0, 1) and E[z 2 11 ] = 1, E[z 4 11 ] = 3. (E.7) Hence, we conclude that ρ < 0 provided that 1 -2ησ 2 + η 2 σ 4 b (d + b + 1) < 1, (E.8) which is equivalent to ησ 2 (d + b + 1) < 2b. (E.9) The proof is complete. Remark 24. It is worth pointing out that ρ < 0 does not hold for arbitrary model parameters. In particular, we can compute that ρ = 1 2 E log   1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ 4 b 2 b i=1 b j=1 z i1 z j1 d k=2 z ik z jk   = 1 2 E log   1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ 4 b 2 d k=2 b i=1 z i1 z ik 2   ≥ 1 2 E log   η 2 σ 4 b 2 d k=2 b i=1 z i1 z ik 2   = log ησ 2 b + 1 2 E log   d k=2 b i=1 z i1 z ik 2   . Note that conditional on z i1 , 1 ≤ i ≤ b, b i=1 z i1 z ik ∼ N 0, b i=1 z 2 i1 , (E.10) are i.i.d. for k = 2, . . . , d. Therefore, 1 2 E log   d k=2 b i=1 z i1 z ik 2   = 1 2 E log [XY ] = 1 2 E log X + 1 2 log E log Y, (E.11) where X is a chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Therefore, we can compute that 1 2 E log X = 1 2 ∞ 0 log(x) x b 2 -1 e -x 2 2 b 2 Γ( b 2 ) dx ≥ 1 2 b 2 +1 Γ( b 2 ) 1 0 log(x)x b 2 -1 e -x 2 dx ≥ 1 2 b 2 +1 Γ( b 2 ) 1 0 log(x)x b 2 -1 dx = -1 2 b 2 -1 b 2 Γ( b 2 ) . Similarly, we can show that 1 2 E log Y ≥ -1 2 d-1 2 -1 (d-1) 2 Γ( d-1 2 ) . Hence, we conclude that ρ ≥ 0 provided that ησ 2 ≥ b exp 1 2 b 2 -1 b 2 Γ( b 2 ) + 1 2 d-1 2 -1 (d -1) 2 Γ( d-1 2 ) . (E.12) Next, we provide alternative formulas for h(s) and ρ for the Gaussian data (Assumption (A1)) which is used for some technical proofs. Lemma 25. For any s > 0, h(s) = E   1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY s/2   , and ρ = 1 2 E log 1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY , where X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Proof. We can compute that h(s) = E      1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 (z i1 z j1 + • • • + z id z jd )z i1 z j1   s/2    = E      1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 b j=1 z 2 i1 z 2 j1 + z i1 z j1 d k=2 z ik z jk   s/2    = E      1 - 2ησ 2 b b i=1 z 2 i1 + η 2 σ 4 b 2 b i=1 z 2 i1 2 + η 2 σ 4 b 2 d k=2 b i=1 z i1 z ik 2   s/2    = E      1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ 4 b 2 d k=2 b i=1 z i1 z ik 2   s/2    . b i=1 z i1 z ik ∼ N 0, b i=1 z 2 i1 , (E.13) are i.i.d. for k = 2, . . . , d. Therefore, we have h(s) = E      1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ 4 b 2 d k=2 b i=1 z i1 z ik 2   s/2    = E      1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ 4 b 2 b i=1 z 2 i1 d k=2 x 2 k   s/2    , where x k are i.i.d. N (0, 1) independent of z i1 , i = 1, . . . , b. Hence, we have h(s) = E      1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ 4 b 2 b i=1 z 2 i1 d k=2 x 2 k   s/2    = E   1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY s/2   , where X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Similarly, we can compute that ρ = 1 2 E   log   1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ b 2 b i=1 b j=1 z i1 z j1 d k=2 z ik z jk     = 1 2 E   log   1 - ησ 2 b b i=1 z 2 i1 2 + η 2 σ 4 b 2 d k=2 b i=1 z i1 z ik 2     = 1 2 E log 1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY , where X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). The proof is complete. In Theorem 1, we showed the existence of the tail-index α. For the Gaussian data (Assumption (A1)), we can provide some explicit bound on the tail-index α provided some explicit technical conditions hold. Next, we provide a technical condition under which the tail-index α ∈ (0, 4]. (See also Proposition 3 for a technical condition under which the tail-index α ∈ (0, 2].) Proposition 26. There exists some 0 < α ≤ 4 such that h(α) = 1 provided that 2b > ησ 2 (d + b + 1) ≥ 2b -2ησ 2 (b + 2) + 2η 2 σ 4 b (b + 2)(b + d + 3) - 1 2 η 3 σ 6 b 2 (b + 2)(b + 4)(b + 6) - 1 2 η 3 σ 6 b 2 (b + 2)(d 2 -1) - η 3 σ 6 b 2 (b + 2)(b + 4)(d -1). Proof. It follows from Lemma 25 that h(4) = E   1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY 2   , where X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). We can further compute that h(4) = E 1 - ησ 2 b X 4 + E η 4 σ 8 b 4 X 2 Y 2 + 2E 1 - ησ 2 b X 2 η 2 σ 4 b 2 XY . (E.14) First, we can compute that E 1 - ησ 2 b X 4 = 1 - 4ησ 2 b E[X] + 6η 2 σ 4 b 2 E[X 2 ] - 4η 3 σ 6 b 3 E[X 3 ] + η 4 σ 8 b 4 E[X 4 ]. (E.15) We recall the formula for the m-th moment of a chi-square distribution with degree of freedom k given by 2 m Γ(m + k 2 )/Γ( k 2 ), which is 2 m k 2 k 2 + 1 • • • k 2 + (m -1) when m is a positive integer. Since X is chi-square distributed with degree of freedom b, we have E[X] = 2 • b 2 = b, E[X 2 ] = 2 2 • b 2 b 2 + 1 = b(b + 2), E[X 3 ] = 2 3 • b 2 b 2 + 1 b 2 + 2 = b(b + 2)(b + 4), E[X 4 ] = 2 4 • b 2 b 2 + 1 b 2 + 2 b 2 + 3 = b(b + 2)(b + 4)(b + 6), which implies that E 1 - ησ 2 b X 4 = 1-4ησ 2 + 6η 2 σ 4 b (b+2)- 4η 3 σ 6 b 2 (b+2)(b+4)+ η 4 σ 8 b 3 (b+2)(b+4)(b+6). (E.16) Second, we can compute that E η 4 σ 8 b 4 X 2 Y 2 = η 4 σ 8 b 4 E X 2 E[Y 2 ] = η 4 σ 8 b 4 b(b + 2)(d -1)(d + 1) = η 4 σ 8 b 3 (b + 2)(d 2 -1) , (E.17) where we used the fact that Y is independent of X and Y is chi-square distributed with degree of freedom d -1 so that E[Y 2 ] = (d -1)(d + 1). Third, we can compute that 2E 1 - ησ 2 b X 2 η 2 σ 4 b 2 XY = 2E η 2 σ 4 b 2 XY + 2E η 4 σ 8 b 4 X 3 Y -4E η 3 σ 6 b 3 X 2 Y = 2 η 2 σ 4 b (d -1) + 2 η 4 σ 8 b 3 (b + 2)(b + 4)(d -1) -4 η 3 σ 6 b 2 (b + 2)(d -1). Putting everything together, we have h(4) = 1 -4ησ 2 + 6η 2 σ 4 b (b + 2) - 4η 3 σ 6 b 2 (b + 2)(b + 4) + η 4 σ 8 b 3 (b + 2)(b + 4)(b + 6) + η 4 σ 8 b 3 (b + 2)(d 2 -1) + 2 η 2 σ 4 b (d -1) + 2 η 4 σ 8 b 3 (b + 2)(b + 4)(d -1) -4 η 3 σ 6 b 2 (b + 2)(d -1), and h(4) ≥ 1 if and only if 1 -4ησ 2 + 6η 2 σ 4 b (b + 2) - 4η 3 σ 6 b 2 (b + 2)(b + 4) + η 4 σ 8 b 3 (b + 2)(b + 4)(b + 6) + η 4 σ 8 b 3 (b + 2)(d 2 -1) + 2 η 2 σ 4 b (d -1) + 2 η 4 σ 8 b 3 (b + 2)(b + 4)(d -1) -4 η 3 σ 6 b 2 (b + 2)(d -1) ≥ 1, which is equivalent to -2b + ησ 2 (3b + 6) + ησ 2 (d -1) - 2η 2 σ 4 b (b + 2)(b + 4) + 1 2 η 3 σ 6 b 2 (b + 2)(b + 4)(b + 6) + 1 2 η 3 σ 6 b 2 (b + 2)(d 2 -1) + η 3 σ 6 b 2 (b + 2)(b + 4)(d -1) -2 η 2 σ 4 b (b + 2)(d -1) ≥ 0, which is equivalent to -2b + ησ 2 (d + b + 1) ≥ -2ησ 2 (b + 2) + 2η 2 σ 4 b (b + 2)(b + 4) - 1 2 η 3 σ 6 b 2 (b + 2)(b + 4)(b + 6) - 1 η 3 σ 6 b 2 (b + 2)(d 2 -1) - η 3 σ 6 b 2 (b + 2)(b + 4)(d -1) + 2 η 2 σ 4 b (b + 2)(d -1). Finally, recall that ρ < 0 if 2b > ησ 2 (d + b + 1) and h(4) ≥ 1 implies there exists some 0 < α ≤ 4 such that h(α) = 1. The proof is complete.  a c d + b + 1 ≤ ησ 2 b < 2 d + b + 1 , (E.18) for some a c ∈ (0, 2) such that F (a c ) = 0 and F (a) ≤ 0 for any a c ≤ a ≤ 2, where F (a) := 2 -a - 2a d + b + 1 (b + 2) + 2a 2 (d + b + 1) 2 (b + 2)(b + d + 3) - 1 2 a 3 (d + b + 1) 3 (b + 2)(b + 4)(b + 6) - 1 2 a 3 (d + b + 1) 3 (b + 2)(d 2 -1) - a 3 (d + b + 1) 3 (b + 2)(b + 4)(d -1). Proof. We aim to show that there exist some η, σ such that 2 > ησ 2 b (d + b + 1) ≥ 2 - 2ησ 2 b (b + 2) + 2η 2 σ 4 b 2 (b + 2)(b + d + 3) - 1 2 η 3 σ 6 b 3 (b + 2)(b + 4)(b + 6) - 1 2 η 3 σ 6 b 3 (b + 2)(d 2 -1) - η 3 σ 6 b 3 (b + 2)(b + 4)(d -1). Let ησ 2 b (d + b + 1) = a. Then, it suffices to show that there exists some a < 2 such that a ≥ 2 - 2a d + b + 1 (b + 2) + 2a 2 (d + b + 1) 2 (b + 2)(b + d + 3) - 1 2 a 3 (d + b + 1) 3 (b + 2)(b + 4)(b + 6) - 1 2 a 3 (d + b + 1) 3 (b + 2)(d 2 -1) - a 3 (d + b + 1) 3 (b + 2)(b + 4)(d -1). Let us define F (a) := 2 -a - 2a d + b + 1 (b + 2) + 2a 2 (d + b + 1) 2 (b + 2)(b + d + 3) - 1 2 a 3 (d + b + 1) 3 (b + 2)(b + 4)(b + 6) - 1 2 a 3 (d + b + 1) 3 (b + 2)(d 2 -1) - a 3 (d + b + 1) 3 (b + 2)(b + 4)(d -1). Then, we can check that F (0) = 2 > 0 and F (2) = - 4 d + b + 1 (b + 2) + 8 (d + b + 1) 2 (b + 2)(b + d + 3) - 4 (d + b + 1) 3 (b + 2)(b + 4)(b + 6) - 4 (d + b + 1) 3 (b + 2)(d 2 -1) - 8 (d + b + 1) 3 (b + 2)(b + 4)(d -1), so that (d + b + 1) 3 4(b + 2) F (2) = -(d + b + 1) 2 + 2(d + b + 1)(b + d + 3) -(b + 4)(b + 6) -(d 2 -1) -2(b + 4)(d -1) = d 2 + b 2 + 1 + 2d + 2b + 2bd + 4d + 4b + 4 -b 2 -10b -24 -d 2 + 1 -2bd -8d + 2b + 8 = -2d -2b -10 < 0. Thus, F (2) < 0. Hence, we conclude that there exists some 0 < a c < 2 such that F (a c ) = 0 and F (a) ≤ 0 for any a c ≤ a ≤ 2. Then, for any  a c ≤ ησ 2 b (d + b + 1) < 2, (E.19) we have (η, σ) ∈ D 4,b . The proof is complete. Recall that D 2m,b consists of (η, σ) such that j η 2(m-k)+j σ 4(m-k)+2j b 2(m-k)+j (-1) j 2 j+2(m-k) • Γ(j + m -k + b 2 ) Γ( b 2 ) Γ(m -k + d-1 2 ) Γ( d-1 2 ) ≥ 1. Proof. By applying Lemma 25, we can compute that h(2m) = E 1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY m = m k=0 m k η 2(m-k) σ 4(m-k) b 2(m-k) E 1 - ησ 2 b X 2k X m-k Y m-k = m k=0 m k η 2(m-k) σ 4(m-k) b 2(m-k) 2k j=0 2k j (-1) j η j σ 2j b j E[X j+m-k Y m-k ] = m k=0 m k η 2(m-k) σ 4(m-k) b 2(m-k) 2k j=0 2k j (-1) j η j σ 2j b j 2 j+m-k • Γ(j + m -k + b 2 ) Γ( b 2 ) 2 m-k Γ(m -k + d-1 2 ) Γ( d-1 2 ) = m k=0 2k j=0 m k 2k j η 2(m-k)+j σ 4(m-k)+2j b 2(m-k)+j • (-1) j 2 j+2(m-k) Γ(j + m -k + b 2 ) Γ( b 2 ) Γ(m -k + d-1 2 ) Γ( d-1 2 ) , where we used the formula for the moments of chi-square distribution, that is, the m-th moment of a chi-square distribution with degree freedom k is given by 2 m Γ(m + k 2 )/Γ( k 2 ). Previously, we have provided some upper bounds on the tail-index α under some technical conditions on the model parameters. Next, let us provide an upper bound for the tail-index α without relying on any additional technical conditions. Proposition 29. The tail-index α is upper bounded by: α ≤ max        2, P -2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY ≤ 0 E -2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY +        , (E.20) where X, Y are independent and X is a chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Proof. We recall that 1 = h(α) = E 1 - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY α 2 , (E.21) where X, Y are independent and X is a chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Note that for any x ≥ 0 and α ≥ 2, (1 + x) α 2 ≥ 1 + α 2 x. Therefore, 1 ≥ E 1 - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY α 2 1 -2ησ 2 b X+ η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY ≥0 ≥ E 1 + α 2 - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY 1 -2ησ 2 b X+ η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY ≥0 = P - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY ≥ 0 + α 2 E - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY + , which yields the desired result. The proof is complete. Remark 30. (i) Note that it follows from Lemma 25 that we have ρ = 1 2 E log 1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY , where X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Therefore, we have ρ = 1 2 ∞ 0 ∞ 0 log 1 - ησ 2 b x 2 + η 2 σ 4 b 2 xy x b 2 -1 e -x 2 2 b 2 Γ( b 2 ) y d-1 2 -1 e -y 2 2 d-1 2 Γ( d-1 2 ) dxdy. (E.22) In particular, when d = 1, we have Y ≡ 0 and ρ = ∞ 0 log 1 - ησ 2 b x x b 2 -1 e -x 2 2 b 2 Γ( b 2 ) dx. (E.23) (ii) Note that it follows from Lemma 25 that we have h(s) = E   1 - ησ 2 b X 2 + η 2 σ 4 b 2 XY s/2   , where X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Therefore, we have h(s) = ∞ 0 ∞ 0 1 - ησ 2 b x 2 + η 2 σ 4 b 2 xy s 2 x b 2 e -x 2 2 b 2 Γ( b 2 ) y d-1 2 -1 e -y 2 2 d-1 2 Γ( d-1 2 ) dxdy. In particular, when d = 1, we have Y ≡ 0 and h(s) = ∞ 0 1 - ησ 2 b x s x b 2 -1 e -x 2 2 b 2 Γ( b 2 ) dx. (E.24) So far, we have studied various properties of the tail-index α, including the monotonicity on stepsize, noise variance, batch size and the dimension, as well as some quantitative bounds. In general, there is no simple closed-form formula for the tail-index α. Next, we will obtain some approximations for the tail-index α in various asymptotic regimes. First, we provide a rigorous first-order approximation for the tail-index α when it is less than and close to 2. Proposition 31. Let a := ησ 2 and a c := 2b d+b+1 . Then the tail-index satisfies: α ∼ 2 - 4 F c (a -a c ), (E.25) for any a ↓ a c , where F c := E 1 - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY log 1 - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY > 0, (E.26) where X, Y are independent and X is a chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). Proof of Proposition 31. Let us define a = ησ 2 . In Proposition 3, we showed that there exists some δ > 0 such that for any 2b ≤ a(d + b + 1) < 2b + δ, ρ < 0 and there exists some 0 < α ≤ 2, such that h(α) = 1. In particular, when a = a c := 2b d+b+1 , the tail-index α = 2. Consider the tail-index α = α(a) as a function of a. Then, we have 1 = h(α) = E 1 - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY α 2 , (E.27) Moreover, we can compute that -2E - 2 b X + 2a c b 2 X 2 + 2a c b 2 XY = -2 -2 + 2a c b (b + 2) + 2a c b (d -1) = -4. Hence, we conclude that α ∼ 2 - 4 F c (a -a c ), (E.29) for any a ↓ a c , where F c := E 1 - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY log 1 - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY > 0. (E.30) Next, we derive an approximation for the tail-index α when it is close to zero. Proposition 32. Let a := ησ 2 and ρ = ρ(a) emphasizing the dependence on a. Define a * := inf{a > 0 : ρ(a) = 0}. Then, we have α ∼ c * (a * -a), (E.31) as a ↑ a * , where c * := 4E -2 b X + 2a * b 2 X 2 + 2a * b 2 XY 1 -2a * b X + a 2 * b 2 X 2 + a 2 * b 2 XY • E log 1 - 2a * b X + a 2 * b 2 X 2 + a 2 * b 2 XY 2 -1 , (E.32) where X, Y are defined in Proposition 31. Proof of Proposition 32. The tail-index α is uniquely determined by 1 = h(α) = E 1 - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY α/2 , (E.33) where a = ησfoot_7 and X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). It is clear that α depends on η and σ only via a := ησ 2 . In Proposition 3, we showed that there exists some δ > 0 such that for any 2b ≤ a(d + b + 1) < 2b + δ, ρ < 0 and there exists some 0 < α ≤ 2, such that h(α) = 1. Let ρ = ρ(a) with emphasis on the dependence of ρ on a = ησ 2 . In Proposition 23, we showed that ρ < 0 provided that a(d + b + 1) < 2b. On the other hand, we showed in Remark 24 that ρ ≥ 0 for any a ≥ b exp 1 2 b 2 -1 b 2 Γ( b 2 ) + 1 2 d-1 2 -1 (d-1) 2 Γ( d-1 2 ) . Therefore, there exists some critical value a * > 0 such that ρ(a) < 0 for every a < a * and ρ(a * ) = 0. It is clear that as a → a * , α → 0. We are interested in studying the tail-index α when α is close to zero. By differentiating (E.33) w.r.t. a, we get 0 = E 1 2 ∂α ∂a log 1 - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY Theorem 33. When the dimension d is large, the tail-index satisfies: α ∼ 2 - d 3 4b(b + 2) ησ 2 - 2b d + b + 1 , (E.36) for any 2b d+b+1 ≤ ησ 2 < 2b d+b+1 + 8b(b+2) d 3 . Note that in Theorem 33 the approximation 2 -d 3 4b(b+2) ησ 2 -2b d+b+1 decreases from 2 to 0 as ησ 2 increases from 2b d+b+1 to 2b d+b+1 + 8b(b+2) d 3 , which is an approximation of a * . Moreover, the approximation 2 -d 3 4b(b+2) ησ 2 -2b d+b+1 is strictly decreasing in d, η, and σ 2 and strictly increasing in b. This is consistent with the monotonicity results we have shown before. 2 so that (1 + x) log(1 + x) ∼ (1 + x)x -x 2 2 , we have F c ∼ E 1 - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY - 1 2 E - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY 2 = 1 2 E - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY 2 . Next, we can compute that  E - 2a c b X + a 2 c b 2 X 2 + a 2 c b 2 XY 2 = 4a 2 c b 2 E[X 2 ] + a 4 c b 4 E[X 4 ] + a 4 c b 4 E[X 2 Y 2 ] - α ∼ 4E -2 b X + 2a * b 2 X 2 + 2a * b 2 XY E -2a * b X + a 2 * b 2 X 2 + a 2 * b 2 XY 2 a * -ησ 2 ∼ 4E -2 b X + 2ac b 2 X 2 + 2ac b 2 XY E -2ac b X + a 2 c b 2 X 2 + a 2 c b 2 XY 2 a c + -ησ 2 . We recall that . (E.50) Remark 36. We have already obtained an approximation of α when α lies between 0 and 2 and the dimension d is large (see Theorem 33). Fix the dimension d and batch size b, when a = ησ 2 → 0, the tail-index α → ∞. Let us derive an approximation for the tail-index α in this asymptotic regime. We recall that the tail-index α is uniquely determined by 1 = h(α) = E 1 - 2a b X + a 2 b 2 X 2 + a 2 b 2 XY α/2 , (E.51) where a = ησ 2 and X, Y are independent and X is chi-square random variable with degree of freedom b and Y is a chi-square random variable with degree of freedom (d -1). We apply the approximation (1 + x) α/2 ∼ 1 + α 2 x + α 2 ( α 2 -1) 2 x 2 to get: 1 ∼ 1 + α 2 E - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY + α 2 ( α 2 -1) 2 E - 2ησ 2 b X + η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY 2 . Assume that ησ 2 is small and ignore the higher-order terms, we get α 2 ∼ 1 + 2E 2ησ 2 b X -2E η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY E 2ησ 2 b X 2 = 1 + 4ησ 2 b b -2 η 2 σ 4 b 2 (b 2 + 2b) -2 η 2 σ 4 b 2 b(d -1) 4η 2 σ 4 b 2 (b 2 + 2b) = b ησ 2 (b + 2) + 1 2 - d -1 2(b + 2) . Hence, we conclude that as ησ 2 → 0, the tail-index satisfies which is consistent with Proposition 3. Remark 37. The approximation in (E.52) is good if ησ 2 is small, and every other model parameter is fixed. When ησ 2 is small, and dimension d is large, a finer approximation is given by  α 2 ∼ 1 + 2E 2ησ 2 b X -2E η 2 σ 4 b 2 X 2 + η 2 σ 4 b 2 XY E -2ησ 2 b X + η 2 σ 4 b 2 X 2 + ησ



A real-valued random variable X has heavy (left) tail if limx→∞ P(X ≤ -x)e c|x| = ∞ for any c > 0. We say x0 is in the domain of attraction of a local minimum x , if gradient descent iterations to minimize f started at x0 with sufficiently small stepsize converge to x as the number of iterations goes to infinity. Note that if the input data is heavy-tailed, the stationary distribution of SGD automatically becomes heavytailed, seeBuraczewski et al. (2012) for details. In our context, the challenge is to identify the occurrence of the heavy tails when the distribution of the input data is light-tailed, such as a simple Gaussian distribution. Otherwise, one can construct a subsequence xn k that is bounded in the space L α converging to x∞ which would be a contradiction. E.g., if M k is deterministic and q k is Gaussian, then x k is Gaussian for all k, and so is x∞ if the limit exists. The form of x can be verified by noticing that E[x k ] converges to the minimizer of the problem by the law of total expectation. Besides, our GCLT requires the sum of the iterates to be normalized by 1 (K-K 0 ) 1/α ; however, for a finite K, normalizing by 1 K-K 0 results in a scale difference, to which our tail-index estimator is agnostic. Note that the proof of Corollary 5 establishes first that x∞ has a bounded s-th moment provided that h(s) = E [ M e1 s ] < 1 and then cites Lemma 14 regarding the equivalence h(s) = h(s). (a * -a) (E.35) as a ↑ a * .When the dimension d is large, we can use Proposition and Proposition to obtain a more explicit approximation for the tail-index α when it is between 0 and 2.



Figure 1: Behavior of α with (a) varying stepsize η and batch-size b, (b) d and σ, (c) under RMSProp.

Figure 2: Results on FCNs. Different markers represent different initializations with the same η, b.

Figure 3: Results on VGG networks. The values of α that exceeded 2 is truncated to 2 for visualization purposes.

Figure 4: The curve h(a, s) = 1 in the (a, s) plane Geometrically, we see from Figure 4 that the curve s(a) as a function of a, is sandwiched between two rectangular boxes and has necessarily s (a) < 0. This can also be directly obtained rigorously from the implicit function theorem; if we differentiate the implicit equation h(a, s(a)) = 0 with respect to a, we obtain ∂h ∂a (a * , s * ) + ∂h ∂s (a * , s * )s (a * ) = 0 .From parts (ii) -(iii), we have ∂h ∂a (a * , s * ) and ∂h ∂s (a * , s * ) > 0. Therefore, we have s (a * ) = -

and hence E x k α = O(k α ), (C.35) which grows polynomially in k. The proof is complete.

a) := (I -aH) e 1 s is convex for s ≥ 1. It follows that for given b and d with H := 1 b b i=1 a i a T i , the function h(a, s) := E [F H (a)] = E I -a H e 1 s (D.20) is a convex function of a for a fixed s ≥ 1.

Let us define D s,b as the set consisting of (η, σ) such that D s,b := (η, σ) : ησ 2 (d + b + 1) < 2b and h(s) ≥ 1 . We have shown in Proposition 23 that ρ < 0 provided that ησ 2 (d + b + 1) < 2b. Therefore, if (η, σ) ∈ D s,b , then the tail-index α ∈ (0, s]. In Proposition 26, we characterised the set D 4,b . In the next proposition, we show that D 4,b is non-trivial, i.e., D 4,b is not an empty set. Proposition 27. D 4,b is not an empty set. In particular, it includes the pairs (η, σ) such that

2m,b = (η, σ) : ησ 2 (d + b + 1) < 2b and h(2m) ≥ 1 . Since D 4,b = ∅ and D 4,b ⊆ D 2m,b for any m ≥ 2. Indeed, we can characterises the set D 2m,b in the following proposition. Proposition 28. Given any m ∈ N, there exists some 0 < α ≤ 2m such that h(α) = 1 provided that ησ 2 (d + b + 1) < 2b and

Proof of Theorem 33. When ησ 2 increases from a c = 2b d+b+1 to a * , where a * = inf{a > 0 : ρ(a) = 0}, the tail-index decreases from 2 to 0. When d ↑ ∞, a * → 0 and a c → 0, and hence it suffices to show thatα ∼ 2 -d 3 4b(b + 2) ησ 2 -2b d + b + 1 , (E.37) as d ↑ ∞ and ησ 2 ↓ 2b d+b+1 , and a * ∼ 2b d+b+1 + 8b(b+2) d 3, and when d ↑ ∞ and ησ 2 ↑ 2b d+b+1 + 8b(b+2) (E.37) in Corollary 34 which is a corollary of Proposition 31 when d ↑ ∞ and we will prove (E.38) in Corollary 35 which is a corollary of Proposition 32 when d ↑ ∞.When the dimension d is large, we have the following result as a corollary of Proposition 31.Corollary 34. The tail-index satisfies:α ∼ 2 -d 3 4b(b + 2) ησ 2 -2b d + b + 1 , (E.39) as d ↑ ∞ and ησ 2 ↓ 2b d+b+1 .Proof of Corollary 34. As d ↑ ∞, a c ↓ 0, we have 2ac b → 0 in probability, and using log(1 + x) ∼ x -x 2

b + 1) 4 (d + b + 1) 2 + (b + 4)(b + 6) + d 2 -1 -2(b + 4)(d + b + 1) + 2(b + 4)(d -1) -2(d + b + 1)(d -1) ∼ 32b(b + 2) d 3 , as d ↑ ∞,where we used the formulas for the moments of chi-square random variables, and a c ∼ 2b d for d ↑ ∞. Therefore, it follows from Proposition 31 that we have α d ↑ ∞ and ησ 2 ↓ 2b d+b+1 . The proof is complete. When the dimension d is large, we have the following result as a corollary of Proposition 32. Corollary 35. When d ↑ ∞ and ησ 2 ↑ a * ∼ 2b d+b+1 + 8b(b+2) Proof of Corollary 35. Note that by the definition of a * , d is large, i.e. d ↑ ∞, we have Y ↑ ∞ in probability, and thus a * → 0. This implies that -2a * b X + a 2 * b 2 X 2 → 0 in probability, and hence we must have b 2 X 2 + a 2 * b 2 XY , (E.43) which implies that as d ↑ ∞, a * ∼ a c := 2b d + b + 1 , (E.44) 2a c b 2

d ↑ ∞, where we used the formulas for the moments of chi-square random variables, anda c ∼ 2b d for d ↑ ∞. Hence, we conclude that d ↑ ∞ is large and ησ 2 ↑ 2b d+b+1 + 8b(b+2)d The proof is complete by noticing that8d 3 32b(b + 2) 2b d + b + 1 + 8b(b + 2) d 3 -ησ 2 = 2 -d 3 4b(b + 2) ησ 2 -2b d + b + 1

b+2) + 1 -d-1 b+2 is strictly increasing in b, and strictly decreasing in η, σ 2 and d, which is consistent with what we have shown before. If ησ 2 = 2b d+b+1 , we know from Proposition 3 that the tail-index α = 2. Indeed, by plugging ησ 2 = 2b d+b+1 into the right hand side of (E.52), we get 2b ησ 2 (b + 2

2 b(b + 2) + a 4 b 4 b(b + 2)(b + 4)(b + 6) + a 4 b 4 b(b + 2)(d 2 -1) -4a 3 b 3 b(b + 2)(b + 4) + 2a 4 b 4 b(b + 2)(b + 4)(d -1) -4a 3 b 3 b(b + 2)(d -1) ∼ 4a 2 b 2 b(b + 2) + a 4 b 4 b(b + 2)d 2 -4a 3 b 3 b(b + 2)d,(E.55)

is called the 'stochastic gradient noise'. Then, based on certain statistical assumptions on U

annex

where a = ησ 2 . Hence, we obtain the approximation:which yields that the tail-index α can be approximated as:.(E.57)

