SHAPE MATTERS: UNDERSTANDING THE IMPLICIT BIAS OF THE NOISE COVARIANCE Anonymous authors Paper under double-blind review

Abstract

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise -induced by minibatches or label perturbation -is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et al. and Woodworth et al. We show that in an over-parameterized setting, SGD with label noise recovers the sparse groundtruth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.

1. INTRODUCTION

One central mystery of deep artificial neural networks is their capability to generalize when having far more learnable parameters than training examples Zhang et al. (2016) . To add to the mystery, deep nets can also obtain reasonable performance in the absence of any explicit regularization. This has motivated recent work to study the regularization effect due to the optimization (rather than objective function), also known as implicit bias or implicit regularization Gunasekar et al. (2017; 2018a; b) ; Soudry et al. (2018) ; Arora et al. (2019) . The implicit bias is induced by and depends on many factors, such as learning rate and batch size Smith et al. (2017) ; Goyal et al. (2017) ; Keskar et al. (2016) ; Li et al. (2019b) ; Hoffer et al. (2017) , initialization and momentum Sutskever et al. (2013) , adaptive stepsize Kingma and Ba (2014) ; Neyshabur et al. (2015) ; Wilson et al. (2017) , batch normalization Ioffe and Szegedy (2015) and dropout Srivastava et al. (2014) . Among these sources of implicit regularization, the SGD noise is believed to be a vital one (LeCun et al., 2012; Keskar et al., 2016) . Previous theoretical works (e.g., Li et al. (2019b) ) have studied the implicit regularization effect from the scale of the noise, which is directly influenced by learning rate and batch size. However, people have empirically observed that the shape of the noise also has a strong (if not stronger) implicit bias. For example, prior works show that mini-batch noise or label noise (label smoothing) -noise in the parameter updates from the perturbation of labels in training -is far more effective than adding spherical Gaussian noise (e.g., see (Shallue et al., 2018, Section 4.6) and Szegedy et al. (2016) ; Wen et al. (2019) ). We also confirm this phenomenon in Figure 1 (left). Thus, understanding the implicit bias of the noise shape is crucial. Such an understanding may also apply to distributed training because synthetically adding noise may help generalization if parallelism reduces the amount of mini-batch noise (Shallue et al., 2018) . In this paper, we theoretically study the effect of the shape of the noise, demonstrating that it can provably determine generalization performance at convergence. Our analysis is based on a nonlinear quadratically-parameterized model introduced by (Woodworth et al., 2020; Vaskevicius et al., 2019) , which is rich enough to exhibit similar empirical phenomena as deep networks. Indeed, Figure 1 (right) empirically shows that SGD with mini-batch noise or label noise can generalize with arbitrary initialization without explicit regularization, whereas GD or SGD with spherical Gaussian noise cannot. We aim to analyze the implicit bias of label noise and Gaussian noise in the quadraticallyparametrized model and explain these empirical observations. train test full batch (fb) small batch fb, label noise fb, =1e-2 fb, =3e-2 fb, =7e-2 fb, =1e-1 fb, =2e-1 fb, =3e-1 Figure 1 : The effect of noise covariance in neural network and quadratically-parameterized models. We demonstrate that label noise induces a stronger regularization effect than Gaussian noise. In both real and synthetic data, adding label noise to large batch (or full batch) SGD updates can recover small-batch generalization performance, whereas adding Gaussian noise with optimallytuned variance σ 2 cannot. Left: Training and validation errors on CIFAR100 for VGG19. Adding Gaussian noise to large batch updates gives little improvement (around 2%), whereas adding label noise recovers the small-batch baseline (around 15% improvement). Right: Training and validation error on a 100-dimensional quadratically-parameterized model defined in Section 2. Similar to deep models, label noise or mini-batch noise leads to better solutions than optimally-tuned spherical Gaussian noise. Moreover, Gaussian noise causes the parameter to diverge after sufficient mixing, as suggested by our negative result for Langevin dynamics (Theorem 2.2). More details are in Section A. We choose to study label noise because it can replicate the regularization effects of minibatch noise in both real and synthetic data (Figure 1 ), and has been used to regularize large-batch parallel training (Shallue et al., 2018) . Moreover, label noise is less sensitive to the initialization and the optimization history than mini-batch noise, which makes it more amenable to theoretical analysis. For example, in an extreme case, if we happen to reach or initialize at a solution that overfits the data exactly, then mini-batch SGD will stay there forever because both the gradient and the noise vanish (Vaswani et al., 2019) . In contrast, label noise will not accidentally vanish, so the analysis is more tractable. Understanding label noise may lead to understanding mini-batch noise or replacing it with other more robust choices. In our setting, we prove that with a proper learning rate schedule, SGD with label noise recovers a sparse ground-truth classifier and generalizes well, whereas SGD with spherical Gaussian noise generalizes poorly. Concretely, SGD with label noise biases the parameter towards the low sparsity regime and exactly recovers the sparse ground-truth, even when the initialization is arbitrarily large (Theorem 2.1). In this same regime, noise-free gradient descent quickly overfits because it trains in the NTK regime (Jacot et al., 2018; Chizat and Bach, 2018) . Adding Gaussian noise is insufficient to fix this, as this algorithm would end up sampling from a Gibbs distribution with infinite partition function and fail to converge to the ground-truth (Theorem 2.2). In summary, with not too small learning rate or noise level, label noise suffices to bias the parameter towards sparse solutions without relying on a small initialization, whereas Gaussian noise cannot. Our analysis suggests that the fundamental difference between label or mini-batch noise and Gaussian noise is that the former is parameter-dependent, and therefore introduces stronger biases than the latter. The conceptual message highlighted by our analysis is that there are two possible implicit biases induced by the noise: 1. prior work (Keskar et al., 2016) shows that by escaping sharp local minima, noisy gradient descent biases the parameter towards more robust solutions (i.e, solutions with low curvature, or "flat" minima), and 2. when the noise covariance varies across the parameter space, there is another (potentially stronger) implicit bias effect toward parameters where the noise covariance is smaller. Label or mini-batch noise benefits from both biases, whereas Gaussian noise is independent of the parameter, so it benefits from the first bias but not the second. For the quadratically-parameterized model, this first bias is not sufficient for finding solutions with good generalization because there is a large set of overfitting global minima of the training loss with reasonable curvature. In contrast, the covariance of label noise is proportional to the scale of the parameter, inducing a much stronger bias towards low norm solutions which generalize well.

1.1. ADDITIONAL RELATED WORKS

Closely related to our work, Blanc et al. (2019) and Zhu et al. (2019) also theoretically studied implicit regularization effects that arise due to shape, rather than scale, of the noise. However, they only considered the local effect of the noise near some local minimum of the loss. In contrast, our work analyzes the global effect of noise. For a more detailed comparison with (Blanc et al., 2019) , see Section 2.2. Woodworth et al. (2020) ; Vaskevicius et al. (2019) analyze the effect of initialization for the same model that we study, showing that large initialization trains in the NTK regime (shown to generalize poorly (Wei et al., 2019; Ghorbani et al., 2019) ) whereas small initialization does not. We show that when the initialization is large, adding noise helps avoid the NTK regime (Li and Liang, 2018; Jacot et al., 2018; Du et al., 2018b; Woodworth et al., 2020) (Mianjy et al., 2018; Mianjy and Arora, 2019; Wei et al., 2020; Arora et al., 2020) . Wei et al. (2020) showed that there also exists an implicit bias induced by dropout noise. Langevin dynamics or the closely-related stochastic gradient descent with spherical Gaussian noise has been studied in previous works Welling and Teh (2011) Several works have theoretically analyzed other types of implicit biases in simplified settings (Soudry et al., 2018; Gunasekar et al., 2018b; Ji and Telgarsky, 2018a) . Gunasekar et al. (2017) and Li et al. (2017) showed that gradient descent finds low rank solutions in matrix completion. Gradient descent has also been shown to maximize the margin in linear and homogeneous models (Soudry et al., 2018; Ji and Telgarsky, 2018b; Nacson et al., 2018; Lyu and Li, 2019; Gunasekar et al., 2018a; Nacson et al., 2019; Poggio et al., 2017) . Du et al. (2018a) showed that gradient descent implicitly balances the layers of deep homogeneous models. Other works showed that it may not always be possible to characterize implicit biases in terms of norm (Arora et al., 2019; Razin and Cohen, 2020) . Gissin et al. (2019) showed that gradient descent dynamics exhibit different implicit biases based on depth. Li et al. (2019b) studied the implicit regularization effect of a large initial learning rate. Guo et al. (2018) studies the notion of "elimination singularities" in RBF networks, where optimization runs into a regime with small weight and small gradient, therefore the training can be slowed down. Our paper also involves training trajectory with small weight norm, but instead focuses on its influnece on the generalization performance.

2.1. SETUP AND BACKGROUNDS

Parameterization. We focus on the nonlinear model parametrization: f v (x) ≜ ⟨v ⊙2 , x⟩, where v ∈ R d is the parameter of the model, x ∈ R d is the data, and v ⊙2 denotes the element-wise square of v. Prior works (Woodworth et al., 2020; Vaskevicius et al., 2019; Li et al., 2017) have studied this model because it is an interesting and informative simplification of nonlinear models. As SGD noise exhibits many of the same empirical behaviors in this simplified model as in deep networks, 1 we use this model as a testbed to develop a mathematical understanding of various sources of implicit biases. As shown in Figure 1 , both SGD with mini-batch noise and label noise generalize better than GD or SGD with spherical Gaussian noise. Data distribution assumptions and overparametrization. We assume that there exists a groundtruth parameter v ⋆ ∈ R d that generates the label y = ⟨v ⋆⊙2 , x⟩ given a data point x, which is assumed to be generated from N (0, I d×d ). A dataset D = (x (i) , y (i) ) n i=1 of n i.i.d data points are generated from this distribution. The implicit bias is only needed in an over-parameterized regime, and therefore we assume that n ≪ d. To make the ground-truth vector information-theoretically recoverable, we assume that the ground-truth vector v ⋆ is r-sparse. Here r is much smaller than d, and casual readers can treat it as a constant. Because the element-wise square in the model parameterization is invariant to any sign flip, we assume v ⋆ is non-negative without loss of generality. For simplicity, we also assume it only takes value in {0, 1}.foot_1 We use S ⊂ [d] with |S| = r to denote the support of v ⋆ throughout the paper. We remark that we can recover v ⋆ by re-parameterizing u = v ⊙2 and applying LASSO (Tibshirani, 1996) in the u-space when n ≥ O(r), which is minimax optimal (Raskutti et al., 2012) . However, the main goal of the paper, similar to several prior works (Woodworth et al., 2020; Vaskevicius et al., 2019; Li et al., 2017) , is to prove that the implicit biases of non-convex optimization can recover the ground truth without explicit regularization in the over-parameterized regime when n = poly(r) ≪ d. 3 We also assume throughout the paper that n, d are larger than some sufficiently large universal constant. Loss function. We use the mean-squared loss denoted by ℓ (i) (v) ≜ 1 4 f v (x (i) ) -y (i) 2 for the i-th example. The empirical loss is written as L(v) ≜ 1 n n i=1 ℓ (i) (v). Initialization. We use a large initialization of the form v [0] = τ • 1 where 1 denotes the all 1's vector, where we allow τ to be arbitrarily large (but polynomial in d). Algorithm 1 Stochastic Gradient Descent with Label Noise Require: Number of iterations T , a sequence of step sizes η [0:T ] , noise level δ, initialization v [0] 1: for t = 0 to T -1 do 2: Sample index i t ∼ [n] uniformly and add noise s t ∼ {±δ} to y (it) . 3: Let l(it) (v) = 1 4 (f v (x (it) ) -y (it) -s t ) 2 4: v [t+1] ← v [t] -η [t] ∇ l(it) (v [t] ) ▷ update with label noise SGD with label noise. We study SGD with label noise as shown in Algorithm 1. We sample an example, add label noise sampled from {±δ} to the label, and apply the gradient update. Computing the gradient, we obtain the update rule written explicitly as: v [t+1] ← v [t] -η [t] (v [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) x (it) ⊙ v [t] + η [t] s t x (it) ⊙ v [t] . ( ) Langevin dynamics/diffusion. We compare SGD with label noise to Langevin dynamics, which adds spherical Gaussian noise to gradient descent (Neal et al., 2011) : v [t+1] ← v [t] -η∇L(v [t] ) + 2η/λ • ξ, ( ) where the noise ξ ∼ N (0, I d×d ) and λ > 0 controls the scale of noise. Langevin dynamics (LD) or its more computationally-efficient variant, stochastic gradient Langevin dynamics (SGLD), is known to converge to the Gibbs distribution µ(v) ∝ e -λL (v) under various settings with sufficiently small learning rate (Roberts et al., 1996; Dalalyan, 2017; Bubeck et al., 2018; Raginsky et al., 2017) . In our negative result about Langevin dynamics/diffusion, we directly analyze the Gibbs distribution in order to disentangle the convergence and the generalization. In our negative result about Langevin dynamics/diffusion, we directly analyze the Gibbs distribution to disentangle the convergence and the generalization. Notations. Unless otherwise specified, we use O(•), Ω(•), Θ(•) to hide absolute multiplicative factors and O(•), Θ(•), Ω(•) to hide poly-logarithmic factors in problem parameters such as d and τ . For example, every occurrence of O(x) is a placeholder for a quantity f (x) that satisfies that for some absolute constants c 1 , c 2 > 0, ∀x, |f (x)| ≤ c 1 |x| • log c2 (dτ ).

2.2. MAIN RESULTS

Our main result can be summarized by the following theorem, which suggests that stochastic gradient descent with label noise can converge to the ground truth despite a potentially large initialization. Theorem 2.1. In the setting of Section 2.1, given a target error ϵ > 0. Suppose we have n ≥ Θ(r 2 ) samples. For any label noise level δ ≥ Θ(τ 2 d 2 ), we run SGD with label noise (Algorithm 1) with the following learning rate schedule: 1. learning rate η 0 = Θ(1/δ) for T 0 = Θ(1) iterations, 2. learning rate η 1 = Θ(1/δ 2 ) for T 1 = Θ(1/η 1 ) iterations, 3. learning rate η 2 = Θ(ϵ 2 /δ 2 ) for T 2 = Θ(1/η 2 ) iterations. Then, with probability at least 0.9, the final iterate v [T ] at time T = T 0 + T 1 + T 2 satisfies ∥v [T ] -v ⋆ ∥ ∞ ≤ ϵ. ( ) Here Θ(•) omits poly-logarithmic dependencies on 1/ϵ, d and τ . In other words, with arbitrarily large initialization scale τ , we can choose large label noise level and the learning rate schedule so that SGD with label noise succeeds in recovering the ground truth. In contrast, when τ is large, gradient flow without noise trains in the "kernel" regime as shown by (Woodworth et al., 2020; Chizat and Bach, 2018) . The solution in this kernel regime minimizes the RKHS distance to initialization, and in our setting equates to finding a zero-error solution with minimum ∥v ⊙2 -v [0]⊙2 ∥ 2 . Such a solution could be arbitrarily far away when initialization scale τ is large and therefore have poor generalization. Figure 1 (right) confirms GD performs poorly with large initialization whereas SGD with minibatch or label noise works. We outline the proof of Theorem 2.1 in Section 3. Blanc et al. (2019) also study the implicit bias of the label noise. For our setting, their result implies that when the iterate is near a global minimum for sufficient time, the iterates will locally move to the direction that reduces the ℓ 2 -norm of v by a small distance (that is larger than random fluctuation). However, it does not imply the global convergence to a solution with good generalization with large (or any) initialization, which is what we prove in Theorem 2.1. 4 Moreover, our analysis captures the effect of the large noise or large learning rate -we require the ratio between the noise and the gradient, which is captured by the value ηδ 2 , to be sufficiently large. This is consistent with the empirical observation that good generalization requires a sufficiently large learning rate or small batch (Goyal et al., 2017) . On the other hand, the following negative result for Langevin dynamics demonstrates that adding Gaussian noise fails to recover the ground truth even when v ⋆ = 0. This suggests that spherical Gaussian noise does not induce a strong enough implicit bias towards low-norm solutions. Theorem 2.2. Assume in addition to the setting in Section 2.1 that the ground truth v ⋆ = 0. When n ≤ d/3, with probability at least 0.9 over the randomness of the data, for any λ > 0, the Gibbs distribution is not well-defined because the partition function explodes: R d e -λL(v) dv = ∞. ( ) As a consequence, Langevin diffusion does not converge to a proper stationary distribution. Theorem 2.2 helps explain the behavior in Figure 1 , where adding Gaussian noise generalizes poorly for both synthetic and real data. In particular, in Figure 1 (right) adding Gaussian noise causes the parameter to diverge for synthetic data, and Theorem 2.2 explains this observation. A priori, the intuition regarding Langevin dynamics is as follows: as λ → +∞, the Gibbs distribution (if it exists) should concentrate on the manifold of global minima with zero loss. The measure on the manifold of global minima should be decided by the geometry of L(•), and in particular, the curvature around the global minimum. As λ → +∞, the mass should likely concentrate at the flattest global minimum (according to some measure of flatness), which intuitively is v ⋆ = 0 in this case. However, our main intuition is that when n < d, even though the global minimum at v ⋆ is the flattest, there are also many bad global minima with only slightly sharper curvatures. The vast volume of bad global minima dominate the flatness of the global minimum at v ⋆ = 0 for any λ,foot_4 and hence the partition function blows up and the Gibbs distribution doesn't exist. The proof of Theorem 2.2 can be found in Section F. 3 ANALYSIS OVERVIEW OF SGD WITH LABEL NOISE (THEOREM 2.1)

3.1. WARM-UP: UPDATES WITH ONLY PARAMETER-DEPENDENT NOISE

Towards building intuition and tools for analyzing the parameter-dependent noise, in this subsection we start by studying an extremely simplified random walk in one dimensional space. The random walk is purely driven by mean-zero noisy updates and does not involve any gradient updates: v ← v + ηξ • v, where ξ ∼ {±1}. ( ) Indeed, attentive readers can verify that when dimension d = 1, sample size n = 1, and v ⋆ = 0, equation ( 1) degenerates to the above random walk if we omit the gradient update term (second to last term in equation ( 1)). We compare it with the standard Brownian motion (which is the analog of gradient descent with spherical Gaussian noise under this extreme simplification) v ← v + ηξ, where ξ ∼ N (0, 1). We initialize at v = 1. We observe that both random walks have mean-zero updates, so the mean is preserved: E[v] = 1. The variances of the two random walks are also both growing because any mean-zero update increases the variance. Moreover, the Brownian motion diverges because it has a Gaussian marginal with variance growing linearly in t, and there is no limiting stationary distribution. However, the parameter-dependent random walk (5) has dramatically different behavior when η < 1: the random variable v will eventually converge to v = 0 with high probability (though the variance grows and the mean remains at 1.). This is because the variance of the noise depends on the scale of v. The smaller v is, the smaller the noise variance is, and so the random walk tends to get "trapped" around 0. This claim has the following informal but simple proof that does not strongly rely on the exact form of the noise and can be extended to more general high-dimensional cases. Consider an increasing concave potential function ϕ : R ≥0 → R ≥0 with ϕ ′′ < 0 (e.g., ϕ(v) = √ v works). Note that when η < 1, the random variable v stays nonnegative. We can show that the expected potential function decreases after any update E[ϕ(v + ηξv)] ≈ E[ϕ(v) + ϕ ′ (v)ηξv + ϕ ′′ (v)η 2 ξ 2 v 2 ] (by Taylor expansion) = E[ϕ(v)] + E[ϕ ′′ (v)η 2 v 2 ] < E[ϕ(v)] (by ϕ ′′ (v) < 0 and E[ξ] = 0.) With more detailed analysis, we can formalize the Taylor expansion and control the decrease of the potential function, and conclude that E [ϕ(v)] converges to zero. Then, by Markov's inequality, with high probability, ϕ(v) is tiny and so is v.foot_5  From the 1-D case to the high-dimensional case. In one dimension, the bias is introduced because of the varying scale of noise (i.e., the norm of the covariance). However, in the high dimensional case, the shape of the covariance also matters. For example, if we generalize the random walk (1) to high-dimensions by running d of the random walks in parallel, then we will observe the same phenomenon, but the noise variances in different dimensions are not identical -they depend on the current scales of the coordinates. (Precisely, the noise variance for dimension k is η 2 v 2 k .) However, suppose we instead add noise of the same variance to all dimensions. Even if this variance depends on the norm of v (say, η 2 ∥v∥ 2 2 ), the implicit bias will be diminished, as the smaller coordinates will have relatively outsized noise and the larger coordinates will have relatively insufficient noise. Outline of the rest of the subsections. We will give a proof sketch of Theorem 2.1 that consists of three stages. We first show in the initial stage of the training that label noise effectively decreases the parameter on all dimensions, bringing the training from large initialization to a small initialization regime, where better generalization is possible (Section 3.2). Then, we show in Section 3.3 that when the parameter is decently small, with label noise and a decayed learning rate, the algorithm will increase the magnitude of those dimensions in support set of v ⋆ , while keep decreasing the norm of the rest of the dimensions. Finally, with one more decay, the algorithm can recover the ground truth.

3.2. STAGE 0: LABEL NOISE WITH LARGE LEARNING RATE REDUCES THE PARAMETER NORM

We first analyze the initial phase where we use a relatively large learning rate. When the initialization is of a decent size, GD quickly overfits to a bad global minimum nearest to the initialization. In contrast, we prove that SGD with label noise biases towards the small norm region, for a similar reason as the random walk example with parameter-dependent noise in Section 3.1. Theorem 3.1. In the setting of Theorem 2.1, recall that we initialize with v [0] = τ • 1. Assume n ≥ Θ(log d). Suppose we run SGD with label noise with noise level δ ≥ Θ(τ 2 d 2 ) and learning rate η 0 ∈ [ Θ(τ 2 d 2 /δ 2 ), Θ(1/δ)] for T 0 = Θ(1/(η 2 δ 2 )) iterations. Then, with probability at least 0.99 over the randomness of the algorithm, ∥v [T0] ∥ ∞ ≤ 1/d . (7) Moreover, the minimum entry of v [T0] is bounded below by exp(-O((ηδ) -1 )). We remark that our requirement of η being large is consistent with the empirical observation that a large initial learning rate helps generalization Goyal et al. (2017) ; Li et al. (2019b) . We provide intuitions and a proof sketch of the theorem in the rest of the subsection and defer the full proof to Section B . Our proof is based on the construction of a concave potential function Φ similar to Section 3.1. We will show that, at every step, the noise has a second-order effect on the potential function and decrease the potential function by a quantity on the order of η 2 δ 2 (omitting the d dependency). 7 On the other hand, the gradient step may increase the potential by a quantity at most on the order of η (omitting d dependency again). Therefore, when η 2 δ 2 ≳ η, we expect the algorithm to decrease the potential and the parameter norm. In particular, we define Φ(v) ≜ d k=1 ϕ(v k ) = d k=1 √ v k . By the update rule 2.1, the update for a coordinate k ∈ [d] can be written as v [t+1] k ← v [t] k -ηs t x (it) k v [t] k -η [t] (v [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) x (it) k v [t] k , ( ) where s t is sampled from {-δ, δ} and i t is sampled from [n]. Let g (it) k ≜ ((v [t] ⊙2 - v ⋆⊙2 ) ⊤ x (it) )x (it) k be the component coming from the stochastic gradient. Using the fact that ϕ(ab) = ϕ(a)ϕ(b) for any a, b > 0, we can evaluate the potential function at time t + 1, E ϕ(v [t+1] k ) = E ϕ(v [t] k )ϕ(1 -ηs t x (it) k -ηg (it) k ) = ϕ(v [t] k )E ϕ(1 -ηs t x (it) k -ηg (it) k ) . (9) Here the expectation is over s t and i t . We perform Taylor-expansion on the term ϕ(1 -ηs t x (it) k - ηg (it) k ) to deal with the non-linearity and use the fact that ηs t x (it) k is mean-zero: E [ ϕ(1 -ηstx (i t ) k -ηg i t k ) ] ≈ ϕ(1) -ϕ ′ (1)ηE [ g i t k ] + 1 2 ϕ ′′ (1)E [ ( (i t ) k -ηg (i t ) k ) 2 ] ≤ ϕ(1) -ϕ ′ (1)ηE [ g i t k ] + 1 2 ϕ ′′ (1)E [ ( (i t ) k ) 2 ] ≤ ϕ(1) -ϕ ′ (1)ηE [ g i t k ] -Ω(η 2 δ 2 ). Under review as a conference paper at ICLR 2021 In the second line we used ϕ ′′ (1) < 0 from the concavity and E[ηs t x (it) k ] = 0, and the third line uses the fact that s t ∼ {±δ} and E[x (it) k 2 ] ≈ 1 (by the data assumption). The rest of the proof consists of bounding the second term in equation ( 10) from above to show the potential function is contracting. We first note for every i t , it holds that |g it k | ≤ ∥v [t] ⊙2 -v ⋆⊙2 ∥ 1 ∥x (it) ∥ 2 ∞ ≤ (∥v [t] ∥ 2 2 + r)∥x (it) ∥ 2 ∞ . Furthermore, we can bound the ℓ 2 norm of v [t] with the following lemma: Lemma 3.2. In the setting of Theorem 3.1, for some failure probability ρ > 0, let b 0 ≜ 6τ d/ρ. Then, with probability at least 1 -ρ/3, we have that ∥v [t] ∥ 2 ≤ b 0 for any t ≤ T 0 . Note that v [0] has ℓ 2 norm τ √ d, and here we prove that the norm does not exceed τ d with high probability. At the first glance, the lemma appears to be mostly auxiliary, but we note that it distinguishes label noise from Gaussian noise, which empirically causes the parameter to blow up as shown in Figure 1 . The formal proof is deferred to Section B. By Lemma 3.2 and the bound on |g it k | in terms of ∥v [t] ∥ 2 , we have |g it k | ≤ (b 2 0 + r)∥x (it) ∥ 2 ∞ ≤ O(b 2 0 + r) with b 0 defined in Lemma 3.2 (up to logarithmic factors). Here we use again that each entry of the data is from N (0, 1). Plugging these into equation ( 10) we obtain E ϕ(1 -ηs t x (it) k -ηg it k ) ≤ 1 + η O(b 2 0 + r) -Ω(η 2 δ 2 ) < 1 -Ω(η 2 δ 2 ) where in the last inequality we use the lower bound on η to conclude η 2 δ 2 ≳ η O(b 2 0 + r). Therefore, summing equation ( 9) over all the dimensions shows that the potential function decreases exponentially fast: [T ] will already converge to a position such that E[Φ(v [T ] )] ≲ 1/d, which implies ∥v [T ] ∥ ∞ ≲ 1/d with probability at least 1 -ρ and finishes the proof. E[Φ(v [t+1] )] < (1 -Ω(η 2 δ 2 ))Φ(v [t] ). After T ≈ log(d)/(η 2 δ 2 ) iterations, v 3.3 STAGE 1: GETTING CLOSER TO v ⋆ WITH ANNEALED LEARNING RATE Theorem 3.1 shows that the noise decreases the ∞-norm of v to 1/d. This means that ℓ 1 or ℓ 2 -norm of v is similar to or smaller than that of v ⋆ if r is constant, and we are in a small-norm region where overfitting is less likely to happen. In the next stage, we anneal the learning rate to slightly reduce the bias of the label noise and increase the contribution of the signal. Recall that v ⋆ is a sparse vector with support S ⊂ [d] . The following theorem shows that, after annealing the learning rate (from the order of 1/δ 2 to 1/δ), SGD with label noise increases entries in v S and decreases entries in v S simultaneously, provided that the initialization has ℓ ∞ -norm bounded by 1/d. (For simplicity and self-containedness of the statement, we reset the time step to 0.) Theorem 3.3. In the setting of Section 2.1, given a target error bound ϵ 1 > 0, we assume that n ≥ Θ(r 2 log 2 (1/ϵ 1 )). We run SGD with label noise (Algorithm 1) with an initialization v [0] whose entries are all in [ϵ min , 1/d], where ϵ min ≥ exp(-O(1)). Let noise level δ ≥ Θ(log(1/ϵ 1 )) and learning rate η = Θ(1/δ 2 ), and number of iterations T = Θ(log(1/ϵ 1 )/η). Then, with probability at least 0.99, after T iterations, we have ∥v [T ] S -v ⋆ S ∥ ∞ ≤ 0.1 and ∥v [T ] S -v ⋆ S ∥ 1 ≤ ϵ 1 . ( ) We remark that even though the initialization is relatively small in this stage, the label noise still helps alleviate the reliance on small initialization. Li et al. (2017) ; Vaskevicius et al. (2019) showed that GD converges to the ground truth with sufficiently small initialization, which is required to be smaller than target error ϵ 1 . In contrast, our result shows that with label noise, the initialization does not need to depend on the target error, but only need to have an ℓ ∞ -norm bound on the order of 1/d. In other words, v gets closer to v ⋆ on both S and S in our case, whereas in Li et al. (2017) ; Vaskevicius et al. (2019) the v S grows slowly. The proof of this theorem balances the contribution of the gradient against that of the noise on S and S. On S, the gradient provides a stronger signal than label noise, whereas on S, the implicit bias of the noise, similar to the effect in Section 3.2, outweighs the gradient and reduces the entries to zero. The analysis is more involved than that of Theorem 3.1, and we defer the full proof to Section C. Stage 2: Convergence to the ground-truth v ⋆ : The conclusion of Theorem 3.3 still allows constant error in the support, namely, ∥v S -v ⋆ S ∥ ∞ ≤ 1/10. In Theorem D.1, we show that further annealing the learning rate will let the algorithm fully converge to v ⋆ with any target error ϵ. Proof of Theorem 2.1. In Section E of Appendix, we combine Theorem 3.1, Theorem 3.3, and Theorem D.1 to prove our main Theorem 2.1.

4. CONCLUSION

In this work, we study the implicit bias effect induced by noise. For a quadratically-parameterized model, we theoretically show that the parameter-dependent noise has a strong implicit bias, which can help recover the sparse ground-truth from limited data. In comparison, our negative result shows that such a bias cannot be induced by spherical Gaussian noise. Our result explains the empirical observation that replacing mini-batch noise or label noise with Gaussian noise usually leads to degradation in the generalization performance of deep models.

A EXPERIMENTAL DETAILS

A.1 EXPERIMENTAL DETAILS FOR THE QUADRATICALLY-PARAMETERIZED MODEL In the experiment of our quadratically-parameterized model, we use a 100-dimensional model with n = 40 data randomly sampled from N (0, I 100×100 ). We set the first 5 dimensions of the groundtruth v ⋆ as 1, and the rest dimensions as 0. We always initialize with v [0] = 1. We use a constant learning rate 0.01 for all the experiments except for label noise. For label noise, we start from 0.01 and then decay the learning rate by a factor of 10 after 1×10 5 and 2×10 5 iterations. For "full batch" experiment, we run full batch gradient descent without noise. For "small batch" experiment, in order to fully disentangle the effect of learning rate and mini-batch sgd noise (i.e., to avoid implicit biases from large lr rather than noise), we add small batch noise to full gradient via the following sampling method: for each iteration, we randomly sample two data i and j from [n], and add δ(∇ℓ (i) (v) -∇ℓ (j) (v)) to the full gradient (we set δ = 1.0 in our experiment). For label noise, we randomly sample i ∈ [n] and s ∈ {δ, -δ} (we set δ = 1.0 in our experiment), and add noise ∇ l(i) (v) -∇ℓ (i) (v) to full gradient, where l(i) (v) ≜ 1 4 (f v (x (i) ) -y (i) -s) 2 . For Gaussian noise experiments, we add noise ξ ∼ N (0, σ 2 I d×d ) to full gradient every iteration, where the values of σ are shown in Figure 1 . For experiments except for Gaussian noises, we train a total of 3 × 10 5 iterations. For a more generous comparison, we run all the Gaussian noise experiments for 4 times longer (i.e., 1.2 × 10 6 iterations) while plotting them in the same figure after scaling the x-axis by a factor of 4. The test error is measured by the square of ℓ 2 distance between v ⊙2 and v ⋆⊙2 , which is the same as the expectation of loss on a freshly randomly sampled data. The trianing and test error are plotted in Figure 1 .

A.2 EXPERIMENTAL DETAILS FOR DEEP NEURAL NETWORKS ON CIFAR100

We train a VGG19 model (Simonyan and Zisserman, 2014) on CIFAR100, using a small and large batch baseline. We also experiment with adding Gaussian noise to the parameters after every gradient update as well as adding label noise in the following manner: with some probability that depends on the current iteration count, we replace the original label with a randomly chosen one. To add additional mean-zero noise to the gradient which simulates the effect of label noise in the regression setting, we compute a noisy gradient of the cross-entropy loss ℓ ce with respect to model output f (x) as follows: ∇f(x) ℓ ce (f (x), y) = ∇ f (x) ℓ ce (f (x), y) + σ ln z ( ) where z is a 100-dimensional vector (corresponding to each class) distributed according to N (0, I 100×100 ), and y is the (possibly flipped) label. We backpropagate using this noisy gradient when we compute the gradient of loss w.r.t. parameters for the updates. After tuning, we choose the initial label-flipping probability as 0.1, and reduce it by a factor of 0.5 every time the learning rate is annealed. We choose σ ln such that σ ln E[∥z∥ 2 2 ] = 0.1, and also decrease σ ln by a factor of 0.5 every time the learning rate is annealed. To add spherical Gaussian noise to the parameter every update, we simply set W ← W + σz after every gradient update, where z is a mean-zero Gaussian whose coordinates are drawn independently from N (0, 1). We tune this σ over the values shown in Figure 1 . We turn off weight decay and BatchNorm to isolate the regularization effects of just the noise alone. Standard data augmentation is still present in our runs. Our small batch baseline uses a batch size of 26, and our large batch baseline uses a batch size of 256. In runs where we add noise, the batch size is always 256. For all runs, we use an initial learning rate of 0.004. We train for 410550 iterations (i.e., minibatches), annealing the learning rate by a factor of 0.1 at the 175950-th and 293250-th iteration. Our models take around 20 hours to train on a single NVIDIA TitanXp GPU when the batch size is 256. The final performance gap between label noise or small minibatch training v.s. large batch or Gaussian noise is around 13% accuracy.

A.3 ADDITIONAL PLOTS

Here we show empirical evidence that training with Gaussian noise fails to converge to a stationary distribution. We train VGG19 network on CIFAR100, and plot the norm of model weight along the training trajectory. As shown in Figure 2 , the weigth norm of large batch (LB) and large batch with label noise (LB+LN) both converge to some finite value, while the weight norm of large batch with Gaussian noise (LB+GN) keeps increasing and fails to converge. B PROOF OF THEOREM 3.1 In this section, we will first prove several lemmas on which the proof of Theorem 3.1 is built upon. Then we will provide a proof of Theorem 3.1. Definition B.1. (b-bounded coupling) Let v [0] , v [1] , • • • , v [T ] be a trajectory of label noise gradient descent with initialization v [0] . We call the following random sequence ṽ[t] a b-bounded coupling of v [t] : starting from ṽ[0] = v [0] , for each time t < T , if ṽ[t] 1 ≤ b, we let ṽ[t+1] ≜ v [t+1] ; otherwise if ṽ [t] 1 > b we don't update, i.e., ṽ[t+1] ≜ ṽ[t] . Lemma B.2. In the setting of Theorem 3.1, assume x (i) ∞ ≤ b x for any i ∈ [n]. Let η ≤ ρ 6T b 2 x (b 2 0 +r) , where b 0 = 6τ d ρ . Let ṽ[t] be the b 0 -bounded coupling of v [t] . If ṽ[t] is always positive on each dimension, then with probability at least 1 -ρ 3 , there is ṽ [T ] 1 ≤ b 0 . ( ) Proof of Lemma B.2. Recall the update at t-th iteration is: v [t+1] = v [t] -η((v [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) ⊙ v [t] -ηs t x (it) ⊙ v [t] . ( ) We first bound the increase of ṽ [t] 1 in expectation. When ṽ[t] 1 ≤ b 0 , there is: E ṽ[t+1] k = ṽ[t] k -ηE[((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (i) )x i k ṽ[t] k ] (15) ≤ ṽ[t] k + η( ṽ[t] ⊙2 1 + v ⋆⊙2 1 )b 2 x ṽ[t] k (16) ≤ ṽ[t] k + η(b 2 0 + r)b 2 x ṽ[t] k (17) where the first inequality is because we can separate the last term into v [t] ⊙2 part and v ⋆⊙2 part and bound them with v [t] ⊙2 1 and v ⋆⊙2 1 respectively, the second inequality is by v [t] 2 2 ≤ v [t] 2 1 and sparsity of v ⋆ . So summing over all dimensions we have E ṽ [t+1] 1 ≤ ṽ[t] 1 + ηb 0 b x (b 2 0 + r). This bound is obviously also true when ṽ [t] 1 > b 0 , in which case ṽ[t+1] = ṽ[t] . We then bound the probability of ṽ[T ] 1 being too large: Pr ṽ[T ] 1 > b 0 ≤ E ṽ[T ] 1 b 0 (18) ≤ τ d + T ηb 0 b 2 x (b 2 0 + r) b 0 (19) ≤ ρ 3 , ( ) where the first inequality is Markov Inequality, the second is by the previous equation, and the third is by assumption of η and the definition of b 0 . Proof of Lemma 3.2. Notice that when ṽ  [T ] 1 ≤ b 0 , there is v [T ] = ṽ[T ] , : if ∥v∥ 1 ≤ b, we let Φ(v) ≜ d k=1 √ v k ; otherwise Φ(v) ≜ 0. Lemma B.4. In the setting of Theorem 3.1, let ϵ 0 = 1/d. Assume x (i) ∞ ≤ b x for i ∈ [n] with some b x > 0, E i [(x (i) k ) 2 ] ≥ 2 3 for all k ∈ [d]. Let b 0 = 6τ d ρ . Assume ηδb x + η(b 2 0 + r)b 2 x ≤ 1 16 , ηδ 2 ≥ 32(b 2 0 + r)b 2 x and T = ⌈ 32 η 2 δ 2 log( ρ √ ϵ0 3d √ τ )⌉. Let ṽ[t] be the b 0 -bounded coupling of v [t] , and Φ(•) is the b 0 -bounded potential function. If ṽ[t] is always positive on each dimension, then with probability at least 1 -ρ 3 , there is Φ(ṽ [T ] ) ≤ √ ϵ 0 . ( ) Proof of Lemma B.4. We first show Φ(ṽ [t] ) decreases exponentially in expectation. If ṽ[t] 1 ≤ b 0 , we have: E Φ(ṽ [t+1] ) ≤ d k=1 E ṽ[t+1] k (22) = d k=1 E st,it ṽ[t] k -ηs t x (it) k ṽ[t] k -η((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ṽ[t] k (23) ≤ d k=1 ṽ[t] k E st,it 1 + ηs t x (it) k + η(b 2 0 + r)b 2 x (24) (25) where the second inequality is because ṽ [t] 2 2 = ṽ[t] 2 1 ≤ b 2 0 . Toward bounding the expectation, we notice that by Taylor expansion theorem, there is for any general function g(x) = √ 1 + x, there is g(1 + x) ≤ g(1) + g ′ (1)x + 1 2 g ′′ (1)x 2 + M 6 |x| 3 , ( ) where M is upper bound on |g ′′′ (1 + x ′ )| for x ′ in 0 to x, which is less than 3 if |x| ≤ 1 2 . So in our theorem if ∆ ≜ ηs t x (it) k + η(b 2 0 + r)b 2 x ∈ [-1 2 , 1 2 ], we have 1 + ηs t x (it) k + η(b 2 0 + r)b 2 x ≤ 1 + 1 2 ∆ - 1 8 ∆ 2 + 1 2 |∆| 3 . (27) Also since E st,it [∆] = η(b 2 0 + r)b 2 x , E st,it [∆ 2 ] ≥ η 2 δ 2 E it [(x (it) k ) 2 ] ≥ 2 3 η 2 δ 2 , we have when |∆| ≤ 1 16 and ηδ 2 ≥ 32(b 2 0 + r)b 2 x , we have E st,it [ √ 1 + ∆] ≤ 1 -E st,it [1 -1 16 ∆ 2 ] ≤ 1 -1 32 η 2 δ 2 . So E Φ(ṽ [t+1] ) ≤ (1 - 1 32 η 2 δ 2 )Φ(ṽ [t] ). Also notice that when ṽ[t] 1 > b 0 , there is Φ(ṽ [t+1] ) = Φ(ṽ [t] ) = 0, so obviously we have E[Φ(ṽ [t+1] )] ≤ (1 -1 32 η 2 δ 2 )Φ(ṽ [t] ) always true. Next we prove that Φ(ṽ [T ] ) ≤ √ ϵ 0 with probability more than 1 -ρ 2 . This is because: Pr Φ(ṽ [T ] ) > √ ϵ 0 ≤ E Φ(ṽ [T ] ) √ ϵ 0 (29) ≤ (1 -1 32 η 2 δ 2 ) T d √ τ √ ϵ 0 (30) ≤ ρ 3 . ( ) where the first inequality if by Markov Inequaltiy, the second inequality is by the previous inequality, and the last inequality is because T = ⌈ 32 η 2 δ 2 log( 3d √ τ ρ √ ϵ0 )⌉. Proof of Theorem 3.1. Let ρ = 0.01, ϵ 0 = 1/d. By Lemma G.1, and Lemma G.2, when n ≥ Θ(log d), with probability at least 1 -ρ 3 there is x (i) ∞ ≤ b x for all i ∈ [n] with some b x = Θ( log(nd)), and E i [(x (i) k ) 2 ] ≥ 2 3 for all k ∈ [d]. Let b 0 = 6τ d ρ . We try to define η and δ such that when T = ⌈ 32 η 2 δ 2 log( 3d √ τ ρ √ ϵ0 )⌉, the assumptions η ≤ ρ 6T b 2 x (b 2 0 +r) and ṽ[t] always being positive in Lemma B.2 and assumptions ηδb x + η(b 2 0 + r)b 2 x ≤ 1 16 and ηδ 2 ≥ 32(b 2 0 + r)b 2 x in Lemma B.4 are satisfied. Assume δ ≥ 6×32 2 b 3 x (b 2 0 +r) ρ log( 3d √ τ ρ √ ϵ0 ), then we only need η ∈ 6×32b 2 x ρδ 2 (b 2 0 + r) log( 3d √ τ ρ √ ϵ0 ), 1 32δbx , and then all the above assumptions are satisfied. Under review as a conference paper at ICLR 2021 Let ṽ[t] be the b 0 -bounded coupling of v [t] . According to Lemma B.4, we know with probability at least 1 -ρ 3 , Φ(ṽ [T ] ) ≤ √ ϵ 0 , which means that either d k=1 ṽ[T ] k ≤ √ ϵ 0 or ṽ[T ] 1 > b 0 . According to Lemma B.2, we know with probability at most ρ 3 , ṽ[T ] 1 > b 0 . Combining these two statements, we know with probability at least 1 -2ρ 3 , ṽ[T ] 1 ≤ b 0 and d k=1 ṽ[T ] k ≤ √ ϵ 0 . Notice that ṽ[T ] 1 ≤ b 0 implies v [T ] = ṽ[T ] , while d k=1 ṽ[T ] k ≤ √ ϵ 0 implies ṽ[T ] k ≤ ϵ 0 for all dimension k, so we've finished the proof for the upper bound. We then give a lower bound for each dimension of ṽ[T ] . We can bound the decrease of any dimension k at time t: ṽ[t+1] k ≥ (1 -ηδ -η(b 2 0 + r))ṽ [t] k (32) ≥ (1 -2ηδ)ṽ [t] k . ( ) where the first inequality is by update rule and the second is because δ > (b 2 0 + r). Putting in the value of T , we have ṽ[T ] k ≥ (1 -2ηδ) T τ (34) > exp - 64 ηδ log( 3d √ τ ρ √ ϵ 0 ) . ( ) C PROOF OF THEOREM 3.3 In this section, we will first prove several lemmas on which the proof of Theorem 3.3 is built upon. Then we will provide a proof of Theorem 3.3. Definition C.1. ((b, ϵ)-bounded coupling) Let v [0] , v [1] , • • • , v [T ] be a trajectory of label noise gradient descent with initialization v [0] . Recall S ⊂ [d] is the support set of v ⋆ , we notate ṽ[t] S a r-dimensional vector composed with those dimensions in S of ṽ[t] , and ṽ[t] S the other d -r dimensions. We call the following random sequence ṽ[t] a (b, ϵ)-bounded coupling of v [t] : starting from ṽ[0] = v [0] , for each time t < T , if ṽ [t] S 1 ≤ ϵ and ṽ[t] S ∞ ≤ b, we let ṽ[t+1] ≜ v [t+1] ; otherwise ṽ[t+1] ≜ ṽ[t] . Lemma C.2. In the setting of Theorem 3.3, let ρ ≜ 1 100 , c 1 ≜ 1 10 , ε1 ≜ 12 ρ , C x ≜ max j̸ =k |E i [x (i) j x (i) k ]|. Assume x (i) ∞ ≤ b x for i ∈ [n] for some b x > 0, and E i [(x (i) k ) 2 ] ≥ 2 3 for k ∈ [d]. Let ṽ[t] be a (1 + c 1 , ε1 )-bounded coupling of v [t] . Assume c 2 1 8ηδ 2 b 2 x ≥ log 6rT 2 ρ , (ε 2 1 + r)C x b 2 x ≤ c1 20 and δ ≥ b x (ε 2 1 + r). Then, with probability at least 1 -ρ 6 , there is ṽ [T ] S ∞ ≤ 1 + c 1 . Proof of Lemma C.2. For any fixed 1 ≤ t 1 < t 2 ≤ T and dimension k ∈ S, we consider the event that ṽ [t1] ∈ [1 + c1 3 , 1 + c1 2 ], and at time t 2 it is the first time in the trajectory such that ṽ[t2] > 1 + c 1 . We first bound the probability of this event happens, i.e., the following quantity: Pr ṽ[t2] k > 1 + c 1 ∧ ṽ[t1] k ≤ 1 + c 1 2 ∧ ṽ[t1:t2] k ∈ [1 + c 1 3 , 1 + c 1 ] , where ṽ [t1:t2] k ∈ [1 + c1 3 , 1 + c 1 ] means that for all t such that t 1 ≤ t < t 2 , there is 1 + c1 3 ≤ ṽ[t] k ≤ 1 + c 1 .

Notice that when ṽ[t]

S 1 ≤ ε1 and ṽ [t] S ∞ ≤ 1 + c 1 and ṽ[t1:t+1] k ∈ [1 + c1 3 , 1 + c 1 ], there is E[ṽ [t+1] k -1] =E st,it 1 + ηs t x it k -η((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ṽ[t] k -1 ≤(ṽ [t] k -1) - 2 3 ηṽ [t] k (ṽ [t] k + 1)(ṽ [t] k -1) + η(ε 2 1 C x + rC x )b 2 x ṽ[t] k (38) ≤(1 -η)(ṽ [t] k -1). ( ) where the first inequality is because ṽ [t] S 2 2 ≤ ε2 1 and E it [(x (i) k ) 2 ] ≥ 1 2 , the second inequality is because (ε 2 1 + r)b 2 x C x ≤ c1 20 . Also, we can bound the variance of this martingale as Var ṽ[t+1] k -1 | ṽ[t] -1 = Var ηs t x (it) k ṽ[t] k + Var η((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ṽ[t] k (40) ≤ (ηδb x (1 + c 1 )) 2 + η 2 (ε 2 1 + r) 2 b 4 x (1 + c 1 ) 2 (41) ≤ 4η 2 δ 2 b 2 x , ( ) where the first inequality is because ηs t x (it) k ṽ[t] k is mean-zero, the second inequality is by x (i) ∞ ≤ b x , the third inequality is by δ ≥ b x (ε 2 1 + r). By Lemma G.3, we have Pr(ṽ [t2] k -1 > c 1 ) (43) ≤e -c 2 1 8η 2 δ 2 b 2 x ∑t 2 -t 1 -1 t=0 (1-η) 2t (44) ≤e -c 2 1 8ηδ 2 b 2 x , ( ) where the first inequality is by Lemma G.3, the second inequality is by taking the sum of denominator. Finally, we finish the proof with a union bound. Since if ṽ [T ] S ∞ > 1 + c 1 , the event in Equation 36has to happen for some k ∈ S and 1 ≤ t 1 < t 2 ≤ T , so we have Pr ṽ[T ] S ∞ > 1 + c 1 (46) ≤ k∈S 1≤t1<t2≤T Pr ṽ[t2] k > 1 + c 1 ∧ ṽ[t1] k ≤ 1 + c 1 2 ∧ ṽ[t1:t2] k ∈ [1 + c 1 3 , 1 + c 1 ] (47) ≤ rT 2 e -c 2 1 8ηδ 2 b 2 x (48) ≤ ρ 6 , ( ) where the last inequality is by assumption. Lemma C.3. In the setting of Lemma C.2, assume (ε 2 1 + r)C x ≤ ρ 12T ηb 2 x . Then, with probability at least 1 -ρ 6 , there is ṽ [T ] S 1 ≤ ε1 . Proof of Lemma C.3. We first bound the increase of ṽ [t] S 1 in expectation. When ṽ[t] S ∞ ≤ 1 + c 1 and ṽ[t] S 1 ≤ ε1 , for any k / ∈ S, there is: E ṽ[t+1] k = ṽ[t] k -ηE i ((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (i) )x (i) k ṽ[t] k (50) ≤ ṽ[t] k + η(ε 2 1 + r)C x b 2 x ṽ[t] k . ( ) because we can bound the dimensions in S and those not in S respectively. So summing over all dimensions not in S we have E ṽ [t+1] S 1 ≤ ṽ[t] S 1 + η(ε 2 1 + r)C x b 2 x ε1 . This bound is obviously also true when ṽ [t] S ∞ > 1 + c 1 and ṽ[t] S 1 > ε1 . We then bound the probability of ṽ [T ] S 1 being too large: Pr ṽ[T ] S 1 > ε1 ≤ E ṽ[T ] S 1 ε1 (52) ≤ 1 + T ηε 1 (ε 2 1 + r)C x b 2 x ε1 (53) ≤ ρ 6 . ( ) where the first inequality is Markov Inequality, the second is by ṽ [0] S 1 ≤ 1 since every dimension is less than 1/d, the third inequality is because ε1 = 12 ρ and (ε 2 1 + r)C x ≤ ρ 12T ηb 2 x . Lemma C.4. In the setting of Lemma C.2, assume (ε 2 1 + r)C x b 2 x < c1 12 - c 2 1 4 , ηδ 2 ≤ c1 8 , T η ≥ 16 c1 log 1 ϵmin and T δ 2 ≥ 2 9 c 2 1 log 6r ρ . Then, for any k ∈ S, with probability at least 1 -ρ 6r , either max t≤T ṽ [t] . Intuitively, whenever ṽ[t] exceeds the proper range, we only times v[t] by 1 + c1 2 η afterwards, otherwise we let it be the same as ṽ[t] . [t] k ≥ 1 -c1 2 , or ṽ[T ] S ∞ > 1 + c 1 , or ṽ[T ] S 1 > ε1 . Proof of Lemma C.4. Fix k ∈ S. Let v[t] be the following coupling of ṽ[t] : starting from v[0] = ṽ[0] , for each time t < T , if ṽ[t] S 1 ≤ ε1 and ṽ[t] S ∞ ≤ 1 + c 1 and ṽ[t] k ≤ 1 -c1 2 , we let v[t+1] ≜ ṽ[t+1] ; otherwise v[t+1] ≜ (1 + c1 2 η)v We first show that -t log(1 + c1 2 η) + log v[t] k is a supermartingale, i.e., E[log v[t+1] k | v[t] ] ≥ log(1 + c1 2 η) + log v[t] k . This is obviously true if ṽ[t] S 1 > ε1 or ṽ[t] S ∞ > 1 + c 1 or ṽ[t] k > 1 -c1 2 . Otherwise, there is E[log v[t+1] k | v[t] ] = E[log ṽ[t+1] k | ṽ[t] ] (55) = E st,it log 1 + ηs t -η(ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) x (it) k + log ṽ[t] k (56) ≥ E st log 1 + ηs t + 2 3 η(1 -(ṽ [t] k ) 2 ) -η(ε 2 1 + r)C x b 2 x + log ṽ[t] k (57) ≥ log(1 + c 1 4 η) + log ṽ[t] k , ( ) where the first inequality is by the update rule, the second inequality is because (ε 2 1 + r)C x b 2 x < c1 12 - c 2 1 4 and 4ηδ 2 ≤ c1 2 and δ ≥ ε2 1 + r. So by Azuma inequality, we have Pr v[T ] k < 1 - c 1 2 (59) ≤e - 2 ( T log (1+ c 1 4 η)+log ϵ min -log(1- c 1 ) ) 2 T (2ηδ) 2 (60) ≤e - ( 1 2 T log (1+ c 1 4 η)) 2 2T η 2 δ 2 (61) ≤e -T c 2 1 2 9 δ 2 (62) ≤ ρ 6r . ( ) where the first inequality is because Azuma inequality and Var[log v[t+1] k | v[t] ] ≤ (2ηδ) 2 ,

and the second inequality is because

T log(1 + c1 4 η) ≥ 2 log 1 ϵmin which is true because T η ≥ 16 c1 log 1 ϵmin , the third inequality is because log(1+ c1 4 η) ≥ c1 8 η, the last inequality is because T δ 2 ≥ 2 9 c 2 1 log 6r ρ . Lemma C.5. In the setting of Lemma C.2, assume (ε 2 1 + r)C x b 2 x ≤ c1 20 and c 2 1 8ηδ 2 ≥ log 6rT 2 ρ . Then, for any k ∈ S, with probability at least 1 -ρ 6r , either max t<T ṽ [t] k < 1 -c1 2 or ṽ[T ] k ≥ 1 -c 1 . Proof of Lemma C.5. For any fixed 1 ≤ t 1 < t 2 ≤ T and dimension k ∈ S, we consider the event that ṽ [t1] ∈ [1 -c1 2 , 1 -c1 3 ], and at time t 2 it is the first time in the trajectory such that ṽ[t2] > 1 < c 1 . We first bound the probability of this event happens, i.e., the following quantity: Pr ṽ[t2] k < 1 -c 1 ∧ ṽ[t1] k ≥ 1 - c 1 2 ∧ ṽ[t1:t2] k ∈ [1 -c 1 , 1 - c 1 3 ] , ( ) where ṽ[t1:t2] k ∈ [1 -c 1 , 1 -c1 3 ] means that for all t such that t 1 ≤ t < t 2 , there is 1 -c 1 ≤ ṽ[t] k ≤ 1 -c1 3 . Notice that when ṽ [t] S 1 ≤ ε1 and ṽ[t] S ∞ ≤ 1 + c 1 and ṽ[t1:t+1] k ∈ [1 -c 1 , 1 -c1 3 ], E[1 - ṽ[t+1] k ] =E st,it 1 -(1 + ηs t x it k -η(ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) x (it) k )ṽ [t] k (65) ≤(1 - ṽ[t] k ) - 2 3 ηṽ [t] k (ṽ [t] k + 1)(1 - ṽ[t] k ) + η(ε 2 1 + r)C x b 2 x ṽ[t] k (66) ≤(1 -η)(1 - ṽ[t] k ). ( ) where the first inequality is because ṽ [t] S 2 2 ≤ ε2 1 , the second inequality is because (ε 2 1 + r)C x b 2 x ≤ c1 20 . Also, we can bound the variance of this martingale as Var 1 - ṽ[t] k | ṽ[t] k ≤ (2ηδ) 2 . ( ) By Lemma G.3, we have Pr(1 - ṽ[t2] k > c 1 ) (69) ≤e -c 2 1 8η 2 δ 2 ∑t 2 -t 1 -1 t=0 (1-η) 2t (70) ≤e -c 2 1 8ηδ 2 , ( ) where the first inequality is by Lemma G.3, the second inequality is by taking the sum of denominator. Finally, we finish the proof with a union bound. Since if max t<T ṽ 64has to happen for some 1 ≤ t 1 < t 2 ≤ T , so we have [t] k > 1 -c1 2 but ṽ[T ] k < 1 -c 1 , the event in Equation Pr max t<T ṽ[t] k > 1 - c 1 2 ∧ ṽ[T ] k < 1 -c 1 (72) ≤ 1≤t1<t2≤T Pr ṽ[t2] k < 1 -c 1 ∧ ṽ[t1] k ≥ 1 - c 1 2 ∧ ṽ[t1:t2] k ∈ [1 -c 1 , 1 - c 1 3 ] (73) ≤T 2 e -c 2 1 8ηδ 2 (74) ≤ ρ 6r . ( ) Definition C.6. ((b, ϵ)-bounded potential function) For a vector v that is positive on each dimension, we define the (b, ϵ)-bounded potential function Φ(v) as follows: if ∥v S ∥ 1 ≤ ϵ and ∥v S ∥ ∞ ≤ b, we let Φ(v) ≜ k / ∈S √ v k ; otherwise Φ(v) ≜ 0. Lemma C.7. In the setting of Lemma C.2, assume 2 3 ηδ 2 > 32(ε 2 1 + r)C x b 2 x and T η 2 δ 2 ≥ 16 log 6 √ d ρ √ ϵ1 . Then, with probability at least 1 -ρ 6 , there is Φ(ṽ ] ) ≤ √ ϵ 1 . Proof of Lemma C.7. We first show Φ(ṽ [t] ) decreases exponentially in expectation. For any 0 ≤ t ≤ T , if ṽ[t] S ∞ ≤ 1 + c 1 and ṽ[t] S 1 ] ≤ ε1 , we have: E Φ(ṽ [t+1] ) ≤ k / ∈S E ṽ[t+1] k (76) = d k / ∈S E st,it ṽ[t] k + ηs t x it k ṽ[t] k -η(ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) x (it) k ṽ[t] k (77) ≤ k / ∈S ṽ[t] k E st,it 1 + ηs t x (it) k + η(ε 2 1 + r)C x b 2 x (78) ≤ (1 - 1 16 η 2 δ 2 )Φ(ṽ [t] ), where the second inequality is because ṽ[t] 2 2 ≤ ε2 1 , the last inequality is by Taylor expansion and 2 3 ηδ 2 > 32(ε 2 1 + r)C x b 2 x . Also notice that when ṽ [t] S ∞ > 1 + c 1 or ṽ[t] S 1 > ε1 , there is Φ(ṽ [t+1] ) = p(ṽ [t] ) = 0, so obviously we have E[Φ(ṽ [t+1] )] ≤ (1 -1 16 η 2 δ 2 )Φ(ṽ [t] ) always true. Next we bound the probability of Φ(ṽ [T ] ) ≤ √ ϵ 1 : Pr(Φ(ṽ [T ] ) > √ ϵ 1 ) ≤ E[Φ(ṽ [T ] )] √ ϵ 1 (80) ≤ (1 -1 16 η 2 δ 2 ) T √ d √ ϵ 1 (81) ≤ e -1 16 T η 2 δ 2 √ d √ ϵ 1 (82) ≤ ρ 6 . ( ) where the first inequality if by Markov Inequaltiy, the second inequality is by the previous inequality and initially Φ(v [0] ) ≤ √ d, the third is by 1 -x ≤ e -x for any x ∈ R, and the last inequality is by T η 2 δ 2 ≥ 16 log 6 √ d ρ √ ϵ1 . Proof of Theorem 3.3. Let ρ = 0.01, c 1 = 0.1, ε1 ≜ 12 ρ , C x ≜ max j̸ =k |E i [x (i) j x (i) k ]|. Let b x = 2 log 30d 2 ρ = Θ(1). According to Lemma G.1, when n ≤ d, there is with probability at least 1 -ρ 15 we have x (i) ∞ ≤ b x for i ∈ [d]. Assume δ be positive number such that 16 δ 2 log 6 √ d ρϵmin √ ϵ1 ≤ 1 and δ ≥ b x (ε 2 1 + r). (since ϵ min ≥ exp(-O(1)) this means δ ≥ Θ(r + log(1/ϵ 1 )) .) Let P = c 2 1 32δ 2 b 2 x log 5r ρ , Q = 2 log 1 P , η = min{ P Q , 16 δ 2 } = Θ( 1 δ 2 ), T = 16 η 2 δ 2 log 6 √ d ρϵmin √ ϵ1 = Θ(log(1/ϵ 1 )/η). Assume C x b 2 x (ε 2 1 + r) ≤ min ηδ 2 48 , ρ 12T η = Θ(ρ/ log(1/ϵ 1 )). (this means C x ≤ Θ( ρ r log(1/ϵ1) ).) We show the assumptions in the previous lemmas are all satisfied. The assumption c 2 1 8ηδ 2 b 2 x ≥ log 6rT 2 ρ in Lemma C.2 is satisfied by c 2 1 8ηδ 2 b 2 x ≥ log 6rT 2 ρ (84) ⇐ c 2 1 8ηδ 2 b 2 x ≥ log 6r ρ + 4 log 1 η (85) ⇐ η log 1 η ≤ c 2 1 32δ 2 b 2 x log 6r ρ = P, ( ) where the first is by T ≤ 1 η 2 , the second is by log 6r ρ + 4 log 1 η ≤ 4 log 6r ρ log 1 η , and the last line is true because η log 1 η ≤ P Q log Q P (87) = P ( log Q Q + log 1/P Q ) ( ) ≤ P. ( ) The assumption δ ≥ b x (ε 2 1 + r) in Lemma C.2 is satisfied by definition of δ. The assumption (ε 2 1 + r)C x b 2 x ≤ c1 20 in Lemma C.2 is satisfied by C x b 2 x (ε 2 1 + r) ≤ ηδ 2 48 ≤ c 2 1 96 ≤ c 1 20 , ( ) where we use ηδ 2 ≤ δ 2 P Q ≤ c 2 1 2 . ( ) The assumption (ε 2 1 + r)C x b 2 x ≤ ρ 12T η in Lemma C.3 is satisfied by assumption of C x . The assumption T δ 2 ≥ 2 9 c 2 1 log 6r ρ in Lemma C.4 is satisfied by T δ 2 ≥ 16 η 2 δ 4 log 6r ρ ≥ 2 6 c 4 1 log 6r ρ ≥ 2 9 c 2 1 log 6r ρ . ( ) The other two assumptions (ε 2 1 + r)C x b 2 x < c1 12 - c 2 1 4 and ηδ 2 ≤ c1 8 in Lemma C.4 follows from C x b 2 x (ε 2 1 + r) < ηδ 2 48 and ηδ 2 ≤ c 2 1 2 . The assumption T η ≥ 16 c1 log 1 ϵmin in Lemma C.4 is satisfied by the definition of T . The assumptions (ε 2 1 + r)C x ≤ c1 20 and c 2 1 8ηδ 2 ≥ log 6rT 2 ρ in Lemma C.5 are satisfied by the same reason as that of Lemma C.2. The assumption T η 2 δ 2 ≥ 16 log 6 √ d ρ √ ϵ1 in Lemma C.7 is satisfied by the definition of T , the assumption 2 3 ηδ 2 > 32(ε 2 1 + r)C x b 2 x in Lemma C.7 is satisfied by the definition of C x . Since data are randomly from N (0, I), with n ≥ Θ(( r log(1/ϵ1) ρ ) 2 ) data, there is with probability at least 1 -ρ 18 there is C x b 2 x (ε 2 1 + r) ≤ min ηδ 2 48 , ρ 12T η = Θ(log(1/ϵ 1 )/η). Meanwhile, according to Lemma G.2, with n ≥ Θ(1) data with probability at least 1 -ρ 18 there is E i [(x (i) k ) 2 ] ≥ 2 3 for all k ∈ [d]. According to definition of b x , we know when n ≤ d, with probability at least 1 -ρ 18 there is also x (i) ∞ ≤ b x for all i ∈ [n]. In summary, with d ≥ n ≥ Θ(( r log(1/ϵ1) ρ ) 2 ) data, with probability at least 1 -ρ 6 there is C x b 2 x (ε 2 1 + r) ≤ min ηδ 2 48 , ρ 12T η and E i [(x (i) k ) 2 ] ≥ 2 3 for all k ∈ [d] and x (i) ∞ ≤ b x for all i ∈ [n] . Now we use these lemmas to finish the proof of the theorem. Let ṽ[t] be a (1 + c 1 , ε1 )-bounded coupling of v [t] , we only need to prove with probability at least 1 -ρ, there is ṽ [T ] S -1 ∞ ≤ c 1 and ṽ[T ] S 1 ≤ ϵ 1 , which follows from a union bound of the previous propositions. In particular, Lemma C.2 and Lemma C.3 tell us that probability of ṽ [T ] S ∞ > 1 + c 1 or ṽ[T ] S 1 > ε1 is at most ρ 3 . Lemma C.4 and Lemma C.5 tell us for any k ∈ S, probability of ṽ [T ] k < 1 -c 1 and ṽ[T ] S ∞ ≤ 1 + c 1 and ṽ[T ] S 1 ≤ ε1 is at most ρ 3r . Lemma C.7 tells us the probability of ṽ [T ] S 1 > ϵ 1 and ṽ[T ] S ∞ ≤ 1 + c 1 and ṽ[T ] S 1 ≤ ε1 is at most ρ 6 . Combining them together tells us that probability of ṽ [T ] S -1 ∞ > c 1 or ṽ[T ] S 1 > ϵ 1 is at most ρ.

D PROOF OF CONVERGENCE TO GROUND TRUTH

The conclusion of Theorem 3.3 still allows constant error in the support, namely, ∥v S -v ⋆ S ∥ ∞ ≤ 1/10. The following theorem shows that further annealing the learning rate will let the algorithm fully converge to v ⋆ with any target error. The end-to-end proof of the convergence to ground truth can be found in the next section (Section E). Theorem D.1. Let s ≥ 2 be the index of the current round of bootstrapping. Let constant c 0 = 1/10. In the setting of Theorem 2.1, assume v [0] is an initial parameter satisfying ∥v [0] S -v ⋆ S ∥ ∞ ≤ c s-1 and ∥v [0] S -v ⋆ S ∥ 1 ≤ ϵ s-1 , where 0 < ϵ s-1 ≤ c s-1 ≤ c 0 . Given a failure rate ρ > 0. Assume n ≥ Θ(r 2 ). Suppose we run SGD with label noise with noise level δ ≥ 0 and learning rate η ≤ Θ(c 2 s /(δ 2 + r 2 )) for T = log(4/c 0 )/η iterations. Then, with probability at least 1 -ρ over the randomness of the algorithm and data, there is ∥v [T ] S -v ⋆ S ∥ ∞ ≤ c s ≜ c s-1 c 0 and ∥v [T ] S -v ⋆ S ∥ 1 ≤ ϵ s ≜ (4/c 0 ) 2cs-1 ϵ s-1 . Here Θ(•) omits poly logarithmic dependency on ρ. In the rest of this section, we will first prove several lemmas on which the proof of Theorem D.1 is built upon. Then we will provide a proof of Theorem D.1. [T ] be a trajectory of label noise gradient descent with initialization v [0] . Recall S ⊂ [d] is the support set of v ⋆ . We call the following random sequence ṽ[t] a (b, ϵ)-to-v ⋆ coupling of v [t] : starting from ṽ Definition D.2. ((b, ϵ)-to-v ⋆ coupling) Let v [0] , v [1] , • • • , v [0] = v [0] , for each time t < T , if ṽ[t] S 1 ≤ ϵ and ṽ[t] S -1 ∞ ≤ b, we let ṽ[t+1] ≜ v [t+1] ; otherwise ṽ[t+1] ≜ ṽ[t] . Lemma D.3. In the setting of Theorem D.1, let C x ≜ max j̸ =k |E i [x (i) j x (i) k ]|. Assume x (i) ∞ ≤ b x for i ∈ [n] for some b x > 0, and E i [(x (i) k ) 2 ] ≥ 2 3 for k ∈ [d]. Let ṽ[t] be a (2c s-1 , ϵ s )-to-v ⋆ coupling of v [t] . Assume c 2 s-1 2ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) ≥ log 10rT 2 ρ and (ϵ 2 s + 4rc s-1 )C x b 2 x ≤ cs-1 10 . Then, with probability at least 1 -ρ 5 , there is ṽ [T ] S -1 ∞ ≤ 2c s-1 . Proof of Lemma D.3. For any fixed 1 ≤ t 1 < t 2 ≤ T and dimension k ∈ S, we consider the event that ṽ[t1] ∈ [1 + 2 3 c s-1 , 1 + c s-1 ], and at time t 2 it is the first time in the trajectory such that ṽ[t2] > 1 + 2c s-1 . We first bound the probability of this event happens, i.e., the following quantity: Pr ṽ[t2] k -1 > 2c s-1 ∧ ṽ[t1] k -1 ≤ c s-1 ∧ ṽ[t1:t2] k ∈ [1 + 2 3 c s-1 , 1 + 2c s-1 ] , where ṽ [t1:t2] k ∈ [1+ 2 3 c s-1 , 1+2c s-1 ] means that for all t such that t 1 ≤ t < t 2 , there is 1+ 2 3 c s-1 ≤ ṽ[t] k ≤ 1 + 2c s-1 . Notice that when ṽ [t] S 1 ≤ ϵ s and ṽ[t] S -1 ∞ ≤ 2c s-1 and ṽ[t1:t+1] k ∈ [1 + 2 3 c s-1 , 1 + 2c s-1 ], there is E[ṽ [t+1] k -1] =E st,it (1 + ηs t x it k -η(ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) x (it) k )ṽ [t] k -1 ≤(ṽ [t] k -1) - 2 3 ηṽ [t] k (ṽ [t] k + 1)(ṽ [t] k -1) + η(ϵ 2 s + 4rc s-1 )C x b 2 x ṽ[t] k (95) ≤(1 -η)(ṽ [t] k -1). ( ) where the first inequality is because ṽ [t] S 2 2 ≤ ϵ 2 s and properties of the data, the second inequality is because (ϵ 2 s + 4rc s-1 )b 2 x C x ≤ cs-1 10 . Also, we can bound the variance of this martingale as Var ṽ[t+1] k -1 | ṽ[t] -1 = Var ηs t x (it) k ṽ[t] k + Var η((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ṽ[t] k (97) ≤ (ηδb x (1 + 2c s-1 )) 2 + η 2 (ϵ 2 s + r) 2 b 4 x (1 + 2c s-1 ) 2 (98) ≤ 4η 2 b 2 x (δ 2 + b 2 x (ϵ 2 s + r) 2 ), By Lemma G.3, we have Pr(ṽ [t2] k -1 > 2c s-1 ∧ ṽ[t1] k -1 ≤ c s-1 ∧ ṽ[t1:t2] k ∈ [1 + 2 3 c s-1 , 1 + 2c s-1 ]) (100) ≤e -c 2 s-1 2η 2 b 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) ∑t 2 -t 1 -1 t=0 (1-η) 2t (101) ≤e -c 2 s-1 2ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) , ( ) where the first inequality is by Lemma G.3, the second inequality is by taking the sum of denominator. Similarly, we bound Pr 1 - ṽ[t2] k > 2c s-1 ∧ 1 - ṽ[t1] k ≤ c s-1 ∧ ṽ[t1:t2] k ∈ [1 -2c s-1 , 1 - 2 3 c s-1 ] . ( ) Notice that when ṽ[t] S 1 ≤ ϵ s and ṽ [t] S -1 ∞ ≤ 2c s-1 and ṽ[t1:t+1] k ∈ [1 -2c s-1 , 1 -2 3 c s-1 ], there is E[1 - ṽ[t+1] k ] =E st,it 1 -(1 + ηs t x it k -η(ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k )ṽ [t] k (104) ≤(1 - ṽ[t] k ) - 2 3 ηṽ [t] k (ṽ [t] k + 1)(1 - ṽ[t] k ) + η(ϵ 2 s + 4rc s-1 )C x b 2 x ṽ[t] k (105) ≤(1 -η)(1 - ṽ[t] k ). where the first inequality is because ṽ [t] S 2 2 ≤ ϵ 2 s and the properties of data, the second inequality is because (ϵ 2 s + 4rc s-1 )C x b 2 x ≤ cs-1 10 . So Pr(1 - ṽ[t2] k > 2c s-1 ∧ 1 - ṽ[t1] k ≤ c s-1 ∧ ṽ[t1:t2] k ∈ [1 -2c s-1 , 1 - 2 3 c s-1 ]) (107) ≤e -c 2 s-1 2η 2 b 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) ∑t 2 -t 1 -1 t=0 (1-η) 2t (108) ≤e -c 2 s-1 2ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) , ( ) where the first inequality is by Lemma G.3, the second inequality is by taking the sum of denominator. Finally, we finish the proof with a union bound. Since if ṽ [T ] S -1 ∞ > 2c s-1 , either event in Equation 93 or in Equation 103 has to happen for some k ∈ S and 1 ≤ t 1 < t 2 ≤ T , so we have Pr ṽ[T ] S -1 ∞ > 2c s-1 (110) ≤ k∈S 1≤t1<t2≤T Pr ṽ[t2] k -1 > 2c s-1 ∧ ṽ[t1] k -1 ≤ c s-1 ∧ ṽ[t1:t2] k ∈ [1 + 2 3 c s-1 , 1 + 2c s-1 ] + k∈S 1≤t1<t2≤T Pr 1 - ṽ[t2] k > 2c s-1 ∧ 1 - ṽ[t1] k ≤ c s-1 ∧ ṽ[t1:t2] k ∈ [1 -2c s-1 , 1 - 2 3 c s-1 ] (112) ≤2rT 2 e -c 2 s-1 2ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) (113) ≤ ρ 5 , ( ) where the first inequality is by union bound, the second inequality is by previous results, the third inequality is by assumption of this lemma. Lemma D.4. In the setting of Lemma D.3, assume (ϵ 2 s + 4c s-1 r)C x b 2 x ≤ c s-1 , ϵ s > (1 + ηc s-1 ) T ϵ s-1 and ((1+ηcs-1) -T ϵs-ϵs-1) 2 2T η 2 b 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 )ϵ 2 s ≥ log 5 ρ . Then, with probability at least 1 -ρ 5 , there is ṽ [T ] S 1 ≤ ϵ s . Proof of Lemma D.4. When ṽ[t] S -1 ∞ ≤ 2c s-1 and ṽ[t] S 1 ≤ ϵ s , for any k / ∈ S, there is: E ṽ[t+1] k = ṽ[t] k -ηE it ((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ṽ[t] k (115) ≤ ṽ[t] k + η( ṽ[t] S ⊙2 1 + 4c s-1 r)C x b 2 x ṽ[t] k (116) ≤ ṽ[t] k + η(ϵ 2 s + 4c s-1 r)C x b 2 x ṽ[t] k (117) ≤ (1 + ηc s-1 )ṽ [t] k . ( ) where the first inequality is because we can bound the dimensions in S and those not in S with ṽ[t] S ⊙2 1 C x b 2 x and 4c s-1 rC x b 2 x respectively, the second inequality is by ṽ [t] S 2 2 ≤ ṽ[t] S 2 1 , the third is because (ϵ 2 s + 4c s-1 r)C x b 2 x ≤ c s-1 . Summing over all k / ∈ S we have E[ ṽ[t+1] S 1 ] ≤ (1 + ηc s-1 ) ṽ[t] S 1 . This bound is obviously also true when ṽ [t] S -1 ∞ > 2c s-1 or ṽ[t] S 1 > ϵ s , in which case ṽ[t+1] = ṽ[t] . Therefore we know (1 + ηc s-1 ) -t ṽ[t] 1 is a supermartingale. Also notice | ṽ[t+1] S 1 - E[ ṽ[t+1] S 1 ]| ≤ ηδϵ s , By Azuma Inequality, Pr ṽ[T ] S 1 > ϵ s (119) ≤e - ((1+ηc s-1 ) -T ϵs-ϵ s-1 ) 2 2T η 2 b 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 )ϵ 2 s , ( ) ≤ ρ 4 . ( ) here we are using ϵ s > (1 + ηc s-1 ) T ϵ s-1 by assumption and the last step is by assumption. Lemma D.5. In the setting of Lemma D.3, assume (1 -η) T 2c s-1 < cs 2 , c 2 s-1 2ηδ 2 ≥ log 5r ρ , and (ϵ 2 s +4c s-1 r)C x b 2 x ≤ cs 10 . Then, for any k ∈ S, with probability at least 1-ρ 5r , either min t≤T |ṽ [t] k - 1| ≥ cs 2 , or ṽ[T ] S -1 ∞ > 2c s-1 , or ṽ[T ] S 1 > ϵ s . Proof of Lemma D.5. We first consider when ṽ [t] k ∈ [1 + cs 2 , 1 + 2c s-1 ]. For some t < T 2 , if ṽ[t] S -1 ∞ ≤ c s-1 and ṽ[t] S 1 ≤ ϵ s , there is E[ṽ [t+1] k -1] (122) =ṽ [t] k -ηE it [((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ]ṽ [t] k -1 ≤(ṽ [t] k -1) - 2 3 ηṽ [t] k (ṽ [t] k + 1)(ṽ [t] k -1) + η(ϵ 2 s + 4c s-1 r)C x b 2 x v [t] k (124) ≤(1 -η)(v [t] k -1). ( ) Here the first inequality is by assumption, the second inequality is because (ϵ 2 s + 4c s-1 r)C x b 2 x ≤ 1 10 c s and c s-1 ≤ 1 10 . We define the event E t as ṽ [t] S -1 ∞ > 2c s-1 or ṽ[t] S 1 > ϵ s . Since (1 -η) T2 2c s-1 < cs 2 by assumption, if ṽ[0] k ∈ [1 + cs 2 , 1 + 2c s-1 ], by Lemma G.4 we know: Pr min t≤T ṽ[t] k > 1 + c s 2 ∧ ṽ[T ] S -1 ∞ ≤ 2c s-1 ∧ ṽ[T ] S 1 ≤ ϵ s (126) ≤ e -( 1 2 cs (1-η) -T -c s-1) 2 2ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) (127) ≤ e - c 2 s-1 2ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) (128) ≤ ρ 5r , ( ) where the second inequality is because of assumption. Similarly, when ṽ [0] k ∈ [1 -2c s-1 , 1 -cs 2 ], there is Pr max t≤T ṽ[t] k < 1 - c s 2 ∧ ṽ[T2] S -1 ∞ ≤ 2c s-1 ∧ ṽ[T ] S 1 ≤ ϵ s (130) ≤ e - c 2 s-1 2ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) (131) ≤ ρ 5r . ( ) Since |ṽ [0] k -1| ≤ c s-1 by assumption of Theorem D.1, by bounding the probability for ṽ [0] k ∈ [1 + cs 2 , 1 + 2c s-1 ] and [1 -2c s-1 , 1 -cs 2 ] repectively we finished the proof. Lemma D.6. In the setting of Lemma D.3, assume c 2 s 8ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) ≥ log 5rT 2 ρ and (ϵ 2 s + 2rc s )C x b 2 x ≤ cs 20 . Then, for any dimension k ∈ S, with probability at most ρ 5r , there is min t≤T |ṽ [t] k -1| ≤ 1 2 c s and |ṽ [T ] k -1| > c s and ṽ[T ] S -1 ∞ ≤ 2c s-1 and ṽ[T ] S 1 ≤ ϵ s . Proof of Lemma D.6. For any fixed 1 ≤ t 1 < t 2 ≤ T , we consider the event that ṽ [t1] ∈ [1 + 1 3 c s , 1 + 1 2 c s ], and at time t 2 it is the first time in the trajectory such that ṽ[t2] > 1 + c s . We first bound the probability of this event happens, i.e., the following quantity: Pr ṽ[t2] k -1 > c s ∧ ṽ[t1] k -1 ≤ 1 2 c s ∧ ṽ[t1:t2] k ∈ [1 + 1 3 c s , 1 + c s ] . ( ) Notice that when ṽ [t] S 1 ≤ ϵ s and ṽ[t] S -1 ∞ ≤ 2c s-1 and ṽ[t1:t+1] k ∈ [1 + 1 3 c s , 1 + c s ], there is E[ṽ [t+1] k -1] =E st,it (1 + ηs t x it k -ηE it [(ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ])ṽ [t] k -1 (134) ≤(ṽ [t] k -1) - 2 3 ηṽ [t] k (ṽ [t] k + 1)(ṽ [t] k -1) + η(ϵ 2 s + 2rc s )C x b 2 x ṽ[t] k (135) ≤(1 -η)(ṽ [t] k -1). ( ) where the first inequality is because ṽ [t] S 2 2 ≤ ϵ 2 s , the second inequality is because (ϵ 2 s + 2rc s )C x b 2 x ≤ cs 20 . Also, we can bound the variance of this martingale as Var ṽ[t+1] k -1 | ṽ[t] -1 = Var ηs t x (it) k ṽ[t] k + Var η((ṽ [t] ⊙2 -v ⋆⊙2 ) ⊤ x (it) )x (it) k ṽ[t] k (137) ≤ (ηδb x (1 + c s )) 2 + η 2 (ϵ 2 s + r) 2 b 4 x (1 + c s ) 2 (138) ≤ 4η 2 b 2 x (δ 2 + b 2 x (ϵ 2 s + r) 2 ), By Lemma G.3, we have Pr(ṽ [t2] k -1 > c s ∧ ṽ[t1] k -1 ≤ 1 2 c s ∧ ṽ[t1:t2] k ∈ [1 + 1 3 c s , 1 + c s ]) (140) ≤e -c 2 s 8η 2 b 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) ∑ T -1 t=0 (1-η) 2t (141) ≤e -c 2 s 8ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) (142) where the first inequality is by Lemma G.3, the second inequality is by taking the sum of denominator. Similarly, we bound Pr(1 - ṽ[t2] k > c s ∧ 1 - ṽ[t1] k ≤ 1 2 c s ∧ ṽ[t1:t2] k ∈ [1 -c s , 1 - 1 3 c s ]) (143) ≤ e -c 2 s 8ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) . ( ) Finally, we finish the proof with a union bound: Pr min t≤T |ṽ [t] k -1| ≤ 1 2 c s ∧ |ṽ [T ] k -1| > c s ∧ ṽ[T ] S -1 ∞ ≤ 2c s-1 ∧ ṽ[T ] S 1 ≤ ϵ s (145) ≤ 1≤t1<t2≤T Pr(ṽ [t2] k -1 > c s ∧ ṽ[t1] k -1 ≤ 1 2 c s ∧ ṽ[t1:t2] k ∈ [1 + 1 3 c s , 1 + c s ]) + 1≤t1<t2≤T Pr(1 - ṽ[t2] k > c s ∧ 1 - ṽ[t1] k ≤ 1 2 c s ∧ ṽ[t1:t2] k ∈ [1 -c s , 1 - 1 3 c s ]) (147) ≤ T 2 e -c 2 s 8ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) (148) ≤ ρ 5r , ( ) where the first inequality is by union bound, the second inequality is by previous results, the third inequality is by assumption of this lemma.

Proof of Theorem

D.1. Let C x ≜ max j̸ =k |E i [x (i) j x (i) k ]|, b x = 2 log 30d 2 ρ = Θ(1). According to Lemma G.1, when n ≤ d, there is with probability at least 1 -ρ 15 we have x (i) ∞ ≤ b x for i ∈ [d]. Set η small enough such that c 2 s 8ηb 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 ) ≥ log 10rT 2 ρ . Obviously we only need η ≤ Θ( c 2 s δ 2 +r 2 ) , where Θ(•) omits poly logarithmic dependency on d and ρ. Assume (ϵ 2 s +4c s-1 r)C x b 2 x ≤ cs 10 , which can be represented as C x ≤ Θ( 1 r ). Recall T = 1 η log 4 c0 , ϵ s = e 2cs-1 log 4 c 0 ϵ s-1 . We first show that the additional assumptions in the previous lemmas are satisfied. There is (1 + ηc s-1 ) T = (1 + ηc s-1 ) 1 η log 4 c 0 (150) ≤ e cs-1 log 4 c 0 ≜ P. ( ) The assumption ϵ s > (1 + ηc s-1 ) T ϵ s-1 in Lemma D.4 is therefore satisfied by definition of ϵ s . The assumption ((1+ηcs-1)  -T ϵs-ϵs-1) 2 2T η 2 b 2 x (δ 2 +b 2 x (ϵ 2 s +r) 2 )ϵ 2 s ≥ log 5 ρ in Lemma D.4 is satisfied because: ((1 + ηc s-1 ) -T ϵ s -ϵ s-1 ) 2 2T η 2 b 2 x (δ 2 + b 2 x (ϵ 2 s + r) 2 )ϵ 2 s ≥ ϵ 2 s (P -1 -P -2 ) 2 2T η 2 b 2 x (δ 2 + b 2 x (ϵ 2 s + r) 2 )ϵ 2 s (152) ≥ (P -1) 2 2T η 2 b 2 x (δ 2 + b 2 x (ϵ 2 s + r) 2 ) (153) ≥ c 2 s-1 2ηb 2 x (δ 2 + b 2 x (ϵ 2 s + r) 2 ) , ( ) which is larger than log 5 ρ by the definition of η. The assumption (1-η) T 2c s-1 ≤ cs 2 in Lemma D.5 is satisfied because (1-η) T 2c s-1 ≤ ( 1 e ) log 4 c 0 2c s-1 = cs 2 . All the other assumptions in Lemma D.3, Lemma D.4, Lemma D.5and Lemma D.6 naturally follows from the definition of η and the requirement of C x . Since data are randomly from N (0, I), with n ≥ Θ(r 2 ) data, there is with probability at least 1 -ρ 15 there is (ϵ 2 s + 4c s-1 r)C x b 2 x ≤ cs 10 . Meanwhile, according to Lemma G.2 with n ≥ Θ(1) data with probability at least 1 -ρ 15 there is E i [(x (i) k ) 2 ] ≥ 2 3 for all k ∈ [d]. According to definition of b x , we know when n ≤ d, with probability at least 1 -ρ 15 there is also x (i) ∞ ≤ b x for all i ∈ [n]. In summary, with d ≥ n ≥ Θ(r 2 ) data, with probability at least 1-ρ 5 there is (ϵ 2 s +4c s-1 r)C x b 2 x ≤ cs 10 and E i [(x (i) k ) 2 ] ≥ 2 3 for all k ∈ [d] and x (i) ∞ ≤ b x for all i ∈ [n] . Now we finish the proof with the above lemmas. Lemma D.3 and Lemma D.4 together tell us that with probability at least 1 -2ρ 5 , there is ṽ [T ] S -1 ∞ ≤ 2c s-1 and ṽ[T ] S 1 ≤ ϵ s , in which case there is also v [T ] = ṽ[T ] by the definition of ṽ[T ] . Lemma D.5 and Lemma D.6 together tell us the probability of ṽ [T ] S -1 ∞ ≤ 2c s-1 and ṽ[T ] S 1 ≤ ϵ s and ṽ[T ] S -1 ∞ > c s is no more than 2ρ 5 . So together we know with probability at least 1 -ρ, there is v [T ] S -1 ∞ ≤ c s and v [T ] S 1 ≤ ϵ s . E PROOF OF THEOREM 2.1 Proof of Theorem 2.1. Starting from initialization τ • 1, by Theorem 3.1, running SGD with label noise with noise level δ > O( τ 2 d 2 ρ 3 ) and η 0 = Θ( 1 δ ) for T 0 = Θ(1) iterations gives us that with probability at least 0.99, T0] satisfies the initial condition of Theorem 3.3. ϵ min ≤ v [T0] k ≤ 1 d where ϵ min = exp(-Θ(1)). Now v [ Recall the final target precision is ϵ, set ϵ 1 = 1 40 3 ϵ. By Theorem 3.3, with n ≥ Θ(r 2 log 2 (1/ϵ)) data, after running SGD with label noise with learning rate η 1 = Θ( 1 δ 2 ) for T 1 = Θ( log(1/ϵ) η1 ) iterations, with probability at least 0.99, there is, v [T0+T1] S -v ⋆ S ∞ ≤ c 1 ≜ 1 10 , ( ) and v [T0+T1] S -v ⋆ S 1 ≤ ϵ 1 . ( ) So we have v [T0+T1] satisfies the initial condition of Theorem D.1. Finally, set ρ = 0.01/⌈log 10 (1/ϵ)⌉, and apply Theorem D.1 for n s = ⌈log 10 (1/ϵ)⌉ = Θ(1) rounds. Since c s gets smaller by 1/10 for each round, the final c ns satisfies 1 10 ϵ ≤ c ns ≤ ϵ. Since the requirement of η for round s is η ≤ Θ( c 2 s δ 2 +r 2 ), we can set η 2 ≤ Θ( ϵ 2 δ 2 ) to satisfy all the rounds at the same time. Set T 2 be the total number of iterations in all of these rounds, obviously T 2 = Θ( 1 η2 ). Notice that ϵ s ≤ e ∑ ∞ s=2 2cs-1 log 4 c 0 ϵ 1 ≤ 40 3 ϵ 1 = ϵ, we have with probability at least 0.99, v [T0+T1+T2] S -v ⋆ S ∞ ≤ c 1 ≜ ϵ, and v [T0+T1+T2] S -v ⋆ S 1 ≤ ϵ. ( ) The total failure rate of above three stages is 0.03, so with probability at least 0.97, there is v [T0+T1+T2] -v ⋆ ∞ ≤ ϵ, which finishes the proof. F PROOF OF THEOREM 2.2 Lemma F.1. Assume n ≤ d 2 -9 √ d. Let C ⊂ R d be the convex cone where each coordinate is positive, K be a random subspace of dimension d -n. Then with probability at least 0.999, there is K ∩ C ̸ = {0} Proof of Lemma F.1. By Theorem 1 of Amelunxen et al. (2014) , we only need to prove δ(C) + δ(K) ≥ d + 9 √ d, ( ) where δ(•) is the statistical dimension of a set. By equation (2.1) of Amelunxen et al. (2014) , there is δ(K) = d -n. ( ) To calculate δ(C), we use Proposition 2.4 from Amelunxen et al. (2014) , δ(C) = E[∥Π C (g)∥ 2 ], where g is a standard random vector, Π C is projection of g to C, the expectation is over g. Since C is the set of all points with element-wise positive coordinate, Π C (g) is simply setting all the negative dimension of g to 0 and keep the positive ones. Therefore, δ(C) = E[∥Π C (g)∥ 2 ] = d 2 . ( ) Therefore we have δ(C) + δ(K) = 3 2 d -n ≥ d + 9 √ d. ( ) Proof of Theorem 2.2. Let X ⊥ be the subspace that is orthogonal to the subspace X spanned by data. Since data is random, with probability 1 the random subspace X is of n dimension. Therefore, according to the previous lemma, with probability at least 0.999, there is X ⊥ ∩ C ̸ = {0}, where C is the coordinate-wise positive cone. Let µ ∈ X ⊥ be such a vector such that µ i > 0 for ∀i ∈ [d], and we scale it such that ∥µ∥ 2 = 1. We can construct the following orthonormal matrix A = [a 1 , • • • , a d ] ∈ R d×d , ( ) such that span{a 1 , • • • a n } = X and a n+1 = µ. Consider the following transformation Aũ = u = v ⊙2 , ( ) since only the projection of u to the span of data influences L(v), we can write L(v) = L(ũ 1:n ) as a function of the first n dimensions of ũ. We can lower bound the partition function with Here the inner loop is integrating over the last d -n dimensions of ũ in the set such that Aũ is coordinate-wise positive. Now we prove that for each ũ1:n such that S = {ũ n+1;d |Aũ > 0} is not empty set, the inner loop integral is always +∞.  = + ∞. ( ) Here the last step is because n < d/2, so the integrand is essentially a polynomial of z with degree larger than -1, so integrating it over all positive z has to be +∞. So we finish the proof that v∈R d e -λL(v) dv = +∞. where the first inequality is because of Azuma Inequality and Var (1 -γ) -t+1 Â[t+1] | Â[t] ≤ (1 -γ) -2(t+1) a. , the second inequality is because Â[0] ≤ c 2 . Since the event in Equation 179 only happens when Â[T ] > c, we've finished the proof. Lemma G.4. Let 0 < c 1 < c 2 be real constants. Let A [0] , A [1] , • • • , A [T ] , be a series of random variables, such that given A [0] , • • • , A [t] for some t < T with A [t] ∈ [c 1 , c 2 ], there is either event E t happens, or E[A [t+1] ] ≤ (1 -γ)A [t] with variance Var[A [t+1] | A [0] , • • • , A [t] ] ≤ a. Then when A [0] ∈ [c 1 , c 2 ] and (1 -γ) T c 2 < c 1 there is Pr min t≤T A [t] > c 1 ∧ max t≤T A [t] ≤ c 2 ∧ ¬E [0:T ] ≤ e - 2 ( c 1 (1-γ) -T -A [0] ) 2 1 γ a . ( ) where ¬E [0:T ] means for any 0 ≤ t < T , E t doesn't happen. Proof of Lemma G.4. Let Â[t] be the following coupling of A [t] : starting from Â[0] = A [0] , for each time t < T , if exists t ′ ≤ tsuch that E t ′ happens or t+1] . Intuitively, whenever A [t] exceeds proper range, we only times Â[t] by 1 -γ afterwards, otherwise we let it be the same as A [t] . Notice that if the event in Equation 185happens, there has to be Â[T ] = A [T ] (otherwise E t happens sometimes or A [t] / ∈ [c 1 , c 2 ], contradicting the event). So we only need to bound Pr Â[T ] > c 1 . A [t ′ ] / ∈ [c 1 , c 2 ], we let Â[t+1] ≜ (1 - γ) Â[t+1] ; otherwise Â[t+1] = A [ We notice that (1 -γ) -t Â[t] for t = 0 • • • T is a supermartingale, i.e., given history there is  E[ Â[t+1] | Â[t] ] ≤ (1 -γ) Â[t] .



In contrast, the implicit bias of noise wouldn't show up in a simpler linear regression model. Our analysis can be straightforwardly extended to v ⋆ with other non-zero values. We also remark that it's common to obtain only sub-optimal sample complexity guarantees in the sparsity parameters with non-convex optimization methods(Li et al., 2017;Ge et al., 2016;Vaskevicius et al., 2019;Chi et al., 2019) due to technical limitations. It also appears difficult to generalize the local analysis directly to a global analysis because once the iterate leaves the local minimum, all the local tools do not apply anymore, and it's unclear whether the iterate will converge to a new local minimum or getting stuck at some region. In fact, one can show that if this phenomenon happens for some λ > 0, then it happens for all other λ. The same proof strategy fails for the Brownian motion because v is not always nonnegative, and there is no concave potential function over the real that can be bounded from below. In general, any mean-zero noise has a second-order effect on any potential function. Therefore, when the noise level is fixed, as η → 0, the effect of the noise diminishes. This is why a lower bound on the learning rate is necessary for the noise to play a role.



; Teh et al. (2016); Raginsky et al. (2017); Zhang et al. (2017); Mou et al. (2017); Roberts et al. (1996); Ge et al. (2015); Negrea et al. (2019); Neelakantan et al. (2015); Mou et al. (2018). In particular, Raginsky et al. (2017) and Li et al. (2019a) provided generalization bounds for SGLD using algorithmic stability.

Figure 2: The norm of model weight along training trajectory. We demonstrate that the model weight fails to converge when training with Gaussian noise. In contrast, the weigth norm converges for training with label noise or without noise.

Lemma 3.2 naturally follows from Lemma B.2. Definition B.3. (b-bounded potential function) For a vector v that is positive on each dimension, we define the b-bounded potential function Φ(v) as follows

ũ1:n , let ũ * n+1:d be one possible solution such that u * = Aũ > 0. Define constant c = min i≤[1:d] max j∈[n+2:d] a n+1 i (d -n -1)|a

This is obviously true when v[t+1] = (1 -γ) Â[t] , and also true otherwise by assumption of the lemma. So we havePr( Â[T ] > c 1 ) (186) = Pr((1 -γ) -T Â[T ] > (1 -γ) -T c 1 ) inequality is because of Azuma Inequality and Var (1 -γ) -t+1 Â[t+1] | Â[t] ≤ (1 -γ) -2(t+1) a and Â[0] ≤ c 2 ≤ e -2 ( c 1 (1-γ) -T -c 2 ) 2 ∑ T -1 t=0 (1-γ) -2t a .Since the event in Equation 185 only happens when Â[T ] > c 1 , we've finished the proof.

without explicit regularization.Several previous works have studied generalization bounds and training dynamics of SGD with statedependent noises for more general models.Hardt et al. (2015) derived stability-based generalization bounds for mini-batch SGD based on training speed.Cheng et al. (2019) proved that SGD with statedependent noises has iterate distribution close to the corresponding continuous stochastic differential equation with the same noise covariance.Meng et al. (2020);Xie et al. (2020) showed that SGD with state-dependent noises escapes local minimum faster than SGD with spherical Gaussian noise .There has been a line of work empirically studying how noise influences generalization.Keskar et al. (2016) argued that large batch training will converge to "sharp" local minima which do not generalize well.Hoffer et al. (2017) argued that large batch size doesn't hurt generalization much if training goes on long enough and additional noise is added with a larger learning rate.Goyal et al.

′ = {ũ n+1:d |ũ n+1 ≥ ũ * n+1 ∧ |ũ j -ũ * j | ≤ c(ũ n+1 -ũ * n+1 ), ∀j ∈ [n + 2, d]} (170)In other words, this is a convex cone where constraint of ũj is linear in ũn+1 for j ∈ [n + 2 : d]. By definition of c, it is easy to verify that S ′ is a subset of S. Also, for every ũn+1:d ∈ S ′ , u i is upper bounded byHere the inequality is because of the definition of c.

G EXTRA LEMMAS

Lemma G.1. Suppose x (i) ∼ N (0, I d×d ) where i ∈ [n] are random data. Then with probability at least 1 -ρ, for every i ∈ [n] there isProof. By Gaussian tail bound, there is Pr |xSo by union bound we haveLemma G.2. Suppose x (i) ∼ N (0, I d×d ) where i ∈ [n] are random data. Then when n > 24 log d ρ , with probability at least 1 -ρ, for every k ∈ [d] there is, by Hoeffding inequality we have24 .(178)Therefore, when n ≥ 24 log d ρ , by union bound we finish the proof., be a series of random variables, such that givenThen there iswhereProof of Lemma G.3. We only need to consider when A [0] ≤ c 2 . Let Â[t] be the following coupling of A [t] : starting from Â[0] = A [0] , for each time t < T , if t+1] . Intuitively, whenever A [t] stops updating or exceeds proper range, we only times Â[t] by 1 -γ afterwards, otherwise we let it be the same as A [t] . Notice that if the event in Equation 179happens, there has to be Â[T ] = A [T ] (otherwise A [t] stops updating or exceeds range at some time, contradicting the event). So we only need to bound Pr Â, and also true otherwise by assumption of the lemma. So we have (184) 

