SGD WITH LARGE STEP SIZES LEARNS SPARSE FEATURES

Abstract

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows to shed a new light on some common practice and observed phenomena when training deep networks.

1. INTRODUCTION

Deep neural networks have accomplished remarkable achievements on a wide variety of tasks. Yet, the understanding of their remarkable effectiveness remains incomplete. From an optimization perspective, stochastic training procedures challenge many insights drawn from convex models. E.g., large step-size schedules used in practice lead to unexpected patterns of stabilizations and sudden drops in the training loss, see e.g. He et al. (2016) . From a generalization perspective, overparametrized deep nets generalize well while fitting perfectly the data and without any explicit regularizers (Zhang et al., 2017) . This suggests that optimization and generalization are tightly intertwined: neural networks find solutions that generalize well thanks to the optimization procedure used to train them. This property, known as implicit bias or algorithmic regularization, has been studied recently both for regression (Li et al., 2018; Woodworth et al., 2020) and classification (Soudry et al., 2018; Lyu and Li, 2020; Chizat and Bach, 2020) . However, for all these theoretical results, it is also shown that typical timescales needed to enter the beneficial feature learning regimes are prohibitively long (Woodworth et al., 2020; Moroshko et al., 2020) . In this paper, we aim at staying closer to the experimental practice and consider the SGD schedules from the ResNet paper (He et al., 2016) where the large step size is first kept constant and then decayed, potentially multiple times. We illustrate this behavior in Fig. 1 where we reproduce a minimal setting without data augmentation or momentum, and with only one step size decrease. We draw attention to two key observations regarding the large step-size phase: (a) quickly after the start of training, the loss remains approximately constant on average and (b) despite no progress on the training loss, running this phase for longer leads to better generalization. We refer to such large step-size phase as loss stabilization. The better generalization hints at some hidden dynamics in the parameter space not captured by the loss curves in Fig. 1 . Our main contribution is to unveil the hidden dynamics behind this phase: loss stabilization helps to amplify the noise of SGD that drives the network towards a solution with sparser features (see Appendix, Figure 7 for a 2D-visualization).

1.1. OUR CONTRIBUTIONS

The effective dynamics behind loss stabilization. We characterize two main components of the SGD dynamics with large step sizes: (i) a fast movement determined by the bouncing directions Test error = 0.007 = 1.5, decay at 10% epochs = 1.5, decay at 30% epochs = 1.5, decay at 50% epochs Figure 1 : A typical training dynamics for a ResNet-18 trained on CIFAR-10. We use weight decay but no momentum or data augmentation for this experiment. We see a substantial difference in generalization (as large as 12% vs. 35% test error) depending on the step size η and its schedule. When the training loss stabilizes, there is a hidden progress occurring which we aim to characterize. causing loss stabilization, (ii) a slow dynamics driven by the combination of the gradient and the multiplicative noise-which is non-vanishing due to the loss stabilization. SDE model and sparse feature learning. We model the effective slow dynamics during loss stabilization by a stochastic differential equation (SDE) whose multiplicative noise is related to the neural tangent kernel features, and validate this modeling experimentally. Building on the existing theory on diagonal linear networks, which shows that this noise structure leads to sparse predictors, we conjecture a similar "sparsifying" effect on the features of more complex architectures. We experimentally confirm this on neural networks of increasing complexity. Insights from our understanding. We draw a clear general picture: the hidden optimization dynamics induced by large step sizes and loss stabilization enable the transition to a sparse feature learning regime. We argue that after a short initial phase of training, SGD first identifies sparse features of the training data and eventually fits the data when the step size is decreased. Finally, we discuss informally how many deep learning regularization methods (weight decay, BatchNorm, SAM) may also fit into the same picture. 1.2 RELATED WORK He et al. (2016) popularized the piecewise constant step-size schedule which often exhibits a clear loss stabilization pattern. However, they did not provide any explanations for such training dynamics and its implicit regularization effect. Non-monotonic patterns of the training loss have been explored in recent works. However, the loss stabilization regime we consider is different (i) from the catapult mechanism (Lewkowycz et al., 2020) where the training loss shows only one spike at the start of training and then monotonically converges without stabilization, and (ii) from the edge of stability regime of full-batch GD (Cohen et al., 2021) where the training loss shows many regular spikes after some point in training but again without stabilization. Past works conjectured that large step sizes induce the minimization of some hidden complexity measures related to flatness of minima (Keskar et al., 2016; Smith and Le, 2018) . Notably, Xing et al. (2018) point out that SGD moves through the loss landscape bouncing between the walls of a valley where the role of the step size is to guide the noisy iterates of SGD towards a flatter minimum. However, many typically used flatness definitions are questionable for this purpose since (1) they are not invariant under reparametrizations that lead to an equivalent neural network (Dinh et al., 2017) , and (2) even for naturally trained networks, full-batch gradient descent with large step sizes (unlike SGD) can lead to flat solutions which are not well-generalizing (Kaur et al., 2022) . Note that it is possible to bridge the gap between GD and SGD by using explicit regularization as in Geiping et al. (2022) . We instead focus on the implicit regularization of SGD which remains the most practical approach for training deep networks. The importance of large step sizes has been investigated with diverse motivations. However, we believe that existing approaches do not sufficiently capture the hidden stochastic dynamics behind the loss stabilization phenomenon observed for deep networks. Attempts to explain it on strongly convex models (Nakkiran, 2020; Wu et al., 2021; Beugnot et al., 2022) are inherently incomplete since it is a phenomenon related to the existence of many zero solutions with very different generalization properties. Li et al. (2019b) analyzed the role of loss stabilization for a synthetic distribution containing different patterns, but it is not clear how this analysis can be extended to general problems. Works based on stability analysis characterize the properties of the minimum that SGD or GD can potentially converge depending on the step size (Wu et al., 2018; Mulayoff et al., 2021; Ma and Ying, 2021; Nacson et al., 2022) . However, these approaches do not capture the entire training dynamics such as the large step size phase that we consider where SGD converges only after the step size is decayed. SGD with label noise has been studied because of its beneficial regularization effect and its resemblance to SGD's standard noise. Its implicit bias has been first characterized by Blanc et al. (2020) and extended by Li et al. (2022) . However, their analysis only holds in the final phase of the training, close to a zero-loss manifold. Our work instead is closer in spirit to Pillaud-Vivien et al. (2022) where the label noise dynamics is analyzed in the central phase of the training, i.e., when the training loss is still substantially above zero.

2. THE EFFECTIVE DYNAMICS OF SGD WITH LARGE STEP-SIZE: SPARSE FEATURE LEARNING

In this section, we show that large step sizes lead the loss to stabilize by making SGD bounce above a valley. We then unveil the effective dynamics induced by this loss stabilization. To clarify our exposition we showcase our results for the mean square error but other losses like the cross-entropy carry the same key properties in terms of the noise covariance (Wojtowytsch, 2021b, Lemma 2.14) . We consider a generic parameterized family of prediction functions H := {x → h θ (x), θ ∈ R p }, a setting which encompasses neural networks. In this case, the training loss on input/output samples (x i , y i ) 1≤i≤n ∈ R d × R reads L(θ) := 1 2n n i=1 (h θ (x i ) -y i ) 2 . We consider the overparameterized setting, i.e. p ≫ n, hence, there shall exists many parameters θ * that lead to zero loss, i.e., perfectly interpolate the dataset. Therefore, the question of which interpolator the algorithm converges to is of paramount importance in terms of generalization. We focus on the SGD recursion with step size η > 0, initialized at θ 0 ∈ R p : for all t ∈ N, θ t+1 = θ t -η(h θt (x it ) -y it )∇ θ h θt (x it ), where i t ∼ U ( 1, n ) is the uniform distribution over the sample indexes. In the following, note that SGD with mini batches of size B > 1 would lead to similar analysis but with η/B instead of η.

2.1. BACKGROUND: SGD IS GD WITH SPECIFIC LABEL NOISE

To emphasize the combined roles of gradient and noise, we highlight the connection between the SGD dynamics and that of full-batch GD plus a specific label noise. Such manner of reformulating the dynamics has already been used in previous works attempting to understand the specificity of the SGD noise (HaoChen et al., 2021; Ziyin et al., 2022) . We formalize it in the following proposition. Proposition 1. Let (θ t ) t≥0 follow the SGD dynamics Eq.(2) with sampling function (i t ) t≥0 . Let 1 i=it be indicator function, define for t ≥ 0, the random vector ξ t ∈ R n such that for all i ∈ 1, n , [ξ t ] i := (h θt (x i ) -y i )(1 -n1 i=it ). Then (θ t ) t≥0 follows the full-batch gradient dynamics on L with label noise (ξ t ) t≥0 , that is θ t+1 = θ t - η n n i=1 (h θt (x i ) -y t i )∇ θ h θt (x i ), where we define the random labels y t := y + ξ t . Furthermore, ξ t is a mean zero random vector with variance such that 1 n(n-1) E ∥ξ t ∥ 2 = 2L(θ t ). This reformulation shows two crucial aspects of the SGD noise: (i) the noisy part at state θ always belongs to the linear space spanned by {∇ θ h θ (x 1 ), . . . , ∇ θ h θ (x n )}, and (ii) it scales as the training loss. Concerning (ii), we highlight in the following section that the loss can stabilize because of large step sizes which makes the effective scale of label noise constant. These features are of paramount importance when modelling the effective dynamics that take place during loss stabilization.

2.2. THE EFFECTIVE DYNAMICS BEHIND LOSS STABILIZATION

On loss stabilization. For generic quadratic costs, e.g., F (β) := ∥Xβ -y∥ 2 , gradient descent with step size η is convergent for η < 2/λ max , divergent for η > 2/λ max and converges to a bouncing 2-periodic dynamics for η = 2/λ max , where λ max is the largest eigenvalue of the Hessian. However, the practitioner is not likely to hit perfectly this unstable step size and, almost surely, the dynamics shall either converge or diverge. Yet, non-quadratic costs bring to this picture a particular complexity: it has been shown that, even for non-convex toy models, there exist an open interval of step sizes for which the gradient descent neither converge nor diverge (Ma et al., 2022; Chen and Bruna, 2022) . As we are interested in SGD, we complement this result by presenting a toy example in which loss stabilization occurs almost surely in the case of stochastic updates. Indeed, consider a regression problem with quadratic parameterization on one-dimensional data inputs x i 's, coming from a distribution ρ, and outputs generated by the linear model y i = x i θ 2 * . The loss writes F (θ) := 1 4 E ρ y -xθ 2 2 , and the SGD iterates with step size η > 0 follow, for any t ∈ N, θ t+1 = θ t + η θ t x it y it -x it θ 2 t where x it ∼ ρ. For the sake of concreteness and clarity, suppose that θ * = 1 and supp(ρ) = [a, b], we have the following proposition (a more general result can be found in Proposition 3 of the Appendix). Proposition 2. For any η ∈ (a -2 , 1.25 • b -2 ) and initialization θ 0 ∈ (0, 1), for t > 0, δ 1 < F (θ t ) < δ 2 almost surely, and 6) ∃T > 0, ∀k > T, θ t+2k < 1 < θ t+2k+1 almost surely. ( ) where δ 1 , δ 2 , T > 0 are constant given in the Appendix. The proposition is divided in two parts: if the step size is large enough, Eq.( 6) the loss stabilizes in between level sets δ 1 and δ 2 and Eq.( 7) shows that after some initial phase, the iterates bounce from one side of the loss valley to the other one. Note that despite the stochasticity of the procedure, the results hold almost surely. The effective dynamics. As observed in the prototypical SGD training dynamics of Fig. 1 and proved in the non-convex toy model of Proposition 2, large step sizes lead the loss to stabilize around some level set. To further understand the effect of this loss stabilization in parameter space, we shall assume perfect stabilization. Then, from Proposition 1, we conjecture the following behaviour During loss stabilization, SGD is well modelled by GD with constant label noise. Label noise dynamics have been studied recently (Blanc et al., 2020; Damian et al., 2021; Li et al., 2022) thanks to their connection with Stochastic Differential Equations (SDEs). To properly write a SDE model, the drift should match the gradient descent and the noise should have the correct covariance structure (Li et al., 2019a; Wojtowytsch, 2021a) . Proposition 1 implies that the noise at state θ is spanned by the gradient vectors {∇ θ h θ (x 1 ), . . . , ∇ θ h θ (x n )} and has a constant intensity corresponding to the loss stabilization at a level δ > 0. Hence, we propose the following SDE model dθ t = -∇ θ L(θ t )dt + ηδ ϕ θt (X) ⊤ dB t , where (B t ) t≥0 is a standard Brownian motion in R n and ϕ θ (X) := [∇ θ h θ (x i ) ⊤ ] n i=1 ∈ R n×p referred to as the Neural Tangent Kernel (NTK) feature matrix (Jacot et al., 2018) . This SDE can be seen as the effective slow dynamics that drives the iterates while they bounce rapidly in some directions at the level set δ (fast dynamics). It highlights the combination of the deterministic part of the full-batch gradient and the noise induced by SGD at level set δ which depends on the step size of SGD. We confirm the validity of this SDE modeling empirically in Sec. C showing that the SDE captures the dynamics of large step size SGD even for non-linear networks. In the next section, we leverage the SDE (8) to understand the implicit bias of such learning dynamics.

2.3. SPARSE FEATURE LEARNING

In this section, we give insights on the effective dynamics given by Eq.( 8). We begin with a simple model of diagonal linear networks that showcase a sparsity inducing dynamics and further disclose our general message about the overall implicit bias promoted by the effective dynamics.

2.3.1. A WARM-UP: DIAGONAL LINEAR NETWORKS

An appealing example of simple non-linear networks that help in forging an intuition for more complicated architectures is diagonal linear networks (Vaskevicius et al., 2019; Woodworth et al., 2020; HaoChen et al., 2021; Pesme et al., 2021) . They are two-layer linear networks with only diagonal connections: the prediction function writes h u,v (x) = ⟨u, v ⊙ x⟩ = ⟨u ⊙ v, x⟩ where ⊙ denotes elementwise multiplication. Even though the loss is convex in the associated linear predictor β := u ⊙ v ∈ R d , it is not in (u, v) , hence the training of such simple models already exhibit a rich non-convex dynamics. In this case, ∇ u h u,v (x) = v ⊙ x, and the SDE model Eq.( 8) writes du t = - 1 n X ⊤ (X(u t ⊙ v t ) -y) ⊙ v t dt + ηδ v t ⊙ X ⊤ dB t , where (B t ) t≥0 is a standard Brownian motion in R n . Equations are symmetric for (v t ) t≥0 . What is the behaviour of this effective dynamics? Pillaud-Vivien et al. ( 2022) answered this question by analyzing a similar stochastic dynamics and unveiled the sparse nature of the resulting solutions. Indeed, under sparse recovery assumptions, denoting β * the sparsest linear predictor that interpolates the data, it is shown that the associated linear predictor β t = u t ⊙ v t : (i) converges exponentially fast to zero outside of the support of β * (ii) is with high probability in a O( √ ηδ) neighborhood of β * in its support after a time O(δ -1 ). Overall conclusion on the model. During a first phase, SGD with large step sizes η decreases the training loss until stabilization at some level set δ > 0. During this loss stabilization, an effective noise-driven dynamics takes place. It shrinks the coordinates outside of the support of the sparsest signal and oscillates in parameter space at level O( √ ηδ) on its support. Hence, decreasing later the step size leads to perfect recovery of the sparsest predictor. This behaviour is illustrated in our experiments in Figure 2 .

2.3.2. THE SPARSE FEATURE LEARNING CONJECTURE FOR MORE GENERAL MODELS

Results for diagonal linear nets recalled in the previous paragraph show that the noisy dynamics (9) induce a sparsity bias. As emphasized in HaoChen et al. (2021) , this effect is largely due to the multiplicative structure of the noise v ⊙ [X ⊤ dB t ] that, in this case, has a shrinking effect on the coordinates (because of the coordinate-wise multiplication with v). In the general case, we see, thanks to Eq.( 8), that the same multiplicative structure of the noise still happens but this time with respect to the NTK feature matrix ϕ θ (X). Hence, this suggests that similarly to the diagonal linear network case, the implicit bias of the noise can lead to a shrinkage effect applied to ϕ θ (X) which depends on the noise intensity δ and the step size of SGD. Indeed, an interesting property of Brownian motion is that, for v ∈ R p , ⟨v, B t ⟩ = ∥v∥ 2 W t , where the equality is valid in law and (W t ) t≥0 is a one-dimensional Brownian motion. Hence, the process Eq.( 8) is equivalent to a process whose i-th coordinate is driven by a noise proportional to ∥ϕ i ∥dW i t , where ϕ i is the i-th column of ϕ θ (X) and (W i t ) t≥0 is a one dimensional Brownian motion. This SDE structure, similar to the geometric Brownian motion, is expected to induce the shrinkage of each multiplicative factor (Oksendal, 2013, Section 5 .1), i.e., in our case (∥∇ θ h(x i )∥) n i=1 . Thus, we conjecture: The noise part of Eq.( 8) seeks to minimize the ℓ 2 -norm of the columns of ϕ θ (X). Note that the fitting part of the dynamics prevents the NTK feature matrix to collapse totally to zero, but as soon as they are not needed to fit the signal, columns can be reduced to zero. Remarkably, from a stability perspective, Blanc et al. (2020) showed a similar bias: locally around a minimum, the SGD dynamics implicitly tries to minimize the Frobenius norm ∥ϕ θ (X) ∥ F = n i=1 ∥∇ θ h θ (x i )∥ 2 . We provide below a specification of this implicit bias for different architectures: • Diagonal linear networks: For h u,v (x) = ⟨u ⊙ v, x⟩, we have ∇ u,v h u,v (x) = [v ⊙ x, u ⊙ x]. Thus, for a generic data matrix X, minimizing the norm of each column of ϕ u,v (X) amounts to put the maximum of zero coordinates and hence to minimize ∥u ⊙ v∥ 0 . • ReLU networks: We take the prototypical one hidden layer to exhibit the sparsification effect. Let h a,W (x) = ⟨a, σ(W x)⟩, then ∇ a h a,W (x) = σ(W x) and ∇ wj h a,W (x) = a j x1 ⟨wj ,x⟩>0 . Note that the ℓ 2 -norm of the column corresponding to the neuron is reduced when it is activated at a minimal number of training points, hence the implicit bias enables the learning of sparse data-active features. Finally, when some directions are needed to fit the data, similarly activated neurons align to fit, allowing the rank of ϕ θ (X) to be also a good proxy for this feature sparsity. Overall, fully understanding theoretically the structural implications of the implicit bias described above remains an exciting avenue for future work. We show next that the conjectured sparsity is indeed observed empirically for a variety of models, as well as that the rank reduction of ϕ θ (X) can be used as a good proxy of the hidden progress of the loss stabilization phase. This is confirmed both for SGD and its SDE modeling (the latter we show in App. C).

3. EMPIRICAL EVIDENCE OF SPARSE FEATURE LEARNING DRIVEN BY SGD

Here we present empirical resultsfor neural networks of increasing complexity: from diagonal linear networks to deep residual networks on CIFAR-10 and CIFAR-100. We make the following common observations for all these networks trained using SGD schedules with large step sizes: (O1) Loss stabilization: training loss stabilizes around a high level set until step size is decayed, (O2) Generalization benefit: longer loss stabilization leads to better generalization, (O3) Sparse feature learning: longer loss stabilization leads to sparser features. Importantly, we use no explicit regularization in our experiments so that the training dynamics is driven purely by SGD and the step size schedule. Additionally, in some cases, we cannot find a single large step size that would lead to loss stabilization. In such cases, whenever explicitly mentioned, we use a warmup step size schedule-i.e., increasing step sizes according to some schedule-to make sure that the training loss stabilizes around some level set. Such warmup schedules are commonly used in practice (He et al., 2016; Devlin et al., 2018) . Warmup is often motivated purely from the optimization perspective as a way to accelerate training (Agarwal et al., 2021) but we suggest that, more importantly, it is also a way to amplify the regularization effect of the SGD noise which is proportional to the step size. Measuring sparse feature learning. Our main insight is that the NTK feature matrix is significantly simplified in the loss stabilization phase, and that the rank of ϕ θ (X) (i.e., the sparsity of its singular values) is a good proxy to track this dynamics. We compute it over iterations for each model (except deep networks where it is not feasible) by using a fixed threshold on the singular values of ϕ θ (X) normalized by the largest singular value. In this way, we ensure that the difference in the rank that we detect is not simply due to a different scales of ϕ θ (X). Moreover, we always compute ϕ θ (X) on the number of fresh samples equal to the number of parameters |θ| to make sure that rank deficiency is not coming from n ≪ |θ| which is the case in the overparametrized settings we consider. Furthermore, we also want to track a more direct and interpretable notion of feature sparsity. This motivates us to count the average number of distinct (i.e., counting a group of highly correlated activations as one), non-zero activations at some layer over the training set which we refer to as the feature sparsity coefficient. We count a pair of activations i and j as highly correlated if their Pearson's correlation coefficient is at least 0.95. Unlike rank(ϕ θ (X)), the feature sparsity coefficient scales to deep networks and has an easy-to-grasp meaning. consider four different SGD runs (started from u i = 0.1, v i = 0 for each i): one with a small step size and three other with initial large step size decayed after 10%, 30%, 50% iterations, respectively.

3.1. SPARSE FEATURE LEARNING IN

Observations. We show the results in Fig. 2 and note that (O1)-(O3) hold even in this simple model trained with vanilla SGD without any explicit regularization or layer normalization schemes. We observe that the training loss stabilizes around 10 -1.5 , the test loss improves for longer schedules, both rank(ϕ θ (X)) and ∥u ⊙ v∥ 0 decrease during the loss stabilization phase leading to a sparse final predictor. While the training loss has seemingly converged to 10 -1.5 , a hidden dynamics suggested by Eq.( 9) occurs which slowly drifts the iterates to a sparse solution. This implicit sparsification explains the dependence of the final test loss on the time when the large step size is decayed, similarly to what has been observed for deep networks in Fig. 1 . Interestingly, we also note that SGD with large step-size schedules encounters saddle points after we decay the step size (see the training loss curves in Fig. 2 ) which resembles the saddle-to-saddle regime described in Jacot et al. (2021) which does not occur in the large-initialization lazy training regime. SGD and GD have different implicit biases. Since we observe from Fig. 2 that for loss stabilization, stochasticity alone does not suffice and large step sizes are necessary, one may wonder if conversely only large step sizes can be sufficient to have a sparsifying effect. Even if special instances can be found for which large step sizes are sufficient (such as for non-centered input features as in Nacson et al. (2022) ), we answer this negatively showing that gradient descent in general does not go to the sparsest solution as demonstrated in Fig. 10 in the Appendix. Moreover, in Fig. 3 , we visualize the difference in trajectory between the two methods taken with large step sizes over a 2D subspace spanned by w ⋆w init and w f loww init , where w ⋆ is the ground truth, w f low is the result of gradient flow, and w init is the initialization. This example provides an important intuition that loss stabilization alone is not sufficient for sparsification and that the role of noise described earlier is crucial.

3.2. SPARSE FEATURE LEARNING IN SIMPLE RELU NETWORKS

Two-layer ReLU network in 1D. We consider the one-dimensional regression task from Blanc et al. (2020) with 12 points, where label noise SGD has been shown to learn a simple model. We show that similar results can be achieved with large-step-size SGD via loss stabilization. We train a ReLU network with 100 neurons with SGD with a linear warmup (otherwise, we were unable to achieve approximate loss stabilization), directly followed by a step-size decay. The two plots correspond to a warmup/decay transition at 2% and 50% of iterations, respectively. The results shown in Fig. 4 confirm that (O1)-(O3) hold: the training loss stabilizes around 10 -0.5 , the predictor becomes much simpler and is expected to generalize better, and both rank(ϕ θ (X)) and the feature sparsity coefficient substantially decrease during the loss stabilization phase. For this one-dimensional task, we can directly observe the final predictor which is sparse in terms of the number of distinct ReLU kinks as captured by the feature sparsity coefficient and the rank of the NTK feature matrix. Interestingly, we also observed overregularization for even larger step sizes when we cannot fit all the training points (see Fig. 11 in Appendix). This phenomenon clearly illustrates how the capacity control is induced by the optimization algorithm: the function class over which we optimize depends on the step size schedule. Additionally, Fig. 12 in Appendix shows the evolution of the predictor over iterations. general picture is confirmed: first the model is simplified during the loss stabilization phase and only then fits the training data. Deeper ReLU networks. We use a teacher-student setup with a random three-layer teacher ReLU network having 2 neurons on each hidden layer. The student network is overparametrized with 10 neurons on each layer and is trained on 50 examples. Such teacher-student setup is useful since we know that the student network can implement the ground truth function but might not find it due to the small sample size. We train models using SGD with a medium constant step size and a large step size with warmup decayed after 10%, 30%, 50% iterations, respectively. The results shown in Fig. 5 confirm that (O1)-(O3) hold: the training loss stabilizes around 10 -1.5 , the test loss is smaller for longer schedules, and both rank(ϕ θ (X)) and the feature sparsity coefficient substantially decrease during the loss stabilization phase. All methods have the same value of the training loss (10 -3 ) after 10 4 iterations but different generalization. Moreover, we see that the feature sparsity coefficient decreases on each layer which makes this metric a promising one to consider for deeper networks.

3.3. SPARSE FEATURE LEARNING IN DEEP RELU NETWORKS

Setup. We consider here an image classification task and train a ResNet-18 and ResNet-34 on CIFAR-10 and CIFAR-100 using SGD with batch size 256 and different step size schedules. We use an exponentially increasing warmup schedule with exponent 1.05 to stabilize the training loss. We cannot measure the rank of ϕ(X) here since this matrix is too large (≈ 50 000 × 20 000 000) so we measure only the feature sparsity coefficient taken at two layers: at the end of super-block 3 (i.e., in the middle of the network) and super-block 4 (i.e., right before global average pooling at the end of the network) of ResNets. We test two settings: a basic setting without explicit regularizers and a state-of-the-art setting with weight decay, momentum, and standard augmentations. Observations. The results on CIFAR-10 shown in Fig. 6 confirm that our main findings still hold also in this setting: the training loss stabilizes either slightly below 10 -1 or above 10 -1 , the test error is becoming progressively better for longer schedules, as well as the feature sparsity coefficient. Small step sizes lead to bad generalization, especially without explicit regularization: 35% test error compared to 15% for large step sizes. This poor performance confirms that it is crucial to leverage the implicit bias of large step sizes. The difference in the feature sparsity coefficient is also substantial with the final model having 70% instead of 24% at block 4 without explicit regularization. The observations are similar for the state-of-the-art setting as well where even with explicit regularization, we still see a noticeable difference in generalization and feature sparsity depending on the step size and schedule. We further note that feature sparsity coefficient is gradually minimized over iterations in this case (similarly to Figures 2, 4 , 5) while without explicit regularization we observe a different pattern: a very quick drop down to almost zero at the very first epoch and then a gradual increase. We show the results with similar findings on CIFAR-100 in Fig. 15 in Appendix. Additionally, Fig. 14 illustrates that for small step sizes, the early and middle layers stay very close to their random initialization which indicates the absence of feature learning similarly to what is suggested by the neuron movement plot in Fig. 9 in the Appendix for two-layer network in a teacher-student setup.

4. INSIGHTS FROM OUR UNDERSTANDING OF THE TRAINING DYNAMICS

Here we provide an extended discussion on the implications of our theoretical and empirical findings. The multiple stages of the SGD training dynamics. As analyzed and shown empirically, the training dynamics we considered can be split onto three distinct phases: (i) an initial phase of reducing the loss down to some level where stabilization can occur, (ii) a loss stabilization phase where noise and gradient directions combine to find architecture-dependent sparse representations of the data, (iii) a final phase when the step size is decreased to fit the training data. This typology allows to clearly disentangle the effect of the stabilization phase (ii) which relies on the implicit bias of SGD to simplify the model. Note that phases (ii) and (iii) can be repeated a few times until final convergence (He et al., 2016) . Moreover, in some training schedules, (ii) does not explicitly occur, and the effect of loss stabilization (ii) and data fitting (iii) can occur simultaneously (Nakkiran et al., 2019) . From lazy training to feature learning. Similar sparse implicit biases have been shown for regression with infinitely small initialization (Boursier et al., 2022) and for classification (Chizat and Bach, 2020; Lyu and Li, 2020) . However, both approaches are not practical from the computational point of view since (i) the origin is a saddle point for regression leading to the vanishing gradient problem (especially, for deep networks), and (ii) max-margin bias for classification is only expected to happen in the asymptotic phase (Moroshko et al., 2020) . On the contrary, large step sizes enable to initialize far from the origin, while allowing to efficiently transition from a regime close to the lazy NTK regime (Jacot et al., 2018) to the rich feature learning regime. Common patterns in the existing techniques. Tuning the step size to obtain loss stabilization can be difficult. To prevent early divergence caused by too large step sizes, we sometimes had to rely on an increasing step size schedule (known as warmup). Interpreting such schedules as a tool to favor implicit regularization provides a new explanation to their success and popularity. Additionally, normalization schemes like batch normalization or weight decay, beyond carrying their own implicit or explicit regularization properties, can be analyzed from a similar lens: they allow to use larger step sizes that boost further the implicit bias effect of SGD while preventing divergence (Bjorck et al., 2018; Zhang et al., 2018) . Note also that we derived our analysis with batch size equal to one for the sake of clarity, but an arbitrary batch size B would simply be equivalent to replacing γ ← γ/B. Similarly to the consequence of large step sizes, preferring smaller batch sizes (Keskar et al., 2016) while avoiding divergence seem key to benefit from the implicit bias of SGD. Finally, the effect of large step sizes or small batches is often connected to measures of flatness of the loss surface via stability analysis (Wu et al., 2018) and some methods like the Hessian regularization (Damian et al., 2021) or SAM (Foret et al., 2021) explicitly optimize it. Such methods resemble the implicit bias of SGD with loss stabilization implied by the label noise equation (Eq.( 8)) where matrix ϕ θ (X) is the key component of the Hessian. However, an important practical difference is that the regularization strength in these methods is explicit and decoupled from the step size schedule which may be harder to properly tune since it is simultaneously responsible for optimization and generalization.

APPENDIX

In Section A, we show Proposition 1 on the equivalence between SGD and GD with added noise. In Section B, we provide the proof that loss stabilization occurs as written in Proposition 2. In Section C, we show experimentally that the proposed SDE model matches well the SDE dynamics. Finally, we present additional experiments in Section D. Figure 7 : Three-dimensional visualisation of the SGD dynamics in a non-convex loss landscape. The SGD dynamics (blue points) is bouncing side-to-side to the bottom of the valley (the dotted green line). A slow movement occurs pushing the iterates in the direction given by the green arrows. To begin this appendix, we provide in Figure 7 a toy visualization in which we showcase a typical SGD dynamics when loss stabilization occurs. We run SGD on the diagonal linear network with one sample in two dimensions (n = 1, d = 2) adding label noise of the shape given by equation Eq.( 9), with balanced layers u = v. The blue points corresponds to iterates of the dynamics (that are linked with the orange dotted lines). The green line corresponds to the global minimum of the loss, what can be called the "bottom of the valley". This hopefully will serve the reader forge a visual intuition on (i) the bouncing dynamics side-to-side to the bottom of the valley (in green), and (ii) the slow stochastic movement (in the direction of the green arrows).

A SGD AND LABEL NOISE GD

For the sake of clarity we recall below the statement of the Proposition 1 which we prove in this section. Proposition 1. Let (θ t ) t≥0 follow the SGD dynamics Eq.( 2) with sampling function (i t ) t≥0 . Let 1 i=it be indicator function, define for t ≥ 0, the random vector ξ t ∈ R n such that for all i ∈ 1, n , [ξ t ] i := (h θt (x i ) -y i )(1 -n1 i=it ). (10) Then (θ t ) t≥0 follows the full-batch gradient dynamics on L with label noise (ξ t ) t≥0 , that is θ t+1 = θ t - η n n i=1 (h θt (x i ) -y t i )∇ θ h θt (x i ), where we define the random labels y t := y + ξ t . Furthermore, ξ t is a mean zero random vector with variance such that 1 n(n-1) E ∥ξ t ∥ 2 = 2L(θ t ). Proof. Note that n i=1 (h θt (x i ) -y t i )∇ θ h θt (x i ) = n i=1 (h θt (x i ) -y i -[ξ t ] i )∇ θ h θt (x i ). (12) Using [ξ t ] i := (h θt (x i ) -y i )(1 -n1 i=it ), = 1 n n i=1 (h θt (x i ) -y i -(h θt (x i ) -y i )(1 -n1 i=it ))∇ θ h θt (x i ), = n i=1 1 i=it (h θt (x i ) -y i )∇ θ h θt (x i ) = (h θt (x it ) -y it )∇ θ h θt (x it ). which is exactly the stochastic gradient wrt to sample (x it , y it ). Now we prove the latter part of the proposition regarding the scale of the noise. Recall that, for all i ⩽ n, we have [ξ t ] i = (h θt (x i ) -y i )(1 -n1 i=it ), where i t ∼ U ( 1, n ). Now taking the expectation, E[ξ t ] i = E [(h θt (x i ) -y i )(1 -n1 i=it )] = (h θt (x i ) -y i )(1 -nE [1 i=it ]) = 0, as E [1 i=it ] = 1/n. Coming to the variance, E ∥ξ t ∥ 2 = E n i=1 [ξ t ] i 2 = n i=1 E[ξ t ] i 2 (16) = n i=1 (h θt (x i ) -y i ) 2 E (1 -n1 i=it ) 2 (17) = n i=1 (h θt (x i ) -y i ) 2 E (1 -2n1 i=it + n 2 1 i=it ) (18) = n i=1 (h θt (x i ) -y i ) 2 (1 -2 + n) (19) = (n -1) n i=1 (h θt (x i ) -y i ) 2 = 2n(n -1)L(θ t ), - and this concludes the proof of the proposition.

B QUADRATIC PARAMETERIZATION IN ONE DIMENSION

Again, for the Appendix to be self-contained, we recall the setup of the Proposition 2 on loss stabilization. We consider a regression problem with quadratic parameterization on one-dimensional data inputs x i 's, coming from a distribution ρ, and outputs generated by the linear model y i = x i θ 2 * . The loss writes F (θ) := 1 4 E ρ yxθ 2 2 , and the SGD iterates with step size η > 0 follow, for any t ∈ N, θ t+1 = θ t + η θ t x it y it -x it θ 2 t where x it ∼ ρ. We rewrite the proposition here. Proposition 3. (Extended version of Proposition 2) Assume ∃ x min , x max > 0 such that supp(ρ) ⊂ [x min , x max ]. Then for any η ∈ ((θ * x min ) -2 , 1.25(θ * x max ) -2 ), any initialization in θ 0 ∈ (0, θ * ), for t ∈ N, we have almost surely F (θ t ) ∈ ϵ 2 o θ 2 * , 0.17 θ 2 * . where ϵ o = min (η(θ * x min ) 2 -1)/3, 0.02 . Also, almost surely, there exists t, k > 0 such that θ t+2k ∈ (0.65 θ * , (1 -ϵ o ) θ * ) and θ t+2k+1 ∈ ((1 + ϵ o ) θ * , 1.162 θ * ). Proof. Consider SGD recursion Eq.( 21) and note that y = xθ 2 * . θ t+1 = θ t + η θ t x(xθ 2 * -xθ 2 t ) θ t+1 = θ t + η θ t x 2 (θ 2 * -θ 2 t ) For the clarity of exposition, we consider the rescaled recursion of the original SGD recursion. θt+1 /θ * = θt /θ * + η θ 2 * x 2 θt /θ * 1 -( θt /θ * ) 2 , and, by making the benign change θ t ← θ t /θ * , we focus on the stochastic recursion instead, θ t+1 = θ t + γθ t (1 -θ 2 t ), where γ ∼ ργ the pushforward of ρ under the application z → η θ 2 * z 2 . Let Γ := supp(ρ γ ), the support of the distribution of γ. From the range of η, it can be verified that Γ ⊆ (1, 1.25). Now the proof of the theorem follows from Lemma 5. Lemma 4 (Bounded Region). Consider the recursion Eq.( 26), for Γ ⊆ (1, 1.25) and 0 < θ 0 < 1, then for all t > 0, θ t ∈ (0, 1.162). Proof. Consider a single step of Eq.( 26), for some γ ∈ (1, 1.25), θ + = θ + γθ(1 -θ 2 ) The aim is to show that θ + stays in the interval (0, 1.162). In order to show this, we do a casewise analysis. For θ ∈ (0, 1]: Since 0 < θ ≤ 1, we have θ + ≥ θ > 0. To prove the bound above, consider the following quantity, θ max = max γ∈(1,1.25) max θ∈(0,1] θ + γθ(1 -θ 2 ) (27) Say h γ (θ) = θ + γθ(1 -θ 2 ), note that h ′ γ (θ) = 1 + γ -3γθ 2 and h ′′ γ (θ) = -6γθ < 0. Hence, for any γ in our domain, the maximum is attained at θ γ = 1 √ 3 1 γ + 1 and h γ (θ γ ) = 2(1+γ) 3/2 3 √ 3γ . max γ∈(1,1.25) max θ∈(0,1] θ + γθ(1 -θ 2 ) = max γ∈(.5,1.25) 2(1 + γ) 3/2 3 √ 3γ It can be verified that 2(1+γ) 3/2 3 √ 3γ is increasing with gamma in the interval (1, 1.25). Hence, max γ∈(1,1.25) 2(1 + γ) 3/2 3 √ 3γ ≤ 2(1 + γ) 3/2 3 √ 3γ γ=1.25 < 1.162 Combining them, we get, θ + ≤ max γ∈(0,1.25) max θ∈(0,1] θ + γθ(1 -θ 2 ) < 1.162 For θ ∈ (1, 1.162): Since θ > 1, we have, θ + < θ < 1.162. For lower bound, note that for θ + to be less than 0, we need 1 + γγθ 2 < 0. But for γ ∈ (1, 1.25) and θ ∈ (1, 1.162), γ(θ 2 -1) < 1.25((1.162) 2 -1) < 1. (31) Hence, it never goes below 0. Lemma 5. Consider the recursion Eq.( 26) with Γ ⊆ (1, 1.25) and θ 0 initialized uniformly in (0, 1). Then, there exists ϵ 0 > 0, such that for all ϵ < ϵ 0 there exists t > 0 such that for any k > 0, θ t+2k ∈ (0.65, 1ϵ) and θ t+2k+1 ∈ (1 + ϵ, 1.162) (32) almost surely. Proof. Define γ min > 1 as the infimum of the support Γ. Let ϵ o = min{ (γmin-1) /3, 0.02}. Note that ϵ 0 > 0 as γ min > 1. Now for any 0 < ϵ < ϵ o , we have γ min (2ϵ)(1ϵ) > 2. Divide the interval (0,1.162) into 4 regions, I 0 = (0, 0.65], I 1 = (0.65, 1ϵ), I 2 = [1ϵ, 1), I 3 = (1, 1.162). The strategy of the proof is that the iterates will eventually end up in I 1 and that once it ends up in I 1 , it comes back to I 1 in 2 steps. Let θ 0 be initialized uniformly random in (0, 1). Consider the sequence (θ t ) t≥0 generated by θ t+1 = h γt (θ t ) := θ t + γ t θ t (1 -θ 2 t ) where γ t ∼ ργ . (33) We prove the following facts (P1)-(P4): (P1) There exists t ≥ 0 such that the θ t ∈ I 1 ∪ I 2 ∪ I 3 . (P2) Let θ t ∈ I 3 , then θ t+1 ∈ I 1 ∪ I 2 . (P3) Let θ t ∈ I 2 , there exists k > 0 such that for k ′ < k, θ t+2k ′ ∈ I 2 and θ t+2k ∈ I 1 . (P4) When θ t ∈ I 1 , then for all k ≥ 0, θ t+2k ∈ I 1 and θ t+2k+1 ∈ (1 + ϵ, 1.162). Proof of (P1)-(P4): Let t ∈ N, note first that the event {θ t = 1} = ∪ k⩽t {θ k = 1|θ k-1 ̸ = 1} and hence a finite union of zero measure sets. Hence {θ t = 1} is a zero measure set and therefore we do not consider it below. For any other sequence, from the above four properties, we can conclude that the lemma holds. Proof of P1: Assume that until time t > 0, the iterates are all in I 0 , then we have θ t = θ t-1 (1 + γ(1 -θ 2 t-1 )) ≥ θ t-1 (2 -θ 2 t-1 ) > 1.5 θ t-1 > 1.5 t θ 0 (34) Hence, the sequence eventually exits I 0 . We know that it will stay bounded from Lemma 4, hence it will end up in I 1 ∪ I 2 ∪ I 3 . Proof of P2: For any θ t ∈ (1, 1.162), 1 < γ < 1.25, since h γ (.) is decreasing in (1,1.162), we have h γ (1.162) < h γ (θ t ) < h γ (1). Also h γ (θ) is linear in gamma with negative coefficient for θ > 1. Hence it decreases as γ increases. Using this, .652 = h 1.25 (1.162) < h γ (1.162) < h γ (θ t ) < h γ (1) = 1. Hence, θ t+1 ∈ I 1 ∪ I 2 . Proof of P3: The proof of this follows from Lemma 7. Proof of P4: The proof of this follows from Lemma 10. Lemma 6. For any θ ∈ I 1 ∪ I 2 and any a, b ∈ Γ, h a (h b (θ)) ∈ I 1 ∪ I 2 , h γmax (h γmax (θ)) ≤ h a (h b (θ)) ≤ h γmin (h γmin (θ)). Proof. For any γ ∈ Γ, recall h γ (θ) = θ + γθ(1 -θ 2 ) = 1 + (1 -θ)(γθ(1 + θ) -1). Note that for θ ∈ I 1 ∪ I 2 , θ(1 + θ) > 1, Hence γθ(1 + θ) > 1. This gives us that h γ (θ) > 1. Now we will track where θ ∈ I 1 ∪ I 2 can end up after two stochastic gradient steps. • For any b ∈ Γ, as θ ∈ I 1 ∪ I 2 , we have h γmax (θ) ≥ h b (θ) ≥ h γmin (θ) > 1, note h γmax (θ) ≥ h b (θ) ≥ h γmin (θ) holds since θ < 1. • Now for any a ∈ Γ and x > 1, h a (x) is a decreasing function in x. Hence h a (h γmax (θ)) ≤ h a (h b (θ)) ≤ h a (h γmin (θ)). Using γ min ≤ a, h a (h γmin (θ)) ≤ h γmin (h γmin (θ)), Similarly using γ max > a, we have, h γmax (h γmax (θ)) ≤ h a (h γmax (θ)). Combining them we get, h γmax (h γmax (θ)) ≤ h a (h b (θ)) ≤ h γmin (h γmin (θ)). Similar argument can extend it to, h 1.25 (h 1.25 (θ)) < h a (h b (θ)) < h 1 (h 1 (θ)). Lemma 7. Let θ t ∈ I 2 , there exists k > 0 such that θ t+2k ∈ I 1 . 

D ADDITIONAL EXPERIMENTAL RESULTS

This section of the appendix presents additional experiments complementing the ones presented in the main text. 1 therein) using a large initialization scale for which small step sizes of GD or SGD lead to lazy training. We postpone the illustration of (O1)-(O3) to Fig. 13 in Appendix as our interest is on showing neuron dynamics (Fig. 9 ). We see that for SGD with a small step size, the neurons w i stay close to their initialization, while for a large step size, there is a clear clustering of directions w i along the teacher directions w ⋆ i . The overall picture is very similar to Fig. 1 of Chizat et al. (2019) where the same feature learning effect is achieved via gradient flow from a small initialization which is, however, much more computationally expensive due to the saddle point at zero. Finally, we note that the clustering phenomenon of neurons w i motivates the removal of highly correlated activations in the feature sparsity coefficient: although the corresponding activations are often non-zero, many of them in fact implement the same feature and thus should be counted only once. Further results. We give a short overview of additional figures referred to in the main text. More details can be found in the captions. • Figure 10 shows that even if loss stabilization occurs in diagonal linear networks, the implicit bias towards sparsity is largely weaker than that of SGD and generalization is poor. • Figures 11 and 12 demonstrate that the implicit bias resulting from high-loss stabilization makes the neural nets learn first a simple model then eventually fits the data. • Figure 13 presents the sparsifying effect corresponding to the neurons' movements exhibited in Figure 14 : Visualization on four sets of convolutional filters taken from different layers of ResNets-18 trained on CIFAR-10 with small vs. large step size η (the 50% decay schedule). For small step sizes, the early and middle layers stay very close to randomly initialized ones which indicates the absence of feature learning.



SGD η=0.28, decay at 10% iterations SGD η=0.28, decay at 30% iterations SGD η=0.28, decay at 50% iterations

Figure2: Diagonal linear networks. We observe loss stabilization, better generalization for longer schedules, minimization of the rank of ϕ θ (X) and sparsity of the predictor u ⊙ v.

Figure4: Two-layer ReLU networks for 1D regression. We observe loss stabilization, simplification of the model trained with a longer schedule, lower rank of ϕ θ (X), and much sparser features.

Figure 3: GD and SGD take different trajectories.

Figure 6: ResNet-18 trained on CIFAR-10. Both without explicit regularization and in the stateof-the-art setting, the training loss stabilizes, the test loss noticeably depends on the length of the schedule, and the feature sparsity coefficient is minimized over iterations.

Figure8: Empirical validation of the SDE modeling. In all cases, the dynamics of the SDE discretization qualitatively matches the dynamics of the corresponding SGD run. Moreover, gradient flow discretization exhibits no rank minimization or feature sparsity which suggests that the presence of the noise plays a key role in learning sparse features.

Figure9: Only for a large step size, the neurons w i cluster along the teacher neurons w ⋆ i leading to a model that uses a sparse set of features.Illustration of neuron dynamics. We illustrate the change of neurons during training of two-layer ReLU networks in the teacherstudent setup ofChizat et al. (2019) (see Fig.1therein) using a large initialization scale for which small step sizes of GD or SGD lead to lazy training. We postpone the illustration of (O1)-(O3) to Fig.13in Appendix as our interest is on showing neuron dynamics (Fig.9). We see that for SGD with a small step size, the neurons w i stay close to their initialization, while for a large step size, there is a clear clustering of directions w i along the teacher directions w ⋆ i . The overall picture is very similar to Fig.1ofChizat et al. (2019) where the same feature learning effect is achieved via gradient flow from a small initialization which is, however, much more computationally expensive due to the saddle point at zero. Finally, we note that the clustering phenomenon of neurons w i motivates the removal of highly correlated activations in the feature sparsity coefficient: although the corresponding activations are often non-zero, many of them in fact implement the same feature and thus should be counted only once.

Figure 9.• Figure14showcases the features learning induced by large step sizes for different layers of ResNets-18 when trained on CIFAR-10. • Figure 15 exhibits the feature sparsity in ResNets architecture on CIFAR-100 without any regularization (plain SGD) and in the state-of-the-art setup. GD η=3.53, decay at 10% iterations GD η=3.53, decay at 30% iterations GD η=3.53, decay at 50% iterations SGD η=0.27, decay at 50% iterations

Figure10: Diagonal linear networks. Loss stabilization also occurs for full-batch gradient descent but does not lead to a similar level of sparsity as SGD and also does not improve the test loss.

Figure11: Two-layer ReLU networks for 1D regression. Unlike for Fig.4, here we use a larger warmup coefficient (500× vs. 400×) which leads to overregularization such that the 50%-schedule run fails to fit all the training points and gets stuck at a too high value of the training loss (≈ 10 -0.5 ).

Figure5: Three-layer ReLU networks in a teacher-student setup. We observe loss stabilization, lower rank of the NTK feature matrix and lower feature sparsity coefficient on both hidden layers.

Figure13: Two-layer ReLU networks in a teacher-student setup. Loss stabilization for two-layer ReLU nets in the teacher-student setup with input dimension d = 2. We observe loss stabilization, better test loss for longer schedules and sparser features due to simplification of ϕ(X).

annex

Proof. For any γ ∈ Γ, let θ + = h γ (θ), then we haveFurthermore,And multiplying the above three terms and adding θ(1θ 2 ), we get,For θ ∈ I 2 , using γ min (2ϵ)(1ϵ) > 2, we have the inequalitiesHence,Therefore, for [1ϵ, 1), for any γ ∈ Γ, h γ (h γ (θ)) < θ. Hence for any two stochastic gradient step with a, b ∈ Γ, from Eq.( 36),Intutively this means that in two gradient steps the iterates move further away from 1 until it eventually leaves the interval I 2 as the sequence {θ t+2k } k≥0 is strictly decreasing with no limit point in I 2 . From Lemma 9 , we know that in two steps the iterates will never leave I 1 ∪ I 2 . Hence they will eventually end up in I 1 leaving I 2 .Property 8. Define g γ (θ) := h γ (h γ (θ)) for the sake of brevity. The followings properties hold for θ ∈ I 1 ∪ I 2 , γ ∈ Γ and θ γ the root of h ′ γ (θ):Q2 The function g γ (.) is decreasing in [0.65, θ γ ) and increasing in (θ γ , 1].Proof. Note h ′ γ (θ) = 1 + γ -γ3θ 2 has at most one root θ γ ∈ (0, 1). Note that for all γ ∈ Γ,Therefore, g γ (.) attains its minimum at θ γ and this shows the desired properties.Proof. Lower Bound: From Eq.( 39), we knowWe know that from property Q1 that g γ (θ) ≥ g γ (θ γ ). HenceIt can be quickly checked that .65 < g 1.25 (θ 1.25 ). Hence the lower bound holds.Upper Bound: From Eq.( 39), we knowWe know that from property Q2 that g 1 (θ) ≤ max{g 1 (1), g 1 (0.65)}. It can be easily verified that g 1 (0.65) < 0.98. Hence g 1 (θ) < 1.Lemma 10. For any θ ∈ I 1 and any a, b ∈ Γ, h a (h b (θ)) ∈ I 1 and h a (θ) ∈ (1 + ϵ, 1.162).Proof. The lower bound in Lemma 9 holds here. For the upper bound, from and Eq.( 36),Using property Q2,From Eq.( 48), g γmin (1ϵ) < 1ϵ. From Eq.( 39), g γmin (0.65) < g 1 (0.65) < 0.98 < 1ϵ. In I 1 , the function h a (.) first increases reaches maximum and decreases. Hence forAlso h a (0.65) > h 1 (0.65) > 1.02 > 1 + ϵ, therefore h a (θ) > 1 + ϵ and this completes the proof.

C EMPIRICAL VALIDATION OF THE SDE MODELING

In this section, we experimentally check the validity of the SDE modeling of SGD in Eq.( 8) in terms of the key metrics: training loss, test loss, rank of the NTK feature matrix, and feature sparsity.SDE discretization. Let γ t be the SDE discretization step size, η t the step size of the corresponding SGD that we aim to validate, δ t the noise intensity level, and Z t ∼ N (0, I n ). Then we discretize the SDE from Eq.( 8) as follows:To approximate continuous time, we use a small discretization step size γ t := η t /10 and run the discretization for 10× longer than the corresponding SGD run. We use η t := η SGD ⌊t/10⌋ and δ t := c • L(θ SGD ⌊t/10⌋ ) where c is a constant that we select for each setting separately to match the training dynamics of the corresponding SGD run. In addition, we also evaluate a discretization of gradient flow (i.e., Eq.( 58) without the noise term) which helps to draw conclusions about the role of the noise term.Experimental results. We present the discretization results in Fig. 8 for all models considered in the paper except deep networks for which computing the NTK matrix ϕ θt on each iteration of the SDE discretization is too costly. In all cases, the dynamics of the SDE discretization qualitatively matches the dynamics of the corresponding SGD run. In particular, we observe similar levels of decrease in the rank of the NTK matrix and feature sparsity coefficient. We note that the match between SDE and SGD curves is not expected to be precise due to the inherent randomness of the process. Finally, we observe that gradient flow discretization exhibits no rank minimization or feature sparsity which suggests that the presence of the noise (either from the original SGD or its SDE discretization) plays a key role in learning sparse features. Figure 15 : ResNet-34 trained on CIFAR-100. Both without explicit regularization and in the state-of-the-art setting, the training loss stabilizes, the test loss significantly depends on the length of the schedule, and feature sparsity is minimized over iterations. However, differently from the plots on CIFAR-10, here without explicit regularization we observe oscillating behavior after the step size decay (although at a very low level between 10 -4 and 10 -2 ).

