SGD WITH LARGE STEP SIZES LEARNS SPARSE FEATURES

Abstract

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows to shed a new light on some common practice and observed phenomena when training deep networks.

1. INTRODUCTION

Deep neural networks have accomplished remarkable achievements on a wide variety of tasks. Yet, the understanding of their remarkable effectiveness remains incomplete. From an optimization perspective, stochastic training procedures challenge many insights drawn from convex models. E.g., large step-size schedules used in practice lead to unexpected patterns of stabilizations and sudden drops in the training loss, see e.g. He et al. (2016) . From a generalization perspective, overparametrized deep nets generalize well while fitting perfectly the data and without any explicit regularizers (Zhang et al., 2017) . This suggests that optimization and generalization are tightly intertwined: neural networks find solutions that generalize well thanks to the optimization procedure used to train them. This property, known as implicit bias or algorithmic regularization, has been studied recently both for regression (Li et al., 2018; Woodworth et al., 2020) and classification (Soudry et al., 2018; Lyu and Li, 2020; Chizat and Bach, 2020) . However, for all these theoretical results, it is also shown that typical timescales needed to enter the beneficial feature learning regimes are prohibitively long (Woodworth et al., 2020; Moroshko et al., 2020) . In this paper, we aim at staying closer to the experimental practice and consider the SGD schedules from the ResNet paper (He et al., 2016) where the large step size is first kept constant and then decayed, potentially multiple times. We illustrate this behavior in Fig. 1 where we reproduce a minimal setting without data augmentation or momentum, and with only one step size decrease. We draw attention to two key observations regarding the large step-size phase: (a) quickly after the start of training, the loss remains approximately constant on average and (b) despite no progress on the training loss, running this phase for longer leads to better generalization. We refer to such large step-size phase as loss stabilization. The better generalization hints at some hidden dynamics in the parameter space not captured by the loss curves in Fig. 1 . Our main contribution is to unveil the hidden dynamics behind this phase: loss stabilization helps to amplify the noise of SGD that drives the network towards a solution with sparser features (see Appendix, Figure 7 for a 2D-visualization).

1.1. OUR CONTRIBUTIONS

The effective dynamics behind loss stabilization. We characterize two main components of the SGD dynamics with large step sizes: (i) a fast movement determined by the bouncing directions We use weight decay but no momentum or data augmentation for this experiment. We see a substantial difference in generalization (as large as 12% vs. 35% test error) depending on the step size η and its schedule. When the training loss stabilizes, there is a hidden progress occurring which we aim to characterize. causing loss stabilization, (ii) a slow dynamics driven by the combination of the gradient and the multiplicative noise-which is non-vanishing due to the loss stabilization. SDE model and sparse feature learning. We model the effective slow dynamics during loss stabilization by a stochastic differential equation (SDE) whose multiplicative noise is related to the neural tangent kernel features, and validate this modeling experimentally. Building on the existing theory on diagonal linear networks, which shows that this noise structure leads to sparse predictors, we conjecture a similar "sparsifying" effect on the features of more complex architectures. We experimentally confirm this on neural networks of increasing complexity. Insights from our understanding. We draw a clear general picture: the hidden optimization dynamics induced by large step sizes and loss stabilization enable the transition to a sparse feature learning regime. We argue that after a short initial phase of training, SGD first identifies sparse features of the training data and eventually fits the data when the step size is decreased. Finally, we discuss informally how many deep learning regularization methods (weight decay, BatchNorm, SAM) may also fit into the same picture.

1.2. RELATED WORK

He et al. ( 2016) popularized the piecewise constant step-size schedule which often exhibits a clear loss stabilization pattern. However, they did not provide any explanations for such training dynamics and its implicit regularization effect. Non-monotonic patterns of the training loss have been explored in recent works. However, the loss stabilization regime we consider is different (i) from the catapult mechanism (Lewkowycz et al., 2020) where the training loss shows only one spike at the start of training and then monotonically converges without stabilization, and (ii) from the edge of stability regime of full-batch GD (Cohen et al., 2021) where the training loss shows many regular spikes after some point in training but again without stabilization. Past works conjectured that large step sizes induce the minimization of some hidden complexity measures related to flatness of minima (Keskar et al., 2016; Smith and Le, 2018) . Notably, Xing et al. ( 2018) point out that SGD moves through the loss landscape bouncing between the walls of a valley where the role of the step size is to guide the noisy iterates of SGD towards a flatter minimum. However, many typically used flatness definitions are questionable for this purpose since (1) they are not invariant under reparametrizations that lead to an equivalent neural network (Dinh et al., 2017) , and (2) even for naturally trained networks, full-batch gradient descent with large step sizes (unlike SGD) can lead to flat solutions which are not well-generalizing (Kaur et al., 2022) . Note that it is possible to bridge the gap between GD and SGD by using explicit regularization as in Geiping et al. (2022) . We instead focus on the implicit regularization of SGD which remains the most practical approach for training deep networks. The importance of large step sizes has been investigated with diverse motivations. However, we believe that existing approaches do not sufficiently capture the hidden stochastic dynamics behind the loss stabilization phenomenon observed for deep networks. Attempts to explain it on strongly convex models (Nakkiran, 2020; Wu et al., 2021; Beugnot et al., 2022) are inherently incomplete since it is a phenomenon related to the existence of many zero solutions with very different generalization



Figure1: A typical training dynamics for a ResNet-18 trained on CIFAR-10. We use weight decay but no momentum or data augmentation for this experiment. We see a substantial difference in generalization (as large as 12% vs. 35% test error) depending on the step size η and its schedule. When the training loss stabilizes, there is a hidden progress occurring which we aim to characterize.

