SGD WITH LARGE STEP SIZES LEARNS SPARSE FEATURES

Abstract

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows to shed a new light on some common practice and observed phenomena when training deep networks.

1. INTRODUCTION

Deep neural networks have accomplished remarkable achievements on a wide variety of tasks. Yet, the understanding of their remarkable effectiveness remains incomplete. From an optimization perspective, stochastic training procedures challenge many insights drawn from convex models. E.g., large step-size schedules used in practice lead to unexpected patterns of stabilizations and sudden drops in the training loss, see e.g. He et al. (2016) . From a generalization perspective, overparametrized deep nets generalize well while fitting perfectly the data and without any explicit regularizers (Zhang et al., 2017) . This suggests that optimization and generalization are tightly intertwined: neural networks find solutions that generalize well thanks to the optimization procedure used to train them. This property, known as implicit bias or algorithmic regularization, has been studied recently both for regression (Li et al., 2018; Woodworth et al., 2020) and classification (Soudry et al., 2018; Lyu and Li, 2020; Chizat and Bach, 2020) . However, for all these theoretical results, it is also shown that typical timescales needed to enter the beneficial feature learning regimes are prohibitively long (Woodworth et al., 2020; Moroshko et al., 2020) . In this paper, we aim at staying closer to the experimental practice and consider the SGD schedules from the ResNet paper (He et al., 2016) where the large step size is first kept constant and then decayed, potentially multiple times. We illustrate this behavior in Fig. 1 where we reproduce a minimal setting without data augmentation or momentum, and with only one step size decrease. We draw attention to two key observations regarding the large step-size phase: (a) quickly after the start of training, the loss remains approximately constant on average and (b) despite no progress on the training loss, running this phase for longer leads to better generalization. We refer to such large step-size phase as loss stabilization. The better generalization hints at some hidden dynamics in the parameter space not captured by the loss curves in Fig. 1 . Our main contribution is to unveil the hidden dynamics behind this phase: loss stabilization helps to amplify the noise of SGD that drives the network towards a solution with sparser features (see Appendix, Figure 7 for a 2D-visualization).

1.1. OUR CONTRIBUTIONS

The effective dynamics behind loss stabilization. We characterize two main components of the SGD dynamics with large step sizes: (i) a fast movement determined by the bouncing directions

