DIRECTION MATTERS: ON THE IMPLICIT BIAS OF STOCHASTIC GRADIENT DESCENT WITH MODERATE LEARNING RATE

Abstract

Understanding the algorithmic bias of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on very small or even infinitesimal learning rate regime, and fail to cover practical scenarios where the learning rate is moderate and annealing. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem. In this case, SGD and GD are known to converge to the unique minimum-norm solution; however, with the moderate and annealing learning rate, we show that they exhibit different directional bias: SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. Furthermore, we show that such directional bias does matter when early stopping is adopted, where the SGD output is nearly optimal but the GD output is suboptimal. Finally, our theory explains several folk arts in practice used for SGD hyperparameter tuning, such as (1) linearly scaling the initial learning rate with batch size; and (2) overrunning SGD with high learning rate even when the loss stops decreasing.

1. INTRODUCTION

Stochastic gradient descent (SGD) and its variants play a key role in training deep learning models. From the optimization perspective, SGD is favorable in many aspects, e.g., scalability for large-scale models (He et al., 2016) , parallelizability with big training data (Goyal et al., 2017) , and rich theory for its convergence (Ghadimi & Lan, 2013; Gower et al., 2019) . From the learning perspective, more surprisingly, overparameterized deep nets trained by SGD usually generalize well, even in the absence of explicit regularizers (Zhang et al., 2016; Keskar et al., 2016) . This suggests that SGD favors certain "good" solutions among the numerous global optima of the overparameterized model. Such phenomenon is attributed to the implicit bias of SGD. It remains one of the key theoretical challenges to characterize the algorithmic bias of SGD, especially with moderate and annealing learning rate as typically used in practice (He et al., 2016; Keskar et al., 2016) . In the small learning rate regime, the regularization effect of SGD is relatively well understood, thanks to the recent advances on the implicit bias of gradient descent (GD) (Gunasekar et al., 2017; 2018a; b; Soudry et al., 2018; Ma et al., 2018; Li et al., 2018; Ji & Telgarsky, 2019b; a; Ji et al., 2020; Nacson et al., 2019a; Ali et al., 2019; Arora et al., 2019; Moroshko et al., 2020; Chizat & Bach, 2020). According to classical stochastic approximation theory (Kushner & Yin, 2003) , with a sufficiently small learning rate, the randomness in SGD is negligible (which scales with learning rate), and as a consequence SGD will behave highly similar to its deterministic counterpart, i.e., GD. Based on this fact, the regularization effect of SGD with small learning rate can be understood through that of GD. Take linear models for example, GD has been shown to be biased towards maxmargin/minimum-norm solutions depending on the problem setups (Soudry et al., 2018; Gunasekar et al., 2018a; Ali et al., 2019) ; correspondingly, follow-ups show that SGD with small learning rate has the same bias (up to certain small uncertainty governed by the learning rate) (Nacson et al., 2019b; Gunasekar et al., 2018a; Ali et al., 2020) . The analogy between SGD and GD in the small learning rate regime is also demonstrated in Figures 1(a ) and 3. However, the regularization theory for SGD with small learning rate cannot explain the benefits of SGD in the moderate learning rate regime, where the initial learning rate is moderate and followed by annealing (Li et al., 2019; Nakkiran, 2020; Leclerc & Madry, 2020; Jastrzebski et al., 2019) . In particular, empirical studies show that, in the moderate learning rate regime, (small batch) SGD generalizes much better than GD/large batch SGD (Keskar et al., 2016; Jastrzębski et al., 2017; Zhu et al., 2019; Wu et al., 2020) (see Figure 3 ). This observation implies that, instead of imitating the bias of GD as in the small learning rate regime, SGD in the moderate learning rate regime admits superior bias than GD -it requires a dedicated characterization for the implicit regularization effect of SGD with moderate learning rate. In this paper, we reveal a particular regularization effect of SGD with moderate learning rate that involves convergence direction. In specific, we consider an overparameterized linear regression model learned by SGD/GD. In this setting, SGD and GD are known to converge to the unique minimumnorm solution (Zhang et al., 2016; Gunasekar et al., 2018a ) (see also Section 2.1). However, with a moderate and annealing learning rate, we show that SGD and GD favor different convergence directions: SGD converges along the large eigenvalue directions of the data matrix; in contrast, GD goes after the small eigenvalue directions. The phenomenon is illustrated in Figure 1 (b). To sum up, we make the following contributions in this work: 1. For an overparameterized linear regression model, we show that SGD with moderate learning rate converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. To our knowledge, this result initiates the regularization theory for SGD in the moderate learning rate regime, and complements existing results for the small learning rate. 2. Furthermore, we show the particular directional bias of SGD with moderate learning rate benefits generalization when early stopping is used. This is because converging along the large eigenvalue directions (SGD) leads to nearly optimal solutions, while converging along the small eigenvalue directions (GD) can only give suboptimal solutions. 3. Finally, our results explain several folk arts for tuning SGD hyperparameters, such as (1) linearly scaling the initial learning rate with batch size (Goyal et al., 2017) ; and (2) overrunning SGD with high learning rate even when the loss stops decreasing (He et al., 2016) .



Figure1: Illustration for the 2-D example studied in Section 3. Here κ = 4 and w0 = (0.6, 0.6). (a): Small learning rate regime. The small learning rate is 0.1/κ. In this regime SGD and GD behave similarly and they both converge along e2. (b): Moderate learning rate regime. The initial moderate learning rate is η = 1.1/κ and the decayed learning rate is η = 0.1/κ. In this regime GD converges along e2 but SGD converges along e1, the larger eigenvalue direction of the data matrix. Please refer to Section 3 for further discussions.

