DIRECTION MATTERS: ON THE IMPLICIT BIAS OF STOCHASTIC GRADIENT DESCENT WITH MODERATE LEARNING RATE

Abstract

Understanding the algorithmic bias of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on very small or even infinitesimal learning rate regime, and fail to cover practical scenarios where the learning rate is moderate and annealing. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem. In this case, SGD and GD are known to converge to the unique minimum-norm solution; however, with the moderate and annealing learning rate, we show that they exhibit different directional bias: SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. Furthermore, we show that such directional bias does matter when early stopping is adopted, where the SGD output is nearly optimal but the GD output is suboptimal. Finally, our theory explains several folk arts in practice used for SGD hyperparameter tuning, such as (1) linearly scaling the initial learning rate with batch size; and (2) overrunning SGD with high learning rate even when the loss stops decreasing.

1. INTRODUCTION

Stochastic gradient descent (SGD) and its variants play a key role in training deep learning models. From the optimization perspective, SGD is favorable in many aspects, e.g., scalability for large-scale models (He et al., 2016) , parallelizability with big training data (Goyal et al., 2017) , and rich theory for its convergence (Ghadimi & Lan, 2013; Gower et al., 2019) . From the learning perspective, more surprisingly, overparameterized deep nets trained by SGD usually generalize well, even in the absence of explicit regularizers (Zhang et al., 2016; Keskar et al., 2016) . This suggests that SGD favors certain "good" solutions among the numerous global optima of the overparameterized model. Such phenomenon is attributed to the implicit bias of SGD. It remains one of the key theoretical challenges to characterize the algorithmic bias of SGD, especially with moderate and annealing learning rate as typically used in practice (He et al., 2016; Keskar et al., 2016) . In the small learning rate regime, the regularization effect of SGD is relatively well understood, thanks to the recent advances on the implicit bias of gradient descent (GD) (Gunasekar et al., 2017; 2018a; b; Soudry et al., 2018; Ma et al., 2018; Li et al., 2018; Ji & Telgarsky, 2019b; a; Ji et al., 2020; Nacson et al., 2019a; Ali et al., 2019; Arora et al., 2019; Moroshko et al., 2020; Chizat & Moderate learning rate regime. The initial moderate learning rate is η = 1.1/κ and the decayed learning rate is η = 0.1/κ. In this regime GD converges along e2 but SGD converges along e1, the larger eigenvalue direction of the data matrix. Please refer to Section 3 for further discussions. Bach, 2020). According to classical stochastic approximation theory (Kushner & Yin, 2003) , with a sufficiently small learning rate, the randomness in SGD is negligible (which scales with learning rate), and as a consequence SGD will behave highly similar to its deterministic counterpart, i.e., GD. Based on this fact, the regularization effect of SGD with small learning rate can be understood through that of GD. Take linear models for example, GD has been shown to be biased towards maxmargin/minimum-norm solutions depending on the problem setups (Soudry et al., 2018; Gunasekar et al., 2018a; Ali et al., 2019) ; correspondingly, follow-ups show that SGD with small learning rate has the same bias (up to certain small uncertainty governed by the learning rate) (Nacson et al., 2019b; Gunasekar et al., 2018a; Ali et al., 2020) . The analogy between SGD and GD in the small learning rate regime is also demonstrated in Figures 1(a ) and 3. However, the regularization theory for SGD with small learning rate cannot explain the benefits of SGD in the moderate learning rate regime, where the initial learning rate is moderate and followed by annealing (Li et al., 2019; Nakkiran, 2020; Leclerc & Madry, 2020; Jastrzebski et al., 2019) . In particular, empirical studies show that, in the moderate learning rate regime, (small batch) SGD generalizes much better than GD/large batch SGD (Keskar et al., 2016; Jastrzębski et al., 2017; Zhu et al., 2019; Wu et al., 2020) (see Figure 3 ). This observation implies that, instead of imitating the bias of GD as in the small learning rate regime, SGD in the moderate learning rate regime admits superior bias than GD -it requires a dedicated characterization for the implicit regularization effect of SGD with moderate learning rate. In this paper, we reveal a particular regularization effect of SGD with moderate learning rate that involves convergence direction. In specific, we consider an overparameterized linear regression model learned by SGD/GD. In this setting, SGD and GD are known to converge to the unique minimumnorm solution (Zhang et al., 2016; Gunasekar et al., 2018a ) (see also Section 2.1). However, with a moderate and annealing learning rate, we show that SGD and GD favor different convergence directions: SGD converges along the large eigenvalue directions of the data matrix; in contrast, GD goes after the small eigenvalue directions. The phenomenon is illustrated in Figure 1 (b). To sum up, we make the following contributions in this work: 1. For an overparameterized linear regression model, we show that SGD with moderate learning rate converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. To our knowledge, this result initiates the regularization theory for SGD in the moderate learning rate regime, and complements existing results for the small learning rate. 2. Furthermore, we show the particular directional bias of SGD with moderate learning rate benefits generalization when early stopping is used. This is because converging along the large eigenvalue directions (SGD) leads to nearly optimal solutions, while converging along the small eigenvalue directions (GD) can only give suboptimal solutions. 3. Finally, our results explain several folk arts for tuning SGD hyperparameters, such as (1) linearly scaling the initial learning rate with batch size (Goyal et al., 2017) ; and (2) overrunning SGD with high learning rate even when the loss stops decreasing (He et al., 2016) .

2. PRELIMINARY

Let (x, y) ∈ R d × R be a pair of d-dimensional feature vector and 1-dimensional label. We consider a linear regression problem with square loss defined as (x, y; w) := (w x -y) 2 , where w ∈ R d is the model parameter. Let D be the population distribution over (x, y), then the test loss is L D (w) := E (x,y)∼D [ (x, y; w) ] . Let S := {(x i , y i )} n i=1 be a training set of n data points drawn i.i.d. from the population distribution D. Then the training/empirical loss is defined as the average of the individual loss over all training data points, L S (w) := 1 n n i=1 i (w), where i (w) := (x i , y i ; w) = (w x i -y i ) 2 . We use {η k } to denote a learning rate scheme (LR). Then gradient descent (GD) iteratively performs the following update:  w k+1 = w k -η k ∇ L S (w k ) = w k - 2η k n n i=1 x i (x i w k -y i ). w k,j+1 = w k,j - η k b i∈B k j ∇ i (w k,j ) = w k,j - 2η k b i∈B k j x i (x i w k,j -y i ), j = 1, . . . , m. (SGD) We also write w k+1 = w k,m+1 and w k = w k,1 to be consistent with notations in (GD).

2.1. THE MINIMUM-NORM BIAS

Before presenting our results on the directional bias, let us first recap the well-known minimumnorm bias for SGD/GD optimizing linear regression problem (Zhang et al., 2016; Gunasekar et al., 2018a; Belkin et al., 2019; Bartlett et al., 2020) . We rewrite the training loss as L S (w) = 1 n X w -Y 2 2 , where X = (x 1 , . . . , x n ) ∈ R d×n and Y = (y 1 , . . . , y n ) ∈ R n . Then its global minima are given by W * := w ∈ R d : P w = w * , w * := X(X X) -foot_0 Y , where P is the projection operator onto the data manifold, i.e., the column space of X. We focus on overparameterized cases where W * contains multiple elements. Notice that every gradient ∇ i (w) = 2x i (x i w -y i ) is spanned in the data manifold, thus (GD) and (SGD) can never move along the direction that is orthogonal to the data manifold. In other words, (GD) and (SGD) implicitly admit the following hypothesis class: H S = w ∈ R d : P ⊥ w = P ⊥ w 0 , where w 0 is the initialization and P ⊥ = I -P is the projection operator onto the orthogonal complement to the column space of X. Putting things together, for any global optimum w ∈ W * (hence P w = w * ), we have w -w 0 2 2 = P w -P w 0 2 2 + P ⊥ w -P ⊥ w 0 2 2 = w * -P w 0 2 2 + P ⊥ w -P ⊥ w 0 2 2 , where the right hand side is minimized when P ⊥ w = P ⊥ w 0 , i.e., w ∈ H S , thus w is the solution found by SGD/GD in the non-degenerated cases (when the learning rate is set properly so that the algorithms can find a global optimum). In sum, SGD/GD is biased to find the global optimum that is closest to the initialization, which is referred as the "minimum-norm" bias in literature since the initialization is usually set to be zero.

3. WARMING UP: A 2-DIMENSIONAL CASE STUDY

In this section we conduct a 2-dimensional case study to motivate our understanding on the directional bias of SGD in the moderate learning rate regime. Let us consider a training set consisting of two orthogonal points, S = {(x 1 , y 1 = 0), (x 2 , y 2 = 0)} where x 1 = √ κ • e 1 = ( √ κ, 0) , x 2 = e 2 = (0, 1) , κ > 2. Clearly w * = 0 is the unique minimum of L S (w). The Hessian of the empirical loss is ∇ 2 L S (w) = x 1 x 1 + x 2 x 2 = diag (κ, 1 ) , which has two eigenvalues: the smaller one 1 is contributed by data x 2 , and the larger one κ contributed by data x 1 . Hence L S (w) is κ-smooth. Similarly the Hessian of the individual losses are ∇ 2 1 (w) = 2x 1 x 1 = diag (2κ, 0) and ∇ 2 2 (w) = 2x 2 x 2 = diag (0, 2) . Thus 2 (w) is 2-smooth, but 1 (w) , the individual loss for data x 1 , is only 2κ-smooth, which is more ill-conditioned compared to L S (w) and 2 (w). Next we consider a moderate initial learning rate η ∈ 1 κ , 2 1+κ . According to convex optimization theory (Boyd et al., 2004) , gradient step with such learning rate is convergent for L S (w) and 2 (w), but oscillating for 1 (w). In other words, (GD) is convergent; and (SGD) is convergent along direction x 2 (or e 2 ), but oscillating along direction x 1 (or e 1 ). We also see this by analytically solving (GD) and (SGD) for this example: w gd k = (1 -ηκ) k (1 -η) k w 0 , w sgd k = (1 -2ηκ) k (1 -2η) k w 0 , where |1 -ηκ| < |1 -η| < 1 and |1 -2η| < 1 < |1 -2ηκ|. By Eq. ( 2), with moderate learning rate GD is convergent for both directions e 1 and e 2 . Moreover, GD fits e 1 faster since the contraction parameter is smaller, i.e., |1 -ηκ| < |1 -η| < 1. Thus observing the entire optimization path, GD approaches the minimum w * = 0 along e 2 , which corresponds to the smaller eigenvalue direction of ∇ 2 L S (w). This is verified by the blue dots in Figure 1 (b). We note this directional bias for GD also holds in the small learning rate regime, as shown in Figure 1 (a). As for SGD in the initial phase where the learning rate is moderate, Eq. ( 2) shows it converges along e 2 but oscillates along e 1 since |1 -2η| < 1 < |1 -2ηκ|. In other words, SGD cannot fit e 1 before the learning rate decays; however when this happens, e 2 is already well fitted. Overall, SGD fits e 2 first then fits e 1 , i.e., SGD converges to the minimum w * = 0 along e 1 , which corresponds to the larger eigenvalue direction of ∇ 2 L S (w). This is verified by the red dots in Figure 1 (b). We note this particular directional bias for SGD is dedicated to the moderate learning rate regime; in the small learning rate regime, as discussed before, SGD behaves similar to GD thus goes after the smaller eigenvalue direction, which is illustrated in Figure 1 (a). The above idea can be carried over to more general cases: the training loss usually has relatively smooth curvature because of the empirical averaging; yet some individual losses can possess bad smoothness condition, corresponding to the data points that contribute to the large eigenvalues of the Hessian/data matrix. Then with a moderate learning rate, while GD is convergent, SGD is convergent for the smooth individual losses but oscillating for the ill-conditioned individual losses. Thus SGD can only fit the latter losses after the learning rate anneals. Therefore, in the moderate learning rate regime, SGD tends to converge along the large eigenvalue directions while GD tends to go after the small eigenvalue directions. We will rigorously justify the above intuitions in the following section.

4. MAIN RESULTS

In this section we present our main theoretical results. The proofs are deferred to Appendix B. We specify the population distribution of (x, y) ∈ R d × R in the following manner. (1) We consider the feature vector as x = ζ • ξ, where ζ and ξ are two independent random variables that represent the magnitude and angle of x, respectively. That is, ζ ∈ R is bounded in (0, 1], and ξ ∈ R d obeys a sphere uniform distribution, U(S d-1 ). (2) We consider a realizable setting where the label is given by y = w * x, i.e., there exists a true parameter w * ∈ R d that generates the label from the feature n (w -w * ) XX (w -w * ), i (w) = (w -w * ) x i x i (w -w * ) , i = 1, . . . , n, where X = (x 1 , . . . , x n ). We denote by P the projection operator onto the column space of X (the data manifold). For i ∈ [n], we denote λ i := x i 2 2 = ζ 2 i ∈ (0, 1]. Without loss of generality, we assume {λ i } i∈ [n] are sorted in a descending order, i.e., λ 1 ≥ λ 2 ≥ • • • ≥ λ n . With the these preparations, we are ready to state our main theorems.

4.1. THE DIRECTIONAL BIAS OF SGD

We first present Theorems 1 and 2 that characterize the different directional biases of SGD and GD in the moderate learning rate regime. Theorem 1 (The directional bias of SGD with moderate LR, informal). Suppose d ≥ poly (n)foot_2 . Denote ν = n/ √ d (which is small). Then with high probability it holds that λ 1 > λ 2 + Θ (ν) , λ n-1 > λ n + Θ (ν) , λ n > Θ (ν) . Suppose the initialization is set such that x i (w 0 -w * ) = 0 for every i ∈ [n]foot_3 . Consider (SGD) with the following moderate learning rate scheme η k = η ∈ b λ1-Θ(ν) , b λ2+Θ(ν) , k = 1, . . . , k 1 ; η ∈ 0, b 2λ1 , k = k 1 + 1, . . . , k 2 , then for such that poly ( ) > ν, there exist k 1 = O log 1 + k 2 and k 2 > 0 such that with high probability the output of SGD w sgd := w k2 satisfies (1 -) • γ 1 ≤ P (w sgd -w * ) • XX • P w sgd -w * P (w sgd -w * ) 2 2 ≤ γ 1 , where γ 1 is the largest eigenvalue of the data matrix XX . Theorem 2 (The directional bias of GD with moderate or small LR, informal). Under the same conditions as Theorem 1, consider (GD) with the following moderate or small learning rate scheme η k ∈ 0, n 2λ 1 + Θ (ν) , k = 1, . . . , k 2 , then for any > 0, if k 2 > O log 1 , then with high probability the output of GD w gd := w k2 satisfies γ n ≤ P (w gd -w * ) • XX • P w gd -w * P (w gd -w * ) 2 2 ≤ (1 + ) • γ n , where γ n is the smallest eigenvalue of the data matrix XX restricted in the column space of X. Remark 1. As the Rayleigh quotient (4) (resp. (6)) converges to its maximum (resp. minimum), the vector gets closer to the eigenvector of the largest (resp. smallest) eigenvalue (Trefethen & Bau III, 1997) . Thus Theorem 1 and 2 suggest that, when projected onto the data manifold, SGD and GD converge to the optimum along the largest and smallest eigenvalue direction respectively. Here we are only interested in the projection onto the data manifold, since SGD/GD cannot move along the direction that is orthogonal to the data manifold as discussed in Section 2.1. Remark 2. In Theorem 1 we use the gap between λ 1 and λ 2 to show a learning rate scheme such that SGD converges along the largest eigenvalue direction. This can be extended by considering the gap between λ r and λ r+1 and the learning rate scheme defined similarly, then SGD converges along the subspace spanned by the eigenvectors of the top r eigenvalues. Similar extension applies to Theorem 2 as well. Remark 3. We legitimately assume b < n/2 -Θ (ν) since it is not very meaningful to discuss SGD that uses more than (roughly) half of the training set as a mini-batch. Then the learning rate schedule in (3) intersects with that in (5), i.e., (5) covers both moderate and small learning rate schemes. Their intersection determines a moderate learning rate scheme, where SGD converges along the large eigenvalue directions while GD goes after the small eigenvalue directions. This justifies the regularization effect of SGD with moderate learning rate. Remark 4. Technically, in Theorem 1 one can set b = n to include GD as a special case, so that GD also follows the large eigenvalue directions. However, the initial learning rate in (3) needs to be at least n λ1 ≥ n ignoring the small order term, which is way too large to be even numerically stable in practical big data circumstances. This observation is also directly supported by Theorem 2, where (5) specifies the range of learning rate such that GD converges along small eigenvalue directions. Note the upper bound in (5) linearly scales with n, and is large enough to include all learning rate that can be adopted in practice. Thus one cannot have a legitimate learning rate for GD to converge along the large eigenvalue directions as SGD does with moderate learning rate. Note GD with small learning rate also converges along the small eigenvalue directions, since (5) covers the small learning rate scheme. In complement, the following Theorem 3 shows that in the small learning rate regime, SGD is imitating GD and converges along the small eigenvalue directions as well. Theorems 1, 2 and 3 together show that, converging along the large eigenvalue directions is a distinct regularization effect that is unique to SGD with moderate learning rate. Theorem 3 (The directional bias of SGD with small LR, informal). Theorem 2 applies to (SGD) with the following small learning rate scheme η k = η ∈ 0, b 2λ 1 + Θ (ν) , k = 1, . . . , k 2 . ( ) Experiment Figure 2 shows two experiments for verifying Theorems 1, 2 and 3. The details of the experimental setup are deferred to Appendix D. In Figure 2 (a), we run SGD and GD in both moderate and small learning rate regimes, and directly compare their Rayleigh quotients as defined in (4) and (6). We can see that the Rayleigh quotient reaches its maximum for SGD with moderate learning rate, and reaches its minimum for both GD and SGD with small learning rate, which verifies Theorems 1, 2 and 3. In Figure 2 (b), we run the algorithms to optimize a neural network on a subset of the FashionMNIST dataset. Since the neural network is non-convex and have multiple local minima, we compare the relative Rayleigh quotients, i.e., the Rayleigh quotients of the convergence directions divided by the maximum absolute eigenvalue of the Hessian (see Appendix D.3). Figure 2 (b) shows that SGD with moderate learning rate converges along relatively large eigenvalue directions while GD/SGD with small learning rate converges along relatively small eigenvalue directions. This distinguishes the directional bias of SGD and GD in the moderate learning rate regime and provides evidence from neural network training to support our theory.

4.2. EFFECTS OF THE DIRECTIONAL BIAS

Next we justify the benefit of the particular directional bias of SGD with moderate learning rate. Recall the hypothesis class H S (Eq. ( 1)) for SGD and GD. Then for an algorithm output w alg , we have the following generalization error decomposition (Shalev-Shwartz & Ben-David, 2014) , L D (w alg ) -inf w L D (w) = L D (w alg ) -inf w ∈H S L D (w ) ∆(w alg ), estimation error + inf w ∈H S L D (w ) -inf w L D (w) approximation error . The approximate error is an intrinsic error determined by the hypothesis class, and is not improvable unless enlarging the hypothesis class. In contrast, the estimation error ∆(w alg ) is determined by the algorithm as well as its hyperparameters. Thus, in the following theorem, we use the estimation error to compare the generalization performance of the SGD and GD outputs in different learning rate regimes. Theorem 4 (Effects of the directional bias, informal). Let W α := w ∈ H S : L S (w) = α be an α-level set of the training loss L S (w). Let ∆ * α := inf w∈Wα ∆(w) be the minimum estimation error within the α-level set W α . Under the same conditions as Theorems 1-3 and assuming that b < n/2 -Θ (ν), then the following holds with high probability: ). The small learning rate scheme is specified by (η , k2) = (0.2, 10 4 ), and the moderate learning rate scheme is specified by (η, η , k1, k2) = (1.05, 0.1, 2 × 10 3 , 3 × 10 3 ). Numerical results show the Rayleigh quotient converges to its maximum for SGD with moderate learning rate, and converges to its minimum for GD and SGD with small learning rate, which verifies Theorems 1, 2 and 3. (b): A neural network example. The plots are averaged over 10 runs. We randomly draw 2, 000 samples from FashionMNIST as the training set. The model is a 5-layer convolutional neural network. The small learning rate scheme is specified by (η , k2) = (10 -3 , 10 4 ), and the moderate learning rate scheme is specified by (η, η , k1, k2) = (10 -2 , 10 -3 , 2.5 × 10 3 , 10 4 ). Since neural network is non-convex, we compare the relative Rayleigh quotient of the concerned algorithms, i.e., the Rayleigh quotient of the convergence directions divided by the maximum absolute eigenvalue of the Hessian (see Appendix D.3). • The output of (SGD) with moderate LR (3) in Theorem 1 satisfies ∆(w sgd ) < (1 + ) • ∆ * α , where α is the training loss of w sgd and is a small constant; • The output of (GD) with moderate or small LR (5) in Theorem 2 satisfies ∆(w gd ) > M • ∆ * α , where α is the training loss of w gd and M = (1 -) • γ 1 /γ n is a constant larger than 1 + ; • The output of (SGD) with small LR (7) in Theorem 3 satisfies ∆(w sgd ) > M • ∆ * α , where α is the training loss of w sgd and M = (1 -) • γ 1 /γ n is a constant larger than 1 + . Theorem 4 suggests that: (1) in the moderate learning rate regime, there is a separation between the test error of SGD and that of GD. In detail, early stopped SGD finds a nearly optimal solution thanks to its particular directional bias. In contrast, early stopped GD can only find a suboptimal one; and (2) in the small learning rate regime, however, SGD no longer admits the dedicated directional bias for moderate learning rate. Instead it behaves similarly as GD, and hence outputs suboptimal solutions when early stopping is adopted. Remark 5. In practice it is usually intractable and unnecessary to achieve the exact global minima of the training loss; instead we often early stop the algorithm once obtaining a small enough training loss, i.e., reaching an α-level set. In this spirit, Theorem 4 compares the generalization ability of SGD with moderate learning rate vs. GD/SGD with small learning rate within a level set. Remark 6. We note the second conclusion in Theorem 4 is also obtained by Nakkiran (2020) for a different purpose. In specific, Nakkiran (2020) show the separation between the test error of GD with "large" and annealing learning rate, and test error of GD with small learning rate. However, the "large" learning rate for GD in their analysis is linear in the training sample size and is not practical as we have discussed in Remark 4. In contrast, we show that under the practically used moderate learning rate, there is a separation between the generalization abilities of SGD and GD. To our knowledge, our work gives the first theoretical justification of the phenomenon that SGD outperforms GD when the learning rate is moderate. Experiment Figure 3 shows the generalization performance of a neural network trained by SGD and GD in both moderate and small learning rate regimes. The setup details are included in Appendix D. We can observe that (1) SGD with moderate learning rate generalizes the best, and (2) GD and SGD with small learning rate perform similarly, but both are worse than SGD with moderate learning rate. The empirical results suggest SGD with moderate learning has certain benign regularization effect. This is explained by the distinct directional bias for SGD with moderate learning rate shown in previous theorems. In this section, we review related works and discuss their similarities and differences to our work.

5. RELATED WORK

The implicit bias of GD The implicit bias of GD has been extensively studied in recent years. We summarize several representative results as follows. For homogeneous model with exponentially-tailed loss, GD converges along the max-margin direction (Gunasekar et al., 2017; 2018a; b; Soudry et al., 2018; Ma et al., 2018; Ji & Telgarsky, 2019a; Ji et al., 2020; Nacson et al., 2019a) . For least square problem and its variants, GD is biased towards minimum-norm solution (Gunasekar et al., 2018a; Ali et al., 2019; Suggala et al., 2018) . Note that this is also the foundation for the learning theory in the interpolation regime such as double descent (Belkin et Chizat & Bach (2020) show GD finds a maxmargin classifier in a functional space. The regularization theory for GD is fruitful; While some of them can be applied to SGD with small learning rate as have discussed (Nacson et al., 2019b; Gunasekar et al., 2018a; Ali et al., 2020) , none of them can be carried over to SGD in the moderate learning rate regime. As far as we know, our paper is the first to study the regularization effect of SGD with moderate learning rate. The stability of SGD Stability is another approach to justify the generalization ability of SGD, where a stable algorithm is guaranteed to have small generalization error (Bousquet & Elisseeff, 2002) . Along this line, several works show that SGD is stable under certain assumptions and therefore generalizes well (Hardt et al., 2016; Kuzborskij & Lampert, 2018; Charles & Papailiopoulos, 2018; Bassily et al., 2020) . More interestingly, Charles & Papailiopoulos (2018) show a simple example where SGD is stable but GD is not, which partly explains the empirically superior performance of SGD. We also aim to justify the benefits of SGD, but take a different approach from the algorithmic regularization. The escaping behavior of SGD A popular theory (Jastrzębski et al., 2017; Zhu et al., 2019; Simsekli et al., 2019) attributes the regularization effect of SGD to its behavior of escaping from sharp minima. These works are built upon the continuous approximation of SGD via stochastic differential equations (Li et al., 2017; Hu et al., 2017; Xu et al., 2018; Simsekli et al., 2019) . However it requires a small learning rate for the approximation to hold. In contrast, our main interest is in the moderate learning rate regime. Another related work in this line is (Wu et al., 2018) , where they study the dynamical stability of minima and how SGD chooses them, but they do not show a directional bias introduced in our work. Non-small learning rate The regularization effect of non-small learning rate has received increasing attentions recently. Theoretically, Li et al. (2019) ; HaoChen et al. ( 2020) study the generalization of certain stochastic dynamics equipped with annealing learning rate scheme. However, their results cannot cover vanilla SGD. Lewkowycz et al. (2020) study the role of initial large learning rate for infinity-width network. Our work is motivated by Nakkiran (2020) , which shows that a large initial learning rate helps GD generalize. This result can be recovered from our theorems; more importantly, we characterize the directional bias of SGD. Empirically, Jastrzebski et al. (2019) ; Leclerc & Madry (2020) investigate the impact of two-phase learning rate for training with SGD, but they do not provide any theoretical analysis.

6. DISCUSSIONS

Linear scaling rule Linearly enlarging the initial learning rate according to batch size, or the linear scaling rule (Goyal et al., 2017) , is an important folk art for paralleling SGD with large batch size while maintaining a good generalization performance. Interestingly, the linear scaling rule arises naturally from Theorem 1, where LR (3) suggests the initial learning rate η to scale linearly with batch size b to guarantee the desired directional bias. Our theory thus partly explains the mechanism of the linear scaling rule. Overrunning SGD with high learning rate Besides the moderate initial learning rate, another key ingredient in Theorem 1 is a sufficiently large k 1 , i.e., SGD needs to run with moderate learning rate for sufficiently long time (to fit small eigenvalue directions). This requirement coincidentally agrees with the folk art to overrun SGD with high learning rate. For example, see Figure 4 in (He et al., 2016) : from the ∼ 2 × 10 5 -th to the ∼ 3 × 10 5 -th iteration, even the training error seems to make no progress, practitioners let SGD run with a relatively high learning rate for nearly 10 5 iterates. Our theory sheds light on understanding the hidden benefits in this "overrunning" phase: indeed the loss is not decreasing since SGD with high learning rate cannot fit the large eigenvalue directions, but on the other hand the overrunning lets SGD fit the small eigenvalue directions better, which in the end leads to the directional bias that SGD converges along the large eigenvalue directions according to Theorem 1. Thus overrunning SGD with high learning rate is helpful.

Key technical challenges

With non-small learning rate, analyzing SGD is usually hard since measure concentration turns vacuous and as a result one cannot relate the SGD iterates to that of GD. Alternatively, we control the SGD iterates epoch by epoch, then bound their composition to characterize the long run behavior of SGD. However, controlling the epoch-wise update of SGD is highly non-trivial, since in different epochs the sequence of stochastic gradient steps varies, and they do not commute due to the issue of matrix multiplication. To overcome this difficulty, we adopt techniques from matrix perturbation theory (Horn & Johnson, 2012) and conduct an analysis in the overparameterized regime. We believe these techniques are of independent interest.

7. CONCLUSION AND FUTURE WORK

We characterize a distinct directional regularization effect of SGD with moderate learning rate, where SGD converges along the large eigenvalue directions of the data matrix. In contrast, neither GD nor SGD with small learning rate can achieve this effect. Moreover, we show this directional bias benefits generalization when early stopping is adopted. Finally, our theory explains several folk arts used in practice for SGD hyperparameter tuning. As an initial attempt, our results are limited to overparameterized linear models, and we ignore other factors that may contribute to the good generalization of SGD for training neural networks, e.g., network structures and explicit regularization. It is left as a future work to extend our results to nonlinear/nonconvex models (with explicit regularization).

A PRELIMINARIES

A.1 ADDITIONAL NOTATIONS We adopt the notations and settings in main text. In addition we make the following notations. For a vector x ∈ R d , denote its direction as x := x x 2 . For simplicity assume the training data {x 1 , . . . , x n } are linear independent. For training data x i , i ∈ [n], we denote λ i = x i 2 2 , then by construction we have λ i ∈ (0, 1]. Without loss of generality let λ 1 ≥ • • • ≥ λ n . We define X := (x 1 , . . . , x n ) ∈ R d×n , X -1 := (x 2 , x 3 , . . . , x n ) ∈ R d×(n-1) . Then based on the above definitions, we define the following two projection operators P = X(X X) -1 X , P ⊥ = I -P. Clearly for any v ∈ R d , P v projects v onto subspace span {x 1 , . . . , x n }, while P ⊥ v projects v onto the orthogonal complement of span {x 1 , . . . , x n }. Furthermore we introduce two more projection operators P -1 = X -1 (X -1 X -1 ) -1 X -1 , P 1 = P -P -1 = I -P ⊥ -P -1 . For any v ∈ R d , P -1 v projects v onto subspace span {x 2 , . . . , x n }, while P 1 v projects v into the orthogonal complement of span {x 2 , . . . , x n } with respect to span {x 1 , . . . , x n }. In the following, we often write the column space of P , which refers to P v : v ∈ R d , similarly for P -1 , P 1 and P ⊥ as well. Clearly the column space of P is also span {x 1 , . . . , x n }; the column space of P -1 is also the data manifold span {x 2 , . . . , x n }. We highlight that the total space R d can be decomposed as the direct sum of the column space of P -1 , P 1 and P ⊥ , i.e., I = P -1 + P 1 + P ⊥ . By definition, it is easy to verify that P ⊥ X = 0, P X = X, P 1 X = (P 1 x 1 , 0, . . . , 0) , P -1 X = (P -1 x 1 , x 2 , . . . , x n ) . Then we define the following matrices which will be repeatedly used in the subsequent proof. H := XX , H -1 := (P -1 X)(P -1 X) , H 1 := (P 1 X)(P 1 X) , H c := (P -1 x 1 )(P 1 x 1 ) + (P 1 x 1 )(P -1 x 1 ) . Based on the above definitions, it is easy to show that H = (P 1 X + P -1 X)(P 1 X + P -1 X) = (P -1 X)(P -1 X) + (P 1 X)(P 1 X) + (P 1 X)(P -1 X) + (P -1 X)(P 1 X) = H -1 + H 1 + H c .

A.2 LEMMAS

We present the following theorems and lemmas as preparation for our analysis. Theorem (Gershgorin circle theorem, restated for symmetric matrix). Let A ∈ R n×n be a symmetric matrix. Let A ij be the entry in the i-th row and the j-th column. Let R i (A) := j =i |A ij | , i = 1, . . . , n. Consider n Gershgorin discs D i (A) := {z ∈ R, |z -A ii | ≤ R i (A)} , i = 1, . . . , n. The eigenvalues of A are in the union of Gershgorin discs G(A) := n i=1 D i (A). Furthermore, if the union of k of the n discs that comprise G(A) forms a set G k (A) that is disjoint from the remaining n -k discs, then G k (A) contains exactly k eigenvalues of A, counted according to their algebraic multiplicities. Proof. See, e.g., Horn & Johnson (2012), Chap 6.1, Theorem 6.1.1. Theorem (Hoffman-Wielandt theorem, restated for symmetric matrix). Let A, E ∈ R n×n be symmetric. Let λ 1 , . . . , λ n be the eigenvalues of A, arranged in decreasing order. Let λ1 , . . . , λn be the eigenvalues of A + E, arranged in decreasing order. Then n i=1 λi -λ i 2 ≤ E 2 F . Proof. See, e.g., Horn & Johnson (2012), Chapter 6.3, Theorem 6.3.5 and Corollary 6.3.8. Lemma 1. Let d ≥ 4 log(2n 2 /δ) for some δ ∈ (0, 1). Then with probability at least 1 -δ, we have | xi , xj | < ι := O 1 √ d , i = j. Proof. See Section C.1. By Lemma 1 we can assume d ≥ poly (n) such that nι is sufficiently small depends on requirements. The following two lemmas characterize the projected components of each training data onto the column space of P 1 , P -1 , and P ⊥ . Lemma 2. For x j = x 1 , we have • P -1 x j = x j ; • P 1 x j = 0; • P ⊥ x j = 0. Proof. These are by the construction of the projection operators. Lemma 3. Assume √ nι ≤ 1/4. With probability at least 1 -δ, we have • 0 ≤ P -1 x1 2 ≤ 2 √ nι; • √ 1 -4nι 2 ≤ P 1 x1 2 ≤ 1; • P ⊥ x 1 = 0. Proof. See Section C.2. The following four lemmas characterize the spectrum of the matrices H, H -1 , H 1 and H c . Lemma 4. Let γ 1 , . . . , γ n be the n non-zero eigenvalues of H := XX in decreasing order. then λ n -nι ≤ γ 1 , . . . , γ n ≤ λ 1 + nι. Furthermore, if there exist λ r and λ r+1 such that λ r > λ r+1 + 2nι, then λ n -nι ≤ γ r+1 , . . . , γ n ≤ λ r+1 + nι < λ r -nι ≤ γ 1 , . . . , γ r ≤ λ 1 + nι. Proof. See Section C.3. Lemma 5. Assume λ n ≥ 3nι. Consider the symmetric matrix H -1 := P -1 X(P -1 X) ∈ R d×d . • 0 is an eigenvalue of H -1 with algebraic multiplicity being d -n + 1, and its corresponding eigenspace is the column space of P 1 + P ⊥ . • Restricted in the column space of P -1 , the n -1 eigenvalues of H -1 belong to (λ n -nι, λ 2 + nι). Proof. See Section C.4. Lemma 6. Consider matrix H 1 := P 1 X(P 1 X) ∈ R d×d . We have H 1 has only one non-zero eigenvalue, which belongs to [λ 1 1 -4nι 2 , λ 1 ]. Moreover, the corresponding eigenspace is the column space of P 1 , which is 1-dim. Proof. Clearly H 1 is rank-1 since the column space of P 1 is 1-dim. Thus it has only one non-zero eigenvalue, which is given by tr(H 1 ) = n i=1 P 1 x i 2 2 = P 1 x 1 2 2 ∈ [λ 1 1 -4nι 2 , λ 1 ], where the last equality follows from Lemma 3. Lemma 7. Consider matrix H c := (P -1 x 1 )(P 1 x 1 ) + (P 1 x 1 )(P -1 x 1 ) ∈ R d×d . H c 2 ≤ 2λ 1 P -1 x1 2 ≤ 4 √ nι. Proof. H c 2 ≤ 2 P -1 x 1 2 • P 1 x 1 2 ≤ 2λ 1 P -1 x1 2 ≤ 4 √ nι, where the last equality follows from Lemma 3 and λ 1 ≤ 1. 

B MISSING PROOFS FOR

w k,j+1 = w k,j - 2η k b i∈B k j x i x i (w k,j -w * ), j = 1, . . . , m, where we assume that the learning rate is fixed within each epoch. Note here π k is independently and randomly chosen at each epoch. For simplicity we often ignore the epoch-indicator k, and write the uniform partition as π := {B 1 , . . . , B m } . It is clear from context that π is random over epochs. For a mini-batch B j ∈ π, we denote H(B j ) := i∈Bj x i x i . Considering translating the variable by v = w -w * , then we can reformulate the SGD update rule as v k,j+1 = v k,j - 2η k b H(B j )v k,j = I - 2η k b H(B j ) v k,j , j = 1, . . . , m. Let M π := m j=1 I - 2η k b H(B j ) := I - 2η k b H(B m ) • I - 2η k b H(B m-1 ) • • • I - 2η k b H(B 1 ) . Here the matrix production over a sequence of matrices M i ∈ R d×d m j=1 is defined from the left to right with descending index, m j=1 M j := M m × M m-1 × • • • × M 1 . Let v k+1 = v k,m+1 and v k = v k,1 . Then we can further reformulate Eq. ( 8) and obtain the epochwise update of SGD as v k+1 = I - 2η k b H(B m ) • I - 2η k b H(B m-1 ) • • • I - 2η k b H(B 1 ) • v k = M π v k . (9) In light of the notion of v, the following lemma restates the related notations of loss functions, hypothesis class, level set, and estimation error defined in Section 2 and 4. Lemma 8 (Reloading SGD notations). Regarding repramaterization v = w -w * , we can reload the following related notations: • Empirical loss and population loss are L S (v) = 1 n (P 1 v) H 1 (P 1 v) + 1 n (P -1 v) H -1 (P -1 v) + 1 n (P v) H c (P v), L D (v) = µ v 2 2 . • The hypothesis class is H S = v ∈ R d : P ⊥ v = P ⊥ v 0 . • The α-level set is V = v ∈ H S : L S (v) = α . • For v ∈ H S , the estimation error is ∆(v) = µ P v 2 2 . Moreover, ∆ * = inf v∈V ∆(v) = µnα γ 1 . Proof. See Section C.5. Based on the above definitions, the following lemma characterizes the one-step update of SGD. Lemma 9 (One step SGD update). Consider the j-th SGD update at the k-th epoch as given by Eq. (8). Set the learning rate be constant η during that epoch. Then for j = 1, . . . , m we have P 1 v k,j+1 P -1 v k,j+1 = I -2η b P 1 H(B j )P 1 -2η b P 1 H(B j )P -1 -2η b P -1 H(B j )P 1 I -2η b P -1 H(B j )P -1 • P 1 v k,j P -1 v k,j Moreover, if 1 / ∈ B j , i.e., x 1 is not used in the j-the step, then P 1 v k,j+1 P -1 v k,j+1 = I 0 0 I -2η b P -1 H(B j )P -1 • P 1 v k,j P -1 v k,j Proof. See Section C.6. Eq. ( 9) indicates the key to analyze the convergence of v k+1 is to characterize the spectrum of the matrix M π . In particular the following lemma bounds the spectrum of M π when projected onto the column space of P -1 . Lemma 10. Suppose 3nι < λ n . Suppose 0 < η < b λ2+3nι . Let π := {B 1 , . . . , B m } be a uniform m partition of index set [n] , where n = mb. Consider the following d × d matrix M -1 := m j=1 I - 2η b P -1 H(B j )P -1 ∈ R d×d . Then for the spectrum of M -1 M -1 we have: • 1 is an eigenvalue of M -1 M -1 with multiplicity being d -n + 1; moreover, the corresponding eigenspace is the column space of P 1 + P ⊥ . • Restricted in the column space of P -1 , the eigenvalues of M -1 M -1 are upper bounded by (q -1 (η)) 2 < 1, where q -1 (η) := max 1 - 2η b (λ 2 + nι) + 3nηι b , 1 - 2η b (λ n -nι) + 3nηι b < 1. Proof. See Section C.7. Consider the projections of v k onto the column space of P -1 and P 1 . For simplicity let A k := P -1 v k 2 , B k := P 1 v k 2 . The following lemma controls the update of A k and B k . Lemma 11 (Update rules for A k and B k ). Suppose 3nι < λ n . Suppose 0 < η < b λ2+3nι . Consider the k-th epoch of SGD iterates given by Eq. (9). Set the learning rate in this epoch to be constant η. Denote ξ(η) := 4η √ nι b , q 1 (η) := 1 - 2ηλ 1 b P 1 x1 2 2 , q -1 (η) := max 1 - 2η b (λ 2 + nι) + 3nηι b , 1 - 2η b (λ n -nι) + 3nηι b < 1. Then we have the following: • A k+1 ≤ q -1 (η) • A k + ξ(η) • B k . • B k+1 ≤ q 1 (η) • B k + ξ(η) • A k . • B k+1 ≥ q 1 (η) • B k -ξ(η) • A k . Proof. See Section C.8. Note we can rephrase the update rules for A k and B k as A k+1 B k+1 ≤ q -1 (η) ξ(η) ξ(η) q 1 (η) • A k B k , where "≤" means "entry-wisely smaller than". The following two lemmas characterize the long run behaviors of A k and B k with different learning rate. Lemma 12 (The long run behavior of SGD with moderate LR). Suppose 3nι < λ n , and λ 2 +4nι < λ 1 . Suppose v 0 is far away from 0. Consider the first k 1 epochs of SGD iterates given by Eq. ( 9). Set the learning rate during this stage to be constant, i.e., η k = η for 0 ≤ k < k 1 . Suppose b λ 1 -3 √ nι < η < b λ 2 + 3nι . Then for 0 < < 1 and 0 < β < β 0 < B 0 satisfying √ nι ≤ poly ( β) , there exists k 1 ≥ O log 1 β such that • A k1 ≤ • β. • B k1 ≤ P v 0 2 • ρ k1 1 + 2 • β = poly 1 β . • For all k = 0, 1, . . . , k 1 , B k > β 0 . Proof. See Section C.9. Lemma 13 (The long run behavior of SGD with small LR). Suppose 3nι < λ n , and λ 2 +4nι < λ 1 . Suppose v 0 is far away from 0. Consider another k 2 -k 1 epochs of SGD iterates given by Eq. ( 9). Set the learning rate to be constant during the updates, i.e., η k = η for k 1 ≤ k < k 2 . Suppose 0 < η < b 2λ 1 . Consider the and β given in Lemma 12. Then for k ≥ k 1 , we have • A k ≤ • β. • B k ≤ q • B k-1 , B k-1 > β, β, B k-1 < β. where q ∈ (0, 1) is a constant. Proof. See Section C.10. Theorem 5 (Theorem 1, formal version). Suppose 3nι < λ n and λ 2 + 4nι < λ 1 . Suppose v 0 is away from 0. Consider the SGD iterates given by Eq. (9) with the following moderate learning rate scheme η k =    η ∈ b λ1-3 √ nι , b λ2+3nι , k = 1, . . . , k 1 ; η ∈ 0, b 2λ1 , k = k 1 + 1, . . . , k 2 . Then for 0 < < 1 such that √ nι ≤ poly ( ) , there exist k 1 > O log 1 and k 2 such that (1 -) • γ 1 ≤ v k2 Hv k2 P v k2 2 2 ≤ γ 1 . Proof. We choose k 1 and k 2 as in Lemma 12 and Lemma 13 with β set as a small constant, then we are guaranteed to have A k2 ≤ • β ≤ • B k2 , from where we have P 1 v k2 2 2 P v k2 2 2 = B 2 k2 A 2 k2 + B 2 k2 ≥ 1 1 + 2 ≥ 1 -2 . Then we have v k2 Hv k2 P v k2 2 2 = (P 1 v k2 ) H 1 (P 1 v k2 ) P v k2 2 2 + (P -1 v k2 ) H -1 (P -1 v k2 ) P v k2 2 2 + (P v k2 ) H c (P v k2 ) P v k2 2 2 ≥ λ 1 1 -4nι 2 • P 1 v k2 2 2 P v k2 2 2 + 0 -4 √ nι ≥ λ 1 (1 -4nι 2 ) • (1 -2 ) -4 √ nι ≥ (γ 1 -nι)(1 -4nι 2 ) • (1 -2 ) -4 √ nι (since γ 1 ≤ λ 1 + nι by Lemma 4) = γ 1 (1 -4nι 2 )(1 -2 ) -nι(1 -4nι 2 )(1 -2 ) -4 √ nι ≥ γ 1 (1 -0.5 ) -0.5γ 1 (since √ nι ≤ poly ( )) = γ 1 (1 -). Theorem 6 (Theorem 4 first part, formal version). Suppose 3nι < λ n and λ 2 + 4nι < λ 1 . Suppose v 0 is away from 0. Consider the SGD iterates given by Eq. ( 9) with the following moderate learning rate schedule η k =    η ∈ b λ1-3 √ nι , b λ2+3nι , k = 1, . . . , k 1 ; η ∈ 0, b 2λ1 , k = k 1 + 1, . . . , k 2 . Then for 0 < < 1 satisfying √ nι ≤ poly ( ) , there exist k 1 and k 2 such that SGD outputs an -optimal solution. Proof. We set β = nα γ 1 , β 0 = nα γ n > β, and apply Lemma 12 to choose a k 1 such that P -1 v k1 2 ≤ • β = • nα γ 1 ; P 1 v k 2 ≥ β 0 = nα γ n , ∀ 0 ≤ k ≤ k 1 . ( ) Thus for all 0 ≤ k ≤ k 1 , L S (v k ) = 1 n (P v k ) XX (P v k ) ≥ γ n n P v k 2 2 (γ n is the smallest eigenvalue of XX in the column space of P ) ≥ γ n n P 1 v k 2 2 > α, (by Eq. ( 13)) which implies SGD cannot reach the α-level set during the iteration of first stage, i.e., SGD does not terminate in this stage. We thus consider the second stage. From Lemma 13 we know P 1 v kn 2 will keep decreasing before being smaller than β, and P -1 v k 2 stays small during this period, i.e., SGD fits P 1 v while in the same time does not mess up P -1 v. Mathematically speaking, there exists k 2 and α such that A k2 := P -1 v k2 2 ≤ • β = • nα γ 1 , L S (v k2 ) = α, which implies SGD terminates at the k 2 -th epoch. Then by Lemmas 3 and 8, we have nα = nL S (v k2 ) = (P 1 v k2 ) H 1 (P 1 v k2 ) + (P -1 v k2 ) H 2 (P -1 v k2 ) + (P v k2 ) H c (P v k2 ) ≥ (P 1 v k2n ) H 1 (P 1 v k2n ) -P -1 x1 2 2 • P v k2n 2 2 ≥ (λ 1 -nι)B 2 k2 -4nι 2 (A 2 k2 + B 2 k2 ) ≥ (γ 1 -3nι)B 2 k2 -4nι 2 A 2 k2 , which yields B 2 k2 ≤ nα + 4nι 2 A 2 k2 γ 1 -3nι ≤ 1 + 2 • nα γ 1 . Then we can bound the estimation error as ∆(v k2 ) = µ P v k2 2 2 = µ(B 2 k2 + A 2 k2 ) ≤ 1 + 2 • µnα γ 1 + 2 • µnα γ 1 ≤ (1 + ) • µnα γ 1 = (1 + ) • ∆ * , where we use the fact that ∆ * = µnα/γ 1 from Lemma 8. Hence SGD is -near optimal.

B.2 THE DIRECTIONAL BIAS OF GD WITH MODERATE OR SMALL LEARNING RATE

Reloading notations Denote the eigenvalue decomposition of XX as XX = GΓG , Γ := diag (γ 1 , . . . , γ n , 0, . . . , 0) , G = (g 1 , . . . , g n , . . . , g d ) , where G ∈ R d×d is orthonormal, and γ 1 , . . . , γ n are given by Lemma 4. Clearly, span {g 1 , . . . , g n } = span {x 1 , . . . , x n } . Let G = (g 1 , . . . , g n ) , G ⊥ = (g n+1 , . . . , g d ) , then P = G G , P ⊥ = G ⊥ G ⊥ . Recall the GD iterates at the k-th epoch: w k+1 = w k - 2η k n XX (w k -w * ) . Considering translating then rotating the variable as, u = G (w -w * ), then we can reformulate the GD iterates as u k+1 = u k - 2η k n Γu k = I - 2η k n Γ u k . ( ) We present the following lemma to reload the related notations regarding the parameterization u = G (w -w * ). Lemma 14 (Reloading GD notations). Regarding reparametrization u = G (w -w * ), we can reload the following related notations: • Empirical loss and population loss are L S (u) = 1 n n i=1 γ i u (i) 2 , L D (u) = µ u By the assumption that k > 1 2 • log γn (u (n) 0 ) 2 γ1n n i=1 u (i) 0 2 log q = O 1 , we have n i=1 u (i) k 2 (u (n) k ) 2 = 1 + n-1 i=1 (u (i) k ) 2 (u (n) k ) 2 = 1 + n-1 i=1 k-1 t=0 q i (η t ) 2 • (u (i) 0 ) 2 k-1 t=0 q n (η t ) 2 • (u (n) 0 ) 2 (by Eq. ( 17)) ≤ 1 + n i=1 u (i) 0 2 (u (n) 0 ) 2 • n-1 i=1 k-1 t=0 q i (η t ) 2 q n (η t ) 2 ≤ 1 + n i=1 u (i) 0 2 (u (n) 0 ) 2 • n • k-1 t=0 q n-1 (η t ) 2 q n (η t ) 2 (by Eq. ( 16)) ≤ 1 + n i=1 u (i) 0 2 (u (n) 0 ) 2 • n • q 2k ≤ 1 + γ n γ 1 , ( Eq. ( 18)) which further yields 1 ≥ (u (n) k ) 2 n i=1 u (i) k 2 ≥ 1 1 + γn γ1 ≥ 1 - γ n γ 1 . ( ) By the above inequalities we have u k Γu k n i=1 u (i) k 2 = n i=1 (u (i) k ) 2 n i=1 u (i) k 2 • γ i = (u (n) k ) 2 n i=1 u (i) k 2 • γ n + n-1 i=1 (u (i) k ) 2 n i=1 u (i) k 2 • γ i ≤ γ n + n-1 i=1 (u (i) k ) 2 n i=1 u (i) k 2 • γ 1 (by Eq. ( 15)) = γ n +   1 - (u (n) k ) 2 n i=1 u (i) k 2    • γ 1 ≤ γ n + γ n γ 1 • γ 1 (by Eq. (19)) = γ n • (1 + ). Finally we note that u k Γu k n i=1 u (i) k 2 ≥ γ n since γ n is the smallest in {γ i } n i=1 . Theorem 8 (Theorem 4 second part, formal version). Suppose λ n + 2nι < λ n-1 . Suppose u 0 is away from 0. Consider the GD iterates given by Eq. ( 14) with learning rate scheme η k ∈ 0, n 2λ 1 + 2nι . Then for ∈ (0, 1), if k ≥ O log 1 ,, then GD outputs an M -suboptimal solution, where M = γ1 γn (1 -) > 1 is a constant. Proof. Consider an α-level set where α = L S (u k ) = 1 n u k Γu k . ( ) From Lemma 15 we know L S (u k ) is monotonic decreasing thus GD cannot terminate before the k-epoch, i.e., the output of GD is u k . Thus ∆(u) ∆ * = γ 1 n i=1 u (i) k 2 nα (by Lemma 14) = γ 1 • n i=1 u (i) k 2 u k Γu k (by Eq. ( 20)) ≥ γ 1 • 1 (1 + )γ n (by Theorem 7) ≥ γ 1 γ n (1 -) =: M, where we have M > 1 by letting < 1 -γn γ1 .

B.3 THE DIRECTIONAL BIAS OF SGD WITH SMALL LEARNING RATE

We analyze SGD with small learning rate by repeating the arguments in previous two sections. Let us denote X -n := (x 1 , x 2 , . . . , x n-1 ) and P -n = X -n (X -n X -n ) -1 X -n P n = P -P -n . That is, P -n is the projection onto the column space of X -n and P n is the projection onto the orthogonal complement of the column space of X -n with respect to the column space of X.

Let us reload

H := XX , H -n := (P -n X)(P -n X) , H n := (P n X)(P n X) , H c := (P -n x n )(P n x n ) + (P n x n )(P -n x n ) . Then H = H -n + H n + H c . Following a routine check we can reload the following lemmas. Lemma 16 (Variant of Lemma 10). Suppose 3nι < λ n . Suppose 0 < η <  M -n := m j=1 I - 2η b P -n H(B j )P -n ∈ R d×d . Then for the spectrum of M -n M -n we have: • 1 is an eigenvalue of M -n M -n with multiplicity being d -n + 1; moreover, the corresponding eigenspace is the column space of P n + P ⊥ . • Restricted in the column space of P -n , the eigenvalues of M -n M -n are upper bounded by (q -n (η)) 2 < 1, where q -n (η) := max 1 - 2η b (λ 1 + nι) + 3nηι b , 1 - 2η b (λ n-1 -nι) + 3nηι b < 1. Proof. This is by a routine check of the proof of Lemma 10. Consider the projections of v k onto the column space of P -n and P n . For simplicity we reload the following notations A k := P -n v k 2 , B k := P n v k 2 . The following lemma controls the update of A k and B k . Lemma 17 (Variant of Lemma 11). Suppose 3nι < λ n . Suppose 0 < η < b λ1+3nι . Consider the k-th epoch of SGD iterates given by Eq. ( 9). Set the learning rate in this epoch to be constant η. Denote ξ(η) := 4η √ nι b , q n (η) := 1 - 2ηλ n b P n xn 2 2 , q -n (η) := max 1 - 2η b (λ 1 + nι) + 3nηι b , 1 - 2η b (λ n-1 -nι) + 3nηι b < 1. Then we have the following: • A k+1 ≤ q -n (η) • A k + ξ(η) • B k . • B k+1 ≤ q n (η) • B k + ξ(η) • A k . • B k+1 ≥ q n (η) • B k -ξ(η) • A k . Proof. This is by a routine check of the proof of Lemma 11. Lemma 18 (Variant of Lemma 13). Suppose 3nι < λ n and λ n + 4nι < λ n-1 . Consider the SGD iterates given by Eq. (9) with the following small learning rate scheme η k = η ∈ 0, b 2λ 1 + 2nι , k = 1, . . . , k 2 . Then for 0 < < 1 satisfying √ nι ≤ poly ( ) , if k 2 ≥ O log 1 , then A k 2 B k 2 ≤ . Proof. From the assumption we have η < b 2(λ1+nι) and η < b 2λn , thus ξ := ξ(η ) = 4η √ nι b , q n := q n (η ) = 1 - 2η λ n b P n xn 2 2 = 1 - 2η λ n b P n xn 2 2 ≤ 1 - 2λ n (1 -4nι 2 ) b η , (since P n xn 2 2 ≥ 1 -4nι 2 by reloading Lemma 3 ) < 1 q -n := q -n (η ) = max 1 - 2η b (λ 1 + nι) + 3nη ι b , 1 - 2η b (λ n-1 -nι) + 3nη ι b = max 1 - 2η b (λ 1 + nι) + 3nη ι b , 1 - 2η b (λ n-1 -nι) + 3nη ι b = 1 - 2η b (λ n-1 -nι) + 3nη ι b = 1 - 2(λ n-1 -nι) -3nι b η ∈ (0, 1). Moreover, by the gap assumption λ n + 4nι < λ n-1 we have q n -q -n ≥ η 2(λ n-1 -nι) -3nι b - 2λ n (1 -4nι 2 ) b ≥ 2η b (λ n-1 -λ n -3nι) > 0. Therefore 0 < q -n < q n < 1. Thus we can set ξ = 4η √ nι b to be small such that 0 < q := q -n q n -ξ • A 0 /B 0 < 1. Moreover, since √ nι ≤ poly ( ) and ξ = 4η √ nι b , we have ξ q n -ξ • A 0 /B 0 ≤ (1 -q) 2 . ( ) Now we recursively show A k B k ≤ A0 B0 . Clearly it holds for k = 0. Suppose A k B k ≤ A0 B0 , we consider A k+1 B k+1 in the following A k+1 B k+1 ≤ q -n • A k + ξ • B k q n • B k -ξ • A k (by Lemma 17) = q -n • A k B k + ξ q n -ξ • A k B k ≤ q -n A k B k + ξ q n -ξ • A 0 /B 0 (by inductive assumption) = q -n q n -ξ • A 0 /B 0 A k B k + ξ q n -ξ • A 0 /B 0 ≤ q • A k B k + (1 -q) 2 (by Eq. ( 21) and ( 22)) ≤ q • A 0 B 0 + (1 -q) 2 ≤ A 0 B 0 , where in the last inequality we assume 2 < A0 B0 . Moreover, from the above we have A k+1 B k+1 ≤ q • A k B k + (1 -q) 2 , which implies A k2 B k2 ≤ q k2 • A 0 B 0 + 1 1 -q • (1 -q) 2 , ≤ 2 + 2 = , where we set k 2 ≥ O log 1 . Next we prove the directional bias of SGD with small learning rate. Theorem 9 (Theorem 3, formal version). Suppose 3nι < λ n and λ n + 4nι < λ n-1 . Suppose v 0 is away from 0. Consider the SGD iterates given by Eq. ( 9) with the following small learning rate scheme η k = η ∈ 0, b 2λ 1 + 2nι , k = 1, . . . , k 2 . Then for 0 < < 1 satisfying √ nι ≤ poly ( ) , if k 2 ≥ O log 1 , then γ n ≤ v k2 Hv k2 P v k2 2 2 ≤ (1 + ) • γ n . Proof. First by Lemma 18 we have B 2 k2 A 2 k2 + B 2 k2 = 1 A 2 k 2 B 2 k 2 + 1 ≥ 1 2 + 1 ≥ 1 -2 . ( ) Next by H = H n + H -n + H c we obtain v k2 Hv k2 P v k2 2 2 = (P n v k2 ) H n (P n v k2 ) P v k2 2 2 + (P -n v k2 ) H -n (P -n v k2 ) P v k2 2 2 + (P v k2 ) H c (P v k2 ) P v k2 2 2 ≤ λ n • P n v k2 2 2 P v k2 2 2 + (λ 1 + nι) • P -n v k2 2 2 P v k2 2 2 + 4 √ nι by reloading Lemma 5, 6, 7 ≤ λ n + (λ 1 + nι) • A 2 k2 A 2 k2 + B 2 k2 + 4 √ nι ≤ γ n + nι + (λ 1 + nι) • 2 + 4 √ nι by reloading Lemma 4 and Eq. ( 23) ≤ γ n + γ n • . since √ nι ≤ poly ( ) Finally, we note v k 2 Hv k 2 P v k 2 2 2 ≥ γ n since γ n is the smallest eigenvalue of H restricted in the column space of P . Theorem 10 (Theorem 4 third part, formal version). Suppose 3nι < λ n and λ n + 4nι < λ n-1 . Suppose v 0 is away from 0. Consider the SGD iterates given by Eq. (9) with the following small learning rate scheme η k = η ∈ 0, b 2λ 1 + 2nι , k = 1, . . . , k 2 . Then for 0 < < 1 such that √ nι ≤ poly ( ) , if k 2 ≥ O log 1 , then SGD outputs an M - suboptimal solution where M = γ1 γn (1 -) > 1 is a constant. Proof. From Eq. ( 9) and η < 1 2λ1 we know that restricted in the column space of P , the eigenvalues of M π is smaller than 1, thus v k indeed converges to 0. Consider an α-level set where α = L S (v k ) = 1 n v k2 Hv k2 . Then ∆(u) ∆ * = γ 1 P v k 2 2 nα (by Lemma 8) = γ 1 • P v k 2 2 v k Hv k (by Eq. ( 24)) ≥ γ 1 • 1 (1 + )γ n (by Theorem 9) ≥ γ 1 γ n (1 -) =: M, where we have M > 1 by letting < 1 -γn γ1 .

C PROOF OF AUXILIARY LEMMAS IN SECTIONS A AND B

C.1 PROOF OF LEMMA 1 Proof of Lemma 1. Note that xi follows uniform distribution on the sphere S d-1 . Therefore, let ξ be a random variable following distribution χ 2 d distribution and define z i = ξ • xi , we have z i follows standard normal distribution in the d-dimensional space. Then it suffices to prove that | z i , z j |/( z i 2 z j 2 ) ≤ ι for all i = j. First we will bound the inner product z i , z j . Note that we have each entry in z i is 1-subgaussian, it can be direcly deduced that z i , z j = d k=1 z (k) i z (k) j = d k=1   z (k) i + z (k) j 2 2 - z (k) i -z (k) j 2 2   is d-subexponential, where z (k) i denotes the k-th of the vector z i . Then if follows that P (| z i , z j | ≥ t) ≤ 2 exp - t 2 d . Next we will lower bound z i 2 . Note that z i 2 2 -d = d k=1 z (k) i 2 -1 . Since z (k) i is 1-subgaussian, we have z i 2 -d is d-subexpoential, then P z i 2 2 -d ≥ t ≤ 2 exp - t 2 d . Finally, applying the union bound for all possible i, j ∈ [n], we have with probability at least 1 -δ, the following holds for all i = j, | z i , z j | ≤ d log 2n 2 δ , z i 2 2 ≥ d -d log 2n 2 δ . Assume d ≥ 4 log(2n 2 /δ), we have z i 2 ≥ d/2. Then it follows that | xi , xj | = | z i , z j | z i 2 z j 2 < 2 1 d log 2n 2 δ =: ι. This completes the proof. Proof of Lemma 3. Similar to the proof of Lemma 1, we consider translating x 1 , . . . , x n to z 1 , . . . , z n by introducing χ 2 d random variables. Let Z -1 = (z 2 , . . . , z n ) ∈ R d×(n-1) , in which each entry is i.i.d. generated from Gaussian distribution N (0, 1). Then we have P -1 x1 = X -1 (X -1 X -1 ) -1 X -1 x1 = Z -1 (Z -1 Z -1 ) -1 Z -1 x1 . Then conditioned on x1 , we have each entry in Z -1 x1 i.i.d. follows N (0, 1). Then it is clear that Z -1 x1 2 2 follows from χ 2 n-1 distribution, implying that with probability at least 1 -δ , we have Z -1 x1 2 2 ≤ (n -1) + (n -1) log(2/δ ). Then by Corollary 5.35 in Vershynin (2010) , we know that with probability at least 1 -δ it holds that √ d - √ n -1 -2 log(2/δ ) ≤ σ min (Z -1 ) ≤ σ max (Z -1 ) ≤ √ d + √ n -1 + 2 log(2/δ ). Therefore, assume (n -1) + 2 log(2/δ ) ≤ √ d/8, we have with probability at least 1 -δ Z -1 (Z -1 Z -1 ) -1 2 ≤ √ d + √ n -1 + 2 log(2/δ ) √ d - √ n -1 -2 log(2/δ ) 2 ≤ 1 √ d 1 + 4 n -1 d + 2 log(2/δ ) d . Combining with the upper bound of Z -1 x1 2 , set δ = δ/2, we have with probability at least 1 -δ that P -1 x1 2 ≤ Z -1 (Z -1 Z -1 ) -1 2 • X -1 x1 2 ≤ 1 + 4 n -1 d + 2 log(4/δ) d •   n -1 d + (n -1) log(4/δ) d   ≤ (1 + 4 √ nι) • √ nι, where the last inequality follows from the definition of ι. Then assume √ nι ≤ 1/4, we are able to completes the proof of the first argument. Note that P 1 x1 2 2 + P -1 x1 2 2 = x1 2 2 = 1, we have P -1 x1 2 = 1 -P 1 x1 2 2 ≥ 1 -4nι 2 ≥ 1 -4nι 2 . This completes the proof of the second argument. The third argument holds trivially by the construction of P ⊥ .

C.3 PROOF OF LEMMA 4

Proof of Lemma 4. Clearly XX ∈ R d×d is of rank n and symmetric, thus XX has n real, nonzero (potentially repeated) eigenvalues, denoted as γ 1 , . . . , γ n in non-decreasing order. Moreover, γ 1 , . . . , γ n are also eigenvalues of X X ∈ R n×n , thus it is sufficient to locate the eigenvalues of X X, where X X ij = x i x j . We first calculate the diagonal entry X X ii = x i x i = λ i . Then we bound the off diagonal entries. For j = i, X X ij = x i x j = λ i λ j xi , xj ∈ (-ι, ι) , where we use 0 < λ 1 , . . . , λ n ≤ 1. Thus we have R i (X X) = j =i X X ij ≤ nι, i = 1, . . . , n, Finally our conclusions hold by applying Gershgorin circle theorem.

C.4 PROOF OF LEMMA 5

Proof of Lemma 5. The first conclusion is clear since by construction, we have P -1 P 1 = P -1 P ⊥ = 0. Note that H -1 is a rank n -1 symmetric matrix. Let τ 2 , . . . , τ n be the n -1 non-zero eigenvalues of H -1 . Clearly, τ 2 , . . . , τ n with τ 1 := 0 give the spectrum of H -1 := (P -1 X) P -1 X ∈ R n×n . We then bound τ 2 , . . . , τ n by analyzing H -1 . From Lemma 3 we have P -1 x1 2 ≤ 2 √ nι. From Lemma 2 we have P -1 X = (P -1 x 1 , P -1 x 2 , . . . , P -1 x n ) = (P -1 x 1 , x 2 , . . . , x n ) . Then we calculate the diagonal entries: H -1 ii = P -1 x 1 2 2 ≤ λ 1 • 4nι 2 ≤ 4nι 2 , i = 1; x i 2 2 = λ i , i = 1. Then we bound the off diagonal entries. Let j = i. Then at least one of them is not 1. Without loss of generality let i = 1, which yields x i = P -1 x i by Lemma (2). Thus x i P 1 x j = P -1 x i , P 1 x j = 0. Thus we have H -1 ij = (P -1 x i ) P -1 x j = x i P -1 x j = x i x j -x i P 1 x j = x i x j = λ i λ j • xi , xj ∈ (-ι, ι) . Thus we have R i (H -1 ) = j =i H -1 ij ≤ nι, i = 1, . . . , n. Finally, we set 4nι 2 + 2nι < λ n , so that the first Geoshgorin disc does not intersect with the others, then Gershgorin circle theorem gives our second conclusion.

C.5 PROOF OF LEMMA 8

Proof of Lemma 8. For the empirical loss, it is clear that L S (v) = 1 n (w -w * ) XX (w -w * ) = 1 n v XX v = 1 n v Hv = 1 n (P v) H(P v) = 1 n (P v) H 1 (P v) + 1 n (P v) H -1 (P v) + 1 n (P v) H c (P v) = 1 n (P 1 v) H 1 (P 1 v) + 1 n (P -1 v) H -1 (P -1 v) + 1 n (P v) H c (P v), where we use Lemma 4, Lemma 5, and Lemma 6. For the population loss, L D (v) = µ w -w * 2 2 = µ v 2 2 . For the hypothesis class H S = w ∈ R d : P ⊥ w = P ⊥ w 0 , applying w-w * = v and w 0 -w * = v 0 , we obtain H S = v ∈ R d : P ⊥ v = P ⊥ v 0 . For the α-level set, we note the optimal training loss is L * S = inf v∈H S L S (v) = 0. As for the estimation error, we note that inf v∈H S L D (v) = inf P ⊥ v=P ⊥ v0 µ v 2 2 = µ P ⊥ v 0 2 2 . thus for v ∈ H S , we have ∆(v) = L D (v) -inf v ∈V L D (v ) = µ v 2 2 -µ P ⊥ v 0 2 2 = µ P v 2 2 . Finally, consider v ∈ V, i.e., nα = v XX v, thus ∆ * = inf v∈V ∆(v) = inf nα=v XX v µ P v 2 2 = µnα γ 1 , where γ 1 is the largest eigenvalue of the matrix XX and the inferior is attended by setting v parallel to the first eigenvector of XX .

C.6 PROOF OF LEMMA

Proof of Lemma 9. From Eq. ( 8) we have v k,j+1 = I - 2η b H(B j ) v k,j , j = 1, . . . , m. Recall the following property of projection operators: P 1 = P 1 P 1 , P -1 = P -1 P -1 0 = P 1 P -1 = P -1 P 1 . Moreover since x i P ⊥ v = 0, we have H(B j )P ⊥ v = i∈Bj x i x i P ⊥ v = 0. Applying P 1 to Eq. ( 25) we have P 1 v k,j+1 = P 1 I - 2η b H(B j ) v k,j = P 1 I - 2η b H(B j ) (P 1 v k,j + P -1 v k,j + P ⊥ v k,j ) = P 1 I - 2η b H(B j ) P 1 v k,j + P 1 I - 2η b H(B j ) P -1 v k,j = I - 2η b P 1 H(B j )P 1 • P 1 v k,j - 2η b P 1 H(B j )P -1 • P -1 v k,j . Similarly applying P -1 to Eq. ( 25) we have P -1 v k,j+1 = P -1 I - 2η b H(B j ) v k,j = P -1 I - 2η b H(B j ) (P 1 v k,j + P -1 v k,j + P ⊥ v k,j ) = P -1 I - 2η b H(B j ) P 1 v k,j + P -1 I - 2η b H(B j ) P -1 v k,j = - 2η b P -1 H(B j )P 1 • P 1 v k,j + I - 2η b P -1 H(B j )P -1 • P -1 v k,j . To sum up we have P 1 v k,j+1 P -1 v k,j+1 = I -2η b P 1 H(B j )P 1 -2η b P 1 H(B j )P -1 -2η b P -1 H(B j )P 1 I -2η b P -1 H(B j )P -1 • P 1 v k,j P -1 v k,j Notice that if 1 / ∈ B j , i.e., x 1 is not used in the j-th step, then we claim P 1 H(B j ) = H(B j )P 1 = 0, since H(B j ) = i∈Bj x i x i is composed by the data belonging to the column space of P -1 . Therefore if 1 / ∈ B j we have P 1 v k,j+1 P -1 v k,j+1 = I 0 0 I -2η b P -1 H(B j )P -1 • P 1 v k,j P -1 v k,j C.7 PROOF OF LEMMA 10 Proof of Lemma 10. Clearly for each component in the production, the column space of P 1 + P ⊥ , which is (n -d + 1)-dimensional, belongs to its eigenspace of eigenvalue 1, which yields the first claim. In the following, we restrict ourselves in the column space of P -1 . Let us expand M -1 : M -1 = m j=1 I - 2η b P -1 H(B j )P -1 = I - 2η b P -1 H(B m )P -1 • • • I - 2η b P -1 H(B 1 )P -1 = I - 2η b m j=1 P -1 H(B j )P -1 H-1 + 2η b 2 1≤i<j≤n P -1 H(B j )P -1 H(B i )P -1 + . . . C . We first analyze matrix H -1 . Since H(B j ) = i∈Bj x i x i and π = {B 1 , . . . , B m } is a partition for index set [n], we have H -1 = m j=1 P -1 H(B j )P -1 = m j=1 P -1 i∈Bj x i x i P -1 = P -1 n i=1 x i x i P -1 = P -1 XX P -1 , which is exactly the matrix we studied in Lemma 5, and from where we have H -1 has eigenvalue zero (with multiplicity being n -d + 1) in the column space of P 1 + P ⊥ , and restricted in the column space of P -1 , the eigenvalues of H -1 belong to (λ n -nι, λ 2 + nι). Then we analyze matrix C. P -1 H(B j )P -1 H(B i )P ) -1 = P -1 i ∈Bi x i x i P -1   P -1 j ∈Bj x j x j P -1   = i ∈Bi j ∈Bj (P -1 x i ) P -1 x i , P -1 x j (P -1 x j ) . Remember that B i ∩ B j = ∅ for i = j, thus x i = x j for i ∈ B i and j ∈ B j . Then from Lemma 1 we have, | P -1 x i , P -1 x j | ≤ | x i , x j | ≤ λ i λ j • ι ≤ ι. Inserting this into Eq. ( 26) we obtain P -1 H(B j )P -1 H(B i )P ) -1 F ≤ b 2 • max | P -1 x i , P -1 x j | 2 ≤ b 2 ι 2 . We can bound the Frobenius norm of the higher degree terms in matrix C in a similar manner; in sum for the Frobenius norm of C, we have C F ≤ m s=2 2η b s • b s • ι s • m s = m s=2 (2ηι) s • n s = m s=0 (2ηι) s • n s -1 -2mηι = (1 + 2ηι) m -1 -2mηι ≤ 1 + m • 2ηι + m 2 √ e 2 • (2ηι) 2 -1 -2mηι (for 2ηι < 1 2m ) ≤ 4m 2 η 2 ι 2 , where for the second to the last inequality we notice that for f (t) = (1 + t) m and t ∈ [0, 1 2n ], we have f (t) = m(m-1)(1+t) m-2 ≤ m(m-1)(1+ 1 2m ) m-2 ≤ m(m-1)• √ e, which implies f (t) is (m 2 √ e)-smooth for t ∈ [0, 1 ]; moreover, by the assumption that 3nι < λ n and η < b λn+3nι , we can indeed verify that 2ηι < 2bι λ n + 3nι ≤ 2bι 6nι ≤ 1 2m . Now we rephrase M -1 M -1 as M -1 M -1 = I - 2η b H -1 + C • I - 2η b H -1 + C = I - 2η b H -1 2 + C I - 2η b H -1 + I - 2η b H -1 C + C C D . Restricting ourselves in the column space of P -1 , the eigenvalues of H -1 belong to (λ n -nι, λ 2 + nι), thus the eigenvalues of I -2η b H -1 2 are upper bounded by max 1 - 2η b (λ 2 + nι) 2 , 1 - 2η b (λ n -nι) 2 < 1, where the last inequality is guaranteed by our assumptions on η and ι. For simplicity we defer the verification to the end of the proof. Consider the following eigen decomposition I -2ηH -1 = U diag (µ 1 , . . . , µ n-1 , 1, . . . , 1) U , where µ 1 , . . . , µ n-1 ∈ (-1, 1) by Eq. ( 29). Then we have (I -ηH -1 )C F = diag (µ 1 , . . . , µ n-1 , 1, . . . , 1) U CU F ≤ U CU F = C F . Therefore we can bound the Frobenius norm of D by D F ≤ 2 (I -2ηH -1 )C F + C 2 F ≤ 2 C F + C 2 F ≤ 8m 2 η 2 ι 2 + 16m 4 η 4 ι 4 ≤ 9m 2 η 2 ι 2 , ( ) where the last inequality follows from 2ηι ≤ 1/(2m) proved in Eq. ( 27). Finally, applying Hoffman-Wielandt theorem with Eq. ( 28), ( 29) and ( 30), we conclude that, restricted in the column space of P -1 , the eigenvalues of M -1 M -1 are upper bounded by max 1 - 2η b (λ 2 + nι) 2 + 9m 2 η 2 ι 2 , 1 - 2η b (λ n -nι) 2 + 9m 2 η 2 ι 2 ≤ max 1 - 2η b (λ 2 + nι) + 3mηι 2 , 1 - 2η b (λ n -nι) + 3mηι 2 := (q -1 (η)) 2 . ( ) At this point we left to verify Eq. ( 29) and q -1 (η) := max 1 - 2η b (λ 2 + nι) + 3nηι b , 1 - 2η b (λ n -nι) + 3nηι b < 1. Clearly it suffices to verify Eq. ( 32). 1 - 2η b (λ 2 + nι) + 3nηι b < 1 ⇔ 3nι b η -1 < 1 - 2(λ 2 + nι) b η < 1 - 3nι b η ⇔ 2λ2-nι b η > 0 2λ2+5nι b η < 2 ⇐      η > 0 2λ 2 -nι > 0 η < b λ2+2.5nι ⇐ 3nι < λ n (since λ 2 ≥ λ n ) 0 < η < b λ2+3nι Similarly, we verify that 1 - 2η b (λ n -nι) + 3nηι b < 1 ⇔ 3nι b η -1 < 1 - 2(λ n -nι) b η < 1 - 3nι b η ⇔ 2λn-5nι b η > 0 2λn+nι b η < 2 ⇐      η > 0 2λ n -5nι > 0 η < b λn+0.5nι ⇐ 3nι < λ n 0 < η < b λ2+3nι (since λ 2 ≥ λ n ) These complete our proof.

C.8 PROOF OF LEMMA 11

Proof of Lemma 11. Note that during one epoch of SGD updates, x 1 is used for only once. Without loss of generality, assume SGD uses x 1 at the l-th step, i.e., 1 ∈ B l and 1 / ∈ B j for j = l. Recursively applying Lemma 9, we have P 1 v k,m+1 P -1 v k,m+1 = I 0 0 m j=l+1 I -2η b P -1 H(B j )P -1 × I -2η b P 1 H(B l )P 1 -2η b P 1 H(B l )P -1 -2η b P -1 H(B l )P 1 I -2η b P -1 H(B j )P -1 × I 0 0 l-1 j=1 I -2η b P -1 H(B j )P -1 × P 1 v k,1 P -1 v k,1 Let v k+1 = v k,m+1 , v k = v k,1 M l := I - 2η b P -1 H(B l )P -1 M >l := m j=l+1 I - 2η b P -1 H(B j )P -1 M <l := l-1 j=1 I - 2η b P -1 H(B j )P -1 M -1 := M >l • M l • M <l = m j=1 I - 2η b P -1 H(B j )P -1 then we have P 1 v k+1 P -1 v k+1 = I 0 0 M >l I -2η b P 1 H(B l )P 1 -2η b P 1 H(B l )P -1 -2η b P -1 H(B l )P 1 M l I 0 0 M <l P 1 v k P -1 v k = I -2η b P 1 H(B l )P 1 -2η b P 1 H(B l )P -1 M <l -M >l 2η b P -1 H(B l )P 1 M -1 P 1 v k P -1 v k In the following we bound the norm of each entries in the above coefficient matrix. According to Lemma 5, we have the eigenvalues of P -1 H(B j )P -1 are upper bounded by λ 2 + nι. Thus the assumption η < b λ2+2nι yields I - 2η b P -1 H(B j )P -1 2 ≤ 1, which further yields M >l 2 ≤ 1, M <l 2 ≤ 1. (34) On the other hand notice that P 1 x i = 0 for i = 1, thus P 1 H(B l )P -1 = P 1 i∈B l x i x i P -1 = P 1 x 1 x 1 P -1 , P -1 H(B l )P 1 = P -1 i∈B l x i x i P 1 = P -1 x 1 x 1 P 1 , which yield max { P 1 H(B l )P -1 2 , P -1 H(B l )P 1 2 } ≤ P 1 x 1 2 • P -1 x 1 2 ≤ 2 √ nι, where the last inequality is from Lemma 3 and λ 1 = x 1 2 ≤ 1. Eq. ( 34) and (35) imply max 2η b P 1 H(B l )P -1 M <l 2 , M >l 2η b P -1 H(B l )P 1 2 ≤ 4η √ nι b =: ξ(η) (36) Next, by P 1 x i = 0 for i = 1 we have P 1 H(B l )P 1 = P 1 i∈B l x i x i P 1 = P 1 x 1 x 1 P 1 = (P 1 x 1 )(P 1 x 1 ) , Then Eq. ( 41) and Eq. ( 42) yield A k B k ≤ q -1 ξ ξ q 1 k • A 0 B 0 = cos θ sin θ -sin θ cos θ ρ k -1 0 0 ρ k 1 cos θ -sin θ sin θ cos θ A 0 B 0 = ρ k -1 + (ρ k 1 -ρ k -1 ) sin 2 θ (ρ k 1 -ρ k -1 ) cos θ sin θ (ρ k 1 -ρ k -1 ) cos θ sin θ ρ k 1 -(ρ k 1 -ρ k -1 ) sin 2 θ A 0 B 0 = A 0 • ρ k -1 + ρ k 1 -ρ k -1 (A 0 sin θ + B 0 cos θ) sin θ B 0 • ρ k 1 + ρ k 1 -ρ k -1 (A 0 cos θ -B 0 sin θ) sin θ ≤ A 0 • ρ k -1 + ρ k 1 -ρ k -1 A 2 0 + B 2 0 sin θ B 0 • ρ k 1 + ρ k 1 -ρ k -1 A 2 0 + B 2 0 sin θ = A 0 • ρ k -1 + ρ k 1 -ρ k -1 • P v 0 2 • sin θ B 0 • ρ k 1 + ρ k 1 -ρ k -1 • P v 0 2 • sin θ . ( ) We claim the following inequalities hold by our assumptions: 0 < ρ -1 < 1 < ρ 1 ≤ q 1 + ξ (44a) ρ k1 -1 P v 0 2 ≤ 2 • β (44b) ρ k1 1 P v 0 2 sin θ ≤ 2 • β, ξ • A 0 + β 0 2 < (q 1 -1)β 0 . The verification of Eq. ( 44) is left later. In the following we prove the conclusions using Eq. ( 44). We first bound A k1 using Eq. ( 43) and Eq. ( 44): A k1 ≤ A 0 • ρ k1 -1 + ρ k1 1 -ρ k1 -1 • P v 0 2 • sin θ ≤ P v 0 2 • ρ k1 -1 + ρ k1 1 • P v 0 2 • sin θ ≤ 2 • β + 2 • β = • β, which justifies the first conclusion. In addition we can obtain an uniform upper bound for A k for k = 0, 1, . . . , k 1 : A k ≤ A 0 • ρ k -1 + ρ k 1 -ρ k -1 • P v 0 2 • sin θ ≤ A 0 + ρ k 1 • P v 0 2 • sin θ ≤ A 0 + 2 • β. ( ) Next we bound B k1 using Eq. ( 43) and Eq. ( 44): B k1 ≤ B 0 • ρ k1 1 + ρ k1 1 -ρ k1 -1 • P v 0 2 • sin θ ≤ P v 0 2 • ρ k1 1 + ρ k1 1 • P v 0 2 • sin θ ≤ P v 0 2 • ρ k1 1 + 2 • β, which justifies the second conclusion. We proceed to derive the uniform lower bound for B k for k = 0, 1, . . . , k 1 . We do it by induction. For k = 0, by assumption we have B 0 ≥ β 0 . Suppose B k-1 ≥ β 0 , then by Eq. ( 40), ( 44) and ( 45) we have B k ≥ q 1 • B k-1 -ξ • A k-1 ≥ q 1 • β 0 -ξ • A 0 + 2 β ≥ q 1 • β 0 -ξ • A 0 + 2 β 0 ≥ β 0 , which justifies the third conclusion. Verification of Eq. (44) From Eq. ( 42) and Gershgorin circle theorem we have q 1 -ξ ≤ ρ 1 ≤ q 1 + ξ, q -1 -ξ ≤ ρ -1 ≤ q -1 + ξ. Moreover, reformatting Eq. ( 42) as q -1 ξ ξ q 1 = cos θ sin θ -sin θ cos θ ρ -1 0 0 ρ 1 cos θ -sin θ sin θ cos θ = ρ -1 cos 2 θ + ρ 1 sin 2 θ (ρ 1 -ρ -1 ) cos θ sin θ (ρ 1 -ρ -1 ) cos θ sin θ ρ -1 sin 2 θ + ρ 1 cos 2 θ = ρ -1 + (ρ 1 -ρ -1 ) sin 2 θ (ρ 1 -ρ -1 ) cos θ sin θ (ρ 1 -ρ -1 ) cos θ sin θ ρ 1 -(ρ 1 -ρ -1 ) sin 2 θ , we then have ξ q 1 -q -1 = (ρ 1 -ρ -1 ) cos θ sin θ (ρ 1 -ρ -1 )(1 -2 sin 2 θ) = 1 2 tan 2θ. For Eq. (44a), using Eq. ( 46) it suffices to show 0 < q 1 -ξ, (48a) q -1 + ξ < 1, (48b) 1 < q 1 -ξ. (48c) Notice the definitions of q 1 , q -1 and ξ are given in Eq. (39). Firstly, Eq. (48c) holds trivially when √ n > 4/3. Secondly, for Eq. (48b), noticing that ξ = 4η  ⇔ 4nι b η -1 < 1 -2(λ2+nι) b η < 1 -4nι b η 4nι b η -1 < 1 -2(λn-nι) b η < 1 -4nι b η ⇔        2λ2-2nι b η > 0 2λ2+6nι b η < 2 2λn-6nι b η > 0 2λn+2nι b η < 2 ⇐              η > 0 λ 2 -nι > 0 λ n -3nι > 0 η < b λ2+3nι η < b λn+nι ⇐ 3nι < λ n 0 < η < b λ2+3nι which are given in assumptions. Thirdly, for Eq. (48c) it suffices to show 2ηλ 1 b P 1 x1 2 2 -1 - 4η √ nι b > 1 ⇐ 2λ 1 (1 -4nι 2 ) b η - 4 √ nι b η > 2 (by Lemma 3) ⇐ η > b λ 1 (1 -4nι 2 ) -2 √ nι ⇐ η > b λ 1 -3 √ nι , (since nι < 1) which are given in assumptions. For Eq. ( 44b), it suffices to show set k 1 = 1 + log 0.5 β P v0 2 log ρ -1 = O log 1 β , as given in assumptions. For Eq. (44c), using Eq. ( 44b) it suffices to show sin θ ≤ ρ -1 ρ 1 k1 = ρ -1 ρ 1 • 0.5 β P v 0 2 1- log ρ 1 log ρ -1 ⇐ sin θ ≤ q -1 -ξ q 1 + ξ • 0.5 β P v 0 2 1-log(q 1 +ξ) log(q -1 -ξ) ⇐ ξ ≤ 0.9 (q 1 -q -1 ) • q -1 -ξ q 1 + ξ • 0.5 β P v 0 2 1-log(q 1 +ξ) log(q -1 -ξ) (by Eq. ( 47)) ⇐ √ nι ≤ poly ( β) . (by Eq. ( 39)) For Eq. (44c), it suffices to show ξ ≤ (q 1 -1)β 0 A 0 + 0.5 β 0 ⇐ √ nι ≤ O (1) . (by Eq. ( 39)) C.10 PROOF OF LEMMA 13 Proof of Lemma 13. Let ξ := ξ(η ) = 4η √ nι b , q 1 := q 1 (η ) = 1 - 2η λ 1 b P 1 x1 2 2 , q -1 := q -1 (η ) = max 1 - 2η b (λ 2 + nι) + 3nη ι b , 1 - 2η b (λ n -nι) + 3nη ι b . Then for k 1 < k ≤ k 2 , Lemma 11 gives us A k B k ≤ q -1 ξ ξ q 1 • A k-1 B k-1 , where "≤" means "entry-wisely smaller than". Denote B := P v 0 2 • ρ k1 1 + 2 • β = poly 1 β . ( ) We initialize the algorithms from w 0 = (0.6, 0.6) . Two kinds of learning rate regime are considered. In the small learning rate regime, the learning rate is η k = 0.1/κ, k = 1, . . . , 800. In the moderate learning rate regime, the learning rate is η k = 1.1/κ, k = 1, . . . , 100; 0.1/κ, k = 101, . . . , 800. For SGD, the mini-batch size is 1. We then randomly draw n = 100 samples from the d-dimensional space as described in Section 4, where ζ ∼ U([0.5, 1]). We initialize the algorithms from zero. We consider two kinds of learning rate regimes. The small learning rate scheme is specified by η k = 0.2, k = 1, . . . , 10 4 . The moderate learning rate scheme is specified by η k = 1.05, k = 1, . . . , 2 × 10 3 ; 0.1, k = 1 + 2 × 10 3 , . . . , 3 × 10 3 . For SGD, the mini-batch size is 1.

D.3 NEURAL NETWORK ON A SUBSET OF FASHIONMNIST

This part corresponds to Figure 2 (b) and Figure 3 . Model We use a LeNet-alike convolutional network: input ⇒ conv1 ⇒ ReLU ⇒ max_pool ⇒ conv2 ⇒ ReLU ⇒ max_pool ⇒ fc1 ⇒ ReLU ⇒ fc2 ⇒ ReLU ⇒ linear ⇒ output. The first convolutional layer uses 5 × 5 kernels with 10 channels and no padding and the second convolutional layer uses 5 × 5 kernels with 16 channels and no padding. The number of hidden units between the two fully connected layers are 60. Dataset https://github.com/zalandoresearch/fashion-mnist  We randomly choose 2, 000 original test data as our training set, and use the 60, 000 original training data as our test set. Thus we have 2, 000 training data and 60, 000 test data. We scale the image data to [0, 1]. Algorithms We randomly initialize the algorithms from a Gaussian distribution with zero mean and standard deviation 0.02. We consider two kinds of learning rate regimes. The small learning rate scheme is specified by η k = 10 -3 , k = 1, . . . , 10 4 . The moderate learning rate scheme is specified by η k = 10 -2 , k = 1, . . . , 2.5 × 10 3 ; 10 -3 , k = 1 + 2.5 × 10 3 , . . . , 10 4 . For SGD, the mini-batch size is 25. For both GD and SGD, the weight decay parameter is set as 0.002. Paired t-test for Figure 3 We also conduct statistical test to show SGD with moderate learning rate is significantly better than the other baselines. Recall that the experiments are repeated for 10 runs, at each run, we first fix a random seed, then run the four algorithms (GD/SGD with small/moderate learning rate) under the same seed. In Table 1 , we report the complete results from the 10 runs. By running a paired t-test at the 5% significance level, we find that SGD with moderate learning rate is significantly better than GD with moderate learning rate (p-value = 0.0043). Similarly, SGD with moderate learning rate is significantly better than SGD with small learning rate (p-value = 0.012), and is also significantly better than GD with small learning rate (p-value = 0.0095).



In this paper we focus on SGD without replacement, nonetheless our results and techniques are ready to be extended to SGD with replacement as well. This is for the conciseness of presentation. Our results can be easily generalized to linear regression with well-specified noise, i.e., noise that is independent of the feature vector. For two sequences {xn ≥ 0} and {yn ≥ 0}: xn = O (yn) if there exist constants C > 0 and N such that xn ≤ Cyn for every n ≥ N ; xn = Θ (yn) if xn = O (yn) and yn = O (xn). xn = o (yn) if for every > 0 there exists a positive constant N ( ) > 0 such that xn ≤ Cyn for every n ≥ N ( ); xn = poly (yn) if there exists large absolute constant D > 0 such that xn = Θ y D n . This holds with probability 1 if w0 is initialized randomly and follows, e.g., Gaussian distribution.



Figure1: Illustration for the 2-D example studied in Section 3. Here κ = 4 and w0 = (0.6, 0.6). (a): Small learning rate regime. The small learning rate is 0.1/κ. In this regime SGD and GD behave similarly and they both converge along e2. (b): Moderate learning rate regime. The initial moderate learning rate is η = 1.1/κ and the decayed learning rate is η = 0.1/κ. In this regime GD converges along e2 but SGD converges along e1, the larger eigenvalue direction of the data matrix. Please refer to Section 3 for further discussions.

GD) Next we introduce mini-batch stochastic gradient descent (SGD). 1 Let b be the batch size. For simplicity suppose n = mb for an integer m (number of mini-batches). Then at each epoch, SGD first randomly partitions the training set into m disjoint mini-batches with size b, and then sequentially performs m updates using the stochastic gradients calculated over the m mini-batches. Specifically, at the k-th epoch, let the mini-batch index sets be B k 1 , B k 2 , . . . , B k m , where |B k j | = b and m j=1 B k j = {1, 2, . . . , n}, then SGD takes m updates as follows

Then the test loss is L D (w) = E (x,y)∼D (w -w * ) xx (w -w * ) = µ w -w * 2 2 , where µ = E[ζ 2 ]/d. For an i.i.d. generated training set S = {(x i , y i )} n i=1 , the training loss and the individual losses are L S (w) = 1

Figure2: Comparison of the (relative) Rayleigh quotients. (a): A linear regression example. We randomly draw 100 samples from a 10, 000-dimensional space as described in Section 4, where ζ ∼ U([0.5, 1]). The small learning rate scheme is specified by (η , k2) = (0.2, 10 4 ), and the moderate learning rate scheme is specified by (η, η , k1, k2) = (1.05, 0.1, 2 × 10 3 , 3 × 10 3 ). Numerical results show the Rayleigh quotient converges to its maximum for SGD with moderate learning rate, and converges to its minimum for GD and SGD with small learning rate, which verifies Theorems 1, 2 and 3. (b): A neural network example. The plots are averaged over 10 runs. We randomly draw 2, 000 samples from FashionMNIST as the training set. The model is a 5-layer convolutional neural network. The small learning rate scheme is specified by (η , k2) = (10 -3 , 10 4 ), and the moderate learning rate scheme is specified by (η, η , k1, k2) = (10 -2 , 10 -3 , 2.5 × 10 3 , 10 4 ). Since neural network is non-convex, we compare the relative Rayleigh quotient of the concerned algorithms, i.e., the Rayleigh quotient of the convergence directions divided by the maximum absolute eigenvalue of the Hessian (see Appendix D.3).

THE THEOREMS IN MAIN TEXT B.1 THE DIRECTIONAL BIAS OF SGD WITH MODERATE LEARNING RATE Reloading notations Let π k := B k 1 , . . . , B k m be a randomly chosen uniform m-partition of [n], where n = mb. Then the SGD iterates at the k-th epoch can be formulated as:

Let π := {B 1 , . . . , B m } be a uniform m partition of index set [n], where n = mb. Consider the following d × d matrix

LINEAR REGRESSION ON SYNTHETIC DATA This part corresponds to Section 4 and Figure 2(a). The model is an overparameterized linear model, with d = 10 4 and n = 100. The true parameter w * is randomly drawn from an d-dimensional Gaussian distribution, N (0, 0.1 • I d×d ).

ACKNOWLEDGEMENT

We would like to thank the anonymous reviewers for their helpful comments. QG is partially supported by the National Science Foundation CAREER Award 1906169, IIS-2008981 and Salesforce Deep Learning Research Award. DZ is supported by the Bloomberg Data Science Ph.D. Fellowship. VB is supported in part by NSF CAREER grant 1652257, ONR Award N00014-18-1-2364 and the Lifelong Learning Machines program from DARPA/MTO. JW is supported by ONR Award N00014-18-1-2364. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

annex

• The hypothesis class is0 , for i = n + 1, . . . , d .• The α-level set is U = u ∈ H u : L S (u) = α .• For u ∈ H S , the estimation error isMoreover, ∆ * = µnα γ 1 .Proof. See Section C.11.The following lemma sovles GD iterates in Eq. ( 14). Lemma 15. For t = 0, . . . , T ,Proof. This is by directly solving Eq. ( 14) where Γ is diagonal.Theorem 7 (Theorem 2, formal version). Suppose λ n + 2nι < λ n-1 . Suppose u 0 is away from 0.Consider the GD iterates given by Eq. ( 14) with learning rate scheme.Then for ∈ (0, 1), if k ≥ O log 1 , then we havewhere the second inequality follows from γ 1 < λ 1 + nι by Lemma 4. Furthermore, sincewhich impliesn η is increasing, let q = max η< n 2λ 1 +2nι f (η), then q < 1 by our assumption on the learning rate. From Lemma 15 we havefrom where we know P 1 x 1 2 2 is the only non-zero eigenvalue of the rank-1 matrix P 1 H(B l )P 1 , and the corresponding eigenspace is the column space of P 1 . Therefore 1 -2η b P 1 x 1 2 2 is an eigenvalue of the matrix I -2ηb P 1 H(B l )P 1 , and the corresponding eigenspace is the column space of P 1 , which impliesFinally, according to Lemma 10, we have, restricted in the column space of P -1 , the right eigenvalues of M -1 is upper bounded by (q -1 (η)) 2 , which impliesNote we have q -1 (η) < 1 by Lemma 10.Combining Eq. ( 33) with Eq. ( 36), ( 37), (38), and lettingC.9 PROOF OF LEMMA 12Proof of Lemma 12. LetThen for 0 < k ≤ k 1 , Lemma 11 gives uswhere "≤" means "entry-wisely smaller than".Let θ, ρ -1 , ρ 1 determine the eigen decomposition of the coefficient matrix, i.e.,We claim the following inequalities hold by our assumptions:The verification of Eq. ( 52) is left later. In the following we prove the main conclusions in the lemma using Eq. ( 52). We proceed by induction. Clearly the conclusions are true for k = k 1 . Suppose for k 1 , . . . , k -1, the conclusions are also true. Then the induction assumptions give uswhere the last inequality is due to B ≥ B k1 ≥ β 0 > β. Then by Eq. ( 50) we have(by Eq. ( 52a))Also by Eq. ( 50) we have(by Eq. ( 53) and ( 54))Verification of Eq. (52) Notice the definitions of q 1 , q -1 and ξ are given in Eq. ( 39). Recall q -1 < 1 is already justified by the choice of learning rate η < 1 λ2+3nι (e.g., see Lemma 10), thus for Eq. (52a), it suffices to showwhich are given in assumptions.For Eq. (52b), it suffices to showwhich is implied by nι ≤ poly ( β).For Eq. (52c), it suffices to show(by Eq. ( 51))We complete our proof.

C.11 PROOF OF LEMMA 14

Proof of Lemma 14. For the empirical loss,For the population loss,For the hypothesis class H S = w ∈ R d : P ⊥ w = P ⊥ w 0 , Note P ⊥ G = diag (0, . . . , 0, 1, . . . , 1).Apply w -w * = Gu and notice w 0 -w * = Gu 0 , then we obtainFor the level set, we only need to note that L * S = inf u∈H S L S (u) = 0. As for the estimation error, we note thatwhere the inferior is attended when, e.g., u (1) = ± nα γ1 and u

D DETAILS OF THE EXPERIMENTS

In this section, we describe the details for our experiments.

D.1 2-D EXAMPLE

This part corresponds to Section 3 and Figure 1 .The two training data points areand the corresponding individual losses are Relative Rayleigh quotient We discuss the relative Rayleigh quotients calculated in Figure 2(b) . Unlike the linear regression model where the Hessian is a constant, for neural networks the loss function is non-convex. In other words, not only there are multiple local minima, but also the Hessian varies at different points. Therefore, a direct comparison in terms of the Rayleigh quotients for the iterates of different algorithms makes little sense. Instead, we consider the relative Rayleigh quotient, i.e., the Rayleigh quotient normalized by the maximum absolute eigenvalue of the Hessian at that point. Mathematically, the relative Rayleigh quotient is defined aswhere ∇L(w)/ ∇L(w) 2 is the convergence direction of gradient methods, and ∇ 2 L(w) 2 is the operator norm of the Hessian, i.e., its maximum absolute eigenvalue. Note that it is computationally hard in practice to project a parameter onto the data manifold, thus we use the vanilla convergence direction instead of the projected one in the above definition of the relative Rayleigh quotient. Since our goal is to compare the relative Rayleigh quotient between different algorithms, this simplification would not affect our conclusions. We obtain the maximum absolute eigenvalue of the Hessian by running power method for 5 iterates.Additional experiments: GD and SGD with increasing learning rates We conduct numerical experiments of training neural networks on a subset of FashionMNIST dataset for SGD and GD with different learning rates η ∈ {0.001, 0.01, 0.02, 0.04, 0.08, 0.16}. The test accuracy results are displayed in Figure 4 .Several conclusions can be drawn from the plots. (1) For a learning rate over η = 0.08, SGD cannot converge (thus only gives about 10% test accuracy). SGD generalizes best with a moderate learning rate η = 0.01.(2) GD fails to converge when η = 0.16. Also, as the learning rate increases, the test accuracy of GD first increases then decreases; but even at its peak (η = 0.01), GD performs worse than SGD with a moderate learning rate.

