STOCHASTIC NORMALIZED GRADIENT DESCENT WITH MOMENTUM FOR LARGE BATCH TRAINING Anonymous authors Paper under double-blind review

Abstract

Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically leads to a drop of generalization accuracy. As a result, large batch training has also become a challenging topic. In this paper, we propose a novel method, called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to momentum SGD (MSGD) which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the -stationary point with the same computation complexity (total number of gradient computation). Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.

1. INTRODUCTION

In machine learning, we often need to solve the following empirical risk minimization problem: min w∈R d F (w) = 1 n n i=1 f i (w), where w ∈ R d denotes the model parameter, n denotes the number of training samples, f i (w) denotes the loss on the ith training sample. The problem in (1) can be used to formulate a broad family of machine learning models, such as logistic regression and deep learning models. Stochastic gradient descent (SGD) Robbins & Monro (1951) and its variants have been the dominating optimization methods for solving (1). SGD and its variants are iterative methods. In the tth iteration, these methods randomly choose a subset (also called mini-batch) I t ⊂ {1, 2, . . . , n} and compute the stochastic mini-batch gradient 1/B i∈It ∇f i (w t ) for updating the model parameter, where B = |I t | is the batch size. Existing works Li et al. (2014b) ; Yu et al. (2019a) have proved that with the batch size of B, SGD and its momentum variant, called momentum SGD (MSGD), achieve a O(1/ √ T B) convergence rate for smooth non-convex problems, where T is total number of model parameter updates. With the population of multi-core systems and the easy implementation for data parallelism, many distributed variants of SGD have been proposed, including parallel SGD Li et al. (2014a) , decentralized SGD Lian et al. (2017) , local SGD Yu et al. (2019b) ; Lin et al. (2020) , local momentum SGD Yu et al. (2019a) and so on. Theoretical results show that all these methods can achieve a O(1/ √ T Kb) convergence rate for smooth non-convex problems. Here, b is the batch size on each worker and K is the number of workers. By setting Kb = B, we can observe that the convergence rate of these distributed methods is consistent with that of sequential methods. In distributed settings, a small number of model parameter updates T implies a small synchronize cost and communication cost. Hence, a small T can further speed up the training process. Based on the O(1/

√

T Kb) convergence rate, we can find that if we adopt a larger b, the T will be smaller. Hence, large batch training can reduce the number of communication rounds in distributed training. Another benefit of adopting Polyak (1964) together. The main contributions of this paper are outlined as follows: • We theoretically prove that compared to MSGD which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the -stationary point with the same computation complexity (total number of gradient computation). That is to say, SNGM needs a smaller number of parameter update, and hence has faster training speed than MSGD. • For a relaxed smooth objective function (see Definition 2), we theoretically show that SNGM can achieve an -stationary point with a computation complexity of O(1/ 4 ). To the best of our knowledge, this is the first work that analyzes the computation complexity of stochastic optimization methods for a relaxed smooth objective function. • Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.

2. PRELIMINARIES

In this paper, we use • to denote the Euclidean norm, use w * to denote one of the optimal solutions of (1), i.e., w * ∈ arg min w F (w). We call w an -stationary point of F (w) if ∇F (w) ≤ . The computation complexity of an algorithm is the total number of its gradient computation. We also give the following assumption and definitions: Assumption 1 (σ-bounded variance) For any w, E ∇f i (w) -∇F (w) 2 ≤ σ 2 (σ > 0). Definition 1 (Smoothness) A function φ(•) is L-smooth (L > 0) if for any u, w, φ(u) ≤ φ(w) + ∇φ(w) (u -w) + L 2 u -w 2 . L is called smoothness constant in this paper. Definition 2 (Relaxed smoothness Zhang et al. ( 2020) ) A function φ(•) is (L, λ)-smooth (L ≥ 0, λ ≥ 0) if φ(•) is twice differentiable and for any w, ∇ 2 φ(w) ≤ L + λ ∇φ(w) , where ∇ 2 φ(w) denotes the Hessian matrix of φ(w). From the above definition, we can observe that if a function φ(w) is (L, 0)-smooth, then it is a classical L-smooth function Nesterov ( 2004). For a (L, λ)-smooth function, we have the following property Zhang et al. (2020) : Lemma 1 If φ(•) is (L, λ)-smooth, then for any u, w, α such that u -w ≤ α, we have ∇φ(u) ≤ (Lα + ∇φ(w) )e λα . All the proofs of lemmas and corollaries of this paper are put in the supplementary.

3. RELATIONSHIP BETWEEN SMOOTHNESS CONSTANT AND BATCH SIZE

In this section, we deeply analyze the convergence property of MSGD to find the relationship between smoothness constant and batch size, which provides insightful hint for designing our new method SNGM. MSGD can be written as follows: v t+1 = βv t + g t , w t+1 = w t -ηv t+1 , where g t = 1/B i∈It ∇f i (w t ) is a stochastic mini-batch gradient with a batch size of B, and v t+1 is the Polyak's momentum Polyak (1964) . We aim to find how large the batch size can be without loss of performance. The convergence rate of MSGD with the batch size B for L-smooth functions can be derived from the work in Yu et al. (2019a) . That is to say, when η ≤ (1 -β) 2 /((1 + β)L), we obtain 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2(1 -β)[F (w 0 ) -F (w * )] ηT + Lησ 2 (1 -β) 2 B + 4L 2 η 2 σ 2 (1 -β) 2 , =O( B ηC ) + O( η B ) + O(η 2 ), where C = T B denotes the computation complexity (total number of gradient computation). According to Corollary 1 in Yu et al. (2019a) , we set η = √ B/ √ T = B/ √ C and obtain that 1 T T -1 t=0 E ∇F (w t ) ≤ O( 1 √ C ) + O( B 2 C ). (5) Algorithm 1 SNGM Initialization: η > 0, β ∈ [0, 1), B > 0, T > 0, u 0 = 0, w 0 ; for t = 0, 1, . . . , T -1 do Randomly choose B function indices, denoted as I t ; Compute a mini-batch gradient g t = 1 B i∈It ∇f i (w t ); u t+1 = βu t + gt gt ; w t+1 = w t -ηu t+1 ; end for Since η ≤ (1 -β) 2 /((1 + β)L) is necessary for (4), we firstly obtain that B ≤ O( √ C/L). Furthermore, according to the right term of (5), we have to set B such that B 2 /C ≤ 1/ √ C, i.e., B ≤ C 1/4 , for O(1/ 4 ) computation complexity guarantee. Hence in MSGD, we have to set the batch size satisfying B ≤ O(min{ √ C L , C 1/4 }). We can observe that a larger L leads to a smaller batch size in MSGD. If B does not satisfy (6), MSGD will get higher computation complexity. In fact, to the best of our knowledge, among all the existing convergence analysis of SGD and its variants on both convex and non-convex problems, we can observe three necessary conditions for the O(1/ 4 ) computation complexity guarantee 

4. STOCHASTIC NORMALIZED GRADIENT DESCENT WITH MOMENTUM

In this section, we propose our novel methods, called stochastic normalized gradient descent with momentum (SNGM), which is presented in Algorithm 1. In the t-th iteration, SNGM runs the following update: u t+1 = βu t + g t g t , w t+1 = w t -ηu t+1 , where g t = 1/B i∈It ∇f i (w t ) is a stochastic mini-batch gradient with a batch size of B. When β = 0, SNGM will degenerate to stochastic normalized gradient descent (SNGD) Hazan et al. (2015) . The u t is a variant of Polyak's momentum. But different from Polyak's MSGD which adopts g t directly for updating u t+1 , SNGM adopts the normalized gradient g t / g t for updating u t+1 . In MSGD, we can observe that if g t is large, then u t may be large as well and this may lead to a bad model parameter. Hence, we have to control the learning rate in MSGD, i.e., η ≤ (1/L), for a L-smooth objective function. The following lemma shows that u t in SNGM can be well controlled whatever g t is large or small. Lemma 2 Let {u t } be the sequence produced by ( 7), then we have ∀t ≥ 0, u t ≤ 1 1 -β .

4.1. SMOOTH OBJECTIVE FUNCTION

For a smooth objective function, we have the following convergence result of SNGM: Table 1 : Comparison between MSGD and SNGM for a L-smooth objective function. C denotes the computation complexity (total number of gradient computation). 1 T T -1 t=0 E ∇F (w t ) learning rate batch size MSGD O( 1 √ C ) + O( B 2 C ) B √ C min{ √ C L , C 1/4 } SNGM O( 1 C 1/4 ) √ B √ C √ C Theorem 1 Let F (w) be a L-smooth function (L > 0). The sequence {w t } is produced by Algorithm 1. Then for any η > 0, B > 0, we have 1 T T -1 t=0 E ∇F (w t ) ≤ 2(1 -β)[F (w 0 ) -F (w * )] ηT + Lκη + 2σ √ B , where κ = 1+β (1-β) 2 . Proof 1 See the supplementary. We can observe that different from (4) which needs η ≤ O(1/L), ( 9) is true for any positive learning rate. According to Theorem 1, we obtain the following computation complexity of SNGM: Corollary 1 Let F (w) be a L-smooth function (L > 0). The sequence {w t } is produced by Algorithm 1. Given any total number of gradient computation C > 0, let T = C/B , B = C(1 -β)σ 2 2L(1 + β)(F (w 0 ) -F (w * )) , and η = 2(1 -β) 3 (F (w 0 ) -F (w * ))B (1 + β)LC . Then we have 1 T T -1 t=0 E ∇F (w t ) ≤ 2 √ 2 4 8L(1 + β)[F (w 0 ) -F (w * )]σ 2 (1 -β)C = O( 1 C 1/4 ). Hence, the computation complexity for achieving an -stationary point is O(1/ 4 ). It is easy to verify that the η and B in Corollary 1 make the right term of (9) minimal. However, the η and B rely on the L and F (w * ) which are usually unknown in practice. The following corollary shows the computation complexity of SNGM with simple settings about learning rate and batch size. Corollary 2 Let F (w) be a L-smooth function (L > 0). The sequence {w t } is produced by Algorithm 1. Given any total number of gradient computation C > 0, let T = C/B , B = √ C and η = B/C. Then we have 1 T T -1 t=0 E ∇F (w t ) ≤ 2(1 -β)[F (w 0 ) -F (w * )] C 1/4 + L(1 + β) (1 -β) 2 C 1/4 + 2σ C 1/4 = O( 1 C 1/4 ). Hence, the computation complexity for achieving an -stationary point is O(1/ 4 ). According to Corollary 2, the batch size of SNGM can be set as O( √ C), which does not rely on the smooth constant L, and the O(1/ 4 ) computation complexity is still guaranteed (see Table 1 ). Hence, SNGM can adopt a larger batch size than MSGD, especially when L is large.

4.2. RELAXED SMOOTH OBJECTIVE FUNCTION

Recently, the authors in Zhang et al. (2020) observe the relaxed smooth property in deep neural networks. According to Definition 2, the relaxed smooth property is more general than L-smooth property. For a relaxed smooth objective function, we have the following convergence result of SNGM: Theorem 2 Let F (w) be a (L, λ)-smooth function (L ≥ 0, λ > 0). The sequence {w t } is produced by Algorithm 1 with the learning rate η and batch size B. Then we have 1 T T -1 t=0 E ∇F (w t ) ≤ 2(1 -β)[F (w 0 ) -F (w * )] ηT + 8Lκη + 4σ √ B , ( ) where κ = 1+β (1-β) 2 and η ≤ 1/(8κλ). Proof 2 The proof is similar to that of Theorem 1. See the supplementary. According to Theorem 2, we obtain the computation complexity of SNGM: Corollary 3 Let F (w) be a (L, λ)-smooth function (L ≥ 0, λ ≥ 0). The sequence {w t } is produced by Algorithm 1. Given any total number of gradient computation C > 0, let T = C/B , B = √ C and η = 4 1/C ≤ 1/(8κλ). Then we have 1 T T -1 t=0 E ∇F (w t ) ≤ 2(1 -β)[F (w 0 ) -F (w * )] C 1/4 + 8L(1 + β) (1 -β) 2 C 1/4 + 4σ C 1/4 = O( 1 C 1/4 ). Hence, the computation complexity for achieving an -stationary point is O(1/ 4 ). According to Corollary 3, SNGM with a batch size of B =

√

C can still guarantee a O(1/ 4 ) computation complexity for a relaxed smooth objective function.

5. EXPERIMENTS

All experiments are conducted with the platform of PyTorch, on a server with eight NVIDIA Tesla V100 (32G) GPU cards. The datasets for evaluation include CIFAR10 and ImageNet.

5.1. ON CIFAR10

First, we evaluate SNGM by training ResNet20 and ResNet56 on CIFAR10. CIFAR10 contains 50k training samples and 10k test samples. We compare SNGM with MSGD and an existing large batch training method LARS You et al. (2017) . We implement LARS by using the open source codefoot_0 . The standard strategy He et al. (2016) for training the two models on CIFAR10 is using MSGD with a weight decay of 0.0001, a batch size of 128, an initial learning rate of 0.1, and dividing the learning rate at the 80th and 120th epochs. We also adopt this strategy for MSGD in this experiment. For SNGM and LARS, we set a large batch size of 4096 and also a weight decay of 0.0001. Following You et al. (2017) , we adopt the poly power learning rate strategy and adopt the gradient accumulation Ott et al. (2018) with a batch size of 128 for the two large batch training methods. The momentum coefficient is 0.9 for all methods. Different from existing heuristic methods for large batch training, we do not adopt the warm-up strategy for SNGM. The results are presented in Figure 2 . As can be seen, SNGM achieves better convergence rate on training loss than LARS. The detailed information about the final convergence results is presented in Table 2 . We can observe that MSGD with a batch size of 4096 leads to a significant drop of test accuracy. SNGM with a batch size of 4096 achieves almost the same test accuracy as MSGD with a batch size of 128. But the other large batch training method LARS achieves worse test accuracy than MSGD with a batch size of 128. These results successfully verify the effectiveness of SNGM.  Figure 2 : Learning curves on CIFAR10. Table 2 : Experimental results on CIFAR10. In LARS with warm-up, we adopt the gradual warm-up strategy and a power of 2, which is the same setting as that in You et al. (2017) He et al. (2016) for training the two models on ImageNet is using MSGD with a weight decay of 0.0001, a batch size of 256, an initial learning rate of 0.1, and dividing the learning rate at the 30th and 60th epochs. We also adopt this strategy for MSGD in this experiment. For SNGM, we set a larger batch size of 8192 and a weight decay of 0.0001. We still adopt the poly power learning rate and the gradient accumulation with a batch size of 128 for SNGM. We do not adopt the warmup strategy for SNGM either. The momentum coefficient is 0.9 in the two methods. The results are

A APPENDIX

A.1 PROOF OF LEMMA 1 The proof follows Zhang et al. (2020) . We put it here for completeness. For any u, w, let r(x) = x(u -w) + w, p(x) = ∇φ(r(x)) , x ∈ [0, 1]. Then we have p(x) = ∇φ(r(x)) = According to Gronwall's Inequality, we obtain p(x) ≤ (Lα + ∇φ(w) )e λα . A.2 PROOF OF LEMMA 2 According to (7), we have u t+1 ≤β u t + 1 ≤β 2 u t-1 + β + 1 ≤β t+1 u 0 + β t + β t-1 + • • • + 1 ≤ 1 1 -β .

A.3 PROOF OF THEOREM 1

Let z t = w t + β 1-β (w t -w t-1 ), then we have w t+1 = w t -η gt gt + β(w t -w t-1 ) and z t+1 = 1 1 -β w t+1 - β 1 -β w t =z t - η 1 -β g t g t . Using the smooth property, we obtain F (z t+1 ) ≤F (z t ) - η 1 -β ∇F (z t ) T g t g t + Lη 2 2(1 -β) 2 =F (z t ) - η 1 -β g t + Lη 2 2(1 -β) 2 - η 1 -β [(∇F (z t ) -∇F (w t )) T g t g t + (∇F (w t ) -g t ) T g t g t ] ≤F (z t ) - η 1 -β g t + Lη 2 2(1 -β) 2 + η 1 -β [L z t -w t + ∇F (w t ) -g t ] Since w t+1 -w t = β(w t -w t-1 ) -ηg t / g t , we obtain w t+1 -w t ≤ β w t -w t-1 + η ≤ η 1 -β . Hence, w t -w t-1 ≤ η/(1 -β) and z t -w t = β 1 -β w t -w t-1 ≤ βη (1 -β) 2 . ( ) Combining the above equations, we obtain g t ≤ (1 -β)[F (z t ) -F (z t+1 )] η + Lη 2(1 -β) + Lβη (1 -β) 2 + ∇F (w t ) -g t . Since ∇F (w t ) ≤ ∇F (w t ) -g t + g t , we obtain ∇F (w t ) ≤ (1 -β)[F (z t ) -F (z t+1 )] η + Lη 2(1 -β) + Lβη (1 -β) 2 + 2 ∇F (w t ) -g t . Using the fact that E ∇F (w t ) -g t ≤ σ/ √ B and summing up the above inequality from t = 0 to T -1, we obtain 1 T T -1 t=0 E ∇F (w t ) ≤ 2(1 -β)[F (w 0 ) -F (w * )] ηT + Lκη + 2σ √ B . A.4 PROOF OF THEOREM 2 Let z t = w t + β 1-β (w t -w t-1 ), then we have w t+1 = w t -η gt gt + β(w t -w t-1 ) and z t+1 = 1 1 -β w t+1 - β 1 -β w t = 1 1 -β [w t -η g t g t + β(w t -w t-1 )] - β 1 -β w t = 1 1 -β w t - β 1 -β w t-1 - η 1 -β g t g t =z t - η 1 -β g t g t . Using the Taylor theorem, there exists ξ t such that F (z t+1 ) ≤F (z t ) - η 1 -β ∇F (z t ) T g t g t + H F (ξ t ) η 2 2(1 -β) 2 =F (z t ) - η 1 -β g t + H F (ξ t ) η 2 2(1 -β) 2 - η 1 -β [(∇F (z t ) -∇F (w t )) T g t g t + (∇F (w t ) -g t ) T g t g t ]. Let ψ t (w) = (∇F (w) -∇F (w t )) Summing up the above inequality from t = 0 to T -1, we obtain 1 T T -1 t=0 E ∇F (w t ) ≤ 2(1 -β)[F (w 0 ) -F (w * )] ηT + 8Lκη + 4σ √ B . where η ≤ 1 8λκ and we use the fact that E ∇F (w t ) -g t ≤ σ/ 



https://github.com/noahgolmant/pytorch-lars



Li et al. (2014b;a);Lian et al. (2017);Yu et al.  (2019b;a): (a) the objective function is L-smooth; (b) the learning rate η is less than O(1/L); (c) the batch size B is proportional to the learning rate η. One direct corollary is that the batch size is limited by the smooth constant L, i.e., B ≤ O(1/L). Hence, we can not increase the batch size casually in these SGD based methods. Otherwise, it may slow down the convergence rate and we need to compute more gradients, which is consistent with the observations inHoffer et al. (2017).

λ ∇φ(r(y)) )dy + ∇φ(w) =Lα + ∇φ(w) + λα x 0 p(y)dy.

β)[F (w 0 ) -F (w * )], y = Lκ, z = 2σ. Then we haveThe equal sign works if and only if η = Bx/Cy, B = Cz 2 /(4xy). Then we obtain1 T T -1 t=0 E ∇F (w t ) ≤ 2 √ 2 4 8L(1 + β)[F (w 0 ) -F (w * )]σ 2 (1 -β)C .

Figure 1: The training loss and test accuracy for training a non-convex model (a network with two convolutional layers) on CIFAR10. The optimization method is MSGD with the poly power learning rate strategy.

T gt gt . Using the Taylor theorem, there existsζ t such that |ψ t (z t )| =|ψ t (w t ) + ∇ψ t (ζ t )(z t -w t )| = |∇ψ(ζ t )(z t -w t )| ≤ H F (ζ t ) z t -w t . H F (ζ t ) z t -w t + ∇F (w t ) -g t ). ∇F (w t ) -g t . Since ∇F (w t ) ≤ ∇F (w t ) -g t + g t , we obtain ∇F (w t ) ≤ (1 -β)[F (z t ) -F (z t+1 )] -β) 2 H F (ζ t ) + 2 ∇F (w t ) -g t .Next, we bound the two Hessian matrices. For convenience, we denoteκ = 1+β (1-β) 2 . Since z tw t ≤ βη/(1 -β) 2 and z t+1 -w t ≤ z t+1 -z t + z t -w t -β) 2][L + (L + λ ∇F (w t ) )e] + 2 ∇F (w t ) -g t

annex

presented in Figure 3 and Table 3 . As can be seen, SNGM with a larger batch size achieves almost the same test accuracy as MSGD with a small batch size. 

6. CONCLUSION

In this paper, we propose a novel method called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to MSGD which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the -stationary point with the same computation complexity. Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size. 

