STOCHASTIC NORMALIZED GRADIENT DESCENT WITH MOMENTUM FOR LARGE BATCH TRAINING Anonymous authors Paper under double-blind review

Abstract

Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically leads to a drop of generalization accuracy. As a result, large batch training has also become a challenging topic. In this paper, we propose a novel method, called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to momentum SGD (MSGD) which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the -stationary point with the same computation complexity (total number of gradient computation). Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.

1. INTRODUCTION

In machine learning, we often need to solve the following empirical risk minimization problem: min w∈R d F (w) = 1 n n i=1 f i (w), where w ∈ R d denotes the model parameter, n denotes the number of training samples, f i (w) denotes the loss on the ith training sample. The problem in (1) can be used to formulate a broad family of machine learning models, such as logistic regression and deep learning models. Stochastic gradient descent (SGD) Robbins & Monro (1951) and its variants have been the dominating optimization methods for solving (1). SGD and its variants are iterative methods. In the tth iteration, these methods randomly choose a subset (also called mini-batch) T Kb) convergence rate for smooth non-convex problems. Here, b is the batch size on each worker and K is the number of workers. By setting Kb = B, we can observe that the convergence rate of these distributed methods is consistent with that of sequential methods. In distributed settings, a small number of model parameter updates T implies a small synchronize cost and communication cost. Hence, a small T can further speed up the training process. Based on the O(1/ √ T Kb) convergence rate, we can find that if we adopt a larger b, the T will be smaller. Hence, large batch training can reduce the number of communication rounds in distributed training. Another benefit of adopting I t ⊂ {1, 2, . . . ,



n} and compute the stochastic mini-batch gradient 1/B i∈It ∇f i (w t ) for updating the model parameter, where B = |I t | is the batch size. Existing works Li et al. (2014b); Yu et al. (2019a) have proved that with the batch size of B, SGD and its momentum variant, called momentum SGD (MSGD), achieve a O(1/ √ T B) convergence rate for smooth non-convex problems, where T is total number of model parameter updates. With the population of multi-core systems and the easy implementation for data parallelism, many distributed variants of SGD have been proposed, including parallel SGD Li et al. (2014a), decentralized SGD Lian et al. (2017), local SGD Yu et al. (2019b); Lin et al. (2020), local momentum SGD Yu et al. (2019a) and so on. Theoretical results show that all these methods can achieve a O(1/ √

