THE IMPACT OF THE MINI-BATCH SIZE ON THE DYNAMICS OF SGD: VARIANCE AND BEYOND Anonymous authors Paper under double-blind review

Abstract

We study mini-batch stochastic gradient descent (SGD) dynamics under linear regression and deep linear networks by focusing on the variance of the gradients only given the initial weights and mini-batch size, which is the first study of this nature. In the linear regression case, we show that in each iteration the norm of the gradient is a decreasing function of the mini-batch size b and thus the variance of the stochastic gradient estimator is a decreasing function of b. For deep neural networks with L 2 loss we show that the variance of the gradient is a polynomial in 1{b. The results theoretically back the important intuition that smaller batch sizes yield larger variance of the stochastic gradients and lower loss function values which is a common believe among the researchers. The proof techniques exhibit a relationship between stochastic gradient estimators and initial weights, which is useful for further research on the dynamics of SGD. We empirically provide insights to our results on various datasets and commonly used deep network structures. We further discuss possible extensions of the approaches we build in studying the generalization ability of the deep learning models.

1. INTRODUCTION

Deep learning models have achieved great success in a variety of tasks including natural language processing, computer vision, and reinforcement learning (Goodfellow et al., 2016) . Despite their practical success, there are only limited studies of the theoretical properties of deep learning; see survey papers (Sun, 2019; Fan et al., 2019) and references therein. The general problem underlying deep learning models is to optimize (minimize) a loss function, defined by the deviation of model predictions on data samples from the corresponding true labels. The prevailing method to train deep learning models is the mini-batch stochastic gradient descent algorithm and its variants (Bottou, 1998; Bottou et al., 2018) . SGD updates model parameters by calculating a stochastic approximation of the full gradient of the loss function, based on a random selected subset of the training samples called a mini-batch. It is well-accepted that selecting a large mini-batch size reduces the training time of deep learning models, as computation on large mini-batches can be better parallelized on processing units. For example, Goyal et al. (2017) scale ResNet-50 (He et al., 2016) from a mini-batch size of 256 images and training time of 29 hours, to a larger mini-batch size of 8,192 images. Their training achieves the same level of accuracy while reducing the training time to one hour. However, noted by many researchers, larger mini-batch sizes suffer from a worse generalization ability (LeCun et al., 2012; Keskar et al., 2017) . Therefore, many efforts have been made to develop specialized training procedures that achieve good generalization using large mini-batch sizes (Hoffer et al., 2017; Goyal et al., 2017) . Smaller batch sizes have the advantage of allegedly offering better generalization (at the expense of a higher training time). The focus of this study is on the behavior of SGD subject to the conditions on the initial point. This is different from previous results which analyze SGD via stringing one-step recursions together. The dynamics of SGD are not comparable if we merely consider the one-step behavior, as the model parameters change iteration by iteration. Therefore, fixing the initial weights and the learning rate can give us a fair view of the impact of different mini-batch sizes on the dynamics of SGD. We hypothesize that, given the same initial point, smaller sizes lead to lower training loss and, unfortunately, decrease stability of the algorithm on average. The latter follows from the fact that the smaller is the batch size, more stochasticity and volatility is introduced. After all, if the batch size equals to the number of samples, there is no stochasticity in the algorithm. To this end, we conjecture that the variance of the gradient in each iteration is a decreasing function of the mini-batch size. The conjecture is the focus of the work herein. Variance correlates to many other important properties of SGD dynamics. For example, there is substantial work on variance reduction methods (Johnson & Zhang, 2013; Allen-Zhu & Hazan, 2016; Wang et al., 2013) which show great success on improving the convergence rate by controlling the variance of the stochastic gradients. Mini-batch size is also a key factor deciding the performance of SGD. Some research focuses on how to choose an optimal mini-batch size based on different criteria (Smith & Le, 2017; Gower et al., 2019) . However, these works make strong assumptions on the loss function properties (strong or point or quasi convexity, or constant variance near stationary points) or about the formulation of the SGD algorithm (continuous time interpretation by means of differential equations). The statements are approximate in nature and thus not mathematical claims. The theoretical results regarding the relationship between the mini-batch size and the variance (and other performances, like loss and generalization ability) of the SGD algorithm applied to general machine learning models are still missing. The work herein partially addresses this gap by showing the impact of the mini-batch size on the variance of gradients in SGD. We further discuss possible extensions of the approaches we build in studying the generalization ability. We are able to prove the hypothesis about variance in the convex linear regression case and to show significant progress in a deep linear neural network setting with samples based on a normal distribution. In this case we show that the variance is a polynomial in the reciprocal of the mini-batch size and that it is decreasing if the mini-batch size is larger than a threshold (further experiments reveal that this threshold can be as small as 2). The increased variance as the mini-batch size decreases should also intuitively imply convergence to lower training loss values and in turn better prediction and generalization ability (these relationships are yet to be confirmed analytically; but we provide empirical evidence to their validity). The major contributions of this paper are as follows. • For linear regression, we show that in each iteration the norm of any linear combination of sample-wise gradients is a decreasing function of the mini-batch size b (Theorem 1). As a special case, the variance of the stochastic gradient estimator and the full gradient at the iterate in step t are also decreasing functions of b at any iteration step t (Theorem 2). In addition, the proof provides a recursive relationship between the norm of gradients and the model parameters at each iteration (Lemma 2). This recursive relationship can be used to calculate any quantity related to the full/stochastic gradient or loss at any iteration with respect to the initial weights. • For the deep linear neural network with L 2 -loss and samples drawn from a normal distribution, we take two-layer linear network as an example and show that in each iteration step t the trace of any product of the stochastic gradient estimators and weight matrices is a polynomial in 1{b with coefficients a sum of products of the initial weights (Theorem 3). As a special case, the variance of the stochastic gradient estimator is a polynomial in 1{b without the constant term (Theorem 4) and therefore it is a decreasing function of b when b is large enough (Theorem 5). The results and proof techniques can be easily extended to general deep linear networks. As a comparison, other papers that study theoretical properties of two-layer networks either fix one layer of the network, or assume the over-parameterized property of the model and they study convergence, while our paper makes no such assumptions on the model capacity. The proof also reveals the structure of the coefficients of the polynomial, and thus serving as a tool for future work on proving other properties of the stochastic gradient estimators. • The proofs are involved and require several key ideas. The main one is to show a more general result than it is necessary in order to carry out the induction. The induction is on time step t. The key idea is to show a much more general result that lets us carry out induction. New concepts and definitions are introduced in order to handle the more general case. Along the way we show a result of general interest establishing expectation of several rank one matrices sampled from a normal distribution intertwined with constant matrices.

