THE IMPACT OF THE MINI-BATCH SIZE ON THE DYNAMICS OF SGD: VARIANCE AND BEYOND Anonymous authors Paper under double-blind review

Abstract

We study mini-batch stochastic gradient descent (SGD) dynamics under linear regression and deep linear networks by focusing on the variance of the gradients only given the initial weights and mini-batch size, which is the first study of this nature. In the linear regression case, we show that in each iteration the norm of the gradient is a decreasing function of the mini-batch size b and thus the variance of the stochastic gradient estimator is a decreasing function of b. For deep neural networks with L 2 loss we show that the variance of the gradient is a polynomial in 1{b. The results theoretically back the important intuition that smaller batch sizes yield larger variance of the stochastic gradients and lower loss function values which is a common believe among the researchers. The proof techniques exhibit a relationship between stochastic gradient estimators and initial weights, which is useful for further research on the dynamics of SGD. We empirically provide insights to our results on various datasets and commonly used deep network structures. We further discuss possible extensions of the approaches we build in studying the generalization ability of the deep learning models.

1. INTRODUCTION

Deep learning models have achieved great success in a variety of tasks including natural language processing, computer vision, and reinforcement learning (Goodfellow et al., 2016) . Despite their practical success, there are only limited studies of the theoretical properties of deep learning; see survey papers (Sun, 2019; Fan et al., 2019) and references therein. The general problem underlying deep learning models is to optimize (minimize) a loss function, defined by the deviation of model predictions on data samples from the corresponding true labels. The prevailing method to train deep learning models is the mini-batch stochastic gradient descent algorithm and its variants (Bottou, 1998; Bottou et al., 2018) . SGD updates model parameters by calculating a stochastic approximation of the full gradient of the loss function, based on a random selected subset of the training samples called a mini-batch. It is well-accepted that selecting a large mini-batch size reduces the training time of deep learning models, as computation on large mini-batches can be better parallelized on processing units. For example, Goyal et al. (2017) scale ResNet-50 (He et al., 2016) from a mini-batch size of 256 images and training time of 29 hours, to a larger mini-batch size of 8,192 images. Their training achieves the same level of accuracy while reducing the training time to one hour. However, noted by many researchers, larger mini-batch sizes suffer from a worse generalization ability (LeCun et al., 2012; Keskar et al., 2017) . Therefore, many efforts have been made to develop specialized training procedures that achieve good generalization using large mini-batch sizes (Hoffer et al., 2017; Goyal et al., 2017) . Smaller batch sizes have the advantage of allegedly offering better generalization (at the expense of a higher training time). The focus of this study is on the behavior of SGD subject to the conditions on the initial point. This is different from previous results which analyze SGD via stringing one-step recursions together. The dynamics of SGD are not comparable if we merely consider the one-step behavior, as the model parameters change iteration by iteration. Therefore, fixing the initial weights and the learning rate can give us a fair view of the impact of different mini-batch sizes on the dynamics of SGD. We hypothesize that, given the same initial point, smaller sizes lead to lower training loss and, unfortunately, decrease stability of the algorithm on average. The latter follows from the fact that the smaller is the batch size, more stochasticity and volatility is introduced. After all, if the batch size equals to the number of samples, there is no stochasticity in the algorithm. To this end, we conjecture that the variance of the gradient in each iteration is a decreasing function of the mini-batch size. The conjecture is the focus of the work herein. Variance correlates to many other important properties of SGD dynamics. For example, there is substantial work on variance reduction methods (Johnson & Zhang, 2013; Allen-Zhu & Hazan, 2016; Wang et al., 2013) which show great success on improving the convergence rate by controlling the variance of the stochastic gradients. Mini-batch size is also a key factor deciding the performance of SGD. Some research focuses on how to choose an optimal mini-batch size based on different criteria (Smith & Le, 2017; Gower et al., 2019) . However, these works make strong assumptions on the loss function properties (strong or point or quasi convexity, or constant variance near stationary points) or about the formulation of the SGD algorithm (continuous time interpretation by means of differential equations). The statements are approximate in nature and thus not mathematical claims. The theoretical results regarding the relationship between the mini-batch size and the variance (and other performances, like loss and generalization ability) of the SGD algorithm applied to general machine learning models are still missing. The work herein partially addresses this gap by showing the impact of the mini-batch size on the variance of gradients in SGD. We further discuss possible extensions of the approaches we build in studying the generalization ability. We are able to prove the hypothesis about variance in the convex linear regression case and to show significant progress in a deep linear neural network setting with samples based on a normal distribution. In this case we show that the variance is a polynomial in the reciprocal of the mini-batch size and that it is decreasing if the mini-batch size is larger than a threshold (further experiments reveal that this threshold can be as small as 2). The increased variance as the mini-batch size decreases should also intuitively imply convergence to lower training loss values and in turn better prediction and generalization ability (these relationships are yet to be confirmed analytically; but we provide empirical evidence to their validity). The major contributions of this paper are as follows. • For linear regression, we show that in each iteration the norm of any linear combination of sample-wise gradients is a decreasing function of the mini-batch size b (Theorem 1). As a special case, the variance of the stochastic gradient estimator and the full gradient at the iterate in step t are also decreasing functions of b at any iteration step t (Theorem 2). In addition, the proof provides a recursive relationship between the norm of gradients and the model parameters at each iteration (Lemma 2). This recursive relationship can be used to calculate any quantity related to the full/stochastic gradient or loss at any iteration with respect to the initial weights. • For the deep linear neural network with L 2 -loss and samples drawn from a normal distribution, we take two-layer linear network as an example and show that in each iteration step t the trace of any product of the stochastic gradient estimators and weight matrices is a polynomial in 1{b with coefficients a sum of products of the initial weights (Theorem 3). As a special case, the variance of the stochastic gradient estimator is a polynomial in 1{b without the constant term (Theorem 4) and therefore it is a decreasing function of b when b is large enough (Theorem 5). The results and proof techniques can be easily extended to general deep linear networks. As a comparison, other papers that study theoretical properties of two-layer networks either fix one layer of the network, or assume the over-parameterized property of the model and they study convergence, while our paper makes no such assumptions on the model capacity. The proof also reveals the structure of the coefficients of the polynomial, and thus serving as a tool for future work on proving other properties of the stochastic gradient estimators. • The proofs are involved and require several key ideas. The main one is to show a more general result than it is necessary in order to carry out the induction. The induction is on time step t. The key idea is to show a much more general result that lets us carry out induction. New concepts and definitions are introduced in order to handle the more general case. Along the way we show a result of general interest establishing expectation of several rank one matrices sampled from a normal distribution intertwined with constant matrices. • We verify the theoretical results on various datasets and provide further understanding. We further empirically show that the results extend to other widely used network structures and hold for all choices of the mini-batch sizes. We also empirically verify that, on average, in each iteration the loss function value and the generalization ability (measured by the gap between accuracy on the training and test sets) are all decreasing functions of the minibatch size. In conclusion, we study the dynamics of SGD under linear regression and a two-layer linear network setting by focusing on the decreasing property of the variance of stochastic gradient estimators with respect to the mini-batch size. The proof techniques can also be used to derive other properties of the SGD dynamics in regard to the mini-batch size and initial weights. To the best of authors' knowledge, the work is the first one to theoretically study the impact of the mini-batch size on the variance of the gradient subject to the conditions on the initial weights, under mild assumptions on the network and the loss function. We support our theoretical results by experiments. We further experiment on other state-of-the-art deep learning models and datasets to empirically show the validity of the conjectures about the impact of mini-batch size on average loss, average accuracy and the generalization ability of the model. The rest of the manuscript is structured as follows. In Section 2 we review the literature while in Section 3 we present the theoretical results on how mini-batch sizes impact the variance of stochastic gradient estimators, under different models including linear regression and deep linear networks. Section 4 introduces (part of) the experiments that verify our theorems and provide further insights into the impact of the mini-batch sizes on SGD performance. We defer the complete experimental details to Appendix A and the proofs of the theorems and other technical details to to Appendix B.

2. LITERATURE REVIEW

Stochastic gradient descent type methods are broadly used in machine learning (Bottou, 1991; Le-Cun et al., 1998; Bottou et al., 2018) . The performance of SGD highly relies on the choice of the mini-batch size. It has been widely observed that choosing a large mini-batch size to train deep neural networks appears to deteriorate generalization (LeCun et al., 2012) . This phenomenon exists even if the models are trained without any budget or limits, until the loss function value ceases to improve (Keskar et al., 2017) . One explanation for this phenomenon is that large mini-batch SGD produces "sharp" minima that generalize worse (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) . Specialized training procedures to achieve good performance with large mini-batch sizes have also been proposed (Hoffer et al., 2017; Goyal et al., 2017) . It is well-known that SGD has a slow asymptotic rate of convergence due to its inherent variance (Johnson & Zhang, 2013) . Variants of SGD that can reduce the variance of the stochastic gradient estimator, which yield faster convergence, have also been suggested. There is prior work focusing on studying the dynamics of SGD. Neelakantan et al. (2015) propose to add isotropic white noise to the full gradient to study the "structured" variance. Mou et al. (2018) . In most of the prior work about the convergence of SGD, it is assumed that the variance of stochastic gradient estimators is upper-bounded by a linear function of the norm of the full gradient, e.g. Assumption 4.3 in Bottou et al. (2018) . Gower et al. (2019) give more precise bounds of the variance under different sampling methods and Khaled & Richtárik (2020) extend them to smooth non-convex regime. These bounds are still dependent on the model parameters at the corresponding iteration. To the best of the authors' knowledge, there is no existing result which represents the variance of stochastic gradient estimators only using the initial weights and the mini-batch size. This paper partially solves this problem.

3. ANALYSIS

Mini-batch SGD is a lighter-weight version of gradient descent. Suppose that we are given a loss function Lpwq where w is the collection (vector, matrix, or tensor) of all model parameters. At each iteration t, instead of computing the full gradient ∇ w Lpw t q, SGD randomly samples a mini-batch set B t that consists of b " |B t | training instances and sets w t`1 Ð w t ´αt ∇ w L Bt pw t q, where the positive scalar α t is the learning rate (or step size) and ∇ w L Bt pw t q denotes the stochastic gradient estimator based on mini-batch B t . An important property of the stochastic gradient estimator ∇ w L Bt pw t q is that it is an unbiased estimator, i.e. E∇ w L Bt pw t q " ∇ w Lpw t q, where the expectation is taken over all possible choices of mini-batch B t . (Smith & Le, 2017; Gower et al., 2019) . However, even the quantities ∇ w Lpw t q and var p∇ w Lpw t qq are still challenging to compute as we do not have direct formulas of their precise values. Besides, as we choose different b's, their values are not comparable as we end up with different w t 's. A plausible idea to address these issues is to represent E∇ w L Bt pw t q and var p∇ w L Bt pw t qq using the fixed and known quantities w 0 , b, t, and α t . In this way, we can further discover the properties, like decreasing with respect to b, of E∇ w L Bt pw t q and var p∇ w L Bt pw t qq. The biggest challenge is how to connect the quantities in iteration t with those of iteration 0. This is similar to discovering the properties of a stochastic differential equation at time t given only the dynamics of the stochastic differential equation and the initial point. In this section, we address these questions under two settings: linear regression and a deep linear network. In Section 3.1 with a linear regression setting, we provide explicit formulas for calculating any norm of the linear combination of sample-wise gradients. We therefore show that the var p∇ w L Bt pw t qq is a decreasing function of the mini-batch size b. In Section 3.2 with a deep linear network setting and samples drawn from a normal distribution, we show that any trace of the product of weight matrices and stochastic gradient estimators is a polynomial in 1{b with finite degree. We further prove that var p∇ w L Bt pw t qq is a decreasing function of the mini-batch size b ą b 0 for some constant b 0 . For a random matrix M , we define var pM q fi E }vecpM q} 2 ´}EvecpM q} 2 where vecpM q denotes the vectorization of matrix M . We denote rm : ns fi tm, m `1, . . . , nu if m ď n, and H otherwise. We use rns fi r1 : ns as an abbreviation. For clarity, we use the superscript b to distinguish the variables with different choices of the mini-batch size b. In each iteration t, we use B b t to denote the batch of samples (or sample indices) to calculate the stochastic gradient. We denote by F b t the filtration of information before calculating the stochastic gradient in the t-th iteration, i.e. F b t fi w 0 , B b 0 , . . . , B b t´1 ( .

3.1. LINEAR REGRESSION

In this subsection, we discuss the dynamics of SGD applied in linear regression. Given data points px 1 , y 1 q, ¨¨¨, px n , y n q, where x i P R p and y i P R, we define the loss function to be Lpwq " 1 n ř n i"1 L i pwq " 1 n ř n i"1 1 2 `wT x i ´yi ˘2, where w P R p are the model parameters. We consider minimizing Lpwq by mini-batch SGD. Note that the bias term in the general linear regression models is omitted, however, adding the bias term does not change the result of this section. Formally, we first choose a mini-batch size b and initial weights w 0 . In each iteration t, we sample B b t , a subset of rns with cardinality b, and update the parameters by w b t`1 " w b t ´αt g b t , where g b t " 1 b ř iPB b t ∇L i `wb t ˘. We first show the relationship between the variance of stochastic gradient g b t and the full gradient ∇L `wb t ˘and sample-wise gradient ∇L i `wb t ˘, i P rns, derived by considering all possible choices of the mini-batch B b t . Readers should note that Lemma 1 actually holds for all models with L 2 -loss, not merely linear regression (since in the proof we do not need to know the explicit form of L i pwq). Lemma 1. Let c b fi n´b bpn´1q ě 0. For any matrix A P R pˆp we have var `Ag b t ˇˇF b t ˘" E " › › Ag b t › › 2 ˇˇF b t ı ´› › A∇L `wb t ˘› › 2 " c b ´1 n ř n i"1 › › A∇L i `wb t ˘› › 2 ´› › A∇L `wb t ˘› › 2 ¯. Lemma 1 provides a bridge to connect the norm and variance of g  u P R pˆp , if we denote A " ř n i"1 A i x i x T i , then we have E " › › ř n i"1 A i ∇L i `wb t`1 ˘› › 2 ˇˇF0 ı " E " › › ř n i"1 B i ∇L i `wb t ˘› › 2 ˇˇF0 ı ὰ2 t c b n 2 ř n k"1 ř n l"1 E " › › ř n i"1 B kl i ∇L i `wb t ˘› › 2 ˇˇF0 ı , where B i " A i ´αt n A; B kl i " A if i " k, i ‰ l, B kl i " A if i " l, i ‰ k , and B kl i equals the zero matrix, otherwise. Lemma 2 provides the tool to reduce the iteration t by one. Therefore, we can easily use it to recursively calculate the norm of any linear combinations of the sample-wise gradients, for all iterations t. Combining the fact that c b is a decreasing function of b, we are able to show Theorem 1. Theorem 1. For any t P N and any matrices A i P R pˆp , i P rns, E " › › ř n i"1 A i ∇L i `wb t ˘› › 2 ˇˇF0 ı is a decreasing function of b for b P rns. Theorem 1 states that the norm of any linear combinations of the sample-wise gradients is a decreasing function of b. Combining Lemma 1 which connects the variance of g b t with the linear combination of ∇L i `wb t ˘'s, and the fact that ∇L `wb t ˘" 1 n ř n i"1 ∇L i `wb t ˘, we have Theorem 2. Theorem 2. Fixing initial weights w 0 , both var `Bg b t ˇˇF 0 ˘and var `B∇L `wb t ˘ˇF 0 ˘are decreasing functions of mini-batch size b for all b P rns, t P N, and all square matrices B P R pˆp . As a special case, Corollary 1 guarantees that the variance of the stochastic gradient estimator is a decreasing function of b. Corollary 1. Fixing initial weights w 0 , both var `gb t ˇˇF 0 ˘and var `∇L `wb t ˘ˇF 0 ˘are decreasing functions of mini-batch size b for all b P rns and t P N. In conclusion, we provide a framework for calculating the explicit value of variance of the stochastic gradient estimators and the norm of any linear combination of sample-wise gradients. We further show that the variance of both the full gradient and the stochastic gradient estimator are a decreasing function of the mini-batch size b. Readers should note that the framework here is not limited to showing the decreasing property of the variance, but can also be used in many other circumstance. For example, we can use Lemma 2 to induct on t and easily show that E " › › ř n i"1 A i ∇L i `wb t ˘› › 2 ˇˇF0 ı is a polynomial of 1 b with degree at most t and estimate the coefficients therein.

3.2. DEEP LINEAR NETWORKS WITH ONLINE SETTING

In this section, we study the dynamics of SGD on deep linear networks. We take the two-layer linear network as an example while the results and proofs can be easily extended to deep linear network with any depth (see Appendix B.3 for more details). We consider the population loss Lpwq " E x"N p0,Ipq " 1 2 }W 2 W 1 x ´W 2 W 1 x} 2 ı under the teacher-student learning framework (Hinton et al., 2015) with w " pW 1 , W 2 q a tuple of two matrices. Here W 1 P R p1ˆp and W 2 P R p2ˆp1 are parameter matrices of the student network and W 1 and W 2 are the fixed ground-truth parameters of the teacher network. We use online SGD to minimize the population loss Lpwq. Formally, we first choose a mini-batch size b and initial weight matrices tW 0,1 , W 0,2 u. In each iteration t, we draw b independent and identically distributed samples x t,i , i P rbs from N p0, I p q to form the mini-batch B b t and update the weight matrices by W b t`1,1 " W b t,1 ´αt g b t,1 and W b t`1,2 " W b t,2 ´αt g b t,2 , where g b t,1 " 1 b b ÿ i"1 ∇ W b t,1 ˆ1 2 › › ›W b t,2 W b t,1 xt,i ´W 2 W 1 xt,i › › › 2 ˙" 1 b b ÿ i"1 W b t,2 T ´W b t,2 W b t,1 ´W 2 W 1 ¯xt,ix T t,i , g b t,2 " 1 b b ÿ i"1 ∇ W b t,2 ˆ1 2 › › ›W b t,2 W b t,1 xt,i ´W 2 W 1 xt,i › › › 2 ˙" 1 b b ÿ i"1 ´W b t,2 W b t,1 ´W 2 W 1 ¯xt,ix T t,i W b t,1 T . (2) The derivation follows from the formulas in Petersen & Pedersen (2012) . In the following, we use W b t " W b t,2 W b t,1 ´W 2 W 1 to denote the gap between the product of model weights and ground-truth weights. For ease of developing our proofs, we first introduce the definition of a multiplicative term in Definition 1. Intuitively, a multiplicative term is a matrix which equals to the product of its parameter matrices and constant matrices (and their transpose). The degree of a matrix A in a multiplicative term M is the number of appearance of A and A T in M . The degree of M is exactly the number of appearances of all weight matrices in M . Definition 1. For any set of matrices S, we denote s S " S Y tM T : M P Su. Given a set of parameter matrices X " tX 1 , X 2 , ¨¨¨, X nv u and constant matrices C " tC 1 , C 2 , ¨¨¨, C nc u, we say that a matrix M is a multiplicative term of parameter matrices X and constant matrices C if it can be written in the form of M " M pX , Cq " ś k i"1 A i , where A i P s X Y s C. We write degpX j ; M q " ř k i"1 `1 tX j " A i u `1 X j " A T i (˘, j P rn v s as the degree of parameter matrix X j in M , degpC j ; M q " ř k i"1 `1 tC j " A i u `1 C j " A T i (˘, j P rn c s as the degree of constant matrix C j in M , and degpM q " ř k i"1 1 A i P s X ( " ř nv j"1 degpX j ; M q as the total degree of the parameter matrices of M . As pointed out in the Section 1, the difficulty of studying the dynamics of SGD is how to connect the quantities in iteration t with fixed variables, like initial weights W 0,1 , W 0,2 and mini-batch size b. We overcome this challenge by carefully calculating the relationship between g b t`1,i and g b t,i , i " 1, 2 so that we can reduce the iteration t step by step. With the help of Lemmas 8 and 9 in Appendix B.2, we can represent g b t`1,i , i " 1, 2 using multiplicative terms of g b t,i , i " 1, 2 and some other constant matrices. Theorem 3 precisely gives the representation in the form of a polynomial of 1 b and the coefficients as the sum of multiplicative terms of parameter matrices W b 0,1 , W b 0,2 ( and constant matrices tW 1 , W 2 u. Theorem 3. Given t ě 0, for any multiplicative terms M i , i P r0 : ms of parameter matrices g b t,1 , g b t,2 ( and constant matrices W b t,1 , W b t,2 , W 1 , W 2 ( with degree d i , respectively, we denote M " ś m i"1 tr pM i q M 0 , d " ř m i"0 d i and d 1 " ř m i"0 `deg `W b t,1 ; M i ˘`degpW b t,2 ; M i q ˘. There exists a set of multiplicative terms M k ij , i P rm k s, j P r0 : m ki s , k P r0 : qs ( of parameter matrices W b 0,1 , W b 0,2 ( and constant matrices tW 1 , W 2 u such that E rM |F 0 s " N 0 `N1 1 b `¨¨¨`N q 1 b q , where N k " ř m k i"1 ś m ki j"1 tr `M k ij ˘M k i0 , k P r0 : qs. Here m k , m ki and q ď 1 2 p3 t`1 ´1qd `1 2 p3 t 1qd 1 are constants independent of b, and ř m ki j"0 deg `M k ij ˘ď 3 t p3d `d1 q. As a special case of Theorem 3, Theorem 4 shows that the variance of the stochastic gradient estimators is also a polynomial of 1 b but with no constant term. This backs the important intuition that the variance is approximately inversely proportional to the mini-batch size b. Besides, note that if we consider b Ñ 8, intuitively we should have var `gb t,i ˇˇF 0 ˘Ñ 0, i " 1, 2. This observation aligns with the statement of Theorem 4. Theorem 4. Given t ě 0, value var `gb t,i ˇˇF 0 ˘, i " 1, 2 can be written as a polynomial of 1 b with degree at most 2 ¨3t with no constant term. Formally, we have var `gb t,i ˇˇF 0 ˘" β 1 1 b `¨¨¨`β r 1 b r , where r ď 2 ¨3t`1 and each β i is a constant independent of b. One should note that the polynomial representation of var `gb t,i ˇˇF 0 ˘, i " 1, 2 does not have the constant term. Therefore, to show the that the variance is a decreasing function of b, we only need to show that the leading coefficient β 1 is non-negative. This is guaranteed by the fact that variance is always non-negative. We therefore have Theorem 5. Theorem 5. Given t P N, there exists a constant b 0 such that for all b ě b 0 function var `gb t,i ˇˇF 0 ˘, i " 1, 2 is a decreasing function of b. The constant b 0 is the largest root of the equation β 1 b r´1 `β2 b r´2 `¨¨¨`β r " 0. See the proof of Theorem 5 in Appendix B.2 for more details. Although we cannot calculate the precise value of b 0 , we verify that b 0 is smaller than 1 in many experiments. From the proofs we conclude that the scale of each β i is of the order O p}M }q, where M is a multiplicative term of parameter matrices tW 0,1 , W 0,2 , W 1 , W 2 u and constant matrix H with degree 2 ¨3t`1 . Unlike the linear regression setting where we can iteratively calculate the variance by Lemma 2, the closed form expressions for the variance of the stochastic gradients in the deep linear network setting are much harder to calculate. However, we are able to iteratively deducing t one by one and provide a polynomial representation for any multiplicative terms of parameter matrices g b t,i , W b t,i , i " 1, 2 ( and constant matrices tW 1 , W 2 u using only the initial weights W 0,1 , W 0,2 and the mini-batch size b. As we further study the polynomial representation of var `gb t,i ˇˇF 0 ˘, i " 1, 2, we are also able to show the decreasing property of the variance of stochastic gradient estimators with respect to b.

4. EXPERIMENTS

In this section, we present numerical results to support the theorems in Section 3 and provide further insights into the impact of the mini-batch size on the dynamics of SGD. The experiments are conducted on four datasets and models that are relatively small due to the computational cost of using large models and datasets. We only report the results on the MNIST dataset here due to the limited space. A complete empirical study is deferred in Appendix A. For all experiments, we perform mini-batch SGD multiple times starting from the same initial weights and following the same choice of the learning rates and other hyper-parameters, if applicable. This enables us to calculate the variance of the gradient estimators and other statistics in each iteration, where the randomness comes only from different samples of SGD.

4.1. RESULTS ON MNIST DATASET

The MNIST dataset is to recognize digits in handwritten images of digits. We use all 60,000 training samples and 10,000 validation samples of MNIST. We build a three-layer fully connected neural network with 1024, 512 and 10 neurons in each layer. For the two hidden layers, we use the ReLU activation function. The last layer is the softmax layer which gives the prediction probabilities for the 10 digits. We use mini-batch SGD to optimize the cross-entropy loss of the model. The model deviates from our analytical setting since it has non-linear activations, it has the cross-entropy loss function (instead of L 2 ), and empirical loss (as opposed to population). MNIST is selected due to its fast training and popularity in deep learning experiments. The goal is to verify the results in this different setting and to back up our hypotheses. As shown in Figure 1 (a), we run SGD with two batch sizes 64 and 128 on five different initial weights with 50 runs for each initial point. This plot shows that, even the smallest value of the variance among the five different initial weights with a mini-batch size of 64, is still larger than the largest variance of mini-batch size 128. We observe that the sensitivity to the initial weights is not large. This plot also empirically verifies our conjecture in the introduction that the variance of the stochastic gradient estimators is a decreasing function of the mini-batch size, for all iterations of SGD in a general deep learning model. In addition, we also conjecture that there exists the decreasing property for the expected loss, error and the generalization ability with respect to the mini-batch size. Figure 1 (b) shows that the expected loss (again, randomness comes from different runs of SGD through the different mini-batches with the same initial weights and learning rates) on the training set is a decreasing function of b. However, this decreasing property does not hold on the validation set when the loss tends to be stable or increasing, in other words, the model starts to be over-fitting. We hypothesize that this is because the learned weights start to bounce around a local minimum when the model is over-fitting. As the larger mini-batch size brings smaller variance, the weights are closer to the local minimum found by SGD, and therefore yield a smaller loss function value. 2013), we build a test set by distorting the 10,000 images of the validation set. The prediction accuracy is obtained on both training and test sets and we calculate the gap between these two accuracies every 100 epochs. We use this gap to measure the model generalization ability (the smaller the better). Figure 1(d) shows that the gap is an increasing function of b starting at epoch 500, which partially aligns with our conjecture regarding the relationship between the generalization ability and the mini-batch size. We test this on multiple choices of the hyper-parameters which control the degree of distortion in the test set and this pattern remains clear.

5. SUMMARY AND FUTURE WORK

We examine the impact of the mini-batch size on the dynamics of SGD. Our focus is on the variance of stochastic gradient estimators. For linear regression and a two-layer linear network, we are able to theoretically prove that the variance conjecture holds. We further experiment on multiple models and datasets to verify our claims and their applicability to practical settings. Besides, we also empirically address the conjectures about the expected loss and the generalization ability. A challenging research direction is to theoretically investigate the impact of the mini-batch size on the generalization ability. There are existing works studying the relationship between the variance of the stochastic gradients and the generalization ability (Gorbunov et al., 2020; Meng et al., 2016) . Together with the tools developed herein, it would be possible to bridge the mini-batch size with the generalization ability of a neural network. We can further choose an optimal mini-batch size which minimizes the generalization ability by solving the polynomial equation if we have more precise estimations of the coefficients. Another appealing direction is using our variance estimations to develop better variance reduction methods. As a results, the upper-bound of the variance decides the convergent rate of these algorithms. Researchers usually assume a much larger upper-bound at each iteration, like a linear function of the norm of the full gradient. With the help of our techniques, we should calculate the variance more precisely and further improve the algorithms. Further interesting work is to extend our techniques to more complicated and sophisticated networks. Although the underlying model of this paper corresponds to deep linear network networks, we are able to show a deeper relationship between the variance and the mini-batch size, the polynomial in 1{b, while the common knowledge is simply that the variance is proportional to 1{b. The extension to other optimization algorithms, like Adam and Gradient Boosting Machines, are also very attractive. We hope our theoretical framework can serve as a tool for future research of this kind. 

A EXPERIMENTS

In this section, we present numerical results to support the theorems in Section 3, to backup the hypotheses discussed in the introduction, and provide further insights into the impact of the minibatch size on the dynamics of SGD. The experiments are conducted on four datasets and models that are relatively small due to the computational cost of using large models and datasets. Remark: We cannot present the complete numerical results in the main paper due to the space limit. Therefore, we move the whole experimental section to Appendix. In order to keep a smooth reading, some of the content is overlapping with Section 4.

A.1 DATASETS AND SETTINGS

For all experiments, we perform mini-batch SGD multiple times starting from the same initial weights and following the same choice of the learning rates and other hyper-parameters, if applicable. This enables us to calculate the variance of the gradient estimators and other statistics in each iteration, where the randomness comes only from different samples of SGD. The learning rate α t is selected to be inversely proportional to iteration t, or fixed, depending on the task at hand. All models are implemented using PyTorch version 1.4 (Paszke et al., 2019) and trained on NVIDIA 2080Ti/1080 GPUs. We have also tested several other random initial weights and ground-truth weights, and learning rates, and the results and conclusions are similar and not presented.

A.1.1 GRADUATE ADMISSION DATASET

The Graduate Admission datasetfoot_0 (Acharya et al., 2019) is to predict the chance of a graduate admission using linear regression. The dataset contains 500 samples with 6 features and is normalized by mean and variance of each feature. This is a popular regression dataset with clean data. We build a linear regression model to predict the chance of acceptance (we include the intercept term in the model) and minimize the empirical L 2 loss using mini-batch SGD, as stated in Section 3.1. For the experiment in Figure 2 (a), we randomly select an initial weight vectors w 0 and run SGD for 2,000 iterations where it appears to converge. We record all statistics at every iteration. There are in total 1,000 runs behind each observation which yields a p-value lower than 0.05. As for Figure 2 (b), we select 20 different b's and run SGD from the same initial point for 40 iterations. There are in total of 200,000 runs to make sure the p-value of all statistics are lower than 0.05. In all experiments, the learning rate is chosen to be α t " 1 2t , t P r2000s because this rate yields a theoretical convergence guaranteed (factor 1/2 has been fine tuned). The purpose of this experiment is to empirically study the rate of decrease of the variance. The theoretical study exhibited in Section 3.1 establishes the non-increasing property but it does not state anything about the rate of decrease.

A.1.2 SYNTHETIC DATASET

We build a synthetic dataset of standard normal samples to study the setting in Section 3.2. We fix the teacher network with 64 input neurons, 256 hidden neurons and 128 output neurons. We optimize the population L 2 loss by updating the two parameter matrices of the student network using online SGD, as stated in Section 3.2. In this case we have proved the functional form of the variance as a function of b and show the decreasing property of the variance of the stochastic gradient estimators for large mini-batch sizes. However, we do not show the decreasing property for every b. With this experiment we confirm that the conjecture likely holds. In the experiment, we randomly select two initial weight matrices W 0,1 , W 0,2 and the ground-truth weight matrices W 1 , W 2 . We run SGD for 1,000 iterations which appears to be a good number for convergence while there are 1,000 runs of SGD in total to again give a p-value below 0.05. We record all statistics at every iteration. The learning rate is chosen to be α t " 1 10t , t P r1000s for the same reason as in the regression experiment.

A.1.3 MNIST DATASET

The MNIST dataset is to recognize digits in handwritten images of digits. We use all 60,000 training samples and 10,000 validation samples of MNIST. The images are normalized by mapping each entry to r´1, 1s. We build a three-layer fully connected neural network with 1024, 512 and 10 neurons in each layer. For the two hidden layers, we use the ReLU activation function. The last layer is the softmax layer which gives the prediction probabilities for the 10 digits. We use minibatch SGD to optimize the cross-entropy loss of the model. The model deviates from our analytical setting since it has non-linear activations, it has the cross-entropy loss function (instead of L 2 ), and empirical loss (as opposed to population). MNIST is selected due to its fast training and popularity in deep learning experiments. The goal is to verify the results in this different setting and to back up our hypotheses. We run SGD for 1,000 epochs on the training set which is enough for convergence. The learning rate is a constant set to 3 ¨10 ´3 (which has been tuned). For the experiment in Figure 5 , there are in total 100 runs to give us the p-value below 0.05. For the experiment in Figure 4 (a), we randomly select five different initial points and we have 50 runs for each initial point. For the experiment corresponding to Figure 4 (b), we choose α " 8 and σ " 2 as in Simard et al. (2013) . The initial weights and other hyper-parameters are chosen to be the same as in Figure 5 .

A.1.4 YELP REVIEW DATASET

The Yelp Review dataset from the Yelp Dataset Challenge (Zhang et al., 2015) contains 1,569,264 samples of customer reviews with positive/negative sentiment labels. We use 10,000 samples as our training set and 1,000 samples as the validation set. We use XLNet (Yang et al., 2019) to perform sentiment classification on this dataset. Our XLNet has 6 layers, the hidden size of 384, and 12 attention heads. There are in total 35,493,122 parameters. We intentionally reduce the number of layers and hidden size of XLNet and select a relatively small size of the training and validation sets since training of XLNet is very time-consuming (Yang et al. (2019) train on 512 TPU v3 chips for 5.5 days) and we need to train the model for multiple runs. This setting allows us to train our model in several hours on a single GPU card. We train the model using the Adam weight decay optimizer, and some other techniques, as suggested in Table 8 of Yang et al. (2019) . This dataset represents sequential data where we further consider the hypotheses. We randomly select a set of initial parameters and run Adam with two different mini-batch sizes of 32 and 64. For computational tractability reasons, for each mini-batch size there are in total of 100 runs and each run corresponds to 20 epochs. We record the variance of the stochastic gradient, loss and accuracy in every step of Adam. The statistics reported in Figure 6 are averaged through each epoch. In all experiments, the learning rate is set to be 4 ¨10 ´5 and the parameter of Adam is set to be 10 ´8 (these two have been tuned). The stochastic gradients of all parameter matrices are clipped with threshold 1 in each iteration. We use the same setup for the learning rate warm-up strategy as suggested in Yang et al. (2019) . The maximum sequence length is set to be 128 and we pad the sequences with length smaller than 128 with zeros.

A.2 DISCUSSION

As observed in Figure 2 (a), under the linear regression setting with the Graduate Admission dataset, the variance of the stochastic gradient estimators and full gradients are all strictly decreasing functions of b for all iterations. This result verifies the theorems in Section 3.1. Figure 2(b ) further studies the rate of decrease of the variance. From the proofs in Section 3.1 we see that var `gb t ˇˇF 0 ȋs a polynomial of 1 b with degree t `1. Therefore, for every t, we can approximate this polynomial by sampling many different b's and calculate the corresponding variances. We pick b to cover all numbers that are either a power of 2 or multiple of 40 in r2, 500s (there are a total of 21 such values) and fit a polynomial with degree 6 (an estimate from the analyses) at t " 10, 20, 30, 40. Under the two-layer linear network setting with the synthetic dataset, Figure 3 verifies that the variance of the stochastic gradient estimators and full gradients are all strictly decreasing functions of b for all iterations. This figure also empirically shows that the constant b 0 in Theorem 5 could be as small as b 0 " 4. In fact, we also experiment with the mini-batch size of 1 and 2, and the decreasing property remains to hold. We also test this on multiple choices of initial weights and learning rates and this pattern remains clear. In aforementioned two experiments we use SGD in its original form by randomly sampling minibatches. In deep learning with large-scale training data such a strategy is computationally prohibitive and thus samples are scanned in a cyclic order which implies fixed mini-batches are processed many times. Therefore, in the next two datasets we perform standard "epoch" based training to empirically study the remaining two hypotheses discussed in the introduction (decreasing loss and error as a function of b) and sensitivity with respect to the initial weights. Note that we are using cross-entropy loss in the MNIST dataset and the Adam optimizer in the Yelp dataset and thus these experiments do not meet all of the assumptions of the analysis in Section 3. As shown in Figure 4 (a), we run SGD with two batch sizes 64 and 128 on five different initial weights. This plot shows that, even the smallest value of the variance among the five different initial weights with a mini-batch size of 64, is still larger than the largest variance of mini-batch size 128. We observe that the sensitivity to the initial weights is not large. This plot also empirically verifies our conjecture in the introduction that the variance of the stochastic gradient estimators is In addition, we also conjecture that there exists the decreasing property for the expected loss, error and the generalization ability with respect to the mini-batch size. Figure 5 (a) shows that the expected loss (again, randomness comes from different runs of SGD through the different mini-batches with the same initial weights and learning rates) on the training set is a decreasing function of b. However, this decreasing property does not hold on the validation set when the loss tends to be stable or increasing, in other words, the model starts to be over-fitting. We hypothesize that this is because the learned weights start to bounce around a local minimum when the model is over-fitting. As the larger mini-batch size brings smaller variance, the weights are closer to the local minimum found by SGD, and therefore yield a smaller loss function value. As suggested by (Simard et al., 2013) , we build a test set by distorting the 10,000 images of the validation set. The prediction accuracy is obtained on both training and test sets and we calculate the gap between these two accuracies every 100 epochs. We use this gap to measure the model generalization ability (the smaller the better). Figure 4 (b) shows that the gap is an increasing function of b starting at epoch 500, which partially aligns with our conjecture regarding the relationship between the generalization ability and the mini-batch size. We also test this on multiple choices of the hyper-parameters which control the degree of distortion in the test set and this pattern remains clear. Figure 6 shows the similar phenomenon that the variance of stochastic estimators and the expected loss and error on both training and validation sets are decreasing functions of b even if we train XLNet using Adam. This example gives us confidence that the decreasing properties are not merely restricted on shallow neural networks or vanilla SGD algorithms. They actually appear in many advanced models and optimization methods.

B LEMMAS AND PROOFS B.1 LEMMAS AND PROOFS OF RESULTS IN SECTION 3.1

For two matrices A, B with the same dimension, we define the inner product xA, By fi tr `AT B ˘. Lemma 3. Suppose that f pxq and gpxq are both smooth, non-negative and decreasing functions of x P R. Then hpxq " f pxqgpxq is also a non-negative and decreasing function of x. Proof. It is obvious that hpxq is non-negative for all x. The first-order derivative of h is h 1 pxq " f 1 pxqgpxq `f pxqg 1 pxq ď 0, and thus hpxq is also a decreasing function of x. Proof of Lemma 1. Throughout the paper, We use For any A P R pˆp , we have C k n " n! k!pn´kq! to denote the combinatorial num- ber. Note that E " g b t `gb t ˘T ˇˇF b t ı " 1 b 2 E » - ÿ iPB b t ∇L i `wb t ˘ÿ iPB b t ∇L i `wb t ˘T ˇˇˇˇˇF b t fi fl " 1 b 2 ˜Cb´1 n´1 C b n n ÿ i"1 ∇L i `wb t ˘∇L i `wb t ˘T `Cb´2 n´2 C b n ÿ i‰j ∇L i `wb t ˘∇L j `wb t ˘T " 1 b 2 ˜b n n ÿ i"1 ∇L i ` E " › › Ag b t › › 2 ˇˇF b t ı " E " `gb t ˘T A T Ag b t ˇˇF b t ı " E " tr ´`g b t ˘T A T Ag b t ¯ˇˇF b t ı " E " tr ´AT Ag b t `gb t ˘T ¯ˇˇF b t ı " tr ´AT AE " g b t `gb t ˘T ˇˇF b t ı" tr ˜n ´b bnpn ´1q n ÿ i"1 A T A∇L i `wb t ˘∇L i `wb t ˘T `pb ´1qn bpn ´1q A T A∇L `wb t ˘∇L `wb t ˘T " n ´b bnpn ´1q n ÿ i"1 › › A∇L i `wb t ˘› › 2 `pb ´1qn bpn ´1q › › A∇L `wb t ˘› › 2 " c b ˜1 n n ÿ i"1 › › A∇L i `wb t ˘› › 2 ´› › A∇L `wb t ˘› › 2 ¸`› › A∇L `wb t ˘› › 2 . Therefore, we have var `Ag b t ˇˇF b t ˘" E " › › Ag b t › › 2 ˇˇF b t ı ´› › E " Ag b t ˇˇF b t ‰› › 2 " E " › › Ag b t › › 2 ˇˇF b t ı ´› › A∇L `wb t ˘› › 2 " c b ˜1 n n ÿ i"1 › › A∇L i `wb t ˘› › 2 ´› › A∇L `wb t ˘› › 2 ¸. Lemma 4. For any set of square matrices tA 1 , ¨¨¨, A n u P R pˆp , if we denote A " ř n i"1 A i x i x T i , then we have E «› › › › › n ÿ i"1 Ai∇Li ´wb t`1 ¯› › › › › 2 ˇˇˇˇF 0 ff " E «› › › › › n ÿ i"1 Bi∇Li ´wb t ¯› › › › › 2 ˇˇˇˇF 0 ff `α2 t c b n 2 n ÿ k"1 n ÿ l"1 E «› › › › › n ÿ i"1 B kl i ∇Li ´wb t ¯› › › › › 2 ˇˇˇˇF 0 ff . Here B i " A i ´αt n A; B kl i " A if i " k, i ‰ l, B kl i " A if i " l, i ‰ k, and B kl i equals the zero matrix, otherwise. Proof of Lemma 4. Let C i " x i x T i and C " 1 n ř n i"1 C i . For the given A 1 , . . . , A n , we denote A " ř n i"1 A i C i . Then we have E » - › › › › › n ÿ i"1 A i ∇L i `wb t`1 ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl " E » -E » - › › › › › n ÿ i"1 A i ∇L i `wb t`1 ˘› › › › › 2 ˇˇˇˇˇF b t fi fl ˇˇˇˇˇF 0 fi fl " E » -E » - › › › › › n ÿ i"1 A i `xT i w b t`1 ´yi ˘xi › › › › › 2 ˇˇˇˇˇF b t fi fl ˇˇˇˇˇF 0 fi fl " E » -E » - › › › › › n ÿ i"1 A i `xT i `wb t ´αt g b t ˘´y i ˘xi › › › › › 2 ˇˇˇˇˇF b t fi fl ˇˇˇˇˇF 0 fi fl " E » -E » - › › › › › n ÿ i"1 A i ∇L i `wb t ˘´α t Ag b t › › › › › 2 ˇˇˇˇˇF b t fi fl ˇˇˇˇˇF 0 fi fl " E » - › › › › › n ÿ i"1 A i ∇L i `wb t ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl ´2α t E « E «C n ÿ i"1 A i ∇L i `wb t ˘, Ag b t GˇˇˇˇˇF b t ffˇˇˇˇˇF 0 ff `α2 t E " E " › › Ag b t › › 2 ˇˇF b t ıˇˇˇF 0 ı " E » - › › › › › n ÿ i"1 A i ∇L i `wb t ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl ´2α t E «C n ÿ i"1 A i ∇L i `wb t ˘, A∇L `wb t ˘GˇˇˇˇˇF 0 ff `α2 t E « c b ˜1 n n ÿ i"1 › › A∇L i pw b t q › › 2 ´› › A∇Lpw b t q › › 2 ¸`› › A∇Lpw b t q › › 2 ˇˇˇˇF 0 ff " E » - › › › › › n ÿ i"1 A i ∇L i `wb t ˘´α t A∇Lpw b t q › › › › › 2 ˇˇˇˇˇF 0 fi fl `α2 t c b E « 1 n n ÿ i"1 › › A∇L i pw b t q › › 2 ´› › A∇Lpw b t q › › 2 ˇˇˇˇF 0 ff " E » - › › › › › n ÿ i"1 A i ∇L i `wb t ˘´α t A∇Lpw b t q › › › › › 2 ˇˇˇˇˇF 0 fi fl `α2 t c b n 2 ÿ i‰j E " › › A∇L i `wb t ˘´A∇L j `wb t ˘› › 2 ˇˇF0 ı " E » - › › › › › n ÿ i"1 ´Ai ´αt n A ¯∇L i `wb t ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl `α2 t c b n 2 n ÿ i"1 n ÿ j"1 E " › › A∇L i `wb t ˘´A∇L j `wb t ˘› › 2 ˇˇF0 ı . Therefore, if we set B i " A i ´αt n A and B kl i " $ & % A i " k, i ‰ l, ´A i " l, i ‰ k, 0 otherwise, we have E » - › › › › › n ÿ i"1 A i ∇L i `wb t`1 ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl " E » - › › › › › n ÿ i"1 B i ∇L i `wb t ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl `α2 t c b n 2 n ÿ k"1 n ÿ l"1 E » - › › › › › n ÿ i"1 B kl i ∇L i `wb t ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl . Proof of Theorem 1. We use induction to show this statement. When t " 0, E " › › ř n i"1 A i ∇L i `wb t ˘› › 2 ˇˇF0 ı " } ř n i"1 A i ∇L i pw 0 q} 2 which is invariant of b. Therefore, it is a decreasing function of b. Suppose the statement holds for t. For any set of matrices tA 1 , . . . , A n u in R pˆp , by Lemma 2 we know that there exist matrices tB 1 , ¨¨¨, B n u and B kl i : i, k, l P rns ( such that E » - › › › › › n ÿ i"1 A i ∇L i `wb t`1 ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl " E » - › › › › › n ÿ i"1 B i ∇L i `wb t ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl `α2 t c b n 2 n ÿ k"1 n ÿ l"1 E » - › › › › › n ÿ i"1 B kl i ∇L i `wb t ˘› › › › › 2 ˇˇˇˇˇF 0 fi fl . By induction, we know that E " › › ř n i"1 B i ∇L i `wb t ˘› › 2 ˇˇF0 ı and all E " › › ř n i"1 B kl i ∇L i `wb t ˘› › 2 ˇˇF0 ı are non-negative and decreasing functions of b. Besides, clearly α 2 t c b n 2 " α 2 t pn´bq bn 3 pn´1q is a non-negative and decreasing function of b. By Lemma 3, we know that α 2 t c b n 2 E " › › ř n i"1 B kl i ∇L i `wb t ˘› › 2 ˇˇF0 ı is also a non-negative and decreasing function of b. Finally, E " › › ř n i"1 A i ∇L i `wb t`1 ˘› › 2 ˇˇF0 ı , as the sum of non-negative and decreasing functions in b, is a non-negative and decreasing function of b. In order to prove Theorem 2, we split the task to two separate theorems about the full gradient and the stochastic gradient and prove them one by one. Theorem 6. Fixing initial weights w 0 , var `B∇L `wb t ˘ˇF 0 ˘is a decreasing function of mini-batch size b for all b P rns, t P N, and all square matrices B P R pˆp . Theorem 7. Fixing initial weights w 0 , var `Bg b t ˇˇF 0 ˘is a decreasing function of mini-batch size b for all b P rns, t P N, and all square matrices B P R pˆp . Proof of Theorem 6. We induct on t to show that the statement holds. For t " 0, we have var `B∇L `wb t ˘ˇF 0 ˘" 0 for any matrix B. Suppose the statement holds for t ´1 ě 0. Note that from ∇L `wb t ˘" 1 n n ÿ i"1 x i `xT i w b t ´yi " 1 n n ÿ i"1 x i `xT i `wb t´1 ´αt g b t´1 ˘´y i " 1 n n ÿ i"1 x i `xT i w b t´1 ´yi ˘´α t n n ÿ i"1 x i x T i g b t´1 " ∇L `wb t´1 ˘´α t Cg b t´1 , we have var `B∇L `wb t ˘ˇF 0 " var `B∇L `wb t´1 ˘´α t BCg b t´1 ˇˇF 0 " E " › › B∇L `wb t´1 ˘´α t BCg b t´1 › › 2 ˇˇF b 0 ı ´› › E " B∇L `wb t´1 ˘´α t BCg b t´1 ˇˇF b 0 ‰› › 2 " E " › › B∇L `wb t´1 ˘› › 2 ´2α t @ B∇L `wb t´1 ˘, BCg b t´1 D `α2 t › › BCg b t´1 › › 2 ˇˇF b 0 ı ´› › E " B∇L `wb t´1 ˘´α t BCg b t´1 ˇˇF b 0 ‰› › 2 " E " › › B∇L `wb t´1 ˘› › 2 ˇˇF0 ı `α2 t E " E " › › BCg b t´1 › › 2 ˇˇF b t´1 ıˇˇˇF b 0 ı ´2α t E " E "@ B∇L `wb t´1 ˘, BCg b t´1 DˇˇF b t´1 ‰ˇˇF 0 ‰ ´› › E " E " B∇L `wb t´1 ˘´α t BCg b t´1 ˇˇF b t´1 ‰ˇˇF b 0 ‰› › 2 " E " › › B∇L `wb t´1 ˘› › 2 ˇˇF0 ı `α2 t E « c b ˜1 n n ÿ i"1 › › BC∇L i `wb t´1 ˘› › 2 ´› › BC∇L `wb t´1 ˘› › 2 ¸`› › BC∇L `wb t´1 ˘› › 2 ˇˇˇˇF 0 ff ´2α t E "@ B∇L `wb t´1 ˘, BC∇L `wb t´1 ˘DˇˇF 0 ‰ ´› › E " B∇L `wb t´1 ˘´α t BC∇L `wb t´1 ˘ˇF b 0 ‰› › 2 (3) " E " › › B pI ´αt Cq ∇L `wb t´1 ˘› › 2 ˇˇF b 0 ı `α2 t c b E «˜1 n n ÿ i"1 › › BC∇L i `wb t´1 ˘› › 2 ´› › BC∇L `wb t´1 ˘› › 2 ¸ˇˇˇˇF 0 ff ´› › E " B pI ´αt Cq ∇L `wb t´1 ˘ˇF b 0 ‰› › 2 " var `B pI ´αt Cq ∇L `wb t´1 ˘ˇF 0 ˘`α 2 t c b ˜1 n n ÿ i"1 E " › › BC∇L i `wb t´1 ˘› › 2 ˇˇF0 ı ´E " › › BC∇L `wb t´1 ˘› › 2 ˇˇF0 ı " var `B pI ´αt Cq ∇L `wb t´1 ˘ˇF 0 ˘`α 2 t c b n 2 ÿ i‰j E " › › BC∇L i `wb t´1 ˘´BC∇L j `wb t´1 ˘› › 2 ˇˇF0 ı , where (3) is by Lemma 1. By induction, we know that the first term of (4) is a decreasing function of b. Taking A i " BC, A j " ´BC, A k " 0, k P rnszti, ju in Theorem 1, we know that E " › › BC∇L i `wb t´1 ˘´BC∇L j `wb t´1 ˘› › 2 ˇˇF0 ı is also a decreasing function of b. Note that α 2 t c b n 2 decreases as b increases. By Lemma 3 we learn that (4) is a decreasing function of b and hence we have completed the induction. Proof of Theorem 7. We have var `Bg b t ˇˇF 0 ˘" E " › › Bg b t › › 2 ˇˇF0 ı ´› › E " Bg b t ˇˇF 0 ‰› › 2 " E " E " › › Bg b t › › 2 ˇˇF b t ıˇˇˇF 0 ı ´› › E " E " Bg b t ˇˇF b t ‰ˇˇF 0 ‰› › 2 " c b ˜1 n n ÿ i"1 E " › › B∇L i `wb t ˘› › 2 ˇˇF0 ı ´E " › › B∇L `wb t ˘› › 2 ˇˇF0 ı Ȩ " › › B∇L `wb t ˘› › 2 ˇˇF0 ı ´› › E " B∇L `wb t ˘ˇF 0 ‰› › 2 " c b n 2 ÿ i‰j E " › › B∇L i `wb t ˘´B∇L j `wb t ˘› › 2 ˇˇF0 ı `var `B∇L `wb t ˘ˇF 0 ˘. Taking A i " B, A j " ´B, A k " 0, k P rnszti, ju in Theorem 1, we know that E " › › B∇L i `wb t ˘´B∇L j `wb t ˘› › 2 ˇˇF0 ı is a decreasing and non-negative function of b for all i, j P rns. By Theorem 6, we know that var `B∇L `wb t ˘ˇF 0 ˘is also a decreasing function of b. Therefore, var `Bg b t ˇˇF 0 ˘, as the sum of two decreasing functions of b, is also a decreasing function of b. Proof of Corollary 1. Simply taking B " I p in Theorem 1 yields the proof.

B.2 PROOFS FOR RESULTS IN 3.2

Remark. We often rely on the trivial facts that x 1 x T 2 " x 1 I p x T 2 and x 1 x T 2 x 3 x T 4 " x 1 x T 2 I p x 3 x T 4 . Lemma 5. Given a multiplicative term of parameter matrices u i v T i : u i , v i P R p , i P rn 1 s ( YtA j : A j P R pˆp , j P rn 2 su and constant matrix tI p u such that degpu 1 v T 1 ; M q ě 1, we have tr pM q " v T 1 M 1 u 1 , where M 1 is a multiplicative term of parameter matrices u i v T i : u i , v i P R p , i P rn 1 s ( Y tA j : A j P R pˆp , j P rn 2 su and constant matrix tI p u such that degpM q " degpM 1 q 1, degpA j ; M q " degpA j ; M 1 q, j P rn 2 s, degpu i v T i ; M q " degpu i v T i ; M 1 q, i P r2 : n 1 s and degpu 1 v T 1 ; M q " degpu 1 v T 1 ; M 1 q `1. Proof. By the definition of multiplicative terms, we know that there exist two multiplicative terms M 1 , M 2 of parameter matrices u i v T i : u i , v i P R p , i P rn 1 s ( Y tA j : A j P R pˆp , j P rn 2 su and constant matrix tI p u such that M " M 1 u 1 v T 1 M 2 , where degpM q " degpM 1 q `degpM 2 q `1, degpA j ; M q " degpA j ; M 1 q `degpA j ; M 2 q, j P rn 2 s, degpu i v T i ; M q " degpu i v T i ; M 1 q `degpu i v T i ; M 2 q, i P r2 : n 1 s and degpu 1 v T 1 ; M q " degpu 1 v T 1 ; M 1 q `degpu 1 v T 1 ; M 2 q `1. Therefore we have tr pM q " tr `M1 u 1 v T 1 M 2 ˘" tr `vT 1 M 2 M 1 u 1 ˘" v T 1 M 2 M 1 u 1 . Note that M 1 " M 2 M 1 satisfies that degpM 1 q " degpM 1 q `degpM 2 q, degpA j , M 1 q " degpA j ; M 1 q `degpA j ; M 2 q, j P rn 2 s, degpu i v T i ; M q " degpu i v T i ; M 1 q `degpu i v T i ; M 2 q, i P r2 : n 1 s and degpu 1 v T 1 ; M 1 q " degpu 1 v T 1 ; M 1 q `degpu 1 v T 1 ; M 2 q `1. We have finished the proof. The following two lemmas focus on the expectation of the product of quadratic forms of the standard normal samples. Lemma 6 focuses on single sample while 7 focuses on the same form with b i.i.d. samples drawn from the standard normal distribution. Lemma 6. Given matrices A j P R pˆp , j P rm ´1s, we have E x"N p0,Ipq " xx T A 1 xx T A 2 ¨¨¨A m´1 xx T ‰ " Nm ÿ i"1 ni ź k"1 tr pM ik q M i0 , where N m and n i , i P rN m s are constants depending on m and tM ik , k P r0 : n i s , i P rN m su are multiplicative terms of parameter matrices tA j , j P rm ´1su and constant matrix tI p u. Furthermore, for every i P rN m s, we have ř ni k"0 degpA j ; M ik q " 1, j P rm ´1s and therefore ř ni k"0 deg pM ik q " m ´1. Proof. See Magnus (1978) . Lemma 7. We are given matrices A j P R pˆp , j P rm ´1s and random vectors x i , i P rbs independently and identically drawn from N p0, I p q. We assume that the multi-set S " i j , i 1 j : j P rms ( satisfies that for every i P S, i is an element of rbs and the number of appearance of i in S is even. Then E xi"N p0,Ipq " x i1 x T i 1 1 A 1 x i2 x T i 1 2 A 2 ¨¨¨A m´1 x im x T i 1 m ı " Nm ÿ i"1 ni ź k"1 tr pM ik q M i0 , where N m and n i are constants depending on m (and independent of b) and M ik , k P r0 : n i s , i P rN m s are multiplicative terms of parameter matrices tA j , j P rm ´1su and constant matrix tI p u. Furthermore, for every i P rN m s, we have ř ni k"0 degpA j ; M ik q " 1, j P rm ´1s and therefore ř ni k"0 deg pM ik q " m ´1. Proof. Let β i , i P rbs be the number of appearances of i in S, which are even by assumption. We induct on the quantity N " ř b i"1 1 tβ i ‰ 0u. For the base case of N " 1, all elements in the multi-set S have the same value. Without loss of generality, we assume i j " i 1 j " 1, j P rms. Then E xi"N p0,Ipq " x i1 x T i 1 1 A 1 x i2 x T i 1 2 ¨¨¨A m´1 x im x T i 1 m ı " E x1"N p0,Ipq " x 1 x T 1 A 1 x 1 x T 1 ¨¨¨A m´1 x 1 x T 1 ‰ , which is the statement of Lemma 6. Suppose the statement holds for N ě 1, and we consider the case of N `1. Note that x T i 1 j A j x ij`1 " x T ij`1 A j x i 1 j is a scalar so that we can move it around without changing the value of the expression 2 . We distinguish two cases. • Let i 1 ‰ i 1 m . Without loss of generality, we assume i 1 " 1. We can always change the order of x T i 1 j A j x ij`1 , j P rm ´1s (and flip it to be x T ij`1 A j x i 1 j if necessary) such that all x 1 's appear in the form of x 1 x T 1 : x i1 x T i 1 1 A 1 x i2 x T i 1 2 A 2 ¨¨¨A m´1 x im x T i 1 m " x 1 ´xT i 1 1 A 1 x i2 x T i 1 2 A 2 ¨¨¨A m´1 x im ¯xT i 1 m " x 1 x T 1 r A 1 x 1 x T 1 r A 2 ¨¨¨r A β 1 2 ´1x 1 x T 1 r A β 1 2 r xx T i 1 m where r x P tx i , i P rbsu , r x ‰ x 1 and r A i 's are multiplicative terms of parameter matrices tx u x T v : u, v P r2 : bsu Y tA j : j P rm ´1su and constant matrix tI p u such that ř u,vPr2:bs ř β 1 2 k"1 degpx u x T v ; r A k q " m ´β1 2 ´1 and ř β 1 2 k"1 degpA j ; r A k q " 1, j P rm ´1s 3 . Applying Lemma 6 and the law of iterative expectations, we have E xi"N p0,Ipq " x i1 x T i 1 1 A 1 x i2 x T i 1 2 ¨¨¨A m´1 x im x T i 1 m ı " E x1,¨¨¨,x b " x 1 x T 1 r A 1 x 1 x T 1 r A 2 ¨¨¨r A β 1 2 ´1x 1 x T 1 r A β 1 2 r xx T i 1 m ı " E x2,¨¨¨,x b «˜N m ÿ i"1 ni ź k"1 tr pM ik q M i0 ¸r A β 1 2 r xx T i 1 m ff " Nm ÿ i"1 E x2,¨¨¨,x b «˜n i ź k"1 tr pM ik q M i0 ¸r A β 1 2 r xx T i 1 m ff , where N m and n i are constant depending on m (and independent of b) and M ik , k P r0 : n i s , i P rN m s are multiplicative terms of parameter matrices ! r A j , j P r β1 2 ´1s ) and constant matrix tI p u. Furthermore, for every i P rN m s, we have ř ni k"0 degp r A j ; M ik q " 1, j P r β1 2 ´1s and therefore ř ni k"0 deg pM ik q " β1 2 ´1. Combining the definition of r A j 's, we know that M ik , k P r0 : n i s , i P rN m s are multiplicative terms of parameter matrices tx u x T v : u, v P r2 : bsu Y tA j : j P rm ´1su and constant 2 For example, we can rewrite xi 1 x T i 1 1 A1xi 2 x T i 1 2 A2xi 3 x T i 1 3 " xi 1 ´xT i 1 1 A1xi 2 ¯"x T i 1 2 A2xi 3 ı x T i 1 3 " xi 1 " x T i 1 2 A2xi 3 ı ´xT i 1 1 A1xi 2 ¯xT i 1 3 " xi 1 " x T i 1 2 ´xT i 1 1 A1xi 2 ¯A2xi 3 ı x T i 1 3 " xi 1 " x T i 1 2 A2 ´xT i 1 1 A1xi 2 ¯xi 3 ı x T i 1 3 . 3 For example, we can rewrite x1x T 2 A1x1x T 1 A2x3x T 3 A3x1x2 " x1 ´xT 2 A1x1 ¯"x T 1 A2x3 ı ! x T 3 A3x1 ) x2 " x1 ´xT 1 A1x2 ¯"x T 3 A2x1 ı ! x T 1 A3x3 ) x2 "x1x T 1 A1x2x T 3 A2x1x T 1 A3x3x2 " x1x T 1 r A1x1x T 1 r A2r xx2, where r A1 " A1x2x T 3 A2, r A2 " A3 and r x " x3. Besides, m " 4, β1 " 4, thus the degree of xux T v in all r A k sum up to m ´β1 2 ´1 " 1 matrix tI p u such that for every i P rN m s, we have ř u,vPr2:bs ř ni k"0 degpx u x T v ; M ik q " m ´β1 2 ´1 and ř ni k"0 degpA j ; M ik q " 1, j P rm ´1s. Applying Lemma 5, for every k P r0 : n i s and every i P rN m s, there exists u ik , v ik P tx j : j P r2 : bsu and multiplicative term M 1 ik of parameter matrices tx u x T v : u, v P r2 : bsu Y tA j : j P rm ´1su and constant matrix tI p u such that tr pM ik q " u T ik M 1 ik v ik . Therefore, we have ˜ni ź k"1 tr pM ik q M i0 ¸r A β 1 2 r xx T i 1 m " ni ź k"1 `uT ik M 1 ik v ik ˘Mi0 r A β 1 2 r xx T i 1 m " M i0 r A β 1 2 r x ni ź k"1 `uT ik M 1 ik v ik ˘xT i 1 m fi U i . Note that for every i P rN m s, we have m´1 ÿ j"1 degpx i ; A j q " ni ÿ k"1 degpx i ; M 1 ik q `degpx i ; M i0 q `deg ´xi ; r A β 1 2 ¯`degpx i ; r xq `deg ´xi ; x T i 1 m ¯, and for every j P rm ´1s, we have ni ÿ k"1 degpA j ; M 1 ik q `degpA j ; M i0 q `deg ´Aj ; r A β 1 2 ¯" 1. In other words, for every i P rN m s, U i has the form of p A 0 x p i1 x T p i 1 1 p A 1 x p i2 x T p i 1 2 ¨¨¨p A m´1 x p i m 1 x T i 1 m p A m 1 but there is no appearance of x 1 . Here x p ij , x p ij P tx j , j P r2 : bsu, and p A i , i P r0 : ms are multiplicative terms of parameter matrices tA j , j P rm ´1su and constant matrix tI p u. Furthermore, for every j P rm ´1s, we have ř ni k"0 degpA j ; p A i q " 1. Note that here we use the liberty of adding identity matrices if more than two consecutive x's appear. Since we have reduced N `1 by one, we can use induction on x p i1 x T p i 1 1 p A 1 x p i2 x T p i 1 2 ¨¨¨p A m´1 x p i m 1 x T i 1 m and finish the proof. The two constant matrices p A 0 and p A m do not change the result of expectation since E ´p A 0 X p A m 1 ¯" p A 0 EpXq p A m 1 . • If i 1 " i 1 m , without loss of generality we assume, i 1 1 " 1 and i 1 1 ‰ i 1 (note that all x T i 1 j A j x ij`1 , j P rm ´1s are inter-changeable and there is at least one element in S that is not equal to i 1 ). We change the orders of x T i 1 j A j x ij`1 , j P rm ´1s (and flip it to be x T ij`1 A j x i 1 j if necessary) such that all x 1 's appear in a consecutive form of x 1 x T 1 : x i1 x T i 1 1 A 1 x i2 x T i 1 2 A 2 ¨¨¨A m´1 x im x T i 1 m " x i1 ´xT i 1 1 A 1 x i2 x T i 1 2 A 2 ¨¨¨A m´1 x im ¯xT i 1 m " x i1 ´r x T 1 r A 0 " x 1 x T 1 r A 1 ¨¨¨r A β 1 2 ´1x 1 x T 1 ı r A β 1 2 r x 2 ¯xT i 1 m , where r x 1 , r x 2 P tx i , i P rbsu , r x 1 , r x 2 ‰ x 1 and r A i 's are multiplicative terms of parameter matrices tx u x T v : u, v P r2 : bsu Y tA j : j P rm ´1su and constant matrix tI p u such that ÿ u,vPr2:bs β 1 2 ÿ k"0 degpx u x T v ; r A k q " m ´β1 2 ´2 and ř β 1 2 k"0 degpA j ; r A k q " 1, j P rm ´1s. The remaining reasoning is the same as the previous case. Remark. If one of the β i numbers of appearance of x j , j P rbs is odd, then it is easy to see that the result in ( 5) is the zero matrix. As pointed out in the Section 1, the difficulty of studying the dynamics of SGD is how to connect the quantities in iteration t with fixed variables, like initial weights W 0,1 , W 0,2 and mini-batch size b. We overcome this challenge by the following two lemmas. Lemma 8 provides the relationship between g b t,i , i " 1, 2 and W b t,i , i " 1, 2 by taking expectation over the distribution of random samples in B b t . Lemma 9 shows the relationship between W b t,i , i " 1, 2 and g b t´1,i , i " 1, 2 using (1) and (2). Lemma 8. For multiplicative terms M i , i P r0 : ms of parameter matrices g b t,1 , g  , W 2 u such that E " M ˇˇF b t ‰ " N 0 `N1 1 b `¨¨¨`N d 1 b d , where N k " ř m k i"1 ś m ki j"1 tr `M k ij ˘M k i0 , k P r0 : ds. Here m k , m ki are constants independent of b, and ř m ki j"0 deg `M k ij ˘ď 3d `řm i"0 `deg `W b t,1 ; M i ˘`degpW b t,2 ; M i q ˘. Lemma 9. For multiplicative term M i , i P r0 : ms of parameter matrices W b t,1 , W b t,2 ( and constant matrices tW 1 , W 2 u of degree d i , let d " 2 d0`¨¨¨`dm . There exists a set of multiplicative terms tM ik , i P r0 : ms , k P rdsu of parameter matrices g b t,1 , g b t,2 ( and constant matrices W b t,1 , W b t,2 , W 1 , W 2 ( such that m ź i"1 tr pM i q M 0 " d ÿ k"1 m ź i"1 tr pM ik q M 0k , where ř m i"0 deg pM ik q ď d. Proof of Lemma 8. By (1) and ( 2) we have M " m ź i"1 tr pM i q M 0 " 1 b d b d ÿ k"1 m ź i"1 tr pM ki q M k0 , where each M ki , k P rb d s, i P r0 : ms is a multiplicative term of parameter matrices x t,i x T t,i , i P rbs ( and constant matrices W b t,1 , W b t,2 , W b t ( . Let Ă M k " ś m i"1 tr pM ki q M k0 , k P " b d ‰ . We split set ! Ă M k : k P " b d ‰ ) into disjoint and non-empty sets (equivalent classes) S 1 , . . . , S n M such that 1. for every i P rn M s and every M 1 , M 2 P S i , we have E " M 1 ˇˇF b t ‰ " E " M 2 ˇˇF b t ‰ , 2. for every i, j P rn M s, i ‰ j and every M 1 P S i and M 2 P S j , we have E " M 1 ˇˇF b t ‰ ‰ E " M 2 ˇˇF b t ‰ . Note that Y n M i"1 S i " ! Ă M k : k P " b d ‰ ) . Let x M k P S k represent the equivalent class S k (it can be any member of S k ). For every i P rn M s, we can always write |S i | " e i,0 `ei,1 b `¨¨¨`e i,d b d such that e i,j P N, e i,j ă b, j P r0 : ds (actually e i,j 's are the digits of the base-b representation of |S i |). Then we have E " M ˇˇF b t ‰ " E » -1 b d b d ÿ k"1 Ă M k ˇˇˇˇˇF b t fi fl " 1 b d E « n M ÿ i"1 `ei,0 `ei,1 b `¨¨¨`e i,d b d ˘x M i ˇˇˇˇF b t ff " 1 b d n M ÿ i"1 `ei,0 `ei,1 b `¨¨¨`e i,d b d ˘E " x M i ˇˇF b t ı (7) " n M ÿ i"1 ˆei,d `ei,d´1 1 b `¨¨¨`e i,0 1 b d ˙E " x M i ˇˇF b t ı . It is important to note that n M , the number of different equivalent classes, is independent of b. This follows from the fact that each E " Ă M k ˇˇF b t ı (and so as E " x M k ˇˇF b t ı ) includes a finite number of weight matrices W b t,1 and W b t,2 with degree less than or equal to 3d řm i"0 `deg `W b t,1 ; M i ˘`degpW b t,2 ; M i q ˘(see Lemma 7). Thus the number of partition sets is bounded by a quantity independent of b. Note that each M ki can be represented as M ki " A ki 0 x ki t,i1 x ki t,i1 T A ki 1 ¨¨¨A ki di´1 x ki t,i d i x ki t,i d i T A ki di for some matrices A ki 0 , . . . , A ki di that are multiplicative term of parameter matrices W b t,1 , W b t,2 andW b t ( constant matrix tI p u (we stress again that some A matrices can be identities, based on the definition of multiplicative terms), and x ki t,i1 , . . . , x ki t,i d i P tx t,1 , . . . , x t,b u. We have tr pM ki q " tr ´Aki 0 x ki t,i1 x ki t,i1 T A ki 1 ¨¨¨A ki di´1 x ki t,i d i x ki t,i d i T A ki di " x ki t,i d i T A ki di A ki 0 x ki t,i1 x ki t,i1 T A ki 1 ¨¨¨A ki di´1 x ki t,i d i . For every k P " b d ‰ , we have m ź i"1 tr pM ki q M k0 " « m ź i"1 x ki t,i d i T A ki di A ki 0 x ki t,i1 x ki t,i1 T A ki 1 ¨¨¨A ki di´1 x ki t,i d i ff A k0 0 x k0 t,i1 x k0 t,i1 T A k0 1 ¨¨¨A k0 d0´1 x k0 t,i d 0 x k0 t,i d 0 T A k0 d0 " « m ź i"1 x ki t,i d i T A ki di A ki 0 x ki t,i1 x ki t,i1 T A ki 1 ¨¨¨A ki di´1 x ki t,i d i ff " x k0 t,i1 T A k0 1 ¨¨¨A k0 d0´1 x k0 t,i d 0 ı A k0 0 x k0 t,i1 x k0 t,i d 0 T A k0 d0 , which can be rewritten as Ă M k " m ź i"1 tr pM ki q M k0 " ˜d ź j"1 x T t, īj A k j x t, ī1 j ¸Ak0 0 x k0 t,i1 x k0 t,i d 0 T A k0 d0 . Note that the randomness of each Ă M k given F b t only comes from the randomness of x t,j 's, i.e. for all k P " b d ‰ we have E " Ă M k ˇˇF b t ı " E xt,j "N p0,Iq «˜d ź j"1 x T t,ij A k j x t,i 1 j ¸Ak 0 x t,i 1 0 x T t,i0 A k 0 1 ff " E xt,j "N p0,Iq « A k 0 x t,i 1 0 ˜d ź j"1 x T t,ij A k j x t,i 1 j ¸xT t,i0 A k 0 1 ff (8) " n k M ÿ i"1 n k i ź j"1 tr ´Ă M k ij ¯Ă M k i0 , Proof of Theorem 3. We use induction on t to show this result. The base case of t " 0 it is the same as the statement in Lemma 8. Suppose that the statement holds for t ě 0, and we consider the case of t `1. By Lemma 8, there exists a set of multiplicative terms M k t`1,i,j , i P rm t`1,k s, j P r0 : m t`1,k,i s , k P r0 : ds ( of parameter matrices W b t`1,1 , W b t`1,2 ( and constant matrices tW 1 , W 2 u such that E " M ˇˇF b t`1 ‰ " N t`1,0 `Nt`1,1 1 b `¨¨¨`N t`1,d 1 b d , where N t`1,k " ř m t`1,k i"1 ś m t`1,k,i j"1 tr `M k t`1,i,j ˘M k t`1,i,0 , k P r0 : ds. Here m t`1,k , m t`1,k,i are constants independent of b, and ř m t`1,k,i j"0 deg `M k t`1,i,j ˘ď 3d `d1 . For each i P rm t`1,k s and each k P r0 : ds, by Lemma 9, there exists a set of multiplicative terms tM t,i,j,k,l , j P rm t`1,i,k s , l P rd t,i,k su of parameter matrices g b t,1 , g b t,2 ( and constant matrices W b t,1 , W b t,2 , W 1 , W 2 ( such that m t`1,k,i ź j"1 tr `M k t`1,i,j ˘M k t`1,i,0 " d t,i,k ÿ l"1 m t`1,k,i ź j"1 tr pM t,i,j,k,l q M t,i,0,k,l , where d t,i,k " 2 řm t`1,k,i j"0 pdegpW b t,1 ;M t,i,j,k,lq `degpW b t,2 ;M t,i,j,k,l qq is a constant independent of b and m t`1,k,i ÿ j"0 deg pM t,i,j,k,l q ď 3d `d1 , and m t`1,k,i ÿ j"0 pdeg pW t,1 ; M t,i,j,k,l q `deg pW t,2 ; M t,i,j,k,l qq ď 3d `d1 . Combining ( 14) and ( 15), we have for every k P r0 : ds N t`1,k " m t`1,k ÿ i"1 d t,i,k ÿ l"1 m t`1,k,i ź j"1 tr pM t,i,j,k,l q M t,i,0,k,l . Note that E rM |F 0 s " E " E " M ˇˇF b t`1 ‰ˇˇF 0 ‰ " E rN t`1,0 |F 0 s `E rN t`1,1 |F 0 s 1 b `¨¨¨`E rN t`1,d |F 0 s 1 b d " mt`1,0 ÿ i"1 dt,i,0 ÿ l"1 E « mt`1,0,i ź j"1 tr pM t,i,j,0,l q M t,i,0,0,l ˇˇˇˇF 0 ff mt`1,1 ÿ i"1 dt,i,1 ÿ l"1 E « mt`1,1,i ź j"1 tr pM t,i,j,1,l q M t,i,0,1,l  where q t ď d 1 `1 2 p3 t ´1qp3d `d1 q and N t,i,k,l,0 , ¨¨¨, N t,i,k,l,qt are sum of multiplicative terms of parameter matrices W b 0,1 , W b 0,2 ( and constant matrices tW 1 , W 2 u with degree at most d ¨3t . Combining ( 19) and ( 20), we can rewrite E rM |F 0 s " N 0 `N1 1 b `¨¨¨`N q 1 b q , in the same form as in the statement. Here q ď d `3q t ď 1 2 p3 t`2 ´1qd `1 2 p3 t`1 ´1qd 1 and ř m ki j"0 deg `M k ij ˘ď 3 ˆ3t p3d `d1 q " 3 t`1 p3d `d1 q follow from ( 16) and ( 17). In conclusion, we have shown that the statement holds for t `1, and therefore finishes the proof. By changing the role of parameter and constant matrices in Theorem 3, we obtain the following corollary. Corollary 2. Given t ě 0, for any multiplicative terms M i , i P r0 : ms of parameter matrices W b t,1 , W ( and constant matrices tW 1 , W 2 u such that E rM |F 0 s " N 0 `N1 1 b `¨¨¨`N q 1 b q , where N k " ř m k i"1 ś m ki j"1 tr `M k ij ˘M k i0 , k P r0 : qs. Here m k , m ki and q ď 3 t pd `2d 1 q are constants independent of b, and ř m ki j"0 deg `M k ij ˘ď 3 t pd `2d 1 q. Proof of Corollary 2. We simply note that M can be written as the sum of at most 2 d multiplicative terms of parameter matrices W b t,1 , W b t,2 , W 1 , W 2 ( and constant matrix tI 0 u. Then we apply Lemmas 8 and 9 iteratively in the same way as in the proof of Theorem 3 to finish the proof. Proof of Theorem 4. We only show the case for g t,1 since the proof for g t,2 can be tackled similarly. ˙. Here we have used the fact that E x"N p0,Ipq tr `xx T Axx T ˘" pp`2qtr pAq. By Corollary 2 we know that there exists a set of multiplicative terms M k ij , i P rm k s, j P r0 : m ki s , k P r0 : qs ı¯" γ 0 `γ1 1 b `¨¨¨`γ q 1 b q , ( ) where γ k " ř m k i"1 ś m ki j"0 tr `M k ij ˘, k P r0 : qs. Here m k , m ki and q ď 6 ¨3t are constants independent of b, and ř m ki j"0 deg `M k ij ˘ď 6 ¨3t . Note that W b 0,1 , W b 0,2 are fixed, and we have γ k , k P r0 : qs are constants independent of b. Similarly we observe that there exist constants q 1 ď 2 ¨3t`1 and γ 1 k , k P r0 : q 1 s such that › › ›E " W b t,2 T W b t ˇˇF0 ı› › › 2 " γ 1 0 `γ1 1 1 b `¨¨¨`γ 1 q 1 b q 1 . ( ) By defining γ i " 0, i ą q and γ 1 i " 0, i ą q 1 , and combining ( 21) and ( 22) we have var `gb t,1 ˇˇF 0 ˘" 1 b ˆpp `2qtr ´E " W b t T W b t,2 W b t,2 T W b t ˇˇF0 ı¯´› › ›E " W b t,2 T W b t ˇˇF0 ı› › › 2 " p `2 b ˆγ0 `γ1 1 b `¨¨¨`γ q 1 b q ˙´1 b ˆγ1 0 `γ1 1 1 b `¨¨¨`γ 1 q 1 b q 1 " maxtq,q 1 u ÿ k"1 `pp `1qγ k ´γ1 k ˘1 b k . Note that γ k 's and γ 1 k 's are all constants independent of b, and max tq, q 1 u ď 2¨3 t`1 . This completes the proof. Therefore, for all b ą b 0 we have `var `gb t,i ˇˇF 0

˘˘1

" ´r b r`1 f pbq `1 b r f pbq ď 0, and thus var `gb t,i ˇˇF 0 ˘is a decreasing function of b for all b ą b 0 .

B.3 EXTENSION TO DEEP LINEAR NETWORKS

The extension from two-layer linear network to deep linear network is straightforward. Here we only provide the ideas on how to translate the proof of two-layer network to d-layer network, but not the strict proof. For simplicity, we remove all superscripts b of matrices in this subsection. Assume that the d-layer linear network is given by f px; wq " W d W d´1 ¨¨¨W 2 W 1 x, where W i , i P rds is the parameter matrix on the i-th layer and w " pW 1 , . . . , W d q. The population loss is defined as Lpwq " E x"N p0,Ipq " 1 2 }W d ¨¨¨W 1 x ´W d ¨¨¨W 1 x} 2  . Similar to (1) and (2), we have  g t,k " 1 b b ÿ i"1 ∇ W t,



https://www.kaggle.com/mohansacharya/graduate-admissions



Figure 1: Experimental results for the MNIST dataset. (a) The median, min, and max of the log of variance of the stochastic gradient estimators for two different mini-batch sizes (distinguished by colors) and five different initial weights. The solid lines show the median of all five initial weights while the highlighted regions show the min and max of the log of variance. (b) The log of the training and validation loss vs epochs. (c) The log of training and validation error vs epochs. Here error is defined as one minus predicting accuracy. The plot does not show the epochs if error equals to zero. (d) The gap of accuracy on training and test sets vs epochs starting from epoch 100.

Figure 1(c) shows that both the expected error on training and validation sets are decreasing functions of b.

Figure1(d) exhibits a relationship between the model's generalization ability and the mini-batch size. As suggested bySimard et al. (2013), we build a test set by distorting the 10,000 images of the validation set. The prediction accuracy is obtained on both training and test sets and we calculate the gap between these two accuracies every 100 epochs. We use this gap to measure the model generalization ability (the smaller the better). Figure1(d)shows that the gap is an increasing function of b starting at epoch 500, which partially aligns with our conjecture regarding the relationship between the generalization ability and the mini-batch size. We test this on multiple choices of the hyper-parameters which control the degree of distortion in the test set and this pattern remains clear.

Figure 2(b)   shows the fitted polynomials. As we observe, the value var `gb t ˇˇF 0 ˘(approximated by the value of the polynomial) is both decreasing with respect to the mini-batch size b and iteration t. Further, the rate of decrease in b is slower as the b increasing. This provides a further insight into the dynamics of training a linear regression problem with SGD.

Figure 2: Experimental results for the Graduate Admission dataset. Left: log `var `gb t ˇˇF0 ˘˘and log `var `∇Lpw b t q ˇˇF0 ˘˘vs iteration t for 4 different mini-batch sizes. Right: The log of polynomial values when fitting polynomials on selected mini-batch sizes at certain iterations.

Figure 4: Experimental results for the MNIST dataset. Left: The median, min, and max of the log of variance of the stochastic gradient estimators for two different mini-batch sizes (distinguished by colors) and five different initial weights. The solid lines show the median of all five initial weights while the highlighted regions show the min and max of the log of variance. Right: The gap of accuracy on training and test sets vs epochs starting from epoch 100.

Figure 5(b) shows that both the expected error on training and validation sets are decreasing functions of b.

Figure4(b) exhibits a relationship between the model's generalization ability and the mini-batch size. As suggested by(Simard et al., 2013), we build a test set by distorting the 10,000 images of the validation set. The prediction accuracy is obtained on both training and test sets and we calculate the gap between these two accuracies every 100 epochs. We use this gap to measure the model generalization ability (the smaller the better). Figure4(b) shows that the gap is an increasing function of b starting at epoch 500, which partially aligns with our conjecture regarding the relationship between the generalization ability and the mini-batch size. We also test this on multiple choices of the hyper-parameters which control the degree of distortion in the test set and this pattern remains clear.

and constant matrices tW 1 , W 2 u such that tr

Proof of Theorem 5. We first show that invar `gb t,i ˇˇF 0 ˘" β 1 1 b `¨¨¨`β r 1 b r we have β 1 ě 0. If r " 1,the statement obviously holds. Let us assume that the statement does not hold for r ą 1, i.e. β 1 ă 0. Taking b large enough such that β 1 b r´1 `β2 b r´2 `¨¨¨`β r ă 0 yields var `gb t,i ˇˇF 0˘" 1 b r `β1 b r´1 `β2 b r´2 `¨¨¨`β r ˘ă 0, which contradicts the fact that var `gb t,i ˇˇF 0 ˘ě 0. Therefore, we have β 1 ě 0. Let b 0 be large enough such that for all b ě b 0 , we have β 1 b r´1 `2β 2 b r´2 `¨¨¨`rβ r ě 0. We denote f pbq " 0. For all b ą b 0 we have f 1 pbq " ´1 b r`1 `β1 b r´1 `2β 2 b r´2 `¨¨¨`rβ r ˘ď 0.

However, it is unclear what is the value of var p∇ w L Bt pw t qq fi E }∇ w L Bt pw t q} 2 }E∇ Bt pw t q} 2 . Intuitively, we should have var p∇ w L Bt pw t qq 9 n 2 b var p∇ w Lpw t qq, where n is the number of training samples and stochasticity on the right-hand side comes from mini-batch samples behind w t

For any set of square matrices tA 1 , ¨¨¨, A n

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pp. 649-657, 2015.Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on LearningTheory, pp. 1980Theory, pp.  -2022Theory, pp.  , 2017.  .

, W 1 , W 2 u such that the degree is at most 1. Therefore, by induction, for every i, k, l,

b t,2 , W b t (and constant matrices tW 1 , W 2 u such thatř 2 i"1 deg `W b t,i ; M ˘" d and deg `Wb t ; M ˘" d 1 , we denote M " ś m i"1 tr pM i q M 0 .There exists a set of multiplicative terms M k ij , i P rm k s, j P r0 : m ki s , k P r0 : qs

k ˆ1 2 }W t,d ¨¨¨W t,1 x t,i ´W d ¨¨¨W 1 x t,i }

annex

andThese degree relationships can be observed from ( 1), ( 2), and the fact that each g b t,1 or g b t,1 contributes one W b t and one ofwhere Ă M kl ij 's are multiplicative terms of parameter matrices( and constant matrices tW 1 , W 2 u such thatwhere the inequality comes from ( 9) and ( 10) and the fact that each g b t,1 or g b t,2 contributes 2 or 0 degrees in the form of W b t,2 W b t,1 or W 2 W 1 , respectively. Combining (7), ( 8) and (11), we haveNote that all constants in (13) are independent of b and combining with (12), we have finished the proof.Proof of Lemma 9. Simply using the fact that W b t,i " W b t´1,i ´αt g b t´1,i , i " 1, 2, if we replace each W b t,i in the left-hand-side of ( 13) by W b t´1,i ´αt g b t´1,i and expand all the parentheses, then each M i , i P r0 : ms becomes the sum of 2 di multiplicative terms of parameter matrices g b t,1 , g b t,2 ( and constant matrices W b t,1 , W b t,2 , W 1 , W 2 ( with degree at most d i . As a result, ś m i"1 tr pM i q M 0 becomes the sum of 2 d terms in the form of ś m i"1 tr pM ik q M 0k where deg pM ik q ď 2 di , and therefore ř m i"0 deg pM ik q ď ś m i"0 2 di " d.We denote W t " W t,d ¨¨¨W t,1 ´W d ¨¨¨W 1 . The remaining are all the same as the proofs in Appendix B.2, except we should replace all appearance of tW t,2 , W t,1 u to tW t,d , W t,d´1 , ¨¨¨, W t,1 u and all tW 2 , W 1 u to W d , W d´1 , ¨¨¨, W 1 ( . We can do this because the stochastic gradient g t,k is still the sum of multiplicative terms of parameter matrices tx t,i u and constant matrices tW t,d , ¨¨¨, W t,1 , W d , ¨¨¨, W 1 u so the Lemmas in Appendix B.2 still apply.In conclusion, we can again represent var pg t,k |F 0 q, k P rds as a polynomial of 1 b with finite degree and without the constant term. By the same approach in the proof of Theorem 5, we can show that the variance is a decreasing function of the mini-batch size b.

