CRITICAL BATCH SIZE MINIMIZES STOCHASTIC FIRST-ORDER ORACLE COMPLEXITY OF DEEP LEARNING OPTIMIZER USING HYPERPARAMETERS CLOSE TO ONE Anonymous authors Paper under double-blind review

Abstract

Practical results have shown that deep learning optimizers using small constant learning rates, hyperparameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions. We first show theoretical evidence that the momentum method (Momentum) and adaptive moment estimation (Adam) perform well in the sense that the upper bound of the theoretical performance measure is small with a small constant learning rate, hyperparameters close to one, and a large batch size. Next, we show that there exists a batch size called the critical batch size minimizing the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, and that SFO complexity increases once the batch size exceeds the critical batch size. Finally, we provide numerical results that support our theoretical results. That is, the numerical results indicate that Adam using a small constant learning rate, hyperparameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and stochastic gradient descent (SGD).

1. INTRODUCTION

1.1 BACKGROUND Useful deep learning optimizers have been proposed to find the model parameters of the deep neural networks that minimize loss functions called the expected risk and empirical risk, such as stochastic gradient descent (SGD) (Robbins & Monro, 1951; Zinkevich, 2003; Nemirovski et al., 2009; Ghadimi & Lan, 2012; 2013) , momentum methods (Polyak, 1964; Nesterov, 1983) , and adaptive methods. The various adaptive methods include Adaptive Gradient (AdaGrad) (Duchi et al., 2011) , Root Mean Square Propagation (RMSProp) (Tieleman & Hinton, 2012) , Adaptive Moment Estimation (Adam) (Kingma & Ba, 2015) , Adaptive Mean Square Gradient (AMSGrad) (Reddi et al., 2018) , Yogi (Zaheer et al., 2018) , Adam with decoupled weight decay (AdamW) (Loshchilov & Hutter, 2019) , and AdaBelief (named for adapting stepsizes by the belief in observed gradients) (Zhuang et al., 2020) . Theoretical analyses of adaptive methods for nonconvex optimization were presented in (Zaheer et al., 2018; Zou et al., 2019; Chen et al., 2019; Zhou et al., 2020; Zhuang et al., 2020; Chen et al., 2021) (see (Jain et al., 2018; Fehrman et al., 2020; Chen et al., 2020; Scaman & Malherbe, 2020; Loizou et al., 2021) for convergence analyses of SGD). A particularly interesting feature of adaptive methods is the use of hyperparameters, denoted by β 1 and β 2 , that can be set to influence the method performance P(K) := 1 K K k=1 E[∥∇f (θ k )∥ 2 ], where ∇f is the gradient of a loss function f : R d → R, (θ k ) K k=1 is the sequence generated by an optimizer, and K is the number of steps. The previous results are summarized in Table 1 indicating that using β 1 and/or β 2 close to 0 makes the upper bound of P(K) small (see also Appendix A.1). Chen et al., 2021) . These studies have shown that using, for example, β 1 ∈ {0.9, 0.99} and β 2 ∈ {0.99, 0.999} provides superior performance for training deep neural networks. The practically useful β 1 and β 2 are each close to 1, whereas in contrast, the theoretical results (Table 1 ) show that using β 1 and/or β 2 close to 0 makes the upper bounds of the performance measures small. Table 1 : Upper bounds of performance measure of optimizers with learning rate α k and hyperparameters β 1 and β 2 (G > 0, s ∈ (0, 1/2), L denotes the Lipschitz constant of the Lipschitz continuous gradient of the loss function f , K denotes the number of steps, b is the batch size, α b,max depends on b and the largest eigenvalue of the Hessian of f , h is a monotone decreasing function with respect to β 1 , and C 3 is defined as in Table 2 . β ≈ a implies that, if β is close to a, then the upper bounds are small.) Optimizer Learning Rate α k Parameters β 1 , β 2 Upper Bound Tail-averaged SGD O(α b,max ) (Jain et al., 2018) β 2 = 0 (Zhou et al., 2020) AdaBelief (Chen et al., 2021) Adam, AMSGrad β 1 = 0 O 1 K 2 + 1 Kb Adam O 1 L β 1 = 0 O 1 K + 1 b (Zaheer et al., 2018) β 2 ≥ 1 -O 1 G 2 Generic Adam O 1 √ k β 1 ≈ 0, O log K √ K (Zou et al., 2019) β 2 = 1 - 1 k ≈ 1 AdaFom 1 √ k β 1 ≈ 0 O log K √ K (Chen et al., 2019) AMSGrad α 0 ≈ β 1 < √ β 2 O 1 K 1 2 -s O 1 √ k β 1 ≈ 0, β 2 ≈ 0 O log K √ K (Zhuang et al., 2020) Padam α β 1 ≈ 0, β 2 ≈ 0 O 1 K 1 2 -s α β 1 ≈ 1, β 2 ≈ 1 O 1 K + 1 b + C 3 (this paper) varying α k β 1 ≈ 1, β 2 ≈ 1 O 1 K + 1 Kb + h(β 1 ) The practical performance of a deep learning optimizer strongly depends on the batch size. In (Smith et al., 2018) , it was numerically shown that using an enormous batch size leads to a reduction in the number of parameter updates and model training time. The theoretical results in (Zaheer et al., 2018) showed that using large batch sizes makes the upper bound of P(K) of an adaptive method small (Table 1 ). Convergence analyses of SGD in (Cotter et al., 2011; Chen et al., 2020; Arjevani et al., 2022) indicated that running SGD with a decaying learning rate and large batch size for sufficiently many steps leads to convergence to a local minimizer of a loss function. Accordingly, the practical results for large batch sizes match the theoretical ones. In (Shallue et al., 2019; Zhang et al., 2019) , it was studied how increasing the batch size affects the performances of deep learning optimizers. In both studies, it was numerically shown that increasing batch size tends to decrease the number of steps K needed for training deep neural networks, but with diminishing returns. Moreover, it was shown that momentum methods can exploit larger batches than SGD (Shallue et al., 2019) , and that K-FAC and Adam can exploit larger batches than momentum methods (Zhang et al., 2019) . 1.2 MOTIVATION 1.2.1 HYPERPARAMETERS CLOSE TO ONE AND CONSTANT LEARNING RATE As described in Section 1.1, the practically useful β 1 and β 2 are each close to 1, whereas in contrast, the theoretical results show that using β 1 and/or β 2 close to 0 makes the upper bounds of the performance measures small. Hence, there is a gap between theory (β 1 , β 2 ≈ 0) and practice (β 1 , β 2 ≈ 1) for adaptive methods. As a consequence, the first motivation of this paper is to bridge this gap. Since using small constant learning rates is robust for training deep neural networks (Kingma & Ba, 2015; Reddi et al., 2018; Zaheer et al., 2018; Zou et al., 2019; Chen et al., 2019; Zhuang et al., 2020; Chen et al., 2021) , we focus on using a small constant learning rate α. Here, we note that using a learning rate depending on the Lipschitz constant L of the gradient ∇f would be unrealistic. This is because computing the Lipschitz constant L is NP-hard (Virmaux & Scaman, 2018, Theorem 2) . The results ("(this paper)" row in Table 1 ) using varying learning rates are given in Appendix A.4.

1.2.2. CRITICAL BATCH SIZE

The second motivation of this paper is to clarify theoretically the relationship between the diminishing returns reported in (Shallue et al., 2019; Zhang et al., 2019) and batch size. Numerical evaluations in (Shallue et al., 2019; Zhang et al., 2019) have definitively shown that, for deep learning optimizers, the number of steps K needed to train a deep neural network halves for each doubling of the batch size b and that there is a region of diminishing returns beyond the critical batch size b ⋆ . This implies that there is a positive number C such that Kb ≈ 2 C for b ≤ b ⋆ and Kb ≥ 2 C for b ≥ b ⋆ , where K and b are defined for i, j ∈ N by K = 2 i and b = 2 j (For example, Figure 1 in Section 4 shows C ≈ 20 and b ⋆ ≈ 2 11 for Adam used to train ResNet-20 on the CIFAR-10 dataset). We define the stochastic first-order oracle (SFO) complexity of a deep learning optimizer as N :=Kb on the basis of the number of steps K needed for training the deep neural network and the batch size b used in the optimizer. Let b ⋆ be a critical batch size such that there are diminishing returns for all batch sizes beyond b ⋆ , as asserted in (Shallue et al., 2019; Zhang et al., 2019) . This fact, expressed in (1), implies that, while SFO complexity N initially almost does not change (i.e., K halves for each doubling of b), N is minimized at critical batch size b ⋆ , and there are diminishing returns once the batch size exceeds b ⋆ .

1.3. CONTRIBUTION

Our results are summarized in Table 2 (see also the "(this paper)" row in Table 1 ). Our goal is to find a local minimizer of a loss function f over R d , i.e., a stationary point θ ⋆ ∈ R d satisfying ∇f (θ ⋆ ) = 0, which is equivalent to the variational inequality (VI) defined for all θ ∈ R d by ∇f (θ ⋆ ) ⊤ (θ ⋆ -θ) ≤ 0. Here, we show the relationship between (i) E[∇f (θ k ) ⊤ (θ k -θ)] ≤ ϵ (θ ∈ R d , k ∈ N) and (ii) E[∥∇f (θ k )∥ 2 ] ≤ ϵ (k ∈ N), where ϵ > 0 is the precision. Let us assume that (θ k ) k∈N is bounded. Suppose that (i) holds. Then, there exists a subsequence (θ ki ) i∈N of (θ k ) k∈N such that (θ ki ) i∈N converges to θ * . The continuity of ∇f thus implies that, for all θ ∈ R d , E[∇f (θ * ) ⊤ (θ * -θ)] ≤ ϵ. Putting θ := θ * -∇f (θ * ) ensures that E[∥∇f (θ * )∥ 2 ] ≤ ϵ. Suppose that (ii) holds. Then, the definition of the inner product and Jensen's inequality imply that, for all 2 ) as the performance measure of an optimizer. θ ∈ R d , E[∇f (θ k ) ⊤ (θ k -θ)] ≤ Dist(θ) √ ϵ, where Dist(θ) := sup{∥θ k -θ∥ : k ∈ N} < +∞. Therefore, it is adequate to use (i) and ϵ-approximation VI(K, θ) := 1 K K k=1 E[∇f (θ k ) ⊤ (θ k -θ)] ≤ ϵ (Table

1.3.1. ADVANTAGE OF SETTING A SMALL CONSTANT LEARNING RATE AND HYPERPARAMETERS CLOSE TO ONE

We can show that the upper bound C1 K + C2 b + C 3 of VI(K, θ) becomes small when α is small, β 1 and β 2 are close to 1, and K is large. This implies that Momentum and Adam perform well when α is small and β 1 and β 2 are each set close to 1. Section 3.1 shows this result in detail. Table 2 : Relationship between batch size b and the number of steps K to achieve an ϵ-approximation of an optimizer using a constant learning rate α and hyperparameters β 1 , β 2 . The critical batch size b ⋆ minimizes SFO complexity N (G, σ 2 , M , and v * are positive constants, D(θ) is a positive real number depending on θ ∈ R d , and h is monotone decreasing with respect to β 1 ) Optimizer SGD Momentum Adam C 1 E[∥θ 1 -θ∥ 2 ] 2α E[∥θ 1 -θ∥ 2 ] 2αβ 1 dD(θ) √ M 2αβ 1 √ 1 -β 2 C 2 σ 2 α 2 σ 2 α 2β 1 σ 2 α 2 √ v * β 1 (1 -β 1 ) C 3 G 2 α 2 G 2 α 2β 1 + h(β 1 ) G 2 α 2 √ v * β 1 (1 -β 1 ) + h(β 1 ) Upper Bound of VI VI(K, θ) := 1 K K k=1 E ∇f (θ k ) ⊤ (θ k -θ) ≤ C 1 K + C 2 b + C 3 = ϵ Steps K and SFO N K = C 1 b (ϵ -C 3 )b -C 2 N = C 1 b 2 (ϵ -C 3 )b -C 2 Critical Batch b ⋆ b ⋆ = 2C 2 ϵ -C 3 1.3.2 CRITICAL BATCH SIZE As described in Section 1.2.2, the practical performance of a deep learning optimizer strongly depends on the batch size (Shallue et al., 2019; Zhang et al., 2019) . The advantage of this paper is to clarify theoretically the relationship between batch size and the performance of deep learning optimizers and develop a theory demonstrating the existence of critical batch sizes, which was shown numerically by (Shallue et al., 2019; Zhang et al., 2019) . Motivated by the results in (Shallue et al., 2019; Zhang et al., 2019) and Section 1.2.2, we use SFO complexity N :=Kb as the performance measure of a deep learning optimizer. We first show that the number of steps K to satisfy VI(K, θ) ≤ ϵ can be defined as in Table 2 . As a function, K is convex and monotone decreasing with respect to batch size b. Next, we show that SFO complexity N defined as in Table 2 is convex with respect to batch size b. This result agrees with the fact of (1). Moreover, SFO complexity N is minimized at b ⋆ defined as in Table 2 . This result guarantees the existence of the critical batch size b ⋆ . Section 3.2 shows the above results in detail. However, the accurate setting of the critical batch size b ⋆ defined as in Table 2 would be difficult since b ⋆ involves unknown parameters, such as G and D(θ) (see Section 2.2.3). The advantage of our analysis is that we can estimate appropriate batch sizes using the formula for b ⋆ before implementing deep learning optimizers. Section 4 will discuss estimation of appropriate batch sizes in detail.

2. NONCONVEX OPTIMIZATION AND DEEP LEARNING OPTIMIZERS

This section gives a nonconvex optimization problem in deep neural networks and optimizers for solving the problem under standard assumptions.

2.1. NONCONVEX OPTIMIZATION IN DEEP LEARNING

Let R d be a d-dimensional Euclidean space with inner product ⟨x, y⟩ := x ⊤ y inducing the norm ∥x∥ and N be the set of nonnegative integers. Define [n] := {1, 2, . . . , n} for n ≥ 1. Given a parameter θ ∈ R d and a data point z in a data domain Z, a machine learning model provides a prediction whose quality is measured by a differentiable nonconvex loss function ℓ(θ; z). We aim to minimize the expected loss defined for all θ ∈ R d by f (θ) = E z∼D [ℓ(θ; z)] = E[ℓ ξ (θ)], where D is a probability distribution over Z, ξ denotes a random variable with distribution function P , and E[•] denotes the expectation taken with respect to ξ. A particularly interesting example of f (θ) is the empirical average loss defined for all θ ∈ R d by f (θ ; S) = 1 n i∈[n] ℓ(θ; z i ) = 1 n i∈[n] ℓ i (θ), where S = (z 1 , z 2 , . . . , z n ) denotes the training set and ℓ i (•) := ℓ(•; z i ) denotes the loss function corresponding to the i-th training data z i .

2.2.1. CONDITIONS

We assume that a stochastic first-order oracle (SFO) exists such that, for a given θ ∈ R d , it returns a stochastic gradient G ξ (θ) of the function f , where a random variable ξ is supported on Ξ independently of θ. The following are standard conditions when considering a deep learning optimizer. (C1) f : R d → R is continuously differentiable. (C2) Let (θ k ) k∈N ⊂ R d be the sequence generated by a deep learning optimizer. For each iteration k, E ξ k [G ξ k (θ k )] = ∇f (θ k ) , where ξ 0 , ξ 1 , . . . are independent samples and the random variable ξ k is independent of (θ l ) k l=0 . There exists a nonnegative constant σ 2 such that E ξ k [∥G ξ k (θ k ) -∇f (θ k )∥ 2 ] ≤ σ 2 . (C3) For each iteration k, the optimizer samples a batch B k of size b independently of k and estimates the full gradient ∇f as ∇f B k (θ k ) := 1 b i∈[b] G ξ k,i (θ k ), where ξ k,i is a random variable generated by the i-th sampling in the k-th iteration.

2.2.2. ADAM

Algorithm 1 is the Adam optimizer (Kingma & Ba, 2015) under (C1)-(C3). The symbol ⊙ in step 6 is defined for all x = (x i ) d i=1 ∈ R d , x ⊙ x := (x 2 i ) d i=1 ∈ R d , and diag(x i ) in step 8 is a diagonal matrix with diagonal components x 1 , x 2 , . . . , x d . Algorithm 1 Adam (Kingma & Ba, 2015) Require: α ∈ (0, +∞), b ∈ (0, +∞), β 1 ∈ (0, 1), β 2 ∈ [0, 1) 1: k ← 0, θ 0 ∈ R d , m -1 := 0, v -1 := 0 2: loop 3: ∇f B k (θ k ) := 1 b i∈[b] G ξ k,i (θ k ) 4: m k := β 1 m k-1 + (1 -β 1 )∇f B k (θ k ) 5: mk := (1 -β k+1 1 ) -1 m k 6: v k := β 2 v k-1 + (1 -β 2 )∇f B k (θ k ) ⊙ ∇f B k (θ k ) 7: vk := (1 -β k+1 2 ) -1 v k 8: H k := diag( vk,i ) 9: θ k+1 := θ k -αH -1 k mk 10: k ← k + 1 11: end loop The SGD optimizer under (C1)-(C3) is Algorithm 1 when β 1 = 0 and H k is the identity matrix. The Momentum optimizer under (C1)-(C3) is defined for all k ∈ N by θ k+1 := θ k -αm k .

2.2.3. ASSUMPTIONS

We assume the following conditions that were used in (Kingma & Ba, 2015, Theorem 4.1): (A1) There exist positive numbers G and B such that, for all k ∈ N, ∥∇f (θ k )∥ ≤ G and ∥∇f B k (θ k )∥ ≤ B (see also Appendix A.5). (A2) For all θ ∈ R d , there exists a positive number Dist(θ) such that, for all k ∈ N, ∥θ k -θ∥ ≤ Dist(θ). (Reddi et al., 2018) shows that there exists a stochastic convex optimization problem such that Adam using β 1 < √ β 2 (e.g., β 1 = 0.9 and β 2 = 0.999) does not converge to the optimal solution. If for all k ∈ N and all i ∈ [d], vk,i in Adam satisfies Let (g 2 k,i ) d i=1 := ∇f B k (θ k ) ⊙ ∇f B k (θ k ) (k ∈ N). Assumption (A1) implies that M := sup{max i∈[d] g 2 k,i : k ∈ N} < +∞. Assumption (A2) implies that D(θ) := sup{max i∈[d] (θ k,i - θ i ) 2 : k ∈ N} < +∞. We define v * := inf{min i∈[d] v k,i : k ∈ N}. Theorem 3 in vk+1,i ≥ vk,i , then Adam with a decaying learning rate α k = O( 1 √ k ) and β 1 and β 2 satisfying β 1 < √ β 2 can solve the stochastic convex optimization problem (Reddi et al., 2018, (2) , Theorem 4). We thus assume condition (2) for Adam to guarantee the convergence of Adam.

3. OUR RESULTS

This section states our theoretical results (Theorem 3.1) in Table 2 and our contribution (Sections  3.1 and 3. 2) in detail. The proof of Theorem 3.1 is given in Appendix A.3. Theorem 3.1. The sequence (θ k ) k∈N generated by each of SGD, Momentum, and Adam with ( 2) under ( C1)-( C3) and ( A1) and (A2) satisfies the following: (i) [Upper bound of VI(K, θ)] For all K ≥ 1 and all θ ∈ R d , VI(K, θ) := 1 K K k=1 E ∇f (θ k ) ⊤ (θ k -θ) ≤ C 1 K + C 2 b + C 3 , where C i (i = 1, 2, 3) for SGD are C 1 := E[∥θ 1 -θ∥ 2 ] 2α , C 2 := σ 2 α 2 , C 3 := G 2 α 2 , C i (i = 1, 2, 3) for Momentum are C 1 := E[∥θ 1 -θ∥ 2 ] 2αβ 1 , C 2 := σ 2 α 2β 1 , C 3 := G 2 α 2β 1 + Dist(θ) G(1 -β 1 ) β 1 + 2 σ 2 + G 2 1 β 1 + 2(1 -β 1 ) , C i (i = 1, 2, 3) for Adam with (2) are C 1 := dD(θ) √ M 2αβ 1 √ 1 -β 2 , C 2 := σ 2 α 2 √ v * β 1 (1 -β 1 ) , C 3 := G 2 α 2 √ v * β 1 (1 -β 1 ) + Dist(θ) G(1 -β 1 ) β 1 + 2 σ 2 + G 2 1 β 1 + 2(1 -β 1 ) , and the parameters are defined as in Section 2.2. (ii) [Steps to satisfy VI(K, θ) ≤ ϵ] The number of steps K defined by K(b) = C 1 b (ϵ -C 3 )b -C 2 (3) satisfies VI(K, θ) ≤ ϵ and the function K(b) defined by ( 3) is convex and monotone decreasing with respect to batch size b (> C2 ϵ-C3 > 0) (see also Appendix A.6 for the condition ϵ - SFO complexity] The SFO complexity defined by C 3 > 0). (iii) [Minimization of N = K(b)b = C 1 b 2 (ϵ -C 3 )b -C 2 (4) is convex with respect to batch size b (> C2 ϵ-C3 > 0). The batch size b ⋆ := 2C 2 ϵ -C 3 (5) attains the minimum value N ⋆ = K(b ⋆ )b ⋆ = 4C1C2 (ϵ-C3) 2 of N . The proof of Theorem 3.1(i) ensures that C i (i = 1, 2, 3) for AMSGrad (Algorithm 1 with mk = m k , vk = v k , ṽk,i := max{ṽ k-1,i , v k,i } (i.e., ṽk+1,i ≥ ṽk,i ; see (2)) , and H k := diag( ṽk,i )) are C 1 := dD(θ) √ M 2αβ 1 , C 2 := σ 2 α 2 √ ṽ * β 1 , C 3 := G 2 α 2 √ ṽ * β 1 + Dist(θ) G(1 -β 1 ) β 1 + 2 σ 2 + G 2 1 β 1 + 2(1 -β 1 ) , ( ) where ṽ-1 := 0 and ṽ * : = inf{min i∈[d] ṽk,i : k ∈ N} (see Appendix A.3.4). We give a brief outline of the proof strategy of Theorem 3.1, with an emphasis on the main difficulty that has to be overcome in order not to assume Lipschitz smoothness of f (i.e., ∇f is Lipschitz continuous with the Lipschitz constant L). First, we show that (C2) and (A1) imply that (E[∥m k ∥]) k∈N and (E[∥d k ∥]) k∈N are bounded, where d k := -H -1 k mk . Since we do not assume Lipschitz smoothness of f , we cannot use the descent lemma, i.e., f (y ) ≤ f (x) + ∇f (x) ⊤ (y -x) + L 2 ∥y -x∥ 2 (x, y ∈ R d ) . This is the main difficulty to prove Theorem 3.1. Almost all of the previous analyses of adaptive methods are based on the descent lemma, and hence, they can use the expectation of the squared norm of the full gradient E[∥∇f (θ k )∥ 2 ] as the performance measure. Accordingly, we must use other performance measures that are different from E[∥∇f (θ k )∥ 2 ]. This paper uses the performance measure VI(K, θ). We next show that K k=1 E[m ⊤ k-1 (θ k -θ)] ≤ a K + b K + c K ≤ C 1 + C2K b + C3 K, where C 1 and C 2 are defined as in Theorem 3.1 and C3 > 0. In particular, ( 2), (A1), and (A2) imply a K ≤ C 1 , the boundedness condition of (E[∥d k ∥]) k∈N implies b K ≤ C2K b , and (A2) and the Cauchy-Schwarz inequality imply c K ≤ C3 K. The definition of m k , the Cauchy-Schwarz inequality, the triangle inequality, and Jensen's inequality imply Theorem 3.1(i). Theorem 3.1(i) and C1 K + C2 b + C 3 = ϵ lead to Theorem 3.1(ii) and (iii).

3.1. ADVANTAGE OF SETTING A SMALL CONSTANT LEARNING RATE AND HYPERPARAMETERS CLOSE TO ONE

We first show theoretical evidence that Adam using a small constant learning rate α, β 1 and β 2 close to 1, and a large number of steps K performs well. Theorem 3.1(i) indicates that the upper bound of VI(K, θ) for Adam is VI(K, θ) ≤ dD(θ) √ M 2αβ 1 √ 1 -β 2 K + α(σ 2 b -1 + G 2 ) 2 √ v * β 1 (1 -β 1 )K K k=1 1 -β k+1 2 + h(β 1 ), where β 1 ∈ (0, 1), β 2 ∈ [0, 1), and h(β 1 ) := Dist(θ) G(1 -β 1 ) β 1 + 2 σ 2 + G 2 1 β 1 + 2(1 -β 1 ) (the strict evaluation (7) of Theorem 3.1(i) comes from (31) in Appendix A). Since the function h(β 1 ) defined by ( 8) is monotone decreasing, using β 1 close to 1 makes h(β 1 ) small. Since 1 β1(1-β1) is monotone increasing for β 1 ≥ 1/2, using β 1 close to 1 makes 1 β1(1-β1) large. Hence, we need to set a small α to make α(σ 2 b -1 +G 2 ) 2 √ v * β1(1-β1) small. The function 1 -β k+1 2 is monotone decreasing with respect to β 2 , while using β 2 close to 1 makes 1 √ 1-β2 large. When β 2 close to 1 and a small learning rate α are used, we need to use a large number of steps K to make dD(θ) √ M 2αβ1 √ 1-β2K small.

3.2. CRITICAL BATCH SIZE

Theorem 3.1(ii) indicates that the number of steps K to satisfy VI(K, θ) ≤ ϵ can be expressed as (3). The function K(b) defined by ( 3) is convex and monotone decreasing. Hence, the form of K defined by (3) supports theoretically the relationship between K and b shown in (Shallue et al., 2019; Zhang et al., 2019) (see also Figures 1 and 3 in this paper). Theorem 3.1(iii) indicates that SFO complexity defined by ( 4) is convex with respect to batch size b. This result agrees with the fact of (1) (see also Figures  ⋆ S = σ 2 S α ϵ -C 3,S , b ⋆ M = σ 2 M α β 1 (ϵ -C 3,M ) , b ⋆ A = σ 2 A α √ v * β 1 (1 -β 1 )(ϵ -C 3,A ) , ( ⋆ S > b * S := σ 2 S α ϵ , b ⋆ M > b * M := σ 2 M α β 1 ϵ , b ⋆ A > b * A := σ 2 A α √ v * β 1 (1 -β 1 )ϵ . ( )

4. NUMERICAL RESULTS

We evaluated the performances of SGD, Momentum, and Adam with different batch sizes. The metrics were the number of steps K and the SFO complexity N satisfying f (θ K ) ≤ 10 -1 , where θ K is generated for each of SGD, Momentum, and Adam using batch size b. The stopping condition was 200 epochs. The experimental environment consisted of two Intel(R) Xeon(R) Gold 6148 2.4-GHz CPUs with 20 cores each, a 16-GB NVIDIA Tesla V100 900-Gbps GPU, and the Red Hat Enterprise Linux 7.6 OS. The code was written in Python 3.8.2 using the NumPy 1.17.3 and PyTorch 1.3.0 packages. A constant learning rate α = 10 -3 was commonly used. Momentum used β 1 = 0.9. Adam used β 1 = 0.9 and β 2 = 0.999 (Kingma & Ba, 2015) . Here, we use (10) and estimate appropriate batch sizes in the sense that SFO complexity is minimized. The definitions of v * and v k,i (see also (26 ) in Appendix A) imply that, for k ∈ [K] and all i ∈ [d], v * ≤ v k,i ≤ max k∈[K] max i∈[d] g 2 k,i =: g 2 k * ,i * ≤ d i=1 g 2 k * ,i = ∥∇f B k * (θ k * )∥ 2 . Con- dition (C2) implies that E[∥∇f B k (θ k )∥ 2 ] ≤ σ 2 b + E[∥∇f (θ k )∥ 2 ] (see also (14) in Appendix A). Conditions (C2) and (C3) imply that, if b is large, then σ is small. Hence, assuming that σ 2 b ≈ 0 im- plies that v * ≤ ∥∇f (θ k * )∥ 2 = ∥ 1 n i∈[n] ∇ℓ i (θ k * )∥ 2 = 1 n 2 ∥ i∈[n] ∇ℓ i (θ k * )∥ 2 =: 1 n 2 ∥G k * ∥ 2 = 1 n 2 i∈[d] G 2 k * ,i ≤ d n 2 max i∈[d] G 2 k * ,i . Since deep learning optimizers can approximate stationary points of f , we assume that, for example, G k * ,i ≈ ϵ. Then, (10) implies that b * S := σ 2 S 10 3 ϵ , b * M := σ 2 M 9 • 10 2 ϵ , b * A := σ 2 A 9 • 10 √ v * ϵ > σ 2 A n 9 • 10 √ dϵ 2 =: b * * A . ( ) Conditions ( C2) and (C3) imply that, if b is large, then σ is small. If SGD and Momentum can exploit large batch sizes, then σ S and σ M are small. However, (11) implies that b * S and b * M must be small when σ S and σ M are small. Accordingly, SGD and Momentum would not be able to use large batch sizes. Meanwhile, from ( 11 Let us consider training ResNet-20 on the CIFAR-10 dataset with n = 50000. Figure 1 shows the number of steps for SGD, Momentum, and Adam versus batch size. For SGD and Momentum, the number of steps K needed for f (θ K ) ≤ 10 -1 initially decreased. However, SGD with b ≥ 2 6 and Momentum with b ≥ 2 10 did not satisfy f (θ K ) ≤ 10 -1 before the stopping condition was reached. Adam had an initial period of perfect scaling (indicated by dashed line) such that the number of steps K needed for f (θ K ) ≤ 10 -1 was inversely proportional to batch size b, and critical batch size b ⋆ = 2 11 such that K was not inversely proportional to the batch size beyond b ⋆ ; i.e., there were diminishing returns. Figure 2 plots the SFO complexities for SGD, Momentum, and Adam versus batch size. For SGD, SFO complexity was minimum at b ⋆ = 2 2 ; for Momentum, it was minimum at b ⋆ = 2 3 . This implies that SGD and Momentum could not use large batch sizes, as shown in (11). For Adam, SFO complexity was minimum at the critical batch size b ⋆ = 2 11 that was close to estimation of 2048, as shown in (11) . We also checked that the elapsed time for Adam monotonically decreased for b ≤ 2 11 and that the elapsed time for critical batch size b ⋆ = 2 11 was the shortest. The elapsed time for b ≥ 2 12 increased with the SFO complexity, as shown in Figure 2 (see Tables 3, 4 , 5, and 6 in Appendix A). critical batch b ⋆ A > b * * A with 1600 ≈ b * * A < b ⋆ A = 2 11 =

4.2. RESNET-18 ON THE MNIST DATASET

2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 Batch Size 11). We can also check that the elapsed time for critical batch size b ⋆ = 2 10 was the shortest (see Tables 7, 8 , 9, and 10 in Appendix A). We also estimated appropriate batch sizes of Adam for (i) CNN on MNIST (n = 60000, d ≈ 7.7 × 10 6 ) and (ii) ResNet-32 on CIFAR-10 (n = 50000, d ≈ 2.0 × 10 7 ) from b * * A in (11) and checked that actual critical batch sizes in (Zhang et al., 2019, Figure 5(a) , (e)) are close to estimated batch sizes ((i) b ⋆ A = 2 11 = 2048 > b * * A ≈ 2000, (ii) b ⋆ A = 2 10 = 1024 ≈ b * * A ≈ 1200).

5. CONCLUSION

This paper showed the relationship between batch size b and the number of steps K to achieve an ϵ-approximation of deep learning optimizers using a small constant learning rate α and hyperparameters β 1 and β 2 close to 1. From the convexity of SFO complexity N , there exists a global minimizer b ⋆ of N that is the critical batch size. We also gave numerical results indicating that Adam using a small constant learning rate, hyperparameters close to one, and the critical batch size has faster convergence than Momentum and SGD. Moreover, we estimated appropriate batch sizes from our formula for b ⋆ and showed that actual critical batch sizes are close to estimated batch sizes.

A.1 PREVIOUS RESULTS IN TABLE 1

This section provides the upper bounds of the performance measure of optimizers indicated in Table 1 . A.1.1 TAIL-AVERAGED SGD We consider the stochastic approximation problem of least square regression that is to minimize the expected square loss function f (θ) := 1 2 E (x,y)∼D [y -x ⊤ θ] (see Section 2.1 for the mathematical preliminaries). Tail-averaged SGD (Jain et al., 2018 , Algorithm 1) with a constant learning rate α depending on the batch size b and the largest eigenvalue of the Hessian of f satisfies E[f ( θ)] -f (θ ⋆ ) ≤ 2(1 -αµ) s α 2 µ 2 ( n b -s) 2 (f (θ 0 ) -f (θ ⋆ )) + 4σ 2 b( n b -s) , ( ) where s is the initial iterations, and n is the total samples, µ > 0 is the smallest eigenvalue of the Hessian of f , (θ i ) i>s is the sequence generated by Tail-averaged SGD, and θ := 1 ⌊ n b ⌋-s i>s θ i (Jain et al., 2018, Theorem 1) . The upper bound of Tail-averaged SGD when K := n b -s is E[f ( θ)] -f (θ ⋆ ) = O 1 K 2 + 1 Kb . A.1.2 ADAM Theorem 1 in (Zaheer et al., 2018) and the proof of Theorem 1 in (Zaheer et al., 2018) show that, under the condition that ∇f is Lipschitz continuous with the Lipschitz constant L, Adam using α = O(L -1 ), β 1 = 0, and β 2 ≥ 1 -O(G -2 ) satisfies 1 K K k=1 E ∥∇f (θ k )∥ 2 ≤ 2 β 2 G + ϵ f (θ 1 ) -f (θ ⋆ ) αK + G √ 1 -β 2 ϵ 2 + Lα 2ϵ 2 σ 2 b , that is, the upper bound of Adam with β 1 = 0 is 1 K K k=1 E ∥∇f (θ k )∥ 2 = O 1 K + 1 b . A.1.3 GENERIC ADAM Theorem 4 in (Zou et al., 2019) shows that, under the condition that ∇f is Lipschitz continuous with the Lipschitz constant L, Generic Adam using α k = O( 1 √ k ), β 1 ∈ (0, 1), and β 2k = 1 -α k+1 satisfies E ∥∇f (θ τ )∥ 4 3 3 2 ≤ D + D ′ K k=1 α k √ 1 -β 2k α K K , where τ is randomly chosen from {1, 2, . . . , K}, α > 0, γ ∈ (0, 1), v 0 = (v 0,i ) d i=1 , D 0 , D 1 > 0, D ′ := 2D 2 0 D 3 d B 4 + v 0,1 d (1 -β 1 )β 2,1 , D := 2D 0 B 4 + v 0,1 d 1 -β 1 D 4 + D 3 D 0 dα log 1 + B 4 v 0,1 d , D 3 := D 0 √ D 1 (1 - √ γ)    D 2 0 αL D 1 (1 - √ γ) 2 + 2 β 1 (1 -β 1 ) D 1 (1 -γ)β 2,1 + 1 2 B 2    , D 4 := f (θ 1 ) -f (θ ⋆ ). This implies that the upper bound of Generic Adam is E ∥∇f (θ τ )∥ 4 3 3 2 = O log K √ K (see also Corollary 10 in (Zou et al., 2019) when s = 1 2 and r = 1). A.1.4 ADAFOM Corollary 3.2 in (Chen et al., 2019) shows that, under the condition that ∇f is Lipschitz continuous with the Lipschitz constant L, the AdaGrad with First Order Momentum (AdaFom) algorithm using α k = 1 √ k and β 1 ∈ (0, 1) satisfies 1 K K k=1 E ∥∇f (θ k )∥ 2 ≤ Q 1 + Q 2 log K √ K , where Ḡ > 0, c > 0, Q 1 := G D 1 d(1 -log(c 2 ) + 2 log G) + D 2 d c + D 3 d c 2 + D 4 , Q 2 := GD 1 d, D 1 := 3 2 L + 1 2 + L 2 β 1 1 -β 1 1 1 -β 1 2 , D 2 := G 2 β 1 1 -β 1 + 2 , D 3 := G 2 1 + L 1 -β 1 2 β 1 1 -β 1 β 1 1 -β 1 2 , D 4 := β 1 1 -β 1 (G 2 + Ḡ2 ) + β 1 1 -β 1 2 Ḡ2 + 2α 1 G 2 E 1 √ v1 1 + E[f (θ 1 ) -f (θ ⋆ )]. A.1.5 AMSGRAD Theorem 3 in (Zhou et al., 2020) shows that, under the condition that ∇f is Lipschitz continuous with the Lipschitz constant L, AMSGrad using a constant learning rate α and β 1 and β 2 with β 1 < √ β 2 satisfies 1 K K k=1 E ∥∇f (θ k )∥ 2 ≤ D 1 αK + D 2 d K + αD 3 d K 1 2 -s , where s ∈ (0, 1 2 ), G ∞ > 0, γ > 0, D 1 := 2G ∞ (f (θ 1 ) -f (θ ⋆ )), D 2 := 2G 3 ∞ γ(1 -β 1 ) , D 3 := 2LG 2 ∞ γ √ 1 -β 2 (1 -β1 √ β2 ) 1 + 2β 2 1 1 -β 1 . Hence, the upper bound of AMSGrad is (Zhuang et al., 2020) shows that, under the condition that ∇f is Lipschitz continuous with the Lipschitz constant L, AdaBelief using α k = α √ k , where α > 0, and 1 K K k=1 E ∥∇f (θ k )∥ 2 = O 1 K 1 2 -s . A.1.6 ADABELIEF Theorem 2.2 in β 1 , β 2 ∈ (0, 1) satisfies 1 K K k=1 E ∥∇f (θ k )∥ 2 ≤ Q 1 + Q 2 log K α √ K , where c > 0, Ḡ > 0, Q 1 := G D 1 α 2 (G 2 + σ 2 ) c + D 2 dα √ c + D 3 dα 2 c + D 4 , Q 2 := GD 1 α 2 (G 2 + σ 2 ) c , D 1 := 3 2 L + 1 2 + L 2 β 1 1 -β 1 1 1 -β 1 2 , D 2 := G 2 β 1 1 -β 1 + 2 , D 3 := G 2 1 + L 1 -β 1 2 β 1 1 -β 1 β 1 1 -β 1 2 , D 4 := β 1 1 -β 1 (G 2 + Ḡ2 ) + β 1 1 -β 1 2 Ḡ2 + 2α 1 G 2 E 1 √ v1 1 + E[f (θ 1 ) -f (θ ⋆ )]. A.1.7 PADAM Corollary 4.5 in (Chen et al., 2021) shows that, under the condition that ∇f is Lipschitz continuous with the Lipschitz constant L, the Partially adaptive momentum estimation (Padam) method using a constant learning rate α, p ∈ [0, 1 4 ], and β 1 and β 2 with β 1 < β 2p 2 satisfies 1 K K k=1 E ∥∇f (θ k )∥ 2 ≤ D 1 αK + D 2 d K + αD 3 dG ∞ K 1 2 -s , where s ∈ (0, 1 2 ), G ∞ > 0, D 1 := 2G 2p ∞ (f (θ 1 ) -f (θ ⋆ )), D 2 := 4G 2+2p ∞ E[∥v -p 1 ∥ 1 ] d(1 -β 1 ) + 4G 2 ∞ , D 3 := 4LG 1-2p ∞ (1 -β 2 ) 2p + 8LG 1-2p ∞ (1 -β 1 ) (1 -β 2 ) 2p (1 -β1 β 2p 2 ) β 1 1 -β 1 2 . Hence, the upper bound of Padam is 1 K K k=1 E ∥∇f (θ k )∥ 2 = O 1 K 1 2 -s .

A.2 LEMMAS

Lemma A.1. Suppose that (C1), (C2), and (C3) hold. Then, Adam satisfies the following: for all k ∈ N and all θ ∈ R d , E ∥θ k+1 -θ∥ 2 H k = E ∥θ k -θ∥ 2 H k + α 2 E ∥d k ∥ 2 H k + 2α β 1 β1k E (θ -θ k ) ⊤ m k-1 + β1 β1k E (θ -θ k ) ⊤ ∇f (θ k ) , where d k := -H -1 k mk , β1 := 1 -β 1 , and β1k := 1 - β k+1 1 . Proof. Let θ ∈ R d and k ∈ N. The definition θ k+1 := θ k + αd k implies that ∥θ k+1 -θ∥ 2 H k = ∥θ k -θ∥ 2 H k + 2α ⟨θ k -θ, d k ⟩ H k + α 2 ∥d k ∥ 2 H k . Moreover, the definitions of d k , m k , and mk ensure that ⟨θ k -θ, d k ⟩ H k = ⟨θ k -θ, H k d k ⟩ = ⟨θ -θ k , mk ⟩ = 1 β1k (θ -θ k ) ⊤ m k = β 1 β1k (θ -θ k ) ⊤ m k-1 + β1 β1k (θ -θ k ) ⊤ ∇f B k (θ k ). Hence, ∥θ k+1 -θ∥ 2 H k = ∥θ k -θ∥ 2 H k + α 2 ∥d k ∥ 2 H k + 2α β 1 β1k (θ -θ k ) ⊤ m k-1 + β1 β1k (θ -θ k ) ⊤ ∇f B k (θ k ) . ( ) Conditions (C2) and (C3) guarantee that E E (θ -θ k ) ⊤ ∇f B k (θ k ) θ k = E (θ -θ k ) ⊤ E ∇f B k (θ k ) θ k = E (θ -θ k ) ⊤ ∇f (θ k ) . Therefore, the lemma follows from taking the expectation on both sides of (13). This completes the proof. The discussion in the proof of Lemma A.1 also gives the following lemma. Lemma A.2. Suppose that (C1), (C2), and (C3) hold. Then, SGD satisfies the following: for all k ∈ N and all θ ∈ R d , E ∥θ k+1 -θ∥ 2 = E ∥θ k -θ∥ 2 + α 2 E ∥∇f B k (θ k )∥ 2 + 2αE (θ -θ k ) ⊤ ∇f (θ k ) . Moreover, Momentum satisfies the following: for all k ∈ N and all θ ∈ R d , E ∥θ k+1 -θ∥ 2 = E ∥θ k -θ∥ 2 + α 2 E ∥m k ∥ 2 + 2α β 1 E (θ -θ k ) ⊤ m k-1 + β1 E (θ -θ k ) ⊤ ∇f (θ k ) , where β1 := 1 -β 1 . The following lemma indicates the bounds on C2) and ( A1), for all k ∈ N satisfies (E[∥m k ∥ 2 ]) k∈N and (E[∥d k ∥ 2 H k ]) k∈N . Lemma A.3. Adam under ( E ∥m k ∥ 2 ≤ σ 2 b + G 2 , E ∥d k ∥ 2 H k ≤ β2k β2 1k √ v * σ 2 b + G 2 , where v * := inf{min i∈[d] v k,i : k ∈ N}, β1k := 1 -β k+1 1k , and β2k := 1 -β k+1 2k . Proof. Condition (C2) implies that E ∥∇f B k (θ k )∥ 2 θ k = E ∥∇f B k (θ k ) -∇f (θ k ) + ∇f (θ k )∥ 2 θ k = E ∥∇f B k (θ k ) -∇f (θ k )∥ 2 θ k + E ∥∇f (θ k )∥ 2 θ k + 2E (∇f B k (θ k ) -∇f (θ k )) ⊤ ∇f (θ k ) θ k = E ∥∇f B k (θ k ) -∇f (θ k )∥ 2 θ k + ∥∇f (θ k )∥ 2 , ( ) which, together with (C2) and (A1), in turn implies that E ∥∇f B k (θ k )∥ 2 ≤ σ 2 b + G 2 . ( ) The convexity of ∥ • ∥ 2 , together with the definition of m k and (15), guarantees that, for all k ∈ N, E ∥m k ∥ 2 ≤ β 1 E ∥m k-1 ∥ 2 + β1 E ∥∇f B k (θ k )∥ 2 ≤ β 1 E ∥m k-1 ∥ 2 + β1 σ 2 b + G 2 . Induction thus ensures that, for all k ∈ N, E ∥m k ∥ 2 ≤ max ∥m -1 ∥ 2 , σ 2 b + G 2 = σ 2 b + G 2 , ( ) where (Horn & Johnson, 1985, Theorem 7.2.6 ). We have that, for all m -1 = 0. For k ∈ N, H k ∈ S d ++ guarantees the existence of a unique matrix H k ∈ S d ++ such that H k = H 2 k x ∈ R d , ∥x∥ 2 H k = ∥H k x∥ 2 . Accordingly, the definitions of d k and mk imply that, for all k ∈ N, E ∥d k ∥ 2 H k = E H -1 k H k d k 2 ≤ 1 β2 1k E H -1 k 2 ∥m k ∥ 2 , where H -1 k = diag v-1 4 k,i = max i∈[d] v-1 4 k,i = max i∈[d] v k,i β2k -1 4 =: v k,i * β2k -1 4 . Moreover, the definition of v * := inf {v k,i * : k ∈ N} and ( 16) imply that, for all k ∈ N, E ∥d k ∥ 2 H k ≤ β 1 2 2k β2 1k v 1 2 * σ 2 b + G 2 , completing the proof. A.3 PROOF OF THEOREM 3.1

A.3.1 SGD

We show Theorem 3.1 for SGD. Proof. (i) Lemma A.2 and (15) imply that, for all k ∈ N and all θ ∈ R d , 2αE (θ k -θ) ⊤ ∇f (θ k ) = E ∥θ k -θ∥ 2 -E ∥θ k+1 -θ∥ 2 + α 2 E ∥∇f B k (θ k )∥ 2 ≤ E ∥θ k -θ∥ 2 -E ∥θ k+1 -θ∥ 2 + α 2 σ 2 b + G 2 . Summing the above inequality from k = 1 to k = K leads to the finding that, for all K ≥ 1, 2α K k=1 E (θ k -θ) ⊤ ∇f (θ k ) ≤ E ∥θ 1 -θ∥ 2 -E ∥θ K+1 -θ∥ 2 + α 2 σ 2 b + G 2 K, which implies that, for all K ≥ 1 and all θ ∈ R d , VI(K, θ) := 1 K K k=1 E (θ k -θ) ⊤ ∇f (θ k ) ≤ E ∥θ 1 -θ∥ 2 2αK + α 2 σ 2 b + G 2 = E ∥θ 1 -θ∥ 2 2α C1 1 K + σ 2 α 2 C2 1 b + G 2 α 2 C3 . ( ) (ii) Let θ ∈ R d and ϵ > 0. Condition C1 K + C2 b + C 3 = ϵ is equivalent to K = K(b) = C 1 b (ϵ -C 3 )b -C 2 . ( ) Since ϵ = C1 K + C2 b + C 3 > C 3 , we consider the case b > C2 ϵ-C3 > 0 to guarantee that K > 0. From (17), the function K defined by (18) satisfies VI(K, θ) ≤ C1 K + C2 b + C 3 = ϵ. Moreover, from (18), dK(b) db = -C 1 C 2 {(ϵ -C 3 )b -C 2 } 2 ≤ 0, d 2 K(b) db 2 = 2C 1 C 2 (ϵ -C 3 ) {(ϵ -C 3 )b -C 2 } 3 ≥ 0, which implies that K is convex and monotone decreasing with respect to b. (iii) We have that Kb = K(b)b = C 1 b 2 (ϵ -C 3 )b -C 2 . Accordingly, dK(b)b db = C 1 b{(ϵ -C 3 )b -2C 2 } {(ϵ -C 3 )b -C 2 } 2 , d 2 K(b)b db 2 = 2C 1 C 2 2 {(ϵ -C 3 )b -C 2 } 3 ≥ 0, which implies that K(b)b is convex with respect to b and dK(b)b db      < 0 if b < b ⋆ , = 0 if b = b ⋆ = 2C2 ϵ-C3 , > 0 if b > b ⋆ . The point b ⋆ attains the minimum value K(b ⋆ )b ⋆ = 4C1C2 (ϵ-C3) 2 of K(b)b. This completes the proof.

A.3.2 MOMENTUM

We show Theorem 3.1 for Momentum. Proof. (i) Lemma A.2 ensures that, for all k ∈ N and all θ ∈ R d , E (θ k -θ) ⊤ m k-1 = 1 2αβ 1 E ∥θ k -θ∥ 2 -E ∥θ k+1 -θ∥ 2 + α 2β 1 E ∥m k ∥ 2 + β1 β 1 E (θ -θ k ) ⊤ ∇f (θ k ) , which, together with Lemma A.3, the Cauchy-Schwarz inequality, and (A1) and (A2), implies that E (θ k -θ) ⊤ m k-1 ≤ 1 2αβ 1 E ∥θ k -θ∥ 2 -E ∥θ k+1 -θ∥ 2 + α 2β 1 σ 2 b + G 2 + β1 β 1 Dist(θ)G. Summing the above inequality from k = 1 to k = K gives a relation that implies that K k=1 E (θ k -θ) ⊤ m k-1 ≤ 1 2αβ 1 E ∥θ 1 -θ∥ 2 -E ∥θ K+1 -θ∥ 2 + α 2β 1 σ 2 b + G 2 K + β1 β 1 Dist(θ)GK, and hence, 1 K K k=1 E (θ k -θ) ⊤ m k-1 ≤ E ∥θ 1 -θ∥ 2 2αβ 1 K + α 2β 1 σ 2 b + G 2 + β1 β 1 Dist(θ)G. Moreover, we have that, for all k ∈ N and all θ ∈ R d , (θ k -θ) ⊤ m k = (θ k -θ) ⊤ m k-1 + β1 (θ k -θ) ⊤ (∇f B k (θ k ) -m k-1 ) ≤ (θ k -θ) ⊤ m k-1 + β1 Dist(θ) (∥∇f B k (θ k )∥ + ∥m k-1 ∥) , where the first equality comes from the definition of m k and the first inequality comes from the Cauchy-Schwarz inequality, the triangle inequality, and (A2). Hence, from Lemma A.3, (15), Jensen's inequality, and b ≥ 1, E (θ k -θ) ⊤ m k ≤ E (θ k -θ) ⊤ m k-1 + 2 β1 Dist(θ) σ 2 b + G 2 ≤ E (θ k -θ) ⊤ m k-1 + 2 β1 Dist(θ) σ 2 + G 2 . ( ) Therefore, for all K ≥ 1 and all θ ∈ R d , 1 K K k=1 E (θ k -θ) ⊤ m k ≤ E ∥θ 1 -θ∥ 2 2αβ 1 K + σ 2 α 2β 1 b + G 2 α 2β 1 + β1 Dist(θ) G β 1 + 2 σ 2 + G 2 . ( ) The definition of m k ensures that (θ k -θ) ⊤ ∇f B k (θ k ) = (θ k -θ) ⊤ m k + (θ k -θ) ⊤ (∇f B k (θ k ) -m k-1 ) + (θ k -θ) ⊤ (m k-1 -m k ) = (θ k -θ) ⊤ m k + 1 β 1 (θ k -θ) ⊤ (∇f B k (θ k ) -m k ) + β1 (θ k -θ) ⊤ (m k-1 -∇f B k (θ k )), which, together with the Cauchy-Schwarz inequality, the triangle inequality, and (A2), implies that (θ k -θ) ⊤ ∇f B k (θ k ) ≤ (θ k -θ) ⊤ m k + 1 β 1 Dist(θ)(∥∇f B k (θ k )∥ + ∥m k ∥) + β1 Dist(θ)(∥∇f B k (θ k )∥ + ∥m k-1 ∥). Lemma A.3, ( 15), Jensen's inequality, and b ≥ 1 guarantee that E (θ k -θ) ⊤ ∇f (θ k ) ≤ E (θ k -θ) ⊤ m k + 2 1 β 1 + β1 Dist(θ) σ 2 b + G 2 ≤ E (θ k -θ) ⊤ m k + 2 1 β 1 + β1 Dist(θ) σ 2 + G 2 . ( ) Therefore, (20) ensures that, for all K ≥ 1 and all θ ∈ R d , 1 K K k=1 E (θ k -θ) ⊤ ∇f (θ k ) ≤ E ∥θ 1 -θ∥ 2 2αβ 1 C1 1 K + σ 2 α 2β 1 C2 1 b + G 2 α 2β 1 + Dist(θ) G β1 β 1 + 2 σ 2 + G 2 1 β 1 + 2 β1 C3 . (ii) A discussion similar to the one showing (ii) in Theorem 3.1 for SGD would show (ii) in Theorem 3.1 for Momentum. (iii) An argument similar to that which obtained (iii) in Theorem 3.1 for SGD would prove (iii) in Theorem 3.1 for Momentum.

A.3.3 ADAM

We show Theorem 3.1 for Adam. Proof. (i) Let θ ∈ R d . Lemma A.1 guarantees that for all k ∈ N, E (θ k -θ) ⊤ m k-1 = β1k 2αβ 1 E ∥θ k -θ∥ 2 H k -E ∥θ k+1 -θ∥ 2 H k a k + α β1k 2β 1 E ∥d k ∥ 2 H k b k + β1 β 1 E (θ -θ k ) ⊤ ∇f (θ k ) c k . ( ) We define γ k := β1k 2β1α (k ∈ N). Then, for all K ≥ 1, K k=1 a k = γ 1 E ∥θ 1 -θ∥ 2 H1 + K k=2 γ k E ∥θ k -θ∥ 2 H k -γ k-1 E ∥θ k -θ∥ 2 H k-1 Γ K -γ K E ∥θ K+1 -θ∥ 2 H K . ( ) Since H k ∈ S d ++ exists such that H k = H 2 k , we have ∥x∥ 2 H k = ∥H k x∥ 2 for all x ∈ R d . Accordingly, we also have Γ K = E K k=2 γ k H k (θ k -θ) 2 -γ k-1 H k-1 (θ k -θ) 2 . From H k = diag(v 1/4 k,i ), we have that, for all x = (x i ) d i=1 ∈ R d , ∥H k x∥ 2 = d i=1 vk,i x 2 i . Hence, for all K ≥ 2, Γ K = E K k=2 d i=1 γ k vk,i -γ k-1 vk-1,i (θ k,i -θ i ) 2 . ( ) Condition (2) and γ k ≥ γ k-1 (k ≥ 1) imply that, for all k ≥ 1 and all i ∈ [d], γ k vk,i -γ k-1 vk-1,i ≥ 0. Moreover, (A2) ensures that D(θ) := sup{max i∈[d] (θ k,i -θ i ) 2 : k ∈ N} < +∞. Accordingly, for all K ≥ 2, Γ K ≤ D(θ)E K k=2 d i=1 γ k vk,i -γ k-1 vk-1,i = D(θ)E d i=1 γ K vK,i -γ 1 v1,i . Let ∇f B k (θ k ) ⊙ ∇f B k (θ k ) := (g 2 k,i ) ∈ R d + . Assumption (A1) ensures that there exists M ∈ R such that, for all k ∈ N, max i∈[d] g 2 k,i ≤ M . The definition of v k guarantees that, for all i ∈ [d] and all k ∈ N, v k,i = β 2 v k-1,i + β2 g 2 k,i . Induction thus ensures that, for all i ∈ [d] and all k ∈ N, v k,i ≤ max{v 0,i , M } = M, where v 0 = (v 0,i ) = 0. From the definition of vk , we have that, for all i ∈ [d] and all k ∈ N, vk,i = v k,i β2k ≤ M β2k . Therefore, ( 23), E[∥θ 1 -θ∥ 2 H1 ] ≤ D(θ)E[ d i=1 v1,i ], and (27) imply, for all K ≥ 1, K k=1 a k ≤ γ 1 D(θ)E d i=1 v1,i + D(θ)E d i=1 γ K vK,i -γ 1 v1,i = γ K D(θ)E d i=1 vK,i ≤ γ K D(θ) d i=1 M β2K ≤ dD(θ) √ M β1K 2β 1 α β2K . (28) Inequality (28) with β1K := 1 -β K+1 1 ≤ 1 and β2K := 1 -β K+1 2 ≥ 1 -β 2 =: β2 implies that K k=1 a k ≤ dD(θ) √ M β1K 2β 1 α β2K ≤ dD(θ) √ M 2β 1 α β2 . ( ) Lemma A.3 guarantees that, for all k ∈ N, b k = α β1k 2β 1 E ∥d k ∥ 2 H k ≤ α β1k 2β 1 β2k β2 1k √ v * σ 2 b + G 2 = α β2k 2 √ v * β 1 β1k σ 2 b + G 2 . ( ) Inequality (30) with β1k := 1 - β k+1 1 ≥ 1 -β 1 =: β1 and β2k := 1 -β k+1 2 ≤ 1 implies that b k ≤ α β2k 2 √ v * β 1 β1k σ 2 b + G 2 ≤ α 2 √ v * β 1 β1 σ 2 b + G 2 . ( ) The Cauchy-Schwarz inequality and (A2) imply that, for all k ∈ N, c k = β1 β 1 E (θ -θ k ) ⊤ ∇f (θ k ) ≤ Dist(θ)G β1 β 1 . (32) Hence, ( 22), ( 29), (31), and (32) ensure that, for all K ≥ 1, 1 K K k=1 E (θ k -θ) ⊤ m k-1 ≤ dD(θ) √ M 2β 1 α β2 K + α(σ 2 b -1 + G 2 ) 2 √ v * β 1 β1 + Dist(θ)G β1 β 1 . Therefore, from ( 19), for all K ≥ 1, 1 K K k=1 E (θ k -θ) ⊤ m k ≤ dD(θ) √ M 2β 1 α β2 K + σ 2 α 2 √ v * β 1 β1 b + G 2 α 2 √ v * β 1 β1 + β1 Dist(θ) G β 1 + 2 σ 2 + G 2 . ( ) From ( 21) and ( 33), for all K ≥ 1, 1 K K k=1 E (θ k -θ) ⊤ ∇f (θ k ) ≤ dD(θ) √ M 2αβ 1 β2 C1 1 K + σ 2 α 2 √ v * β 1 β1 C2 1 b + G 2 α 2 √ v * β 1 β1 + Dist(θ) G β1 β 1 + 2 σ 2 + G 2 1 β 1 + 2 β1 C3 . (ii) A discussion similar to the one showing (ii) in Theorem 3.1 for SGD would show (ii) in Theorem 3.1 for Adam. (iii) An argument similar to that which obtained (iii) in Theorem 3.1 for SGD would prove (iii) in Theorem 3.1 for Adam.

A.3.4 AMSGRAD

We show that C i for AMSGrad are (6). Proof. Let us consider AMSGrad (Algorithm 1 with mk = m k , vk = v k , ṽk,i := max{ṽ k-1,i , v k,i }, and H k := diag( ṽk,i )). Induction, together with (26) and ṽk,i := max{ṽ k-1,i , v k,i }, ensures that, for all k ∈ N, ṽk,i ≤ M , where ṽ-1 = (ṽ -1,i ) = 0. An argument similar to that which showed (i) in Appendix A.3.3 ensures that K k=1 a k ≤ dD(θ) √ M 2αβ 1 , b k ≤ α 2 √ ṽ * β 1 σ 2 b + G 2 , and c k ≤ Dist(θ)G β1 β 1 , where a k , b k , and c k are defined as in ( 22) and ṽ * := inf{min i∈ [d] ṽk,i : k ∈ N}. Inequalities ( 21) and (33) thus imply that C i for AMSGrad are (6).

A.4 RESULTS FOR VARYING LEARNING RATES

We can establish the upper bound of VI(K, θ) for Adam with varying learning rates from a discussion similar to the one for proving Theorem 3.1(i) (Appendix A.3.3). Let (α k ) k∈N be monotone decreasing. Then, γ k := β1k 2β1α k satisfies γ k+1 ≥ γ k (k ∈ N). Hence, (25) holds. The discussion in Appendix A.3.3 thus ensures that VI(K, θ) ≤ dD(θ) √ M 2β 1 √ 1 -β 2 α K K + σ 2 b -1 + G 2 2 √ v * β 1 (1 -β 1 )K K k=1 α k 1 -β k+1 2 + h(β 1 ). Let us consider the case where α k = 1 √ k (k ≥ 1). From (34), we have that VI(K, θ) ≤ dD(θ) √ M 2β 1 √ 1 -β 2 √ K + σ 2 b -1 + G 2 2 √ v * β 1 (1 -β 1 )K K k=1 1 √ k + h(β 1 ) ≤ dD(θ) √ M 2β 1 √ 1 -β 2 √ K + σ 2 b -1 + G 2 √ v * β 1 (1 -β 1 ) √ K K + h(β 1 ) = dD(θ) √ M 2β 1 √ 1 -β 2 √ K + σ 2 b -1 + G 2 √ v * β 1 (1 -β 1 ) 1 √ K + h(β 1 ), where 1 -β k+1 2 ≤ 1 and K k=1 1 √ k ≤ 2 √ K. Hence, VI(K, θ) ≤ dD(θ) √ M 2β 1 √ 1 -β 2 + G 2 √ v * β 1 (1 -β 1 ) 1 √ K + σ 2 √ v * β 1 (1 -β 1 ) 1 b √ K + h(β 1 ) =: C1 √ K + C2 b √ K + h(β 1 ). Theorem 3.1(i) indicates that Adam with a constant learning rate α satisfies VI(K, θ) ≤ C 1 K + C 2 b + C 3 . Hence, using α k = 1 √ k mitigates the variance (the term C2 b √ K ) in contrast to the term C2 b using α k = α. However, the bias term C1 K using α k = α would be better than the bias term C1 where α jT = (γ j-1 α, γ j-1 α, . . . , γ j-1 α) (j = 1, 2, . . . , P ). That is, (α k ) is ((α, α, . . . , α T ), (γα, γα, . . . , γα T ), . . . , (γ P -1 α, γ P -1 α, . . . , γ P -1 α T )). Let α > 0 be the lower bound of α k . From K k=1 α k = αT + γαT + • • • + γ P -1 αT ≤ αT 1 -γ and (34), we have that VI(T P, θ) ≤ dD(θ) √ M 2β 1 √ 1 -β 2 αT P + σ 2 b -1 + G 2 2 √ v * β 1 (1 -β 1 )T P αT 1 -γ + h(β 1 ) = dD(θ) √ M 2β 1 √ 1 -β 2 αT P + (σ 2 b -1 + G 2 )α 2 √ v * β 1 (1 -β 1 )(1 -γ)P + h(β 1 ) = dD(θ) √ M 2β 1 √ 1 -β 2 αT + G 2 α 2 √ v * β 1 (1 -β 1 )(1 -γ) 1 P + σ 2 α 2 √ v * β 1 (1 -β 1 )(1 -γ) 1 P b + h(β 1 ) = O 1 P + 1 P b + h(β 1 ). Here, we recall that Adam with α k = α satisfies VI(T P, θ) ≤ C 1 T P + C 2 b + C 3 = O 1 P + 1 b + C 3 , and Adam with α k = 1 √ k satisfies VI(T P, θ) ≤ C1 √ T P + C2 b √ T P + h(β 1 ) = O 1 √ P + 1 b √ P + h(β 1 ). Therefore, using (α k ) = (α T , α 2T , . . . , α P T ) is more desirable to minimize the upper bound of VI(K, θ) for Adam than using  α k = α, 1 √ k . A. )∥ ≤ G. A.6 CONDITION ϵ -C 3 > 0 Under Lipschitz smoothness condition of f (i.e., ∥∇f (x) -∇f (y)∥ ≤ L∥x -y∥), the gradient descent (GD) with a constant learning rate α = O(L -1 ) satisfies that lim k→+∞ ∥∇f (θ k )∥ = 0 by the descent lemma (f (y) ≤ f (x)+∇f (x) ⊤ (y-x)+ L 2 ∥y-x∥ 2 ). Under non-smoothness condition of f , GD with a constant learning rate α > 0 satisfies that lim inf k→+∞ ∇f (θ k ) ⊤ (θ k -θ) ≤ G 2 2 α. Hence, it is not guaranteed that, under non-smoothness condition of f , GD with a constant learning rate α > 0 converges. However, we can expect that using a small constant learning rate α approximates a stationary point of f . Meanwhile, under non-smoothness condition of f , GD with a diminishing learning rate α k = 1 √ k satisfies that VI(K, θ) ≤ D(θ) 2Kα K + σ 2 b -1 + G 2 2K K k=1 α k = D(θ) 2Kα K + G 2 2K K k=1 α k = O 1 √ K (see also Appendix A.3.1 when α is 1 √ k ). Hence, GD with a diminishing learning rate does not need C 3 used in Theorem 3.1. This result strongly depends on the condition β 1 = 0. Under non-smoothness condition of f , Adam with a diminishing learning rate α k = 1 √ k and b = n (i.e., σ = 0) satisfies that VI(K, θ) ≤ C1 √ K + C2 b √ K + h(β 1 ) = C1 √ K + h(β 1 ) (see ( 35)) and Adam with a constant learning rate α and b = n (i.e., σ = 0) satisfies that VI(K, θ) ≤ C 1 K + C 2 b + C 3 = C 1 K + C 3 . The previous results (Zaheer et al., 2018) indicated that Adam with a constant learning rate α = O(L -1 ), β 1 = 0, and σ = 0 satisfies that 1 K K k=1 E ∥∇f (θ k )∥ 2 ≤ 2 β 2 G + ϵ f (θ 1 ) -f (θ ⋆ ) αK =: Ĉ1 K . The condition β 1 ̸ = 0 is essential to analyze Adam since the practically useful β 1 is close to 1. The upper bound of VI(K, θ) for Adam with β 1 ̸ = 0 depends on the term with respect to β 1 , i.e., Let us consider SGD under (C1)-(C3) and let θ * be a stationary point f . From Lemma A.2, we have that C 3 := G 2 α 2 √ v * β1(1-β1) + h(β 1 ). 2αE (θ k -θ * ) ⊤ ∇f (θ k ) = E ∥θ k -θ * ∥ 2 -E ∥θ k+1 -θ * ∥ 2 + α 2 E ∥∇f B k (θ k )∥ 2 . Hence, for all K ≥ 1, 2α K k=1 E (θ k -θ * ) ⊤ ∇f (θ k ) = E ∥θ 1 -θ * ∥ 2 -E ∥θ K+1 -θ * ∥ 2 + α 2 K k=1 E ∥∇f B k (θ k )∥ 2 . Here, we assume that there exists c ∈ [0, 1] such that cσ 2 ≤ E[∥G ξ k (θ k ) -∇f (θ k )∥ 2 ] ≤ σ 2 . From (14), we have that E ∥∇f B k (θ k )∥ 2 θ k = E ∥∇f B k (θ k ) -∇f (θ k )∥ 2 θ k + ∥∇f (θ k )∥ 2 ≥ cσ 2 b . Since θ K+1 approximates θ * , we assume that there exists X(θ * ) ≥ 0 such that E[∥θ 1 -θ * ∥ 2 ] - E[∥θ K+1 -θ * ∥ 2 ] ≥ X(θ * ). Hence, 2α K k=1 E (θ k -θ * ) ⊤ ∇f (θ k ) ≥ X(θ * ) + cσ 2 α 2 K b , which implies that VI(K, θ * ) ≥ D 1 K + D 2 b , where D 1 := X(θ * ) 2α and D 2 := cσ 2 α 2 . Meanwhile, Theorem 3.1(i) indicates that VI(K, θ * ) ≤ C 1 K + C 2 b + C 3 , where C 1 := E[∥θ 1 -θ * ∥ 2 ] 2α , C 2 := σ 2 α 2 , and C 3 := G 2 α 2 . Suppose that VI(K, θ * ) ≤ ϵ. Then, we have that D 1 K + D 2 b ≤ ϵ, which implies that K(b) := D 1 b ϵb -D 2 ≤ K. Suppose that VI(K, θ * ) ≥ ϵ. Then, we have that (Jain et al., 2018) guarantees that C 1 K + C 2 b + C 3 ≥ ϵ, which implies that K(b) := C 1 b (ϵ -C 3 )b -C 2 ≥ K. From D 1 = X(θ * ) 2α ≤ E[∥θ 1 -θ * ∥ 2 ] -E[∥θ K+1 -θ * ∥ 2 ] 2α ≤ E[∥θ 1 -θ * ∥ 2 ] 2α = C 1 , D 2 = cσ 2 α 2 ≤ σ 2 α 2 = C 2 , and (ϵ -C 3 )b -C 2 ≤ ϵb -C 2 ≤ ϵb -D 2 , we have that K(b) ≤ K(b). Hence, K with VI(K, θ * ) = ϵ satisfies K(b) ≤ K ≤ K(b) and the SFO complexity N = Kb satisfies K(b)b =: N ≤ N ≤ N := K(b)b. Moreover, b ⋆ := 2D2 ϵ minimizing N and b ⋆ := 2C2 ϵ-C3 minimizing N (see Theorem 3.1(iii)) satisfy that b ⋆ ≤ b ⋆ . Let α be small enough. Then, we have that b ⋆ ≈ b ⋆ and K(b ⋆ )b ⋆ ≈ K(b ⋆ )b ⋆ , where C 1 = E[∥θ 1 -θ * ∥ 2 ] 2α ≈ D 1 = X(θ * ) 2α , C 2 = σ 2 α 2 ≈ D 2 = cσ 2 α 2 , E[f ( θ)] -f (θ ⋆ ) ≤ (1 -αµ) s α 2 µ 2 ( n P b -s) 2 2 + (P -1)(1 -αµ) s P (f (θ 0 ) -f (θ ⋆ )) + 4σ 2 P b( n P b -s) , where s is the initial iterations, and n is the total samples, µ > 0 is the smallest eigenvalue of the Hessian of f , and (θ i ) i>s and θi := 1 ⌊ n P b ⌋-s i>s θ i are the sequences generated by Tail-averaged SGD (see also Appendix A.1.1). This result with P = 1 coincides with (12). Setting s > κ b log P and α = α b,max 2 (Jain et al., 2018, P. 15, Remarks) ensures that E[f ( θ)] -f (θ ⋆ ) ≤ exp - s κ b 3κ 2 b ( n P b -s) 2 P (f (θ 0 ) -f (θ ⋆ )) + 4σ 2 P b( n P b -s) , which, together with K := n P b -s, implies that E[f ( θ)] -f (θ ⋆ ) = O 1 K 2 P + 1 KP b . ( ) Hence, (36) indicates that the larger the number of machines P is, the smaller the upper bound of E[f ( θ)] -f (θ ⋆ ) becomes. Next, let us consider SGD independently in P machines, each of which contains n P samples, for minimizing the expected square loss function f . Given θ k ∈ R d , machine i (i ∈ [P ]) generates the point θ (i) k+1 := θ k -α∇f B (i) k (θ k ) = θ k - α b j∈[b] G ξ (i) k,j (θ k ) using SGD with the batch size b and computes θ k+1 := 1 P P i=1 θ (i) k+1 , where ξ (i) k,j is a random variable generated by the j-th sampling in the k-the iteration for machine i and we assume that the stochastic gradient G ξ (i) k,j (θ k ) satisfies (C2). We have the following theorem. Theorem A.1. Consider minimizing the expected square loss function f (θ) := 1 2 E (x,y)∼D [yx ⊤ θ] and let θ ⋆ be a minimizer of f . Then, under (C2) and (A1), the sequence (θ k ) generated by Parallelizing SGD (37) satisfies the following: (i) [Upper bound of VI(K, θ ⋆ )] For all K ≥ 1, E f 1 K K k=1 θ k -f (θ ⋆ ) ≤ VI(K, θ ⋆ ) ≤ C 1 K + C 2 P b + C 3 , where P is the number of machines, C 1 := max i∈[P ] E[∥θ (i) 1 -θ ⋆ ∥ 2 ] 2α , C 2 := σ 2 α 2 , C 3 := G 2 α 2 . (ii) [Steps to satisfy VI(K, θ ⋆ ) ≤ ϵ] The number of steps K P defined by K P (b) = C 1 P b (ϵ -C 3 )P b -C 2 (38) satisfies VI(K P , θ ⋆ ) ≤ ϵ and the function K P (b) defined by ( 38) is convex and monotone decreasing with respect to b (> C2 (ϵ-C3)P > 0). (iii) [Minimization of SFO complexity] The SFO complexity defined by N P = K P (b)b = C 1 P b 2 (ϵ -C 3 )P b -C 2 (39) is convex with respect to b (> C2 (ϵ-C3)P > 0). The batch size b ⋆ P := 2C 2 (ϵ -C 3 )P attains the following minimum value of N P defined by (39): N ⋆ P = K P (b ⋆ P )b ⋆ P = 4C 1 C 2 (ϵ -C 3 ) 2 P . ( ) Let θ (i) 1 = θ 1 (i ∈ [P ] ). Then, the SFO complexity N ⋆ = K(b ⋆ )b ⋆ for unparallelizing SGD with the batch size b ⋆ = 2C2 ϵ-C3 is obtained by ( 40) with P = 1 (see also Theorem 3.1(iii)), i.e., N ⋆ = N ⋆ 1 = K 1 (b ⋆ )b ⋆ = 4C 1 C 2 (ϵ -C 3 ) 2 . Meanwhile, Parallelizing SGD (37) with the batch size b ⋆ P = 2C2 (ϵ-C3)P , where P (> 1) is the number of machines, has SFO complexity (40), i.e., N ⋆ P = 4C 1 C 2 (ϵ -C 3 ) 2 P < 4C 1 C 2 (ϵ -C 3 ) 2 = N ⋆ . Therefore, we can conclude that the larger P is, the smaller SFO complexity becomes.  ∥θ (i) k+1 -θ ⋆ ∥ 2 = ∥θ k -θ ⋆ ∥ 2 -2α(θ k -θ ⋆ ) ⊤ ∇f B (i) k (θ k ) + α 2 ∥∇f B (i) k (θ k )∥ 2 , which implies that E ∥θ (i) k+1 -θ ⋆ ∥ 2 θ k = ∥θ k -θ ⋆ ∥ 2 -2αE (θ k -θ ⋆ ) ⊤ ∇f B (i) k (θ k ) θ k + α 2 E ∥∇f B (i) k (θ k )∥ 2 θ k = ∥θ k -θ ⋆ ∥ 2 -2α(θ k -θ ⋆ ) ⊤ E ∇f B (i) k (θ k ) θ k + α 2 E ∥∇f B (i) k (θ k )∥ 2 θ k . Hence, (C2) implies that, for all i ∈ [P ], E ∥θ (i) k+1 -θ ⋆ ∥ 2 = E ∥θ k -θ ⋆ ∥ 2 -2αE (θ k -θ ⋆ ) ⊤ ∇f (θ k ) + α 2 E E ∥∇f B (i) k (θ k )∥ 2 θ k . (41) A discussion similar to the one showing (14) implies that, for all i ∈ [P ], E ∥∇f B (i) k (θ k )∥ 2 θ k = E ∥∇f B (i) k (θ k ) -∇f (θ k ) + ∇f (θ k )∥ 2 θ k = E ∥∇f B (i) k (θ k ) -∇f (θ k )∥ 2 θ k + E ∥∇f (θ k )∥ 2 θ k + 2E (∇f B (i) k (θ k ) -∇f (θ k )) ⊤ ∇f (θ k ) θ k = E ∥∇f B (i) k (θ k ) -∇f (θ k )∥ 2 θ k + ∥∇f (θ k )∥ 2 . The definition of θ k and the linearity of ∇f guarantee that E ∥∇f B (i) k (θ k ) -∇f (θ k )∥ 2 θ k = E    1 b j∈[b] G ξ (i) k,j (θ k ) -∇f (θ k ) 2 θ k    = 1 b 2 E    j∈[b]   G ξ (i) k,j   1 P i∈[P ] θ (i) k   -∇f   1 P i∈[P ] θ (i) k     2 θ k    = 1 P 2 b 2 E    j∈[b] i∈[P ] G ξ (i) k,j (θ (i) k ) -∇f (θ (i) k ) 2 θ k    , which, together with (C2), in turn implies that E ∥∇f B (i) k (θ k ) -∇f (θ k )∥ 2 θ k ≤ P bσ 2 P 2 b 2 = σ 2 P b . Hence, from (A1), E ∥∇f B (i) k (θ k )∥ 2 ≤ σ 2 P b + G 2 . ( ) Accordingly, from ( 41) and ( 42), for all i ∈ [P ], E ∥θ (i) k+1 -θ ⋆ ∥ 2 ≤ E ∥θ k -θ ⋆ ∥ 2 -2αE (θ k -θ ⋆ ) ⊤ ∇f (θ k ) + α 2 σ 2 P b + G 2 . Since the convexity of ∥ • ∥ 2 and the definition of θ k imply that E ∥θ k+1 -θ ⋆ ∥ 2 ≤ 1 P i∈[P ] E ∥θ (i) k+1 -θ ⋆ ∥ 2 , we have that E ∥θ k+1 -θ ⋆ ∥ 2 ≤ E ∥θ k -θ ⋆ ∥ 2 -2αE (θ k -θ ⋆ ) ⊤ ∇f (θ k ) + α 2 σ 2 P b + G 2 . Hence, for all K ≥ 1, VI(K, θ ⋆ ) := 1 K K k=1 E (θ k -θ ⋆ ) ⊤ ∇f (θ k ) ≤ E ∥θ 1 -θ ⋆ ∥ 2 2αK + α 2 σ 2 P b + G 2 ≤ i∈[P ] E[∥θ (i) 1 -θ ⋆ ∥ 2 ] 2αP 1 K + σ 2 α 2 1 P b + G 2 α 2 ≤ max i∈[P ] E[∥θ (i) 1 -θ ⋆ ∥ 2 ] 2α C1 1 K + σ 2 α 2 C2 1 P b + G 2 α 2 C3 . Since f is convex, we have that E f 1 K K k=1 θ k -f (θ ⋆ ) ≤ 1 K K k=1 E [f (θ k ) -f (θ ⋆ )] ≤ VI(K, θ ⋆ ). (ii) A discussion similar to the one showing (ii) in Theorem 3.1 for SGD would show (ii) in Theorem A.1. (iii) An argument similar to that which obtained (iii) in Theorem 3.1 for SGD would prove (iii) in Theorem A.1. A.9 ADDITIONAL NUMERICAL RESULTS 



2 and 4 in this paper). Moreover, SFO complexity N := Kb is minimized at b ⋆ defined by (5); e.g., b ⋆ S for SGD, b ⋆ M for Momentum, and b ⋆ A for Adam are respectively b

) where σ S , σ M , and σ A are positive constants depending on SGD, Momentum, and Adam and C 3 = C 3,S , C 3,M , C 3,A are positive constants defined as in Theorem 3.1. From (9), the lower bounds b * of b ⋆ S , b ⋆ M , and b ⋆ A are respectively b

), we expect that Adam can exploit a large batch size b ⋆ A > b * A > b * * A , for example, when b * * A < 2 11 (CIFAR-10; n = 50000, ResNet-20; d ≈ 1.1 × 10 7 , b * * A ≈ 1600) and 2 10 < b * * A < 2 11 (MNIST; n = 60000, ResNet-18; d ≈ 1.0 × 10 7 , b * * A ≈ 2000) (He et al., 2016; Leong et al., 2020), where σ 2 A ≈ 10 -2 and ϵ ≈ 10 -3 are used. 4.1 RESNET-20

Figure 2: SFO complexities for SGD, Momentum, and Adam versus batch size needed to train ResNet-20 on CIFAR-10. SFO complexity of Adam (resp. Momentum) is minimized at critical batch size b ⋆ = 2 11 (resp. b ⋆ = 2 3 ), whereas SFO complexity for SGD tends to increase with batch size.

Figure 3: Number of steps for SGD, Momentum, and Adam versus batch size needed to train ResNet-18 on MNIST. There is an initial period of perfect scaling (indicated by dashed line) such that the number of steps K for Adam is inversely proportional to batch size b. Adam has critical batch size b ⋆ = 2 10 .

sense of minimizing the upper bound of VI(K, θ). Let T ≥ 1, γ ∈ (0, 1), α > 0, and P := K T . Next, let us consider the following varying learning rate (α k ) = (α T , α 2T , . . . , α P T ),

Hence, ϵ > C 3 and ϵ > h(β 1 ) are needed to consider K to satisfyVI(K, θ) ≤ C1 √ K + h(β 1 ) = ϵ (diminishing learning rate case) and VI(K, θ) ≤ C1 K + C 3 = ϵ (constant learning rate case) even if σ = 0.A.7 LOWER AND UPPER BOUNDS ON K WITH VI(K, θ * ) = ϵ

used. Hence, the batch sizes b ⋆ and b ⋆ (b ⋆ ≈ b ⋆ ) approximate the batch size minimizing N = Kb. A.8 SFO COMPLEXITY REGARDING COMPUTATIONAL COST OF PARALLELIZABLE STEPS Let us consider Tail-averaged SGD (Jain et al., 2018, Algorithm 1) independently in P machines, each of which contains n P samples, for minimizing the expected square loss function f (θ) := 1 2 E (x,y)∼D [y -x ⊤ θ] (see Appendix A.1.1). Let θi be the point generated by Tail-averaged SGD with the batch size b on machine i (i ∈ [P ]) and let θ := 1 P P i=1 θi be the point generated by Parallelizing Tail-averaged SGD. Then, Theorem 6 in

Proof of TheoremA.1. (i) Let k ∈ N. The definition of θ (i) k+1 := θ k -α∇f B (i) k (θ k ) ensures that, for all i ∈ [P ],

ON THE CIFAR-10 DATASET

5 BOUNDEDNESS CONDITION OF (∇f (θ k )) k∈N Let f : R d → R be convex. Then, f is Lipschitz continuous (i.e., |f (x) -f (y)| ≤ G∥x -y∥) if and only if ∥∇f (x)∥ ≤ G (x ∈ R d ) (see, e.g., Theorem 6.2.2, Corollary 6.1.2, and Exercise 6.1.9(c) in(Borwein & Lewis, 2000)). Let θ * be a local minimizer of a Lipschitz continuous function f . The continuity of f ensures that f is convex around θ * . Hence, for any θ belonging to a neighborhood N (θ * ) of θ * , ∥∇f (θ)∥ ≤ G. If the sequence (θ k ) k∈N generated by an optimizer approximates θ * , then θ k ∈ N (θ * ) for sufficiently large k, i.e., ∥∇f (θ k

Elapsed time and training accuracy of SGD when f (θ K ) ≤ 10 -1 for training ResNet-20

Elapsed time and training accuracy of Momentum when f (θ K ) ≤ 10 -1 for training ResNet-20 on CIFAR-10

Elapsed time and training accuracy of Adam when f (θ K ) ≤ 10 -1 for training ResNet-20

Elapsed time and training accuracy of Adam when f (θ K ) ≤ 10 -1 for training ResNet-20 on CIFAR-10

Elapsed time and training accuracy of SGD when f (θ K ) ≤ 10 -1 for training ResNet-18

Elapsed time and training accuracy of Momentum when f (θ K ) ≤ 10 -1 for training

Elapsed time and training accuracy of Adam when f (θ K ) ≤ 10 -1 for training ResNet-18

Elapsed time and training accuracy of Adam when f (θ K ) ≤ 10 -1 for training ResNet-18

