ADAM + : A STOCHASTIC METHOD WITH ADAPTIVE VARIANCE REDUCTION

Abstract

Adam is a widely used stochastic optimization method for deep learning applications. While practitioners prefer Adam because it requires less parameter tuning, its use is problematic from a theoretical point of view since it may not converge. Variants of Adam have been proposed with provable convergence guarantee, but they tend not be competitive with Adam on the practical performance. In this paper, we propose a new method named Adam + (pronounced as Adam-plus). Adam + retains some of the key components of Adam but it also has several noticeable differences: (i) it does not maintain the moving average of second moment estimate but instead computes the moving average of first moment estimate at extrapolated data points; (ii) its adaptive step size is formed not by dividing the square root of second moment estimate but instead by dividing the root of the norm of first moment estimate. As a result, Adam + requires few parameter tuning, as Adam, but it enjoys a provable convergence guarantee. Our analysis further shows that Adam + enjoys adaptive variance reduction, i.e., the variance of the stochastic gradient estimator reduces as the algorithm converges, hence enjoying an adaptive convergence. We also propose a more general variant of Adam + with different adaptive step sizes and establish their fast convergence rate. Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam + significantly outperforms Adam and achieves comparable performance with besttuned SGD and momentum SGD.

1. INTRODUCTION

Adaptive gradient methods (Duchi et al., 2011; McMahan & Streeter, 2010; Tieleman & Hinton, 2012; Kingma & Ba, 2014; Reddi et al., 2019) are one of the most important variants of Stochastic Gradient Descent (SGD) in modern machine learning applications. Contrary to SGD, adaptive gradient methods typically require little parameter tuning still retaining the computational efficiency of SGD. One of the most used adaptive methods is Adam (Kingma & Ba, 2014) , which is considered by practitioners as the de-facto default optimizer for deep learning frameworks. Adam computes the update for every dimension of the model parameter through a moment estimation, i.e., the estimates of the first and second moments of the gradients. The estimates for first and second moments are updated using exponential moving averages with two different control parameters. These moving averages are the key difference between Adam and previous adaptive gradient methods, such as Adagrad (Duchi et al., 2011) . Although Adam exhibits great empirical performance, there still remain many mysteries about its convergence. First, it has been shown that Adam may not converge for some objective functions (Reddi et al., 2019; Chen et al., 2018b) . Second, it is unclear what is the benefit that the moving average brings from theoretical point of view, especially its effect on the convergence rate. Third, it has been empirically observed that adaptive gradient methods can have worse generalization performance than its non-adaptive counterpart (e.g., SGD) on various deep learning tasks due to the coordinate-wise learning rates (Wilson et al., 2017) . The above issues motivate us to design a new algorithm which achieves the best of both worlds, i.e., provable convergence with benefits from the moving average and enjoying good generalization performance in deep learning. Specifically, we focus on the following optimization problem: min w∈R d F (w), Table 1 : Summary of different algorithms with different assumptions and complexity results for finding an -stationary point. "Individual Smooth" means assuming that F (w) = E ξ∼D [f (w; ξ)] and that every component function f (w; ξ) is L-smooth. "Hessian Lipschitz" means that ∇ 2 F (x) -∇ 2 F (y) ≤ L H xy holds for x, y and L H ≥ 0. "Type I" means that the complexity depends on E[ T i=1 g 1:T,i ], where g 1:T,i stands for the i-th row of the matrix [g 1 , . . . , g T ] with g t being the stochastic gradient at t-th iteration and T being the number of iterations. "Type II" means that complexity depends on E[ T t=1 z t ], where z t is the variance-reduced gradient estimator at t-th iteration. Algorithm 1 Adam + : Good default settings for the tested machine learning problems are α = 0.1, a = 1, β = 0.1, 0 = 10 -8 . 1: Require: α, a ≥ 1: stepsize parameters 2: Require: β ∈ (0, 1): Exponential decay rates for the moment estimate 3: Require: g t (w): unbiased stochastic gradient with parameters w at iteration t 4: Require: w 0 : Initial parameter vector 5: z 0 = g 0 (w 0 ) 6: for t = 0, . . . , T do 7: Set η t = αβ a max( zt 1/2 , 0)

8:

w t+1 = w t -η t z t 9: w t+1 = (1 -1/β)w t + 1/β • w t+1 10: z t+1 = (1 -β)z t + βg t+1 ( w t+1 ) 11: end for

2. ALGORITHM AND THEORETICAL ANALYSIS

In this section, we introduce our algorithm Adam + (presented in Algorithm 1) and establish its convergence guarantees. Adam + resembles Adam in several aspects but also has noticeable differences. Similar to Adam, Adam + also maintains an exponential moving average of first moment (i.e., stochastic gradient), which is denoted by z t , and uses it for updating the solution in line 8. However, the difference is that the stochastic gradient is evaluated on an extrapolated data point w t+1 , which is an extrapolation of two previous updates w t and w t+1 . Similar to Adam, Adam + also uses an adaptive step size that is proportional to 1/ z t 1/2 . Nonetheless, the difference lies at its adaptive step size is directly computed from the square root of the norm of first moment estimate z t . In contrast, Adam uses an adaptive step size that is proportional to 1/ √ v t , where v t is an exponential moving average of second moment estimate. These two key components of Adam + , i.e., extrapolation and adaptive step size from the root norm of the first moment estimate, make it enjoy two noticeable benefits: variance reduction of first moment estimate and adaptive convergence. We shall explain these two benefits later. Before moving to the theoretical analysis, we would like to make some remarks. First, it is worth mentioning that the moving average estimate with extrapolation is inspired by the literature of stochastic compositional optimization (Wang et al., 2017) . Wang et al. (2017) showed that the extrapolation helps balancing the noise in the gradients, reducing the bias in the estimates and giving a faster convergence rate. Here, our focus and analysis techniques are quite different. In fact, Wang et al. (2017) focuses on the compositional optimization while we consider a general nonconvex optimization setting. Moreover, the analysis in (Wang et al., 2017) mainly deals with the error of the gradient estimator caused by the compositional nature of the problem, while our analysis focuses on carefully designing adaptive normalization to obtain an adaptive and fast convergence rates. A similar extrapolation scheme has been also employed in the algorithm NIGT by Cutkosky & Mehta (2020) . In later sections, we will also provide a more general variant of Adam + which subsumes NIGT as a special case. Another important remark is that the update of Adam + is very different from the famous Nesterov's momentum method. In Nesterov's momentum method, the update of w t+1 uses the stochastic gradient at an extrapolated point w t+1 = w t+1 + γ(w t+1 -w t ) with a momentum parameter γ ∈ (0, 1). In contrast, in Adam + the update of w t+1 is using the moving average estimate at an extrapolated point w t+1 = w t+1 + (1/β -1)(w t+1 -w t ). Finally, Adam + does not employ coordinate-wise learning rates as in Adam, and hence it is expected to have better generalization performance according to Wilson et al. (2017) .

2.1. ADAPTIVE VARIANCE REDUCTION AND ADAPTIVE CONVERGENCE

In this subsection, we analyze Adam + by showing its variance reduction property and adaptive convergence. To this end, we make the following assumptions. Assumption 1. There exists positive constants L, ∆, L H , σ and an initial solution w 0 such that (i) F is L-smooth, i.e., ∇F (x) -∇F (y) ≤ L x -y , ∀x, y ∈ R d . (ii) For ∀x ∈ R d , we have access to a first-order stochastic oracle at time t g t (x) such that E [g t (x)] = ∇F (x), E g t (x) -∇F (x) 2 ≤ σ 2 . (iii) ∇F is a L H -smooth mapping, i.e., ∇ 2 F (x) -∇ 2 F (y) ≤ L H x -y , ∀x, y ∈ R d . (iv) F (w 0 ) -F * ≤ ∆ < ∞, where F * = inf w∈R d F (w). Remark: Assumption 1 (i) and (ii), (iv) are standard assumptions made in literature of stochastic non-convex optimization (Ghadimi & Lan, 2013) . Assumption (iii) is the assumption that deviates from typical analysis of stochastic methods. We leverage this assumption to explore the benefit of moving average, extrapolation and adaptive normalization. It is also used in some previous works for establishing fast rate of stochastic first-order methods for nonconvex optimization (Fang et al., 2019; Cutkosky & Mehta, 2020) and this assumption is essential to get fast rate due to the hardness result in (Arjevani et al., 2019) . It is also the key assumption for finding a local minimum in previous works (Carmon et al., 2018; Agarwal et al., 2017; Jin et al., 2017) . We might also assume that the stochastic gradient estimator in Algorithm 1 satisfies the following variance property. Assumption 2. Assume that E[ g 0 (w 0 ) -∇F (w 0 ) 2 ] ≤ σ 2 0 and E[ g t (w t ) -∇F (w t ) 2 ] ≤ σ 2 m , t ≥ 1. Remark: When g 0 (resp. g t ) is implemented by a mini-batch stochastic gradient with mini-batch size S, then σ 2 0 (resp. σ 2 m ) can be set as σ 2 /S by Assumption 1 (ii). We differentiate the initial variance and intermediate variance because they contribute differently to the convergence. We first introduce a lemma to characterize the variance of the moving average gradient estimator z t . Lemma 1. Suppose Assumption 1 and Assumption 2 hold and a ≥ 1. Then, there exists a sequence of random variables δ t satisfying z t -∇F (w t ) ≤ δ t for ∀t ≥ 0, E δ 2 t+1 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 m + E CL 2 H w t+1 -w t 4 β 3 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 m + E CL 2 H α 4 β 4a-3 z t 2 , where C = 1944. Remark: Note that δ t is an upper bound of z t -∇F (w t ) , the above lemma can be used to illustrate the variance reduction effect for the gradient estimator z t . To this end, we can bound z t 2 ≤ 2δ 2 t + 2 ∇F (w t ) 2 , then the term CL 2 α 4 β 4a-3 δ 2 t can be canceled with -β/4δ 2 t with small enough α. Hence, we have Eδ 2 t+1 ≤ (1 -β/4)E[δ 2 t ] + 2β 2 σ 2 m + cE[ ∇F (w t ) 2 ] with a small constant c. As the algorithm converges with E[ ∇F (w t ) 2 ] and β decreases to zero, the variance of z t will also decrease. Indeed, the above recursion of z t 's variance resembles that of the recursive variance reduced gradient estimators (e.g., SPIDER (Fang et al., 2018) , STORM (Cutkosky & Orabona, 2019) ). The benefit of using Adam + is that we do not need to compute stochastic gradient twice at each iteration. We can now state our convergence rates for Algorithm 1. Theorem 1. Suppose Assumption 1 and Assumption 2 hold. Suppose ∇F (w) ≤ G for any w ∈ R d . By choosing the parameters such that α 4 ≤ 1 36CL 2 H , α ≤ 1 4L , a = 1 and 0 = β a , we have 1 T T t=1 E ∇F (w t ) 2 ≤ GE T t=1 z t T + ∆ αT + 18σ 2 0 βT + 30βσ 2 m . In addition, suppose the initial batch size is T 0 and the intermediate batch size is m, and choose β = T -b with 0 ≤ b ≤ 1, we have 1 T T t=1 E ∇F (w t ) 2 ≤ E G T t=1 z t T + ∆ αT + 18σ 2 T 1-b T 0 + 30σ 2 mT b . ( ) Theorem 2. Suppose Assumption 1 and Assumption 2 hold. By choosing parameters such that 640α 3 L 3/2 H ≤ 1/120, a = 1, 0 = 0, β = 1/T s with s = 2/3 then it takes T = O -4.5 number of iterations to ensure that 1 T T t=1 E ∇F (w t ) 3/2 ≤ 3/2 , 1 T E T t=1 δ 3/2 t ≤ 3/2 . Remarks: • From Theorem 1, we can observe that the convergence rate of Adam + crucially depends on the growth rate of E T t=1 z t , which gives a data-dependent adaptive complexity. If E T t=1 z t ≤ T α with α < 1, then the algorithm converges. Smaller α implies faster convergence. Our goal is to ensure that 1 T T t=1 E ∇F (w t ) 2 ≤ 2 . Choosing b = 1 -α, m = O(1) and T 0 = T 1-α = O( -2 ), and we end up with T = O -2 1-α complexity. • Theorem 2 shows that in the ergodic sense, the Algorithm Adam + always converges, and the variance gets smaller when the number of iteration gets larger. Theorem 2 rules out the case that the magnitude of z t converges to a constant and the bound (2) in Theorem 1 becomes vacuous. • To compare with Adam-style algorithms (e.g., Adam, AdaGrad), these algorithms' convergence depend on the growth rate of stochastic gradient, i.e., d i=1 g 1:T,i /T , where g 1:T,i = [g 1,i , . . . , g T,i ] denotes the i-th coordinate of all historical stochastic gradients. Hence, the data determines the growth rate of stochastic gradient. If the stochastic gradients are not sparse, then its growth rate may not be slow and these Adam-style algorithms may suffer from slow convergence. In contrast, for Adam + the convergence can be accelerated by the variance reduction property. Note that we have E T t=1 z t /T ≤ E T t=1 (δ t + ∇F (w t ) ) /T . Hence, Adam + 's convergence depends on the variance reduction property of z t .

2.2. A GENERAL VARIANT OF ADAM + : FAST CONVERGENCE WITH LARGE MINI-BATCH

Next, we introduce a more general variant of Adam + by making a simple change. In particular, we keep all steps the same as in Algorithm 1 except the adaptive step size is now set as η t = αβ a max( zt p , 0) , where p ∈ [1/2, 1) is parameter. We refer to this general variant of Adam + as power normalized Adam + (Nadam + ). This generalization allows us to compare with some existing methods and to establish fast convergence rate. First, we notice that when setting p = 1 and a = 5/4 and β = 1/T 4/7 , Nadam + is almost the same as the stochastic method NIGT (Cutkosky & Mehta, 2020) with only some minor differences. However, we observed that normalizing by z t leads to slow convergence in practice, so we are instead interested in p < 1. Below, we will show that NAdam + with p < 1 can achieve a fast rate of 1/ 3.5 , which is the same as NIGT. Theorem 3. Under the same assumption as in Theorem 1, further assume σ 2 0 = σ 2 /T 0 and σ 2 m = σ 2 /m. By using the step size η t = αβ 4/3 max( zt 2/3 , 0) in Algorithm 1 with CL 2 α 4 ≤ 1/14, 0 = 2β 4/3 , in order to have E [ ∇F (w τ ) ] ≤ for a randomly selected solution w τ from {w 1 , . . . , w T }, it suf- fice to set β = O( 1/2 ), T = O( -2 ), the initial batch size T 0 = 1/β = O( -1/2 ), the intermediate batch size as m = 1/β 3 = O( -3/2 ), which ends up with the total complexity O( -3.5 ). 2009), language modeling on Wiki-Text2 dataset (Merity, 2016) and automatic speech recognition on SWB-300 dataset (Saon et al., 2017) . We choose tasks from different domains to demonstrate the applicability for the real-world deep learning tasks in a broad sense. The detailed description is presented in Table 2 . We compare our algorithm Adam + with SGD, momentum SGD, Adagrad, NIGT and Adam. We choose the same random initialization for each algorithm, and run a fixed number of epochs for every task. For Adam we choose the default setting β 1 = 0.9 and β 2 = 0.999 as in the original Adam paper. (He et al., 2016) and VGG19 (Simonyan & Zisserman, 2014) to do image classification task on CIFAR10 and CI-FAR100 dataset respectively. For every optimizer, we use batch size 128 and run 350 epochs. For SGD and momentum SGD, we set the initial learning rate to be 0.1 for the first 150 epochs, and the learning rate is decreased by a factor of 10 for every 100 epochs. For Adagrad and Adam, the initial learning rate is tuned from {0.1, 0.01, 0.001} and we choose the one with the best performance. The best initial learning rates for Adagrad and Adam are 0.01 and 0.001 respectively. For NIGT, we tune the their momentum parameter from {0.01, 0.1, 0.9} (the best momentum parameter we found is 0.9) and the learning rate is chosen the same as in SGD. For Adam + , the learning rate is set according to Algorithm 1, in which we choose β = 0.1 and the value of α is the same as the learning rate used in SGD. We report training and test accuracy versus the number of epochs in Figure 1 for CIFAR10 and Figure 2 for CIFAR100. We observe that our algorithm consistently outperforms all other algorithms on both CIFAR10 and CIFAR100, in terms of both training and testing accuracy. Notably, we have some interesting observations for the training of VGG19 on CIFAR100. First, both Adam + and NIGT significantly outperform SGD, momentum SGD, Adagrad and Adam. Second, Adam + achieves almost the same final accuracy as NIGT, and Adam + converges much faster in the early stage of the training.

3.2. LANGUAGE MODELING

Wiki-text2 In the second experiment, we consider the language modeling task on WikiText-2 dataset. We use a 2-layer LSTM (Hochreiter & Schmidhuber, 1997) . The size of word embeddings is 650 and the number of hidden units per layer is 650. We run every algorithm for 40 epochs, with batch size 20 and dropout ratio 0.5. For SGD and momentum SGD, we tune the initial learning rate from {0.1, 0.2, 0.5, 5, 10, 20} and decrease the learning rate by factor of 4 when the validation error saturates. For Adagrad and Adam, we tune the initial learning rate from {0.001, 0.01, 0.1, 1.0}. We report the best performance for these methods across the range of learning rate. The best initial learning rates for Adagrad and Adam are 0.01 and 0.001 respectively. For NIGT, we tune the initial value of learning rate from the same range as in SGD, and tune the momentum parameter β from {0.01, 0.1, 0.9}, and the best parameter choice is β = 0.9. The learning rate and β are both decreased by a factor of 4 when the validation error saturates. For Adam + , we follow the same tuning strategy as NIGT. We report both training and test perplexity versus the number of epochs in Figure 3 . 

3.3. AUTOMATIC SPEECH RECOGNITION

SWB-300 In the third experiment, we consider the automatic speech recognition task on SWB-300 dataset (Saon et al., 2017) . SWB-300 contains roughly 300 hours of training data of over 4 million samples (30GB) and roughly 6 hours of held-out data of over 0.08 million samples (0.6GB). Each training sample is a fusion of FMLLR (40-dim), i-Vector (100-dim), and logmel with its delta and double delta. The acoustic model is a long short-term memory (LSTM) model with 6 bi-directional layers. Each layer contains 1,024 cells (512 cells in each direction). On top of the LSTM layers, there is a linear projection layer with 256 hidden units, followed by a softmax output layer with 32,000 (i.e., 32,000 classes) units corresponding to context-dependent HMM states. The LSTM is unrolled with 21 frames and trained with non-overlapping feature sub-sequences of that length. This model contains over 43 million parameters and is about 165MB large. The training takes about 20 hours on 1 V100 GPU. To compare, we adopt the well-tuned Momentum SGD strategy as described in (Zhang et al., 2019) for this task as the baseline: batch size is 256, learning rate is 0.1 for the first 10 epochs and then annealed by √ 0.5 for another 10 epochs, with momentum 0.9. We grid search the learning rate of Adam and Adagrad from {0.1, 0.01, 0.001}, and report the best configuration we have found (Adam with learning rate 0.001 and Adagrad with learning rate 0.01). For NIGT, we also follow the same learning rate setup (including annealing) as in Momentum SGD baseline. In addition, we fine tuned β in NIGT by exploring β in {0.01, 0.1, 0.9} and reported the best configuration (β=0.9). For Adam + , we follow the same learning rate and annealing strategy as in the Momentum SGD and tuned β in the same way as in NGIT, reporting the best configuration (β=0.01). From Figure 4 , Adam + achieves the indistinguishable training loss and held-out loss w.r.t. well-tuned Momentum SGD baseline and significantly outperforms the other optimizers.

3.4. GROWTH RATE OF t i=1 z i

In this subsection, we consider the growth rate of t i=1 z i , since they crucially affect the convergence rate as shown in Theorem 1. We report the results of both ResNet18 training on CIFAR10 dataset and VGG19 training on CIFAR100 dataset. From Figure 5 , we can observe that it quickly reaches a plateau and then grows at a very slow rate with respect to the number of iterations. This phenomenon verifies the variance reduction effect and also explains the reason why Adam + enjoys a fast convergence speed in practice.

4. CONCLUSION

In this paper, we design a new algorithm named Adam + to train deep neural networks efficiently. Different from Adam, Adam + updates the solution using moving average of stochastic gradients calculated at the extrapolated points and adaptive normalization on only first-order statistics of stochastic gradients. We establish data-dependent adaptive complexity results for Adam + from the perspective of adaptive variance reduction, and also show that a variant of Adam + achieves state-of-the-art complexity. Extensive empirical studies on several tasks verify the effectiveness of the proposed algorithm. We also empirically show that the slow growth rate of the new gradient estimator, providing the reason why Adam + enjoys fast convergence in practice.

A PROOF OF LEMMA 1

Proof. The proof is similar to that of Lemma 12 in (Wang et al., 2017) . Define ζ (t) k = β(1 -β) t-k if t ≥ k > 0 (1 -β) t-k if t ≥ k = 0 (3) By the definition of ζ (k) t and the update of Algorithm 1, we have ζ (t+1) k = (1 -β)ζ (t) k , t k=0 ζ (t) k = 1, w t = t k=0 ζ (t) k w t+1 , z t+1 = t k=0 ζ (t) k ∇f ( w t+1 ; ξ t+1 ). Define m t+1 = t k=0 ζ (t) k w t+1 -w k+1 2 , n t+1 = t k=0 ζ (t) k [∇f ( w k+1 ; ξ k+1 ) -F ( w k+1 )], where ∇f ( w k+1 ; ξ k+1 ) is an unbiased stochastic first-order oracle for F ( w k+1 ) with bounded variance σ 2 m . Note that ∇F is a L H -smooth mapping (according to Assumption 1 (iii)), then by Lemma 10 of (Wang et al., 2017) , we have z t -∇F (w t ) 2 ≤ (L H m t + n t ) 2 ≤ 2L 2 H m 2 t + 2 n t 2 . Define q t+1 = t k=0 ζ (t) k w t+1 -w k+1 . According to Lemma 11 (a) and (b) of (Wang et al., 2017) , we have m t+1 + 4q 2 t+1 ≤ 1 - β 2 m t + 4q 2 t + 18 β w t+1 -w t 2 . Taking squares on both sides of the inequality and using the fact that (a+b ) 2 ≤ (1+ β 2 )a 2 +(1+ 2 β )b 2 for β > 0, we have m t+1 + 4q 2 t+1 2 ≤ 1 + β 2 1 - β 2 2 m t + 4q 2 t 2 + 1 + 2 β 324 β 2 w t+1 -w t 4 ≤ 1 - β 2 (m t + 4q 2 t ) 2 + 972 β 3 w t+1 -w t 4 , where the last inequality holds since 1/β ≥ 1. Define δ 2 t = 2L 2 H (m t + 4q 2 t ) 2 + 2 n t 2 , then we have z t -∇F (w t ) 2 ≤ δ 2 t for all t. Denote F t+1 by the σ-algebra generated by ξ 1 , . . . , ξ t+1 . Taking the summation of (4) and according to the bound of n t derived in Lemma 11 (c) of (Wang et al., 2017) , we have E δ 2 t+1 |F t+1 ≤ 1 - β 2 δ 2 t + 2β 2 σ 2 m + 1944L 2 H w t+1 -w t 4 β 3 , Taking expectation on both sides yields E δ 2 t+1 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 m + E 1944L 2 H w t+1 -w t 4 β 3 . Note that η t = αβ a max( zt 1/2 , 0) , we have E δ 2 t+1 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 m + E CL 2 α 4 β 4a-3 z t 2 .

B PROOF OF THEOREM 1

Proof. By Lemma 1 and the update rule of Algorithm 1, we have E δ 2 t+1 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 m + E CL 2 H α 4 β 4a z t 4 β 3 (max( z t 1/2 , 0 )) 4 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 m + E 2CL 2 H α 4 β 4 ( δ t 2 + ∇F (w t ) 2 ) β 3 , where the second inequality holds since (max( z t 1/2 , 0 )) 4 ≥ z t 2 and z t 2 ≤ 2 δ t 2 + 2 ∇F (w t ) 2 . Note that 2CL 2 H α 4 ≤ 1/18. Plugging it in (5), we have 8β 18 E δ 2 t ≤ E δ 2 t -δ 2 t+1 + 2β 2 σ 2 m + E β 18 ∇F (w t ) 2 . ( ) Summing over t = 1, . . . , T on both sides of ( 6) and with some simple algebra, we have T t=1 E δ 2 t ≤ T t=1 E 3 δ 2 t -δ 2 t+1 β + T t=1 5βσ 2 m + T t=1 E 1 8 ∇F (w t ) 2 . ( ) By Assumption 1 (i) and by the property of L-smooth function, we know that F (w t+1 ) ≤ F (w t ) + ∇ F (w t )(w t+1 -w t ) + L 2 w t+1 -w t 2 =F (w t ) -η t ∇ F (w t )z t + η 2 t L 2 z t 2 ≤ F (w t ) -η t ∇ F (w t )(z t -∇F (w t ) + ∇F (w t )) + η 2 t L z t -∇F (w t ) 2 + ∇F (w t ) 2 = F (w t ) -(η t -η 2 t L) ∇F (w t ) 2 -η t ∇ F (w t )(z t -∇F (w t )) + η 2 t L z t -F (w t ) 2 ≤ F (w t ) - η t 2 -η 2 t L ∇F (w t ) 2 + η t 2 + η 2 t L z t -F (w t ) 2 . Noting that η t = αβ a max( zt 1/2 , 0) , α ≤ 1/4L and 0 = β a , we know that η t L ≤ 1/4. Hence we have ∇F (w t ) 2 ≤ 4(F (w t ) -F (w t+1 )) η t + 3 z t -∇F (w t ) 2 . Taking summation over t = 1, . . . , T and taking expectation yield T t=1 E ∇F (w t ) 2 ≤ E T t=1 4(F (w t ) -F (w t+1 )) η t + 3 T t=1 E z t -∇F (w t ) 2 ≤ E T t=1 4(F (w t ) -F (w t+1 )) η t + 3 T t=1 E δ 2 t . Combining ( 7) and (8) yields T t=1 E ∇F (w t ) 2 ≤ E T t=1 4(F (w t ) -F (w t+1 )) η t + T t=1 E 9 δ 2 t -δ 2 t+1 β + T t=1 15βσ 2 m + T t=1 E 3 8 ∇F (w t ) 2 . By some simple algebra, we have T t=1 E ∇F (w t ) 2 ≤ E T t=1 8(F (w t ) -F (w t+1 )) η t + T t=1 E 18 δ 2 t -δ 2 t+1 β + T t=1 30βσ 2 m . Then we have 1 T T t=1 E ∇F (w t ) 2 ≤ E T t=1 8 max z t 1/2 , 0 (F (w t ) -F (w t+1 )) αβ a T + 18σ 2 0 βT +30βσ 2 m . (9) Noting that |F (w t ) -F (w t+1 )| ≤ Gη t z t , we have 1 T T t=1 E ∇F (w t ) 2 ≤ 8GE T t=1 z t T + ∆ αT + 18σ 2 0 βT + 30βσ 2 m . C PROOF OF THEOREM 2 Before introducing the proof, we first introduce several lemmas which are useful for our analysis. Lemma 2. Adam + with η t = αβ max( zt 1/2 , 0 ) and 0 = 0 satisfies F (w t+1 ) -F (w t ) ≤ αβ - ∇F (w t ) 3/2 6 + 9 z t -∇F (w t ) 3/2 + 64α 4 β 4 L 3 3 . Proof. By the L-smoothness and the update of the algorithm, we have F (w t+1 ) -F (w t ) ≤ ∇ F (w t )(w t+1 -w t ) + L w t+1 -w t 2 2 ≤ -αβ • ∇F (w t ), z t max z t 1/2 , 0 + α 2 β 2 L z t 2 max z t 1/2 , 0 2 . ( ) Define ∆ t = z t -∇F (w t ). If ∇F (w t ) ≥ 2 ∆ t , we have - z t , ∇F (w t ) max z t 1/2 , 0 = - ∇F (w t ) 2 + ∆ t , ∇F (w t ) max ∇F (w t ) + ∆ t 1/2 , 0 ≤ - ∇F (w t ) 2 2 ∇F (w t ) + ∆ t 1/2 ≤ - ∇F (w t ) 3/2 3 ≤ - ∇F (w t ) 3/2 3 + 8 ∆ t 3/2 . ( ) If ∇F (w t ) ≤ 2 ∆ t , we have - z t , ∇F (w t ) max z t 1/2 , 0 = - ∇F (w t ) 2 + ∆ t , ∇F (w t ) max ∇F (w t ) + ∆ t 1/2 , 0 ≤ 6 ∆ t 2 ∆ t 1/2 = 6 ∆ t 3/2 ≤ - ∇F (w t ) 3/2 3 + 8 ∆ t 3/2 . ( ) By ( 12) and ( 13), we have - z t , ∇F (w t ) max z t 1/2 , 0 ≤ - ∇F (w t ) 3/2 3 + 8 ∆ t 3/2 . ( ) By ( 11) and ( 14), we have F (w t+1 ) -F (w t ) ≤ αβ ∇F (w t ) 3/2 3 + 8 ∆ t 3/2 + α 2 β 2 L z t = αβ - ∇F (w t ) 3/2 3 + 8 ∆ t 3/2 + α 2 β 2 L min x>0 2 z t 3/2 3x + x 2 3 ≤ αβ - ∇F (w t ) 3/2 3 + 8 ∆ t 3/2 + α 2 β 2 L 2 z t 3/2 3(8αβL) + 64α 2 β 2 L 2 3 ≤ αβ - ∇F (w t ) 3/2 6 + 9 ∆ t 3/2 + 64α 4 β 4 L 3 3 , where the last inequality holds because z t 3/2 ≤ 2 ∇F (w t ) 3/2 + 2 ∆ t 3/2 . Lemma 3. For Adam + with η t = αβ max( zt 1/2 , 0) , there exist random variables δ t such that E δ 3/2 t+1 ≤ 1 - β 2 E δ 3/2 t + 2β 3/2 σ 3/2 + E 320L 3/2 H w t+1 -w t 3 β 2 . Proof. The proof shares the similar spirit of Lemma 12 in (Wang et al., 2017) , but we adapt the proof for our purpose. Define ζ (t) k = β(1 -β) t-k if t ≥ k > 0 (1 -β) t-k if t ≥ k = 0 (15) By the definition of ζ (k) t and the update of Algorithm 1, we have ζ (t+1) k = (1 -β)ζ (t) k , t k=0 ζ (t) k = 1, w t = t k=0 ζ (t) k w t+1 , z t+1 = t k=0 ζ (t) k ∇f ( w t+1 ; ξ t+1 ). Define m t+1 = t k=0 ζ (t) k w t+1 -w k+1 2 , n t+1 = t k=0 ζ (t) k [∇f ( w k+1 ; ξ k+1 ) -F ( w k+1 )], where ∇f ( w k+1 ; ξ k+1 ) is an unbiased stochastic first-order oracle for F ( w k+1 ) with bounded variance σ 2 . Note that ∇F is a L H -smooth mapping (according to Assumption 1), then by Lemma 10 of (Wang et al., 2017) , we have z t -∇F (w t ) 3/2 ≤ (L H m t + n t ) 3/2 ≤ 2L 3/2 H m 3/2 t + 2 n t 3/2 . Define q t+1 = t k=0 ζ (t) k w t+1 -w k+1 . According to Lemma 11 (a) and (b) of (Wang et al., 2017) , we have m t+1 + 4q 2 t+1 ≤ 1 - β 2 m t + 4q 2 t + 18 β w t+1 -w t 2 . Taking the power 3/2 on both sides of the inequality and using the fact that (a + b) 3/2 ≤ 1 + β 2 a 3/2 + 1 + 2 β b 3/2 for β > 0, we have m t+1 + 4q 2 t+1 3/2 ≤ 1 + β 2 1/2 1 - β 2 3/2 m t + 4q 2 t 3/2 + 1 + 2 β 1/2 80 β 3/2 w t+1 -w t 3 ≤ 1 - β 2 (m t + 4q 2 t ) 3/2 + 160 β 2 w t+1 -w t 3 , ( ) where the last inequality holds since 1/β ≥ 1. By the definition of n t , we have n t+1 = (1 -β)n t + β(∇f ( w t+1 ) -F ( w t+1 )). Denote F t+1 by the σ-algebra generated by ξ 1 , . . . , ξ t+1 . Noting that E n t+1 3/2 |F t+1 ≤ E n t+1 2 |F t+1 3/4 ≤ (1 -β/2) 3/2 n t 3/2 + β 3/2 σ 3/2 , ( ) where the last inequality holds by invoking Lemma 11(c) of (Wang et al., 2017) . Define δ 3/2 t = 2L 3/2 H (m t + 4q 2 t ) 3/2 + 2 n t 3/2 , then we have z t -∇F (w t ) 3/2 ≤ δ 3/2 t for all t. According to ( 16) and ( 17), we have E δ 3/2 t+1 |F t+1 ≤ 1 - β 2 δ t 3/2 + 2β 3/2 σ 3/2 + 320L 3/2 H w t+1 -w t 3 β 2 . Taking expectation on both sides yields E δ 3/2 t+1 ≤ 1 - β 2 E δ 3/2 t + 2β 3/2 σ 3/2 + E 320L 3/2 H w t+1 -w t 3 β 2 . Lemma 4. Adam + with learning rate η t = αβ max( zt 1/2 , 0) and 640α 3 L 3/2 H ≤ 1/120 satisfies 1 T T t=1 E ∇F (w t ) 3/2 ≤ 101∆ αβT + 2727E δ 3/2 1 βT + 4545β 1/2 σ 3/2 + 3β 3 L 3/2 100 . To ensure that 1 T T t=1 E ∇F (w t ) 3/2 ≤ 3/2 , we can choose β = 3 , T = O( -9/2 ). Proof. By Lemma 3 and noting that η t = αβ max( zt 1/2 , 0) , we have E δ 3/2 t+1 ≤ 1 - β 2 E δ 3/2 t + 2β 3/2 σ 3/2 + E 320L 3/2 H α 3 β 3 z t 3/2 β 2 ≤ 1 - β 2 E δ 3/2 t + 2β 3/2 σ 3/2 + E 640L 3/2 H α 3 β ∇F (w t ) 3/2 + δ t 3/2 . Note that 640α 3 L 3/2 H ≤ 1/120. Plugging it into (18), we have 59β 120 E δ 3/2 t ≤ E δ 3/2 t -δ 3/2 t+1 + 2β 3/2 σ 3/2 + E β 120 ∇F (w t ) 3/2 . ( ) Summing over t = 1, . . . , T on both sides of ( 19) and with some simple algebra, we have T t=1 E δ 3/2 t ≤ T t=1 E 3(δ 3/2 t -δ 3/2 t+1 ) β + T t=1 5β 1/2 σ 3/2 + T t=1 E 1 59 ∇F (w t ) 2/3 . By Lemma 2, taking expectation on both sides, we have E [F (w t+1 ) -F (w t )] ≤ αβ - E ∇F (w t ) 3/2 6 + 9E δ 3/2 t + 64α 4 β 4 L 3 3 . Summing (20) over t = 1, . . . , T yields 5 504 αβ T t=1 E ∇F (w t ) 3/2 ≤ F (w 1 )-F * +αβ   27E δ 3/2 1 β + T t=1 45β 1/2 σ 3/2   + 64α 4 β 4 L 3 T 3 . Hence, we have 1 T T t=1 E ∇F (w t ) 3/2 ≤ 101∆ αβT + 2727E δ 3/2 1 β + 4545β 1/2 σ 3/2 + 2155α 3 β 3 L 3 ≤ 101∆ αβT + 2727E δ 3/2 1 βT + 4545β 1/2 σ 3/2 + 3β 3 L 3/2 100 . Lemma 5. Under the same setting of Lemma 4, we know that to ensure that 1 T T t=1 E δ 3/2 t ≤ 3/2 , we need T = O( -9/2 ) iterations. Proof. From (19) and Lemma 4, we have T t=1 59β 120 E δ 3/2 t ≤ E δ 3/2 1 + 2β 3/2 σ 3/2 T + T t=1 E β 120 ∇F (w t ) 3/2 . Noting that β = T -b with 0 < b < 1, then we know that there exists a universal constant C > 0 such that 1 T T t=1 59 120 E δ 3/2 t ≤ E δ 3/2 1 T 1-b + 2σ 3/2 T b/2 + 1 T T t=1 E 1 120 ∇F (w t ) 3/2 . ( ) Take b = 2 3 . From Lemma 4, we know that it takes T = O( -9/2 ) iterations to ensure that 1 T T t=1 E ∇F (w t ) 3/2 ≤ 3/2 . In addition, From (21), we know that it takes T = O( -9/2 ) iterations to ensure that 1 T T t=1 E δ 3/2 t ≤ 3/2 . We can easily prove Theorem 2 by incorporating the results in Lemma 4 and Lemma 5. It is also evident to see that if β = 1/T s with 0 < s < 1, then it takes T = O (poly(1/ )) number of iterations to ensure that 1 T T t=1 E δ 3/2 t ≤ 3/2 and 1 T T t=1 E ∇F (w t ) 3/2 ≤ 3/2 hold simultaneously.

D PROOF OF THEOREM 3

Proof. Define γ t = min β a zt 2/3 , β a 0 with 0 = 2β a . Then we know that η t = αγ t and γ t ≤ 1 2 . Note that α ≤ 1 L , so we have η t ≤ 1 2L . By the L-smoothness of F , we have F (w t+1 ) ≤ F (w t ) + ∇ F (w t )(w t+1 -w t ) + L 2 w t+1 -w t 2 ≤ F (w t ) -η t ∇ F (w t )z t + η 2 t L 2 + γ t 2L z t 2 - 1 2L γ t z t 2 = F (w t ) -η t ∇ F (w t ) (z t -∇F (w t ) + ∇F (w t )) + η 2 t L 2 + γ t 2L z t 2 - 1 2L γ t z t 2 (a) ≤ F (w t ) -η t ∇ F (w t ) (z t -∇F (w t ) + ∇F (w t )) + η 2 t L + γ t L z t -∇F (w t ) 2 + ∇F (w t ) 2 - 1 2L γ t z t 2 (b) ≤ F (w t ) - η t 2 ∇F (w t ) 2 + η t 2 z t -∇F (w t ) 2 + η 2 t L + γ t L z t -∇F (w t ) 2 + ∇F (w t ) 2 - 1 2L γ t z t 2 = F (w t ) - η t 2 -η 2 t L - γ t L ∇F (w t ) 2 + η 2 t L + γ t L + η t 2 z t -∇F (w t ) 2 - 1 2L γ t z t 2 (c) ≤ F (w t ) - 1 2L γ t z t 2 + 1 L z t -∇F (w t ) 2 , (a) holds since z t 2 ≤ 2 z t -∇F (w t ) 2 + 2 ∇F (w t ) 2 , (b) holds since -∇ F (w t )z t ≤ 1 2 ∇F (w t ) 2 + z t -∇F (w t ) 2 , (c) holds due to ηt 2 -η 2 t L -γt L ≥ 0 (since η t ≤ 1 2L , we have ηt 2 -η 2 t L ≥ 1 2L and note that γt L ≤ 1 2L ) . By the definition of γ t , we have γ t z t 2 ≥ β 2a z t 2/3 min z t 2/3 β a , z t 4/3 β a 0 = β 2a z t 2/3 min z t 2/3 β a , z t 4/3 2β 2a (a) ≥ β 2a z t 2/3 z t 2/3 β a - 1 2 = β a z t 4/3 - β 2a z t 2/3 2 , where (a) holds since x ≥ x -1 2 , x 2 2 ≥ x -1 2 hold for any x and let x = zt 2/3 β a . Combining ( 22) and ( 23), we have β a z t 4/3 ≤ γ t z t 2 + β 2a z t 2/3 2 ≤ 2L (F (w t ) -F (w t+1 )) + β 2a z t 2/3 2 + 2 z t -∇F (w t ) 2 = 2L (F (w t ) -F (w t+1 )) + β a z t 4/3 • β a 2 z t 2/3 + 2 z t -∇F (w t ) 2 . If β a 2 zt 2/3 ≤ 1 2 , we have β a z t 4/3 ≤ 4L (F (w t ) -F (w t+1 )) + 4 z t -∇F (w t ) 2 . If β a 2 zt 2/3 > 1 2 , then β a > z t 2/3 , and hence we have β a z t 4/3 ≤ β 3a . As a result, we have β a z t 4/3 ≤ 4L (F (w t ) -F (w t+1 )) + 4 z t -∇F (w t ) 2 + β 3a . ( ) Taking summation on both sides of (24) over t = 1, . . . , T yields T t=1 z t 4/3 ≤ 4L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 4 β a z t -∇F (w t ) 2 + β 2a T. Define ∆ t = z t -∇F (w t ), then we have ∇F (w t ) 4/3 ≤ 2 z t 4/3 + 2 ∆ t 4/3 . (26) Hence, T t=1 z t 4/3 + ∇F (w t ) 4/3 (a) ≤ 2 T t=1 ∆ t 4/3 + 12L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 12 β a z t -∇F (w t ) 2 + 3β 2a T (b) ≤ T t=1 4 3 ∆ t 2 β a + β 2a 2 + 12L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 12 β a z t -∇F (w t ) 2 + 3β 2a T ≤ 12L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 14 β a z t -∇F (w t ) 2 + 4β 2a T, where (a) holds due to ( 25) and ( 26), (b) holds because min x>0 c 2 x + x 2 2 = 3c 4/3 2 . By Lemma 1, we know that E δ 2 t+1 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 + E CL 2 η 4 t z t 4 β 3 (a) ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 + E CL 2 α 4 β 4a z t 4 max( z t 8/3 , 4 0 )β 3 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 + E CL 2 α 4 β 4a-3 z t 4/3 . Note that CL 2 α 4 ≤ 1/14, we have β 2 E δ 2 t ≤ E δ 2 t -δ 2 t+1 + 2β 2 σ 2 + E β 4a-3 z t 4/3 14 . ( ) Taking summation on both sides of (28) over t = 1, . . . , T , we have T t=1 E δ 2 t ≤ E δ 2 1 β + 2βσ 2 T + T t=1 E β 4a-4 z t 4/3 14 = E δ 2 1 β + 2βσ 2 T + T t=1 E β a z t 4/3 14 , where the last equality holds since a = 4/3. Taking expectation on both sides of ( 27) and combining (29), we have T t=1 E z t 4/3 + ∇F (w t ) 4/3 ≤ 12L∆ β a + 14E δ 2 1 β 1+a + 28βσ 2 T β a + T t=1 E z t 4/3 + 4β 2a T. As a result, we have 1 T T t=1 E ∇F (w t ) 4/3 ≤ 12L∆ β a T + 14E δ 2 1 β 1+a T + 28βσ 2 β a + 4β 2a . Suppose initial batch size is T 0 , the intermediate batch size is m, and a = 4/3, then we have 1 T T t=1 E ∇F (w t ) 4/3 ≤ 12L∆ β 4/3 T + 14σ 2 β 7/3 T 0 T + 28σ 2 β 1/3 m + 4β 8/3 . ( ) We can choose β = O( 1/2 ), T = O( -2 ), the initial batch size T 0 = 1/β = O( -1/2 ), the intermediate batch size as m = 1/β 3 = O( -3/2 ), which ends up with the total complexity O( -3.5 ).

E A NEW VARIANT OF ADAM +

Theorem 4. Assume that ∇f (w; ξ) ≤ G almost surely for every w ∈ R d . Choose η t = αβ a max( zt 1/2 , 0) with a = 4/3, and we have 1 T T t=1 E ∇F (w t ) 3/2 ≤ 12L∆ β a T + 14E δ 2 1 β 1+a T + 28βσ 2 β a + 4β 3a . Denote the initial batch size and the intermediate batch size are T 0 and m respectively, then we have 1 T T t=1 E ∇F (w t ) 3/2 ≤ 12L∆ β a T + 14σ 2 T 0 T β 1+a + 28σ 2 β a-1 m + 4β 3a . To ensure that 1 T T t=1 E ∇F (w t ) 3/2 ≤ 3/2 , we choose β = 3/8 , T = O(1/ 2 ), the initial batch size is T 0 = 1/ 3/8 and m = 1/ 1.625 , then the total computational complexity is O(1/ 3.625 ). Proof. Define γ t = min β a zt 1/2 , β a 0 with 0 = 2β a . Then we know that η t = αγ t and γ t ≤ 1 2 . Note that α ≤ 1 L , so we have η t ≤ 1 2L . By the L-smoothness of F , we have F (w t+1 ) ≤ F (w t ) + ∇ F (w t )(w t+1 -w t ) + L 2 w t+1 -w t 2 ≤ F (w t ) -η t ∇ F (w t )z t + η 2 t L 2 + γ t 2L z t 2 - 1 2L γ t z t 2 = F (w t ) -η t ∇ F (w t ) (z t -∇F (w t ) + ∇F (w t )) + η 2 t L 2 + γ t 2L z t 2 - 1 2L γ t z t 2 (a) ≤ F (w t ) -η t ∇ F (w t ) (z t -∇F (w t ) + ∇F (w t )) + η 2 t L + γ t L z t -∇F (w t ) 2 + ∇F (w t ) 2 - 1 2L γ t z t 2 (b) ≤ F (w t ) - η t 2 ∇F (w t ) 2 + η t 2 z t -∇F (w t ) 2 + η 2 t L + γ t L z t -∇F (w t ) 2 + ∇F (w t ) 2 - 1 2L γ t z t 2 = F (w t ) - η t 2 -η 2 t L - γ t L ∇F (w t ) 2 + η 2 t L + γ t L + η t 2 z t -∇F (w t ) 2 - 1 2L γ t z t 2 (c) ≤ F (w t ) - 1 2L γ t z t 2 + 1 L z t -∇F (w t ) 2 , (a) holds since z t 2 ≤ 2 z t -∇F (w t ) 2 + 2 ∇F (w t ) 2 , (b) holds since -∇ F (w t )z t ≤ 1 2 ∇F (w t ) 2 + z t -∇F (w t ) 2 , (c) holds due to ηt 2 -η 2 t L -γt L ≥ 0 (since η t ≤ 1 2L , we have ηt 2 -η 2 t L ≥ 1 2L and note that γt L ≤ 1 2L ) . By the definition of γ t , we have γ t z t 2 ≥ β 2a z t min z t 1/2 β a , z t β a 0 = β 2a z t min z t 1/2 β a , z t 2β 2a (a) ≥ β 2a z t z t 1/2 β a - 1 2 = β a z t 3/2 - β 2a z t 2 , where (a) holds since x ≥ x -1 2 , x 2 2 ≥ x -1 2 hold for any x and let x = zt 1/2 β a . Combining ( 31) and (32), we have β a z t 3/2 ≤ γ t z t 2 + β 2a z t 2 ≤ 2L (F (w t ) -F (w t+1 )) + β 2a z t 2 + 2 z t -∇F (w t ) 2 = 2L (F (w t ) -F (w t+1 )) + β a z t 3/2 • β a 2 z t 1/2 + 2 z t -∇F (w t ) 2 . If β a 2 zt 1/2 ≤ 1 2 , we have β a z t 4/3 ≤ 4L (F (w t ) -F (w t+1 )) + 4 z t -∇F (w t ) 2 . If β a 2 zt 1/2 > 1 2 , then β a > z t 1/2 , and hence we have β a z t 3/2 ≤ β 4a . As a result, we have β a z t 3/2 ≤ 4L (F (w t ) -F (w t+1 )) + 4 z t -∇F (w t ) 2 + β 4a . Taking summation on both sides of (33) over t = 1, . . . , T yields T t=1 z t 3/2 ≤ 4L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 4 β a z t -∇F (w t ) 2 + β 3a T. Define ∆ t = z t -∇F (w t ), then we have ∇F (w t ) 3/2 ≤ 2 z t 3/2 + 2 ∆ t 3/2 . ( ) Hence, we have x + x 3 3 = 3/2 3 . By Lemma 1, we know that T t=1 z t 3/2 + ∇F (w t ) 3/2 (a) ≤ 2 T t=1 ∆ t 3/2 + 12L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 12 β a z t -∇F (w t ) 2 + 3β 3a T (b) ≤ T t=1 3 2 ∆ t 2 β a + β 3a 3 + 12L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 12 β a z t -∇F (w t ) 2 + 3β 3a T ≤ 12L T t=1 F (w t ) -F (w t+1 ) β a + T t=1 14 β a z t -∇F (w t ) 2 + 4β 3a T, E δ 2 t+1 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 + E CL 2 H η 4 t z t 4 β 3 (a) ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 + E CL 2 H α 4 β 4a z t 4 max( z t 2 , 4 0 )β 3 ≤ 1 - β 2 E δ 2 t + 2β 2 σ 2 + E CL 2 H α 4 β 4a-3 z t 2 . Note that CL 2 H α 4 ≤ 1 14G 1/2 , we have β 2 E δ 2 t ≤ E δ 2 t -δ 2 t+1 + 2β 2 σ 2 + E β 4a-3 z t 2 14G 1/2 . ( ) Taking summation on both sides of (37) over t = 1, . . . , T , we have T t=1 E δ 2 t ≤ E δ 2 1 β + 2βσ 2 T + T t=1 E β 4a-4 z t 2 14 = E δ 2 1 β + 2βσ 2 T + T t=1 E β a z t 2 14G 1/2 ≤ E δ 2 1 β + 2βσ 2 T + T t=1 E β a z t 3/2 14 , where the equality holds since a = 4/3 and last inequality holds since z t ≤ G. Taking expectation on both sides of (36) and combining (38), we have T t=1 E z t 3/2 + ∇F (w t ) 3/2 ≤ 12L(F (w 1 ) -F * ) β a + 14E δ 2 1 β 1+a + 28βσ 2 T β a + T t=1 E z t 3/2 + 4β 3a T. As a result, we have 1 T T t=1 E ∇F (w t ) 3/2 ≤ 12L(F (w 1 ) -F * ) β a T + 14E δ 2 1 β 1+a T + 28βσ 2 β a + 4β 3a .

F RELATED WORK

Adaptive Gradient Methods Adaptive gradient methods were first proposed in the framework of online convex optimization (Duchi et al., 2011; McMahan & Streeter, 2010) , which dynamically incorporate knowledge of the geometry of the data to perform more informative gradient-based learning. This type of algorithm was proved to have fast convergence if stochastic gradients are sparse (Duchi et al., 2011) . Based on this idea, several other adaptive algorithms were proposed to train deep neural networks, including Adam (Kingma & Ba, 2014), Amsgrad (Reddi et al., 2019) , RMSprop (Tieleman & Hinton, 2012) . There are many work trying to analyze variants of adaptive gradient methods in both convex and nonconvex case (Chen et al., 2018a; 2019; 2018b; Luo et al., 2019; Chen et al., 2018a; b; Ward et al., 2019; Li & Orabona, 2019; Chen et al., 2019) . Notably, all of these works are able to establish faster convergence rate than SGD, based on the assumption that stochastic gradients are sparse. However, this assumption may not hold in deep learning. In contrast, our algorithm can have faster convergence than SGD even if stochastic gradients are not sparse, since our algorithm's new data-dependent adaptive complexity does not rely on the sparsity of stochastic gradients. Variance Reduction Methods Variance reduction is a technique to achieve fast rates for finite sum and stochastic optimization problems. It was first proposed for finite-sum convex optimization (Johnson & Zhang, 2013) and then it was extended in finite-sum nonconvex (Allen-Zhu & Hazan, 2016; Reddi et al., 2016; Zhou et al., 2018) and stochastic nonconvex (Lei et al., 2017; Fang et al., 2018; Wang et al., 2019; Pham et al., 2020; Cutkosky & Orabona, 2019) optimization. To prove faster convergence rate than SGD, all these works make the assumption that the objective function is an average of individual functions and each one of them is smooth. In contrast, our analysis does not require such an assumption and to achieve a faster-than-SGD rate. Other Related Work Arjevani et al. (2019) show that SGD is optimal for stochastic nonconvex smooth optimization, if one does not assume that every component function is smooth. There are recent work trying to establish faster rate than SGD, when the Hessian of the objective function is Lipschitz (Fang et al., 2019; Cutkosky & Mehta, 2020) . There are several empirical papers, including LARS (You et al., 2017) and LAMB (You et al., 2019) ), which utilize both moving average and normalization for training of deep neural networks with large-batch sizes. Zhang et al. (2020) consider an algorithm for finding stationary point for nonconvex nonsmooth problems. Levy (2017) considers convex optimization setting and design algorithms which adapts to the smoothness parameter. Liu et al. (2019) introduced Rectified Adam to alleviate large variance at the early stage. However, none of them establish data-dependent adaptive complexity as in our paper.

G ADAM + WITH FIXED LEARNING RATE

We report the results on image classification with CIFAR10 on ResNet18. We further considered the Adam + with fixed learning rate 0.1 and do not employ any learning rate annealing scheme. We report our result on Figure 6 , in which "Adam+ Fixed Stepsize" uses the default setting of Algorithm 1. As we can see from the 

H GROWTH RATE ANALYSIS OF t i=1 z i

We provide the log-log plot (log( t i=1 z i ) versus log(t)) for the ResNet18 training on CIFAR10 experiment, as illustrated in Figure 7 . We can see that the slope is around 0.8. Although the slope is not small, we can see from Figure 5 that the slope becomes almost zero after the iteration 6 × 10 4 (which corresponds to the epoch number 154). At this particular epoch, the training and test accuracy are not the best so we need to keep the training until epoch 350. Then our algorithm Adam + is able to take advantage of the slow growth rate of t i=1 z i for large t and enjoys faster convergence, which is consistent with Figure 1 .



Remark: Note that the above theorem establishes the fast convergence rate for Nadam + with p = 2/3. Indeed, we can also establish a fast rate of Adam + (where p = 1/2) in the order of O(1/ 3.625 ) with details provided in the Appendix E.3 EXPERIMENTSIn this section, we conduct empirical studies to verify the effectiveness of the proposed algorithm on three different tasks: image classification on CIFAR10 and CIFAR100 dataset(Krizhevsky et al.,



Figure 1: Comparison of optimization methods for ResNet18 Training on CIFAR10.

Figure 2: Comparison of optimization methods for VGG19 training on CIFAR100.

Figure 4: Comparison of optimization methods for six-layers LSTM training on SWB-300.

) where (a) holds due to (34) and (35), (b) holds because min x>0 c 2

Figure 6: Comparison of Adam + and Adam + with Fixed Stepsize

Figure 7: log( t i=1 z i ) versus log(t)

Summary of setups in the experiments.

From the Figure, we have the following observations: First, in terms of training perplexity, our algorithm achieves comparable performance with SGD and momentum SGD and outperforms Adagrad and NIGT, and it is worse than Adam. Second, in terms of test perplexity, our algorithm outperforms Adam, Adagrad, NIGT and momentum SGD, and it is comparable to SGD. An interesting observation is that Adam does not generalize well even if it has fast convergence in terms of training error, which is consistent with the observations in(Wilson et al., 2017).

