PROVABLE BENEFIT OF ADAPTIVITY IN ADAM

Abstract

Adaptive Moment Estimation (Adam) has been observed to converge faster than stochastic gradient descent (SGD) in practice. However, such an advantage has not been theoretically characterized -the existing convergence rate of Adam is no better than SGD. We attribute this mismatch between theory and practice to a commonly used assumption: the smoothness is globally upper bounded by some constant L (called L-smooth condition). Specifically, compared to SGD, Adam adaptively chooses a learning rate better suited to the local smoothness. This advantage becomes prominent when the local smoothness varies drastically across the domain, which however is hided under L-smooth condition. In this paper, we analyze the convergence of Adam under a condition called (L 0 , L 1 )-smooth condition, which allows the local smoothness to grow with the gradient norm. This condition has been empirically verified to be more realistic for deep neural networks (Zhang et al., 2019a) than the L-smooth condition. Under (L 0 , L 1 )-smooth condition, we establish the convergence for Adam with practical hyperparameters. As such, we argue that Adam can adapt to the local smoothness, justifying Adam's benefit of adaptivity. In contrast, SGD can be arbitrarily slow under this condition. Hence, we theoretically characterize the advantage of Adam over SGD. Under review as a conference paper at ICLR 2023 We also provide counter-examples where SGD diverges under the same setting as in Theorem 1. Therefore, our results can shed new light on why Adam outperforms SGD in deep learning tasks. The rest of this paper is organized as follows: Section 2 summarizes the related works; Section 3, introduces notations and assumptions; Section 4 presents our convergence result of Adam along with the proof ideas. Convergence analysis for Adam. When Adam is firstly proposed in (Kingma & Ba, 2015) , the authors provide a convergence proof. However, Reddi et al. (2019) point out that the proof in (Kingma & Ba, 2015) has flaws. They further provide simple counterexamples with which Adam diverges. Ever since then, there have been many attempts to modify the update rules of Adam to ensure convergence: e.g., AMSGrad (Reddi et al., 2019) and AdaBound (Luo et al., 2019). Due to the limited space, we introduce more details of the Adam-variants in Appendix A.

1. INTRODUCTION

Machine learning tasks are often formulated as solving the following finite-sum problem. min w∈R d f (w) = n-1 i=0 f i (w), where {f i (w)} n-1 i=0 is lower bounded, n denotes the number of samples or mini-batches, and w denotes the trainable parameters. Among various gradient-based optimizers, Stochastic Gradient Descent (SGD) is a simple and popular method to solve Eq. (1). However, Adaptive gradient methods including Adaptive Moment estimation (Adam) (Kingma & Ba, 2014) is recently observed to outperform SGD in modern deep neural tasks including GANs (Brock et al., 2018) , BERTs (Kenton & Toutanova, 2019) , GPTs (Brown et al., 2020) and ViTs (Dosovitskiy et al., 2020) . For instance, as reported in Figure 1 . (a): SGD converges much slower than Adam during the training of Transformers. Similar phenomena are also reported in BERT training (Zhang et al., 2019b) . The empirical success of Adam comes from its special update rules. Firstly, it uses the heavy-ball momentum mechanism controlled by a hyperparameter β 1 . Second, it uses an adaptive learning rate strategy. In particular, the learning rate of Adam contains exponential moving averages of past squared gradients, which is weighted by β 2 . Larger β 1 and β 2 will bring more gradient information of historical steps into the update. The update rule of Adam is given in Eq. (2) (presented later in Section 3). Despite its practical success, the theoretical understanding of Adam is limited. For instance, the existing convergence rates of Adam are no better than that of SGD (Zhang et al., 2022; Shi et al., 2021; Défossez et al., 2020; Zou et al., 2019; De et al., 2018; Guo et al., 2021) . As such, there is a mismatch between Adam's superior empirical performance and its theoretical understanding. To close the gap between theory and practice, we revisit the existing analyses for Adam. Current setups for the analysis of Adam fail to model the real-world applications: all of the convergence analyses of Adam are based on L-smooth condition, i.e., the Lipschitz coefficient of the gradient is globally bounded. However, it has been recently observed that L-smooth condition does not hold in many deep learning tasks (Zhang et al., 2019a; Crawshaw et al., 2022) . This gap in setting can obscure Adam's superiority: different local Lipschitz coefficients of the gradient (local smoothness) may require different optimal learning rates in terms of convergence. However, the learning rate of SGD is ignorant of the local smoothness along the training trajectory. If the local smoothness does not change sharply along the trajectory, the optimal learning rate does not change much. Then SGD can work well by selecting a learning rate through grid search. However, when the local smoothness varies dramatically, a learning rate fits some points well may fit other points along the trajectory arbitrarily badly, which indicates SGD may converge arbitrarily slow (detailed discussions can be found in [Theorem 4, (Zhang et al., 2019a) ] and Section 4.3 in this paper). In contrast, Adam adapts the update according to the local information and does not suffer from such an issue. Following the above methodology, we analyze the convergence of Adam under (L 0 , L 1 )-smooth condition, which assumes the local smoothness (the spectral norm of the local Hesssian) to be upper bounded by L 1 • (local gradient norm) + L 0 (Assumption 1). As the gradient norm can be unbounded, (L 0 , L 1 )-smooth condition allows the local smoothness to grow fiercely with the gradient norm. Moreover, it has been demonstrated by (Zhang et al., 2019a; 2020a; Crawshaw et al., 2022) that for many practical neural networks, (L 0 , L 1 )-smooth condition more closely characterizes the optimization landscape (as illustrated by Figure 1 (b) ) along the optimization trajectories. Under (L 0 , L 1 )-smooth condition, we successfully establish the convergence of Adam. Meanwhile, in contrast, under the same assumption, it is proved that SGD can be arbitrarily slow in (Zhang et al., 2019a) or even diverge (see Section 4.3). Therefore, our theory demonstrates that Adam can provably converge faster than SGD. The main contribution of this paper is summarized as follows. We derive the first convergence result of Adam under the more realistic (L 0 , L 1 )-smooth condition (Theorem 1). First, our convergence result is established under the mildest assumption so far: • (L 0 , L 1 )-smooth condition is strictly weaker than L-smooth condition. More importantly, (L 0 , L 1 )-smooth condition is observed to hold in practical deep learning tasks. Relaxing the smoothness condition is important to characterize the advantage of Adam over SGD. • Our result does not require the bounded gradient assumption (i.e. ∥∇f (x)∥ ≤ C). Removing such a condition is necessary, as otherwise (L 0 , L 1 )-smooth condition will degenerate to L-smooth condition. Our result does not need other strong assumptions like bounded adaptor or large ε (see Eq. ( 2)), either. Furthermore, the conclusion of our convergence result is among the strongest. • Our convergence result holds for every possible trajectory. This is much stronger than the common results of "convergence in expectation" and is technically challenging. • In our convergence results, the setting of hyperparameters (β 1 , β 2 ) is close to practice. Specifically, our result holds for any β 1 and any β 2 close to 1, which matches the practical settings (for example, 0.9 and 0.999 in deep learning libraries). On the other hand, vanilla Adam works well in practice, and the divergence is not reported unless for carefully constructed examples. This empirical phenomenon motivates researchers to rethink the counterexamples. The counterexamples states "for every β 1 < √ β 2 , there exists a problem that Adam diverges". That is to say, the divergence statement requires picking (β 1 , β 2 ) before fixing the problem, while in practice, the algorithmic parameters are often picked according to the problem. Based on this observation, a recent work (Zhang et al., 2022) proves that Adam can converge with (β 1 , β 2 ) picked after the problem is given. Similar analyses on vanilla Adam and RMSProp are also shown under various conditions in (Zaheer et al., 2018b; Zou et al., 2019; Défossez et al., 2020; De et al., 2018; Guo et al., 2021; Shi et al., 2021) . We list and compare these results in Table 1 for convenience. In summary, we emphasize that all the above works (including those for Adam-variants) require L-smooth condition. In addition, most of them require stronger assumptions such as bounded gradient or large ϵ. Table 1 : Comparison of existing convergence result of Adam. All the literature requires L-smooth condition, which is strictly stronger than (L 0 , L 1 )-smooth condition. We provide some explanation on the upper footmarks: (a): When β 1 = 0, Adam is reduced to RMSProp (Hinton et al., 2012) , the analysis of which is essentially simpler due to the lack of the momentum term; (b): ε is the stability hyperparameter in Adam (see Eq. ( 2)). In practice, ε is often a small number such as 10 -8 . In theory, ε should be arbitrarily small, including 0. In this work, we focus on the convergence of vanilla Adam. In particular, our result is the first to relax L-smooth condition for Adam. Moreover our convergence analysis of Adam is proved under the mildest assumptions so far and with the strongest conclusions, which can be easily extended to other Adam-variants as well. Generalized smoothness assumption. There are several attempts on relaxing L-smooth condition. Zhang et al. (2019a) propose (L 0 , L 1 )-smooth condition to theoretically explain the acceleration effect of clipped SGD over SGD. Similar results are also extended to clipped SGD with momentum (Zhang et al., 2020a) and differentially-private SGD (DP SGD) (Yang et al., 2022) . Through extensive experiments, these works empirically showed that (L 0 , L 1 )-smooth condition holds when Adam outperforms SGD. However, they did not theoretically analyze Adam in this setting. Theoretical comparison between adaptive gradient methods and SGD. The comparison between adaptive gradient methods and SGD is a popular topic. There are several theoretical works from different perspectives. Ward et al. (2020) ; Défossez et al. (2020) ; Faw et al. (2022) show that AdaGrad can converge with any constant learning rate. They argue that "tuning-free" is one advantage of AdaGrad over SGD. This line of works has the following differences with us. First, they analyze AdaGrad while we focus on Adam. Second, they require L-smooth, while we do not have this condition. Third, their proposed convergence rate is no better than SGD, while we show Adam can converge faster. Xie et al. (2022) study the escaping rate from saddle points of the continuous dynamics approximation of Adam and SGD. They show that adaptive learning rate can help escape saddle points efficiently. This work is orthogonal to ours: they compare the behavior of Adam and SGD after a certain stationary point is reached, while we focus on comparing the iteration complexity of approaching the stationary points. Zhang et al. (2020b) prove that the (adaptive) clipped gradient method converges faster than SGD when the stochastic noises are heavy-tailed. As such, they argue that the heavy-tailed noise is one cause of SGD's poor performance. However, a recent work (Chen et al., 2021b) shows that (S)GD still performs poorly even when the stochastic noise is removed in the training process (by using the full-batch gradient), which shows that the effect of the stochastic noise may not be crucial. There are also recent works (Zhou et al., 2020; Zou et al., 2021) trying to compare the generalization behaviors of adaptive gradient methods and SGD. These works are orthogonal to the topic of this work. More detailed discussion is deferred to Appendix A.

3. PRELIMINARIES

This section introduces notations, definitions, and assumptions that are used throughout this work. Notations. We list the notations that are used in the formal definition of the randomly-shuffled Adam and its convergence analysis. • (Vector) We define a ⊙ b as the Hadamard product (i.e., component-wise product) between two vectors a and b with the same dimension. We also define ⟨a, b⟩ as the ℓ 2 inner product between a and b, and ∥a∥ p as the ℓ p norm of a (specifically, we abbreviate the ℓ 2 norm of a as ∥a∥). We define 1 d as an all-one vector with dimension d. • (Derivative) For a function f (w) : R d → R, we define ∇f (w) as the gradient of f at point w, and ∂ l f (w) as the l-th partial derivative of f at point w, i.e., ∂ l f (w) = (∇f (w)) l .

• (Array

) We define [m 1 , m 2 ] ≜ {m 1 , • • • , m 2 }, ∀m 1 , m 2 ∈ N, m 1 ≤ m 2 . Specifically, we use [m] ≜ {1, • • • , m}. Formal Definition of Adam. Based on the n-sum problem Eq. ( 1), we provide the update rule of Adam as follows. We initialize w 1,0 , m 1,-1 , and ν 1,-1 as any point in R d . At the beginning of every outer loop k ∈ N + , we sample {τ k,0 , • • • , τ k,n-1 } as a random permutation of {0, 1, • • • , n -1}. Then, in each inner loop i ∈ [0, n -1], we respectively calculate the 1st-order momentum m k,i , the 2st-order momentum ν k,i , and the parameter w k,i+1 as    m k,i = β 1 m k,i-1 + (1 -β 1 )∇f τ k,i (w k,i ), ν k,i = β 2 ν k,i-1 + (1 -β 2 )∇f τ k,i (w k,i ) ⊙ ∇f τ k,i (w k,i ), , w k,i+1 = w k,i - η k √ ν k,i +ε1 d ⊙ m k,i . (2) In the end of the outer loop k, Adam updates w k+1,0 = w k,n , ν k+1,-1 = ν k,n-1 , m k+1,-1 = m k,n-1 as a preparation for the next outer loop. m k,i and ν k,i are weighted averages with hyperparamter β 1 ∈ [0, 1) and β 2 ∈ [0, 1), respectively. We choose η k to be diminishing learning rate η k = η1 √ k In practice, ε is adopted for numerical stability and it is often chosen to be 10 -8 in practice. In our theory, we allow ε to be an arbitrary non-negative constant including 0. We further emphasize that that our analysis holds for any random permutation order used to generate τ k,i . In the default setting in deep learning libraries, the pure Adam is implemented in the random shuffling fashion and it is the default setting for computer vision, NLP, and generative models. We make two mild assumptions on the objective function (Eq. ( 1)). Assumption 1 ((L 0 , L 1 )-smooth condition). f i (w) satisfies (L 0 , L 1 )-smooth condition, i.e., there exist positive constants (L 0 , L 1 ), such that, ∀w 1 , w 2 ∈ R d satisfying ∥w 1 -w 2 ∥ ≤ 1 L1 , ∥∇f i (w 1 ) -∇f i (w 2 )∥ ≤ (L 0 + L 1 ∥∇f i (w 1 )∥)∥w 1 -w 2 ∥. (3) Eq. ( 3) is firstly introduced by Zhang et al. (2019a) . It is strictly weaker than the classical Lsmooth condition (i.e., L 1 = 0 in Assumption 3, see [Remark 2.3, (Zhang et al., 2020a) ] for details). For instance, Eq. ( 3) holds for a wide range of polynomials and even the exponential functions, while L-smooth condition does not hold for polynomials with degree larger than 2. More importantly, empirical observation in (Zhang et al., 2019a; 2020a) demonstrates that Eq. ( 3) is a better characterization of the loss landscape of neural networks than L-smooth condition, especially in the tasks where Adam significantly outperforms SGD. Assumption 2 (Affine Noise Variance). ∀w ∈ R d , the gradients of {f i (w)} n-1 i=0 has the following connection with the gradient of f (w): n-1 i=0 ∥∇f i (w)∥ 2 ≤ D 1 ∥∇f (w)∥ 2 + D 0 . Assumption 2 is quite general, which covers the "bounded variance" assumption (Ghadimi et al., 2016; Zaheer et al., 2018a; Huang et al., 2021) and the "strongly growth condition" in existing literature (Schmidt & Roux, 2013; Vaswani et al., 2019) . Compare to the "bounded variance" assumption where the noise diminishes when the gradient is large, Assumption 2 allows the noise variance to grow with the gradient norm, and thus applies to a rich variety of problems (e.g., (Khani & Liang, 2020) ). Furthermore, combining Assumptions 1 and 2, we conclude that f satisfies (nL 0 + L 1 √ n √ D 0 , L 1 √ n √ D 1 )-smooth condition. The proof is immediate and we defer it to Lemma 2 in the appendix.

4. MAIN RESULT

4.1 ADAM CONVERGES WITH (L 0 , L 1 )-SMOOTH CONDITION We formally present our main result as follows: Theorem 1. Consider Adam defined as Eq. ( 2) with diminishing learning rate η k = η1 √ k . Let Assumptions 1 and 2 hold. Suppose that β 1 and β 2 satisfies β 2 1 < β 2 < 1 and δ(β2) ≜ √ d max        1 - 1 β n-1 2 + 8n 1-β n-1 2 β n 2 , β2 1 -2(1-β 2 )n β n 2 -1, β -n/2 2 -1, 1 -β2        n β √ n 2 2 ≤ 1 2(4 + √ 2) √ D1 n -1 + 1+β 1 1-β 1 . Then, we have that min k∈[1,T ] ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 √ D0 ≤ O log T √ T + O( √ D0 min{δ(β2), δ(β2) 2 }). Theorem 1 shows that Adam converges to the neighborhood of stationary points under (L 0 , L 1 )- smooth condition. The coefficient of log T √ T is O(L 2 0 + L 2 1 ) with respect to (L 0 , L 1 ). In Section 4.2, we present a proof sketch to illustrate our proof idea. Remark 1 (The choice of hyperparameters (β 1 , β 2 )). In Theorem 1, the feasible range of (β 1 , β 2 ) subject to the two constraints: β 2 1 < β 2 < 1 and Inequality (4). We emphasize that the intersection of these two constraints is not empty. One can easily observe that δ(β 2 ) → 0 as β 2 → 1, and thus the constraints hold when β 1 is fixed and β 2 approaches 1. Therefore, Theorem 1 indicates that Adam can work when β 2 close enough to 1. On the other hand, there is little restriction on β 1 except β 2 1 < β 2 (discussed in Appendix B). This agrees with β 2 selection in deep learning libraries (e.g., 0.999 in Pytorch). On the other hand, there are counterexamples in (Zhang et al., 2022) , where Adam is showed to diverge if β 2 are chosen improperly (i.e., not close to 1). Therefore, the condition that β 2 is close to 1 is both sufficient and necessary for the convergence of Adam. Remark 2 (Convergence to the neighborhood of stationary points). When D 0 ̸ = 0, our theorem can only ensure that Adam converges to a neighborhood of stationary points. As discussed under Assumption 2, convergence to the bounded region is common for Adam analysis: even with L-smooth condition, Zhang et al. (2022) can only show that Adam with diminishing learning rates converges to a bounded region. This is because even with diminishing η k , the effective learning rate η k ε1 d + √ ν k,i may not decay. We further conduct experiments and observe that Adam cannot reach the stationary point over a simple quadratic function with D 0 ̸ = 0. Specifically, we run Adam on a synthetic optimization target which is given as follows: f0(x) = (x -3) 2 + 1 10 , fi(x) = - 1 10 x - 10 3 2 + 1 10 , i ∈ [1, 9], f (x) = 9 i=0 fi(x) = 1 10 x 2 . Such an example is proposed by (Shi et al., 2021) . One can easily observe that f (x) and {f i (x)} 9 i=0 satisfies Assumption 1 and Assumption 2 with D 1 = 208, D 0 = 75 5 9 , thus Theorem 1 applies. With the diminishing learning rate η t = 1 √ t , we plot the distance between Adam's trajectory and the unique minimum 0 of f in Figure 2 . It can be observed that Adam does not converge to the unique minimum 0, but gets closer to the minimum if choosing β 2 closer to 1, which supports Theorem 1. The good news is the neighborhood shrinks as β 2 → 1, which explains the practical use of large β 2 . Furthermore, we refine the bounded region {w : min { ∥∇f (w))∥ √ D1 , ∥∇f (w)∥ 2 √ D0 } ≤ O( √ D 0 δ(β 2 ))} in existing literature (Shi et al., 2021; Zhang et al., 2022) to {w : min{ ∥∇f (w))∥ √ D1 , ∥∇f (w)∥ 2 √ D0 } ≤ O( √ D 0 min{δ(β 2 ), δ(β 2 ) 2 }) } through a more careful analysis. We emphasize that this change contributes to a sharp improvement since δ(β 2 ) is required to be close to 0. Remark 3 (Dependence on the dimension d). The convergence rate and the neighborhood have polynomial dependence over the dimension d, which is not desired as the convergence rate of SGD under L-smooth condition does not have such a dependence. We conjecture that such a dependence may be due to the limit of proof techniques. However, such a result suffices to provide a convergence rate separation between Adam and SGD since we will latter show that SGD may converge arbitrarily slow under (L 0 , L 1 )-smooth condition. Furthermore to the best of our knowledge, all the existing convergence analysis of Adam has the dependence of d even under L-smooth condition e.g., (Défossez et al., 2020; Shi et al., 2021; Zhang et al., 2022) . Though we believe that removing the dependence on d is possible and worthy studying, it is technically difficult and deserves an independent work. Remark 4 (Constant learning rate). Following the same proof strategy of Theorem 1, one can show that with constant learning rate η, the conclusion (Eq. ( 5)) of Theorem 1 changes into min k∈[1,T ] { ∥∇f (w k,0 )∥ √ D0 , ∥∇f (w k,0 )∥ 2 √ D1 } ≤ O( log T ηT ) + O( √ D 0 min{δ(β 2 ), δ(β 2 ) 2 }) + O(η). In other words, Adam will converge faster to the neighborhood (with rate 1/ √ t → 1/t), but the size of neighborhood is enlarged by an additional term O(η) due to the constant step-size. Remark 5. Theorem 1 allows both β 1 = 0 (RMSprop) and β 1 > 0 (Adam). However, the convergence rate of these two cases are not distinguished, showing no benefit of momentum so far. At the current stage, we take a first step to compare RMSProp & Adam to SGD. A more detailed comparison between RMSprop and Adam will be considered as future work. For completeness, we briefly discuss on the difficulty in analyzing the effect of momentum in Appendix B.

4.2. PROOF IDEAS

Here, we briefly explain our proof idea for the convergence result. We will first prove Theorem 1 for RMSProp (i.e., β 1 = 0) to show the challenge brought by (L 0 , L 1 )-smooth condition and how we tackle it. We then show the additional difficulty when extending the result to general Adam and our corresponding intuition to solve it. Stage I: Convergence of RMSProp. We start with the following descent lemma: f (w k+1,0 ) -f (w k,0 ) ≤⟨wk+1,0 -w k,0 , ∇f (w k,0 )⟩ First Order + L loc 2 ∥w k+1,0 -w k,0 ∥ 2 , Second Order (6) where L loc is the local smoothness. We first bound the second order term by noticing that the norm of the epoch-k update of RMSProp is in order O(η k ) when 0 = β 2 1 < β 2 < 1 (a standard property with proof deferred to Lemma 3 in the appendix). When k is large enough, the epoch-k update is small enough and (L 0 , L 1 )-smooth condition applies, leading to L loc = O(1) + O(∥∇f (w)∥) and the second order term = O(η 2 k ) + O(η 2 k ∥∇f (w)∥). Therefore, in order to show f (w k,0 ) decreases, we need to prove the first order is negative and dominates the term O(η 2 k ) + O(η 2 k ∥∇f (w)∥). In other words, we need to lower bound the alignment between the update w k+1,0 -w k,0 = -η k i 1 ε1 d + √ ν k,i ⊙ ∇f τ k,i (w k,i ) and the negative gradient -∇f (w k,0 ). It is easy to prove such a property when ν k,i does not change, i.e., if we are analyzing SGD with preconditioning, in which case i 1 ε1 d + √ ν k,i ⊙∇f τ k,i (w k,i ) = i 1 ε1 d + √ ν k,0 ⊙∇f τ k,i (w k,i ) ≈ i 1 ε1 d + √ ν k,0 ⊙∇f τ k,i (w k,0 ) = 1 ε1 d + √ ν k,0 ⊙ ∇f (w k,0 ). Can we extend the above methodology to RMSProp? We give an affirmative answer when β 2 is close to 1 and the gradient is large by providing the following lemma. Lemma 1 (Informal). For any l ∈ [d] and i ∈ [0, n -1], if max p∈[0,n-1] |∂ l f p (w k,0 )| = Ω( k-1 r=1 β (k-1-r) 2 2 η r ∥∇f (w r,0 )∥ + η k ), then |ν l,k,i -ν l,k,0 | = O(δ(β 2 )ν l,k,0 ). The proof idea is that (1). the squared maximum gradient can be bounded by ν l,k,0 with an error term. Thus when the maximum gradient is larger than the error term, it can be only bounded by ν l,k,0 ; (3). thus ν l,k,i = β i 2 ν l,k,0 + (1 -β 2 )β i-1 2 ∇f τ k,i (w k,i ) + • • • = β i 2 ν l,k,0 + (1 -β i 2 )O(ν l,k,0 ) gets close to ν l,k,0 when β 2 is close to 1. The detailed proof is sophisticated due to the presence of (L 0 , L 1 )-smooth condition and we defer it to Corollary 2 for details. Therefore, if we denote those dimensions with large gradients (i.e., satisfying the requirement of Lemma 1) as L k large and the rest as L k small , Lemma 1 indicates l∈L k large (w l,k+1,0 -w l,k,0 )∂ l f (w k,0 ) = -η k l∈L k large ∂ l f (w k,0 ) √ ν l,k,i + ε i ∂ l fτ k,i (w k,i ) ≈ -η k l∈L k large ∂ l f (w k,0 ) 2 √ ν l,k,0 + ε ≈ -η k l∈L k large ∂ l f (w k,0 ) 2 max p∈[0,n-1] |∂ l fp(w k,0 )| + ε = -Ω η k min ∥∇f (w k,0 )∥ √ D 1 , ∥∇f (w k,0 )∥ 2 √ D 0 . Here in the second "≈", we use again the squared maximum gradient is close to ν l,k,0 if it is large. A formal derivation of the above result can be seen in Appendix D.2. How about the dimensions in L k small ? We will treat them as error terms. Specifically, l ∈ L k small indicates that ∂ l f (w k,0 ) = O( k-1 r=1 β (k-1-r) 2 2 η r ∥∇f (w r,0 )∥ + η k ). One can easily observe that ∂ l fτ k,i (w k,i ) √ ν l,k,i +ε is bounded, and thus l∈L k small (w l,k+1,0 -w l,k,0 )∂ l f (w k,0 ) equals to -η k l∈L k large ∂ l f (w k,0 ) √ ν l,k,i + ε i ∂ l fτ k,i (w k,i ) = O η k k-1 r=1 β (k-1-r) 2 2 ηr∥∇f (wr,0)∥ + η k . Putting all the estimates together, we conclude that f (w k+1,0 ) -f (w k,0 ) is smaller than -Ω(η k min{ ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 √ D0 }) + O(η k ( k r=1 β (k-1-r) 2 2 η r ∥∇f (w r,0 )∥)) + O(η 2 k ). The accumula- tion of O(η 2 k ) is in order log T and grows slowly. To derive the convergence result, we need to show the first term dominates. However, the second term contains historical gradient information, which is not necessarily smaller than the first term in absolute value. How to deal with it? We observe that while the first and the second terms are not comparable for single k, summing over k leads to f (w T +1,0 ) -f (w 1,0 ) is smaller than -Ω   T k=1 η k min{ ∥∇f (w k,0 )∥ √ D 1 , ∥∇f (w k,0 )∥ 2 √ D 0 }   + O   T k=1 η k   k r=1 β (k-1-r) 2 2 ηr ∥∇f (w r,0 )∥     + O   T k=1 η 2 k   . With a sum order change, one can easily observe that the last term equals to O( T k=1 η 2 k ∥∇f (w k,0 )∥), which is also O( T k=1 η k min{ ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 √ D0 }) + O(η 2 k ) due to the mean value inequality η 2 k ∥∇f (w k,0 )∥ ≤ O(η 2 k ) + O(η 2 k D1 D0 ∥∇f (w k,0 )∥ 2 ). Thus the last term is dominated by the last to second term. This completes the proof for RMSProp. Remark 6 (Difficulty compared to L-smooth case). First of all, the change of ν k,i is easier to bound without the historical gradient term. Secondly, under L-smooth condition, the error does not contain historical gradient information and is only in order of O(η 2 k ), which is easy to bound. Remark 7 (Intuition why SGD converges slowly). If we replace Adam with SGD in the above analysis, the first order term may no longer dominates the second order term. This explains why SGD converges slowly. A detailed discussion can be found in Appendix B. Stage II: Extend the proof to general Adam. For Adam, the update norm is still bounded and the second order term can be bounded similarly. However, the analysis of the first order term becomes more challenging even though we still have ν k,i ≈ ν k,0 . Specifically, even with constant ν k,i = ν k,0 , -η k ⟨ i We observe that the alignment of w k+1,0 -w k,0 and -∇f (w k,0 ) is required due to that our analysis is based on the potential function f (w k,0 ). However, while this potential function is suitable for the analysis of RMSProp, it is no longer appropriate for Adam based on the above discussion and we need to construct another potential function. Our construction of the potential function is based on the following observation: we revisit the update rule in Eq. ( 2) and rewrite it in the following manner: m k,i -β1m k,i-1 1 -β1 = ∇fτ k,i (w k,i ). Notice that the right-hand-side of the above equation contains no historical gradients but only the gradient of the current step! We then get 1/ √ ν k,i into the play, leading to w k,i+1 -β1w k,i 1 -β1 - w k,i -β1w k,i-1 1 -β1 = w k,i+1 -w k,i -β1(w k,i -w k,i-1 ) 1 -β1 ≈ - η k √ ν k,0 + ε1 d ⊙ m k,i -β1m k,i-1 1 -β1 = - η k √ ν k,0 + ε1 d ⊙ ∇fτ k,i (w k,i ). One can see that the sequence {u k,i ≜ w k,i -β1w k,i-1 1-β1 } are (approximately) doing SGD within one epoch (with coordinate-wise but constant learning rate ν k,i )! Further notice that u k,i = w k,i + β 1 w k,i -w k,i-1 1-β1 is close to w k,i . Therefore, we choose our potential function as f (u k,i ). The Taylor's expansion of f at u k,0 then provides a new descent lemma, i.e., f (u k+1,0 ) -f (u k,0 ) ≤⟨uk+1,0 -u k,0 , ∇f (u k,0 )⟩ First Order + L0 + L1∥∇f (w k,0 )∥ 2 ∥w k+1,0 -w k,0 ∥ 2 , Second Order (7) By noticing w k,i ≈ u k,i ≈ u k,0 , the first order term can be further approximated as follows, ⟨u k+1,0 -u k,0 , ∇f (u k,0 )⟩ ≈ -⟨ η k √ ν k,0 + ε1 d ⊙ ∇f (w k,0 ), ∇f (w k,0 )⟩, which is negative, and the challenge is addressed. Remark 8. Zhang et al. (2022) also encounter the misalignment between w k+1,0 -w k,0 and -∇f (w k,0 ), but they do not use the potential function and can only derive in-expectation convergence. A detailed discussion can be found in Appendix B. Remark 9. There are similar potential functions in existing works for other optimizers, but none of them succeeds to derive the result for Adam. We defer a detailed discussion to Appendix B.

4.3. SGD MAY

CONVERGE ARBITRARILY SLOW WITH (L 0 , L 1 )-SMOOTH CONDITION As the (L 0 , L 1 )-smooth condition appears to be a more precise characterization of the landscape of neural networks than L-smooth condition, one may wonder how SGD performs under such a circumstance. Here, we find examples that satisfy (L 0 , L 1 )-smooth condition, but GD and SGD diverges. This implies that under this realistic smoothness assumption, (S)GD can be arbitrarily slow. GD may converge arbitrarily slow with (L 0 , L 1 )-smooth condition. To begin with, we notice that Zhang et al. (2019a) have already provided counterexamples over which GD converges arbitrarily slowly. However, their examples use constant learning rate GD, which, unfortunately, does NOT match the setting of Theorem 1 (i.e., diminishing learning rate and stochastic update), and thus does not suffice to provide a fair comparison between GD and (full-batch) Adam. To fill this gap in setting, we provide the following example: Example 1. There exists a domain X and a target function f : R X → R, such that for every initial learning rate η 1 , there exists a region E ⊂ X with infinite measure, such that GD with diminishing learning rate η t = η1 √ t and initialized in E will drive f (w t ) and ∥∇f (w t )∥ to infinity. We defer a concrete construction of the example to Appendix C.1. We emphasize here that it is a classification problem with linear model, which shares similar properties with practical classification problems. The intuition of this example is that with a fixed initial learning rate, a bad initialization may lead to arbitrarily worse local smoothness due to globally unbounded smoothness, and make the first-step stepsize relatively large. This leads to an even worse (larger) local smoothness at the next step. Also, the increase of the local smoothness can be rapid enough to offset the decrease in learning rate, and eventually drive the loss to divergence. On the contrary, the update norm of Adam is bounded regardless of the initialization, and thus Adam does not suffer from the same problem. One may notice that the divergence issue in the above example can be equivalently viewed as "given initialization, a large learning rate will bring divergence". A natural question is that whether reducing the learning rate will lead to convergence. The answer could be yes, but it is true only if some additional assumption is provided (for example, Assumption 4 in (Zhang et al., 2019a) , which assumes the gradient is bounded over a loss sub-level set). However, in this case, the feasible initial learning rate depends on the local smoothness over the trajectory. Since the initial gradient can be arbitrarily large, the initial learning rate needs to be arbitrarily small, which thus leads to a slow convergence rate if bad initialization is picked. This is exactly the intuition of the lower bound in (Zhang et al., 2019a) . Please refer to a detailed discussion in Appendix B. Similar negative results also apply to SGD. Actually, the situation is even worse: SGD still suffer from divergence for learning rate 1/L, which can lead to convergence for GD. Example 2. There exists a domain X , a initial point w 0 ∈ X and a target function f = n-1 i=0 f i , over which SGD with initial learning rate η 1 = 1 L and diminishing learning rate η t = η1 √ t diverges while GD with the same setting converges, where L is the smoothness upper bound of the loss sub-level set {w ∈ X : f (w) ≤ f (w 0 )}. We defer the concrete construction to Appendix C.2. The intuition of this example is that even with learning rate 1/L, a non-full-batch update can step out the loss sub-level set and increase the local smoothness coefficient. Similarly as before, keep reducing the learning rate might lead to convergence. But this only happens when additional assumptions is imposed. Even if so, the convergence rate could be arbitrarily slow.

5. CONCLUSION

In this paper, we take the first step to theoretically understand the adaptivity in Adam. Specifically, we prove that Adam can converge arbitrarily faster than SGD. Our assumption is realistic and close to practical settings. On the other hand, there is still a lot to explore for Adam's performance. For example, it is interesting to analyze how momentum helps in Adam's optimization. Further, it is interesting to investigate whether Adam can handle more sharp smooth conditions, e.g., smoothness is bounded by a high order polynomial of the gradient norm.

A ADDITIONAL RELATED WORKS

In this section, we provide discussions on more related works. New variants of Adam. Ever since Reddi et al. ( 2019) pointed out the non-convergence issue of Adam, many new variants of Adam are designed. For instance, Zou et al. (2019) ; Gadat & Gavra (2020) ; Chen et al. (2018b; 2021a) replaced the constant hyperparameters by iterate-dependent ones e.g. β 1t or β 2t . AMSGrad (Reddi et al., 2019) and AdaFom (Chen et al., 2018b) enforced {v t } to be an non-decreasing. Similarly, AdaBound (Luo et al., 2019) imposed constraints v t ∈ [C l , C u ] to prevent the learning rate from vanishing or exploding. Similarly, Zhou et al. (2018b) adopted a new estimate of v t to correct the bias. In addition, there are attempts to combine Adam with Nesterov momentum (Dozat, 2016) as well as warm-up techniques (Liu et al., 2020a) . There are also some works providing theoretical analysis on the variants of Adam. For instance, Zhou et al. (2018a) studied the convergence of AdaGrad and AMSGrad. Gadat & Gavra (2020) studied the asymptotic behavior of a subclass of adaptive gradient methods from landscape point of view. Their analysis applies to Adam-variants with β 1 = 0 and β 2 increasing along the iterates. In summary, all these works require L-smooth condition. Though the new variants of Adam can probably converge, the convergence rate is no better than SGD. Generalization ability of Adaptive gradient methods. The generalization ability of Adam is a hot debate topic. For instance, Wang et al. (2021) studied the implicit bias of adaptive optimization algorithms on homogeneous neural networks. They proved that the convergent direction of Adam and RMSProp is the same as SGD. Zhou et al. (2020) ; Xie et al. (2022) ; Zou et al. (2021) argue that Adam preferred sharp local-min while GD prefers the wide ones. As such, they argue that Adam generalizes worse than SGD. Zou et al. (2021) prove that Adam generalizes worse than SGD over a specific model. There are also several attempts to improve the generalization ability of Adam. For instance, Padam (Chen et al., 2018a) introduced a partial adaptive parameter to improve the generalization performance. AdamW (Loshchilov & Hutter, 2017) We discuss a bit more on β 1 . First, The requirement β 2 1 < β 2 is a standard condition in Adam-family literature (Reddi et al., 2018; Zou et al., 2019; Défossez et al., 2020; Shi et al., 2021) . Since β 2 is suggested to be large, this condition covers flexible choice of β 1 ∈ [0, 1). Second, although the dependency between β 1 and β 2 in Inequality (4) seems complicated, the effect of β 1 on Inequality (4) can be omitted when adopting the practical value such as β 1 = 0.9. This is because the term " 1+β1 1-β1 " is way smaller than the term " (n -1)" in the right-hand-side of Inequality (4). For example, when β 1 = 0.9 (the default setting in deep learning libraries), 1+β1 1-β1 = 19. This number is much smaller than n on real datasets (for instance, on CIFAR-10 ( He et al., 2016) , n = 50k/128 ≈ 390; on ImageNet (You et al., 2017) , n = 1.2m/8k = 150). As a result, constraint (4) is inactive for practical choice of β 1 .

B.2 COMPARISONS OF OPTIMIZERS OVER THE FINE-TUNING TASK

For the case where the gradient along the trajectory is small, (L 0 , L 1 )-smooth condition will degenerate to L-smooth condition, and thus SGD works well. This may explain the phenomenon that SGD is also adopted in some finetuning tasks, as pretraining can be viewed as selecting a good initialization (and we can expect that the gradient is small along the trajectory). While the above discussion is highly intuitive, it is an interesting future work to formally prove it.

B.3 ON THE EFFECT OF MOMENTUM

Since Adam and RMSProp both converge under (L 0 , L 1 ) smooth condition, a natural question is what the effect of momentum is in Adam. However, this is a highly non-trivial question. The effect Under review as a conference paper at ICLR 2023 of momentum is not clear even for momentum SGD in non-convex optimization, let alone for Adam. We believe that this question is interesting yet beyond the scope of this paper and leave it as a future work.

B.4 CONVERGENCE RESULT OF RMSPROP

As mentioned in Section 4.2, we provide the convergence result of RMSProp as a corollary of Theorem 1 here. Corollary 1. Let all the conditions in Theorem 1 hold. Then, for RMSProp, if δ(β2) ≜ √ d max        1 - 1 β n-1 2 + 8n 1-β n-1 2 β n 2 , β2 1 -2(1-β 2 )n β n 2 -1, β -n/2 2 -1, 1 -β2        n β √ n 2 2 ≤ 1 2(4 + √ 2) √ D1n , then, we have that min k∈[1,T ] ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 √ D0 ≤ O log T √ T + O( √ D0 min{δ(β2), δ(β2) 2 }). ( ) Since Adam and RMSProp both converge under (L 0 , L 1 )-smooth condition, a natural question is what the effect of momentum is in Adam. However, this is a highly non-trivial question. Theoretically, the effect of momentum is not clear even for momentum SGD in non-convex optimization, let alone for Adam. Practically, there are existing experiments suggesting that the performance of well-tuned RMSProp can match that of well-tuned Adam (Choi et al., 2019) . We believe that this question is interesting but beyond the scope of this paper and leave it as a future work. B.5 ADVANTAGE OF ADAM OVER THE GD/SGD WITH GRADIENT CLIPPING (Zhang et al., 2019a) shows that GD/SGD with gradient clipping converges under (L 0 , L 1 ) smooth condition. A natural question is that what is the benefit of Adam over GD/SGD with gradient clipping. Honestly, we are not able to answer this question yet, but one potential advantage may be that Adam can handle more complex noise. It is not known whether SGD with gradient clipping can converge in our setting, as the existing analyses of GD/SGD with gradient clipping under (L 0 , L 1 )-smooth condition all assume that the distance between stochastic gradient and true gradient are bounded with probability 1 ( [Zhang et al., 2019] and [Zhang et al. 2020] ), which is strictly stronger than our affine variance noise assumption. It will be interesting to either provide counterexample that SGD with clipping does not converge under (L 0 , L 1 )-smooth condition with affine noise assumption, or prove that SGD with clipping does converge and find other perspective to demonstrate the advantage of Adam over SGD with clipping.

B.6 INTUITION WHY ADAM CONVERGES FASTER THAN SGD

We first illustrate this from the proof. If we analyze SGD following the same methodology of Section 4.2 for SGD, the first order term in Eq. ( 6) for SGD is ⟨-η k n-1 i=0 ∇f τ k,i (w k,i ), ∇f τ k,i (w k,0 )⟩ ≈ ⟨-η k n-1 i=0 ∇f τ k,i (w k,0 ), ∇f τ k,i (w k,0 )⟩ = -η k ∥∇f τ k,i (w k,0 )∥ 2 . However, when it comes to bound the second order term, we will encounter two problems. First, it is not known whether the update of SGD is bounded, and thus whether we can bound L loc using (L 0 , L 1 )-smooth condition. Second, even if we succeed to bound L loc by O(∥∇f (w k,i ))∥ + O(1), the second order term has the form O(η 2 k ∥∇f (w k,i )∥ 3 ) + O(η 2 k ∥∇f (w k,i )∥ 2 ), and may be larger than the first order term when the gradient is large. We invite the readers to see the counterexample in Section 4.3 to learn more. We then provide an example. When the local smoothness constant coefficientL varies drastically in the domain, it is difficult to decide a pre-decide stepsize that guarantees to work along SGD trajectories. There are two cases: 1. When entering a sharp local region, the pre-determined steps might be too large and cause divergence of SGD. 2. When entering a flat region, the pre-determined steps might be too small to make progress. It will cause slow convergence of SGD. Situation gets worse when using diminishing steps. For Adam, the effective stepsize involves 1 √ ν k,i +ε1 d , which is adaptive changing along the trajectory. Adam can better handle the above two cases: 1. When entering the sharp local region , the gradient is usually large and 1 √ ν k,i +ε1 d will gradually decrease in this region. Finally it will reach a small enough stepsize to slip into the sharp region. 2. When entering the flat region, the gradient is consistently small and thus 1 √ ν k,i +ε1 d will gradually increase, leading to larger stepsize and thus Adam will converge faster.

B.7 DISCUSSION ON THE POTENTIAL FUNCTION

We notice that similar potential functions have already been applied in the analysis of other momentum-based optimizers, e.g., momentum (S)GD in (Ghadimi et al., 2015) and (Liu et al., 2020b) and Adam-type optimizers (except Adam) in (Chen et al., 2018b) . However, extending the proof to Adam is highly-nontrivial and fails to provide a convergence result for Adam. The key difficulty lies in showing that the first order expansion of f (u k,0 ) is positive, which further requires that the adaptive learning rate does not change much within one epoch. This is hard for Adam as the adaptive learning rate of Adam can be nonmonotonic. The lack of L-smooth condition makes the proof even challenging due to the unbounded error brought by gradient norms. 2022) handle the misalignment between w k+1,0 -w k,0 and -∇f (w k,0 ) by additionally assuming that the random permutation to generate {τ k,0 , • • • , τ k,n-1 } is uniformly sampled from all possible permutation orders. Based on this additional assumption, the expectation of the momentum can be shown close to the gradient and thus they derive an in-expectation result. In contrast, we do not need such an assumption. Further, their proof is more complex than ours as they need to deal with the historical gradient signal in the momentum. Our potential function allows us to offer a trajectory-wise convergence result with simplified proof.

B.9 INSIGHT FOR PRACTITIONERS

First, Adam receives great popularity among practitioners (with more than 100k citations). It is important to theoretically understand this current algorithm. Second, we provide new suggestions for practitioners : When running experiments on tasks such as Transformers and LSTM training, we suggest using Adam instead of SGD. Though this is a folk result based on engineering experience. It is firstly theoretically justified.Third, we provide suggestions for hyperparameter tuning (based on the convergence conditions in Theorem 1): when running Adam, we suggest tune up beta2 and try different β 1 < √ β 2 . This suggestion would save much effort of grid-searching the (β 1 , β 2 ) combination.

C DIVERGING EXAMPLE FOR (S)GD

In this section, we discuss the divergence issue for (S)GD under the (L 0 , L 1 )-smooth condition. This section is organized as follows: we will first consider the simple full-batch case, i.e., n = 1 and f (w) = f 0 (w) (therefore there is no random reshuffling and no randomness in the process of training), in which case we will show GD may suffer from divergence under the (L 0 , L 1 ) assumption while Adam converges to the stationary point. We then step further to the random reshuffling case and show that the randomness may brought additional divergence issues.

C.1 COMPARE GD AND FULL-BATCH ADAM

To start with, we notice there is an example in Zhang et al. (2019a) over which GD will converge arbitrarily slow. We re-state their result as follows: Property 1 (Theorem 4, Zhang et al. (2019a) ). For every fixed learning rate η ∈ R + and tolerate error bound ε ∈ R + , there exists a domain X , an optimization problem f : X → R, and an initialization point w 1 ∈ X satisfying f is lower with (L 0 , L 1 ) smoothness and satisfying M ≜ sup{∥∇f (w)∥|w such that f (w) ≤ f (w 0 )} < ∞, such that GD over f starting from w 1 with constant learning rate η, such that any t ∈ Z + smaller than L1M (f (w1)-min w∈X f (w)-5ε/8) 8ε 2 (log M +1) satisfies ∥∇f (w t )∥ > ε. By Property 1, we know that the convergence of GD can be arbitrarily slow depending on the gradient upper bound M of the loss sub-level set {w : f (w) ≤ f (w 1 )}, which further depends on the initialization. In other words, the performance of GD sensitively relies on the initialization. However, Property 1 does not suffice to show Adam performs better than GD, as in Property 1 the learning rate is constant while in Theorem 1, it is decaying. To fill this gap in setting, we provide the following example satisfying Assumptions 2 and 1, on which GD with decaying learning rate diverges if initialized badly. Example 3 (Example 1, restated). Let n = 1 and p = 2 with f 0 ((w 1 , w 2 )) = e - √ 3 2 w 1 -1 2 w 2 + e √ 3 2 w 1 -1 2 w 2 and f ((w 1 , w 2 )) = f 0 ((w 1 , w 2 )). Then, f satisfies Assumptions 1 and 2 . However, for every initial learning rate η 1 , there exists a region E with infinite measure over R 2 such that, with GD with decaying learning rate η k = η1 √ k and start point in E, f and ∇f diverge to infinity. Proof. Notations. To simplify the notation, we define x 1 = ( √ 3 2 , 1 2 ) and x 2 = (- √ 3 2 , 1 2 ). It can be observed that ∥x 1 ∥ = ∥x 2 ∥ = 1 and ⟨x 1 , x 2 ⟩ = -1 2 . We also denote the parameter given by the k-th iteration of GD as w k . By definition, we have w k+1 = w k + η k e -⟨x1,w k ⟩ x 1 + η k e -⟨x2,w k ⟩ x 2 . We further define g(x) ≜ η 1 e 3 8 x -17x -16η 1 e -15 16 x . As g(x) → ∞ as x → ∞, there exists a large enough positive constant C > 1, such that for every x > C, g(x) > 0. Verify the assumptions. Assumption 2 immediately follows as n = 1. We then check Assumption 1. We have ∇f (w) = -e -⟨w,x1⟩ x 1 -e -⟨w,x2⟩ x 2 , ∇ 2 f (w) = x ⊤ 1 x 1 e -⟨w,x1⟩ + x ⊤ 2 x 2 e -⟨w,x2⟩ . Therefore, ∥∇ 2 f 0 (w)∥ 2 ≤ e -⟨w,x1⟩ + e -⟨w,x2⟩ , while ∥∇f (w)∥ ≥ ⟨-e -⟨w,x1⟩ x 1 -e -⟨w,x2⟩ x 2 , (0, 1)⟩ ≥ 1 2 e -⟨w,x1⟩ + 1 2 e -⟨w,x2⟩ . As a conclusion, Assumption 1 is satisfied with L 0 = 0 and L 1 = 2. Initialization. We define region E as E ≜ {a 1 x 1 + b 1 : a 1 > C, a 1 > 16∥b 1 ∥} ∪ {a 1 x 2 + b 1 : a 1 > C, a 1 > 16∥b 1 ∥}. It should be noticed that E has infinite measure. Iteration. As {a 1 x 1 + b 0 : a 1 > C, a 1 > 16∥b 1 ∥} is symmetric with {a 1 x 2 + b 1 : a 1 > C, a 1 > 16∥b 1 ∥}, we analyze the case when w 1 is in the former set without loss of generality, i.e., w 1 = a 1 x 1 + b 1 with a 1 > 0 and a 1 > 16∥b 1 ∥. We will prove by induction that ∀k ≥ 1, w k = a k x i k + b k , where i k equals to 1 if k is odd and 2 if k is even. a k and b k satisfies a k ≥ 16∥b k ∥ and a k ≥ 17 k-1 C. For k = 1, the proof directly follows by the initialization. For k ̸ = 1, suppose that the above property holds for all the iterations before the k-th iteration. Suppose k is even. Then, w k-1 = a k-1 x 1 + b k-1 , which leads to ∥w k-1 ∥ ≤ a k-1 + ∥b k-1 ∥ ≤ 17 16 a k-1 , ⟨w k-1 , x 1 ⟩ ≥ a k-1 -∥b k-1 ∥ ≥ 15 16 a k-1 , and ⟨w k-1 , x 2 ⟩ ≤ - 1 2 a k-1 + ∥b k-1 ∥ ≤ - 7 16 a k-1 . Consequently, w k =w k-1 + η k-1 e -⟨x1,w k-1 ⟩ x 1 + η k-1 e -⟨x2,w k-1 ⟩ x 2 ≤η k-1 e -⟨x2,w k-1 ⟩ x 2 + w k-1 + η k-1 e -⟨x1,w k-1 ⟩ x 1 . The norm of w k-1 + η k-1 e -⟨x1,w k-foot_1 ⟩ x 1 can be bounded as ∥w k-1 + η k-1 e -⟨x1,w k-1 ⟩ x 1 ∥ ≤ ∥w k-1 ∥ + η k-1 e -⟨x1,w k-1 ⟩ ≤ 17 16 a k-1 + η k-1 e -15 16 a k-1 , while the coefficient of x 2 can be lower bounded as η k-1 e -⟨x2,w k-1 ⟩ ≥ η k-1 e 7 16 a k-1 ≥ η 1 e 7 16 a k-1 -1 2 log k-1 ≥ η 1 e 3 8 a k-1 + 1 16 17 k-1 C-1 2 (k-2) ≥ η 1 e 3 8 a k-1 . Therefore, η k-1 e -⟨x2,w k-1 ⟩ -16∥w k-1 + η k-1 e -⟨x1,w k-1 ⟩ x 1 ∥ ≥ η 1 e 3 8 a k-1 -17a k-1 + 16η k-1 e -15 16 a k-1 ≥g(a k-1 ). As by the induction condition a k-1 ≥ 17 k-1 C ≥ C, we have that g(a k-1 ) ≥ 0. Therefore, choosing a k = η k-1 e -⟨x2,w k-1 ⟩ and b k = ∥w k-1 + η k-1 e -⟨x1,w k-1 ⟩ x 1 ∥ completes the proof for k. The proof of the case that k is odd follows the same routine by simply exchanging x 1 and x 2 in the above proof. Consequently, for ∥∇f (w k )∥, we have that when k is even ∥∇f (w k )∥ ≥ ∥x 1 e -⟨x2,w k ⟩ ∥ -∥x 2 e -⟨x1,w k ⟩ ∥ ≥ e 5 16 a k -e -15 16 a k → ∞. Similar claim holds for the case k is odd. We then complete the proof that ∥∇f (w k )∥ → ∞ as k → ∞. The claim that f (w k ) → ∞ as k → ∞ can be derived following the same routine. The proof is completed. On the other hand, as f in Example 1 satisfies Assumption 1 and Assumption 2 (with D 0 = 0), Theorem 1 applies and full batch Adam will converge to a stationary point regardless of the initialization. Example 1 indicates that even for the full-batch case, Adam appears to be less sensitive to the initialization compared with GD under (L 0 , L 1 )-smooth condition. However, under the same setting and fixing the initialization point, tuning down the initial learning rate η 1 may help GD converge. Specifically, when gradient norm of the loss sub-level set {w : f (w) ≤ f (w 0 )} is upper bounded, if we choose the learning rate to be sufficiently small (for example, smaller than 2/(L 0 + L 1 M ) 1 where M is the gradient norm upper bound of the loss sub-level set), we can make the loss keeps decreasing as GD iterates, and the parameter stays in the loss sub-level set. However, such a learning rate can be rather small when M is large and leads to a slow convergence of GD, which is exactly the intuition behind Property 1. Furthermore, when n ≥ 2 and SGD is adopted, tuning down the learning rate may not help any more. We will discuss such an issue in the following section.

C.2 COMPARE SGD AND ADAM

First of all, Example 3 can be simply extended to provide a divergent example for SGD. Example 4 (Example 1, restated). Let n = 2 and p = 2 with f 0 ((w 1 , w 2 )) = e - √ 3 2 w 1 -1 2 w 2 , f 1 (x) = e √ 3 2 w 1 -1 2 w 2 , and f ((w 1 , w 2 )) = f 0 ((w 1 , w 2 )). Then, f satisfies Assumptions 1 and 2 (with D 0 = 0). However, for every initial learning rate η 1 , there exists a region E with infinite measure over R 2 such that, with SGD with decaying learning rate η k = η1 √ k and start point in E, f and ∇f diverge to infinity. The proof follows the similar routine as Example 3 and we omit it here. Note as D 0 = 0 in this case, Adam still converges to the stationary point. On the other hand, as discussed in the previous section, for SGD over (L 0 , L 1 )-smooth condition, a small learning rate may no longer help and SGD will still diverge. We illustrate this idea by consider the following example. Example 5 (Diverging Example for SGD with Small Learning Rate). Specifically, we rescale the x 1 and x 2 in Example 3 as x1 = x 1 and x2 = 50x 2 . We choose initialization point w 1 = x1 , and choose learning rate as η 1 = 1/L with decaying learning rate η t = η1 √ t , where L is defined as the smoothness upper bound of the loss sub-level set, i.e., L ≜ arg max ∥∇ 2 f (w)∥ 2 : f (w) ≤ f (w 1 ) . By the definition of the previous section, GD on such an example will converge to the stationary point, as the iterates will stay in the sub-level set, the smoothness will be small, and the loss will decrease. However, we run SGD on this example and find that the loss and gradient norm will explode in three steps (Figure 3 ). The intuition behind Example 5 is that the update calculate from a batch in SGD does not essentially reduce the loss and may no longer keeps the iterates stay in the loss sub-level set. This can eventually make the loss diverging, or keep the parameter away from the saddle point in the training of neural networks.

D PROOF OF THEOREM 1

This appendix provides the formal proof of Theorem 1. Specifically, we will first make some preparations by ① showing notations and ② proving lemmas which characterize several basic properties of the Adam optimizer, and then prove Theorem 1 based on the lemmas. Remark 10. In the remaining proof of this paper, we assume without the loss of generality that η 1 is small enough, such that the following requirements are fulfilled (with notations explained latter): • 2C 2 √ dη 1 ≤ 1 L1 . This will latter ensure that we can directly apply the definition of (L 0 , L 1 )smooth condition (Assumption 1) to parameter sequence {w k,i } k,i ; • 1 4(2 √ 2+1) ≥ √ D 1 C 11 η 1 . This will latter ensure the second-order term is smaller than the first-order term at the end of the proof. The proof can be easily extended to general cases (while certainly more cumbersome) by selecting large enough K and using the epoch K as a new start point and derive the results after epoch K (this is due to η k is decaying, and K is finite, the epochs before epoch k can be uniformly bounded and we then derive the desired result for all epochs). Without the loss of generality, we also take the following initialization: w 1,0 = w 0 , m 1,-1 = ∇f τ1,-1 (w 0 ) (τ 1,-1 can be any integer in [0, n -1]), and ν l,1,-1 = max j {∂ l f j (w 0 ) 2 }, ∀l (here the maximum is taken component-wisely). We take the initialization to have a more concise proof, while the proof can be easily extended to all the initialization as the information of the initialization in the exponentially decayed average of Adam (both in m k,i and ν k,i ) decays rapidly with k increasing.

D.1 PRELIMINARIES D.1.1 NOTATIONS

Here we provide a complete list of notations used in the appendix for a clear reference. • We use (k 1 , i 1 ) ≤ (<)(k 2 , i 2 ) for ∀k 1 , k 2 ∈ N + and i 1 , i 2 ∈ {0, • • • , n -1}, if either k 1 < k 2 or k 1 = k 2 and i 1 ≤ (<)i 2 • We define function g(x) : [0, 1] → R /-as g(β2) ≜ max        1 β n-1 2 -1, 1 - 1 β n-1 2 + 8n 1-β n-1 2 β n 2 , 1 -β2, β2 1 -(1 -β2) 2n β n 2 -1        . • We define constants {C i } 10 i=1 as follows: C1 ≜ (1 -β1) 2 1 -β2 1 1 - β 2 1 β 2 + 1, C2 ≜ nC1 + β1 1 -β1 C1 1 + √ 2 , C3 ≜ C1 n(L0 + L1 D0) + 2 √ 2(L0 + L1 D0) √ 1 -β2 1 - √ β2 √ β2 1 - √ β2 + 8 √ 2nL0 1 1 -β n 2 , C4 ≜ 4L1C1 D1 √ 1 -β2 1 - √ β2 C5 ≜ n 2 (1 + n √ dC1η1L1 √ n D1) C4 + dC4 √ D1 1 -β n 2 , C6 ≜ dC3 + C4n √ D1 1 -β n 2 η 2 1 , C7 ≜ 3n C4 + dC4 1 -β n 2 nL0 + L1 √ n D0 n 2 √ dC1η 3 1 + dC3 + C2C4n √ D1 1 -β n 2 η 2 1 , C8 ≜ 2n 2 β n 2 L1 D1n √ n + dg(β2) n -1 + 1 + β1 1 -β1 √ 2n β n 2 2 L1C1 D1 1 + 1 1 -β n 2 (n + n 5 2 √ dC1η1L1 D1) + β1 (1 -β1)η1 √ dC1, C9 ≜ 2n 2 β n 2 d(n 2 L0 + n √ nL1 D0)C1η 2 1 + g(β2) n -1 + 1 + β1 1 -β1 √ 2n β n 2 2 n + 2 √ 2β1 1 -β1 C1(L0 + L1 D0)d √ dη , C10 ≜ 3dg(β2) n -1 + 1 + β1 1 -β1 √ 2n β n 2 2 L1C1 D1 1 + 1 1 -β n 2 n nL0 + L1 √ n D0 n √ dC1η 3 1 + C9, C11 ≜ ( 1 2 + C2)C5 + C8 + 3L1 √ n √ D1C 2 2 d 2 , C12 ≜ ( 1 2 + C2)C6 + C9 + nL0 + L1 √ n √ D0 2 3C 2 2 dη 2 1 , C13 ≜ ( 1 2 + C2)C7 + C10 + nL0 + L1 √ n √ D0 2 3C 2 2 dη 2 1 .

D.1.2 AUXILIARY LEMMAS

Here we provide auxiliary lemmas describing basic properties of Adam and the descent lemma under (L 0 , L 1 )-smooth condition.

Smoothness of f

Lemma 2. With Assumptions 2 and 1, f satisfies (nL 0 + L 1 √ n √ D 0 , L 1 √ n √ D 1 )-smooth condition. Proof. ∀w 1 , w 2 ∈ R d satisfying ∥w 1 -w 2 ∥ ≤ 1 L1 , ∥∇f (w1) -∇f (w2)∥ ≤ n-1 i=0 ∥∇fi(w1) -∇fi(w2)∥ ≤ n-1 i=0 (L0 + L1∥∇fi(w1)∥)∥w1 -w2∥ ≤   nL0 + L1 √ n n-1 i=0 ∥∇fi(w1)∥ 2   ∥w1 -w2∥ ≤(nL0 + L1 √ n D0 + D1∥∇f (w1)∥ 2 )∥w1 -w2∥ ≤(nL0 + L1 √ n( √ D0 + √ D1∥∇f (w1)∥))∥w1 -w2∥ ≤(nL0 + L1 √ n √ D0 + L1 √ n √ D1∥∇f (w1)∥)∥w1 -w2∥. The proof is completed.

Basic Properties of Adam

The following lemma characterizes the update norm of Adam. Lemma 3 (Bounded Update). If β 1 < √ β 2 , we have ∀k ∈ N + , i ∈ {0, • • • , n -1}, |m l,k,i | √ ν l,k,i + ε ≤ C 1 , where C 1 ≜ (1 -β 1 ) 2 1 -β 2 1 1 - β 2 1 β2 + 1. Furthermore, we have |w l,k,i+1 -w l,k,i | ≤ C 1 η k , and thus ∥w k,i+1 -w k,i ∥ ≤ C 1 η k √ d. Proof. By the definition of m k,i , we have (m l,k,i ) 2 =   (1 -β 1 ) i j=0 β (k-1)n+i-((k-1)n+j) 1 ∂ l f τ k,j (w k,j ) + (1 -β 1 ) k-1 m=1 n-1 j=0 β (k-1)n+i-((m-1)n+j) 1 ∂ l f τm,j (w m,j ) + β (k-1)n+i+1 1 ∂ l f τ1,-1 (w 1,0 )   2 ≤   (1 -β 1 ) i j=0 β (k-1)n+i-((k-1)n+j) 1 |∂ l f τ k,j (w k,j )| + (1 -β 1 ) k-1 m=1 n-1 j=0 β (k-1)n+i-((m-1)n+j) 1 |∂ l f τm,j (w m,j )| + β (k-1)n+i+1 1 max s∈[n] |∂ l f s (w 1,0 )|   2 (⋆) ≤   (1 -β 2 ) i j=0 β (k-1)n+i-((k-1)n+j) 2 |∂ l f τ k,j (w k,j )| 2 + (1 -β 2 ) k-1 m=1 n-1 j=0 β (k-1)n+i-((m-1)n+j) 2 |∂ l f τm,j (w m,j )| 2 + β (k-1)n+i+1 2 max s∈[n] |∂ l f s (w 1,0 )| 2   •   (1 -β 1 ) 2 1 -β 2 (k-1)n+i j=0 β 2 1 β 2 j + β 2 1 β 2 (k-1)n+i+1   ( * ) =   (1 -β 1 ) 2 1 -β 2 (k-1)n+i j=0 β 2 1 β 2 j + β 2 1 β 2 (k-1)n+i+1   ν l,k,i ≤   (1 -β 1 ) 2 1 -β 2 1 1 - β 2 1 β2 + 1   ν l,k,i = C 1 ν l,k,i , where Eq. (⋆) is due to the Cauchy-Schwartz's Inequality, and Eq. ( * ) is due to the definition of ν l,1,-1 . We complete the proof of the first claim. The second claim then follows directly from the update rule w l,k,i+1 -w l,k,i = η k m l,k,i √ ν l,k,i + ε . The proof is completed. Based on Lemma 3, we then provide estimations for the norms of the momentum and the adaptor. Lemma 4 (Estimation of the norm of the momentum). We have for all l ∈ [d], k ∈ Z + , i ∈ [n], |m l,k,i | ≤ max i ′ ∈[n] |∂ l f i ′ (w k,0 )| + n + 2 √ 2β 1 1 -β 1 C 1 (L 0 + L 1 D 0 ) √ dη k + L 1 C 1 D 1 η k i-1 j=0 ∥∇f (w k,j )∥ + L 1 C 1 D 1 k-1 t=1 η k-t n-1 j=0 β tn+i-j 1 ∥∇f (w k-t,j )∥. Similarly, l ∈ [d], k ∈ Z + /{1}, |m l,k-1,n-1 | ≤ max i ′ ∈[n] |∂ l f i ′ (w k,0 )|+ k-1 t=1 n-1 j=0 β tn-1-j 1 C 1 η k-t √ dL 1 D 1 ∥∇f (w k-t,j )∥+ 2 √ 2(L 0 + L 1 √ D 0 )C 1 √ dη k 1 -β 1 . Proof. To begin with, for any t ∈ [k -1] and any j ∈ [0, n -1], we have the following estimation for ∂ l f i (w k-t,j ): |∂ l f i (w k-t,j )| (⋆) ≤ |∂ l f i (w k,0 )| + n-1 p=j |∂ l f i (w k-t,p ) -∂ l f i (w k-t,p+1 )| + t-1 r=1 n-1 p=0 |∂ l f i (w k-r,p ) -∂ l f i (w k-r,p+1 )| ≤|∂ l f i (w k,0 )| + n-1 p=j (L 0 + L 1 ∥∇f i (w k-t,p )∥)∥w k-t,p -w k-t,p+1 ∥ + t-1 r=1 n-1 p=0 (L 0 + L 1 ∥∇f i (w k-r,p )∥)∥w k-r,p -w k-r,p+1 ∥ ≤|∂ l f i (w k,0 )| + n-1 p=j (L 0 + L 1 ∥∇f i (w k-t,p )∥)C 1 η k-t √ d + t-1 r=1 n-1 p=0 (L 0 + L 1 ∥∇f i (w k-r,p )∥)C 1 η k-r √ d ≤|∂ l f i (w k,0 )| + n-1 p=j   L 0 + L 1 i ′ ∈[n] ∥∇f i ′ (w k-t,p )∥ 2   C 1 η k-t √ d + t-1 r=1 n-1 p=0   L 0 + L 1 i ′ ∈[n] ∥∇f i ′ (w k-r,p )∥ 2   C 1 η k-r √ d ≤|∂ l f i (w k,0 )| + n-1 p=j L 0 + L 1 D 1 ∥∇f (w k-t,p )∥ + L 1 D 0 C 1 η k-t √ d + t-1 r=1 n-1 p=0 L 0 + L 1 D 1 ∥∇f (w k-r,p )∥ + L 1 D 0 C 1 η k-r √ d ( * ) ≤ |∂ l f i (w k,0 )| + n-1 p=j L 1 D 1 ∥∇f (w k-t,p )C 1 η k-t √ d + t-1 r=1 n-1 p=0 L 1 D 1 ∥∇f (w k-r,p )∥C 1 η k-r √ d + 2(L 0 + L 1 D 0 )C 1 √ dη k-1 (tn -j) ≤|∂ l f i (w k,0 )| + n-1 p=j L 1 D 1 ∥∇f (w k-t,p )C 1 η k-t √ d + t-1 r=1 n-1 p=0 L 1 D 1 ∥∇f (w k-r,p )∥C 1 η k-r √ d + 2 √ 2(L 0 + L 1 D 0 )C 1 √ dη k (tn -j). where Inequality (⋆) is due to (L 0 , L 1 )-smooth condition, and Inequality ( * ) is due to ∀a, b ∈ N + , a > b, b i=0 1 √ a-i ≤ 2 b+1 a . Similarly, we have that for any j ∈ [0, n -1], |∂ l f i (w k,j )| ≤ |∂ l f i (w k,0 )| + j-1 p=0 |∂ l f i (w k,p+1 ) -∂ l f i (w k,p )| ≤|∂ l f i (w k,0 )| + j-1 p=0 L 0 + L 1 D 1 ∥∇f (w k,p )∥ + L 1 D 0 C 1 η k √ d =|∂ l f i (w k,0 )| + j-1 p=0 L 1 D 1 ∥∇f (w k,p )∥C 1 η k √ d + j(L 0 + L 1 √ D 0 )C 1 √ dη k . Therefore, the norm of m l,k,i can be bounded as |m l,k,i | ≤(1 -β 1 ) i j=0 β (k-1)n+i-((k-1)n+j) 1 |∂ l f τ k,j (w k,j )| + (1 -β 1 ) k-1 t=1 n-1 j=0 β tn+i-j 1 |∂ l f τ k-t,j (w k-t,j )| + β (k-1)n+i+1 1 |∂ l f τ1,0 (w 1,0 )| ≤(1 -β 1 ) i j=0 β (k-1)n+i-((k-1)n+j) 1 |∂ l f τ k,j (w k,0 )| + (1 -β 1 ) k-1 t=1 n-1 j=0 β tn+i-j 1 |∂ l f τ k-t,j (w k,0 )| + β (k-1)n+i+1 1 |∂ l f τ1,0 (w k,0 )| + (1 -β 1 ) i j=0 β (k-1)n+i-((k-1)n+j) 1 j-1 p=0 C 1 η k √ dL 1 D 1 ∥∇f (w k,p )∥ + (L 0 + L 1 D 0 )C 1 η k √ dj + (1 -β 1 ) k-1 t=1 n-1 j=0 β tn+i-j 1   n-1 p=j C 1 η k-t √ dL 1 D 1 ∥∇f (w k-t,p )∥ + t-1 r=1 n-1 p=0 C 1 η k-r √ dL 1 D 1 ∥∇f (w k-r,p )∥ + 2 √ 2(L 0 + L 1 D 0 )C 1 √ dη k (tn -j) + β (k-1)n+i+1 1 k-1 t=1 n-1 p=0 L 1 D 1 ∥∇f (w k-r,p )∥C 1 η k-r √ d + 2 √ 2(L 0 + L 1 D 0 )C 1 √ dη k (k -1)n (⋆) ≤ max i∈[n] |∂ l f i (w k,0 )| + n + 2 √ 2β 1 1 -β 1 √ dC 1 (L 0 + L 1 D 0 )η k + L 1 C 1 D 1 η k i-1 j=0 ∥∇f (w k,j )∥ + L 1 C 1 D 1 k-1 t=1 η k-t n-1 j=0 β tn+i-j 1 ∥∇f (w k-t,j )∥, where Inequality (⋆) is due to a exchange in the sum order. Following the same routine, we have |m l,k,-1 | ≤(1 -β 1 ) k-1 t=1 n-1 j=0 β tn-1-j 1 |∂ l f τ k-t,j (w k-t,j )| + β (k-1)n 1 |∂ l f τ1,0 (w 1,0 )| ≤(1 -β 1 ) k-1 t=1 n-1 j=0 β tn-1-j 1 |∂ l f τ k-t,j (w k,0 )| + β (k-1)n 1 |∂ l f τ1,0 (w k,0 )| + (1 -β 1 ) k-1 t=1 n-1 j=0 β tn-1-j 1 C 1 √ d   n-1 p=j L 1 D 1 ∥∇f (w k-t,p )∥η k-t + t-1 r=1 n-1 p=0 L 1 D 1 ∥∇f (w k-r,p )∥η k-r + 2 √ 2(L 0 + L 1 D 0 )C 1 √ dη k (tn -j) + β (k-1)n 1 k-1 t=1 n-1 p=0 L 1 D 1 ∥∇f (w k-r,p )∥C 1 η k-r √ d + 2 √ 2(L 0 + L 1 D 0 )C 1 √ dη k (k -1)n ≤ max i∈[n] |∂ l f i (w k,0 )| + k-1 t=1 n-1 j=0 β tn-1-j 1 C 1 η k-t √ dL 1 D 1 ∥∇f (w k-t,j )∥ + 2 √ 2(L 0 + L 1 √ D 0 )C 1 √ dη k 1 -β 1 . The proof is completed. Lemma 5 (Estimation of the norm of the adaptor). We have for all l ∈ [d], k ∈ Z + , |ν l,k,0 | ≥β n 2 1 -β 2 1 -β n 2 i∈[n] ∂ l f i (w k,0 ) 2 - i∈[n] |∂ l f i (w k,0 ) 2 | 8 √ 2nη k C 1 L 0 1 -β 2 (1 -β n 2 ) 2 β n 2 + 4L 1 C 1 1 -β 2 1 -β n 2 √ 1 -β 2 1 - √ β 2   k-1 t=1 β n 2 β 2 (r-1)n η k-t n-1 j=0 ( D 1 ∥∇f (w k-t,j )∥ + D 0 )     , and |ν l,k,0 | ≤2 max i∈[n] ∂ l f i (w k,0 ) 2 + 2 2 √ 2η k C 1 (L 0 + L 1 D 0 ) √ 1 -β 2 1 - √ β 2 √ β 2 1 - √ β 2 + L 1 C 1 D 1 k-1 t=1 η k-t √ 1 -β 2 1 - √ β 2 n-1 j=0 β 2 (t-1)n ∥∇f (w k-t,j )∥   2 . Proof. By the definition of ν l,k,0 , we have ν l,k,0 =(1 -β2)∂ l fτ k,0 (w k,0 ) 2 + k-1 t=1 n-1 j=0 (1 -β2)β tn-j 2 ∂ l fτ k-t,j (w k-t,j ) 2 + β (k-1)n+1 2 max i∈[n] ∂ l fi(w1,0) 2 ≥(1 -β2)∂ l fτ k,0 (w k,0 ) 2 + k-1 t=1 n-1 j=0 (1 -β2)β tn 2 ∂ l fτ k-t,j (w k-t,j ) 2 + β (k-1)n+1 2 1 n n i=1 ∂ l fi(w1,0) 2 =(1 -β2)∂ l fτ k,0 (w k,0 ) 2 + k-1 t=1 n-1 j=0 (1 -β2)β tn 2 (∂ l fτ k-t,j (w k,0 ) + ∂ l fτ k-t,j (w k-t,j ) -∂ l fτ k-t,j (w k,0 )) 2 + β (k-1)n+1 2 1 n n i=1 (∂ l fi(w k,0 ) + ∂ l fi(w1,0) -∂ l fi(w k,0 )) 2 ≥(1 -β2)∂ l fτ k,0 (w k,0 ) 2 + k-1 t=1 n-1 j=0 (1 -β2)β tn 2 ∂ l fτ k-t,j (w k,0 ) 2 + β (k-1)n+1 2 1 n n i=1 ∂ l fi(w k,0 ) 2 - k-1 t=1 n-1 j=0 (1 -β2)β tn 2 |∂ l fτ k-t,j (w k,0 )||∂ l fτ k-t,j (w k,0 ) -∂ l fτ k-t,j (w k-t,j )| -β (k-1)n+1 2 1 n n i=1 |∂ l fi(w k,0 )||∂ l fi(w k,0 ) -∂ l fi(w1,0)| (⋆) ≥ β n 2 1 -β (k-1)n 2 1 -β n 2 (1 -β2) + β (k-1)n+1 2 n i∈[n] ∂ l fi(w k,0 ) 2 - k-1 t=1 n-1 j=0 (1 -β2)β tn 2 |∂ l fτ k-t,j (w k,0 )| t r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k tn -β (k-1)n+1 2 1 n n i=1 |∂ l fi(w k,0 )| k-1 r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k (k -1)n ≥β n 2 1 -β2 1 -β n 2 i∈[n] ∂ l fi(w k,0 ) 2 - k-1 t=1 n-1 j=0 (1 -β2)β tn 2 |∂ l fτ k-t,j (w k,0 )| t r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k tn -β (k-1)n+1 2 1 n n i=1 |∂ l fi(w k,0 )| k-1 r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k (k -1)n ≥β n 2 1 -β2 1 -β n 2 i∈[n] ∂ l fi(w k,0 ) 2 -8 √ 2η k C1L0 1 -β2 (1 -β n 2 ) 2 β n 2 i∈[n] |∂ l fi(w k,0 )| -4L1C1 1 -β2 1 -β n 2 i∈[n] |∂ l fi(w k,0 )| k-1 r=1 β rn 2 η k-r n-1 j=0 ∥∇fi(w k-r,j )∥ ≥β n 2 1 -β2 1 -β n 2 i∈[n] ∂ l fi(w k,0 ) 2 -8 √ 2η k C1L0 1 -β2 (1 -β n 2 ) 2 β n 2 i∈[n] |∂ l fi(w k,0 )| -4L1C1 1 -β2 1 -β n 2 ∥∇fi(w k,0 )∥ k-1 r=1 β rn 2 η k-r n-1 j=0 ( √ D1∥∇f (w k-r,j )∥ + √ D0) ≥β n 2 1 -β2 1 -β n 2 i∈[n] ∂ l fi(w k,0 ) 2 -8 √ 2nη k C1L0 1 -β2 (1 -β n 2 ) 2 β n 2 i∈[n] |∂ l fi(w k,0 ) 2 | -4L1C1 1 -β2 1 -β n 2 i∈[n] |∂ l fi(w k,0 ) 2 | k-1 r=1 β rn 2 η k-r n-1 j=0 ( √ D1∥∇f (w k-r,j )∥ + √ D0) ≥β n 2 1 -β2 1 -β n 2 i∈[n] ∂ l fi(w k,0 ) 2 -8 √ 2nη k C1L0 1 -β2 (1 -β n 2 ) 2 β n 2 i∈[n] |∂ l fi(w k,0 ) 2 | -4L1C1 1 -β2 1 -β n 2 √ 1 -β2 1 - √ β2 i∈[n] |∂ l fi(w k,0 ) 2 | k-1 r=1 β n 2 β2 (r-1)n η k-r n-1 j=0 ( √ D1∥∇f (w k-r,j )∥ + √ D0) . The first claim is proved. As for the upper bound, we have ν l,k,0 =(1 -β2)∂ l fτ k,0 (w k,0 ) 2 + k-1 t=1 n-1 j=0 (1 -β2)β tn-j 2 ∂ l fτ k-t,j (w k-t,j ) 2 + β (k-1)n+1 2 max i∈[n] ∂ l fi(w1,0) 2 ≤2(1 -β2)∂ l fτ k,0 (w k,0 ) 2 + 2 k-1 t=1 n-1 j=0 (1 -β2)β tn-j 2 ∂ l fτ k-t,j (w k,0 ) 2 + 2β (k-1)n+1 2 max i∈[n] ∂ l fi(w k,0 ) 2 + 2 k-1 t=1 n-1 j=0 (1 -β2)β tn-j 2 n-1 p=j L1 √ D1∥∇f (w k-t,p )C1η k-t √ d + t-1 r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k (tn -j) 2 + 2β (k-1)n+1 2 k-1 r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k (k -1)n 2 ≤2 max i∈[n] ∂ l fi(w k,0 ) 2 + 2 k-1 t=1 n-1 j=0 1 -β2 β2 tn-j n-1 p=j L1 √ D1∥∇f (w k-t,p )C1η k-t √ d + t-1 r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k (tn -j) + β2 (k-1)n+1 k-1 r=1 n-1 p=0 L1 √ D1∥∇f (w k-r,p )∥C1η k-r √ d + 2 √ 2(L0 + L1 √ D0)C1 √ dη k (k -1)n 2 ≤2 max i∈[n] ∂ l fi(w k,0 ) 2 + 2 2 √ 2η k C1(L0 + L1 √ D0) √ 1 -β2 1 - √ β2 √ β2 1 - √ β2 + L1C1 √ D1 k-1 t=1 η k-t √ 1 -β2 1 - √ β2 n-1 j=0 β2 tn-j ∥∇f (w k-t,j )∥ 2 ≤2 max i∈[n] ∂ l fi(w k,0 ) 2 + 2 2 √ 2η k C1(L0 + L1 √ D0) √ 1 -β2 1 - √ β2 √ β2 1 - √ β2 + L1C1 √ D1 k-1 t=1 η k-t √ 1 -β2 1 - √ β2 n-1 j=0 β2 (t-1)n ∥∇f (w k-t,j )∥ 2 . The proof is completed. We then immediately have the following corollary when max i∈[n] |∂ l f i (w k,0 )| is large enough. Corollary 2 (Lemma 1, formal). If max i∈[n] |∂ l f i (w k,0 )| ≥4L 1 C 1 √ 1 -β 2 1 - √ β 2   k-1 r=1 β 2 (r-1)n η k-r n-1 j=0 ( D 1 ∥∇f (w k-r,j )∥ + D 0 )   + 2 √ 2η k C 1 (L 0 + L 1 D 0 ) √ 1 -β 2 1 - √ β 2 √ β 2 1 - √ β 2 + 8 √ 2nη k C 1 L 0 1 1 -β n 2 + η k C 1 n(L 0 + L 1 D 0 ) + L 1 D 1 n-1 p=0 ∥∇f (w k,p )∥ , ( ) then β n 2 2 1 n i∈[n] ∂ l f i (w k,0 ) 2 ≤ ν l,k,0 ≤ 4 max i∈[n] ∂ l f i (w k,0 ) 2 . Furthermore, if Eq. ( 11) holds, we have ∀i ∈ {0, • • • , n -1}, β n-1 2 ν l,k,0 ≤ ν l,k,i ≤ β n-1 2 + 8n 1 -β n-1 2 β n 2 ν l,k,0 , 1 β 2 1 -(1 -β 2 ) 2n β n 2 ν l,k,0 ≤ ν l,k,-1 ≤ 1 β 2 ν l,k,0 , Proof. The first claim is derived by directly applying the range of max i∈[n] |∂ l f i (w k,0 )| into Lemma 5. As for the second claim, we have ν l,k,i = β i 2 ν l,k,0 + (1 -β 2 )(∂ l f τ k,i (w k,i ) 2 + • • • + β i-1 2 ∂ l f τ k,i (w k,1 ) 2 ). On the other hand, since ∀j ∈ {0, • • • , n -1} |∂ l f i (w k,j )| ≤ max p∈[n] |∂ l f p (w k,0 )| + η k C 1 j(L 0 + L 1 D 0 ) + L 1 D 1 j-1 p=0 ∥∇f (w k,p )∥ ≤ max p∈[n] |∂ l f p (w k,0 )| + η k C 1 n(L 0 + L 1 D 0 ) + L 1 D 1 n-1 p=0 ∥∇f (w k,p )∥ , we have β n-1 2 ν l,k,0 ≤ ν l,k,i ≤β i 2 ν l,k,0 + 2(1 -β2) max p∈[n] ∂ l fp(w k,0 ) 2 (1 + • • • + β i-1 2 ) + 2(1 -β2)(1 + • • • + β i-1 2 )η 2 k C 2 1 n(L0 + L1 √ D0) + L1 √ D1 n-1 p=0 ∥∇f (w k,p )∥ 2 =β i 2 ν l,k,0 + 2(1 -β i 2 ) max p∈[n] ∂ l fp(w k,0 ) 2 + 2(1 -β i 2 )η 2 k C 2 1 n(L0 + L1 √ D0) + L1 √ D1 n-1 p=0 ∥∇f (w k,p )∥ 2 . Therefore, if Eq. ( 11) holds, we then have ν l,k,i ≤β i 2 ν l,k,0 + 4(1 -β i 2 ) max p∈[n] ∂ l f p (w k,0 ) 2 ≤β i 2 ν l,k,0 + 4 n n (1 -β i 2 ) p∈[n] ∂ l f p (w k,0 ) 2 ≤ β i 2 + 8n 1 -β i 2 β n 2 ν l,k,0 ≤ β n-1 2 + 8n 1 -β n-1 2 β n 2 ν l,k,0 . Following the same routine, we have β 2 ν l,k,-1 ≤ ν l,k,0 , and if Eq. ( 11) holds, ν l,k,-1 = 1 β 2 ν l,k,0 -(1 -β 2 )∂ l f τ k,0 (w k,0 ) 2 ≥ 1 β 2 ν l,k,0 -(1 -β 2 ) max p ∂ l f p (w k,0 ) 2 ≥ν l,k,0 1 β 2 1 -(1 -β 2 ) 2n β n 2 . The proof of the second claim is completed. Remark 11. For brevity, we denote C 3 ≜ C 1 n(L 0 + L 1 D 0 ) + 2 √ 2(L 0 + L 1 D 0 ) √ 1 -β 2 1 - √ β 2 √ β 2 1 - √ β 2 + 8 √ 2nL 0 1 1 -β n 2 , C 4 ≜ 4L 1 C 1 D 1 √ 1 -β 2 1 - √ β 2 . The right-hand-size of Eq. ( 11) is smaller than max i∈[n] |∂ l fi(w k,0 )| ≥C3η k + C4 k-1 r=1 β2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ + C4n k-1 r=1 β2 (r-1)n η k-r + η k C4 n-1 j=0 ∥∇f (w k,j )∥ . Furthermore, we define g(β 2 ) as g(β 2 ) ≜ max        1 √ β 2 n-1 -1, 1 - 1 β n-1 2 + 8n 1-β n-1 2 β n 2 , 1 -β 2 , β 2 1 -(1 -β 2 ) 2n β n 2 -1        , and the conclusion of Corollary 2 can be translated into that if Eq. ( 12) holds, 1 √ ν l,k,i - 1 √ ν l,k,0 ≤ g(β 2 ) 1 √ ν l,k,0 , and 1 √ ν l,k,-1 - 1 √ ν l,k,0 ≤ g(β 2 ) 1 √ ν l,k,0 . In the end of this section, we draw the connection of the gradients across the epoch. Lemma 6. ∀k ∈ N + , i ∈ {0, • • • , n -1}, ∥∇f (w k,i )∥ ≤ (1 + n √ dC 1 η 1 L 1 √ n D 1 )∥∇f (w k,0 )∥ + nL 0 + L 1 √ n D 0 n √ dC 1 η k . Proof. By Assumption 1, we have ∥∇f (w k,i )∥ ≤∥∇f (w k,0 )∥ + nL 0 + L 1 n i=1 ∥∇f i (w k,0 )∥ ∥w k,i -w k,0 ∥ ≤∥∇f (w k,0 )∥ + nL 0 + L 1 n i=1 ∥∇f i (w k,0 )∥ i √ dC 1 η k ≤∥∇f (w k,0 )∥ +   nL 0 + L 1 √ n n i=1 ∥∇f i (w k,0 )∥ 2   i √ dC 1 η k ≤∥∇f (w k,0 )∥ + nL 0 + L 1 √ n D 1 ∥∇f (w k,0 )∥ + L 1 √ n D 0 i √ dC 1 η k ≤(1 + n √ dC 1 η 1 L 1 √ n D 1 )∥∇f (w k,0 )∥ + nL 0 + L 1 √ n D 0 n √ dC 1 η k . The proof is completed. Descent Lemma Under (L 0 , L 1 )-smooth condition We need a descent lemma assuming (L 0 , L 1 )smooth condition similar as the case assuming L smoothness. Specifically, for a function h : X → R satisfying L-smooth condition and two points w and v in the domain X , by Taylor's expansion, we have h(w) ≤ h(v) + ⟨∇h(v), w -v⟩ + L 2 ∥w -v∥ 2 . This is called "Descent Lemma" by existing literature Sra (2014) , as it guarantees that the loss decreases with proper parameter update. Parallel to the above inequality, we have the following Descent Lemma under (L 0 , L 1 ) smoothness. Lemma 7. Assume that function h : X → R satisfies (L 0 , L 1 )-smooth condition, i.e., ∀w, v ∈ X satisfying ∥w -v∥ ≤ 1 L1 , ∥∇h(w) -∇h(v)∥ ≤ (L 0 + L 1 ∥∇h(v)∥)∥w -v∥. Then, for any three points u, w, v ∈ X satisfying ∥w -u∥ ≤ 1 L1 and ∥v -u∥ ≤ 1 L1 . Then, h(w) ≤ h(v) + ⟨∇h(u), w -v⟩ + 1 2 (L 0 + L 1 ∥∇h(u)∥)(∥v -u∥ + ∥w -u∥)∥w -v∥. Proof. By the Fundamental theorem of calculus, we have h(w) =h(v) + 1 0 ⟨∇h(v + a(w -v)), w -v⟩da =h(v) + ⟨∇h(u), w -v⟩ + 1 0 ⟨∇h(v + a(w -v)) -∇h(u), w -v⟩da ≤h(v) + ⟨∇h(u), w -v⟩ + 1 0 ∥∇h(v + a(w -v)) -∇h(u)∥∥w -v∥da (⋆) ≤ h(v) + ⟨∇h(u), w -v⟩ + 1 0 (L 0 + L 1 ∥∇h(u)∥)∥v + a(w -v) -u∥∥w -v∥da ≤h(v) + ⟨∇h(u), w -v⟩ + 1 0 (L 0 + L 1 ∥∇h(u)∥)((1 -a)∥v -u∥ + a∥w -u∥)∥w -v∥da ≤h(v) + ⟨∇h(u), w -v⟩ + 1 2 (L 0 + L 1 ∥∇h(u)∥)(∥v -u∥ + ∥w -u∥)∥w -v∥, where Inequality (⋆) is due to ∥v + a(w -v) -u∥ = ∥(1 -a)(v -u) + a(w -u)∥ ≤ (1 -a)∥v -u∥ + a∥w -u∥ ≤ 1 L 1 , and thus the definition of (L 0 , L 1 )-smooth condition can be applied. The proof is completed.

D.2 PROOF OF ADAM'S CONVERGENCE

Proof of Theorem 1. We define u k ≜ w k,0 -β1w k,-1 1-β1 (with w 1,-1 ≜ w 1,0 = w 0 ), and let u l,k be the i-th component of u k , ∀k ∈ N + , l ∈ [d]. Then, by Lemma 3, we immediately have ∀l ∈ [d], |u l,k -w l,k,0 | is bounded as |u l,k -w l,k,0 | = w l,k,0 -β 1 w l,k,-1 1 -β 1 -w l,k,0 = β 1 1 -β 1 |w l,k,0 -w l,k,-1 | ≤ β 1 1 -β 1 C 1 η 1 1 √ k -1 (13) ≤ √ 2β 1 1 -β 1 C 1 η 1 1 √ k ≤ √ 2β 1 1 -β 1 C 1 η k ≤ C 2 η k , and |u l,k+1 -u l,k | = w l,k+1,0 -β 1 w l,k+1,-1 1 -β 1 - w l,k,0 -β 1 w l,k,-1 1 -β 1 = (w l,k+1,0 -w l,k,0 ) + β 1 1 -β 1 (w l,k+1,0 -w l,k+1,-1 ) - β 1 1 -β 1 (w l,k,0 -w l,k,-1 ) ≤ (w l,k+1,0 -w l,k,0 ) + β 1 1 -β 1 (w l,k+1,0 -w l,k+1,-1 ) - β 1 1 -β 1 (w l,k,0 -w l,k,-1 ) ≤nC 1 η 1 1 √ k + β 1 1 -β 1 C 1 η 1 1 √ k + √ 2 √ k = C 2 η 1 1 √ k = C 2 η k , where C 2 is defined as C 2 ≜ nC 1 + β1 1-β1 C 1 1 + √ 2 . We then analyze the change of the Lyapunov function f (u k ) along the iterations. Specifically, by Lemma 7, we have f (u k+1 ) ≤f (u k ) + ⟨∇f (w k,0 ), u k+1 -u k ⟩ + nL 0 + L 1 i∈[n] ∥∇f i (w k,0 )∥ 2 (∥w k,0 -u k ∥ + ∥w k,0 -u k+1 ∥)∥u k+1 -u k ∥ ≤f (u k ) + ⟨∇f (w k,0 ), u k+1 -u k ⟩ + nL 0 + L 1 i∈[n] ∥∇f i (w k,0 )∥ 2 3C 2 2 dη 2 k ≤f (u k ) + ⟨∇f (w k,0 ), u k+1 -u k ⟩ + nL 0 + L 1 √ n i∈[n] ∥∇f i (w k,0 )∥ 2 2 3C 2 2 dη 2 k ≤f (u k ) + ⟨∇f (w k,0 ), u k+1 -u k ⟩ + nL 0 + L 1 √ n D 0 + D 1 ∥∇f (w k,0 )∥ 2 2 3C 2 2 dη 2 k ≤f (u k ) + ⟨∇f (w k,0 ), u k+1 -u k ⟩ + nL 0 + L 1 √ n( √ D 0 + √ D 1 ∥∇f (w k,0 )∥) 2 3C 2 2 dη 2 k ( * ) = f (u k ) + l∈L k large ∂ l f (w k,0 )(u l,k+1 -u l,k ) + l∈L k small ∂ l f (w k,0 )(u l,k+1 -u l,k ) + nL 0 + L 1 √ n √ D 0 2 3C 2 2 dη 2 k + 3L 1 √ n √ D 1 C 2 2 dη 2 k 2 ∥∇f (w k,0 )∥. Here in Eq. ( * ), L large and L small are respectively defined as  L k large = {l : l ∈ [d], ∂ l f (w k,0 )(u l,k+1 -u l,k ) and l∈L k small ∂ l f (w k,0 )(u l,k+1 -u l,k ) respectively. ①Analysis for l∈L k small ∂ l f (w k,0 )(u l,k+1 -u l,k ): By directly applying the range of max i∈[n] |∂ l f i (w k,0 )|, we have 1 n l∈L k small ∂ l f (w k,0 )(u l,k+1 -u l,k ) ≤dC2η k C3η k + C4 k-1 r=1 β2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ +C4n k-1 r=1 β2 (r-1)n η k-r + η k C4 n-1 p=0 ∥∇f (w k,p )∥ . Summing over k from 1 to t then leads to 1 n T k=1 l∈L k small ∂ l f (w k,0 )(u l,k+1 -u l,k ) ≤ T k=1 dC2C3η 2 k + dC2C4 T k=1 η k k-1 r=1 β2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ + C2C4n T k=1 η k k-1 r=1 β2 (r-1)n η k-r + C2C4 T k=1 η 2 k n-1 p=0 ∥∇f ≤ T k=1 dC2C3η 2 k + dC2C4 1 -β n 2 T -1 k=1 η 2 k n-1 j=0 ∥∇f (w k,j )∥ + C2C4n 1 -β n 2 T -1 k=1 η 2 k + C2C4 T k=1 η 2 k n-1 p=0 ∥∇f (w k,p )∥ ≤ dC2C3 + C2C4n 1 -β n 2 η 2 1 (1 + ln T ) + C2C4 + dC2C4 1 -β n 2 T k=1 η 2 k n-1 j=0 ∥∇f (w k,j )∥, which by Lemma 6 further leads to T k=1 l∈L k small ∂ l f (w k,0 )(u l,k+1 -u l,k ) ≤n C2C4 + dC2C4 1 -β n 2 T k=1 η 2 k n-1 j=0 (1 + n √ dC1η1L1 √ n)∥∇f (w k,0 )∥ + nL0 + L1 √ n √ D0 n √ dC1η k + n dC2C3 + C2C4n 1 -β n 2 η 2 1 (1 + ln T ) ≤n 2 (1 + n √ dC1η1L1 √ n) C2C4 + dC2C4 1 -β n 2 T k=1 η 2 k ∥∇f (w k,0 )∥ + dC2C3 + C2C4n 1 -β n 2 η 2 1 (1 + ln T ) + n C2C4 + dC2C4 1 -β n 2 nL0 + L1 √ n √ D0 n 2 √ dC1 T k=1 η 3 k ≤n 2 (1 + n √ dC1η1L1 √ n √ D1) C2C4 + dC2C4 √ D1 1 -β n 2 T k=1 η 2 k ∥∇f (w k,0 )∥ + dC2C3 + C2C4n √ D1 1 -β n 2 η 2 1 (1 + ln T ) + 3n C2C4 + dC2C4 1 -β n 2 nL0 + L1 √ n √ D0 n 2 √ dC1η 3 1 . We further define C5 ≜ n 2 (1 + n √ dC1η1L1 √ n √ D1) C4 + dC4 √ D1 1 -β n 2 , C6 ≜ dC3 + C4n √ D1 1 -β n 2 η 2 1 , C7 ≜ 3n C4 + dC4 1 -β n 2 nL0 + L1 √ n √ D0 n 2 √ dC1η 3 1 + dC3 + C2C4n √ D1 1 -β n 2 η 2 1 , and thus T k=1 l∈L k small ∂ l f (w k,0 )(u l,k+1 -u l,k ) ≤ C 2 C 5 T k=1 η 2 k ∥∇f (w k,0 )∥ + C 6 ln T + C 7 . ②Analysis for l∈L k large ∂ l f (w k,0 )(u l,k+1 -u l,k ): This term requires a more sophisticated analysis. To begin with, we provide a decomposition of u k+1 -u k . According to the definition of u k , we have u k+1 -u k = (w k+1,0 -β1w k+1,-1 ) -(w k,0 -β1w k,-1 ) 1 -β1 = (w k+1,0 -w k,0 ) -β1(w k+1,-1 -w k,-1 ) 1 -β1 = n-1 i=0 (w k,i+1 -w k,i ) -β1 n-1 i=0 (w k,i -w k,i-1 ) 1 -β1 = (w k+1,0 -w k+1,-1 ) + (1 -β1) n-2 i=0 (w k,i+1 -w k,i ) -β1(w k,0 -w k,-1 ) 1 -β1 (⋆) = - η k √ ν k,n-1 ⊙ m k,n-1 + (1 -β1) n-2 i=0 η k √ ν k,i ⊙ m k,i -β1 η k-1 √ ν k-1,n-1 ⊙ m k-1,n-1 1 -β1 = - η k √ ν k,0 ⊙ m k,n-1 + (1 -β1) n-2 i=0 m k,i -β1m k-1,n-1 1 -β1 -η k 1 √ ν k,n-1 - 1 √ ν k,0 ⊙ m k,n-1 1 -β1 + n-2 i=0 1 √ ν k,i - 1 √ ν k,0 ⊙ m k,i - β1 1 -β1 1 √ ν k-1,n-1 - 1 √ ν k,0 ⊙ m k-1,n-1 - β1 1 -β1 (η k-1 -η k ) 1 √ ν k-1,n-1 ⊙ m k- Here equation (⋆) is due to a direct application of the update rule of w k,i . We then analyze the above three terms respectively, namely, we define a 1 l ≜ - η k √ ν l,k,0 m l,k,n-1 + (1 -β1) n-2 i=0 m l,k,i -β1m l,k-1,n-1 1 -β1 = - η k √ ν l,k,0 n-1 i=0 ∂ l fτ k,i (w k,i ), a 2 l ≜ -η k 1 √ ν l,k,n-1 - 1 √ ν l,k,0 m l,k,n-1 1 -β1 + n-2 i=0 1 √ ν l,k,i - 1 √ ν l,k,0 m l,k,i - β1 1 -β1 1 √ ν l,k-1,n-1 - 1 √ ν l,k,0 m l,k-1,n-1 , a 3 l ≜ - β1 1 -β1 (η k-1 -η k ) 1 √ ν l,k-1,n-1 m l,k-1,n-1 . One can then easily observe that by Eq. ( 17), l∈L k large ∂ l f (w k,0 )(u l,k+1 -u l,k ) = l∈L k large ∂ l f (w k,0 )a 1 l + l∈L k large ∂ l f (w k,0 )a 2 l + l∈L k large ∂ l f (w k,0 )a 3 l . ②. (A) Tackling Term l∈L k large ∂ l f (w k,0 )a 1 l : We have l∈L k large ∂ l f (w k,0 )a 1 l = - l∈L k large ∂ l η k √ ν l,k,0 ∂ l f (w k,0 ) n-1 i=0 ∂ l fτ k,i (w k,0 ) - l∈L k large η k √ ν l,k,0 ∂ l f (w k,0 ) n-1 i=0 (∂ l fτ k,i (w k,i ) -∂ l fτ k,i (w k,0 )) = - l∈L k large η k √ ν l,k,0 ∂ l f (w k,0 ) 2 - l∈L k large η k √ ν l,k,0 ∂ l f (w k,0 ) n-1 i=0 (∂ l fτ k,i (w k,i ) -∂ l fτ k,i (w k,0 )) (⋆) = - l∈L k large η k √ ν l,k,0 ∂ l f (w k,0 ) 2 + O η 2 k + O η 2 k ∥∇f (w k,0 )∥ , where Eq. (⋆) is due to l∈L k large η k √ ν l,k,0 ∂ l f (w k,0 ) n-1 i=0 (∂ l fτ k,i (w k,i ) -∂ l fτ k,i (w k,0 )) ( * ) ≤ η k 2n 2 β n 2    l∈L k large n-1 i=0 |∂ l fτ k,i (w k,i ) -∂ l fτ k,i (w k,0 )|    ≤η k 2n 2 β n 2 √ d n-1 i=0 ∥∇fτ k,i (w k,i ) -∇fτ k,i (w k,0 )∥ (•) ≤ η k 2n 2 β n 2 √ d n-1 i=0 (L0 + L1∥∇fτ k,i (w k,0 )∥)∥w k,i -w k,0 ∥ ≤η k 2n 2 β n 2 √ d(nL0 + L1 √ D1 √ n∥∇f (w k,0 )∥ + √ nL1 √ D0)n √ dC1η k (•) ≤ 2n 2 β n 2 d(n 2 L0 + n √ nL1 √ D0)C1η 2 k + η 2 k d 2n 2 β n 2 L1 √ D1n √ n∥∇f (w k,0 )∥. Here Eq. ( * ) is due to Corollary 2, Eq. (•) is due to f i is (L 0 , L 1 )-smooth, ∀i, and Eq. ( •) is due to Lemma 3. ②. (B) Tackling Term l∈L k large ∂ l f (w k,0 )a 2 l : We have for any l ∈ L max , |∂ l f (w k,0 )a 2 l | ≤η k |∂ l f (w k,0 )| 1 √ ν l,k,n-1 - 1 √ ν l,k,0 |m l,k,n-1 | 1 -β 1 + n-2 i=0 1 √ ν l,k,i - 1 √ ν l,k,0 |m l,k,i | - β 1 1 -β 1 1 √ ν l,k-1,n-1 + 1 √ ν l,k,0 |m l,k-1,n-1 | (⋆) ≤ η k g(β 2 ) |∂ l f (w k,0 )| √ ν l,k,0 |m l,k,n-1 | 1 -β 1 + n-2 i=0 |m l,k,i | + β 1 1 -β 1 |m l,k-1,n-1 | ( * ) ≤ η k g(β 2 ) n -1 + 1 + β 1 1 -β 1 |∂ l f (w k,0 )| √ ν l,k,0 max i∈[n] |∂ l f i (w k,0 )| + η 2 k g(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 n + 2 √ 2β 1 1 -β 1 C 1 (L 0 + L 1 D 0 ) √ d + η 2 k g(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 n-1 j=0 ∥∇f (w k,j )∥ + η k g(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 k-1 t=1 η k-t n-1 j=0 β tn-1-j 1 ∥∇f (w k-t,j )∥, where Inequality (⋆) is due to Corollary 2, and g(β 2 ) is defined in Lemma 11 , and Inequality ( * ) is due to Lemma 4, by which we have ∀i ∈ {-1, • • • , n -1} |m l,k,i | ≤ max i ′ ∈[n] |∂ l f i ′ (w k,0 )| + n + 2 √ 2β 1 1 -β 1 C 1 (L 0 + L 1 D 0 ) √ dη k + L 1 C 1 D 1 η k n-1 j=0 ∥∇f (w k,j )∥ + L 1 C 1 D 1 k-1 t=1 η k-t n-1 j=0 β tn-1-j 1 ∥∇f (w k-t,j )∥. Therefore, summing over L k large and k leads to T k=1 l∈L k large ∂ l f (w k,0 )a 2 l ≤ T k=1 l∈L k large η k g(β 2 ) n -1 + 1 + β 1 1 -β 1 |∂ l f (w k,0 )| √ ν l,k,0 max i∈[n] |∂ l f i (w k,0 )| + T k=1 η 2 k g(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 n + 2 √ 2β 1 1 -β 1 C 1 (L 0 + L 1 D 0 )d √ d + dg(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 T k=1 η 2 k n-1 j=0 ∥∇f (w k,j )∥ + dg(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 T k=1 η k k-1 t=1 η k-t n-1 j=0 β (t-1)n 1 ∥∇f (w k-t,j )∥ ≤ T k=1 l∈L k large η k g(β 2 ) n -1 + 1 + β 1 1 -β 1 |∂ l f (w k,0 )| √ ν l,k,0 max i∈[n] |∂ l f i (w k,0 )| + g(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 n + 2 √ 2β 1 1 -β 1 C 1 (L 0 + L 1 D 0 )d √ dη 1 (1 + ln T ) + dg(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 1 + 1 1 -β n 2 T k=1 η 2 k n-1 j=0 ∥∇f (w k,j )∥ (⋆) ≤ T k=1 l∈L k large η k g(β 2 ) n -1 + 1 + β 1 1 -β 1 |∂ l f (w k,0 )| √ ν l,k,0 max i∈[n] |∂ l f i (w k,0 )| + g(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 n + 2 √ 2β 1 1 -β 1 C 1 (L 0 + L 1 D 0 )d √ dη 1 (1 + ln T ) + dg(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 1 + 1 1 -β n 2 • T k=1 η 2 k n-1 j=0 (1 + n √ dC 1 η 1 L 1 √ n D 1 )∥∇f (w k,0 )∥ + nL 0 + L 1 √ n D 0 n √ dC 1 η k ≤ T k=1 l∈L k large η k g(β 2 ) n -1 + 1 + β 1 1 -β 1 |∂ l f (w k,0 )| √ ν l,k,0 max i∈[n] |∂ l f i (w k,0 )| + g(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 n + 2 √ 2β 1 1 -β 1 C 1 (L 0 + L 1 D 0 )d √ dη 2 1 (1 + ln T ) + dg(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 1 + 1 1 -β n 2 (n + n 5 2 √ dC 1 η 1 L 1 D 1 ) T k=1 η 2 k ∥∇f (w k,0 )∥ + 3dg(β 2 ) n -1 + 1 + β 1 1 -β 1 √ 2n β n 2 2 L 1 C 1 D 1 1 + 1 1 -β n 2 n nL 0 + L 1 √ n D 0 n √ dC 1 η 3 1 . where Inequality (⋆) is due to Lemma 6.

②. (C) Tackling Term l∈L

k large ∂ l f (w k,0 )a 3 l : For any l ∈ L k large , |∂ l f (w k,0 )a 3 l | ≤ β 1 1 -β 1 |η k-1 -η k | 1 √ ν l,k-1,n-1 |m l,k-1,n-1 ||∂ l f (w k,0 )| ≤ β 1 η 1 (1 -β 1 ) 1 √ k √ k -1( √ k + √ k -1) C 1 |∂ l f (w k,0 )| = β 1 η k (1 -β 1 ) 1 √ k -1( √ k + √ k -1) C 1 |∂ l f (w k,0 )|. Summing over k and L k large then leads to T k=1 l∈L k large |∂ l f (w k,0 )a 3 l | ≤ β1 (1 -β1) T k=1 l∈L k large η k √ k -1( √ k + √ k -1) C1|∂ l f (w k,0 )| ≤2 β1 (1 -β1)η1 √ dC1 T k=1 η 2 k ∥∇f (w k,0 )∥. Put ②.(A), ②.(B), and ②.(C) together. We have T k=1 l∈L k large ∂ l f (w k,0 )(u l,k+1 -u l,k ) ≤ - T k=1 l∈L k large η k √ ν l,k,0 ∂ l f (w k,0 ) 2 + T k=1 l∈L k large η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| √ ν l,k,0 max i∈[n] |∂ l fi(w k,0 )| + C8 T k=1 η 2 k ∥∇f (w k,0 )∥ + C9 ln T + C10, where C 5 , C 6 , and C 7 are constants defined as C8 ≜ 2n 2 β n 2 L1 √ D1n √ n + dg(β2) n -1 + 1 + β1 1 -β1 √ 2n β n 2 2 L1C1 √ D1 1 + 1 1 -β n 2 (n + n 5 2 √ dC1η1L1 √ D1) + 2 β1 (1 -β1)η1 √ dC C9 ≜ 2n 2 β n 2 d(n 2 L0+n √ nL1 √ D0)C1η 2 1 +g(β2) n -1 + 1 + β1 1 -β1 √ 2n β n 2 2 n + 2 √ 2β1 1 -β1 C1(L0+L1 √ D0)d √ dη 2 1 , C10 ≜ 3dg(β2) n -1 + 1 + β1 1 -β1 √ 2n β n 2 2 L1C1 √ D1 1 + 1 1 -β n 2 n nL0 + L1 √ n √ D0 n √ dC1η 3 1 +C9. We then analyze the first two terms in Eq. ( 18) here. Specifically, we have l∈L k large η k ∂ l f (w k,0 ) 2 √ ν l,k,0 + ε - l∈L k large η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| √ ν l,k,0 + ε max i∈[n] |∂ l fi(w k,0 )| (⋆) ≥ l∈L k large η k ∂ l f (w k,0 ) 2 √ ν l,k,0 + ε - l∈L k large η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| ≥ l∈L k large η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε - l∈L k large η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| (•) = l∈[d] η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε - l∈L k large η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| + O   η 2 k + η k k-1 r=1 β2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ + η k k-1 r=1 β2 (r-1)n η k-r + η 2 k n-1 j=0 ∥∇f (w k,j )∥   ≥ l∈[d] η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε - l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| + O   η 2 k + η k k-1 r=1 β2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ + η k k-1 r=1 β2 (r-1)n η k-r + η 2 k n-1 j=0 ∥∇f (w k,j )∥   , where Inequality (⋆) is due to Corollary 2 and Equality (•) is due to l∈L k small η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε ≤ l∈L k small η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε ≤ n 2 η k l∈L k small max i∈[n] |∂ l fi(w k,0 )| ≤ ndη k 2   C3η k + C4 k-1 r=1 β2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ + C4n k-1 r=1 β2 (r-1)n η k-r + η k C4 n-1 j=0 ∥∇f (w k,j )∥   . Parallel to Eq. ( 16) and summing the right-hand-side of the above inequality over k from 1 to t, we have T t=1 ndη k 2   C 3 η k + C 4 k-1 r=1 β 2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ + C 4 n k-1 r=1 β 2 (r-1)n η k-r + η k C 4 n-1 j=0 ∥∇f (w k,j )∥   ≤ 1 2 C 5 T k=1 η 2 k ∥∇f (w k,0 )∥ + C 6 ln T + C 7 . Suppose now there does not exist an iteration k ∈ [T ], such that ∥∇f (w k,0 )∥ ≤ 2 √ d(2 √ 2 + 1) D 0 g(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 , since otherwise, the proof has been completed. By Lemma 8, we then have l∈L k large η k ∂ l f (w k,0 ) 2 √ ν l,k,0 + ε - l∈L k large η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| √ ν l,k,0 + ε max i∈[n] |∂ l fi(w k,0 )| ≥η k 1 2(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + O η 2 k + η k k-1 r=1 β2 (r-1)n η k-r n-1 j=0 ∥∇f (w k-r,j )∥ + η k k-1 r=1 β2 (r-1)n η k-r + η 2 k n-1 j=0 ∥∇f (w k,j )∥ . Putting ① and ② together and summing over k, we have f (uT +1) -f (u1) ≤ - T k=1 η k 1 2(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + ( 1 2 + C2)C5 + C8 T k=1 η 2 k ∥∇f (w k,0 )∥ + ( 1 2 + C2)C6 + C9 ln T + ( 1 2 + C2)C7 + C10 + T k=1 nL0 + L1 √ n √ D0 2 3C 2 2 dη 2 k + T k=1 3L1 √ n √ D1C 2 2 dη 2 k 2 ∥∇f (w k,0 )∥ ≤ - T k=1 η k 1 2(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + ( 1 2 + C2)C5 + C8 + 3L1 √ n √ D1C 2 2 d 2 T k=1 η 2 k ∥∇f (w k,0 )∥ + ( 1 2 + C2)C6 + C9 + nL0 + L1 √ n √ D0 2 3C 2 2 dη 2 1 ln T + ( 1 2 + C2)C7 + C10 + nL0 + L1 √ n √ D0 2 3C 2 2 dη 2 1 ≤ T k=1 η k 1 2(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + C11 T k=1 η 2 k ∥∇f (w k,0 )∥ + C12 ln T + C13, where C 11 , C 12 , and C 13 is defined as C11 ≜ ( 1 2 + C2)C5 + C8 + 3L1 √ n √ D1C 2 2 d 2 , C12 ≜ ( 1 2 + C2)C6 + C9 + nL0 + L1 √ n √ D0 2 3C 2 2 dη 2 1 , C13 ≜ ( 1 2 + C2)C7 + C10 + nL0 + L1 √ n √ D0 2 3C 2 2 dη 2 1 . On the other hand, as for ∀k ∈ [T ], η 2 k ∥∇f (w k,0 )∥ ≤ 1 4 √ D 0 + ε √ D 1 η 2 k + √ D 1 √ D 0 + ε η 2 k ∥∇f (w k,0 )∥ 2 , we have that η 2 k ∥∇f (w k,0 )∥ ≤ 1 4 √ D 0 + ε √ D 1 η 2 k + η 2 k min ∥∇f (w k,0 )∥, √ D 1 √ D 0 + ε ∥∇f (w k,0 )∥ 2 = 1 4 √ D 0 + ε √ D 1 η 2 k + D 1 η 2 k min ∥∇f (w k,0 )∥ √ D 1 , ∥∇f (w k,0 )∥ 2 √ D 0 + ε , and thus, f (uT +1) -f (u1) ≤ - T k=1 η k 1 2(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + C11 T k=1 η 2 k ∥∇f (w k,0 )∥ + C12 ln T + C13 ≤ - T k=1 η k 1 2(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + √ D0 + ε 4 √ D1 C11 T k=1 η 2 k + C12 ln T + C13 + √ D1C11 T k=1 η 2 k min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 ≤ - T k=1 η k 1 2(2 √ 2 + 1) - √ D1C11η k min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + C12 + √ D0 + ε 4 √ D1 C11η 2 1 ln T + C13 + √ D0 + ε 4 √ D1 C11η 2 1 ≤ - T k=1 η k 1 4(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 + C12 + √ D0 + ε 4 √ D1 C11η 2 1 ln T + C13 + √ D0 + ε 4 √ D1 C11η 2 1 . The proof is completed. Remark 12. By the definitions of C 11 , C 12 , and C 13 , one can easily observe that the hidden coefficient of O( ln T √ T ) is in the order of 3 2 for d and 3 2 for n. For completeness, we would like to emphasize that our contribution is "providing the first convergence bound of Adam without the L-smooth condition", but we agree the bound itself can be tighten. Proving tighter bound is an interesting topic and we leave it as a future work. Proof. To begin with, we have  |∂ l f i (w k,0 )| 2 ≤ i∈[n] d l ′ =1 |∂ l ′ f i (w k,0 )| 2 = i∈[n] ∥∇f i (w k,0 )∥ 2 ≤ D 1 ∥∇f (w k,0 )∥ 2 + D 0 . We respectively consider the case ε ≤ √ D 0 and ε > √ D 0 . Case I: ε ≤ √ D 0 . In this case, we have that  η k g(β2) n -1 + 1 + β1 1 -β1 2n β n 2 |∂ l f (w k,0 )| ≥ η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + D0 + √ D0 - √ dη k g(β2) n -1 + 1 + β1 1 -β1 2n β n 2 ∥∇f (w k,0 )∥. We further discuss the case depending on whether ∥∇f (w k,0 )∥ 2 ≤ D0 D1 or not. Case I.1: ∥∇f (w k,0 )∥ 2 ≤ D0 D1 . In this case, the last line of the above equations can be further lower bounded by η k ∥∇f (w k,0 )∥ 2 2 D 1 ∥∇f (w k,0 )∥ 2 + D 0 + √ D 0 - √ dη k g(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 ∥∇f (w k,0 )∥ ≥ η k ∥∇f (w k,0 )∥ 2 (2 √ 2 + 1) √ D 0 - √ dη k g(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 ∥∇f (w k,0 )∥ =η k ∥∇f (w k,0 )∥ (2 √ 2 + 1) √ D 0 - √ dg(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 ∥∇f (w k,0 )∥ Case I.2: ∥∇f (w k,0 )∥ 2 > D0 D1 . η k ∥∇f (w k,0 )∥ 2 2 D 1 ∥∇f (w k,0 )∥ 2 + D 0 + √ D 0 - √ dη k g(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 ∥∇f (w k,0 )∥ ≥ η k ∥∇f (w k,0 )∥ 2 (2 √ 2 + 1) √ D 1 ∥∇f (w k,0 )∥ - √ dη k g(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 ∥∇f (w k,0 )∥ =η k 1 (2 √ 2 + 1) √ D 1 - √ dg(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 ∥∇f (w k,0 )∥ ( * ) ≥ η k 1 2(2 √ 2 + 1) √ D 1 ∥∇f (w k,0 )∥, where Inequality ( * ) is due to the constraint on β 2 . Therefore, we have either (1). there exists a iteration k ∈ [T ], such that Case II: ε > √ D 0 . In this case, we have that ∥∇f (w k,0 )∥ ≤ 2 √ d(2 √ 2 + 1) D 0 g(β 2 ) n -1 + 1 + β 1 1 -β 1 2n β n 2 , or η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + D0 + ε - l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| ≥ η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + ε 2 + ε - l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| . Similar as Case I, we further divides the case regarding the value of ∥∇f (w k,0 )∥. Case II.1: D 1 ∥∇f (w k,0 )∥ 2 ≤ ε 2 . In this case, we have η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + ε 2 + ε - l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| ≥ η k ∥∇f (w k,0 )∥ 2 (2 √ 2 + 1)ε - l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| ε max i∈[n] |∂ l fi(w k,0 )| ≥ η k ∥∇f (w k,0 )∥ 2 (2 √ 2 + 1)ε -η k g(β2) n -1 + 1 + β1 1 -β1 ∥∇f (w k,0 )∥ ε D1∥∇f (w k,0 )∥ 2 + D0 = η k ∥∇f (w k,0 )∥ ε ∥∇f (w k,0 )∥ 2 √ 2 + 1 -g(β2) n -1 + 1 + β1 1 -β1 D1∥∇f (w k,0 )∥ 2 + D0 . Case II.2: D 1 ∥∇f (w k,0 )∥ 2 > ε 2 . This case is quite similar to Case I.2, and we have |∂ l fi(w k,0 )| η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + ε 2 + ε - l∈[d] ≥η k 1 2(2 √ 2 + 1) min ∥∇f (w k,0 )∥ √ D1 , ∥∇f (w k,0 )∥ 2 ε + √ D0 . The proof is completed.

E EXPERIMENT DETAILS

This section collects experiments and their corresponding settings, and is arranged as follows: to begin with, we show that Adam works well under the different reshuffling order; we then provide the experiment settings of Figure 1 .

E.1 ADAM WORKS WELL UNDER DIFFERENT RESHUFFLING ORDER

We run Adam on ResNet 110 for CIFAR 10 across different random seeds and plot the 10-run mean and variance in Figure 4 . One can observe that the performance of Adam is robust with respect to random seed, and support Theorem 1 in terms of trajectory-wise convergence. The experiment is based on this repo, where we adopt the default hyperparameters settings. In this section, we provide the models and hyperparameter settings of Figures 1. We will also illustrate how we evaluate the local smoothness. Models and hyper-parameter settings in Figures 1. In 



m k,i √ ν k,i +ε , -∇f (w k,0 )⟩ > 0 is not necessarily correct, as the momentum m k,i contains a heavy historical signal, and may push the update away from the negative gradient direction. How to deal with this challenge? 2/(L0 + L1M ) serves as the largest reasonable setting of the learning rate for GD, as f over the loss sub-level set has smoothness upper bound (L0 + L1M ).



Figure 1: For Transformer (Vaswani et al., 2017) on the WMT 2014 dataset, we plot (a) the training loss of SGD and Adam and (b) the gradient norm vs. the local smoothness on the training trajectory. The blue line stands for log(local smoothness) = log(gradient norm) + 1.4. It can be observed that all the (log(gradient norm), log(local smoothness)) points lie under this line, and thus the training process obeys (0, e 1.4 )-smooth condition.

L0, L1)-Smoothness Allow β1 ̸ = 0 (a) Allow ε = 0 (b) Allow unbounded gradient Trajectory-wise convergenceZaheer et al. (2018b)

(c): Zaheer et al. (2018b) further requires the signs of the gradients to remain the same along the trajectory; (d): Guo et al. (2021); Huang et al. (2021) require √ ν k,i + ε to be lower bounded, which is equivalent to requiring ε > 0.

Figure 2: Performance of Adam on a synthetic objective satisfying (L 0 , L 1 )-smooth condition. Adam doesn't converge to the unique stationary point, but gets closer to the stationary point as β 2 → 1.

improved regularization in Adam by decoupling the weight decay from the gradient-based update. B ADDITIONAL DISCUSSIONS B.1 RESTRICTION OF β 1 IN THEOREM 1

B.8 DISCUSSION ON THE IN-EXPECTATION RESULT IN (ZHANG ET AL., 2022) Zhang et al. (

Figure 3: Example on which SGD with decaying learning rate and small initial learning rate diverges. We only plot the first two steps because the values exceed the math range in python in the third step.

Let Assumptions 1 and 2 hold. Let β 2 1 < β 2 and Eq. (4) hold. Then, either there exists a iteration k ∈ [T ], such that either∥∇f (w k,0 )∥ ≤ 2 iteration k ∈ [1, T ], we have that l∈[d] η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε -

l∈[d] η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε -i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| (⋆) ≥ l∈[d] η k ∂ l f (w k,0 ) 2 2 D1∥∇f (w k,0 )∥ 2 + D0 + ε -l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| = η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + D0 + ε -l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| ,where Inequality (⋆) is due to thatmax i∈[n] |∂ l f i (w k,0 )| = max i∈[n]

k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + D0 + ε -l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n] |∂ l fi(w k,0 )| ≥ η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + D0 + √ D0 -l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| max i∈[n] |∂ l fi(w k,0 )| = η k ∥∇f (w k,0 )∥ 2 2 D1∥∇f (w k,0 )∥ 2 + D0 + √ D0 -l∈[d]

.for all k ∈ [1, T ], l∈[d] η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε -

have either (1). there exists a iteration k ∈[T ], such that ∥∇f (w k,0 )∥ ≤ 2 √ d(2 √ 2 + 1) D 0 g(β 2 ) n -. for all k ∈ [1, T ], l∈[d] η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + εof Case I and Case II, we have that either there exists a iteration k ∈ [T ], such that∥∇f (w k,0 )∥ ≤ 2 iteration k ∈ [1, T ], we have that l∈[d] η k ∂ l f (w k,0 ) 2 2 max i∈[n] |∂ l fi(w k,0 )| + ε -l∈[d] η k g(β2) n -1 + 1 + β1 1 -β1 |∂ l f (w k,0 )| β n 2 2n max i∈[n] |∂ l fi(w k,0 )| + ε max i∈[n]

Figure 4: Performance of Adam with different shuffling orders. We respectively plot the training loss and the training accuracy of Adam together with their variances over 10 runs with different random shuffling order. The result indicate the performance of Adam is robust w.r.t. the shuffling order.

Figure 1, we use exactly the same setting as Vaswani et al. (2017) on WMT 2014 dataset, based on this repo.How we evaluate the local smoothness. We use the same method asZhang et al. (2019a). Specifically, with a finite-difference step α, we calculate the smoothness at w k aslocal smoothness = max γ∈{α,2α,••• ,1} ∥∇f (w k + γ(w k+1 -w k )) -∇f (w k )∥ γ∥w k+1 -w k ∥ .

s.t. Eq. (12) holds}, L k small = {l : l ∈ [d], s.t. Eq. (12) doesn't hold}. Apparently, L large ∪ L small = [d]. We then tackle l∈L k large

