ON THE CONVERGENCE OF ADAGRAD(NORM) ON R d : BEYOND CONVEXITY, NON-ASYMPTOTIC RATE AND ACCELERATION

Abstract

Existing analysis of AdaGrad and other adaptive methods for smooth convex optimization is typically for functions with bounded domain diameter. In unconstrained problems, previous works guarantee an asymptotic convergence rate without an explicit constant factor that holds true for the entire function class. Furthermore, in the stochastic setting, only a modified version of AdaGrad, different from the one commonly used in practice, in which the latest gradient is not used to update the stepsize, has been analyzed. Our paper aims at bridging these gaps and developing a deeper understanding of AdaGrad and its variants in the standard setting of smooth convex functions as well as the more general setting of quasar convex functions. First, we demonstrate new techniques to explicitly bound the convergence rate of the vanilla AdaGrad for unconstrained problems in both deterministic and stochastic settings. Second, we propose a variant of AdaGrad for which we can show the convergence of the last iterate, instead of the average iterate. Finally, we give new accelerated adaptive algorithms and their convergence guarantee in the deterministic setting with explicit dependency on the problem parameters, improving upon the asymptotic rate shown in previous works. * Equal contribution, corresponding authors. We consider the following optimization problem: and x * ∈ arg min x∈R d F (x) = ∅. We will use the following notations throughout the paper: a + = max {a, 0}, a ∨ b = max {a, b}, [n] = {1, 2, • • • , n}, and • denotes the 2 -norm • 2 for simplicity.

1. INTRODUCTION

In recent years, the prevalence of machine learning models has motivated the development of new optimization tools, among which adaptive methods such as Adam (Kingma & Ba, 2014) , AmsGrad (Reddi et al., 2018) , AdaGrad (Duchi et al., 2011) emerge as the most important class of algorithms. These methods do not require the knowledge of the problem parameters when setting the stepsize as traditional methods like SGD, while still showing robust performances in many ML tasks. However, it remains a challenge to analyze and understand the properties of these methods. Take AdaGrad and its variants for example. In its vanilla scalar form, also known as AdaGradNorm, the step size is set using the cumulative sum of the gradient norm of all iterates so far. The work of Ward et al. (2020) has shown the convergence of this algorithm for non-convex funtions by bounding the decay of the gradient norms. However, in convex optimization, usually we require a stronger convergence criterion-bounding the function value gap. This is where we lack theoretical understanding. Even in the deterministic setting, most existing works (Levy, 2017; Levy et al., 2018; Ene et al., 2021) rely on the assumption that the domain of the function is bounded. The dependence on the domain diameter can become an issue if it is unknown or cannot be readily estimated. Other works for unconstrained problems (Antonakopoulos et al., 2020; 2022) offer a convergence rate that depends on the limit of the step size sequence. This limit is shown to exist for each function, but without an explicit value, and more importantly, it is not shown to be a constant for the entire function class. This means that these methods essentially do not tell us how fast the algorithm converges in the worst case. Another work by Ene & Nguyen (2022) gives an explicit rate of convergence for the entire class but requires the strong assumption that the gradients are bounded even in the smooth setting and the convergence guarantee has additional error terms depending on this bound. In the stochastic setting, one common approach is to analyze a modified version of AdaGrad with off-by-one step size, i.e. the gradient at the current time step is not taken into account when setting the new step size. This is where the gap between theory and practice exists.

1.1. OUR CONTRIBUTION

In this paper, we make the following contributions. First, we demonstrate a method to show an explicit non-asymptotic convergence rate of AdaGradNorm and AdaGrad on R d in the deterministic setting. Our method extends to a more general function class known as γ-quasar convex functions with a weaker condition for smoothness. To the best of our knowledge, we are the first to prove this result. Second, we present new techniques to analyze stochastic AdaGradNorm and offer an explicit convergence guarantee for γ-quasar convex optimization on R d with a mild assumption on the noise of the gradient estimates. We propose two new variants of AdaGradNorm which demonstrate the convergence of the last iterate instead of the average iterate as shown in AdaGradNorm. Finally, we propose a new accelerated algorithm with two variants and show their non-asymptotic convergence rate in the deterministic setting.

1.2. RELATED WORK

Adaptive methods There has been a long line of works on adaptive methods, including AdaGrad (Duchi et al., 2011) , RMSProp (Tieleman et al., 2012) and Adam (Kingma & Ba, 2014) . AdaGrad was first designed for stochastic online optimization; subsequent works (Levy, 2017; Kavis et al., 2019; Bach & Levy, 2019; Antonakopoulos et al., 2020; Ene et al., 2021) analyzed AdaGrad and various adaptive algorithms for convex optimization and generalized them for variational inequality problems. These works commonly assume that the optimization problem is contrained in a set with bounded diameter. Li & Orabona (2019) are the first to analyze a variant of AdaGrad for unbounded domains where the latest gradient is not used to construct the step size, which differs from the standard version of AdaGrad commonly used in practice. However, the algorithm and analysis of Li & Orabona (2019) set the initial step size based on the smoothness parameter and thus they do not adapt to it. Other works provide convergence guarantees for adaptive methods for unbounded domains, yet without explicit dependency on the problem parameters (Antonakopoulos et al., 2020; 2022) , or for a class of strongly convex functions (Xie et al., 2020) . Another work by Ene & Nguyen (2022) requires the strong assumption that the gradients are bounded even for smooth functions and the convergence guarantee has additional error terms depending on the gradient upperbound. Our work analyzes the standard version of AdaGrad for unconstrained and general convex problems and shows explicit convergence rate in both the deterministic and stochastic setting. Accelerated adaptive methods have been designed to achieve O(1/T 2 ) and O(1/ √ T ) respectively in the deterministic and stochastic setting in the works of Levy et al. (2018) ; Ene & Nguyen (2022) ; Antonakopoulos et al. (2022) . We show different variants and demonstrate the same but explicit accelerated convergence rate in the deterministic setting for unconstrained problems.

Analysis beyond convexity

The convergence of some variants of AdaGrad has been established for nonconvex functions in the work of Li & Orabona (2019) ; Ward et al. (2020) ; Faw et al. (2022) under various assumptions. Other works (Li & Orabona, 2020; Kavis et al., 2022) demonstrate the convergence with high probability. We refer the reader to Faw et al. (2022) for a more detailed survey on AdaGrad-style methods for nonconvex optimization. In general, the criterion used to study these convergence rates is the gradient norm of the function, which is weaker than the function value gap normally used in the study of convex functions. In comparison, we study the convergence of AdaGrad via the function value gap for a broader notion of convexity, known as quasar-convexity, as well as a more generalized definition of smoothness.

Algorithm 1 AdaGradNorm

Initialize: x 1 , η > 0 for t = 1 to T b t = b 2 0 + t i=1 ∇F (x i ) 2 x t+1 = x t -η bt ∇F (x t ) Algorithm 2 Stochastic AdaGradNorm Initialize: x 1 , η > 0 for t = 1 to T b t = b 2 0 + t i=1 ∇F (x i ) 2 x t+1 = x t -η bt ∇F (x t ) 2 Additionally, we list below the assumptions that will be used in the paper. 1. γ-quasar convexity: There exists γ ∈ (0, 1] such that F * ≥ F (x) + 1 γ ∇F (x), x * -x , ∀x ∈ R d where x * ∈ arg min x∈R d F (x). When γ = 1, F is also known as star-convex. 1'. Convexity: F is convex. This stronger assumption implies that Assumption 1 holds with γ = 1. 2. Weak L-smoothness: ∃L > 0 such that F (x) -F * ≥ ∇F (x) 2 /2L, ∀x ∈ R d . 2'. L-smoothness: ∃L > 0 such that F (x) ≤ F (y) + ∇F (y), x -y + L 2 x -y 2 , ∀x, y ∈ R d . 2". L-smoothness: ∃L = diag L i∈ [d] with L i > 0 such that F (x) ≤ F (y) + ∇F (y), x -y + 1 2 x -y 2 L , ∀x, y ∈ R d where a L = a, La . In the stochastic setting, we assume that we have access to a stochastic gradient oracle ∇F (x) that is independent of the history of the randomness and it satisfies the following assumptions: 3. Unbiased gradient estimate: E[ ∇F (x)] = ∇F (x). 4. Sub-Weibull noise: E exp ( ∇F (x) -∇F (x) /σ) 1/θ ≤ exp(1) for some θ > 0. Here, we give a brief discussion of our assumptions. Assumption 1 is introduced by Hinder et al. (2020) and it is strictly weaker than Assumption 1'. Assumption 2 is a relaxation of Assumption 2', the latter is the standard definition of smoothness used in many existing works (see Guille-Escuret et al. (2021) for a detailed comparison between different smoothness conditions). Assumption 2" is used to analyze the AdaGrad algorithm which uses per-coordinate step sizes. Assumption 3 is a standard assumption in stochastic optimization problems. Assumption 4 is more general and encapsulates sub-Gaussian (θ = 1/2, used in Li & Orabona (2019) ) and sub-exponential noise (θ = 1). We refer the reader to Vladimirova et al. (2020) for more discussion on sub-Weibull noise. 3 CONVERGENCE OF ADAGRADNORM ON R d UNDER γ-QUASAR

CONVEXITY

We first turn our attention to AdaGradNorm (Algorithm 1) in the deterministic setting, which will serve as the basis for the understanding of Stochastic AdaGradNorm (Algorithm 2) and deterministic AdaGrad (Algorithm 7). To the best of our knowledge, we are the first to present the explict convergence rate of these three algorithms on R d . Due to the space limit, we defer the theorem of the convergence guarantee of AdaGrad and its proof to Section A.3 in the appendix.

3.1. ADAGRADNORM

Previous analysis of AdaGradNorm often aims at bounding the gradient norm of smooth nonconvex functions, or is conducted for smooth convex functions in constrained problems with a bounded domain. Bounding the gradient norm is strictly weaker than bounding the function value gap due to the fact that ∇F (x) 2 ≤ 2L(F (x) -F * ), where L is the smoothness parameter. For convex functions, the common analysis will always meet the following intermediate step F (x t ) -F * ≤ b t 2η x t -x * 2 -x t+1 -x * 2 + Other terms. Assuming a bounded domain is a way to making the terms bt 2η x t -x * 2 -x t+1 -x * 2 telescope after taking the sum over all iterations t. This is critical in the analysis, but at the same time leads to the dependence on the domain diameter, which can be hard to estimate. For unconstrained problems, a natural approach is to divide the terms by b t , so that the remaining terms 1 2η x t -x * 2 -x t+1 -x * 2 can telescope. Our key insight is that we can bound the function value gap via the step size b t , which in turn can be bounded via the function value gap. This selfbounding argument allows us to finally prove the convergence rate. This result holds under more general conditions than convexity and smoothness (Assumptions 1 and 2). Theorem 3.1. With Assumptions 1 and 2, AdaGradNorm admits T t=1 F (x t ) -F * T ≤ 2L x1-x * 2 γη + 4ηL γ log + 2ηL γb0 + b 0 x1-x * 2 γη + 2η γ log + 2ηL γb0 T Proof. Starting from the γ-quasar convexity of F , we have F (x t ) -F * ≤ ∇F (x t ), x t -x * γ = b t γη x t -x t+1 , x t -x * = b t 2γη x t -x * 2 -x t+1 -x * 2 + x t+1 -x t 2 Notice that x t+1 -x t = -ηb -1 t ∇F (x t ). Dividing both sides by b t and taking the sum over t, we obtain T t=1 F (x t ) -F * b t ≤ x 1 -x * 2 2γη + T t=1 η 2γb 2 t ∇F (x t ) 2 . Note that F also satisfies Assumption 2, i.e., F (x t ) -F * ≥ ∇F (xt) 2 2L . Therefore T t=1 F (x t ) -F * 2b t + ∇F (x t ) 2 4Lb t ≤ T t=1 F (x t ) -F * b t ≤ x 1 -x * 2 2γη + T t=1 η 2γb 2 t ∇F (x t ) 2 ⇒ T t=1 F (x t ) -F * b t ≤ x 1 -x * 2 γη + T t=1 η γb 2 t - 1 2Lb t ∇F (x t ) 2 A . We can bound the term A by the technique commonly used in the analysis of adaptive methods. Let τ be the last t such that b t ≤ 2ηL γ . If b 1 > 2ηL γ , we have A < 0 ≤ 2η γ log + 2ηL γb0 . Otherwise A ≤ τ t=1 η γb 2 t - 1 2Lb t ∇F (x t ) 2 ≤ τ t=1 η γ b 2 t -b 2 t-1 b 2 t ≤ η γ τ t=1 log b 2 t b 2 t-1 ≤ 2η γ log + 2ηL γb 0 . Thus we always have A ≤ 2η γ log + 2ηL γb0 , and obtain T t=1 F (x t ) -F * b t ≤ x 1 -x * 2 γη + 2η γ log + 2ηL γb 0 , which gives T t=1 F (x t ) -F * ≤ b T x 1 -x * 2 γη + 2η γ log + 2ηL γb 0 . Note that by Assumption 2 again, we have b T = b 2 0 + T t=1 ∇F (x t ) 2 ≤ b 2 0 + T t=1 2L (F (x t ) -F * ). Let ∆ T = T t=1 F (x t ) -F * , then ∆ T ≤ b 2 0 + 2L∆ T x 1 -x * 2 γη + 2η γ log + 2ηL γb 0 ⇒ ∆ T ≤ 2L x 1 -x * 2 γη + 4ηL γ log + 2ηL γb 0 + b 0 x 1 -x * 2 γη + 2η γ log + 2ηL γb 0 . Dividing both sides by T , we get the desired result. When F is convex (which implies γ = 1), using the above theorem and convexity, we obtain the following convergence rate for the average iterate: , AdaGradNorm admits F (x T ) -F * ≤ 2L x1-x * 2 η + 4ηL log + 2ηL b0 + b 0 x1-x * 2 η + 2η log + 2ηL b0 T . The rate in Theorem 3.1 can be improved by a factor 1/γ by replacing Assumption 2 by 2'. The details and the proof are deferred into Section A.1 in the appendix.

3.2. STOCHASTIC ADAGRADNORM

In this section, we consider the stochastic setting where we only have access to an unbiased gradient estimate ∇F (x t ) of ∇F (x t ) (Assumption 3). As expected for a stochastic method, the accumulation of noise is the reason that we can only expect an O(1/ √ T ) convergence rate, instead of O(1/T ). This convergence rate is already shown by prior works (Levy et al., 2018) under the bounded domain assumption. However, in an unbounded domain, when extending our previous analysis to the stochastic setting, that is, dividing both sides by b t , we will face several challenges. One of such is the term b -1 t ∇F (x t ) -∇F (x t ), x t -x * . To handle this term, often we see that existing works, such as Li & Orabona (2019) , analyze a modified version of Stochastic AdaGradNorm with off-byone stepsize, i.e., b t = b 2 0 + t-1 i=1 ∇F (x i ) 2 in which the latest gradient ∇F (x t ) is not used to calculate b t . This allows to decouple the dependency of b t on the randomness at time t, thus in expectation E[b -1 t ∇F (x t ) -∇F (x t ), x t -x * ] = 0. Yet, this analysis does not apply to the standard algorithm which is more commonly used in practice. To the best of our knowledge, we are the first to propose a new technique that can show the convergence of Algorithm 2 on R d without going through the off-by-one stepsize. Here, we briefly compared the assumptions in our analysis with the assumptions in Li & Orabona (2019) . Assumptions 2' and 3 used in both works are standard. Meanwhile, Assumptions 1 (γ-quasar convexity) and 4 (sub-Weibull noise) in our analysis are much weaker than the convexity and sub-Gaussian noise assumptions in Li & Orabona (2019) . Besides, we note that, while the guarantee in Li & Orabona ( 2019) is a bound on E ( T t=1 F (x t ) -F (x * ))/T , we will present a bound for E ( T t=1 F (x t ) -F (x * ))/T , which is a stronger criterion that is often used in convex analysis. We also remark that the algorithm and analysis of Li & Orabona (2019) still require the smoothness parameter to set the initial stepsize, thus their method is not fully adaptive. The first observation is that, if we let ξ t := ∇F (x t ) -∇F (x t ) be the stochastic error and M T := max t∈[T ] ξ t 2 , M T is bounded by σ 2 log 2θ eT δ with probability at least 1 -δ (c.f. Lemma A.4 in Appendix A), which can give a high probability bound on b T . Lemma 3.3. Suppose F satisfies Assumptions 2' and 4, if M T ≤ σ 2 log 2θ eT δ , then b T ≤ 2b 0 + 4(F (x 1 ) -F * ) η + 4ηL log + ηL b 0 + 4σ T log 2θ eT δ log 1 + 16σ 2 T log 2θ eT δ b 2 0 . Lemma 3.3 gives us an insight: b t = O(1 + σ t log 2θ t). Note that this can be expected since we know the classic choice of the step size for SGD is of the order of O(1 + σ √ t). Hence, if we are willing to accept extra log terms in the convergence guarantee, the appearance of log b t is accommodatable. Next we will introduce our novel technique, which, to the best of our knowledge, is the first method that allows us to analyze the standard Stochastic AdaGradNorm on R d . Lemma 3.4. Suppose F satisfies Assumptions 1 and 3 then E T t=1 F (x t ) -F (x * ) b T ≤ x 1 -x * 2 γη + 2η γ E M T b 2 0 + log b T b 0 . ( ) Proof sketch. Starting from the γ-quasar convexity, with simple transformations, we obtain F (x t ) -F * ≤ -ξ t , x t -x * γ + b t 2γη x t -x * 2 -x t+1 -x * 2 + x t+1 -x t 2 . Here we introduce our novel technique: instead of dividing by b t , we divide both sides by 2b t -b 0 . This divisor causes a slight non-uniformity between the coefficients of the distance terms x t -x * 2 making the sum of them not telescoping. However, this is exactly what we want to handle the difficult term -ξt,xt-x * γ(2bt-b0) which does not disappear after taking the expectation. E F (x t ) -F * 2b t -b 0 ≤ E -ξ t , x t -x * γ(2b t -b 0 ) + b t 2b t -b 0 × x t -x * 2 -x t+1 -x * 2 + x t+1 -x t 2 2γη . The key step is to use Cauchy-Schwarz inequality for the term | -ξ t , x t -x * | ≤ λ 2 ξ t 2 + 1 2λ x t -x * 2 with the appropriate coefficient λ so that the term x t -x * 2 can be absorbed to make a telescoping sum bt-1 xt-x * 2 2bt-1-b0 -bt xt+1-x * 2 2bt-b0 . The remaining terms are free of x * ; hence can be more easily bounded. We can obtain E F (x t ) -F * 2b t -b 0 ≤ E   Zt ξ t 2 + b t-1 x t -x * 2 2γη(2b t-1 -b 0 ) - b t x t+1 -x * 2 2γη(2b t -b 0 ) + η ∇F (x t ) 2 2γb 2 t    , where Z t = η γb0 1 2bt-1-b0 -1 2bt-b0 . Now we have a telescoping sum bt-1 xt-x * 2 2γη(2bt-1-b0) - bt xt+1-x * 2 2γη(2bt-b0) . Taking the sum over t, we have E T t=1 F (x t ) -F * 2b t ≤ E T t=1 F (x t ) -F * 2b t -b 0 ≤ x 1 -x * 2 2γη + E T t=1 Z t ξ t 2 + E    T t=1 η ∇F (x t ) 2 2γb 2 t    . Proceeding to bound each term, we will obtain Lemma 3.4. We emphasize the following crucial aspect of Lemma 3.4: the inequality gives us a relationship between the function gap and the stepsize b T , which we know how to bound with high probability under Assumptions 2' and 4. On the other hand, this relationship is not ideal due to the fact that on the L.H.S. of (1), we have not obtained a decoupling between the function gap and b T . To this end, we introduce the second novel technique. Let ∆ T := T t=1 F (x t ) -F (x * ), we write ∆ T = ∆ T 1 E(δ) + ∆ T 1 E c (δ) where we define the event E(δ) = M T ≤ σ 2 log 2θ eT δ . For the first term, when E(δ) happens, we also know from Lemma 3.3 that the stepsize is bounded. Thus we can bound E ∆ T 1 E(δ) = E T t=1 F (x t ) -F (x * ) b T b T 1 E(δ) which leads us back to Lemma 3.4. We can bound the second term using a tail bound for the event E c (δ), knowing from the first observation that Pr [E c (δ)] ≤ δ. From this insight, and using the self-bounding argument as in the proof of Theorem 3.1, we finally obtain the following result. Theorem 3.5. Suppose F satisfies Assumptions 1, 2', 3 and 4, Stochastic AdaGradNorm admits E T t=1 F (x t ) -F (x * ) T = O 1 + poly σ 2 log 2θ T, log(1 + σ 2 T log 2θ T ) 1 T + σ log θ T √ T . Algorithm 3 AdaGradNorm-Last Initialize: x 1 , η > 0, ∆ > 0, p t > 0 for t = 1 to T b t = b 2+∆ 0 + t i=1 ∇F (xi) 2 pi 1 2+∆ x t+1 = x t -η bt ∇F (x t ) Algorithm 4 AdaGradNorm-Last Initialize: x 1 , η > 0, δ ∈ [2/3, 1), p t > 0 for t = 1 to T b t = b 2 0 + t i=1 ∇F (xi) 2 pi 1 2 x t+1 = x t -η b δ t b 1-δ t-1 ∇F (x t ) Remark 3.6. In the big-O notation, we only show the dependency on σ, T and θ for simplicity. The dependency on the other parameters will be made explicit in the proof of the theorem. By setting σ = 0, we obtain the standard convergence rate E ( T t=1 F (x t ) -F (x * ))/T = O(1/T ) as shown in Section 3.1 for the deterministic setting. This means our analysis adapts to the noise parameter σ. Finally, it is worth pointing out that even when we relax Assumption 2' to Assumption 2, we can still provide a convergence guarantee for Stochastic AdaGradNorm. We present the result in Theorem A.9 in the appendix.

4. LAST ITERATE CONVERGENCE OF VARIANTS OF ADAGRADNORM FOR

γ-QUASAR CONVEX AND SMOOTH MINIMIZATION ON R d In Section 3, under Assumptions 1 and 2, we proved that the average iterate produced by AdaGrad-Norm converges at the 1/T rate, i.e., T t=1 F (x t ) -F * /T = O(1/T ). A natural question is whether there exists an adaptive algorithm that can guarantee the convergence of the last iterate. In this section, we give an affirmative answer by presenting two simple variants of AdaGradNorm and show convergence of the last iterate under Assumptions 1 and 2'. In Algorithm 3, by setting p i = i -1 , ∇F (x i ) 2 has a bigger coefficient than in the standard AdaGradNorm. Should we use the 1 2 -power (∆ = 0) instead of 1 2+∆ with ∆ > 0, b t will grow faster compared with the same term in AdaGradNorm. We will see later that ∆ = 0 still leads to the convergence of the last iterate. However, we first focus on the easier case with ∆ > 0 and state convergence rate of Algorithm 3 in Theorem 4.1. Theorem 4.1. With Assumptions 1 and 2', by taking p t = 1 t in Algorithm 3, we have F (x T +1 ) -F * ≤ 2 η x1-x * 2 γη + h(∆) + g(∆) + b ∆ 0 1 ∆ x1-x * 2 γη + h(∆) + g(∆) T where h(∆) :=    (2+∆)η(ηL) ∆ 2 log + ηL b0 ∆ ≥ 1 (2+∆)η 2 L 2b 1-∆ 0 log + ηL b0 ∆ ∈ (0, 1) and g(∆) := (2 + ∆)η γ 2ηL γ ∆ log + 2ηL γb 0 . An issue with variant 3 is that, when using 1 2+∆ -power, the stepsize ceases to be scale-invariant. Algorithm 4 shows a different approach, using the scale-invariant power 1 2 , but a different stepsize b δ t b 1-δ t-1 , for a constant δ ∈ [2/3, 1). The tradeoff is that the provable convergence rate of the second variant depends exponentially on the smoothness parameter. We also note that, when δ = 1, we obtain the same algorithm as when setting ∆ = 0 in the previous variant. Remark 4.2. b 0 in every algorithm is only for stabilization and is set to a constant that is very close to 0 in practice. However, the first stepsize in Algorithm 4, i.e., b δ 1 b 1-δ 0 will explode. To avoid this issue, we can simply set the first stepsize as b 1 instead of b δ 1 b 1-δ 0 . We note that, under this change, Algorithm 4 still admits a provable convergence rate. However, for simplicity, we keep b δ 1 b 1-δ 0 in both the description of the algorithm and its analysis. Theorem 4.3. With Assumptions 1 and 2', by taking p t = 1 t in Algorithm 4, we have F (x T +1 ) -F * ≤ b 0 exp k(δ) 1-δ k(δ) T , where k(δ) = x1-x * 2 γη 2 + ηL b0 1 -b0 ηL 1 δ + + 2 γδ 2ηL γb0 2 δ -2 log + 2ηL γb0 . To finish this section, we briefly discuss the case when ∆ = 0 in Algorithm 3 or equivalently δ = 1 in Algorithm 4. First, by seeing ∆ tends to 0, we can expect a convergence rate depending exponentially on the problem parameters. When ∆ = 0, while we can still expect a bound of the function gap via the final stepsize b T , bounding b T becomes problematic. In the proof of Theorem 4.1, to bound b T , we use the sum T t=1 ∇F (xt) 2 b 2 t pt = T t=1 b 2+∆ t -b 2+∆ t-1 b 2 t . This sum only admits a lower bound in terms of b T when ∆ > 0, thus the argument does not work when ∆ = 0. However, it is still possible to give an asymptotic rate under the γ-quasar convexity assumption. If we further assume that F is convex, we can give a non-asymptotic rate. The main idea on how to bound b T is as follows. Let τ be the last time such that b t ≤ ηL/2. The increment from b τ +1 to b T can be bounded by observing that the increase in each step ∇F (x t ) 2 ≤ 2 3 p t b 2 t . Moreover, the critical step is the increase from b τ to b τ +1 , which again can be analyzed via the function gap and smoothness. We present the asymptotic and a non-symptotic convergence rate and their analysis in Sections B.4 and B.5 in the appendix.

5. ACCELERATED VARIANTS OF ADAGRADNORM FOR CONVEX AND SMOOTH MINIMIZATION ON R d

In this section, by using the stronger Assumption 1', we give two new algorithms that achieve the accelerated rate O(1/T 2 ), matching the optimal rate in T for convex and smooth optimization for unconstrained deterministic problems. Our new algorithms are adapted from the acceleration scheme introduced in Auslender & Teboulle (2006) (see also Lan (2020) ). They are also similar to existing adaptive accelerated methods designed for bounded domains, including Levy et al. (2018) ; Ene et al. (2021) . However, previous analysis does not apply in unconstrained problems; we therefore have to make necessary modifications. To the best of our knowledge, in unconstrained problems under the deterministic setting, the only existing analysis for an accelerated method was introduced in Antonakopoulos et al. ( 2022). Here we discuss some limitations of this work. The convergence rate for the weighted average iterate x T +1/2 is given by f (x T +1/2 ) -f (x) ≤ O 1 T 2 R h lim t→∞ b t + K h lim t→∞ b 2 t where h is a K h -strongly convex mirror map function, R h = max h(x) -min h(x) is the range of h. This result is only applicable when the domain is unbounded but the range of the mirror map is bounded. Even in the standard 2 setup with h(x) = 1 2 x 2 , this assumption does not hold. Moreover, due to the term lim t→∞ b t , the above guarantee is dependent on the particular function. Thus, while a standard convergence guarantee is applicable to say, all SVM models with Huber loss, the above guarantee varies for each SVM model and there is no universal bound for all of them. We further highlight some key differences between this work and ours. While the convergence rate above depends on the convergence of the stepsize, for both our variants, we will show an explicit convergence rate that holds universally for the entire function class. Second, the algorithm in Antonakopoulos et al. ( 2022) is based on an extra gradient method which requires to calculate gradients twice in one iteration. Instead, our algorithms only need one gradient computation per iteration. Finally, our algorithms guarantee the convergence of the last iterate as opposed to that for the weighted average iterate as shown above. Algorithm 5 shows the first variant. For an accelerated method, the step size typically has the form b t = b 2 0 + t i=1 s i ∇F i 2 1 2 where ∇F i is the gradient evaluated at time i, and s i = O(i 2 ). However in order to be able to give an explicit convergence rate, Algorithm 5 uses a smaller b t with power 1 2+∆ , with ∆ > 0. When ∆ = 0, we can only show an asymptotic convergence rate, similarly to Antonakopoulos et al. (2022) . We first focus on the case when ∆ > 0. In the appendix we will discuss the convergence of the algorithm when ∆ = 0. We have the following theorem. Algorithm 5 AdaGradNorm-Acc Initialize: x 1 = w 1 , η > 0, ∆ > 0, a t > 0, q t > 0 for t = 1 to T v t = (1 -a t )w t + a t x t b t = b 2+∆ 0 + t i=1 ∇F (vi) 2 q 2 i 1 2+∆ x t+1 = x t -η qtbt ∇F (v t ) w t+1 = (1 -a t )w t + a t x t+1 Algorithm 6 AdaGradNorm-Acc Initialize: x 1 = w 1 , η > 0, δ ∈ [2/3, 1), a t > 0, q t > 0 for t = 1 to T v t = (1 -a t )w t + a t x t b t = b 2 0 + t i=1 ∇F (vi) 2 q 2 i 1 2 x t+1 = x t - η qtb δ t b 1-δ t-1 ∇F (v t ) w t+1 = (1 -a t )w t + a t x t+1 Theorem 5.1. Suppose F satisfies Assumptions 1' and 2', let a t = 2 t+1 , q t = 2 t in Alg. 5, then F (w T +1 ) -F * ≤ 1 T (T + 1) 2 x * -x 1 2 η 2 + 4h(∆) η + b ∆ 0 1 ∆ x * -x 1 2 2η + h(∆) where h(∆) =    (2+∆)(2ηL) ∆-1 Lη 2 2 log + 2ηL b0 ∆ ≥ 1 (2+∆)Lη 2 2b 1-∆ 0 log + 2ηL b0 ∆ ∈ (0, 1) . Similarly to the second variant in the previous section, we also have a scale-invariant accelerated algorithm, shown in Algorithm 6 using power 1 2 but a smaller stepsize b δ t b 1-δ t-1 . This algorithm also has an exponential dependency on the problem parameters, which is given in the following theorem. Remark 5.2. Similar to Remark 4.2, the first stepsize in Algorithm 6, i.e., b δ 1 b 1-δ 0 can be replaced by b 1 . However, for simplicity, we keep b δ 1 b 1-δ 0 in both the description of the algorithm and its analysis. Theorem 5.3. Suppose F satisfies Assumptions 1' and 2', let a t = 2 t+1 , q t = 2 t in Alg. 6, then F (w T +1 ) -F * ≤ b 0 exp s(δ) 1-δ (s(δ)) T (T + 1) , where s(δ) = x * -x1 2 2η + η 2 L b0 1 -b0 2ηL 1 δ + . Similarly to the previous section, we give a more detailed discussion of the convergence of the Algorithm 5 when ∆ = 0 or equivalently Algorithm 6 when δ = 1 in Section C.4 in the appendix. While we can still show an accelerated O(1/T 2 ) asymptotic convergence rate, we only present an O(1/T 2 + 1/T ) non-asymptotic rate. The difference between these algorithms and the ones in the previous section is that the stepsize b t increases much faster. More precisely, the increment in each step is now O(t 2 ∇F (v t ) 2 ) instead of O(t ∇F (x t ) 2 ). Thus we can only show an upperbound for b t that grows linearly with time, which leads to the O(1/T 2 + 1/T ) convergence rate.

6. CONCLUSION AND FUTURE WORK

In this paper, we go back to the most basic AdaGrad algorithm and study its convergence rate in generalized smooth convex optimization. We prove explicit convergence guarantees for unconstrained problems in both the deterministic and stochastic setting. Building on these insights, we propose new algorithms that exhibit last iterate convergence, with and without acceleration. We see our work as primarily theoretical since the first and foremost goal is to understand properties of existing algorithms that work well in practice. We refer the reader to the long line of previous works (Duchi et al., 2011; Levy, 2017; Kavis et al., 2019; Bach & Levy, 2019; Antonakopoulos et al., 2020; Ene et al., 2021; Ene & Nguyen, 2022; Antonakopoulos et al., 2022) that have already demonstrated the behavior of AdaGrad and accelerated adaptive algorithms empirically. A MISSING PROOFS FROM SECTION 3 A.1 ADAGRADNORM As we pointed out before, it is possible to obtain an improvement by a factor 1/γ compared with Theorem 3.1 by assuming L-smoothness instead of weak L-smoothness. Theorem A.1. With Assumptions 1 and 2', AdaGradNorm admits T t=1 F (x t ) -F * T ≤ L x1-x * 2 η + 2ηL log + ηL b0 + b 0 x1-x * 2 γη + 2η γ log + 2ηL b0 T . Proof. Note that Assumption 2' can imply Assumption 2, so following the same proof of Theorem 3.1, we still have T t=1 F (x t ) -F * ≤ b T x 1 -x * 2 γη + 2η γ log + 2ηL γb 0 . However, from here, we will bound b T directly, rathe than use the self bounded argument in the previous proof. By the L-smoothness, we know F (x t+1 ) -F (x t ) ≤ ∇F (x t ), x t+1 -x t + L 2 x t+1 -x t 2 = Lη 2 2b 2 t - η b t ∇F (x t ) 2 ⇒ ∇F (x t ) 2 b t ≤ 2 (F (x t ) -F (x t+1 )) η + Lη b 2 t - 1 b t ∇F (x t ) 2 . Sum up from 1 to T , we know T t=1 ∇F (x t ) 2 b t ≤ 2 η (F (x 1 ) -F (x T +1 )) + T t=1 Lη b 2 t - 1 b t ∇F (x t ) 2 ≤ 2 η (F (x 1 ) -F (x * )) + T t=1 Lη b 2 t - 1 b t ∇F (x t ) 2 . Use the the same proof technique as before, we can bound T t=1 Lη b 2 t - 1 b t ∇F (x t ) 2 ≤ 2ηL log + ηL b 0 . and T t=1 ∇F (x t ) 2 b t = T t=1 b 2 t -b 2 t-1 b t ≥ T t=1 b t -b t-1 = b T -b 0 . Hence, we know b T ≤ 2 η (F (x 1 ) -F (x * )) + 2ηL log + ηL b 0 + b 0 ≤ L x 1 -x * 2 η + 2ηL log + ηL b 0 + b 0 . By using this bound on b T , we can get the final result with an improvement by a factor 1/γ.

A.2 STOCHASTIC ADAGRADNORM

We will employ the following notations for convenience ∆ t := t s=1 F (x s ) -F * ; ξ t := ∇F (x t ) -∇F (x t ); M t := max s∈[t] ξ s 2 . Before diving into the details of our proof, we first present some technical results we will use in the proof of Theorem 3.5.

A.2.1 TECHNICAL LEMMAS

To start with, under Assumptions 1 and 3 only, we can obtain a bound for a term close to our final goal ∆ T . Lemma A.2. (Lemma 3.4) Suppose F satisfies Assumptions 1 and 3, we have E ∆ T b T ≤ x 1 -x * 2 γη + 2η γ E M T b 2 0 + log b T b 0 . Proof. We start by using the γ-quasar convexity of the function F F (x t ) -F * ≤ ∇F (x t ), x t -x * γ = ∇F (x t ) -∇F (x t ), x t -x * γ + ∇F (x t ), x t -x * γ = -ξ t , x t -x * γ + b t 2γη x t -x * 2 -x t+1 -x * 2 + x t+1 -x t 2 . Dividing both sides by 2b t -b 0 and taking expactations, we have E F (x t ) -F * 2b t -b 0 ≤ E -ξ t , x t -x * γ(2b t -b 0 ) + b t 2b t -b 0 × x t -x * 2 -x t+1 -x * 2 + x t+1 -x t 2 2γη . Now we no longer have a telescoping sum in the R.H.S.. However, this is exactly what we want to handle the difficult term -ξt,xt-x * γ(2bt-b0) which does not disappear after taking the expectation. The key step is to use Cauchy-Schwarz inequality for the term | -ξ t , x t -x * | ≤ λ 2 ξ t 2 + 1 2λ x t -x * 2 with the appropriate coefficient λ so that the term x t -x * 2 can be absorbed to make a telescoping sum bt-1 xt-x * 2 2bt-1-b0 -bt xt+1-x * 2 2bt-b0 . The remaining terms are free of x * ; hence can be more easily Published as a conference paper at ICLR 2023 bounded. To do this, note that E -ξ t , x t -x * γ(2b t -b 0 ) =E     1 γ 1 2b t -b 0 - 1 2b t-1 -b 0 A -ξ t , x t -x *     ≤E [|A| | -ξ t , x t -x * |] ≤E |A| |A| 4 b t-1 2γη(2b t-1 -b 0 ) - b t 2γη(2b t -b 0 ) -1 ξ t 2 + |A| -1 b t-1 2γη(2b t-1 -b 0 ) - b t 2γη(2b t -b 0 ) x t -x * 2 =E η γb 0 1 2b t-1 -b 0 - 1 2b t -b 0 ξ t 2 + E b t-1 2b t-1 -b 0 - b t 2b t -b 0 x t -x * 2 2γη Thus we have E F (x t ) -F * 2b t -b 0 ≤E η γb 0 1 2b t-1 -b 0 - 1 2b t -b 0 ξ t 2 + E b t-1 x t -x * 2 2γη(2b t-1 -b 0 ) - b t x t+1 -x * 2 2γη(2b t -b 0 ) + E    η ∇F (x t ) 2 2γb t (2b t -b 0 )    ≤E η γb 0 1 2b t-1 -b 0 - 1 2b t -b 0 ξ t 2 + E b t-1 x t -x * 2 2γη(2b t-1 -b 0 ) - b t x t+1 -x * 2 2γη(2b t -b 0 ) + E    η ∇F (x t ) 2 2γb 2 t    Now we have a telescoping sum b t-1 x t -x * 2 2γη(2b t-1 -b 0 ) - b t x t+1 -x * 2 2γη(2b t -b 0 ) and the remaining terms are free of x t -x * . Taking the sum over t, we have E T t=1 F (x t ) -F * 2b t -b 0 ≤E T t=1 η 2γb 0 1 2b t -b 0 - 1 2b t-1 -b 0 ξ t 2 + x 1 -x * 2 2γη + E T t=1 η ∇F (x t ) 2 2γb 2 t . First for the easy term E T t=1 η ∇F (xt) 2 2γb 2 t , we have E T t=1 η ∇F (x t ) 2 2γb 2 t = η 2γ E T t=1 b 2 t -b 2 t-1 b 2 t ≤ η 2γ E T t=1 log b 2 t -log b 2 t-1 = η γ E log b T b 0 . Next, we bound E T t=1 η γb 0 1 2b t-1 -b 0 - 1 2b t -b 0 ξ t 2 ≤E T t=1 η γb 0 1 2b t-1 -b 0 - 1 2b t -b 0 M T ≤E η γb 2 0 M T (4) Plugging the bounds (3) and ( 4) into (2), we have E T t=1 F (x t ) -F * 2b t ≤ E T t=1 F (x t ) -F * 2b t -b 0 ≤ x 1 -x * 2 2γη + η γb 2 0 E [M T ] + η γ E log b T b 0 . The last step is using T t=1 F (xt)-F * 2bt ≥ T t=1 F (xt)-F * 2b T = ∆ T 2b T to finish the proof. Due to the appearance of M T in Lemma A.2, it is natural to consider what we can obtain under the additional Assumption 4, i.e., sub-Weibull noise with parameter θ. We first provide the following simple bound on E ξ t 2 . The result is not new and the proof is only included for completeness. Lemma A.3. Under Assumption 4, ∀t ∈ [T ], we have E ξ t 2 ≤ Γ(2θ + 1)eσ 2 . Proof. We first note that from the definition of sub-Weibull noise, the tail of ξ t can be bounded as follows Pr [ ξ t ≥ u] ≤ E exp ( ξ t /σ) 1/θ exp (u/σ) 1/θ ≤ exp 1 -(u/σ) 1/θ . Then we can obtain E ξ t 2 = ∞ 0 2u Pr [ ξ ≥ u] du ≤ ∞ 0 2u exp 1 -(u/σ) 1/θ du = 2θeσ 2 ∞ 0 v 2θ-1 exp(-v)dv = Γ(2θ + 1)eσ 2 where u is substituted by σv θ in the second equation. Next, we prove a high probability bound on M T , the proof of which is inspired by Lemma 5 in Li & Orabona (2020) . Lemma A.4. Under Assumption 4, given 0 < δ < 1, define the event E(δ) = M T ≤ σ 2 log 2θ eT δ , we have Pr [E(δ)] ≥ 1 -δ. Proof. Note that Pr [M T ≥ u] = Pr max s∈[T ] ξ s 2 ≥ u = Pr max s∈[T ] ξ s 1 θ ≥ u 1 2θ ≤ E exp max s∈[T ] ( ξ s /σ) 1/θ exp (u 1/2 /σ) 1/θ ≤ T s=1 E exp ( ξ s /σ) 1/θ exp (u 1/2 /σ) 1/θ = T exp 1 -(u 1/2 /σ) 1/θ . Choose u = σ 2 log 2θ eT δ to obtain Pr M T ≥ σ 2 log 2θ eT δ ≤ δ. Lastly, we will find an upper bound on the p-th moment of M T . Lemma A.5. Under Assumption 4, given p > 0, there is E [M p T ] ≤ σ 2p log 2θp Γ(4θp + 1)e 2 T 2 + 1 . Proof. Note that in Lemma A.4, we proved Pr [M T ≥ u] ≤ T exp 1 -(u 1/2 /σ) 1/θ . Let E(δ) be the same as it in Lemma A.4. Then, by Holder's inequality we have E [M p T ] = E M p T 1 E(δ) + E M p T 1 E c (δ) ≤ E M p T 1 E(δ) + E M 2p T E 1 E c (δ) ≤ σ 2p log 2θp eT δ + E M 2p T δ = σ 2p log 2θp eT δ + δ ∞ 0 2pu 2p-1 Pr [M T ≥ u] du ≤ σ 2p log 2θp eT δ + δ ∞ 0 2pu 2p-1 T exp 1 -(u 1/2 /σ) 1/θ du = σ 2p log 2θp eT δ + σ 2p Γ(4θp + 1)eT δ = σ 2p log 2θp eT δ + Γ(4θp + 1)eT δ . Choose δ = 1 Γ(4θp+1)eT < 1, we have E [M p T ] ≤ σ 2p log 2θp Γ(4θp + 1)e 2 T 2 + 1 . Note that all the above results only depend on Assumptions 1, 3 and 4 without requiring the smoothness of F . A.2.2 PROOF OF THEOREM 3.5 Theorem 3.5 requires Assumption 2' additionally. Thus we first show that under Assumptions 2' and 4. b T enjoys a O(1 + σ T log 2θ T ) upper bound with high probability. Lemma A.6. (Lemma 3.3) Suppose F satisfies Assumptions 2'and 4. Under the event E(δ) = M T ≤ σ 2 log 2θ eT δ , we have b T ≤ g T (δ) := 2b 0 + 4(F (x 1 ) -F * ) η + 4ηL log + ηL b 0 + 4σ T log 2θ eT δ log 1 + 16σ 2 T log 2θ eT δ b 2 0 . Additionally, by Lemma A.4, there is 1 -δ ≤ Pr [E(δ)] ≤ Pr [b T ≤ g T (δ)] . Proof. We start by using the smoothness of F F (x t+1 ) -F (x t ) ≤ ∇F (x t ), x t+1 -x t + L 2 x t+1 -x t 2 = - η b t ∇F (x t ), ∇F (x t ) + η 2 L 2b 2 t ∇F (x t ) 2 = - η b t ∇F (x t ) -∇F (x t ), ∇F (x t ) - η b t ∇F (x t ) 2 + η 2 L 2b 2 t ∇F (x t ) 2 ⇒ ∇F (x t ) 2 b t ≤ 2 η (F (x t ) -F (x t+1 )) + 2 ξ t , ∇F (x t ) b t + ηL b 2 t - 1 b t ∇F (x t ) 2 . Taking the sum over t we have T t=1 ∇F (x t ) 2 b t ≤ 2(F (x 1 ) -F * ) η + 2 T t=1 ξ t , ∇F (x t ) b t + T t=1 ηL b 2 t - 1 b t ∇F (x t ) 2 . Using the common technique, we know that T t=1 ηL b 2 t -1 bt ∇F (x t ) 2 ≤ 2ηL log + ηL b0 . More- over, for the L.H.S. T t=1 ∇F (x t ) 2 b t = T t=1 b 2 t -b 2 t-1 b t ≥ T t=1 b t -b t-1 = b T -b 0 . Thus we have b T ≤ b 0 + 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + 2 T t=1 ξ t , ∇F (x t ) b t For the last term in this equation, we notice that ξ t , ∇F (x t ) ≤ ξ t ∇F (x t ) ≤ √ M T ∇F (x t ) , hence b T ≤ 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + b 0 + 2 T t=1 ξ t , ∇F (x t ) b t ≤ 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + b 0 + 2 M T T t=1 ∇F (x t ) b t (a) ≤ 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + b 0 + 2 M T T T t=1 ∇F (x t ) 2 b 2 t = 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + b 0 + 4M T T T t=1 b 2 t -b 2 t-1 b 2 t ≤ 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + b 0 + 4M T T log b 2 T b 2 0 where (a) is due to Jensen's inequality. We can write 4M T T log b 2 T b 2 0 = 4M T T log b 2 T b 2 0 + 16M T T + log b 2 0 + 16M T T b 2 0 ≤ 4M T T b 2 T b 2 0 + 16M T T + log b 2 0 + 16M T T b 2 0 ≤ b 2 T 4 + 4M T T log b 2 0 + 16M T T b 2 0 . Hence b T ≤ b 0 + 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + b 2 T 4 + 4M T T log b 2 0 + 16M T T b 2 0 ≤ b 0 + 2(F (x 1 ) -F * ) η + 2ηL log + ηL b 0 + b T 2 + 2 M T T log b 2 0 + 16M T T b 2 0 which gives us b T ≤ 2b 0 + 4(F (x 1 ) -F * ) η + 4ηL log + ηL b 0 + 4 M T T log b 2 0 + 16M T T b 2 0 . Recall the definition of the event E(δ) is M T ≤ σ 2 log 2θ eT δ , thus we know b T ≤ 2b 0 + 4(F (x 1 ) -F * ) η + 4ηL log + ηL b 0 + 4σ T log 2θ eT δ log 1 + 16σ 2 T log 2θ eT δ b 2 0 . By using Lemma A.6, we can consider the following decomposition E [∆ T ] = E ∆ T 1 E(δ) + E ∆ T 1 E c (δ) = E ∆ T b T b T 1 E(δ) + E ∆ T 1 E c (δ) ≤ g T (δ)E ∆ T b T 1 E(δ) + E ∆ T 1 E c (δ) ≤ g T (δ)E ∆ T b T + E ∆ T 1 E c (δ) . Published as a conference paper at ICLR 2023 Note that Lemma A.2 tells us E ∆ T b T ≤ x 1 -x * 2 γη + 2η γ E M T b 2 0 + log b T b 0 . Hence our remaining task is to find a proper bound on E ∆ T 1 E c (δ) , which is stated in the following lemma. Lemma A.7. Under Assumptions 2' and 4 we have E ∆ T 1 E c (δ) ≤ F (x 1 ) -F * + η 2 L log + ηL 2b 0 T δ + ηE 1/4 M 2 T log E b 2 T b 2 0 T 3/2 δ 1/4 . Proof. We restart from the smoothness of F : F (x s+1 ) -F (x s ) ≤ - η b s ∇F (x s ) -∇F (x s ), ∇F (x s ) - η b s ∇F (x s ) 2 + η 2 L 2b 2 s ∇F (x s ) 2 . Taking the sum over s, we have for t ≥ 2 F (x t ) -F (x 1 ) ≤ t-1 s=1 - η b s ∇F (x s ) -∇F (x s ), ∇F (x s ) + t-1 s=1 η 2 L 2b 2 s - η b s ∇F (x s ) 2 ≤ η 2 L log + ηL 2b 0 + t-1 s=1 η b s ξ s ∇F (x s ) . Following the same proof of Lemma A.6, we have F (x t ) -F * ≤ F (x 1 ) -F * + η 2 L log + ηL 2b 0 + η M t-1 (t -1) log b 2 t-1 b 2 0 . Now we bound ∆ T as follows ∆ T = T t=1 F (x t ) -F * ≤ F (x 1 ) -F * + T t=2 F (x 1 ) -F * + η 2 L log + ηL 2b 0 + η M t-1 (t -1) log b 2 t-1 b 2 0 ≤ F (x 1 ) -F * + η 2 L log + ηL 2b 0 T + T t=2 η M t-1 (t -1) log b 2 t-1 b 2 0 ≤ F (x 1 ) -F * + η 2 L log + ηL 2b 0 T + η M T log b 2 T b 2 0 T 3/2 . Thus we obtain E ∆ T 1 E c (δ) ≤ F (x 1 ) -F * + η 2 L log + ηL 2b 0 T δ + ηE M T log b 2 T b 2 0 1 E c (δ) T 3/2 . Here we invoke Holder's inequality for three variables: for p, q, r > 0, 1/p + 1/q + 1/r = 1 then E[XY Z] ≤ E 1/p [X p ]E 1/q [Y q ]E 1/r [Z r ]. By substituting X = √ M T , Y = log b 2 T b 2 0 , Z = 1 E c (δ) , and p = 4, q = 2, r = 4, we have E M T log b 2 T b 2 0 1 E c (δ) ≤ E 1/4 M 2 T E 1/2 log b 2 T b 2 0 E 1/4 1 E c (δ) ≤ E 1/4 M 2 T log E b 2 T b 2 0 δ 1/4 . So finally we get E ∆ T 1 E c (δ) ≤ F (x 1 ) -F * + η 2 L log + ηL 2b 0 T δ + ηE 1/4 M 2 T log E b 2 T b 2 0 T 3/2 δ 1/4 . Lemma A.8. Suppose F satisfies Assumptions 1, 2', 3 and 4 then E [∆ T ] ≤g T   x 1 -x * 2 2γη + 2ησ 2 2 (4θ-1)∨2θ log 2θ T + C 1 γb 2 0 + η γ log E b 2 T b 2 0   + F (x 1 ) -F * + η 2 L log + ηL 2b0 T 3 + ησ 2 (2θ-1)∨θ log θ T + C 2 2 1 + log E b 2 T b 2 0 √ T where C 1 = 2 (2θ-1) + log 2θ Γ(4θ + 1)e 2 + 1 and C 2 = 2 (θ-1) + log θ Γ(8θ + 1)e 2 + 1 are two constants and g T = 2b 0 + 4(F (x 1 ) -F * ) η + 4ηL log + ηL b 0 + 4σ T log 2θ (eT 5 ) log 1 + 16σ 2 T log 2θ (eT 5 ) b 2 0 . Proof. As stated above, we know E [∆ T ] = E ∆ T 1 E(δ) + E ∆ T 1 E c (δ) = E ∆ T b T b T 1 E(δ) + E ∆ T 1 E c (δ) (a) ≤ g T (δ)E ∆ T b T 1 E(δ) + E ∆ T 1 E c (δ) ≤ g T (δ)E ∆ T b T + E ∆ T 1 E c (δ) (b) ≤ g T (δ) x 1 -x * 2 2γη + 2η γ E M T b 2 0 + log b T b 0 + E ∆ T 1 E c (δ) ≤ g T (δ) x 1 -x * 2 2γη + 2η γb 2 0 E [M T ] + η γ log E b 2 T b 2 0 + E ∆ T 1 E c (δ) where (a) is due to Lemma A.6. (b) is by Lemma A.2. Lemma A.7 gives us E ∆ T 1 E c (δ) ≤ F (x 1 ) -F * + η 2 L log + ηL 2b 0 T δ + ηE 1/4 M 2 T log E b 2 T b 2 0 T 3/2 δ 1/4 . Pluggin in this bound, we have E [∆ T ] ≤g T (δ) x 1 -x * 2 2γη + 2η γb 2 0 E [M T ] + η γ log E b 2 T b 2 0 + F (x 1 ) -F * + η 2 L log + ηL 2b 0 T δ + ηE 1/4 M 2 T log E b 2 T b 2 0 T 3/2 δ 1/4 . Now we take δ = T -4 and let g T := g T (T -4 ) to obtain E [∆ T ] ≤g T x 1 -x * 2 2γη + 2η γb 2 0 E [M T ] + η γ log E b 2 T b 2 0 + 1 T 3 F (x 1 ) -F * + η 2 L log + ηL 2b 0 + ηE 1/4 M 2 T log E b 2 T b 2 0 √ T . From Lemma A.5, we know E [M T ] ≤ σ 2 log 2θ Γ(4θ + 1)e 2 T 2 + 1 ≤ σ 2 (2 (4θ-1)∨2θ log 2θ (T ) + C 1 ) and E M 2 T ≤ σ 4 (log 4θ (Γ(8θ + 1)e 2 T 2 ) + 1) ⇒ E 1/4 M 2 T = σ log 4θ (Γ(8θ + 1)e 2 T 2 ) + 1 1/4 ≤ σ(log θ (Γ(8θ + 1)e 2 T 2 ) + 1) ≤ σ(2 (2θ-1)∨θ log θ T + C 2 ) Hence we have E [∆ T ] ≤g T   x 1 -x * 2 2γη + 2ησ 2 2 (4θ-1)∨2θ log 2θ T + C 1 γb 2 0 + η γ log E b 2 T b 2 0   + F (x 1 ) -F * + η 2 L log + ηL 2b0 T 3 + ησ 2 (2θ-1)∨θ log θ T + C 2 log E b 2 T b 2 0 √ T ≤g T   x 1 -x * 2 2γη + 2ησ 2 2 (4θ-1)∨2θ log 2θ T + C 1 γb 2 0 + η γ log E b 2 T b 2 0   + F (x 1 ) -F * + η 2 L log + ηL 2b0 T 3 + ησ 2 (2θ-1)∨θ log θ T + C 2 2 1 + log E b 2 T b 2 0 √ T With these results, we can finally show the theorem 3.5. Proof of Theorem 3.5 . The key technique we use is the self-bounding argument. That is, we have expressed a bound for E[∆ T ] via E[b 2 T /b 2 0 ] , now we will show how to bound this term via ∆ T . To do this, we rely on the smoothness assumption and Lemma A.3 E b 2 T = E b 2 0 + T t=1 ∇F (x t ) 2 ≤ b 2 0 + E T t=1 2 ξ t 2 + E T t=1 2 ∇F (x t ) 2 ≤ b 2 0 + 2Γ(2θ + 1)eσ 2 T + E 4L T t=1 F (x t ) -F (x * ) ≤ b 2 0 + 2Γ(2θ + 1)eσ 2 T + 4LE [∆ T ] . Thus from Lemma A.8 we can write E [∆ T ] ≤ G 0 + G 1 log 1 + 2Γ(2θ + 1)eσ 2 T b 2 0 + 4L b 2 0 E [∆ T ] where G 0 = F (x 1 ) -F * + η 2 L log + ηL 2b0 T 3 + ησ 2 (2θ-1)∨θ log θ T + C 2 √ T 2 + g T   x 1 -x * 2 2γη + 2ησ 2 2 (4θ-1)∨2θ log 2θ (T ) + C 1 γb 2 0   = O 1 + σ T log 2θ T + (1 + σ 2 log 2θ T )g T G 1 = ησ 2 (2θ-1)∨θ log θ T + C 2 √ T 2 + ηg T γ = O σ T log 2θ T + g T g T = 2b 0 + 4(F (x 1 ) -F * ) η + 4ηL log + ηL b 0 + 4σ T log 2θ (eT 5 ) log 1 + 16σ 2 T log 2θ (eT 5 ) b 2 0 = O 1 + σ T log 2θ T log(1 + σ 2 T log 2θ T ) Now we solve (5). Consider two cases: If 4LE [∆ T ] ≤ 2Γ(2θ + 1)eσ 2 T then E [∆ T ] ≤ G 0 + G 1 log 1 + 4Γ(2θ + 1)eσ 2 T b 2 0 . If 4LE [∆ T ] ≥ 2Γ(2θ + 1)eσ 2 T then E [∆ T ] ≤ G 0 + G 1 log 1 + 8L b 2 0 E [∆ T ] = G 0 + G 1 log 1 + 8L b 2 0 E [∆ T ] 1 + 16LG 1 /b 2 0 + G 1 log 1 + 16LG 1 b 2 0 ≤ G 0 + G 1 1 + 8L b 2 0 E [∆ T ] 1 + 16LG 1 /b 2 0 + G 1 log 1 + 16LG 1 b 2 0 ≤ G 0 + G 1 + E [∆ T ] 2 + G 1 log 1 + 16LG 1 b 2 0 ⇒ E [∆ T ] ≤ 2G 0 + 2G 1 + 2G 1 log 1 + 16eLG 1 b 2 0 . In both cases, we have E [∆ T ] ≤ 3G 0 + 2G 1 + 2G 1 log 1 + 16eLG 1 b 2 0 + G 1 log 1 + 4Γ(2θ + 1)eσ 2 T b 2 0 = O (1 + poly(σ 2 log 2θ T, log(1 + σ 2 T log 2θ T )))(1 + σ T log 2θ T ) Dividing both sides by T concludes the proof.

A.2.3 CONVERGENCE OF STOCHASTIC ADAGRADNORM UNDER WEAKER ASSUMPTIONS

Note that Theorem 3.5 depends on the stronger Assumption 2' instead of Assumption 2. Besides, in Section 3.1, we proved that Assumptions 1 and 2 are enough to ensure that AdaGradNorm can converge in the deterministic setting. Hence it is reasonable to conjecture Stochastic AdaGradNorm can also converge if replacing Assumption 2' by Assumption 2. In this section, we show that, indeed, this conjecture is true. Theorem A.9. Suppose F satisfies Assumptions 1, 2, 3 and 4. Stochastic AdaGradNorm admits E   T t=1 F (x t ) -F (x * ) T   = O (1 + poly(log(1 + σ √ T ), σ 2 log 2θ T )) 1 √ T + σ 1/2 T 1/4 . Proof. First we invoke Lemma A.2 to get E ∆ T b T ≤ x 1 -x * 2 γη + 2η γ E M T b 2 0 + log b T b 0 . Using Holder's inequality we have E ∆ T ≤ x 1 -x * 2 γη + 2η γ E M T b 2 0 + log b T b 0 E [b T ] ≤ x 1 -x * 2 γη + 2η γb 2 0 E [M T ] + 2η γ log E [b T ] b 0 E [b T ]. Applying Lemma A.5 with p = 1 to get E [M T ] ≤ σ 2 log 2θ Γ(4θ + 1)e 2 T 2 + 1 ≤ σ 2 2 2θ log 2θ T 2 + 2 2θ log 2θ Γ(4θ + 1)e 2 + 1 = σ 2 2 4θ log 2θ T + C where C = 2 2θ log 2θ Γ(4θ + 1)e 2 + 1. Besides, note that b T = b 2 0 + T t=1 ∇F (x t ) 2 ≤ b 2 0 + 2 T t=1 ξ t 2 + 4L∆ T ≤ b 0 + 2 T t=1 ξ t 2 + 2 L∆ T . Thus we know E [b T ] ≤ E   b 0 + 2 T t=1 ξ t 2 + 2 L∆ T .   ≤ b 0 + 2 T t=1 E [ ξ t 2 ] + 2 √ LE ∆ T ≤ b 0 + 2Γ(2θ + 1)eσ 2 T + 2 √ LE ∆ T where the last inequality is due to Lemma A.3. Hence, by letting B 1 = x 1 -x * 2 γη + 2η 2 4θ log 2θ T + C σ 2 γb 2 0 = O(1 + σ 2 log 2θ T ) B 2 = b 0 + 2Γ(2θ + 1)eσ 2 T = O(1 + σ √ T ) X = E ∆ T we can solve the following inequality X 2 ≤ B 1 + 2η γ log B 2 + 2 √ LX b 0 (B 2 + 2 √ LX) to get the final result.

A.3 ADAGRAD

Algorithm 7 AdaGrad Initialize: x 1 , η > 0 for t = 1 to T for j = 1 to d b t,j = b 2 0,j + t i=1 (∇ j F (x i )) 2 x t+1,j = x t,j -η bt,j ∇ j F (x t ) In this section, we will extend the result of AdaGradNorm to AdaGrad (Algorithm 7) in the deterministic setting. To our knowledge, we are the first to give the explicit bound of the counvergence rate of AdaGrad on R d . First, we examine the growth of the stepsize. Lemma A.10. Suppose F satisfies Assumptions 1 and 2", we have d j=1 b T,j ≤ d j=1 b 0,j + 2(F (x 1 ) -F * ) η + 2η d j=1 L j log + ηL j b 0,j . Proof. By smoothness we have F (x t+1 ) -F (x t ) ≤ ∇F (x t ), x t+1 -x t + x t+1 -x t 2 L 2 = d j=1 - η b t,j + L j η 2 2b 2 t,j ∇ j F (x t ) 2 ⇒ d j=1 η 2b t,j ∇ j F (x t ) 2 ≤ F (x t ) -F (x t+1 ) + d j=1 L j η 2 2b 2 t,j - η 2b t,j ∇ j F (x t ) 2 ⇒ T t=1 d j=1 η 2b t,j ∇ j F (x t ) 2 ≤ F (x 1 ) -F * + T t=1 d j=1 L j η 2 2b 2 t,j - η 2b t,j ∇ j F (x t ) 2 . Note that, for the L.H.S., T t=1 d j=1 η 2b t,j ∇ j F (x t ) 2 = η 2 d j=1 T t=1 b 2 t,j -b 2 t-1,j b t,j ≥ η 2 d j=1 T t=1 b t,j -b t-1,j = η 2 d j=1 (b T,j -b 0,j ) . Besides, T t=1 d j=1 L j η 2 2b 2 t,j - η 2b t,j ∇ j F (x t ) 2 = η 2 T t=1 d j=1 L j η b 2 t,j - 1 b t,j ∇ j F (x t ) 2 ≤ η 2 d j=1 τj t=1 L j η b 2 t,j ∇ i F (x t ) 2 (τ j is the last t such thatb t,j ≤ ηL j ) = η 2 2 d j=1 L j τi t=1 b 2 t,j -b 2 t-1,j b 2 t,j ≤ η 2 d j=1 L j log + ηL j b 0,j . Hence we have d j=1 b T,j ≤ d j=1 b 0,j + 2(F (x 1 ) -F * ) η + 2η d j=1 L j log + ηL j b 0,j . Theorem A.11 states the convergence guarantee for Algorithm 7. Theorem A.11. Suppose F satisfies Assumptions 1 and 2", we have T t=1 F (x t ) -F * T ≤ d j=1 b 0,j + 2(F (x1)-F * ) η + 2η d j=1 L j log + ηLj b0,j d x1-x * 2 b 1 γη + 2η γ d j=1 2ηLj γ -b 0,j + T d d d j=1 b 0,j where x 1 -x * 2 b1 = d j=1 b 1,j (x 1,j -x * j ) 2 . Before going into the proof, it is worth discussing the result above as well as the main challenges and differences compared with AdaGradNorm. For simplicity, let b t = diag(b t,i ). We can expect that, by a similar argument that we used before to bound the function value gap via the stepsize, we will have is obtained in a similar manner as before, but in d-dimensions. The challenge is that since the stepsize is a vector, it is not possible to use "division" by the stepsize as in AdaGradNorm. On the one hand, we can overcome this by rewriting the argument; on the other hand, this problem will incur an exponential rate for g(b T ) dependent on the smoothness parameters. T t=1 F (x t ) -F * T ≤ g(b T ) x1-x * 2 b 1 γη + 2η γ d j=1 Proof. We can write x t+1 = x t -ηb -1 t ∇F (x t ). Starting from γ-quasar convexity F (x t ) -F * ≤ ∇F (x t ), x t -x * γ = b t (x t -x t+1 ) , x t -x * ηγ = x t -x * 2 bt -x t+1 -x * 2 bt + x t+1 -x t 2 bt 2ηγ = x t -x * 2 bt -x t+1 -x * 2 bt 2ηγ + η 2γ ∇F (x t ) 2 b -1 t . Note that F also satisfies Assumption 2 F (x t ) -F * ≥ ∇F (x t ) 2 L, * 2 . Hence F (x t ) -F * 2 + ∇F (x t ) 2 L, * 4 ≤ F (x t ) -F * ≤ x t -x * 2 bt -x t+1 -x * 2 bt 2ηγ + η 2γ ∇F (x t ) 2 b -1 t ⇒ F (x t ) -F * 2 ≤ x t -x * 2 bt -x t+1 -x * 2 bt 2ηγ + η 2γ ∇F (x t ) 2 b -1 t - ∇F (x t ) 2 L, * 4 ⇒ F (x t ) -F * ≤ x t -x * 2 bt -x t+1 -x * 2 bt ηγ + d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2 . Taking the sum over t T t=1 F (x t ) -F * ≤ T t=1 x t -x * 2 bt -x t+1 -x * 2 bt ηγ + T t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2 = x 1 -x * 2 b1 -x T +1 -x * 2 b T γη + T t=2 x t -x * 2 bt-bt-1 γη + T t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2 . Due to the excess term T t=2 xt-x * 2 b t -b t-1 γη in the RHS, we need to proceed and bound x tx * 2 bt-bt-1 . First, observe that since the L.H.S. is non-negative, x T +1 -x * 2 b T γη ≤ x 1 -x * 2 b1 γη + T t=2 x t -x * 2 bt-bt-1 γη + T t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2 To upperbound x t -x * 2 bt-bt-1 , a key observation is that x T +1 -x * 2 b T = x T +1 -x * 2 b T +1 -b T × x T +1 -x * 2 b T x T +1 -x * 2 b T +1 -b T ≥ x T +1 -x * 2 b T +1 -b T min k b T,k b T +1,k -b T,k Hence for T ≥ 1 x T +1 -x * 2 b T +1 -b T γη ≤ max k b T +1,k b T,k -1   x 1 -x * 2 b1 γη + T t=2 x t -x * 2 bt-bt-1 γη + T t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2   By using this bound for the last term x T -x * 2 b T -b T -1 γη we obtain T t=1 F (x t ) -F * ≤ x 1 -x * 2 b1 -x T +1 -x * 2 b T γη + T t=2 x t -x * 2 bt-bt-1 γη + T t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2 ≤ x 1 -x * 2 b1 γη + T -1 t=2 x t -x * 2 bt-bt-1 γη + T t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2 + max k b T,k b T -1,k -1   x 1 -x * 2 b1 γη + T -1 t=2 x t -x * 2 bt-bt-1 γη + T -1 t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2   = max k b T,k b T -1,k   x 1 -x * 2 b1 γη + T -1 t=2 x t -x * 2 bt-bt-1 γη + T -1 t=1 d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2   + d j=1 η γb T,j - 1 2L j ∇ j F (x T ) 2 Continue to unroll this relation and for convenience let T t=T +1 max k b t,k b t-1,k = 1, we have T t=1 F (x t ) -F * ≤ T t=2 max k b t,k b t-1,k x 1 -x * 2 b1 γη + T t=1 T =t+1 max k b ,k b -1,k   d j=1 η γb t,j - 1 2L j ∇ j F (x t ) 2   = T t=2 max k b t,k b t-1,k x 1 -x * 2 b1 γη + d j=1 T t=1 T =t+1 max k b ,k b -1,k η γb t,j - 1 2L j ∇ j F (x t ) 2 Given j, if b 1,j > 2ηLj γ , we know T t=1 T =t+1 max k b ,k b -1,k η γb t,j - 1 2L j ∇ j F (x t ) 2 < 0 ≤ 2η γ T t=2 max k b t,k b t-1,k 2ηL j γ -b 0,j + Published as a conference paper at ICLR 2023 Otherwise, let τ j be the last t such that b t,j ≤ 2ηLj γ , we also have T t=1 T =t+1 max k b ,k b -1,k η γb t,j - 1 2L j ∇ j F (x t ) 2 ≤ τj t=1 T =t+1 max k b ,k b -1,k η γb t,j - 1 2L j ∇ j F (x t ) 2 ≤ T t=2 max k b t,k b t-1,k τj t=1 η γb t,j - 1 2L j ∇ j F (x t ) 2 ≤ T t=2 max k b t,k b t-1,k τj t=1 η γb t,j ∇ j F (x t ) 2 = T t=2 max k b t,k b t-1,k η γ τj t=1 b 2 t,j -b 2 t-1,j b t,j ≤ T t=2 max k b t,k b t-1,k 2η γ τj t=1 b t,j -b t-1,j ≤ 2η γ T t=2 max k b t,k b t-1,k 2ηL j γ -b 0,j + . Hence we know d j=1 T t=1 T =t+1 max k b ,k b -1,k η γb t,j - 1 2L j ∇ j F (x t ) 2 ≤ 2η γ T t=2 max k b t,k b t-1,k   d j=1 2ηL j γ -b 0,j +   . Thus we have T t=1 F (x t ) -F * ≤ T t=2 max k b t,k b t-1,k x 1 -x * 2 b1 γη + d j=1 T t=1 T =t+1 max k b ,k b -1,k η γb t,j - 1 2L j ∇ j F (x t ) 2 ≤ T t=2 max k b t,k b t-1,k   x 1 -x * 2 b1 γη + 2η γ d j=1 2ηL j γ -b 0,j +   From Lemma A.10 d j=1 b T,j ≤ d j=1 b 0,j + 2(F (x 1 ) -F * ) η + 2η d j=1 L j log + ηL j b 0,j Using AM-GM we have d j=1 b T,j ≤ d j=1 b T,j d d ≤ 1 d d   d j=1 b 0,j + 2(F (x 1 ) -F * ) η + 2η d j=1 L j log + ηL j b 0,j   d Published as a conference paper at ICLR 2023 Note that T t=2 max j b t,j b t-1,j ≤ T t=2 d j=1 b t,j b t-1,j ≤ d j=1 b T,j b 0,j ≤ 1 d d d j=1 b 0,j   d j=1 b 0,j + 2(F (x 1 ) -F * ) η + 2η d j=1 L j log + ηL j b 0,j   d Hence T t=1 F (x t ) -F * ≤ T t=2 max j b T,j b T -1,j   x 1 -x * 2 b1 γη + 2η γ d j=1 2ηL j γ -b 0,j +   ≤ d j=1 b 0,j + 2(F (x1)-F * ) η + 2η d j=1 L j log + ηLj b0,j d x1-x * 2 b 1 γη + 2η γ d j=1 2ηLj γ -b 0,j + d d d j=1 b 0,j which finishes the proof.

B MISSING PROOFS FROM SECTION 4

B.1 IMPORTANT LEMMA First, we state a general lemma that can be used for a more general setting. The proof of the lemma is standard. Lemma B.1. Suppose F satisfies Assumptions 1 and 2' and the following conditions hold: • x t is generated by x t+1 = x t -η ct ∇F (x t ) , with η > 0 and c t > 0 is non-decreasing; • p t ∈ (0, 1] satisfies 1 pt ≥ 1-pt+1 pt+1 , p 1 = 1; Then we have F (x T +1 ) -F * p T c T ≤ x 1 -x * 2 γη + T t=1 L 2c t - 1 η + p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t . ( ) Proof. Starting from L-smoothness F (x t+1 ) -F (x t ) ≤ ∇F (x t ), x t+1 -x t + L 2 x t+1 -x t 2 = 2p t γ ∇F (x t ), x t+1 -x t + 1 - 2p t γ ∇F (x t ), x t+1 -x t + L 2 x t+1 -x t 2 = 2p t γ ∇F (x t ), x * -x t + 2p t γ ∇F (x t ), x t+1 -x * + 1 - 2p t γ ∇F (x t ), x t+1 -x t + L 2 x t+1 -x t 2 ≤ 2p t (F * -F (x t )) + p t c t γη x t -x * 2 -x t+1 -x * 2 -x t+1 -x 2 -1 - 2p t γ c t η x t+1 -x t 2 + L 2 x t+1 -x t 2 = 2p t (F * -F (x t )) + p t c t γη x t -x * 2 -x t+1 -x * 2 + L 2 - c t η + p t c t γη x t+1 -x t 2 . Note that Assumption 2 can be implied by Assumption 2', hence we have F * -F (x t ) ≤ - ∇F (x t ) 2 2L = - c 2 t x t+1 -x t 2 2η 2 L . Therefore F (x t+1 ) -F (x t ) ≤ 2p t (F * -F (x t )) + p t c t γη x t -x * 2 -x t+1 -x * 2 + L 2 - c t η + p t c t γη x t+1 -x t 2 ≤ p t (F * -F (x t )) - p t c 2 t x t+1 -x t 2 2η 2 L + p t c t γη x t -x * 2 -x t+1 -x * 2 + L 2 - c t η + p t c t γη x t+1 -x t 2 = p t (F * -F (x t )) + p t c t γη x t -x * 2 -x t+1 -x * 2 + L 2 - c t η + p t c t γη - p t c 2 t 2η 2 L x t+1 -x t 2 . We obtain F (x t+1 ) -F * p t c t ≤ 1 -p t p t c t (F (x t ) -F * ) + x t -x * 2 -x t+1 -x * 2 γη + L 2c t - 1 η + p t γη - p t c t 2η 2 L x t+1 -x t 2 p t . Note that we require 1 pt ≥ 1-pt+1 pt+1 and c t is increasing hence 1 p t c t ≥ 1 -p t+1 p t+1 c t ≥ 1 -p t+1 p t+1 c t+1 which leads to F (x T +1 ) -F * p T c T ≤ 1 -p 1 p 1 c 1 (F (x 1 ) -F * ) + x 1 -x * 2 γη + T t=1 L 2c t - 1 η + p t γη - p t c t 2η 2 L x t+1 -x t 2 p t = 1 -p 1 p 1 c 1 (F (x 1 ) -F * ) + x 1 -x * 2 γη + T t=1 L 2c t - 1 η + p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t . By setting p 1 = 1 we get the desired result.

B.2 FIRST VARIANT

Note that if we assume p t satisfies the condition in Lemma B.1, by replacing c t by b t , we have F (x T +1 ) -F * p T b T ≤ x 1 -x * 2 γη + T t=1 L 2b t - 1 η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t immediately. Now our two left tasks are to bound the residual term T t=1 L 2bt -1 η + pt γη -ptbt 2η 2 L η 2 ∇F (xt) 2 b 2 t pt and find an upper bound on b T . Lemmas B.2 and B.3 demonstrate how we achieve these two goals. Lemma B.2. If p t ≤ 1 for every t, we have T t=1 L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t ≤ h(∆) T t=1 p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t ≤ g(∆) where h(∆) :=    (2+∆)η(ηL) ∆ 2 log + ηL b0 ∆ ≥ 1 (2+∆)η 2 L 2b 1-∆ 0 log + ηL b0 ∆ ∈ (0, 1) and g(∆) := (2 + ∆)η γ 2ηL γ ∆ log + 2ηL γb 0 . Proof. We first bound T t=1 L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t . If b 1 > ηL, we know T t=1 L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t < 0 ≤ h(∆). Otherwise, we define the time τ = max {t ∈ [T ], b t ≤ ηL}. Hence, we have T t=1 L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t = τ t=1 L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t + T t=τ L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t ≤ τ t=1 L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t ≤ τ t=1 L 2b t × η 2 ∇F (x t ) 2 b 2 t p t = η 2 L 2 τ t=1 b 2+∆ t -b 2+∆ t-1 b 3 t = η 2 L 2 τ t=1 b 2+∆ t -b 2+∆ t-1 b 2+∆ t × b ∆-1 t ≤    η 2 L 2 τ t=1 b 2+∆ t -b 2+∆ t-1 b 2+∆ t × (ηL) ∆-1 ∆ ≥ 1 η 2 L 2 τ t=1 b 2+∆ t -b 2+∆ t-1 b 2+∆ t × 1 b 1-∆ 0 ∆ < 1 =    η(ηL) ∆ 2 τ t=1 b 2+∆ t -b 2+∆ t-1 b 2+∆ t ∆ ≥ 1 η 2 L 2b 1-∆ 0 τ t=1 b 2+∆ t -b 2+∆ t-1 b 2+∆ t ∆ < 1 ≤    (2+∆)η(ηL) ∆ 2 log ηL b0 ∆ ≥ 1 (2+∆)η 2 L 2b 1-∆ 0 log ηL b0 ∆ < 1 ≤ h(∆). By applying a similar argument, we can prove T t=1 p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t ≤ g(∆). Lemma B.3. Suppose all the conditions in Lemma B.1 are satisfied by replacing c t by b t , additionally, assume p t ≤ 1, we will have b T ≤ 2 η x 1 -x * 2 γη + h(∆) + g(∆) + b ∆ 0 1 ∆ Proof. Using Lemma B.1 by replacing c t by b t , we know F (x T +1 ) -F * p T b T ≤ x 1 -x * 2 γη + T t=1 L 2b t - 1 η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t = x 1 -x * 2 γη + T t=1 L 2b t - 1 2η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t - η ∇F (x t ) 2 2b 2 t p t ≤ x 1 -x * 2 γη + h(∆) + g(∆) - T t=1 η ∇F (x t ) 2 2b 2 t p t , where the last inequality is by Lemma B.2. Noticing F (x T +1 ) -F * ≥ 0, we know T t=1 η ∇F (x t ) 2 2b 2 t p t ≤ x 1 -x * 2 γη + h(∆) + g(∆). Now we use the update rule of b t to get T t=1 η ∇F (x t ) 2 2b 2 t p t = η 2 T t=1 b 2+∆ t -b 2+∆ t-1 b 2 t ≥ η 2 T t=1 b ∆ t -b ∆ t-1 = η 2 b ∆ T -b ∆ 0 Hence we know b T ≤ 2 η x 1 -x * 2 γη + h(∆) + g(∆) + b ∆ 0 1 ∆ Equipped with Lemmas B.2 and B.3, we can give a proof of Theorem 4.1. Proof. Note that if p t = 1 t , all the conditions in Lemma B.1 are satisfied by replacing c t by b t . Hence we have F (x T +1 ) -F * p T b T ≤ x 1 -x * 2 γη + T t=1 L 2b t - 1 η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t ≤ x 1 -x * 2 γη + T t=1 L 2b t - 1 2η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t ≤ x 1 -x * 2 γη + h(∆) + g(∆), where the last inequality is by Lemma B.2. Multiplying both sides by p T b T , we get F (x T +1 ) -F * ≤ b T x1-x * 2 γη + h(∆) + g(∆) T . By using the upper bound of b T in Lemma B.3, we finish the proof.

B.3 SECOND VARIANT

Similar to the previous section, what we need to do is to bound the residual term T t=1 L 2ct -1 η + pt γη -ptct 2η 2 L η 2 ∇F (xt) 2 c 2 t pt and find an upper bound on c T where c t = b δ t b 1-δ t-1 here. We first bound the residual term by the following lemma. Lemma B.4. If p t ≤ 1 for every t, we have T t=1 L 2c t - 1 2η η 2 ∇F (x t ) 2 c 2 t p t ≤ η 2 L b 0 1 - b 0 ηL 1 δ + T t=1 p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t ≤ 2η γδ 2ηL γb 0 2 δ -2 log + 2ηL γb 0 where c t = b δ t b 1-δ t-1 . Proof. Note that ct ct-1 = b δ t b 2δ-1 t-1 b 1-δ t-2 ≥ 1, this means c t is monotone increasing. We first bound T t=1 L 2c t - 1 2η η 2 ∇F (x t ) 2 c 2 t p t . If c 1 > ηL, we know T t=1 L 2c t - 1 2η η 2 ∇F (x t ) 2 c 2 t p t < 0 ≤ η 2 L b 0 1 - b 0 ηL 1 δ + . Otherwise, let τ = max {t ∈ [T ] , c t ≤ ηL}. We have T t=1 L 2c t - 1 2η η 2 ∇F (x t ) 2 c 2 t p t ≤ τ t=1 η 2 L ∇F (x t ) 2 2c 3 t p t = η 2 L 2 τ t=1 b 2 t -b 2 t-1 b 3δ t b 3-3δ t-1 ≤ η 2 L τ t=1 b t -b t-1 b 3δ-1 t b 3-3δ t-1 ≤ η 2 L τ t=1 b t -b t-1 b t b t-1 ≤ η 2 L 1 b 0 - 1 b τ . Note that c τ = b δ τ b 1-δ τ -1 ≤ ηL ⇒ b τ ≤ ηL b 1-δ τ -1 1/δ . Hence 1 b 0 - 1 b τ ≤ 1 b 0 - (b τ -1 ) 1 δ -1 (ηL) 1/δ ≤ 1 b 0 1 - b 0 ηL 1 δ + . Combinging two cases, there is always T t=1 L 2c t - 1 2η η 2 ∇F (x t ) 2 c 2 t p t ≤ η 2 L b 0 1 - b 0 ηL 1 δ + . Now we turn to the second bound. If c 1 > 2ηL/γ, we know T t=1 p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t < 0 ≤ 2η γδ 2ηL γb 0 2 δ -2 log + 2ηL γb 0 . Published as a conference paper at ICLR 2023 Otherwise, we define the time τ = max {t ∈ [T ] , c t ≤ 2ηL/γ}. Then, we have T t=1 p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t ≤ τ t=1 p t γη η 2 ∇F (x t ) 2 c 2 t p t ≤ η γ τ t=1 ∇F (x t ) 2 c 2 t p t = η γ τ t=1 b 2 t -b 2 t-1 b 2δ t b 2-2δ t-1 = η γ τ t=1 b t b t-1 2-2δ b 2 t -b 2 t-1 b 2 t Because b δ t b 1-δ t-1 = c t ≤ 2ηL/γ for t ≤ τ , so we know b t ≤ 2ηL γb 1-δ t-1 1/δ . Using this bound η γ τ t=1 b t b t-1 2-2δ b 2 t -b 2 t-1 b 2 t ≤ η γ τ t=1 2ηL γb t-1 2 δ -2 b 2 t -b 2 t-1 b 2 t ≤ η γ 2ηL γb 0 2 δ -2 τ t=1 b 2 t -b 2 t-1 b 2 t ≤ 2η γ 2ηL γb 0 2 δ -2 log b t b 0 ≤ 2η γδ 2ηL γb 0 2 δ -2 log + 2ηL γb 0 . The proof is completed. As before, our last task is to bound c T . It is enough to bound b T since c T ≤ b T . Lemma B.5. Suppose all the conditions in Lemma B.1 are satisfied by replacing c t by b t , additionally, assume p t ≤ 1, we will have b T ≤ b 0 exp      x1-x * 2 γη 2 + ηL b0 1 -b0 ηL 1 δ + + 2 γδ 2ηL γb0 2 δ -2 log + 2ηL γb0 1 -δ      . Proof. By Lemma B.1, we know 0 ≤ F (x T +1 ) -F * p T c T ≤ x 1 -x * 2 γη + T t=1 L 2c t - 1 η + p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t 0 ≤ x 1 -x * 2 γη + T t=1 L 2c t - 1 2η + p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t - η ∇F (x t ) 2 2c 2 t p t T t=1 η ∇F (x t ) 2 2c 2 t p t ≤ x 1 -x * 2 γη + T t=1 L 2c t - 1 2η + p t γη - p t c t 2η 2 L η 2 ∇F (x t ) 2 c 2 t p t ≤ x 1 -x * 2 γη + η 2 L b 0 1 - b 0 ηL 1 δ + + 2η γδ 2ηL γb 0 2 δ -2 log + 2ηL γb 0 where the last inequality is by Lemma B.4. Note that for the L.H.S., we have T t=1 η ∇F (x t ) 2 2c 2 t p t = η 2 T t=1 b 2 t -b 2 t-1 b 2δ t b 2-2δ t-1 = η 2 T t=1 b t b t-1 2-2δ - b t-1 b t 2δ ≥ η(1 -δ) T t=1 log b t b t-1 = η(1 -δ) log b T b 0 . Hence we know b T ≤ b 0 exp      x1-x * 2 γη 2 + ηL b0 1 -b0 ηL 1 δ + + 2 γδ 2ηL γb0 2 δ -2 log + 2ηL γb0 1 -δ      Finally, the proof of Theorem 4.3 is similar to the proof of Theorem 4.1, hence, which is omitted. B.4 AN ASYMPTOTIC RATE WHEN ∆ = 0 AND δ = 1 As mentioned before, by setting ∆ = 0 in Algorithm 3 and δ = 1 in Algorithm 4 we obtain the same algorithm. The square root update rule of b t and the step size now are both more similar to the original AdaGradNorm. Intuitively, we can also expect the convergence of the last iterate in this case; furthermore, by taking the limit when ∆ → 0 and δ → 1, we can have a sense of the exponential dependency of the provable convergence rate on the problem parameters. However, previous analysis strictly requires that ∆ > 0 and δ < 1, thus does not apply here. In this section, we partially confirm the convergence of this variant by proving an asymptotic rate, i.e., F (x T +1 ) -F * = O (1/T ). Unfortunately, under Assumptions 1 and 2', we cannot figure out the explicit dependency of the convergence rate on the problem parameters. However, in the next section, we will give an explicit rate by replacing Assumption 1 with the stronger Assumption 1'. As stated, our goal is to prove Theorem B.6 in this section. Theorem B.6. Suppose F satisfies Assumptions 1 and 2', when ∆ = 0 for Algorithm 3, or equivalently, δ = 1 for Algorithm 4, by taking p t = 1 t , we have F (x T +1 ) -F * = O (1/T ) . Before starting the proof, we first discuss why we can obtain only an asymptotic rate when ∆ = 0 and δ = 1. As before, one can still expect that F (x T +1 ) -F * ≤ b T C T remains true for some constant C. However, a critical difference will show up when we want to find an explicit upper bound on b T . Using the proof of Lemma B.3 as an example (similarly for the proof of Lemma B.5), one key step is to get T t=1 ∇F (xt) 2 b 2 t pt = O(1) , where in the previous analysis, by replacing ∇F (xt) 2 pt by b 2+∆ t -b 2+∆ t-1 with ∆ > 0, we can lower bound T t=1 ∇F (xt) 2 b 2 t pt by a function of b T and finally give an explicit bound on b T . However, this is not possible when ∆ = 0 as T t=1 ∇F (xt) 2 b 2 t pt = T t=1 b 2 t -b 2 t-1 b 2 t . The only information we can get from T t=1 b 2 t -b 2 t-1 b 2 t = O(1) is lim T →∞ b 2 T -1 b 2 T = 1 . This is not enough to tell us whether b T is upper bounded or not. In Lemma B.8, we will use a new argument to finally show that lim T →∞ b T < ∞, which leads to an asymptotic rate as desired. It is worth pointing out that finding an asymptotic without explicit dependency on the problem parameters is the approach used in some of the previous work, such as Antonakopoulos et al. (2022) . This also gives us a glimpse of the method used to analyze the convergence of the accelerated methods in Section 5. Now we start the proof. As before, we can employ Lemma B.1. Hence we only need to bound the residual terms as following Lemma B.7. Suppose p t ≤ 1, when ∆ = 0 for Algorithm 3, or equivalently, δ = 1 for Algorithm 4, we have T t=1 L 2b t - 1 2η η 2 ∇F (x t ) 2 b 2 t p t ≤ η ηL b 0 -1 + T t=1 p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t ≤ 2η γ log + 2ηL γb 0 The proof is essentially similar to the proof of Lemmas B.2 and B.4, hence we omit it here. Lemma B.8. Suppose all the conditions in Lemma B.1 are satisfied by replacing c t by b t , then when ∆ = 0 for Algorithm 3, or equivalently, δ = 1 for Algorithm 4, we have lim T →∞ b T = b ∞ < ∞. Proof. First note that b t is increasing, by the Monotone convergence theorem, we know lim T →∞ b T = b ∞ exists. We aim to show b ∞ < ∞. By Lemma B.1 and replacing c t by b t , we have F (x T +1 ) -F * p T b T ≤ x 1 -x * 2 γη + T t=1 L 2b t - 1 η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t = x 1 -x * 2 γη + T t=1 L 2b t - 1 2η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t - 1 2η × η 2 ∇F (x t ) 2 b 2 t p t = x 1 -x * 2 γη + T t=1 L 2b t - 1 2η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t - η ∇F (x t ) 2 2b 2 t p t ≤ x 1 -x * 2 γη + η ηL b 0 -1 + + 2η γ log + 2ηL γb 0 - T t=1 η ∇F (x t ) 2 2b 2 t p t , where the last inequality is by Lemma B.7. Noticing F (x T +1 ) -F * ≥ 0, we know T t=1 η ∇F (x t ) 2 2b 2 t p t ≤ x 1 -x * 2 γη + η ηL b 0 -1 + + 2η γ log + 2ηL γb 0 , which implies ∞ t=1 ∇F (x t ) 2 b 2 t p t ≤ 2 x 1 -x * 2 γη 2 + 2 ηL b 0 -1 + + 4 γ log + 2ηL γb 0 . We observe that b 2 T = b 2 T -1 + ∇F (x T ) 2 p T ⇒b 2 T = b 2 T -1 1 -∇F (x T ) 2 b 2 T p T = b 2 0 T t=1 1 1 -∇F (xt) 2 b 2 t pt . Taking log to both sides, we get log b 2 T = log b 2 0 + T t=1 log 1 1 -∇F (xt) 2 b 2 t pt ≤ log b 2 0 + T t=1 1 1 -∇F (xt) 2 b 2 t pt -1 = log b 2 0 + T t=1 ∇F (xt) 2 b 2 t pt 1 -∇F (xt) 2 b 2 t pt ≤ log b 2 0 + ∞ t=1 ∇F (xt) 2 b 2 t pt 1 -∇F (xt) 2 b 2 t pt Note that Inequality (7) tells us lim t→∞ ∇F (xt) 2 b 2 t pt = 0, hence we can let τ be the time such that ∇F (xt) 2 b 2 t pt ≤ 1 2 for t ≥ τ . Then we know ∞ t=1 ∇F (xt) 2 b 2 t pt 1 -∇F (xt) 2 b 2 t pt = τ -1 t=1 ∇F (xt) 2 b 2 t pt 1 -∇F (xt) 2 b 2 t pt + ∞ t=τ ∇F (xt) 2 b 2 t pt 1 -∇F (xt) 2 b 2 t pt ≤ τ -1 t=1 ∇F (xt) 2 b 2 t pt 1 -∇F (xt) 2 b 2 t pt + 2 ∞ t=τ ∇F (x t ) 2 b 2 t p t ≤ τ -1 t=1 ∇F (xt) 2 b 2 t pt 1 -∇F (xt) 2 b 2 t pt + 2 2 x 1 -x * 2 γη 2 + 2 ηL b 0 -1 + + 4 γ log + 2ηL γb 0 < ∞. The above result implies log b 2 T has a uniform upper bound which means b ∞ < ∞. Now we can start to prove Theorem B.6. Proof. Note that when ∆ = 0 for Algorithm 3, or equivalently, δ = 1 for Algorithm 4, if p t = 1 t , all the conditions in Lemma B.1 are satisfied by replacing c t by b t . Hence we have F (x T +1 ) -F * p T b T ≤ x 1 -x * 2 γη + T t=1 L 2b t - 1 η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t ≤ x 1 -x * 2 γη + T t=1 L 2b t - 1 2η + p t γη - p t b t 2η 2 L η 2 ∇F (x t ) 2 b 2 t p t = x 1 -x * 2 γη + η ηL b 0 -1 + + 2η γ log + 2ηL γb 0 , where the last inequality is by Lemma B.7. Multiplying both sides byp T b T , we wknow F (x T +1 ) -F * ≤ b T T x 1 -x * 2 γη + η ηL b 0 -1 + + 2η γ log + 2ηL γb 0 ≤ b ∞ T x 1 -x * 2 γη + η ηL b 0 -1 + + 2η γ log + 2ηL γb 0 = O 1 T , where the last line is by Lemma B.8. B.5 A NON-ASYMPTOTIC RATE WHEN ∆ = 0 AND δ = 1 FOR CONVEX SMOOTH FUNCTIONS In the previous section, we only give an asymptotic rate when ∆ = 0 and δ = 1. In the following, we will show that, by replacing Assumption 1 by the stronger Assumption 1', a non-asymptotic rate can be obtained as stated in Theorem B.9. Theorem B.9. Suppose F satisfies Assumptions 1' and 2', when ∆ = 0 for Algorithm 3, or equivalently, δ = 1 for Algorithm 4, by taking p t = 1 t , we have F (x T +1 ) -F * ≤ b x1-x * 2 2η + η 2 2ηL b0 -1 + T , where b = max ηL 2 , b 2 0 + ∇F (x 1 ) 2 exp 3 x1-x * 2 η 2 + 3 2ηL b0 -1 + , ηL 1 4 + xt-x * 2 η 2 + 2ηL b0 -1 + exp 3 x1-x * 2 η 2 + 3 2ηL b0 -1 + . We first give another well-known characterization of convex and L-smooth functions without proof. Lemma B.10. Suppose F satisfies Assumption 1' and 2', then ∀x, y ∈ R d ∇F (x) -∇F (y), x -y ≥ ∇F (x) -∇F (y) 2 L . Next, we state a simple variant of Lemma B.1, the proof of which is essentially the same as the proof of Lemma B.1, hence we omit it. Lemma B.11. Suppose the following conditions hold: • F satisfies Assumptions 1' and 2'; • p t ∈ (0, 1) satisfies 1 pt ≥ 1-pt+1 pt+1 , p 1 = 1. When ∆ = 0 for Algorithm 3, or equivalently, δ = 1 for Algorithm 4, we have F (x T +1 ) -F * p T b T ≤ x 1 -x * 2 2η + T t=1 L 2b t - 1 η + p t 2η η 2 ∇F (x t ) 2 b 2 t p t The same as Lemma B.7, we give the following bound on the residual term without proof. Lemma B.12. Suppose p t ≤ 1, when ∆ = 0 for Algorithm 3, or equivalently, δ = 1 for Algorithm 4, we have T t=1 L 2b t - 1 4η η 2 ∇F (x t ) 2 b 2 t p t ≤ η 2 2ηL b 0 -1 + Again, the above two lemmas give us F (x T +1 ) -F * ≤ p T b T x 1 -x * 2 2η + η 2 2ηL b 0 -1 + W.l.o.g., we assume b T > ηL 2 in the following analysis. we can use the bound b T ≤ ηL 2 to get a trivial convergence rate. Now we define the time τ = max t ∈ [T ] , b t ≤ ηL 2 ∨ 0. This time τ is extremly useful and will finally help us bound b T . Now we list the following three important lemmas related to time τ . Lemma B.13. With Assumptions 1' and 2', when t ≥ τ + 1, ∇F (x t ) is non-increasing. Proof. Taking x = x t ,y = x t+1 in Lemma B.10, we get Proof. This is because ∇F (x t ) -∇F (x t+1 ) 2 L ≤ ∇F (x t ) -∇F (x t+1 ), x t -x t+1 = ∇F (x t ) -∇F (x t+1 ), η b t ∇F (x t ) ⇒ 1 L - η b t ∇F (x t ) 2 + 1 L ∇F (x t+1 ) 2 ≤ 2 L - η b t ∇F (x t ), ∇F (x t+1 ) . Note that when t ≥ τ + 1, we know b t > ηL 2 ⇒ 2 L -η bt > 0, hence we have 1 L - η b t ∇F (x t ) 2 + 1 L ∇F (x t+1 ) 2 ≤ 2 L - η b t ∇F (x t ), ∇F (x t+1 ) ≤ 1 L - η 2b t ∇F (x t ) 2 + 1 L - η 2b t ∇F (x t+1 ) 2 , which implies ∇F (x t+1 ) 2 ≤ ∇F (x t ) 2 . ∇F (x t ) 2 b 2 t p t = t ∇F (x t ) 2 b 2 0 + t i=1 i ∇F (x i ) 2 ≤ t ∇F (x t ) 2 (t -1) ∇F (x t-1 ) 2 + t ∇F (x t ) 2 ≤ t ∇F (x t ) 2 (t -1) ∇F (x t ) 2 + t ∇F (x t ) 2 = t 2t -1 , where the last inequality is because t -1 ≥ τ + 1, hence ∇F (x t-1 ) ≥ ∇F (x t ) by Lemma B.13. Note that t ≥ 2, so ∇F (xt) 2 b 2 t pt ≤ t 2t-1 ≤ 2 3 . Lemma B.15. With Assumptions 1' and 2', if p t = 1 t b τ +1 ≤ b 2 0 + ∇F (x 1 ) 2 ∨ ηL 1 4 + x 1 -x * 2 η 2 + 2ηL b 0 -1 + Proof. If τ = 0, we have b τ +1 = b 1 = b 2 0 + ∇F (x 1 ) 2 . Otherwise, we know τ + 1 ≥ 2, hence b 2 τ +1 = b 2 τ + (τ + 1) ∇F (x τ +1 ) 2 ≤ b 2 τ + 2L (τ + 1) (F (x τ +1 ) -F * ) ≤ b 2 τ + 2L τ + 1 τ b τ x 1 -x * 2 2η + η 2 2ηL b 0 -1 + ≤ b 2 τ + 4Lb τ x 1 -x * 2 2η + η 2 2ηL b 0 -1 + ≤ ηL 2 2 + η 2 L 2 x 1 -x * 2 η 2 + 2ηL b 0 -1 + ⇒ b τ +1 ≤ ηL 1 4 + x 1 -x * 2 η 2 + 2ηL b 0 -1 + where the second inequality is due to (8).  t = 1 t b T ≤ max ηL 2 , b 2 0 + ∇F (x 1 ) 2 exp 3 x 1 -x * 2 η 2 + 3 2ηL b 0 -1 + , ηL 1 4 + x 1 -x * 2 η 2 + 2ηL b 0 -1 + exp 3 x 1 -x * 2 η 2 + 3 2ηL b 0 -1 +    Proof. Note that if b T ≤ ηL 2 , we are done. If b T > ηL 2 , we will bound b T as follows: b 2 T = b 2 T -1 + ∇F (x T ) 2 p T ⇒ b 2 T = b 2 T -1 1 -∇F (x T ) 2 b 2 T p T = b 2 τ +1 T t=τ +2 1 1 -∇F (xt) 2 b 2 t pt ⇒ log b 2 T ≤ log b 2 τ +1 + T t=τ +2 log 1 1 -∇F (x) 2 b 2 t pt ≤ log b 2 τ +1 + T t=τ +2 ∇F (x) 2 b 2 t pt 1 -∇F (x) 2 b 2 t pt ≤ log b 2 τ +1 + T t=τ +2 3 ∇F (x) 2 b 2 t p t , where the last inequality is by Lemma B.14. Noticing p t = 1 t ≤ 1, combining Lemmas B.11 and B.12,we can find T t=1 ∇F (x t ) 2 b 2 t p t ≤ 2 x 1 -x * 2 η 2 + 2 2ηL b 0 -1 + . Hence we know log b 2 T ≤ log b 2 τ +1 + T t=τ +2 3 ∇F (x) 2 b 2 t p t ≤ log b 2 τ +1 + x 1 -x * 2 η 2 + 6 2ηL b 0 -1 + ⇒ b 2 T ≤ b 2 τ +1 exp 6 x 1 -x * 2 η 2 + 6 2ηL b 0 -1 + . The last step is to use the bound on b τ +1 in Lemma B.15. Finally, the proof of Theorem B.9 is obtained by simply using Lemma B.16 to Equation (8). C MISSING PROOFS FROM SECTION 5 C.1 IMPORTANT LEMMA First, we state a general lemma that can be used for a more general setting. The proof of the lemma is standard. Lemma C.1. Suppose F satisfies Assumptions 1' and 2' and the following conditions hold: • w t is generated by v t = (1 -a t )w t + a t x t x t+1 = x t -η q t c t ∇F (v t ) w t+1 = (1 -a t )w t + a t x t+1 with η > 0 and c t > 0 is non-decreasing; • a t ∈ (0, 1] and q t ≥ a t satisfy 1 atqt ≥ 1-at+1 at+1qt+1 , a 1 = 1; Then we have Note that a t ≤ q t , 1 atqtct ≥ 1-at+1 at+1qt+1ct+1 and a t = 1. Thus F (w T +1 ) -F * a T q T c T ≤ x 1 -x * F (w T +1 ) -F * a T q T c T ≤ x 1 -x * 2 2η + T t=1 L 2b t - 1 2η x t+1 -x t 2 = x 1 -x * 2 2η + T t=1 L 2b t - 1 2η η 2 ∇F (v t ) 2 c 2 t q 2 t . C.2 FIRST VARIANT By using Lemma C.1, the proof idea of Theorem 5.1 is the same as the proof of Theorem 4.1. Hence, we omit it for brevity. C.3 SECOND VARIANT By using Lemma C.1, the proof idea of Theorem 5.3 is the same as the proof of Theorem 4.3. Hence, we omit it here. C.4 A DISCUSSION ON WHEN ∆ = 0 AND δ = 1 Algorithms 5 and 6 become one when ∆ = 0 and δ = 1. As discussed in B.4, the challenge is to find a explicit bound on b T . First, we give an asymptotic rate in Theorem C.2 of which the proof idea is the same as the proof of Theorem B.6, thus is omitted. Theorem C.2. Suppose F satisfies Assumptions 1 and 2', when ∆ = 0 for Algorithm 5, or equivalently, δ = 1 for Algorithm 6, by taking a t = 2 t+1 , p t = 2 t , we have F (w T +1 ) -F * = O 1/T 2 . Now, we aim to prove the following non-asymptotic rate. Theorem C.3. Suppose F satisfies Assumptions 1 and 2', when ∆ = 0 for Algorithm 5, or equivalently, δ = 1 for Algorithm 6, by taking a t = 2 t+1 , p t = 2 t , we have We shortly discuss here why we can only give a rate in the order of 1/T but not 1/T 2 . Recall that in the proof of Theorem B.9, the key step is that after a certain time, ∇F (x t ) is a non-increasing sequence, by using which we can finally give a constant upper bound on b T that finally helps us to get the final 1/T rate. However, it is unclear under what condition on b t , ∇F (v t ) now will be a non-increasing sequence in our accelerated algorithm. Thus it is unclear to us whether it is possible to give a constant bound on b T . Instead, we will show b t can increase at most linearly in this accelerated scheme by a new trick, for which reason, we can finally obtain the rate in the order of 1/T . This guarantees that the convergence of the last iterate is no worse than the variants in Section 4. F (w T +1 ) -F * ≤ 4 b 0 + 4η 2 L 2 Proof. As before, to start with, we use Lemma C.1 F (w T +1 ) -F * a T q T b T ≤ x 1 -x * 2 2η + T t=1 L 2b t - 1 2η η 2 ∇F (v t ) 2 b 2 t q 2 t . By using b 2 t = b 2 t-1 + ∇F (vt) 2 b 2 t and the same technique in the previous proof, we know T t=1 L 2b t - 1 2η η 2 ∇F (v t ) 2 b 2 t q 2 t ≤ η 2 L b 0 log + ηL b 0 . So we have F (w T +1 ) -F * ≤ a T q T b T x 1 -x * 2 2η + η 2 L b 0 log + ηL b 0 D Now we turn to bound b t by observing b 2 t = b 2 t-1 + ∇F (v t ) 2 q 2 t ≤ b 2 t-1 + 2 ∇F (v t ) -∇F (w t+1 ) 2 q 2 t + 2 ∇F (w t+1 ) 2 q 2 t ≤ b 2 t-1 + 2L 2 v t -w t+1 2 q 2 t + 2 ∇F (w t+1 ) 2 q 2 t = b 2 t-1 + 2L 2 a 2 t x t+1 -x t 2 q 2 t + 2 ∇F (w t+1 ) 2 q 2 t ≤ b 2 t-1 + 2L 2 x t+1 -x t 2 + 2 ∇F (w t+1 ) 2 q 2 t , where the last inequality is due to a t ≤ p t . Then we use x t+1 -x t 2 = η 2 ∇F (vt) 2 b 2 t q 2 t = η 2 (b 2 t -b 2 t-1 ) b 2 t and ∇F (w t+1 ) 2 ≤ 2L(F (w t+1 ) -F * ) ≤ 2La t q t b t D to get b 2 t ≤ b 2 t-1 + 2η 2 L 2 b 2 t -b 2 t-1 b 2 t + 4La t q t b t D q 2 t ≤ b 2 t-1 + 2η 2 L 2 b 2 t -b 2 t-1 b 2 t + 4Lb t D ⇒ b t ≤ b 2 t-1 b t + 2η 2 L 2 b 2 t -b 2 t-1 b 3 t + 4LD ≤ b t-1 + 4η 2 L 2 1 b t-1 - 1 b t + 4LD ⇒ b t ≤ b 0 + 4η 2 L 2 b 0 + 4LDt. Using this bound, we finally get In this section, we provide some empirical evidence to compare the performances of our algorithms in the deterministic setting. Our test function follows the quadratic function used to prove the lower bound of the first order method constructed by Nesterov (Nesterov et al., 2018) . That is F (w T +1 ) -F * ≤ a T q T b T D = 4 b 0 + 4η 2 L 2 F (x) = x[1] 2 + x[d] 2 + d-1 i=1 (x[i] -x[i + 1]) 2 2 -x[1]. where x[i] refers to the i-th coordinate of point x ∈ R d . It is known that F is 4-smooth and convex with the unique minimizer x * [i] = 1 - i d + 1 , ∀i ∈ [d] . We fix d = 101 and set the time horizon to T = 1000 in the test. The starting point x 1 is initialized randomly satisfying that every coordinate is uniformly chosen in [0, 1). All algorithms share the same x 1 . For the adaptive algorithms, we choose b 0 = 10 -2 and set η = 1 without any further tuning. We also compare with an accelerated algorithm (Lan, 2020) , which requires using the smoothness constant L = 4. The result is shown in Figure 1 . We can find that our Algorithms 3 and 4 admit the last iterate convergence. Additionally, both our accelerated algorithms, i.e., Algorithms 5 and 6, enjoy the accelerated property without knowing the smoothness parameter and are competitive against Accelerated Gradient Descent (Lan, 2020) which requires the smooth parameter to set the step size. Another interesting observation is that it seems AdaGradNorm also exhibits the last iterate convergence. However, whether this is indeed a property of AdaGradNorm has not been confirmed by the theory. We leave this as a future direction.



Figure 1: Function value gap for different algorithms

Corollary 3.2. With Assumptions 1' and 2, for xT =

This is just what we want.

Now we combine Lemmas B.14 and B.15 to get an upper bound for b T . Lemma B.16. With Assumptions 1' and 2', if p

Starting from smoothnessF (w t+1 ) -F (v t ) ≤ ∇F (v t ), w t+1 -v t + L 2 w t+1 -v t 2 =(1 -a t ) ∇F (v t ), w t -v t + a t ∇F (v t ), x t+1 -v t + L 2 w t+1 -v t 2 =(1 -a t ) ∇F (v t ), w t -v t + a t ∇F (v t ), x * -v t + a t ∇F (v t ), x t+1 -x * + L 2 w t+1 -v t 2 ≤(1 -a t )(F (w t ) -F (v t )) + a t (F * -F (v t )) + a t ∇F (v t ), x t+1 -x * + L 2 w t+1 -v t 2 Thus F (w t+1 ) -F * ≤ (1 -a )(F (w t ) -F * ) + a t ∇F (v t ), x t+1 -x * + L 2 w t+1 -v t 2where the last inequality is due to the convexity of F . Using the update rule ∇F (v t ) = qtct η (x t -x t+1 ) and w t+1 -v t = a t (x t+1 -x t ) we obtain F (w t+1 ) -FDividing both sides by a t q t c t and sumning up from 1 to T , we have

ACKNOWLEDGMENTS

TN and AE were supported in part by NSF CAREER grant CCF-1750333, NSF grant III-1908510, and an Alfred P. Sloan Research Fellowship. HN was supported in part by NSF CAREER grant CCF-1750716 and NSF grant CCF-1909314. Reproducibility Statement. We include the full proofs of all theorems in the Appendix.

