META-STORM: GENERALIZED FULLY-ADAPTIVE VARIANCE REDUCED SGD FOR UNBOUNDED FUNC-TIONS

Abstract

We study the application of variance reduction (VR) techniques to general nonconvex stochastic optimization problems. In this setting, the recent work STORM (Cutkosky & Orabona, 2019) overcomes the drawback of having to compute gradients of "mega-batches" that earlier VR methods rely on. There, STORM utilizes recursive momentum to achieve the VR effect and is then later made fully adaptive in STORM+ (Levy et al., 2021) , where full-adaptivity removes the requirement for obtaining certain problem-specific parameters such as the smoothness of the objective and bounds on the variance and norm of the stochastic gradients in order to set the step size. However, STORM+ crucially relies on the assumption that the function values are bounded, excluding a large class of useful functions. In this work, we propose META-STORM, a generalized framework of STORM+ that removes this bounded function values assumption while still attaining the optimal convergence rate for non-convex optimization. META-STORM not only maintains full-adaptivity, removing the need to obtain problem specific parameters, but also improves the convergence rate's dependency on the problem parameters. Furthermore, META-STORM can utilize a large range of parameter settings that subsumes previous methods allowing for more flexibility in a wider range of settings. Finally, we demonstrate the effectiveness of META-STORM through experiments across common deep learning tasks. Our algorithm improves upon the previous work STORM+ and is competitive with widely used algorithms after the addition of per-coordinate update and exponential moving average heuristics.

1. INTRODUCTION

In this paper, we consider the stochastic optimization problem in the form min x∈R d F (x) := E ξ∼D [f (x, ξ)] , where F : R d → R is possibly non-convex. We assume only access to a first-order stochastic oracle via sample functions f (x, ξ), where ξ comes from a distribution D representing the randomness in the sampling process. Optimization problems of this form are ubiquitous in machine learning and deep learning. Empirical risk minimization (ERM) is one instance, where F (x) is the loss function that can be evaluated by a sample or a minibatch represented by ξ. An important advance in solving Problem (1) is the recent development of variance reduction (VR) techniques that improve the convergence rate to critical points of vanilla SGD from O(1/T 1/4 ) to O(1/T 1/3 ) (Fang et al., 2018; Li et al., 2021) for the class of mean-squared smooth functions (Arjevani et al., 2019) . In contrast to earlier VR algorithms which often require the computation of the gradients over large batches, recent methods such as Cutkosky & Orabona (2019) ; Levy et al. (2021) ; Huang et al. (2021) avoid this drawback by using a weighted average of past gradients, often known as momentum. When the weights are selected appropriately, momentum reduces the error in the gradient estimates which improves the convergence rate. A different line of work on adaptive methods (Duchi et al., 2011; Kingma & Ba, 2014) , some of which incorporate momentum techniques, have shown tremendous success in practice. These adap-  × O κ 1/2 +κ 3/4 G -1/2 +σ+ G log 3/4 T T 1/2 + σ 1/3 T 1/3 3', 4 κ = O (β (F (x1) -F * )) Super-ADAM (Huang et al., 2021 ) × O κ 1/2 + σ log T 1 T 1/2 + 1 T 1/3 3' κ = O (β (F (x1) -F * )) Does not adapt to σ STORM+ (Levy et al., 2021 ) O κ 1 T 1/2 + κ 2 σ 1/3 T 1/3 3', 4, 6 κ1 = O β 9/4 + G 5 + β 3/2 G 6 + B 9/8 κ2 = O β 3/2 + B 3/4 META-STORM-SG, p = 1 2 (Ours) 1 O κ1 + κ2 log 1 + σ 2 T 1 T 1/2 + σ 1/3 T 1/3 3, 4 κ1 = O F (x1) -F * + σ 2 + G 2 + κ2 log κ2 κ2 = O((1 + G 2 )β) META-STORM, p = 1 2 (Ours) O κ1 + κ2 log 1 + σ 2 T 1 T 1/2 + σ 1/3 T 1/3 3, 5 κ1 = O F (x1) -F * + σσ + σ 2 + σ 3 + κ2 log κ2 κ2 = O((1 + σ 3 )β) tive methods remove the burden of obtaining certain problem-specific parameters, such as smoothness, in order to set the right step size to guarantee convergence. STORM+ (Levy et al., 2021) is the first algorithm to bridge the gap between fully-adaptive algorithms and VR methods, achieving the variance-reduced convergence rate of O(1/T 1/3 ) while not requiring knowledge of any problemspecific parameter. This is also the first work to demonstrate the interplay between adaptive momentum and step sizes to adapt to the problem's structure, while still achieving the VR rate. However, STORM+ relies on a strong assumption that the function values are bounded, which generally does not hold in practice. Moreover, the convergence rate of STORM+ has high polynomial dependencies on the problem parameters, compared to what can be achieved by appropriately configuring the step sizes and momentum parameters given knowledge of the problem parameters (see Section 3.1). Our contributions: In this work, we propose META-STORM-SG and META-STORM, two flexible algorithmic frameworks that attain the optimal variance-reduced convergence rate for general nonconvex objectives. Both of them generalize STORM+ by allowing a wider range of parameter selection and removing the restrictive bounded function value assumption while maintaining its desirable fully-adaptive property -eliminating the need to obtain any problem-specific parameter. These have been enabled via our novel analysis framework that also establishes a convergence rate with much better dependency on the problem parameters. We present a comparison of META-STORM and its sibling META-STORM-SG against recent VR methods in Table 1 . In the appendix, we propose another algorithm, META-STORM-NA, with even less restrictive assumptions; however, with a tradeoff of losing the adaptivity to the variance parameter. We complement our theoretical results with experiments across three common tasks: image classification, masked language modeling, and sentiment analysis. Our algorithms improve upon the previous work, STORM+. Furthermore, the addition of heuristics such as exponential moving average and per-coordinate updates improves our algorithms' generalization performance. These versions of our algorithms are shown to be competitive with widely used algorithms such as Adam and AdamW.

1.1. RELATED WORK

Variance reduction methods for stochastic non-convex optimization: Variance reduction is introduced for non-convex optimization by Allen-Zhu & Hazan (2016) ; Reddi et al. (2016) in the context of finite sum optimization, achieving faster convergence over the full gradient descent method. These methods are first improved by Lei et al. (2017) and later by Fang et al. (2018) ; Li et al. (2021) both of which achieve an O(1/Tfoot_0/3 ) convergence rate, matching the lower bounds in Arjevani et al. (2019) . However, these earlier methods periodically need to compute the full gradient (in the finitesum case) or a giant batch at a check point, which can be quite costly. Shortly after, Cutkosky & Orabona (2019) and Tran-Dinh et al. (2019) introduce a different approach that utilizes stochastic gradients from previous time steps instead of computing the full gradient at a checkpoints. These methods are framed as momentum-based methods as they are similar to using a weighted average of the gradient estimates to achieve the variance reduction. Recently, SUPER-ADAM (Huang et al., 2021) integrates STORM in a larger framework of adaptive algorithms, but loses adaptivity to the variance parameter σ. At the same time, STORM+ (Levy et al., 2021) proposes a fully adaptive version of STORM, which our work builds upon. Adaptive methods for stochastic non-convex optimization: Classical methods, like SGD (Ghadimi & Lan, 2013) , typically require the knowledge of problem parameters, such as the smoothness and the variance of the stochastic gradients, to set the step sizes. In contrast, adaptive methods (Duchi et al., 2011; Tieleman et al., 2012; Kingma & Ba, 2014) forgo this requirement: their step sizes only rely on the stochastic gradients obtained by the algorithms. Although these adaptive methods are originally designed for convex optimization, they enjoy great successes and popularity in highly non-convex practical applications such as training deep neural networks, often making them the method of choice in practice. As a result, theoretical understanding of adaptive methods for non-convex problems has received significant attention in recent years. The works by Ward et al. (2019) ; Kavis et al. (2021) propose a convergence analysis of AdaGrad under various assumptions. Among VR methods, STORM+ is the only fully adaptive algorithm that does not require knowledge of any problem parameter. Our work builds on and generalizes STORM+, removing the bounded function value assumption while obtaining much better dependencies on the problem parameters.

1.2. PROBLEM DEFINITION AND ASSUMPTIONS

We study stochastic non-convex optimization problems for which the objective function F : R d → R that has form F (x) := E ξ∼D [f (x, ξ)] and f (•, ξ) is a sampling function depending on a random variable ξ drawn from a distribution D. We will omit the writing of D in E ξ∼D [f (x, ξ) ] for simplicity in the remaining paper. • represents • 2 for brevity. [T ] is defined as {1, 2, • • • , T }. The analysis of our algorithms relies on the following assumptions 1-5: 1. Lower bounded function value: F * := inf x∈R d F (x) > -∞. 2. Unbiased estimator with bounded variance: We assume to have access to ∇f (x, ξ) satisfying E ξ [∇f (x, ξ)] = ∇F (x), E ξ ∇f (x, ξ) -∇F (x)foot_1 ≤ σ 2 for some σ ≥ 0.

3.. Averaged β-smoothness: E

ξ ∇f (x, ξ) -∇f (y, ξ) 2 ≤ β 2 x -y 2 , ∀x, y ∈ R d . 4. Bounded stochastic gradients: ∇f (x, ξ) ≤ G, ∀x ∈ R d , ξ ∈ support(D) for some G ≥ 0. 5. Bounded stochastic gradient differences: ∇f (x, ξ) -∇f (x, ξ ) ≤ 2 σ, ∀x ∈ R d , ξ, ξ ∈ support(D) for some σ ≥ 0. Assumptions 1, 2 and 3 are standard in the VR setting (Arjevani et al., 2019) . Assumption 5 is weaker than the assumptions made in the prior works based on the STORM framework (Cutkosky & Orabona, 2019; Levy et al., 2021) . These works assume that the stochastic gradients are bounded, i.e., Assumption 4. We note that assumption 4 implies that assumption 5 holds by replacing σ by G, thus we only have to consider σ = O( G). To better understand assumption 5, we fix ξ ∈ support(D) and consider another ξ ∼ D, then due to the convexity of • , ∇f (x, ξ)-∇F (x) = ∇f (x, ξ) -E ξ [∇f (x, ξ )] ≤ E ξ [ ∇f (x, ξ) -∇f (x, ξ) ] ≤ 2 σ. This means assumption 5 implies a stronger version of assumption 2. For this reason, we can consider σ = O( σ). Algorithm 1 META-STORM-SG Input: Initial point x 1 ∈ R d Parameters: a 0 , b 0 , η, p ∈ [ 1 4 , 1 2 ], p + 2q = 1 Sample ξ 1 ∼ D, d 1 = ∇f (x 1 , ξ 1 ) for t = 1, • • • , T do: a t+1 = 1 + t i=1 ∇f (xi,ξi) 2 a 2 0 -2 3 b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t+1 x t+1 = x t -η bt d t Sample ξ t+1 ∼ D d t+1 = ∇f (x t+1 , ξ t+1 ) + (1 -a t+1 )(d t - ∇f (x t , ξ t+1 )) end for Output x out = x t where t ∼ Uniform ([T ]). Algorithm 2 META-STORM Input: Initial point x 1 ∈ R d Parameters: a 0 , b 0 , η, p ∈ [ 3- √ 7 2 , 1 2 ], p + 2q = 1 Sample ξ 1 ∼ D, d 1 = ∇f (x 1 , ξ 1 ), a 1 = 1 for t = 1, • • • , T do: b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t x t+1 = x t -η bt d t Sample ξ t+1 ∼ D a t+1 = 1 + t i=1 ∇f (xi,ξi)-∇f (xi,ξi+1) 2 a 2 0 -2 3 d t+1 = ∇f (x t+1 , ξ t+1 ) + (1 -a t+1 )(d t - ∇f (x t , ξ t+1 )) end for Output x out = x t where t ∼ Uniform ([T ]). Additional assumptions made in the prior works (Cutkosky & Orabona, 2019; Levy et al., 2021; Huang et al., 2021) include the following: 3'. Almost surely β-smooth: ∇f (x, ξ) -∇f (y, ξ) ≤ β xy , ∀x, y ∈ R d , ξ ∈ support(D).

6.. Bounded function values:

There exists B ≥ 0 such that |F (x) -F (y)| ≤ B for all x, y ∈ R d . We remark that 3' is strictly stronger than 3 and it is NOT a standard assumption in Arjevani et al. (2019) . Moreover, assumption 6, which plays a critical role in the analysis of Levy et al. (2021) , is relatively strong and cannot be always satisfied in non-convex optimization. Our work removes these two restrictive assumptions and also improves the dependency on the problem parameters.

2. OUR ALGORITHMS

In this section, we introduce our two main algorithms, META-STORM-SG and META-STORM, shown in Algorithm 1 and Algorithm 2 respectively. Our algorithms follow the generic framework of momentum-based variance-reduced SGD put forward by STORM (Cutkosky & Orabona, 2019) . The STORM template incorporates momentum and variance reduction as follows: d t = a t ∇f (x t , ξ t ) + (1 -a t ) d t-1 momentum + (1 -a t ) (∇f (x t , ξ t ) -∇f (x t-1 , ξ t )) variance reduction (2) x t+1 = x t - η b t d t . The first variant, META-STORM-SG, similar to prior works, uses the gradient norms when setting a t and similarly, requires the strong assumption on the boundedness of the stochastic gradients. The major difference lies in the structure of the momentum parameters and the step sizes and their relationship, which is further developed in the second algorithm META-STORM so that assumption 4 can be relaxed to assumption 5. We now highlight our key algorithmic contributions and how they depart from prior works. A first point of departure is our use of stochastic gradient differences when setting the momentum parameter a t in META-STORM: prior works set a t based on the stochastic gradients, while META-STORM sets a t based on the difference of two gradient estimators taken at two different time step ξ t-1 and ξ t at the same point x t-1 . The gradient difference can be viewed as a proxy for the variance σ 2 , which allows us to require the mild assumption 5 in the analysis. With this choice, our algorithm obtains the best dependency on the problem parameters. On the other hand, the coefficient 1 -a t+1 in the update for d t+1 now depends on ξ t+1 , and addressing this correlation requires a more careful analysis. The second point of departure is the setting of the step sizes b t and their relationship to the momentum parameters a t in both META-STORM-SG and META-STORM. We propose a general update rule b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t that allows for a broad range of choices for p and q that subsume prior works. In practice, different problem domains may benefit from different choices of p and q. Our framework allows us to capture prior works such as the STORM+ update b t = ( t i=1 d i 2 /a i+1 ) 1/3 using a different but related choice of momentum parameters and a simpler update that uses only the current momentum value a t instead of all the previous momentum values a i+1 with i ≤ t. We further motivate and provide intuition for our algorithmic choices in Section 3. We note that our algorithm uses only the stochastic gradient information received, and it does not require any knowledge of the problem parameters. We provide an overview and intuition for our algorithm in Section 3, and give the complete analysis in the appendix. Our analysis departs significantly from prior works such as STORM+, and it allows us to forgo the bounded function value assumption and improve the convergence rate's dependency on the problem parameters. It remains an interesting open question to determine the best convergence rate that can be achieved when the function values are bounded. We can further alleviate assumption 5 in another new algorithm, META-STORM-NA (Algorithm 5), provided in Section H in the appendix. To the best of our knowledge, META-STORM-NA is the only adaptive algorithm that enjoys the convergence rate O(1/T 1/3 ) under only the weakest assumptions 1-3. It also allows a wide range of choices for p ∈ 0, 1 2 . However, the tradeoff is that the algorithm does not adapt to the variance parameter σ. For the detailed analysis, we refer readers to Section H. Finally, we show the convergence rate obtained by Algorithms 1 and 2 in the following theorems. The convergence rates for general p are given in the appendix. Theorem 2.1. Under the assumptions 1-4 in Section 1.2, with the choice p = 1 2 and setting a 0 = b 0 = η = 1 to simplify the final bound, META-STORM-SG ensures that E ∇F (x out ) 2 3 = O W 1 1 (σ 2 T ) 1/3 ≤ W 1 T 1/3 + W 2 + W 3 log 2 3 1 + σ 2 T 1 T 1/3 + σ 2/9 T 2/9 where W 1 = O F (x 1 ) -F * + σ 2 + G 2 + β 1 + G 2 log β + G 2 β , W 2 = O (F (x 1 ) -F * ) 2/3 + σ 4/3 + G 4/3 + (1 + G 4/3 )β 2/3 log 2/3 β + G 2 β and W 3 = O (1 + G 4/3 )β 2/3 . We note that when σ 2 > 0 and T is large enough, the effect of W 1 can be eliminated. Combining Theorem 2.1 and Markov's inequality, we immediately have the following corollary. Corollary 2.2. Under the same setting in Theorem 2.1, additionally we assume σ 2 > 0 and T is large enough, then for any 0 < δ < 1, with probability 1 -δ ∇F (x out ) ≤ O κ 1 + κ 2 log 1 + σ 2 T δ 3/2 1 T 1/2 + σ 1/3 T 1/3 where κ 1 = O F (x 1 ) -F * + σ 2 + G 2 + κ 2 log κ 2 and κ 2 = O 1 + G 2 β . Theorem 2.3. Under the assumptions 1-3 and 5 in Section 1.2, with the choice p = 1 2 and setting a 0 = b 0 = η = 1 to simplify the final bound, META-STORM ensures that E ∇F (x out ) 6 7 = O Q 1 + Q 2 log 6 7 1 + σ 2 T 1 T 3/7 + σ 2/7 T 2/7 where Q 1 = O F (x 1 ) -F * 6/7 + σσ 6/7 + σ 12/7 + σ 18/7 + 1 + σ 18/7 β 6/7 log 6/7 β + σ 3 β and Q 2 = O 1 + σ 18/7 β 6/7 . Combining Theorem 2.3 and Markov's inequality, we also have the following corollary. Corollary 2.4. Under the same setting in Theorem 2.3, then, for any 0 < δ < 1, with probability 1 -δ ∇F (x out ) ≤ O κ 1 + κ 2 log 1 + σ 2 T δ 7/6 1 T 1/2 + σ 1/3 T 1/3 where κ 1 = O F (x 1 ) -F * + σσ + σ 2 + σ 3 + κ 2 log κ 2 and κ 2 = O 1 + σ 3 β . We emphasize that the aim of our analysis is to provide a convergence in expectation or with constant probability. In particular, we state Corollaries 2.2 and 2.4 only to give a more intuitive way to see the dependency on the problem parameters. To boost the success probability and achieve a log 1 δ dependency on the probability margin, a common approach is to perform log 1 δ independent repetitions of the algorithms. We briefly discuss the difference between the convergence rate of the two algorithms. We note that these two rates cannot be compared directly since assumption 4 is stronger than assumption 5. Additionally, as pointed out in Section 1.2, we have σ = O( G) and thus the term O( σ 3 ) in Corollary 2.4 is O( G 3 ), whereas Corollary 2.2 has a O( G 2 ) term. To give an intuition why an extra higher order term W 1 appears in Theorem 2.1 when σ = 0 compared with Theorem 2.3, we note that when σ = 0, d t in both algorithms degenerates to ∇F (x t ). However, the coefficient a t+1 becomes 1 in META-STORM but does not in META-STORM-SG. This discrepancy leads to b t being larger in META-STORM-SG than in META-STORM, and moreover the META-STORM b t becomes exactly the same as the stepsize used in AdaGrad. Due to the larger b t when σ = 0, it is reasonable to expect a slower convergence rate for META-STORM-SG. The appearance of the term W 1 reflects that.

3. OVERVIEW OF MAIN IDEAS AND ANALYSIS

In this section, we an overview of our novel analysis framework. We first give a basic non-adaptive algorithm and its analysis to motivate the algorithmic choices made by our adaptive algorithms. We then discuss how to turn the non-adaptive algorithm into an adaptive one. Section D in the appendix gives a proof sketch for Theorem 2.3 for the special case p = 1 2 that illustrates the main ideas used in the analyses of all of our algorithms. We give the complete analyses in the appendix.

3.1. NON-ADAPTIVE ALGORITHM

As a warm-up towards our fully adaptive algorithms and their analysis, we start with a basic nonadaptive algorithm and analysis that will guide our algorithmic choices and provide intuition for our analysis. The algorithm instantiates the STORM template using fixed choices a t = a and b t /η = b for the momentum and step size. In the following, we outline an analysis for the algorithm and derive appropriate choices for the values a and b. Algorithm: As noted above, the algorithm performs the following updates: x t+1 = x t - 1 b d t ; d t+1 = ∇f (x t+1 , ξ t+1 ) + (1 -a)(d t -∇f (x t , ξ t+1 )). To make it simpler, we assume d 1 = ∇F (x 1 ). Alternatively, one can use a standard mini-batch setting to set d 1 = 1 m m i=1 ∇f (x 1 ; ξ i ) with a proper m leading to small variance as in previous non-adaptive analysis (Fang et al., 2018; Zhou et al., 2018; Tran-Dinh et al., 2019) .

Key idea:

We start by introducing some convenient notation. Let t = d t -∇F (x t ) be the stochastic error (in particular, 1 = 0) and H t := t i=1 ∇F (x i ) 2 D t := t i=1 d i 2 E t := t i=1 i 2 . First, to bound Bounding D T : Starting from the function value analysis, using smoothness, the update rule x t+1 = x t -1 b d t , the definition of t = d t -∇F (x t ), and Cauchy-Schwarz, we obtain F (x t+1 ) -F (x t ) ≤ ∇F (x t ), x t+1 -x t + β 2 x t+1 -x t 2 = - 1 b ∇F (x t ), d t + β 2b 2 d t 2 = - 1 b d t 2 + 1 b t , d t + β 2b 2 d t 2 ≤ - 1 2b d t 2 + 1 2b t 2 + β 2b 2 d t 2 . Suppose that we choose b so that b ≥ 2β, which ensures that β 2b 2 ≤ 1 4b . By rearranging the previous inequality, summing up over all iterations, and taking expectation, we obtain E [D T ] ≤ 4bE (F (x 1 ) -F (x T +1 )) + 2E [E T ] ≤ 4b (F (x 1 ) -F * ) + 2E [E T ] . Bounding E T : By the standard calculation for the stochastic error t used in STORM, we have E t+1 2 ≤ (1 -a) 2 E t 2 + 2(1 -a) 2 β 2 b 2 E d t 2 + 2a 2 σ 2 . Summing up over all iterations, rearranging, and using that a ∈ [0, 1] and 1 = 0, we obtain E [E T ] ≤ 1 1 -(1 -a) 2 2(1 -a) 2 β 2 b 2 E [D T ] + 2a 2 σ 2 T ≤ 2β 2 ab 2 E [D T ] + 2aσ 2 T. By combining inequalities (4) and ( 5), we obtain E [D T ] ≤ 4b (F (x 1 ) -F * ) + 4β 2 ab 2 E [D T ] + 4aσ 2 T ; (6) E [E T ] ≤ 8β 2 ab (F (x 1 ) -F * ) + 4β 2 ab 2 E [E T ] + 2aσ 2 T. ( ) Ideal non-adaptive choices for a, b: Here, we set a and b to optimize the overall bound, and obtain choices that depend on the problem parameters. In the next section, we build upon these choices to obtain adaptive algorithms that use only the stochastic gradient information received by the algorithm. We observe that ( 6) and ( 7) bound E[D T ] and E[E T ] in terms of themselves, and the coefficient on the right-hand side is 4β 2 ab 2 . Suppose that we set a so that this coefficient is 1 2 , i.e., we set a = 8β 2 b 2 , so that 4β 2 ab 2 = 1 2 (note that this requires setting b ≥ 2 √ 2β, so that a ≤ 1). By plugging this choice into ( 6) and ( 7), we obtain E [D T ] , E [E T ] ≤ O b (F (x 1 ) -F * ) + β 2 σ 2 T b 2 . The best choice for b is the one that balances the two terms above: b = Θ β 2 σ 2 T F (x1)-F * 1/3 . Since we also need b ≥ Ω(β), we can set b to the sum of the two. Hence, we obtain a = Θ β 2 b 2 = Θ 1 1 + (β (F (x 1 ) -F * )) -2/3 (σ 2 T ) 2/3 ; (8) b = Θ β + β 2/3 F (x 1 ) -F * -1/3 σ 2 T 1/3 ; (9) E [D T ] , E [E T ] , E [H T ] ≤ O β F (x 1 ) -F * + β F (x 1 ) -F * 2/3 σ 2 T 1/3 . ( )

3.2. ADAPTIVE ALGORITHM

In this section, we build on the non-adaptive algorithm and its analysis from the previous section. We first motivate the algorithmic choices made by our algorithm via a thought experiment where we pretend that H T , D T , E T are deterministic quantities. Towards adaptive algorithms: To develop an adaptive algorithm, we would like to pick a, b without an explicit dependence on the problem parameters by using quantities that the algorithm can track. We break this down by first considering choices that do not depend on β, but on σ, and then removing the dependency on σ. As a thought experiment, let us pretend that H T , D T , E T are deterministic quantities. A natural choice for a that mirrors the non-adaptive choice (8) is a = (1 + σ 2 T ) -2/3 . Since we are pretending that D T is a deterministic quantity, we can set b by inspecting (5): E T (5) ≤ 2β 2 ab 2 D T + 2aσ 2 T If we set b = D 1/2 T /a 1/4 , we ensure that D T cancels and we obtain the desired upper bound on E T . More precisely, by plugging in a = (1 + σ 2 T ) -2/3 and b = D 1/2 T /a 1/4 into (5), we obtain E T (5) ≤ 2β 2 a 1/2 D T D T + 2aσ 2 T ≤ O β 2 1 + σ 2 T 1/3 + 1 + σ 2 T 1/3 We now consider two cases for D T . If D T ≤ 16βfoot_2 (1 + σ 2 T ) 1/foot_3 , the above inequality together with H T ≤ 2D T + 2E T imply that H T ≤ O((1 + β 2 )(1 + σ 2 T ) 1/3 ). Otherwise, we have D T ≥ 16β 2 (1 + σ 2 T ) 1/3 and thus ab 2 ≥ 16β 2 . Plugging into (6), we obtain D T (6) ≤ O D T 1 + σ 2 T 1/6 (F (x 1 ) -F (x * )) + 1 + σ 2 T 1/3 which solves to D T ≤ O((1 + σ 2 T ) 1/3 (F (x 1 ) -F * ) 2 ). We can again bound H T using H T ≤ 2D T + 2E T . In both cases, we have the bound H T ≤ O 1 + β 2 + (F (x 1 ) -F * ) 2 1 + σ 2 T 1/3 We now turn to removing the dependency on σ 2 T in a. The algorithm can also track H T := T t=1 ∇f (x t ; ξ t ) -∇f (x t ; ξ t+1 ) 2 , which can be viewed as a proxy for σ 2 T . Replacing σ 2 T by this proxy and making a and b be time dependent give the update rules employed by our algorithm in the special case p = 1 2 . Our update rule for general p follows from a similar thought experiment. Analysis: Using a similar approach as in the non-adaptive analysis, we can turn the above argument into a rigorous analysis. In the appendix, we give the complete analysis as well as a proof sketch in Section D that gives an overview of our main analysis techniques.

4. EXPERIMENTS

We examine the empirical performance of our methods against the previous work STORM+ (Levy et al., 2021) and popular algorithms (Adam, AdamW, AdaGrad, and SGD) on three tasks: (1) Image classification with the CIFAR10 dataset (Krizhevsky et al., 2009 ) using ResNet18 (Ren et al., 2016) models; (2) Masked language modeling via the BERT pretraining loss (Devlin et al., 2018) with the IMDB dataset (Maas et al., 2011) using distill-BERT models (Sanh et al., 2019) , where we employ the standard cross entropy loss for MLM fine tuning (with whole word masking and fixed test masks) with maximum length 128; and (3) Sentiment analysis with the SST2 dataset (Socher et al., 2013) via finetuning BERT models (Devlin et al., 2018) . We use the standard train/validation split and run all algorithms for 4 epochs. We use the default implementation of AdaGrad, Adam, AdamW, and SGD from Pytorch. For STORM+, we follow the authors' original implementation. 2 We give the complete implementation details and tables of hyperparameters for all algorithms in Section B.1 of the Appendix. Heuristics. For our algorithms and STORM+, we further examine whether heuristics like exponential moving average (EMA) of the gradient sums (or often called online moment estimation) and per-coordinate update would be beneficial. The versions with heuristics is denoted (H) in the results below. This is discussed in full details in Section B.1.1 of the Appendix. Results. We perform our experiments on the standard train/test splits of each dataset. We tune for the best learning rate across a fixed grid for all algorithms and perform each run 5 times. For readability, we omit error bars in the plot. Full plots with error bars and tabular results with standard deviation as well as further discussions are presented in Section B.2 of the Appendix. 3 1. CIFAR10 (Figure 1 ). Overall, META-STORM-SG achieves the lowest training loss with META-STORM and STORM+ coming in close. META-STORM with heuristics attains the best test accuracy, with Adam coming in close. 2. IMDB (Figure 2 ). AdamW attains the best training loss. However, META-STORM with heuristics achieve the best test loss (with AdamW coming in close). META-STORM-SG and the heuristic algorithms outperform STORM+ in both minimizing training loss and test loss. 3. SST2 (Figure 3 ). META-STORM with heuristics attain the best training loss and accuracy, above Adam and AdamW. It also achieves the best validation accuracy out of all the algorithms. Furthermore, non-heuristic META-STORM and META-STORM-SG outperform STORM+. We remark that STORM+ appears to be rather unstable for this task as some of the random runs do not converge to good stationary points. 

5. CONCLUSION

In this paper, we propose META-STORM-SG and META-STORM, two fully-adaptive momentumbased variance-reduced SGD frameworks that generalize upon STORM+ and remove STORM+'s restrictive bounded function values assumption. META-STORM and its sibling META-STORM-SG attain the optimal convergence rate with better dependency on the problem parameters than previous methods and allow for a wider range of configurations. Experiments demonstrate our algorithms' effectiveness across common deep learning tasks against the previous work STORM+, and when heuristics are further added, achieve competitive performance against state-of-the-art algorithms. Reproducibility Statement. We include the full proofs of all theorems in the Appendix. For our experiments, full implementation details including hyperparameter selection and algorithm development are included in Section B of the Appendix. We also make our source code available.

A APPENDIX OUTLINE

The appendix is organized as follows. • Section B presents the full implementation details for our algorithms and hyperparameters used. This section also includes additional ablation studies and experiments. • Section C introduces the notations used in the analysis of our algorithms. • Section D presents the proof sketch of Theorem 2.3. • Section E establishes some basic results that are used in our full analysis. • Section F gives the analysis of META-STORM for general p. • Section G gives the analysis of META-STORM-SG for general p. • Section H introduces META-STORM-NA and gives the analysis for general p. • Section I gives several basic inequalities that are used in our analysis.

B EXPERIMENTAL DETAILS AND ADDITIONAL EXPERIMENTS

In this section, we present the complete implementation details along with the full experimental setup. All of our experiments were conducted on two NVIDIA RTX3090.

B.1 IMPLEMENTATION DETAILS AND HYPERPARAMETER TUNING

In this section, we present the full implementation details of the heuristics version, parameter selection, and hyperparameter tuning for all 3 datasets.

B.1.1 HEURISTICS VERSIONS OF META-STORM AND META-STORM-SG

Algorithm 3 Heuristic update of META-STORM and META-STORM-SG. G t = αG t-1 + (1 -α) (∇f (x t , ξ t ) -∇f (x t , ξ t+1 )) 2 for META-STORM (H) αG t-1 + (1 -α) (∇f (x t , ξ t )) 2 for META-STORM-SG (H) a t+1 = 1 + G t /a 2 0 -2/3 D t = αD t-1 + (1 -α)d 2 t b t =    b 1/p 0 + D t p /a q t for META-STORM (H) b 1/p 0 + D t p /a q t+1 for META-STORM-SG (H) x t+1 = x t - η b t d t d t+1 = ∇f (x t+1 , ξ t+1 ) + (1 -a t+1 )(d t -∇f (x t , ξ t+1 )) Algorithm 4 Heuristic update of STORM+ G t = αG t-1 + (1 -α) (∇f (x t , ξ t )) 2 a t+1 = (1 + G t /a 0 ) -2/3 D t = αD t-1 + (1 -α) d 2 t a t+1 b t = (b 0 + D t ) -1/3 x t+1 = x t - η b t d t d t+1 = ∇f (x t+1 , ξ t+1 ) + (1 -a t+1 )(d t -∇f (x t , ξ t+1 )) For our algorithms, we employ the common heuristic of using an exponential moving average (EMA) scheme in the momentum and the step size. We also perform a per-coordinate update instead of simply using the norm. With this, our update rules for x t+1 = x t -ηd t /b t becomes coordinatewise division with the update rules as in Algorithm 3, where all the operations between vectors here are coordinate-wise multiplication, exponentiation, and division. In our experiments, we set α = 0.99, a 0 = 1, b 0 = 10 -8 as selected by the criterion detailed next. Similarly, we also examine the same heuristics on STORM+ via an analogous implementation in Algorithm 4.

B.1.2 ALGORITHM DEVELOPMENT AND DEFAULT PARAMETERS SELECTION

We develop our algorithm on MNIST and tune for p, a 0 , and b 0 . For a 0 , we tune on MNIST across a range of values from 1 to 10 8 and found that larger values of a 0 are helpful. For b 0 , we simply need a small number for numerical stability so we pick 10 -8 . For the heuristic versions of our algorithms, a 0 = 1 gives the best results. This might be due to the effects of per-coordinate operations removing the need to scale down the gradient-accumulated step-size. Effects of varying p. In Figures 4 and 5 , we show the training loss and test accuracy of different values of p of our algorithms on MNIST (with a 0 = 10 8 and b 0 = 10 -8 ). For each configuration, we tune the base learning rate η across 10 -3 , 10 -2 , 10 -1 , 1, 10 .The results suggest that the lower values of p tend to perform better. While p = 1/3 has comparable performance to the lowest setting of p, this choice is somewhat analogous to STORM+. Hence, we select the lowest possible value p for our algorithms in the subsequent experiments (with p = 0.20 for META-STORM and p = 0.25 for META-STORM-SG).  b 0 META-STORM 0.20 a 2 0 =10 8 b 1/p 0 =10 -8 META-STORM-SG 0.25 a 2 0 =10 8 b 1/p 0 =10 -8 META-STORM (H) 0.50 a 2 0 =1 b 1/p 0 =10 -8 META-STORM-SG (H) 0.50 a 2 0 =1 b 1/p 0 =10 -8 STORM+ N/A a 0 =# of parameters b 0 = 1 STORM+ (H) N/A a 0 = 1 b 0 = 10 -8 The discussion above leads to the choice of a 0 = 10 8 and b 0 = 10 -8 by default for our algorithms with p = 0.20 for META-STORM and p = 0.25 for META-STORM-SG on the benchmarks present in this section. For the heuristic versions of META-STORM, we use p = 0.50, a 0 = 1, and b 0 = 10 -8 for our algorithm with heuristics. This version with heuristics is further denoted (H) in our results below. For STORM+, we use the original authors' implementation of setting a 0 to the number of parameters of the model (which is roughly 10 8 for ResNet18 for example) and b 0 = 1. For other baseline algorithms, we use the default parameters from Pytorch implementation. Hyperparameter tuning. For all algorithms, we tune only the learning rate while using the default values for the other parameters for all algorithms. For learning rate tuning, we perform a grid search across values 10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 , 1 for CIFAR10 and IMDB and across values 10 -5 , 2 × 10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 , 1 for SST2 (due to 2 × 10 -5 being the default learning rate for AdamW on SST2 and also more practical due to SST2 being a smaller dataset). For Adam on IMDB, the learning rate in our grid search is not small enough to converge, requiring additional tuning for decreasing training loss. Table 3 includes the selected learning rate we used for each algorithm across the datasets. After obtaining the best learning rate, we additionally run each algorithm across 5 different seeds to obtain error bars.  Algorithm CIFAR10 IMDB SST2 META-STORM 1 10 -2 10 -2 META-STORM-SG 1 10 -1 10 -2 META-STORM (H) 10 -3 10 -4 2 • 10 -5 META-STORM-SG (H) 10 -3 10 -4 2 • 10 -5 STORM+ 0.1 10 -2 10 -2 STORM+ (H) 10 -2 10 -3 10 -4 Adam 10 -3 10 -6 10 -5 AdamW N/A 10 -4 10 -5 Adagrad 10 -3 10 -3 10 -4 SGD 10 -3 10 -2 10 -3

B.2 FULL RESULTS FOR EXPERIMENTS IN SECTION 4 AND ADDITIONAL EXPERIMENTS

In this section, we show complete plots and tabular results along with more detailed discussions for our experiments. The reader should note that STORM-based methods require twice the amount of oracle access over the baselines. The plots show average across 5 seeds along with min/max bars. The tables show the average across 5 seeds across a range of selected epochs and one standard deviation is included at the last epoch. In the plots and tables below: (H) denotes the version of the algorithm with the heuristics (EMA and per-coordinate update) employed.

B.2.1 CIFAR10: RESULTS AND DISCUSSIONS

Figure 8 shows all 4 plots of the main experiments in Section 4 in Figure 8 . Tables. Tables 4 and 5 show the training loss and accuracy for CIFAR10. Tables 6 and 7 show the test loss and accuracy for CIFAR10. To further study this generalization gap among different algorithms, Table 8 shows the generalization gap of different algorithms. META-STORM with heuristics and Adam achieve the smallest gap among all the algorithms. For our algorithms, the version with heuristics exhibit a smaller generalization gap than the version without the heuristics while STORM+ lies in between. Interestingly, Adagrad and SGD exhibit larger generalization gaps. Figure 2 from Section 4 shows the train and test loss of the algorithms used. We include Figure 9 here that includes the error bars across 5 random seeds. Tables. Tables 9 and 10 show the train and test loss for our experiments. Figure 10 shows all 4 plots of the main experiments for SST2. Tables. Tables 11 and 12 present the training loss and accuracy for the experiments for SST2. Tables 13 and 14 show the validation loss and accuracy for the experiments for SST2. We recall the assumptions in Section 1.2 we rely on: 1. Lower bounded function value: F * := inf x∈R d F (x) > -∞. 2. Unbiased estimator with bounded variance: We assume to have access to ∇f (x, ξ) satisfying E ξ [∇f (x, ξ)] = ∇F (x), E ξ ∇f (x, ξ) -∇F (x) 2 ≤ σ 2 for some σ ≥ 0. 3. Averaged β-smoothness: E ξ ∇f (x, ξ) -∇f (y, ξ) 2 ≤ β 2 x -y 2 , ∀x, y ∈ R d . 4. Bounded stochastic gradients: ∇f (x, ξ) ≤ G, ∀x ∈ R d , ξ ∈ support(D) for some G ≥ 0. 5. Bounded stochastic gradient differences: ∇f (x, ξ) -∇f (x, ξ ) ≤ 2 σ, ∀x ∈ R d , ξ, ξ ∈ support(D) for some σ ≥ 0. We remind the reader that σ = O( σ) and σ = O( G).

C.2 NOTATIONS

In the analysis below, we employ the following notations β max := max {β, 1} ; D t := t i=1 d i 2 ; E t,s := t i=1 a s i+1 i 2 ; H t := t i=1 ∇F (x i ) 2 ; H t := t i=1 ∇f (x i , ξ i ) 2 ; H t := t i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 We will also write E t := E t,0 = t i=1 i 2 . We denote F t = σ (ξ i , 1 ≤ i ≤ t) as the sigma algebra generated by the first t samples. Besides, we define 0 0 := 1. In Section I, we will list and prove all inequalities used in the subsequent proofs. D PROOF SKETCH FOR THEOREM 2.3 In this section, to give an overview of the proof techniques, we present the proof sketch for Theorem 2.3 for the special case p = 1 2 . For simplicity, we assume β ≥ 1 to simplify the notation. The analysis of the fully adaptive algorithms follows a similar approach to the non-adaptive analysis given in Section 3. As before, towards our final goal of bounding ∇F (x out ) , we will translate to H T and upper bound it via D T and E T . Bounding E T : As in existing VR algorithms, we need to calculate how the stochastic error t changes with each iteration. By a standard calculation, we obtain a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 where Z t+1 = ∇f (x t+1 , ξ t+1 ) -∇f (x t , ξ t+1 ) -∇F (x t+1 ) + ∇F (x t ); M t+1 = 2(1 -a t+1 ) 2 t , Z t+1 + 2(1 -a t+1 )a t+1 t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) . We note that, in META-STORM, a t+1 ∈ F t+1 , which implies E[M t+1 | F t ] = 0. This extra term M t+1 makes our analysis more challenging compared with previous works. Now, we highlight some challenges and point out how to solve them: CHALLENGE 1. How to obtain a term as close to E T as possible with a proper upper bound? In the L.H.S. of (11), we can see an extra coefficient a t+1 appear in front of t 2 . A straightforward option is to divide both sides by a t+1 then sum up to get E T . However, if we do so, the following problem arises. Let us focus on the term Z t+1 2 /a t+1 . The averaged β-smoothness assumption gives E Z t+1 2 | F t ≤ η 2 β 2 d t 2 b 2 t . However, we cannot apply this result to Z t+1 2 /a t+1 since a t+1 ∈ F t+1 as noted above. If we temporarily think a -1 t+1 ≤ ca -1 t for some constant c (we can expect this because the change from a t to a t+1 is not too large due to the bounded differences assumption), we will get E[a -1 t+1 | Z t+1 2 | F t ] ≤ E[ca -1 t Z t+1 2 | F t ] ≤ η 2 β 2 dt 2 atb 2 t . If we plug in the update rule of b t = (b 2 0 +D T ) 1/2 /a 1/4 t , then we obtain E[ Z t+1 2 |/a t+1 | F t ] ≤ η 2 β 2 a -1/2 t dt 2 b 2 0 +Dt . It can be shown that T t=1 dt 2 b 2 0 +D T can be upper bounded by log D T b 2 0 , but now we still have the extra a -1/2 t coefficent. To remove it, it is reasonable to divide both sides of (11) by a 1/2 t+1 rather than a t+1 . CHALLENGE 2. How to get rid of the term involving M t+1 ? As discussed in Challenge 1, we want to divide both sides by a 1/2 t+1 . Now we focus on the term a -1/2 t+1 M t+1 . Again, due to a t+1 ∈ F t+1 , E[a -1/2 t+1 M t+1 | F t ] = 0. An important observation here is that, if we replace a t+1 by a t in M t+1 , we will have a martingale difference sequence. Formally, we define N t+1 = 2(1 -a t ) 2 t , Z t+1 + 2(1 -a t )a t t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) . Then E[N t+1 | F t ] and E[a -1/2 t N t+1 | F t ] are both 0. This observation tells us that, in order to bound E[ T t=1 a -1/2 t+1 M t+1 ], it suffices to bound E[ T t=1 a -1/2 t+1 M t+1 -a -1/2 t N t+1 ]. Using the Cauchy-Schwartz inequality, we show that the term T t=1 a -1/2 t+1 M t+1 -a -1/2 t N t+1 can be bounded by terms related to T t=1 (a -1/2 t+1 -a -1/2 t ) t 2 , T t=1 a -1/2 t+1 Z t+1 2 and T t=1 a 3/2 t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 . We then bound these latter terms in turn, and eliminate the term involving M t+1 . After overcoming the two challenges above, we can finally show the following inequality, where K 1 , K 2 , K 4 are constants that depend only on σ, σ, β, a 0 , b 0 , η and are independent of T . E a 1/2 T +1 E T ≤ E E T,1/2 ≤ K 1 + K 2 E log 1 + H T /a 2 0 + K 4 E log 1 + D T /b 2 0 . (12) Bounding D T : By following the standard non-adaptive analysis via smoothness, we obtain F (x t+1 ) ≤ F (x t ) - η b t ∇F (x t ), d t + η 2 β 2b 2 t d t 2 . ( ) Here we proceed similarly to the non-adaptive analysis from Section 3.1, but start to diverge from the analysis approach used in STORM+. The STORM+ analysis proceeds by splitting -∇F (x t ), d t = -∇F (x t ) 2 -∇F (x t ), t ≤ -1 2 ∇F (x t ) 2 + 1 2 t 2 , multiplying both sides of ( 13) with b t /η, and summing up over all iterations. This gives the following upper bound on H T : H T = T t=1 ∇F (x t ) 2 ≤ T t=1 2 η (F (x t ) -F (x t+1 )) b t + T t=1 t 2 + ηβ T t=1 d t 2 b t . This analysis requires F (x) to be bounded so that the sum T t=1 2 η (F (x t ) -F (x t+1 ))b t can telescope. To remove this assumption, we go back to (13), split -∇F (x t ), d t = -d t 2 + t , d t , and upper bound the inner product via the Cauchy-Schwartz inequality and the inequality ab ≤ γ 2 a 2 + 1 2γ b 2 which holds for any γ > 0: t , d t ≤ t d t ≤ λa 1/2 t+1 b t 2ηβ t 2 + ηβ 2λa 1/2 t+1 b t d t 2 where λ > 0 is a constant (setting λ based on σ yields the best dependence on σ). We note that this choice will need a bound on E[ is that this coefficient ensures a constant split if a t and b t correspond to the non-adaptive choices we derived in Section 3.1, which were set so that a 1/2 b = Θ (β). We obtain E T t=1 d t 2 b t ≤ 2 η (F (x 1 ) -F * ) + E T t=1 ηβ + ηβ a 1/2 t+1 λ -b t d t 2 b 2 t ( ) + λ ηβ E E T,1/2 ( ) . The term ( ) can be bounded using standard techniques used in the analyses of adaptive algorithms. The term ( ) has already been bounded in the previous analysis. Now we only need to simplify the term on the L.H.S. to D T . But due to the randomness of b t , this is not achievable. However, the same as for the first inequality in (12), we can bridge this gap by aiming for a slightly weaker inequality that bounds D 1/2 T instead of D T . More precisely, we connect the left-hand side of ( 14) to D 1/2 T as follows: T t=1 d t 2 b t ≥ -b 0 + b 2 0 b 0 + T t=1 a 1/4 T +1 d t 2 b 2 0 + T i=1 d i 2 1/2 ≥ a 1/4 T +1 D 1/2 T -b 0 . ( ) By plugging in (15) into ( 14) and setting λ appropriately, we can finally obtain the following upper bound: E a 1/4 T +1 D 1/2 T ≤ K 5 + K 6 E log 1 + H T /a 2 0 + K 7 E   log K 8 + K 9 1 + H T /a 2 0 1/3 b 0    . ( ) where K 5 , K 6 , K 7 , K 8 , K 9 depend only on σ, σ, β, a 0 , b 0 , η and are independent of T . Combining the bounds: The final part of the analysis is to combine ( 12) and ( 16). In contrast to the simpler non-adaptive analysis, these inequalities bound a 1/2 T +1 E T and a 1/4 T +1 D 1/2 T instead of D T and E T . In order to obtain an upper bound on H T via the inequality H T ≤ 2D T + 2E T , we need to connect a  T +1 ] = E[1 + H T /a 2 0 ] = O(1 + σ 2 T ) (note that this -3/2 is the smallest c to make sure we can upper bound E[a c t+1 ]). Combining this result and Holder's inequality gives us the bound E D 3/7 T ≤ E 6/7 a 1/4 T +1 D 1/2 T E 1/7 a -3/2 T +1 ; E E 3/7 T ≤ E 3/7 a 1/2 T +1 E T E 4/7 a -3/8 T +1 ≤ E 3/7 a 1/2 T +1 E T E 1/7 a -3/2 T +1 ; where 3/7 is chosen to ensure that we finally can use the bound on E[a 

E BASIC ANALYSIS

As discussed in Section 3, we aim to use E T and D T to bound H T . Here, we apply this framework to give some basic results which will be used frequently for the full analysis of every algorithm. We first state the following decomposition in our analysis framework. The reason we use p ≤ 1 here is that we can not always bound H T directly because of the randomness of a t and b t in our algorithms. Lemma E.1. Given p ≤ 1, we have E H p T ≤ 2 p+1 max E E p T , E D p T ≤ 4 max E E p T , E D p T . Proof. By the definition of H T , E T and D T , we have H T ≤ 2E T + 2D T . Hence H p T ≤ (2E T + 2D T ) p (a) ≤ 2 p E p T + 2 p D p T ⇒ E H p T ≤ 2 p E E p T + E D p T ≤ 2 p+1 max E E p T , E D p T (b) ≤ 4 max E E p T , E D p T where (a) and (b) are both by due to p ≤ 1.

E.1 VARIANCE REDUCTION ANALYSIS FOR E T

The same as in all existing momentum-based VR methods, we need to analyze how the error term t changes in the algorithm. Based on our notations, we give the following two standard lemmas. Lemma E.2. ∀t ≥ 1, we have a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 , where Z t+1 := ∇f (x t+1 , ξ t+1 ) -∇f (x t , ξ t+1 ) -∇F (x t+1 ) + ∇F (x t ), M t+1 := 2(1 -a t+1 ) 2 t , Z t+1 + 2(1 -a t+1 )a t+1 t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) . Proof. Starting from the definition of t+1 , we have t+1 2 = d t+1 -∇F (x t+1 ) 2 = ∇f (x t+1 , ξ t+1 ) + (1 -a t+1 )(d t -∇f (x t , ξ t+1 )) -∇F (x t+1 ) 2 = (1 -a t+1 ) t + (1 -a t+1 )Z t+1 + a t+1 (∇f (x t+1 , ξ t+1 ) -∇F (x t+1 )) 2 = (1 -a t+1 ) 2 t 2 + (1 -a t+1 )Z t+1 + a t+1 (∇f (x t+1 , ξ t+1 ) -∇F (x t+1 )) 2 + M t+1 (a) ≤ (1 -a t+1 ) 2 t 2 + 2(1 -a t+1 ) 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 (b) ≤ (1 -a t+1 ) t 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 where (a) is by (x + y) 2 ≤ 2x 2 + 2y 2 , (b) is by 0 ≤ 1 -a t+1 ≤ 1. Adding a t+1 t 2 -t+1 to both sides, we get the desired result. Lemma E.3. ∀t ≥ 1, we have E Z t+1 2 | F t ≤ η 2 β 2 d t 2 b 2 t . Proof. From the definition of Z t+1 , we have E Z t+1 2 | F t = E ∇f (x t+1 , ξ t+1 ) -∇f (x t , ξ t+1 ) -∇F (x t+1 ) + ∇F (x t ) 2 |F t (a) ≤ E ∇f (x t+1 , ξ t+1 ) -∇f (x t , ξ t+1 ) 2 |F t (b) ≤ β 2 x t+1 -x t 2 (c) = η 2 β 2 d t 2 b 2 t where (a) is by E X -E [X] 2 ≤ E X 2 , (b) is by the averaged β-smooth assumption, (c) is by the fact x t+1 -x t = -η bt d t .

E.2 ON THE WAY TO BOUND D T

We choose to bound the terms D T instead of starting from H T as done in AdaGradNorm or STORM+. The latter also requires the bounded function value assumption in the analysis. Lemma E.4. For any of META-STORM-SG, META-STORM or META-STORM-NA, we have, for any λ > 0 E a q T +1 D 1-p T ≤ b 1 p -1 0 + 2 η (F (x 1 ) -F * ) + E T t=1 ηβ max + ηβ max a 1/2 t+1 λ -b t d t 2 b 2 t + λE E T,1/2 ηβ max . Proof. Using smoothness, the update rule x t+1 = x t -η bt d t and the definition of t = d t -∇F (x t ), we obtain F (x t+1 ) ≤ F (x t ) + ∇F (x t ), x t+1 -x t + β 2 x t+1 -x t 2 = F (x t ) - η ∇F (x t ), d t b t + η 2 β 2b 2 t d t 2 = F (x t ) - η d t 2 b t + η t , d t b t + η 2 β 2b 2 t d t 2 . First we use Cauchy-Schwarz to separate the stochastic gradient and the stochastic error terms F (x t+1 ) ≤ F (x t ) - η d t 2 b t + λ t η t 2 2b t + η d t 2 2λ t b t + η 2 β 2b 2 t d t 2 . Taking λ t = λa 1/2 t+1 b t ηβ max for some λ > 0. We have η d t 2 2b t ≤ F (x t ) -F (x t+1 ) + η 2 β 2b 2 t + η 2λ t b t - η 2b t d t 2 + λ t η t 2 2b t = F (x t ) -F (x t+1 ) + η 2 β 2b 2 t + η 2 β max 2b 2 t a 1/2 t+1 λ - η 2b t d t 2 + λa 1/2 t+1 t 2 2β max = F (x t ) -F (x t+1 ) + η 2 β 2 + η 2 β max 2a 1/2 t+1 λ - ηb t 2 d t 2 b 2 t + λa 1/2 t+1 t 2 2β max ≤ F (x t ) -F (x t+1 ) + η 2 β max 2 + η 2 β max 2a 1/2 t+1 λ - ηb t 2 d t 2 b 2 t + λa 1/2 t+1 t 2 2β max ⇒ E T t=1 d t 2 b t ≤ 2 η (F (x 1 ) -F * ) + E T t=1 ηβ max + ηβ max a 1/2 t+1 λ -b t d t 2 b 2 t + λE E T,1/2 ηβ max . The final step is to relate the L.H.S. to D T . Recall for META-STORM-SG and META-STORM-NA, we have b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t+1 . Hence T t=1 d t 2 b t = T t=1 a q t+1 d t 2 (b 1/p 0 + t i=1 d i 2 ) p ≥ T t=1 a q T +1 d t 2 (b 1/p 0 + T i=1 d i 2 ) p = a q T +1 (b 1/p 0 + T i=1 d i 2 ) 1-p -a q T +1 b 1/p 0 (b 1/p 0 + T i=1 d i 2 ) p ≥ a q T +1 (b 1/p 0 + T i=1 d i 2 ) 1-p -b 1/p-1 0 ≥ a q T +1 D 1-p T -b 1/p-1 0 . The same result holds for META-STORM by a similar proof. By using this bound, the proof is finished. To finish section, we prove a technical result, Lemma E.5, which will be very useful in the proof of every algorithm. The motivation to prove it is because we want to bound the term inside the expectation part in Lemma E.4. Lemma E.5. Given A, B ≥ 0. We have • for META-STORM-SG and META-STORM-NA T t=1 A + B a 1/2 t+1 -b t d t 2 b 2 t ≤ (A + B) 1 p -1 1 -p log A + a -1/2 T +1 B b 0 . • for META-STORM T t=1 A + B a 1/2 t -b t d t 2 b 2 t ≤ (A + B) 1 p -1 1 -p log A + a -1/2 T +1 B b 0 . Proof. In META-STORM-SG and META-STORM-NA, we have b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t+1 where p + 2q = 1. Define the set S = t ∈ [T ] : b t ≤ A + B a 1/2 t+1 and let s = max S. We know T t=1 A + B a 1/2 t+1 -b t d t 2 b 2 t ≤ t∈S A + B a 1/2 t+1 -b t d t 2 b 2 t = t∈S A + B a 1/2 t+1 -b t a q/p t+1 b 1/p t -a q/p t b 1/p t-1 b 2 t (a) ≤ t∈S A + B a 1/2 t+1 -b t a q/p t+1 b 1/p t -b 1/p t-1 b 2 t = t∈S a 1/2 t+1 A + B -a 1/2 t+1 b t a q p -1 2 t+1 b 1 p -2 t b 1/p t -b 1/p t-1 b 1/p t where (a) is by a t ≥ a t+1 . Note that a 1/2 t+1 A + B -a 1/2 t+1 b t a q p -1 2 t+1 b 1 p -2 t (b) ≤ A + B -a 1/2 t+1 b t a q p -1 2 t+1 b 1 p -2 t (c) = A + B -a 1/2 t+1 b t a 1 2p -1 t+1 b 1 p -2 t = A + B -a 1/2 t+1 b t a 1/2 t+1 b t 1 p -2 (d) ≤ A + B 1 p -1 1 p -1 1 p -2 1 p -2 ≤ p 1 -p (A + B) 1 p -1 where (b) holds by a t+1 ≤ 1, (c) is due to q p -1 2 = 2q-p 2p = 1-2p 2p = 1 2p -1 by p + 2q = 1 and (d) is by applying Lemma I.8. Thus we know T t=1 A + B a 1/2 t+1 -b t d t 2 b 2 t ≤ p 1 -p (A + B) 1 p -1 t∈S b 1/p t -b 1/p t-1 b 1/p t (e) ≤ (A + B) 1 p -1 1 -p t∈S log b t b t-1 (f ) ≤ (A + B) 1 p -1 1 -p s t=1 log b t b t-1 = (A + B) 1 p -1 1 -p log b s b 0 (g) ≤ (A + B) 1 p -1 1 -p log A + a -1/2 T +1 B b 0 where (e) is by taking x = (b t /b t-1 ) 1/p in 1-1 x ≤ log x, (f ) is because b t is increasing. The reason (g) is true is that b s ≤ A + a -1/2 s+1 B ≤ A + a -1/2 T +1 B where the first inequality is due to s ∈ S and the second one holds by that a -1/2 t is increasing. Now we finish the proof for META-STORM-SG and META-STORM-NA. The proof for META-STORM is essentially the same hence omitted here.

F ANALYSIS OF META-STORM FOR GENERAL p

In this section, we give a general analysis for our Algorithm META-STORM. We will see that p = 1 2 is a special corner case. First we recall the choices of a t and b t a t+1 = (1 + t i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 /a 2 0 ) -2/3 , b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t where p, q satisfy p + 2q = 1, p ∈ 3- √ 7 2 , 1 2 . a 0 > 0 and b 0 > 0 are absolute constants. Naturally, we have a 1 = 1. We will finally prove the following theorem. Theorem F.1. Under the assumptions 1-3 and 5, by defining p = 3(1-p) 4-p ∈ 3 7 , √ 7 -2 , we have E H p T ≤4          2K1 K4 p 1-2p + 2K2 K4 p 1-2p + (2K 4 ) p 2p 1 + 2σ 2 T a 2 0 p 3 p = 1 2 2K 1 + 2 K 2 + K4 3 log 1 + 2σ 2 T a 2 0 + 2K4 p log 4K4 b 2 p 0 p 1 + 2σ 2 T a 2 0 p 3 + b 2 p 0 p = 1 2 + 4 K 5 + K 6 + K 7 3 log 1 + 2σ 2 T a 2 0 + K 7 log K 8 + K 9 b 0 p 1-p 1 + 2σ 2 T a 2 0 p 3 , where K i , i ∈ [9] are some constants only depending on a 0 , b 0 , η, σ, σ, β, p, q, F (x 1 ) -F * . To simplify our final bound, we only indicate the dependency on β and F (x 1 ) -F * . E H p T = O (F (x 1 ) -F * ) p 1-p + β p p log p 1-p β + β p p log p 1-p 1 + σ 2 T (1 + σ 2 T ) p 3 . Remark F.2. For all i ∈ [9], the constant K i will be defined in the proof that follows. By using the concavity of x p , we state the following convergence theorem without proof. Theorem F.3. Under the assumptions 1-3 and 5, by defining p = 3(1-p) 4-p ∈ 3 7 , √ 7 -2 , we have E ∇F (x out ) 2 p = O (F (x 1 ) -F * ) p 1-p + β p p log p 1-p β + β p p log p 1-p 1 + σ 2 T 1 T p + σ 2 p/3 T 2 p/3 . Here, we give a more explicit convergence dependency for p = 1 2 used in Theorem 2.3 Theorem F.4. Under the assumptions 1-3 and 5, when p = 1 2 , by setting λ = min 1, (a 0 / σ) 7/3 (which is used in K 5 to K 9 ) we get the best dependency on σ. For simplicity, under the setting a 0 = b 0 = η = 1, we have E ∇F (x out ) 6/7 = O Q 1 + Q 2 log 6/7 1 + σ 2 T 1 T 3/7 + σ 2/7 T 2/7 where Q 1 = O (F (x 1 ) -F * ) 6/7 + σ 12/7 + ( σσ) 6/7 + σ 18/7 + 1 + σ 18/7 β 6/7 log 6/7 β + σ 3 β and Q 2 = O 1 + σ 18/7 β 6/7 . To start with, we first state the following useful bound for a t : Lemma F.5. ∀α ∈ (0, 3/2] and ∀t ≥ 1, there is a t a t+1 α ≤ 1 + 4 σ 2 a 2 0 2α 3 a α t . Especially, taking α ∈ {1/2, 1, 3/2}, we have a t a t+1 1/2 ≤ 1 + 4 1/3 σ 2/3 a 2/3 0 a 1/2 t ; a t a t+1 ≤ 1 + 4 2/3 σ 4/3 a 2/3 0 a t ; a t a t+1 3/2 ≤ 1 + 4 σ 2 a 2 0 a 3/2 t . Proof. Note that a t a t+1 α = a α t 1 a 3/2 t + ∇f (x t , ξ t ) -∇f (x t , ξ t+1 ) 2 a 2 0 2α/3 = 1 + ∇f (x t , ξ t ) -∇f (x t , ξ t+1 ) 2 a 2 0 a 3/2 t 2α/3 ≤ 1 + 4 σ 2 a 2 0 a 3/2 t 2α/3 ≤ 1 + 4 σ 2 a 2 0 2α/3 a α t where the last inequality is because 2α/3 ≤ 1. Lemma F.5 allows us to obtain some other properties of a t . Lemma F.6. For t ≥ 1 (1 -a t+1 ) 2 -(1 -a t ) 2 2 a t+1 ≤ 4 2/3 σ 4/3 a 4/3 0 ((1 -a t+1 )a t+1 -(1 -a t )a t ) 2 a t+1 ≤ 4 2/3 σ 4/3 a 4/3 0 a 2 t . Proof. Let a t+1 = x, a t = y and note that x ≤ y ≤ 1. For the first inequality, (1 -a t+1 ) 2 -(1 -a t ) 2 2 a t+1 ≤ (1 -x) 2 -(1 -y) 2 x = (y -x)(2 -x -y) x ≤ ( y x -1)(2 -y) ≤ 4 2/3 σ 4/3 a 2/3 0 a t × (2 -a t ) (Lemma F.5) ≤ 4 2/3 σ 4/3 a 4/3 0 . For the second inequality, we have ((1 -a t+1 )a t+1 -(1 -a t )a t ) 2 a t+1 = ((1 -x)x -(1 -y)y) 2 x = (y -x) 2 (1 -x -y) 2 x ≤ (y -x) 2 x ≤ y x -1 y ≤ 4 2/3 σ 4/3 a 2/3 0 a t × a t (Lemma F.5) = 4 2/3 σ 4/3 a 4/3 0 a 2 t . F.1 ANALYSIS OF E T Following a similar approach, we first define a random time τ satisfying τ = max {[T ] , a t ≥ K -1 } , where K -1 := min 1, a 4 0 /(144 σ 4 ) . One thing we need to emphasize here is that, in our current choice, a t ∈ F t , which implies {τ + 1 = t} = {τ = t -1} = {a t-1 ≥ K -1 , a t < K -1 } ∈ F t . This means τ + 1 is a stopping time instead of τ itself. We now prove a useful proposition for τ : Lemma F.7. We have a t+1 ≥ K 0 , ∀t ≤ τ, a -1 t+1 -a -1 t ≤ 2/9, ∀t ≥ τ + 1. where K 0 := (K -3/2 -1 + 4 σ 2 /a 2 0 ) -2/3 = (max 1, 1728 σ 6 /a 6 0 + 4 σ 2 /a 2 0 ) -2/3 . Proof. First, by the definition of τ , we know a t ≥ K -1 ≥ K 0 ,∀t ≤ τ . For time τ , we have a -3/2 τ +1 -a -3/2 τ = ∇f (x τ , ξ τ ) -∇f (x τ , ξ τ +1 ) 2 /a 2 0 ≤ 4 σ 2 /a 2 0 ⇒ a -1 τ +1 ≤ (a -3/2 τ + 4 σ 2 /a 2 0 ) 2/3 ≤ (K -3/2 -1 + 4 σ 2 /a 2 0 ) 2/3 = K -1 0 , which implies a τ +1 ≥ K 0 . For the second proposition, let h(y) = y 2/3 . Due to the concavity of h, we know h(y 1 ) -h(y 2 ) ≤ h (y 2 )(y 1 -y 2 ) = 2(y1-y2) 3y 1/3 2 . Now we have a -1 t+1 -a -1 t = (a -3/2 t + ∇f (x t , ξ t ) -∇f (x t , ξ t+1 ) 2 /a 2 0 ) 2/3 -(a -3/2 t ) 2/3 ≤ 2a 1/2 t ∇f (x t , ξ t ) -∇f (x t , ξ t+1 ) 2 3a 2 0 ≤ 8a 1/2 t σ 2 3a 2 0 ≤ 2 9 where the last step is by a t ≤ a τ +1 < K -1 ≤ a 4 0 /(144 σ 4 ). F.1.1 BOUND ON E E τ,3/2-2 FOR ∈ 1 4 , 1 2 Unlike STORM+ in which they bound E [E τ ], we choose to bound E E τ,3/2-2 . We first prove the following bound on E E τ,3/2-2 : Lemma F.8. For any ∈ 1 4 , 1 2 , we have E E τ,3/2-2 ≤ 2σ 2 + 16 1 + 6 σ 4/3 a 4/3 0 3a 2 0 + 5 σ 2 K 2 -1/2 0 + 4 1 + 6 σ 4/3 a 4/3 0 η 2 β 2 K 2 -1/2 0 E T t=1 d t 2 b 2 t . Proof. We start from Lemma E.2 a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 . Summing up from 1 to τ and taking expectations on both sides, we will have E [E τ,1 ] ≤E τ t=1 t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 ≤σ 2 + E τ t=1 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 ≤σ 2 + E T t=1 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + τ t=1 M t+1 . (17) First we bound E [ τ t=1 M t+1 ]. From the definition of M t+1 , we have E [M t+1 ] = E 2(1 -a t+1 ) 2 t , Z t+1 + 2(1 -a t+1 )a t+1 t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) . Now for t ≥ 1, we define N t+1 := 2(1 -a t ) 2 t , Z t+1 + 2(1 -a t )a t t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) ∈ F t+1 with N 1 := 0. A key observation is that E τ t=1 N t+1 = 0. This is because N t := t i=1 N t is a martingale and τ + 1 is a bounded stopping time. Then by optional sampling theorem, we have E τ t=1 N t+1 = E τ +1 t=1 N t = E [N τ +1 ] = 0. By subtracting E [ τ t=1 M t+1 ] by E [ τ t=1 N t+1 ], we obtain E τ t=1 M t+1 = E τ t=1 2 (1 -a t+1 ) 2 -(1 -a t ) 2 t , Z t+1 + 2 ((1 -a t+1 )a t+1 -(1 -a t )a t ) t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) ] . Using Cauchy-Schwarz inequality for each term, we have 2 (1 -a t+1 ) 2 -(1 -a t ) 2 t , Z t+1 ≤2 (1 -a t+1 ) 2 -(1 -a t ) 2 t Z t+1 ≤ a t+1 4 t 2 + 4 (1 -a t+1 ) 2 -(1 -a t ) 2 2 a t+1 Z t+1 2 , 2 ((1 -a t+1 )a t+1 -(1 -a t )a t ) t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) ≤2 |(1 -a t+1 )a t+1 -(1 -a t )a t | t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) ≤ a t+1 4 t 2 + 4 ((1 -a t+1 )a t+1 -(1 -a t )a t ) 2 a t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 . Plugging the above bounds into (18), we obtain E τ t=1 M t+1 ≤E      τ t=1 a t+1 2 t 2 + (1 -a t+1 ) 2 -(1 -a t ) 2 2 a t+1 (i) 4 Z t+1 2 + τ t=1 ((1 -a t+1 )a t+1 -(1 -a t )a t ) 2 a t+1 (ii) 4 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2      . ( ) Plugging the bounds for (i) and (ii) from Lemma F.6 into (19), the following bound on E [ τ t=1 M t+1 ] comes up E τ t=1 M t+1 ≤E τ t=1 a t+1 2 t 2 + 4 5/3 σ 4/3 a 4/3 0 Z t+1 2 + 4 5/3 σ 4/3 a 4/3 0 a 2 t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 ≤E 1 2 E τ,1 + 12 σ 4/3 a 4/3 0 Z t+1 2 + 12 σ 4/3 a 4/3 0 a 2 t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 . Then from (17), we have E [E τ,1 ] ≤ σ 2 + E 1 2 E τ,1 + E T t=1 2 + 12 σ 4/3 a 4/3 0 Z t+1 2 + E T t=1 2a 2 t+1 + 12 σ 4/3 a 4/3 0 a 2 t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 , which will give us E [E τ,1 ] ≤ 2σ 2 + 4 1 + 6 σ 4/3 a 4/3 0 E    T t=1 Z t+1 2 (iii)    + E       T t=1 4 a 2 t+1 + 6 σ 4/3 a 4/3 0 a 2 t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 (iv)       . ( ) For term (iii), Lemma E.3 tells us E Z t+1 2 | F t ≤η 2 β 2 d t 2 b 2 t . For term (iv), we know E [(iv)] = E T t=1 4 a 2 t+1 + 6 σ 4/3 a 4/3 0 a 2 t ∇f (x t+1 , ξ t+1 ) -E [∇f (x t+1 , ξ t+2 )|F t+1 ] 2 ≤ E T t=1 4 a 2 t+1 + 6 σ 4/3 a 4/3 0 a 2 t E ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 |F t+1 = E T t=1 4 a 2 t+1 + 6 σ 4/3 a 4/3 0 a 2 t ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 . ( ) Note that a 2 t = 1 + t-1 i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 /a 2 0 4/3 , then we have T t=1 4 a 2 t+1 + 6 σ 4/3 a 4/3 0 a 2 t ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 =4a 2 0 T t=1 ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 /a 2 0 1 + t i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 /a 2 0 4/3 + 24 σ 4/3 a 2/3 0 T t=1 ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 /a 2 0 1 + t-1 i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 /a 2 0 4/3 ≤4a 2 0 12 + 8 σ 2 a 2 0 + 24 σ 4/3 a 2/3 0 12 + 20 σ 2 a 2 0 =16 3a 2 0 + 2 σ 2 + 96 σ 4/3 a 4/3 0 3a 2 0 + 5 σ 2 ≤16 1 + 6 σ 4/3 a 4/3 0 3a 2 0 + 5 σ 2 , ( ) where, for the first inequality, we use Lemma I.4 and Lemma I.5. Plugging ( 21) and ( 23) into (20), we obtain E [E τ,1 ] ≤ 2σ 2 + 16 1 + 6 σ 4/3 a 4/3 0 3a 2 0 + 5 σ 2 + 4 1 + 6 σ 4/3 a 4/3 0 η 2 β 2 E T t=1 d t 2 b 2 t . Note that by Lemma F.7, we have for t ≤ τ ,a t+1 ≥ K 0 . By using this property and noticing 2 -1/2 ≥ 0 , we can obtain E K 2 -1/2 0 E τ,3/2-2 =E K 2 -1/2 0 τ t=1 a 3/2-2 t+1 t 2 ≤ E τ t=1 a t+1 t 2 ≤2σ 2 + 16 1 + 6 σ 4/3 a 4/3 0 3a 2 0 + 5 σ 2 + 4 1 + 6 σ 4/3 a 4/3 0 η 2 β 2 E T t=1 d t 2 b 2 t , which will give the desired bound immediately. F.1.2 BOUND ON E [E T,1-2 ] FOR ∈ 1 4 , 1 With the previous result on E E τ,3/2-2 , we can bound E [E T,1-2 ]. Lemma F.9. For any ∈ 1 4 , 1 2 , we have E [E T,1-2 ] ≤ K 1 ( ) + K 2 ( )      E H T /a 2 0 4 -1 3 > 1 4 E log 1 + H T /a 2 0 = 1 4 + E T t=1 K 3 ( )a 2 t + 3 1 + 2 2 2 η 2 β 2 d t 2 a 2 t b 2 t , where K 1 ( ) := 3 σ 2 + 24 1 + 2 σ 2 2 + 72 σ 2 σ 2 + 8 1 + 6 σ 4/3 a 4/3 0 3a 2 0 + 5 σ 2 a 2 0 K 2 -1/2 0 K 2 ( ) :=    9(1+2 2 )a 2 0 2 (4 -1) = 1 4 3(1+2 2 )a 2 0 2 = 1 4 K 3 ( ) := 144 σ 2 K 2 -1/2 0 a 2 0 1 + 6 σ 4/3 a 4/3 0 + 3 1 + 2 2 2 4 σ 2 a 2 0 4 3 Proof. We use a similar strategy as in the previous proof in which we bound E E τ,3/2-2 . Starting from Lemma E.2 a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 . Dividing both sides by a 2 t+1 , taking the expectations on both sides and summing up from 1 to T to get E [E T,1-2 ] ≤ E T t=1 t 2 a 2 t+1 - t+1 2 a 2 t+1 + 2 a 2 t+1 Z t+1 2 + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 a 2 t+1 ≤ σ 2 + E T t=1 a -2 t+1 -a -2 t t 2 + T t=1 2 a 2 t+1 Z t+1 2 + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 a 2 t+1 . ( ) As before, we bound E Mt+1 a 2 t+1 first. From the definition of M t+1 , we have E M t+1 a 2 t+1 = E 2(1 -a t+1 ) 2 a 2 t+1 t , Z t+1 + 2(1 -a t+1 )a 1-2 t+1 t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) . A similar key observation is that, if we replace a t+1 by a t , we can find E 2(1 -a t ) 2 a 2 t t , Z t+1 + 2(1 -a t )a 1-2 t t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) = 0. By subtracting E Mt+1 a 2 t+1 by 0, we know E M t+1 a 2 t+1 = E 2 (1 -a t+1 ) 2 a 2 t+1 - (1 -a t ) 2 a 2 t t , Z t+1 +2 (1 -a t+1 )a 1-2 t+1 -(1 -a t )a 1-2 t t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) . (25) Using Cauchy-Schwarz for each term 2 (1 -a t+1 ) 2 a 2 t+1 - (1 -a t ) 2 a 2 t t , Z t+1 ≤2 (1 -a t+1 ) 2 a 2 t+1 - (1 -a t ) 2 a 2 t t Z t+1 ≤ a -2 t+1 -a -2 t t 2 + (1-at+1) 2 a 2 t+1 -(1-at) 2 a 2 t 2 a -2 t+1 -a -2 t Z t+1 2 , 2 (1 -a t+1 )a 1-2 t+1 -(1 -a t )a 1-2 t t , ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) ≤2 (1 -a t+1 )a 1-2 t+1 -(1 -a t )a 1-2 t t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) ≤ a -2 t+1 -a -2 t t 2 + (1 -a t+1 )a 1-2 t+1 -(1 -a t )a 1-2 t 2 a -2 t+1 -a -2 t ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 , Plugging these two bounds into (25), we obtain E M t+1 a 2 t+1 ≤ E 2 a -2 t+1 -a -2 t t 2 + E       (1-at+1) 2 a 2 t+1 -(1-at) 2 a 2 t 2 a -2 t+1 -a -2 t (i) Z t+1 2       + E       (1 -a t+1 )a 1-2 t+1 -(1 -a t )a 1-2 t 2 a -2 t+1 -a -2 t (ii) ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2       . ( ) To bound (i) and (ii), let a t+1 = x, a t = y and note that 0 ≤ x ≤ y ≤ 1. By Lemma I.6, we have for (i) (1-at+1) 2 a 2 t+1 -(1-at) 2 a 2 t 2 a -2 t+1 -a -2 t = (1-x 1/ ) 2 x 2 -(1-y 1/ ) 2 y 2 2 x 2 y 2 y 2 -x 2 ≤ 1 2 x 2 = 1 2 a 2 t+1 . ( ) For (ii), by Lemma I.7, (1 -a t+1 )a 1-2 t+1 -(1 -a t )a 1-2 t 2 a -2 t+1 -a -2 t = (1 -x 1/ )x 1/ -2 -(1 -y 1/ )y 1/ -2 2 x 2 y 2 y 2 -x 2 ≤ y 2/ -2 2 = a 2-2 t 2 . ( ) Plugging ( 27) and ( 28) into ( 26), we will have E M t+1 a 2 t+1 ≤ E 2 a -2 t+1 -a -2 t t 2 + 1 2 a 2 t+1 Z t+1 2 + a 2-2 t 2 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 . Now combining this with (24), we obtain E [E T,1-2 ] ≤ σ 2 + E       T t=1 3 a -2 t+1 -a -2 t t 2 (iii) + T t=1 1 + 2 2 2 a 2 t+1 Z t+1 2 (iv) + T t=1 a 2-2 t 2 + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 (v)       . ( ) For (iii), we split the sum according to τ then use Lemma F.7 and Lemma F.5, T t=1 3 a -2 t+1 -a -2 t t 2 = τ t=1 3 a -2 t+1 -a -2 t t 2 + T t=τ +1 3 a -2 t+1 -a -2 t t 2 Note that 3/2 -2 ∈ 1 2 , 1 , we have a -2 t+1 -a -2 t = 1 a 3/2 t+1 - 1 a 2 t a 3/2-2 t+1 a 3/2-2 t+1 ≤ a -3/2 t+1 -a -3/2 t a 3/2-2 t+1 ≤ 4 σ 2 a 2 0 a 3/2-2 t+1 , (Lemma F.5) and we can use Lemma F.7 to bound for t ≥ τ + 1 a -2 t+1 -a -2 t = 1 a t+1 - 1 a 2 t a 1-2 t+1 a 1-2 t+1 ≤ a -1 t+1 -a -1 t a 1-2 t+1 ≤ 2 9 a 1-2 t+1 . Thus T t=1 3 a -2 t+1 -a -2 t t 2 ≤ τ t=1 12 σ 2 a 2 0 a 3/2-2 t+1 t 2 + T t=τ +1 2 3 a 1-2 t+1 t 2 ≤ 12 σ 2 a 2 0 τ t=1 a 3/2-2 t+1 t 2 + T t=1 2 3 a 1-2 t+1 t 2 = 12 σ 2 a 2 0 E τ,3/2-2 + 2 3 E T,1-2 . For (iv), note that E 1 + 2 2 2 a 2 t+1 Z t+1 2 = 1 + 2 2 2 E a 2 t a 2 t+1 Z t+1 2 a 2 t ≤ 1 + 2 2 2 E 1 + 4 σ 2 a 2 0 4 3 a 2 t Z t+1 2 a 2 t (Lemma F.5) ≤ 1 + 2 2 2 E 1 + 4 σ 2 a 2 0 4 3 a 2 t E Z t+1 2 | F t a 2 t ≤ 1 + 2 2 2 E 1 + 4 σ 2 a 2 0 4 3 a 2 t η 2 β 2 d t 2 a 2 t b 2 t , where the last step is by Lemma E.3. Hence we obtain E T t=1 1 + 2 2 2 a 2 t+1 Z t+1 2 ≤ E T t=1 1 + 2 2 2 1 + 4 σ 2 a 2 0 4 3 a 2 t η 2 β 2 d t 2 a 2 t b 2 t . For (v), by the same argument when bounding ( 22), we know E [(v)] ≤ E T t=1 a 2-2 t 2 + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 )| 2 . Now we use Lemma I.2 and Lemma I.3 to get T t=1 a 2-2 t 2 + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 = a 2 0 2 T t=1 ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 /a 2 0 1 + t-1 i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 /a 2 0 4(1-)/3 + 2a 2 0 T t=1 ∇f (x t+1 , ξ t+1 ) -∇f (x t+1 , ξ t+2 ) 2 /a 2 0 1 + t i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 /a 2 0 4(1-)/3 ≤ a 2 0 2 × 24 σ 2 a 2 0 + 2a 2 0 × 12 σ 2 a 2 0 + 1 + 2 2 a 2 0 2      3 4 -1 H T /a 2 0 4 -1 3 = 1 4 log 1 + H T /a 2 0 = 1 4 = 24 1 + 2 σ 2 2 + 1 + 2 2 a 2 0 2      3 4 -1 H T /a 2 0 4 -1 3 = 1 4 log 1 + H T /a 2 0 = 1 4 . Plugging the bounds on (iii), (iv) and (v) into (29), we get E [E T,1-2 ] ≤ σ 2 + 24 1 + 2 σ 2 2 + E 12 σ 2 a 2 0 E τ,3/2-2 + 2 3 E T,1-2 + E T t=1 1 + 2 2 2 1 + 4 σ 2 a 2 0 4 3 a 2 t η 2 β 2 d t 2 a 2 t b 2 t + 1 + 2 2 a 2 0 2      3 4 -1 E H T /a 2 0 4 -1 3 = 1 4 E log 1 + H T /a 2 0 = 1 4 , which gives us E [E T,1-2 ] ≤ 3 σ 2 + 24 1 + 2 σ 2 2 + 36 σ 2 a 2 0 E E τ,3/2-2 + E T t=1 3 1 + 2 2 2 1 + 4 σ 2 a 2 0 4 3 a 2 t η 2 β 2 d t 2 a 2 t b 2 t + 3 1 + 2 2 a 2 0 2      3 4 -1 E H T /a 2 0 4 -1 3 = 1 4 E log 1 + H T /a 2 0 = 1 4 . Now we plug in the bound on E E τ,3/2-2 in Lemma F.8 to get the final result E [E T,1-2 ] ≤ 3 σ 2 + 24 1 + 2 σ 2 2 + 72 σ 2 σ 2 + 8 1 + 6 σ 4/3 a 4/3 0 3a 2 0 + 5 σ 2 a 2 0 K 2 -1/2 0 K1( ) + K 2 ( )      E H T /a 2 0 4 -1 3 = 1 4 E log 1 + H T /a 2 0 = 1 4 + E       T t=1       144 σ 2 K 2 -1/2 0 a 2 0 1 + 6 σ 4/3 a 4/3 0 + 3 1 + 2 2 2 4 σ 2 a 2 0 4 3 K3( ) a 2 t + 3 1 + 2 2 2       η 2 β 2 d t 2 a 2 t b 2 t       , where K 2 ( ) :=    9(1+2 2 )a 2 0 (4 -1) 2 = 1 4 3(1+2 2 )a 2 0 2 = 1 4 . F.1.3 BOUND ON E E T,1/2 The following bound on E E T,1/2 will be useful when we bound D T . Corollary F.10. We have E E T,1/2 ≤ K 1 (1/4)+K 2 (1/4)E log 1 + H T /a 2 0 +E T t=1 K 3 (1/4)a 1/2 t + 54 η 2 β 2 d t 2 a 1/2 t b 2 t . Proof. Take = 1 4 in Lemma F.9. F.1.4 BOUND ON E a 1-2q T +1 E T With the previous result on E [E T,1-2 ], we can bound E a 1-2q T +1 E T immediately. Lemma F.11. Given p + 2q = 1,p ∈ 3- √ 7 2 , 1 2 , we have E a 1-2q T +1 E T ≤      K 1 + K 2 E H T /a 2 0 4q-1 3 + K 4 E D 1-2p T q > 1 4 K 1 + K 2 E log 1 + H T /a 2 0 + K 4 E log 1 + D T b 2 0 q = 1 4 where K 1 := K 1 (q) K 2 := K 2 (q) K 4 :=        K 3 (q) + 3(1+2q 2 ) q 2 η 2 β 2 4q-1 q > 1 4 K 3 (q) + 3(1+2q 2 ) q 2 η 2 β 2 q = 1 4 . Proof. When q > 1 4 ⇔ p < 1 2 , by Lemma F.9, taking = q, we know E [E T,1-2q ] ≤ K 1 (q) + K 2 (q)E H T /a 2 0 4q-1 3 + E T t=1 K 3 (q)a 2q t + 3 1 + 2q 2 q 2 η 2 β 2 d t 2 a 2q t b 2 t ≤ K 1 (q) + K 2 (q)E H T /a 2 0 4q-1 3 + E T t=1 K 3 (q) + 3 1 + 2q 2 q 2 η 2 β 2 d t 2 a 2q t b 2 t = K 1 (q) + K 2 (q)E H T /a 2 0 4q-1 3 + K 3 (q) + 3 1 + 2q 2 q 2 η 2 β 2 E T t=1 d t 2 a 2q t b 2 t (a) = K 1 (q) + K 2 (q)E H T /a 2 0 4q-1 3 + K 3 (q) + 3 1 + 2q 2 q 2 η 2 β 2 E    T t=1 d t 2 b 1/p 0 + t i=1 d i 2 2p    (b) ≤ K 1 (q) + K 2 (q)E H T /a 2 0 4q-1 3 + K 3 (q) + 3 1 + 2q 2 q 2 η 2 β 2 E D 1-2p T 1 -2p (c) = K 1 (q) + K 2 (q)E H T /a 2 0 4q-1 3 + K 3 (q) + 3 1 + 2q 2 q 2 η 2 β 2 4q -1 E D 1-2p T , where (a) is by a 2q t b 2 t = a 2q t b 1/p 0 + t i=1 d i 2 2p a 2q t = b 1/p 0 + t i=1 d i 2 2p , (b) is by Lemma I.1, (c) is by 1 -2p = 4q -1. When q = 1 4 , by a similar argument, we have E [E T,1-2q ] ≤ K 1 (q) + K 2 (q)E log 1 + H T /a 2 0 + K 3 (q) + 3 1 + 2q 2 q 2 η 2 β 2 E log 1 + D T b 2 0 .

Now we can define

K 4 :=        K 3 (q) + 3(1+2q 2 ) q 2 η 2 β 2 4q-1 q > 1 4 K 3 (q) + 3(1+2q 2 ) q 2 η 2 β 2 q = 1 4 . The final step is by noticing for 1 -2q = p > 0, E T,1-2q = T t=1 a 1-2q t+1 t 2 ≥ a 1-2q T +1 T t=1 t 2 = a 1-2q T +1 E T .

F.2 ANALYSIS OF D T

We will prove the following bound Lemma F.12. Given p + 2q = 1,p ∈ 3- √ 7 2 , 1 2 , we have E a q T +1 D 1-p T ≤ K 5 + K 6 E log a 2 0 + H T a 2 0 + K 7 E   log K 8 + K 9 1 + H T /a 2 0 1/3 b 0    where K 5 := b 1 p -1 0 + 2 η (F (x 1 ) -F * ) + λK 1 (1/4) ηβ max , K 6 := λK 2 (1/4) ηβ max , K 7 := (K 8 + K 9 ) 1 p -1 1 -p , K 8 := (1 + λK 3 (1/4)) ηβ max , K 9 := 1 λ + 2 σ 2/3 a 2/3 0 λ + 54λ ηβ max , λ > 0 can be any number. Proof. We start from Lemma E.4 E a q T +1 D 1-p T ≤ b 1 p -1 0 + 2 η (F (x 1 ) -F * ) + E T t=1 ηβ max + ηβ max a 1/2 t+1 λ -b t d t 2 b 2 t + λE E T,1/2 ηβ max where λ > 0 is used to reduce the order of σ in the final bound. In the proof of the general case, we don't choose λ explicitly anymore. Plugging in the bound on E E T,1/2 in Corollary F.10, we have E a q T +1 D 1-p T ≤b 1 p -1 0 + 2 η (F (x 1 ) -F * ) + λK 1 (1/4) ηβ max + λK 2 (1/4) ηβ max E log a 2 0 + H T a 2 0 + E T t=1 ηβ max + ηβ max a 1/2 t+1 λ + K 3 (1/4)λη 2 β 2 ηβ max + 54λη 2 β 2 a 1/2 t ηβ max -b t d t 2 b 2 t ≤K 5 + K 6 E log a 2 0 + H T a 2 0 + E T t=1 (1 + λK 3 (1/4)) ηβ max + a 1/2 t λa 1/2 t+1 + 54λ ηβ max a 1/2 t -b t d t 2 b 2 t ≤K 5 + K 6 E log a 2 0 + H T a 2 0 + E       T t=1 (1 + λK 3 (1/4)) ηβ max + 1 λ + 2 σ 2/3 a 2/3 0 λ + 54λ ηβ max a 1/2 t -b t d t 2 b 2 t (i)       (30) where, in the last step, we use Lemma F.5. Next, we apply Lemma E.5 to (i) to get (i) ≤ 1 + λK 3 (1/4) + 1 λ + 2 σ 2/3 a 2/3 0 λ + 54λ ηβ max 1 p -1 1 -p × log (1 + λK 3 (1/4)) ηβ max + 1 λ + 2 σ 2/3 a 2/3 0 λ + 54λ ηβ max 1 + H T /a 2 0 1/3 b 0 = K 7 log K 8 + K 9 1 + H T /a 2 0 1/3 b 0 By plugging the above bound into (30), we get the desired result.

F.3 COMBINE THE BOUNDS AND THE FINAL PROOF

From Lemma F.11, we have E a 1-2q T +1 E T ≤      K 1 + K 2 E H T /a 2 0 4q-1 3 + K 4 E D 1-2p T q > 1 4 K 1 + K 2 E log 1 + H T /a 2 0 + K 4 E log 1 + D T b 2 0 q = 1 4 From Lemma F.12, we have E a q T +1 D 1-p T ≤ K 5 + K 6 E log a 2 0 + H T a 2 0 + K 7 E   log K 8 + K 9 1 + H T /a 2 0 1/3 b 0    . Now let p = 3(1 -p) 4 -p ∈ 3 7 , √ 7 -2 . Apply Lemma E.1, we can obtain E H p T ≤ 4 max E E p T , E D p T . Now we can give the final proof of Theorem F.1. Proof. First, we have E H T = E T i=1 ∇f (x i , ξ i ) -∇f (x i , ξ i+1 ) 2 = 2 T i=1 Var [∇f (x i , ξ i )] ≤ 2σ 2 T, where the second equation is by the independency of ξ i and ξ i+1 . Now we consider following two cases: Case 1: E D p T ≤ E E p T . In this case, we will finally prove E E p T ≤          2K1 K4 p 1-2p + 2K2 K4 p 1-2p + (2K 4 ) p 2p 1 + 2σ 2 T a 2 0 p 3 q = 1 4 2K 1 + 2 K 2 + K4 3 log 1 + 2σ 2 T a 2 0 + 2K4 p log 4K4 b 2 p 0 p 1 + 2σ 2 T a 2 0 p 3 + b 2 p 0 q = 1 4 . Note that by Holder inequality E E p T = E a (1-2q) p T +1 E p T × a -(1-2q) p T +1 ≤ E p a 1-2q T +1 E T E 1-p a - (1-2q) p 1-p T +1 = E p a 1-2q T +1 E T E 1-p (1 + H T /a 2 0 ) 2(1-2q) p 3(1-p) (a) = E p a 1-2q T +1 E T E 1-p (1 + H T /a 2 0 ) 2p p 3(1-p) (b) ≤ E p a 1-2q T +1 E T E 2p p 3 1 + H T /a 2 0 , where (a) is by 1 -2q = p, (b) is due to 2p p 3(1-p) = 2p(1-p) 2p+1 < 1. First, if q = 1 4 , we have E a 1-2q T +1 E T ≤ K 1 + K 2 E H T /a 2 0 4q-1 3 + K 4 E D 1-2p T (c) ≤ K 1 + K 2 E 1-2p 3 H T /a 2 0 + K 4 E 1-2p p D p T (d) ≤ K 1 + K 2 2σ 2 T /a 2 0 1-2p 3 + K 4 E 1-2p p E p T where (c) is by 4q-1 3 = 1-2p 3 ≤ 1 and p ≥ 3- √ 7 2 ⇒ 1 -2p ≤ 3(1-p) 4-p = p, (d) is by E H T ≤ 2σ 2 T and E D p T ≤ E E p T . Then we know E E p T ≤ E p a 1-2q T +1 E T E 2p p 3 1 + H T /a 2 0 ≤ K 1 + K 2 2σ 2 T /a 2 0 1-2p 3 + K 4 E 1-2p p E p T p 1 + 2σ 2 T /a 2 0 2p p 3 . If K 4 E 1-2p p E p T ≤ K 1 + K 2 2σ 2 T /a 2 0 1-2p 3 , we know E 1-2p p E p T ≤ K 1 K 4 + K 2 K 4 2σ 2 T a 2 0 1-2p 3 ⇒ E E p T ≤ K 1 K 4 + K 2 K 4 2σ 2 T a 2 0 1-2p 3 p 1-2p ≤ 2K 1 K 4 p 1-2p + 2K 2 K 4 p 1-2p 2σ 2 T a 2 0 p 3 . If K 4 E 1-2p p E p T ≥ K 1 + K 2 2σ 2 T /a 2 0 1-2p 3 , then we know E E p T ≤ (2K 4 ) p E 1-2p E p T 1 + 2σ 2 T a 2 0 2p p 3 ⇒ E E p T ≤ (2K 4 ) p 2p 1 + 2σ 2 T a 2 0 p 3 . Combining two results, we know when q = 1 4 E E p T ≤ 2K 1 K 4 p 1-2p + 2K 2 K 4 p 1-2p 2σ 2 T a 2 0 p 3 + (2K 4 ) p 2p 1 + 2σ 2 T a 2 0 p 3 ≤ 2K 1 K 4 p 1-2p + 2K 2 K 4 p 1-2p + (2K 4 ) p 2p 1 + 2σ 2 T a 2 0 p 3 . Following a similar approach, we can prove for q = 1 4 ,there is E E p T ≤   K 1 + K 2 log 1 + 2σ 2 T a 2 0 + K 4 p log   1 + E E p T b 2 p 0     p 1 + 2σ 2 T a 2 0 p 3 Now we use Lemma I.9 to get E E p T ≤     2K 1 + 2K 2 log 1 + 2σ 2 T a 2 0 + 2K 4 p log 4K 4 1 + 2σ 2 T a 2 0 p 3 b 2 p 0     p 1 + 2σ 2 T a 2 0 p 3 + b 2 p 0 = 2K 1 + 2 K 2 + K 4 3 log 1 + 2σ 2 T a 2 0 + 2K 4 p log 4K 4 b 2 p 0 p 1 + 2σ 2 T a 2 0 p 3 + b 2 p 0 . Finally, we have E E p T ≤          2K1 K4 p 1-2p + 2K2 K4 p 1-2p + (2K 4 ) p 2p 1 + 2σ 2 T a 2 0 p 3 q = 1 4 2K 1 + 2 K 2 + K4 3 log 1 + 2σ 2 T a 2 0 + 2K4 p log 4K4 b 2 p 0 p 1 + 2σ 2 T a 2 0 p 3 + b 2 p 0 q = 1 4 . Case 2: E D p T ≥ E E p T . In this case, we will finally prove E D p T ≤ K 5 + K 6 + K 7 3 log 1 + 2σ 2 T a 2 0 + K 7 log K 8 + K 9 b 0 p 1-p 1 + 2σ 2 T a 2 0 p 3 Note that by Holder inequality E D p T = E a pq 1-p T +1 D p T × a -pq 1-p T +1 ≤ E p 1-p a q T +1 D 1-p T E 1-p-p 1-p a -pq 1-p-p T +1 = E p 1-p a q T +1 D 1-p T E 1-p-p 1-p 1 + H T /a 2 0 2 pq 3(1-p-p) = E p 1-p a q T +1 D 1-p T E p 3 1 + H T /a 2 0 , where the last step is by 2 pq 3(1-p-p) = (1-p) p 3(1-p-p) = 1. We know E a q T +1 D 1-p T ≤ K 5 + K 6 E log a 2 0 + H T a 2 0 + K 7 E   log K 8 + K 9 1 + H T /a 2 0 1/3 b 0    (e) ≤ K 5 + K 6 log a 2 0 + E H T a 2 0 + K 7 log K 8 + K 9 E 1 + H T /a 2 0 1/3 b 0 (f ) ≤ K 5 + K 6 log a 2 0 + E H T a 2 0 + K 7 log K 8 + K 9 1 + E H T /a 2 0 1/3 b 0 (g) ≤ K 5 + K 6 log a 2 0 + 2σ 2 T a 2 0 + K 7 log K 8 + K 9 1 + 2σ 2 T /a 2 0 1/3 b 0 , where (e) is by the concavity of log function, (f ) holds due to E X 1/3 ≤ E 1/3 [X] for X ≥ 0, (g) is by E H T ≤ 2σ 2 T . Then we have E D p T ≤ E p 1-p a q T +1 D 1-p T E p 3 1 + H T /a 2 0 ≤ K 5 + K 6 log a 2 0 + 2σ 2 T a 2 0 + K 7 log K 8 + K 9 1 + 2σ 2 T /a 2 0 1/3 b 0 p 1-p 1 + 2σ 2 T a 2 0 p 3 ≤ K 5 + K 6 + K 7 3 log 1 + 2σ 2 T a 2 0 + K 7 log K 8 + K 9 b 0 p 1-p 1 + 2σ 2 T a 2 0 p 3 . Finally, combining Case 1 and Case 2 and using (31), we get the desired result and finish the proof E H p T ≤4 max E E p T , E D p T ≤4          2K1 K4 p 1-2p + 2K2 K4 p 1-2p + (2K 4 ) p 2p 1 + 2σ 2 T a 2 0 p 3 q = 1 4 2K 1 + 2 K 2 + K4 3 log 1 + 2σ 2 T a 2 0 + 2K4 p log 4K4 b 2 p 0 p 1 + 2σ 2 T a 2 0 p 3 + b 2 p 0 q = 1 4 + 4 K 5 + K 6 + K 7 3 log 1 + 2σ 2 T a 2 0 + K 7 log K 8 + K 9 b 0 p 1-p 1 + 2σ 2 T a 2 0 p 3 .

G ANALYSIS OF META-STORM-SG FOR GENERAL p

In this section, we give a general analysis for our Algorithm META-STORM-SG. Readers will see p = 1 2 is a very special corner case. First we recall the choices of a t and b t : a t+1 = (1 + t i=1 ∇f (x i , ξ i ) /a 2 0 ) -2/3 , b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t+1 where p, q satisfy p + 2q = 1, p ∈ 1 4 , 1 2 . a 0 > 0 and b 0 > 0 are absolute constants. Naturally, we have a 1 = 1. We will finally prove the following theorem. Theorem G.1. Under the assumptions 1-4, by defining p = 2(1-p) 3 ∈ 1 3 , 1 2 , we have E H p T ≤4C 9 1 2σ 2 T p ≤ 4C 9 + 4C 10 1 2σ 2 T p ≤ 4C 10 + 4          2C1 C3 p 1-2p + 2C2 C3 p 1-2p + (2C 3 ) p 2p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 p = 1 2 C 1 + C2 p + C3 p log 1 + (2σ 2 T ) p min{a 2 p 0 /2,4b 2 p 0 } p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 p = 1 2 . + 4 C 4 + (3C 5 + C 6 ) log a 2/3 0 + 2 2σ 2 T 1/3 a 2/3 0 + C 6 log 2C 7 + 2C 8 b 0 p 1-p × 1 + 2 2σ 2 T p a 2 p 0 1/3 where C i , i ∈ [10] are some constants only depending on a 0 , b 0 , σ, G, β, p, q, F (x 1 ) -F * . To simplify our final bound, we only indicate the dependency on β and F (x 1 ) -F * when σ = 0 and T is big enough to eliminate C 9 and C 10 E H p T = O (F (x 1 ) -F * ) p 1-p + β p p log p 1-p β + β p p log p 1-p 1 + σ 2 T (1 + σ 2 T ) p 3 . Remark G.2. For all i ∈ [10], the constant C i will be defined in the proof that follows. Again, by the concavity of x p , we have the following convergence theorem, of which the proof is omitted. Theorem G.3. Under the assumptions 1-4 by defining p = 2(1-p) 3 ∈ 1 3 , 1 2 , when σ = 0 and T is big enough, we have E ∇F (x out ) 2 p = O (F (x 1 ) -F * ) p 1-p + β p p log p 1-p β + β p p log p 1-p 1 + σ 2 T 1 T p + σ 2 p/3 T 2 p/3 . Here, we give a more explicit convergence dependency for p = 1 2 used in Theorem 2.1. Theorem G.4. Under the assumptions 1-4, when p = 1 2 , by setting λ = min 1, (a 0 / G) 2 (which is used in C 4 to C 8 and C 10 ) we get the best dependency on G. For simplicity, under the setting a 0 = b 0 = η = 1, we have E ∇F (x out ) 2/3 = O   W 1 1 σ 2 T 1/3 ≤ W 1 T 1/3 + W 2 + W 3 log 2/3 1 + σ 2 T 1 T 1/3 + σ 2/9 T 2/9   where W 1 = O F (x 1 ) -F * + σ 2 + G 2 + β 1 + G 2 log β + G 2 β , W 2 = O (F (x 1 ) -F * ) 2/3 + σ 4/3 + G 4/3 + (1 + G 4/3 )β 2/3 log 2/3 β + G 2 β and W 3 = O (1 + G 4/3 )β 2/3 . To start with, we first state the following useful bound for a t : Lemma G.5. ∀t ≥ 1, there is a -3/2 t+1 -a -3/2 t ≤ ( G/a 0 ) 2 . Proof. a -3/2 t+1 -a -3/2 t = ∇f (x t , ξ t ) 2 /a 2 0 ≤ ( G/a 0 ) 2 . G.1 ANALYSIS OF E T Following a similar approach, we define a random time τ satisfying τ = max {[T ] , a t ≥ C 0 } where C 0 := min 1, (a 0 / G) 4 . Note that {τ = t} = {a t ≥ C 0 , a t+1 < C 0 } ∈ F t , this means τ is a stopping time. We now prove a useful proposition of τ : Lemma G.6. ∀t ≥ τ + 1, we have a -1 t+1 -a -1 t ≤2/3. Proof. Let h(y) = y 2/3 . Due to the concavity, we know h(y 1 ) -h(y 2 ) ≤ h (y 2 )(y 1 -y 2 ) = 2(y1-y2) 3y 1/3 2 . Now we have a -1 t+1 -a -1 t = (a -3/2 t + ∇f (x t , ξ t ) 2 /a 2 0 ) 2/3 -(a -3/2 t ) 2/3 ≤ 2a 1/2 t ∇f (x t , ξ t ) 2 3a 2 0 ≤ 2a 1/2 t G 2 3a 2 0 ≤ 2 3 where the last step is by a t ≤ a τ +1 < C 0 ≤ (a 0 / G) 4 . G.1.1 BOUND ON E E τ,3/2-2 FOR ∈ 1 4 , 1 2 Similar to the analysis of META-STORM, we choose to bound E E τ,3/2-2 . We first prove the following bound on E E τ,3/2-2 : Lemma G.7. For any ∈ 1 4 , 1 2 , we have E E τ,3/2-2 ≤ σ 2 + 24a 2 0 + 4 G 2 C 2 -1/2 0 + 2η 2 β 2 C 2 -1/2 0 E T t=1 d t 2 b 2 t . Proof. We start from Lemma E.2, a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 . Summing up from 1 to τ -1 and taking the expectations on both sides, we obtain E [E τ -1,1 ] ≤ E τ -1 t=1 t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 = E 1 2 -τ 2 + τ -1 t=1 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 ≤ E 1 2 -τ 2 + T t=1 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + τ -1 t=1 M t+1 ⇒ E E τ -1,1 + τ 2 ≤ σ 2 + E T t=1 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + τ -1 t=1 M t+1 Because C 0 ≤ 1, a τ +1 ≤ 1, 2 -1/2 ≥ 0 and 3/2 -2 ≥ 0, so we have C 2 -1/2 0 a 3/2-2 τ +1 ≤ 1. Besides, for t ≤ τ -1, by the definition of τ , we have C 0 ≤ a t+1 , then we know C 2 -1/2 0 a 3/2-2 t+1 ≤ a 2 -1/2 t+1 a 3/2-2 t+1 = a t+1 . These two results give us C 2 -1/2 0 E τ,3/2-2 = C 2 -1/2 0 τ t=1 a 3/2-2 t+1 t 2 ≤ τ -1 t=1 a t+1 t 2 + τ 2 = E τ -1,1 + τ 2 , which implies E C 2 -1/2 0 E τ,3/2-2 ≤ σ 2 + E T t=1 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + τ -1 t=1 M t+1 LetM t := t i=1 M i ∈ F t with M 1 = 0. For s ≤ t, we know E [M t |F s ] = 0, hence M t is a martingale. Note that τ is a bounded stopping time, hence by optional sampling theorem E τ -1 t=1 M t+1 = E [M τ ] = 0. Now we have E C 2 -1/2 0 E τ,3/2-2 ≤ σ 2 + E T t=1 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 . By Lemma E.3 E Z t+1 2 | F t ≤ η 2 β 2 d t 2 b 2 t . Besides, under our current choice, a t+1 ∈ F t , E a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 |F t =a 2 t+1 E ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 |F t ≤a 2 t+1 E ∇f (x t+1 , ξ t+1 ) 2 |F t . Using these two bounds, we have E C 2 -1/2 0 E τ,3/2-2 ≤ σ 2 + E T t=1 2η 2 β 2 d t 2 b 2 t + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) 2 = σ 2 + E T t=1 2η 2 β 2 d t 2 b 2 t + 2a 2 0 × ∇f (x t+1 , ξ t+1 ) 2 /a 2 0 (1 + t i=1 ∇f (x i , ξ i ) 2 /a 2 0 ) 4/3 ≤ σ 2 + 24a 2 0 + 4 G 2 + 2η 2 β 2 E T t=1 d t 2 b 2 t , where the last inequality holds by Lemma I.4. Dividing both sides by C 2 -1/2 0 , we get the desired bound immediately E E τ,3/2-2 ≤ σ 2 + 24a 2 0 + 4 G 2 C 2 -1/2 0 + 2η 2 β 2 C 2 -1/2 0 E T t=1 d t 2 b 2 t . G.1.2 BOUND ON E [E T,1-2 ] FOR ∈ 1 4 , 1 2 With the previous result on E E τ,3/2-2 , we can bound E [E T,1-2 ]. Lemma G.8. For any ∈ 1 4 , 1 2 , we have E [E T,1-2 ] ≤ C 1 ( ) + C 2 ( )      E H T /a 2 0 4 -1 3 > 1 4 E log 1 + H T /a 2 0 = 1 4 + E T t=1 G 2 a 2 0 C 2 -1/2 0 a 2 t+1 + 1 6η 2 β 2 d t 2 a 2 t+1 b 2 t , where C 1 ( ) := 3   σ 2 + 6 G 2 + G 2 σ 2 + 24a 2 0 + 4 G 2 a 2 0 C 2 -1/2 0   C 2 ( ) := 18a 2 0 4 -1 > 1 4 6a 2 0 = 1 4 . Proof. Starting from Lemma E.2 as well a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 . Dividing both sides by a 2 t+1 and taking expectations, we have E a 1-2 t+1 t 2 ≤ E t 2 a 2 t+1 - t+1 2 a 2 t+1 + 2 a 2 t+1 Z t+1 2 + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 a 2 t+1 . ( ) Note that under our current choice, a t+1 ∈ F t , hence we have E M t+1 a 2 t+1 = E E [M t+1 |F t ] a 2 t+1 = 0; E Z t+1 2 a 2 t+1 = E E Z t+1 2 |F t a 2 t+1 ≤ E η 2 β 2 d t 2 a 2 t+1 b 2 t ; E a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 = E a 2-2 t+1 E ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 |F t ≤ E a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) 2 , where the second bound holds by Lemma E.3. Plugging these three bounds into (32), we know E a 1-2 t+1 t 2 ≤ E t 2 a 2 t+1 - t+1 2 a 2 t+1 + 2η 2 β 2 d t 2 a 2 t+1 b 2 t + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) 2 . Now sum up from 1 to T to get E [E T,1-2 ] ≤E T t=1 t 2 a 2 t+1 - t+1 2 a 2 t+1 + 2η 2 β 2 d t 2 a 2 t+1 b 2 t + 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) 2 ≤σ 2 + E       T t=1 a -2 t+1 -a -2 t t 2 (i) +2η 2 β 2 T t=1 d t 2 a 2 t+1 b 2 t + T t=1 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) 2 (ii)       . For (i), we split the time by τ T t=1 a -2 t+1 -a -2 t t 2 = τ t=1 a -2 t+1 -a -2 t t 2 + T t=τ +1 a -2 t+1 -a -2 t t 2 ≤ τ t=1 a -3/2 t+1 -a -3/2 t a 3/2-2 t+1 t 2 + T t=τ +1 a -1 t+1 -a -1 t a 1-2 t+1 t 2 ≤ G 2 a 2 0 τ t=1 a 3/2-2 t+1 t 2 + T t=τ +1 2 3 a 1-2 t+1 t 2 ≤ G 2 a 2 0 τ t=1 a 3/2-2 t+1 t 2 + T t=1 2 3 a 1-2 t+1 t 2 = G 2 a 2 0 E τ,3/2-2 + 2 3 E T,1-2 , where the second inequality is by Lemma G.5 and Lemma G.6. Next, for (ii), we use Lemma I.2 to get T t=1 2a 2-2 t+1 ∇f (x t+1 , ξ t+1 ) 2 =2a 2 0 T t=1 ∇f (x t+1 , ξ t+1 ) 2 /a 2 0 1 + t i=1 ∇f (x i , ξ i ) 2 /a 2 0 4(1-)/3 ≤2a 2 0 ×   3 G 2 a 2 0 +    1 1-4(1-)/3 T i=1 ∇f (xi,ξi) 2 a 2 0 1-4(1-)/3 4(1 -)/3 < 1 log 1 + T i=1 ∇f (xi,ξi) 2 a 2 0 4(1 -)/3 = 1   =6 G 2 +      6a 2 0 4 -1 H T /a 2 0 4 -1 3 > 1 4 2a 2 0 log 1 + H T /a 2 0 = 1 4 . Plugging these two bounds into (33), we have E [E T,1-2 ] ≤ σ 2 + 6 G 2 + E G 2 a 2 0 E τ,3/2-2 + 2 3 E T,1-2 + 2η 2 β 2 T t=1 d t 2 a 2 t+1 b 2 t +      6a 2 0 4 -1 H T /a 2 0 4 -1 3 > 1 4 2a 2 0 log 1 + H T /a 2 0 = 1 4 . Thus E [E T,1-2 ] ≤ 3 σ 2 + 6 G 2 + 3 G 2 a 2 0 E E τ,3/2-2 +      18a 2 0 4 -1 E H T /a 2 0 4 -1 3 > 1 4 6a 2 0 E log 1 + H T /a 2 0 = 1 4 + 6η 2 β 2 E T t=1 d t 2 a 2 t+1 b 2 t Plugging the bound on E E τ,3/2-2 in Lemma G.7, we finally get E [E T,1-2 ] ≤ 3   σ 2 + 6 G 2 + G 2 σ 2 + 24a 2 0 + 4 G 2 a 2 0 C 2 -1/2 0   C1( ) +C 2 ( )      E H T /a 2 0 4 -1 3 > 1 4 E log 1 + H T /a 2 0 = 1 4 + E T t=1 G 2 a 2 0 C 2 -1/2 0 a 2 t+1 + 1 6η 2 β 2 d t 2 a 2 t+1 b 2 t , where C 2 ( ) := 18a 2 0 4 -1 > 1 4 6a 2 0 = 1 4 . G.1.3 BOUND ON E E T,1/2 The following bound on E E T,1/2 will be useful when we bound D T . Corollary G.9. We have E E T,1/2 ≤ C 1 (1/4) + C 2 (1/4) E log 1 + H T /a 2 0 + E T t=1 G 2 a 2 0 a 1/2 t+1 + 1 6η 2 β 2 d t 2 a 1/2 t+1 b 2 t . Proof. Take = 1 4 in Lemma G.8. G.1.4 BOUND ON E a 1-2q T +1 E T Lemma G.10. Given p + 2q = 1,p ∈ 1 4 , 1 2 , we have E a 1-2q T +1 E T ≤      C 1 + C 2 E H T /a 2 0 4q-1 3 + C 3 E D 1-2p T q > 1 4 C 1 + C 2 E log 1 + H T /a 2 0 + C 3 E log 1 + D T b 2 0 q = 1 4 , where C 1 := C 1 (q) C 2 := C 2 (q) C 3 :=      G 2 a 2 0 C 2q-1/2 0 + 1 6η 2 β 2 4q-1 q > 1 4 G 2 a 2 0 + 1 6η 2 β 2 q = 1 4 . Proof. When p = 1 2 ⇔ q > 1 4 , by Lemma G.8, taking = q, we know E [E T,1-2q ] ≤ C 1 (q) + C 2 (q)E H T /a 2 0 4q-1 3 + E T t=1 G 2 a 2 0 C 2q-1/2 0 a 2q t+1 + 1 6η 2 β 2 d t 2 a 2q t+1 b 2 t ≤ C 1 (q) + C 2 (q)E H T /a 2 0 4q-1 3 + G 2 a 2 0 C 2q-1/2 0 + 1 6η 2 β 2 E T t=1 d t 2 a 2q t+1 b 2 t (a) = C 1 (q) + C 2 (q)E H T /a 2 0 4q-1 3 + G 2 a 2 0 C 2q-1/2 0 + 1 6η 2 β 2 × E    T t=1 d t 2 b 1/p 0 + t i=1 d i 2 2p    (b) ≤ C 1 (q) + C 2 (q)E H T /a 2 0 4q-1 3 + G 2 a 2 0 C 2q-1/2 0 + 1 6η 2 β 2 E D 1-2p T 1 -2p (c) = C 1 (q) + C 2 (q)E H T /a 2 0 4q-1 3 + G 2 a 2 0 C 2q-1/2 0 + 1 6η 2 β 2 4q -1 E D 1-2p T , where (a) is by When p = 1 2 ⇔ q = 1 4 , by a similar argument, we have a 2q t+1 b 2 t = a 2q E [E T,1-2q ] ≤ C 1 (q) + C 2 (q)E log 1 + H T /a 2 0 + G 2 a 2 0 + 1 6η 2 β 2 E log 1 + D T b 2 0 . Now we can define C 3 :=      G 2 a 2 0 C 2q-1/2 0 + 1 6η 2 β 2 4q-1 q > 1 4 G 2 a 2 0 + 1 6η 2 β 2 q = 1 4 . The final step is by noticing for 1 -2q = p > 0 E T,1-2q = T t=1 a 1-2q t+1 t 2 ≥ a 1-2q T +1 T t=1 t 2 = a 1-2q T +1 E T .

G.2 ANALYSIS OF D T

We will prove the following bound Lemma G.11. Given p + 2q = 1,p ∈ 1 4 , 1 2 , we have E a q T +1 D 1-p T ≤ C 4 + C 5 E log a 2 0 + H T a 2 0 + C 6 E   log C 7 + C 8 1 + H T /a 2 0 1/3 b 0    where C 4 := b 1 p -1 0 + 2 η (F (x 1 ) -F * ) + λC 1 (1/4) ηβ max , C 5 := λC 2 (1/4) ηβ max , C 6 := (C 7 + C 8 ) 1 p -1 1 -p , C 7 := 1 + 6λ G 2 a 2 0 ηβ max , C 8 := 1 λ + 6λ ηβ max , λ > 0 can be any number. Proof. The same as before, we start from Lemma E.4 E a q T +1 D 1-p T ≤ b 1 p -1 0 + 2 η (F (x 1 ) -F * )+E T t=1 ηβ max + ηβ max a 1/2 t+1 λ -b t d t 2 b 2 t + λE E T,1/2 ηβ max where λ > 0 is used to reduce the order of G in the final bound. In the proof of the general case , we don't choose λ explicitly anymore. Plugging in the bound on E E T,1/2 in Corollary G.9, we know E a q T +1 D 1-p T ≤ b 1 p -1 0 + 2 η (F (x 1 ) -F * ) + λC 1 (1/4) ηβ max + λC 2 (1/4) ηβ max E log a 2 0 + H T a 2 0 + E T t=1 1 + 6λ G 2 a 2 0 ηβ max + 1 λ + 6λ ηβ max a 1/2 t+1 -b t d t 2 b 2 t = C 4 + C 5 E log a 2 0 + H T a 2 0 + E       T t=1 1 + 6λ G 2 a 2 0 ηβ max + 1 λ + 6λ ηβ max a 1/2 t+1 -b t d t 2 b 2 t (i)       . Applying Lemma E.5 to (i), we get (i) ≤ 1 + 6λ G 2 a 2 0 + 1 λ + 6λ ηβ max 1 p -1 1 -p × log 1 + 6λ G 2 a 2 0 ηβ max + 1 λ + 6λ ηβ max 1 + H T /a 2 0 1/3 b 0 = C 6 log C 7 + C 8 1 + H T /a 2 0 1/3 b 0 By using this bound to (34), the proof is completed. G.3 COMBINE THE BOUNDS AND THE FINAL PROOF. From Lemma G.10, we have E a 1-2q T +1 E T ≤      C 1 + C 2 E H T /a 2 0 4q-1 3 + C 3 E D 1-2p T q > 1 4 C 1 + C 2 E log 1 + H T /a 2 0 + C 3 E log 1 + D T b 2 0 q = 1 4 From Lemma G.11, we have E a q T +1 D 1-p T ≤ C 4 + C 5 E log a 2 0 + H T a 2 0 + C 6 E   log C 7 + C 8 1 + H T /a 2 0 1/3 b 0    Now let p = 2(1 -p) 3 ∈ 1 3 , 1 2 . Apply Lemma E.1, we have E H p T ≤ 2 p+1 max E E p T , E D p T ≤ 4 max E E p T , E D p T , Now we can give the final proof of Theorem G.1. Proof. First, we have E H p T = E   T i=1 ∇f (x i , ξ i ) 2 p   ≤ E   T i=1 2 ∇F (x i ) 2 + 2 ∇f (x i , ξ i ) -∇F (x i ) 2 p   = E   2H T + 2 T i=1 ∇f (x i , ξ i ) -∇F (x i ) 2 p   ≤ E   2 p H p T + 2 T i=1 ∇f (x i , ξ i ) -∇F (x i ) 2 p   = 2 p E H p T + E   2 T i=1 ∇f (x i , ξ i ) -∇F (x i ) 2 p   ≤ 2 p E H p T + E p 2 T i=1 ∇f (x i , ξ i ) -∇F (x i ) 2 ≤ 2 p E H p T + 2σ 2 T p ≤ 2 2 p+1 max E E p T , E D p T + 2σ 2 T p ≤ 4 max E E p T , E D p T + 2σ 2 T p . ( ) Now we consider following two cases: Case 1: E E p T ≥ E D p T . In this case, we will finally prove E E p T ≤                      2C1 C3 p 1-2p + 2C2 C3 p 1-2p + (2C 3 ) p 2p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 +C 9 1 2σ 2 T p ≤ 4C 9 q = 1 4 C 1 + C2 p + C3 p log 1 + (2σ 2 T ) p min{a 2 p 0 /2,4b 2 p 0 } p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 +C 9 1 2σ 2 T p ≤ 4C 9 q = 1 4 . where C 9 is a constant. Note that by Holder inequality E E p T = E a (1-2q) p T +1 E p T × a -(1-2q) p T +1 ≤ E p a 1-2q T +1 E T E 1-p a -(1-2q) p 1-p T +1 = E p a 1-2q T +1 E T E 1-p (1 + H T /a 2 0 ) 2(1-2q) p 3(1-p) (a) = E p a 1-2q T +1 E T E 1-p (1 + H T /a 2 0 ) 2p p 3(1-p) (b) ≤ E p a 1-2q T +1 E T E 2p 3 (1 + H T /a 2 0 ) p ≤ E p a 1-2q T +1 E T E 2p 3 1 + H T /a 2 0 p where (a) is by 1 -2q = p, (b) is due to 2p 3(1-p) = 2p 1+2p < 1. First, if q = 1 4 , we have E a 1-2q T +1 E T ≤ C 1 + C 2 E H T /a 2 0 4q-1 3 + C 3 E D 1-2p T (c) ≤ C 1 + C 2 E 1-2p 3 p H T /a 2 0 p + C 3 E 1-2p p D p T (d) ≤ C 1 + C 2   4E E p T + 2σ 2 T p a 2 p 0   1-2p 3 p + C 3 E 1-2p p E p T , where (c) is by 4q-1 3 (d) is by ( 36) and = 1-2p 3 ≤ 2-2p 3 = p and p ≥ 1 4 ⇒ 1 -2p ≤ 2-2p 3 = p, E D p T ≤ E E p T . Then we know E E p T ≤ E p a 1-2q T +1 E T E 2p 3 1 + H T /a 2 0 p ≤     C 1 + C 2   4E E p T + 2σ 2 T p a 2 p 0   1-2p 3 p + C 3 E 1-2p p E p T     p ×   1 + 4E E p T + 2σ 2 T p a 2 p 0   2p 3 . If 4E E p T ≤ 2σ 2 T p , we will get E E p T ≤   C1 + C 2 2 2σ 2 T p a 2 p 0 1-2p 3 p + C 3 E 1-2p p E p T    p 1 + 2 2σ 2 T p a 2 p 0 2p 3 . If C 3 E 1-2p p E p T ≤ C 1 + C 2 2(2σ 2 T ) p a 2 p 0 1-2p 3 p , we have E 1-2p p E p T ≤ C 1 C 3 + C 2 C 3 2 2σ 2 T p a 2 p 0 1-2p 3 p ⇒ E E p T ≤    C 1 C 3 + C 2 C 3 2 2σ 2 T p a 2 p 0 1-2p 3 p    p 1-2p ≤ 2C 1 C 3 p 1-2p + 2C 2 C 3 p 1-2p 2 2σ 2 T p a 2 p 0 1 3 . If C 3 E 1-2p p E p T ≥ C 1 + C 2 2(2σ 2 T ) p a 2 p 0 1-2p 3 p , we have E E p T ≤ 2C 3 E 1-2p p E p T p 1 + 2 2σ 2 T p a 2 p 0 2p 3 = (2C 3 ) p E 1-2p E 2(1-p) 3 T 1 + 2 2σ 2 T p a 2 p 0 2p 3 ⇒ E E p T ≤ (2C 3 ) p 2p 1 + 2 2σ 2 T p a 2 p 0 1 3 . Combining two cases, we know under 4E E p T ≤ 2σ 2 T p E E p T ≤ 2C 1 C 3 p 1-2p + 2C 2 C 3 p 1-2p 2 2σ 2 T p a 2 p 0 1 3 + (2C 3 ) p 2p 1 + 2 2σ 2 T p a 2 p 0 1 3 ≤ 2C 1 C 3 p 1-2p + 2C 2 C 3 p 1-2p + (2C 3 ) p 2p 1 + 2 2σ 2 T p a 2 p 0 1 3 . Now if 4E E p T ≥ 2σ 2 T p , then we have E E p T ≤     C 1 + C 2   8E E p T a 2 p 0   1-2p 3 p + C 3 E 1-2p p E p T     p   1 + 8E E p T a 2 p 0   2p 3 ≤     C p 1 + C p 2   8E E p T a 2 p 0   1-2p 3 + C p 3 E 1-2p E p T       1 + 8E E p T a 2 p 0   2p 3 . ( ) We claim there is a constant C 9 such that E E p T ≤ C 9 because the highest order of E E p T is only 1 -2p + 2p 3 = 1 -4p 3 < 1. Here we give the order of C 9 directly without proof C 9 = O a 2 p 0 + C 1 C 3 p 1-2p + C 3 p 2 2 + C 3 p 4p 3 1 a p 0 . Hence, when q = 1 4 , we finally have E E p T ≤ 2C 1 C 3 p 1-2p + 2C 2 C 3 p 1-2p + (2C 3 ) p 2p 1 + 2 2σ 2 T p a 2 p 0 1 3 +C 9 1 2σ 2 T p ≤ 4C 9 . Following a similar approach, we can prove for q = 1 4 ,there is E E p T ≤   C 1 + C 2 p + C 3 p log   1 + 2σ 2 T p min a 2 p 0 /2, 4b 2 p 0     p 1 + 2 2σ 2 T p a 2 p 0 1 3 + C 9 , where C 9 = O C 1/2 1 + C 1/2 2 + C 1/2 3 log 1/2 C 2 + C 3 a 2 p 0 b p 0 + a 2 p 0 + a 3 p 0 + a p 0 b 2 p 0 . Finally, we have E E p T ≤                      2C1 C3 p 1-2p + 2C2 C3 p 1-2p + (2C 3 ) p 2p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 +C 9 1 2σ 2 T p ≤ 4C 9 q = 1 4 C 1 + C2 p + C3 p log 1 + (2σ 2 T ) p min{a 2 p 0 /2,4b 2 p 0 } p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 +C 9 1 2σ 2 T p ≤ 4C 9 q = 1 4 . Case 2: E E p T ≤ E D p T . In this case, we will finally prove E D p T ≤ C 4 + (3C 5 + C 6 ) log a 2/3 0 + 2 2σ 2 T 1/3 a 2/3 0 + C 6 log 2C 7 + 2C 8 b 0 p 1-p × 1 + 2 2σ 2 T p a 2 p 0 1/3 + C 10 . where C 10 is a constant. Note that by Holder inequality E D p T = E a q p 1-p T +1 D p T × a -q p 1-p T +1 ≤ E p 1-p a q T +1 D 1-p T E 1-p-p 1-p a -q p 1-p-p T +1 = E p 1-p a q T +1 D 1-p T E 1-p-p 1-p 1 + H T /a 2 0 2q p 3(1-p-p) (e) ≤ E p 1-p a q T +1 D 1-p T E 1 3 1 + H T /a 2 0 p ≤ E p 1-p a q T +1 D 1-p T E 1 3 1 + H T /a 2 0 p where (e) is by 2q 3(1-p-p) = 1-p 3(1-p-p) = 1. We know E a q T +1 D 1-p T ≤C 4 + C 5 E log a 2 0 + H T a 2 0 + C 6 E   log C 7 + C 8 1 + H T /a 2 0 1/3 b 0    = C 4 + C 5 p E   log a 2 0 + H T a 2 0 p   + C 6 3 p E     log    C 7 + C 8 1 + H T /a 2 0 1/3 b 0    3 p     (f ) ≤ C 4 + C 5 p E log a 2 p 0 + H p T a 2 p 0 + C 6 3 p E     log (2C 7 ) 3 p + (2C 8 ) 3 p 1 + H T /a 2 0 p b 3 p 0     (g) ≤ C 4 + C 5 p log a 2 p 0 + E H p T a 2 p 0 + C 6 3 p log (2C 7 ) 3 p + (2C 8 ) 3 p 1 + E H p T /a 2 p 0 b 3 p 0 (h) ≤ C 4 + C 5 p log a 2 p 0 + 4E D p T + 2σ 2 T p a 2 p 0 + C 6 3 p log (2C 7 ) 3 p + (2C 8 ) 3 p 1 + 4E[D p T ]+(2σ 2 T ) p a 2 p 0 b 3 p 0 where (f ) is by (x + y) p ≤ x p + y p ,(x + y) q ≤ (2x) q + (2y) q for 0 ≤ x, y, 0 ≤ p ≤ 1, q ≥ 0, (g) holds by the concavity of log function, (h) is due to (36) and E E p T ≤ E D p T . Then we know E D p T ≤ E p 1-p a q T +1 D 1-p T E 1 3 1 + H T /a 2 0 p ≤   C 4 + C 5 p log a 2 p 0 + 4E D p T + 2σ 2 T p a 2 p 0 + C 6 3 p log (2C 7 ) 3 p + (2C 8 ) 3 p 1 + 4E[D p T ]+(2σ 2 T ) p a 2 p 0 b 3 p 0     p 1-p ×   1 + 4E D p T + 2σ 2 T p a 2 p 0   1/3 . If 4E D p T ≤ 2σ 2 T p , we will get E D p T ≤     C 4 + C 5 p log a 2 p 0 + 2 2σ 2 T p a 2 p 0 + C 6 3 p log (2C 7 ) 3 p + (2C 8 ) 3 p 1 + 2(2σ 2 T ) p a 2 p 0 b 3 p 0     p 1-p × 1 + 2 2σ 2 T p a 2 p 0 1/3 ≤ C 4 + C 5 p + C 6 3 p log a 2 p 0 + 2 2σ 2 T p a 2 p 0 + C 6 3 p log (2C 7 ) 3 p + (2C 8 ) 3 p b 3 p 0 p 1-p × 1 + 2 2σ 2 T p a 2 p 0 1/3 ≤ C 4 + (3C 5 + C 6 ) log a 2/3 0 + 2 2σ 2 T 1/3 a 2/3 0 + C 6 log 2C 7 + 2C 8 b 0 p 1-p × 1 + 2 2σ 2 T p a 2 p 0 1/3 . If 4E D p T ≥ 2σ 2 T p , we have E D p T ≤     C 4 + C 5 p log a 2 p 0 + 8E D p T a 2 p 0 + C 6 3 p log (2C 7 ) 3 p + (2C 8 ) 3 p 1 + 8E[D p T ] a 2 p 0 b 3 p 0     p 1-p ×   1 + 8E D p T a 2 p 0   1/3 . ( ) which implies there is a constant C 10 such that E D p T ≤ C 10 . Here we give the order of C 10 directly without proof C 10 = O a 2 p 0 + a 3 p 0 + C 4 + C 6 log C 7 + C 8 b 0 + (C 5 + C 6 ) log C 5 + C 6 a 3 p 0 Combining these two results, we know E D p T ≤ C 4 + (3C 5 + C 6 ) log a 2/3 0 + 2 2σ 2 1/3 a 2/3 0 + C 6 log 2C 7 + 2C 8 b 0 p 1-p × 1 + 2 2σ 2 T p a 2 p 0 1/3 + C 10 1 2σ 2 T p ≤ 4C 10 . Finally, combining Case 1 and Case 2 and using 35, we get the desired result and the finish the proof E H p T ≤4 max E E p T , E D p T ≤4C 9 1 2σ 2 T p ≤ 4C 9 + 4C 10 1 2σ 2 T p ≤ 4C 10 + 4          2C1 C3 p 1-2p + 2C2 C3 p 1-2p + (2C 3 ) p 2p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 q = 1 4 C 1 + C2 p + C3 p log 1 + (2σ 2 T ) p min{a 2 p 0 /2,4b 2 p 0 } p 1 + 2(2σ 2 T ) p a 2 p 0 1 3 q = 1 4 . + 4 C 4 + (3C 5 + C 6 ) log a 2/3 0 + 2 2σ 2 T 1/3 a 2/3 0 + C 6 log 2C 7 + 2C 8 b 0 p 1-p × 1 + 2 2σ 2 T p a 2 p 0 1/3 H ALGORITHM META-STORM-NA AND ITS ANALYSIS FOR GENERAL p Algorithm META-STORM-NA is shown in Algorithm 5. To highlight the differences with META-STORM-SG and META-STORM, we set a t only based on the time round t, not using the stochastic gradients. This is the reason that the convergence of this algorithm does not depend on bounded stochastic gradients or bounded stochastic gradients differences assumptions. Moreover, the requirement of p ∈ 0, 1 2 is also more relaxed compared with our previous algorithms. Algorithm 5 META-STORM-NA Initial point x 1 ∈ R d Parameters: a 0 > 2 3 , b 0 , η, p ∈ 0, 1 2 , p + 2q = 1 Sample ξ 1 ∼ D, d 1 = ∇f (x 1 , ξ 1 ) for t = 1, • • • , T do: a t+1 = 1 + t/a 2 0 -2 3 b t = (b 1/p 0 + t i=1 d i 2 ) p /a q t+1 x t+1 = x t -η bt d t Sample ξ t+1 ∼ D d t+1 = ∇f (x t+1 , ξ t+1 ) + (1 -a t+1 )(d t -∇f (x t , ξ t+1 )) end for Output x out = x t where t ∼ Uniform ([T ]). Now we give the main convergence result, Theorem H.1, of META-STORM-NA. As we discussed before, it can achieve the rate O(1/T 3 ) under the weakest assumptions 1-3, however, with losing the adaptivity to the variance parameter σ as a tradeoff. Theorem H.1. Under the assumptions 1-3, by defining p = 1 -p ∈ 1 2 , 1 , we have (omitting the dependency on η, a 0 and b 0 ) E H p T = O F (x 1 ) -F * + β p p log (βT ) + σ 2 log T + σ 2 p T p 3 . By combining the above theorem with the concavity of x p , we give the following convergence guarantee omitting the proof: Theorem H.2. There is E ∇F (x out ) 2 p = O F (x 1 ) -F * + β p p log (βT ) + σ 2 log T + σ 2 p T 2 p 3 . Note that 2 p ≥ 1, hence the criterion, E ∇F (x out ) 2 p , used in Theorem H.2 is strictly stronger than E [ ∇F (x out ) ]. In the following sections, we will give a proof of Theorem H.1. H.1 BOUND ON E E T,1/2 Lemma H.3. Given p + 2q = 1, p ∈ 0, 1 2 , we have E E T,1/2 ≤ σ 2 1 + 2a 2 0 log 1 + T /a 2 0 + 2η 2 β 2 E T t=1 dt 2 a 1/2 t+1 b 2 t 1 -2/(3a 2 0 ) . Proof. We start from Lemma E.2, a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 . Dividing both sides by a 1/2 t+1 , summing up from 1 to T and taking the expectations on both sides, we obtain E E T,1/2 ≤E T t=1 t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 a 1/2 t+1 ≤σ 2 + E T t=1 a -1 t+1 -a -1 t a 1/2 t+1 t 2 + 2 a 1/2 t+1 Z t+1 2 + 2a 3/2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 a 1/2 t+1 Because a t+1 is not random, we know E 2 a 1/2 t+1 Z t+1 2 ≤ E 2η 2 β 2 d t 2 a 1/2 t+1 b 2 t , E 2a 3/2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 ≤ 2a 3/2 t+1 σ 2 , E M t+1 a 1/2 t+1 = 0, where the first inequality is by Lemma E.3. Besides, by the concavity of x 2/3 and a 0 > 2 3 , we know a -1 t+1 -a -1 t = 1 + t/a 2 0 2/3 -1 + (t -1) /a 2 0 2/3 ≤ 2 3a 2 0 (1 + (t -1) /a 2 0 ) 1/3 ≤ 2 3a 2 0 < 1. Then we have E E T,1/2 ≤ σ 2 + E 2 3a 2 0 E T,1/2 + T t=1 2η 2 β 2 d t 2 a 1/2 t+1 b 2 t + 2a 3/2 t+1 σ 2 ⇒ E E T,1/2 ≤ σ 2 1 + 2 T t=1 a 3/2 t+1 + 2η 2 β 2 E T t=1 dt 2 a 1/2 t+1 b 2 t 1 -2/(3a 2 0 ) . Note that T t=1 a 3/2 t+1 = T t=1 1 1 + t/a 2 0 ≤ a 2 0 log 1 + T /a 2 0 . So we know E E T,1/2 ≤ σ 2 1 + 2a 2 0 log 1 + T /a 2 0 + 2η 2 β 2 E T t=1 dt 2 a 1/2 t+1 b 2 t 1 -2/(3a 2 0 ) . H.2 BOUND ON E [E T ] Lemma H.4. Given p + 2q = 1, p ∈ 0, 1 2 , we have E [E T ] ≤ 6a 2 0 σ 2 1 + T /a 2 0 1/3 1 -2/(3a 2 0 ) + 2η 2 β 2 (1 + T /a 2 0 ) 2p 3 1 -2/(3a 2 0 )    E[D 1-2p T ] 1-2p p = 1 2 E log 1 + D T b 2 0 p = 1 2 . Proof. We start from Lemma E.2, a t+1 t 2 ≤ t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 . Dividing both sides by a t+1 , summing up from 1 to T and taking the expectations on both sides, we obtain (1 + t/a 2 0 ) E [E T ] ≤E T t=1 t 2 -t+1 2 + 2 Z t+1 2 + 2a 2 t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 a t+1 ≤σ 2 + E T t=1 a -1 t+1 -a -1 t ≤2/(3a 2/3 ≤ 3a 2 0 1 + T /a 2 0 1/3 -3a 2 0 < 3a 2 0 1 + T /a 2 0 1/3 -2. So we know  E [E T ] ≤ + E       T t=1 1 + 9a 2 0 -2 (3a 2 0 -2) a 1/2 t+1 ηβ max -b t d t 2 b 2 t (i)       . Applying Lemma E.5 to (i), we get (i) ≤ Note that a q T +1 = a From Lemma H.4, we have E [E T ] ≤ 6a 2 0 σ 2 1 + T /a 2 0 1/3 1 -2/(3a 2 0 ) + 2η 2 β 2 (1 + T /a 2 0 ) 2p 3 1 -2/(3a 2 0 )    E[D 1-2p T ] 1-2p p = 1 2 E log 1 + D T b 2 0 p = 1 2 . From Lemma H.5, we have Apply Lemma E.1, we know E D 1-p T ≤ 1 + T /a 2 E H p T ≤ 4 max E E p T , E D p T . ( ) Now we can give the final proof of Theorem H.1. Proof. Now we consider following two cases: Case 1: p = 1 2 . Note that by Holder inequality E E p T = E p [E T ] , E D 1-2p T ≤ E 1-2p p D p T . So we know E E p T ≤ 6a 2 0 σ 2 1 + T /a 2 0 1/3 1 -2/(3a 2 0 ) + 2η 2 β 2 (1 + T /a 2 0 ) 2p 3 (1 -2/(3a 2 0 ))(1 -2p) E D 1-2p T p ≤ 6a 2 0 σ 2 1 + T /a 2 0 1/3 1 -2/(3a 2 0 ) + 2η 2 β 2 (1 + T /a 2 0 ) 2p 3 (1 -2/(3a (1 -2/(3a 2 0 ))(1 -2p) (1 -2/(3a 2 0 ))(1 -2p) p E 1-2p E p T Then if 2η 2 β 2 (1+T /a 2 0 ) 2p 3 (1-2/(3a 2 0 ))(1-2p) p E 1-2p E p T ≤ 6a 2 0 σ 2 (1+T/a 2 0 ) 1/3 1-2/(3a p E 1-2p E p T ⇒ E E p T ≤ 2 1 p 2η 2 β 2 (1 -2/(3a 2 0 ))(1 -2p) p 2p 1 + T /a 2 0 p 3 . Hence under E E p T ≥ E D p T , we get  E E p T ≤   2 1 p 2η 2 β 2 (1 -2/(3a 2 0 ))(1 -2p) p 2p + 2 1 p 6a 2 0 σ 2 1 -2/(3a

I BASIC INEQUALITIES

In this section, we prove some technical lemmas used in our proof. Lemma I.1. For c 0 > 0, c i≥1 ≥ 0, p ∈ (0, 1], we have Proof. We first prove the case p = 1. From Lemma 3 in Levy et al. (2021) , for b 1 > 0, b i≥2 ≥ 0, p ∈ (0, 1), we have By the definition of T 0 , we know for any 1 ≤ t ≤ T 0 -1, c t = 0. Then we have where the inequality holds by 1 -1 x ≤ log x. Lemma I.2. For c 0 > 0, c i≥1 ∈ (0, c], p ∈ (0, 1], we have Lemma I.6. Given 0 ≤ x ≤ y ≤ 1, 0 < ≤ 1, we have (1 -x 1/ ) 2 x 2 - (1 -y 1/ ) 2 y 2 2 ≤ y 2 -x 2 2 x 4 y 2 . Proof. Note that (1 -x 1/ ) 2 x 2 - (1 -y 1/ ) 2 y 2 2 = 1 -x 1/ x + 1 -y 1/ y 2 1 -x 1/ x - 1 -y 1/ y 2 ≤ 1 x + 1 y 2 1 -x 1/ x - 1 -y 1/ y 2 , now let h(x) = 1-x 1/l x , we can find h (x) = -(1-)x 1/ + x 2 ≤ 0. Hence 1 -x 1/ x -1 -y 1/ y = h(x) -h(y) ≥ 0. Besides, let g(x) = h(x) -1 x , we can find that g (x) = (1 -) 1 -x 1/ x 2 ≥ 0. Lemma I.9. Given X, A, B ≥ 0, C > 0, D ≥ 0, 0 ≤ u ≤ 1,if we have X ≤ A + B log 1 + X C u D, then there is X ≤ 2A + 2B log 4uBD C + C D 1/u u D. Especially, when D ≥ 1, we know X ≤ 2A + 2B log 4uBD C + C 1/u u D. Proof. Let Y = (X/D) 1/u , then we know 



This bound holds when σ > 0 and T is large enough. Link to the code of STORM+: https://github.com/LIONS-EPFL/storm-plus-code. The reader should keep in mind that variance-reduced algorithms like META-STORM require twice the amount of gradient queries, so the improvement in performance that our algorithms exhibit does not come without a cost. Additional plots and further discussions are available in Section B.



Figure 3: Training loss and validation accuracy on SST2. (H) denotes the addition of heuristics.

Figure 4: Training loss and test accuracy for META-STORM on MNIST for different p values.For the heuristics versions of our algorithms, we perform the same experiments and show the results in Figures6 and 7. Since p = 0.50 attains the lowest training loss for both heuristics versions of our algorithms, we select such value for all our experiments. Default parameters.

Figure 7: Training loss and test accuracy for META-STORM-SG (H) on MNIST for different p values.

Figure 8: Losses and accuracies on CIFAR10.

Figure 10: Losses and accuracies on SST2.

and D T and E T . The bounded variance assumption on the stochastic gradients gives us a bound on E[a -3/2

+1 ]. Thus we obtain an upper bound on E[H 3/7 T ]. Finally, applying the concavity of x 3/7 to E[H 3/7 T ] gives Theorem 2.3.

is by Lemma I.1, (c) is by 1 -2p = 4q -1.

is by 1 -2q = p, (b) is by Lemma I.1

Now we simply take λ = 1 and use Lemma H.3 to getE a q T +1 D 1-p x 1 ) -F * ) + σ 2 1 + 2a 2 0 log 1 + T /a 2 0 ηβ max (1 -2/(3a 2 0 ))

the desired result.H.4 COMBINE THE BOUNDS AND THE FINAL PROOF.

2η 2 β 2 (1 -2/(3a 2 0 ))(1 -2p) F (x 1 ) -F * + β p p log (βT ) + σ 2 log T + σ 2 p T pCase 2: p = 1 2 . By a similar proof, we still haveE H p T ≤O F (x 1 ) -F * + β p p log (βT ) + σ 2 log T + σ 2 p T p 3

Now we defineT 0 = min {t ∈ [T ] : c t > 0} .

is by Lemma I.1. Lemma I.3. For c 0 > 0, c i≥1 ∈ (0, c], p ∈ (0, 1], we have is by Lemma I.2. Lemma I.4. (Lemma 6 in Levy et al. (2021)), for c i≥1 ∈ (0, c], we have

For c i≥1 ∈ (0, c], we have, we have i ) 4/3 ≤ 12 + 5c, where the last inequality is by Lemma I.4.

Note that log ((m -x) x n ) = log (m -x) + n log x = log (m -x) + n logx is by the concavity of log function. Then we know (m -x) x n ≤ m n+1 n+1 n n .

is by (x + y) p ≤ (2x) p + (2y) p , for x, y ≥ 0, p ≥ 1. (b) is by log x ≤ x -1 ≤ x.

Comparison of the convergence rate after T iterations under constant success probability. The assumptions and definitions of the parameters referenced can be found in Section 1.2. Assumptions 1 and 2 are used in all algorithms, thus we leave them out from the table.



Table of Hyperparameters.

CIFAR10 average training loss across 5 seeds for selected epochs. Lowest loss is bolded per selected epoch.

CIFAR10 average training accuracy across 5 seeds for selected epochs. Highest accuracy is bolded per selected epoch.

CIFAR10 average test loss across 5 seeds for selected epochs. Lowest loss is bolded per selected epoch.

CIFAR10 average test accuracy across 5 seeds for selected epochs. Highest accuracy is bolded per selected epoch.

CIFAR10 accuracy generalization gap (train acc -test acc) of the last epoch's accuracy.

IMDB average training loss across 5 seeds for selected epochs. Lowest loss for each epoch is bolded below.

IMDB test loss. Lowest loss for each epoch is bolded below.

SST2 training loss. Lowest loss for each epoch is bolded below.

SST2 training accuracy. Highest accuracy for each epoch is bolded below.

SST2 validation loss. Lowest loss for each epoch is bolded below.

SST2 validation accuracy. Highest accuracy for each epoch is bolded below. Similarly to CIFAR10, we examine the generalization gap of different algorithms in Table 15. Here, we see that MS-SG attains the lowest generalization gap between training accuracy and test accuracy while Adam suffers from the largest generalization gap among the algorithms compared in our experiments.

+ 2a t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 + M t+1 a t+1 . 2a t+1 ∇f (x t+1 , ξ t+1 ) -∇F (x t+1 ) 2 ≤ 2a t+1 σ 2 ,

This means

h(x) -1 x -h(y) + 1 y = g(x) -g(y) ≤ 0, which impliesThus we finally haveProof. If = 1 2 , then we know. By Taylor's expansion, there exists x ≤ z ≤ y, such thatThis will give usx 2 y 2/ -4 .Lemma I.8. Given m, n ≥ 0, For 0 ≤ x ≤ m, we have

