CONTINUIZED ACCELERATION FOR QUASAR CONVEX FUNCTIONS IN NON-CONVEX OPTIMIZATION

Abstract

Quasar convexity is a condition that allows some first-order methods to efficiently minimize a function even when the optimization landscape is non-convex. Previous works develop near-optimal accelerated algorithms for minimizing this class of functions, however, they require a subroutine of binary search which results in multiple calls to gradient evaluations in each iteration, and consequently the total number of gradient evaluations does not match a known lower bound. In this work, we show that a recently proposed continuized Nesterov acceleration can be applied to minimizing quasar convex functions and achieves the optimal bound with a high probability. Furthermore, we find that the objective functions of training generalized linear models (GLMs) satisfy quasar convexity, which broadens the applicability of the relevant algorithms, while known practical examples of quasar convexity in non-convex learning are sparse in the literature. We also show that if a smooth and one-point strongly convex, Polyak-Łojasiewicz, or quadratic-growth function satisfies quasar convexity, then attaining an accelerated linear rate for minimizing the function is possible under certain conditions, while acceleration is not known in general for these classes of functions.

1. INTRODUCTION

Momentum has been the main workhorse for training machine learning models (Kingma & Ba, 2015; Wilson et al., 2017; Loshchilov & Hutter, 2019; Reddi et al., 2018; He et al., 2016; Simonyan & Zisserman, 2015; Krizhevsky et al., 2012) . In convex learning and optimization, several momentum methods have been developed under different machineries, which include the ones built on Nesterov's estimate sequence (Nesterov, 1983; 2013) , methods derived from ordinary differential equations and continuous-time techniques, (Krichene et al., 2015; Scieur et al., 2017; Attouch et al., 2018; Su et al., 2014; Wibisono et al., 2016; Shi et al., 2018; Diakonikolas & Orecchia, 2019) , approaches based on dynamical systems and control (Hu & Lessard, 2017; Wilson et al., 2021) , algorithms generated from playing a two-player zero-sum game via no-regret learning strategies (Wang et al., 2021a; Wang & Abernethy, 2018; Cohen et al., 2021) , and a recently introduced continuized acceleration (Even et al., 2021) . On the other hand, in the non-convex world, despite numerous empirical evidence confirms that momentum methods converge faster than gradient descent (GD) in several applications, see e.g., Sutskever et al. (2013) ; Leclerc & Madry (2020) , first-order accelerated methods that provably find a global optimal point are sparse in the literature. Indeed, there are just few results showing acceleration over GD that we are aware. Wang et al. (2021b) show Heavy Ball has an accelerated linear rate for training an over-parametrized ReLU network and a deep linear network, where the accelerated linear rate has a square root dependency on the condition number of a neural tangent kernel matrix at initialization, while the linear rate of GD depends linearly on the condition number. A follow-up work of Wang et al. (2022) shows that Heavy Ball has an acceleration for minimizing a class of Polyak-Łojasiewicz functions (Polyak, 1963) . When the goal is not finding a global optimal point but a first-order stationary point, some benefits of incorporating the dynamic of momentum can be shown (Cutkosky & Orabona, 2019; Cutkosky & Mehta, 2021; Levy et al., 2021) . Nevertheless, theoretical-grounded momentum methods in non-convex optimization are still less investigated to our knowledge. With the goal of advancing the progress of momentum methods in non-convex optimization in mind, we study efficiently solving min w f (w), where the function f (•) satisfies quasar-convexity (Hinder et al., 2020; Hardt et al., 2018; Nesterov et al., 2019; Guminov & Gasnikov, 2017; Bu & Mesbahi, 2020) , which is defined in the following. Under quasar convexity, it can be shown that GD or certain momentum methods can globally minimize a function even when the optimization landscape is non-convex. Definition 1. (Quasar convexity) Let ρ > 0. Denote w * a global minimizer of f (•) : R d → R. The function f (•) is ρ-quasar convex if for all w ∈ R d , one has: f (w * ) ≥ f (w) + 1 ρ ∇f (w), w * -w . For µ > 0, the function f (•) is (ρ, µ)-strongly quasar convex if for all w ∈ R d , one has: f (w * ) ≥ f (w) + 1 ρ ∇f (w), w * -w + µ 2 w * -w 2 . ( ) For more characterizations of quasar convexity, we refer the reader to Hinder et al. (2020) (Appendix D in the paper), where a thorough discussion is provided. Recall that a function f (•) is L-smooth if f (x) ≤ f (y) + ∇f (y), x -y + L 2 x -y 2 for any x and y, where L > 0 is the smoothness constant. For minimizing L-smooth and ρ-quasar convex functions, the algorithm of Hinder et al. (2020) takes O L 1/2 w0-w * ρ 1/2 number of iterations and O L 1/2 w0-w * ρ 1/2 log 1 ρ total number of function and gradient evaluations for getting an -optimality gap. For L-smooth and (ρ, µ)-strongly quasar convex functions, the algorithm of Hinder et al. (2020)  V := f (w 0 ) -f (w * ) + µ 2 z 0 -w * 2 , and w 0 and z 0 are some initial points. Both results of Hinder et al. (2020) improve those in the previous works of Nesterov et al. (2019) and Guminov & Gasnikov (2017) for minimizing quasar and strongly quasar convex functions. A lower bound Ω L 1/2 w0-w * ρ 1/2 on the number of gradient evaluations for minimizing quasar convex functions via any first-order deterministic methods is also established in Hinder et al. (2020) . The additional logarithmic factors in the (upper bounds of the) number of gradient evaluations, compared to the iteration complexity, result from a binary-search subroutine that is executed in each iteration to determine the value of a specific parameter of the algorithm. A similar concern applies to Bu & Mesbahi (2020) , where the algorithm assumes an oracle is available but its implementation needs a subroutine which demands multiple function and gradient evaluations in each iteration. Hence, the open questions are whether the additional logarithmic factors in the total number of gradient evaluations can be removed and whether function evaluations are necessary for an accelerated method to minimize quasar convex functions. We answer them by showing an accelerated randomized algorithm that avoids the subroutine, makes only one gradient call per iteration, and does not need function evaluations. Consequently, the complexity of gradient calls does not incur the additional logarithmic factors as the previous works, and, perhaps more importantly, the computational cost per iteration is significantly reduced. The proposed algorithms are built on the continuized discretization technique that is recently introduced by Even et al. ( 2021) to the optimization community, which offers a nice way to implement a continuous-time dynamic as a discrete-time algorithm. Specifically, the technique allows one to use differential calculus to design and analyze an algorithm in continuous time, while the discretization of the continuized process does not suffer any discretization error thanks to the fact that the Poisson process can be simulated exactly. Our acceleration results in this paper champion the approach, and provably showcase the advantage of momentum over GD for minimizing quasar convex functions. While previous works of quasar convexity are theoretically interesting, a lingering issue is that few examples are known in non-convex machine learning. While some synthetic functions are shown in previous works (Hinder et al., 2020; Nesterov et al., 2019; Guminov & Gasnikov, 2017) , the only practical non-convex learning applications that we are aware are given by Hardt et al. (2018) , where they show that for learning a class of linear dynamical systems, a relevant objective function over a convex constraint set satisfies quasar convexity, and by Foster et al. (2018) , where they show that a robust linear regression with Tukey's biweight loss and a GLM with an increasing link function satisfy quasar convexity, under the assumption that the link function has a bounded second derivative (which excludes the case of Leaky-ReLU). In this work, we find that the objective functions of learning GLMs with link functions being logistic, quadratic, ReLU, or Leaky-ReLU satisfy (strong) quasar convexity, under mild assumptions on the data distribution. We also establish connections between strong quasar convexity and one-point convexity (Guille-Escuret et al., 2022; Kleinberg et al., 2018) , the Polyak-Łojasiewicz (PL) condition (Polyak, 1963; Karimi et al., 2016) , and the quadratic-growth (QG) condition (Drusvyatskiy & Lewis, 2018) . Our findings suggest that investigating minimizing quasar convex functions is not only theoretically interesting, but is also practical for certain non-convex learning applications. To summarize, our contributions include: • For minimizing functions satisfying quasar convexity or strong quasar convexity, we show that the continuized Nesterov acceleration not only has the optimal iteration complexity, but also makes the same number of gradient calls required to get an expected -optimality gap or an -gap with high probability. The continuized Nesterov acceleration avoids multiple gradient calls in each iteration, in contrast to the previous works. We also propose an accelerated algorithm that uses stochastic pseudo-gradients for learning a class of GLMs. • We find that GLMs with various link functions satisfy quasar convexity. Moreover, we show that if a smooth one-point convex, PL, or QG function satisfies quasar convexity, then acceleration for minimizing the function is possible under certain conditions, while acceleration over GD is not known for these classes of functions in general in the literature.

2. PRELIMINARIES

Related works of gradient-based algorithms for structured non-convex optimization: Studying gradient-based algorithms under some relaxed notions of convexity has seen a growing interest in non-convex optimization, e.g., (Gower et al., 2021; Vaswani et al., 2019; 2022; Jin, 2020) . These variegated notions include one-point convexity (Guille-Escuret et al., 2022; Kleinberg et al., 2018) , the PL condition (Polyak, 1963; Karimi et al., 2016) , the QG condition (Drusvyatskiy & Lewis, 2018) , the error bound condition (Luo & Tseng, 1993; Drusvyatskiy & Lewis, 2018) , local quasi convexity (Hazan et al., 2016) , the regularity condition (Chi et al., 2019) , variational coherence (Zhou et al., 2017) , and quasar convexity (Hinder et al., 2020; Hardt et al., 2018; Nesterov et al., 2019; Guminov & Gasnikov, 2017; Bu & Mesbahi, 2020) . For more details, we refer the reader to the references therein. The continuized technique of designing optimization algorithms: The continuized technique was introduced in Aldous & Fill (2002) under the subject of Markov chain and was recently used in optimization by Even et al. (2021) , where they consider the following random process and build a connection to Nesterov's acceleration (Nesterov, 1983; 2013) : dw t = η t (z t -w t )dt -γ t ∇f (w t )dN (t) dz t = η t (w t -z t )dt -γ t ∇f (w t )dN (t), in which η t , η t , γ t , γ t are parameters to be chosen and dN (t) is the Poisson point measure. More precisely, one has dN (t) = k≥1 δ T k (dt), where the random times T 1 , T 2 , . . . , T k , . . . are such that the increments T 1 , T 2 -T 1 , T 3 -T 2 , . . . follow i.i.d. from the exponential distribution with mean 1 (so E[T k ] = k). Between the random times, the continuized process (3) reduces to a system of ordinary differential equations: dw t = η t (z t -w t )dt (4) dz t = η t (w t -z t )dt. (5) At the random time T k , the dynamic (3) is equivalent to taking GD steps: w T k = w T k--γ T k ∇f (w T k-) z T k = z T k--γ T k ∇f (w T k-). A nice feature of this continuized technique is that one can implement the dynamic (3) without causing any discretization error, thanks to the fact that the Poisson process can be simulated exactly. In contrast, other continuous-time approaches (Krichene et al., 2015; Scieur et al., 2017; Attouch et al., 2018; Su et al., 2014; Wibisono et al., 2016; Shi et al., 2018; Diakonikolas & Orecchia, 2019) do not enjoy such a benefit. The formal statement of the continuized discretization is replicated as follows. Lemma 1. (Theorem 3 in Even et al. ( 2021)) The discretization of the continuized Nesterov acceleration (3) can be implemented as wk := w T k , ṽk := w T k+1 -, zk := z T k . Furthermore, the update of the discretized process is in the following form: ṽk = wk + τ k (z k -wk ) (8) wk+1 = ṽk -γk+1 ∇f (ṽ k ) (9) zk+1 = zk + τ k (ṽ k -zk ) -γ k+1 ∇f (ṽ k ), where τ k , τ k , γk , γ k are random parameters that are functions of η t , η t , γ t , and γ t . We We start by identifying a class of functions that satisfy quasar convexity. To get the ball rolling, we need to introduce two notions first. Observe that if a function is generalized variational coherent, then it is variational coherent, i.e., ∇f (w), w -w * ≥ 0, which is a condition that allows an almost-sure convergence to w * via mirror descent (Zhou et al., 2017) . Also, when the non-negative function h(w, w * ) is a squared l 2 norm, i.e., h(w, w * ) = w -w * 2 2 , it becomes one-point convexity, i.e., ∇f (w), w -w * ≥ C v w -w * 2 2 . In the literature, a few non-convex learning problems have been shown to exhibit one-point convexity, see e.g., Yehudai & Shamir (2020) ; Sattar & Oymak (2022) ; Li & Yuan (2017) ; Kleinberg et al. (2018) . However, Guille-Escuret et al. (2022) recently show that for minimizing the class of functions that are one-point convex w.r.t. a global minimizer w * and have gradient Lipschitzness in the sense that ∇f (w) -∇f (w * ) 2 ≤ L w -w * 2 for any w ∈ R d (which is called the upper error bound condition in their terminology), GD is optimal among any first-order methods, which suggests that a different condition than the upper error bound condition might be necessary to show acceleration over GD for functions satisfying one-point convexity. Proof. Using the definitions, we have Definition 2. (C v -generalized variational coherence w.r.t. a function h(•, •)) Denote w * ∈ R d a global minimizer of a function f (•). We say that the function f (•) is generalized variational coherent with the parameter C v > 0 if for all w ∈ R d , one has: ∇f (w), w -w * ≥ C v h(w, w * ), where h(w, w * ) : R d × R d → R + is a non-negative Definition 3. (C l -generalized smoothness w.r.t. a function h(•, •)) Denote w * ∈ R d a global minimizer of a function f (•). We say that the function f (•) is generalized smooth with the parameter C l > 0 if for all w ∈ R d , one has: f (w) -f (w * ) ≤ C l h(w, w * ), where h(w, w * ) : R d × R d → R + is a non-negative function whose inputs are w and w * . We see that if a function f (•) is L-smooth w.r.t. a norm • , then it is L 2 -generalized smooth w.r.t. the square norm, i.e., h(w, w * ) = w -w * 2 . Lemma 2. If f (•) is C v -generalized f (w) -f (w * ) ≤ C l h(w, w * ) ≤ C l Cv ∇f (w), w -w * . Lemma 2 could be viewed as a modified result of Lemma 5 in Foster et al. (2018) , where the authors show that a GLM with the link function having a bounded second derivative and a positive first derivative satisfies quasar convexity. In the following, we provide three more examples of quasar convexity, while the proofs are deferred to Appendix C. For these examples, we assume that each sample x ∈ R d is i.i.d. from a distribution D, and that there exists a w * ∈ R d such that its label is generated as y = σ(w * x), where σ(•) : R → R is the link function of a GLM. We consider minimizing the square loss function: f (w) := E x∼D 1 2 σ(w x) -y 2 . (11) 3.1.1 EXAMPLE 1: (GLMS WITH INCREASING LINK FUNCTIONS) Lemma 3. Suppose that the link function σ(z) is L 0 -Lipschitz and α-increasing, i.e., σ (z) ≥ α > 0 for all z > R. Then, the loss function ( 11) is α 2 -generalized variational coherent and L 2 0 2 -generalized smooth w.r.t. h(w, w * ) = E x∼D ((w -w * ) x) 2 . Therefore, it is ρ = 2α 2 L 2 0 -quasar convex. An example of the link functions that satisfy the assumption is the Leaky-ReLU, i.e., σ(z) = max(αz, z), where α > 0. If we further assume that the models w, w * , and the features x in ( 11) have finite length so that the input to the link function σ(•) is bounded, then the logistic link function, i.e., σ(z) = (1 + exp(-z)) -1 , is another example.

3.1.2. EXAMPLE 2: (PHASE RETRIEVAL)

When the link function is quadratic, i.e., σ(z) = z 2 , the objective function becomes that of phase retrieval, see e.g., Yonel & Yazici (2020) ; Chi et al. (2019) . White et al. (2016) , Yonel & Yazici (2020) show that in the neighborhood of the global minimizers ±w * , the function satisfies one-point convexity in terms of the l 2 norm when the data distribution D follows a Gaussian distribution, for which a specialized initialization technique called the spectral initialization finds a point in the neighborhood (Ma et al., 2020) . As discussed earlier, one-point convexity is equivalent to generalized coherence w.r.t. the square norm, i.e., h(w, w * ) = w -w * 2 2 . Therefore, by Lemma 2, to show quasar convexity for all w in the neighborhood of ±w * , it remains to show that the objective function is generalized smooth w.r.t. the square norm w -w * 2 2 . Lemma 4. Assume that there exists a finite constant C R > 0 such that all w ∈ R d in the balls of radius R centered at ±w * satisfy E x∼D (w + w * ) x 2 x 2 2 ≤ C R . Then, the loss function (11) is 1 2 C R -generalized smooth w.r.t. h(w, w * ) = w -w * 2 . An example of the distribution D that satisfies the assumption in Lemma 4 is a Gaussian distribution.

3.1.3. EXAMPLE 3: (LEARNING A SINGLE RELU)

When the link function is ReLU, i.e., σ(z) = max{0, z}, Theorem 4.2 in Yehudai & Shamir (2020) shows that under mild assumptions of the data distribution, e.g., D is a Gaussian, the objective function is one-point convex in terms of the l 2 norm. Therefore, as the case of phase retrieval, it remains to show generalized smoothness w.r.t. the square norm w -w * 2 2 for showing quasar convexity. Lemma 5. When the link function is ReLU, the loss function ( 11 ) is 1 2 E x∼D [ x 2 2 ]-generalized smooth w.r.t. h(w, w * ) = w -w * 2 2 .

3.2. EXAMPLES OF STRONG QUASAR CONVEXITY

In this subsection, we switch to investigate strong quasar convexity. We establish its connections to one-point convexity, the PL condition, and the QG condition.

3.2.1. ONE-POINT CONVEX FUNCTIONS WITH QUASAR CONVEXITY

It turns out that if a C v -one-point convex function f (•) satisfies ρ-quasar convexity, then it also satisfies strong quasar convexity. Specifically, we have the following lemma. Lemma 6. Suppose that the function f (•) satisfies C v -one-point convexity and ρ-quasar convexity. Then, it is also ρ = ρ θ , µ = 2Cv(θ-1) ρ -strongly quasar convex for any θ > 1. The proof is deferred to Appendix C.4. By Lemma 6, phase retrieval and ReLU regression illustrated in the previous subsection can also be strongly quasar convex.  ∈ R d if f (w) -f (w * ) ≥ ν 2 w -w * 2 for some ν > 0 and all w ∈ R d (Drusvyatskiy & Lewis, 2018; Karimi et al., 2016) . Recall also that a function f (•) satisfies ν-PL if 2ν(f (w) -f (w * )) ≤ ∇f (w) 2 for some ν > 0 and all w ∈ R d (Karimi et al., 2016) . It is known that a ν-PL function satisfies ν-QG, see e.g., Appendix A in Karimi et al. (2016) . The notion of PL has been discovered in various non-convex problems recently (Altschuler et al., 2021; Oymak & Soltanolkotabi, 2019; Chizat, 2021; Merigot et al., 2021) . We show in Lemma 7 below that if a ν-QG function f (•) satisfies quasar convexity, then it also satisfies strong quasar convexity. Lemma 7. Suppose that the function f (•) is ν-QG and ρ-quasar convex w.r.t. a global minimizer w * . Then, it is also (ρ = ρθ, µ = ν(1-θ) θ )-strongly quasar convex for any θ < 1. Lemma 8 in the following shows that GLMs with increasing link functions satisfy QG under certain distributions D, e.g., Gaussian, and hence they are strongly quasar convex by Lemma 7 and Lemma 3. Lemma 8. Following the setting of Lemma 3, assume that the smallest eigenvalue of the matrix E x∼D [xx ] satisfies λ min (E x∼D [xx ]) > 0. Then, the function (11) is α 2 λ min (E x∼D [xx ])-QG. The proofs of Lemma 7 and Lemma 8 are available in Appendix C.5.

4. MAIN RESULTS: ALGORITHMIC ASPECTS

We first analyze the continuized Nesterov acceleration (3) and its discrete-time version ( 8)-( 10) for minimizing quasar convex functions. Theorem 1. Assume that the function f (•) is L-smooth and ρ-quasar convex. Let η t = 2 ρt , η t = 0, γ t = 1 L , and γ t = ρt 2L . Then, the update w t of the continuized algorithm (3) satisfies E[f (w t ) -f (w * )] ≤ 2L z0-w * 2 ρ 2 t 2 . Furthermore, for the update wk of the discrete-time algorithm ( 8)-( 10), if the parameters are chosen as τ k = 1 -T k T k+1 2/ρ , τ k = 0, γk = 1 L , and γ k = ρT k 2L , then E T 2 k (f ( wk ) -f (w * )) ≤ 2L z0-w * 2 ρ 2 . It is noted that the expectation E is with respect to the Poisson process, which is the only source of randomness in the continuized Nesterov acceleration. By applying some concentration inequalities, we can get a bound on the optimal gap with a high probability from Theorem 1. Corollary 1. The update wk of the algorithm ( 8)-( 10) with the same parameters indicated in Theorem 1 satisfies f ( wk ) -f (w * ) ≤ 2c0L z0-w * 2 (1-c) 2 ρ 2 k 2 , with probability at least 1 -1 c 2 k -1 c0 for any c ∈ (0, 1) and c 0 > 1. Corollary 1 implies that K = O L 1/2 z0-w * ρ 1/2 number of gradient calls is sufficient for the discretetime algorithm to get an -optimality gap with a high probability, since the discrete-time algorithm only queries one gradient in each iteration k. Next we analyze the convergence rate for minimizing (ρ, µ)-strongly quasar-convex functions. Theorem 2. Assume that the function f (•) is L-smooth and (ρ, µ)-strongly quasar convex, where µ > 0. Let γ t = 1 L , γ t = 1 √ µL , η t = ρ µ L , and η t = µ L . Then, the update w t of the continuized algorithm (3) satisfies E[f (w t ) -f (w * )] ≤ f (w 0 ) -f (w * ) + µ 2 z 0 -w * 2 exp -ρ µ L t . Furthermore, for the update wk of the discrete-time algorithm ( 8)-( 10), if the parameters are chosen as τ k = 1 1+ρ 1 -exp -(1 + ρ) µ L (T k+1 -T k ) , τ k = ρ(1-exp(-(1+ρ) √ µ L (T k+1 -T k ))) ρ+exp(-(1+ρ) √ µ L (T k+1 -T k )) , γk = 1 L , and γ k = 1 √ µL , then E exp ρ µ L T k (f ( wk ) -f (w * )) ≤ f ( w0 ) -f (w * ) + µ 2 z0 -w * 2 . Corollary 2. The update wk of the algorithm ( 8)-( 10) with the same parameters indicated in Theorem 2 satisfies f ( wk ) -f (w * ) ≤ c 0 f ( w0 ) -f (w * ) + µ 2 z0 -w * 2 exp -ρ µ L (1 -c)k , with probability at least 1 -1 c 2 k -1 c0 for any c ∈ (0, 1) and c 0 > 1. The proof of the above theorems and corollaries are available in Appendix D. Denote V := f ( w0 ) -f (w * ) + µ 2 z0 -w * 2 . Theorem 2 and Corollary 2 show that the proposed algorithm takes O √ L/µ ρ log V number of iterations with the same number of gradient evaluations to get an -expected optimality gap and an -optimality gap with a high probability respectively. Together with Theorem 1 and Corollary 1, these theoretical results show that the continuized Nesterov acceleration has an advantage compared to the existing algorithms of minimizing quasar and strongly quasar-convex functions (Hinder et al., 2020; Bu & Mesbahi, 2020; Nesterov et al., 2019; Guminov & Gasnikov, 2017) , as it avoids multiple gradient calls in each iteration and does not need function evaluations to have an -gap with a high probability. On the other hand, it should be emphasized that the guarantees in the aforementioned works are deterministic bounds, while ours is an expected one or a high-probability bound. Recall Lemma 2 suggests that C v -one-point convexity and L-smoothness implies ρ = 2Cv L quasar convexity. Furthermore, Lemma 6 states that ρ quasar convexity and C v -one-point convexity actually implies ρ = ρ θ , µ = 2Cv(θ-1) ρ -strongly quasar convex for any θ > 1. By combining Lemma 2 and Lemma 6, we find that C v -one-point convexity and L-smoothness implies (ρ = 2Cv Lθ , µ = L(θ -1))strong quasar convexity, for any θ > 1. By substituting (ρ = 2Cv Lθ , µ = L(θ -1)) into the complexity O √ L/µ ρ log V indicated by Corollary 2, we see that the required number of iterations to get an gap with high probability for minimizing functions that satisfy C v -one-point convexity and L-smoothness via the proposed algorithm is and that the optimal complexity is simply attained by GD. Our result does not contradict to this lower bound result, because L-smoothness and L-EB + are different. L-EB + is about gradient Lipschitzness between a minimizer w * and any w, not for any pair of points in R d , and hence does not imply L-smoothness. Also, L-smoothness does not imply L-EB + . Lastly, it is noted that both theorems (and corollaries) require the L-smoothness condition. The reader might raise a question whether the smoothness holds for the case when a GLM with the link function is ReLU or Leaky-ReLU. For this case, it has been shown that the objective (11) satisfies smoothness when the data distribution is a Gaussian distribution, e.g., Lemma 5.2 in Zhang et al. (2018) . O L Cv θ √ θ-1 log V = O L Cv log V , Continuized accelerated algorithm with stochastic pseudo-gradients for GLMs: We also propose a stochastic algorithm to recover an unknown GLM w * ∈ R d that generates the label y of a sample x ∈ R d via y = σ(w * x), where σ(•) is the link function. A natural metric for this task is the distance to the unknown target w * , i.e., f (w) := 1 2 w -w * 2 2 , However, since we do not have access to w * , we cannot use the gradients of f (•) for the update. Instead, let us consider using stochastic pseudo-gradients, defined as g(w; ξ) := σ(w x) -y x, where ξ := (x, y) represents a random sample drawn from the data distribution. Assume that the first derivative of the link function is positive, i.e., σ (•) ≥ α > 0. Then, the expectation of the dot product between the pseudo-gradient and the gradient ∇f (w) over the data distribution satisfies E ξ [ g(w; ξ), ∇f (w) ] = E ξ [ σ(w x) -y x, w -w * ] = E x [ σ(w x) -σ(w * x) x, w -w * ] = E x (σ(w x)-σ(w * x)) w x-w * x (w -w * ) x 2 ≥ αE x (w -w * ) x 2 , which implies that taking a negative of pseudo-gradient step should make progress on minimizing the distance 1 2 w -w * 2 2 on expectation when w has not converged to w * . That is, the update w t+1 = w t -ηg(w t ; ξ) could be shown to converge to the target w * ∈ R d under certain conditions, where η > 0 is the step size. In fact, this algorithm is called (stochastic) GLMtron in the literature (Kakade et al., 2011) . We introduce a continuized acceleration of it in the following. But before that, let us provide some necessary ingredients first. . When the data matrix satisfies E x [xx ] θI d for some θ > 0 and when the derivative of the link function σ(•) satisfies σ (•) ≥ α > 0, one has H(w) µI d 0, where µ := αθ. We assume that for any w ∈ R d , it holds that E x ψ(w x, w * x) 2 x 2 2 xx R 2 H(w) for some constant R 2 > 0, and also that E x ψ(w x, w * x) 2 x 2 H(w) -1 xx κH(w) for some constant κ > 0, where we denote x 2 H(w) -1 := x H(w) -1 x and H(w) -1 is the inverse of H(w). Define κ := R 2 µ . Then, we have κ ≤ κ, because E x ψ(w x, w * x) 2 x 2 H(w) -1 xx 1 µ E x ψ(w x, w * x) 2 x 2 2 xx R 2 µ H(w) = κH(w). The assumptions can be viewed as a generalization of the assumptions made in Jain et al. (2018) ; Even et al. ( 2021) for the standard least-square regression, in which case one has σ(z) = z and hence ψ(•, •) = 1. Our continuized acceleration with stochastic pseudo-gradient steps can be formulated as: dw t = η t (z t -w t )dt -γ t Ξ g(w t ; ξ)dN (t, ξ) dz t = η t (w t -z t )dt -γ t Ξ g(w t ; ξ)dN (t, ξ), where η t , η t , γ t , γ t are parameters, ξ ∈ Ξ represents an i.i.d. random variable associated with a sample used to compute a stochastic pseudo-gradient g(w; ξ), and dN (t, ξ) = Σ k≥1 δ (T k ,ξ k ) (dt, dξ) is the Poisson point measure on R ≥0 × Ξ. We have Theorem 3 in the following, and its proof is available in Appendix E, where we also provide a convergence guarantee of the discrete-time algorithm. Theorem 3. (Continuized algorithm (13) for GLMs) Choose η t = µ κR 2 , η t = µ κR 2 , γ t = 1 R 2 , and γ t = 1 √ µκR 2 . Then, the update w t of ( 13) satisfies E 1 2 w t -w * 2 2 ≤ 1 2 w 0 -w * 2 2 + µ z 0 -w * 2 H(w0) -1 exp - µ κR 2 t .

5. EXPERIMENTS

We compare the proposed continuized acceleration with GD and the accelerated method of Hinder et al. (2020) (AGD). For the method of Hinder et al. (2020) , we use their implementation available online (Hinder et al., 2021) . Our first set of experiments consider optimizing the empirical risks of GLMs with link functions being logistic, ReLU, and quadratic, i.e., solving min w 1 n n i=1 1 2 σ(w x i ) -y i 2 , where n is the number of samples. Each data point x i is sampled from the normal distribution N (0, I d ) and the label y i is generated as We instead use the grid search and report the result under the best configuration of these parameters for each method. More precisely, we search L and µ over {. . . , 10 q , 5 × 10 q , 10 q+1 , . . . } with the constraint that L > µ, where q ∈ {-2, -1, . . . , 4}, and search ρ ∈ {0.01, 0.1, 0.5}. y i = σ(w * x i ), Figure 1 shows the results, where we compare the performance of the algorithms in terms of the function value versus iteration, the number of gradient calls, and CPU time (seconds). From the first column of the figure, one can see that the proposed continuized acceleration is competitive with AGD of Hinder et al. (2020) in terms of the number of iterations. From the middle and the last column, the continuized acceleration shows its promising results over AGD and GD when they are measured in terms of the number of gradient calls and CPU time, which confirms that the cost of AGD per iteration is indeed higher than the continuized acceleration and showcases the advantage of the continuized acceleration. Our second set of experiments compare stochastic GLMtron and the proposed continuized acceleration of it (accelerated stochastic GLMtron), in which both algorithms randomly select a sample to compute a stochastic pseudo-gradient at each step of the update. We consider learning a GLM with a Leaky-ReLU, i.e., σ(z) = max(αz, z) under different values of α. Figure 2 shows the effectiveness of accelerated stochastic GLMtron, as it is significantly faster than stochastic GLMtron for recovering the true vector w * .

6. CONCLUSION

We show that the continuized Nesterov acceleration outperforms the previous accelerated methods for minimizing quasar convex functions. Compared to the previous approaches, the continuized discretization technique provides a relatively easy way to design and analyze an accelerated algorithm for quasar convex functions. Hence, it would be interesting to check whether this technique could offer any other benefits in non-convex optimization. Specifically, can the technique help design fast algorithms for minimizing other classes of non-convex functions? On the other hand, while examples of quasar convex functions are provided in this paper, a natural question is if this property holds more broadly in modern machine learning applications. Exploring the possibilities might be another interesting direction. A ALGORITHMS OF HINDER ET AL. ( 2020) We replicate the algorithms in Hinder et al. (2020) using our notations for the reader's reference. Their algorithms use a subroutine of binary search to determine the "mixing" parameter τ k . Algorithm 1: AGD for (ρ, µ)-strongly quasar convex function minimization in Hinder et al. (2020) 1: Set τ k = ρ µ L , γk = 1 L , and γ k = 1 √ µL 2: for k = 0, 1, . . . , K do 3: α k ← BINARYLINESEARCH f, w k , z k , b = ρµ 2 , c = L µ , ˜ = 0 . 4: τ k ← 1 -α k . 5: v k = w k + τ k (z k -w k ). 6: w k+1 = v k -γk+1 ∇f (v k ). 7: z k+1 = z k + τ k (v k -z k ) -γ k+1 ∇f (v k ). 8: end 9: return w K Algorithm 2: AGD for ρ-quasar convex function minimization in Hinder et al. (2020) 1: Set τ k = 0, γk = 1 L , and γ k = ρ Lθ k , where θ k = θ k-1 2 (θ k-1 ) 2 + 4 -θ k-1 for k ≥ 0 and θ -1 = 1. 2: for k = 0, 1, . . . , K do 3: α k ← BINARYLINESEARCH f, w k , z k , b = 0, c = ρ 1 θ k -1 , ˜ = ρ 2 . 4: τ k ← 1 -α k . 5: v k = w k + τ k (z k -w k ). 6: w k+1 = v k -γk+1 ∇f (v k ). 7: (Hinder et al., 2020) 1: Assumptions: z k+1 = z k + τ k (v k -z k ) -γ k+1 ∇f (v k ). 8: end 9: return w K Algorithm 3: BINARYLINESEARCH(f, w, z, b, c, ˜ ,[guess]) f is L-smooth, w, z ∈ R d ; b, c, ˜ ≥ 0; "guess" (optional) is in [0, 1] if provided. 2: Define g(α) := f (αw + (1 -α)z) and p := b w -z 2 . 3: if guess provided and cg(guess) + guess(g (guess) -guess • p) ≤ cg(1) + ˜ then return guess; 4: if g (1) ≤ ˜ + p then return 1; 5: else if c = 0 or g(0) ≤ g(1) + ˜ c then return 0; 6: τ ← 1 -˜ +p L w-z 2 . 7: lo ← 0, hi ← τ, α ← τ . 8: while cg(α) + α(g (α) -αp) > cg(1) + ˜ do α ← (lo + hi)/2 if g(α) ≤ g(τ ) then hi ← α; else lo ← α; end 9: return α.

B PROOF OF LEMMA 1

Lemma 1 (Theorem 3 in Even et al. ( 2021)) The discretization of the continuized Nesterov acceleration (3) can be implemented as wk := w T k , ṽk := w T k+1 -, zk := z T k . Furthermore, the update of the discretized process is in the following form: ṽk = wk + τ k (z k -wk ) (14) wk+1 = ṽk -γk+1 ∇f (ṽ k ) (15) zk+1 = zk + τ k (ṽ k -zk ) -γ k+1 ∇f (ṽ k ), where τ k , τ k , γk , γ k are random parameters that are functions of η t , η t , γ t , and γ t . Proof. We replicate the proof in (Even et al., 2021) for completeness. Recall between random times, we have the ODEs dw t = η t (z t -w t )dt (17) dz t = η t (w t -z t )dt. Integrating from T k to T k+1-, ṽk = w T k+1-= w T k + τ k (z T k -w T k ) = wk + τ k (z k -wk ) z T k+1-= z T k + τ k (w T k -z T k ) = zk + τ k ( wk -zk ), where τ k and τ k depend on η t and η t respectively. Combing the above two equations, we have z T k+1-= zk + τ k ( 1 1 -τ k (ṽ k -τ k zk ) -zk ) = zk + τ k (ṽ k -zk ), where 6) and ( 7), we have τ k := τ k 1-τ k . Furthermore, from ( wk+1 = w T k+1 = w T k+1--γ T k+1 ∇f (w T k+1-) = ṽk -γ T k+1 ∇f (ṽ k ), zk+1 = z T k+1 = z T k+1--γ T k+1 ∇f (w T k+1-) = zk + τ k (ṽ k -zk ) -γ T k+1 ∇f (ṽ k ). (23) Hence, γk+1 = γ T k+1 and γ k+1 = γ T k+1 .

C MISSING PROOFS IN SECTION 3

C.1 PROOF OF LEMMA 3 Lemma 3 Suppose that the link function σ(z) is L 0 -Lipschitz and α-increasing, i.e., σ (z) ≥ α > 0 for all z > R. Then, the loss function ( 11) is α 2 -generalized variational coherent and L 2 0 2 -generalized smooth w.r.t. h(w, w * ) = E x∼D ((w -w * ) x) 2 . Therefore, the function (11) is ρ = 2α 2 L 2 0 -quasar convex. Proof. We first show generalized variational coherence. We have ∇f (w), w -w * (a) = E x∼D σ(w x) -σ(w * x) σ (w x) w -w * , x = E x∼D σ(w x)-σ(w * x) (w-w * ) x σ (w x)((w -w * ) x) 2 (b) ≥ α 2 E x∼D ((w -w * ) x) 2 = α 2 h(w, w * ), where (a) uses that y = σ(w * x), (b) uses that σ(w x)-σ(w * x) w x-w * x ≥ 0 as σ (•) ≥ α > 0. Now let us switch to show generalized smoothness. We have f (w) -f (w * ) = E x∼D 1 2 σ(w x) -σ(w * x) 2 ≤ L 2 0 2 E x∼D ((w -w * ) x) 2 = L 2 0 2 h(w, w * ), where the inequality is due to L 0 -Lipschitzness of σ(•). We can now invoke Lemma 2 to conclude that the objective function is ρ = Proof. We have f (w) -f (w * ) = E x∼D 1 2 (w x) 2 -(w * x) 2 2 = E x∼D 1 2 (w x) -(w * x) 2 (w x) + (w * x) 2 ≤ 1 2 w -w * 2 2 E x∼D (w + w * ) x 2 x 2 2 ≤ 1 2 C R w -w * 2 2 . C.3 PROOF OF LEMMA 5 Lemma 5 When the link function is ReLU, the loss function ( 11) is 1 2 E x∼D [ x 2 2 ]-generalized smooth w.r.t. h(w, w * ) = w -w * 2 2 . Proof. We have f (w) -f (w * ) = E x 1 2 σ(w x) -σ(w * x) 2 ≤ E x 1 2 w x -w * x 2 ≤ E x 1 2 w -w * 2 2 x 2 2 ≤ 1 2 E x [ x 2 2 ] w -w * 2 2 , where the first inequality uses that ReLU is 1-Lipschitz.

C.4 PROOF OF LEMMA 6

Lemma 6 Suppose that the function f (•) satisfies C v -one-point convexity and ρ-quasar convexity. Then, it is also ρ = ρ θ , µ = 2Cv(θ-1) ρ -strongly quasar convex for any θ > 1. Proof. We have f (w) -f (w * ) ≤ 1 ρ ∇f (w), w -w * = θ ρ ∇f (w), w -w * -θ-1 ρ ∇f (w), w -w * ≤ θ ρ ∇f (w), w -w * -θ-1 ρ C v w -w * 2 , ( ) where the last inequality uses the definition of C v -one-point convexity. Rearranging the above inequality, we get f (w * ) ≥ f (w) + 1 ρ/θ ∇f (w), w * -w + 2Cv(θ-1)/ ρ 2 w * -w 2 . C.5 PROOF OF LEMMA 7 AND LEMMA 8 Lemma 7 Suppose that the function f (•) is ν-QG and ρ-quasar convex w.r.t. a global minimizer w * . Then, it is also (ρ = ρθ, µ = ν(1-θ) θ )-strongly quasar convex for any θ < 1. Proof. By ρ-quasar convexity, we have ∇f (w), w -w * ≥ ρ(f (w) -f (w * )) = ρθ(f (w) -f (w * )) + ρ(1 -θ)(f (w) -f (w * )) ≥ ρθ(f (w) -f (w * )) + ρ(1-θ)ν 2 w -w * 2 , ( ) where the last inequality uses the definition of ν-QG. Rearranging the above inequality, we get f (w * ) ≥ f (w) + 1 ρθ ∇f (w), w * -w + ν(1-θ)/θ 2 w * -w 2 , which shows the result. Lemma 8 Following the setting of Lemma 3, assume that the smallest eigenvalue of the matrix E x∼D [xx ] satisfies λ min (E x∼D [xx ]) > 0. Then, the function (11) is α 2 λ min (E x∼D [xx ])-QG. Proof. We have f (w) -f (w * ) = E x∼D 1 2 σ(w x) -σ(w * x) 2 = E x∼D 1 2 σ(w x) -σ(w * x) w x -w * x (w x -w * x) 2 ≥ 1 2 α 2 E x∼D (w x -w * x) 2 = 1 2 α 2 (w -w * ) E x∼D xx (w -w * ) ≥ 1 2 α 2 λ min (E x∼D [xx ]) w -w * 2 , ( ) where the second-to-last inequality uses that the derivative of the link function satisfies σ (•) ≥ α.

D PROOF OF THEOREM 1 AND THEOREM 2

Theorem 1 Assume that the function f (•) is L-smooth and ρ-quasar convex. Let η t = 2 ρt , η t = 0, γ t = 1 L , and γ t = ρt 2L . Then, the update w t of the continuized algorithm (3) satisfies E[f (w t ) -f (w * )] ≤ 2L z 0 -w * 2 ρ 2 t 2 . Furthermore, for the update wk of the discrete-time algorithm ( 8)-( 10), if the parameters are chosen as τ k = 1 -T k T k+1 2/ρ , τ k = 0, γk = 1 L , and γ k = ρT k 2L , then E[T 2 k (f ( wk ) -f (w * ))] ≤ 2L z0 -w * 2 ρ 2 . Theorem 2 Assume that the function f (•) is L-smooth and (ρ, µ)-strongly quasar convex, where µ > 0. Let γ t = 1 L , γ t = 1 √ µL , η t = ρ µ L , and η t = µ L . Then, the update w t of the continuized algorithm (3) satisfies E[f (w t ) -f (w * )] ≤ f (w 0 ) -f (w * ) + µ 2 z 0 -w * 2 exp -ρ µ L t . Furthermore, for the update wk of the discrete-time algorithm (8)-( 10), if the parameters are chosen as τ k = 1 1+ρ 1 -exp -(1 + ρ) µ L (T k+1 -T k ) , τ k = ρ(1-exp(-(1+ρ) √ µ L (T k+1 -T k ))) ρ+exp(-(1+ρ) √ µ L (T k+1 -T k )) , γk = 1 L , and γ k = 1 √ µL , then E[exp ρ µ L T k (f ( wk ) -f (w * ))] ≤ f ( w0 ) -f (w * ) + µ 2 z0 -w * 2 . Based on ( 65), (66), and (21), we conclude that τ k = ρ(1-exp(-(1+ρ) √ µ L (T k+1 -T k ))) ρ+exp(-(1+ρ) √ µ L (T k+1 -T k )) . Moreover, we have γk = γ T k = 1 L and γ k = γ T k = 1 √ µL . This proves the second part of Theorem 2.

D.3 PROOF OF COROLLARY 1 AND COROLLARY 2

Corollary 1: The update wk of the algorithm ( 8)-( 10) with the same parameters indicated in Theorem 1 satisfies f ( wk ) -f (w * ) ≤ 2c0L z0-w * 2 (1-c) 2 ρ 2 k 2 , with probability at least 1 -1 c 2 k -1 c0 for any c ∈ (0, 1) and c 0 > 1. Proof. Using Markov's inequality and Theorem 1, we get Pr T 2 k (f ( wk ) -f (w * )) ≥ C 0 ≤ E[T 2 k (f ( wk )-f (w * ))] C0 ≤ 2L z0-w * 2 /ρ 2 C0 . ( ) Let C 0 := c 0 2L z0 -w * 2 /ρ 2 , where c 0 > 1 is a universal constant. Then, with probability 1 -1 c0 , T 2 k (f ( wk ) -f (w * )) ≤ 2c0L z0-w * 2 ρ 2 . ( ) By Chebyshev's inequality, we have Pr  (|T k -E[T k ]| ≥ cE[T k ]) ≤ Var(T k ) c 2 (E[T k ]) 2 , where c > 0 is a universal constant. Hence, we have T k ≥ (1 -c)E[T k ] = (1 -c)k with probability at least 1 -1 c 2 k , ( wk ) -f (w * ) ≤ c 0 f ( w0 ) -f (w * ) + µ 2 z0 -w * 2 exp -ρ µ L (1 -c)k , with probability at least 1 -1 c 2 k -1 c0 for any c ∈ (0, 1) and c 0 > 1. Proof. Using Markov's inequality and Theorem 2, we get  Pr exp ρ µ L T k (f ( wk ) -f (w * )) ≥ C 0 ≤ E[exp(ρ √ µ L T k )(f( wk )-f (w * ))] C0 ≤ f ( w0)-f (w * )+ µ 2 z0-w * 2 C0 . (69) Let C 0 := c 0 f ( w0 ) -f (w * ) + µ 2 z0 -w * 2 , where c 0 > 1 is a universal constant. Then, with probability 1 -1 c0 , exp ρ µ L T k (f ( wk ) -f (w * )) ≤ c 0 (f ( w0 ) -f (w * )) + µ 2 z0 -w * 2 (70) By Chebyshev's inequality, we have Pr (|T k -E[T k ]| ≥ cE[T k ]) ≤ Var(T k ) c 2 (E[T k ]) 2 , where c > 0 is a universal constant. Hence, we have T k ≥ (1 -c)E[T k ] = (1 -c)k with probability at least 1 -1 c 2 k , η t = µ κR 2 , η t = µ κR 2 , γ t = 1 R 2 , and γ t = 1 √ µκR 2 . Then, the update w t of ( 13) satisfies E 1 2 w t -w * 2 ≤ 1 2 w 0 -w * 2 + µ z 0 -w * 2 H(w0) -1 exp - µ κR 2 t . Proof. Let us denote H t := H(w t ) = E x [ψ(w t x, w * x)xx ] and consider a Lyapunov function for the continuized process (3), defined as: φ t := A t 2 w t -w * 2 + B t 2 z t -w * 2 H -1 t . We first show that φ t is a super-martingale under certain values of parameters η t , η t , γ t , γ t , A t , and B t . Let us denote the process wt := (t, w t , z t ), which satisfies the following equation:  d wt = b( wt )dt + Ξ G( wt ; ξ)dN (t, ξ), b( wt ) = 1 η t (z t -w t ) η t (w t -z t ) , G( wt ; ξ) = 0 -γ t g(w t ; ξ) -γ t g(w t ; ξ) . ( φ t = φ 0 + t 0 I s ds + M t , where M t is a martingale and I t is  I t := ∇φ(



function whose inputs are w and w * .

variational coherent and C l -generalized smooth w.r.t. the same non-negative function h(•, •), then the function satisfies ρ-quasar convexity with ρ = Cv C l .

where we simply let θ = 2. On the other hand, Guille-Escuret et al. (2022) consider minimizing a class of functions that satisfies C v -one-point convexity and a condition called the L-upper error bound condition (L-EB + ). A function satisfies L-EB + if ∇f (w) -∇f (w * ) 2 ≤ L w -w * 2 for a fixed minimizer w * and any w ∈ R d . Guille-Escuret et al. (2022) show that the optimal iteration complexity k to have w k -w

Denote a matrix H(w) := E x [ψ(w x, w * x)xx ], where ψ(a, b) := σ(a)-σ(b) a-b

where w * ∼ N (0, I d ) is the true vector and σ(•) is the link function. In the experiments, we set the number of samples n = 1000 and the dimension d = 50. The initial point of all the algorithms w 0 ∈ R d is a close-to-zero point, and is sampled as w 0 ∼ 10 -2 ζ, where ζ ∼ N (0, I d ). Since the continuized acceleration has randomness due to the Poisson process, it was replicated 10 runs in the experiments, and the averaged results over these runs are reported. Both the continuized acceleration and AGD ofHinder et al. (2021) need the knowledge of L, ρ, and µ for setting their parameters theoretically.

(a) Logistic link (x: iteration). (b) Logistic link (x: # calls). (c) Logistic link (x: time). (d) ReLU link (x: iteration). (e) ReLU link (x: # calls). (f) ReLU link (x: time). (g) Quadratic link (x: iteration). (h) Quadratic link (x: # calls). (i) Quadratic link (x: time).

Figure 1: Comparison of the continuized Nesterov acceleration, GD, and AGD (Hinder et al., 2020).

Assume that there exists a finite constant C R > 0 such that all w ∈ R d in the balls of radius R centered at ±w * satisfy E x∼D (w + w * ) x 2 x 2 2 ≤ C R . Then, the loss function (11) is 1 2 C R -generalized smooth w.r.t. h(w, w * ) = w -w * 2 2 .

where we used the fact that E[T k ] = Var[T k ] = k as T k is the sum of k Poisson random variables with mean 1. Combining this lower bound of T k and (68) leads to the result.Corollary 2: The update wk of the algorithm (8)-(10) with the same parameters indicated in Theorem 2 satisfies f

where we used the fact that E[T k ] = Var[T k ] = k as T k is the sum of k Poisson random variables with mean 1. Combining this lower bound of T k and (70) leads to the result.E PROOF OF THEOREM 3Theorem 3 (Continuized algorithm (13) for GLMs) Choose

) Then, by Proposition 2 of Even et al. (2021), we have

3.2.2 POLYAK-ŁOJASIEWICZ (PL) OR QUADRATIC-GROWTH (QG) FUNCTIONS WITH QUASAR

wt ), b( wt ) + E ξ [φ( wt + G( wt ; ξ)) -φ( wt )] (74)For the first term of I t , we have∇φ( wt ), b( wt ) = ∂ t φ( wt ) + ∂ w φ( wt ), η t (z t -w t ) + ∂ z φ( wt ), η t (w t -z t ) + A t η t w t -w * , z t -w t + B t η t z t -w * , H -1 t (w t -z t ) . (75) Since H t = E x [ψ(w t x, w * x)xx ] µI d , we have + A t η t w t -w * , z t -w t .For the second term of I t , we haveE ξ [φ( wt + G( wt ; ξ)) -φ( wt )] ; ξ) 2 -A t γ t w t -w * , g(w t ; ξ) + B t (γ t ) 2 2 g(w t ; ξ) 2 -B t γ t z t -w * , H -1 t g(w t ; ξ) . (80)Let us upper-bound the first two terms in (80). We haveE ξ [ g(w t ; ξ) 2 ] = E x [ σ(w t x) -y x, σ(w t x) -y x ] = E x [ψ(w t x, wwhere in the last inequality we usedE x ψ(w t x, w * x) 2 x 2 xx R 2 E x ψ(w t x, w * x)xx = R 2 H t .(82)Furthermore, we haveE ξ [ g(w t ; ξ), w t -w * ] = E x [ σ(w x) -σ(w * x) x, w t -w * ] = (w t -w * ) E x [ψ(w x, w * x)xx ](w t -w * ) = w t -w * ; ξ) 2 -A t γ t w t -w * , g(w t ; ξ) ≤ A t γ 2 t R 2 2 -A t γ t w t -w * 2 Ht . (84)Now let us switch to upper-bound the last two terms in (80). We haveE ξ [ g(w t ; ξ) 2 E x σ(w t x) -y x, H -1 t σ(w t x) -y x = E x ψ(w t x, w * x) 2 ((w t -w * ) x) 2 x 2 , E x ψ(w t x, w * x) 2 x 2 ) = γ t z t -w * , w t -w * . Ht -B t γ t z t -w * , w t -w * . η t -B t γ t ) w t -w * , z t -w * Now let us determine η t , η t , γ t , γ t , A t , and B t . We start by taking γ t = 1 R 2 . We want I t ≤ 0, so we want to satisfy dA t dt = A t η t , dB t dt = B t η t , A t η t = B t γ t , B t η t = dA t dt µ, B t (γ t ) 2 = A t κR 2 . (91)

ACKNOWLEDGMENTS

We thank the reviewers for constructive feedback, which helps improve the quality of this paper.

Published as a conference paper at ICLR 2023

The proof follows that of Theorem 2 in Even et al. (2021) with some modifications to account for (strong) quasar convexity. We will consider a Lyapunov function for the continuized process (3), defined as:We will show that φ t is a super-martingale under certain choices of parameters η t , η t , γ t , γ t , A t , and B t . Let us first denote the process wt := (t, w t , z t ), whose dynamic is:Then, by Proposition 2 of Even et al. (2021) , we havewhere M t is a martingale. Therefore, to show φ t is a supermartingale, it suffices to show:For the first term of I t . we haveBy (ρ, µ)-strongly quasar-convexity, we haveFurthermore, the following inequality holds,For the second term of I t , we haveSince by smoothness, we have(40) So the second term can be bounded asCombining ( 34), ( 38), (41), we haveNow let us determine the parameters η t , η t , γ t , γ t , A t and B t . We start by taking γ t = 1 L . Since we need I t ≤ 0, we want to satisfyLet us choosewhich ensures that the last three conditions of (43) are satisfied. It remains to show that the first two hold:We havewhere (a) uses that dAt dt = ρA t η t = ρA t Bt LAt from (45), and (b) uses dBt dt = ρµ AtBt L from (45). The equations on (46) imply thatFor the case of µ = 0, we choose A 0 = 0 and B 0 = 1. From ( 46), we have d dt (2L , η t = 0, and η t = 2 ρt . Therefore, as φ t is a super-martingale, we getSo we haveThis proves the first part of Theorem 1.The ODEs (4)-( 5) becomeIntegrating the ODEs from time t 0 to t,Using Lemma 1 with t 0 = T k , t = T k+1-, (52) becomesThis together with (8) implies that, while comparing ( 20) and ( 53) leads toWe haveThen, we can conclude thatThis proves the first part of Theorem 2.The ODEs (4)-( 5) becomeThe solutions areTaking t 0 = T k , t = T k+1-, the above becomesComparing equation ( 64) and ( 14), we know that. Furthermore, by using (64), we getPublished as a conference paper at ICLR 2023 Let us choosewhich ensures that the last three conditions of (91) are satisfied. It remains to show that the first two hold:We have d dtwhere (a) uses that dAt dt = A t η t = A t Bt κR 2 At and (b) uses that dBt dt = B t η t = µ At κR 2 Bt from ( 92) and ( 93). The equations on (94) imply thatLet us chooseWe haveThen, we can conclude thatSo we haveand we have B0 A0 = µ from (96) and we can choose A 0 = 1. This proves Theorem 3. Now let us switch to determine the corresponding parameters of the discrete-time algorithm (8)-( 10), where the gradient ∇(w t ) is now replaced with the stochastic pseudo-gradient g(w t ; ξ t ). The ODEs (4)-( 5) becomeThe solutions areTaking t 0 = T k , t = T k+1-, the above becomesComparing equation ( 105) and ( 14), we know τ k = 1 2 1 -exp -2 µ κR 2 (T k+1 -T k ) and τ k = 1 2 1 -exp -2 µ κR 2 (T k+1 -T k ) . Furthermore, by using (105), we getBased on ( 106), (107), and ( 21), we conclude that τ k = 1-exp(-2 √ µ κR 2 (T k+1 -T k )) 1+exp(-2 √ µ κR 2 (T k+1 -T k )) . Moreover, we have γk = γ T k = 1 R 2 and γ k = γ T k = 1 √ µκR 2 . We can now conclude that the corresponding discrete-time algorithm under the above choice of parameters satisfies

