ESCAPING SADDLE POINTS IN ZEROTH-ORDER OPTI-MIZATION: TWO FUNCTION EVALUATIONS SUFFICE

Abstract

Two-point zeroth order methods are important in many applications of zeroth-order optimization arising in robotics, wind farms, power systems, online optimization, and adversariable robustness to black-box attacks in deep neural networks, where the problem can be high-dimensional and/or time-varying. Furthermore, such problems may be nonconvex and contain saddle points. While existing works have shown that zeroth-order methods utilizing Ω(d) function valuations per iteration (with d denoting the problem dimension) can escape saddle points efficiently, it remains an open question if zeroth-order methods based on two-point estimators can escape saddle points. In this paper, we show that by adding an appropriate isotropic perturbation at each iteration, a zeroth-order algorithm based on 2m (for any 1 ≤ m ≤ d) function evaluations per iteration can not only find -second order stationary points polynomially fast, but do so using only Õ( d / 2.5 ) function evaluations. Related Work. Due to space considerations, we defer a full discussion of related work to Appendix A. We make the following assumptions on the class of functions f : R d → R which we consider. Assumption 1 (Properties of f ). We suppose that f : R d → R satisfies the following properties: 1. f is twice-differentiable and lower bounded, i.e. f * := min x f (x) > -∞. In our work, we focus on finding approximate second order stationary points, defined below. Definition 1. A point x ∈ R d is an ( , ϕ)-second order stationary point if ∇f (x) < , and λ min (∇ 2 f (x)) > -ϕ.

1. INTRODUCTION

Two-point (or in general 2m-point, where 1 ≤ m < d with d being the problem dimension) estimators, which approximate the gradient using two (or 2m) function evaluations per iteration, have been widely studied by researchers in the zeroth-order optimization literature, in convex (Nesterov and Spokoiny, 2017; Duchi et al., 2015; Shamir, 2017) , nonconvex (Nesterov and Spokoiny, 2017) , online (Shamir, 2017) , as well as distributed settings (Tang et al., 2019) . A key reason for doing so is that for applications of zeroth-order optimization arising in robotics (Li et al., 2022) , wind farms (Tang et al., 2020a) , power systems (Chen et al., 2020) , online (time-varying) optimization (Shamir, 2017) , learning-based control (Malik et al., 2019; Li et al., 2021) , and improving adversarial robustness to black-box attacks in deep neural networks (Chen et al., 2017) , it may be costly or impractical to wait for Ω(d) (where d denotes the problem dimension) function evaluations per iteration to make a step. This is especially true for high-dimensional and/or time-varying problems. See Appendix A for more discussion. However, despite the advantages of zeroth-order methods with two-point estimators, there has been a lack of existing work studying the ability of two-point estimators to escape saddle points in nonconvex optimization problems. Since nonconvex problems arise often in practice, it is crucial to know if two-point algorithms can efficiently escape saddle points of nonconvex functions and converge to second-order stationary points. To motivate the challenges of escaping saddle points using two-point zeroth-order methods, we begin with a review of escaping saddle points using first-order methods. The problem of efficiently escaping saddle points in deterministic first-order optimization (with exact gradients) has been carefully studied in several earlier works (Jin et al., 2017; 2018) . A key idea in these works is the injection of an isotropic perturbation whenever the gradient is small, facilitating escape from a saddle if a negative curvature direction exists even without actively identifying the direction. However, the analysis of efficient saddle point escape for stochastic gradient methods is often more complicated. In general, the behavior of the stochastic gradient near the saddle point can be difficult to characterize. Hence, strong concentration assumptions are typically made on the stochastic gradients being used, such as subGaussianity, boundedness of the variance or a bounded gradient estimator (Ge et al., 2015; Daneshmand et al., 2018; Xu et al., 2018; Fang et al., 2019; Roy et al., 2020; Vlaski and Sayed, 2021b) , creating an analytical issue when such idealized assumptions fail to hold. Indeed, though zeroth-order methods can be viewed as stochastic gradient methods, common zeroth order estimators, such as two-point estimators (Nesterov and Spokoiny, 2017) , are not subGaussian, and can have unbounded variance. For instance, it can be shown that the variance of the two-point estimator is on the order of Ω(d ∇f (x) 2 ) (Nesterov and Spokoiny, 2017) , with both a dependence on the problem dimension d as well as on the norm of the gradient, which can be unbounded. Due to non-subGaussianity and unboundedness, it is tricky to bound the effect of such zeroth-order estimators and establish tight concentration inequalities that facilitate its escape near saddle points. In addition, the large variance of the zeroth-order estimator is also an issue away from saddles when the gradient is large. While this is not an issue to show function improvement in expectation, as we discuss later, this becomes an issue when guaranteeing high probability bounds. Due to these difficulties, previous works on escaping saddle points in zeroth-order optimization have exclusively focused on approaches requiring Ω(d) function evaluations per iteration to accurately estimate the gradient (Bai et al., 2020; Vlatakis-Gkaragkounis et al., 2019) , or in some cases negative curvature directions (Zhang et al., 2022; Lucchi et al., 2021) or the Hessian itself (Balasubramanian and Ghadimi, 2022) , reducing in a sense the zeroth-order problem back to a first-order one. However, as explained earlier, two-point or 2m-point zeroth-order algorithms are important for high-dimensional and/or time-varying problems in many applications areas. This raises an important question: Can two-point zeroth-order methods escape saddle points and reach approximate second order stationary points efficiently? Our Contribution. In this work, we show that by adding an appropriate isotropic perturbation at each iteration, a zeroth-order algorithm based on any number m of pairs (m ranging from 1 to d) of function evaluations per iteration can not only find ( , √ )-second order stationary points (cf. the definition later in Definition 1) polynomially fast, but do so using only Õ( polylog( 1 δ )d / 2.5 ) function evaluations, with a probability of at least 1 -δ. In particular, this proves that using a single two-point zeroth-order estimator at each iteration (with appropriate perturbation) suffices to efficiently escape saddle points in zeroth-order optimization, with high probability. Moreover, for functions that are ( , ψ) strict-saddle (see Definition 3 for a definition of strict saddle functions), our results become Õ( polylog( 1 δ )d /ψ 2 ), which is a significant improvement when ψ ; strict saddle functions have been identified as an important class of functions in nonconvex optimization, with several well-known examples such as tensor decomposition (Ge et al., 2015) , dictionary learning and phase retrieval (Sun et al., 2015) . A comparison of our results with existing zeroth-order and first-order methods is shown in Table 1 . We also provide numerical results in Appendix G showing that our proposed two-point algorithm requires fewer total function evaluations to converge than zeroth order methods that use 2d function evaluations per iteration, for a nonconvex test function proposed in Du et al. (2017) . To overcome the theoretical challenges that were discussed earlier, we i) first show, via a careful analysis, that zeroth order methods can make function value improvement across iterates with large gradients with high probability, even when only a single two-point estimator (which can have significant variance at large gradients) is used per iteration. ii) Second, near saddle points, we overcome issues caused by the unbounded variance and non-subGaussinity of zeroth-order gradient estimators by developing new technical tools, including novel martingale concentration inequalities involving Gaussian vectors, to tightly bound such terms. In turn, this allows us to show that the noise emanating from the zeroth-order estimators will not overwhelm the effect of the additional isotropic perturbative noise, facilitating escape along negative curvature directions. To the best of our knowledge, both analyses are novel, and may be independent contributions on their own. Iteration Complexity Fun. evaluations per iter. First-order Jin et al. (2017) )-second order stationary points in smooth, nonconvex functions; for † , the convergence is to ( , 2/3 )-second order stationary points. For ‡ , the term ψ in the denominator is (i) ψ when the function f is ( , ψ)-strict saddle for a ψ > O( √ ) (see Definition 3 for a definition) and (ii) O( √ ) if otherwise. We define an ( , ϕ)-approximate saddle point as follows. Definition 2. A point x ∈ R d is an ( , ϕ)-approximate saddle point, if ∇f (x) < , and λ min (∇ 2 f (x)) ≤ -ϕ. Following past convention (Jin et al., 2019a) , we will focus in particular on escaping ( , √ ρ )-saddle points. For notational simplicity, in following text, we refer to ( , √ ρ )-saddle points simply assaddle points and ( , √ ρ )-second order stationary points as -second order stationary points. Beyond the definition of -approximate saddle points above, it is known that many nonconvex functions with saddle points, such as orthogonal tensor decomposition (Ge et al., 2015) , phase retrieval and dictionary learning (Sun et al., 2015) , satisfy what is known as a strict saddle condition (Ge et al., 2015) . For the Hessians of the saddle points of such functions, there is always a strict negative eigenvalue whose magnitude is bounded from below. We provide a precise definition below. Definition 3. A twice-differential function f (x) is ( , ψ)-strict saddle, if for any point x, at least one of the following is true: ∇f (x) ≥ or λ min (∇ 2 f (x)) ≤ -ψ As a corollary of Theorem 1 (to be stated later), for functions f which are ( , ψ) strict saddle, assuming that ψ ≥ √ ρ , the sample complexity of our algorithm scales as Ω d max{L 2 ,L}(f (x0)-f * ) m 2 ψ , which scales as Ω d m 2 when ψ is of size Ω(1). Thus, in this setting, for two-point estimators, where m = 1, the dependence on d and in our sample complexity (as measured by function evaluations) matches that achieved by the algorithms in Vlatakis-Gkaragkounis et al. (2019) ; Zhang et al. (2022) , which have to use 2d function evaluations per iteration to estimate the gradient. In our work, we consider the following batch symmetric two-point zeroth-order estimator. Definition 4 ((Batch) two-point zeroth-order estimator with perturbation). We define a m-batch two-point zeroth order estimator as follows: g (m) u (x) := 1 m m i=1 f (x + uZ i ) -f (x -uZ i ) 2u Z i , where Z i i.i.d ∼ N (0, I), and u > 0 is a smoothing radius. Such 2m zeroth-order gradient estimators have frequently been studied in zeroth-order optimization works (see e.g. Nesterov and Spokoiny (2017) ). To facilitate efficient escape from saddle points, our proposed Algorithm 1 adds isotropic perturbation at each iteration. Algorithm 1: Zeroth-order perturbed gradient descent (ZOPGD) input :x 0 , horizon T , step-size η, smoothing radius u, perturbation radius r, batch size m for step t = 0, . . . , T 1 do Sample Z (m) = {Z t,i } m i=1 ∼ N (0, I) to compute g (m) u (x t )). Update x t+1 = x t -η g (m) u (x t ) + Y t , where Y t ∼ N (0, rfoot_1 d I) We now state an informal version of our main result, and follow that with a few remarks. Theorem 1 (Main result, informal version of Theorem 2). Consider running Algorithm 1. Let Õ hide polylogarithmic terms in δ and other parameters. Suppose δ ∈ (0, 1/e]. Suppose √ ρ ≤ min{1, L}foot_0 , such that ψ ≤ min{1, L}, where ψ := min{ψ, 1, L} if f (•) is ( , ψ)-strict saddle for any ψ > √ ρ √ ρ otherwise. (2) Suppose u = Õ min{ √ , √ r} √ ρd , r = Õ ( ) , η = Õ m ψ d max{L,L 2 } , Then, in T = Ω (f (x 0 ) -f * ) η 2 + ρ 2 (f (x 0 ) -f * ) η ψ4 = Ω d max{L, L 2 }(f (x 0 ) -f * ) m ψ 2 + d max{L, L 2 }ρ 2 (f (x 0 ) -f * ) m ψ5 = Ω d max{L, L 2 }ρ 2 (f (x 0 ) -f * ) m ψ 2 iterations (with each iteration using 2m function evaluations), with probability at least 1 -δ, at least half the iterates are -approximate second-order stationary points. Remark 1. As the choice of η in Proposition 4 (Appendix D) and Theorem 2 (Appendix F) respectively imply, the Ω f (x0)-f * η 2 term in the sample complexity comes from the large gradient iterations (Proposition 4), whereas the Ω ρ 2 (f (x0)-f * ) η ψ4 term comes from the escape saddle point phase. Comparison to gradient-based methods. For first-order escape saddle point algorithms, standard perturbation-based methods (without acceleration) can find a ( , O( √ ))-second-order stationary point using Õ(1/ 2 ) iterations for deterministic GD (Jin et al., 2019a) , while for standard SGD the best-known rates are slower at Õ(1/ 3.5 ) (Fang et al., 2019) . In contrast, our sample complexity (as measured by the total number of function evaluations) is Õ d 2 ψ , where ψ is defined in Eq. ( 2). The extra (linear) dependence on d is typical for zeroth-order algorithms (see e.g. Nesterov and Spokoiny (2017) ); intuitively, gradient calculation for d-dimensional functions requires O(d) calculations agnostically, so it makes sense that zeroth-order algorithms requires d times more iterations. For general non strict-saddle functions, our dependence on sits between that of the deterministic methods and SGD methods, and suggests the benefit of a specialized treatment of zeroth-order methods over considering them simply as a subclass of SGD methods. Moreover, for ( , ψ)strict-saddle functions where ψ = Ω(1), our sample complexity becomes Õ( d 2 ), with an dependence that matches that of the best existing sample complexity for non-accelerated first-order escape saddle point methods Jin et al. (2017) Comparison to existing zeroth-order methods. As Table 1 suggests, our sample complexity significantly outperforms that of Bai et al. (2020) and Balasubramanian and Ghadimi (2022) , and also that in Lucchi et al. (2021) , which is a random search method. The sample complexity in Vlatakis-Gkaragkounis et al. (2019) ; Zhang et al. (2022) outperform our method, with a function evaluation complexity of Õ d 2 . However, for for ( , ψ)strict-saddle functions where ψ = Ω(1), our sample complexity becomes Õ( d 2 ), which matches the sample complexity in Vlatakis-Gkaragkounis et al. (2019); Zhang et al. (2022) . Moreover, a key limitation of their methods is a requirement to use Ω(d) function evaluations to estimate the gradient at each iteration, which may not be practical in realistic applications when d is large. In contrast, our method supports any number of function evaluations at each iteration between 1 to d. Moreover, numerically, we found that for a test nonconvex function proposed in Du et al. (2017) , our method (with two-point estimators) takes fewer function evaluations to escape saddle points and converge to the global minimum than the methods in Vlatakis-Gkaragkounis et al. ( 2019); Zhang et al. (2022) .

3. PROOF STRATEGY AND KEY CHALLENGES IN THE ZEROTH-ORDER SETTING

Broadly speaking, our proof include two major parts, i) characterizing the progress made in iterations when the gradient is large (which we can define to be iterations t where ∇f (x t ) ≥ ) (Section 3.1), ii) and iterations when we are at an -approximate saddle point (where progress may be made along the negative eigendirection of the Hessian matrix) (Section 3.2). While the approach is similar to the first-order case (e.g. Jin et al. (2019a) ), the zeroth-order setting brings forth several challenges as we explain later in the individual subsections. In the rest of this section, we explain these challenges, sketch out our high-level proof outlines, and provide statements of the main technical results. Note that due to the space limit, the full proof is provided in the Appendix.

3.1. SHOWING FUNCTION DECREASE WHEN GRADIENTS ARE LARGE

Challenge. Due to the noise in two-point (or 2m where m is a small constant) zeroth-order gradient, even when the gradient is large, it may not always be possible to make progress at each iteration, especially when m < d is used in the gradient estimation equation in Eq. ( 1). While it is tempting to use an expectation-based argument to handle this issue, it is known that expectation-based function decrease arguments are insufficient for escape saddle point purposes (see e.g. Proposition 1 in Ziyin et al. (2021) ). We tackle this issue by using high-probability arguments instead; we note that achieving these high-probability bounds is highly nontrivial due to the large variance of the two-point zeroth-order estimator (scaling with d times the squared norm of the gradient). Hence, any single iteration of the zeroth-order method may in fact lead to a function increase rather than decrease. High-level proof outline. (i) We first characterize the function value change for our proposed algorithm (Lemma 1). (ii) Next, we tackle the issue of the possibility that the function value might increase for any given iteration. The key idea here is that across any small consecutive number of iterations, there will be one iteration where the zeroth-order estimator is sufficiently aligned with the gradient direction (Lemma 14 in Appendix D). (iii) Along with a series of other technical results in Appendix D, we then show that the function makes sufficient progress across the duration of the algorithm, with high probability (Proposition 1). To more concretely illustrate the key analytical challenge, we next introduce the following function decrease lemma, proved in Appendix D. Lemma 1 (Function decrease for batch zeroth-order optimization). Suppose at each time t, the algorithm performs the update step (with batch-size parameter 1 ≤ m ≤ d) x t+1 = x t -η g (m) u (x t ) + Y t , g (m) u (x t ) = 1 m m i=1 f (x t + uZ t,i ) -f (x t -uZ t,i ) 2u Z t,i , where each Z t,i is drawn i.i.d from N (0, I), u > 0 is the smoothing radius, and Y t ∼ N (0, r 2 d I) with r > 0 denoting the perturbation radius. Then, there exist absolute constants c 1 > 0, C 1 ≥ 1 such that, for any T ∈ Z + and T ≥ τ > 0, α > 0 and δ ∈ (0, 1/e], upon defining H 0,τ (δ) to be the event on which the inequality f (x τ ) -f (x 0 ) ≤ - 3η 4 τ -1 t=0 1 m m i=1 Z t,i ∇f (x t ) 2 + η α + c 1 Lη 2 χ 3 d m τ -1 t=0 ∇f (x t ) 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (α + ηL) log T δ + τ c 1 Lη 2 r 2 (3) is satisfied (where χ := log(C 1 dmT /δ)), we have P(H 0,τ (δ)) ≥ 1 - (τ + 4)δ T , P(∩ τ τ =1 H 0,τ (δ)) ≥ 1 - 5τ δ T for any 0 ≤ τ ≤ T . Our goal is to show that we can arrive at a contradiction f (x T ) < min x f (x) when there is a large number of steps at which ∇f (x t ) ≥ (Proposition 1). As we can see from Eq. ( 3), this implies that we need to prove a lower bound of the form T -1 t=0 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ Ω 1 α + c 1 Lηχ 3 d m T -1 t=0 ∇f (x t ) 2 (4) for some α which is not too large (an example would be picking α such that it only scales logarithmically in the problem parameters). However, it is tricky to prove such a lower-bound in the zeroth-order setting. In particular, for small batch-sizes m, 1 m m i=1 Z t,i ∇f (x t ) 2 could be small even as ∇f (x t ) 2 is large; this is because for each i ∈ [m], Z t,i could have a negligible component in the ∇f (x t ) direction. This necessitates a more delicate analysis to prove a bound similar to Eq. ( 4). Due to space reasons, we defer our more detailed proof approach outline to Appendix D (see the discussion immediately following Lemma 1) The results in Appendix D culminates in the following result which limits the number of large-gradient. Proposition 1 (Bound on number of iterates with large gradients, informal version of Proposition 4). Let δ ∈ (0, 1/e] be arbitrary. Letting Õ hide polylogarithmic dependencies on δ (and other parameters), consider choosing u, r, η and T such that u = Õ √ √ ρd , r = O( ), η = Õ m dL , T = Ω (f (x 0 ) -f * ) + 2 /L) η 2 . Then, with probability at least 1 -O(δ), there are at most T /4 iterations for which ∇f (x t ) ≥ .

3.2. MAKING PROGRESS NEAR SADDLE POINTS

Challenge. The noise in two-point zeroth-order estimators makes the analysis around -approximate saddle points challenging, because the concentration properties of the (non-subGaussian) noise are hard to characterize. Intuitively, a noisier estimator might facilitate easier escape from saddle point. However, without an appropriate concentration bound, the noise may behave in unpredictable ways, preventing escape from saddle regions. Previous analysis of saddle point escape using stochastic estimators typically requires these estimators to satisfy subGaussian properties (Jin et al., 2019a; Fang et al., 2019) , which zeroth-order estimators do not satisfy. High-level proof outline. (i) We first prove a technical result showing that the travelling distance of the iterates can be bounded in terms of the function value decrease (i.e., Improve or Localize, Lemma 2). (ii) Next, at any -saddle point, we consider a coupling argument and define two sequences running near-identical zeroth-order dynamics, differing only in the sign of their perturbative term along the minimum eigendirection of H , which denotes the Hessian of the saddle (Lemma 3). Using Lemma 2 in point (i), if we assume for contradiction that the two sequences both "get stuck" and make little function value progress, the dynamics of the difference between the two sequences will remain small as both sequences remain close to the saddle point. iii) However, since the perturbation vectors of the two sequences differ in the (most) negative direction of H, the norm of the the difference of the two sequences will grow exponentially so long as a). the sequences remain close to the saddle point (and thus the Hessian has a negative curvature direction) and b). the effect of the zeroth-order stochastic noise can be controlled. This leads to a contradiction, implying that sufficient function decrease must have been made (Proposition 5 in Appendix E.3). (iv) To show that the zerothorder stochastic noise can be controlled, we prove one technical result (Proposition 2), providing a concentration bound for the product of (possibly unbounded) subGaussian random vectors that scales linearly with the dimension d. This enables us to control the effect of the zeroth-order noise near saddle points, and is essential in showing that the eventual sample complexity scales linearly with d. We provide a more detailed proof sketch below, where we elaborate more on our analytical challenges and ideas. We first introduce an informal statement of a key technical result that bounds, with high probability, the travelling distance of the iterates in terms of the function value decrease. Lemma 2 (Improve or Localize, informal version of Lemma 23). Consider the perturbed zerothorder update Algorithm 1. Let δ ∈ (0, 1/e] be arbitrary. Consider any T s = Ω 1 m log(1/δ) , and any t 0 ≥ 0. For any F > 0, suppose f (x Ts+t0 ) -f (x t0 ) > -F, i.e. f (x t0 ) -f (x Ts+t0 ) < F . Letting Õ hide polylogarithmic terms involving δ, suppose u = Õ min { √ , √ r} √ ρd , r = Õ min , F ηT s , η = Õ m √ ρ dL . Then, with probability at least 1 -O Tsδ T (here T ≥ T s denotes the total number of iterations), for each τ ∈ {0, 1, . . . , T s }, we have that x t0+τ -x t0 2 ≤ φ Ts (δ, F ), where φ Ts (δ, F ) = Õ max T s , d m ηF + Õ(η 2 2 ). Intuitively, the above result shows that if little function value improvement has been made, then the algorithm's iterates have not moved much, such that it remains approximately in a saddle region if it started out in a saddle region. Next, Lemma 3 formally introduces the coupling we have mentioned, setting the stage for the rest of our arguments. For notational convenience, in this section, unless otherwise specified, we will assume that the initial iterate x 0 is an -saddle point. Lemma 3. Suppose x 0 is an -approximate saddle point. Without loss of generality, suppose that the minimum eigendirection of H := ∇ 2 f (x 0 ) is the e 1 direction, and let γ to denote -λ min (∇ 2 f (x 0 )) (note γ ≥ √ ρ ). Consider the following coupling mechanism, where we run the zeroth-order gradient dynamics, starting with x 0 , with two isotropic noise sequences, Y t and Y t respectively, where (Y t ) 1 = -(Y t ) 1 , and (Y t ) j = (Y t ) j for all other j = 1. Suppose that the sequence {Z t,i } t∈T,i∈ [m] is the same for both sequences. Let {x t } denote the sequence with the {Y t } noise sequence, and let the {x t } denote the sequence with the {Y t } noise sequence, where x t+1 = x t -η 1 m m i=1 Z t,i Z t,i ∇f (x t ) + u 2 Z t,i Z t,i H t,i Z t,i + Y t , x 0 = x 0 , and H t,i := H t,i,+ -H t,i,- , with H t,i,+ = ∇ 2 f (x t + α t,i,+ uZ i ) for some α t,i,+ ∈ [0, 1], and H t,i,-= ∇ 2 f (x t -α t,i,-uZ i ) for some α t,i,-∈ [0, 1]. Then, for any t ≥ 0, xt+1 := xt+1 -x t+1 = -η t τ =0 (I -ηH) t-τ ξg 0 (τ ) Wg 0 (t+1) -η t τ =0 (I -ηH) t-τ ( Hτ -H)xτ W H (t+1) -η t τ =0 (I -ηH) t-τ ξu(τ ) Wu(t+1) -η t τ =0 (I -ηH) t-τ Ŷτ Wp(t+1) where ξg 0 (t) = 1 m m i=1 (Zt,iZ t,i -I)∇f (xt), ξ g 0 (t) = 1 m m i=1 (Zt,i(Zt,i) -I)∇f (x t ), ξg 0 (t) = ξg 0 (t) -ξ g 0 (t), ξu(t) = 1 m m i=1 u 2 Zt,iZt,i Ht,iZt,i, ξ u (t) = 1 m m i=1 u 2 Zt,iZt,i H t,i Zt,i, ξu(t) = ξu(t) -ξ u (t), Ŷt = Yt -Y t , Ht = 1 0 ∇ 2 f (axt + (1 -a)x t )da. Our goal is to show that the dominating term in the evolution of the difference dynamics comes from the W p term involving the additional perturbation. To this end, we need to bound the remaining terms, W g0 , W H , W u . A key technical challenge is to find a precise concentration bound for the W g0 (t + 1) term, where W g0 (t + 1) = -η t τ =0 (I -ηH) t-τ 1 m m i=1 (Z τ,i Z τ,i -I)(∇f (x τ ) -∇f (x τ )) . For the simplicity of discussion, we assume for the time being that m = 1, and drop the i index in the subscript of Z τ,i . Since E[Z τ Z τ ] = I, heuristically, assuming that Z τ Z τ -I satisfies "nice" concentration properties, utilizing the independence of the Z τ 's across time and the fact that (I -ηH) (1 + ηγ)I, we would like to show that with high probability, W g0 (t) η t-1 τ =0 (1 + ηγ) 2(t-1-τ ) E (Z τ Z τ -I) (∇f (x τ ) -∇f (x τ )) 2 | F τ -1 where F τ -1 is a sigma-algebra containing all randomness up to and including iteration τ -1, such that x τ and x τ are both in F τ -1 , but Z τ is not. Then, assuming that Eq. ( 5) holds, since E (Z τ Z τ -I) (∇f (x τ ) -∇f (x τ )) 2 | F τ -1 = O(d) ∇f (x τ ) -∇f (x τ ) 2 , it follows that W g0 (t) ≤ η O(d) t-1 τ =0 (1 + ηγ) 2(t-1-τ ) ∇f (x τ ) -∇f (x τ ) 2 With this bound on W g0 (t) , we eventually prove in Proposition 5 in Appendix E.3 that our algorithm escapes any -saddle point with constant probability and that the O(d) term appearing in the square root term above will eventually lead to an O(d) dependence in the sample complexityfoot_2 . We note that the O(d) dimension dependence matches that of the best-known existing upper bound for finding first-order stationary points in smooth nonconvex zeroth-order optimization (Nesterov and Spokoiny, 2017) , and has been conjectured to be the best possible dimension dependence for general smooth nonconvex zeroth-order optimization (Balasubramanian and Ghadimi, 2022) .

Key technical challenge

The key challenge in the above argument is to show that an equation in the form of Eq. ( 5) could in fact hold. At first glance, that an inequality such as Eq. ( 5) should hold is rather non-obvious -this is because while the variable (Z τ Z τ -I)(∇f (x τ ) -∇f (x τ )) | F τ -1 is mean-zero, it is subExponential rather than subGaussian. In fact, even in the subGaussian case, given a sequence of random vectors x 0 , . . . , x t-1 , such that each E[x τ | F τ -1 ] = 0, and that each x τ | F τ -1 is norm-subGaussian with parameter σ τ ∈ F τ -1 (which is an appropriate generalization of subGaussianity for vectors, proposed in Jin et al. (2019b) ), proving a concentration inequality of the form t-1 τ =0 x τ ≈ Õ t-1 τ =0 σ 2 τ is a very delicate matter. In our case, the analogue of Tropp et al. (2015) ; Jin et al. (2019b) ) rely crucially on subGaussian properties that allow for each τ the moment-generating function E[e θYτ | F τ -1 ] to be defined for any fixed (and non-random) θ > 0, where Y τ takes the form x τ is (I -ηH) t-1-τ (Z τ Z τ -I)(∇f (x τ ) -∇f (x τ )), while the analogue of σ 2 τ is (1 + ηγ) 2(t-1-τ ) E (Z τ Z τ -I) (∇f (x τ ) -∇f (x τ )) 2 | F τ -1 . Existing techniques (cf. Yτ = 0 x τ xτ 0 , such that E[Y τ | F τ -1 ] = 0 (since E[x τ | F τ -1 ] = 0) , and the eigenvalues of Y τ are ± x τ . In the case when x τ is merely subExponential, the Moment Generating Function (MGF), E[e θYτ | F τ -1 ], will no longer be well-defined at any fixed (and non-random) θ > 0. This poses a challenge in our setting, since x τ takes the form (I -ηH) t-1-τ (Z τ Z τ -I)(∇f (x τ ) -∇f (x τ )), which is subExponential rather than subGaussian. While it may be possible to force (I -ηH) Our solution To overcome the issue, we build on the following observation: with high probability, for any vector g ∈ R d , Z τ g is bounded within some log factor of g . On the event { Z τ g = Õ( g )}, the variable (Z τ Z τ -I)g = Z τ (Z τ g) -g ≈ Z τ g -g behaves approximately like a subGaussian random vector since Z τ ∼ N (0, I d ). Based on this intuition, after some careful analysis, we can show that (Z τ Z τ -I)(∇f (x τ ) -∇f (x τ )) | F τ -1 is subGaussian on the event that Z τ ∇f (x τ ) is bounded within some log factor of ∇f (x τ ) , which happens with high probability. This then allows us to show that on this event, the corresponding MGF is well-defined for all fixed θ > 0, enabling us to prove a concentration inequality of the form Eq. ( 5). This intuition is crystallized in the following proposition, which proves a more general bound than what we strictly need. For notational simplicity, we introduce the function lr(x) := log (x log(x)). Proposition 2. Let F t , t ≥ -1 be a filtration. Let (Z t ) t≥0 be a sequence of random vectors following the distribution N (0, I) such that Z t ∈ F t and is independent of F t-1 , and let (v t ) t≥0 be a sequence of random vectors such that v t ∈ F t-1 . For each τ ≥ 0, let Wτ = τ -1 t=0 Mt(ZtZ t -I)vt, where each M t is a deterministic matrix of appropriate dimension. Then, there exist some absolute constants c , C > 0 such that for any τ ∈ Z + and δ ∈ (0, 1/e], the following statements hold: 1. For any θ > 0, with probability at least 1 -δ, we have Wτ ≤ c θ τ -1 t=0 Mt 2 2 d(lr(Cτ /δ)) 2 vt 2 + 1 θ log(Cdτ /δ). 2. For any B > b > 0, with probability at least 1 -δ, either τ -1 t=0 Mt 2 2 d(lr(Cτ /δ)) 2 vt 2 ≥ B or Wτ ≤ c max τ -1 t=0 Mt 2 2 d(lr(Cτ /δ)) 2 vt 2 , b (log(Cτ d/δ) + log(log(B/b) + 1)) Moreover, as is clear from the bounds above, we may pick C ≥ 1 such that log C δ ≥ 1, ∀δ ∈ (0, 1 e ]. With this result, along with a series of other technical results in Appendix E.3, we can show that the algorithm makes a function decrease of F with Ω(1) probability near an -saddle point (Proposition 5in Appendix E.3). Armed with Proposition 5, as well as Proposition 1, the main result in Theorem 1 then follows. The complete detailed analysis can be found in Appendix E (Escaping saddle point) and Appendix F (main result).

4. CONCLUSION

In this paper, we proved that using two function evaluations per iteration suffices to escape saddle points and reach approximate second order stationary points efficiently in zeroth-order optimization. Along the way, we gave the first analysis of high-probability function change using two(or more)-point zeroth-order gradient estimators, as well as a novel concentration bound for sums of subExponential (but not subGaussian) vectors which are each the products of Gaussian vectors. These technical contributions may be of independent interest to researchers working in zeroth-order optimization as well as general stochastic optimization. There are a few limits of the current results which lead to several interesting future directions, such as extension to noisy function evaluations, as well as studying if some zeroth-order estimators such as asymmetric two-point estimators (Nesterov and Spokoiny, 2017) or single-point estimators (Flaxman et al., 2005) could actually escape saddle points without additional perturbation noise.

A RELATED WORK

Two-point methods in zeroth-order optimization. Two-point (or in general 2m-point, where 1 ≤ m < d with d being the problem dimension) estimators, which approximate the gradient using two (or 2m) function evaluations per iteration, have been widely studied by researchers in the zerothorder optimization literature, in convex (Nesterov and Spokoiny, 2017; Duchi et al., 2015; Shamir, 2017) , nonconvex (Nesterov and Spokoiny, 2017) , online (Shamir, 2017) , as well as distributed settings (Tang et al., 2019) . A key reason for doing so is that for applications of zeroth-order optimization arising in robotics (Li et al., 2022) , wind farms (Tang et al., 2020a) , power systems (Chen et al., 2020) , online (time-varying) optimization (Shamir, 2017) , learning-based control (Malik et al., 2019; Li et al., 2021) , and improving adversarial robustness to black-box attacks in deep neural networks (Chen et al., 2017) , it may be costly or impractical to wait for Ω(d) (where d denotes the problem dimension) function evaluations per iteration to make a step. This is especially true for high-dimensional and/or time-varying problems. Indeed, for high-dimensional problems, two-point estimators can make swift progress even in the initial stage compared to 2d-point estimator, and can reach a higher-quality solution if computation is limited (Tang et al., 2020b; Chen et al., 2017) . For instance, consider the work in (Chen et al., 2017) , which studies the use of zeroth-order estimators to perform black-box attacks on deep neural networks, in order to identify (and then defend against) adversarial images that may lead to misclassification. In the paper, the authors employed two-point zeroth-order estimators, due to the high computational cost of using 2d function evaluations per iteration for hundreds of iterations (here d is the dimension of an image, which in this case is over 20000). The authors showed empirically that their two-point estimators worked well; however there over no accompanying theoretical results. For online or time-varying environments, two-points estimators also often preferable. Since zerothorder methods are often used in physical systems whose environment drifts or changes over time, this leads naturally to a time-varying or online optimization. For these problems, 2d-point estimators will not produce a good estimation because the underlying function can drift to a very different problem while waiting for the 2d function evaluations. Indeed, the fewer function evaluations an optimization procedure needs, the faster it can catch up with the time-varying environment. In fact, for online optimization, it has been shown that two points estimator is optimal for convex Lipschitz functions (Shamir, 2017) . Thus, two-point estimators are a natural fit for time-varying online optimization problems. Saddle point escape with access to deterministic gradient. While standard gradient descent can escape saddle points asymptotically (Lee et al., 2019; Panageas et al., 2019) , it is known that standard gradient descent may take exponential time to escape saddle points (Du et al., 2017) . Hence, when access to deterministic gradient is available, research has centered on escaping saddle points with adding perturbation (Jin et al., 2017) , momentum/acceleration based methods (Jin et al., 2018; Sun et al., 2019a; Staib et al., 2019) , or gradient-based robust Hessian power/curvature exploitation methods (Zhang and Li, 2021; Adolphs et al., 2019) . In addition, there has also been work on escaping saddle points devoted to specific optimization settings, such as constrained optimization (Mokhtari et al., 2018; Avdiukhin et al., 2019) , optimization of weakly convex functions (Huang, 2021) , bilevel optimization (Huang et al., 2022) , as well as on general manifolds (Sun et al., 2019b; Criscitiello and Boumal, 2019; Han and Gao, 2020) . Saddle point escape in stochastic gradient descent (SGD). In practice, only stochastic gradient estimators are available in many problems. While SGD may converge to local maxima in worst-case scenarios (Ziyin et al., 2021) , under assumptions such as bounded variance or subGaussian noise, there have been many works that have studied the problem of saddle point escape in SGD (Ge et al., 2015; Daneshmand et al., 2018; Xu et al., 2018; Jin et al., 2019a; Vlaski and Sayed, 2021b) . The best existing rate (without considering momentum/variance reduction techniques) appears to belong to that of Fang et al. (2019) , which converges to -second order stationary points using Õ(1/ 3.5 ) stochastic gradients. While zeroth-order gradient estimators may also be viewed as stochastic gradients, they typically do not satisfy the bounded/subGaussian noise assumptions that are assumed in these works, making a direct comparison inappropriate. Escaping saddle point via momentum methods in SGD has also been studied (Wang et al., 2021; Antonakopoulos et al., 2022) ; while we do not consider incorporating momentum in our works, this may be interesting future work. A number of papers has also considered the specialized setting of escaping saddle points in nonconvex finite-sum optimization (Reddi et al., 2018; Liang et al., 2021) , with many considering the case where variance-reduction is used (Ge et al., 2019; Li, 2019) . While the finite-sum problem is quite different from our problem, the variance reduction approach considered in these works may be a relevant future direction. The saddle point escape problem has also been studied in other specific settings such as compressed optimization (Avdiukhin and Yaroslavtsev, 2021) , distributed optimization (Vlaski and Sayed, 2021a) , or in the overparameterization case (Roy et al., 2020) . Saddle point escape with zeroth-order information. The problem of escaping saddle points in zeroth-order optimization has been studied less often, and we have already listed all known works comparable to our work in the introduction (Bai et al., 2020; Vlatakis-Gkaragkounis et al., 2019; Balasubramanian and Ghadimi, 2022) ; a more detailed comparison of these works with our results has been provided in the discussion following the statement of our main result Theorem 1. We would like to mention that Roy et al. (2020) also includes a convergence result of Õ d 1.5 4.5 for the case with noisy function evaluations, which is incomparable to our existing work which focuses on the case with exact function evaluation. In addition, Roy et al. (2020) also makes a subGaussian assumption on the estimator noise, which zeroth-order estimators in our paper do not satisfy. Nonetheless, considering the extension to noisy function evaluations will make for important future work. Zeroth-order optimization. Our work rests on a line of research in zeroth-order optimization which focuses on constructing gradient estimators using zeroth-order function values (Flaxman et al., 2005; Duchi et al., 2015; Nesterov and Spokoiny, 2017; Shamir, 2017; Larson et al., 2019) . As we have discussed, for smooth nonconvex functions, it is known that two-point zeroth-order estimators suffice to find first-order -stationary points using Õ(d/ 2 ) function evaluations (Nesterov and Spokoiny, 2017) . Our work studies the more complicated problem of reaching -second order stationary points, attaining a rate of Õ(d/ 2.5 ).

B PROOF ROADMAP

We begin by introducing several key concentration inequalities in Appendix C which we will frequently use in our proofs. We then describe in detail (and prove) the sequence of results that lead up to Proposition 4 in Appendix D, showing that there cannot be too many iterations with large gradients. Next, we describe the saddle point argument in detail, and prove Proposition 5 in Appendix E.3. Finally, we combine these results and prove our main result Theorem 2 (whose informal version is Theorem 1) in Appendix F Throughout our proofs, absolute constants, as denoted by e.g. (c, c , C), may change from line to line. However, within the same proof, for clarity, we try to index different constants differently. We assume d ≥ 2 and m ≤ d. Notations. We shall denote the conditional expectation and conditional probability by E F [•] = E[• | F] and P F (•) = P(• | F) where F is a sigma-algebra.

C CONCENTRATION INEQUALITIES

This section serves to introduce several probabilistic results which will be useful for our main proofs in subsequent sections. We first introduce subGaussian, subExponential and norm-subGaussian random vectors in Appendix C.1. Next, in Appendix C.2, we provide concentration bounds for norm-subGaussian and subExponential random vectors. We then prove a novel concentration inequality involving products of subGaussian random vectors in Appendix C.3. We conclude by stating some concentration bounds for Appendix C.4 random variables.

C.1 SUBGAUSSIAN, SUBEXPONENTIAL AND NORM-SUBGAUSSIAN RANDOM VECTORS

We first define subGaussian and subExponential random vectors. A detailed reference for these concepts can be found in Vershynin (2018) . Definition 5 (subGaussian and subExponential random vectors). A random vector x ∈ R d is σ-subGaussian (SG(σ)), if there exists σ > 0 such that for any unit vector g ∈ S d-1 , E [exp(λ g, x -E[x] )] ≤ exp(λ 2 σ 2 /2) ∀λ ∈ R. Meanwhile, a random vector x ∈ R d is σ-subExponential (SE(σ)), if there exists σ > 0 such that for any unit vector g ∈ S d-1 , E [exp(λ g, x -E[x] )] ≤ exp(λ 2 σ 2 /2) ∀|λ| ≤ 1 σ An alternative concentration property for random vectors revolving around its norm, known as norm-subGaussianity (Jin et al., 2019b) , is also relevant. Definition 6 (norm-subGaussian random vectors). A random vector x ∈ R d is σ-norm-subGaussian (nSG(σ)), there exists σ > 0 such that P( x -Ex ≥ s) ≤ 2e -s 2 2σ 2 ∀s ≥ 0. We recall the following result which provides several examples of nSG random vectors. In particular, it tells us a random vector Jin et al. (2019b) ). There exists absolute constant c such that the following random vectors are all nSG(cσ). x ∈ R d that is (σ/ √ d)-subGaussian is also σ-subGaussian. Lemma 4 (Lemma 1 in 1. A bounded random vector x ∈ R d so that x ≤ σ. 2. A random vector x ∈ R d , where x = ξe 1 and the random variable ξ ∈ R is σ-subGaussian.

3.. A random vector

x ∈ R d that is (σ/ √ d)-subGaussian In addition, if x ∈ R d is zero-mean nSG(σ), its component along a single direction is also subGaus- sian. Lemma 5. Suppose x ∈ R d is zero-mean nSG(σ). Then, for any fixed vector v ∈ R d , v, x is zero-mean v σ-subGaussian. Proof. Without loss of generality, we assume that v ∈ S d-1 is a unit vector. That v, x is zero-mean follows directly from x being zero-mean and v being fixed. Meanwhile, since | v, x | ≤ v x = x , for any s ≥ 0, it follows that P(| v, x | ≥ s) ≤ P( x ≥ s) ≤ 2e -s 2 2σ 2 , where the last inequality follows from the fact that x is zero-mean and also nSG(σ). Hence v, x is zero-mean SG(σ), as desired.

VECTORS

We begin by giving some concentration bounds for norm-subGaussian random vectors. To do so, we introduce the following condition. Condition 1. Consider random vectors x 1 , . . . , x n ∈ R d , and corresponding filtrations F i generated by (x 1 , . . . , x i ). We assume x i | F i-1 is zero-mean, nSG(σ i ), with σ i ∈ F i-1 , i.e, E [x i | F i-1 ] = 0, P ( x i ≥ s | F i-1 ) ≤ 2e -s 2 2σ 2 i ∀s ≥ 0, where σ i is a measurable function of (x 1 , . . . , x i-1 ) for each i. For norm subGaussian random vectors satisfying Condition 1, we first have the following bound. Lemma 6. Suppose (x 1 , . . . , x n ) ∈ R d satisfy Condition 1, i.e. each x i | F i-1 is mean-zero, nSG(σ i ) with σ i ∈ F i-1 . Let {u i } denote a sequence of random vectors such that u i ∈ F i-1 for every i ∈ [n]. Then, there exists an absolute constant c, such that for any δ ∈ (0, 1) and λ > 0, with probability at least 1 -δ, n i=1 u i , x i ≤ cλ n i=1 u i 2 σ 2 i + 1 λ log(1/δ). Proof. We note that if x i is mean-zero and nSG(σ i ), then by Lemma 5, u i , x i | F i-1 is zero-mean and u i σ i -subGaussian. The rest of the proof follows from the proof of Lemma 39 in Jin et al. (2019a) (key idea is exponentiate and then apply Markov's inequality). For completeness, we restate the proof here. Observe that for any i, since u i , x i is u i σ i -subGaussian, for any λ > 0, we have that E [exp(λ u i , x i ) | F i-1 ] ≤ exp(λ 2 u i 2 σ 2 i /2) For any λ > 0 and s ≥ 0, observe that P n i=1 λ u i , x i -λ 2 u i 2 σ 2 i /2 ≥ s = P exp λ n i=1 u i , x i -λ 2 u i 2 σ 2 i /2 ≥ exp(λs) ≤ E exp λ n i=1 u i , x i -λ 2 u i 2 σ 2 i /2 exp(-λs) = E E exp λ n i=1 u i , x i -λ 2 u i 2 σ 2 i /2 F n-1 exp(-λs) = E exp λ n-1 i=1 u i , x i -λ 2 u i 2 σ 2 i /2 E exp λ u n , x n -λ 2 u n 2 σ 2 n /2 F n-1 exp(-λs) (i) ≤ E exp λ n-1 i=1 u i , x i -λ 2 u i 2 σ 2 i /2 exp(-λs) ≤ • • • ≤ exp(-λs) Above, (i) follows from the fact that u i , x i | F i-1 is zero-mean and u i σ i -subGaussian for each i ∈ [n]. The final result then follows by picking c = 1 2 and s = log(1/δ). Assuming Condition 1, the following concentration result also holds for a sequence of nSG random vectors. Lemma 7 (Lemma 6, Corollary 7 and Corollary 8 in Jin et al. (2019b) combined) . Suppose (x 1 , . . . , x n ) ∈ R d satisfy Condition 1. Then, there exists an absolute constant c such that for any fixed δ ∈ (0, 1), θ > 0, with probability at least 1 -δ, n i=1 x i ≤ cθ n i=1 σ 2 i + 1 θ log(2d/δ). Moreover, there are two corollaries. 1. ( (Jin et al., 2019b, Corollary 7 )) When {σ i } is deterministic, there exists an absolute constant c such that for any fixed δ ∈ (0, 1), with probability at least 1 -δ. Jin et al., 2019b, Corollary 8 )) Suppose that the {σ i } sequence is random. Then, there exists an absolute constant c such that for any fixed δ ∈ (0, 1) and B > b > 0, with probability at least 1 -δ: n i=1 x i ≤ c log(2d/δ) n i=1 σ 2 i 2. (( either n i=1 σ 2 i ≥ B or n i=1 x i ≤ c max n i=1 σ 2 i , b • (log(2d/δ) + log(log(B/b))) We state here a Bernstein-type concentration inequality for sub-exponential random variables, which we also need. Lemma 8 (Bernstein concentration inequality). Consider a sequence of independently distributed σ-subexponential variables x 1 , . . . , x n ∈ R, with mean E[x i ] ≤ c σ for some c > 0 and each i ∈ [n]. Then, there exists an absolute constant C > 0, such that for any δ ∈ (0, 1), with probability at least 1 -δ, n i=1 x i ≤ Cσ(n + log(1/δ)). Proof. The result of Eq. ( 6) follows by applying Bernstein's inequality to n i=1 x i -E[x i ] (so each summand is mean-zero). Per Bernstein's inequality, (cf. Theorem 2.8.1 in Vershynin ( 2018)), there exists an absolute constant c > 0 such that for any s ≥ 0, P n i=1 (x i -E[x i ]) ≥ s ≤ exp -c min s 2 nσ 2 , s σ . Pick s = σ n + log(1/δ) c . Then, min s 2 nσ 2 , s σ = min n + 2 log(1/δ) c + (log(1/δ)) 2 c 2 n , n + log(1/δ) c = n + log(1/δ) c . Continuing, we have that P n i=1 (x i -E[x i ]) ≥ s ≤ exp -c min s 2 nσ 2 , s σ ≤ exp -c n + log(1/δ) c ≤ δ. Thus, it follows that with probability at least 1 -δ, n i=1 (x i -E[x i ]) ≤ σ n + log(1/δ) c =⇒ n i=1 x i ≤ σ n + log(1/δ) c + nc σ, where implication holds since by assumption, E[x i ] ≤ c σ for some c > 0. Then, by setting C = max{1 + c , 1/c}, the desired result follows.

C.3 A NOVEL CONCENTRATION INEQUALITY FOR THE ZEROTH-ORDER SETTING

In the zeroth-order setting, we will frequently have to bound the norms of terms of the form W τ = τ -1 t=0 M t (Z t Z t -I)v t , where M t is a known and fixed quantity, while Z t is random, and v t depends on x 0 and the history of previous {Z j } t-1 j=0 's, and is hence F t-1 -measurable. For our purposes, it suffices to consider Z t ∼ N (0, I). To see why such a bound will be useful, as mentioned in the main text and as we will see again later in the full proofs, in the analysis of escaping saddle points, we need to bound a term of the form W g0 (τ ) = η τ -1 t=0 (I -ηH) τ -1-t (Z t Z t -I)(∇f (x t ) -∇f (x t )), where H = ∇ 2 f (x 0 ) (assuming that x 0 is an -saddle point), and x t and x t are two coupled sequences. Comparing with Eq. ( 7), we see that for the equation above, we can define M t = η(I -ηH) τ -1-t (a fixed and known quantity) and v t = ∇f (x t ) -∇f (x t ) (clearly, ∇f (x t ) -∇f (x t ) is F t-1 - measurable) . This motivates why we wish to bound terms of the form Eq. ( 7). Observe that each (Z t Z t -I)v t | F t-1 term is subExponential rather than subGaussian. While it is possible to define norm-subExponential vectors in analogous way to norm-subGaussian vectors, the corresponding moment generating function (MGF) for subExponential random variables is not defined on the entirety of R. When bounding a sum in the form of τ -1 t=0 (Z t Z t -I)v t , this creates a subtle but challenging technical issue. Following the intuition outlined in the main text, we bypass this difficulty by proving the following result. For notational simplicity, we introduce the function lr(x) := log (x log(x)) . (8) We now recall Proposition 2 which we first introduced in the main text. Proposition 2. Let F t , t ≥ -1 be a filtration. Let (Z t ) t≥0 be a sequence of random vectors following the distribution N (0, I) such that Z t ∈ F t and is independent of F t-1 , and let (v t ) t≥0 be a sequence of random vectors such that v t ∈ F t-1 . For each τ ≥ 0, let Wτ = τ -1 t=0 Mt(ZtZ t -I)vt, where each M t is a deterministic matrix of appropriate dimension. Then, there exist some absolute constants c , C > 0 such that for any τ ∈ Z + and δ ∈ (0, 1/e], the following statements hold: 1. For any θ > 0, with probability at least 1 -δ, we have Wτ ≤ c θ τ -1 t=0 Mt 2 2 d(lr(Cτ /δ)) 2 vt 2 + 1 θ log(Cdτ /δ). 2. For any B > b > 0, with probability at least 1 -δ, either τ -1 t=0 Mt 2 2 d(lr(Cτ /δ)) 2 vt 2 ≥ B or Wτ ≤ c max τ -1 t=0 Mt 2 2 d(lr(Cτ /δ)) 2 vt 2 , b (log(Cτ d/δ) + log(log(B/b) + 1)) Moreover, as is clear from the bounds above, we may pick C ≥ 1 such that log C δ ≥ 1, ∀δ ∈ (0, 1 e ]. Proof. We will focus on proving the first point, since the second follows as a natural corollary of our proof of the first part and the proof of (Jin et al., 2019b, Corollary 8) . For simplicity, we shall assume v t = 0 in the intermediate steps; extension to the general case is straightforward. First of all, for 0 ≤ α < 1, let g(α; δ) = 2 π √ 2 lr(1/δ) α (x 2 -1)e -x 2 /2 dx = 2 π αe -α 2 /2 - δ 2 lr(1/δ) log(1/δ) . It's not hard to see that for a fixed δ ∈ (0, 1/e], g(α; δ) is continuous and strictly increasing over α ∈ [0, 1). Then, since log x x + 1 ≤ x for x ≥ 1, by plugging in x = log(1/δ), we get lr(1/δ) (log(1/δ)) 2 = log log(1/δ) + log(1/δ) (log(1/δ)) 2 = 1 log(1/δ) log log(1/δ) log(1/δ) + 1 ≤ 1, which leads to g(2δ; δ) = 2 π 2δe -2δ 2 - δ 2 lr(1/δ) log(1/δ) ≥ 2 π 2e -2/e 2 δ - √ 2δ > 0 for δ ∈ (0, 1/e]. Furthermore, we obviously have g(0; δ) < 0. Therefore g(α; δ) = 0 has a unique solution in (0, 2δ), which we denote by α(δ). 3 These results imply that, for a random variable Z following the standard normal distribution, we have E (Z 2 -1)1 α(δ)≤|Z|≤ √ 2 lr(1/δ) = 2 π √ 2 lr(1/δ) α(δ) (x 2 -1)e -x 2 /2 dx = g(h(δ); δ) = 0 3 By letting W0(x) denote the the principal branch of the Lambert W function, it can be shown that α(δ) = -W0 - 2δ 2 lr(1/δ) (log(1/δ)) 2 . and P(α(δ) ≤ |Z| ≤ 2 lr(1/δ)) ≥ 1 -2 1 √ 2π ∞ √ 2 lr(1/δ) e -x 2 /2 dx + 1 √ 2π α(δ) 0 e -x 2 /2 dx ≥ 1 -2 1 2 exp - 2 lr(1/δ) 2 + α(δ) √ 2π = 1 -2 δ 2 log(1/δ) + α(δ) √ 2π ≥ 1 -2 δ 2 + 2 √ 2π δ ≥ 1 -Cδ for any δ ∈ (0, 1/e], where we define the absolute constant C := 2(1/2 + 2/ √ 2π). Now we let A t denote the event A t = α(δ) ≤ Z t v t v t ≤ 2 lr(1/δ) . Since Z t v t / v t conditioned on F t-1 follows the standard normal distribution, we have P Ft-1 (A t ) ≥ 1 -Cδ, and E Ft-1 v t Z t Z t -I v t 1 At = 0. Moreover, for any random vector u ∈ F t-1 that is orthogonal to v t , we have E Ft-1 u Z t Z t -I v t 1 At = E Ft-1 u Z t • E Ft-1 Z t v t 1 At = 0, where we used the fact that Z t u is independent of Z t v t conditioned on F t-1 . Therefore E Ft-1 (Z t Z t -I)v t 1 At = 0. Consider defining then the random variable Q t by Q t := (Z t Z t -I)v t • 1 At . We now show that Q t | F t-1 is norm-subGaussian. Let u ∈ R d with u = 1 be arbitrary. We have u Q t = u (Z t Z t -I)v t • 1 At = u v t v t v t 2 + I - v t v t v t 2 (Z t Z t -I)v t • 1 At = u v t |Z t v t | 2 v t 2 -1 • 1 At + u I - v t v t v t 2 (Z t Z t -I)v t • 1 At = u v t |Z t v t | 2 v t 2 -1 • 1 At + u ⊥ Z t Z t v t • 1 At , where we denote u ⊥ = I - vtv t vt 2 u. Since u v t |Z t v t | 2 v t 2 -1 • 1 At ≤ |u v t |(2 lr(1/δ) -1), we see that u v t |Z t vt| 2 vt 2 -1 • 1 At conditioned on F t-1 is |u v t |(2 lr(1/δ) -1)-subGaussian. Furthermore, since |u ⊥ Z t Z t v t • 1 At | ≤ |Z t u ⊥ | 2 lr(1/δ) v t , we have P Ft-1 |u ⊥ Z t Z t v t • 1 At | ≥ s ≤ P Ft-1 |Z t u ⊥ | 2 lr(1/δ) v t ≥ s , and since Z t u ⊥ / u ⊥ | F t-1 follows the standard normal distribution, we see that u ⊥ Z t Z t v t • 1 At is a 2 lr(1/δ) u ⊥ v t -subGaussian variable. Note that u Q t is just the sum of u v t |Z t vt| 2 vt 2 -1 • 1 At and u ⊥ Z t Z t v t • 1 At , we can conclude that u Q t is subGaussian with parameter (2 lr(1/δ) -1)|u v t | + 2 lr(1/δ) u ⊥ v t ≤ 2 lr(1/δ)(|u v t | + u ⊥ v t ) ≤ 2 √ 2 lr(1/δ) |u v t | 2 + u ⊥ 2 v t 2 = 2 √ 2 lr(1/δ) v t , whenever δ ∈ (0, 1/e]. Consequently, by (Jin et al., 2019b , Lemma 1), we see that (Jin et al., 2019a, Lemma 6) , we know that there exists an absolute constant c > 0 such that for any θ > 0 and δ > 0, we have that with probability at least 1 -δ, Q t | F t-1 is 8 lr(1/δ) √ d v t -norm-subGaussian. It follows easily that M t Q t | F t-1 is mean-zero and 8 lr(1/δ) M t 2 v t √ d-norm-subGaussian. Hence, by τ -1 t=0 M t Q t ≤ cθ τ -1 t=0 d(lr(1/δ)) 2 M t 2 2 v t 2 + 1 θ log(2d/δ).

Now, consider denoting the event

A := τ -1 t=0 A t = Z t v t ∈ α(δ) v t , 2 lr(1/δ)) v t , ∀t ∈ {0, . . . , τ -1} By the union bound and Eq. ( 9), we note that P(A) ≥ 1 -τ Cδ. Moreover, note that on the event A, τ -1 t=0 M t Q t = τ -1 t=0 M t (Z t Z t -I)v t . Hence, P τ -1 t=0 M t (Z t Z t -I)v t ≤ cθ τ -1 t=0 d(lr(1/δ)) 2 M t 2 2 v t 2 + 1 θ log(2d/δ) ≥ P τ -1 t=0 M t Y t ≤ cθ τ -1 t=0 d(lr(1/δ)) 2 M t 2 2 v t 2 + 1 θ log(2d/δ), and A happens ≥ 1 -P τ -1 t=0 M t Y t ≥ cθ τ -1 t=0 d(lr(1/δ)) 2 M t 2 2 v t 2 + 1 θ log(2d/δ) + P(A c ) ≥ 1 -(δ + τ Cδ). Now, by rescaling δ to δ/(Cτ + 1), we get the desired result. Note this C is different from the C in the statement of the lemma by an absolute multiplicative factor.

C.4 SUB-WEIBULL RANDOM VARIABLES

In our work, we occasionally require bounding sums of heavy-tailed distribution, e.g. higher powers of Z where Z ∼ N (0, I). To this end, we consider the following definition of sub-Weibull random variables. Definition 7. We say that a random variable X ∈ R is sub-Weibull(K, α) for some K, α > 0, P(|X| ≥ s) ≤ 2 exp(-(s/K) 1/α ) ∀s ≥ 0. For instance, the standard normal distribution is sub-Weibull(1, 1 2 ). From the way we define the tail parameter α, the larger the α, the heavier the tail of the distribution. In our work, we need to show that the sum of sub-Weibull random variables is again sub-Weibull, which is ensured by the following result Lemma 9. Suppose X and Y are sub-Weibull(K X , α) and sub-Weibull(K Y , α) respectively. Then, XY is sub-Weibull(C(K X • K Y ), 2α) and X + Y is sub-Weibull(C(K X + K Y ), α) for some absolute constant C > 0. A helpful result is the following, which bounds the sum of identically distributed sub-Weibull random variables. Lemma 10 (Corollary 3.1 in Vladimirova et al. (2020) ). Suppose X 1 , . . . , X n are identically distributed (K , α) sub-Weibull random variables. Then, for some absolute constant c > 0, for all s ≥ ncK , we have P n i=1 X i ≥ s ≤ exp - s ncK 1/α In our work, we frequently need to bound sums of the k-th power of the norm of a standard ddimensional Gaussian. We do so using Lemma 10. Lemma 11. Suppose X i i.i.d ∼ N (0, I d ) for i ∈ [n]. Then, for any k ∈ Z + , there exists absolute constants c, C > 0 such that for any δ ∈ (0, 1), with probability at least 1 -δ, n i=1 X i 2k ≤ nCc k d k (1 + (log(1/δ)) k ). In particular, for any δ ∈ (0, 1/e) such that log(1/δ) ≥ 1, it follows that n i=1 X i 2k ≤ 2nCc k d k (log(1/δ)) k . Proof. First, observe that for any j ∈ [d], (X i ) 2 j , being subExponential, is (1, 1)-subWeibull. Then, by Lemma 9, X i 2 = d j=1 (X i ) 2 j is (cd, 1) for some absolute constant c. Now, it follows from definition of sub-Weibullness in Definition 7 that X i 2k is (c k d k , k)-subWeibull. Hence, applying Lemma 10, we have that there exists absolute constant C > 0 such that for any s ≥ nCc k d k , P n i=1 X i 2k ≥ s ≤ exp - s nCc k d k 1/k Choosing s = (1 + (log(1/δ)) k )nCc k d k , we arrive then at the desired result.

C.5 SUPERMARTINGALE CONCENTRATION INEQUALITIES

We first state and prove a supermartingale-type concentration inequality of the form we later require. Lemma 12. Consider a filtration of sigma-algebras F 0 ⊂ F 1 ⊂ • • • ⊂ F n-1 ⊂ F n and a sequence of random variables X 1 , . . . , X n such that X i ∈ F i . Suppose that P Fi-1 (X i ≤ a) = 1 and P Fi-1 (X i ≤ -b) ≥ p (10) for some a, b > 0 and 0 < p ≤ 1 2 . Then, for any 0 < µ ≤ b such that |-b + µ| ≥ 1-p p (a + µ), we have P n i=1 X i ≥ -nµ + s ≤ exp - s 2 4n(b -µ) 2 , ∀s > 0. Proof. Observe that by Markov's inequality, for any λ > 0, P n i=1 X i ≥ -nµ + s = P exp λ n i=1 (X i + µ) ≥ exp(λs) ≤ E [exp (λ n i=1 (X i + µ))] exp(λs) .

Now, observe that

E exp λ n i=1 (X i + µ) = E E Fn-1 exp λ n i=1 (X i + µ) = E exp λ n-1 i=1 (X i + µ) E Fn-1 [exp(λ(X n + µ))] . Let us now compute E Fn-1 [exp(λ(X n + µ))]: E Fn-1 [exp(λ(X n + µ))] = (-∞,-b] exp(λ(x + µ)) P Fn-1 (X n ∈ dx) + (-b,a] exp(λ(x + µ)) P Fn-1 (X n ∈ dx) ≤ P Fn-1 (X n ≤ -b) exp(λ(-b + µ)) + P Fn-1 (-b < X n ≤ a) exp(λ(a + µ)) ≤ p exp(λ(-b + µ)) + (1 -p) exp(λ(a + µ)). Then observe that by our choice of µ, -b + µ < 0, and that |-b + µ| ≥ (a + µ) 1-p p . Since we assumed p ≤ 1 2 , this means that 1-p p ≥ 1 and so for any k ≥ 1, |-b + µ| ≥ (a + µ) 1 -p p =⇒ |-b + µ| ≥ (a + µ) 1 -p p 1/k =⇒ p|-b + µ| k ≥ (1 -p)(a + µ) k . Consequently, by Taylor expansion, p exp(λ(-b + µ)) + (1 -p) exp(λ(a + µ)) = 1 + ∞ k=1 λ k (p(-b + µ) k + (1 -p)(a + µ) k ) k! ≤ 1 + ∞ k=1 λ k (p(-b + µ) k + p |-b + µ| k ) k! = 1 + ∞ k=1 λ 2k • 2p |-b + µ| 2k (2k)! ≤ 1 + ∞ k=1 λ 2k |-b + µ| 2k (k)! = exp(λ 2 (-b + µ) 2 ), which leads to E Fn-1 [exp(λ(X n + µ))] ≤ exp(λ 2 (-b + µ) 2 ). Now, continuing from Eq. ( 11), we have that E exp λ n i=1 (X i + µ) ≤ E exp λ n-1 i=1 (X i + µ) E Fn-1 1 [exp(λ(X n + µ))] ≤ E exp λ n-1 i=1 (X i + µ) exp(λ 2 (b -µ) 2 ) ≤ . . . ≤ exp(nλ 2 (b -µ) 2 ). Thus, for any λ > 0 and s ≥ 0, P n i=1 X i ≥ -nµ + s ≤ E [exp(λ( n i=1 (X i + µ)))] exp(λs) ≤ exp(nλ 2 (b -µ) 2 -λs) By finding the minimizing λ, we find that P n i=1 X i ≥ -nµ + s ≤ exp - s 2 4n(b -µ) 2 , which completes the proof. We will later require a weakened form of a supermartingale concentration inequality, as stated and proven below. Proposition 3 (Weakened supermartingale concentration inequality). Consider a filtration of sigmaalgebras F 0 ⊂ F 1 • • • ⊂ F n and a sequence of random variables X 1 , . . . , X n such that X i ∈ F i . Consider for each i ∈ {1, . . . , n} a bad set B i where 1 Bi ∈ F i-1 , and suppose P Fi-1 (X i 1 B c i ≤ a) = 1 and P Fi-1 (X i 1 B c i ≤ -b) ≥ p for some a, b > 0 and 0 ≤ p ≤ 1/2. Then, for any 0 < µ ≤ b such that |-b + µ| ≥ 1-p p (a + µ), we have P n i=1 X i ≥ -nµ + s ≤ exp - s 2 4n(b -µ) 2 + n i=1 P(X i ∈ B i ), ∀s > 0. Proof. We define Q i := X i 1 B c i . We can then apply Lemma 12 and get P n i=1 Q i ≥ -nµ + s ≤ exp - s 2 4n(b -µ) 2 . Since P (X i = Q i for some i ∈ [n]) ≤ i P(X i ∈ B i ), it follows that P n i=1 X i ≥ -nµ + s ≤ exp - s 2 4n(b -µ) 2 + n i=1 P(X i ∈ B i ), which completes the proof.

D FUNCTION DECREASE IN LARGE GRADIENT REGIME

In this section, we show that sufficient function decrease can be made across the iterations with large gradients. We first restate and prove the function decrease lemma (Lemma 1), first introduced in the main text. We then provide a detailed roadmap of our proof in the subsequent discussion following the proof of Lemma 1. Lemma 1 (Function decrease for batch zeroth-order optimization). Suppose at each time t, the algorithm performs the update step (with batch-size parameter 1 ≤ m ≤ d) x t+1 = x t -η g (m) u (x t ) + Y t , where g (m) u (x t ) = 1 m m i=1 f (x t + uZ t,i ) -f (x t -uZ t,i ) 2u Z t,i , where each Z t,i is drawn i.i.d from N (0, I), u > 0 is the smoothing radius, and Y t ∼ N (0, r 2 d I) with r > 0 denoting the perturbation radius. Then, there exist absolute constants c 1 > 0, C 1 ≥ 1 such that, for any T ∈ Z + and T ≥ τ > 0, α > 0 and δ ∈ (0, 1/e], upon defining H 0,τ (δ) to be the event on which the inequality f (x τ ) -f (x 0 ) ≤ - 3η 4 τ -1 t=0 1 m m i=1 Z t,i ∇f (x t ) 2 + η α + c 1 Lη 2 χ 3 d m τ -1 t=0 ∇f (x t ) 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (α + ηL) log T δ + τ c 1 Lη 2 r 2 (3) is satisfied (where χ := log(C 1 dmT /δ)), we have P(H 0,τ (δ)) ≥ 1 - (τ + 4)δ T , P(∩ τ τ =1 H 0,τ (δ)) ≥ 1 - 5τ δ T for any 0 ≤ τ ≤ T . Proof. First, for each t ∈ {-1, . . . , τ }, we define F t to be the sigma-algebra generated by x 0 , ({Z 0,i } m i=1 , . . . , {Z t,i } m i=1 ), (Y 0 , . . . , Y t ). Note that F -1 is the sigma-algebra generated only by x 0 . By Taylor expansion, for any x, y ∈ R d , there exists α ∈ [0, 1] such that f (x + y) = f (x) + ∇f (x), y + 1 2 y ∇ 2 f (x + αy) y. Therefore f (x t + uZ t,i ) -f (x t -uZ t,i ) 2u = ∇f (x), Z t,i + u 2 Z t,i Ht,i Z t,i with Ht,i = ∇ 2 f (x + α i,+ uZ t,i ) -∇ 2 f (x -α i,-uZ t,i ) 2 for some α i,± ∈ [0, 1], and x t+1 = x t -η 1 m m i=1 Z t,i Z t,i ∇f (x t ) + u 2 Z t,i Z t,i Ht,i Z t,i + Y t ( ) By the ρ-Hessian Lipschitz property of f , it follows that Ht,i ≤ ρu Z t,i Observe that f (x t+1 ) (i) ≤ f (x t ) + x t+1 -x t , ∇f (x t ) + L 2 x t+1 -x t 2 (ii) = f (x t ) -η 1 m m i=1 Z t,i ∇f (x t ) 2 -η 1 m m i=1 u 2 Z t,i ∇f (x t ) • Z t,i Ht,i Z t,i -η ∇f (x t ), Y t + Lη 2 2 1 m m i=1 Z t,i Z t,i ∇f (x t ) + u 2 Z t,i Z t,i Ht,i Z t,i + Y t 2 (iii) ≤ f (x t ) - η m m i=1 Z t,i ∇f (x t ) 2 + η m m i=1 Z t,i ∇f (x t ) 2 4 + u 2 Z t,i Ht,i Z t,i 2 4 -η ∇f (x t ), Y t + Lη 2 2   2 1 m m i=1 Z t,i Z t,i ∇f (x t ) 2 + u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i 2 + 4 Y t 2   (iv) ≤ f (x t ) - 3η 4m m i=1 Z t,i ∇f (x t ) 2 + ηu 2 m m i=1 u 2 ρ 2 Z t,i 6 4 -η ∇f (x t ), Y t + Lη 2 2   2 1 m m i=1 Z t,i Z t,i ∇f (x t ) 2 + u 2 m m i=1 u 2 ρ 2 Z t,i 8 + 4 Y t 2   ≤ f (x t ) - 3η 4m m i=1 Z t,i ∇f (x t ) 2 + ηu 4 ρ 2 4m m i=1 Z t,i 6 + Lη 2 u 4 ρ 2 2m m i=1 Z t,i 8 -η ∇f (x t ), Y t + Lη 2 2   2 1 m m i=1 Z t,i Z t,i ∇f (x t ) 2 + 4 Y t 2   Above, to derive (i), we used the L-smoothness of f . To derive (ii), we used the expression for (x t+1 -x t ) shown in Eq. ( 12). To derive (iii), we used the fact that ab ≤ (a 2 + b 2 )/2 for any a, b ∈ R ≥0 , as well as two applications of the fact that a + b 2 ≤ 2( a 2 + b 2 ) for any two vectors a, b ∈ R d . To derive (iv), we used the fact that Ht,i ≤ ρu Z t,i . To continue from Eq. ( 13), we first observe that we can rewrite Z t,i Z t,i ∇f (x t ) = (Z t,i Z t,i -I)∇f (x t ) + ∇f (x t ), so that 1 m m i=1 Z t,i Z t,i ∇f (x t ) 2 ≤ 2 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ) 2 + 2 ∇f (x t ) 2 . Observe that we can apply the bound in Proposition 2 to m i=1 (Z t,i Z t,i -I)∇f (x t ) , and since Z t,i is independent of F t-1 for all i, we know there exist absolute constants c 1 > 0, C 1 ≥ 1 such that for any δ ∈ (0, 1/e] and θ > 0, with probability at least 1 -δ conditioned on F t-1 , m i=1 (Z t,i Z t,i -I)∇f (x t ) ≤ c 1 θ m i=1 d(lr(C 1 m/δ)) 2 ∇f (x t ) 2 + 1 θ log(C 1 dm/δ) = c 1 θmd(lr(C 1 m/δ)) 2 ∇f (x t ) 2 + 1 θ log(C 1 dm/δ). Moreover, since C 1 ≥ 1, log(C 1 dm/δ) and lr(C 1 m/δ) both are at least 1 as long as δ ∈ (0, 1/e]. Observe that conditioned on F t-1 , ∇f (x t ) is fixed. Hence, we can pick θ = 1 c 1 md lr(C 1 dm/δ) ∇f (x t ) which is F t-1 -measurable, and plug it into Eq. ( 14) to find that the probability conditioned on F t-1 of the following event m i=1 (Z t,i Z t,i -I)∇f (x t ) ≤ 2 √ c 1 (lr(C 1 dm/δ)) 3/2 √ md ∇f (x t ) is at least 1 -δ. By taking the total expectation, it follows that the event has a total probability at least 1 -δ. Thus, with probability at least 1 -δ, 1 m m i=1 Z t,i Z t,i ∇f (x t ) 2 ≤ 2 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ) 2 + 2 ∇f (x t ) 2 ≤ 4c 1 (lr(C 1 dm/δ)) 3 d m ∇f (x t ) 2 + 2 ∇f (x t ) 2 ≤ c 2 (lr(C 1 dm/δ)) 3 d m ∇f (x t ) 2 , where the last inequality comes from the fact that lr(C 1 dm/δ) ≥ 1, our assumption at the outset of the appendix that d ≥ m, and denoting c 2 := 4c 1 + 2. Denote the event H0,τ (δ) as the event that f (x τ ) -f (x 0 ) ≤ - τ -1 t=0 3η 4m m i=1 Z t,i ∇f (x t ) 2 + Lη 2 c 2 d(lr(C 1 dm/δ)) 3 m τ -1 t=0 ∇f (x t ) 2 + ηu 4 ρ 2 4m τ -1 t=0 m i=1 Z t,i 6 + Lη 2 u 4 ρ 2 2m τ -1 t=0 m i=1 Z t,i 8 -η τ -1 t=0 ∇f (x t ), Y t + 2Lη 2 τ -1 t=0 Y t 2 holds. Now, continuing from Eq. ( 13), and using the bound in Eq. ( 16), summing over the iterations from t = 0 to τ -1, we find using the union bound that P(∩ τ τ =1 H0,τ (δ)) ≥ 1 -τ δ, P( H0,τ (δ)) ≥ 1 -τ δ. Now, by Lemma 6, for any δ ∈ (0, 1), α > 0, with probability at least 1 -δ, there exists an absolute constant c 3 > 0 such that -η τ -1 t=0 ∇f (x t ), Y t ≤ η 1 α τ -1 t=0 ∇f (x t ) 2 + c 3 αr 2 log(1/δ) . Meanwhile, since Y t ∼ N (0, (r 2 /d)I), Y t 2 is sub-exponential with sub-exponential norm cr 2 for some absolute constant c > 0, and by Bernstein's inequality (Lemma 8), there exists some absolute constant c 4 > 0 such that τ -1 t=0 Y t 2 ≤ c 4 r 2 (τ + log(1/δ)) with probability at least 1 -δ. To bound τ -1 t=0 1 m m i=1 Z t,i 6 and τ -1 t=0 1 m m i=1 Z t,i , both sums of heavy tailed Gaussian moments, we use Lemma 11, which states that for any k ∈ Z + and δ ∈ (0, 1), with probability at least 1 -δ, 1 m τ -1 t=0 m i=1 Z t,i 2k ≤ c 5 τ (c 6 ) k d k (1 + (log(1/δ)) k ) for some absolute constants c 5 , c 6 > 0. As in the statement of the proof, using χ := lr(C 1 dm/δ) to ease the notation, denote the event that f (x τ ) -f (x 0 ) ≤ - 3η 4 τ -1 t=0 1 m m i=1 Z t,i ∇f (x t ) 2 + η α + c 2 Lη 2 χ 3 d m τ -1 t=0 ∇f (x t ) 2 + τ ηu 4 ρ 2 2 • c 5 c 3 6 d 3 log 1 δ 3 + τ Lη 2 u 4 ρ 2 • c 5 c 4 6 d 4 log 1 δ 4 + η(c 3 αr 2 + 2c 4 ηLr 2 ) log 1 δ + 2c 4 Lη 2 τ r 2 holds as H 0,τ (δ). Plugging Eq. ( 18), Eq. ( 19), and Eq. ( 20) into Eq. ( 17), by union bound, we see that P(∩ τ τ =1 H 0,τ (δ)) ≥ 1 -(τ + 4τ )δ = 1 -5τ δ, P(H 0,τ ) ≥ 1 -(τ + 4)δ. The final result then follows by rescaling δ to δ T and denoting c 1 := max{c 2 , c 3 , 2c 4 , c 5 c 3 6 /2, c 5 c 4 6 }. Outline of proof approach. Similar to the first-order setting, our goal is to show that we can arrive at a contradiction f (x T ) < min x f (x) when there is a large number of steps at which ∇f (x t ) ≥ . Roughly speaking, as Eq. ( 3) shows, we need to prove a lower bound of the form T -1 t=0 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ Ω 1 α + c 1 Lηχ 3 d m T -1 t=0 ∇f (x t ) 2 (21) for some α which is not too large (an example would be picking α such that it only scales logarithmically in the problem parameters). However, it is tricky to prove such a lower-bound in the zeroth-order setting. In particular, for small batch-sizes m, 1 m m i=1 Z t,i ∇f (x t ) 2 could be small even as ∇f (x t ) 2 is large; this is because for each i ∈ [m], Z t,i could have a negligible component in the ∇f (x t ) direction. This necessitates a more careful analysis to prove a bound similar to Eq. ( 21). We do so using the following approach. 1. Intuitively, whilst for each individual iteration t, 1 m m i=1 Z t,i ∇f (x t ) 2 could be small even as ∇f (x t ) 2 is large, in a small number of (consecutive) iterations {t 0 , . . . , t 0 + t f }, with high probability, there will be at least one iteration t within {t 0 , . . . , t 0 + t f -1}, such that 1 m m i=1 Z t,i ∇f (x t ) 2 = Ω( ∇f (x t ) 2 ). We formalize this intuition in Lemma 14. Thus, we consider breaking the time-steps into chunks where each chunk has t f consecutive iterations. 2. Consider any such interval {t 0 , . . . , t 0 + t f -1}. There are two cases to consider. (a) The first case is when the gradient throughout all t f iterations is large enough to dominate the perturbation terms. Intuitively, in this case, it is not hard to see that given appropriate parameter choices, the gradient will change little throughout the t f iterations. In fact, as we formalize in Lemma 16, for an appropriate choice of t f and η, we can show that 1 2 ∇f (x t0 ) ≤ ∇f (x t ) ≤ 2 ∇f (x t0 ) ∀t ∈ {t 0 , . . . , t 0 + t f -1}. As a result, combined with point 1, we see that t0+t f -1 t=t0 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ Ω( ∇f (x t0 ) 2 ). Thus, by choosing α and η judiciously, for such intervals, it is possible to show that t0+t f -1 t=t0 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ Ω( ∇f (x t0 ) 2 ) ≥Ω 1 α + c 1 Lηχ 3 d m t0+t f -1 t=t0 ∇f (x t ) 2 =Ω 1 α + c 1 Lηχ 3 d m Ω t f ∇f (x t0 ) 2 Thus, in these intervals, it is possible to obtain function improvement on the order of ηΩ( ∇f (x t0 ) 2 ). (b) The remaining case is when the gradient is small and dominated by the perturbation terms in any one of the t f iterations. In this case, as we show in Lemma 17, for each of the t f iterations, the gradient will be small and on the same scale as the perturbation terms. In turn, by choosing r, u and η appropriately, we can make the perturbation terms small. Thus, whilst these intervals may not contribute to function decrease, they also contribute little in the way of function increase. 3. When there are at least T /4 iterations with large gradient (i.e. ∇f (x t ) ≥ ), assuming t f divides T , it follows that there are at least T /(4t f ) intervals of length t f where one iteration in the interval contains a large gradient. By choosing u, r and η appropriately such they are dominated by , it is possible to show that with high probability, such an interval cannot belong to the second case above, and must instead be from the first case. Since ∇f (x t ) ≈ ∇f (x t0 ) for each t ∈ {t 0 , . . . , t 0 + t f -1} in this case, and we know that one of the iterations has a gradient with size at least , it follows that we make function decrease progress of at least ηΩ( 2 ) for such intervals. By appropriately choosing η, u and r to limit the effects of the intervals of the second form, we can then show a contradiction of the form f (x T ) < f * . We demonstrate this formally in Proposition 4. We formalize our approach in the following series of results. First, for analytical convenience, we prove the following result showing that for any t, the perturbation terms Y t and 1 m m i=1 Z t,i 4 are bounded with high probability. Lemma 13. There exists an absolute constant c 3 > 0 such that, for any t ∈ N, the event G t (δ) := Y t 2 ≤ c 2 3 r 2 1 + log(T /δ) d and 1 m m i=1 Z t,i 4 ≤ 2c 3 d 2 log T δ 2 has probability at least 1 -2δ/T for any δ ∈ (0, 1/e]. Proof. Noting that Y t ∼ N (0, (r 2 /d)I), by applying Bernstein's inequality (Lemma 8), it can be shown that with probability at least δ/T , Y t 2 ≤ c 2 3 r 2 1 + log(T /δ) d , where c 3 > 0 is some absolute constant. Then by using Lemma 11, applying the union bound, and redefining the constant c 3 , we complete the proof. Next, in Lemma 14, we show that in a small number of iterations, with high probability, there exists some iteration t such that 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ 1 2 ∇f (x t ) 2 . Lemma 14. There exists an absolute constant c 2 ≥ 1 such that, upon defining t f (δ) = c 2 m log T δ , δ > 0, and defining the event B t0 (δ; k) := t0+k-1 t=t0 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ 1 2 ∇f (x t ) 2 , we have P (B t0 (δ; k)) ≥ 1 - δ T . for any δ ∈ (0, 1), t 0 ∈ N and k ≥ t f (δ). Proof. Denote the event E t = 1 m m i=1 |Z t,i ∇f (x t )| 2 < 1 2 ∇f (x t ) 2 . Observe that, conditioned on F t-1 , the set of random variables ∇f (x t ) 2 -|Z t,i ∇f (x t )| 2 m i=1 are independent, mean-zero, and subexponential with subexponential norm ≤ c ∇f (x t ) 2 for some absolute constant c > 0. Hence P Ft-1 (E t ) = P Ft-1 1 m m i=1 |Z t,i ∇f (x t )| 2 < 1 2 ∇f (x t ) 2 = P Ft-1 m i=1 ∇f (x t ) 2 -Z t,i ∇f (x t ) 2 > m 2 ∇f (x t ) 2 ≤ exp (-c m) , where c is some positive absolute constant. Then, for any t 0 , k ∈ N, P 1 m m i=1 Z t,i ∇f (x t ) 2 < 1 2 ∇f (x t ) 2 for every t ∈ [t 0 , t 0 + k) = E t0+k-1 t=t0 1 Et = E t0+k-2 t=t0 1 Et • E F t 0 +k-2 1 E t 0 +k-1 ≤ exp(-c m) • E t0+k-2 t=t0 1 Et ≤ • • • ≤ exp(-c mk). Therefore, by letting c 2 = max{1, 1/c } and k ≥ t f (δ) = c 2 m log T δ , we get P 1 m m i=1 Z t,i ∇f (x t ) 2 < 1 2 ∇f (x t ) 2 for every t ∈ [t 0 , t 0 + k) ≤ δ T , which completes the proof. The term t f (δ) will frequently appear in the proofs to come; in the sequel we denote t f (δ) := c 2 m log T δ , δ ∈ (0, 1/e], where c 2 ≥ 1 is the absolute constant defined in Lemma 14. We next show that with high probability, the norm difference term ∇f (x t+1 ) -∇f (x t ) can be bounded in terms of ∇f (x t ) and the perturbation terms u 2m m i=1 Z t,i Z t,i Ht,i Z t,i as well as Y t . Lemma 15. Define A t (δ) := ∇f (x t+1 )-∇f (x t ) ≤ ∇f (x t ) 8t f (δ) + ηL u 2m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t (23) where t f (δ) is defined in Eq. ( 22), and let C 1 ≥ 1 be the corresponding absolute constants defined in Lemma 1. Then there exists an absolute constant c 4 > 0 such that, whenever η satisfies ηL c 4 (lr(C 1 dmT /δ)) 3/2 √ d √ m ≤ 1 8t f (δ) , we have P(A t (δ)) ≥ 1 - δ T for any δ ∈ (0, 1/e] and t ∈ Z + . Proof. Since ∇f is L-Lipschitz, following the zeroth-order update step, we see that ∇f (x t+1 ) -∇f (x t ) ≤ L x t+1 -x t (25) = ηL 1 m m i=1 Z t,i Z t,i ∇f (x t ) + u 2m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t . (26) Now, it follows from Eq. ( 16) (with a slight modification in the absolute constant terms since here the norm is not squared) that there exists some absolute constant c 4 > 0 such that for any δ ∈ (0, 1/e], we have that with probability at least 1 -δ/T , the event 1 m m i=1 Z t,i Z t,i ∇f (x t ) ≤ c 4 (lr(C 1 dmT /δ)) 3/2 d m ∇f (x t ) , Hence, continuing from Eq. ( 26), it follows that with probability at least 1 -δ/T , ∇f (x t+1 ) -∇f (x t ) ≤ ηL c 4 (lr(C 1 dmT /δ)) 3/2 d m ∇f (x t ) + u 2m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t , and by plugging in the condition Eq. ( 24), we see that the event A t (δ) = ∇f (x t+1 ) -∇f (x t ) ≤ ∇f (x t ) 8t f (δ) + ηL u 2m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t has probability at least 1 -δ/T . We show now that if the norm of the gradient dominates the norm of the perturbation terms, and we choose the step-size η sufficiently small, then in a small number of iterations, the norm of the gradient does not change very much. For notational simplicity, we denote the event E(t 1 , t 2 , δ) := t1+t2-1 t=t1 ∇f (x t ) > 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t . Lemma 16. Let δ ∈ (0, 1/e] and T ∈ Z + be such that T > 2t f (δ) + 1. Consider any positive integer t f ≤ 2t f (δ), and any t 0 ∈ {0, . . . , T -1 -t f }. Suppose η satisfies the condition Eq. (24). Then, on the event E(t 0 , t f , δ) ∩   t0+t f -1 t=t0 A t (δ)   , we have 1 2 ∇f (x 0 ) ≤ ∇f (x t ) ≤ 2 ∇f (x 0 ) for all t ∈ {t 0 , . . . , t 0 + t f -1}. Proof. By plugging ∇f (x t ) > 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t into the definition of A t (δ), we see that, on the event E(t 0 , t f , δ) ∩ t0+t f -1 t=t0 A t (δ) , we have ∇f (x t+1 ) -∇f (x t ) ≤ ∇f (x t ) 4t f (δ) , and consequently, 1 - 1 4t f (δ) ∇f (x t ) ≤ ∇f (x t+1 ) ≤ 1 + 1 4t f (δ) ∇f (x t ) , which leads to 1 - 1 4t f (δ) t-t0 ∇f (x 0 ) ≤ ∇f (x t ) ≤ 1 + 1 4t f (δ) t-t0 ∇f (x 0 ) for all t ∈ {t 0 , . . . , t 0 + t f }. Then, since (1 + 1/(4x)) 2x ≤ 2 and (1 -1/(4x)) 2x ≥ 1/2 for any x ≥ 1, noting that t f ≤ 2t f (δ), we get the desired result. Conversely, in the following result, we show that in a small number of consecutive iterations, if the gradient is smaller than the perturbation terms in any one of the iterations, then for each of the iterations in this range, the gradient be small and be on the same scale as the size of the perturbation terms. Lemma 17. Let δ ∈ (0, 1/e] and T ∈ Z + be such that T > 2t f (δ) + 1. Consider any positive integer t f ≤ 2t f (δ), and any t 0 ∈ {0, . . . , T -1 -t f }. Suppose η satisfies the condition Eq. (24). Then, on the event E c (t 0 , t f , δ) ∩   t0+t f -1 t=t0 A t (δ)   ∩   t0+t f -1 t=t0 G t (δ)   , we have ∇f (x t ) ≤ c 5 t f (δ)ηL u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r ∀t ∈ {t 0 , t 0 +1, . . . , t 0 +t f -1}, where c 5 is some absolute constant. Proof. Let t be the first iteration in {t 0 , t 0 + 1, . . . , t 0 + t f -1} such that ∇f (x t ) ≤ 8t f (δ)ηL u 2 1 m m i=1 Z t ,i Z t ,i Ht ,i Z t ,i + Y t . ( ) Since we are working on an event which is a subset of E c (t 0 , t f , δ), t is well-defined. By Ht ,i ≤ ρu Z t ,i , we see that ∇f (x t ) ≤ 8t f (δ)ηL u 2 ρ 2m m i=1 Z t ,i 4 + Y t ≤ 8t f (δ)ηL c 3 u 2 d 2 ρ log T δ 2 + c 3 1 + log(T /δ) d r , where we used the definition of G t (δ). Recall that t is the first time step such that Eq. ( 27) holds. By deriving similarly as in the proof of Lemma 16, we can show that for any j ∈ {t 0 , t 0 + 1, . . . , t -1}, ∇f (x j ) ≤ 2 ∇f (x t ) ≤ 16t f (δ)ηLc 3 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r . Meanwhile, for iterations t ∈ [t , t 0 + t f ), by using the definitions of A t (δ) and G t (δ), we have ∇f (x t+1 ) ≤ 1 + 1 8t f (δ) ∇f (x t ) + ηLc 3 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r = 1 + 1 8t f (δ) t+1-t ∇f (x t ) + t-t i=0 1 + 1 8t f (δ) t-t -i ηLc 3 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r ≤ 1 + 1 8t f (δ) t f ∇f (x t ) + 8t f (δ) 1 + 1 8t f (δ) t f -1 ηLc 3 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r ≤ e 1/4 • 8t f (δ)ηLc 3 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r + 8t f (δ)(e 1/4 -1) • ηLc 3 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r ≤ 16t f (δ)ηLc 3 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r , where we used t f ≤ 2t f (δ) and the fact that (1 -1/(8x)) 2x ≤ e 1/4 for all x > 0. By defining c 5 := 16c 3 , we complete the proof. We next derive a useful result showing that the function change f (x τ ) -f (x 0 ) can be decomposed into one component arising from intervals when the gradient dominates noise (which improves function value) and another component arising from intervals with small gradient which may add to function value but whose contributions are bounded in terms of η, u and r. For now, we focus on the case τ ≥ t f (δ), since it will be useful to us in proving that there cannot be more than T /4 iterations with large gradient. Lemma 18 (Function change for large τ ). Let c 1 > 0, c 4 > 0, c 5 > 0, C 1 ≥ 1 be the absolute constants defined in the statements of the previous lemmas. Let δ ∈ (0, 1/e], and let τ ≥ t f (δ)) be arbitrary. Consider splitting {0, 1 . . . , τ -1} into K := τ /t f (δ) intervals: J k = {kt f (δ), . . . , (k + 1)t f (δ) -1}, 0 ≤ k < K -1, J K-1 = {(K -1)t f (δ), . . . , τ -1}. Let I 1 denote the set of indices k such that for every time-step t in the interval J k , the gradient dominates the noise terms as ∇f (x t ) > 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t . ( ) Suppose we choose η such that η ≤ 1 Lt f (δ) • min √ m 8c 4 (lr(C 1 dmT /δ)) 3/2 √ d , m 128c 1 (lr(C 1 dmT /δ)) 3 d . ( ) Then, on the event E τ (δ) := H τ (δ)∩ τ -1 t=0 A t (δ) ∩ τ -1 t=0 G t (δ) ∩ K-2 k=0 B kt f (δ) (δ; t f (δ)) ∩B (K-1)t f (δ) (δ; τ-(K-1)t f (δ)), we have the following upper bound on function value change: f (x τ ) -f (x 0 ) ≤ - k∈I1 η 2 min t∈J k ∇f (x t ) 2 + τ c 2 5 64 η 3 t f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (128t f (δ) + ηL) log T δ + τ c 1 Lη 2 r 2 . ( ) Moreover, P(E τ (δ)) ≥ 1 -(5τ +4)δ T . Proof. Without loss of generality, we may assume that τ is a multiple of t f (δ).foot_3 Then, any interval J k = {t 0 , . . . , t 0 + t f (δ) -1} belongs to one of the following two cases: Case 1) (Gradient dominates noise): Recall that this means that for every t ∈ J k , we have ∇f (x t ) > 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t . By our choice of η in Eq. ( 29), we can apply Lemma 16 to get min t∈J k ∇f (x t ) ≥ 1 4 max t∈J k ∇f (x t ) . We now consider the two cases when J has fewer than t f (δ) iterations and when J = J k f Note also that on the event B kt f (δ) (δ; t f (δ)), there exists some t ∈ J k such that 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ 1 2 ∇f (x t ) 2 . This implies then that 1 4 t∈J k 1 m m i=1 Z t,i ∇f (x t ) 2 ≥ 1 4 min t∈J k ∇f (x t ) 2 ≥ 1 64 max t∈J k ∇f (x t ) 2 ≥ 1 64t f (δ) t∈J k ∇f (x t ) 2 . ( ) Thus by setting α = 128t f (δ) in Eq. ( 3) and by choosing η such that c 1 Lη 2 χ 3 d m ≤ η α = η 128t f (δ) ⇐⇒ η ≤ m 128c 1 Lt f (δ)dχ 3 , it follows that - 3η 4 t∈J k 1 m m i=1 Z t,i ∇f (x t ) 2 + η 128t f (δ) + c 1 Lη 2 χ 3 d m t∈J k ∇f (x t ) 2 = - 3η 4 t∈J k 1 m m i=1 Z t,i ∇f (x t ) 2 + η 64t f (δ) t∈J k ∇f (x t ) 2 ≤ - η 2 t∈J k 1 m m i=1 Z t,i ∇f (x t ) 2 ≤ - η 2 min t∈J k ∇f (x t ) 2 (32) Case 2) (Gradient does not dominate noise): there exists some t ∈ J k such that ∇f (x t ) ≤ 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t . By our choice of η in Eq. ( 29), we can apply Lemma 17 to get ∇f (x t ) ≤ c 5 t f (δ)ηL u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r ∀t ∈ J k . Hence, by setting α = 128t f (δ) in Eq. ( 3) and choosing η such that c 1 Lη 2 χ 3 d m ≤ η α = η 128t f (δ) , it follows that η 128t f (δ) + c 1 Lη 2 χ 3 d m t∈J k ∇f (x t ) 2 ≤ η 64t f (δ) t∈J k c 5 t f (δ)ηL u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r 2 ≤ c 2 5 64 t f (δ) 2 η 3 L 2 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r 2 (33) Without loss of generality, we may assume that τ is a multiple of t f (δ).foot_4 Then, any interval J k = {t 0 , . . . , t 0 + t f (δ) -1} belongs to one of the following two cases: Having studied the two cases, we may now proceed to use them to complete the proof. Let I c 1 denote the complement of I 1 in {0, 1, . . . , K -1}. Then, - 3η 4 τ -1 t=0 1 m m i=1 Z t,i ∇f (x t ) 2 + η α + c 1 Lη 2 χ 3 d m τ -1 t=0 ∇f (x t ) 2 = k∈I1 - 3η 4 t=∈J k 1 m m i=1 Z t,i ∇f (x t ) 2 + η 128t f (δ) + c 1 Lη 2 χ 3 d m t∈J k ∇f (x t ) 2 + k∈I c 1 - 3η 4 t∈J k 1 m m i=1 Z t,i ∇f (x t ) 2 + η 128t f (δ) + c 1 Lη 2 χ 3 d m t∈J k ∇f (x t ) 2 ≤ - k∈I1 η 2 min t∈J k ∇f (x t ) 2 + k∈I c 1 t f (δ)   c 2 5 64 t f (δ) 2 η 3 L 2 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r 2   ≤ - k∈I1 η 2 min t∈J k ∇f (x t ) 2 + τ c 2 5 64 t f (δ) 2 η 3 L 2 u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r 2 . ( ) and so by Eq. (3), f (x τ ) -f (x 0 ) ≤ - k∈I1 η 2 min t∈J k ∇f (x t ) 2 + τ c 2 5 64 t f (δ) 2 η 3 L 2 u 2 d 2 ρ log T δ 2 + 1+ log(T /δ) d r 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (α + ηL) log T δ + τ c 1 Lη 2 r 2 . Note that we choose α = 128t f (δ). In addition, observe that by our choice of δ (such that δ ≤ 1 e ), it follows that 1 + log(T /δ) d ≤ 2 log(T /δ). We can now complete our proof by using the union bound (suppressing the dependence of some of the events on δ for notational simplicity) to derive P(E c τ ) ≤ P(H c τ ) + τ -1 t=0 P(A c t ) + τ -1 t=0 P(G c t ) + K-1 k=0 P(B c kt f (δ) (δ; t f (δ))) ≤ (τ + 4)δ T + τ T δ + 2 τ T δ + Kδ T ≤ (5τ + 4) T δ. We are now ready to show that if sufficiently many iterations have a large gradient, then with high probability, the function value of the last iterate f (x T ), will be less than min x f (x), a contradiction. Hence this limits the number of iterations that can have a large gradient. Proposition 4. Let c 1 > 0, c 2 ≥ 1, c 4 > 0, c 5 > 0, C 1 ≥ 1 be the absolute constants defined in the statements of the previous lemmas, and let δ ∈ (0, 1/e] be arbitrary. Suppose we choose u, r, η and T such that u ≤ √ d √ ρ log(T /δ) • min 1 64c 2 5 c 2 , 1 2048c 1 c 2 1/4 , r ≤ • min 1 8c 5 √ 2c 2 , 1 32 √ c 1 , η ≤ 1 Lt f (δ) min 1 log(T /δ) , √ m 8c 4 (lr(C 1 dmT /δ)) 3/2 √ d , m 128c 1 (lr(C 1 dmT /δ)) 3 d , T ≥ max 256t f (δ) (f (x 0 ) -f * ) + 2 /L) η 2 , 4 . Then, with probability at least 1 -6δ, there are at most T /4 iterations for which ∇f (x t ) ≥ . Proof. Without loss of generality, we assume that T is a multiple of t f (δ), and we similarly split {0, 1, . . . , T } into K = T /t f (δ) intervals J 0 , . . . , J K-1 . Let I 1 denote the set of indices k such that for every t ∈ J k , ∇f (x t ) > 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t . ( ) We let I c 1 denote the complement of I 1 in {0, 1, . . . , K -1}. We denote E T (δ) := H T (δ) ∩ T -1 t=0 A t (δ) ∩ T -1 t=0 G t (δ) ∩ K-1 k=0 B kt f (δ) (δ; t f (δ)) . In the remaining part of the proof, unless otherwise stated, we shall always assume that we are working on the event E T (δ). By Lemma 18 with τ = T and our choices of η and δ in the statement of the lemma, we have f (x T ) -f (x 0 ) ≤ - k∈I1 η 2 min t∈J k ∇f (x t ) 2 + T c 2 5 64 t f (δ) 2 η 3 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 + T ηu 4 ρ 2 • c 1 d 3 log T δ 3 + T Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (128t f (δ) + ηL) log T δ + T c 1 Lη 2 r 2 . ( ) Suppose that there are at least T /4 iterations where ∇f (x t ) ≥ . Let I denote the set of indices k for which there exists some t ∈ J k with ∇f (x t ) ≥ . Then, by the pigeonhole principle, the set I has at least T /(4t f (δ)) members. Note that, by our choices of the parameters η, u, r, it can be shown that c 5 t f (δ)ηL u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r < , while by Lemma 17, if k is in I c 1 , we have ∇f (x t ) ≤ c 5 t f (δ)ηL u 2 d 2 ρ log(T /δ) + 1 + log(T /δ) d r , ∀t ∈ J k . This implies that I ⊆ I 1 . Observe that by Lemma 16, for any k ∈ I 1 , we have 1 2 ∇f (x kt f (δ) ) ≤ ∇f (x t ) ≤ 2 ∇f (x kt f (δ) ) , ∀t ∈ J k . This implies in particular that for any k ∈ I , we have min t∈J k ∇f (x t ) 2 ≥ 1 16 2 , and consequently - k∈I1 η 2 min t∈J k ∇f (x t ) 2 ≤ - k∈I η 2 • 2 16 ≤ - T 4t f (δ) • η 2 • 2 16 = - T η 2 128t f (δ) . Hence, by Eq. ( 36), f (x T ) -f (x 0 ) ≤ - T η 2 128t f (δ) + T c 2 5 64 t f (δ) 2 η 3 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 + T ηu 4 ρ 2 • c 1 d 3 log T δ 3 + T η • (ηL)u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (128t f (δ) + ηL) log T δ + T η • c 1 ηLr 2 . ( ) Now, by our choices of u, r and η, we have T c 2 5 64 t f (δ) 2 η 3 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 ≤ T η • c 2 5 32 t f (δ) 2 (ηL) 2 u 4 d 4 ρ 2 log T δ 4 + 2 log(T /δ)r 2 ≤ T η • 2 2048c 2 log T δ 2 + 2 2048c 2 log(T /δ) ≤ T η 2 512t f (δ) , where we used log(T /δ) ≥ 1 and 2c 2 log(T /δ) ≥ t f (δ). We also have T ηu 4 ρ 2 • c 1 d 3 log T δ 3 + T η • (ηL)u 4 ρ 2 • c 1 d 4 log T δ 4 + T c 1 Lη 2 r 2 ≤ T η • 2 2048c 2 d log(T /δ) + T η • 2 2048c 2 t f (δ) log(T /δ) + T η • 2 1024t f (δ) log(T /δ) ≤ T η 2 512t f (δ) , where we used c 2 d log(T /δ) ≥ t f (δ), c 2 ≥ 1 and log(T /δ) ≥ 1. Finally, ηc 1 r 2 (128t f (δ) + ηL) log T δ ≤ (128t f (δ) + 1) 2 1024Lt f (δ) < 2 L . By plugging these bounds into Eq. ( 38), we get f (x T ) -f (x 0 ) < - T η 2 128t f (δ) + T η 2 512t f (δ) + T η 2 512t f (δ) + 2 L ≤ - T η 2 256t f (δ) + 2 L . Therefore, as long as T ≥ 256t f (δ) (f (x 0 ) -f * ) + 2 /L) η 2 , we will get f (x T ) < f * , which is a contradiction. Thus, we can conclude that on the event E T (δ), there are at most T /4 iterations for which ∇f (x t ) ≥ . We can now complete our proof by using the union bound (suppressing the dependence of some of the events on δ for notational simplicity) to derive P(E c T ) ≤ P(H c T ) + T -1 t=0 P(A c t ) + T -1 t=0 P(G c t ) + K-1 k=0 P(B c kt f (δ) (δ; t f (δ))) ≤ (T + 4)δ T + δ + 2δ + Kδ T ≤ 6δ.

E ESCAPING SADDLE POINT

In this section, we first show that the travelling distance of the iterates can be bounded in terms of the function value improvement (Appendix E.2). Utilizing this result, as well as Proposition 2 in Appendix C.3 which provides a concentration bound on the the zeroth-order noise, we then prove that sufficient function value decrease can be made near a saddle point in Appendix E.3.

E.1 KEY QUANTITIES AND NOTATION

We will use γ to denote -λ min (∇ 2 f (x 0 )), where we know that γ ≥ √ ρ .

E.2 IMPROVE OR LOCALIZE

In this subsection, we aim to bound the movement of the iterates across a number of steps in terms of the function value improvement made during these number of steps. We first state a simple result separating the norm of the difference between x t0+τ and x t0 into a few different terms. Lemma 19. Consider the perturbed zeroth-order update Algorithm 1. Then, for any t 0 ∈ N and τ ∈ N, x t0+τ -x t0 2 ≤ V 1 (t 0 , τ ) + V 2 (t 0 , τ ) + V 3 (t 0 , τ ) + V 4 (t 0 , τ ), ( ) where V 1 (t 0 , τ ) := 8η 2 τ t0+τ -1 t=t0 ∇f (x t ) 2 , V 2 (t 0 , τ ) := 8η 2 t0+τ -1 t=t0 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ) 2 V 3 (t 0 , τ ) := 4η 2 t0+τ -1 t=t0 Y t 2 , V 4 (t 0 , τ ) := 4η 2 t0+τ -1 t=t0 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i 2 . ( ) Proof. For notational convenience, let t 0 := 0. Then, applying the form of the perturbed zeroth-order update in Algorithm 1, we get x τ -x 0 2 = τ -1 t=0 x t+1 -x t 2 = η 2 τ -1 t=0 1 m m i=1 Z t,i Z t,i ∇f (x t ) + 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i + Y t 2 ≤ 4η 2 τ -1 t=0 1 m m i=1 Z t,i Z t,i ∇f (x t ) 2 + 4η 2 τ -1 t=0 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i 2 + 4η 2 τ -1 t=0 Y t 2 ≤ 4η 2 τ -1 t=0 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ) + τ -1 t=0 ∇f (x t ) 2 + 4η 2 τ -1 t=0 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i 2 + 4η 2 τ -1 t=0 Y t 2 ≤ 8η 2 τ τ -1 t=0 ∇f (x t ) 2 V1(0,τ ) + 8η 2 τ -1 t=0 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ) 2 V2(0,τ ) + 4η 2 τ -1 t=0 Y t 2 V3(0,τ ) + 4η 2 τ -1 t=0 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i V4(0,τ ) . We now proceed to bound the terms V 1 (t 0 , τ ), V 2 (t 0 , τ ), V 3 (t 0 , τ ) and V 4 (t 0 , τ ). First, we have the following result bounding V 1 (t 0 , τ ). Lemma 20. Let c 1 > 0, c 2 ≥ 1, c 4 > 0, c 5 > 0, C 1 ≥ 1 be the absolute constants defined in the statements of the previous lemmas, and let δ ∈ (0, 1/e] be arbitrary. Suppose we choose η such that η ≤ 1 Lt f (δ) • min √ m 8c 4 (lr(C 1 dmT /δ)) 3/2 √ d , m 128c 1 (lr(C 1 dmT /δ)) 3 d . There are two cases to consider. 1. The first is when τ ≥ t f (δ). In this case, split {t 0 , t 0 +1, . . . , t 0 +τ -1} into K := τ /t f (δ) intervals: J k = {t 0 + kt f (δ), . . . , t 0 + (k + 1)t f (δ) -1}, 0 ≤ k < K -1, J K-1 = {t 0 + (K -1)t f (δ), . . . , t 0 + τ -1}. Then, on the event Et 0 ,τ (δ) := Ht 0 ,τ (δ)∩ t 0 +τ -1 t=t 0 At(δ) ∩ t 0 +τ -1 t=t 0 Gt(δ) ∩ K-2 k=0 B t 0 +kt f (δ) (δ; t f (δ)) ∩B t 0 +(K-1)t f (δ) (δ; τ-(K-1)t f (δ)), we have that V 1 (t 0 , τ ) = 8η 2 τ t0+τ -1 t=t0 ∇f (x t ) 2 ≤ 64ητ t f (δ) ((f (x 0 ) -f (x τ )) + N u,r (τ ; δ)) , where N u,r (τ ; δ) := τ c 2 5 64 η 3 t f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (128t f (δ) + ηL) log T δ + τ c 1 Lη 2 r 2 + c 2 5 t 3 f (δ)η 3 L 2 u 2 d 2 ρ log(T /δ) + 2 log(T /δ)r 2 . ( ) 2. The second is when τ < t f (δ). Suppose we choose u and r such that u ≤ √ d √ ρ log(T /δ) • min 1 64c 2 5 c 2 , 1 2048c 1 c 2 1/4 , r ≤ • min 1 8c 5 √ 2c 2 , 1 32 √ c 1 . Suppose the event ∩ t0+τ -1 t=t0 (A t (δ) ∩ G t (δ) holds. Suppose also that ∇f (x t0 ) ≤ . Then, V 1 (t 0 , τ ) ≤ 32η 2 τ 2 2 ≤ 32η 2 (t f (δ)) 2 2 Proof. 1. We first consider the case where τ ≥ t f (δ). Let I 1 denote the set of indices k such that for every time-step t in the interval J k , the gradient dominates the noise terms as ∇f (x t ) > 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t . WLOG, we may assume that t 0 := 0, and denote V 1 (τ ) := V 1 (0, τ ). WLOG, we also assume that τ is a multiple of t f (δ). From Lemma 18, on the event that E τ (δ) holds and by our choice of η, we have f (x τ ) -f (x 0 ) ≤ - k∈I1 η 2 min t∈J k ∇f (x t ) 2 + τ c 2 5 64 η 3 t f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (128t f (δ) + ηL) log T δ + τ c 1 Lη 2 r 2 . By Lemma 16 (and our choice of η), it follows that for any k ∈ I 1 , on the event ∩ t∈J k A t (δ), we have t∈J k ∇f (x t ) 2 ≤ 4t f min t∈J k ∇f (x t ) 2 . Thus, on the event that E τ (δ) holds, for our choice of η, we have η k∈I1 t∈J k ∇f (x t ) 2 ≤ 4t f (δ)η k∈I1 min t∈J k ∇f (x t ) 2 ≤ 8t f (δ) k∈I1 η 2 min t∈J k ∇f (x t ) 2 ≤ 8t f (δ)   (f (x 0 ) -f (x τ )) + τ c 2 5 64 η 3 t f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2   + 8t f (δ) τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + 8t f (δ) ηc 1 r 2 (128t f (δ) + ηL) log T δ + τ c 1 Lη 2 r 2 . Similarly, for any k ∈ I c 1 (where I c 1 denotes the complement of I 1 in {0, 1, . . . , K -1}, i.e. intervals where the gradient is smaller than than the perturbation terms in some iteration), on the event (∩ t∈J k A t (δ)) ∩ (∩ t∈J k G t (δ)), by Lemma 17 (and our choice of η), we have ∇f (x t ) ≤ c 5 t f (δ)ηL u 2 d 2 ρ log(T /δ) + 2 log(T /δ)r , ∀t ∈ J k . On the event that E τ (δ) holds, this gives us then η k∈I c 1 t∈J k ∇f (x t ) 2 ≤ ητ c 2 5 t 2 f (δ)η 2 L 2 u 2 d 2 ρ log(T /δ) + 2 log(T /δ)r 2 . Hence, on the event that E τ (δ) holds, we have that η τ -1 t=0 ∇f (x t ) 2 = η k∈I1 t∈J k ∇f (x t ) 2 + η k∈I c 1 t∈J k ∇f (x t ) 2 ≤ 8t f (δ)   (f (x 0 ) -f (x τ )) + τ c 2 5 64 η 3 t f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2   + 8t f (δ) τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + 8t f (δ) ηc 1 r 2 (128t f (δ) + ηL) log T δ + τ c 1 Lη 2 r 2 + 8t f (δ)ητ c 2 5 t 2 f (δ)η 2 L 2 u 2 d 2 ρ log(T /δ) + 2 log(T /δ)r 2 . This yields the final result for the case τ ≥ t f (δ). 2. We next consider the case where 1 ≤ τ < t f (δ). Recall the notation that E(t 0 , t 0 + τ, δ) := ∩ t0+τ -1 t=t0 ∇f (x t ) > 8t f (δ)ηL u 2 1 m m i=1 Z t,i Z t,i Ht,i Z t,i + Y t There are two cases to consider. (a) On the event E(t 0 , t 0 + τ, δ) ∩ ∩ t0+τ -1 t=t0 A t (δ) , we have by Lemma 16 that ∇f (x t ) ≤ 2 ∇f (x 0 ) for each t ∈ {0, 1, . . . , τ -1}. Then, V 1 (t 0 , τ ) = 8η 2 τ t0+τ -1 t=t0 ∇f (x t ) 2 ≤ 8η 2 τ 2 4 ∇f (x 0 ) 2 ≤ 32η 2 τ 2 2 , where the final inequality uses the assumption that ∇f (x 0 ) ≤ . (b) Suppose the event E c (t 0 , t 0 + τ, δ) ∩ ∩ t0+τ -1 t=t0 A t (δ) ∩ ∩ t0+τ -1 t=t0 G t (δ) holds. In this case, by Lemma 17, we have that for each t ∈ {t 0 , t 0 + 1, . . . , t 0 + τ -1} ∇f (x t ) ≤c 5 t f (δ)ηL u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r ≤ , where the final inequality follows by our choice of η, u and r (cf. Eq. ( 37)). Hence, V 1 (t 0 , τ ) = 8η 2 τ t0+τ -1 t=t0 ∇f (x t ) 2 ≤ 8η 2 τ 2 c 5 t f (δ)ηL u 2 d 2 ρ log T δ 2 + 1 + log(T /δ) d r 2 ≤ 8η 2 τ 2 2 < 32η 2 τ 2 2 . The final result for the case τ < t f (δ) then follows. We proceed to bound V 2 (t 0 , τ ). Lemma 21. Let c 1 > 0, c 2 ≥ 1, c 4 > 0, c 5 > 0, C 1 ≥ 1 be the absolute constants defined in the statements of the previous lemmas, and let δ ∈ (0, 1/e] be arbitrary and τ > 0 be arbitrary. Suppose we choose η such that η ≤ 1 Lt f (δ) • min √ m 8c 4 (lr(C 1 dmT /δ)) 3/2 √ d , m 128c 1 (lr(C 1 dmT /δ)) 3 d . Let T s denote an integer such that T s ≥ max {τ, t f (δ)}, and for any F > 0, define B(δ; F ) := 8t f (δ)(F + N u,r (T s , δ)) η T s + d m (lr(CT 2 /δ)) 2 , b τ (δ; F ) := t f (δ)τ F η . Let c , C > 0 denote the same constants as in the statement of Proposition 2. Denote the event that either t0+τ -1 t=t0 d m (lr(CT 2 /δ)) 2 ∇f (x t ) 2 ≥ B(δ; F ) or V 2 (t 0 , τ ) 8η 2 ≤ c max t0+τ -1 t=t0 d m (lr(CT 2 /δ)) 2 ∇f (x t ) 2 , b τ (δ; F ) log CT 2 δ +log log B(δ; F ) b τ (δ; F ) + 1 holds as L t0,τ (δ; F )foot_5 . We show that P(L t0,τ (δ; F )) ≥ 1 -δ T . Finally, denote the event M t0,Ts (F ) as the event that f (x t0 ) -f (x t0+Ts ) < F . Then, on the event L t0,τ (δ) ∩ E t0,Ts (δ) ∩ M t0,Ts (F ) (where E 0,Ts (δ) is as defined in Lemma 20), V 2 (t 0 , τ ) ≤ 8c 2 β 1 (δ; F )ηt f (δ) max 8d m (lr(CT 2 /δ)) 2 (F + N u,r (T s , δ)) , τ F , where β 1 (δ; F ) := log CT 2 δ + log log B(δ; F ) b 1 (δ; F ) + 1 . Proof. We note that P(L t0,τ (δ; F )) ≥ 1 -δ T . is a direct consequence of Proposition 2. In the rest of the proof, without loss of generality, we assume that t 0 = 0 for notational simplicity. On the event L 0,τ (δ; F ) ∩ E 0,Ts (δ) ∩ M t0,Ts (F ), suppose that τ -1 t=0 d m (lr(CT 2 /δ)) 2 ∇f (x t ) 2 ≥ B(δ; F ) = 8t f (δ)(F + N u,r (T s , δ)) η T s + d m (lr(CT 2 /δ)) 2 =⇒ η τ -1 t=0 ∇f (x t ) 2 ≥ 8t f (δ)(F + N u,r (T s , δ)) =⇒ η Ts-1 t=0 ∇f (x t ) 2 ≥ 8t f (δ)(F + N u,r (T s , δ)) =⇒ 8η 2 T s Ts-1 t=0 ∇f (x t ) 2 ≥ 64ηT s t f (δ)(F + N u,r (T s , δ)) =⇒ 8η 2 T s Ts-1 t=0 ∇f (x t ) 2 ≥ 64ηT s t f (δ)(f (x 0 ) -f (x Ts ) + N u,r (T s , δ)), since f (x 0 ) -f (x Ts ) ≤ F ⇐⇒ V 1 (0, T s ) ≥ 64ηT s t f (δ)(f (x 0 ) -f (x Ts ) + N u,r (T s , δ)), where we note the last equation contradicts Lemma 20. For notational simplicity, denote β τ (δ; F ) := log CT 2 δ + log log B(δ; F ) b τ (δ; F ) + 1 . Observe that β 1 is larger than β τ for every τ ≥ 1. Since L t0,τ (δ; F ) holds, we must have then that V 2 (0, τ ) 8η 2 ≤ c max τ -1 t=0 d m (lr(CT 2 /δ)) 2 ∇f (x t ) 2 , b τ (δ; F ) β 1 (δ; F ). Now, continuing, recalling the definition of V 1 (0, T s ) = 8η 2 T s Ts-1 t=0 ∇f (x t ) 2 V 2 (0, τ ) ≤ c 2 β 1 (δ; F ) max 8η 2 τ -1 t=0 d m (lr(CT 2 /δ)) 2 ∇f (x t ) 2 , 8η 2 b τ (δ; F ) ≤ c 2 β 1 (δ; F ) max 8η 2 Ts-1 t=0 d m (lr(CT 2 /δ)) 2 ∇f (x t ) 2 , 8η 2 b τ (δ; F ) ≤ c 2 β 1 (δ; F ) max d m (lr(CT 2 /δ)) 2 V 1 (0, T s ) T s , 8ηt f (δ)τ F (i) ≤ c 2 β 1 (δ; F ) max d m (lr(CT 2 /δ)) 2 (64ηt f (δ)(f (x 0 ) -f (x T S ) + N u,r (T s , δ))) , 8ηt f (δ)τ F (ii) ≤ c 2 β 1 (δ; F ) max d m (lr(CT 2 /δ)) 2 (64ηt f (δ)(F + N u,r (T s , δ))) , 8ηt f (δ)τ F = c 2 β 1 (δ; F )(8ηt f (δ)) max d m (lr(CT 2 /δ)) 2 (8(F + N u,r (T s , δ))) , τ F . We note that (i) is a consequence of Lemma 20, while (ii) comes from our assumption that the event M t0,Ts (F ) holds, i.e. f (x t0 ) -f (x t0+Ts ) ≤ F . We next bound V 3 (t 0 , τ ) and V 4 (t 0 , τ ). Lemma 22. Let c > 0 denote the same constant in Lemma 7. Consider any arbitrary 0 < δ ≤ 1/e, and let τ ≥ t f (δ) be arbitrary. Let N t0,τ (δ) denote the event that V 3 (t 0 , τ ) := 4η 2 t0+τ -1 t=t0 Y t 2 ≤ 4c 6 η 2 τ log(2dT /δ)r 2 , where c 6 > 0 is an absolute constant. Then, by Lemma 7, P(N t0,τ (δ)) ≥ 1 -δ T . Denote the event O t (δ) := 1 m m i=1 Z t,i 8 ≤ c 7 d 4 log T δ 4 , where c 7 > 0 is an absolute constant. Then, on the event ∩ t0+τ -1 t=t0 O t (δ), we have V 4 (t 0 , τ ) ≤ 4c 7 η 2 τ 2 ρ 2 u 4 d 4 log T δ 4 . Moreover, for each t, P(O t (δ)) ≥ 1 -δ T . Proof. The proof for V 3 (t 0 , τ ) follows directly from Lemma 7, by picking c 6 to be the c that appears in the statement of Lemma 7. Meanwhile, observe that V 4 (t 0 , τ ) = 4η 2 τ -1 t=0 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i 2 ≤ 4η 2 τ   t0+τ -1 t=t0 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i 2   (iii) ≤ 4η 2 τ t0+τ -1 t=t0 1 m m i=1 ρ 2 u 4 Z t,i 8 ≤ 4c 7 η 2 τ 2 ρ 2 u 4 d 4 log T δ 4 . Above, to derive (iii), we used the bound that Ht,i ≤ ρu Z t,i . The final inequality is a consequence of our assumption that ∩ t0+τ -1 t=t0 O t (δ) holds. Finally, the result that P(O t (δ)) ≥ 1 -δ T holds due to Lemma 11, where we note that we may pick the absolute constant c 7 to be equal to 2Cc 4 , where c, C > 0 are the absolute constants that appear in the statement of Lemma 11. Finally, combining the earlier results, we have the following technical result, which bounds the travelling distance of the iterates in terms of the decrease in function value decrease. Lemma 23 (Improve or Localize). Consider the perturbed zeroth-order update Algorithm 1. Let c > 0, c 1 > 0, c 2 ≥ 1, c 4 > 0, c 5 > 0, c 6 > 0, c 7 > 0, C 1 ≥ 1 be the absolute constants defined in the statements of the previous lemmas, and let δ ∈ (0, 1/e] be arbitrary. Consider any T s ≥ t f (δ). For any F > 0, suppose f (x Ts ) -f (x 0 ) > -F, i.e. f (x 0 ) -f (x Ts ) < F . Suppose that the event P t0,Ts (δ, F ) := ∩ Ts τ =1 (L t0,τ (δ; F ) ∩ N t0,τ (δ)) ∩ ∩ t0+Ts-1 t=t0 O t (δ) ∩ A t (δ) ∩ G t (δ) ∩ ∩ Ts-1 τ =t f (δ) E t0,τ (δ) holds, where the events E t0,τ (δ), L t0,τ (δ), N t0,τ (δ), O t (δ) are as defined in Lemma 20, Lemma 21 and Lemma 22, and G t (δ) and A t (δ) are as defined in Lemma 13 and Lemma 15. Suppose we choose u, r and η such that u ≤ √ d √ ρ log(T /δ) • min 1 64c 2 5 c 2 , 1 2048c 1 c 2 1/4 , r ≤ • min 1 8c 5 √ 2c 2 , 1 32 √ c 1 , η ≤ 1 Lt f (δ) min 1 log(T /δ) , √ m 8c 4 (lr(C 1 dmT /δ)) 3/2 √ d , m 128c 1 (lr(C 1 dmT /δ)) 3 d . Suppose η ≤ min 1, 1 t f (δ) , 1 t f δL . Suppose also we pick u and r small enough such that u ≤ r 1/2 d log(T /δ)ρ 1/2 , r 2 ≤ min    F ηT s log(T /δ) 65c 2 5 8 + 132c 1 + 1 , F 4c 6 log(2dT /δ) + 4c 7 ηT s    . Then, for each τ ∈ {0, 1, . . . , T s }, we have that x t0+τ -x t0 2 ≤ φ Ts (δ, F ), where φ Ts (δ, F ) ≤ max 128ηT s t f (δ)F, 32η 2 (t f (δ)) 2 2 +8c 2 β 1 (δ; F )ηt f (δ) max 16d m (lr(CT 2 /δ)) 2 F, T s F + T s ηt f (δ)F, where β 1 (δ; F ) is defined as in Lemma 21. Moreover, P(P t0,Ts (δ, F )) ≥ 1 -12Tsδ T . Proof. We recall that x t0+τ -x t0 2 ≤ 8η 2 τ t0+τ -1 t=t0 ∇f (x t ) 2 V1(t0,τ ) + 8η 2 t0+τ -1 t=t0 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ) 2 V2(t0,τ ) + 4η 2 t0+τ -1 t=t0 Y t 2 V3(t0,τ ) + 4η 2 t0+τ -1 t=t0 1 m m i=1 uZ t,i Z t,i Ht,i Z t,i V4(t0,τ ) By Lemma 20, Lemma 21, and Lemma 22, which bound V 1 (t 0 , τ ), V 2 (t 0 , τ ), and V 3 (t 0 , τ ), V 4 (t 0 , τ ) respectively, on the event P t0,Ts (δ, F ), we have, for any 0 ≤ τ ≤ T s , x τ -x 0 2 ≤ V 1 (0, τ ) + V 2 (0, τ ) + V 3 (0, τ ) + V 4 (0, τ ) ≤ max 64ητ t f (δ)(F + N u,r (τ ; δ)), 32η 2 (t f (δ)) 2 2 + 8c 2 β 1 (δ; F )ηt f (δ) max 8d m (lr(CT 2 /δ)) 2 (F + N u,r (T s , δ)) , τ F + 4c 6 η 2 τ log(2dT /δ)r 2 + 4c 7 η 2 τ 2 ρ 2 u 4 d 4 (log(T /δ)) 4 , where N u,r (τ ; δ) is defined as in Lemma 20. For the simplified bound (which does not contain N u,r (τ ; δ)), it remains for us to show that our choice of u and r ensures that N u,r (T s , δ) ≤ F and 4c 6 η 2 T s log(2dT /δ)r 2 + 4c 7 η 2 T 2 s ρ 2 u 4 d 4 (log(T /δ)) 4 ≤ ηT s t f (δ)F. First, our choice of u ensures that u 4 d 4 ρ 2 (log(T /δ)) 4 ≤ r 2 . Next, recall that N u,r (τ ; δ) := τ c 2 5 64 η 3 t f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (128t f (δ) + ηL) log T δ + τ c 1 Lη 2 r 2 + c 2 5 t 3 f (δ)η 3 L 2 u 2 d 2 ρ log(T /δ) + 2 log(T /δ)r 2 . Recalling our choice of η such that η ≤ min{1, 1 t f (δ) , 1 t f (δ)L }, it follows that N u,r (T s ; δ) ≤ ηT s r 2 8c 2 5 64 log(T /δ) + 2c 1 + 2c 1 + (128c 1 + 1) log(T /δ) + c 1 + 8c 2 5 log(T /δ) ≤ ηT s r 2 log(T /δ) 65c 2 5 8 + 132c 1 + 1 ≤ F, where the last inequality follows choosing r such that r 2 ≤ F ηTs log(T /δ) 65c 2 5 8 +132c1+1 . Similarly, we have 4c 6 η 2 T s log(2dT /δ)r 2 + 4c 7 η 2 T 2 s ρ 2 u 4 d 4 (log(T /δ)) 4 ≤ ηT s t f (δ) 4c 6 η log(2dT /δ)r 2 + 4c 7 ηT s ρ 2 u 4 d 4 (log(T /δ)) 4 ≤ ηT s t f (δ) 4c 6 η log(2dT /δ)r 2 + 4c 7 ηT s r 2 By choosing r such that r 2 ≤ F 4c 6 log(2dT /δ) + 4c 7 ηT s , it follows that 4c 6 η 2 T s log(2dT /δ)r 2 + 4c 7 η 2 T 2 s ρ 2 u 4 d 4 (log(T /δ)) 4 ≤ ηT s t f (δ)F, as desired. We next lower bound the probability of P t0,Ts (δ, F ) := ∩ Ts τ =1 (L t0,τ (δ; F ) ∩ N t0,τ (δ))∩ ∩ t0+Ts-1 t=t0 O t (δ) ∩ A t (δ) ∩ G t (δ) ∩ ∩ Ts τ =t f (δ) E t0,τ (δ) . Observe that ∩ Ts τ =t f (δ) E t0,τ (δ) = ∩ Ts τ =t f (δ) H t0,τ (δ)∩ t0+τ -1 t=t0 A t (δ) ∩ G t (δ) ∩ K-2 k=0 B t0+kt f (δ) (δ; t f (δ)) ∩B t0+(K-1)t f (δ) (δ; τ -(K -1)t f (δ)) = ∩ Ts τ =t f (δ) H t0,τ (δ)∩ K-2 k=0 B t0+kt f (δ) (δ; t f (δ)) ∩B t0+(K-1)t f (δ) (δ; τ -(K -1)t f (δ)) ∩ Ts-1 t=t0 A t (δ)∩G t (δ) . Note this implies that ∩ Ts τ =t f (δ) E t0,τ (δ) ∩ ∩ Ts-1 t=t0 A t (δ) ∩ G t (δ) = ∩ Ts τ =t f (δ) E t0,τ (δ) We note that by Lemma 1, P ∩ Ts τ =t f (δ) H t0,τ (δ) c ≤ 5T s δ T . Meanwhile, we note that ∩ Ts-1 t=t0 B t (δ; t f (δ)) ⊆ ∩ Ts τ =t f (δ) K-2 k=0 B t0+kt f (δ) (δ; t f (δ)) ∩B t0+(K-1)t f (δ) (δ; τ -(K -1)t f (δ)) . Hence, by Lemma 14, we have that P ∩ Ts τ =t f (δ) K-2 k=0 B t0+kt f (δ) (δ; t f (δ)) ∩B t0+(K-1)t f (δ) (δ; τ -(K -1)t f (δ)) c ≤ P ∩ Ts-1 t=t0 B t (δ; t f (δ)) c ≤ T s δ T . Meanwhile, by Lemma 13 and Lemma 15, we may bound P Ts-1 t=t0 A t (δ) ∩ G t (δ) c ≤ T s δ T + 2T s δ T = 3T s δ T . Hence, it follows that P ∩ Ts τ =t f (δ) E t0,τ (δ) ∩ ∩ Ts-1 t=t0 A t (δ) ∩ G t (δ) c ≤ 5T s δ T + T s δ T + 3T s δ T = 9T s δ T . Meanwhile, it follows from our results in the preceding lemmas that P ∩ Ts τ =1 (L t0,τ (δ; F ) ∩ N t0,τ (δ)) ∩ ∩ Ts-1 t=t0 O t (δ) c ≤ 3T s δ T . Hence, it follows that P(P t0,Ts (δ, F )) ≥ 1 -12Tsδ T .

E.3 PROVING FUNCTION VALUE DECREASE NEAR SADDLE POINT

We next build on the technical result earlier to prove that each time we are near the saddle point, there is a constant probability of making significant function value decrease. We briefly provide a high-level proof outline below. In our proof, we introduce a coupling argument connecting two closely-related sequences both starting from the saddle, differing only in the sign of their perturbative term along the minimum eigendirection of the Hessian at the saddle. Specifically, when function decrease from a saddle is not sufficiently large, due to the earlier technical result, we know that the coupled sequences will remain within a radius φ of the original saddle for a large number (which we will denote as T s ) of iterations. We then utilize this fact to show that the difference of the coupled sequence will (with some constant probability) grow exponentially large, eventually moving out of their specified radius φ within T s iterations, leading to a contradiction. Our first result formally introduces the coupling, setting the stage for the rest of our arguments. For notational convenience, in this section, unless otherwise specified, we will often assume that the initial iterate x 0 is an -saddle point. Lemma 3. Suppose x 0 is an -approximate saddle point. Without loss of generality, suppose that the minimum eigendirection of H := ∇ 2 f (x 0 ) is the e 1 direction, and let γ to denote -λ min (∇ 2 f (x 0 )) (note γ ≥ √ ρ ). Consider the following coupling mechanism, where we run the zeroth-order gradient dynamics, starting with x 0 , with two isotropic noise sequences, Y t and Y t respectively, where (Y t ) 1 = -(Y t ) 1 , and (Y t ) j = (Y t ) j for all other j = 1. Suppose that the sequence {Z t,i } t∈T,i∈ [m] is the same for both sequences. Let {x t } denote the sequence with the {Y t } noise sequence, and let the {x t } denote the sequence with the {Y t } noise sequence, where x t+1 = x t -η 1 m m i=1 Z t,i Z t,i ∇f (x t ) + u 2 Z t,i Z t,i H t,i Z t,i + Y t , x 0 = x 0 , and H t,i := H t,i,+ -H t,i,- , with H t,i,+ = ∇ 2 f (x t + α t,i,+ uZ i ) for some α t,i,+ ∈ [0, 1], and H t,i,-= ∇ 2 f (x t -α t,i,-uZ i ) for some α t,i,-∈ [0, 1]. Then, for any t ≥ 0, xt+1 := xt+1 -x t+1 = -η t τ =0 (I -ηH) t-τ ξg 0 (τ ) Wg 0 (t+1) -η t τ =0 (I -ηH) t-τ ( Hτ -H)xτ W H (t+1) -η t τ =0 (I -ηH) t-τ ξu(τ ) Wu(t+1) -η t τ =0 (I -ηH) t-τ Ŷτ Wp(t+1) where ξg 0 (t) = 1 m m i=1 (Zt,iZ t,i -I)∇f (xt), ξ g 0 (t) = 1 m m i=1 (Zt,i(Zt,i) -I)∇f (x t ), ξg 0 (t) = ξg 0 (t) -ξ g 0 (t), ξu(t) = 1 m m i=1 u 2 Zt,iZt,i Ht,iZt,i, ξ u (t) = 1 m m i=1 u 2 Zt,iZt,i H t,i Zt,i, ξu(t) = ξu(t) -ξ u (t), Ŷt = Yt -Y t , Ht = 1 0 ∇ 2 f (axt + (1 -a)x t )da. Proof. Observe that xt+1 := x t+1 -x t+1 = x t -η (∇f (x t ) + ξ g0 (t) + ξ u (t)Y t ) -x t -η ∇f (x t ) + ξ g0 (t) + ξ u (t) + Y t = xt -η (∇f (x t ) -∇f (x t )) + ξ g0 (t) -ξ g0 (t) + (ξ u (t) -ξ u (t)) + (Y t -Y t ) = xt -ηH xt -η( Ht -H)x t -η ξg0 (t) -η ξu (t) -η Ŷt = -η t τ =0 (I -ηH) t-τ ξg0 (τ ) Wg 0 (t+1) -η t τ =0 (I -ηH) t-τ ( Hτ -H)x τ W H (t+1) -η t τ =0 (I -ηH) t-τ ξu (τ ) Wu(t+1) -η t τ =0 (I -ηH) t-τ Ŷτ Wp(t+1) where ξ g0 (t) = 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ), ξ g0 (t) = 1 m m i=1 (Z t,i (Z t,i ) -I)∇f (x t ), ξg0 (t) = ξ g0 (t) -ξ g0 (t), ξ u (t) = 1 m m i=1 u 2 Z t,i Z t,i Ht,i Z t,i , ξ u (t) = 1 m m i=1 u 2 Z t,i Z t,i H t,i Z t,i , ξu (t) = ξ u (t) -ξ u (t), Ŷt = Y t -Y t , Ht = 1 0 ∇ 2 f (ax t + (1 -a)x t )da. To derive the final equality, we utilized the fact that x 0 = x 0 . This completes our proof. Suppose x 0 is an -saddle point. Recall that γ > 0 denotes -λ min (∇ 2 f (x 0 )), where we know that γ ≥ √ ρ . γ ≥ ψ := min{ψ, 1, L} if f (•) is ( , ψ)-strict saddle for any ψ > √ ρ √ ρ otherwise. In the sequel, for any t ≥ 0, it is helpful to define the quantities β(t) 2 := (1 + ηγ) 2t (ηγ) 2 + 2ηγ , α(t) 2 := (1 + ηγ) 2t -1 (ηγ) 2 + 2ηγ . ( ) We next introduce some probabilistic events (and their implications) which, if true, can be used to bound the sizes of W g0 (t + 1) , W u (t + 1) , W u (t + 1) (and as we will see in the next result, indirectly bound W H (t + 1) . These bounds will be useful in the final proof of making function value progress near a saddle point. Lemma 24. We assume δ ∈ (0, 1/e] throughout the lemma. Suppose that we pick u, r and η as specified in Lemma 23. Suppose T s ≥ t f (δ). Suppose also that f (x Ts ) -f (x 0 ) > -F, f (x Ts ) -f (x 0 ) > -F. Then, we have the following results. 1. Let S φ (δ) denote the event S φ (δ) := max{ x t -x 0 2 , x t -x 0 2 } ≤ φ Ts (δ, F ), ∀0 ≤ t ≤ T s . In addition, let S u (δ) denote the event S u (δ) := W u (t + 1) ≤ ηβ(t + 1) √ 3 η ψ 2c 3 ρd 2 (log(T /δ)) 2 u 2 , ∀0 ≤ t ≤ T s -1 , where c 3 is the same absolute constant as the c 3 in the preceding lemmas. Then, P(S φ (δ) ∩ S u (δ)) ≥ 1 - 24T s δ T . 2. Consider defining the event R t (δ), which is the event where either t τ =0 (1 + ηγ) 2(t-τ ) dL 2 m xτ -x τ 2 (lr(CT 2 /δ)) 2 ≥ GT s (δ, F ), or Wg 0 (t + 1) ≤ c η max lr CT 2 δ 2 t τ =0 dL 2 m (1 + ηγ) 2(t-τ ) xτ -x τ 2 , g(t + 1) log CdT 2 δ +log log GT s (δ, F ) g(t + 1) +1 normalsize holds. Above, c , C refer to the same constants as in Proposition 2, and GT s (δ, F ) := 8 Ts-1 τ =0 (1 + ηγ) 2τ dL 2 m (lr(CT 2 /δ)) 2 φT s (δ, F ) + β(Ts)ηr 60 √ d 2 , g(t + 1) := β(t + 1)ηr 60 √ d 2 . Then, P(Rt(δ)) ≥ 1 -δ T . Suppose the event ∩ Ts-1 t=0 Rt(δ) ∩ S φ (δ) holds. Then, the event Sg 0 (δ) holds, where Sg 0 (δ) := ∩ Ts-1 t=0 Sg 0 ,t(δ), and Sg 0 ,t(δ) is defined as Sg 0 ,t(δ) :=    Wg 0 (t + 1) ≤ ζ1(δ, F )c η max lr CT 2 δ 2 t τ =0 dL 2 m (1 + ηγ) 2(t-τ ) xτ -x τ 2 , g    where ζ1(δ, F ) := log CdT 2 δ + log log GT s (δ, F ) g(1) + 1 . 3. In addition, let Sp(δ) denote the event Sp(δ) := Wp(t + 1) ≤ 2 2 log(T /δ)β(t + 1)ηr √ d ∀0 ≤ t ≤ Ts -1 . Then, P(Sp(δ)) ≥ 1 -Tsδ T . Proof. We consider the three claims separately. 1. Note that our assumptions satisfy the conditions required in Lemma 23. Hence, by Lemma 23, on the event P 0,Ts (δ, F ), we have that x τ -x 0 2 ≤ φ Ts (δ, F ). Simultaneously, on the event P 0,Ts (δ, F ), we know that ∩ Ts-1 t=0 G t (δ) holds, i.e. 1 m m i=1 Z t,i 4 ≤ 2c 3 d 2 (log(T /δ)) 2 , ∀0 ≤ t ≤ T s -1. Thus, for W u (t + 1), we have that W u (t + 1) = η t τ =0 (I -ηH) t-τ ξu (τ ) ≤ η t τ =0 (I -ηH) t-τ ξ u (τ ) + η t τ =0 (I -ηH) t-τ ξ u (τ ) ≤ η t τ =0 (1 + ηγ) t-τ 1 m m i=1 u 2 Z t,i Z t,i Ht,i Z t,i + 1 m m i=1 u 2 Z t,i Z t,i H t,i Z t,i ≤ η t τ =0 (1 + ηγ) t-τ ρ m m i=1 Z t,i 4 u 2 (iv) ≤ η t τ =0 (1 + ηγ) t-τ ρ(2c 3 )d 2 (log(T 2/δ)) 2 u 2 ≤ η (1 + ηγ) t+1 ηγ 2c 3 ρCd 2 (log(T /δ)) 2 u 2 (v) = ηβ(t + 1) (ηγ) 2 + 2ηγ ηγ 2c 3 ρd 2 (log(T /δ)) 2 u 2 ≤ ηβ(t + 1) √ 3 √ ηγ 2c 3 ρd 2 (log(T /δ)) 2 u 2 (vi) ≤ ηβ(t + 1) √ 3 η ψ 2c 3 ρd 2 (log(T /δ)) 2 u 2 where the inequality in (iv) holds due to Eq. ( 45), the equality in (v) holds due to the definition of β(t + 1), and the inequality in (vi) used the fact that γ ≥ ψ. Hence the event ∩ Ts t=0 x t -x 0 2 ≤ φ Ts (δ, F ) and ∩ S u (δ) holds with probability at least 1 -12Tsδ T . Note that by the coupling, the distribution of x τ is the same as that of x τ . Thus, by the assumption f (x Ts ) -f (x 0 ) > -F , it follows by a similar argument that the bound x τ -x 0 2 ≤ φ Ts (δ, F ) also holds with probability at least 1 -12Tsδ T . The claim then follows by an application of the union bound. 2. For the second claim, observe first that the claim P(R t (δ)) ≥ 1 -δ T is a consequence of Proposition 2. Suppose next that f (x Ts ) -f (x 0 ) > -F . Then, by definition of the event S φ (δ), we know that x τ -x 0 2 ≤ φ Ts (δ, F ), x τ -x 0 2 ≤ φ Ts (δ, F ) where φ Ts (δ, F ) is as defined in Lemma 23. Suppose now that R t (δ) holds true, and suppose for contradiction that t τ =0 (1 + ηγ) 2(t-τ ) dL 2 m x τ -x τ 2 (lr(CT 2 /δ)) 2 ≥ G Ts (δ, F ) = 8 Ts-1 τ =0 (1 + ηγ) 2τ dL 2 m (lr(CT 2 /δ)) 2 φ Ts (δ, F ) + β(T s )ηr 60 √ d 2 . This implies that there exists some 0 ≤ τ ≤ t ≤ T s such that x τ -x τ 2 ≥ 8φ Ts (δ, F ). However, we also know that on the event S φ (δ), x τ -x τ 2 ≤ 2 x τ -x 0 2 + 2 x τ -x 0 2 ≤ 4φ Ts (δ, F ). This leads to a contradiction. We must then have that W g0 (t + 1) ≤ ζ 1 (δ, F )c η max lr CT 2 δ 2 t τ =0 dL 2 m (1 + ηγ) 2(t-τ ) x τ -x τ 2 , g , where ζ 1 (δ, F ) := log CdT 2 δ + log log G(δ, F ) g(1) + 1 3. Observe that W p (t + 1) = η t τ =0 (I -ηH) t-τ Ŷτ = η t τ =0 (1 + ηγ) t-τ (2(Y τ ) 1 ), which means that W p (t + 1) is a 1-dimensional Gaussian with variance η 2 t τ =0 (1 + ηγ) 2(t-τ ) 4r 2 d = 4η 2 r 2 d (1 + ηγ) 2(t+1) -1 2ηγ + (ηγ) 2 = 4η 2 r 2 α(t + 1) 2 d . Since α(t + 1) ≤ β(t + 1), using the subGaussianity of a Gaussian distribution, it follows that for any t, with probability at least 1 -δ/T , W p (t + 1) ≤ 2 2 log(T /δ)β(t + 1)ηr √ d . For any F > 0, we are now ready to show that the algorithm makes a function decrease of F with Ω(1) probability near an -saddle point. Proposition 5. Suppose that x t0 is an -approximate saddle point. Let c > 0, c 1 > 0, c 2 ≥ 1, c 4 > 0, c 5 > 0, c 6 > 0, c 7 > 0, C 1 ≥ 1 be the absolute constants defined in the statements of the previous lemmas, and let δ ∈ (0, 1/e] be arbitrary. Consider any F > 0. As in the statement of Lemma 23, suppose we choose u, r and η such that u ≤ √ d √ ρ log(T /δ) • min 1 64c 2 5 c 2 , 1 2048c 1 c 2 1/4 , r ≤ • min 1 8c 5 √ 2c 2 , 1 32 √ c 1 , η ≤ 1 Lt f (δ) min 1 log(T /δ) , √ m 8c 4 (lr(C 1 dmT /δ)) 3/2 √ d , m 128c 1 (lr(C 1 dmT /δ)) 3 d . Suppose we pick T s = max ι η ψ , t f (δ), 4 , where ι = max log 2 φ Ts (δ, F ) 20 √ d η 2 γ 2 + 2ηγ ηr , 1 , ψ := min{ψ, 1, L} if f (•) is ( , ψ)-strict saddle for any ψ > √ ρ √ ρ otherwise. Suppose in addition that u, η also satisfy the conditions u ≤ r η ψ 120 √ 3c 3 √ dρd 2 (log(T /δ)) 2 , η ≤ max 1 c c 9 ζ 1 (δ, F ) , m ψ 360ι(c ) 2 c 2 9 dL 2 lr CT 2 δ 2 ζ 1 (δ, F ) 2 , 1 2 ψ , where ζ 1 (δ, F ) is as defined in Lemma 23, c , c 3 , C > 0 are the same constants as in the previous results, and c 9 = 2 √ 2 + 1 20 . Suppose also that φ Ts (δ, F ) satisfies the bound φ Ts (δ, F ) ≤ ψ 60c 9 ιρ log(T /δ) 2 . Then, with probability at least 1 3 -13Tsδ T , f (x t0+Ts ) -f (x t0 ) ≤ -F . Proof of Proposition 5. Without loss of generality, we assume that t 0 = 0. By Lemma 3, we have xt+1 := x t+1 -x t+1 = -η t τ =t0 (I -ηH) t-τ ξg0 (τ ) Wg 0 (t+1) -η t τ =t0 (I -ηH) t-τ ( Hτ -H)x τ W H (t+1) -η t τ =t0 (I -ηH) t-τ ξu (τ ) Wu(t+1) -η t τ =t0 (I -ηH) t-τ Ŷτ Wp(t+1) where ξ g0 (t) = 1 m m i=1 (Z t,i Z t,i -I)∇f (x t ), ξ g0 (t) = 1 m m i=1 (Z t,i (Z t,i ) -I)∇f (x t ), ξg0 (t) = ξ g0 (t) -ξ g0 (t), ξ u (t) = 1 m m i=1 u 2 Z t,i Z t,i Ht,i Z t,i , ξ u (t) = 1 m m i=1 u 2 Z t,i Z t,i H t,i Z t,i , ξu (t) = ξ u (t) -ξ u (t), Ŷt = Y t -Y t , Ht = 1 0 ∇ 2 f (ax t + (1 -a)x t )da. Recall that we define for t ≥ 0, β(t) 2 := (1 + ηγ) 2t (ηγ) 2 + 2ηγ , α(t) 2 := (1 + ηγ) 2t -1 (ηγ) 2 + 2ηγ . Throughout the proof, we suppose for contradiction that f (x Ts ) -f (x 0 ) > -F, f (x Ts ) -f (x 0 ) > -F, and assume the event ∩ Ts-1 t=0 R t (δ) ∩ S φ (δ) ∩ S u (δ) ∩ S p (δ) holds, where the events intersected are defined in Lemma 24. Then, by Lemma 24, the event S g0 (δ) (also defined in Lemma 24) holdsfoot_6 . Consider the following induction argument, where we seek to show that there exists an absolute constant c 9 > 0 such that for every t ∈ {0, 1, . . . , T s }, x t -x t ≤ c 9 log(T /δ) β(t)ηr √ d , and max { W g0 (t) , W H (t) , W u (t) } ≤ β(t + 1)ηr √ d Combined with a lower bound on W p (t + 1) (which makes use of the property that W p (t + 1) is a 1-dimensional Gaussian), we will then use the inductive claim in Eq. ( 49) to show that W p (T s ) ≥ 2 W g0(Ts) + W H (T s ) + W u (T s ) . Since W p (t + 1) is a 1-dimensional Gaussian random variable with a standard deviation that grows exponentially with t, by our choice of T s , we will see that x Ts -x Ts is larger than what expect (since our assumptions imply that max x Ts -x 0 2 , x Ts -x 0 2 ≤ φ Ts (δ, F ), i.e. x Ts and x Ts both remain close to x 0 and hence close to each other). This yields a contradiction, implying that on the event we assumed to hold, i.e. ∩ Ts-1 t=0 R t (δ) ∩ S φ (δ) ∩ S p (δ) the assumption f (x Ts ) -f (x 0 ) > -F, and f (x Ts ) -f (x 0 ) > -F is not true, i.e. one of the sequences must have made function value progress of at least F . We proceed to prove Eq. ( 49). Observe that the claim holds for the base case t = 0; this is true since x 0 = x 0 . Now suppose that this holds for all τ ≤ t. We will seek to show that Eq. ( 49) holds for t + 1 as well. We do so by bounding the norms of W g0 (t + 1), W H (t + 1), W u (t + 1) and W p (t + 1) respectively. 1. (Bounding W g0 (t + 1) ) Since the event S g0 (δ) holds, it follows that for each 0 ≤ t ≤ T s -1, we have that W g0 (t + 1) ≤ ζ 1 (δ, F )c η max lr CT 2 δ 2 t τ =0 dL 2 m (1 + ηγ) 2(t-τ ) x τ -x τ 2 , g where ζ 1 (δ, F ) := log CdT 2 δ + log log G Ts (δ, F ) g(1) + 1 , and the terms G Ts (δ, F ) and g(1) are defined as in Lemma 24. Recall by the inductive claim in Eq. ( 49) that there exists c 9 > 0 such that x τ -x τ ≤ c 9 log(T /δ) β(t)ηr √ d ∀ 0 ≤ τ ≤ t. Hence, it follows that W g0 (t + 1) ≤ c ζ 1 (δ, F )η max √ t + 1 lr CT 2 δ c 9 √ dL √ m β(t)ηr √ d , β(t + 1)ηr 60 √ d . Hence, noting the choice of T s in Eq. ( 47), by choosing η such that c c 9 ζ 1 (δ, F )η T s lr CT 2 δ √ dL √ m ≤ 1 60 ⇐⇒ η ≤ m ψ 360ι(c ) 2 c 2 9 dL 2 lr CT 2 δ 2 ζ 1 (δ, F ) 2 , and 50) c c 9 ζ 1 (δ, F )η ≤ 1. it follows that W g0 (t + 1) ≤ β(t + 1)ηr 60 √ d . 2. Meanwhile, the term W H (t + 1) can be bounded as follows. By the inductive assumption in Eq. ( 49), we have that xτ = x τ -x τ ≤ c 9 log(T /δ) β(τ )ηr √ d ∀ 0 ≤ τ ≤ t. Moreover, on the event our proof assumes, we know that max x τ -x 0 2 , x τ -x 0 2 ≤ φ Ts (δ, F ). Thus, using the ρ-Hessian Lipschitz property, we have W H (t + 1) = η t τ =0 (I -ηH) t-τ ( Hτ -H)x τ ≤ η t τ =0 (1 + ηγ) t-τ ρ φ Ts (δ, F ) c 9 log(T /δ)β(τ )ηr √ d ≤ c 9 (t + 1) log(T /δ)ηρ φ Ts (δ, F ) β(t)ηr √ d ≤ c 9 T s log(T /δ)ηρ φ Ts (δ, F ) β(t)ηr √ d . Given our choice of T s in Eq. ( 47), if c 9 T s log(T /δ)ηρ φ Ts (δ, F ) ≤ 1 60 ⇐⇒ φ Ts (δ, F ) ≤ ψ 60c 9 ιρ log(T /δ) 2 it follows that W H (t + 1) ≤ β(t + 1)ηr 60 √ d . 3. Meanwhile, for W u (t + 1), since the event S u (δ) holds, we have that W u (t + 1) ≤ ηβ(t + 1) √ 3 η ψ 2c 3 ρd 2 (log(T /δ)) 2 u 2 . Now, by picking ηβ(t + 1) √ 3 η ψ 2c 3 ρd 2 (log(T /δ)) 2 u 2 ≤ β(t + 1)ηr 60 √ d ⇐⇒ u ≤ r η ψ 120 √ 3c 3 √ dρd 2 (log(T /δ)) 2 , it follows that with probability 1 -δ/T , W u (t + 1) ≤ where the final inequality uses the fact that 0 < δ ≤ 1/e (which implies log(T /δ) ≥ 1). Hence, we see that the first part of the inductive claim of Eq. ( 49) holds with the constant c 9 := 1 20 + 2 √ 2, and the second part follows naturally as a consequence of our argument above. Meanwhile, observe that for any η such that η ψ ≤ 1 2 , we have that (1 + ηγ) 1 η ψ ≥ 2. Thus, by choosing η such that η ψ ≤ 1 2 , we have for any t ≥ 1 η ψ , α(t + 1) 2 ≥ 1 2 β(t + 1) 2 . Hence, following Eq. ( 46), by choosing T s ≥ 1 η ψ , W p (T s ) is a 1-dimensional Gaussian with variance at least 2η 2 r 2 β(Ts) d , such that with probability at least 2/3, W p (T s ) ≥ β(T s )ηr 10 √ d . Simultaneously, we know that on the event ∩ Ts-1 t=0 R t (δ) ∩ S φ (δ) ∩ S u (δ) ∩ S p (δ), we have W g0 (T s ) + W H (T s ) + W u (T s ) ≤ 3β(T s )ηr 60 √ d = β(T s )ηr 20 √ d . We note that by Lemma 24, we have P ∩ Ts-1 t=0 R t (δ) ∩ S φ (δ) ∩ S u (δ) ∩ S p (δ) ≥ 1 - 24T s δ T + T s δ T + T s δ T = 1 - 26T s δ T . Thus, with probability at least 2/3 -26Tsδ T , we have xTs ≥ 1 2 W p (T s ) ≥ β(T s )ηr 20 √ d Thus, choosing T s ≥ ι η ψ , where ι = max log 2 φ Ts (δ, F ) 20 √ d η 2 γ 2 + 2ηγ ηr , 1 , noting that if η ψ ≤ 1/2, then (1 + ηγ) 1 η ψ ≥ (1 + η ψ) 1 η ψ ≥ 2, we have that with probability at least 2/3 -26Tsδ T , xTs ≥ β(T s )ηr 20 √ d = ηr 20 √ d (1 + ηγ) Ts 2ηγ + (ηγ) 2 ≥ ηr 20 √ d (1 + ηγ) log 2 √ φ Ts (δ,F ) 20 √ d √ η 2 γ 2 +2ηγ ηr η ψ 2ηγ + (ηγ) 2 ≥ ηr 20 √ d 2ηγ + (ηγ) 2 2 log 2 √ φ Ts (δ,F ) 20 √ d √ η 2 γ 2 +2ηγ ηr > 2 φ Ts (δ, F ) > 2 φ(T s , δ). Thus, at least one of x Ts -x 0 and x Ts -x 0 is larger than φ(T s , δ), a contradiction. Since the two sequences have the same distribution, it follows that with probability at least 1/3 -13Tsδ T , f (x Ts ) -f (x 0 ) ≤ -F . In the result above, we require an upper bound on the norm of φ Ts (δ, F ) to hold (i.e. equation 48), which in turn necessitates an upper bound on F , the function value improvement we can expect to make. Below, we show how to choose F to be as large as possible (up to constants and logarithmic factors) whilst still satisfying equation 48, assuming that u, r and η are chosen appropriately small such that the dominant term of φ Ts (δ, F ) scales with F . Lemma 25. Consider choosing F such that F = 1 2 ψ 60c 9 ιρ log(T /δ) 2 1 ηT s t f (δ) (129 + 8c 2 β 1 (δ; F ) (16(lr(CT 2 /δ)) 2 + 1)) . Suppose η ≤ min 1, 1 t f (δ) , 1 t f δL . Suppose we pick u and r small enough such that u ≤ r 1/2 d log(T /δ)ρ 1/2 , r 2 ≤ min    F ψ 2ι log(T /δ) 65c 2 5 8 + 6c 1 + 1 , F 4c 6 log(2dT /δ) + 8c7ι ψ   . Then, N u,r (T s , δ) ≤ F , and that 4c 6 η 2 T s log(2dT /δ)r 2 + 4c 7 η 2 T 2 s ρ 2 u 4 d 4 (log(T /δ)) 4 ≤ ηT s t f (δ)F. Suppose in addition η is small enough so that 32η 2 (t f (δ)) 2 ≤ 1 2 ψ 60c 9 ιρ log(T /δ)

2

. Suppose also that ψ ≤ 1foot_7 and η ≤ m d , so that T s ≥ ι η ψ ≥ d m . Then, the condition in Eq. (48) will be satisfied. Proof. We note that since ι ψ ≤ T s ≤ 2ι ψ , it follows by our choice of r that r also satisfies the condition Hence, our choice of η, u and r satisfies the conditions in Lemma 23, and it follows then that φ Ts (δ, F ) ≤ max 128ηT s t f (δ)F, 32η 2 (t f (δ)) 2 2 + 8c 2 β 1 (δ; F )ηt f (δ) max 16d m (lr(CT 2 /δ)) 2 F, T s F + T s ηt f (δ)F, where β 1 (δ; F ) is as defined in Lemma 21. The condition in Eq. ( 48) requires that Proof. Throughout the proof, we assume that the event D τ (δ) holds. Let J denote {0, 1 . . . , τ -1} where τ < t f (δ). Then, J belongs to one of the two following cases. Case 1) (Gradient dominates noise): Recall that this means that for every t ∈ J, we have Note that, by our choices of the parameters η, u, r, it can be shown that Combining both cases above (Eq. ( 53) and Eq. ( 54)), we see that for the choice α = 128t f (δ), the bound - 3η 4 t∈J 1 m m i=1 Z t,i ∇f (x t ) 2 + η 128t f (δ) + c 1 Lη 2 χ 3 d m t∈J ∇f (x t ) 2 ≤ η 4 2 always holds. Recall by Eq. ( 3) that we have f (x τ ) -f (x 0 ) ≤ - 3η 4 τ -1 t=0 1 m m i=1 Z t,i ∇f (x t ) 2 + η α + c 1 Lη 2 χ 3 d m τ -1 t=0 ∇f (x t ) 2 + τ ηu 4 ρ 2 • c 1 d 3 log T δ 3 + τ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + 1 r 2 (α + ηL) log T δ + τ c 1 Lη 2 r 2 . By plugging in Eq. ( 55) above, as well as the choice α = 128t f (δ), we see that We can now complete our proof by using the union bound (suppressing the dependence of some of the events on δ for notational simplicity) to derive P(D c τ ) ≤ P(H c τ ) + τ -1 t=0 P(A c t ) + τ -1 t=0 P(G c t ) ≤ (τ + 4)δ T + τ T δ + 2 τ T δ ≤ (4t f (δ) + 4) T δ Armed with Proposition 5 and Lemma 25, we are now ready to show for T sufficiently large, with high probability, there can be no more than T /4 -saddle points. Combined with Proposition 4, this yields the following result. Theorem 2. Suppose we pick u, r, η such that they satisfy the conditions in Proposition 5 and Lemma 25. Suppose F is chosen as prescribed in Lemma 25. Suppose that ψ ≤ 1, so that T s ≥ ι ≤ t f (δ)r 2 c 1 + t f (δ)r 2 c 1 + c 1 r 2 (128t f (δ) + 1) log(T /δ) + c 1 r 2 = r 2 (130c 1 t f (δ) + c 1 log(T /δ) + c 1 ). Hence, by picking r such that r ≤ 2 4(130c 1 t f (δ) + c 1 log(T /δ) + c 1 ) , it follows that η 2 4 ≥ t f (δ)ηu 4 ρ 2 • c 1 d 3 log T δ 3 + t f (δ)Lη 2 u 4 ρ 2 • c 1 d 4 log T δ 4 + ηc 1 r 2 (128t f (δ) + ηL) log T δ + t f (δ)c 1 Lη 2 r 2 . Then, if τ 1 < t f (δ), with probability at least 1 - (5t f (δ)+4) δ , f (x τ1 ) -f (x 0 ) ≤ η 2 2 . Suppose also that we pick r such that Choose T such that -(0.05T /T s )F ≤ -(f (x 0 ) -f * ) ⇐⇒ T ≥ 20T s (f (x 0 ) -f * ) F ≥ ϕρ 2 (f (x 0 ) -f * ) η ψ4 yields a contradiction, where ϕ := 20 2ι 2 (60c 9 ι log(T /δ)) 2 (t f (δ)) 129 + 8c 2 β 1 (δ; F ) 16(lr(CT 2 /δ)) 2 + 1 Hence, with probability at least 1 -16δ, there cannot be more than T /4 saddle points. In addition, with probability at least 1 -6δ, by Proposition 4, there cannot be more than T /4 iterates with ∇f (x t ) ≥ . Hence, with probability at least 1 -22δ, there are at least T /2 -approximate second order stationary points.

G SIMULATIONS

We test the performance of our proposed algorithm with two-point estimators (ZOPGD-2pt) against existing zeroth-order benchmarks using the octopus function (proposed in Du et al. (2017) ) of varying dimensions. It is known that the octopus function defined on R d , which chains d saddle points sequentially, takes exponential (in d) time for exact gradient descent to escape; it has thus emerged as a popular benchmark to evaluate and compare the performance of algorithms that seek to escape saddle points. In our experiments, we compare the performance of our two-point estimator algorithm (ZOPGD-2pt) with PAGD (Algorithm 1 in Vlatakis-Gkaragkounis et al. ( 2019)) and ZO-GD-NCF (see Zhang et al. (2022) ), which are the only two existing zeroth-order algorithms that have (a) a Õ( d / 2 ) sample complexity for escaping saddle points (with the latter algorithm yielding the tightest bounds), and (b) performed the best empirically on escaping saddle points (see the simulation results in Zhang et al. (2022) ). We note that both PAGD and ZO-GD-NCF have to use 2d function evaluations per iteration to estimate the gradient while our algorithm only needs to use 2 function evaluations. In our plots, we plot the function value against the number of function evaluations. For completeness, we also plot the performance of exact gradient descent (normalized such that its x-axis is also the number of function queries). We tested the algorithms for d = 10 and d = 30. To account for the stochasticity in the algorithms, for each algorithm, we computed the average and standard deviation over 30 trials, and plotted the mean trajectory with an additional band that represents 1.5 times the standard deviation. For our algorithmś hyperparameters, we picked η = 1 4dL , u = 10 -2 , r = 0.05, m = 1( i.e. two-point estimator) (57) For PAGD, we used the hyperparameters listed in their paper, and for ZO-GD-NCF, we used the code from their Neurips submission. We note in particular that both methods used the step-size 1 4L . For initialization, we chose a random x 0 near the saddle point at the origin, drawn from N (0, 10 -3 I d×d )foot_9 (fixed for all trials and all algorithms). As we can see in Fig. 1 , in both cases, our algorithm reaches the global minimum of the octopus function in significantly fewer function evaluations than PAGD and ZO-GD-NCF (approximately 2.5 times faster than ZO-GD-NCF, and approximately 3 times faster than PAGD), despite our algorithm only using 2 function evaluations per iteration compared to 2d function evaluations per iteration for both PAGD and ZO-GD-NCF. As a sanity check, we note that the number of function evaluations required for PAGD and ZO-GD-NCF to reach the global minimum approximately matches that in Figure 1 of Zhang et al. (2022) ; here the correspondence is only approximate since Zhang et al. (2022) only plots one trial while we compute the mean and standard deviation of 30 trials. This result suggests that in addition to the theoretical convergence guarantees, there might also be empirical benefits to using two-point estimators versus existing 2d-point estimators in the zeroth-order escaping saddle point literature. 



In our paper, we focus on the case √ ρ ≤ L; otherwise, by the L-Lipschitz assumption, λmin(∇ f (x)) ≥ -L for all x ∈ R d , which implies -first order stationary points are also -second order stationary points. For general 1 ≤ m ≤ d, there will also be an O(1/m) dependence in the sample complexity. To accommodate the last interval which has length at most 2t f (δ) -1, we note that the results we require for the proof, namely Lemma 14, Lemma 16 and Lemma 17, all hold for any interval length t f ≤ 2t f (δ). To accommodate the last interval which has length at most 2t f (δ) -1, we note that the results we require for the proof, namely Lemma 14, Lemma 16 and Lemma 17, all hold for any interval length t f ≤ 2t f (δ). We note that by construction, B(δ; F ) ≥ bτ (δ; F ) We may also directly assume that Sg 0 (δ) also holds, but our way of reasoning prevents double counting of probabilities. Without loss of generality, we may set ψ = 1 if f (•) is ( , ψ)-strict saddle for any ψ > 1. Recall we focus on the case ψ ≤ L, since otherwise, by the L-Lipschitz assumption, λmin(∇ 2 f (x)) ≥ -L for all x ∈ R d , i.e. -first order stationary points are also -second order stationary points. Using the random seed in our code, we note that ∇f (x0) = 0.011 for d = 10 and ∇f (x0) = 0.030 for d = 30.



log(T /δ) 2 ≥ 128ηT s t f (δ)F + 8c 2 β 1 (δ; F )ηt f (δ) max 16d m (lr(CT 2 /δ)) 2 F, T s F + ηT s t f (δ)F = 129ηT s t f (δ)F + 8c 2 β 1 (δ; F )ηt f (δ) max 16d m (lr(CT 2 /δ)) 2 F, T s F .

setting α = 128t f (δ) in Eq. (3) and choosing η such thatc 1 Lη 2 χ 3 d m ≤ η α = η 128t f (δ) , it follows that η 128t f (δ) + c 1 Lη 2 χ 3 d m t∈J ∇f (x t )

(x τ ) -f (x 0 ) ≤ η 4 2 + t f (δ)ηu 4 ρ 2 • c 1 d 3 log T δ t f (δ)Lη 2 u 4 ρ 2 • c 1 d 4 log T δ ηc 1 r 2 (128t f (δ) + ηL) log T δ + t f (δ)c 1 Lη 2 r 2 .

Suppose we pick T s as prescribed in Proposition 5. Suppose in addition we pick r 1 t f (δ) + c 1 log(T /δ) + c 1 ) Suppose also that we choose η such that+ τ 1 ηu 4 ρ 2 • c 1 d 3 log T δ Lη 2 u 4 ρ 2 • c 1 d 4 log T δ ηc 1 r 2 (128t f (δ) + ηL) log T δ + τ 1 c 1 Lη 2 r 2 .with probability at least 1 -(5τ1+4)δ T .By our choice of u, we know thatt f (δ)ηu 4 ρ 2 • c 1 d 3 log T δ t f (δ)Lη 2 u 4 ρ 2 • c 1 d 4 log T δ ηc 1 r 2 (128t f (δ) + ηL) log T δ + t f (δ)c 1 Lη 2 r 2

f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2 + T ηu 4 ρ 2 • c 1 d 3 log T δ T Lη 2 u 4 ρ 2 • c 1 d 4 log T 2 (128t f (δ) + ηL) log T δ + T c 1 Lη 2 r 2 .Then, by a union bound, it follows that with probability at least 1 -9δ,U 2 = (f (x τ1 ) -f (x 0 )) + the union bound, with probability at least 1 -16δ, f (x τ Ns ) -f (x 0 ) = U 1 + U 2 ≤ T T s -0.1F + η 2 /2 + F 40By recalling our choice of F in Lemma 25, by choosing η such that log(T /δ)2 1 t f (δ) (129 + 8c 2 β 1 (δ; F ) (16(lr(CT 2 /δ)) 2 + 1)) f (δ) (129 + 8c 2 β 1 (δ; F ) (16(lr(CT 2 /δ)) 2 + 1with probability at least 1 -16δ, f (x τ Ns ) -f (x 0 ) = U 1 + U 2

Figure1: Performance on toy octopus function, with τ = e, L = e, γ = 1 (Here, τ, L, γ are parameters determining the properties of f . Our parameter choice is consistent with that inZhang et al. (2022). SeeDu et al. (2017) for details about the definitions of τ, L and γ.).

Selected comparison of convergence results to ( , O( √

R d , a careful examination of the argument in Proposition 5 would show that this results in a O(d 2 ) rather than O(d) dependence in the sample complexity, incurring a heavy price on the overall sample complexity (extra factor of d) if d is large.

Combining the bounds for W g0 , W p , W H and W u , it follows that xt+1 ≤ W g0 (t + 1) + W p (t + 1) + W H (t + 1) + W u (t + 1)

Z t,i Ht,i Z t,i + Y t .By our choice of η in Eq. (29), we can apply Lemma 16 to get Thus by setting α = 128t f (δ) in Eq. (3) and by choosing η such that Lt f (δ)dχ 3 , By our choice of η in Eq. (29), we can apply Lemma 17 to get∇f (x t ) ≤ c 5 t f (δ)ηL u 2 d 2 ρ log T δ

annex

By our assumption, we know that T s ≥ d m . Thus, further simplifying indicates that it suffices for us to show 1 2 ψ 60c 9 ιρ log(T /δ) 2 ≥ 129ηT s t f (δ)F + 8c 2 β 1 (δ; F )ηt f (δ) max 16T s (lr(CT 2 /δ)) 2 F, T s F .(51)By choosing F such that F ≤ 1 2 ψ 60c 9 ιρ log(T /δ) 2 1 ηT s t f (δ) (129 + 8c 2 β 1 (δ; F ) (16(lr(CT 2 /δ)) 2 + 1)) ,we see that Eq. ( 51) is satisfied.Remark 2. Suppose without loss of generality that T s = ι η ψ . Then, as a consequence of Lemma 25, we note that the amortized function value progress of decreasing function value by F over T s iterations isIn this section, we prove our main result. First, we need an additional result (Lemma 26) showing that with high probability, we can bound the function value increase if a saddle appears within t f (δ) iterations immediately after we have had T s iterations after the previous saddle. We note that such a bound is necessary because our earlier result upper bounding function increase in τ iterations (see Lemma 18) focused on the case where τ ≥ t f (δ). Next, we state and prove Theorem 2, which is the precise version of Theorem 1 in the main text.Lemma 26 (Function change for small τ ). Let c 1 > 0, c 4 > 0, c 5 > 0, C 1 ≥ 1 be the absolute constants defined in the statements of the previous lemmas. Let δ ∈ (0, 1/e], and suppose τ < t f (δ).Let J denote the interval {0, 1 . . . , τ -1} where τ < t f (δ).Suppose we choose η such thatSuppose also we pick u, r and η as prescribed in the statement of Proposition 4.Suppose that min t∈J ∇f (x t ) ≤ . Then, on the eventwe have the following upper bound on function value change:whereThen, with probability at least 1 -22δ, there are at least T /2 -approximate second order stationary points.Proof. Consider defining the following sequence of stopping times:We note that if τ i = T , then τ j = T for any j > i. Let N s the (random) number of saddle points encountered in T iterations.We observe that we can decompose the function changeWe first consider U 1 . Letting x j := x T for any j ≥ T , we have thatNow, by Eq. ( 30), observe that with probability at least 1T (note T s ≥ 4), for any 1 ≤ i ≤ T /T s , we have thatSuppose we pick u, r such that M u,r,Ts ≤ 0.1F . Recall from Proposition 5 that with probability at least 1/3 -13Tsδ T , (f (x τi+Ts ) -f (x τi )) 1 τi<T ≤ -F . Choosing δ such that 1/3 -13Tsδ T ≥ 0.3, and letting µ = 0.1F , we note that |-F + µ| = 0.9F ≥ 0.7 0.3 0.2F ≥ 0.7 0.3 (M u,r,Ts + µ). Now, let E τi denote the bad event on whichWe know that E τi has probability at most 6Tsδ T . Let E τ := ∪ T /Ts i=1E τi , such that P(E τ ) ≤ 6δ. Then, by applying the weakened supermartingale inequality in Proposition 3, we haveNote that supposing for contradiction that there are at least T /4 saddles, we must then have thatwhere we may ensure the last inequality by picking T /T s such thatNote that our choice of T ensures this. Thus, with probability at least 1 -7δ,Next, we bound the summand U 2 . Recall thatWithout loss of generality, we may analyze each of the summands f (x τi+1 ) -f (x τi+Ts ) in the same way as we treat (f (x τ1 ) -f (x 0 )). Let us then consider the summand f (x τ1 ) -f (x 0 ). There are two cases to consider.1. The first is when τ 1 < t f (δ). In this case, since we know that ∇f (x τ1 ) ≤ (as x τ1 is an -saddle point), it follows by Lemma 26 thatwith probability at least 1 -(4t f (δ)+4)δ T . 2. The second case is when τ 1 ≥ t f (δ). In this case, by Lemma 18, we have that f (x τ1 ) -f (x 0 ) ≤ τ 1 c 2 5 64 η 3 t f (δ) 2 L 2 u 2 d 2 ρ log T δ 2 + 2 log(T /δ)r 2

