FASTER GRADIENT-FREE METHODS FOR ESCAPING SADDLE POINTS

Abstract

Escaping from saddle points has become an important research topic in nonconvex optimization. In this paper, we study the case when calculations of explicit gradients are expensive or even infeasible, and only function values are accessible. Currently, there have two types of gradient-free (zeroth-order) methods based on random perturbation and negative curvature finding proposed to escape saddle points efficiently and converge to an ϵ-approximate second-order stationary point. Nesterov's accelerated gradient descent (AGD) method can escape saddle points faster than gradient descent (GD) which have been verified in first-order algorithms. However, whether AGD could accelerate the gradient-free methods is still unstudied. To unfold this mystery, in this paper, we propose two accelerated variants for the two types of gradient-free methods of escaping saddle points. We show that our algorithms can find an ϵ-approximate second-order stationary point with Õ(1/ϵ 1.75 ) iteration complexity and Õ(d/ϵ 1.75 ) oracle complexity, where d is the problem dimension. Thus, our methods achieve a comparable convergence rate to their first-order counterparts and have smaller oracle complexity compared to prior derivative-free methods for finding second-order stationary points.

1. INTRODUCTION

Non-convex optimization has received increasing attention in recent years because lots of modern machine learning (ML) and deep learning (DL) tasks can be formulated as optimizing models with non-convex loss functions. In this paper, we consider non-convex optimization with the following general form: min x∈R d f (x), where f (x) is differentiable and has Lipschitz continuous gradient and Hessian. In this paper, we focus on situations when first-order information (gradient) is not always directly accessible. Many machine learning and deep learning applications often encounter settings where the calculation of explicit gradients is expensive or even infeasible, such as black-box adversarial attack on deep neural networks (Papernot et al., 2017; Madry et al., 2018; Chen et al., 2017; Bhagoji et al., 2018; Tu et al., 2019) , policy search in reinforcement learning (Salimans et al., 2017; Choromanski et al., 2018; Jing et al., 2021) , hyper-parameter optimization (Bergstra & Bengio, 2012) . Therefore, zeroth-order optimization, which utilizes only the zeroth-order information (function value) to optimize the non-convex problem Eq. ( 1), has gained increasing attention in machine learning. In general, the goal of a non-convex optimization problem Eq. ( 1) is to find an ϵ-approximate firstorder stationary point (FOSP, see Definition 3), since finding the global minimum is NP-hard. Gradient descent is proven to be an optimal first-order algorithm for finding an ϵ-approximate FOSP of non-convex problem Eq. (1) under the gradient Lipschitz assumption (Carmon et al., 2020; 2021) , which needs a gradient query complexity of Θ( 1 ϵ 2 ). However, for non-convex functions, FOSPs can be local minima, global minima and saddle points. The ubiquity of saddle points makes highdimensional non-convex optimization problems extremely difficult and will lead to highly suboptimal solutions (Jain et al., 2017; Sun et al., 2018) . Therefore, many recent research works have focused on escaping saddle points and studying properties of converging to an ϵ-approximate secondorder stationary point (SOSP, see Definition 4) using first-order methods. A recent line of work showed that first-order methods can efficiently escape saddle points and converge to SOSPs. Specifically, Jin et al. (2017) proposed the perturbed gradient descent (PGD) algorithm by adding uniform random perturbation into the standard gradient descent algorithm that can find an ϵ-approximate SOSP in Õ(log 4 d/ϵ 2 ) gradient queries. Under the zeroth-order setting, Jin et al. (2018a) proposed a zeroth-order perturbed stochastic gradient descent (ZPSGD) method, which studied the power of Gaussian smoothing and stochastic perturbed gradient for finding local minima. The role of Gaussian smoothing is to reduce zeroth-order optimization to a stochastic first-order optimization of a Gaussian smoothed function of problem Eq. ( 1). They proved their method can find an ϵ-approximate SOSP with a function query complexity of Õ d 2 /ϵ 5 . Vlatakis-Gkaragkounis et al. (2019) proposed the perturbed approximate gradient descent (PAGD) method using the forward difference of the coordinate-wise gradient estimators, which finds an ϵapproximate SOSP in Õ d log 4 d/ϵ 2 function queries. Recently, Lucchi et al. (2021) proposed a random search power iteration (RSPI) method, which alternatively runs the random search step and zeroth-order power iteration step, and can find an (ϵ, ϵ 2/3 )-approximate SOSP (∥∇f (x)∥ ≤ ϵ, λ min (∇ 2 f (x)) ≥ -ϵ 2/3 ) in O(d log d/ϵ 8 3 ) function queries. Zhang et al. (2022) proposed a zerothorder gradient descent method with zeroth-order negative curvature finding that can find an (ϵ, δ)approximate SOSP (∥∇f (x)∥ ≤ ϵ, λ min (∇ 2 f (x)) ≥ -δ) in O( d ϵ 2 + d log d δ 3.5 ) function queries. Although gradient descent has achieved an optimal convergence rate for finding FOSPs under gradient Lipschitz assumption, potential improvements are achievable under additional Hessian Lipschitz assumption (Carmon et al., 2021 ). Nesterov's AGD combined with some special mechanisms, has been proved to be able to find ϵ-approximate FOSPs with less query complexity. Carmon et al. (2017) proposed a variant of Nesterov's AGD with a "convex until guilty" mechanism, which can find an ϵ-approximate FOSP with gradient query complexity O( 1 ϵ 7/4 log 1 ϵ ). Recently, Li & Lin (2022) proposed a restarted accelerated gradient descent method that can find an ϵ-approximate FOSP in gradient query complexity O( 1 ϵ 7/4 ), which adds a restart mechanism to Nesterov's AGD. On finding SOSPs, Nesterov's AGD is also proved to be more efficient than GD. Jin et al. (2018b) studied a variant of Nesterov's AGD named perturbed AGD, and proved that it can find an ϵapproximate SOSP in Õ(log 6 d/ϵ 7/4 ) gradient queries. Their method added two algorithmic features to Nesterov's AGD: random perturbation and negative curvature exploitation, to ensure the monotonic decrease of the Hamiltonian function (see Eq. ( 4)). Allen-Zhu & Li (2018) proposed a first-order negative curvature finding framework named Neon2 that can find the most negative curvature direction efficiently. Combining Neon2 with CDHS method of Carmon et al. (2018) can find an ϵ-approximate SOSPs in Õ(log d/ϵ 7/4 ) gradient queries, which improved the complexity of perturbed AGD method by a factor of poly(log d) due to the use of negative curvature finding subroutine. Recently, Zhang & Li (2021) proposed a single-loop algorithm that also achieves the same function query complexity, which replaced the random perturbation step in perturbed AGD with accelerated negative curvature finding. Given the advantages of Nesterov's AGD in finding SOSPs in first-order optimization, it is then natural to design AGD based zeroth-order methods for finding SOSPs with smaller function query complexity. To the best of our knowledge, it is still a vacancy in zeroth-order optimization.

Contributions

The main contributions of this paper are summarized as follows, • We study the complexity of two AGD based zeroth-order methods for finding ϵapproximate SOSPs. We first study a zeroth-order version of the perturbed AGD method (Algorithm 1) using the central finite difference version of the coordinate-wise gradient estimator, which can be proved to have a lower approximation error compared to its forward counterpart. The total function query complexity of Algorithm 1 for finding an ϵapproximate SOSP is Õ(d log 6 d/ϵ 7 4 ). • Due to the efficiency of the negative curvature finding for finding the most negative curvature direction near a saddle, we further study a zeroth-order version of the perturbed AGD with accelerated negative curvature finding subroutine (Algorithm 3), which uses the finite difference of the two coordinate-wise gradient estimators to approximate the Hessianvector product. We show that Algorithm 3 can further improve the function query complexity of Algorithm 1 by a factor of poly(log d). • Finally, we conduct several empirical experiments to verify the efficiency and effectiveness of our methods in escaping saddle points.

2. PRELIMINARIES

2.1 NOTATIONS Throughout this paper, we use bold uppercase letters A, B to denote matrices and bold lowercase letters x, y to denote vectors. We use ∥ • ∥ to denote the Euclidean norm of a vector and the spectral norm of a matrix. We use B x (r) to denote the ℓ 2 ball with radius r centered at point x. We use Õ(•) to hide absolute constants and log factors.

2.2. DEFINITIONS

Definition 1. For a differentiable nonconvex function f : R d → R, f is ℓ-Lipschitz smooth if ∀x, y ∈ R d , ∥∇f (x) -∇f (y)∥ ≤ ℓ∥x -y∥. Definition 2. For a twice differentiable nonconvex function f : R d → R, f is ρ-Hessian Lipschitz if ∀x, y ∈ R d , ∥∇ 2 f (x) -∇ 2 f (y)∥ ≤ ρ∥x -y∥. Definition 3. For a differentiable function f , we say x is an ϵ-approximate first-order stationary point if ∥∇f (x)∥ ≤ ϵ. Definition 4. For a twice differentiable function f , we say x is an ϵ-approximate second-order stationary point if ∥∇f (x)∥ ≤ ϵ and λ min (∇ 2 f (x)) ≥ -√ ρϵ.

2.3. ZEROTH-ORDER GRADIENT ESTIMATOR

In this subsection, we introduce a central difference coordinate-wise gradient estimator, which is widely studied in literature of zeroth-order optimization (Ji et al., 2019; Vlatakis-Gkaragkounis et al., 2019; Lucchi et al., 2021) , ∇f (x) = d i=1 f (x + µe i ) -f (x -µe i ) 2µ e i , where e i is the i-th standard basis vector with 1 at its i-th coordinate and 0 otherwise. When analyzing the approximation error of the above gradient estimator, previous work only exploited the smoothness property of the gradient of f , not the property of Hessian Lipschitz (which is a basic assumption for analyzing the second-order convergence properties). To fill this gap, we establish the following lemma, Lemma 1. For a twice differentiable function f : R d → R, assume that f is ρ-Hessian Lipschitz, then for any given smoothing parameter µ and any x ∈ R d , we have ∥ ∇f (x) -∇f (x)∥ 2 ≤ 1 36 ρ 2 dµ 4 . Note that, under the Hessian Lipschitz assumption, the central difference has a lower approximation error than that of O(ℓ 2 dµ 2 ) error under the ℓ-smooth assumption (Ji et al., 2019) .

2.4. ZEROTH-ORDER HESSIAN-VECTOR PRODUCT ESTIMATOR

In this subsection, we show how to approximate the Hessian-vector product under the setting that we only have access to the zeroth-order information. By the Hessian Lipschitz property, it is easy to check that the Hessian-vector product ∇ 2 f (x) • v can be approximated by the difference of two gradients ∇f (x + v) -∇f (x) with approximation error up to O(∥v∥ 2 ) for some v with small magnitude. On the other hand, by Lemma 1, ∇f (x + v), ∇f (x) can be approximated by the central difference coordinate-wise gradient estimator with high accuracy. Then we define the following zeroth-order Hessian-vector product estimator as follows, which was previously studied in (Ye et al., 2018; Lucchi et al., 2021; Zhang et al., 2022) : H f (x)v = ∇f (x + v) -∇f (x) (3) = d i=1 f (x + v + µe i ) -f (x + v -µe i ) 2µ e i - d i=1 f (x + µe i ) -f (x -µe i ) 2µ e i Above, the notation H f (x) can be seen as the Hessian matrix of f at point x with small perturbations and we don't need to know the explicit expression since we only need to study the approximation error of it, which is established in Lemma 2. Lemma 2 (Zhang et al. (2022) ). For a twice differentiable function f : R d → R, assume that f is ρ-Hessian Lipschitz, then for any smoothing parameter µ and x ∈ R d , we have ∥H f (x)v -∇ 2 f (x)v∥ ≤ ρ ∥v∥ 2 2 + √ dµ 2 3 .

2.5. HAMILTONIAN

The following function, which takes the form of Hamiltonian, was proposed by Jin et al. (2018b) to tackle the problem of monotonic decrease of the function value for the momentum-based algorithms in the nonconvex setting, E t = f (x t ) + 1 2η ∥v t ∥ 2 , ( ) where v t = x t -x t-1 is the momentum.

3. ALGORITHM DESCRIPTION

In this section, we propose two novel Nesterov's accelerated method based algorithms that can escape saddle points and converge to an ϵ-approximate SOSP using only zeroth-order oracles.

3.1. ZEROTH-ORDER PERTURBED ACCELERATED GRADIENT DESCENT

In this subsection, we introduce the zeroth-order perturbed accelerated gradient descent method in Algorithm 1. The algorithms consist of three parts: the random perturbation steps, the accelerated gradient descent steps and the negative curvature exploitation steps. The random perturbation step is called when the gradient is small and no perturbation is added over the past T iterations. Let κ = ℓ √ ρϵ , and set the parameters of Algorithm 1 as follows, η = 1 4ℓ , θ = 1 4 √ κ , γ = θ 2 η , s = γ 4ρ , T = √ κχc, r = ηϵχ -5 c -8 , ( ) where c is constant and χ = max{1, log dℓ∆ f ρϵδ } with ∆ f := f (x 0 ) -f (x * ) < ∞. Since we only have access to the zeroth-order information, we can verify if a point x is an ϵapproximate FOSP by using the coordinate-wise gradient estimator based on the following fact: Algorithm 1 Zeroth-Order Perturbed Accelerated Gradient Descent 1: v 0 ← 0, t perturb ← 0 2: for t = 0, 1, . . . do 3: if ∥ ∇f (x t )∥ ≤ 3 4 ϵ and t -t perturb > T then 4: x t ← x t + ξ t , ξ t ∼ Unif (B 0 (r)), t perturb ← t 5: y t ← x t + (1 -θ)v t 6: x t+1 ← y t -η ∇f (y t ) 7: v t+1 ← x t+1 -x t 8: if Eq. ( 6) holds then 9: (x t+1 , v t+1 ) ← NCE(x t , v t , s) Algorithm 2 Negative Curvature Exploitation (x t , v t , s) (Jin et al., 2018b) 1: if ∥v t ∥ ≥ s then 2: x t+1 ← x t 3: else 4: δ = s • v t /∥v t ∥ 5: x t+1 ← arg min x∈{xt+δ,xt-δ} f (x) Return (x t+1 , 0) Proposition 1. Assume that f is ρ-Hessian Lipschitz, with choice of the smoothing parameter µ in Eq. ( 2) such that µ ≤ 3ϵ 2ρ √ d , we can conclude that if ∥ ∇f (x)∥ ≤ 3ϵ 4 , then we have ∥∇f (x)∥ ≤ ϵ, if ∥ ∇f (x)∥ > 3ϵ 4 , then we have ∥∇f (x)∥ ≥ ϵ 2 . The proof of this proposition directly follows from Lemma 1. The random perturbation is uniformly randomly selected from the ℓ 2 -ball with radius r. The second part of the Algorithm 1 is the Nesterov's accelerated gradient descent steps with its gradients estimated by Eq. ( 2). The negative curvature exploitation step is called when the following condition holds: f (x t ) ≤ f (y t ) + ∇f (y t ), x t -y t - γ 2 ∥y t -x t ∥ 2 . If this condition hold, then the function have an approximate large negative curvature between x t and y t . In this case, the accelerated gradient step may not decrease the function value of the Hamiltonian. Then we call the negative curvature exploitation step to further decrease the Hamiltonian. Specifically, when Eq. ( 6) doesn't hold, we have the following lemma: Lemma 3. Assume that f (•) is ℓ-smooth, ρ-Hessian Lipschitz and set the learning rate η ≤ 1 4ℓ , θ ∈ [2ηγ, 1 2 ]. Then, for each iteration t where Eq. ( 6) does not holds, we have: E t+1 ≤ E t - θ 2η ∥v t ∥ 2 - η 4 ∥∇f (y t )∥ 2 + η • ρ 2 dµ 4 48 . On the other hand, when Eq. ( 6) holds, i.e., a negative curvature direction is observed, then we have the following lemma: Lemma 4. Assume that f (•) is ℓ-smooth and ρ-Hessian Lipschitz. Then, for each iteration t where Eq. ( 6) holds, we have: E t+1 ≤ E t -min s 2 2η , 1 2 γs 2 -ρs 3 - ρ 2 dµ 4 9γ . Remark 1. The results in Lemma 3 and 4 are similar to the ones in Jin et al. (2018b) while with additional system error terms induced by the smoothing parameter µ. Lemma 3 and 4 together ensure the monotonic decrease of the Hamiltonian in each iteration as long as the smoothing parameter µ is sufficient small. Then we set T = √ κχc = Θ( √ κ) and denote E := ϵ 3 ρ χ -5 c -7 = Θ( ϵ 3 ρ ). Based on Lemma 3 and Lemma 4, we can further prove that when the current approximate gradient is large, i.e., ∥ ∇f (x t )∥ ≥ 3ϵ 4 (or equivalently, ∥∇f (x t )∥ ≥ ϵ 2 , according to Lemma 1). We have the following average decrease lemma: Lemma 5 (Large gradient). If ∥ ∇f (x τ )∥ ≥ 3ϵ 4 with µ ≤ O(( 3ϵ On the other hand, when the current approximate gradient is small and no perturbation is added over the past T iterations, then we add a uniform random perturbation in B 0 (r). If there exist a large negative curvature direction of the current point, we have Lemma 6 (Negative curvature). Suppose ∥ ∇f (x t )∥ ≤ 3ϵ 4 ( thus ∥∇f (x t )∥ ≤ ϵ), λ min (∇ 2 f (x t )) ≤ -√ ρϵ and no perturbation is added in iterations [t -T , t). Then by running Algorithm 1, we have E T -E 0 ≤ -E with probability at least 1 -δE 2∆ f . Utilizing the above lemmas, we finally get the following main result. Theorem 1. Assume that f (•) is ℓ-smooth and ρ-Hessian Lipschitz. For any δ > 0, ϵ ≤ ℓ 2 ρ , f (x 0 )f * ≤ ∆ f , if we set the hyperparameters as in Eq. ( 5) and choose µ = Õ( ϵ 1/2 d 1/4 ) in Line 3 and 8, µ = Õ( ϵ 13/8 d 1/2 ) in Line 6 of Algorithm 1, respectively, then with probability at least 1 -δ, one of the iterates of x t will be an ϵ-approximate SOSP. The total number of iterations is no more than O ∆ f ℓ 1/2 ρ 1/4 ϵ 7/4 log 6 ( dℓ∆ f ρϵδ ) and the total number of function queries (oracle complexity) is no more than O d∆ f ℓ 1/2 ρ 1/4 ϵ 7/4 log 6 ( dℓ∆ f ρϵδ ) . Proof outline. We first prove two monotonical descent lemmas (Lemma 3 and Lemma 4) of the Hamiltonian in each iteration and an improve or localize property in Appendix B. Next, in Appendix C, we prove that Hamiltonian will decrease by E in T iterations in both large gradient and negative curvature scenarios. Remark 2. Note that, Theorem 1 only ensures that with high probability, one of the iterates will be an ϵ-approximate SOSP. It is then natural to add a termination condition to make the algorithm more practical: Once the pre-condition of random perturbation step is reached, record the current iterate point x t0 and the current function value of the Hamiltonian E t0 before adding the random perturbation. If the decrease of the Hamiltonian is less than E after T iterations, then, with high probability x t0 is an ϵ-approximate SOSP according to Lemma 6.

3.2. ZEROTH-ORDER PERTURBED ACCELERATED GRADIENT DESCENT WITH ACCELERATED NEGATIVE CURVATURE FINDING

In this subsection, we introduce how to utilize the negative curvature finding to accelerate escaping saddle points. The main task of the negative curvature finding is to find the approximate most negative eigenvector direction near a saddle point. Then adding a perturbation in this direction will obtain a more efficient decrease of the function value. Classical methods for computing the most negative eigenvector direction like the power method and Lanczos method require the computations of the Hessian-vector products. Since we have only access to the zeroth-order information, an efficient way to approximate the Hessian-vector product is to utilize the zeroth-order Hessian-vector product estimator in Eq. ( 3). The accelerated negative curvature finding subroutine is self-contained in Line 11-13 of Algorithm 3 when ζ ̸ = 0. The following lemma states that the accelerated negative curvature finding using zeroth-order Hessian-vector product estimator can find a negative curvature direction in almost the same iteration complexity as the Lanczos method. Lemma 7. Suppose ∥ ∇f (x t )∥ ≤ 3ϵ 4 , λ min (∇ 2 f (x t )) ≤ - √ ρϵ and no perturbation is added in iterations [t -T ′ , t]. For any 0 < δ 0 < 1, let κ = ℓ √ ρϵ , and set the parameters as follows, η = 1 4ℓ , θ = 1 4 √ κ , γ = θ 2 η , s = γ 4ρ , T ′ = 32 √ κ log( ℓ √ d δ 0 √ ρϵ ), r ′ = δ 0 ϵ 32 π ρd . ( ) Then by running Algorithm 3 for T ′ iterations after adding the random perturbation in Line 5, with probability at least 1 -δ 0 , we have êT ∇ 2 f (x t )ê ≤ - √ ρϵ 4 . Algorithm 3 Zeroth-Order Perturbed Accelerated Gradient Descent with Accelerated Negative Curvature Finding 1: t perturb ← -T ′ -1, y 0 ← x 0 , x ← x 0 , ζ ← 0 2: for t = 0, 1, . . . , do 3: if ∥ ∇f (x t )∥ ≤ 3ϵ 4 and t -t perturb > T ′ then 4: x = x t 5: x t = x + ξ t , ξ t ∼ Unif(B 0 (r ′ )) 6: y t = x t , ζ = ∇f (x), t perturb ← t 7: if t perturb ̸ = -T ′ -1 and t -t perturb = T ′ then 8: ê ← xt-x ∥xt-x∥ 9: x t ← arg min x∈{x-1 4 √ ϵ ρ ê,x+ 1 4 √ ϵ ρ ê} f (x) 10: y t = x t , ζ = 0 11: x t+1 = y t -η( ∇f (y t ) -ζ) 12: v t+1 = x t+1 -x t 13: y t+1 = x t+1 + (1 -θ)v t+1 14: if t perturb ̸ = -T ′ -1 and t -t perturb < T ′ then 15: (y t+1 , x t+1 ) = (x, x) + r ′ • ( yt+1-x ∥yt+1-x∥ , xt+1-x ∥xt+1-x∥ ) 16: else if f (x t+1 ) ≤ f (y t+1 ) + ∇f (y t+1 ), x t+1 -y t+1 -γ 2 ∥y t+1 -x t+1 ∥ 2 then 17: (x t+1 , v t+1 ) ← NCE(x t+1 , v t+1 , s) 18: y t+1 ← x t+1 + (1 -θ)v t+1 Then moving along the direction of ê, the function value of f will make further decrease according to the following lemma: Lemma 8 (Zhang & Li (2021) , Lemma 6). Suppose the function f : R d → R is ℓ-smooth and ρ-Hessian Lipschitz. Then for any point x 0 ∈ R d , if there exist a unit vector ê satisfying ê∇ 2 f (x 0 )ê ≤ - √ ρϵ 4 , then we have f x 0 - f ′ ê(x0) 4|f ′ ê(x0 )| ϵ ρ ê ≤ f (x 0 ) -1 384 ϵ 3 ρ , where f ′ ê(x 0 ) is the directional derivative along the direction ê. Remark 3. In the first-order setting, f ′ ê(x 0 ) = ⟨∇f (x 0 ), ê⟩. However, in the zeroth-order setting, the directional derivative cannot be computed directly. To tackle this problem, one can simply compare the function value of two opposite directions, i.e., Line 9 of Algorithm 3. Theorem 2. Assume that f (•) is ℓ-smooth and ρ-Hessian Lipschitz. For any δ > 0, ϵ ≤ ℓ 2 ρ , f (x 0 ) - f * ≤ ∆ f , if we set the hyperparameters as in Eq. (7) with δ 0 = δ 384∆ f ϵ 3 ρ and choose µ = Õ( ϵ 1/2 d 1/4 ) in Line 3 and 16, µ = Õ( ϵ 13/8 d 1/2 ) in Line 11 of Algorithm 3. Then with probability at least 1 -δ, one of the iterates of x t in Algorithm 3 will be an ϵ-approximate SOSP. The total number of iterations is no more than O ∆ f ℓ 1/2 ρ 1/4 ϵ 7/4 log( ℓ √ d∆ f δϵ 2 ) and the total number of function queries (oracle complexity) is no more than ρ after T ′ iterations, then, with high probability x t0 is an ϵ-approximate SOSP according to Lemma 8. Remark 5 (Proof outline). The main difference between Algorithm 1 and Algorithm 3 is the way in which random perturbations are added. Specifically, in Algorithm 1, we add a uniform random perturbation nearby a first-order stationary point. If it is a saddle point, then by running the zerothorder accelerated gradient descent for T = √ κχc = Θ( √ κ) steps, the value of Hamiltonian function will decrease by E := O d∆ f ℓ 1/2 ρ 1/4 ϵ 7/4 log( ℓ √ d∆ f δϵ 2 ) . ϵ 3 ρ χ -5 c -7 = Θ( ϵ 3 ρ ). In Algorithm 3, the perturbation is added along an approximate negative curvature direction, which is obtained by running T ′ = 32 √ κ log( ℓ √ d δ0 √ ρϵ ) steps of zeroth-order accelerated negative curvature finding (Line 11-13). Then moving along the negative curvature direction, the value of Hamiltonian function will decrease by 1 384 ϵ 3 ρ (no more log term as in E ). Thus, the total function query complexity induced by Algorithm 3 is O(d • √ κ log( ℓ √ d δ0 √ ρϵ ) • ρ ϵ 3 ) = O d∆ f ℓ 1/2 ρ 1/4 ϵ 7/4 log( ℓ √ d∆ f δϵ 2 ) .

4. NUMERICAL EXPERIMENTS

In this section, we conduct several numerical experiments to verify the effectiveness of the proposed methods for escaping saddle points and the efficiency compared with the existing methods. Specially, we run zeroth-order perturbed accelerated gradient descent (Algorithm 1) and zeroth-order perturbed accelerated gradient descent with accelerated negative curvature finding (Algorithm 3) against the perturbed approximate gradient descent (PAGD) and the random search power iteration (RSPI) method. All experiments are performed on a computer with a six-core Intel Core i5-10500 CPU. We first consider the cubic regularization problem (Liu et al., 2018) , which is defined as:

4.1. CUBIC REGULARIZATION PROBLEM

min x∈R d f (x) := 1 2 x T Ax + 1 6 ∥x∥ 3 . (8) Above, A is a randomly generated diagonal matrix with only one diagonal entry is -1 and the rest diagonal entries are uniformly distributed between [1, 2]. So that with increase of the dimension, the negative curvature directions that can escape from the saddle point will be more difficult to explore. In this experiment, we set ϵ = 10 -2 . To test the ability of different algorithms to escape from saddle points, we initialize all algorithms at a strict saddle point x 0 = (0, . . . , 0) T . In this experiment, we run Algorithm 1, 3, PAGD on the above cubic regularization problem from a strict saddle point. For Algorithm 1 and 3, the parameter settings basically follow Eq. ( 5) and Eq. ( 7). Specifically, we choose ϵ = 0.001 and the perturbation radius r and r ′ are set to 0.001. The Lipschitz constants ℓ and ρ are selected based on a coarse grid search of the region {0.1, 1, 10, 100}× {0.1, 1, 10, 100}. Since all algorithms have certain randomness, we repeatedly run each algorithm multiple times and report the averaged function value versus the averaged number of function queries and the number of iterations in Figure 2 . The results in Fig. 1 illustrate that Algorithm 1, 3 can escape saddle points using less iterations than PAGD and converge faster than PAGD. On the other hand, in all dimensions, the number of iterations for escaping saddle points are almost the same. This verifies the result in Lemma 6 and 7 that the number of iterations of Algorithm 1, 3 are only log dependent on the dimension d. Then we consider the following quartic function (Lucchi et al., 2021) , f (x 1 , x 2 , . . . , x d , y) = 1 4 d i=1 x 4 i -y d i=1 x i + d 2 y 2 (9) which has a strict saddle point at x 0 = (0, . . . , 0) T and two global minima at (1, . . . , 1) T and (-1, . . . , -1) T . In this experiment, we run Algorithm 1, 3, perturbed approximate gradient descent (PAGD), Random Search Power Iteration (RSPI) and ZO-GD-NCF on the above quartic function staring from its saddle point. Especially, we also run an acceleration version of RSPI, which replaces the finite difference gradient estimator in RSPI by the SPSA estimator (Spall et al., 1992) . The parameter settings of PAGD are taken from Vlatakis-Gkaragkounis et al. ( 2019) and the parameters of RSPI are taken from the appendix of Lucchi et al. (2021) . For Algorithm 1 and 3, the parameter settings basically follow Eq. ( 5) and Eq. ( 7). Specifically, we choose ϵ = 10 -4 and the perturbation radius r and r ′ are set to 0.01. The Lipschitz constants ℓ and ρ are selected based on a coarse grid search of the region {10, 20, 100, 150, 200} × {0.1, 1, 10}. Since all algorithms have certain randomness, we repeatedly run each algorithm multiple times and report the averaged function value versus the averaged number of function queries in Figure 2 . The results in Fig. 2 illustrate that both Algorithms 1 and 3 can efficiently escape saddle points and converge quickly to the global minimum. Note that, for all dimensions, Algorithms 1 and 3 escape saddle points with fewer function queries than PAGD. This verifies the theoretical result that algorithms 1 and 3 take Θ( √ κ) iterations for escaping saddle points when the initial point is a saddle point, while PAGD takes Θ(κ) iterations. For high dimensional problems, the computational cost of RSPI for escaping saddle points is expensive. In contrast, RSPI with SPSA estimator is much more efficient.

5. CONCLUSION

In this paper, we study the complexity of two zeroth-order AGD based algorithms for escaping saddle points and converging to SOSPs. The first method is a zeroth-order version of the perturbed AGD which uses the central finite difference version of the coordinate-wise gradient estimator. The second method extracts accelerated negative curvature findings by using the finite difference of two coordinate-wise gradient estimators. Both methods improve the function query complexity of prior zeroth-order methods for converging to SOSPs.

APPENDIX A AUXILIARY LEMMAS

Lemma 9 (Nesterov et al. (2018)  , Lemma 1.2.3 & 1.2.4). If f is ℓ-Lipschitz smooth, then for all x, y ∈ R d , |f (y) -f (x) -∇f (x) T (y -x)| ≤ ℓ 2 ∥y -x∥ 2 . If f is ρ-Hessian Lipschitz, then for all x, y ∈ R d , ∇f (y) -∇f (x) -∇ 2 f (x)(y -x) ≤ ρ 2 ∥y -x∥ 2 , and f (y) -f (x) -∇f (x) T (y -x) - 1 2 (y -x) T ∇ 2 f (x)(y -x) ≤ ρ 6 ∥y -x∥ 3 . Lemma 1. If f is ρ-Hessian Lipschitz, then for any given smoothing parameter µ and any x ∈ R d , if f is ℓ-Lipschitz smooth, we have ∥ ∇f (x) -∇f (x)∥ 2 ≤ 1 36 ρ 2 dµ 4 (10) Proof. ∇f (x) -∇coord f (x) = d i=1 f (x + µe i ) -f (x -µe i ) 2µ e i -∇f (x) = 1 2µ d i=1 (f (x + µe i ) -f (x -µe i ) -2µ∇ i f (x))e i Since f is ρ-Hessian Lipschitz, for all i ∈ [d], we have f (x + µe i ) -f (x -µe i ) -2µ∇ i f (x) = f (x + µe i ) -f (x) -µ∇ i f (x) - µ 2 2 ∇ 2 ii f (x) -f (x -µe i ) -f (x) + µ∇ i f (x) - µ 2 2 ∇ 2 ii f (x) ≤ f (x + µe i ) -f (x) -µ∇ i f (x) - µ 2 2 ∇ 2 ii f (x) + f (x -µe i ) -f (x) + µ∇ i f (x) - µ 2 2 ∇ 2 ii f (x) ① ≤2 • ρ 6 µ 3 = ρ 3 µ 3 where ① is due to Lemma 9. Let µ 1 , µ 2 denote the two eigenvalues of A, then A can be rewritten as A = µ 1 + µ 2 -µ 1 µ 2 1 0 and for any t ∈ N: ∇f (x) -∇coord f (x) = 1 2µ d i=1 (f (x + µe i ) -f (x -µe i ) -2µ∇ i f (x))e i = 1 2µ d i=1 (f (x + µe i ) -f (x -µe i ) -2µ∇ i f (x)) 2 ≤ 1 2µ d ρµ 3 3 2 = √ dρµ 2 (0 1) A t = (1 0) A t-1 (µ 1 -1)(µ 2 -1) (1 0) t-1 τ =0 A τ 1 0 = 1 -(1 0) A t 1 1 . Lemma 11 (Jin et al. (2018b) , Lemma 30). Let θ ∈ (0, 1/4], define A = (2 -θ)(1 -x) -(1 -θ)(1 -x) 1 0 , and let x ∈ [-1 4 , θ 2 (2-θ) 2 ]. Denote (a t -b t ) = (1 0) A t , then for any t ≥ 2 θ + 1, we have t-1 τ =0 a τ ≥ Ω( 1 θ 2 ), 1 b t ( t-1 τ =0 a τ ) ≥ Ω(1) min{ 1 θ , 1 |x| }. Lemma 12 (Jin et al. (2018b) , Lemma 32). Let θ ∈ (0, 1/4], define A = (2 -θ)(1 -x) -(1 -θ)(1 -x) 1 0 , and let x ∈ [ θ 2 (2-θ) 2 , 1 4 ]. Denote (a t -b t ) = (1 0) A t , then for any t ≥ 0, we have max{|a t |, |b t |} ≤ (t + 1)(1 -θ) 1/2 . Lemma 13 (Jin et al. (2018b) , Lemma 34). Under the same setting as in Lemma 12, for any sequence ϵ τ , any t ≥ Ω(1/θ), we have: t-1 τ =0 a τ ϵ τ ≤O(1/x) |ϵ 0 | + t-1 τ =1 |ϵ τ -ϵ τ -1 | t-1 τ =0 (a τ -a τ -1 )ϵ τ ≤O(1/ √ x) |ϵ 0 | + t-1 τ =1 |ϵ τ -ϵ τ -1 | Lemma 14 (Jin et al. (2018b) , Lemma 36). Let θ ∈ (0, 1/4], define A = (2 -θ)(1 -x) -(1 -θ)(1 -x) 1 0 , and let x ∈ [-1/4, 0], denote (a t , -b t ) = (1 0) A t . Then for any 0 ≤ τ ≤ t, we have |a t-τ ||a τ -b τ | ≤ [ 2 θ + t + 1]|a t+1 -b t+1 |. Lemma 15 (Jin et al. (2018b) , Lemma 37). Under the same setting as in Lemma 14, let A(x) = A and g(x) = | (1 0) [A(x)] t 1 0 |, then we have 1. g(x) is a monotonically decreasing function for x ∈ [-1, θ 2 /(2 -θ) 2 ]. 2. For any x ∈ [θ 2 /(2 -θ 2 ), 1], we have g(x) ≤ g(θ 2 /(2 -θ 2 )). Lemma 16 (Jin et al. (2018b) , Lemma 38). Under the same setting as in Lemma 14, we have |a t+1 -b t+1 | ≥ |a t -b t | = (a t , -b t ) = (1 0) A t 1 1 and |a t -b t | ≥ θ 2 (1 + 1 2 min{ |x| θ , |x|}) t . Lemma 17 (Zhang & Li (2021) , Lemma 21). Consider the sequence with recurrence: ξ t+2 = (1 + κ)((2 -θ)ξ t+1 -(1 -θ)ξ t ), for some κ > 0. Then we have ξ t = ( 1 + κ 2 ) t (C 1 (2 -θ -µ) t + C 2 (2 -θ -µ) t ) , where µ = (2 -θ) 2 -4(1-θ) 1+κ , C 1 = -2-θ-µ 2µ ξ 0 + 1 (1+κ)µ ξ 1 , C 2 = 1-θ+µ 2µ ξ 0 -1 (1+κ)µ ξ 1 .

B PROOF OF HAMILTONIAN LEMMAS IN THE ZEROTH-ORDER SETTING

Lemma 3. Assume that f (•) is ℓ-smooth and set the learning rate η ≤ 1 4ℓ , θ ∈ [2ηγ, 1 2 ]. Then, for each iteration t where Eq. ( 6) does not hold, we have: E t+1 ≤ E t - θ 2η ∥v t ∥ 2 - η 4 ∥∇f (y t )∥ 2 + η • ρ 2 dµ 4 Proof. x t+1 ←y t -η ∇f (y t ) y t+1 ←x t+1 + (1 -θ)(x t+1 -x t ) By smoothness, with η ≤ 1 4ℓ , we have f (x t+1 ) ≤f (y t ) + ⟨∇f (y t ), x t+1 -y t ⟩ + ℓ 2 ∥x t+1 -y t ∥ 2 =f (y t ) -η ∇f (y t ), ∇f (y t ) + ℓη 2 2 ∥ ∇f (y t )∥ 2 According to the update rule of the accelerated gradient descent, we have ∥x t+1 -x t ∥ 2 =∥y t -η ∇f (y t ) -x t ∥ 2 =∥y t -x t ∥ 2 -2η ∇f (y t ), y t -x t + η 2 ∥ ∇f (y t )∥ 2 . Dividing both sides by 2η, we have 1 2η ∥x t+1 -x t ∥ 2 = 1 2η ∥y t -x t ∥ 2 + ∇f (y t ), x t -y t + η 2 ∥ ∇f (y t )∥ 2 Then we have f (x t+1 ) + 1 2η ∥x t+1 -x t ∥ 2 ≤f (y t ) + 1 2η ∥y t -x t ∥ 2 + ∇f (y t ), x t -y t + η 2 ∥ ∇f (y t )∥ 2 -η ∇f (y t ), ∇f (y t ) + ℓη 2 2 ∥ ∇f (y t )∥ 2 =f (y t ) + 1 2η ∥y t -x t ∥ 2 + ∇f (y t ), x t -y t -η ∇f (y t ), ∇f (y t ) + η 2 (1 + ℓη)∥ ∇f (y t )∥ 2 As long as the following condition holds: f (x t ) ≥ f (y t ) + ∇f (y t ), x t -y t - γ 2 ∥x t -y t ∥ 2 , we have f (x t+1 ) + 1 2η ∥x t+1 -x t ∥ 2 ≤f (x t ) + 1 + ηγ 2η ∥y t -x t ∥ 2 -η ∇f (y t ), ∇f (y t ) + η 2 (1 + ℓη)∥ ∇f (y t )∥ 2 Note that -∇f (y t ), ∇f (y t ) = -∥∇f (y t )∥ 2 -∇f (y t ), ∇f (y t ) -∇f (y t ) , ∥ ∇f (y t )∥ 2 =∥∇f (y t ) + ∇f (y t ) -∇f (y t )∥ 2 =∥∇f (y t )∥ 2 + 2 ∇f (y t ), ∇f (y t ) -∇f (y t ) + ∥ ∇f (y t ) -∇f (y t )∥ 2 Combine the two equations with the above inequality, we have f (x t+1 ) + 1 2η ∥x t+1 -x t ∥ 2 ≤f (x t ) + 1 + ηγ 2η ∥y t -x t ∥ 2 - η(1 -ℓη) 2 ∥∇f (y t )∥ 2 + ℓη 2 ∇f (y t ), ∇f (y t ) -∇f (y t ) + η(1 + ℓη) 2 ∥ ∇f (y t ) -∇f (y t )∥ 2 ≤f (x t ) + 1 + ηγ 2η ∥y t -x t ∥ 2 - η(1 -ℓη) 2 ∥∇f (y t )∥ 2 + ℓη 2 β 2 ∥∇f (y t )∥ 2 + 1 2β ∥ ∇f (y t ) -∇f (y t )∥ 2 + η(1 + ℓη) 2 ∥ ∇f (y t ) -∇f (y t )∥ 2 =f (x t ) + 1 + ηγ 2η ∥y t -x t ∥ 2 -η • 1 -ℓη -βℓη 2 ∥∇f (y t )∥ 2 + η( ℓη 2β + 1 + ℓη 2 )∥ ∇f (y t ) -∇f (y t )∥ 2 . Take β = 1 and η ≤ 1 4ℓ , we have f (x t+1 ) + 1 2η ∥x t+1 -x t ∥ 2 ≤f (x t ) + 1 + ηγ 2η ∥y t -x t ∥ 2 - η 4 ∥∇f (y t )∥ 2 + 3η 4 ∥ ∇f (y t ) -∇f (y t )∥ 2 ≤f (x t ) + 1 + ηγ 2η ∥y t -x t ∥ 2 - η 4 ∥∇f (y t )∥ 2 + η • ρ 2 dµ 4 48 Using the fact that ∥y t - x t ∥ = (1 -θ)∥x t -x t-1 ∥, we have f (x t+1 ) + 1 2η ∥x t+1 -x t ∥ 2 ≤f (x t ) + 1 + ηγ 2η (1 -θ) 2 ∥x t -x t-1 ∥ 2 - η 4 ∥∇f (y t )∥ 2 + η • ρ 2 dµ 4 48 =f (x t ) + 1 2η ∥x t -x t-1 ∥ 2 - 2θ -θ 2 -ηγ(1 -θ) 2 2η ∥v t ∥ 2 - η 4 ∥∇f (y t )∥ 2 + η • ρ 2 dµ 4 48 ≤f (x t ) + 1 2η ∥x t -x t-1 ∥ 2 - θ 2η ∥v t ∥ 2 - η 4 ∥∇f (y t )∥ 2 + η • ρ 2 dµ 4 Lemma 4. Assume that f (•) is ℓ-smooth and ρ-Hessian Lipschitz. Then, for each iteration t where Eq. ( 6) holds, we have: E t+1 ≤ E t -min{ s 2 2η , 1 2 γs 2 -ρs 3 - ρ 2 dµ 4 9γ } Proof. When ∥v t ∥ ≥ s, then x t+1 = x t , so we have E t+1 = f (x t+1 ) = f (x t ) = E t - 1 2η ∥v t ∥ 2 ≤ E t - s 2 2η . When ∥v t ∥ ≤ s happens, we have f (x t ) = f (y t ) + ⟨∇f (y t ), x t -y t ⟩ + 1 2 (x t -y t ) T ∇ 2 f (ζ t )(x t -y t ), where ζ t = y t + α(x t -y t ), and α ∈ [0, 1]. When the following condition holds: f (x t ) ≤f (y t ) + ∇f (y t ), x t -y t - γ 2 ∥x t -y t ∥ 2 =f (y t ) + ⟨∇f (y t ), x t -y t ⟩ + ∇f (y t ) -∇f (y t ), x t -y t - γ 2 ∥x t -y t ∥ 2 , we have 1 2 (x t -y t ) T ∇ 2 f (ζ t )(x t -y t ) ≤ ∇f (y t ) -∇f (y t ), x t -y t - γ 2 ∥x t -y t ∥ 2 ≤ 1 2 1 β ∥ ∇f (y t ) -∇f (y t )∥ 2 + β∥x t -y t ∥ 2 - γ 2 ∥x t -y t ∥ 2 = - γ -β 2 ∥x t -y t ∥ 2 + 1 2β ∥ ∇f (y t ) -∇f (y t )∥ 2 ≤ - γ -β 2 ∥x t -y t ∥ 2 + ρ 2 dµ 4 18β . Take β = γ 2 we have 1 2 (x t -y t ) T ∇ 2 f (ζ t )(x t -y t ) ≤ - γ 4 ∥x t -y t ∥ 2 + ρ 2 dµ 4 9γ Note that min{⟨∇f (x t ), δ⟩ , ⟨∇f (x t ), -δ⟩} ≤ 0. Without loss of generality, we assume that ⟨∇f (x t ), δ⟩ ≤ 0. Since x t+1 = arg min x∈{xt+δ,xt-δ} f (x), we have f (x t+1 ) ≤ f (x t + δ) = f (x t ) + ⟨∇f (x t ), δ⟩ + 1 2 δ T ∇ 2 f (ζ ′ t )δ ≤ f (x t ) + 1 2 δ T ∇ 2 f (ζ ′ t )δ, where ζ ′ t = x t + α ′ δ and α ′ ∈ [0, 1]. Since ∥ζ t -ζ ′ t ∥ ≤ 2s and δ lines up with y t -x t , we have δ T ∇ 2 f (ζ ′ t )δ ≤δ T ∇ 2 f (ζ t )δ + ∥∇ 2 f (ζ ′ t ) -∇ 2 f (ζ t )∥∥δ∥ 2 ≤ - γ 2 ∥δ∥ 2 + 2ρs∥δ∥ 2 + 2ρ 2 dµ 4 9γ = - γ 2 s 2 + 2ρs 3 + 2ρ 2 dµ 4 9γ Finally we get E t+1 = f (x t+1 ) ≤ f (x t ) -( 1 4 γs 2 -ρs 3 - ρ 2 dµ 4 9γ ) ≤ E t -( 1 4 γs 2 -ρs 3 - ρ 2 dµ 4 9γ ). Lemma 18. If the Eq. ( 6) does not holds, then for all steps in [t, t + T ], we have: t+T τ =t+1 ∥x τ -x τ -1 ∥ 2 ≤ 2η θ (E t -E t+T ) + ηρ 2 dµ 4 48 T Proof. The proof directly follows from the results of Lemma 3.

C PROOF OF MAIN RESULTS OF ALGORITHM 1

Recall the parameter settings in Algorithm 4, η = 1 4ℓ , θ = 1 4 √ κ , γ = θ 2 η = √ ρϵ 4 , s = γ 4ρ = 1 16 ϵ ρ , r = ηϵχ -5 c -8 , where κ = ℓ √ ρϵ . Denote T = √ κ • χc, E = ϵ 3 ρ • χ -5 c -7 , S = 2ϵ ρ χ -2 c -3 , M = ϵ √ κ ℓ c -1 Lemma 19. After running the NCE with µ ≤ Õ( ϵ 1/2 d 1/4 ) for one step, we have E t+1 -E t ≤ -2E . Proof. According to Lemma 4, with the choice of the smoothing parameter such that µ ≤ Õ( ϵ 1/2 d 1/4 ), we have E t+1 -E t ≤ -min{ s 2 2η , 1 2 γs 2 -ρs 3 - ρ 2 dµ 4 9γ } ≤ -Ω(E c 7 ) ≥ 2E . Lemma 20. Let 0 be an origin point. Denote δ τ = ∇f (y τ ) -∇f (0) -∇ 2 f (0)y τ Then the zeroth-order AGD update can be rewritten as: x t+1 x t = A t x 1 x 0 -η t τ =1 A t-τ ∇f (0) + δ τ 0 , where A = (2 -θ)(I -η∇ 2 f (0)) -(1 -θ)(I -η∇ 2 f (0)) I 0 . Proof. x t+1 = (2 -θ)x t -(1 -θ)x t-1 -η ∇f ((2 -θ)x t -(1 -θ)x t-1 ) Then we have x t+1 x t = (2 -θ)(I -η∇ 2 f (0)) -(1 -θ)(I -η∇ 2 f (0)) I 0 x t x t-1 -η ∇f (0) + δ t 0 =A t x 1 x 0 -η t τ =1 A t-τ ∇f (0) + δ τ 0 Lemma 21. If for any τ ≤ t, we have ∥x τ ∥ ≤ R, then for any τ ≤ t, we have 1. ∥δ τ ∥ ≤ ρO(R 2 + √ dµ 2 ) 2. ∥δ τ -δ τ -1 ∥ ≤ ρO(R∥x τ -x τ -1 ∥ + R∥x τ -1 -x τ -1 ∥ + √ dµ 2 ) 3. t τ =1 ∥δ τ -δ τ -1 ∥ 2 ≤ O(ρ 2 R 2 t τ -1 ∥x τ -x τ -1 ∥ 2 + tρ 2 dµ 4 ) Proof. For the first inequality, by using the second inequality of Lemma 9, we have ∥∇f (y τ ) -∇f (0) -∇ 2 f (0)y τ ∥ ≤ ρ 2 ∥y τ ∥ 2 = ρ 2 ∥(2 -θ)x τ -(1 -θ)x τ -1 ∥ 2 ≤ O(ρR 2 ). Using Lemma 1, we have ∥δ τ ∥ =∥ ∇f (y τ ) -∇f (0) -∇ 2 f (0)y τ ∥ ≤∥∇f (y τ ) -∇f (0) -∇ 2 f (0)y τ ∥ + ∥ ∇f (y τ ) -∇f (0) -(∇f (y τ ) -∇f (0))∥ ≤O(ρR 2 + √ dρµ 2 ). For the second inequality, we have δ τ -δ τ -1 = ∇f (y τ ) -∇f (y τ -1 ) -∇ 2 f (0)(y τ -y τ -1 ). Then we have ∥∇f (y τ ) -∇f (y τ -1 ) -∇ 2 f (0)(y τ -y τ -1 )∥ =∥ 1 0 (∇ 2 f (x τ -1 + θ(y τ -y τ -1 )) -∇ 2 f (0))dθ(y τ -y τ -1 )∥ ≤∥ 1 0 (∇ 2 f (y τ -1 + θ(y τ -y τ -1 )) -∇ 2 f (0))dθ∥ • ∥y τ -y τ -1 ∥ ≤ ρ max{∥y τ ∥, ∥y τ -1 ∥}∥y τ -y τ -1 ∥ ≤O(ρR)(∥x τ -x τ -1 ∥ + ∥x τ -1 -x τ -2 ∥). Thus, ∥δ τ -δ τ -1 ∥ ≤∥ ∇f (y τ ) -∇f (y τ -1 ) -(∇f (y τ ) -∇f (y τ -1 ))∥ + ∥∇f (y τ ) -∇f (y τ -1 ) -∇ 2 f (0)(y τ -y τ -1 )∥ ≤O(ρR)(∥x τ -x τ -1 ∥ + ∥x τ -1 -x τ -2 ∥) + O(ρ √ dµ 2 ) Then we have t τ =1 ∥δ τ -δ τ -1 ∥ 2 ≤ O(ρ 2 R 2 t τ -1 ∥x τ -x τ -1 ∥ 2 + t τ =1 ρ 2 dµ 4 ) C.1 LARGE GRADIENT Let S be the subspace with eigenvalues in ( θ 2 η(2-θ) 2 , ℓ] and S c be the complementary subspace. Lemma 22 (Large momentum or large gradient). If ∥v t ∥ ≥ M or ∥∇f (x t )∥ ≥ 2ℓM , and at iteration t only AGD is used with smoothing parameter µ ≤ O( ϵ 1/2 κ 1/8 ρ 1/2 d 1/4 c -1/2 ) and without NCE or perturbation, we have: E t+1 -E t ≤ - 4E T . Proof. When ∥v t ∥ ≥ ϵ √ κ ℓ and µ ≤ O( ϵ 1/2 κ 1/8 ρ 1/2 d 1/4 c -1/2 ), using Lemma 3, we have E t+1 -E t ≤ - θ 2η ∥v t ∥ 2 + ηρ 2 dµ 4 48 ≤ -Ω( ℓ √ κ ϵ 2 κ ℓ 2 c -2 - ρ 2 dµ 4 ℓ ) = -Ω( ϵ 2 √ κc -2 -ρ 2 dµ 4 ℓ ) ≤ -Ω( ϵ 2 √ κc -2 ℓ ) ≤ -Ω( E T c 6 ) ≤ - 4E T , holds for large enough constant c. When ∥v t ∥ ≤ M and ∥∇f (x t )∥ ≥ 2ℓM , then by gradient Lipschitz assumption, we have ∥∇f (y t )∥ ≥ ∥∇f (x t )∥ -∥∇f (y t ) -f (x t )∥ ≥ ∥∇f (x t )∥ -ℓ(1 -θ)∥v t ∥ ≥ ℓM . Using Lemma 3, with µ ≤ O( ϵ 1/2 κ 1/8 ρ 1/2 d 1/4 c -1/2 ), we have E t+1 -E t ≤ - η 4 ∥∇f (y t )∥ 2 + ηρ 2 dµ 4 48 ≤ -Ω( ϵ 2 κc -2 -ρ 2 dµ 4 ℓ ) ≤ -Ω( ϵ 2 √ κc -2 -ρ 2 dµ 4 ℓ ) ≤ -Ω( ϵ 2 √ κc -2 ℓ ) ≤ -Ω( E T c 6 ) ≤ - 4E T , holds for large enough constant c. Lemma 23. If ∥P S c ∇f (x 0 )∥ ≥ ϵ 4 , ∥v 0 ∥ ≤ M , v T 0 [P T S ∇ 2 f (x 0 )P S ]v 0 ≤ 2 √ ρϵM 2 , µ ≤ Õ( ϵ 5/8 d 1/4 ), and for t ∈ [0, T /4] only AGD steps are used, then we have E T /4 -E 0 ≤ -E . Proof. Define x -1 = x 0 -v 0 . Without loss of generality, set x 0 = 0. Using Lemma 20, we have x t x t-1 = A t-1 0 -v 0 -η t-1 τ =0 A t-1-τ ∇f (0) + δ τ 0 Denote A j = (2 -θ)(1 -ηλ j ) -(1 -θ)(1 -ηλ j ) 1 0 , where λ j is the j-th eigenvalue of ∇ 2 f (0). Denote a (j) t -b (j) t = (1 0) A t j . Then we have for the j-th eigen-direction x (j) t = b (j) t v (j) 0 -η t-1 τ =0 ( ∇f (0) + δ (j) τ ) = -η t-1 τ =0 a (j) τ ∇f (0) (j) + t-1 τ =0 p (j) τ δ (j) τ + q (j) t v (j) 0 , where p (j) τ = a (j) t-1 t-1-τ τ =0 a (j) τ , q (j) t = - b (j) t η t-1-τ τ =0 a (j) τ For j ∈ S c , using Lemma , we have t-1 τ =0 a (j) τ ≥ Ω( 1 θ 2 ). Then rewrite the above equation as x (j) t = -η t-1 τ =0 a (j) τ ∇f (0) (j) + δ(j) + ṽ(j) , where δ(j) = t-1 τ =0 p (j) τ δ (j) τ , ṽ(j) = q (j) t v (j) 0 . For all j ∈ S c , | δ(j) | = | t-1 τ =0 p (j) τ δ (j) τ | ≤ t-1 τ =0 p (j) τ (|δ (j) 0 | + |δ (j) τ -δ (j) 0 |) = |δ (j) 0 | + t-1 τ =0 p (j) τ |δ (j) τ -δ (j) 0 | ≤ |δ (j) 0 | + t-1 τ =1 |δ (j) τ -δ (j) τ -1 | Then by Cauchy-Swartz inequality, ∥P S c δ∥ 2 = j∈S c | δ(j) | 2 ≤ j∈S c (|δ (j) 0 | + t-1 τ =1 |δ (j) τ -δ (j) τ -1 |) 2 ≤ 2   j∈S c |δ (j) 0 | 2 + j∈S c ( t-1 τ =1 |δ (j) τ -δ (j) τ -1 |) 2   ≤2   j∈S c |δ (j) 0 | 2 + t j∈S c t-1 τ =1 (|δ (j) τ -δ (j) τ -1 |) 2   ≤ 2∥δ 0 ∥ 2 + 2t t-1 τ =1 ∥δ τ -δ τ -1 ∥ 2 Assume that E T /4 -E 0 ≥ -E . By Lemma 18 and choose µ ≤ O(( ϵ 3 ρ 5 χ -6 c -8 d ) 1/4 ) = Õ( ϵ 3/8 (d) 1/4 ), we have ∥x t -x 0 ∥ ≤ t t τ =1 ∥x τ -x τ -1 ∥ 2 ≤ 2ηE θ • T 4 + T 2 16 ηρ 2 dµ 4 48 ≤ S . With µ ≤ O(( ϵ 5 ρ 3 χ -10 c -14 dℓ ) 1/4 ) = Õ( ϵ 5/8 (d) 1/4 ), by Lemma 21 we have ∥δ 0 ∥ ≤ O(ρS 2 ). By Lemma 18 and Lemma 21, we have t t-1 τ =1 ∥δ τ -δ τ -1 ∥ 2 ≤ O(ρ 2 S 2 t t-1 τ =1 ∥x τ -x τ -1 ∥ 2 + t 2 ρ 2 dµ 4 ) ≤ O(ρ 2 S 4 ) So we have ∥P S c δ∥ ≤ O(ρS 2 ) ≤ O(ϵc -6 ). By Lemma 11, -ηq (j) t = bt t-1 τ =0 aτ ≤ O(1) max{θ, η|λ j |}, then ∥P S c ṽ∥ 2 = j∈S c [q (j) t v (j) 0 ] 2 ≤ O(1) j∈S c max{θ 2 , η|λ j |} η 2 [v (j) 0 ] 2 Since the NCE step is not reached, then we have: f (x 0 ) ≥f (y 0 ) + ∇f (y 0 ), x 0 -y 0 - γ 2 ∥x 0 -y 0 ∥ 2 =f (y 0 ) + ⟨∇f (y 0 ), x 0 -y 0 ⟩ + ∇f (y 0 ) -∇f (y 0 ), x 0 -y 0 - γ 2 ∥x 0 -y 0 ∥ 2 ≥f (y 0 ) + ⟨∇f (y 0 ), x 0 -y 0 ⟩ - 1 2β ∥ ∇f (y 0 ) -∇f (y 0 )∥ 2 - γ + β 2 ∥x 0 -y 0 ∥ 2 ≥f (y 0 ) + ⟨∇f (y 0 ), x 0 -y 0 ⟩ - ρ 2 dµ 4 72β - γ + β 2 ∥x 0 -y 0 ∥ 2 =f (y 0 ) + ⟨∇f (y 0 ), x 0 -y 0 ⟩ - ρ 2 dµ 4 72γ -γ∥x 0 -y 0 ∥ 2 , where the last step is by taking β = γ. Then we have 1 2 (x 0 -y 0 ) T ∇ 2 f (ζ 0 )(x 0 -y 0 ) ≥ - ρ 2 dµ 4 72γ -γ∥x 0 -y 0 ∥ 2 , where ζ 0 = ϕx 0 + (1 -ϕ)y 0 and ϕ ∈ [0, 1]. Note that (1 -θ)v 0 = y 0 -x 0 , we have 1 2 v T 0 ∇ 2 f (ζ 0 )v 0 ≥ - ρ 2 dµ 4 72(1 -θ) 2 γ -γ∥v 0 ∥ 2 ≥ - ρ 2 dµ 4 18γ -γ∥v 0 ∥ 2 , where the last inequality uses the fact that θ ≤ 1 2 . Using the Hessian Lipschitz property, we have ∥∇ 2 f (ζ 0 ) -∇ 2 f (x 0 )∥ ≤ ρ∥y 0 ∥ ≤ ρ∥v 0 ∥ ≤ ρM = (ρϵ) 3/4 √ ℓ c -1 ≤ √ ρϵ 2 = 2γ. Then we have v T 0 ∇ 2 f (x 0 )v 0 ≥ - ρ 2 dµ 4 9γ -4γ∥v 0 ∥ 2 ≥ - ρ 2 dµ 4 √ ρϵ - √ ρϵ∥v 0 ∥ 2 . Since θ 2 η(1-θ) 2 = Θ( √ ρϵ), we have j∈S c |λ j |[v (j) 0 ] 2 ≤ √ ρϵ∥v 0 ∥ 2 + ρ 2 dµ 4 √ ρϵ + j:0<λj ≤ θ 2 η(1-θ) 2 λ j [v (j) 0 ] 2 + j:λj > θ 2 η(1-θ) 2 λ j [v (j) 0 ] 2 ≤O( √ ρϵ)∥v 0 ∥ 2 + ρ 2 dµ 4 √ ρϵ + v T 0 [P T S ∇ 2 f (0)P S ]v 0 . With µ ≤ O( ϵ 2 √ ρϵ ℓρ 2 d ) 1/4 = Õ( ϵ 5/8 d 1/4 ), then we have ∥P S c ṽ∥ 2 ≤ O( 1 η ) √ ρϵ∥v 0 ∥ 2 + ρ 2 dµ 4 √ ρϵ + v T 0 [P T S ∇ 2 f (0)P S ]v 0 ≤ O(ℓ √ ρϵM 2 ) = O(ϵ 2 c -2 ) Then we have ∥x t ∥ ≥∥P S c x t ∥ ≥ η min j∈S c t-1 τ =0 a (j) τ ∥P S c ( ∇f (0) + δ + ṽ)∥ ≥Ω( η θ 2 ) ∥P S c ∇f (0)∥ - ρ √ dµ 2 6 -∥P S c δ∥ -∥P S c ṽ∥ ≥Ω( ηϵ θ 2 ) ≥ S , which contradicts with ∥x t -x 0 ∥ = ∥x t ∥ ≤ S . So we have E T /4 -E 0 ≤ -E . Lemma 24. If ∥v 0 ∥ ≤ M and ∥∇f (x 0 )∥ ≤ 2ℓM , E T /2 -E 0 ≥ -E , µ ≤ Õ( ϵ 5/8 d 1/4 ) and for any t ∈ [0, T /2] only SGD steps are used. Then ∀t ∈ [T /4, T /2]: ∥P S ∇f (x t )∥ ≤ ϵ 4 and v T t [P T S ∇ 2 f (x 0 )P S ]v t ≤ √ ρϵM 2 . Proof. Since E T /4 -E 0 ≥ -E . By Lemma 18 and choose µ ≤ O(( ϵ 3 ρ 5 χ -6 c -8 d ) 1/4 ) = Õ( ϵ 3/8 (d) 1/4 ), we have ∥x t -x 0 ∥ ≤ t t τ =1 ∥x τ -x τ -1 ∥ 2 ≤ 2ηE θ • T 4 + T 2 16 ηρ 2 dµ 4 48 ≤ S . Define x -1 = x 0 -v 0 . Without loss of generality, set x 0 = 0. Using Lemma 20, we have x t x t-1 = A t-1 0 -v 0 -η t-1 τ =0 A t-1-τ ∇f (0) + δ τ 0 Define ∆ t = 1 0 (∇ 2 f (ϕx t ) -∇ 2 f (0))dϕ. Then we have ∇f (x t ) =∇f (0) + (∇ 2 f (0) + ∆ t )x t = ∇f (0) + (∇ 2 f (0) + ∆ t )x t + ∇f (0) -∇f (0) = I -η∇ 2 f (0) (I 0) t-1 τ =0 A t-1-τ I 0 ∇f (0) + ∇ 2 f (0) (I 0) A t 0 -v 0 -η∇ 2 f (0) (I 0) t-1 τ =0 A t-1-τ δ t 0 + ∆ t x t + ∇f (0) -∇f (0). If we choose µ ≤ Õ( ϵ 1/2 d 1/4 ), we have ∥∆ t x t ∥ ≤ ρ∥x t ∥ 2 ≤O(ρS 2 ) ≤ O(ϵc -6) ≤ ϵ/20 ∥∇f (0) -∇f (0)∥ ≤ ρ √ dµ 2 6 ≤ ϵ/20 By Lemma 11, we have 1 -ηλ j (1 0) t-1 τ =0 A t-1-τ j 1 0 = (1 0) A t j 1 1 . Denote a (j) t , -b (j) t = (1 0) A t j . By Lemma 12, max j∈S |a (j) t |, |b t | ≤ (t + 1)(1 -θ) t/2 , then we have when t ≥ T /4 = Ω( 2 θ log 1 θ ), µ ≤ Õ( ϵ 1/2 d 1/4 ), ∥P S I -η∇ 2 f (0) (I 0) t-1 τ =0 A t-1-τ I 0 ∇f (0) ∥ 2 = j∈S |(a (j) t -b (j) t ) ∇f (0) (j) | 2 ≤(t + 1) 2 (1 -θ) t ∥ ∇f (0)∥ 2 ≤ (t + 1) 2 (1 -θ) t 2(∥∇f (0)∥ 2 + ρ 2 dµ 4 36 ) ≤ ϵ 2 /400 ∥P S ∇ 2 f (0) (I 0) A t 0 -v 0 ∥ 2 ≤ j∈S |λ j b (j) t v (j) 0 | 2 ℓ 2 (t + 1) 2 (1 -θ) t ∥v 0 ∥ 2 ≤ ϵ 2 /400. Using Lemma 13, for all j ∈ S, we have | η∇ 2 f (0) (I 0) t-1 τ =0 A t-1-τ δ t 0 (j) | = |ηλ j t-1 τ =0 a (j) τ δ t-1-τ | ≤ |δ (j) t-1 | + t-1 τ =1 |δ (j) τ -δ (j) τ -1 | Using Lemma 21 and choose µ ≤ Õ( ϵ 5/8 d 1/4 ), we have ∥P S η∇ 2 f (0) (I 0) t-1 τ =0 A t-1-τ δ t 0 ∥ ≤ 2∥δ t-1 ∥ 2 + 2t t-1 τ =1 ∥δ τ -δ τ -1 ∥ 2 ≤ O(ρ 2 S 4 ) ≤ O(ϵ 2 c -12 ) ≤ ϵ 2 400 Thus we have for any t ∈ [T /4, T ], ∥P S ∇f (x t )∥ ≤ ϵ 4 . Using Lemma 20, we have v t = (1 -1) x t x t-1 = (1 -1) A t 0 -v 0 -η (1 -1) t-1 τ =0 A t-1-τ ∇f (0) 0 -η (1 -1) t-1 τ =0 A t-1-τ δ τ 0 By Lemma 12, for t ≥ T /4 = Ω( c θ log 1 θ ), we have ∥[P T S ∇ 2 f (x 0 )P S ] 1/2 (1 -1) A t 0 -v 0 ∥ 2 = j∈S |λ 1/2 j (b (j) t -b (j) t-1 )v (j) 0 | 2 ≤ ℓ(t + 1) 2 (1 -θ) t ∥v 0 ∥ 2 ≤O( ϵ 2 ℓ c -3 ) ≤ 1 3 √ ρϵM 2 By Lemma 10, we have |ηλ j (1 -1) t-1 τ =0 A t-1-τ j 1 0 | =|ηλ j (1 0) t-1 τ =0 (A t-1-τ j -A t-2-τ j ) 1 0 | =| (1 0) (A t j -A t-1 j ) 1 0 |. By choosing µ ≤ Õ( ϵ 1/2 d 1/4 ), then we have ∥[P T S ∇ 2 f (x 0 )P S ] 1/2 η (1 -1) t-1 τ =0 A t-1-τ ∇f (0) 0 ∥ 2 = j∈S |λ -1/2 j (a t-1 -b (j) t + b (j) t-1 ) ∇f (0) (j) | 2 ≤O( 1 √ ρϵ )(t + 1) 2 (1 -θ) t ∥ ∇f (0)∥ 2 ≤ O( 1 √ ρϵ )(t + 1) 2 (1 -θ) t • 2(∥∇f (0)∥ 2 + ρ 2 dµ 4 36 ) ≤O( ϵ 3 ℓ c -3 ) ≤ 1 3 √ ρϵM 2 . By Lemma 13, for any j ∈ S, we have |(∇ 2 f (0) 1 2 η (1 -1) t-1 τ =0 A t-1-τ δ τ 0 ) (j) | =|ηλ 1/2 j t-1 τ =0 (a τ -a τ -1 )δ t-1-τ | ≤ √ η(|δ (j) t-1 | + t-1 τ =1 |δ (j) τ -δ (j) τ -1 |). Using Lemma 21 and choose µ ≤ Õ( ϵ 5/8 d 1/4 ), we have ∥[P T S ∇ 2 f (x 0 )P S ] 1/2 η (1 -1) t-1 τ =0 A t-1-τ δ τ 0 ∥ 2 ≤η[2∥δ t-1 ∥ 2 + 2t t-1 τ =1 ∥δ τ -δ τ -1 ∥ 2 ] ≤O(ηρ 2 S 2 ) ≤ O( ϵ 2 ℓ c -6 ) ≤ 1 3 √ ρϵM 2 Thus we have v T t [P T S ∇ 2 f (x 0 )P S ]v t ≤ √ ρϵM 2 . Lemma 5. If ∥ ∇f (x τ )∥ ≥ 3ϵ 4 with µ ≤ O(( 3ϵ 2ρ √ d ) 1/2 ) in Line 3 of Algorithm 1 for all τ ∈ [0, T ], then by running Algorithm 1 with µ ≤ Õ( ϵ 5/8 d 1/4 ) in Line 6 and µ ≤ Õ( ϵ 1/2 d 1/4 ) in Line 8, we have E T -E 0 ≤ -E . Proof. According to lemma 1, if we choose µ ≤ O(( 3ϵ 2ρ √ d ) 1/2 ) in Line 3 of Algorithm 1. Then we get if ∥ ∇f (x t )∥ ≤ 3ϵ 4 , then ∥∇f (x t )∥ ≤ ϵ, otherwise ∥∇f (x t )∥ ≥ ϵ 2 . According to Algorithm 1, if ∥ ∇f (x τ )∥ ≥ ϵ 4 , then for all τ ∈ [0, T ], the perturbation step is not reached. According to Lemma 19, as long as the NCE step is reached, then we have the Hamiltonian we decrease by E in a single step. And according to Lemma 3 and Lemma 4, the Hamiltonian decrease monotonically in all steps, so we have Lemma 5 holds. Then we prove that if the NCE step is never reached in all steps τ ∈ [0, T ], Lemma 5 holds. Let t 1 = arg min t∈[0,T ] {t|∥v t ≤ M and∥∇f (x t )∥ ≤ 2ℓM ∥}. When t 1 ∈ [T /4, T ], then we have E T -E 0 ≤ E T /4 -E 0 ≤ -E according to Lemma 22. Then we discuss the case when t 1 ∈ [0, T /4]. Using Lemma 24 by setting t 1 as a initial step, we have ∥P S ∇f (x t )∥ ≤ ϵ 4 and v T t [P T S ∇ 2 f (x 0 )P S ]v t ≤ √ ρϵM 2 . ∀t ∈ [t 1 + T /4, t 1 + T /2]. Let t 2 = arg min t∈[t1+T /4,T ] {t|∥v t ∥ ≤ M }. If t 2 ≥ t 1 + T 2 , then Hamiltonian will decrease by E by Lemma 22. Otherwise, t 2 ∈ [t 1 + T /4, t 1 + T /2], we have ∥P S ∇f (x t2 )∥ ≤ ϵ 4 , by prediction of Lemma 5, we have ∥∇f (x t2 )∥ ≥ 3ϵ 4 , so we have ∥P S c ∇f (x t2 )∥ ≥ ϵ 4 . By Lemma 18, ∥x t1 -x t2 ∥ ≤ 2S holds, then we have v T t2 [P S T ∇ 2 f (x t2 )P S ]v t2 ≤ v T t2 [P S T ∇ 2 f (x t1 )P S ]v t2 + ∥∇ 2 f (x t1 ) -∇ 2 f (x t2 )∥∥v t2 ∥ 2 ≤ 2 √ ρϵM 2 So according to Lemma 23, the Hamilton will decrease by E .

C.2 NEGATIVE CURVATURE

Lemma 25. Suppose ∥ ∇f (x)∥ ≤ 3ϵ 4 ( thus ∥ ∇f (x)∥ ≤ ϵ) and λ min (∇ 2 f (x)) ≤ -√ ρϵ. x 0 and x ′ 0 are at distance at most r from x. Let x 0 -x ′ 0 = r 0 e 1 and v 0 = v ′ 0 = ṽ where e 1 is the minimum eigen-direction of ∇ 2 f (x) and r 0 ≥ δE r 2∆ f √ d . Then, running zeroth-order AGD starting at (x 0 , v 0 ) and (x ′ 0 , v ′ 0 ) respectively and set µ ≤ Õ( ϵ 13/8 d 1/2 ), we have min{E T -Ẽ, E ′ T -Ẽ} ≤ -E . Proof. Assume that min{E T -E 0 , E ′ T -E ′ 0 } ≥ -2E , where E 0 and E ′ 0 are Hamiltonians at (x 0 , v 0 ) and (x ′ 0 , v ′ 0 ), respectively. By Lemma 18 and choose µ ≤ Õ( ϵ 3/8 d 1/4 ), we have for any t ≤ T , max{∥x t -x∥, ∥x ′ t -x∥} ≤ max{∥x t -x 0 + x 0 -x∥, ∥x ′ t -x ′ 0 + x ′ 0 -x∥} ≤r + max{∥x t -x 0 ∥, ∥x ′ t -x ′ 0 ∥} ≤ r + 4ηE T θ + T 2 ηρ 2 dµ 4 48 ≤ 2S . Let x = 0 be the origin. Let w t = x t -x ′ t , according to lemma 20, we have w t+1 w t = A t w 1 w 0 -η t τ =1 A t-τ ξ τ 0 = A t+1 w 0 w -1 -η t τ =0 A t-τ ξ τ 0 , where ξ t = ∇f (y t ) -∇f (y ′ t ) -∇ 2 f (0)(y t -y ′ t ) =. Let ∆ t = 1 0 (∇ 2 f (ϕy t + (1 -ϕ)y ′ t ) - ∇ 2 f (0))dϕ, then we have ξ t = ∆ t (y t -y ′ t ) + e t -e ′ t = ∆ t ((1 -θ)w t -(1 -θ)w t-1 ) + e t -e ′ t , where e t = ∇f (y t ) -∇f (y t ), e ′ t = ∇f (y ′ t ) -∇f (y ′ t ). Since v 0 = v ′ 0 , we have w -1 = w 0 , ∥∆ t ∥ ≤ ρ max{∥x t -x∥, ∥x ′ t -x∥} ≤ 2ρS and ∥ξ t ∥ ≤ 6ρS (∥w τ ∥ + ∥w τ -1 ∥) + ρ √ dµ 2 3 . Then we prove by induction that ∥η (I 0) t-1 τ =0 A t-1-τ ξ τ 0 ∥ ≤ 1 2 ∥ (I 0) A t w 0 w 0 ∥ For reasonably small µ, it is easy to check the base case holds for t = 1 as ∥A∥ ≤ ℓ = 4η. Then we assume that for all steps less than or equal to t, the induction assumption holds. Then we have ∥w t ∥ =∥ (I 0) A t w 0 w 0 -η (I 0) t-1 τ =0 A t-1-τ ξ τ 0 ∥ ≤ ∥ (I 0) A t w 0 w 0 ∥ + ∥η (I 0) t-1 τ =0 A t-1-τ ξ τ 0 ∥ ≤2∥ (I 0) A t w 0 w 0 ∥, then we have ∥ξ t ∥ ≤O(ρS )(∥w t ∥ + ∥w t-1 ∥) + ρ √ dµ 2 3 ≤ O(ρS )(∥ (I 0) A t w 0 w 0 ∥ + ∥ (I 0) A t-1 w 0 w 0 ∥) + ρ √ dµ 2 3 ≤O(ρS )∥ (I 0) A t w 0 w 0 ∥ + ρ √ dµ 2 3 , where the last inequality uses Lemma 16. For the case t + 1, we have ∥η (I 0) t τ =0 A t-τ ξ τ 0 ∥ ≤η t τ =0 ∥ (I 0) A t-τ I 0 ∥∥ξ τ ∥ ≤η t τ =0 ∥ (I 0) A t-τ I 0 ∥ O(ρS )∥ (I 0) A τ w 0 w 0 ∥ + ρ √ dµ 2 3 Without loss of generality, assume that the minimum eigenvector direction of ∇ 2 f (x) is along the first coordinate e 1 with the corresponding 2 × 2 matrix A 1 . Let a (1) t -b (1) t = (1 0) A 1 . If we choose µ ≤ Õ( ϵ 13/8 d 1/2 ), then ∥η (I 0) t τ =0 A t-τ ξ τ 0 ∥ ≤η t τ =0 a (1) t-τ O(ρS )(a (1) τ -b (1) τ )∥w 0 ∥ + ρ √ dµ 2 3 ≤η t τ =0 a (1) t-τ O(ρS )(a (1) τ -b (1) τ )∥w 0 ∥ ≤O(ηρS ) t τ =0 ( 2 θ + t + 1)|a (1) t+1 -b (1) t+1 |∥x 0 ∥ ≤O(ηρS )∥ (I 0) A t+1 w 0 w 0 ∥ ≤ 1 2 ∥ (I 0) A t+1 w 0 w 0 ∥, where the second inequality used Lemma 16 that |a (1) τ -b τ | ≥ θ 2 and µ ≤ Õ( ϵ 13/8 d 1/2 ); the third inequality used Lemma 14 and the fourth inequality used 1 θ ≤ S . Then we finished the proof of the induction. Then we have ∥w t ∥ ≥ ∥ (I 0) A t w 0 w 0 ∥ -∥η (I 0) t-1 τ =0 A t-1-τ ξ τ 0 ∥ ≥ 1 2 ∥ (I 0) A t w 0 w 0 ∥ ≥ 1 4 (1 + Ω(θ)) t r 0 , where the last inequality Lemma 16 and λ min (∇ 2 f (x)). Since r 0 ≥ δE r (r)) 2∆ f √ d , T = Ω( 1 θ χc) Then we have ∥w T ∥ = ∥x T -x ′ T ∥ ≥ 1 4 (1 + ω(θ)) T r 0 ≥ 4S , which is contradicted with ∀t ≤ T , max{∥x t -x∥, ∥x ′ t -x∥} ≤ 2S . Therefore the following inqualty holds min{E T -E 0 , E ′ T -E ′ 0 } ≤ -2E . Since max{E 0 -Ẽ, E ′ 0 -Ẽ} = max{f (x 0 ) -f (x), f (x ′ 0 ) -f (x)} ≤ ϵr + ℓr 2 2 ≤ E . Then we have min{E T -Ẽ, E ′ T -Ẽ} ≤ -E . Lemma 6. Suppose ∥ ∇f (x t )∥ ≤ 3ϵ 4 ( thus ∥∇f (x t )∥ ≤ ϵ), λ min (∇ 2 f (x t )) ≤ - √ ρϵ Vol(B (d) x0 (r)) = r 0 Γ(d/2 + 1) r √ πΓ(d/2 + 1/2) ≤ δE 2∆ f . Then, with probability at least 1 -δE 2∆ f , we have E T -E 0 ≤ -E .

C.3 PROOF OF THEOREM 1

Proof. Consider the set H = {τ |τ ∈ [0, T ]and∥ ∇f (x τ )∥ ≤ 3ϵ 4 } and suppose that all x τ are not ϵ-approximate SOSPs. If H = ∅, then no perturbation is added and by Lemma 5, we have  E T -E 0 ≤ -E . Else if H ̸ = ∅, then define τ ′ = arg min H. Then by Lemma 6, we have E τ ′ +T -E 0 ≤ E τ ′ +T -E τ ′ ≤ -E . A i ) = 1 -Pr( i Āi ) ≥ 1 -i Pr( Āi ) ≥ 1 - 2∆ f E • δE 2∆ f = 1 -δ.

D PROOF OF MAIN RESULTS OF ALGORITHM 3

Algorithm 4 Zeroth-Order Accelerated Negative Curvature Finding without Renormalization(x, r ′ , T ′ ) 1: x 0 ← Unif(B x(r ′ )) 2: y 0 ← x 0 3: for t = 0, . . . , T ′ do 4: x t+1 = y t -η ∥yt-x∥ r ′ ∇f (r ′ yt-x ∥yt-x∥ + x) -∇f (x) 5: v t+1 = x t+1 -x t 6: y t+1 = x t+1 + (1 -θ)v t+1 return x T ′ -x ∥x T ′ -x∥ . Lemma 26. The output of the algorithm 4 is the same as the unit ê in Algorithm 3. Denote the sequence of {x r } obtained by Algorithm 4 and Algorithm 3 by {x 1,0 , x 1,1 , . . . , x 1,T ′ } and {x 2,0 , x 2,1 , . . . , x 2,T ′ }, respectively. Then we have x 1,T ′ - x ∥x 1,T ′ -x∥ = x 2,T ′ - x ∥x 2,T ′ -x∥ . Proof. We prove this by induction that x 2,k - x ∥y 2,k -x∥ = x 1,k - x r ′ , y 2,k - x ∥y 2,k -x∥ = y 1,k - x r ′ . It is easy to check that the base case holds for k = 0. Then we assume that the above equations holds for all k ≤ t. Then we have x 2,t+1 -x =y 2,t -x -η ∥y 2,t -x∥ r ′ ∇f (r ′ y 2,t - x ∥y 2,t -x∥ + x) -∇f (x) =y 2,t -x -η ∥y 2,t -x∥ r ′ ∇f (y 1,t ) -∇f (x) = ∥y 2,t -x∥ r ′ y 1,t -x -η ∇f (y 1,t ) -∇f (x) Denote by x ′ 1,t+1 , y ′ 1,t+1 the value of x 1,t+1 , y 1,t+1 before renormalization, then x ′ 1,t+1 = y 1,t -η( ∇f (y 1,t ) -∇f (x)), y ′ 1,t+1 = x ′ 1,t+1 + (1 -θ)(x ′ 1,t+1 -x 1,t ) Then we have x 2,t+1 -x = ∥y 2,t -x∥ r ′ y 1,t -x -η ∇f (y 1,t ) -∇f (x) = ∥y 2,t -x∥ r ′ x ′ 1,t+1 -x , y 2,t+1 -x = x 2,t+1 -x + (1 -θ)(x 2,t+1 -x -(x 2,t -x)) = ∥y 2,t -x∥ r ′ (y ′ 1,t+1 -x), then we have ∥y 2,t+1 -x∥ = ∥y 2,t -x∥ r ′ ∥y ′ 1,t+1 -x∥. So we have x 2,t+1 -x = ∥y 2,t -x∥ r ′ (x ′ 1,t+1 -x) = ∥y 2,t -x∥ r ′ ∥y ′ 1,t+1 -x∥ r ′ (x 1,t+1 -x) = ∥y 2,t+1 -x∥ r ′ (x 1,t+1 -x). Then we finish the proof of the induction. Note that ∇ 2 f (x) has the following eigendecomposition: Proof. Define x -1 = x 0 -v 0 . Without loss of generality, we assume that x = 0. Consider the worst case that α 0 = π d δ 0 and the component x 0,d along u d equals 0. Assume that the eigenvalues satisfy ∇ 2 f (x) = n i=1 λ i u i u T i , λ 2 = • • • = λ p = λ p+1 = • • • = λ d-1 = - √ ρϵ. Define ∆ = ∥yt∥ r ′ ∇f (y t r ′ ∥yt∥ ) -∇f (0) -∇ 2 f (0) r ′ ∥yt∥ y t and assume that ∆ lies in the direction that make α t as small as possible. Then, the component ∆ S in S should be in the opposite direction to v S , and the component ∆ S c in S c should be in the direction of v S c . Then we have both ∥x t,S c ∥/∥x t ∥ and ∥y t,S c ∥/∥y t ∥ being non-decreasing. Note that x t+2 = x t+1 + (1 -θ)(x t+1 -x t ) -η∆ -η∇ 2 f (0)(x t+1 + (1 -θ)(x t+1 -x t )) Then we consider the following recurrence formula: ∥x t+2,S c ∥ ≤ (1 + η √ ρϵ)(∥x t+1,S c ∥ + (1 -θ)(∥x t+1,S c ∥ -∥x t,S c ∥)) + η∥∆ S c ∥. Since ∥x t,S c ∥/∥x t ∥ is non-decreasing, we have ∥∆ S c ∥ ∥x t+1,S c ∥ ≤ ∥∆∥ ∥x t+1,S c ∥ ≤ ∥∆∥ ∥x t+1 ∥ ∥x 0 ∥ ∥x 0,S c ∥ ≤ ∥∆∥ ∥x t+1 ∥ ∥x 0 ∥ ∥x 0 ∥ -∥x 0,S ∥ = ∥∆∥ ∥x t+1 ∥ 1 1 -α 0 ≤ 2∥∆∥ ∥x t+1 ∥ ≤ 2 r ′ ρ( r ′2 2 + √ dµ 2 3 ) ≤ 2ρr ′ , where the last second step uses Lemma 2 and the last step is due to our choice of µ such that √ dµ 2 ≤ r ′2 . Then we have ∥x t+2,S c ∥ ≤ (1 + η √ ρϵ + 2ηρr ′ )((2 -θ)∥x t+1,S c ∥ -(1 -θ)∥x t,S c ∥). Then by Lemma 17, we have ∥x t,S c ∥ ≤( 1 + κ S c 2 ) t (- 2 -θ -µ S c 2µ S c ∥x 0,S c ∥ + 1 (1 + κ S c )µ S c (1 + κ S c )∥x 0,S c ∥) • (2 -θ + µ S c ) t +( 2 -θ + µ S c 2µ S c ∥x 0,S c ∥ - 1 (1 + κ S c )µ S c (1 + κ S c )∥x 0,S c ∥) • (2 -θ -µ S c ) t ≤( 1 + κ S c 2 ) t (- 2 -θ -µ S c 2µ S c ∥x 0,S c ∥ + ∥x 0,S c ∥ µ S c ) + ( 2 -θ + µ S c 2µ S c ∥x 0,S c ∥ - ∥x 0,S c ∥ µ S c ) • (2 -θ + µ S c ) t =( 1 + κ S c 2 ) t ∥x 0,S c ∥(2 -θ + µ S c ) t , where κ S c = η √ ρϵ + 2ηρr ′ , µ S c = ((2 -θ) 2 -4(1-θ) 1+κ S c ). Suppose for some value t, we have α k ≥ α min for any 1 ≤ k ≤ t + 1. Then we have ∥x t+2,S ∥ ≥ (1 + η √ ρϵ) ≥ (1 + η √ t+1,S ∥ + (1 -θ)(∥x t+1,S ∥ -∥x t,S ∥)) -η∥∆ S ∥. Since ∥x t+1,S ∥/∥x t+1 ∥ ≥ α min holds for all t > 0, we have ∥y t+1,S ∥ ∥yt+1∥ ≥ α min , then ∥∆ S ∥ ∥y t+1,S ∥ ≤ ∥∆∥ α min ∥y t+1 ∥ ≤ 1 α min ρ( r ′2 2 + √ dµ 2 3 ) ≤ ρr ′ α min , where the last second step uses Lemma 2 and the last step is due to our choice of µ such that √ dµ 2 ≤ r ′2 . Then we have ∥x t+2,S ∥ ≥ (1 + η √ ρϵ - ηρr ′ α min )((2 -θ)∥x t+1,S ∥ -(1 -θ)∥x t,S ∥). Then by Lemma 17, we have ∥x t,S ∥ ≥( 1 + κ S 2 ) t (- 2 -θ -µ S 2µ S ∥x 0,S ∥ + 1 (1 + κ S )µ S (1 + κ S )∥x 0,S ∥) • (2 -θ + µ S ) t +( 2 -θ + µ S 2µ S ∥x 0,S ∥ - 1 (1 + κ S )µ S (1 + κ S )∥x 0,S ∥) • (2 -θ -µ S ) t ≥( 1 + κ S 2 ) t • (- 2 -θ -µ S 2µ S ∥x 0,S ∥ + ∥x 0,S ∥ µ S ) • (2 -θ + µ S ) t =( 1 + κ S 2 ) t • ∥x 0,S ∥ 2 • (2 -θ + µ S ) t where κ S = η √ ρϵ -ηρr ′ αmin , µ S = ((2 -θ) 2 -4(1-θ) 1+κ S ). Then we have ∥x t,S ∥ ∥x t,S c ∥ ≥ ( 1 + κ S 1 + κ S c ) t ∥x 0,S ∥ 2∥x 0,S c ∥ ( 2 -θ + µ S 2 -θ + µ S c ) t , where 1 + κ S 1 + κ S c ≥(1 + κ S )(1 -κ S c ) = 1 -( 1 α min + 2)ηρr ′ -κ S κ S c ≥ 1 -2ηρr ′ /α min , 2 -θ + µ S 2 -θ + µ S c ≥(1 + µ S 2 -θ )(1 - µ S c 2 -θ ) = (1 + 1 2 -θ (2 -θ) 2 - 4(1 -θ) 1 + κ S )(1 - 1 2 -θ (2 -θ) 2 - 4(1 -θ) 1 + κ S c ) =(1 + 1 2 -θ θ 2 + κ S (2 -θ) 2 1 + κ S )(1 - 1 2 -θ θ 2 + κ S c (2 -θ) 2 1 + κ S c ) ≥ 1 - 2(κ S c -κ S ) θ ≥ 1 - 3ηρr ′ α min θ . Then we have ∥x t,S ∥ ∥x t,S c ∥ ≥ ∥x 0,S ∥ 2∥x 0,S c ∥ (1 -4ρr ′ α min θ ) t ≥ ∥x 0,S ∥ 2∥x 0,S c ∥ (1 -1/T ′ ) t ≥ ∥x 0,S ∥ 2∥x 0,S c ∥ exp(-t T ′ -1 ) ≥ ∥x 0,S ∥ 4∥x 0,S c ∥ . So we have α t = ∥x t,S ∥ ∥x t,S ∥ 2 + ∥x t,S c ∥ 2 ≥ ∥x 0,S ∥ 8∥x 0,S c ∥ ≥ α min . Thus for all t ≤ T ′ , we have α t ≥ α min .

Proof of Lemma 7

Proof. As stated above, we only need to prove the case when λ d ≥ -  1 + µ S 2-θ ≤ 1 (1 + µ S 2-θ )(1 -µ S ′c 2-θ ) = 1 (1 + 1 - 4(1-θ) (1+κ S )(2-θ) 2 )(1 -1 - 4(1-θ) (1+κ S ′c )(2-θ) 2 ) ≤ 1 1 + 2(κ S -κ S ′c ) θ = 1 - 2(κ S -κ S ′c ) θ 1 + 2(κ S -κ S ′c ) θ ≤ 1 - κ S -κ S ′c θ ≤ 1 - η √ ρϵ 4θ = 1 - (ρϵ) 1/4 16 √ ℓ . Then we have ∥x T ′ ,S ′c ∥ ∥x T ′ ,S ∥ ≤ 2 δ 0 d π (1 - (ρϵ) 1/4 16 √ ℓ ) T ′ ≤ √ ρϵ 8ℓ . So we conclude that there exist some 1 ≤ t 0 ≤ T ′ such that Then we have êT ∇ 2 f (0)ê =(ê S ′c + êS ′ ) T ∇ 2 f (0)(ê S ′c + êS ′ ) = êT S ′c ∇ 2 f (0)ê S ′c + êT S ′ ∇ 2 f (0)ê S ′ ≤ℓ∥ê S ′c ∥ 2 - √ ρϵ 2 ∥ê S ′ ∥ 2 ≤ ρϵ 64ℓ - √ ρϵ 2 (1 - √ ρϵ 8ℓ ) 2 ≤ - √ ρϵ 4 .

D.1 PROOF OF THEOREM 2

Proof. Recall the parameters setting of Algorithm 3: δ 0 = δ 384∆ f ϵ 3 ρ , η = 1 4ℓ , θ = 1 4 √ κ , γ = θ 2 η , s = γ 4ρ , T ′ = 32 √ κ log( ℓ √ d δ 0 √ ρϵ ), E = ϵ 3 ρ c -7 A , r ′ = δ 0 ϵ 32 π ρd , where c A is a large enough constant. Define a new parameter T = √ κc A . From Lemma 5 we know if ∥ ∇f (x τ )∥ ≥ 3ϵ 4 for any τ ∈ [0, T ], then by running Algorithm 3 we have E T -E 0 ≤ -E . Then we first assume that for each time we can escape saddle points successfully, i.e., after T ′ iterations of the perturbation step, we have êT ∇ 2 f (x)ê ≤ - ρ and the total times of random perturbations is no more than 384(f (x 0 ) -f * ) ρ ϵ 3 . By union bound, the probability that at least one time the negative curvature finding fails to escape saddle points is upper bounded by 384(f (x 0 ) -f * ) ρ ϵ 3 δ 0 ≤ δ. Then we assume that we never encounter a SOSP in the rest steps. Set the total number of iterations to be T = max{ 4∆ f ( T +T ′ ) E , 768∆ f T ′ ρ ϵ 3 } = O ∆ f ℓ 1/2 ρ 1/4 ϵ 7/4 log( ℓ √ d∆ f δϵ 2 ) . Denote by the N T the number of periods containing only large gradient steps, then we have N T ≥ T 2( T + T ′ ) -384(f (x 0 ) -f * ) ρ ϵ 3 ≥ (2c 7 A -84)∆ f ρ ϵ 3 ≥ ∆ f E . By Lemma 5 we have the Hamiltonian will decrease by N T E ≥ ∆ f , which cause a contradiction. Thus we have with probability at least 1 -δ, we must encounter an ϵ-approximate SOSP during the T iterations.



2ρ √ d ) 1/2) in Line 3 of Algorithm 1 for all τ ∈ [0, T ], then by running Algorithm 1 with µ ≤ Õ( ϵ 5/8 d 1/4 ) in Line 6 and µ ≤ Õ( ϵ 1/2 d 1/4 ) in Line 8, we have E T -E 0 ≤ -E .



Similar to Algorithm 1, we can also add an termination condition for Algorithm 3: Once the pre-condition of random perturbation step is reached, record the current iterate point x t0 and the current function value f (x t0 ) before adding the random perturbation. If the decrease of the function is less than 1 384 ϵ 3

Figure 1: Performance of different algorithms to minimize the cubic regularization problem with growing dimensions. Confidence intervals show mini-max intervals over ten runs

Figure 2: Performance of different algorithms to minimize the quartic function with growing dimensions. Confidence intervals show mini-max intervals over ten runs.

where {u n i=1 } forms an orthonormal basis of R d . Without loss of generality, assume that λ 1 ≤ λ 2 ≤ • • • ≤ λ d and λ 1 ≤ -√ ρϵ. If λ d ≤ -√ ρϵ/2, then Lemma holds directly. Then we prove the case when λ d ≥ -√ ρϵ/2 and assume that λ p ≤ -√ ρϵ ≤ λ p+1 (p > 1). Let S be the subspace of R d spanned by {u 1 , u 2 , . . . , u p } and S c be subspace spanned by {u p+1 , u p+2 , . . . , u d }. Then we have the following lemma: Lemma 27. Denote α t = ∥x t,S -x∥ ∥xt-x∥ , where x t,S is the component of x t in the subspace S. Then, during all the T ′ iterations of Algorithm 4, we have α t ≥ α min = δ0 8 π d , given that α 0 = π d δ 0 .

we have ∥ê S ′c ∥ ≤ √ ρϵ 8ℓ and ∥ê S ′ ∥ ≥ ∥ê∥ -∥ê S ′c ∥ ≥ 1 -√ ρϵ 8ℓ .

Comparison of different zeroth-order methods for finding ϵ-approximate second-order stationary points.

and no perturbation is added in iterations [t-T , t]. Then by running Algorithm 1, we have E T -E 0 ≤ -E with probability at least 1 -δE 2∆ f .Proof. According to precondition of Lemma 6, a perturbation will be added at iteration 0, then the Hamiltonian will increase by at most E . According to Lemma 19, Hamiltonian will decrease by 2E if at least one NCE step is called and thus E T -E 0 ≤ -E . Otherwise NCE step is never reached in iterations [0, T ]. In this case, denote by B x0 (r) the ball with radius r around x 0 . Let X ⊂ B x0 (r) be the region where Hamiltonian will not decrease by E if the AGD sequences started from at a point x ∈ X . Then by Lemma 25, the width of region is no more than r 0 = δE r

Thus the Hamiltonian will decrease by at least E /(2T ) per step and the total steps is no more than times. Denote by A the event that the argument of Theorem 1 is true and denote by A i , i ∈ {1, . . . , ⌊

√ ρϵ2 . Then there exist some p ′ such that λ p ′ ≤ -√ ρϵ/2 ≤ λ p ′ +1 . Let S ′ be the subspace of R d spanned by {u 1 , u 2 , . . . , u p ′ } and S ′c be the complementary subspace. Definex t,S = p ′ i=1 ⟨u i , x t ⟩ u i , x t,S ′c = d i=p ′ +1 ⟨u i , x t ⟩ u i ,and let α t = ∥x t,S ∥/∥x t ∥. We know that with probability at least we have α 0 ≥ π d δ 0 . Then we prove that there exists some t 0 with 1 ≤ t ≤ T ′ such that ∥x t0,S ′c ∥ Assume the contrary holds that for any 1 ≤ t ≤ T ′ , Then we consider the case when ∥x t,S ′c ∥ achieves the largest possible value and we have the following recurrence formula:∥x t+2,S ′c ∥ ≤ (1 + η √ ρϵ/2)(∥x t+1,S ′c ∥ + (1 -θ)(∥x t+1,S ′c ∥ -∥x t,S ′c ∥)) + η∥∆ S ′c ∥. ′c ∥ ∥x t+1,S ′c ∥ + (1 -θ)(∥x t+1,S ′c ∥ -∥x t,S ′c ∥)where the last is step is due to Lemma 2 and our choice of µ such that √ dµ 2 ≤ r ′2 . Then we have∥x t+2,S ′c ∥ ≤ (1 + η √ ρϵ/2 + 2ρr ′ / √ ρϵ)((2 -θ)∥x t+1,S ′c ∥ -(1 -θ)∥x t,S ′c ∥). we have ∥x t,S ′c ∥ ≤ ∥x 0,S ′c ∥( 1 + κ S ′c 2 ) t (2 -θ + µ S ′c ) t ,whereκ S ′c = η √ ρϵ/2+2ρr ′ / √ ρϵ, µ S ′c = (2 -θ) 2 -4(1-θ) 1+κ S ′c . By Lemma 27, we have ∥x t,S ∥ ≥ ( 1+κ S 2 ) t • ∥x 0,S ∥ 2 • (2 -θ + µ S ) t for any 1 ≤ t ≤ T ′ . Then we have S )(1 -κ S ′c ) ≤ 1 1 + (κ S -κ S ′c )/2 = 1 -(κ S -κ S ′c )/2 1 + (κ S -κ S ′c )/2 ≤ 1 -κ S -κ S ′c 4

ACKNOWLEDGMENT

The authors thank four anonymous reviewers for their helpful comments and suggestions. Bin Gu was partially supported by the National Natural Science Foundation of China under Grant 62076138.

