ESCAPING SADDLE POINTS IN ZEROTH-ORDER OPTI-MIZATION: TWO FUNCTION EVALUATIONS SUFFICE

Abstract

Two-point zeroth order methods are important in many applications of zeroth-order optimization arising in robotics, wind farms, power systems, online optimization, and adversariable robustness to black-box attacks in deep neural networks, where the problem can be high-dimensional and/or time-varying. Furthermore, such problems may be nonconvex and contain saddle points. While existing works have shown that zeroth-order methods utilizing Ω(d) function valuations per iteration (with d denoting the problem dimension) can escape saddle points efficiently, it remains an open question if zeroth-order methods based on two-point estimators can escape saddle points. In this paper, we show that by adding an appropriate isotropic perturbation at each iteration, a zeroth-order algorithm based on 2m (for any 1 ≤ m ≤ d) function evaluations per iteration can not only find -second order stationary points polynomially fast, but do so using only Õ( d / 2.5 ) function evaluations.

1. INTRODUCTION

Two-point (or in general 2m-point, where 1 ≤ m < d with d being the problem dimension) estimators, which approximate the gradient using two (or 2m) function evaluations per iteration, have been widely studied by researchers in the zeroth-order optimization literature, in convex (Nesterov and Spokoiny, 2017; Duchi et al., 2015; Shamir, 2017) , nonconvex (Nesterov and Spokoiny, 2017) , online (Shamir, 2017) , as well as distributed settings (Tang et al., 2019) . A key reason for doing so is that for applications of zeroth-order optimization arising in robotics (Li et al., 2022) , wind farms (Tang et al., 2020a) , power systems (Chen et al., 2020) , online (time-varying) optimization (Shamir, 2017), learning-based control (Malik et al., 2019; Li et al., 2021) , and improving adversarial robustness to black-box attacks in deep neural networks (Chen et al., 2017) , it may be costly or impractical to wait for Ω(d) (where d denotes the problem dimension) function evaluations per iteration to make a step. This is especially true for high-dimensional and/or time-varying problems. See Appendix A for more discussion. However, despite the advantages of zeroth-order methods with two-point estimators, there has been a lack of existing work studying the ability of two-point estimators to escape saddle points in nonconvex optimization problems. Since nonconvex problems arise often in practice, it is crucial to know if two-point algorithms can efficiently escape saddle points of nonconvex functions and converge to second-order stationary points. To motivate the challenges of escaping saddle points using two-point zeroth-order methods, we begin with a review of escaping saddle points using first-order methods. The problem of efficiently escaping saddle points in deterministic first-order optimization (with exact gradients) has been carefully studied in several earlier works (Jin et al., 2017; 2018) . A key idea in these works is the injection of an isotropic perturbation whenever the gradient is small, facilitating escape from a saddle if a negative curvature direction exists even without actively identifying the direction. However, the analysis of efficient saddle point escape for stochastic gradient methods is often more complicated. In general, the behavior of the stochastic gradient near the saddle point can be difficult to characterize. Hence, strong concentration assumptions are typically made on the stochastic gradients being used, such as subGaussianity, boundedness of the variance or a bounded gradient estimator (Ge et al., 2015; Daneshmand et al., 2018; Xu et al., 2018; Fang et al., 2019; Roy et al., 2020; Vlaski and Sayed, 2021b) , creating an analytical issue when such idealized assumptions fail to hold. Indeed, though zeroth-order methods can be viewed as stochastic gradient methods, common zeroth order estimators, such as two-point estimators (Nesterov and Spokoiny, 2017), are not subGaussian, and can have unbounded variance. For instance, it can be shown that the variance of the two-point estimator is on the order of Ω(d ∇f (x) 2 ) (Nesterov and Spokoiny, 2017), with both a dependence on the problem dimension d as well as on the norm of the gradient, which can be unbounded. Due to non-subGaussianity and unboundedness, it is tricky to bound the effect of such zeroth-order estimators and establish tight concentration inequalities that facilitate its escape near saddle points. In addition, the large variance of the zeroth-order estimator is also an issue away from saddles when the gradient is large. While this is not an issue to show function improvement in expectation, as we discuss later, this becomes an issue when guaranteeing high probability bounds. Due to these difficulties, previous works on escaping saddle points in zeroth-order optimization have exclusively focused on approaches requiring Ω(d) function evaluations per iteration to accurately estimate the gradient (Bai et al., 2020; Vlatakis-Gkaragkounis et al., 2019) , or in some cases negative curvature directions (Zhang et al., 2022; Lucchi et al., 2021) or the Hessian itself (Balasubramanian and Ghadimi, 2022), reducing in a sense the zeroth-order problem back to a first-order one. However, as explained earlier, two-point or 2m-point zeroth-order algorithms are important for high-dimensional and/or time-varying problems in many applications areas. This raises an important question: Can two-point zeroth-order methods escape saddle points and reach approximate second order stationary points efficiently? Our Contribution. In this work, we show that by adding an appropriate isotropic perturbation at each iteration, a zeroth-order algorithm based on any number m of pairs (m ranging from 1 to d) of function evaluations per iteration can not only find ( , √ )-second order stationary points (cf. the definition later in Definition 1) polynomially fast, but do so using only Õ( polylog( 1 δ )d / 2.5 ) function evaluations, with a probability of at least 1 -δ. In particular, this proves that using a single two-point zeroth-order estimator at each iteration (with appropriate perturbation) suffices to efficiently escape saddle points in zeroth-order optimization, with high probability. Moreover, for functions that are ( , ψ) strict-saddle (see Definition 3 for a definition of strict saddle functions), our results become Õ( polylog( 1 δ )d /ψ 2 ), which is a significant improvement when ψ ; strict saddle functions have been identified as an important class of functions in nonconvex optimization, with several well-known examples such as tensor decomposition (Ge et al., 2015) , dictionary learning and phase retrieval (Sun et al., 2015) . A comparison of our results with existing zeroth-order and first-order methods is shown in Table 1 . We also provide numerical results in Appendix G showing that our proposed two-point algorithm requires fewer total function evaluations to converge than zeroth order methods that use 2d function evaluations per iteration, for a nonconvex test function proposed in Du et al. (2017) . To overcome the theoretical challenges that were discussed earlier, we i) first show, via a careful analysis, that zeroth order methods can make function value improvement across iterates with large gradients with high probability, even when only a single two-point estimator (which can have significant variance at large gradients) is used per iteration. ii) Second, near saddle points, we overcome issues caused by the unbounded variance and non-subGaussinity of zeroth-order gradient estimators by developing new technical tools, including novel martingale concentration inequalities involving Gaussian vectors, to tightly bound such terms. In turn, this allows us to show that the noise emanating from the zeroth-order estimators will not overwhelm the effect of the additional isotropic perturbative noise, facilitating escape along negative curvature directions. To the best of our knowledge, both analyses are novel, and may be independent contributions on their own. Related Work. Due to space considerations, we defer a full discussion of related work to Appendix A.

2. PROBLEM SETUP

We make the following assumptions on the class of functions f : R d → R which we consider. Assumption 1 (Properties of f ). We suppose that f : R d → R satisfies the following properties: 1. f is twice-differentiable and lower bounded, i.e. f * := min x f (x) > -∞. 2. f is L-Lipschitz, i.e. ∇f (x) -∇f (y) ≤ L x -y ∀x, y ∈ R d 3. f is ρ-Hessian Lipschitz, i.e. ∇ 2 f (x) -∇ 2 f (y) ≤ ρ x -y ∀x, y ∈ R d . In our work, we focus on finding approximate second order stationary points, defined below. Definition 1. A point x ∈ R d is an ( , ϕ)-second order stationary point if ∇f (x) < , and λ min (∇ 2 f (x)) > -ϕ.

