ESCAPING SADDLE POINTS IN ZEROTH-ORDER OPTI-MIZATION: TWO FUNCTION EVALUATIONS SUFFICE

Abstract

Two-point zeroth order methods are important in many applications of zeroth-order optimization arising in robotics, wind farms, power systems, online optimization, and adversariable robustness to black-box attacks in deep neural networks, where the problem can be high-dimensional and/or time-varying. Furthermore, such problems may be nonconvex and contain saddle points. While existing works have shown that zeroth-order methods utilizing Ω(d) function valuations per iteration (with d denoting the problem dimension) can escape saddle points efficiently, it remains an open question if zeroth-order methods based on two-point estimators can escape saddle points. In this paper, we show that by adding an appropriate isotropic perturbation at each iteration, a zeroth-order algorithm based on 2m (for any 1 ≤ m ≤ d) function evaluations per iteration can not only find -second order stationary points polynomially fast, but do so using only Õ( d / 2.5 ) function evaluations.

1. INTRODUCTION

Two-point (or in general 2m-point, where 1 ≤ m < d with d being the problem dimension) estimators, which approximate the gradient using two (or 2m) function evaluations per iteration, have been widely studied by researchers in the zeroth-order optimization literature, in convex (Nesterov and Spokoiny, 2017; Duchi et al., 2015; Shamir, 2017), nonconvex (Nesterov and Spokoiny, 2017) , online (Shamir, 2017) , as well as distributed settings (Tang et al., 2019) . A key reason for doing so is that for applications of zeroth-order optimization arising in robotics (Li et al., 2022) , wind farms (Tang et al., 2020a ), power systems (Chen et al., 2020) , online (time-varying) optimization (Shamir, 2017), learning-based control (Malik et al., 2019; Li et al., 2021) , and improving adversarial robustness to black-box attacks in deep neural networks (Chen et al., 2017) , it may be costly or impractical to wait for Ω(d) (where d denotes the problem dimension) function evaluations per iteration to make a step. This is especially true for high-dimensional and/or time-varying problems. See Appendix A for more discussion. However, despite the advantages of zeroth-order methods with two-point estimators, there has been a lack of existing work studying the ability of two-point estimators to escape saddle points in nonconvex optimization problems. Since nonconvex problems arise often in practice, it is crucial to know if two-point algorithms can efficiently escape saddle points of nonconvex functions and converge to second-order stationary points. To motivate the challenges of escaping saddle points using two-point zeroth-order methods, we begin with a review of escaping saddle points using first-order methods. The problem of efficiently escaping saddle points in deterministic first-order optimization (with exact gradients) has been carefully studied in several earlier works (Jin et al., 2017; 2018) . A key idea in these works is the injection of an isotropic perturbation whenever the gradient is small, facilitating escape from a saddle if a negative curvature direction exists even without actively identifying the direction. However, the analysis of efficient saddle point escape for stochastic gradient methods is often more complicated. In general, the behavior of the stochastic gradient near the saddle point can be difficult to characterize. Hence, strong concentration assumptions are typically made on the stochastic gradients being used, such as subGaussianity, boundedness of the variance or a bounded gradient estimator (Ge et al., 2015; Daneshmand et al., 2018; Xu et al., 2018; Fang et al., 2019; Roy et al., 2020; Vlaski and Sayed, 2021b) , creating an analytical issue when such idealized assumptions fail to hold. Indeed, though zeroth-order methods can be viewed as stochastic gradient methods, common zeroth order estimators, such as two-point estimators (Nesterov and Spokoiny, 2017) , are not subGaussian, and can have unbounded variance. For instance, it can be shown that the variance of the two-point

