FASTER GRADIENT-FREE METHODS FOR ESCAPING SADDLE POINTS

Abstract

Escaping from saddle points has become an important research topic in nonconvex optimization. In this paper, we study the case when calculations of explicit gradients are expensive or even infeasible, and only function values are accessible. Currently, there have two types of gradient-free (zeroth-order) methods based on random perturbation and negative curvature finding proposed to escape saddle points efficiently and converge to an ϵ-approximate second-order stationary point. Nesterov's accelerated gradient descent (AGD) method can escape saddle points faster than gradient descent (GD) which have been verified in first-order algorithms. However, whether AGD could accelerate the gradient-free methods is still unstudied. To unfold this mystery, in this paper, we propose two accelerated variants for the two types of gradient-free methods of escaping saddle points. We show that our algorithms can find an ϵ-approximate second-order stationary point with Õ(1/ϵ 1.75 ) iteration complexity and Õ(d/ϵ 1.75 ) oracle complexity, where d is the problem dimension. Thus, our methods achieve a comparable convergence rate to their first-order counterparts and have smaller oracle complexity compared to prior derivative-free methods for finding second-order stationary points.

1. INTRODUCTION

Non-convex optimization has received increasing attention in recent years because lots of modern machine learning (ML) and deep learning (DL) tasks can be formulated as optimizing models with non-convex loss functions. In this paper, we consider non-convex optimization with the following general form: min x∈R d f (x), where f (x) is differentiable and has Lipschitz continuous gradient and Hessian. In this paper, we focus on situations when first-order information (gradient) is not always directly accessible. Many machine learning and deep learning applications often encounter settings where the calculation of explicit gradients is expensive or even infeasible, such as black-box adversarial attack on deep neural networks (Papernot et al., 2017; Madry et al., 2018; Chen et al., 2017; Bhagoji et al., 2018; Tu et al., 2019) , policy search in reinforcement learning (Salimans et al., 2017; Choromanski et al., 2018; Jing et al., 2021) , hyper-parameter optimization (Bergstra & Bengio, 2012) . Therefore, zeroth-order optimization, which utilizes only the zeroth-order information (function value) to optimize the non-convex problem Eq. (1), has gained increasing attention in machine learning. In general, the goal of a non-convex optimization problem Eq. ( 1) is to find an ϵ-approximate firstorder stationary point (FOSP, see Definition 3), since finding the global minimum is NP-hard. Gradient descent is proven to be an optimal first-order algorithm for finding an ϵ-approximate FOSP of non-convex problem Eq. (1) under the gradient Lipschitz assumption (Carmon et al., 2020; 2021) , which needs a gradient query complexity of Θ( 1 ϵ 2 ). However, for non-convex functions, FOSPs can be local minima, global minima and saddle points. The ubiquity of saddle points makes highdimensional non-convex optimization problems extremely difficult and will lead to highly suboptimal solutions (Jain et al., 2017; Sun et al., 2018) . Therefore, many recent research works have focused on escaping saddle points and studying properties of converging to an ϵ-approximate secondorder stationary point (SOSP, see Definition 4) using first-order methods. A recent line of work showed that first-order methods can efficiently escape saddle points and converge to SOSPs. Specifically, Jin et al. (2017) proposed the perturbed gradient descent (PGD) algorithm by adding uniform random perturbation into the standard gradient descent algorithm that can find an ϵ-approximate SOSP in Õ(log 4 d/ϵ 2 ) gradient queries. Under the zeroth-order setting, Jin et al. (2018a) proposed a zeroth-order perturbed stochastic gradient descent (ZPSGD) method, which studied the power of Gaussian smoothing and stochastic perturbed gradient for finding local minima. The role of Gaussian smoothing is to reduce zeroth-order optimization to a stochastic first-order optimization of a Gaussian smoothed function of problem Eq. ( 1). They proved their method can find an ϵ-approximate SOSP with a function query complexity of Õ d 2 /ϵ 5 . Vlatakis-Gkaragkounis et al. ( 2019) proposed the perturbed approximate gradient descent (PAGD) method using the forward difference of the coordinate-wise gradient estimators, which finds an ϵapproximate SOSP in Õ d log 4 d/ϵ 2 function queries. Recently, Lucchi et al. ( 2021) proposed a random search power iteration (RSPI) method, which alternatively runs the random search step and zeroth-order power iteration step, and can find an (ϵ, ϵ 2/3 )-approximate SOSP (∥∇f 2022) proposed a zerothorder gradient descent method with zeroth-order negative curvature finding that can find an (ϵ, δ)approximate SOSP (∥∇f ), which adds a restart mechanism to Nesterov's AGD. On finding SOSPs, Nesterov's AGD is also proved to be more efficient than GD. Jin et al. (2018b) studied a variant of Nesterov's AGD named perturbed AGD, and proved that it can find an ϵapproximate SOSP in Õ(log 6 d/ϵ 7/4 ) gradient queries. Their method added two algorithmic features to Nesterov's AGD: random perturbation and negative curvature exploitation, to ensure the monotonic decrease of the Hamiltonian function (see Eq. ( 4)). Allen-Zhu & Li (2018) proposed a first-order negative curvature finding framework named Neon2 that can find the most negative curvature direction efficiently. Combining Neon2 with CDHS method of Carmon et al. ( 2018) can find an ϵ-approximate SOSPs in Õ(log d/ϵ 7/4 ) gradient queries, which improved the complexity of perturbed AGD method by a factor of poly(log d) due to the use of negative curvature finding subroutine. Recently, Zhang & Li (2021) proposed a single-loop algorithm that also achieves the same function query complexity, which replaced the random perturbation step in perturbed AGD with accelerated negative curvature finding. (x)∥ ≤ ϵ, λ min (∇ 2 f (x)) ≥ -ϵ 2/3 ) in O(d log d/ϵ 8 3 ) function queries. Zhang et al. ( (x)∥ ≤ ϵ, λ min (∇ 2 f (x)) ≥ -δ) in O( d ϵ 2 + d log d δ 3.5 ) function queries. Given the advantages of Nesterov's AGD in finding SOSPs in first-order optimization, it is then natural to design AGD based zeroth-order methods for finding SOSPs with smaller function query complexity. To the best of our knowledge, it is still a vacancy in zeroth-order optimization.

Contributions

The main contributions of this paper are summarized as follows,



Comparison of different zeroth-order methods for finding ϵ-approximate second-order sta-

