FASTER GRADIENT-FREE METHODS FOR ESCAPING SADDLE POINTS

Abstract

Escaping from saddle points has become an important research topic in nonconvex optimization. In this paper, we study the case when calculations of explicit gradients are expensive or even infeasible, and only function values are accessible. Currently, there have two types of gradient-free (zeroth-order) methods based on random perturbation and negative curvature finding proposed to escape saddle points efficiently and converge to an ϵ-approximate second-order stationary point. Nesterov's accelerated gradient descent (AGD) method can escape saddle points faster than gradient descent (GD) which have been verified in first-order algorithms. However, whether AGD could accelerate the gradient-free methods is still unstudied. To unfold this mystery, in this paper, we propose two accelerated variants for the two types of gradient-free methods of escaping saddle points. We show that our algorithms can find an ϵ-approximate second-order stationary point with Õ(1/ϵ 1.75 ) iteration complexity and Õ(d/ϵ 1.75 ) oracle complexity, where d is the problem dimension. Thus, our methods achieve a comparable convergence rate to their first-order counterparts and have smaller oracle complexity compared to prior derivative-free methods for finding second-order stationary points.

1. INTRODUCTION

Non-convex optimization has received increasing attention in recent years because lots of modern machine learning (ML) and deep learning (DL) tasks can be formulated as optimizing models with non-convex loss functions. In this paper, we consider non-convex optimization with the following general form: min x∈R d f (x), where f (x) is differentiable and has Lipschitz continuous gradient and Hessian. In this paper, we focus on situations when first-order information (gradient) is not always directly accessible. Many machine learning and deep learning applications often encounter settings where the calculation of explicit gradients is expensive or even ). Therefore, zeroth-order optimization, which utilizes only the zeroth-order information (function value) to optimize the non-convex problem Eq. ( 1), has gained increasing attention in machine learning. In general, the goal of a non-convex optimization problem Eq. ( 1) is to find an ϵ-approximate firstorder stationary point (FOSP, see Definition 3), since finding the global minimum is NP-hard. Gradient descent is proven to be an optimal first-order algorithm for finding an ϵ-approximate FOSP of non-convex problem Eq. ( 1) under the gradient Lipschitz assumption (Carmon et al., 2020; 2021) , which needs a gradient query complexity of Θ( 1 ϵ 2 ). However, for non-convex functions, FOSPs can be local minima, global minima and saddle points. The ubiquity of saddle points makes highdimensional non-convex optimization problems extremely difficult and will lead to highly suboptimal solutions (Jain et al., 2017; Sun et al., 2018) . Therefore, many recent research works have focused on escaping saddle points and studying properties of converging to an ϵ-approximate secondorder stationary point (SOSP, see Definition 4) using first-order methods.



infeasible, such as black-box adversarial attack on deep neural networks(Papernot et al., 2017; Madry et al., 2018; Chen et al., 2017; Bhagoji et al.,  2018; Tu et al., 2019), policy search in reinforcement learning(Salimans et al., 2017; Choromanski  et al., 2018; Jing et al., 2021), hyper-parameter optimization(Bergstra & Bengio, 2012

