ZEROTH-ORDER OPTIMIZATION WITH TRAJECTORY-INFORMED DERIVATIVE ESTIMATION

Abstract

Zeroth-order (ZO) optimization, in which the derivative is unavailable, has recently succeeded in many important machine learning applications. Existing algorithms rely on finite difference (FD) methods for derivative estimation and gradient descent (GD)-based approaches for optimization. However, these algorithms suffer from query inefficiency because many additional function queries are required for derivative estimation in their every GD update, which typically hinders their deployment in real-world applications where every function query is expensive. To this end, we propose a trajectory-informed derivative estimation method which only employs the optimization trajectory (i.e., the history of function queries during optimization) and hence can eliminate the need for additional function queries to estimate a derivative. Moreover, based on our derivative estimation, we propose the technique of dynamic virtual updates, which allows us to reliably perform multiple steps of GD updates without reapplying derivative estimation. Based on these two contributions, we introduce the zeroth-order optimization with trajectory-informed derivative estimation (ZORD) algorithm for query-efficient ZO optimization. We theoretically demonstrate that our trajectory-informed derivative estimation and our ZORD algorithm improve over existing approaches, which is then supported by our real-world experiments such as black-box adversarial attack, non-differentiable metric optimization, and derivative-free reinforcement learning.

1. INTRODUCTION

Zeroth-order (ZO) optimization, in which the objective function to be optimized is only accessible by querying, has received great attention in recent years due to its success in many applications, e.g., black-box adversarial attack (Ru et al., 2020) , non-differentiable metric optimization (Hiranandani et al., 2021) , and derivative-free reinforcement learning (Salimans et al., 2017) . In these problems, the derivative of objective function is either prohibitively costly to obtain or even non-existent, making it infeasible to directly apply standard derivative-based algorithms such as gradient descent (GD). In this regard, existing works have proposed to estimate the derivative using the finite difference (FD) methods and then apply GD-based algorithms using the estimated derivative for ZO optimization (Nesterov and Spokoiny, 2017; Cheng et al., 2021) . These algorithms, which we refer to as GD with estimated derivatives, have been the most widely applied approach to ZO optimization especially for problems with high-dimensional input spaces, because of their theoretically guaranteed convergence and competitive practical performance. Unfortunately, these algorithms suffer from query inefficiency, which hinders their real-world deployment especially in applications with expensive-to-query objective functions, e.g., black-box adversarial attack. Specifically, one of the reasons for the query inefficiency of existing algorithms on GD with estimated derivatives is that in addition to the necessary queries (i.e., the query of every updated input)foot_0 , the FD methods applied in these algorithms require a large number of additional queries to accurately estimate the derivative at an input (Berahas et al., 2022) . This naturally begs the question: Can we estimate a derivative without any additional query? A natural approach to achieve this is to leverage the optimization trajectory, which is inherently available as a result of the necessary queries and their observations, to predict the derivatives. However, this requires a non-trivial method to simultaneously (a) predict a derivative using only the optimization trajectory (i.e., the history of updated inputs and their observations), and (b) quantify the uncertainty of this prediction to avoid using inaccurate predicted derivatives. Interestingly, the Gaussian process (GP) model satisfies both requirements and is hence a natural choice for such a derivative estimation. Specifically, under the commonly used assumption that the objective function is sampled from a GP (Srinivas et al., 2010) , the derivative at any input in the domain follows a Gaussian distribution which, surprisingly, can be calculated using only the optimization trajectory. This allows us to (a) employ the mean of this Gaussian distribution as the estimated derivative, and (b) use the covariance matrix of this Gaussian distribution to obtain a principled measure of the predictive uncertainty and the accuracy of this derivative estimation, which together constitute our trajectory-informed derivative estimation (Sec. 3.1). Another reason for the query inefficiency of the existing algorithms on GD with estimated derivatives is that every update in these algorithms requires reapplying derivative estimation and hence necessitates additional queries. This can preclude their adoption of a large number of GD updates since every update requires potentially expensive additional queries. Therefore, another question arises: Can we perform multiple GD updates without reapplying derivative estimation and hence without any additional query? To address this question, we propose a technique named dynamic virtual updates (Sec. 3.2). Specifically, thanks to the ability of our method to estimate the derivative at any input in the domain while only using existing optimization trajectory, we can apply multi-step GD updates without the need to reapply derivative estimation and hence without requiring any new query. Moreover, we can dynamically determine the number of steps for these updates by inspecting the aforementioned predictive uncertainty at every step, such that we only perform an update if the uncertainty is small enough (which also indicates that the estimation error is small, see Sec. 4.1). By incorporating our aforementioned trajectory-informed derivative estimation and dynamic virtual updates into GD-based algorithms, we then introduce the zeroth-order optimization with trajectoryinformed derivative estimation (ZORD) algorithm for query-efficient ZO optimization. We theoretically bound the estimation error of our trajectory-informed derivative estimation and show that this estimation error is non-increasing in the entire domain as the number of queries is increased and can even be exponentially decreasing in some scenarios (Sec. 4.1). Based on this, we prove the convergence of our ZORD algorithm, which improves over the existing ZO optimization algorithms that rely on the FD methods for derivative estimation (Sec. 4.2). Lastly, we use extensive experiments, such as black-box adversarial attack, non-differentiable metric optimization, and derivative-free reinforcement learning, to demonstrate that (a) our trajectory-informed derivative estimation improves over the existing FD methods and that (b) our ZORD algorithm consistently achieves improved query efficiency compared with previous ZO optimization algorithms (Sec. 5).

2. PRELIMINARIES 2.1 PROBLEM SETUP

Throughout this paper, we use ∇ and ∂ x to denote, respectively, the total derivative (i.e., gradient) and partial derivative w.r.t the variable x. We consider the minimization of a black-box objective function f : X → R, in which X ⊂ R d is a convex subset of the d-dimensional domain: min x∈X f (x) . (1) Since we consider ZO optimization, the derivative information is not accessible and instead, we are only allowed to query the inputs in X . For every queried input x ∈ X , we observe a corresponding noisy output of y(x) = f (x) + ζ, in which ζ is a zero-mean Gaussian noise with a variance of σ 2 :



In practice, it is usually necessary to query every updated input to measure the optimization performance and select the best-performing input. We refer to these queries as necessary queries.

