ZEROTH-ORDER OPTIMIZATION WITH TRAJECTORY-INFORMED DERIVATIVE ESTIMATION

Abstract

Zeroth-order (ZO) optimization, in which the derivative is unavailable, has recently succeeded in many important machine learning applications. Existing algorithms rely on finite difference (FD) methods for derivative estimation and gradient descent (GD)-based approaches for optimization. However, these algorithms suffer from query inefficiency because many additional function queries are required for derivative estimation in their every GD update, which typically hinders their deployment in real-world applications where every function query is expensive. To this end, we propose a trajectory-informed derivative estimation method which only employs the optimization trajectory (i.e., the history of function queries during optimization) and hence can eliminate the need for additional function queries to estimate a derivative. Moreover, based on our derivative estimation, we propose the technique of dynamic virtual updates, which allows us to reliably perform multiple steps of GD updates without reapplying derivative estimation. Based on these two contributions, we introduce the zeroth-order optimization with trajectory-informed derivative estimation (ZORD) algorithm for query-efficient ZO optimization. We theoretically demonstrate that our trajectory-informed derivative estimation and our ZORD algorithm improve over existing approaches, which is then supported by our real-world experiments such as black-box adversarial attack, non-differentiable metric optimization, and derivative-free reinforcement learning.

1. INTRODUCTION

Zeroth-order (ZO) optimization, in which the objective function to be optimized is only accessible by querying, has received great attention in recent years due to its success in many applications, e.g., black-box adversarial attack (Ru et al., 2020) , non-differentiable metric optimization (Hiranandani et al., 2021) , and derivative-free reinforcement learning (Salimans et al., 2017) . In these problems, the derivative of objective function is either prohibitively costly to obtain or even non-existent, making it infeasible to directly apply standard derivative-based algorithms such as gradient descent (GD). In this regard, existing works have proposed to estimate the derivative using the finite difference (FD) methods and then apply GD-based algorithms using the estimated derivative for ZO optimization (Nesterov and Spokoiny, 2017; Cheng et al., 2021) . These algorithms, which we refer to as GD with estimated derivatives, have been the most widely applied approach to ZO optimization especially for problems with high-dimensional input spaces, because of their theoretically guaranteed convergence and competitive practical performance. Unfortunately, these algorithms suffer from query inefficiency, which hinders their real-world deployment especially in applications with expensive-to-query objective functions, e.g., black-box adversarial attack. Specifically, one of the reasons for the query inefficiency of existing algorithms on GD with estimated derivatives is that in addition to the necessary queries (i.e., the query of every updated input)foot_0 , the FD methods applied in these algorithms require a large number of additional queries to accurately estimate the derivative at an input (Berahas et al., 2022) . This naturally begs the question: Can we estimate a derivative without any additional query? A natural approach to achieve this is to leverage the optimization trajectory, which is inherently available as a result of the necessary queries and their observations, to predict the derivatives. However, this requires a non-trivial method to simultaneously (a) predict a derivative using only the optimization trajectory (i.e., the history of updated inputs and their observations), and (b) quantify the uncertainty of this prediction to avoid using inaccurate predicted derivatives. Interestingly, the Gaussian process (GP) model satisfies both requirements and is hence a natural choice for such a derivative estimation. Specifically, under the commonly used assumption that the objective function is sampled from a GP (Srinivas et al., 2010) , the derivative at any input in the domain follows a Gaussian distribution which, surprisingly, can be calculated using only the optimization trajectory. This allows us to (a) employ the mean of this Gaussian distribution as the estimated derivative, and (b) use the covariance matrix of this Gaussian distribution to obtain a principled measure of the predictive uncertainty and the accuracy of this derivative estimation, which together constitute our trajectory-informed derivative estimation (Sec. 3.1). Another reason for the query inefficiency of the existing algorithms on GD with estimated derivatives is that every update in these algorithms requires reapplying derivative estimation and hence necessitates additional queries. This can preclude their adoption of a large number of GD updates since every update requires potentially expensive additional queries. Therefore, another question arises: Can we perform multiple GD updates without reapplying derivative estimation and hence without any additional query? To address this question, we propose a technique named dynamic virtual updates (Sec. 3.2). Specifically, thanks to the ability of our method to estimate the derivative at any input in the domain while only using existing optimization trajectory, we can apply multi-step GD updates without the need to reapply derivative estimation and hence without requiring any new query. Moreover, we can dynamically determine the number of steps for these updates by inspecting the aforementioned predictive uncertainty at every step, such that we only perform an update if the uncertainty is small enough (which also indicates that the estimation error is small, see Sec. 4.1) . By incorporating our aforementioned trajectory-informed derivative estimation and dynamic virtual updates into GD-based algorithms, we then introduce the zeroth-order optimization with trajectoryinformed derivative estimation (ZORD) algorithm for query-efficient ZO optimization. We theoretically bound the estimation error of our trajectory-informed derivative estimation and show that this estimation error is non-increasing in the entire domain as the number of queries is increased and can even be exponentially decreasing in some scenarios (Sec. 4.1). Based on this, we prove the convergence of our ZORD algorithm, which improves over the existing ZO optimization algorithms that rely on the FD methods for derivative estimation (Sec. 4.2). Lastly, we use extensive experiments, such as black-box adversarial attack, non-differentiable metric optimization, and derivative-free reinforcement learning, to demonstrate that (a) our trajectory-informed derivative estimation improves over the existing FD methods and that (b) our ZORD algorithm consistently achieves improved query efficiency compared with previous ZO optimization algorithms (Sec. 5 ).

2. PRELIMINARIES 2.1 PROBLEM SETUP

Throughout this paper, we use ∇ and ∂ x to denote, respectively, the total derivative (i.e., gradient) and partial derivative w.r.t the variable x. We consider the minimization of a black-box objective function f : X → R, in which X ⊂ R d is a convex subset of the d-dimensional domain: min x∈X f (x) . (1) Since we consider ZO optimization, the derivative information is not accessible and instead, we are only allowed to query the inputs in X . For every queried input x ∈ X , we observe a corresponding noisy output of y(x) = f (x) + ζ, in which ζ is a zero-mean Gaussian noise with a variance of σ 2 : Algorithm 1: Standard (Projected) GD with Estimated Derivatives 1: Input: Objective function f : X → R, initialization x 0 , iteration number T , learning rates {η t } T t=1 , projection function P X (x) 2: for iteration t = 1, . . . , T do 3: g(x t-1 ) ≈ ∇f (x t-1 ) with (2) 4: x t ← P X (x t-1 -η t-1 g(x t-1 )) 5: Query x t to yield y(x t ) 6: end for 7: Return arg min x 1:T y(x) Algorithm 2: ZORD (Ours) 1: Input: In addition to the parameters in Algo. 1, set the steps of virtual updates {V t } T t=1 2: for iteration t = 1, . . . , T do 3: x t,0 ← x t-1 4: for iteration τ = 1, . . . , V t do 5: x t,τ ← P X (x t,τ -1 -η t,τ -1 ∇µ t-1 (x t,τ -1 )) 6: end for 7: Query x t = x t,τ to yield y(x t ) 8: Update (4) using optimization trajectory 9: end for 10: Return arg min x 1:T y(x) ζ ∼ N (0, σ 2 ). Besides, we adopt a common assumption on f which has already been widely used in the literature of Bayesian optimization (BO) (Srinivas et al., 2010; Kandasamy et al., 2018) : we assume that f is sampled from a Gaussian process (GP). A GP GP(µ(•), k(•, •)), which is characterized by a mean function µ(•) and a covariance function k(•, •), is a stochastic process in which any finite subset of random variables follows a multi-variate Gaussian distribution (Rasmussen and Williams, 2006) . In addition, following the common practice of GP and BO, we assume w.l.o.g. that µ(x) = 0 and k(x, x ′ ) ≤ 1 (∀x, x ′ ∈ X ). We also assume that the kernel function k is differentiable, and that ∥∂ z ∂ z ′ k(z, z ′ )| z=z ′ =x ∥ 2 ≤ κ 2 , ∀x ∈ X for some κ > 0. This is satisfied by most commonly used kernels such as the squared exponential (SE) kernel (Rasmussen and Williams, 2006) .

2.2. ZO OPTIMIZATION WITH ESTIMATED DERIVATIVES

To solve (1), GD with estimated derivatives (e.g., Algo. 1) has been developed (Flaxman et al., 2005; Ghadimi and Lan, 2013; Nesterov and Spokoiny, 2017; Liu et al., 2018a; b) . Particularly, these algorithms first' estimate the derivative of f (line 3 of Algo. 1) and then plug the estimated derivative into GD-based methods to obtain the next input for querying (lines 4-5 of Algo. 1). In these algorithms, the derivative is typically estimated by averaging the finite difference approximation of the directional derivatives for f along certain directions, which we refer to as the finite difference (FD) method in this paper. For example, given a parameter λ and directions {u i } n i=1 , the derivative ∇f at any x ∈ X can be estimated by the following FD method (Berahas et al., 2022) : ∇f (x) ≈ g(x) ≜ n i=1 y(x + λu i ) -y(x) λ u i . The directions {u i } n i=1 are usually sampled from the standard Gaussian distribution (Nesterov and Spokoiny, 2017) or uniformly from the unit sphere (Flaxman et al., 2005) , or set as the standard basis vectors with 1 at one of its coordinates and 0 otherwise (Lian et al., 2016) . As mentioned before, existing FD methods typically require many additional queries (i.e., {x + λu i } n i=1 ) to achieve an accurate derivative estimation in every iteration of Algo. 1 (Berahas et al., 2022) , making existing ZO optimization algorithms (Flaxman et al., 2005; Nesterov and Spokoiny, 2017 ) query-inefficient.

3. ZO OPTIMIZATION VIA TRAJECTORY-INFORMED DERIVATIVE ESTIMATION

To improve existing GD with estimated derivatives (e.g., Algo. 1), we propose the ZORD algorithm (Algo. 2), which achieves more query-efficient ZO optimization thanks to our two major contributions. Firstly, we propose a derived GP-based derivative estimation method which only uses the optimization trajectory and consequently does not require any additional query for derivative estimation (Sec. 3.1). Secondly, thanks to the ability of our method to estimate the derivative at any input in the domain without any additional query and to measure the estimation error in a principled way, we develop the technique of dynamic virtual updates to further improve the query efficiency of our ZORD (Sec. 3.2).

3.1. TRAJECTORY-INFORMED DERIVATIVE ESTIMATION

To begin with, if a function f follows a GP, then its derivative ∇f also follows a GP (Rasmussen and Williams, 2006) . This is formalized by our Lemma 1 below (proof in Appx. B.1), which then provides us a principled way to estimate the derivative at any input in the domain. Lemma 1 (Derived GP for Derivatives). If a function f follows a GP: f ∼ GP µ(•), σ 2 (•, •) , then ∇f ∼ GP ∇µ(•), ∂σ 2 (•, •) where ∂σ 2 (•, •) denotes the cross partial derivative w.r.t the first and second arguments of σ 2 (•, •). f Follows the Posterior GP. As discussed in Sec. 2.1, we assume that f ∼ GP(µ(•), k(•, •)). So, in every iteration t of our Algo. 2, conditioned on the current optimization trajectory D t-1 ≜ {(x τ , y τ )} t-1 τ =1 , f follows the posterior GP: f ∼ GP µ t-1 (•), σ 2 t-1 (•, •) with the mean function µ t-1 (•) and the covariance function σ 2 t-1 (•, •) defined as below (Rasmussen and Williams, 2006) : µ t-1 (x) ≜ k t-1 (x) ⊤ K t-1 + σ 2 I -1 y t-1 σ 2 t-1 (x, x ′ ) ≜ k (x, x ′ ) -k t-1 (x) ⊤ K t-1 + σ 2 I -1 k t-1 (x ′ ) where y ⊤ t-1 ≜ [y τ ] t-1 τ =1 and k t-1 (x) ⊤ ≜ [k(x, x τ )] t-1 τ =1 are (t -1)-dimensional row vectors, and K t-1 ≜ [k(x τ , x τ ′ )] t-1 τ,τ ′ =1 is a (t -1) × (t -1)-dimensional matrix. Define σ 2 t-1 (x) ≜ σ 2 t-1 (x, x ), the posterior distribution at x is Gaussian with mean µ t-1 (x) and variance σ 2 t-1 (x). ∇f Follows the Derived GP for Derivatives. Substituting (3) into Lemma 1, we have that ∇f ∼ GP ∇µ t-1 (•), ∂σ 2 t-1 (•, •) , in which the mean ∇µ t-1 (x) at x and the covariance ∂σ 2 t-1 (x, x ′ ) at x, x ′ are ∇µ t-1 (x) ≜ ∂ z k t-1 (z) ⊤ K t-1 + σ 2 I -1 y t-1 z=x , ∂σ 2 t-1 (x, x ′ ) ≜ ∂ z ∂ z ′ k(z, z ′ ) -∂ z k t-1 (z) ⊤ K t-1 + σ 2 I -1 ∂ z ′ k t-1 (z ′ ) z=x,z ′ =x ′ , in which ∂ z k t-1 (z) ≜ [∂ z k(z, x τ )] t-1 τ =1 is a (t -1) × d-dimensional matrix and ∂ z ∂ z ′ k(z, z ′ ) is a d × d-dimensional matrix. Therefore, ∇µ t-1 (x) is a d-dimensional vector and ∂σ 2 t-1 (x, x ′ ) is a d × d-dimensional matrix. We refer to this GP (4) followed by ∇f as the derived GP for derivatives.

So, define ∂σ 2

t-1 (x) ≜ ∂σ 2 t-1 (x, x), we have that for any input x ∈ X , the derivative ∇f (x) at x follows a d-dimensional Gaussian distribution: ∇f (x) ∼ N (∇µ t-1 (x), ∂σ 2 t-1 (x)). This allows us to (a) estimate the derivative ∇f (x) at any input x ∈ X using the posterior mean ∇µ t-1 (x) of the derived GP for derivatives (4): ∇f (x) ≈ ∇µ t-1 (x) , (6) and (b) employ the posterior covariance matrix ∂σ 2 t-1 (x) to obtain a principled measure of the uncertainty for this derivative estimation, which together constitute our novel derivative estimation. Remarkably, our derivative estimation only makes use of the naturally available optimization trajectory D t-1 and does not need any additional query, which is in stark contrast to the existing FD methods (e.g., (2)) that require many additional queries for their derivative estimation. Moreover, our principled measure of uncertainty allows us to perform dynamic virtual updates (Sec. 3.2) and theoretically guarantee the quality of our derivative estimation (Sec. 4.1).

3.2. DYNAMIC VIRTUAL UPDATES

Note that our derived GP-based derivative estimation (6) can estimate the derivative at any input x within the domain. As a result, in every iteration t of our ZORD algorithm, for a step τ ≥ 1, after performing a GD update using the estimated derivative at x t,τ -1 (i.e., ∇µ t-1 (x t,τ -1 )) to reach the input x t,τ (line 5 of Algo. 2), we can again estimate the derivative at x t,τ (i.e., ∇µ t-1 (x t,τ )) and then perform another GD update to reach x t,τ +1 without requiring any additional query. This process can be repeated for multiple steps, and can further improve the query efficiency of our ZORD. Formally, given the projection function P X (x) ≜ arg min z∈X ∥x -z∥ 2 2 /2 and learning rates {η t,τ } Vt-1 τ =0 , we perform the following virtual updates for V t steps (lines 4-6 of Algo. 2): x t,τ = P X (x t,τ -1 -η t,τ -1 ∇µ t-1 (x t,τ -1 )) ∀τ = 1, • • • , V t (7) and then choose the last x t,Vt to query (i.e., line 7 of Algo. 2). Importantly, these multi-step virtual GD updates are only feasible in our ZORD (Algo. 2) because our derivative estimator (6) does not require any new query in all these steps, whereas the existing FD methods require additional queries to estimate the derivative in every step. The number of steps for our virtual updates (i.e., V t ) induces an intriguing trade-off: An overly small V t may not be able to fully exploit the benefit of our derivative estimation (6) which is free from the requirement for additional queries, yet an excessively large V t may lead to the usage of inaccurate derivative estimations which can hurt the performance (validated in Appx. D.2). Remarkably, (4) allows us to dynamically choose V t by inspecting our principled measure of the predictive uncertainty (i.e., ∂σfoot_1 t-1 (x)) for every derivative estimation. Specifically, after reaching the input x t,τ , we continue the virtual updates (to reach x t,τ +1 ) if our predictive uncertainty is small, i.e., if ∂σ 2 t-1 (x t,τ ) 2 ≤ c where c is a confidence threshold; otherwise, we terminate the virtual updates and let V t = τ since the derivative estimation at x t,τ is likely unreliable. 2 4 THEORETICAL ANALYSIS

4.1. DERIVATIVE ESTIMATION ERROR

To begin with, we derive a theoretical guarantee on the error of our derivative estimation at any x. Theorem 1 (Derivative Estimation Error). Let δ ∈ (0, 1) and β ≜ d + 2( √ d + 1) ln(1/δ). For any x ∈ X and any t ≥ 1, the following holds with probability of at least 1 -δ, ∥∇f (x) -∇µ t (x)∥ 2 ≤ β ∥∂σ 2 t (x)∥ 2 . Thm. 1 (proof in Appx. B. 2) has presented an upper bound on the error of our derivative estimation (6) at any x ∈ X in terms of ∥∂σ 2 t (x)∥ 2 , which is a measure of the uncertainty about our derivative estimation at x (Sec. 3.1). This hence implies that the threshold c applied to our predictive uncertainty ∂σ 2 t (x) 2 (Sec. 3.2) also ensures that the derivative estimation error is small during our dynamic virtual updates. Next, we show in the following theorem (proof in Appx. B.3) that our upper bound on the estimation error from Thm. 1 is non-increasing as the number of function queries is increased. Theorem 2 (Non-Increasing Error). For any x ∈ X and any t ≥ 1, we have that ∂σ 2 t (x) 2 ≤ ∂σ 2 t-1 (x) 2 . Let δ ∈ (0, 1). Define r ≜ max x∈X ,t≥1 ∥∂σ 2 t (x)∥ 2 / ∂σ 2 t-1 (x) 2 , given the β in Thm. 1 , we then have that r ∈ [1/ 1 + 1/σ 2 , 1], and that with a probability of at least 1 -δ, ∥∇f (x) -∇µ t (x)∥ 2 ≤ β ∥∂σ 2 t (x)∥ 2 ≤ κβr t . Thm. 2 shows that our upper bound on the derivative estimation error (i.e., β ∥∂σ 2 t (x)∥ 2 from Thm. 1) is guaranteed to be non-increasing in the entire domain as the number of function queries is increased. Moreover, in some situations (i.e., when r < 1), our upper bound on the estimation error is even exponentially decreasing. Of note, r characterizes how fast the uncertainty about our derivative estimation (measured by ∥∂σ 2 t (x)∥ 2 ) is reduced across the domain. Since GD-based algorithms usually perform a local search in a neighborhood (especially for the problems with high-dimensional input spaces), all the inputs within the local region are expected to be close to each other (measured by the kernel function k). Moreover, as the objective function is usually smooth in the local region (i.e., its derivatives are continuous), reducing the uncertainty of the derivative at an input x t (i.e., by querying x t ) is also expected to decrease the uncertainty of the derivatives at the other inputs in the same local region (i.e., decrease ∥∂σ 2 t (x)∥ 2 ). So, r < 1 is expected to be a reasonable condition that can be satisfied in practice. This will also be corroborated by our empirical results (e.g., Figs. 1 and 2 ), which demonstrates that the error of our derivative estimation (6) is indeed reduced very fast. Our GP-based Method (6) vs. Existing FD Methods. Our derivative estimation method based on the derived GP ( 6) is superior to the traditional FD methods (e.g., (2)) in a number of major aspects. (a) Our derivative estimation error can be exponentially decreasing in some situations (i.e., when r < 1 in Thm. 2), which is unachievable for the existing FD methods since they can only attain a polynomial rate of reduction (Berahas et al., 2022) . (b) Our method (6) does not need any additional query to estimate the derivative (but only requires the optimization trajectory), whereas the existing FD methods require additional queries for every derivative estimation. (c) Our method ( 6) is equipped with a principled measure of the predictive uncertainty and hence the estimation error for derivative estimation (i.e., via ∥∂σ 2 t (x)∥ 2 , Thm. 1), which is typically unavailable for the existing FD methods. (d) Our method (6), unlike the existing FD methods, makes it possible to apply the technique of dynamic virtual updates (Sec. 3.2) thanks to its capability of estimating the derivative at any input in the domain without requiring any additional query and measuring the estimation error in a principled way (Thm. 1).

4.2. CONVERGENCE ANALYSIS

To analyze the convergence of our ZORD, besides our main assumption that f is sampled from a GP (Sec. 2.1), we assume that f is L c -Lipchitz continuous for L c > 0. This is a mild assumption since it has been shown that a function f sampled from a GP is Lipchitz continuous with high probability for commonly used kernels, e.g., the SE kernel and Matérn kernel with ν > 2 (Srinivas et al., 2010) . We also assume that f is L s -Lipchitz smooth, which is commonly adopted in the analysis GD-based algorithms (J Reddi et al., 2016) . We aim to prove the convergence of our ZORD for nonconvex f by analyzing how fast it converges to a stationary point (Ghadimi and Lan, 2013; Liu et al., 2018a) . Specifically, we follow the common practice of previous works (J Reddi et al., 2016; Liu et al., 2018b) to analyze the following derivative mapping: G t,τ ≜ (x t,τ -P X (x t,τ -η t,τ ∇f (x t,τ ))) /η t,τ . The convergence of our ZORD is formally guaranteed by Thm. 3 below (proof in Appx. B.4). Theorem 3 (Convergence of ZORD). Let δ ∈ (0, 1). Suppose our ZORD (Algo. 2) is run with V t = V and η t,τ = η ≤ 1/L s for any t and τ . Then with probability of at least 1 -δ, when r < 1, min t≤T 1 V V -1 τ =0 ∥G t,τ ∥ 2 2 ≤ 2[f (x 0 ) -f (x * )]/η T V 1 + 2α 2 r 2 T (1 -r 2 ) + (2L c + 1/η)αr T (1 -r) 2 where α ≜ κ d + 2( √ d + 1) ln(V T /δ). When r = 1, we instead have 2 = 2α 2 + (2L c + 1/η)α. In the upper bound of Thm. 3, the term 1 represents the convergence rate of (projected) GD when the true derivative is used and it asymptotically goes to 0 as T increases; the term 2 corresponds to the impact of the error of our derivative estimation (6) on the convergence. In situations where r < 1 which is a reasonably achievable condition as we have discussed in Sec. 4.1, the term 2 will also asymptotically approach 0. This, remarkably, suggests that the impact of the derivative estimation error on the convergence vanishes asymptotically and our ZORD algorithm is guaranteed to converge to a stationary point (i.e., min t≤T 1 V V -1 τ =0 ∥G t,τ ∥ 2 2 approaches 0) at the rate of O(1/T ) when r < 1. This is unattainable by existing ZO optimization algorithms using FD-based derivative estimation (Nesterov and Spokoiny, 2017; Liu et al., 2018b) , because these methods typically converge to a stationary point at the rate of O(1/T + const.) with a constant learning rate. Even when r = 1 where the term 2 becomes a constant independent of T , our Thm. 3 is still superior to the convergence of these existing works because our result (Thm. 3) is based on the worst-case analysis whereas these works are typically based on the average-case analysis, i.e., their results only hold in expectation over the randomly sampled directions for derivative estimation. This means that their convergence may become even worse when inappropriate directions are used, e.g., directions that are nearly orthogonal to the true derivative which commonly happens in high-dimensional input spaces. In addition, given a fixed T , our ZORD enjoys a query complexity (i.e., the number of queries in T iterations) of O(T ), which significantly improves over the O(nT ) of the existing works based on FD (n in Sec. 2.2). The impacts of the number of steps of our virtual updates (i.e., V ) are partially reflected in Thm. 3. Specifically, a larger V improves the reduction rate of the term 1 because a larger number of virtual GD updates (without requiring additional queries) will be applied in our ZORD algorithm. This is also unachievable by existing ZO optimization algorithms using FD-based derivative estimation since they require additional queries for the derivative estimation in their every GD update. Meanwhile, a larger V may also negatively impact the performance of our ZORD since it may lead to the use of those estimated derivatives with large estimation errors (Sec. 3.2). However, this negative impact has 6) (GP) and the FD estimator, measured by cosine similarity (larger is better) and Euclidean distance (smaller is better). Each curve is the mean ± standard error from five independent runs. only been implicitly accounted for by the term 2 because this term comes from our Thm. 2, which is based on a worst-case analysis and gives a uniform upper bound on the derivative estimation error for all inputs in the domain X .

5. EXPERIMENTS

In this section, we firstly empirically verify the efficacy of our derived GP-based derivative estimator (6) in Sec. 5.1, and then demonstrate that our ZORD outperforms existing baseline methods for ZO optimization using synthetic experiments (Sec. 5.2) and real-world experiments (Secs. 5.3, 5.4).

5.1. DERIVATIVE ESTIMATION

Here we investigate the efficacy of our derivative estimator (6) based on the derived GP for derivatives (4). Specifically, we sample a function f (defined on a one-dimensional domain) from a GP using the SE kernel, and then use a set of randomly selected inputs as well as their noisy observations (as optimization trajectory) to calculate our derived GP for derivatives. The results (Fig. 1 ) illustrate a number of interesting insights. Firstly, in regions where (even only a few) function queries are performed (e.g., in the region of [-3, 0]), our estimated derivative (i.e., ∇µ t-1 (x) (6)) generally aligns with the groundtruth derivative (i.e., ∇f (x)) and our estimation uncertainty (i.e., characterized by ∂σ 2 t-1 (x) 2 ) shrinks compared with other un-queried regions. These results hence demonstrate that our (4) is able to accurately estimate derivatives and reliably quantify the uncertainty of these estimations within the regions where function queries are performed. Secondly, as more input queries are collected (i.e., from left to right in Fig. 1 ), the uncertainty ∂σ 2 t-1 (x) 2 in the entire domain is decreased in general. This provides an empirical justification for our Thm. 2 which guarantees non-increasing uncertainty and hence non-increasing estimation error. Lastly, note that with only 12 queries (rightmost figure), our derivative estimator is already able to accurately estimate the derivative in the entire domain, which represents a remarkable reduction rate of our derivative estimation error. Next, we compare our derivative estimator (6) with the FD estimator (Sec. 2.2). Specifically, using the Ackley function with d = 10 (see Appx. C.2), we firstly select an input x 0 and then follow the FD method (2) to randomly sample n directions {u i } n i=1 from the standard Gaussian distribution, to construct input queries {x 0 + λu i } n i=1 (see Sec. 2.2). Next, these queries and their observations are (a) used as the optimization trajectory to apply our derivative estimator (6), and (b) used by the FD method to estimate the derivative following (2). The results are shown in Fig. 2a (for two different values of λ), in which for both our derived GP-based estimator (6) and the FD estimator, we measure the cosine similarity (larger is better) and Euclidean distance (smaller is better) between the estimated derivative and the true derivative at x 0 . The figures show that our derivative estimation error enjoys a faster rate of reduction compared with the FD method, which corroborates our theoretical insights from Thm. 2 (Sec. 4.1) positing that our estimation error can be rapidly decreasing. Subsequently, to further highlight our advantage of being able to exploit the optimization trajectory and hence to eliminate the need for additional function queries (Sec. 4.1), we perform another comparison where our derived GP-based estimator (6) only utilizes 20 queries from the optimization trajectory (sampled using the same method above) for derivative estimation. The results (Fig. 2b ) show that even with only these 20 queries (without any additional function query), our derivative estimator (6) achieves comparable or better estimation errors than FD using as many as 80 additional queries. Overall, the results in Fig. 2 have provided empirical supports for the superiority of our derived GP-based derivative estimation (6), which substantiates our theoretical justifications in Sec. 4.1.

5.2. SYNTHETIC EXPERIMENTS

Here we adopt the widely use Ackley and Levy functions with various dimensions (Eriksson et al., 2019) to show the superiority of our ZORD. We compare ZORD with a number of representative baselines for ZO optimization, e.g., RGF (Nesterov and Spokoiny, 2017) which uses FD for derivative estimation, PRGF (Cheng et al., 2021) which is a recent extension of RGF, GLD (Golovin et al., 2020) which is a recent ZO optimization algorithm based on direct search, and TuRBO (Eriksson et al., 2019) which is a highly performant Bayesian optimization (BO) algorithm. We also evaluate the performance of a first-order optimization algorithm, i.e., GD with true derivatives. More details are in Appx. C.2. The results are shown in Fig. 3 , where ZORD outperforms all other ZO optimization algorithms. Particularly, ZORD considerably outperforms both RGF and PRGF, which can be attributed to our two major contributions. Firstly, our derivative estimator (6) used by ZORD is more accurate and more query-efficient than the FD method adopted by RGF and PRGF, as theoretically justified in Sec. 4.1 and empirically demonstrated in Sec. 5.1. Secondly, our dynamic virtual updates (Sec. 3.2) can perform multi-step GD updates without requiring any additional query, which further improves the performance of ZORD (validated in Appx. D.2). Moreover, ZORD is the only ZO optimization algorithm that is able to converge to a comparable final performance to that of the GD with true derivatives in every figure of Fig. 3 .

5.3. BLACK-BOX ADVERSARIAL ATTACK

We further compare our ZORD with other ZO optimization algorithms in the problem of black-box adversarial attack on images, which is one of the most important applications of ZO optimization in recent years. In black-box adversarial attack (Ru et al., 2020) , given a fully trained ML model and an image z, we intend to find (through only function queries) a small perturbation x to be added to z such that the perturbed image z + x will be incorrectly classified by the ML model. Following the practice from (Cheng et al., 2021) , we randomly select an image from MNIST (Lecun et al., 1998 Particularly, our ZORD is substantially more query-efficient than RGF and PRGF which rely on the FD methods for derivative estimation, e.g., for CIFAR-10, the number of queries required by RGF and PRGF are 9.4× and 10.8× of that required by ZORD. This further verifies the advantages of our trajectory-informed derivative estimation (as justified theoretically in Sec. 4.1 and empirically in Sec. 5.1) and dynamic virtual updates (as demonstrated in Appx. D.2). Remarkably, our ZORD also outperforms BO (i.e., TuRBO-1/10 which correspond to two versions of the TuRBO algorithm (Eriksson et al., 2019) ) which has been widely shown to be query-efficient in black-box adversarial attack (Ru et al., 2020) . Overall, these results showcase the ability of our ZORD to advance the other ZO optimization algorithms in challenging real-world ZO optimization problems.

5.4. NON-DIFFERENTIABLE METRIC OPTIMIZATION

Non-differentiable metric optimization (Hiranandani et al., 2021; Huang et al., 2021) , which has received a surging interest recently, can also be cast as a ZO optimization problem. We therefore use it to further demonstrate the superiority of our ZORD to other ZO optimization algorithms. Specifically, we firstly train a multilayer perceptron (MLP) (d = 2189) on the Covertype (Dua and Graff, 2017) dataset with the cross-entropy loss function. Then, we use the same dataset to fine-tune this MLP model by exploiting ZO optimization algorithms to optimize a non-differentiable metric, such as precision, recall, F1 score and Jaccard index (see more details in Appx. C.4). Here we additionally compare with the evolutionary strategy (ES) which has been previously applied for non-differentiable metric optimization (Huang et al., 2021) . Fig. 4 illustrates the percentage improvements achieved by different algorithms during the fine-tuning process (i.e., (f (x 0 ) -f (x T )) × 100%/f (x 0 )). The results show that our ZORD again consistently outperforms the other ZO optimization algorithms in terms of both the query efficiency and the final converged performance. These results therefore further substantiate the superiority of ZORD in optimizing high-dimensional non-differentiable functions.

6. CONCLUSION

We have introduced the ZORD algorithm, which achieves query-efficient ZO optimization through two major contributions. Firstly, we have proposed a novel derived GP-based method (6) which only uses the optimization trajectory and hence eliminates the requirement for additional queries (Sec. 3.1) to estimate derivatives. Secondly, we have introduced a novel technique, i.e., dynamic virtual updates, which is made possible by our GP-based derivative estimation, to further improve the performance of our ZORD (Sec. 3.2). Through theoretical justifications (Sec. 4) and empirical demonstrations (Sec. 5), we show that our derived GP-based derivative estimation improve over existing FD methods and that our ZORD outperforms various ZO optimization baselines.

7. REPRODUCIBILITY STATEMENT

For our theoretical results, we have discussed all our assumptions in Sec. 2.1 & Sec. 4.2, and provided our complete proofs in Appx. B. For our empirical results, we have provided our detailed experimental settings in Appx. C and included our codes in the supplementary materials (i.e., the zip file).

APPENDIX A RELATED WORK

Various types of algorithms have been proposed in the literature to solve ZO optimization problems, e.g., direct search, Bayesian optimization (BO) and GD-based algorithms with estimated derivatives. Particularly, direct search, e.g., (Stich et al., 2013; Golovin et al., 2020) , relies on the comparison of function values at different inputs for the updates, which can be query-inefficient in practice owing to its indirect utilization of function values. In contrast, Bayesian optimization (BO) directly utilizes the function values to model the objective function using a Gaussian process (GP) and iteratively selects the inputs to query by trading off sampling potentially optimal inputs (i.e., exploitation) and inputs that can improve the GP belief of the objective function over the entire input domain (i.e., exploration) (Chowdhury and Gopalan, 2017; Srinivas et al., 2010; Dai et al., 2019; 2020) . However, in ZO optimization problems with high-dimensional input spaces, BO algorithms typically suffer from query inefficiency and large computational complexity (Rasmussen and Williams, 2006; Letham et al., 2020; Eriksson et al., 2019) , which significantly hinders their real-world applications. Therefore, GD-based algorithms with estimated derivatives, which inherit the advantage of GD-based algorithms in optimizing functions with high-dimensional input spaces, have been more widely applied in practice. For these algorithms, the derivatives are commonly estimated using the finite difference (FD) approximation (which requires additional function queries) of the directional derivatives along selected directions, in which the directions can be randomly sampled unit vectors Flaxman et al. (2005) , Gaussian vectors (Nesterov and Spokoiny, 2017), or standard bases (Lian et al., 2016) (Sec. 2.2). More recently, some works have incorporated a time-dependent prior (i.e., the estimated derivative in the previous iteration) into existing FD methods to improve the quality of its derivative estimation (Ilyas et al., 2019; Meier et al., 2019; Cheng et al., 2021) . Nevertheless, such a prior is also estimated by the FD method (i.e., in the previous iteration) and can hence be biased owing to the its estimation error, which may even lead to larger derivative estimation errors in practice due to compounding errors. Another line of work has taken the surrogate derivatives from other sources to help reduce the derivative estimation error of existing FD methods (Maheswaranathan et al., 2019; Cheng et al., 2019) . However, these surrogate derivatives may generally be unavailable in practice. Importantly, these existing FD methods require additional function queries for every derivation estimation during optimization, which will significantly increase the query complexity of ZO optimization algorithms which employ these FD methods for derivative estimation. 

APPENDIX B PROOFS

So, to prove Lemma 1, we only need to derive the mean and the covariance of the Gaussian process above for a function f that is sampled from another Gaussian process, i.e., f ∼ GP(µ(•), σ 2 (•, •)). Specifically, for the mean E [∇f ], we have E [∇f ] = ∇E [f ] = ∇µ . (10) where the first equality derives from the interchangeability of the expectation and derivative operation based on the Leibniz integral rule. The second equality comes from the fact that E [f ] = µ. For the covariance Cov(∇f, ∇f ), we have Cov(∇f (z), ∇f (z ′ )) (a) = E (∇f (z) -E [∇f (z)]) ⊤ (∇f (z ′ ) -E [∇f (z ′ )]) (b) = E ∇ (f (z) -E [f (z)]) ⊤ ∇ (f (z ′ ) -E [f (z ′ )]) (c) = E ∂ z ∂ z ′ (f (z) -E [f (z)]) ⊤ (f (z ′ ) -E [f (z ′ )]) (d) = ∂ z ∂ z ′ E (f (z) -E [f (z)]) ⊤ (f (z ′ ) -E [f (z ′ )]) (e) = ∂ z ∂ z ′ σ 2 t (z, z ′ ) . Notably, (b) and (d) also derive from the interchangeability of the expectation and derivative operation based on the Leibniz integral rule. Besides, (e) is obtained based on Cov(f, f ) = σ 2 (•, •). This finally completes our proof.

B.2 PROOF OF THEOREM 1

To begin with, we introduce the following concentration inequality for standard multi-variate Gaussian distribution: Lemma B.1 (Laurent and Massart (2000) ). Let ζ ∼ N (0, I m ) and δ ∈ (0, 1) then P ∥ζ∥ 2 ≤ m + 2( √ m + 1) ln(1/δ) ≥ 1 -δ . ( ) Define ζ ≜ ∂σ 2 t (x) -1/2 (∇f (x) -∇µ t (x)), according to Lemma 1, we then have that ζ follows a standard multi-variate Gaussian distribution, i.e., ζ ∼ N (0, I d ) . ( ) Let δ ∈ (0, 1). By substituting the result above into Lemma B.1, the following holds with probability of at least 1 -δ: ∥∇f (x) -∇µ t (x)∥ 2 = ∂σ 2 t (x) -1/2 ζ 2 ≤ ∥∂σ 2 t (x)∥ 2 ∥ζ∥ 2 ≤ d + 2( √ d + 1) ln(1/δ) ∥∂σ 2 t (x)∥ 2 = β ∥∂σ 2 t (x)∥ 2 with β ≜ d + 2( √ d + 1) ln(1/δ) and the first inequality is from the Cauchy-Schwarz inequality, which completes our proof.

B.3 PROOF OF THEOREM 2

We first introduce the following lemmas. Lemma B.2 (Chowdhury and Gopalan (2021)). For any σ ∈ R and any matrix A, the following hold I -A ⊤ AA ⊤ + σ 2 I -1 A = σ 2 A ⊤ A + σ 2 I -1 . ( ) Lemma B.3 (Sherman-Morrison formula). For any invertible square matrix A and column vectors u, v, suppose A + uv ⊤ is invertible, then the following holds A + uv ⊤ -1 = A -1 - A -1 uv ⊤ A -1 1 + v ⊤ A -1 u . ( ) Preparation. We then introduce some additional notations and representations for our proof of Theorem 2. Following the common practice in (Chowdhury and Gopalan, 2021), we let the kernel k be defined by ψ(x), i.e., k(x, x ′ ) = ψ(x) ⊤ ψ(x ′ ), and ϕ(x) ≜ ∇ψ(x). We then further define the (t × d)-dimensional Jacobian matrix ϕ t (x) ≜ [ϕ(x) ⊤ ψ(x τ )] t τ =1 and Ψ t ≜ [ψ(x τ )] t τ =1 . The matrix K t and the covariance matrix ∂σ 2 t (x) defined on the optimization trajectory D t in our Sec. 3.1 can be reformulated as K t = Ψ ⊤ t Ψ t , ∂σ 2 t (x) = ϕ(x) ⊤ ϕ(x) -ϕ t (x) ⊤ K t + σ 2 I -1 ϕ t (x) . Based on the reformulation above, define V t ≜ Ψ t Ψ ⊤ t + σ 2 I, we can further reformulate ∂σ 2 t (x) as below ∂σ 2 t (x) (a) = ϕ(x) ⊤ ϕ(x) -ϕ t (x) ⊤ K t + σ 2 I -1 ϕ t (x) (b) = ϕ(x) ⊤ ϕ(x) -ϕ(x) ⊤ Ψ t Ψ ⊤ t Ψ t + σ 2 I -1 Ψ ⊤ t ϕ(x) (c) = ϕ(x) ⊤ I -Ψ t Ψ ⊤ t Ψ t + σ 2 I -1 Ψ ⊤ t ϕ(x) (d) = σ 2 ϕ(x) ⊤ Ψ t Ψ ⊤ t + σ 2 I -1 ϕ(x) (e) = σ 2 ϕ(x) ⊤ V -1 t ϕ(x) . Note that (b) is obtained by exploiting the fact that K t = Ψ ⊤ t Ψ t and ϕ t (x) = ϕ(x) ⊤ Ψ t . In addition, (d) comes from Lemma B.2 by replacing the matrix A in Lemma B.2 with the matrix Ψ ⊤ t . First Part. We then prove the first half part of our Theorem 2, i.e., the following Lemma B.4. Lemma B.4 (Non-Increasing Variance Norm). For any x ∈ X and any t ≥ 1, we have that ∂σ 2 t (x) 2 ≤ ∂σ 2 t-1 (x) 2 . ( ) Proof. Based on our additional notations and representations, we have ∂σ 2 t (x) (a) = σ 2 ϕ(x) ⊤ V -1 t ϕ(x) (b) = σ 2 ϕ(x) ⊤ Ψ t-1 Ψ ⊤ t-1 + σ 2 I + ψ(x t )ψ(x t ) ⊤ -1 ϕ(x) (c) = σ 2 ϕ(x) ⊤ V t-1 + ψ(x t )ψ(x t ) ⊤ -1 ϕ(x) (d) = σ 2 ϕ(x) ⊤ V -1 t-1 ϕ(x) -σ 2 1 + ψ(x t ) ⊤ V -1 t-1 ψ(x t ) -1 ϕ(x) ⊤ V -1 t-1 ψ(x t )ψ(x t ) ⊤ V -1 t-1 ϕ(x) (e) = ∂σ 2 t-1 (x) -σ 2 1 + ψ(x t ) ⊤ V -1 t-1 ψ(x t ) -1 ϕ(x) ⊤ V -1 t-1 ψ(x t )ψ(x t ) ⊤ V -1 t-1 ϕ(x) (f ) ≼ ∂σ 2 t-1 (x) . (20) Note that (a) follows from the aforementioned definition of V t and (b) comes from the fact that Ψ t Ψ ⊤ t = Ψ t-1 Ψ ⊤ t-1 + ψ(x t )ψ(x t ) ⊤ . Similarly, (c) uses the definition of V t-1 . In addition, equality (d) derives from Lemma B.3 by letting A = V t-1 and u = v = ψ(x t ) and (e) follows from the reformulation of ∂σ 2 t-1 (x) in (18). Finally, (f ) derives from the positive semi-definite property of ϕ(x) ⊤ V -1 t-1 ψ(x t )ψ(x t ) ⊤ V -1 t-1 ϕ(x) as well as the fact that 1 + ψ(x t ) ⊤ V -1 t-1 ψ(x t ) > 0. That is, for any column vector z we have that z ⊤ ϕ(x) ⊤ V -1 t-1 ψ(x t )ψ(x t ) ⊤ V -1 t-1 ϕ(x)z = ϕ(x t ) ⊤ V -1 t-1 ϕ(x)z ⊤ ϕ(x t ) ⊤ V -1 t-1 ϕ(x)z = ϕ(x t ) ⊤ V -1 t-1 ϕ(x)z 2 2 ≥ 0 . (21) So, ϕ(x) ⊤ V -1 t-1 ψ(x t )ψ(x t ) ⊤ V -1 t-1 ϕ(x) is positive semi-definite. Following a similar way, we are also able to verify that 1 + ψ(x t ) ⊤ V -1 t-1 ψ(x t ) > 0 by showing that ψ(x t ) ⊤ V -1 t-1 ψ(x t ) ≥ 0 using the decomposition of V -1 t-1 from the Principle Component Analysis (PCA). Since ∂σ 2 t (x) ≼ σ 2 t-1 (x) is equivalent to ∂σ 2 t (x) 2 ≤ ∂σ 2 t-1 (x) 2 , we then complete the proof of first half part of our Theorem 2. Second Part. To prove the rest of our Theorem 2, we firstly introduce the following lemmas. Lemma B.5. For any x ∈ X and any t ≥ 1, the following holds V -1 t ≼ V -1 t-1 . Proof. For any column vector z, we have z ⊤ (V t -V t-1 ) z = z ⊤ ψ(x t )ψ(x t ) ⊤ z = ψ(x t ) ⊤ z ⊤ ψ(x t ) ⊤ z = ψ(x t ) ⊤ z 2 2 ≥ 0 . The first equality comes from the intermediate result in (20) . So, V t -V t-1 is positive semi-definite, i.e., V t-1 ≼ V t . This can also indicate that V -1 t ≼ V -1 t-1 , which thus completes our proof. Lemma B.6 (Lower Bound of Variance Norm). For any x ∈ X and any t ≥ 1, the following holds 1/(1 + 1/σ 2 ) ∂σ 2 t-1 (x) 2 ≤ ∂σ 2 t (x) 2 . ( ) Proof. We firstly show that V -1/2 t ψ(x)ψ(x) ⊤ V -1/2 t 2 (a) ≤ V -1/2 t ψ(x) 2 ψ(x) ⊤ V -1/2 t 2 (b) = ψ(x) ⊤ V -1/2 t 2 2 (c) = ψ(x) ⊤ V -1/2 t V -1/2 t ψ(x) (d) = ψ(x) ⊤ V -1 t ψ(x) (e) ≤ ψ(x) ⊤ V -1 t-1 ψ(x) (f ) ≤ ψ(x) ⊤ V -1 0 ψ(x) (g) = ψ(x) ⊤ ψ(x)/σ 2 (h) = 1/σ 2 . ( ) Note that (a) derives from the Cauchy-Schwarz inequality. As for (b) and (c), they have exploited the fact that V -1/2 t ψ(x) ⊤ = ψ(x) ⊤ V -1/2 t and ψ(x) ⊤ V -1/2 t is a row vector. In addition, (e) follows from Lemma B.5. Finally, (g) results from V -1 0 = I/σ 2 and (h) derives from the assumption that k(x, x) ≤ 1 (∀x ∈ X ) in Sec. 2.1. Alternatively, we can restate the result above as V -1/2 t ψ(x)ψ(x) ⊤ V -1/2 t ≼ σ -2 I . ( ) We then complete our proof on the first inequality in Lemma B.6 using the following inequality: ∂σ 2 t (x) (a) = σ 2 ϕ(x) ⊤ V t-1 + ψ(x t )ψ(x t ) ⊤ -1 ϕ(x) (b) = σ 2 ϕ(x) ⊤ V 1/2 t-1 I + V -1/2 t-1 ψ(x t )ψ(x t ) ⊤ V -1/2 t-1 V 1/2 t-1 -1 ϕ(x) (c) = σ 2 ϕ(x) ⊤ V -1/2 t-1 I + V -1/2 t-1 ψ(x t )ψ(x t ) ⊤ V -1/2 t-1 -1 V -1/2 t-1 ϕ(x) (d) ≽ σ 2 ϕ(x) ⊤ V -1 t-1 ϕ(x)/(1 + 1/σ 2 ) (e) = ∂σ 2 t-1 (x)/(1 + 1/σ 2 ) (27) where (a) derives from ( 20) and (c) comes from the inversion of matrix product. Finally (d) follows from the result in ( 26) and (e) exploits the reformulation of ∂σ 2 t-1 (x). According to Lemma B.4 and Lemma B.6, the following holds for any x ∈ X and any t ≥ 1, 1 1 + 1/σ 2 ≤ ∂σ 2 t (x) 2 ∂σ 2 t-1 (x) 2 ≤ 1 . ( ) Based on the definition of r in our Theorem 2, we therefore also have r ≜ max x∈X ,t≥1 ∥∂σ 2 t (x)∥ 2 / ∂σ 2 t-1 (x) 2 ∈ 1/ 1 + 1/σ 2 , 1 . ( ) As a result, for every iteration t of our Algo. 2, we have ∥∂σ 2 t (x)∥ 2 ≤ r ∂σ 2 t-1 (x) 2 ≤ r t ∥∂σ 2 0 (x)∥ 2 = r t ∥∂ z ∂ z ′ k(z, z ′ )| z=z ′ =x ∥ 2 ≤ r t κ (30) where the last inequality derives from our assumption of ∥∂ z ∂ z ′ k(z, z ′ )| z=z ′ =x ∥ 2 ≤ κ 2 (∀x ∈ X ) in our Sec. 2.1. By substituting the result above into our Theorem 1, we complete our proof of Theorem 2.

B.4 PROOF OF THEOREM 3

Preparation. Following the definition of the derivative mapping on the true derivative ∇f (x t,τ ) in ( 8), we defined the following derivative mapping on our estimated derivative ∇µ t-1 (x t,τ ): G t,τ ≜ x t,τ -x t,τ +1 η t,τ = x t,τ -P X (x t,τ -η t,τ ∇µ t (x t,τ )) η t,τ . By re-arranging it, we have the following update rule that has reformulated (7): x t,τ +1 = x t,τ -η t,τ G t,τ . Based on our definition of the derivative mappings in ( 31) and ( 8), we introduce the following lemmas: Lemma B.7 (General Projection Inequalities). Given P X (x) = arg min z∈X ∥x -z∥ 2 2 /2 and domain X , for any x, x ′ , we have ∥x -P X (x)∥ 2 ≤ ∥x -P X (x ′ )∥ 2 , (33) ∥P X (x) -P X (x ′ )∥ 2 ≤ ∥x -x ′ ∥ 2 . ( ) Proof. For (33), as P X (x ′ ) ∈ X (∀x ′ ) and P X (x) = arg min z∈X ∥x -z∥ 2 2 /2, we then naturally have (33). For (34), since P X (x) is the optimum of h(z) = ∥x -z∥ 2 2 /2, according to the optimality condition of the convex projection function h(z) within the domain z ∈ X (Boyd and Vandenberghe, 2014) , we then have the following inequality for any P X (x ′ ) ∈ X : ∇h(z) ⊤ (P X (x ′ ) -z) ≥ 0 . ( ) By taking ∇h(z) = zx with z = P X (x) into the inequality above, we have (P X (x) -x) ⊤ (P X (x ′ ) -P X (x)) ≥ 0 . ( ) By exchanging x and x ′ in the result above, we achieve the following similar result: (P X (x ′ ) -x ′ ) ⊤ (P X (x) -P X (x ′ )) ≥ 0 . ( ) By summing ( 36) and ( 37), (x -x ′ ) ⊤ (P X (x) -P X (x ′ )) ≥ ∥P X (x) -P X (x ′ )∥ 2 2 . ( ) Based on the Cauchy-Schwarz inequality, we finally achieve (34) using ∥P X (x) -P X (x ′ )∥ 2 2 ≤ (x -x ′ ) ⊤ (P X (x) -P X (x ′ )) ≤ ∥x -x ′ ∥ 2 ∥P X (x) -P X (x ′ )∥ 2 (39) where both sides need to be divided by ∥P X (x) -P X (x ′ )∥ 2 to complete our proof. Lemma B.8 (Inequalities for Derivative Mappings). Given (31) and (8), for every t and τ , we have G t,τ 2 2 ≤ ∇µ t-1 (x t,τ ) ⊤ G t,τ , ( ) ∥G t,τ ∥ 2 ≤ ∥∇f (x t,τ )∥ 2 , ( ) G t,τ -G t,τ 2 ≤ ∥∇µ t-1 (x t,τ ) -∇f (x t,τ )∥ 2 . ( ) Proof. For (40), let x t,τ = x t,τ -η t,τ ∇µ t-1 (x t,τ ), we then have ∥P X (x t,τ ) -P X ( x t,τ )∥ 2 2 -(x t,τ -x t,τ ) ⊤ (P X (x t,τ ) -P X ( x t,τ )) (a) = ∥x t,τ -x t,τ +1 ∥ 2 2 -η t,τ ∇µ t-1 (x t,τ ) ⊤ (x t,τ -x t,τ +1 ) (b) = η 2 t,τ G t,τ 2 2 -η 2 t,τ ∇µ t-1 (x t,τ ) ⊤ G t,τ ≤ 0 (43) where (a) results from the fact that x t,τ +1 = P X (x t,τ -η t,τ ∇µ t-1 (x t,τ )) based on our ( 7) and (b) derives from the definition of G t,τ in (31). In addition, (c) is based on the following result by substituting x = x t,τ and x ′ = x t,τ into (38): ∥P X (x t,τ ) -P X ( x t,τ )∥ 2 2 -(x t,τ -x t,τ ) ⊤ (P X (x t,τ ) -P X ( x t,τ )) ≤ 0 . Finally, by dividing η 2 t,τ on the both sides of the last inequality in ( 43), we finish the proof for (40). For (41), following the same proof above, we can also obtain achieve the following inequality for the projected derivative G t,τ : ∥G t,τ ∥ 2 2 ≤ ∇f (x t,τ ) ⊤ G t,τ ≤ ∥∇f (x t,τ )∥ 2 ∥G t,τ ∥ 2 . ( ) We complete the proof for (41) by dividing ∥G t,τ ∥ 2 on the both sides of the inequality above. For (42), define x ′ t,τ +1 ≜ x t,τ -η t,τ G t,τ , we have G t,τ -G t,τ 2 (a) = 1 η t,τ x t,τ -x t,τ +1 -x t,τ -x ′ t,τ +1 2 (b) = 1 η t,τ x t,τ +1 -x ′ t,τ +1 2 (c) = 1 η t,τ ∥P X (x t,τ -η t,τ ∇µ t-1 (x t,τ )) -P X (x t,τ -η t,τ ∇f (x t,τ ))∥ 2 (d) ≤ 1 η t,τ ∥x t,τ -η t,τ ∇µ t-1 (x t,τ ) -(x t,τ -η t,τ ∇f (x t,τ ))∥ 2 (e) = ∥∇µ t-1 (x t,τ ) -∇f (x t,τ )∥ 2 (46) where (a) comes from the definition of G t,τ and G t,τ in ( 31) and ( 8), respectively. In addition, (c) derives from ( 7) and ( 8). Finally, (d) results from (34). Proof. Since the objective function f is assumed to be L s -Lipschitz smooth (Sec. 4.2), we have the following inequality for any x t,τ ∈ X in our ZORD algorithm: f (x t,τ +1 ) -f (x t,τ ) ≤ ∇f (x t,τ ) ⊤ (x t,τ +1 -x t,τ ) + L s 2 ∥x t,τ +1 -x t,τ ∥ 2 2 . ( ) Let δ ′ ∈ (0, 1). Define β ≜ d + 2( √ d + 1) ln(1/δ ′ ) , by substituting (32) into the inequality above, the following inequality holds with probability of at least 1 -δ ′ : f (x t,τ +1 ) -f (x t,τ ) (a) ≤ -η t,τ ∇f (x t,τ ) ⊤ G t,τ + L s η 2 t,τ 2 G t,τ = η t,τ (∇µ t-1 (x t,τ ) -∇f (x t,τ )) ⊤ G t,τ -η t,τ ∇µ t-1 (x t,τ ) ⊤ G t,τ + L s η 2 t,τ 2 G t,τ 2 2 (c) = η t,τ (∇µ t-1 (x t,τ ) -∇f (x t,τ )) ⊤ G t,τ -G t,τ + (∇µ t-1 (x t,τ ) -∇f (x t,τ )) ⊤ G t,τ -η t,τ ∇µ t-1 (x t,τ ) ⊤ G t,τ + L s η 2 t,τ 2 G t,τ 2 2 (d) ≤ η t,τ ∥∇µ t-1 (x t,τ ) -∇f (x t,τ )∥ 2 G t,τ -G t,τ 2 + ∥∇µ t-1 (x t,τ ) -∇f (x t,τ )∥ 2 ∥G t,τ ∥ 2 -η t,τ ∇µ t-1 (x t,τ ) ⊤ G t,τ + L s η 2 t,τ 2 G t,τ 2 2 (e) ≤ η t,τ ∥∇µ t-1 (x t,τ ) -∇f (x t,τ )∥ 2 2 + ∥∇µ t-1 (x t,τ ) -∇f (x t,τ )∥ 2 ∥∇f (x t,τ )∥ 2 - 2η t,τ -L s η 2 t,τ 2 G t,τ 2 2 (f ) ≤ η t,τ κ 2 β 2 r 2t + η t,τ L c κβr t - η t,τ 2 G t,τ (48) where (d) derives from the Cauchy-Schwarz inequality and (e) follows from the Lemma B.7. Finally, (f ) result from the bounded derivative estimation error in Theorem 2 and the fact that f is L c -Lipschitz continuous (i.e., ∥∇f (x)∥ 2 ≤ L c for any x ∈ X ) and η t,τ ≤ 1/L s (∀τ ). For every iteration t our ZORD algorithm, we in fact will apply the virtual updates (7) for V t times (see Algo. 2). Therefore, for probability ≥ 1 -V t δ ′ , we have 1 V t Vt-1 τ =0 η t,τ G t,τ 2 2 ≤ 2 V t Vt-1 τ =0 f (x t,τ ) -f (x t,τ +1 ) + η t,τ κ 2 β 2 r 2t + L c κβr t = 2 V t [f (x t-1 -f (x t ))] + 2 V t Vt-1 τ =0 η t,τ κ 2 β 2 r 2t + L c κβr t where the first inequality results from (48) by re-arranging it and then sum it up over τ . However, in order to prove the convergence of our ZORD algorithm to a stationary point, we need to consider the derivative mapping of G t,τ instead (refer to our Sec. 4.2). So, for any τ , we propose the following inequality: ∥G t,τ ∥ 2 = G t,τ -G t,τ + G t,τ 2 ≤ G t,τ -G t,τ 2 + G t,τ 2 ≤ ∥∇µ t-1 (x t,τ ) -∇f (x t,τ )∥ 2 + G t,τ 2 ≤ κβr t + G t,τ 2 where the first inequality is from the Cauchy-Schwarz inequality and the second inequality comes from (42). Finally, by taking the result above into (49), we have 1 V t Vt-1 τ =0 η t,τ ∥G t,τ ∥ 2 2 ≤ 2 V t [f (x t-1 -f (x t ))] + 2 V t Vt-1 τ =0 η t,τ κ 2 β 2 r 2t + L c κβr t + κβr t . Then, substituting V t = V and η t,τ = η for any t, τ into the result above, the following inequality holds with probability of at least 1 -V T δ ′ when r < 1: 1 T T t=1 1 V V -1 τ =0 η ∥G t,τ ∥ 2 2 (a) ≤ 1 T T t=1 2 (f (x t-1 -f (x t )) V + 2ηκ 2 β 2 r 2t + (2ηL c + 1)κβr t (b) ≤ 2 T V [f (x 0 ) -f (x T )] + 2η(1 -r 2T ) T (1 -r 2 ) κ 2 β 2 r 2 + (2ηL c + 1)(1 -r T ) T (1 -r) κβr (c) ≤ 2 T V [f (x 0 ) -f (x * )] + 2ηκ 2 β 2 r 2 T (1 -r 2 ) + (2ηL c + 1)κβr T (1 -r) . (52) Note that (b) derives from the summation of the geometric sequence about r and (c) comes from x * ≜ arg min x∈X f (x). When r = 1, the following holds with probability of at least ≥ 1 -V T δ ′ accordingly: 1 T T t=1 1 V V -1 τ =0 η ∥G t,τ ∥ 2 2 ≤ 1 T T t=1 2 (f (x t-1 -f (x t )) V + 2ηκ 2 β 2 r 2t + (2ηL c + 1)κβr t = 2 T V [f (x 0 ) -f (x T )] + 2ηκ 2 β 2 + (2ηL c + 1)κβ . Finally, let δ = V T δ ′ ∈ (0, 1), the following holds with probability of at least 1 -δ, min t≤T 1 V V -1 τ =0 ∥G t,τ ∥ 2 2 ≤ 1 T T t=1 1 V V -1 τ =0 ∥G t,τ ∥ 2 2 ≤ 1 + 2 where 1 and 2 can be defined as below with α ≜ κ d + 2( √ d + 1) ln(V T /δ): apply these ZO/FO optimization algorithms with a query budget of 200 for d = 20, 40, and a query budget of 400 for d = 100 to compare their query efficiency. We use the same Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.1 and exponential decay rates of 0.9, 0.999 for RGF, PRGF, GD, and our ZORD algorithm, for faster convergence compared with standard GD. 1 = 2/η T V [f (x 0 ) -f (x T )] 2 = 2α 2 r 2 / T (1 -r 2 ) + (2L c + 1/η)αr/ [T (1 -r)] (r < 1) , 2α 2 + (2L c + 1/η)α (r = 1) .

C.3 BLACK-BOX ADVERSARIAL ATTACK

For the black-box adversarial attack experiment on the MNIST dataset, we use the same fully trained deep neural networks from (Cheng et al., 2021) and adopt a L ∞ constraint of ∥x∥ ∞ ≤ 0.3 on the input perturbation x. For the black-box adversarial attack experiment on the CIFAR-10 dataset, we fully train a ResNet-18 (He et al., 2016) on CIFAR-10 using stochastic gradient descend (SGD) with a cosine annealed learning rate from 0.1 to 0, a momentum of 0.9 and a weight decay of 5 × 10 -4 for 200 epochs, and adopt a L ∞ constraint of ∥x∥ ∞ ≤ 0.2 on the input perturbation x. Note that we use the same loss function as (Cheng et al., 2021) for these two experiments. Meanwhile, to apply RGF, PRGF and our ZORD, we adopt Adam optimizer with the same learning rate of 0.5 and the same exponential decay rates of 0.9, 0.999.

C.4 NON-DIFFERENTIABLE METRIC OPTIMIZATION

The Covertype dataset used in Sec. 5.4 is a classification dataset consisting of 581,012 samples from 7 different categories. Each sample from this dataset is a 54-dimensional vector of integers. In this experiment, we randomly split the dataset into training and test sets with each containing 290,506 samples. The MLP classifier applied in Sec. 5.4 consists of 2 layers with 30 and 14 hidden neurons respectively, leading to 2189 parameters in total (i.e., d = 2189). We first train this MLP classifier on the training dataset of Covertype using the L-BFGS algorithm with the cross-entropy loss function for 300 epochs, and then apply ZO optimization algorithms to fine-tune our trained MLP directly on the non-differentiable metrics (i.e., using these metrics as the new loss functions), including precision, recall, F1 score and Jaccard index. To obtain the results of ES, RGF, PRGF and our ZORD algorithm in Sec. 5.4, we apply the same Adam optimizer with a learning rate of 0.2 (for precision and recall) or 0.01 (for F1 score and Jaccard index) and exponential decay rates of 0.9, 0.999. Note that standard BO algorithms (including TuRBO) fail to achieve any percentage improvements (i.e., achieving 0% in the y-axis of Fig. 4 ) in this experiment according to our five independent runs, which is likely due to their aggressive exploration in the input domain of such a high dimension. In light of this, we do not include them in our comparison since all other methods are able to achieve certain improvements.

C.5 DERIVATIVE-FREE REINFORCEMENT LEARNING

Our derivative-free RL experiments aim to learn controllers (which outputs policies) that maximize the rewards/return for several environments in the OpenAI Gym (Brockman et al., 2016) without using true derivatives. Specifically, we need to optimize the parameters (i.e., x) of our neural network (MLP) controller with 2 hidden layers, where each hidden layer has 10 hidden neurons and one bias term. We adopt a L ∞ constraint of ∥x∥ ∞ ≤ 1 on the parameters x. We use a softmax output layer for the policies that deal with discrete action spaces, and a tanh output layer for the policies that deal with continuous action spaces. The dimension of neural network parameters (represented as a column vector) d is determined by the dimensions of both the observation |S| and the action space |A| of an environment, as detailed in Tab. 2. In order to search for policies that are robust to different random state initializations, we use the vectorized API of OpenAI Gym, and our observed function value y(x) given the network parameters x is an averaged return of 32 parallel environments. We also fix the seed of OpenAI Gym for all queries, which ensures that we are evaluating on a fixed set of 32 state initializations and that our results can be reproduced. We first initialize a sample of 500 points from a Latin Hypercube (McKay et al., 1979) to find a good initial input, and then proceed to apply ZO optimization algorithms (i.e., ES, RGF, PRGF, and our ZORD) with the same query budget of 1000 on this initial input. For all these ZO optimization algorithms, we employ the same Adam optimizer with a learning rate of 1.0 and exponential decay rates of 0.9, 0.999. Considering the prohibitive noise in RL experiments, we use 300 queries from the optimization trajectory that has the smallest Euclidean distance with an input needing to be updated. Of note, we conduct 10 trials in total where each trial differs from each other by both the OpenAI Gym seed and the Latin Hypercube initializations.

D.1 MORE RESULTS ON DERIVATIVE ESTIMATION

Besides the comparison in Fig. 2 , we provide additional comparison between our derived GP-based estimator (6) and the FD estimator (2) under various input dimensions in Fig. 6 (a) and various kernels show that under various input dimensions and GP kernels, our derived GP-based estimator ( 6) is still able to achieve faster reduction rates compared with the FD estimator. Of note, all the function queries applied in our derived GP-based estimator is from the optimization trajectory whereas the FD estimator requires additional function queries for its derivative estimation. So, Fig. 6 (a)(b) also show that our derived GP method is still able to achieve improved query efficiency for accurate derivative estimation than FD method under various input dimensions and GP kernels because our method avoids the requirement of additional queries for derivative estimation. Interestingly, the objective function (i.e., the Ackley function) is not truly sampled from the GPs based on these kernels. This therefore means that though we have assumed that we need the prior knowledge about the GP in which the objective function is sampled from (Sec. 2.1), such an assumption does not really need to be satisfied for our derived GP-based method to achieve accurate derivative estimation in practice. More interestingly, we notice that Matérn(ν = 0.5) and SE kernel will achieve slightly worse derivative estimation, indicating that the choice of GP kernels may impact the quality of our derived GP-based derivative estimation. However, in practice, our derived GP method based on Matérn(ν = 2.5) kernel, which has been widely adopted in our experiments, is already able to provide us with good derivative estimation for ZO optimization as confirmed by the results in our other experiments.

D.2 MORE RESULTS ON SYNTHETIC EXPERIMENTS

In this section, we compare ZORD with more baselines in Fig. 8 . Notably, we mainly compare our ZORD with CobBO (based on the code implementation provided by (Tan et al., 2021) ) since CobBO generally performs better than other baselines, e.g., TPE, ATPE, and BADS according to (Tan et al., 2021) . As shown in the results in Fig. 8 , our ZoRD algorithm is still able to outperform the other benchmark BO algorithm (i.e., CobBO). We then the impacts of dynamic virtual updates (Sec. 3.2) on our ZORD algorithm. In particular, we apply the same setting in Appx. C.2 to optimize the Ackley and Levy function with d = 40 under various confidence thresholds c for our dynamic virtual updates. Fig. 7 illustrates the results. As shown in both Fig. 7 (a) and (b), our ZORD algorithm using the technique of dynamic virtual updates (i.e., c > 0) can consistently achieve improved query efficiency compared with the one not using the technique of dynamic virtual updates (i.e., c = 0). This indicates the essence of dynamic virtual updates in helping improve the query efficiency of our ZORD algorithm. Such a result actually corroborates our theoretical insights about virtual updates (Sec. 4.2). Remarkably, our ZORD algorithm without the technique of dynamic virtual updates (i.e., c = 0) is still able to achieve both improved query efficiency and better converged performance compared with RGF and PRGF, which further verifies the superiority of our derived GP-based derivative estimation. More interestingly, both Fig. 7 

D.3 MORE RESULTS ON BLACK-BOX ADVERSARIAL ATTACK

Besides the comparison in our Sec. 5.3, we also compare the success rate achieved by different ZO optimization algorithms on the 15 images selected from MNIST or CIFAR-10 in Fig. 9 . Note that we adopt the same settings in Appx. C.3 for this comparison. Considering the large computational complexity of TuRBO-1/10 algorithm for hard-to-attack imagesfoot_2 which is usually undesirable in practice, we drop the comparison with them in this experiment. Fig. 9 shows that under the same query budget, our ZORD algorithm is able to achieve considerably improved success rate over other ZO optimization algorithms. These results therefore further support the superior query efficiency of our ZORD algorithm in real-world challenging problems.

D.4 MORE RESULTS FOR DERIVATIVE-FREE REINFORCEMENT LEARNING

Recent years have also witnessed a surging interest in derivative-free reinforcement learning (Salimans et al., 2017; Qian and Yu, 2021) , where ZO optimization algorithms are widely applied. In light of this, we also demonstrate the superiority of our ZORD algorithm in the problem of derivative-free reinforcement learning. Specifically, we adopt the setting in Sec. C.5 to experiment in different RL environments. Tab. 3 summarizes the comparison among different ZO optimization algorithms under Of note, the novelty of our work in fact lies in its way of exploiting the GP assumption to help design an improved derivative estimation and hence an improved ZO optimization algorithm, which to the best of our knowledge has not been explored theoretically yet in the field of ZO optimization via GD with estimated derivative. That is, at this moment, it is still not known in the literature how existing FD methods can utilize such an assumption to achieve better derivative estimation (i.e., their derivative estimation quality will remain the same), even when they make the same assumption as us. In light of this, the comparison between our derived GP method and the FD method in Sec. 4 is not only necessary but also meaningful to show the advantage of exploiting such an assumption in ZO derivative estimation. Importantly, our empirical results further show that such an assumption is in fact not restrictive for our ZORD to achieve compelling performance in practice. For example, our Fig. 2 and Fig. 6 have shown that our derived GP-based method is able to achieve smaller derivative estimation error than the FD method when the objective functions are not designed to be sampled from a GP with the kernel that we had applied for our derivative estimation. Moreover, the results in our Sec. 5.2, 5.3, 5.4 have shown that our ZORD is capable of achieving competitive optimization performance for real-world optimization problems where the objective functions are also not designed to be sampled from a GP with the kernel that we had used for our ZORD. Meanwhile, the theoretical challenges of our work lie in the theoretical guarantee on the derivative estimation error of our unique derived GP-based method for any input in the domain as well as the convergence analysis based on such a unique derivative estimation, which to the best of our knowledge have not been studied in the literature. This means that our Thm. 1 and Thm. 2 have provided new developments in the analysis of gradient estimation error and our Thm. 3 will be the first convergence result for GD using our unique derivative estimation method. Interestingly, the bound in our Thm. 3 also improves over the standard ones from (Nesterov and Spokoiny, 2017; Liu et al., 2018b) in several aspects, as discussed in our Sec. 4.2.

E.2 ZORD VS. BO

Our ZORD algorithm and standard BO algorithms (e.g., GP-UCB) have in fact applied the same GP assumption for their algorithm design. That is, however, where the similarity ends. Of note, our ZORD exploits such an assumption to derive a specific GP (i.e., (4)) for derivative estimation, which is then employed for local exploitation via (projected) GD update. In contrast, BO algorithms utilize such an assumption to construct their acquisition functions for a global optimization that can trade off between exploitation and exploration. In practice, the exploration of BO algorithms is usually query-inefficient, especially for problems with high-dimensional input spaces, and therefore GD with Published as a conference paper at ICLR 2023 estimated derivatives (especially our ZORD) is preferred to realize better optimization performances in these problems (see our Sec. 5.2). So, our ZORD and BO algorithms belong to two different types of ZO optimization algorithms (i.e., GD-type vs. BO-type), where their theoretical analyses are in fact not comparable. In particular, GD-type and BO-type ZO optimization algorithms apply different metrics for their theoretical analyses, e.g., the derivative estimation error as well as the convergence to a stationary point (in the nonconvex case) for GD-type ZO optimization algorithms vs. the global asymptotic convergence in terms of the regret for BO-type ZO optimization algorithms. So, it is more reasonable to compare the theory (including the theoretical challenge, the new developments, and the novelty of the convergence result) of our ZORD with other GD-type ZO optimization algorithms, e.g., the ones using FD methods for their derivative estimation (Nesterov and Spokoiny, 2017; Liu et al., 2018b) , as what we have discussed in Sec. E.1. In addition, in contrast to using the GP to model the objective function within the entire domain for global exploration in BO, our derived GP in ZORD will be applied to estimate the derivative of the objective function for local exploitation by GD as shown in Sec. 3.1. As GD typically optimizes in a local region, our derived GP only needs to estimate the derivative locally, which is known to be much simpler than modeling the objective function within the entire domain in BO especially for objective functions in high-dimensional input spaces. In light of this, the derived GP for derivative estimation (4) in our ZORD algorithm advances the standard GP in BO in the following aspects: 1. Improved Query Efficiency for Estimation. The derived GP in our ZORD algorithm requires fewer function queries to provide accurate derivative estimation. We provide a visual example in Fig. 10 , in which we sample a one-dimensional function f from a GP prior GP(0, k(x, x)) using the standard SE kernel and then randomly select the same number of queries from the input domain of [-6, 6 ] and [0, 3] for standard GP and our derived GP, respectively. As illustrated in Fig. 10 , function in a local region (i.e., x ∈ [0, 3]) is usually smoother than its counterpart in the entire domain (i.e., x ∈ [-6, 6] ). As a result, with only 4 function queries, our derived GP can already provide accurate estimation to the derivative of this objective function whereas standard GP requires more than 8 function queries to model this objective function accurately in the entire domain. 2. Reduced Computational Complexity. Comparing (3) and ( 5), both the derived GP for derivative estimation in our ZORD algorithm and the standard GP in BO enjoy a computational complexity of O(n 3 ) with n function queries. However, as a consequence of the improved query efficiency of our derived GP, it is able to require fewer function queries (i.e., smaller n) for accurate derivative estimationfoot_3 and hence can enjoy a reduced computational complexity in practice especially when a large number of queries (e.g., n > 1000) are applied to the standard GP in BO.



In practice, it is usually necessary to query every updated input to measure the optimization performance and select the best-performing input. We refer to these queries as necessary queries. The first step of GD update to reach xt,1 is always performed, i.e., Vt ≥ 1. Bayesian optimization algorithms, including TuRBO-1/10, are widely known to suffer from the prohibitive computational complexity when they need a large number of function queries for optimization, e.g., T > 1000(Rasmussen and Williams, 2006). As introduced in our Appx. C, 150 function queries for our derived GP can already help our ZORD algorithm to achieve remarkable results in practice (refer to the experiments in our Sec. 5).



Figure 1: Our derived GP for derivative estimation (4) with different number n of queries. Green curve and its confidence interval denote the mean ∇µ(x) and standard deviation of the derived GP.

Figure3: Optimization of Ackley and Levy functions with different dimensions. The x-axis and yaxis denote the number of queries and log-scaled optimality gap (i.e., log (f (x T ) -f (x * ))) achieved after this number of queries. Each curve is the mean ± standard error from ten independent runs. Table1: Comparison of the number of required queries to achieve a successful black-box adversarial attack. Every entry represents mean ± standard deviation from five independent runs.

Figure4: Optimization of different non-differentiable metrics on the Covertype dataset. The x-axis and y-axis denote, respectively, the number of queries and the improvement on the non-differentiable metric. Each curve is the mean ± standard error from five independent experiments.

Figure 5: The 3D illustration of Ackley and Levy synthetic function with d = 2.

Figure6: Comparison of the derivative estimation errors of our derived GP-based estimator (GP) and the FD estimator under various input dimensions and kernels. Similarly, each result is reported with the mean ± standard error from five independent runs.

Figure7: Comparison of our ZORD algorithm using different confidence thresholds c for its dynamic virtual updates, where the x-axis and the y-axis denote the number of function queries and the logscaled optimality gap (i.e., log (f (x T ) -f (x * ))) achieved with this number of queries, respectively.

Figure8: Additional comparison between our ZORD and other baselines. The x-axis and y-axis denote the number of queries and log-scaled optimality gap (i.e., log (f (x T ) -f (x * ))) achieved after this number of queries. Each curve is the mean ± standard error from ten independent runs.

Figure 10: Comparison of local derivative estimation (in the input domain of [0, 3]) in our ZORD and global function approximation (in the input domain of [-6, 6]) in BO under various number of random function queries.

Comparison of the number of required queries to achieve a successful black-box adversarial attack. Every entry represents mean ± standard deviation from five independent runs.

) (d = 28 × 28) or CIFAR-10 (Krizhevsky et al., 2009) (d = 32 × 32), and aim to add a perturbation with an L ∞ constraint to make a trained deep neural network misclassify the image (more details in Appx. C.3). Tab. 1 summarizes the number of required queries to achieve a successful attack by different algorithms (see results on multiple images in Appx. D.3). The results show that in such high-dimensional ZO optimization problems, our ZORD again significantly outperforms the other algorithms since it requires a considerably smaller number of queries to achieve a successful attack.

OpenAI Gym environment properties and their respective network dimensions.

Comparison of the rewards (larger is better) achieved by various ZO optimization algorithms in different RL environments. Each result is reported with the mean ± standard deviation from ten independent runs. the same query budget of 1000. As BO algorithms usually suffer from the prohibitive computational complexity for a large T (Rasmussen and Williams, 2006) and GLD has never been applied in RL, we mainly compare our ZORD algorithm with ES, RGF and PRGF, which also belongs to the same type of ZO optimization algorithm: GD with estimated derivative. Remarkably, Tab. 3 shows that under the same query budget, our ZORD algorithm can consistently enjoy improved performance (i.e., highest rewards) than the other ZO optimization algorithms in different RL environments. This further supports the superiority of our ZORD algorithm to other FD-based ZO optimization algorithms.

ACKNOWLEDGMENTS

This research is part of the programme DesCartes and is supported by the National Research Foundation, Prime Minister's Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

APPENDIX C EXPERIMENTAL SETTINGS

C.1 GENERAL SETTINGS Derived GP. Among all our experiments in Sec. 5, to apply the derivative estimation in Sec. 3.1 for every iteration t and every step τ of our ZORD algorithm, we use the derived GP (4) based on the Matérn kernel with ν = 2.5 and fit this derived GP using 150 queries that achieves the smallest Euclidean distance with input x t,τ from the optimization trajectory. This is because we only need to model the objective function f in the vicinity of input x t,τ precisely rather than the entire domain, so as to achieve an accurate derivative estimation at input x t,τ .Confidence Threshold. Among all our experiments in Sec. 5, the confidence threshold c of our dynamic virtual updates (Sec. 3.2) is set to be 0.35 in order to realize a good trade-off between query efficiency and accurate derivative estimation in practice, which can already allow our ZORD to achieve compelling empirical results consistently (see our Sec. 5). In light of this, c = 0.35 would be a reasonably good choice in practice, especially when there is no prior knowledge about the objective functions. When we have prior knowledge about the smoothness of the objective functions, we can likely make a better choice for c: Intuitively, smooth objective functions usually can be modeled by the Gaussian process effectively (Rasmussen and Williams, 2006) , so an accurate derivative estimation from our derived GP is also likely to be achieved. In this scenario, a large confidence threshold can be applied to fully exploit the benefit of our derivative estimation that is free from the requirement for additional queries and consequently results in an improved query efficiency in practice.Baselines. In addition, among all our experiments in Sec. 5, we consistently use n = 10, λ = 0.01 and directions {u i } n i=1 that are randomly sampled from a unit sphere for the derivative estimation of the FD method (2) applied in the RGF and PRGF algorithm. Moreover, following the common practice of (Berahas et al., 2022; Cheng et al., 2021) , we conduct orthogonalization on these randomly selected directions via the Gram-Schmidt procedure. As for the ES algorithm (e.g., the one applied in (Salimans et al., 2017 )), we apply the same n, λ and {u i } n i=1 in RGF and PRGF for their update in every iteration. Domain Transformation. Following the practice that has been used in (Eriksson et al., 2019) , for all our experiments, we firstly re-scale the input domains into [0, 10] d to ease the optimization and then re-scale the updated inputs back to the original domains for querying.

C.2 SYNTHETIC EXPERIMENTS

i=1 , the Ackley and Levy function applied in our synthetic experiments are given below,) where w i = 1 + (x i -1)/4 for any i = 1, • • • , d, Ackley function achieves its minimum (i.e., min f (x) = 0) at x * = 0, and Levy function achieves its minimum (i.e., min f (x) = 0) at x * = 1. Note that the Ackley and Levy function for the synthetic experiments in our Sec. 5.2 are defined within the domain [-20, 20] d and [-7.5, 7.5] d , respectively. To give a better understanding of these two synthetic functions, we provide a 3D illustration of these two synthetic functions with d = 2 in our Fig. 5 . As shown in Fig. 5 , these two synthetic functions are highly nonconvex and therefore have local minimums within their domains.To compare our ZORD algorithm with other ZO/FO optimization baselines in Sec. 5.2, we firstly employ TuRBO of 300 queries to find a good initialization for all other ZO/FO optimization algorithms in Fig. 3 because of the nonconvexity of these two synthetic functions as shown in Fig. 5 . We then

