BOOSTING ONE-POINT DERIVATIVE-FREE ONLINE OPTIMIZATION VIA RESIDUAL FEEDBACK

Abstract

Zeroth-order optimization (ZO) typically relies on two-point feedback to estimate the unknown gradient of the objective function, which queries the objective function value twice at each time instant. However, if the objective function is time-varying, as in online optimization, two-point feedback can not be used. In this case, the gradient can be estimated using one-point feedback that queries a single function value at each time instant, although at the expense of producing gradient estimates with large variance. In this work, we propose a new one-point feedback method for online optimization that estimates the objective function gradient using the residual between two feedback points at consecutive time instants. We study the regret bound of ZO with residual feedback for both convex and nonconvex online optimization problems. Specifically, for both Lipschitz and smooth functions, we show that using residual feedback produces gradient estimates with much smaller variance compared to conventional one-point feedback methods, which improves the learning rate. Our regret bound for ZO with residual feedback is tighter than the existing regret bound for ZO with conventional one-point feedback and relies on weaker assumptions, which suggests that ZO with our proposed residual feedback can better track the optimizer of online optimization problems. We provide numerical experiments that demonstrate that ZO with residual feedback significantly outperforms existing one-point feedback methods in practice.

1. INTRODUCTION

Zeroth-order optimization (ZO) algorithms have been widely used to solve online optimization problems where first or second order information (i.e., gradient or Hessian information) is unavailable at each time instant. Such problems arise, e.g., in online learning and involve adversarial training Chen et al. (2017) and reinforcement learning Fazel et al. (2018) ; Malik et al. (2018) among others. The goal in online optimization is to minimize a sequence of time-varying objective functions {f t (x)} t=1:T , where the value f t (x t ) is revealed to the agent after an action x t is selected and is used to adapt the agent's future strategy. Since the future objective functions are not known a priori, the performance of the online decision process can be measured using notions of regret, generally defined as the difference between the total cost incurred by the decision selected by the agent online and the cost of the fixed or varying optimal decision that a clairvoyant agent could select. Perhaps the most popular zeroth-order gradient estimator is the two-point estimator that has been extensively studied in Agarwal et al. (2010) ; Ghadimi & Lan (2013) ; Duchi et al. (2015) ; Ghadimi et al. (2016) ; Bach & Perchet (2016) ; Nesterov & Spokoiny (2017) ; Gao et al. (2018) ; Roy et al. (2019) . Specifically, the two-point estimator queries the function value f t (x) for twice, for two different realizations of the decision variables, and uses the difference in these function values to estimate the desired gradient, as illustrated by the equation (Two-point feedback): e g (2) t (x) = u ⇣ f t (x + u) f t (x) ⌘ , where > 0 is a parameter and u ⇠ N (0, I). However, the two-point gradient estimator can not be used for the solution of non-stationary online optimization problems that arise frequently, e.g., in online learning. The reason is that in these non-stationary online optimization problems, the objective function being queried is time-varying, and hence only a single function value can be sampled at a given time instant. In this case, the following one-point feedback can be used (One-point feedback): e g (1) t (x) = u f t (x + u), which queries the objective function f t (x) only once at each time instant. One-point feedback was first proposed and analyzed in Flaxman et al. (2005) for the solution of online convex optimization problems. Saha & Tewari (2011) ; Hazan & Levy (2014) ; Dekel et al. (2015) showed that the regret of convex online optimization methods using one-point gradient estimation can be improved assuming smoothness or strong convexity of the objective functions and using self-concordant regularization. More recently, Gasnikov et al. (2017) developed such regret bounds for stochastic convex problems. On the other hand, Hazan et al. (2016) characterized the convergence of one-point zeroth-order methods for static stochastic non-convex optimization problems. However, as shown in these studies, a limitation of one-point feedback is that the resulting gradient estimator has large variance and, therefore, induces large regret. In addition, the regret analysis for ZO with one-point feedback usually requires the strong assumption that the function value is uniformly upper bounded over time, so this method can not be used for practical non-stationary optimization problems.

Contributions:

In this paper, we propose a novel one-point gradient estimator for zeroth-order online optimization and develop new regret bounds to study its performance. Specifically, our contributions are as follows. We propose a new one-point feedback scheme which requires a single function evaluation at each time instant. This feedback scheme estimates the gradient using the residual between two consecutive feedback points and we refer to it as residual feedback. We show that our residual feedback induces a smaller gradient estimation variance than the conventional one-point feedback scheme in Flaxman et al. (2005); Gasnikov et al. (2017) . Furthermore, we provide regret bounds for online convex optimization with our proposed residual feedback estimator. Our analysis relies on a weaker assumption than the one needed in the case of the conventional one-point estimator, and our proposed regret bounds are tighter especially when the value of the objective function is large. In addition, we provide regret bounds for online non-convex optimization with residual feedback. Finally, we present numerical experiments that demonstrate that the proposed residual-feedback estimator significantly outperforms the conventional one-point method in its ability to track the time-varying optimizers of online learning problems. To the best of our knowledge, this is the first time a one-point zeroth-order method is theoretically studied for online non-convex optimization problems. It is also the first time that a one-point gradient estimator demonstrates comparable empirical performance to that of the two-point method. We note that two-point estimators can only be used to solve online non-stationary learning problems in simulations, where the system can be hard coded to be fixed during two queries of the objective function values at two different decision variables. Related work: Zeroth-order methods have been used to solve many different types of optimization problems. For example, Balasubramanian & Ghadimi (2018) apply ZO to solve a set-constrained optimization problem where the projection onto the constraint set is non-trivial. (2017) study online bandit algorithms using ellipsoid methods. In particular, these methods induce heavy computation per step and achieve regret bounds that have bad dependence on the problem dimension. As a comparison, our one-point method is computation light and achieves regret bounds that have better dependence on the problem dimension.

2. PRELIMINARIES AND RESIDUAL FEEDBACK

We first introduce the classes of Lipschitz and smooth functions. Definition 2.1 (Lipschitz functions). The class of Lipschtiz-continuous functions C 0,0 satisfies: for any f 2 C 0,0 , |f (x) f (y)|  L 0 kx yk, 8x, y 2 R d , where L 0 > 0 is the Lipschitz parameter. The class of smooth functions C 1,1 satisfies: for any f 2 C 1,1 , krf (x) rf (y)k  L 1 kx yk, 8x, y 2 R d , where L 1 > 0 is the smoothness parameter. In ZO, the objective is to estimate the first-order gradient of a function using zeroth-order oracles. Necessarily, we need to perturb the function around the current point along all the directions uniformly in order to estimate the gradient. This motivates us to consider the Gaussian-smoothed version of the function f as introduced in Nesterov & Spokoiny (2017), f (x) := E u⇠N (0,1) [f (x + u)], where the coordinates of the vector u are i.i.d standard Gaussian random variables. The following bounds on the approximation error of the function f (x) have been developed in Nesterov & Spokoiny (2017) . Lemma 2.2. Consider a function f and its smoothed version f . It holds that |f (x) f (x)|  ⇢ L 0 p d, if f 2 C 0,0 , 2 L 1 d, if f 2 C 1,1 , and krf (x) rf (x)k  L 1 (d + 3) 3/2 , if f 2 C 1,1 . The smoothed function f (x) satisfies the following amenable property Nesterov & Spokoiny (2017) . Lemma 2.3. If f 2 C 0,0 is L 0 -Lipschitz, then f 2 C 1,1 with Lipschitz constant L 1 = p d 1 L 0 . Consider the following online bandit optimization problem. min x2X T 1 X t=0 f t (x), (P) where X ⇢ R d is a convex set and {f t } t is a random sequence of objective functions. In this setting, the objective functions {f t } t are unknown a priori and their derivatives are unavailable. At time t, a new objective function f t is randomly generated independent of an agent's decisions, and then the agent queries the objective function value at certain perturbed points and use them to update the current policy parameters. The goal of the agent is to minimize a certain regret function. Such an online setting often occurs in non-stationary learning scenarios where either the system is time-varying on its own or a single query of the function f t changes the system state (i.e., f t changes to f t+1 ). In this non-stationary setting, the conventional two-point feedback scheme is known to be impractical as it requires to evaluate f t at two different points at the same time t. Instead, it is natural to use the one-point feedback scheme (2) in Gasnikov et al. (2017) . However, the gradient estimate based on the above one-point feedback induces a large variance that leads to a large regret. In this paper, we focus on such an one-point derivative-free setting and propose the following novel one-point residual feedback scheme for estimating the gradient with reduced variance. (Residual feedback): e g t (x t ) := u t f t (x t + u t ) f t 1 (x t 1 + u t 1 ) , where u t 1 , u t ⇠ N (0, I) are independent random vectors. To elaborate, the residual feedback in (3) queries f t at a single perturbed point x t + u t , and then subtracts it by f t 1 (x t 1 + u t 1 ) obtained from the previous iteration. We name such a scheme as one-point residual feedback. Next, we explore some basic properties of the residual feedback. We first show that this estimator is an unbiased gradient estimate of the smoothed function f ,t . Lemma 2.4. The residual feedback satisfies E ⇥ e g t (x t ) ⇤ = rf ,t (x t ) for all x t 2 X and t. Proof. By the fact that u t has zero mean and is independent from u t 1 and x t 1 . We consider the following ZO algorithm with residual feedback (ZO with residual feedback): x t+1 = ⇧ X x t ⌘g t (x t ) , ( ) where ⌘ is the learning rate and ⇧ X is the projection operator onto the set X . The update (4) can be implemented assuming that the objective function can be queried at points outside the feasible set X , similar to the methods considered in Duchi et al. (2015) ; Bach & Perchet (2016); Gasnikov et al. (2017) . Note that it is possible to modify the update (4) so that the iterates are guaranteed to be within the feasible set X . This modification and related analysis can be found in Section H in the supplementary material. The requirement that the objective function is evaluated at feasible points in derivative-free optimization algorithms has also been considered in Bubeck et al. (2017) ; Bilenne et al. (2020) . Specifically, Bubeck et al. (2017) develop the so called ellipsoid method, which requires computation of an ellipsoid containing the optimizer at each time step. On the other hand, almost concurrently with this work, Bilenne et al. (2020) proposed a similar oracle as in (3) for a static convex optimization problem with specific objective and constraint functions. Next, we bound the second moment of the gradient estimate based on the residual feedback. Lemma 2.5 (Second moment). Assume that f t 2 C 0,0 with Lipschitz constant L 0 for all time t. Then, under the ZO update rule in (4), the second moment of the residual feedback satisfies: for all t, E[ke g t (x t )k 2 ]  4dL 2 0 ⌘ 2 2 E[ke g t 1 (x t 1 )k 2 ] + D t , where D t := 16L 2 0 (d + 4) 2 + 2d 2 E ⇥ f t (x t 1 + u t 1 ) f t 1 (x t 1 + u t 1 ) 2 ⇤ . The above lemma shows that the second moment of the residual feedback can be bounded by a perturbed contraction, provided that we choose ⌘ and such that the contracting rate ↵ = 4dL 2 0 ⌘ 2 2 < 1. As we show later in the analysis, such a contraction property leads to a small variance of the residual feedback that helps reduce the regret of the online ZO algorithm.

3. ZO WITH RESIDUAL FEEDBACK FOR ONLINE CONVEX OPTIMIZATION

In this section, we consider the online bandit problem (P) where the sequence of functions {f t } t=0:T 1 are all convex. In particular, we are interested in analyzing the following static regret of the algorithm. R T := E h T 1 X t=0 f t (x t ) min x2X T 1 X t=0 f t (x) i . We make the following assumption on the non-stationary of the online learning problem. Assumption 3.1 (Bounded variation). There exists V f > 0 such that for all t, E ⇥ |f t (x t 1 + u t 1 ) f t 1 (x t 1 + u t 1 )| 2 ⇤  V 2 f , where the expectation is taken over x t 1 , the random vector u t 1 and the random functions f t 1 ,f t . Intuitively, we assume the squared variation of the objective function between two consecutive time instants is uniformly bounded over time. We note that this assumption is much weaker than the uniformly bounded function value assumption, i.e., E ⇥ f t (x) 2 ⇤  B 2 , 8t, x 2 X, which is used in the analysis of ZO with the conventional one-point feedback Gasnikov et al. (2017) . In particular, under Assumption 3.1, the perturbation term in Lemma 2.5 can be bounded as D t  16L 2 0 (d + 4) 2 + 2dV 2 f 2 . Then, by telescoping the contraction inequality, we obtain the following bound for the second moment of the residual-feedback gradient estimate, E[kg t (x t )k 2 ]  max n E[kg 0 (x 0 )k 2 ], 1 1 ↵ ⇣ 16L 2 0 (d + 4) 2 + 2d 2 V 2 f ⌘o . ( ) In practice, is usually chosen to be sufficiently small, and the above bound is dominated by O(d 2 V 2 f ), which is much smaller than the second moment bound of the conventional one-point feedback O(d 2 B 2 ) (B 2 is the uniform bound of the second moment of f t over time). For example, consider the time-varying objective functions, f 0 (x) = 1/2x 2 and f t (x) = f t 1 (x) + n t , where n t is Gaussian noise with zero mean at time t. Then, it can be verified that Assumption 3.1 holds with a finite V f whereas the second moment of f t (x) is unbounded over time. This suggests that the variance of the residual feedback can be significantly smaller than that of the conventional one-point feedback. Next, we first consider the case where the objective function f t is convex and Lipschitz. Based on the above characterization of the second moment of residual feedback, we obtain the following regret bound for ZO with residual feedback. Theorem 3.2 (Regret for Convex Lipschitz f t ). Let Assumption 3.1 hold. Assume that f t 2 C 0,0 is convex with Lipschitz constant L 0 for all t and kx 0 x ⇤ k  R. Run ZO with residual feedback for T > R 2 iterations with ⌘ = R 3 2 (2 p 2L 0 p dT 3 4 ) 1 and = p RT 1 4 . Then, we have that R T  p 2L 0 p dRT 3 4 + E ⇥ kg 0 (x 0 )k 2 ⇤ R 3 2 2 p 2dL 0 T 3 4 + 8 p 2 (d + 4) 2 p d L 0 R 3 2 T 1 4 + 2L 0 p dRT 3 4 + p 2dRV 2 f L 0 1 T 3 4 . ( ) Asymptotically, we have R T = O((L 0 + L 0 1 V 2 f ) p dRT 3 4 ). To the best of our knowledge, the best known regret for ZO with the conventional one-point feedback is of the order O( p dL 0 RBT 3 4 ) Gasnikov et al. (2017) . Therefore, our regret bound is tighter if the function variation satisfies V 2 f  O(B 1 2 L 3 2 0 ) . Essentially, using the proposed residual feedback gradient estimator, the regret of ZO no longer depends on the uniform bound of the function value, which can be huge in practice. Instead, our regret only relies on how fast the function varies over time. Remark 3.3. We note that the complexity bound in Theorem 3.2 generally depends on the values of the Lipschitz parameters L 0 , L 1 and the constant V 2 f . Specifically, choose ⌘ = R 3 2 (2 p 2L 0 p dT 3 4 ) 1 and = p RL q 0 T 1 4 with q > 0 as a tuning parameter, and we obtain that R T = O((L 0 + L 0 1 q + L 2q 1 0 V 2 f ) p dRT 3 4 ) when T L 2q 0 R 2 . If L 0 < 1, we can choose q = 1 to achieve the bound R T = O((L 0 + L 0 V 2 f ) p dRT 3 4 ). On the other hand, if L 0 1, we can choose q = 0 to achieve the bound R T = O((L 0 + L 0 1 V 2 f ) p dRT 3 4 ). We note that the dependence of the bounds in Theorems 3.4, 4.2 and 4.3 on L 0 , L 1 can also be optimized in a similar way by properly choosing . Next, we present the regret of ZO with residual feedback for convex smooth objective functions. Theorem 3.4 (Regret for Convex Smooth f t ). Let Assumption 3.1 hold. Assume that f t (x) 2 C 0,0 \C 1,1 is convex with Lipschitz constant L 0 and smoothness constant L 1 for all t, and assume that kx 0 x ⇤ k  R. Run ZO with residual feedback for T > R 2 iterations with ⌘ = R 4 3 (2 p 2L 0 d 2 3 T 2 3 ) 1 and = R 1 3 d 1 6 T 1 6 . Then, we have that R T  p 2L 0 d 2 3 R 2 3 T 2 3 + E ⇥ kg 0 (x 0 )k 2 ⇤ R 4 3 2 p 2L 0 d 2 3 T 2 3 + 8 p 2L 0 (d + 4) 2 d 2 3 R 4 3 T 1 3 + 2L 1 d 2 3 R 2 3 T 2 3 + p 2L 0 1 d 2 3 R 2 3 V 2 f T 2 3 . Asymptotically, the above regret bound is in the order of O(( L 0 + L 1 + L 0 1 V 2 f )(dRT ) 3 ). To the best of our knowledge, the best known regret for ZO with the conventional one-point feedback in the convex smooth case is of the order O(L 1 3 1 (dRBT ) 3 ) Gasnikov et al. ( 2017). Therefore, our regret bound is tighter if the function variation satisfies V 2 f  O(B 2 3 L 1 3 1 L 0 ). Our numerical experiments show that ZO with residual feedback always outperforms ZO with the conventional one-point feedback in practice.

4. ZO WITH RESIDUAL FEEDBACK FOR ONLINE NONCONVEX OPTIMIZATION

In this section, we analyze the regret of ZO with residual feedback in solving the unconstrained online bandit problem (P) with nonconvex functions. Throughout this section, we make the following assumption regarding the objective functions. Assumption 4.1. There exist W T , f W T > 0 such that the following conditions hold for all t. 1. P T t=1 E[f ,t (x t ) f ,t 1 (x t )]  W T , where the expectation is taken with respect to x t and the random smoothed objective functions f ,t 1 , f ,t . 2. P T t=1 E[|f t (x t 1 + u t 1 ) f t 1 (x t 1 + u t 1 )| 2 ]  f W T , where the expectation is taken with respect to x t 1 , the random vector u t 1 and the random objective functions f t 1 , f t . The above two conditions measure the accumulated first-order and second-order function variations, as also adopted by Roy et al. (2019) . Next, we consider the case where {f t } t are nonconvex and Lipschitz continuous functions. Since the objective function f t is not necessarily differentiable, i.e., rf (t) is not well defined, we define the regret as the accumulated gradient of the smoothed function, i.e., R T g, := P T 1 t=0 E[krf ,t (x t )k 2 ]. In addition, it is often required that the smoothed function f ,t is close to the original function f t such that |f ,t (x) f t (x)|  ✏ f for all t. To satisfy this condition, we need to choose  ( p dL 0 ) 1 ✏ f according to Lemma 2.2. We obtain the following regret bound for ZO with residual feedback. Theorem 4.2 (Nonconvex Lipschitz f t ). Let Assumptions 4.1 hold. Assume that f t 2 C 0,0 with Lipschitz constant L 0 and that f t is bounded below by f ⇤ t for all t. Run ZO with residual feedback for T > (d✏ f ) 1 iterations with ⌘ = ✏ 3 2 f (2 p 2L 2 0 d 3 2 T 1 2 ) 1 and = ✏ f (d 1 2 L 0 ) 1 . Then, we have that R T g,  2 p 2L 2 0 E[f ,0 (x 0 )] f ⇤ ,T + W T d 3 2 ✏ 3 2 f T 1 2 + ✏ 1 2 f E ⇥ kg 0 (x 0 )k 2 ⇤ 2 p 2dT + 4 p 2L 0 ✏ 2 f (d + 4) 2 d 1 2 T 1 2 + L 2 0 p 2 d 3 2 f W T ✏ 3 2 f T 1 2 . ( ) Asymptotically, we have R T g, = O(d 3 2 L 2 0 ✏ 3 2 f (W T + f W T T 1 )T 1 2 + d 3 2 L 0 ✏ 1 2 f T 1 2 ). Based on Theorem 4.2, we observe that the regret bound satisfies R T g, /T ! 0 whenever W T = o(T 1 2 ✏ 3 2 f ) and f W T = o(T 3 2 ✏ 3 2 f ). In particular, if the bounded variation Assumption 3.1 holds, then we have f W T  O(T V 2 f ) , and it suffices to let T 1 2 ✏ 3 2 f = o(1). Next, we consider the nonconvex and smooth problem and study the regret R T g := P T 1 t=0 E[krf t (x t )k 2 ] . We obtain the following regret for ZO with residual-feedback. Theorem 4.3 (Nonconvex smooth f t ). Let Assumptions 4.1 hold. Assume that f t 2 C 0,0 \ C 1,1 with Lipschitz constant L 0 and smoothness constant L 1 and that f t is bounded below by f ⇤ t for all t. Run ZO with residual feedback for T iterations with ⌘ = (2 p 2L 0 d 4 3 T 1 2 ) 1 and = (d 5 6 T 1 4 ) 1 . Then, R T g  4 p 2L 0 E[f ,0 (x 0 )] f ⇤ ,T + W T d 4 3 T 1 2 + L 1 E ⇥ kg 0 (x 0 )k 2 ⇤ p 2L 0 d 4 3 T 1 2 + 8 p 2L 1 L 0 (d + 4) 2 d 4 3 T 1 2 + p 2L 1 L 0 d 4 3 f W T + 2L 2 1 (d + 3) 3 d 5 3 T 1 2 . ( ) Asymptotically, the above regret bound is in the order of O(d 4 3 L 0 W T T 1 2 + d 4 3 L 1 L 0 1 f W T ). Based on Theorem 4.3, we observe that the regret bound satisfies R T g /T ! 0 whenever W T = o(T 2 ) and f W T = o(T ). We note that these requirements of W T , f W T are more relaxed than those in the nonsmooth case, as they do not rely on the small parameter ✏ f .

5. ZO WITH RESIDUAL FEEDBACK FOR STOCHASTIC ONLINE OPTIMIZATION

In this section, we generalize the residual feedback to solve stochastic online bandit problems. Since its regret analysis follows the same proof logic as that of ZO with residual feedback, we only introduce the key technical lemmas and comment on the proof difference. The stochastic online bandit problems are formulated as follows. min x2X T 1 X t=0 E[F t (x; ⇠ t )], where E[F t (x; ⇠ t )] = f t (x), 8t, (R) where ⇠ t denotes a certain noise that is independent of x. Different from the previous deterministic online setting, the agent in the stochastic setting can only query noisy evaluations of the function. This covers the scenarios where the agent does not have access to the underlying data distribution. To solve the above stochastic online problem, we propose the following stochastic residual feedback e g t (x t ) := u t F t (x t + u t ; ⇠ t ) F t 1 (x t 1 + u t 1 ; ⇠ t 1 ) , where ⇠ t 1 and ⇠ t are independent random samples that are sampled in the iterations t 1 and t, respectively. Since the noisy function value F (x; ⇠ t ) is an unbiased estimate of the objective function f t (x), it is straightforward to show that ( 13) is an unbiased gradient estimate of the function f ,t (x). To analyze the regret of ZO with stochastic residual feedback, we first consider the convex setting and the following assumption that bounds the variation of the stochastic functions. Assumption 5.1. (Bounded stochastic variation) There exists V f,⇠ > 0 such that for all t, E ⇥ F t (x t 1 + u t 1 , ⇠ t ) F t 1 (x t 1 + u t 1 , ⇠ t 1 ) 2 ⇤  V 2 f,⇠ , where the expectation is taken with respect to x t 1 , the random vector u t 1 and the random objective functions F t 1 (•, ⇠ t 1 ), F t (•, ⇠ t ). The above assumption generalizes Assumption 3.1 to the stochastic setting. The bound V 2 f,⇠ controls both the variation of function over time and the variation due to stochastic sampling. The following lemma characterizes the second moment of the stochastic residual feedback. Lemma 5.2. Assume F (x, ⇠) 2 C 0,0 with Lipschitz constant L 0 for all ⇠. Then, under the ZO update rule, we have that E[ke g t (x t )k 2 ]  4dL 2 0 ⌘ 2 2 E[ke g t (x t 1 )k 2 ] + D t,⇠ , where D t,⇠ := 16L 2 0 (d + 4) 2 + 2d 2 E[ F t (x t 1 + u t 1 , ⇠ t ) F t 1 (x t 1 + u t 1 , ⇠ t 1 ) 2 ]. Observe that the above second moment bound is very similar to that in Lemma 2.5, and the only difference is the perturbation term. In particular, the perturbation term D t,⇠ can be further bounded by leveraging Assumption 5.1, and the resulting second moment bound is almost the same as that in eq. ( 8) for the deterministic case (simply replace V f in eq. ( 8) by V f,⇠ ). Therefore, the regret analysis of ZO with stochastic residual feedback is the same as that of ZO with residual feedback in the deterministic online setting. Consequently, ZO with stochastic residual feedback achieves almost the same regret bounds as those in Theorems 3.2 and 3.4, and one simply needs to replace V f by V f,⇠ . For the nonconvex setting, we adopt the following assumption that generalizes Assumption 4.1. Assumption 5.3. There exists W T , f W T,⇠ > 0 such that the following two conditions hold for all t. 1. P T t=1 E[f ,t (x t ) f ,t 1 (x t )]  W T , where the expectation is taken with respect to x t and the random smoothed objective functions f ,t 1 , f ,t . 2. P T t=1 E[|F t (x t 1 + u t 1 ; ⇠ t ) F t 1 (x t 1 + u t 1 ; ⇠ t 1 )| 2 ]  f W T,⇠ , where the expectation is taken with respect to x t 1 , the random vector u t 1 and the random objective functions F t 1 (•, ⇠ t 1 ), F t (•, ⇠ t ). Then, following the same proof logic as that of Theorems 4.2 and 4.3, on can obtain similar regret bounds for ZO with stochastic residual feedback (simply replace W T , f W T in Theorems 4.2 and 4.3 by W T,⇠ , f W T,⇠ , respectively).

6. NUMERICAL EXPERIMENTS

In this section, we compare the performance of ZO with one-point, two-point and residual feedback in solving two non-stationary reinforcement learning problems, i.e., LQR control and resource allocation, in which either the reward or transition functions are varying over episodes.

6.1. NONSTATINOARY LQR CONTROL

We consider an LQR problem with noisy system dynamics. The static version of this problem is considered in Fazel et al. (2018) ; Malik et al. (2018) . Specifically, consider a system whose state The two point method (orange) is infeasible to use in practice and is presented here to serve as the simulating benchmark. x k 2 R nx at step k is subject to a transition function x k+1 = A t x k + B t u k + w k , where u k 2 R nu is the action at step k, and A t 2 R nx⇥nx and B t 2 R nx⇥nu are dynamical matrices in episode t. These matrices are unknown and changing over episodes. The vector w k is the noise on the state transition. Specifically, the entries of the dynamical matrices A 0 and B 0 at episode 0 are randomly generated from a Gaussian distribution N (0, 0.1 2 ). Then, we generate the time-varying dynamical matrices as A t+1 = A t + 0.01M t and B t+1 = B t + 0.01N t , where M t and N t are random matrices whose entries are uniformly sampled from [0,1]. Moreover, consider a state feedback policy u k = K t x k , where K t 2 R nu⇥nx is the policy parameter that is fixed within episode t. Within each episode, there exists an optimal policy K ⇤ t so that the discounted accumulated cost function V t (K) := E ⇥ P H 1 k=0 k (x T k Qx k + u T k Ru k ) ⇤ at episode t is minimized, where  1 is the discount factor and H is the horizon. The goal is to track the time-varying optimal policy parameter K ⇤ t so that V t (K t ) V t (K ⇤ t ) is small in every episode. We apply the conventional one-point method in Gasnikov et al. (2017) and the proposed residualfeedback method (13) to solve the above non-stationary LQR problem. The performance of the twopoint method in Bach & Perchet (2016) is also presented as a benchmark, although it is impractical in non-stationary scenarios. This is because the two-point method in Bach & Perchet (2016) requires to evaluate value function V t for two different policy functions at two consecutive episodes. However, evaluating the value function V t for a given policy during episode t requires to collect samples by executing this policy. Then, during the subsequent episode t + 1, since the problem is non-stationary, the dynamic matrices change to A t+1 , B t+1 and so does the value function V t+1 . Therefore, it is not possible to evaluate the same value function V t at two different episodes and, as a result, the two-point method in Bach & Perchet (2016) is not applicable here. Each algorithm is run for 10 trials, and the stepsizes are optimized respectively. The accumulated regrets P T 1 t=0 |V (K t ) V (K ⇤ ) | of these algorithms are presented in Figure 1(a) . We observe that the residual feedback method achieves a much lower regret than the conventional one-point method and has a comparable performance to that of the impractical two-point method. Moreover, we present in Figure 1 (b) the estimated variance of the gradient estimates of these three methods at the policy iterates over episodes. It can be seen that the variance of our proposed residual-feedback is close to the impractical two-point feedback and is much smaller than that of the conventional one-point feedback. This observation justifies our theoretical characterization of the second moment of the residual feedback.

6.2. NONSTATIONARY RESOURCE ALLOCATION

We consider a multi-stage resource allocation problem with time-varying sensitivity to the lack of resource supply. Specifically, 16 agents are located on a 4 ⇥ 4 grid. During episode t, at step k, agent i stores m i (k) amount of resources and has a demand for resources in the amount of d i (k). Also, agent i decides to send a fraction of resources a ij (k) 2 [0, 1] to its neighbors (2017) (green) to solve the non-stationary resource allocation problem are presented. In (a), the varying cost Jt(✓t) of three methods are presented. In (b), the variance of the gradient estimates at agent 1 given by three methods are presented. The two point method (orange) is infeasible to use in practice and is presented here to serve as the simulating benchmark. j 2 N i on the grid. The local amount of resources and demands of agent i evolve as m i (k + 1) = m i (k) P j2Ni a ij (k)m i (k) + P j2Ni a ji (k)m j (k) d i (k) and d i (k) = i sin(! i k + i ) + w i,k , where w i,k is the noise in the demand. At each step k, agent i receives a local cost r i,t (k), such that r i,t (k) = 0 when m i (k) 0 and r i,t (k) = ⇣ t m i (k) 2 when m i (k) < 0, where ⇣ t represents the varying sensitivity of the agents to the lack of supply during episode t. Let agent i makes its decisions according to a parameterized policy function ⇡ i,t (o i ; ✓ i,t ) : O i ! [0, 1] |Ni| , where ✓ i,t is the parameter of the policy function ⇡ i,t at episode t, o i 2 O i denotes agent i's local observation. Specifically, we let o i (k) = [m i (k), d i (k)] T . Our goal is to track the time-varying optimal policy so that the accumulated cost over the grid J t (✓ t ) = P 16 i=1 P H k=0 k r i,t (k) during each episode is maintained at a low level, where ✓ t = [. . . , ✓ i,t , . . . ] is the policy parameter, H is the problem horizon at each episode, and is the discount factor. In Figure 2 (a), we present the costs achieved during each episode J t (✓ t ) with 10 trials using ZO with the residual-feedback, one-point and the impractical two-point feedback. It can be seen that our proposed residual-feedback achieves a cost J t (✓ t ) that is as low as the cost achieved by the impractical two-point feedback in such a non-stationary environment. In particular, both residual and two-point feedback perform much better than the conventional one-point feedback. Moreover, Figure 2 (b) compares the estimated variances of these feedback schemes, and one can observe that the variance of the residual feedback is comparable to that of the two-point feedback and is much smaller than that of the conventional one-point feedback.

7. CONCLUSION

In this paper, we proposed a residual one-point feedback oracle for zeroth-order online learning problems, which estimates the gradient of the time-varying objective function using a single query of the function value at each time instant. We showed that the regret bound of the proposed residual feedback estimator can be much lower than that of the conventional one-point method in online convex optimization setting. In addition, we studied the gradient size regret bound of the residualfeedback estimator when it is applied to the online non-convex optimization problems. Numerical experiments on two non-stationary reinforcement learning problems were conducted and the proposed residual-feedback estimator was shown to significantly outperform the conventional one-point method in non-stationary online learning problems.



Figure 1: The regrets of applying the proposed residual one-point feedback (3) (blue), the two-point oracle in Bach & Perchet (2016) (orange) and the conventional one-point oracle in Gasnikov et al. (2017) (green) to online policy optimization for the nonstationary LQR problem. In (a), the regrets P T t=0 |V (Kt) V (K ⇤ )| of three methods are presented. In (b), the variance of the gradient estimates given by three methods are presented.The two point method (orange) is infeasible to use in practice and is presented here to serve as the simulating benchmark.

Figure 2: The costs during each episode by applying the proposed residual one-point feedback (3) (blue), the two-point oracle in Bach & Perchet (2016) (orange) and the conventional one-point oracle in Gasnikov et al.(2017) (green) to solve the non-stationary resource allocation problem are presented. In (a), the varying cost Jt(✓t) of three methods are presented. In (b), the variance of the gradient estimates at agent 1 given by three methods are presented. The two point method (orange) is infeasible to use in practice and is presented here to serve as the simulating benchmark.

