BOOSTING ONE-POINT DERIVATIVE-FREE ONLINE OPTIMIZATION VIA RESIDUAL FEEDBACK

Abstract

Zeroth-order optimization (ZO) typically relies on two-point feedback to estimate the unknown gradient of the objective function, which queries the objective function value twice at each time instant. However, if the objective function is time-varying, as in online optimization, two-point feedback can not be used. In this case, the gradient can be estimated using one-point feedback that queries a single function value at each time instant, although at the expense of producing gradient estimates with large variance. In this work, we propose a new one-point feedback method for online optimization that estimates the objective function gradient using the residual between two feedback points at consecutive time instants. We study the regret bound of ZO with residual feedback for both convex and nonconvex online optimization problems. Specifically, for both Lipschitz and smooth functions, we show that using residual feedback produces gradient estimates with much smaller variance compared to conventional one-point feedback methods, which improves the learning rate. Our regret bound for ZO with residual feedback is tighter than the existing regret bound for ZO with conventional one-point feedback and relies on weaker assumptions, which suggests that ZO with our proposed residual feedback can better track the optimizer of online optimization problems. We provide numerical experiments that demonstrate that ZO with residual feedback significantly outperforms existing one-point feedback methods in practice.

1. INTRODUCTION

Zeroth-order optimization (ZO) algorithms have been widely used to solve online optimization problems where first or second order information (i.e., gradient or Hessian information) is unavailable at each time instant. Such problems arise, e.g., in online learning and involve adversarial training Chen et al. (2017) and reinforcement learning Fazel et al. (2018) ; Malik et al. (2018) among others. The goal in online optimization is to minimize a sequence of time-varying objective functions {f t (x)} t=1:T , where the value f t (x t ) is revealed to the agent after an action x t is selected and is used to adapt the agent's future strategy. Since the future objective functions are not known a priori, the performance of the online decision process can be measured using notions of regret, generally defined as the difference between the total cost incurred by the decision selected by the agent online and the cost of the fixed or varying optimal decision that a clairvoyant agent could select. Perhaps the most popular zeroth-order gradient estimator is the two-point estimator that has been extensively studied in Agarwal et al. (2019) . Specifically, the two-point estimator queries the function value f t (x) for twice, for two different realizations of the decision variables, and uses the difference in these function values to estimate the desired gradient, as illustrated by the equation (Two-point feedback): e g (2) t (x) = u ⇣ f t (x + u) f t (x) ⌘ , where > 0 is a parameter and u ⇠ N (0, I). However, the two-point gradient estimator can not be used for the solution of non-stationary online optimization problems that arise frequently, e.g., in online learning. The reason is that in these non-stationary online optimization problems, the objective function being queried is time-varying, and hence only a single function value can be sampled at a given time instant. In this case, the following one-point feedback can be used (One-point feedback): e g (1) 2016) characterized the convergence of one-point zeroth-order methods for static stochastic non-convex optimization problems. However, as shown in these studies, a limitation of one-point feedback is that the resulting gradient estimator has large variance and, therefore, induces large regret. In addition, the regret analysis for ZO with one-point feedback usually requires the strong assumption that the function value is uniformly upper bounded over time, so this method can not be used for practical non-stationary optimization problems. t (x) = u f t (x + u), Contributions: In this paper, we propose a novel one-point gradient estimator for zeroth-order online optimization and develop new regret bounds to study its performance. Specifically, our contributions are as follows. We propose a new one-point feedback scheme which requires a single function evaluation at each time instant. This feedback scheme estimates the gradient using the residual between two consecutive feedback points and we refer to it as residual feedback. We show that our residual feedback induces a smaller gradient estimation variance than the conventional one-point feedback scheme in Flaxman et al. ( 2005); Gasnikov et al. ( 2017). Furthermore, we provide regret bounds for online convex optimization with our proposed residual feedback estimator. Our analysis relies on a weaker assumption than the one needed in the case of the conventional one-point estimator, and our proposed regret bounds are tighter especially when the value of the objective function is large. In addition, we provide regret bounds for online non-convex optimization with residual feedback. Finally, we present numerical experiments that demonstrate that the proposed residual-feedback estimator significantly outperforms the conventional one-point method in its ability to track the time-varying optimizers of online learning problems. To the best of our knowledge, this is the first time a one-point zeroth-order method is theoretically studied for online non-convex optimization problems. It is also the first time that a one-point gradient estimator demonstrates comparable empirical performance to that of the two-point method. We note that two-point estimators can only be used to solve online non-stationary learning problems in simulations, where the system can be hard coded to be fixed during two queries of the objective function values at two different decision variables. Related work: Zeroth-order methods have been used to solve many different types of optimization problems. For example, Balasubramanian & Ghadimi (2018) (2017) study online bandit algorithms using ellipsoid methods. In particular, these methods induce heavy computation per step and achieve regret bounds that have bad dependence on the problem dimension. As a comparison, our one-point method is computation light and achieves regret bounds that have better dependence on the problem dimension.

2. PRELIMINARIES AND RESIDUAL FEEDBACK

We first introduce the classes of Lipschitz and smooth functions.



(2010); Ghadimi & Lan (2013); Duchi et al. (2015); Ghadimi et al. (2016); Bach & Perchet (2016); Nesterov & Spokoiny (2017); Gao et al. (2018); Roy et al.

which queries the objective function f t (x) only once at each time instant. One-point feedback was first proposed and analyzed in Flaxman et al. (2005) for the solution of online convex optimization problems. Saha & Tewari (2011); Hazan & Levy (2014); Dekel et al. (2015) showed that the regret of convex online optimization methods using one-point gradient estimation can be improved assuming smoothness or strong convexity of the objective functions and using self-concordant regularization. More recently, Gasnikov et al. (2017) developed such regret bounds for stochastic convex problems. On the other hand, Hazan et al. (

apply ZO to solve a set-constrained optimization problem where the projection onto the constraint set is non-trivial. Gorbunov et al. (2018); Ji et al. (2019) apply a variance-reduced technique and acceleration schemes to achieve better convergence speed in ZO. Wang et al. (2018) improve the dependence of the iteration complexity on the dimension of the problem under an additional sparsity assumption on the gradient of the objective function. And Hajinezhad & Zavlanos (2018); Tang & Li (2019) apply zeroth-order oracles to distributed optimization problems when only bandit feedbacks are available at each local agents. Our proposed residual feedback oracle can be used to solve such online optimization problems as well. Also related is work by Zhang et al. (2015) that considers non-convex online bandit optimization problems with a single query at each time step. However, this method employs the exploration and exploitation bandit learning framework and the proposed analysis is restricted to a special class of non-convex objective functions. Finally, Agarwal et al. (2011); Hazan & Li (2016); Bubeck et al.

