ON THE DYNAMIC REGRET OF ONLINE MULTIPLE MIRROR DESCENT Anonymous

Abstract

We study the problem of online convex optimization, where a learner makes sequential decisions to minimize an accumulation of strongly convex costs over time. The quality of decisions is given in terms of the dynamic regret, which measures the performance of the learner relative to a sequence of dynamic minimizers. Prior works on gradient descent and mirror descent have shown that the dynamic regret can be upper bounded using the path length, which depend on the differences between successive minimizers, and an upper bound using the squared path length has also been shown when multiple gradient queries are allowed per round. However, they all require the cost functions to be Lipschitz continuous, which imposes a strong requirement especially when the cost functions are also strongly convex. In this work, we consider Online Multiple Mirror Descent (OMMD), which is based on mirror descent but uses multiple mirror descent steps per online round. Without requiring the cost functions to be Lipschitz continuous, we derive two upper bounds on the dynamic regret based on the path length and squared path length. We further derive a third upper bound that relies on the gradient of cost functions, which can be much smaller than the path length or squared path length, especially when the cost functions are smooth but fluctuate over time. Thus, we show that the dynamic regret of OMMD scales linearly with the minimum among the path length, squared path length, and sum squared gradients. Our experimental results further show substantial improvement on the dynamic regret compared with existing alternatives.

1. INTRODUCTION

Online optimization refers to the design of sequential decisions where system parameters and cost functions vary with time. It has applications to various classes of problems, such as object tracking (Shahrampour & Jadbabaie, 2017) , networking (Shi et al., 2018) , cloud computing (Lin et al., 2012) , and classification (Crammer et al., 2006) . It is also an important tool in the development of algorithms for reinforcement learning (Yuan & Lamperski, 2017) and deep learning (Mnih et al., 2015) . In this work, we consider online convex optimization, which can be formulated as a discrete-time sequential learning process as follows. At each round t, the learner first makes a decision x t ∈ X , where X is a convex set representing the solution space. The learner then receives a convex cost function f t (x) : X → R and suffers the corresponding cost of f t (x t ) associated with the submitted decision. The goal of the online learner is to minimize the total accrued cost over a finite number of rounds, denoted by T . For performance evaluation, prior studies on online learning often focus on the static regret, defined as the difference between the learner's accumulated cost and that of an optimal fixed offline decision, which is made in hindsight with knowledge of f t (•) for all t: Reg s T = T t=1 f t (x t ) -min x∈X T t=1 f t (x). A successful online algorithm closes the gap between the online decisions and the offline counterpart when normalized by T , i.e., sustaining sublinear static regret in T . In the literature, there are various online algorithms (Zinkevich, 2003; Cesa-Bianchi & Lugosi, 2006; Hazan et al., 2006; Duchi et al., 2010; Shalev-Shwartz, 2012) that guarantee a sublinear bound on the static regret. However, algorithms that guarantee performance close to that of a static decision may still perform poorly in dynamic settings. Consequently, the static regret fails to accurately reflect the quality of decisions in many practical scenarios. Therefore, the dynamic regret has become a popular metric in recent works (Besbes et al., 2015; Mokhtari et al., 2016; Yang et al., 2016; Zhang et al., 2017) , which allows a dynamic sequence of comparison targets and is defined by Reg d T = T t=1 f t (x t ) - T t=1 f t (x * t ), where x * t = argmin x∈X f t (x) is a minimizer of the cost at round t. It is well-known that the online optimization problem may be intractable in a dynamic setting, due to arbitrary fluctuation in the cost functions. Hence, achieving a sublinear bound on the dynamic regret may be impossible. However, it is possible to upper bound the dynamic regret in terms of certain regularity measures. One of the measures to represent regularity is the path length, defined by C T = T t=2 x * t -x * t-1 , which illustrates the accumulative variation in the minimizer sequence. For instance, the dynamic regret of online gradient descent for convex cost functions can be bounded by O( , 2003) . 1 For strongly convex functions, the dynamic regret of online gradient descent can be reduced to O(C T ) (Mokhtari et al., 2016) . When the cost functions are smooth and strongly convex, by allowing the learner to make multiple queries to the gradient of the cost functions, the regret bound can be further improved to O(min(C T , S T )), where S T represents the squared path length, defined by √ T (1 + C T )) (Zinkevich S T = T t=2 x * t -x * t-1 2 , which can be smaller than the path length when the distance between successive minimizers is small. All the aforementioned studies require the cost functions to be Lipschitz continuous. However, there are many commonly used cost functions, e.g., the quadratic function that do not meet the Lipschitz condition. In addition, the above works rely on measuring distances using Euclidean norms, which hinders the projection step in gradient descent update for some constraint sets, e.g., probability simplex (Duchi, 2018) . Besides gradient descent, mirror descent is another well-known technique of online convex optimization (Hall & Willett, 2015; Jadbabaie et al., 2015) . Mirror descent uses the Bregman divergence, which generalizes the Euclidean norm used in the projection step of gradient descent, thus acquiring expanded applicability to a broader range of problems. In addition, the Bregman divergence is only mildly dependent on the dimension of decision variables (Beck & Teboulle, 2003; Nemirovsky & Yudin, 1983) , so that mirror descent is optimal among first-order methods when the decision variables have high dimensions (Duchi et al., 2010) . In this work we focus on the mirror descent approach. In previous works on online mirror descent, the learner queries the gradient of each cost function only once, and performs one step of mirror descent to update its decision (Hall & Willett, 2015; Shahrampour & Jadbabaie, 2017) . In this case, the dynamic regret has an upper bound of order O( √ T (1 + C T )), which is the same as that of online gradient descent in (Zinkevich, 2003) . In this work, we investigate whether it is possible to improve the dynamic regret when the learner performs multiple mirror descent steps in each online round, while relaxing the Lipschitz continuity condition on the cost functions. To this end, we analyze the performance of the Online Multiple Mirror Descent (OMMD) algorithm, which uses multiple steps of mirror descent per online round. When the cost functions are smooth and strongly convex, we show that the upper bound on the dynamic regret can be reduced from O( √ T (1 + C T )) to O(min(C T , S T , G T )) , where G T represent the sum squared gradients, i.e., G T = T t=1 ∇f t (x t ) 2 * , where . * denotes the dual norm. The sum squared gradients G T can be smaller than both the path length and squared path length, especially when the cost functions fluctuate drastically over time. In contrast to the aforementioned works, our analysis does not require the cost functions to be Lipschitz continuous. Furthermore, our numerical experiments suggest substantially reduced dynamic regret compared with the best known alternatives, including single-step dynamic mirror descent (Hall & Willett, 2015) , online multiple gradient descent (Zhang et al., 2017) , and online gradient descent Zinkevich (2003) .

2. ONLINE MULTIPLE MIRROR DESCENT

In this section, we describe OMMD and discuss how the learner can improve the dynamic regret by performing multiple mirror descent steps per round. Before delving into the details, we proceed by stating several definitions and standard assumptions.

2.1. PRELIMINARIES

Definition 1: The Bregman divergence with respect to the regularization function r(•) is defined as D r (x, y) = r(x) -r(y) -∇r(y), x -y . The Bregman divergence is a general distance-measuring function, which contains the Euclidean norm and the Kullback-Leibler divergence as two special cases. Using the Bregman divergence, a generalized definition of strong convexity is given in (Shalev-Shwartz & Singer, 2007) . Definition 2: A convex function f (•) is λ-strongly convex with respect to a convex and differentiable function r(•) if f (y) + ∇f (y), x -y + λD r (x, y) ≤ f (x), ∀x, y ∈ X . Following many prior studies on mirror descent, we assume that the cost functions are λ-strongly convex, where the above generalized strong convexity definition is used. We further assume that the cost functions are L-smooth, and the regularization function r(•) is L r -smooth and 1-strongly convex with respect to some norm (refer to App. A for definitions). We note that these are standard assumptions commonly used in the literature after the group of studies began by (Hazan et al., 2006; Shalev-Shwartz & Singer, 2007) , to provide stronger regret bounds by constraining the curvature of cost functions. We further make a standard assumption that the Bregman divergence is Lipschitz continuous as follows: |D r (x, z) -D r (y, z)| ≤ K x -y , ∀x, y, z ∈ X , where K is a positive constant. We note that this condition is much milder than the condition of Lipschitz continuous cost functions required in (Zhang et al., 2017; Mokhtari et al., 2016; Hall & Willett, 2015) . There is a notable weakness in such bounds. Since the sequence of cost functions are revealed to the learner, the learner has no control over it. If these cost functions happen to not meet the Lipschitz condition, earlier analyses that require this condition become inapplicable. In this work, we do not require the cost functions to be Lipschitz continuous. Instead, we move the Lipschitz continuity condition from the cost functions to the Bregman divergence to broaden the application of our work. The main benefit of this is that the regularization function and the corresponding Bregman divergence is within the control of the learner. The learner can carefully design this regularization function to satisfy the Lipschitz continuity of the associated Bregman divergence with a small factor. For example, in the particular case of the KL divergence, which is obtained by the choosing negative entropy as the regularization function, on the set X = {x| d i=1 x i = 1; x i ≥ 1 D }, the constant K is of O(log D). Other examples of many widely used Bregman divergences that satisfy this condition are given in (Bauschke & Borwein, 2001) . We consider online optimization over a finite number of rounds, denoted by T . At the beginning of every round t, the learner submits a decision represented by x t , which is taken from a convex and compact set X . Then, an adversary selects a function f t (•) and the learner suffers the corresponding cost f t (x t ). The learner then updates its decision in the next round. With standard mirror descent, this is given by x t+1 = argmin x∈X { ∇f t (x t ), x + 1 α D r (x, x t )} (4) where α is a fixed step size, and D r (•, •) is the Bregman divergence corresponding to the regularization function r(•). The update in equation 4 suggests that the learner aims to stay close to the current decision x t as measured by the Bregman divergence, while taking a step in a direction close to the negative gradient to reduce the current cost at round t. OMMD uses mirror descent in its core as the optimization workhorse. However, in contrast to classical online optimization methods, where the learner queries the gradient of each cost function only once, OMMD is designed to take advantage of the curvature of cost functions by allowing the learner to make multiple queries to the gradient in each round. This is especially important when the successive cost functions have similar curvatures. In particular, in order to track x * t+1 the learner needs to access the gradient of the cost function, i.e., ∇f t+1 (•). Unfortunately, this information is not available until the end of round t + 1. However, if the successive functions have similar curvatures, the gradient of f t (•) is a reasonably accurate estimate for the gradient of f t+1 (•). In this case, every time that the learner queries the gradient of f t (•), it finds a point that is likely to be closer to the minimizer of f t+1 (•). Hence, it may benefit the learner to perform multiple mirror descent steps in each round. Thus, the learner generates a series of decisions, represented by y 1 t , y 2 t , . . . , y M +1 t , via the following updates: y 1 t = x t , y i+1 t = argmin y∈X { ∇f t (y i t ), y + 1 α D r (y, y i t )}, i = 1, 2, . . . , M. Then, by setting x t+1 = y M +1 t , the learner proceeds to the next round, and the procedure continues. Note that M is independent of T . Applying multiple steps of mirror descent can reveal more information about the sequence of minimizers. It can reduce the dynamic regret, but only if the series of decisions in equation 5 helps decrease the distance to the minimizer x * t+1 . Therefore, quantifying the benefit of OMMD over standard mirror descent requires careful analysis on the impact of the fluctuation of f t (•) over time. To this end, we provide an analysis to bound the dynamic regret of OMMD in the next section.

3. THEORETICAL RESULTS

The following lemma paves the way for the proposed analysis on the dynamic regret of OMMD. It bounds the distance of the learner's future decision from the current optimal solution, after a single step of mirror descent. Lemma 1 Assume that f t (•) is λ-strongly convex with respect to a differentiable function r(•), and is L-smooth. Single-step mirror descent with a fixed step size α ≤ 1 L guarantees the following: D r (x * t , x t+1 ) ≤ βD r (x * t , x t ), where x * t is the unique minimizer of f t (•), and β = 1 -2αλ 1+αλ . Lemma 1 is proved in App. B in the supplementary material. Remark 1. Lemma 1 states that a mirror descent step reduces the distance (measured by the Bregman divergence) of the learner's decisions to the current minimizer. This generalizes the results in (Mokhtari et al., 2016; Zhang et al., 2017) , where similar bounds were derived for online gradient descent when the distance was measured in Euclidean norms. In particular, those results correspond to the special choice of r(x) = x 2 2 , which reduces the Bregman divergence to Euclidean distance, i.e., D r (x, y) = x -y 2 . Lemma 1 indicates that the distance between the next decision x t+1 and the minimizer x * t is strictly smaller than the distance between the current decision x t and the minimizer at round t. This implies that if the minimizers of the functions f t (•) and f t+1 (•), which are x * t and x * t+1 respectively, are not far from each other, applying mirror descent multiple times enables the online learner to more accurately track the sequence of optimal solutions x * t . The succeeding theorems provide three separate upper bounds on the dynamic regret of OMMD, based on path length C T (as defined in equation 1), squared path length S T (as defined in equation 2), and sum squared gradients (as defined in equation 3). Theorem 2 Assume that r(•) is L r -smooth and 1-strongly convex with respect to some norm • and that the cost functions are L-smooth and λ-strongly convex with respect to r(•). Let x t be the sequence of decisions generated by OMMD with a fixed step size 1 2λ < α ≤ 1 L and M ≥ 1 2 + 1 2αλ log L r mirror descent steps per round. The dynamic regret satisfies the following bound: T t=1 f t (x t ) -f t (x * t ) ≤ Kλ 2αλ -1 (1 + L r β M ) (1 -L r β M ) (C T + x * 1 -x 1 ). where β is the shrinking factor derived in Lemma 1, and K is the Lipschitz constant associated with D r (•, •). The proof of Theorem 2 is given in App. C in the supplementary material. Remark 2. It has been shown in (Hall & Willett, 2015) that single-step mirror descent guarantees an upper bound of O( √ T (1 + C T )) on the dynamic regret for convex cost functions. With that bound, a sublinear path length is not sufficient to guarantee sublinear dynamic regret. In contrast, Theorem 2 implies that OMMD reduces the upper bound to O(C T ) when the cost functions are strongly convex and smooth, which implies that a sublinear path length is sufficient to yield sublinear dynamic regret. Remark 3. The range of M where the bound in Theorem 2 holds is usually wide. For example, it is M ≥ 3 and M ≥ 5 for the two experiments shown in Section 4. Theorem 3 Under the same convexity and smoothness conditions stated in Theorem 2, let x t be the sequence of decisions generated by OMMD with a fixed step size α ≤ 1 L and M ≥ 1 2 + 1 2αλ log 2L r mirror descent steps per round. For any arbitrary positive constant θ, the dynamic regret is upper bounded by T t=1 f t (x t ) -f t (x * t ) ≤ T t=1 ∇f t (x * t ) 2 * 2θ + LL r + θ 1 -2L r β M S T + x * 1 -x 1 2 2 . Theorem 3 is proved in App. D in the supplementary material. Since the gradient at x * t is zero if x * t is in the relative interior of the feasibility set X , i.e., ∇f t (x * t ) = 0, the above theorem can be simplified to the following corollary. Corollary 4 If x * t belongs to the relative interior of the feasibility set X for all t, the dynamic regret bound in Theorem 3 is of order O(S T ). When the cost functions drift slowly, the distances between successive minimizers are small. Hence, the squared path length S T , which relies on the square of those distances, can be significantly smaller than the path length C T . In this case, Theorem 3 and Corollary 4 can provide a tighter regret bound than Theorem 2. Theorem 5 Under the same convexity and smoothness conditions stated in Theorem 2, let x t be the sequence of decisions generated by OMMD with a fixed step size α > 1 2λ . The following bound holds on the dynamic regret: T t=1 f t (x t ) -f t (x * t ) ≤ α 2 λ 4αλ -2 G T . The proof of Theorem 5 is given in App. E in the supplementary material. Remark 4. Interestingly, Theorem 5 implies that sublinear dynamic regret can be achieved when the gradient of the cost functions shrink over time. For instance, if ∇f t (x) * = O(1/t γ ) for some γ > 0, Theorem 5 guarantees O(T 1-2γ ) dynamic regret. This is especially important when the cost functions decrease while the minimizers fluctuate. In this scenario, the path length C T and squared path length S T may grow linearly, whereas diminishing gradients ensure sublinear G T . Theorem 2, Corollary 4, and Theorem 5, respectively, state that the dynamic regret of OMMD is upper bounded linearly by path length C T , squared path length S T , and sum squared gradients G T . This immediately leads to the following result. Corollary 6 Under the same convexity and smoothness conditions stated in Theorem 2, the dynamic regret of OMMD with suitably chosen α and M has an upper bound of O(min(C T , S T , G T )). Remark 5. We note that (Mokhtari et al., 2016) . Furthermore, in contrast to these studies, our analysis does not require the cost functions to be Lipschitz continuous. The quantities C t , S T , and G T represent distinct aspects of an online learning problem and are not generally comparable. The following example demonstrates the benefit of having multiple upper bounds and taking their minimum. Consider a sequence of quadratic programming problems of the form f t (x) = A t x -b t 2 over the d-dimensional probability simplex. Assume that for any t ≥ 1, we have the parameter sequence of A t = diag( 1 t p 1 , 0, 0, . . . , 0), if t is odd diag(0, 1 t p 1 , 0, . . . , 0), if t is even, and b = [ 1 t p 2 , 1 t p 2 , . . . , 1 t p 2 ] , where p 1 and p 2 are positive constants such that p 2 ≤ p 1 . In this setting, we observe that C T = O(T ) and G T = O(T 1-p1-p2 ). Thus, G T can be considerably smaller than C T . On the other hand, it is also possible that C T is smaller than G T in other cases. For example, let A t = diag(1/2, 0, . . . , 0) on odd rounds and A t = diag(0, -1/2, . . . , -1/2) , and b t be the unity vector for all t. In this case, we observe that C T = O(1), while the sum gradient scales linearly with time, i.e., G T = O(T ). Thus, neither C T nor G T alone can provide a small regret bound for all cases. Similar examples can be found in comparison between S T and G T but are omitted for brevity.

4. EXPERIMENTS

We investigate the performance of OMMD via numerical experiments in two different learning scenarios (with further experiments presented in App. G in the supplementary material). First, we consider a ridge regression problem on the CIFAR-10 dataset (Krizhevsky, 2009) . Then, we study a case of online convex optimization where the difference between successive minimizers diminishes as time progresses. We compare OMMD with the following alternatives: Online Gradient Descent (OGD) (Zinkevich, 2003) , Online Multiple Gradient Descent (OMGD) (Zhang et al., 2017) , and Dynamic Mirror Descent (DMD) (Hall & Willett, 2015) . In the first experiment, we consider multi-class classification with ridge regression. In this task, the learner observes a sequence of labeled examples (ω, z), where ω ∈ R d , and the label z, denoting the class of the data example, is drawn from a discrete space Z = {1, 2, . . . , c}. We use the CIFAR-10 image dataset, which contains 5 × 10 4 data samples. Each data sample ω is a color image of size 32 × 32 pixel that can be represented by a 3072-dimensional vector, i.e., d = 3072. Data samples correspond to color images of objects, including airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Hence, there are c = 10 different classes. For ridge regression, the cost function associated with batch of data samples at round t, i.e., (ω 1,t , z 1,t ), . . . , (ω b,t , z b,t ), is given by f (x, (ω t , z t )) = ω T t x -z t 2 2 , where x is the optimization variable, which is constrained by the set X = {x : x ∈ R d + , x 1 = 1}, and (ω t , z t ) compactly represents the batch of data samples at round t, i.e., ω t = [ω 1,t , ω 2,t , . . . , ω b,t ] T and z t = [z 1,t , z 2,t , . . . , z b,t ] T . The goal of the learner is to classify streaming images online by tracking the unknown optimal parameter x * t . We use the negative entropy regularization function, i.e., r(x) = d j=1 x j log(x j ), which is strongly convex with respect to the l1-norm. Then, the mirror descent update in equation 4 leads to the following update: y i+1 t,j = y i t,j exp(-α∇f t (y i t,j )) d j=1 y i t,j exp(-α∇f t (y i t,j )) , where x t,j and ∇f t (x t ) j denote the j-th component of x t and ∇f t (x t ) respectively. The proof of the above closed-form update is given in App. F in the supplementary material. In our experiment, we set batch size to 20 data samples per online round, and set α = 0.1. In Fig. 1 , we compare the performance of OMMD with DMD, OGD, and OMGD in terms of the dynamic regret. We see that the methods based on mirror descent perform better than those based on gradient descent as generally expected. Furthermore, OMMD with M = 10 can reduce the dynamic regret up to 30% in comparison with DMD. The dynamic regret associated with all algorithms grow linearly with the number of rounds. This is because the sequence of minimizers x * t depend on batches of samples that are independent over time, so that they do not converge. We note that this is common for online optimization in dynamic settings where steady fluctuation in the environment results in linear dynamic regret. Next, we study the performance of OMMD in solving a sequence of quadratic programming problems of the form f t (x 1 , x 2 ) = ρ x 1 -a t 2 + x 2 -b t 2 , where ρ is a positive constant, a t and b t are time-variant vectors, and the decision variable are x 1 ∈ R d1 , x 2 ∈ R d2 , such that d 1 + d 2 = d. In our experiment, we set ρ = 10, d 1 = 500, and d = 1000. We assume that b t is time-invariant and for all rounds t we have b t = 2, while a t satisfies the recursive formula a t+1 = a t + 1/ √ t with initial value a 1 = -1.5. We further set the step size α = 0.03. We use the same regularization function and constraint set as in the previous experiment. From Fig. 2 , we observe that the performance advantage of OMMD is even more pronounced. As time progresses and the difference between the successive cost functions becomes less significant, the difference between the minimizers decreases. In this case, OMMD can significantly improve the performance of online optimization by reducing the gap between the learner's decisions and the minimizers sequence. In particular, compared with DMD, OMMD with M = 10 reduces the dynamic regret up to 80% after 2500 rounds.

5. RELATED WORKS

The problem of online convex optimization has been extensively studied in the literature since the seminal work of (Zinkevich, 2003) . Most prior works study various online algorithms that guarantee a sublinear bound on the static regret (Zinkevich, 2003; Cesa-Bianchi & Lugosi, 2006; Hazan et al., 2006; Duchi et al., 2010; Shalev-Shwartz, 2012 ). Here we review the most relevant works with a focus on the dynamic regret. ( Yang et al., 2016) . For strongly convex cost functions, the upper bound on the dynamic regret can be reduced to O(C T ) (Mokhtari et al., 2016) . The above works make only a single query to the gradient of the cost functions in every round. By allowing the learner to make multiple gradient queries per online round, the regret bound can be improved to O(min(C T , S T )) when the cost functions are smooth and strongly convex (Zhang et al., 2017) . The analysis in all aforementioned studies requires the cost functions to be Lipschitz continuous. However, many commonly used cost functions do not satisfy this condition over an unbounded feasible set, e.g., the quadratic function, and even when the feasible set is bounded the Lipschitz factor can be excessively large, especially when the cost functions are strongly convex. Therefore, Lipschitz continuity of the cost functions is not assumed in our analysis. Instead, we move this condition from the cost functions to the Bregman divergence, which the learner can control and design. In addition, the above works rely on measuring distances using Euclidean norms, while the updates with Euclidean distance are challenging for some constraint sets, e.g., probability simplex (Duchi, 2018) . It is known that gradient descent does not perform as well as mirror descent, especially when the input dimension is high (Nemirovsky & Yudin, 1983; Beck & Teboulle, 2003) .

5.2. DYNAMIC REGRET OF ONLINE MIRROR DESCENT

The dynamic regret of online single-step mirror descent was studied in (Hall & Willett, 2015) , where an upper bound of O( √ T (1 + C T )) was derived for convex cost functions. To take advantage of smoothness in cost functions, an adaptive algorithm based on optimistic mirror descent (Rakhlin & Sridharan, 2013) was proposed in (Jadbabaie et al., 2015) , which contains two steps of mirror descent per online round. However, different from our work, in that variant the learner is allowed to make only a single query about the gradient. The algorithm further requires some prior prediction of the gradient in each round, which is used in the second mirror descent step. The dynamic regret bound was given in terms of a combination of the path length C T , deviation between the predictions and the actual gradients D T , and functional variation F T = T t=1 max x∈X |f t (x) -f t-1 (x)|. foot_2 Unfortunately, to achieve this bound, the algorithm requires the design of a time-varying step size that depends on the optimal solution in the previous step, which prevents direct numerical comparison with OMMD. Therefore, in Section 4 we have experimented only with the method of (Hall & Willett, 2015) . All aforementioned works make only a single query to the gradient of the cost functions in every online round. In contrast, in this work, we allow the learner to make multiple gradient queries per round. The learner then uses this information to update its decision via multiple steps of mirror descent. In this way, we show the dynamic regret can be upper bounded linearly by the minimum among the path length, squared path length, and sum squared gradients. Furthermore, as opposed to the aforementioned works, our analysis does not require the cost functions to be Lipschitz continuous. Finally, there is also recent work in the literature on distributed online mirror descent (Shahrampour & Jadbabaie, 2017) . As expected, it is more challenging to achieve performance guarantee in distributed optimization. We focus on centralized online convex optimization in this work.

6. CONCLUSION

We have studied online convex optimization in dynamic settings. By applying the mirror descent step multiple times in each round, we show that the upper bound on the dynamic regret can be reduced significantly from O( √ T (1 + C T )) to O(min(C T , S T , G T )) , when the cost functions are strongly convex and smooth. In contrast to prior studies (Hall & Willett, 2015; Zhang et al., 2017; Mokhtari et al., 2016) , our analysis does not require the cost functions to be Lipschitz continuous. Numerical experiments with the CIFAR-10 dataset, and sequential quadratic programming, and additional examples show substantial improvement on the dynamic regret compared with existing alternatives.

A ADDITIONAL DEFINITIONS Definition

3: A function f (•) is L-smooth, if there exists a positive constant L such that f (y) ≤ f (x) + ∇f (x), y -x + L 2 y -x 2 , ∀x, y ∈ X . Definition 4: A convex function f (•) is λ-strongly convex with respect to some norm • , if there exists a positive constant λ such that f (y) + ∇f (y), x -y + λ 2 x -y 2 ≤ f (x), ∀x, y ∈ X . Definition 5: A function f (•) is Lipschitz continuous with factor G if for all x and y in X , the following holds: |f (x) -f (y)| ≤ G x -y , ∀x, y ∈ X .

B PROOF OF LEMMA 1

Consider single-step mirror descent update as follows: x = argmin y∈X f t (x ) + ∇f t (x ), y -x + 1 α D r (y, x ) . Strong convexity of the above minimization objective implies f t (x ) + ∇f t (x ), x -x + 1 α D r (x, x ) ≤ f t (x ) + ∇f t (x ), y -x + 1 α D r (y, x ) - 1 α D r (y, x), ∀y ∈ X . Furthermore, from the smoothness condition, we have f t (x) ≤ f t (x ) + ∇f t (x ), x -x + L 2 x -x 2 . ( ) Substituting equation 9 into equation 8, and setting y = x * t , we obtain f t (x) - L 2 x -x 2 + 1 α D r (x, x ) ≤ (10) f t (x ) + ∇f t (x ), x * t -x + 1 α D r (x * t , x ) - 1 α D r (x * t , x). Since α ≤ 1 L , and regularization function r(•) is 1-strongly convex, we have 1 α D r (x, x ) ≥ LD r (x, x ) ≥ L 2 x -x 2 . ( ) Next, we exploit the strong convexity of the cost function, i.e., f t (x ) + ∇f t (x ), x * t -x ≤ f t (x * t ) -λD r (x * t , x ). Combining equation 10, equation 11, and equation 12, we obtain f t (x) ≤ f t (x * t ) -λD r (x * t , x ) + 1 α D r (x * t , x ) - 1 α D r (x * t , x). Next, we use the result of (Hazan & Kale, 2014) , which states that for evey λ-strongly convex function f t (.), the following bound holds: f t (x) -f t (x * t ) ≥ λD r (x * t , x), where x * t = argmin x∈X f t (x). Combining the above with equation 13, we obtain D r (x * t , x) ≤ βD r (x * t , x ), where β = 1 -2λα 1+λα . C PROOF OF THEOREM 2 C.1 KEY LEMMAS The following two lemmas pave the way for our regret analysis leading to Theorem 2. Lemma 7 presents an alternative form for the mirror descent update. Lemma 7 Suppose there exists z t+1 that satisfies ∇r(z t+1 ) = ∇r(x t ) -α∇f t (x t ), for some strongly convex function r(•), and step size α. Then, the following updates are equivalent x t+1 = argmin x∈X D r (x, z t+1 ), x t+1 = argmin x∈X ∇f t (x t ), x + 1 α D r (x, x t ) . ( ) Proof. We begin by expanding equation 16 as follows: x t+1 = argmin x∈X {r(x) -r(z t+1 ) -∇r(z t+1 ), x -z t+1 } = argmin x∈X {r(x) -∇r(z t+1 ), x } = argmin x∈X {r(x) -∇r(x t ) -α∇f t (x t ), x } = argmin x∈X {α ∇f t (x t ), x + r(x) -r(x t ) -∇r(x t ), x -x t } = argmin x∈X { ∇f t (x t ), x + 1 α D r (x, x t )}. Thus, the update in equation 16 is equivalent to equation 17. Lemma 8 Under the same convexity and smoothness condition stated in Theorem 2, let x t be the sequence of decisions generated by OMMD. Then, the following bound holds: x t+1 -x * t ≤ L r β M x t -x * t , where L r is the smoothness factor of the regularization function r(•), and β is the shrinking factor obtained in Lemma 1. Proof. Using the result of Lemma 1, OMMD with M mirror descent steps guarantees D r (x * t , x t+1 ) ≤ β M D r (x * t , x t ). Since the regularization function r(•) is 1-strongly convex, we have x * t -x t+1 2 2 ≤ r(x * t ) -r(x t+1 ) -∇r(x t+1 ), x * t -x t+1 . Next, we exploit the smoothness condition of the regularization function r(•), i.e., r(x * t ) -r(x t ) -∇r(x t ), x * t -x t ≤ L r 2 x * t -x t 2 . ( ) By combining the above with equation 20, and equation 21, and using the definition of Bregman divergence, we obtain x t+1 -x * t 2 ≤ L r β M x t -x * t 2 . ( ) Taking the square root on both sides of equation 23 completes the proof. C.2 PROOF OF THE THEOREM Now, we are ready to present the proof of Theorem 2. In this proof, we will use the following properties of Bregman divergence. (a) By direct substitution, the following equality holds for any x, y, z ∈ X , ∇r(z) -∇r(y), x -y = D r (x, y) -D r (x, z) + D r (y, z).  To bound the dynamic regret, we begin by using the strong convexity of the cost function f t (•), i.e., f t (x t ) -f t (x * t ) ≤ ∇f t (x t ), x t -x * t -λD r (x * t , x t ) ≤ 1 α ∇r(x t ) -∇r(z t+1 ), x t -x * t -λD r (x * t , x t ) ≤ 1 α D r (x * t , x t ) -D r (x * t , z t+1 ) + D r (x t , z t+1 ) -λD r (x * t , x t ) ≤ 1 α D r (x * t , x t ) -D r (x * t , x t+1 ) -D r (x t+1 , z t+1 ) + D r (x t , z t+1 ) -λD r (x * t , x t ) ≤ 1 α -λ D r (x * t , x t ) + 1 α D r (x t , z t+1 ) -D r (x t+1 , z t+1 ) ≤ 1 α -λ f t (x t ) -f t (x * t ) λ + 1 α D r (x t , z t+1 ) -D r (x t+1 , z t+1 ) , where in the second line we have used the alternative mirror descent update stated in Lemma 7, i.e., ∇f t (x t ) = (1/α)(∇r(x t ) -∇r(z t+1 )). To obtain the third line, we have utilized the Bregman divergence property in equation 24. We have used the Bregman projection property in equation 25 in the fourth line. By omitting some negative terms, and using equation 14, we obtain the right-hand side of equation 26. Thus, if α > 1 2λ , we have f t (x t ) -f t (x * t ) ≤ λ 2αλ -1 D r (x t , z t+1 ) -D r (x t+1 , z t+1 ) (a) ≤ λK 2αλ -1 x t+1 -x t ≤ λK 2αλ -1 x t+1 -x * t + x t -x * t (b) ≤ λK 2αλ -1 (1 + L r β M ) x t -x * t , where we have used the Lipschitz continuity of Bregman divergence to obtain inequality (a), and we have applied Lemma 8 to obtain inequality (b). Summing equation 27 over time, we have Reg d T = T t=1 f t (x t ) -f t (x * t ) ≤ λK 2αλ -1 (1 + L r β M ) T t=1 x t -x * t . Now, we proceed to bound T t=1 x t -x * t as follows: T t=1 x t -x * t = x 1 -x * 1 + T t=2 x t -x * t ≤ x 1 -x * 1 + T t=2 x t -x * t-1 + x * t-1 -x * t (a) ≤ x 1 -x * 1 + T t=2 L r β M x t-1 -x * t-1 + T t=2 x * t -x * t-1 , where we used the result of Lemma 8 to obtain inequality (a). If M ≥ 1 2 + 1 2αλ log L r , we have β M = 1 - 2αλ 1 + αλ M ≤ exp -2M αλ 1 + αλ < 1 L r , which implies L r β M < 1. Therefore, by combining equation 29 and equation 30, we have T t=1 x t -x * t ≤ x 1 -x * 1 1 -L r β M + T t=2 x * t -x * t-1 1 -L r β M . Finally, substituting equation 31 into equation 28 completes the proof.

D PROOF OF THEOREM 3

In order to bound the dynamic regret, we begin by the smoothness condition of the cost function f t (.), i.e., f t (x t ) -f t (x * t ) ≤ ∇f t (x * t ), x t -x * t + L 2 x t -x * t 2 ≤ ∇f t (x * t ) * x t -x * t + L 2 x t -x * t 2 . ( ) Next, we use the fact ∇f t (x * t ) * x t -x * t ≤ ∇f t (x * t ) 2 * 2θ + θ x t -x * t 2 2 , for any arbitrary positive constant θ > 0. Thus, we have f t (x t ) -f t (x * t ) ≤ ∇f t (x * t ) 2 * 2θ + (L + θ) x t -x * t 2 2 . ( ) Summing equation 34 over time, we obtain Reg d T = T t=1 f t (x t ) -f t (x * t ) ≤ T t=1 ∇f t (x * t ) 2 * 2θ + L + θ 2 T t=1 x t -x * t 2 . Now, we proceed by bounding T t=1 x t -x * t 2 as follows: T t=1 x t -x * t 2 = x 1 -x * 1 2 + T t=2 x t -x * t-1 + x * t-1 -x * t 2 ≤ x 1 -x * 1 2 + T t=2 2 x t -x * t-1 2 + 2 x * t-1 -x * t 2 ≤ x 1 -x * 1 2 + 2β M L r T t=1 x t -x * t-1 2 + 2 T t=2 x * t-1 -x * t 2 . ( ) We note that if M ≥ 1 2 + 1 2αλ log 2L r , then 2β M L r < 1. Therefore, from equation 36 we can obtain T t=1 x t -x * t 2 ≤ x 1 -x * 1 2 1 -2β M L r + 2 1 -2β M L r T t=2 x * t -x * t-1 2 . ( ) Substituting equation 37 into equation 35 completes the proof.

E PROOF OF THEOREM 5

The proof of Theorem 5 initially follows the first half of the proof of Theorem 2, which is repeated here for completeness. To analyze the dynamic regret, we first use the strong convexity of the cost function f t (•), i.e., f t (x t ) -f t (x * t ) ≤ ∇f t (x t ), x t -x * t -λD r (x * t , x t ) ≤ 1 α ∇r(x t ) -∇r(z t+1 ), x t -x * t -λD r (x * t , x t ) ≤ 1 α D r (x * t , x t ) -D r (x * t , z t+1 ) + D r (x t , z t+1 ) -λD r (x * t , x t ) ≤ 1 α D r (x * t , x t ) -D r (x * t , x t+1 ) -D r (x t+1 , z t+1 ) + D r (x t , z t+1 ) -λD r (x * t , x t ) ≤ 1 α -λ D r (x * t , x t ) + 1 α D r (x t , z t+1 ) -D r (x t+1 , z t+1 ) ≤ 1 α -λ f t (x t ) -f t (x * t ) λ + 1 α D r (x t , z t+1 ) -D r (x t+1 , z t+1 ) , where in the second line we have used the alternative mirror descent update stated in Lemma 7, i.e., ∇f t (x t ) = (1/α)(∇r(x t ) -∇r(z t+1 )). To obtain the third line, we have utilized the Bregman divergence property in equation 24. We have used the Bregman projection property in equation 25 in the fourth line. By omitting some negative terms, and using equation 14, we obtain the right-hand side of equation 38. Therefore, if α > 1 2λ , we have f t (x t ) -f t (x * t ) ≤ λ 2αλ -1 (D r (x t , z t+1 ) -D r (x t+1 , z t+1 )) . Now we continue to bound D r (x t , z t+1  F CLOSED-FORM UPDATE FOR MIRROR DESCENT In this section, we derive the close-form mirror descent update in equation 6. which is a natural consequence of steady fluctuation in the sequence of dynamic minimizers x * t as explained before. Next, we consider the case where the cost function switches between two functions. Both functions are in the quadratic form f t (x) = A t x -b t 2 2 , where A t ∈ R d×d , and b t ∈ R d . In particular, we assume that the parameter A t is chosen among (2) t (•) every τ rounds. In our experiment, we set d 1 = 10, d = 1000, p 1 = 0.9, and p 2 = 0.1. We further set the switching period τ = 10, and parameter α = 0.02. The dynamic regret roughly reflects the accumulated mismatch error over time. In Fig. 4 , we compare the performance of OMMD with that of other alternatives in terms of the dynamic regret. OMMD with M = 10 nearly halves the dynamic regret of DMD after 300 rounds. Furthermore, the benefit of applying multiple steps of mirror descent can be significant even for smaller values of M .



A more general definition of the dynamic regret was introduced in (Zinkevich, 2003), which allows comparison against an arbitrary sequence {ut} T t=1 . We note that the regret bounds developed in(Zinkevich, 2003) also hold for the specific case of ut = x * t . We note that the regret bounds derived in(Jadbabaie et al., 2015) is under the same definition of(Zinkevich, 2003).



and (Zhang et al., 2017) provide upper bounds of O(C T ) and O(min(C T , S T )), respectively, on the dynamic regret of online gradient descent with single and multiple gradient queries, while (Hall & Willett, 2015) presents an upper bound of O( √ T (1 + C T )) on the dynamic regret of online single-step mirror descent. Corollary 6 shows that OMMD can improve the dynamic regret bound to O(min(C T , S T , G T ))

Figure 1: Dynamic regret comparison on CIFAR-10 dataset.

If x = argmin x ∈X D r (x , z), i.e., x is the Bregman projection of z into the set X , then for any arbitrary point y ∈ X , we have D r (y, z) ≥ D r (y, x) + D r (x, z).

Figure 3: Dynamic regret comparison on MNIST dataset.

Figure 4: Dynamic regret comparison for switching cost.

d 1 + d 2 = d, and b t = [ 1 t p 2 , . . . , 1 t p 2 ]. Therefore, at each round the cost function is either f

Algorithm 1 Online Multiple Mirror DescentInput: Arbitrary initialization of x 1 ∈ X ; step size α; time horizon T .Output: Sequence of decisions {x t : 1 ≤ t ≤ T }.

). By the definition of Bregman divergence, we haveD r (x t , z t+1 ) + D r (z t+1 , x t ) = ∇r(x t ) -∇r(z t+1 ), x t -z t+1 = α∇f t (x t ), x t -z t+1 ≤ α∇f t (x t ) * x t -z t+1 t+1 ) -r(x t ) -∇r(x t ), z t+1 -x t = D r (z t+1 , x t ).

annex

Let r(y) = d j=1 y j log(y j ) be the negative entropy. Then, we havewhere y i t,j denotes the j-th component of the decision vector y i t , and D KL (y, y i t ) represents the KL divergence between y and y i t . Now consider the update in equation 5, which can be written as follows:subject to 1, y = 1, y ≥ 0.The Lagrangian of the above problem is given bywhere λ ∈ R and γ ∈ R d + are Lagrange multipliers corresponding to the constraints. Next, we take derivative with respect to y to obtainSetting the above to zero results in the following closed-form update:G ADDITIONAL EXPERIMENTSIn this section, we present additional experiments to study the performance of OMMD. In the first experiment, we use the MNIST dataset. In the second experiment, we consider a switching problem where the cost function switches between two quadratic functions after a specific number of rounds.First, we consider the well-known MNIST digits dataset, where every data sample ω is an image of size 28 × 28 pixel that can be represented by a 784-dimensional vector, i.e., d = 784. Each sample corresponds to one of the digits in {0, 1, . . . , 9}, and thus, there are c = 10 different classes. The goal of the learner is to classify streaming digit images in an online fashion.We consider a robust regression problem, where the cost function for the batch of data samples at time t is given by1 , where x is the optimization variable, belonging to the constraint set is X = {x : x ∈ R n + , x 1 = 1}. We use the negative entropy regularization function, i.e., r(x) = d i=1 x i log(x i ), which is strongly convex with respect to the l1-norm. We set the step size α = 0.1 and use a batch size of 20 data examples per round.From Fig. 3 , we again observe that OMMD consistently outperforms the other alternatives. In particular, compared with DMD, applying M = 10 steps of mirror descent can reduce the dynamic regret up to 20%. We also see that the dynamic regret grows linearly with the number of rounds,

