MULTI-OBJECTIVE ONLINE LEARNING

Abstract

This paper presents a systematic study of multi-objective online learning. We first formulate the framework of Multi-Objective Online Convex Optimization, which encompasses a novel multi-objective regret. This regret is built upon a sequencewise extension of the commonly used discrepancy metric Pareto suboptimality gap in zero-order multi-objective bandits. We then derive an equivalent form of the regret, making it amenable to be optimized via first-order iterative methods. To motivate the algorithm design, we give an explicit example in which equipping OMD with the vanilla min-norm solver for gradient composition will incur a linear regret, which shows that merely regularizing the iterates, as in single-objective online learning, is not enough to guarantee sublinear regrets in the multi-objective setting. To resolve this issue, we propose a novel min-regularized-norm solver that regularizes the composite weights. Combining min-regularized-norm with OMD results in the Doubly Regularized Online Mirror Multiple Descent algorithm. We further derive the multi-objective regret bound for the proposed algorithm, which matches the optimal bound in the single-objective setting. Extensive experiments on several real-world datasets verify the effectiveness of the proposed algorithm.

1. INTRODUCTION

Traditional optimization methods for machine learning are usually designed to optimize a single objective. However, in many real-world applications, we are often required to optimize multiple correlated objectives concurrently. For example, in autonomous driving (Huang et al., 2019; Lu et al., 2019b) , self-driving vehicles need to solve multiple tasks such as self-localization and object identification at the same time. In online advertising (Ma et al., 2018a; b) , advertising systems need to decide on the exposure of items to different users to maximize both the Click-Through Rate (CTR) and the Post-Click Conversion Rate (CVR). In most multi-objective scenarios, the objectives may conflict with each other (Kendall et al., 2018) . Hence, there may not exist any single solution that can optimize all the objectives simultaneously. For example, merely optimizing CTR or CVR will degrade the performance of the other (Ma et al., 2018a; b) . Multi-objective optimization (MOO) (Marler & Arora, 2004; Deb, 2014) is concerned with optimizing multiple conflicting objectives simultaneously. It seeks Pareto optimality, where no single objective can be improved without hurting the performance of others. Many different methods for MOO have been proposed, including evolutionary methods (Murata et al., 1995; Zitzler & Thiele, 1999) , scalarization methods (Fliege & Svaiter, 2000) , and gradient-based iterative methods (Désidéri, 2012) . Recently, the Multiple Gradient Descent Algorithm (MGDA) and its variants have been introduced to the training of multi-task deep neural networks and achieved great empirical success (Sener & Koltun, 2018) , making them regain a significant amount of research interest (Lin et al., 2019; Yu et al., 2020; Liu et al., 2021) . These methods compute a composite gradient based on the gradient information of all the individual objectives and then apply the composite gradient to update the model parameters. The composite weights are determined by a min-norm solver (Désidéri, 2012) which yields a common descent direction of all the objectives. However, compared to the increasingly wide application prospect, the gradient-based iterative algorithms are relatively understudied, especially in the online learning setting. Multi-objective online learning is of essential importance for reasons in two folds. First, due to the data explosion in many real-world scenarios such as web applications, making in-time predictions requires performing online learning. Second, the theoretical investigation of multi-objective online learning will lay a solid foundation for the design of new optimizers for multi-task deep learning. This is analogous to the single-objective setting, where nearly all the optimizers for training DNNs are initially analyzed in the online setting, such as AdaGrad (Duchi et al., 2011) , Adam (Kingma & Ba, 2015) , and AMS-Grad (Reddi et al., 2018) . In this paper, we give a systematic study of multi-objective online learning. To begin with, we formulate the framework of Multi-Objective Online Convex Optimization (MO-OCO). One major challenge in deriving MO-OCO is the lack of a proper regret definition. In the multi-objective setting, in general, no single decision can optimize all the objectives simultaneously. Thus, to devise the multi-objective regret, we need to first extend the single fixed comparator used in the singleobjective regret, i.e., the fixed optimal decision, to the entire Pareto optimal set. Then we need an appropriate discrepancy metric to evaluate the gap between vector-valued losses. Intuitively, the Pareto suboptimality gap (PSG) metric, which is frequently used in zero-order multi-objective bandits (Turgay et al., 2018; Lu et al., 2019a) , is a very promising candidate. PSG can yield scalarized measurements from any vector-valued loss to a given comparator set. However, we find that vanilla PSG is unsuitable for our setting since it always yields non-negative values and may be too loose. In a concrete example, we show that the naive PSG-based regret R I (T ) can even be linear w.r.t. T when the decisions are already optimal, which disqualifies it as a regret metric. To overcome the failure of vanilla PSG, we propose its sequence-wise variant termed S-PSG, which measures the suboptimality of the whole decision sequence to the Pareto optimal set of the cumulative loss function. Optimizing the resulting regret R II (T ) will drive the cumulative loss to approach the Pareto front. However, as a zero-order metric motivated geometrically, designing appropriate first-order algorithms to directly optimize it is too difficult. To resolve the issue, we derive a more intuitive equivalent form of R II (T ) via a highly non-trivial transformation. Based on the MO-OCO framework, we develop a novel multi-objective online algorithm termed Doubly Regularized Online Mirror Multiple Descent. The key module of the algorithm is the gradient composition scheme, which calculates a composite gradient in the form of a convex combination of the gradients of all objectives. Intuitively, the most direct way to determine the composite weights is to apply the min-norm solver (Désidéri, 2012) commonly used in offline multi-objective optimization. However, directly applying min-norm is not workable in the online setting. Specifically, the composite weights in min-norm are merely determined by the gradients at the current round. In the online setting, since the gradients are adversarial, they may result in undesired composite weights, which further produce a composite gradient that reversely optimizes the loss. To rigorously verify this point, we give an example where equipping OMD with vanilla min-norm incurs a linear regret, showing that only regularizing the iterate, as in OMD, is not enough to guarantee sublinear regrets in our setting. To fix the issue, we devise a novel min-regularized-norm solver with an explicit regularization on composite weights. Equipping it with OMD results in our proposed algorithm. In theory, we derive a regret bound of O( √ T ) for DR-OMMD, which matches the optimal bound in the single-objective setting (Hazan et al., 2016) and is tight w.r.t. the number of objectives. Our analysis also shows that DR-OMMD attains a smaller regret bound than that of linearization with fixed composite weights. We show that, in the two-objective setting with linear losses, the margin between the regret bounds depends on the difference between the composite weights yielded by the two algorithms and the difference between the gradients of the two underlying objectives. To evaluate the effectiveness of DR-OMMD, we conduct extensive experiments on several largescale real-world datasets. We first realize adaptive regularization via multi-objective optimization, and find that adaptive regularization with DR-OMMD significantly outperforms fixed regularization with linearization, which verifies the effectiveness of DR-OMMD over linearization in the convex setting. Then we apply DR-OMMD to deep online multi-task learning. The results show that DR-OMMD is also effective in the non-convex setting.

2. PRELIMINARIES

In this section, we briefly review the necessary background knowledge of two related fields.

2.1. MULTI-OBJECTIVE OPTIMIZATION

Multiple-objective optimization (MOO) is concerned with solving the problems of optimizing multiple objectives simultaneously (Fliege & Svaiter, 2000; Deb, 2014) . In general, since different objectives may conflict with each other, there is no single solution that can optimize all the objectives at the same time, hence the conventional concept of optimality used in the single-objective setting is no longer suitable. Instead, MOO seeks to achieve Pareto optimality. In the following, we give the relevant definitions more formally. We use a vector-valued loss F = (ffoot_0 , . . . , f m ) to denote the objectives, where m ≥ 2 and f i : X → R, i ∈ {1, . . . , m}, X ⊂ R, is the i-th loss function. Definition 1 (Pareto optimality). (a) For any two solutions x, x ′ ∈ X , we say that x dominates x ′ , denoted as x ≺ x ′ or x ′ ≻ x, if f i (x) ≤ f i (x ′ ) for all i, and there exists one i such that f i (x) < f i (x ′ ); otherwise, we say that x does not dominate x ′ , denoted as x ⊀ x ′ or x ′ ⊁ x. (b) A solution x * ∈ X is called Pareto optimal if it is not dominated by any other solution in X . Note that there may exist multiple Pareto optimal solutions. For example, it is easy to show that the optimizer of any single objective, i.e., x * i ∈ arg min x∈X f i (x), i ∈ {1, . . . , m}, is Pareto optimal. Different Pareto optimal solutions reflect different trade-offs among the objectives (Lin et al., 2019) . Definition 2 (Pareto front). (a) All Pareto optimal solutions form the Pareto set P X (F ). (b) The image of P X (F ) constitutes the Pareto front, denoted as P(H) = {F (x) | x ∈ P X (F )}. Now that we have established the notion of optimality in MOO, we proceed to introduce the metrics that measure the discrepancy of an arbitrary solution x ∈ X from being optimal. Recall that, in the single-objective setting with merely one loss function f : Z → R, for any z ∈ Z, the loss difference f (z) -min z ′′ ∈Z f (z ′′ ) is directly qualified for the discrepancy measure. However, in MOO with more than one loss, for any x ∈ X , the loss difference F (x) -F (x ′′ ), where x ′′ ∈ P X (F ), is a vector. Intuitionally, the desired discrepancy metric shall scalarize the vector-valued loss difference and yield 0 for any Pareto optimal solution. In general, in MOO, there are two commonly used discrepancy metrics, i.e., Pareto suboptimality gap (PSG) (Turgay et al., 2018) and Hypervolume (HV) (Bradstreet, 2011) . As HV is a complex volume-based metric, it is more difficult to optimize via gradient-based algorithms (Zhang & Golovin, 2020) . Hence in this paper, we adopt PSG, which has already been extensively used in multi-objective bandits (Turgay et al., 2018; Lu et al., 2019a) . Definition 3 (Pareto suboptimality gap 1 ). For any x ∈ X , the Pareto suboptimality gap to a given comparator set Z ⊂ X , denoted as ∆(x; Z, F ), is defined as the minimal scalar ϵ ≥ 0 that needs to be subtracted from all entries of F (x), such that F (x) -ϵ1 is not dominated by any point in Z, where 1 denotes the all-one vector in R m , i.e., ∆(x; Z, F ) = inf ϵ≥0 ϵ, s.t. ∀ x ′′ ∈ Z, ∃ i ∈ {1, . . . , m}, f i (x) -ϵ < f i (x ′′ ). Clearly, PSG is a distance-based discrepancy metric motivated from a purely geometric viewpoint. In practice, the comparator set Z is often set to be the Pareto set X * = P X (F ) (Turgay et al., 2018) ; therein for any x ∈ K, its PSG is always non-negative and equals zero if and only if x ∈ P X (F ). Multiple Gradient Descent Algorithm (MGDA) is an offline first-order MOO algorithm (Fliege & Svaiter, 2000; Désidéri, 2012) . At each iteration l ∈ {1, . . . , L} (L is the number of iterations), it first computes the gradient ∇f i (x l ) of each objective, then derives the composite gradient g comp l = m i=1 λ i l ∇f i (x l ) as a convex combination of these gradients, and finally applies g comp l to execute a gradient descent step to update the decision, i.e., x l+1 = x l -ηg comp l (η is the step size). The core part of MGDA is the module that determines the composite weights λ l = (λ 1 l , . . . , λ m l ), given by λ l = arg min λ l ∈Sm ∥ m i=1 λ i l ∇f i (x l )∥ 2 2 , where S m = {λ ∈ R m | m i=1 λ i = 1, λ i ≥ 0, i ∈ {1, . . . , m}} is the probabilistic simplex in R m . This is a min-norm solver, which finds the weights in the simplex that yield the minimum L2norm of the composite gradient. Thus MGDA is also called the min-norm method. Previous works (Désidéri, 2012; Sener & Koltun, 2018) showed that when all f i are convex functions, MGDA is guaranteed to decrease all the objectives simultaneously until it reaches a Pareto optimal decision.

2.2. ONLINE CONVEX OPTIMIZATION

Online Convex Optimization (OCO) (Zinkevich, 2003; Hazan et al., 2016) is the most commonly adopted framework for designing online learning algorithms. It can be viewed as a structured repeated game between a learner and an adversary. At each round t ∈ {1, . . . , T }, the learner is required to generate a decision x t from a convex compact set X ⊂ R n . Then the adversary replies the learner with a convex function f t : X → R and the learner suffers the loss f t (x t ). The goal of the learner is to minimize the regret with respect to the best fixed decision in hindsight, i.e., R(T ) = T t=1 f t (x t ) -min x * ∈X T t=1 f t (x * ). A meaningful regret is required to be sublinear in T , i.e., lim T →∞ R(T )/T = 0, which implies that when T is large enough, the learner can perform as well as the best fixed decision in hindsight. Online Mirror Descent (OMD) (Hazan et al., 2016 ) is a classic first-order online learning algorithm. At each round t ∈ {1, . . . , T }, OMD yields its decision via x t+1 = arg min x∈X η⟨∇f t (x t ), x⟩ + B R (x, x t ), where η is the step size, R : X → R is the regularization function, and B R (x, x ′ ) = R(x)-R(x ′ )- ⟨∇R(x ′ ), x -x ′ ⟩ is the Bregman divergence induced by R. As a meta-algorithm, by instantiating different regularization functions, OMD can induce two important algorithms, i.e., Online Gradient Descent (Zinkevich, 2003) and Online Exponentiated Gradient (Hazan et al., 2016) .

3. MULTI-OBJECTIVE ONLINE CONVEX OPTIMIZATION

In this section, we formally formulate the MO-OCO framework. Framework overview. Analogously to single-objective OCO, MO-OCO can be viewed as a repeated game between an online learner and the adversarial environment. The main difference is that in MO-OCO, the feedback is vector-valued. The general framework of MO-OCO is given as follows. At each round t ∈ {1, . . . , T }, the learner generates a decision x t from a given convex compact decision set X ⊂ R n . Then the adversary replies the decision with a vector-valued loss function F t : X → R m , whose i-th component f i t : X → R is a convex function corresponding to the i-th objective, and the learner suffers the vector-valued loss F t (x t ). The goal of the learner is to generate a sequence of decisions {x t } T t=1 to minimize a certain kind of multi-objective regret. The remaining work in framework formulation is to give an appropriate regret definition, which is the most challenging part. Recall that the single-objective regret R(T ) = T t=1 f t (x t ) - T t=1 f t (x * ) is defined as the difference between the cumulative loss of the actual decisions {x t } T t=1 and that of the fixed optimal decision in hindsight x * ∈ arg min x∈X T t=1 f t (x). When defining the multiobjective analogy to R(T ), we encounter two issues. First, in the multi-objective setting, no single decision can optimize all the objectives simultaneously in general, hence we cannot compare the cumulative loss with that of any single decision. Instead, we use the the Pareto optimal set X * of the cumulative loss function T t=1 F t , i.e., X * = P X ( T t=1 F t ), which naturally aligns with the optimality concept in MOO. Second, to compare {x t } T t=1 and X * in the loss space, we need a discrepancy metric to measure the gap between vector losses. Intuitively, we can adopt the commonly used PSG metric (Turgay et al., 2018) . But we find that vanilla PSG is not appropriate for OCO, which is largely different from the bandits setting. We explicate the reason in the following.

3.1. THE NAIVE REGRET BASED ON VANILLA PSG FAILS IN MO-OCO

By definition, at each round t, the difference between the decision x t and the Pareto optimal set can be evaluated by PSG ∆(x t ; X * , F t ). Naturally, we can formulate the multi-objective regret by accumulating ∆(x t ; X * , F t ) over all rounds, i.e., R I (T ) := T t=1 ∆(x t ; X * , F t ). Recall that the single-objective regret can also expressed as R(T ) = T t=1 (f t (x t )-f t (x * )). Hence, R I (T ) essentially extends the scalar discrepancy f t (x t ) -f t (x * ) to the PSG metric ∆(x t ; X * , F t ). However, these two discrepancy metrics have a major difference, i.e., f t (x t ) -f t (x * ) can be negative, whereas ∆(x t ; X * , F t ) is always non-negative. In previous bandits settings (Turgay et al., 2018) , the discrepancy is intrinsically non-negative, since the comparator set is exactly the Pareto optimal set of the evaluated loss function. However, the non-negative property of PSG can be problematic in our setting, where the comparator set X * is the Pareto set of the cumulative loss function, rather than the instantaneous loss F t that is used for evaluation. Specifically, at some round t, the decision x t may Pareto dominate all points in X * w.r.t. F t , which corresponds to the single-objective setting where it is possible that f t (x t ) < f t (x * ) at some specific round. In this case, we would expect the discrepancy metric at this round to be negative. However, PSG can only yield 0 in this case, making the regret much looser than we expect. In the following, we provide an example in which the naive regret R I (T ) is linear w.r.t. T even when the decisions x t are already optimal. Problem instance. Set X = [-2, 2]. Let the loss function be identical among all objectives, i.e., f 1 t (x) = ... = f m t (x), and alternate between x and -x. Suppose the time horizon T is an even number, then the Pareto optimal set X * = X . Now consider the decisions x t = 1, t ∈ {1, ..., T }. In this case, it can easily be checked that the single-objective regret of each objective is zero, indicating that these decisions are optimal for each objective. To calculate R I (T ), notice that when all the objectives are identical, PSG reduces to ∆(x t ; X * , f 1 t ) = sup x * ∈X max{f 1 t (x t ) -f 1 t (x * ), 0} at each round t. Hence, in this case we have R I (T ) = 1≤k≤T /2 (sup x * ∈[-2,2] max{1 -x * , 0} + sup x * ∈[-2,2] max{x * -1, 0}) = 3T , which is linear w.r.t. T . Therefore, R I (T ) is too loose to measure the suboptimality of decisions, which is unqualified as a regret metric.

3.2. THE ALTERNATIVE REGRET BASED ON SEQUENCE-WISE PSG

In light of the failure of the naive regret, we need to modify the discrepancy metric in our setting. Recall that the single-objective regret can be interpreted as the gap between the actual cumulative loss T t=1 f t (x t ) and its optimal value min x∈X T t=1 f t (x). In analogy, we can measure the gap between T t=1 F t (x t ) and the Pareto front P * = P X ( T t=1 F t ). However, vanilla PSG is a pointwise metric, i.e., it can only measure the suboptimality of a decision point. To evaluate the decision sequence {x t } T t=1 , we modify its definition and propose a sequence-wise variant of PSG. Definition 4 (Sequence-wise PSG). For any decision sequence {x t } T t=1 , the sequence-wise PSG (S-PSG) to a given comparator setfoot_2 X * w.r.t. the loss sequence {F t } T t=1 is defined as ∆({x t } T t=1 ; X * , {F t } T t=1 ) = inf ϵ≥0 ϵ, s.t. ∀ x ′′ ∈ X * , ∃ i ∈ {1, . . . , m}, T t=1 f i t (x t )-ϵ < T t=1 f i t (x ′′ ). Since X * is the Pareto set of T t=1 F t , S-PSG measures the discrepancy from the cumulative loss of the decision sequence to the Pareto front P * . Now the regret can be directly given as R II (T ) := ∆({x t } T t=1 ; X * , {F t } T t=1 ). R II (T ) has a clear physical meaning that optimizing it will impose the cumulative loss to be close to the Pareto front P * . However, since PSG (or S-PSG) is a zero-order metric motivated in a purely geometric sense, i.e., its calculation needs to solve a constrained optimization problem with an unknown boundary {F t (x ′′ ) | x ′′ ∈ X * }, it is difficult to design a first-order algorithm to optimize PSG-based regrets, not to mention the analysis. To resolve this issue, we derive an equivalent form via highly non-trivial transformations, which is more intuitive than its original form. Proposition 1. The multi-objective regret R II (T ) based on S-PSG has an equivalent form, i.e., R II (T ) = max sup x * ∈X * inf λ * ∈Sm T t=1 λ * ⊤ (F t (x t ) -F t (x * )), 0 . Remark. (i) The above form is closely related to the single-objective regret R(T ). Specifically, when m = 1, we can prove that R II (T ) = max{ 

4. DOUBLY REGULARIZED ONLINE MIRROR MULTIPLE DESCENT

In this section, we present the Doubly Robust Online Mirror Multiple Descent (DR-OMMD) algorithm, the protocol of which is given in Algorithm 1. At each round t, the learner first computes the gradient of the loss regarding each objective, then determines the composite weights of all these gradients, and finally applies the composite gradient to the online mirror descent step.

4.1. VANILLA MIN-NORM MAY INCUR LINEAR REGRETS

The core module of DR-OMMD is the composition of gradients. For simplicity, denote the gradients at round t in a matrix form ∇F t (x t ) = [∇f 1 t (x t ), . . . , ∇f m t (x t )] ∈ R n×m . Then the composite gradient is g t = ∇F t (x t )λ t , where λ t is the composite weights. As illustrated in the preliminary, in the offline setting, the min-norm method (Désidéri, 2012; Sener & Koltun, 2018 ) is a classic method to determine the composite weights, which produces a common descent direction that can descend all the losses simultaneously. Thus, it is tempting to consider applying it to the online setting. However, directly applying min-norm to the online setting is not workable, which may even incur linear regrets. In vanilla min-norm, the composite weights λ t are determined solely by the gradients ∇F t (x t ) at the current round t, which are very sensitive to the instantaneous loss F t . In the online setting, the losses at each round can be adversarially chosen, and thus the corresponding gradients can be adversarial. These adversarial gradients may result in undesired composite weights, which may further produce a composite gradient that even deteriorates the next prediction. In the following, we provide an example in which min-norm incurs a linear regret. We extend OMD (Hazan et al., 2016) to the multi-objective setting, where the composite weights are directly yielded by min-norm. Problem instance. We consider a two-objective problem. The decision domain is X = {(u, v) | u + v ≤ 1 2 , v -u ≤ 1 2 , v ≥ 0} and the loss function at each round is F t (x) = (∥x -a∥ 2 , ∥x -b∥ 2 ), t = 2k -1, k = 1, 2, ...; (∥x -b∥ 2 , ∥x -c∥ 2 ), t = 2k, k = 1, 2, ..., where a = (-2, -1), b = (0, 1), c = (2, -1). For simplicity, we first analyze the case where the total time horizon T is an even number. Then we can compute the Pareto set of the cumulative loss T t=1 F t , i.e., X * = {(u, 0) | -1 2 ≤ u ≤ 1 2 } , which locates at the x-axis. For conciseness of analysis, we instantiate OMD with L2-regularization, which results in the simple OGD algorithm (McMahan, 2011) . We start at an arbitrary point x 1 = (u 1 , v 1 ) ∈ X satisfying v 1 > 0. At each round t, suppose the decision x t = (u t , v t ), then the gradient of each objective w.r.t. x t takes g 1 t = (2u t + 4, 2v t + 2), t = 2k -1; (2u t , 2v t -2), t = 2k. g 2 t = (2u t , 2v t -2), t = 2k -1; (2u t -4, 2v t + 2), t = 2k. Since 0 ≤ v t ≤ 1 2 , we observe that the second entry of either gradient alternates between positive and negative. By using min-norm, the composite weights λ t can be computed as λ t = ((1 -u t -v t )/4, (3 + u t + v t )/4), t = 2k -1; ((3 -u t + v t )/4, (1 + u t -v t )/4), t = 2k. We observe that both entries of composite weights alternative between above 1 2 and below 1 2 , and ∥λ t+1 -λ t ∥ 1 ≥ 1. Recall that ∥λ t ∥ 1 = 1, hence the composite weights at two consecutive rounds change radically. The resulting composite gradient takes g comp t = (u t -v t + 1, -u t + v t -1), t = 2k -1; (-u t -v t -1, -u t -v t -1), t = 2k. The fluctuating composite weights mix with the positive and negative second entries of gradients, making the second entry of g comp t always negative, i.e., -u t + v t -1 < 0 and -u t -v t -1 < 0. Hence g comp t always drives x t away from the Pareto set X * that coincides with the x-axis. This essentially reversely optimizes the loss, hence increasing the regret. In fact, we can prove that it even incurs a linear regret. Due to the lack of space, we leave the proof of linear regret when T is an odd number in Appendix H. The above results of the problem instance are summarized as follows. Proposition 2. For OMD equipped with vanilla min-norm, there exists a multi-objective online convex optimization problem, in which the resulting algorithm incurs a linear regret. Remark. Stability is a basic requirement to ensure meaningful regrets in online learning (McMahan, 2017) . In the single-objective setting, directly regularizing the iterate x t (e.g., OMD) is enough. However, as shown in the above analysis, merely regularizing x t is not enough to attain sublinear regrets in the multi-objective setting, since there is another source of instability, i.e., the composite weights, that affects the direction of composite gradients. Therefore, in multi-objective online learning, besides regularizing the iterates, we also need to explicitly regularize the composite weights.

4.2. THE ALGORITHM

Enlightened by the design of regularization in FTRL (McMahan, 2017) , we consider the regularizer r(λ, λ 0 ), where λ 0 is the pre-defined composite weights that may reflect the user preference. This results in a new solver called min-regularized-norm, i.e., λ t = arg min λ∈Sm ∥∇F t (x t )λ∥ 2 2 + α t r(λ, λ 0 ), where α t is the regularization strength. Equipping OMD with the new solver, we derive the proposed algorithm. Note that beyond the regularization on the iterate x t that is intrinsic in online learning, there is another regularization on the composite weights λ t in min-regularized-norm. Both regularizations are fundamental, and they together ensure stability in the multi-objective online setting. Hence we call the algorithm Doubly Regularized Online Mirror Multiple Descent (DR-OMMD). In principle, r can take various forms such as L1-norm, L2-norm, etc. Here we adopt L1-norm since it aligns well with the simplex constraint of λ. Min-regularized-norm can be computed very efficiently. When m = 2, it has a closed-form solution. Specifically, suppose the gradients at round t are g 1 t and g 2 t . Set γ L = (g ⊤ 2 (g 2 -g 1 )-α t )/∥g 2 -g 1 ∥ 2 and γ R = (g ⊤ 2 (g 2 -g 1 )+α t )/∥g 2 -g 1 ∥ 2 . Given any λ 0 = (γ 0 , 1 -γ 0 ) ∈ S 2 , we can compute the composite weights λ t as (γ t , 1 -γ t ) where γ t = max{min{γ ′′ t , 1}, 0}, where γ ′′ t = max{min{γ 0 , γ R }, γ L }. When m > 2, since the constraint S m is a simplex, we can introduce a Frank-Wolfe solver (Jaggi, 2013) (see detailed protocol in Appendix E.1). We also discuss the L2-norm case in Appendix E.2. Compared to vanilla min-norm, the composite weights in min-regularized-norm are not fully determined by the adversarial gradients. The resulting relative stability of composite weights makes the composite gradients more robust to the adversarial environment. In the following, we give a general analysis and prove that DR-OMMD indeed guarantees sublinear regrets.

4.3. THEORETICAL ANALYSIS

Our analysis is based on two conventional assumptions (Jadbabaie et al., 2015; Hazan et al., 2016) . Assumption 1. The regularization function R is 1-strongly convex. In addition, the Bregman diver- gence is γ-Lipschitz continuous, i.e., B R (x, z) -B R (y, z) ≤ γ∥x -y∥, ∀x, y, z ∈ domR, where domR is the domain of R and satisfies X ⊂ domR ⊂ R n . Assumption 2. There exists some finite G > 0 such that for each i ∈ {1, . . . , m}, the i-th loss f i t at each round t ∈ {1, . . . , T } is differentiable and G-Lipschitz continuous w.r.t. ∥ • ∥ 2 , i.e., |f i t (x) -f i t (x ′ )| ≤ G∥x -x ′ ∥ 2 . Note that in the convex setting, this assumption leads to bounded gradients, i.e., ∥∇f i t (x)∥ 2 ≤ G for any t ∈ {1, . . . , T }, i ∈ {1, . . . , m}, x ∈ X . Theorem 1. Suppose the diameter of X is D. Assume F t is bounded, i.e., |f i t (x)| ≤ F, ∀x ∈ X , t ∈ {1, . . . , T }, i ∈ {1, . . . , m}. For any λ 0 ∈ S m , DR-OMMD attains R(T ) ≤ γD η T + T t=1 η t 2 (∥∇F t (x t )λ t ∥ 2 2 + 4F η t ∥λ t -λ 0 ∥ 1 ). Remark. When η t = √ 2γD G √ T or √ 2γD G √ t , α t = 4F ηt , the bound attains O( √ T ). It matches the optimal single-objective bound w.r.t. T (Hazan et al., 2016) and is tight w.r.t. m (justified in Appendix F.2). Comparison with linearization. Linearization with fixed weights λ 0 ∈ S m essentially optimizes the scalar loss λ ⊤ 0 F t with gradient g t = ∇F t (x t )λ 0 . From OMD's tight bound (Theorem 6.8 in (Orabona, 2019 )), we can derive a bound γD η T + T t=1 ηt 2 ∥∇F t (x t )λ 0 ∥ 2 2 for linearization. In comparison, when α t = 4F ηt , DR-OMMD attains a regret bound γD η T + T t=1 ηt 2 min λ∈Sm {∥∇F t (x t )λ∥ 2 2 + α t ∥λ -λ 0 ∥ 1 }, which is smaller than that of linearization. Note that although the bound of linearization refers to single-objective regret R(T ), the comparison is reasonable due to the consistency of the two regret metrics, i.e., R II (T ) = max{R(T ), 0} when m = 1, as proved in Proposition 1. In the following, we further investigate the margin in the two-objective setting with linear losses. Suppose the loss functions are f 1 t (x) = x ⊤ g 1 t and f 2 t (x) = x ⊤ g 2 t for some vectors g 1 t , g 2 t ∈ R n at each round. Then we can show that the margin is at least (see Appendix F.3 for the detailed proof) M ≥ T t=1 η t 4 ∥λ t -λ 0 ∥ 2 2 • ∥g 1 t -g 2 t ∥ 2 2 , which indicates the benefit of DR-OMMD. Specifically, while linearization requires adequate λ 0 , DR-OMMD selects more proper λ t adaptively; the advantange is more obvious as the gradients of different objectives vary wildly. This matches our intuition that linearization suffers from conflict gradients (Yu et al., 2020) , while DR-OMMD can alleviate the conflict by pursuing common descent.

5. EXPERIMENTS

In this section, we conduct experiments to compare DR-OMMD with two baselines: (i) linearization performs single-objective online learning on scalar losses λ ⊤ 0 F t with pre-defined fixed λ 0 ∈ S m ; (ii) min-norm equips OMD with vanilla min-norm (Désidéri, 2012) for gradient composition.

5.1. CONVEX EXPERIMENTS: ADAPTIVE REGULARIZATION

Many real-world online scenarios adopt regularization to avoid overfitting. A standard scheme is to add a term r(x) to the loss f t (x) at each round and optimize the regularized loss f t (x) + σr(x) (McMahan, 2011) , where σ is a pre-defined fixed hyperparameter. The formalism of multi-objective online learning provides a novel way of regularization. As r(x) measures model complexity, it can be regarded as the second objective alongside the primary goal f t (x). We can augment the loss to F t (x) = (f t (x), r(x)) and thereby cast regularized online learning into a two-objective problem. Compared to the standard scheme, our approach chooses σ t = λ 2 t /λ 1 t in an adaptive way. We use two large-scale online benchmark datasets. (i) protein is a bioinformatics dataset for protein type classification (Wang, 2002) , which has 17 thousand instances with 357 features. (ii) covtype is a biological dataset collected from a non-stationary environment for forest cover type prediction (Blackard & Dean, 1999) , which has 50 thousand instances with 54 features. We set the logistic classification loss as the first objective, and the squared L2-norm of model parameters as the second objective. Since the ultimate goal of regularization is to lift predictive performance, we measure the average loss, i.e., t≤T l t (x t )/T , where l t (x t ) is the classification loss at round t. We adopt a L2-norm ball centered at the origin with diameter K = 100 as the decision set. The learning rates are decided by a grid search over {0.1, 0.2, . . . , 3.0}. For DR-OMMD, the parameter α t is simply set as 0.1. For fixed regularization, the strength σ = (1-λ 1 0 )/λ 1 0 is determined by some λ 1 0 ∈ [0, 1], which is exactly linearization with weights λ 0 = (λ 1 0 , 1 -λ 1 0 ). We run both algorithms with varying λ 1 0 ∈ {0, 0.1, ..., 1}. In Figure 1 , we plot (a) their final performance w.r.t. the choice of λ 0 and (b) their learning curves with desirable λ 0 (e.g., (0.1, 0.9) on protein). Other results are deferred to the appendix due to the lack of space. The results show that DR-OMMD consistently outperforms fixed regularization; the gap becomes more significant when λ 0 is not properly set.

5.2. NON-CONVEX EXPERIMENTS: DEEP MULTI-TASK LEARNING

We use MultiMNIST (Sabour et al., 2017) , which is a multi-task version of the MNIST dataset for image classification and commonly used in deep multi-task learning (Sener & Koltun, 2018; Lin et al., 2019) . In MultiMNIST, each sample is composed of a random digit image from MNIST at the top-left and another image at the bottom-right. The goal is to classify the digit at the top-left (task L) and that at the bottom-right (task R) at the same time. We follow (Sener & Koltun, 2018 )'s setup with LeNet. Learning rates in all methods are selected via grid search over {0.0001, 0.001, 0.01, 0.1}. For linearization, we examine different weights (0.25, 0.75), (0.5, 0.5), and (0.75, 0.25). For DR-OMMD, α t is set according to Theorem 1, and the initial weights are simply set as λ 0 = (0.5, 0.5). Note that in the online setting, samples arrive in a sequential manner, which is different from offline experiments where sample batches are randomly sampled from the training set. Figure 2 compares the average cumulative loss of all the examined methods. We also measure two conventional metrics in offline experiments, i.e., the training loss and test loss (Reddi et al., 2018) ; the results are similar and deferred to the appendix due to the lack of space. The results show that DR-OMMD outperforms counterpart algorithms using min-norm or linearization in all metrics on both tasks, validating its effectiveness in the non-convex setting.

APPENDIX

The appendix is organized as follows. Appendix A reviews related work. Appendix B validates the correctness of our definition of PSG. Appendix C discusses the domain of the comparator in S-PSG, indicating that it makes no difference whether the comparator is selected from the Pareto optimal set or from the whole domain. Appendix D provides the detailed derivation of the equivalent form of R II (T ). Appendix E discusses how to efficiently compute the composition weights for the minregularized-norm solver. Appendix F discusses the order of DR-OMMD's regret bound with fixed or adaptive learning rate, shows the tightness of the derived bound, and provides more details on the regret comparison between DR-OMMD and linearization. Appendix G supplements more details in the experimental setup and empirical results. Appendix H and I provide detailed proofs of the remaining theoretical claims in the main paper. Finally, Appendix J supplements regret analysis of DR-OMMD in the strongly convex setting.

A RELATED WORK

In this section, we review previous work in some related fields, i.e., online learning, multi-objective optimization, multi-objective multi-armed bandits, and multi-objective Bayesian optimization.

A.1 ONLINE LEARNING

Online learning arms to make sequential predictions for streaming data. Please refer to the introduction books (Hazan et al., 2016; Orabona, 2019) for more background knowledges. Most of the previous works on online learning are conducted in the single-objective setting. As far as we are concerned, there are only two lines of work concerning multi-objective learning. The first line of works provides a multi-objective perspective of the prediction-with-expert-advice (PEA) problem (Koolen, 2013; Koolen & Van Erven, 2015) . Specifically, they view each individual expert as a multi-objective criterion, and characterize the Pareto optimal trade-offs among different experts. These works have two main distinctions from our proposed MO-OCO. First, they are still built upon the original PEA problem where the payoff of each expert (or decision) is a scalar, while we focus on vectoral payoffs. Second, their framework is restricted to an absolute loss game, whereas our framework is general and can be applied to any coordinate-wise convex loss functions. The second line of work studies online learning with vectoral payoffs via Blackwell approachability (Blackwell, 1956; Mannor et al., 2014; Abernethy et al., 2011) . In their framework, the learner is given a target set T ⊂ R m and its goal is to generate decisions {x t } T t=1 to minimize the distance between the average loss T t=1 l t (x t )/T and the target set T . There are two major differences between Blackwell approachability and our proposed MO-OCO: previous works on Blackwell approachability are zero-order methods and the target set T is often known beforehand (also see the discussion in (Busa-Fekete et al., 2017)), while in MO-OCO we intend to develop a first-order method to reach the unknown Pareto front.

A.2 MULTI-OBJECTIVE OPTIMIZATION

Multi-objective optimization aims to optimize multiple objectives concurrently. Most of the previous works on multi-objective optimization are conducted in the offline setting, including the batch optimization setting (Désidéri, 2012; Liu et al., 2021) and the stochastic optimization setting (Sener & Koltun, 2018; Lin et al., 2019; Yu et al., 2020; Chen et al., 2020; Javaloy & Valera, 2021) . These methods are based on gradient composition, and have shown very promising results in multi-task learning applications. Despite the existence of previous works on multi-objective optimization, as the first work of multiobjective optimization in the OCO setting, our work is largely different from them in three aspects. First, we contribute the first formal framework of multi-objective online convex optimization. In particular, our framework is based on a novel equivalent transformation of the PSG metric, which is intrinsically different from previous offline optimization frameworks. Second, we provide a showcase in which a commonly used method in the offline setting, namely min-norm (Désidéri, 2012; Sener & Koltun, 2018) , fail to attain sublinear regret in online setting. Our proposed min-regularized-norm is a novel design when tailoring offline methods to the online setting. Third, the regret analysis of multi-objective online learning is intrinsically different from the convergence analysis in the offline setting (Yu et al., 2020) .

A.3 MULTI-OBJECTIVE MULTI-ARMED BANDITS

Another branch of related works study multi-objective optimization in the multi-armed bandits setting (Busa-Fekete et al., 2017; Tekin & Turgay, 2018; Turgay et al., 2018; Lu et al., 2019a; Degenne et al., 2019) . Among these works, the most relevant one to ours is (Turgay et al., 2018) , which introduces the Pareto suboptimality gap (PSG) metric to characterize the multi-objective regret in the bandits setting, and proposes a zero-order zooming algorithm to minimize the regret. In this work, our regret definition also utilizes the PSG metric (Turgay et al., 2018) . However, as the first study of multi-objective optimization in the OCO setting, our work is intrinsically different from these previous works in the following aspects. First, as PSG is a zero-order metric, we perform a novel equivalent transformation, making it amenable to the OCO setting. Second, our proposed algorithm is a first-order multiple gradient algorithm, whose design principles are completely distinct from zero-order algorithms. For example, the concept of the stability of composite weights does not even exist in the design of previous zero-order methods for multi-objective bandits (Turgay et al., 2018; Lu et al., 2019a) . Third, the regret analysis of MO-OCO is intrinsically different from that in the bandits setting.

A.4 MULTI-OBJECTIVE BAYESIAN OPTIMIZATION

The final area related to our work is multi-objective Bayesian optimization (Zhang & Golovin, 2020; Konakovic Lukovic et al., 2020; Chowdhury & Gopalan, 2021; Maddox et al., 2021; Daulton et al., 2022) , which studies Bayesian optimization with vector-valued feedback. There are two branches of works in this area, using different notions of regret. The first branch is based on scalarization, which adopts the expectation of the gap between scalarized losses over some given distribution (Chowdhury & Gopalan, 2021) as the regret. In this approach, the distribution of scalarization can be understood as a set of preference, which needs to be known beforehand. The second branch is based on Pareto optimality (Zhang & Golovin, 2020) , which uses hypervolume as the discrepancy metric and adopt the gap between the true Pareto front and the estimated Pareto front as the regret. As the first work on multi-objective optimization in the OCO setting, our work is largely different from these works in the following aspects. First, the regret definitions are different. Specifically, compared to the first branch based on scalarization, our regret definition is purely motivated by Pareto optimality, which does not need any preference in advance; compared to the second branch using hypervolume, we note that hypervolume is mainly used for Pareto front approximation, which is unsuitable to our adversarial setting where the goal is to impose the cumulative loss to reach the Pareto front. Second, multi-objective Bayesian optimization is conducted in a stochastic setting, which typically assumes that the losses follow some Gaussian distribution, whereas our work is conducted in the adversarial setting where the losses can be generated arbitrarily.

B AN EQUIVALENT DEFINITION OF PSG

Recall that in Definition 3, we formulate the PSG metric as a constrained optimization problem. We note that, since the PSG metric is based on the notion of "non-dominance" (Turgay et al., 2018) , its most direct form is actually ∆ ′ (x; K * , F ) = inf ϵ≥0 ϵ, s.t. ∀x ′′ ∈ K * ,∃i ∈ {1, . . . , m}, f i (x) -ϵ < f i (x ′′ ) or ∀i ∈ {1, . . . , m}, f i (x) -ϵ = f i (x ′′ ). At the first glance, the above definition seems to be quite different from Definition 3, since it has an extra condition "∀i ∈ {1, . . . , m}, f i (x) -ϵ = f i (x ′′ )". In the following, we prove that both definitions actually yield the same value due to the infimum operation on ϵ. Specifically, for any possible pair (x, K * , F ), we denote ∆ ′ (x; K * , F ) = ϵ ′ 0 and ∆(x; K * , F ) = ϵ 0 . By comparing the constraints of both definitions, it is obvious that ϵ 0 must satisfy the constraint of ∆ ′ (x; K * , F ), hence the infimum operation guarantees that ϵ ′ 0 ≤ ϵ 0 . It remains to prove that ϵ ′ 0 ≥ ϵ 0 . To this end, we only need to show that ϵ ′ 0 + ξ satisfies the constraint of ∆(x; K * , F ) for any ξ > 0. Consider an arbitrary x ′′ ∈ K * . From the definition of ∆ ′ (x; K * , F ), we know that either ∃i ∈ {1, . . . , m}, f i (x) -ϵ ′ 0 < f i (x ′′ ) or ∀i ∈ {1, . . . , m}, f i (x) -ϵ ′ 0 = f i (x ′′ ). Whichever condition holds, we must have ∃i ∈ {1, . . . , m}, f i (x) -ϵ ′ 0 -ξ < f i (x ′′ ) for any ξ > 0. Since it holds for any x ′′ ∈ K * , ϵ ′ 0 + ξ lies in the feasible region of ∆(x; K * , F ), hence we have ϵ 0 ≤ ϵ ′ 0 + ξ, ∀ξ > 0 and thus ϵ 0 ≤ ϵ ′ 0 . In summary, we have ∆ ′ (x; K * , F ) = ∆(x; K * , F ) for any pair (x, K * , F ).

C DISCUSSION ON THE DOMAIN OF THE COMPARATOR IN S-PSG

Recall that in Definition 4, the comparator x ′ in S-PSG is selected from the Pareto optimal set X * of the cumulative loss T t=1 F t . This actually stems from the original definition of PSG (Turgay et al., 2018) , which uses the Pareto optimal set as the comparator set. In fact, comparing with Pareto optimal decisions in X * is already enough to measure the suboptimality of any decision sequence {x t } T t=1 . The reason is that, for any non-optimal decision x ′ ∈ X -X * , there must exist some Pareto optimal decision x ′′ ∈ X * that dominates x ′ , hence the suboptimality metric does not need to compare with this non-optimal decision x ′ . In other words, even if we extend the comparator set in S-PSG to the whole domain X , the modified form will be equivalent to the original form based on the Pareto optimal set X * . In the following, we strictly prove this equivalence ∆({x t } T t=1 ; X , {F t } T t=1 ) = ∆({x t } T t=1 ; X * , {F t } T t=1 ). Specifically, we modify the definition of S-PSG and let the comparator domain X ′ be any subset of the decision domain X , i.e., ∆({x t } T t=1 ; X ′ , {F t } T t=1 ) = inf ϵ≥0 ϵ, s.t. ∀ x ′′ ∈ X ′ , ∃i ∈ {1, . . . , m}, T t=1 f i t (x t ) -ϵ < T t=1 f i t (x ′′ ). Then the modified regret based on the whole domain X takes R ′ II (T ) = ∆({x t } T t=1 ; X , {F t } T t=1 ). Now we begin to prove the equivalence ∆({x t } T t=1 ; X , {F t } T t=1 ) = ∆({x t } T t=1 ; X * , {F t } T t=1 ). For any X ′ ⊂ X , let E(X ′ ) denote the constraint of ∆({x t } T t=1 ; X ′ , {F t } T t=1 ), i.e.,

E(X

′ ) = {ϵ ≥ 0 | ∀ x ′′ ∈ X ′ , ∃i ∈ {1, . . . , m}, T t=1 f i t (x t ) -ϵ < T t=1 f i t (x ′′ )}, then ∆({x t } T t=1 ; X ′ , {F t } T t=1 ) = inf E(X ′ ). Hence, we just need to prove inf E(X ) = inf E(X * ). On the one hand, since X * ⊂ X , from the above definition of S-PSG, it is easy to check that for any ϵ ∈ E(X ), it must satisfy ϵ ∈ E(X * ). Hence, we have E(X ) ⊂ E(X * ). On the other hand, given any ϵ ∈ E(X * ), we now check that ϵ ∈ E(X ). To this end, we consider an arbitrary point x ′′ ∈ X in two cases. (i) If x ′′ ∈ X * , since ϵ ∈ E(X * ), we naturally have T t=1 f i0 t (x t ) -ϵ < T t=1 f i0 t (x ′′ ) for some i 0 . (ii) If x ′′ / ∈ X * , since X * is the Pareto optimal set of T t=1 F t , there must exist some Pareto optimal decision x ∈ X * that dominates x ′′ w.r.t. T t=1 F t , which means that T t=1 f i t ( x) ≤ T t=1 f i t (x ′′ ) for all i ∈ {1, ..., m}. Notice that ϵ ∈ E(X * ) gives T t=1 f i0 t (x t ) -ϵ < T t=1 f i0 t ( x) for some i 0 , hence in this case we also have T t=1 f i0 t (x t ) -ϵ < T t=1 f i0 t (x ′′ ). Combining the above two cases, we prove that ϵ ∈ E(X ), and consequently E(X * ) ⊂ E(X ). In summary, we have E(X ) = E(X * ), hence ∆({x t } T t=1 ; X , {F t } T t=1 ) = inf E(X ) = inf E(X * ) = ∆({x t } T t=1 ; X * , {F t } T t=1 ). Therefore, it makes no difference whether the comparator in R II (T ) is generated from the Pareto optimal set X * or from the whole domain X .

D DERIVATION OF THE EQUIVALENT MULTI-OBJECTIVE REGRET FORM

In this section, We strictly derive the equivalent form of R II (T ) in Proposition 1, which is highly non-trivial and forms the basis of the subsequent algorithm design and theoretical analysis. Proof of Proposition 1. Recall that the PSG metric used in R II (T ) is an extension of vanilla PSG to leverage any decision sequence. To motivate the analysis, we first investigate vanilla PSG ∆(x; X * , F ) that deals with a single decision x, and derive a useful lemma as follows. Lemma 1. Vanilla PSG has an equivalent form, i.e., ∆(x; X * , F ) = sup x * ∈X * inf λ∈Sm λ ⊤ (F (x) -F (x)) + , where for any vector l = (l 1 , ..., l m ) ∈ R m , the truncation (l) + produces a vector whose i-th entry equals to max{l i , 0} for all i ∈ {1, ..., m}. Proof. In the definition of PSG, the evaluated decision x is compared to all Pareto optimal points x ′ ∈ X * . For any fixed comparator x ′ ∈ X * , we define the pair-wise suboptimality gap w.r.t. F between decisions x and x ′ as follows δ(x; x ′ , F ) = inf ϵ≥0 {ϵ | F (x) -ϵ1 ⊁ F (x ′ )}. Hence, PSG can be expressed as ∆(x; X * , F ) = sup x ′ ∈X * δ(x; x ′ , F ). To proceed, we analyze the pair-wise gap δ(x; x ′ , F ). From its definition, we know that δ(x; x ′ , F ) measures the minimal non-negative value that needs to be subtracted from each entry of F (x) until it is not dominated by x ′ . Now we consider two cases. (i) If F (x) ⊁ F (x ′ ), i.e., f k0 (x) ≤ f k0 (x ′ ) for some k 0 ∈ {1, ..., m}, nothing needs to be subtracted from F (x) and we directly have δ(x; x ′ , F ) = 0. (ii) If F (x) ≻ F (x ′ ), we have f k (x) ≥ f k (x ′ ) for all k ∈ {1, ..., m}, which obviously violates the condition F (x) -ϵ1 ⊁ F (x ′ ) when ϵ = 0. Now let us gradually increase ϵ from zero. Notice that such a condition holds only when there there exists some k 0 satisfying f k0 (x) -ϵ ≤ f k0 (x ′ ), or equivalently ϵ ≥ f k0 (x) -f k0 (x ′ ). Hence, in this case, we have δ(x; x ′ , F ) = min k∈{1,...,m} {f k (x) -f k (x ′ )}. Combining the above two cases, we derive an equivalent form of the pair-wise suboptimality gap. Specifically, we can easily check that the following form holds for both cases, i.e., δ(x; x ′ , F ) = min k∈{1,...,m} max{f k (x) -f k (x ′ ), 0}. To relate the above form with F , denote U m = {e k | 1 ≤ k ≤ m} as the set of all unit vector in R m , then we equivalently have δ(x; x ′ , F ) = min λ∈Um λ ⊤ (F (x) -F (x ′ )) + . Now the calculation of δ(x; x ′ , F ) is transformed into a minimization problem over λ ∈ U m . Since U m is a discrete set, we can apply a linear relaxation trick. Specifically, we now turn to minimize the scalar p(λ) = λ ⊤ max{F (x) -F (x ′ ), 0} over the convex curvature of U m , which is exactly the probability simplex S m = {λ ∈ R m | λ ⪰ 0, ∥λ∥ 1 = 1}. Note that U m contains all the vertexes of S m . Since inf λ∈Sm p(λ) is a linear optimization problem, the minimal point λ * must be a vertex of the simplex, i.e., λ * ∈ U m . Hence, the relaxed problem is equivalent to the original problem, namely, δ(x; x ′ , F ) = min λ∈Um λ ⊤ (F (x) -F (x ′ )) + = inf λ∈Sm λ ⊤ (F (x) -F (x ′ )) + . Taking the supremum of both sides over x ′ ∈ X * , we prove the lemma. ■ The above lemma can be naturally extended to the sequence-wise variant S-PSG. Specifically, we can extend the pair-wise suboptimality gap δ(x; x ′ , F ) to measure any decision sequence, which now becomes δ({x t } T t=1 ; x ′ , {F t } T t=1 ) = inf ϵ≥0 {ϵ | T t=1 F t (x t ) -ϵ1 ⊁ T t=1 F t (x ′ )}. Then S-PSG can be expressed as ∆({x t } T t=1 ; X * , {F t } T t=1 ) = sup x * ∈X * δ({x t } T t=1 ; x * , {F t } T t=1 ). Similar to the derivation of the above lemma, by investigating the relation between T t=1 F t (x t ) and T t=1 F t (x ′ ), we can derive an equivalent form of δ({x t } T t=1 ; x ′ , {F t } T t=1 ) as δ({x t } T t=1 ; x ′ , {F t } T t=1 ) = min k∈{1,...,m} max{ T t=1 f k t (x) - T t=1 f k t (x ′ ), 0}, and further δ({x t } T t=1 ; x ′ , {F t } T t=1 ) = inf λ∈Sm λ ⊤ ( T t=1 F t (x t ) - T t=1 F t (x ′ )) + . Hence, the S-PSG-based regret form can be expressed as R II (T ) = sup x * ∈X * inf λ∈Sm λ ⊤ ( T t=1 F t (x t ) - T t=1 F t (x * )) + . The max-min form of R II (T ) has a truncation operation (•) + , which brings irregularity to the regret form. To handle the truncation operation, we utilize the following lemma: Lemma 2. (a) For any l ∈ R m , we have inf λ∈Sm λ ⊤ (l) + = max{inf λ∈Sm λ ⊤ l, 0}. (b) For any h : X → R, we have sup x∈X max{h(x), 0} = max{sup x∈X h(x), 0}. Proof. To prove the first statement, we consider the following two cases. (i) If l ≻ 0, then (l) + = l. For any λ ∈ S m , we have λ ⊤ (l) + = λ ⊤ l > 0. Taking the infimum over λ ∈ S m on both sides, we have inf λ ⊤ Sm λ ⊤ (l) + = inf λ∈Sm λ ⊤ l ≥ 0. Moreover, from the last equation we have max{inf λ∈Sm λ ⊤ l, 0} = inf λ∈Sm λ ⊤ l, which proves the statement in this case. (ii) If l ⊁ 0, then l i ≤ 0 for some i ∈ {1, ..., m}. Set e i as the i-th unit vector in R m , then we have e ⊤ i l ≤ 0. One the one hand, since e i ∈ S m , we have inf λ∈Sm λ ⊤ l ≤ e ⊤ i l ≤ 0, and further max{inf λ∈Sm λ ⊤ l, 0} = 0. On the other hand, notice that e ⊤ i (l) + = 0 and λ ⊤ (l) + ≥ 0 for any λ ∈ S m , then inf λ∈Sm λ ⊤ (l) + = e ⊤ i (l) + = 0. Hence, the statement also holds in this case. To prove the second statement, we also consider two cases. (i) If h(x 0 ) > 0 for some x 0 ∈ X , then sup x∈X h(x) ≥ h(x 0 ) > 0, and max{sup x∈X h(x), 0} = sup x∈X h(x). Since we also have sup x∈X max{h(x), 0} = sup x∈X h(x), the statement holds in this case. (ii) If h(x) ≤ 0 for all x ∈ X , then sup x∈X h(x) ≤ 0, and thus max{sup x∈X h(x), 0} = 0. Meanwhile, for any x ∈ X , we have max{h(x)} = 0, which validates the statement in this case. ■ From the above lemma, we directly have R II (T ) = sup x * ∈X * max{ inf λ∈Sm λ ⊤ ( T t=1 F t (x t ) - T t=1 F t (x * )), 0} = max{ sup x * ∈X * inf λ∈Sm λ ⊤ ( T t=1 F t (x t ) - T t=1 F t (x * )), 0}, which derives the desired equivalent form. ■

E CALCULATION OF MIN-REGULARIZED-NORM

In this section, we discuss how to efficiently calculate the solutions to min-regularized-norm with L1-norm and L2-norm. Algorithm 2 Frank-Wolfe Solver for Min-Regularized-Norm with L1-Norm 1: Initialize: λ t = (γ 1 t , . . . , γ m t ) = ( 1 m , . . . , 1 m ). 2: Compute the matrix U = ∇F t (x t ) ⊤ ∇F t (x t ), i.e., U ij = ∇f i t (x t ) ⊤ ∇f j t (x t ), ∀i, j ∈ {1, . . . , m}. 3: repeat 4: Select an index k ∈ arg max i∈{1,...,m} { m j=1 γ j t U ij + α sgn(γ i t -γ i 0 )}. 5: Compute δ ∈ arg min 0≤δ≤1 δ∇f k t (x t ) + (1 -δ)∇F t (x t )λ t 2 2 +α∥δ(e k -λ t )+λ t -λ 0 ∥ 1 . 6: Update λ t = (1 -δ)λ t + δe k . 7: until δ ∼ 0 or Number of Iteration Limits 8: return λ t . E.1 L1-NORM Similar to (Sener & Koltun, 2018) , we first consider the setting of two objectives, namely m = 2. In this case, for any λ = (γ, 1 -γ), λ 0 = (γ 0 , 1 -γ 0 ) ∈ S 2 , the L1-regularization ∥λ -λ 0 ∥ 1 equals to 2|γ -γ 0 |. Hence min-regularized-norm with L1-norm at round t reduces to λ t = (γ t , 1 -γ t ) where γ t ∈ arg min 0≤γ≤1 ∥γg 1 + (1 -γ)g 2 ∥ 2 2 + 2α|γ -γ 0 |. Interestingly, the above problem has a closed-form solution. Proposition 3. Set γ L = (g ⊤ 2 (g 2 -g 1 )-α)/∥g 2 -g 1 ∥ 2 2 , and γ R = (g ⊤ 2 (g 2 -g 1 )+α)/∥g 2 -g 1 ∥ 2 2 . Then min-regularized-norm with L1-norm produces weights λ t = (γ t , 1 -γ t ) where γ t = max{min{γ ′′ t , 1}, 0}, where γ ′′ t = max{min{γ 0 , γ R }, γ L }. Proof. We solve the following two quadratic sub-problems, i.e., min 0≤γ≤γ0 h 1 (γ) = ∥γg 1 + (1 -γ)g 2 ∥ 2 2 + 2α(γ 0 -γ), as well as min γ0≤γ≤1 h 2 (γ) = ∥γg 1 + (1 -γ)g 2 ∥ 2 2 + 2α(γ -γ 0 ). It can be checked that in the former sub-problem, h 1 monotonously decreases on (-∞, γ R ] and increases on [γ R , +∞); in the latter sub-problem, h 2 monotonously decreases on (-∞, γ L ] and increases on [γ L , +∞). Since each sub-problem has its constraint ([0, γ 0 ] or [γ 0 , 1]), the solution to the original optimization problem can then be derived by comparing the optimal values of the two sub-problems with their constraints. Specifically, notice that γ L ≤ γ R and 0 ≤ γ 0 ≤ 1, and we can consider the following three cases. (i) When 0 ≤ γ 0 ≤ γ L ≤ γ R , then h 1 monotonously decreases on [0, γ 0 ] and its minimum on [0, γ 0 ] is h 1 (γ 0 ). Notice that h 1 (γ 0 ) = h 2 (γ 0 ). For the sub-problem of h 2 , we further consider two situations: (i-a) If γ L ≤ 1, then γ L ∈ [γ 0 , 1], hence the minimum of h 2 on [γ 0 , 1] is h 2 (γ L ). Since h 2 (γ L ) ≤ h 2 (γ 0 ) = h 1 (γ 0 ) , the minimal point of the original problem is γ L , and hence γ t = γ L . (i-b) If γ L > 1, then h 2 monotonously decreases on [γ 0 , 1], and we surely have h 2 (1) ≤ h 2 (γ 0 ) = h 1 (γ 0 ). Hence γ t = 1 in this situation. Combining the above two situations, we have γ t = min{γ L , 1} in this case. (ii) When γ L ≤ γ R ≤ γ 0 ≤ 1, then h 2 monotonously increases on [γ 0 , 1] and its minimum on [γ 0 , 1] is h 2 (γ 0 ). Notice that h 1 (γ 0 ) = h 2 (γ 0 ). For the sub-problem of h 1 , similar to the first case, we also consider two situations: (ii-a) If γ R ≥ 0, then γ R ∈ [0, γ 0 ], hence the minimum of h 1 on [0, γ 0 ] is h 1 (γ R ). Since h 1 (γ R ) ≤ h 1 (γ 0 ) = h 2 (γ 0 ) , the minimal point of the original problem is γ R , and hence γ t = γ R . (ii-b) If γ R < 0, then h 1 monotonously increases on [0, γ 0 ]. Hence we have h 1 (0) ≤ h 1 (γ 0 ) = h 2 (γ 0 ). Hence the solution to the original problem γ t = 0. Combining the above two situations, we have γ t = max{γ R , 0} in this case. Algorithm 3 Frank-Wolfe Solver for Min-Regularized-Norm with L2-Norm 1: Initialize: λ t = (γ 1 t , . . . , γ m t ) = ( 1 m , . . . , 1 m ). 2: Compute the matrix U = ∇F t (x t ) ⊤ ∇F t (x t ), i.e., U ij = ∇f i t (x t ) ⊤ ∇f j t (x t ), ∀i, j ∈ {1, . . . , m}. 3: repeat 4: Select an index k ∈ arg max i∈{1,...,m} { m j=1 γ j t U ij + α(γ i t -γ i 0 )}. 5: Compute δ ∈ arg min 0≤δ≤1 ∥δ∇f k t (x t ))+(1-δ)∇F t (x t )λ t ∥ 2 2 +α∥δ(e k -λ t )+λ t -λ 0 ∥ 2 2 , which has an analytical form δ = max{min{ (∇F t (x t )λ t -∇f k t (x t )) ⊤ ∇F t (x t )λ t + α∥e k -λ t ∥ 2 2 ∥∇F t (x t )λ t -∇f k t (x t )∥ 2 2 + α(e k -λ t ) ⊤ (λ t -λ 0 ) , 1}, 0}.

6:

Update λ t = (1 -δ)λ t + δe k . 7: until δ ∼ 0 or Number of Iteration Limits 8: return λ t . (iii) When γ L < γ 0 < γ R , then h 1 monotonously decreases on [0, γ 0 ] and h 2 monotonously increases on [γ 0 , 1]. Hence each sub-problem attains its minimum at γ 0 , and thus γ t = γ 0 . Summarizing the above three cases gives γ t =    min{γ L , 1}, γ 0 ≤ γ L ; max{γ R , 0}, γ 0 ≥ γ R ; γ 0 , otherwise. We can further rewrite the above formula into a compact form as follows, which can be checked case-by-case. γ t = max{min{γ ′′ t , 1}, 0} , where γ ′′ t = max{min{γ 0 , γ R }, γ L }, This gives the closed-form solution of min-regularized-norm when m = 2. ■ Now that we have derived the closed-form solution to the min-regularized-norm solver with any two gradients, in principle, we can apply (Sener & Koltun, 2018) 's technique to efficiently compute the solution to the solver with more than two gradients. We provide the full procedure in Algorithm 2, which is an extension of (Sener & Koltun, 2018) . By following the exact line search technique (Jaggi, 2013) in MGDA, we get our line search oracle as line 5 in Algorithm 2. The first term is the same as that in MGDA, and the second term is an extra L1-regularization term related to the design in Algorithm 1. Unlike the oracle of MGDA that has a closed-form solution by a reduction to the case of two gradients, the extra L1-norm term makes our oracle difficult to get a closed-form solution. The reason is that, such an extra term is the L1-norm of a m-dimension vector, hence it can not simply reduce to the case of two gradients. To proceed, we can directly apply numerical methods to get the solution (e.g. similar to the implementation in (Liu et al., 2021) ).

E.2 L2-NORM

Recall that in min-regularized-norm, the regularization on λ can take various forms. In the following, we discuss an alternative regularization, i.e., L2-regularization r(λ, λ 0 ) = 1 2 ∥λ -λ 0 ∥ 2 2 . In the discussion, we will show that similar to (Sener & Koltun, 2018) , min-regularized-norm with L2norm can be computed very efficiently via the Frank-Wolfe method. Similar to the previous discussion on L1-regularization, we first consider the setting of m = 2. In this case, for any λ = (γ, 1 -γ), λ 0 = (γ 0 , 1 -γ 0 ) ∈ S 2 , the L2-regularization 1 2 ∥λ -λ 0 ∥ 2 2 equals to (γ -γ 0 ) 2 . Hence min-regularized-norm with L2-norm at round t reduces to λ t = (γ t , 1 -γ t ) where γ t ∈ arg min 0≤γ≤1 ∥γg 1 + (1 -γ)g 2 ∥ 2 2 + α(γ -γ 0 ) 2 . Since the above problem is in the quadratic form, it also has a closed-form solution. The proof is elementary and hence omitted. Proposition 4. Min-regularized-norm with L2-norm produces weights λ t = (γ t , 1 -γ t ) where γ t = max{min{ (g 2 -g 1 ) ⊤ g 2 + αγ 0 ∥g 2 -g 1 ∥ 2 2 + α , 1}, 0}. When m > 2, since λ t is constrained to the probability simplex ∆ m , similar to the case of L1regularization, we can use a Frank-Wolfe method to efficiently calculate the composition weights, which is presented in Algorithm 3. Note that since the line search (step 2) has a closed-form solution, its calculation cost is not high, i.e., just the same as the calculation cost of the original min-norm solver (Sener & Koltun, 2018) .

F MORE DETAILS OF THE THEORETICAL RESULTS

In this section, we first prove the remark below Theorem 1, i.e., with proper choices of η t and α t , DR-OMMD is guaranteed to have a sublinear regret bound in O( √ T ). Then we show the tightness of the above derived regret bound of DR-OMMD. Finally, we give a more detailed comparison with linearization from the theoretical aspect.

F.1 MORE DETAILS OF THE REMARK BELOW THEOREM 1

Recall that in the remark below Theorem 1 in our main paper, we claim that with proper choice of η t and α t , DR-OMMD is guaranteed to attain a sublinear regret bound. We summarize this remark into the following corollary and provide a strict proof. Corollary 1. (i) (Fixed learning rate) When setting η t = √ 2γD G √ T and α t = 4F ηt , for any λ 0 ∈ S m , DR-OMMD achieves the following multi-objective regret R(T ) ≤ G 2γDT . (ii) (Diminishing learning rate) When setting η t = √ 2γD G √ t , α t = 4F ηt , for any λ 0 ∈ S m , DR-OMMD attains the following multi-objective regret R(T ) ≤ 3 2 G 2γDT . Proof. We start from the regret bound regarding λ t in Theorem 1. When α t = 4F ηt , from the definition of min-regularized-norm, the composite weights λ t generated by DR-OMMD at each round satisfy λ t ∈ arg min λ∈Sm ∥∇F t (x t )λ∥ 2 2 + 4F η t ∥λ -λ 0 ∥ 1 . Recall that λ 0 ∈ S m . Hence, for any t ∈ {1, ..., T }, the last term of the regret bound in Theorem 1 can be bounded as ∥∇F t (x t )λ t ∥ 2 2 + 4F η t ∥λ t -λ 0 ∥ 1 = min λ∈Sm ∥∇F t (x t )λ∥ 2 2 + 4F η t ∥λ -λ 0 ∥ 1 ≤ ∥∇F t (x t )λ 0 ∥ 2 2 . From Assumption 2, each gradient g i t is bounded as ∥g i t ∥ 2 ≤ G, hence ∥∇F t (x t )λ 0 ∥ 2 ≤ m i=1 ∥λ i 0 g i t ∥ 2 = m i=1 λ i 0 ∥g i t ∥ 2 ≤ G. Therefore, when α t = 4F ηt , we have ∥∇F t (x t )λ t ∥ 2 2 + 4F η t ∥λ t -λ 0 ∥ 1 ≤ ∥∇F t (x t )λ 0 ∥ 2 2 ≤ G 2 . Plugging it into the regret bound in Theorem 1, we have R(T ) ≤ γD η T + η t 2 G 2 T = G 2γDT , which proves the bound with the fixed optimal learning rate. Alternatively, set η t = √ 2γD G √ t and utilize T t=1 1 √ t ≤ 2 √ T , we also have R(T ) ≤ 1 2 G 2γDT (1 + T t=1 1 √ t ) ≤ 3 2 G 2γDT , which proves the bound with the adaptive learning rate. ■

F.2 THE TIGHTNESS OF DR-OMMD'S BOUND

In this subsection, we show that the derived bound in Corollary 1 is tight w.r.t. m regarding any gradient-based algorithm. Specifically, we follow the standard worst-case analysis of deriving lower bounds and construct a special case in which any gradient-based algorithm will incur a regret in the order of Ω( √ T ). Assume f 1 t = f 2 t = • • • = f m t at each round t. In this case, the instantaneous gradients of all the objectives are identical, i.e., g i t = ∇f i t (x t ) ≡ ∇f 1 t (x t ) = g 1 t , ∀i ∈ {1, ..., m}. For any gradient-based algorithm, since it can only utilize the gradient information of the objectives, it cannot distinguish the objective to which a certain gradient belongs. Alternatively speaking, in this case, any multiple gradient algorithm will treat all gradients in the same way and thus behave like a single-objective algorithm using the single gradient g 1 t . Hence, in intuition, for any gradient-based algorithm, the worst-case bounds are at least independent of m. In particular, the worst-case bounds of gradientbased algorithms cannot decrease as m increases; otherwise, the above case will be violated. In the following, we provide a detailed proof of the tightness of the O( √ T ) bound. In the above case, since f 1 t = f 2 t = • • • = f m t for any t, the cumulative losses of all the objectives are also identical, i.e., T t=1 f 1 t = T t=1 f 2 t = • • • = T t=1 f m t . Therefore, the Pareto set X * of the cumulative vector loss T t=1 F t coincides with the optimal decision set of the cumulative loss T t=1 f 1 t of the first objective, i.e., X * = argmin x∈X T t=1 f 1 t (x). Recall our definition of the multi-objective regret. Since λ ⊤ F t (x) = f 1 t (x) for any λ ∈ S m , we have R(T ) = sup x * ∈X * ( T t=1 f 1 t (x t ) - T t=1 f 1 t (x * )) = T t=1 f 1 t (x t ) -min x * ∈X T t=1 f 1 t (x * ), which exactly reduces to the single-objective regret R S (T ) defined by the losses {f 1 t } T t=1 of the first objective. Hence we have R(T ) = R S (T ) in this case. Since the losses {f 1 t } T t=1 of the first objective can be chosen adversarially, we can follow Section 3.2 in (Hazan et al., 2016) to construct a certain sequence {f 1 t } T t=1 that admits a lower single-objective regret bound of Ω( √ T ). Hence in this certain case, any multiple gradient algorithm will admit a multi-objective regret R(T ) = Ω( √ T ) w.r.t. T and m, matching our derived regret bound for DR-OMMD in terms of both T and m. Some readers may suspect it unreasonable that in the multi-objective setting, the derived regret bounds do not increase as m increases. Now we explicate the rationality of such independence in the following. In fact, the independence of m lies in the adoption of PSG in the formulation of the regret. Recall that, in the definition of PSG, "∃i ∈ {1, . . . , m}" means that it just needs to pick one coordinate i to satisfy f i t (x t ) -ϵ < f i t (x ′′ ), which omits the dependency of m. We can see this point from another perspective. Recall that in the derivation of Proposition 1, we know that the regret R(T ) has an equivalent form, namely sup x * ∈X * inf λ * ∈Sm (λ * ) ⊤ T t=1 (F t (x t )-F t (x * )), or equivalently sup x * ∈X * min i∈{1,...,m} T t=1 (f i t (x t ) -f i t (x * )). In particular, PSG takes a minimum operation over all objectives, and thus it does not necessarily increase as m increases. There is another intuitive way that can help understand the rationality of the independence of m. As is well recognized in existing research in multi-objective optimization (Emmerich & Deutz, 2018) , the proportion of the Pareto optimal solutions (or more precisely, non-dominated solutions) in the decision domain tends to increase rapidly as the number of objectives increases. As a consequence, it might not be harder to reach the Pareto optimal set when m turns larger, hence intuitively, the regret bound does not necessarily increase as m increases.

F.3 MORE DETAILS IN THE COMPARISON WITH LINEARIZATION

Recall that in the remark below Theorem 1, we show that our derived bound for DR-OMMD is smaller than that of linearization, and discuss the margin between the two regret bounds in the twoobjective setting with linear losses. We now summarize the result in Theorem 2 in the following. Theorem 2. Consider a two-objective optimization setting with linear losses. Suppose the loss functions are f 1 t (x) = x ⊤ g 1 t and f 2 t (x) = x ⊤ g 2 t at each round t. For any λ 0 = (γ 0 , 1 -γ 0 ) ∈ S m , let λ t = (γ t , 1 -γ t ) denote the composite weights produced by min-regularized-norm with L1norm. When the regularization strength is set as α t = 4F/η t , the margin between the regret bound of linearization with fixed weights λ 0 and that of DR-OMMD with composite weights λ t is at least M ≥ T t=1 η t 2 (γ t -γ 0 ) 2 ∥g 1 t -g 2 t ∥ 2 2 . Before proving the theorem, we remark that the two bounds are basically in the same order. Note that this theoretical result is also very commonly seen in the offline setting, where multiple gradient algorithms often have the same (convergence) rate as linearization (Yu et al., 2020; Liu et al., 2021) . The benefit of multiple gradient algorithms is mainly due to the implementation of gradient composition. For example, the concept of common descent (Sener & Koltun, 2018; Yu et al., 2020) eliminates the gradient conflicting issue; the resulting algorithm achieves substantial performance improvements compared to linearization in their experiments. In this paper, we move one step forward and discuss the margin between DR-OMMD and linearization. We show that such a margin is due to the gradient difference g 1 t -g 2 t and the gap between the pre-defined weights λ 0 and the adaptive weights λ t . This is the best we can do for now. The regret bound comparison for m ≥ 3 is left for future research. Proof of Theorem 2. We first write out the regret bounds of both methods. For DR-OMMD with α t = 4F/η t , Theorem 1 provides the following regret bound R DR-OMMD (T ) ≤ γD η T + T t=1 η t 2 min λ∈Sm {∥∇F t (x t )λ∥ 2 2 + 4F η t ∥λ -λ 0 ∥ 1 }. For linearization with fixed weights λ 0 ∈ S m , it can be viewed as single-objective optimization with linearized loss λ ⊤ 0 F t . Hence, we can directly borrow the tight bound of OMD (e.g., Theorem 6.8 in (Orabona, 2019) ) and derive a bound R linear (T ) ≤ γD η T + T t=1 η t 2 ∥∇F t (x t )λ 0 ∥ 2 2 . The margin between the above two bounds takes M = T t=1 η t 2 (∥∇F t (x t )λ 0 ∥ 2 2 -min λ∈Sm {∥∇F t (x t )λ∥ 2 2 + 4F η t ∥λ -λ 0 ∥ 1 }), which stems from different choices of composite weights. We investigate the margin at each round. Lemma 3. In a two-objective setting, suppose the gradients are g 1 and g 2 at some specific round t, and the corresponding gradient matrix G = [g 1 , g 2 ]. For any λ 0 = (γ 0 , 1 -γ 0 ) ∈ S m , let λ t = (γ t , 1 -γ t ) denote the composite weights produced by min-regularized-norm with L1-norm, then the following inequality holds, i.e., ∥Gλ 0 ∥ 2 2 -(∥Gλ t ∥ 2 2 + α∥λ t -λ 0 ∥ 1 ) ≥ (γ 0 -γ t ) 2 ∥g 2 -g 1 ∥ 2 2 . Proof. Denote the left side of the target inequality as M (λ t , λ 0 ), then it can be simplified as M (λ t , λ 0 ) = (Gλ 0 -Gλ t ) ⊤ (Gλ 0 + Gλ t ) -α∥λ t -λ 0 ∥ 1 = (λ 0 -λ t ) ⊤ G ⊤ G(λ 0 + λ t ) -α∥λ t -λ 0 ∥ 1 To leverage this term, we need to plug the derived composite weights λ t into M (λ t , λ 0 ). Recall that in the two-objective setting, the weight γ t is given as γ t = max{min{γ ′′ t , 1}, 0}, where γ ′′ t = max{min{γ 0 , γ R }, γ L }, where γ L = (g ⊤ 2 (g 2 -g 1 ) -α)/∥g 2 -g 1 ∥ 2 2 and γ R = (g ⊤ 2 (g 2 -g 1 ) + α)/∥g 2 -g 1 ∥ 2 2 . Since the maximum and minimum operations will truncate the value of the produced weight, we now calculate M (λ t , λ 0 ) by case. Specifically, notice that γ L < γ R and 0 ≤ γ 0 ≤ 1, we consider the following cases. Case 1: When γ R < 0, we must have γ L < γ R < 0 ≤ γ 0 , which leads to γ t = 0. In this case, we have λ 0 -λ t = (γ 0 , -γ 0 ) and λ 0 + λ t = (γ 0 , 2 -γ 0 ). Therefore, M (λ t , λ 0 ) can be computed as M (λ t , λ 0 ) = (γ 0 g 1 -γ 0 g 2 ) ⊤ (γ 0 g 1 + (2 -γ 0 )g 2 ) -2αλ 0 . Also, from the condition γ R < 0, we have α < g ⊤ 2 (g 1 -g 2 ). Since λ 0 ≥ 0, plugging it into the above inequality, we have M (λ t , λ 0 ) ≥ (γ 0 g 1 -γ 0 g 2 ) ⊤ (γ 0 g 1 -γ 0 g 2 ) = γ 2 0 ∥g 1 -g 2 ∥ 2 2 . In this case, since γ t = 0, we derive the desired inequality. Case 2: When γ L > 1, we must have γ 0 ≤ 1 < γ L < γ R , which results in γ t = 1. In this case, we have λ 0 -λ t = (γ 0 -1, 1 -γ 0 ) and λ 0 + λ t = (γ 0 + 1, 1 -γ 0 ). Now M (λ t , λ 0 ) can be calculated as M (λ t , λ 0 ) = ((γ 0 -1)g 1 + (1 -γ 0 )g 2 ) ⊤ ((γ 0 + 1)g 1 + (1 -γ 0 )g 2 ) -2α(1 -γ 0 ). Notice that the condition γ L > 1 gives α < g ⊤ 1 (g 2 -g 1 ). Since 1 -γ 0 ≥ 0, plugging it into the above inequality, we have M (λ t , λ 0 ) ≥ ((γ 0 -1)g 1 + (1 -γ 0 )g 2 ) ⊤ ((γ 0 -1)g 1 + (1 -γ 0 )g 2 ) = (1 -γ 0 ) 2 ∥g 1 -g 2 ∥ 2 2 . In this case, since γ t = 1, we derive the desired inequality. Case 3: When 0 ≤ γ L ≤ γ R ≤ 1, the margin is a bit more complex since the value of λ t further depends on the relation between λ 0 , λ L , and λ R . Specifically, we consider the following cases. (i) If 0 ≤ γ L ≤ γ 0 ≤ γ R ≤ 1, then γ t = γ 0 . In this case, the inequality trivially holds. (ii) If 0 ≤ γ 0 ≤ γ L ≤ 1, then γ t = γ L . In this case, since ∥λ t -λ 0 ∥ 1 = 2(γ L -γ 0 ), we can calculate M (λ t , λ 0 ) as M (λ t , λ 0 ) = ((γ 0 -γ L )g 1 + (γ L -γ 0 )g 2 ) ⊤ ((γ 0 + γ L )g 1 + (2 -γ 0 -γ L )g 2 ) -2α(γ L -γ 0 ) = (γ 0 -γ L )((g 1 -g 2 ) ⊤ ((γ 0 + γ L )g 1 + (2 -γ 0 -γ L )g 2 ) + 2α). Since γ t = γ L = (g ⊤ 2 (g 2 -g 1 ) -α)/∥g 2 -g 1 ∥ 2 2 , we have α = (g 1 -g 2 ) ⊤ (-γ L g 1 + (γ L -1)g 2 ). Therefore, we further have (g 1 -g 2 ) ⊤ ((γ 0 + γ L )g 1 + (2 -γ 0 -γ L )g 2 ) + 2α = (g 1 -g 2 ) ⊤ ((γ 0 -γ L )g 1 + (γ L -γ 0 )g 2 ) = (g 1 -g 2 ) ⊤ (γ 0 -γ L )(g 1 -g 2 ). Plugging it into the above equation on M (λ t , λ 0 ), we derive M (λ t , λ 0 ) = (γ L -γ 0 ) 2 ∥g 1 -g 2 ∥ 2 2 (iii) If 0 ≤ γ R ≤ γ 0 ≤ 1, then γ t = γ R . In this case, ∥λ t -λ 0 ∥ 1 = 2(γ 0 -γ R ) , and M (λ t , λ 0 ) can be calculated as M (λ t , λ 0 ) = ((γ 0 -γ R )g 1 + (γ R -γ 0 )g 2 ) ⊤ ((γ 0 + γ R )g 1 + (2 -γ 0 -γ R )g 2 ) -2α(γ 0 -γ R ) = (γ 0 -γ R )((g 1 -g 2 ) ⊤ ((γ 0 + γ R )g 1 + (2 -γ 0 -γ R )g 2 ) -2α). Since γ t = (g ⊤ 2 (g 2 -g 1 ) + α)/∥g 2 -g 1 ∥ 2 2 , we have α = (g 1 -g 2 ) ⊤ (γ R g 1 + (1 -γ R )g 2 ), then (g 1 -g 2 ) ⊤ ((γ 0 + γ R )g 1 + (2 -γ 0 -γ R )g 2 ) -2α = (g 1 -g 2 ) ⊤ (γ 0 -γ R )(g 1 -g 2 ). Plugging it into the above equation on M (λ t , λ 0 ), we derive M (λ t , λ 0 ) = (γ R -γ 0 ) 2 ∥g 1 -g 2 ∥ 2 2 Combining all of the above cases, we prove the lemma. ■ For any t ∈ {1, ..., T }, set g 1 = g 1 t , g 2 = g 2 t (i.e., G = G t ), and α = 4F ηt in Lemma 3, we have ∥G t λ 0 ∥ 2 2 -(∥G t λ t ∥ 2 2 + 4F η t ∥λ t -λ 0 ∥ 1 ) ≥ (γ t -γ 0 ) 2 ∥g 2 t -g 1 t ∥ 2 2 . Since ∥λ t -λ 0 ∥ 2 2 = 2(γ t -γ 0 ) 2 , summing the above inequality over t ∈ {1, ..., T }, we can directly calculate the margin as M ≥ T t=1 η t 4 ∥λ t -λ 0 ∥ 2 2 • ∥g 2 t -g 1 t ∥ 2 2 , which proves the theorem. ■ 

G.1 MORE DETAILS OF THE EXPERIMENTAL SETUP

The protein and covtype datasets used in our experiments are publicly available in (Dua & Graff, 2017) . The MultiMNIST dataset is acquired by the code provided by (Sener & Koltun, 2018) . All runs are deployed on Xeon(R) E5-2699 @ 2.2GHz.

G.2 MORE RESULTS FOR ADAPTIVE REGULARIZATION

We supplement the empirical results on covtype in Figure 3 , which have been omitted from our main paper due to the lack of space. These results are consistent with the results on protein as presented in our main paper.

G.3 MORE RESULTS FOR ONLINE DEEP MULTI-TASK LEARNING

In our main paper, due to the page limit, Figure 2 Proof. As we have described in our main paper, we consider the following two-objective optimization problem. Decision domain is set as X = {(u, v) | u + v ≤ 1 2 , v -u ≤ 1 2 , v ≥ 0}. At each any k; since the first entry of g comp t is non-positive when t = 2k, namely -u 2k -v 2k -1 ≤ 0, we have u 2k+1 ≥ u 2k for any k. Now we can go back to analyze the gap between the composite weights at any two consecutive rounds. It is easy to verify that γ 2k-1 < γ 2k and γ 2k > γ 2k+1 , hence we have ∥λ 2k -λ 2k-1 ∥ 1 = 2(γ 2k -γ 2k-1 ) = 2 -(u 2k -u 2k-1 ) + (v 2k + v 2k-1 ) 2 ≥ 1 + v 1 , ∥λ 2k+1 -λ 2k ∥ 1 = 2(γ 2k -γ 2k+1 ) = 2 -(u 2k -u 2k+1 ) + (v 2k + v 2k+1 ) 2 ≥ 1 + v 1 . Therefore, the composite weights λ t indeed change radically at any two consecutive rounds. The above analysis on v t also implies the failure of min-norm in this problem. Recall that any Pareto optimal solution x * = (u * , v * ) ∈ X * must satisfy v * = 0. Suppose the initial iterate x 1 = (u 1 , v 1 ) does not lie in X * , i.e., v 1 > 0, which is almost sure for random initialization x 1 ∈ X . Then we iteratively have 0 < v 1 ≤ v 2 ≤ ... ≤ v T , which means that x t moves away from the Pareto set X * . In the following, we strictly prove that min-norm indeed incurs a linear multi-objective regret. To calculate R(T ), we first investigate the quantity R(x * , λ) = λ ⊤ T t=1 (F t (x t ) -F t (x * )) for any fixed weights λ = (γ, 1 -γ) ∈ S 2 and best fixed decision x * = (u * , 0) ∈ X * . Specifically, recall the form of T t=1 F t derived above, then we have λ ⊤ T t=1 F t (x * ) = (γ(u * + 1) 2 + (1 -γ)(u * -1) 2 + 2)T. Denote the cumulative loss T t=1 F t (x t ) = (L 1 , L 2 ), we now consider the loss of each objective L 1 and L 2 separately. Specifically, for the first objective, we have L 1 = T /2 k=1 ((u 2k-1 + 2) 2 + u 2 2k + (v 2k-1 + 1) 2 + (v 2k -1) 2 ). Since 0 < v 1 ≤ v 2 ≤ ... ≤ v T ≤ 1, for the term regarding v t we have T /2 k=1 ((v 2k-1 + 1) 2 + (v 2k -1) 2 ) = (v 1 + 1) 2 + (v T -1) 2 + T /2-1 k=1 ((v 2k -1) 2 + (v 2k+1 + 1) 2 ) ≥ T /2-1 k=1 ((v 2k -1) 2 + (v 2k + 1) 2 ) = T /2-1 k=1 (2v 2 2k + 2) ≥ T /2-1 k=1 (2v 2 1 + 2) = (2v 2 1 + 2)( T 2 -1) ≥ v 2 1 T + T -2. For the k-th term regarding u t , we have (u 2k-1 + 2) 2 + u 2 2k = (u 2k-1 + 1) 2 + (u 2k + 1) 2 + 2(u 2k-1 -u 2k ) + 2. Recall that we have derived u 2k ≤ u 2k-1 , thus we have T /2 k=1 (u 2k-1 +2) 2 +u 2 2k ≥ T /2 k=1 ((u 2k-1 +1) 2 +(u 2k +1) 2 +2) ≥ T t=1 (u t +1) 2 +T ≥ (ū+1) 2 T +T, where ū = 1 T T t=1 u t and the last inequality is derived from Jensen's inequality. In summary, for the cumulative loss L 1 of the first objective, we have L 1 ≥ (ū + 1) 2 T + v 2 1 T + 2T -2. Similarly, we can analyze the cumulative loss L 2 of the second objective L 2 = T /2 k=1 (u 2 2k-1 + (u 2k -2) 2 + (v 2k-1 -1) 2 + (v 2k + 1) 2 ). Since 0 < v 1 ≤ v 2 ≤ ... ≤ v T ≤ 1, for the term regarding v t we have T /2 k=1 ((v 2k-1 -1) 2 + (v 2k + 1) 2 ) ≥ T /2 k=1 ((v 2k-1 -1) 2 + (v 2k-1 + 1) 2 ) ≥ v 2 1 T + T. For the term regarding u t , we also have T /2 k=1 (u 2 2k-1 + (u 2k -2) 2 ) = T /2 k=1 ((u 2k-1 -1) 2 + (u 2k -1) 2 + 2(u 2k-1 -u 2k ) + 2) ≥ T t=1 (u t -1) 2 + T ≥ (ū -1) 2 T + T, where the last inequality is derived from Jensen's inequality. Therefore, we have L 2 ≥ (ū -1) 2 T + v 2 1 T + 2T. Combining the above inequalities, we have R(x * , λ) = γL 1 + (1 -γ)L 2 -λ ⊤ T t=1 F t (x * ) ≥ γ((ū + 1) 2 -(u * + 1) 2 ) + (1 -γ)((ū -1) 2 -(u * -1) 2 ) + v 2 1 T -2γ. For any λ ∈ S 2 (i.e., γ ∈ [0, 1]), set x ′ = (ū, 0) ∈ X * , then it holds that R(x ′ , λ) ≥ v 2 1 T -2. Equivalently, the multi-objective regret satisfies R(T ) = sup x * ∈X * inf λ∈S2 R(x * , λ) ≥ inf λ∈S2 R(x ′ , λ) ≥ v 2 1 T -2, which is linear w.r.t. T for any x 1 = (u 1 , v 1 ) ∈ X such that v 1 > 0. We now investigate the case when T is an odd number. Since the calculation of the composite weights λ t and the composite gradient g comp t at each round is independent of the total time horizon T , we still have ∥λ t+1 -λ t ∥ 1 ≥ v 1 + 1 for any t. Hence the first desired property also holds for any odd T . It remains to prove that OMD with min-norm still incurs a linear regret when T is odd. In this case, the Pareto optimal set X * does not lie in the x-axis anymore, hence it is difficult to directly compute R(T ). However, we can still use our derived R(x * , λ) for any even T to estimate the regret. Specifically, set x ′ = ( 1T -1 T -1 t=1 u t , 0); from the above derivation with even T , for any λ ∈ S 2 , we still have (note that now T -1 is an even number) λ ⊤ T -1 t=1 F t (x t ) -λ ⊤ T -1 t=1 F t (x ′ ) ≥ v 2 1 T -2. Since for any x ∈ X , we have 0 ≤ ∥x -a∥ 2 2 , ∥x -b∥ 2 2 , ∥x -c∥ 2 2 ≤ 10, we have R(x ′ , λ) = λ ⊤ T t=1 F t (x t ) -λ ⊤ T t=1 F t (x ′ ) ≥ v 2 1 T -12. Furthermore, from the definition of Pareto optimality, there exists some x ′′ ∈ X * that Pareto dominates x ′ regarding the cumulative loss T t=1 F t , namely T t=1 F t (x ′′ ) ⪯ T t=1 F t (x ′ ). Hence R(x ′′ , λ) = λ ⊤ T t=1 F t (x t ) -λ ⊤ T t=1 F t (x ′′ ) ≥ R(x ′ , λ), for any λ ∈ S 2 . Therefore, the multi-objective regret R(T ) = sup x * ∈X * inf λ∈S2 R(x * , λ) ≥ inf λ∈S2 R(x ′′ , λ) ≥ inf λ∈S2 R(x ′ , λ) ≥ v 2 1 T -12, which is also linear w.r.t. T for any x 1 = (u 1 , v 1 ) ∈ X such that v 1 > 0. ■

I OMITTED PROOFS OF THEOREM 1

Proof. We start from the definition of the multi-objective regret R II (T ) (which is abbreviated as R(T )). Specifically, for any λ ∈ S m and λ 1 . . . , λ T ∈ S m , it holds that R(T ) = sup To proceed, notice that if the composite weights λ t are given beforehand instead of being calculated via min-regularized-norm, DR-OMMD acts just like standard OMD using linearized loss λ t F t . Hence, the second term in the above regret bound can be further analyzed in a similar way as singleobjective OMD (Srebro et al., 2011; Cesa-bianchi et al., 2012) . Specifically, at each round t, since F t is coordinate-wise convex, the linearized loss λ ⊤ t F t is also convex. Also notice that the composite gradient g t = ∇F t (x t )λ t is exactly the gradient of λ ⊤ t F t at x t . Hence for any x * ∈ X * , we have λ ⊤ t F t (x t ) -λ ⊤ t F t (x * ) ≤ g ⊤ t (x t -x * ) = g ⊤ t (x t+1 -x * ) + g ⊤ t (x t -x t+1 ). From the first-order optimal condition of x t+1 , for any x ′ ∈ X , we have (η t ∇F t (x t )λ t + ∇R(x t+1 ) -∇R(x t )) ⊤ (x ′ -x t+1 ) ≥ 0. Recall that g t = ∇F t (x t )λ t . We set x ′ = x * and combine the above two inequalities, which derives λ ⊤ t F t (x t ) -λ ⊤ t F t (x * ) ≤ 1 η t (∇R(x t+1 ) -∇R(x t )) ⊤ (x * -x t+1 ) + g ⊤ t (x t -x t+1 ). Recall the definition of Bregman divergence B R . We can check that (also see (Beck & Teboulle, 2003 )) B R (x * , x t ) -B R (x * , x t+1 ) -B R (x t+1 , x t ) = (∇R(x t+1 ) -∇R(x t )) ⊤ (x * -x t+1 ). Since R is 1-strongly convex, we have B R (x t+1 , x t ) ≥ ∥x t+1 -x t ∥ 2 2 /2. Hence λ ⊤ t F t (x t ) -λ ⊤ t F t (x * ) ≤ 1 η t (B R (x * , x t ) -B R (x * , x t+1 ) - 1 2 ∥x t+1 -x t ∥ 2 2 ) + g ⊤ t (x t -x t+1 ). Moreover, from the Cauchy-Schwartz inequality we have g ⊤ t (x t -x t+1 ) ≤ η t 2 ∥g t ∥ 2 2 + 1 2η t ∥x t -x t+1 ∥ 2 2 . Combining the above two inequalities, we derive λ t ⊤ F t (x t ) -λ t ⊤ F t (x * ) ≤ 1 η t (B R (x * ; x t ) -B R (x * ; x t+1 )) + η t 2 ∥∇F t (x t )λ t ∥ 2 2 , for any x * ∈ X * . Summing it over t ∈ {1, ..., T } and utilizing B R (x * ; x T +1 ) ≥ 0, we have Taking the supremum over x * ∈ X * and plugging it back to the above regret bound, we prove the theorem. ■

J REGRET ANALYSIS IN THE STRONGLY CONVEX SETTING

In this section, we discuss the regret bound of DR-OMMD in the strongly convex setting, where each loss function f i t is H-strongly convex. Recall that most literature in this setting only considers OGD (Zhao & Zhang, 2021; Wan et al., 2022) , which is a special case of OMD that instantiates the regularization function on the iterate x as L2-regularizer, i.e., R(x) = 1 2 ∥x∥ 2 2 . Hence, in the following, we mainly analyze the bound of the OGD-type variant in the strongly convex setting. Theorem 3. Assume that for any t ∈ {1, ..., T }, i ∈ {1, ..., m}, the loss function f i t is H-strongly convex. Set η t = 1 Ht and R(x) = 1 2 ∥x∥ 2 in DR-OMMD, then it attains the following regret R(T ) ≤ H 2 ∥x 1 -x * ∥ 2 2 + T t=1 1 2Ht (∥∇F t (x t )λ t ∥ 2 2 + 4F Ht∥λ t -λ 0 ∥ 1 ). Remark. By setting α t = 4F Ht, the above bound reduces to R(T ) ≤ H 2 ∥x 1 -x * ∥ 2 2 + T t=1 1 2Ht min λ∈Sm {∥∇F t (x t )λ∥ 2 2 + α t ∥λ -λ 0 ∥ 1 } ≤ H 2 ∥x 1 -x * ∥ 2 2 + T t=1 ∥∇F t (x t )λ 0 ∥ 2 2 2Ht ≤ H 2 ∥x 1 -x * ∥ 2 2 + G 2 2H T t=1 1 t = O(log T ), which aligns with the optimal regret bound O(log T ) in the single-objective strongly convex setting (Hazan et al., 2016) . Proof. From the derivation of Theorem 1, we have R(T ) ≤ 2F T t=1 ∥λ -λ t ∥ 1 + sup x * ∈X * T t=1 λ t ⊤ (F t (x t ) -F t (x * )). Denote the composite gradient as g t = ∇F t (x t )λ t , which equals to the gradient of λ t F t at x t . Since λ t ∈ S m , λ t F t is a convex combination of f 1 t , ..., f m t , hence λ t F t is also H-strongly convex. For any x * ∈ X * , we now have λ ⊤ t F t (x t ) -λ ⊤ t F t (x * ) ≤ g ⊤ t (x t -x * ) - H 2 ∥x t -x * ∥ 2 2 . We now bound the term g ⊤ t (x t -x ⋆ ). When R(x) = 1 2 ∥x∥ 2 2 , the Bregman divergence B R (x, z) = 1 2 ∥x -z∥ 2 2 and the OMD update reduces to OGD, i.e., x t+1 = Π X (x t -η t g t ) (Π X denotes the standard projection onto X ). Plugging the above update rule into ∥x t+1 -x * ∥ 2 2 , we derive ∥x t+1 -x ⋆ ∥ 2 2 = ∥Π X (x t -η t g t ) -x ⋆ ∥ 2 2 ≤ ∥x t -η t g t -x ⋆ ∥ 2 2 , where the last inequality is derived from the Pythagorean theorem. Hence we have ∥x t+1 -x ⋆ ∥ 2 2 ≤ ∥x t -x ⋆ ∥ 2 2 + η 2 t ∥g t ∥ 2 2 -2η t g ⊤ t (x t -x ⋆ ) , or equivalently g ⊤ t (x t -x ⋆ ) ≤ ∥x t -x ⋆ ∥ 2 2 -∥x t+1 -x ⋆ ∥ 2 2 2η t + η t 2 ∥g t ∥ 2 2 . Plugging it into the inequality of λ t ⊤ F t (x t ) -λ t ⊤ F t (x * ), we derive  λ t ⊤ F t (x t ) -λ t ⊤ F t (x * ) ≤ ∥x t -x ⋆ ∥ 2 2 -∥x t+1 -



Our definition looks a bit different from(Turgay et al., 2018). In Appendix B, we show they are equivalent. It is equivalent to use either X * or X as the comparator set. See Appendix C for the detailed proof. CONCLUSIONSIn this paper, we give a systematic study of multi-objective online learning, encompassing a novel framework, a new algorithm, and corresponding non-trivial theoretical analysis. We believe that this work paves the way for future research on more advanced multi-objective optimization algorithms, which may inspire the design of new optimizers for multi-task deep learning.



Figure 1: Results to verify the effectiveness of adaptive regularization on protein. (a) Performance of DR-OMMD and linearization under varying λ 0= (λ 1 0 , 1 -λ 1 0 ). (b)Performance using the optimal weights λ 0 = (0.1, 0.9).

Figure 2: Results to verify the effectiveness of DR-OMMD in the non-convex setting. The two plots show the performance of DR-OMMD and various baselines on both tasks (Task L and Task R) of MultiMNIST.

Figure 3: Results to verify the effectiveness of adaptive regularization on covtype. (a) Performance of DR-OMMD and linearization under varying λ 0 = (λ 1 0 , 1-λ 1 0). (b) Performance using the optimal weights λ 0 = (0.1, 0.9).

only reports the average cumulative loss of DR-OMMD and various baselines on MultiMNIST. Here we supplement the results on the training loss and the test loss in Figure 4. The results are consistent with the average cumulative loss, showing superiority of DR-OMMD over linearization and MGDA in the non-convex setting. H OMITTED PROOFS OF PROPOSITION 2 (MIN-NORM MAY INCUR LINEAR REGRETS)

Algorithm 1 Doubly Regularized Online Mirror Multiple Descent (DR-OMMD)1: Input: Convex set X , time horizon T , regularization parameter α t , learning rate η t , regularization function R, user preference λ 0 . 2: Initialize: x 1 ∈ X . 3: for t = 1, . . . , T do

(F t (x t ) -F t (x * )) ≤ sup x ⊤ (F t (x t ) -F t (x * )) -λ t ) ⊤ F t (x t ) + λ t ⊤ (F t (x t ) -F t (x * )) + (λ t -λ) ⊤ F t (x * ) (F t (x t ) -F t (x * )) +

Since B R (x * ; x t ) ≤ γD and η t ≤ η t-1 for any t, we have( 1 ηt -1 ηt-1 )B R (x * ; x T ) ≤ ( 1 ηt -

x ⋆ ∥ . Summing it over t ∈ {1, ..., T }, we have

acknowledgement

ACKNOWLEDGMENTS This work was supported in part by the National Key Research and Development Program of China No. 2020AAA0106300 and National Natural Science Foundation of China No. 62250008. This work was also supported by Ant Group through Ant Research Intern Program. We would like to thank Wenliang Zhong, Jinjie Gu, Guannan Zhang and Jiaxin Liu for generous support on this project.

annex

round t, the loss function F t : X → R 2 takes F t (x) = (∥x -a∥ 2 2 , ∥x -b∥ 2 2 ), t = 2k -1, k = 1, 2, ...; (∥x -b∥ 2 2 , ∥x -c∥ 2 2 ), t = 2k, k = 1, 2, ..., where a = (-2, -1), b = (0, 1), c = (2, -1). For simplicity of analysis, we first consider the case when the total time horizon T is an even number. Then it can be checked that the cumulative loss function takesfor any x = (u, v) ∈ X . Obviously the Pareto optimal set X * of the cumulative loss coincides with the line segment between (-1, 0) and (1, 0), i.e.,(note that X * is the intersection of the line segment and X ). Now consider equipping OMD with vanilla min-norm, where the composite gradients are produced by the min-norm method. Suppose the learning process starts at anyNote that this is true if and only if x 1 / ∈ X * . Then for the iterate x t = (u t , v t ) at each round t, we can directly calculate the gradients asThe min-norm weights can be computed as λ t = (γ t , 1 -γ t ) whereThe composite gradientRecall that the update form of OMD takeswhere η t > 0 is the learning rate and Π X is the projection operation onto X . Denote the iterate x t = (u t , v t ) at each round. Now we can investigate the relation between x t and x t+1 by considering the following two cases:∈ X , then we do not need projection, and directly haveFor simplicity we consider the projection based on the Euclidean distance, namelySince the composite gradient is orthogonal to the boundary on which the iterate after projection x t+1 = Π X (x ′ t+1 ) is located, it can be checked that x t+1 lies on the line segment linking x t and x ′ t+1 . Alternatively speaking, x t+1 can be expressed as x t -η ′ t g comp t for some 0 ≤ η ′ t < η t . Combining the above two cases, we know that at each round t, there exists some η ′ t ∈ [0, η t ] such that x t+1 = x t -η ′ t g comp t . Now we can analyze the relation between each entry of x t and x t+1 . Specifically, since the second entry of the composite gradient is always non-positive, namely -u t + v t -1 ≤ 0 and -u t -v t -1 ≤ 0, we have v t+1 ≥ v t for any t. Moreover, since the first entry of Plugging it into the above regret form, we derive the desired regret bound. ■

