MITIGATING GRADIENT BIAS IN MULTI-OBJECTIVE LEARNING: A PROVABLY CONVERGENT APPROACH

Abstract

Machine learning problems with multiple objectives appear either i) in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, ii) in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias among them. These multiple-objective learning problems are often tackled by the multiobjective optimization framework. However, existing stochastic multi-objective gradient methods and their recent variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic multi-objective gradient correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the nonconvex setting. Simulations on supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods. In this section, we introduce the concepts of Pareto optimality and Pareto stationarity and then discuss MGDA and its existing stochastic counterpart. We then motivate our proposed method by elaborating the challenge in stochastic MOO. The notations used in the paper are summarized in Appendix A.

1. INTRODUCTION

Multi-objective optimization (MOO) involves optimizing multiple, potentially conflicting objectives simultaneously. Recently, MOO has gained attention in various application settings such as optimizing hydrocarbon production (You et al., 2020) , tissue engineering (Shi et al., 2019) , safe reinforcement learning (Thomas et al., 2021) , and training neural networks for multiple tasks (Sener & Koltun, 2018) . We consider the stochastic MOO problem as min x∈X F (x) := (E ξ [f 1 (x, ξ)], E ξ [f 2 (x, ξ)], . . . , E ξ [f M (x, ξ)]) where X ⊆ R d is the feasible set, and f m : X → R with f m (x) := E ξ [f m (x, ξ)] for m ∈ [M ]. Here we denote [M ] := {1, 2, . . . , M } and denote ξ as a random variable. In this setting, we are interested in optimizing all of the objective functions simultaneously without sacrificing any individual objective. Since we cannot always hope to find a common variable x that achieves optima for all functions simultaneously, a natural solution instead is to find the so-termed Pareto stationary point x that cannot be further improved for all objectives without sacrificing some objectives. In this context, a multiple gradient descent algorithm (MGDA) has been developed for achieving this goal (Désidéri, 2012) . The idea of MGDA is to iteratively update the variable x via a common descent direction for all the objectives through a time-varying convex combination of gradients from individual objectives. Recently, various MGDA-based MOO algorithms have been proposed, especially for multi-task learning (MTL) (Sener & Koltun, 2018; Chen et al., 2018; Yu et al., 2020a; Liu et al., 2021a) . While the deterministic MGDA algorithm and its variants are well understood in literature, only little theoretical study has been taken on its stochastic counterpart. Recently, (Liu & Vicente, 2021) has introduced the stochastic multi-gradient (SMG) method as a stochastic counterpart of MGDA (see Section 2.3 for details). To establish convergence, however, (Liu & Vicente, 2021 ) requires a strong assumption on the fast decaying first moment of the gradient, which was enforced by linearly growing the batch size. While this allows for analysis of multi-objective optimization in stochastic setting, this may not be true for many MTL tasks in practice. Furthermore, the analysis in (Liu & Vicente, The work was supported by the National Science Foundation CAREER project 2047177 and the RPI-IBM Artificial Intelligence Research Collaboration. Correspondence to: Tianyi Chen (chentianyi19@gmail.com). Figure 1 : A toy example from (Liu et al., 2021a) with two objective (Figures 1b and 1c ) to show the impact of gradient bias. We use the mean objective as a reference when plotting the trajectories corresponding to each initialization (3 initializations in total). The starting points of the trajectories are denoted by a black •, and the trajectories are shown fading from red (start) to yellow (end). The Pareto front is given by the gray bar, and the black ⋆ denotes the point in the Pareto front corresponding to equal weights to each objective. We implement recent MOO algorithms such as SMG (Liu & Vicente, 2021) , PCGrad (Yu et al., 2020a) , and CAGrad (Liu et al., 2021a) , and MGDA (Désidéri, 2012) alongside our method. Except for MGDA (Figure 1d ) all the other algorithms only have access to gradients of each objective with added zero mean Gaussian noise. It can be observed that SMG, CAGrad, and PCGrad fail to find the Pareto front in some initializations. 2021) cannot cover the important setting with non-convex multiple objectives, which is prevalent in challenging MTL tasks. This leads us to a natural question: Can we design a stochastic MOO algorithm that provably converges to a Pareto stationary point without growing batch size and also in the nonconvex setting? Our contributions. In this paper, we answer this question affirmatively by providing the first stochastic MOO algorithm that provably converges to a Pareto stationary point without growing batch size. Specifically, we make the following major contributions: C1) (Asymptotically unbiased multi-gradient). We introduce a new method for MOO that we call the stochastic Multi-objective gradient with Correction (MoCo) method. MoCo is a simple algorithm that addresses the convergence issues of stochastic MGDA and provably converges to a Pareto stationary point under several stochastic MOO settings. We use a toy example in Figure 1 to demonstrate the empirical benefit of our method. In this example, MoCo is able to reach the Pareto front from all initializations, while other MOO algorithms such as SMG, CAGrad, and PCGrad fail to find the Pareto front due to using biased multi-gradient. C2) (Unified non-asymptotic analysis). We generalize our MoCo method to the case where the individual objective function has a nested structure and thus obtaining unbiased stochastic gradients is costly. We provide a unified convergence analysis of the nested MoCo algorithm in smooth non-convex and convex stochastic MOO settings. To our best knowledge, this is the first analysis of smooth non-convex stochastic gradient-based MOO. C3) (Experiments on MTL applications). We provide an empirical evaluation of our method with existing state-of-the-art MTL algorithms in supervised learning and reinforcement learning (RL) settings, and show that our method can outperform prior methods such as stochastic MGDA, PCGrad, CAGrad, and GradDrop.

2.1. PARETO OPTIMALITY AND PARETO STATIONARITY

In MOO, we are interested in finding the points which can not be improved simultaneously for all the objective functions, leading to the notion of Pareto optimality. Consider two feasible solutions to (1) x, x ′ ∈ X . We say that x dominates x ′ if f m (x) ≤ f m (x ′ ) for all m ∈ [M ], and F (x) ̸ = F (x ′ ). If a point x * ∈ X is not dominated by any x ∈ X , we say x * is Pareto optimal. The collection of all the Pareto optimal points are called as the Pareto set. The collection of vector objective values F (x * ) for all the Pareto optimal x * is called as the Pareto front. Akin to the single objective case, a necessary condition for Pareto optimality is Pareto stationarity. If x is a Pareto stationary point, then there is no common descent direction for all f m (x) at x. Formally, x is a called a Pareto stationary point if range(∇F (x) ⊤ ) ∩ (-R M ++ ) = ∅ where ∇F (x) ∈ R d×M is the Jacobian of F (x), i.e. ∇F (x) := (∇f 1 (x), ∇f 2 (x), . . . , ∇f M (x)), and R M ++ is the positive orthant cone. When all f m (x) are strongly convex, a Pareto stationary point is also Pareto optimal.

2.2. MULTIPLE GRADIENT DESCENT ALGORITHM (MGDA)

The MGDA algorithm has been proposed in (Désidéri, 2012) that can converge to a Pareto stationary point of F (x). MGDA achieves this by seeking a convex combination of individual gradients ∇f m (x) (also known as the multi-gradient), given by d (x) = M m=1 λ * m (x)∇f m (x) where the weights λ * (x) := (λ * 1 (x), ..., λ * M (x)) ⊤ are found by solving the following sub-problem: λ * (x) = arg min λ ∥∇F (x)λ∥ 2 s. t. λ ∈ ∆ M := {λ ∈ R M | 1 ⊤ λ = 1, λ ≥ 0}. (2) With this multi-gradient d(x), the kth iteration of MGDA is given by x k+1 = Π X (x k -α k d(x k )) with d(x k ) = M m=1 λ * m (x k )∇f m (x k ) where α k is the learning rate, and Π X denotes the projection to set X . It can be shown that MGDA optimizes all objectives simultaneously following the direction -d(x) whenever x is not a Pareto stationary point and will terminate once it reaches a Pareto stationary point (Fliege et al., 2019) . However, in many real world applications we either do not have access to the true gradient of functions f m or obtaining the true gradients is prohibitively expensive in terms of computation. This leads us to a possible stochastic counterpart of MGDA, which is discussed next.

2.3. STOCHASTIC MULTI-OBJECTIVE GRADIENT AND ITS BRITTLENESS

The stochastic counterpart of MGDA, referred to as SMG algorithm, has been studied in (Liu & Vicente, 2021) . In SMG algorithm, the stochastic multi-gradient is obtained by replacing the true gradients ∇f m (x) in (2) with their stochastic approximations ∇f m (x, ξ), where ∇f m (x) = E ξ [∇f m (x, ξ)]. Specifically, the stochastic multi-gradient is given by g(x, ξ) = M m=1 λ g m (x, ξ)∇f m (x, ξ) with λ g (x, ξ) = arg min λ∈∆ M M m=1 λ m ∇f m (x, ξ) 2 . (4) While this change of the subproblem facilitates use of stochastic gradients in place of deterministic gradients, it raise issues in the biasedness in the stochastic multi-gradient calculated in this method. The bias of SMG. To better understand the cause of this bias, consider the case M = 2 of (4) for simplicity. We can rewrite the problem for solving for convex combination weights as arg min λ∈[0,1] ∥λ∇f 1 (x, ξ) + (1 -λ)∇f 2 (x, ξ)∥ 2 , which admits the closed-form solution for λ as λ g (x, ξ) = (∇f 2 (x, ξ) -∇f 1 (x, ξ)) ⊤ ∇f 2 (x, ξ) ∥∇f 1 (x, ξ) -∇f 2 (x, ξ)∥ 2 +, 1 ⊺ (5) where [x] +, 1 ⊺ = max(min(x, 1), 0). It can be seen that the solution for λ is non-linear in ∇f 1 (x, ξ) and ∇f 2 (x, ξ), which suggests that E[λ g (x, ξ)] ̸ = λ * (x) and thus E[g(x, ξ)] ̸ = d(x). To ensure convergence, a recent approach proposed to replace the stochastic gradient ∇f m (x, ξ) with its mini-batch version with the batch size growing with the number of iterations (Liu & Vicente, Algorithm 1 MoCo: Stochastic Multi-objective gradient with Correction 1: Input Initial model parameter x 0 , tracking parameters {y 0,i } M m=1 , convex combination coefficient parameter λ 0 , and their respective learning rates {α k } K k=0 , and 13)-( 14) {β k } K k=0 , {γ k } K k=0 . 2: for k = 0, . . . , K -1 do 3: for objective m = 1, . . . , M do 4: Obtain gradient estimator h m,k ▷ either h m,k = ∇f m (x k , ξ k ) or h m,k in ( 5: Update y k+1,m following (6) 6: end for 7: Update λ k+1 and x k+1 following ( 9)-(10) 8: end for 9: Output x K 2021). However, this may not be desirable in practice and often leads to sample inefficiency. In multiobjective reinforcement learning settings, this means running increasingly many number of roll-outs for policy gradient calculation, which may be infeasible. On the other hand, Yang et al. (2021) also analyzes MGDA in the stochastic, smooth, and non-convex setting, and establishes convergence. However, to overcome the bias issue in stochastic MGDA, Yang et al. (2021) assumes having access to λ * (x), which allows access to an unbiased estimate of the true multi-gradient d(x). However, this assumption is not practical since computing λ * (x) requires access to true gradients ∇f m (x), which may not be true in a stochastic setting. In contrast, in the following section we propose a method that reduces the bias in multi-gradient asymptotically and enjoys provable convergence.

3. STOCHASTIC MULTI-OBJECTIVE GRADIENT DESCENT WITH CORRECTION

In this section, we will first propose a new stochastic update that addresses the biased multi-gradient in MOO, extend it to the nested MOO setting, and then establish its convergence result. To achieve this, we use a momentum-like gradient estimate and a regularized version of MGDA subproblem.

3.1. A BASIC ALGORITHMIC FRAMEWORK

We start by discussing how to obtain ∇f m (x) without incurring the bias issue. The key idea is to approximate true gradients of each objective using a 'tracking variable', and use these approximations in finding optimal convex combination coefficients, similar to MGDA and SMG. At each iteration k, assuming we have access to h k,m which is a stochastic estimator of ∇f m (x k ) (e.g., h k,m = ∇f m (x k , ξ k )). We obtain ∇f m (x k ) by iteratively updating the 'tracking' variable y k,m ∈ R d by y k+1,m = Π Lm y k,m -β k y k,m -h k,m , m = 1, 2, • • • , M, where β k is the step size and Π Lm denotes the projection to set {y ∈ R d | ∥y∥ ≤ L m }, and L m is the Lipschitz constant of f m on X . Under some assumptions on the stochastic gradients h k,m that will be specified in Section 3.3, we can show that for a given x k , the recursion in (6) admits a unique fixed-point y * m (x k ) that satisfies y * m (x k ) = E[h k,m ] = ∇f m (x k ). In this subsection we will first assume that h k,m is an unbiased estimator of ∇f m (x k ), and will generalize to the biased estimator in the next subsection. In this case, with only one sample needed at each iteration, the distance between y m,k and ∇f m (x k ) is expected to diminish as k increases. Even with an accurate estimate of ∇f m (x), solving (1) is still not easy since these gradients could conflict with each other. As described in Section 2.2, given x ∈ X , the MGDA algorithm finds the optimal scalars, denoted as {λ * m (x)} M m=1 , to scale each gradient ∇f m (x) such that d(x) = M m=1 λ * m (x)∇f m (x) , and -d(x) is a common descent direction for every f m (x). For obtaining the corresponding convex combinations when we do not have access to the true gradient, we propose to use Y k := (y k,1 , ..., y k,M ) ∈ R d×M as an approximation of ∇F (x k ). In general, the solution for (2) is not necessarily unique. We overcome this issue by adding ℓ 2 regularization. Specifically, with ρ > 0 denoting the regularization constant, the new subproblem is given by λ * ρ (x) = arg min λ ∥∇F (x)λ∥ 2 + ρ 2 ∥λ∥ 2 s. t. λ ∈ ∆ M := {λ ∈ R M | 1 ⊤ λ = 1, λ ≥ 0}. (8) Remark 1 (On the Lipschitz continuity of λ * ρ (x)). Since (2) and (4) depend on x, the subproblems change at each iteration. To analyze the convergence of the algorithm, it is important to quantify the change of solutions λ * (x) and λ g (x, ξ) at different x. One natural way is to assume the aforementioned solutions are Lipschitz continuous in ∇F (x); see (Liu & Vicente, 2021) . However, this condition does not hold in general since ∇F (x) is not positive definite at least at Pareto stationary points, and thus the solutions to (2) and ( 4) are not unique. We overcome this issue by adding the regularization ρ to ensure uniqueness of the solution and the Lipschitz continuity of λ * ρ (x) in x. With this regularized reformulation, we find λ * ρ (x) by running stochastic projected gradient descent of (8), given by λ k+1 = Π ∆ M λ k -γ k Y ⊤ k Y k + ρI λ k , where γ k is the step size, I ∈ R M ×M is the identity matrix, and operator Π ∆ M denotes the projection to the probability simplex ∆ M . With λ k as an approximation of λ * ρ (x k ) and Y k as an approximation of ∇F (x k ), we then update x k with x k+1 = Π X (x k -α k Y k λ k ), ( ) where X is a closed convex set. We have summarized the basic MoCo algorithm in Algorithm 1.

3.2. GENERALIZATION TO NESTED MOO SETTING

In this section we extend MoCo to the bi-level MOO setting. Recall, in the previous section, we have introduced the gradient estimator h k,m . In the simple case where ∇f m (x, ξ) is obtained, setting h k,m = ∇f m (x k , ξ k ) leads to the exact solution y * m (x k ) = ∇f m (x k ). However, in some practical applications as shown in Section 5.2, ∇f m (x, ξ) is difficult to obtain, and hence h k,m can be biased. To put this on concrete ground, we first consider the following nested multi-objective problem: min x∈X F (x) := (E ξ [f 1 (x, z * 1 (x), ξ)], E ξ [f 2 (x, z * 2 (x), ξ)], . . . , E ξ [f M (x, z * M (x), ξ)]) s.t. z * m (x) := arg min z∈R d l m (x, z) := E φ [l m (x, z, φ)], m = 1, 2, • • • , M where l m is a strongly-convex function, and φ is a random variable. For convenience, we define Under some conditions that will be specified later, it has been shown in (Ghadimi & Wang, 2018) that the gradient of f m (x) takes the following form: f m (x, z) := E ξ [f m (x, ∇f m (x) = ∇ x f m (x, z * m (x)) -∇ 2 xz l m (x, z * m (x))[∇ 2 zz l m (x, z * m (x))] -1 ∇ z f m (x, z * m (x)) (12) where ∇ x f (x, z * m (x)) = ∂f (x,z) ∂x | z=z * m (x) , ∇ 2 xz l(x, z * m (x)) = ∂l(x,z) ∂x∂z | z=z * m (x) and likewise for ∇ z f (x, z * m (x)) and ∇ 2 zz l(x, z * m (x)). Computing the unbiased stochastic estimate of (12) requires z * m (x), which is often costly in practice. Instead, we iteratively update z k,m to approach z * m (x k ) via z k+1,m = z k,m -β k ∇ z l m (x k , z k,m , φ k ). ( ) Then we use z k,m to replace z * m (x k ) in the place of (12) to compute a biased gradient estimator as h k,m = ∇ x f m (x k , z k,m , ξ k ) -∇ 2 xz l m (x k , z k,m , φ ′ k )H zz k,m ∇ z f m (x k , z k,m , ξ k ) ( ) where φ k , φ ′ k have the same distribution as that of φ, and H zz k,m is a stochastic approximation of the Hessian inverse [∇ 2 zz l m (x k , z k,m )] -1 . Given x k , when z k,m reaches the optimal solution z * m (x k ), it follows from (12) that E[h k,m ] = ∇f m (x k ). We summarize the algorithm for MoCo with inexact gradient in Algorithm 2 (see Appendix D). When z k,m is non-optimal, we quantify the error below. Lemma 1. Define F k as the σ-algebra generated by Y 1 , Y 2 , ..., Y k . Consider the sequences generated by ( 6), ( 9), (10), ( 13) and ( 14). Under certain standard assumptions that will be specified in the supplementary, we have for any m ∈ [M ] and for any k that 1 K K k=1 E ∥E[h k,m |F k ] -∇f m (x k )∥ 2 = O α 2 K β 2 K , E ∥h k,m -E [h k,m |F k ] ∥ 2 |F k ≤ σ 2 0 (15) where E[•] is the total expectation, α K , β K are the learning rates, and σ 0 > 0 is a constant. Lemma 1 shows that the average bias of the gradient estimator will diminish if α k and β k are chosen properly. In addition, the variance of the estimator is also bounded by a constant. Allowing biased gradient in this manner facilitates MoCo to tackle more challenging MTL tasks as highlighted below. Remark 2 (Connection between nested MOO with multi-objective actor-critic). Choosing each f m in (11) to be the infinite-horizon accumulated reward and each l m to be the critic objective function will lead to the popular actor-critic algorithm in reinforcement learning (Konda & Borkar, 1999; Wen et al., 2021) . In this work, we have extended this to the multi-objective case, and conducted experiments on multi-objective soft actor critic in Appendix K.3.

3.3. A UNIFIED CONVERGENCE RESULT

In this section we provide the convergence analysis for our proposed method. First, we make following assumptions on the objective functions. Assumption 1. For m ∈ [M ]: f m (x) is Lipschitz continuous with modulus L m and ∇f m (x) is Lipschitz continuous with modulus L m,1 , for any x ∈ X . Due to the x update in (10), the optimal solution for y k,m and λ k sequences are changing at each iteration, and the change scales with ∥x k+1 -x k ∥. In order to guarantee the convergence of y k,m and λ k , the change in optimal solution needs to be controlled. The first half of Assumption 1 ensures that ∇f (x) is uniformly bounded, that is, ∥x k+1 -x k ∥ is upper bounded and thus controlled. The second half of the assumption is standard in establishing the convergence of non-convex functions (Bottou et al., 2018) . Next, we make an alternative version of Assumption 1 for analysis in the convex setting. Assumption 2. Function f m (x) is convex for any m ∈ [M ] and the feasible set X is bounded. Notice when X is bounded, then there exists a constant C x such that ∥x-x ′ ∥ ≤ C x for any x, x ′ ∈ X . This assumption controls ∥x k+1 -x k ∥ when the objective functions are convex. Next, to unify the analysis of the nested MOO in Section 3.2 and the basic MOO in Section 3.1, we make the following assumption on the quality of the gradient estimator h k,m . Assumption 3. For any m ∈ [M ], there exist constants c m , σ m such that 1 K K k=1 E∥E[h k,m |F k ]- ∇f m (x k )∥ 2 ≤ c m α 2 K /β 2 K and E[∥h k,m -E[h k,m |F k ]∥ 2 |F k ] ≤ σ 2 m for any k. Assumption 3 requires the stochastic gradient h k,m almost unbiased and has bounded variance. Compared to (Liu & Vicente, 2021, Assumption 5 .2), Assumption 3 is weaker, because i) it does not require the variance σ 2 m to decrease in the same speed as α 2 k ; and ii) it allows bias in the stochastic gradient of each objective function. In practice, the batch size is often fixed, and thus the variance is non-decreasing, which suggests one benefit of Assumption 3 over that in (Liu & Vicente, 2021) . Lemma 2. Consider the sequences generated by Algorithm 1. Assume that K ≫ M such that K = O(M 10 ). Then, under Assumptions 1 and 3, or Assumptions 2 and 3, if we choose step sizes α k = Θ(K -9 10 ), β k = Θ(K -1 2 ), γ k = Θ(K -2 5 ), and ρ = Θ(K -1 5 ), it holds that 1 K K k=1 E ∥d(x k ) -Y k λ k ∥ 2 = O(K -1 5 ). With suitable choice of ρ, as λ k and Y k converge to λ * ρ (x k ) and ∇F (x k ) respectively, the update direction Y k λ k for x k converges to d(x k ) = ∇F (x k )λ * (x k ), which is the desired MGDA direction. It can be seen that our method achieves vanishingly small expected error in stochastic multi-gradient asymptotically for all trajectories, while SMG fails to reduce the error in multi-gradient. The following theorem then captures the convergence of x k under convex objective functions. Theorem 1. Consider the sequences generated by Algorithm 1. Under Assumptions 2 and 3, if we choose α k = Θ(K -9 10 ), β k = Θ(K -1 2 ), γ k = Θ(K -2 5 ), and ρ = Θ(K -1 5 ), it holds ∀x * ∈ X that 1 K K k=1 E[λ * (x k ) • (F (x k ) -F (x * ))] = O K -1 10 . ( ) If we choose x * as the Pareto-optimal point and λ * (x k ) > 0, Theorem 1 captures the convergence to the Pareto-optimal objective values. In many practical problems the objective functions are non-convex, and the following theorem establishes the convergence of the proposed method for non-convex functions. Theorem 2. Consider the sequences generated by Algorithm 1 with X = R d . Under Assumptions 1 and 3, if we choose α k = Θ(K -9 10 ), β k = Θ(K -1 2 ), γ k = Θ(K -2 5 ), and ρ = Θ(K -1 5 ), it holds 1 K K k=1 E ∥∇F (x k )λ * (x k )∥ 2 = O K -1 10 . ( ) Theorem 2 shows that the MGDA direction ∇F (x k )λ * (x k ) converges to 0, which indicates that the proposed method is able to achieve Pareto-stationarity. It is the first finite-time convergence guarantee for the stochastic MGDA method under non-convex objective functions. Theorem 3. Consider the sequences generated by Algorithm 1 with X = R d . Furthermore assume there exists a constant F > 0 such that for all k ∈ [K], ∥F (x k )∥ ≤ F . Then, under Assumptions 1 and 3, if we choose α k = Θ(K -3 5 ), β k = Θ(K -2 5 ), γ k = Θ(K -1 ), and ρ = 0, it holds that 1 K K k=1 E ∥∇F (x k )λ * (x k )∥ 2 = O M K -2 5 . ( ) Theorem 3 shows Algorithm 1 will converge to a Pareto stationary point with an improved convergence rate, if the sequence of functions F (x 1 ), F (x 2 ), . . . , F (x k ) are bounded. Remark 3 (Comparison with SMG). Theorems 1 and 2 provide the convergence rates of MoCo under Assumptions 1-3 and Assumptions 1 and 3, respectively. Compared to the convergence analysis of SMG in (Liu & Vicente, 2021 ), the convergence rates in Theorems 1 and 2 are derived under small batch size, without the unjustified assumption on the Lipschitz continuity of λ * (x) and additionally account for the non-convex MOO setting. This may not be true unless ∇F (x) is full rank which can not be true at Pareto stationary points. We overcome this problem by adding a properly chosen regularization to the problem. We also provide an improved sample/iteration complexity with Theorem 3 under some additional assumptions on F (x k ). Furthermore, we provide an improvement over Theorem 3 with a modified assumption on the stochastic gradient bias in Appendix J.

4. RELATED WORK

To put our work in context, we review prior art that we group in the following two categories. Multi-task learning. MTL algorithms find a common model that can solve multiple possibly related tasks. MTL has shown great success in many fields such as natural language processing, computer vision and robotics (Hashimoto et al., 2016) , (Ruder, 2017) , (Zhang & Yang, 2021) , (Vandenhende et al., 2021) . One line of research involves designing machine learning models that facilitate MTL, such as architectures with task specific modules (Misra et al., 2016) , with attention based mechanisms (Rosenbaum et al., 2017) , (Yang et al., 2020) , or with different path activation corresponding to different tasks. Our method is model agnostic, and thus can be applied to these methods in a complementary manner. tasks and learn these tasks using smaller models (Rusu et al., 2015) , (Parisotto et al., 2015) , (Teh et al., 2017) , (Ghosh et al., 2017) . These models are then aggregated into a single model using knowledge distillation (Hinton et al., 2015) . Our method does not require multiple models in learning, and focus on learning different tasks simultaneously using a single model. Gradient-based MOO. This line of work involves optimizing multiple objectives simultaneously using gradient manipulations. A foundational algorithm in this regard is MGDA (Désidéri, 2012) , which dynamically combine gradients to find a common descent direction for all objectives. A comprehensive convergence analysis for the deterministic MGDA algorithm has been provided in (Fliege et al., 2019) . Recently, (Liu & Vicente, 2021) extends this analysis to the stochastic counterpart of multi-gradient descent algorithm, for smooth convex and strongly convex functions. However, this work makes strong assumptions on the bias of the stochastic gradient and does not consider the nested MOO setting that is central to the multi-task reinforcement learning. In Yang et al. (2021) , the authors establish convergence of stochastic MGDA under the assumption of access to true convex combination coefficients, which may not be true in a practical stochastic optimization setting. Another related line of work considers the optimization challenges related to MTL, considering task losses as objectives. One common approach is to find gradients for balancing learning of different tasks. The simplest way is to re-weight per task losses based on a specific criteria such as uncertainty (Kendall et al., 2017) , gradient norms (Chen et al., 2018) or task difficulty (Guo et al., 2018) . These methods are often heuristics and may be unstable. More recent work (Sener & Koltun, 2018) , (Yu et al., 2020a) , (Liu et al., 2021a) , (Gu et al., 2021) introduce gradient aggregation methods which mitigate conflict among tasks while preserving utility. In (Sener & Koltun, 2018) , MTL has been first tackled through the lens of MOO techniques using MGDA. In (Yu et al., 2020a) , a new method called PCGrad has been developed to amend gradient magnitude and direction in order to avoid conflicts among per task gradients. In (Liu et al., 2021a) , an algorithm similar to MGDA, named CAGrad, has been developed, which uniquely minimizes the average task loss. In (Liu et al., 2021b) , an impartial objective gradient modification mechanism has been studied. Concurrent to our work, a Nash bargaining solution has been proposed in (Navon et al., 2022) for weighting per objective gradients. All the aforementioned works on MTL use the deterministic objective gradient for analysis (if any), albeit the accompanying empirical evaluations are done in a stochastic setting. There are also gradient-based MOO algorithms that find a set of Pareto optimal points for a given problem rather than one. To this end, works such as (Liu et al., 2021c; Liu & Vicente, 2021; Lin et al., 2019; Mahapatra & Rajan, 2021; Navon et al., 2020; Lin et al., 2022; Kyriakis et al., 2021; Yang et al., 2021; Zhao et al., 2021; Momma et al., 2022) , develop algorithms that find multiple points in the Pareto front in mutil-task supervised learning or reinforecement learning settings, ensuring some quality of the obtained set of Pareto points. Our work is orthogonal to this line of research, and can potentially be combined with those method to achieve better performance.

5. EXPERIMENTS

In this section, first we provide further illustration of our method in comparison with existing gradientbased MOO algorithms in the toy example introduced in Section 1. Then we provide empirical Toy example. To further elaborate on how MoCo converges to a Pareto stationary point, we again optimize the two objectives given in Figure 1 and demonstrate the performance in the objective space (Figure 2 ). MGDA with true gradients converges to a Pareto stationary point in all initializations. However, it can be seen that SMG, PCGrad,and CAGrad methods fail to converge to a Pareto stationary point, and end up in dominated points in the objective space for some initializations. This is because these algorithms use a biased multi-gradient that does not become zero. In contrast, MoCo converges to Pareto stationary points in every initialization, and follows a similar trajectory to MGDA.

5.1. SUPERVISED LEARNING

We compare MoCo with existing MTL algorithms using NYU-v2 (Silberman et al., 2012) and CityScapes (Cordts et al., 2015) datasets. We follow the experiment setup of (Liu et al., 2021a) and combine our method with MTL method MTAN (Liu et al., 2019) , which applies an attention mechanism. We evaluate our method in comparison to CAGrad, PCGrad, vanilla MTAN and Cross-Stitch (Misra et al., 2016) . Following (Maninis et al., 2019; Liu et al., 2021a; Navon et al., 2022) , we use the per-task performance drop of a metric S m for method A with respect to baseline B as a measure of the overall performance of a given method, which is given by ∆m = 1 M M m=1 (-1) ℓm (S A,m -S B,m )/S B,m , where M is the number of tasks, S B,m and S A,m are the values of metric S m obtained by the baseline and the compared method respectively. Here, ℓ m = 1 if higher values for S m are better and 0 otherwise. The results of the experiments are shown in Table 1 and 2 . Our method, MoCo, outperforms all the existing MTL algorithms in terms of ∆m% for both Cityscapes and NYU-v2 datasets. Since our method focuses on gradient correction, our method can also be applied on top of existing gradient-based MOO methods. Additional experiments regarding this are provided in Appendix K.

5.2. REINFORCEMENT LEARNING

For the multi-task reinforcement learning setting, we use the multi-task reinforcement learning benchmark MT10 available in the Met-world environment (Yu et al., 2020b) . We follow the experimental setup used in (Liu et al., 2021a) and provide the empirical comparison between our MoCo method and the existing baselines. Specifically, we use MTRL codebase (Sodhani & Zhang, 2021) and use soft actor-critic (SAC) (Haarnoja et al., 2018) as the underlying reinforcement learning algorithm. Due to space limitation, the experiment results in a multi-task reinforcement learning setting and details of hyperparameter selection are provided in Appendix K.

A NOTATIONS

In this section we summarize the notations used in the paper and a corresponding brief description.

Notation Description x

Decision variable / model parameter ξ Some random variable independent of x X Feasible set of x M Number of objectives in the MOO problem f m (x) An objective such that f m : X → R, where m ∈ {1, 2, . . . , M } F (x) Vector of functions containing f 1 (x), f 2 (x), . . . , f M (x), such that F : X → R M ∇F (x) Jacobian of F (x), wich has ∇f 1 (x), ∇f 2 (x), . . . , ∇f M (x) as columns R M ++ Positive othant cone of dimension M ∆ M M dimensional probability simplex 1 M dimensional vector of ones λ An element of ∆ M λ * (x) A solution to the problem (2) (original MGDA subproblem) λ g (x) A solution to the problem (4) (SMG subproblem) λ * ρ (x) The unique solution to the problem (8) (ℓ 2 regularized sub-problem) d(x) MGDA mult-gradient given by ( 36 Lower level objective such that f m : X → R, where m ∈ {1, 2, . . . , M } φ Some random variable independent of x, z z * m (x) Minimizer of l m (x, z) with respect to z for any x ∈ X z k,m Approximation of z * m (x k ), updated as equation 13 ∇ 2 zz l m (x, z) Hessian of l m (x, z) w.r.t. to z H zz k,m Stochastic approximation of ∇ 2 zz l m (x k , z k,m ) Table 3 : Some important notations used in the paper

B ADDITIONAL RELATED WORK

In this section we provide brief discussion on additional work related to MTL and MOO. We first discuss a closely related concurrent work Zhou et al. (2022) , which was not available at the time of submission of this paper. In Zhou et al. (2022) , the authors introduce a unified framework for stochastic gradient-based MOO algorithms. Similar to MoCo, the proposed algorithm in Zhou et al. (2022) also uses momentum-like techniques to reduce the multi-gradient bias. The key algorithmic difference of the aforementioned method compared to MoCo is that MoCo uses momentum-based correction on the full gradient estimation while Zhou et al. (2022) uses the momentum-like moving averaging for convex combination coefficients λ computed using stochastic gradients. As a consequence of this difference, it is not clear if the averaged weight λ obtained in Zhou et al. (2022) will ensure convergence of the stochastic MOO update direction to any deterministic MOO direction. On the other hand, in our work, we can show the convergence to MGDA direction in Lemma 2. In terms of theoretical results in Zhou et al. (2022) , the authors provide convergence guarantees of the proposed algorithm for both convex and non-convex cases with convergence rates similar to that of single objective stochastic gradient descent (SGD). However, in order to obtain these results, Zhou et al. (2022) requires additional assumptions on bound of the function values, which is not required in standard single objective SGD convergence analysis. On the other hand, in this work, the convergence guarantees presented in Theorems 1 and 2 are based on fairly standard assumptions in optimization literature, and does not assume any bound on the function value. Furthermore, the results are based on weaker assumptions on the stochastic gradient bias which facilitates bi-level MOO. In addition, we provide improved convergence results for MoCo in Theorems 3 and 4, with stronger assumptions on function value bounds similar to those in Zhou et al. (2022) , but still with weaker assumptions on the bias of stochastic gradients. In addition to gradient based MOO which is the main focus of this paper, there also exist non gradient based blackbox MOO algorithms such as (Deb et al., 2002; Golovin & Zhang, 2020; Knowles, 2006; Konakovic Lukovic et al., 2020) , which are based on an evolutionary algorithm or Bayesian optimization. However, these methods often suffer from the curse of dimensionality, and may not be feasible in large scale MOO problems. On the other hand, recent works in MTL have analyzed MTL from different viewpoints. In (Wang et al., 2021) , the authors explore the connection between gradient based meta learning and MTL. In (Ye et al., 2021) , the meta learning problem with multiple objectives in the upper levels has been tackled via a gradient based MOO approach. The importance of task grouping in MTL is analysed in works such as (Fifty et al., 2021) . In (Meyerson & Miikkulainen, 2020) , the authors show that seemingly unrelated tasks can be used for MTL. (Liu & Vicente, 2021) . Here the "Batch size" column represents the number of samples used at each (outer level) iteration, "Non-convex" column denotes whether the analysis is valid for non-convex functions, "Lipschitz continuity of λ * (x)" column denotes whether Lipschitz continuity of λ * (x) (see Remark 1) with respect to x was assumed, "Bounded functions" column denotes whether boundedness of function values was assumed, "Biased gradients" column denotes bias in the stochastic gradient of ofunctions allowed in analysis, and "Sample complexity" column provides the (outer level) sample complexity of the corresponding method.

C SUMMARY OF COMPARISON WITH CLOSELY RELATED PRIOR WORK

Method Batch size Non-convex Lipschitz continuity of λ * (x) Bounded functions Biased gradient Sample complexity SMG (Theorem 1) O ϵ -2 ✗ ✓ ✗ ✗ O ϵ -4 MoCo (Theorem 1) O(1) ✗ ✗ ✗ O α 2 β 2 O ϵ -10 MoCo (Theorem 2) O(1) ✓ ✗ ✗ O α 2 β 2 O ϵ -10 MoCo (Theorem 3) O(1) ✓ ✗ ✓ O α 2 β 2 O ϵ -2.5 MoCo (Theorem 4) O(1) ✓ ✗ ✓ O (β) O ϵ -2

D ALGORITHM FOR MOCO WITH INEXACT GRADIENT

In this section we provide the omitted pseudo-code for MoCo with inexact gradients described in Section 3.2. As remarked in Section 3.2, this algorithm can relate to the actor critic setting with multiple critics. In Appendix K.3, we provide empirical evaluations of multi-task reinforcement learning using soft actor critic.

E PROOF OF LEMMA 1

Throughout the section, we write E[•|F k ] as E k [•] for conciseness. Consider the following conditions (a) For any x ∈ X , l m (x, z) is strongly convex w.r.t. z with modulus µ m > 0. Algorithm 2 MoCo with inexact gradient 1: Input Initial model parameter x 0 , tracking parameters {y 0,i } M m=1 , lower level parameter {z 0,i } M m=1 , convex combination coefficient parameter λ 0 , and their respective learning rates {α k } K k=0 , and {β k } K k=0 , {γ k } K k=0 . 2: for k = 0, . . . , K -1 do (c) There exist constants l f x , l f z , l ′ f z , l z such that ∇ x f m (x, z) and ∇ z f m (x, z) are respectively l f x and l f z Lipschitz continuous w.r.t. z; ∇ z f m (x, z) is l ′ f z -Lipschitz continuous w.r.t. x; f m (x, z) is l z -Lipschitz continuous w.r.t. z. (d) There exist constants C H and σ H such that ∥E k [H zz k,m ] -[∇ zz l m (x k , z k,m )] -1 ∥ 2 ≤ C H α 2 k /β 2 k and E k ∥H zz k,m ∥ 2 ≤ σ 2 H . There exist constants C x , C xz , C z , C l such that E k ∥∇ x f m (x k , z k,m , ξ k )∥ 2 ≤ C 2 x , E k ∥∇ xz l m (x k , z k,m , φ ′ k )∥ 2 ≤ C 2 xz , E k ∥∇ z f m (x k , z k,m , ξ k )∥ 2 ≤ C 2 z and E k ∥∇ z l m (x k , z k,m , φ k )∥ 2 ≤ C 2 l . The above conditions are standard in the literature (Ghadimi & Wang, 2018) . In particular, condition (d) on H zz k,m can be guaranteed by (Ghadimi & Wang, 2018, Algorithm 3) . With these conditions, we first give a restatement of Lemma 1. Lemma 3 (Restatement of Lemma 1). Consider the sequences generated by ( 13), ( 6), ( 9) and (10). Under conditions (a)-(d), if we choose the step sizes as those of Lemma 2, we have for any m that 1 K K k=1 E∥E[h k,m |F k ] -∇f m (x k )∥ 2 = O α 2 K β 2 K . ( ) There exists a constant σ 0 such that E[∥h k,m -E[h k,m |F k ]∥ 2 |F k ] ≤ σ 2 0 . Proof. We first prove ( 21) with E k ∥h k,m -E k [h k,m ]∥ 2 = E k ∥h k,m ∥ 2 -∥E k [h k,m ]∥ 2 ≤ E k ∥h k,m ∥ 2 ≤ 2E k ∥∇ x f m (x k , z k,m , ξ k )∥ 2 + 2E k ∥∇ 2 xz l m (x k , z k,m , φ ′ k )∥ 2 ∥H zz k,m ∥ 2 ∥∇ z f m (x k , z k,m , ξ k )∥ 2 ≤ 2C 2 x + 2C 2 xz C 2 z σ 2 H , where the last inequality follows from item (d) along with the independence of ∇ 2 xz l m (x k , z k,m , φ ′ k ), H zz k,m and ∇ z f m (x k , z k,m , ξ k ) given F k . Next we start to prove (20). E k [h k,m ] = ∇ x f m (x k , z k,m ) -∇ 2 xz l m (x k , z k,m )E k [H zz k,m ]∇ z f m (x k , z k,m ). In the following proof, we write z * k,m (x k ) as z * k,m . The above inequality along with (12) implies ∥E k [h k,m ] -∇f m (x k )∥ ≤ ∥∇ x f m (x k , z k,m ) -∇ x f m (x k , z * k,m )∥ + ∥∇ 2 xz l m (x k , z k,m )E k [H zz k,m ]∇ z f m (x k , z k,m ) -∇ 2 xz l m (x k , z k,m )[∇ zz l m (x k , z k,m )] -1 ∇ z f m (x k , z k,m )∥ + ∥∇ 2 xz l m (x k , z k,m )[∇ zz l m (x k , z k,m )] -1 ∇ z f m (x k , z k,m ) -∇ 2 xz l m (x k , z * k,m )[∇ zz l m (x k , z * k,m )] -1 ∇ z f m (x k , z * k,m )∥ ≤ l f x ∥z k,m -z * m (x k )∥ + L xz l z ∥E k [H zz k,m ] -[∇ zz l m (x k , z k,m )] -1 ∥ + L xz l z µ m + l f z µ m + l f z l z l zz µ 2 m ∥z k,m -z * m (x k )∥ where the last inequality follows from conditions (a)-(c). The inequality (24) implies ∥E k [h k,m ] -∇f m (x k )∥ 2 ≤ 2L 2 xz l 2 z ∥E k [H zz k,m ] -[∇ zz l m (x k , z k,m )] -1 ∥ 2 + 2 l f x + L xz l z µ m + l f z µ m + l f z l z l zz µ 2 m 2 ∥z k,m -z * m (x k )∥ 2 ≤ 2L 2 xz l 2 z C H α 2 K β 2 K + 2(l f x + L xz l z µ m + l f z µ m + l f z l z l zz µ 2 m ) 2 ∥z k,m -z * m (x k )∥ 2 (25) where the last inequality follows from condition (d) on the quality of H zz k,m . Thus to prove (20), it suffices to prove 1 K K k=1 E∥z k,m -z * k,m ∥ 2 = O α 2 K β 2 K next. The convergence of z k,m . We start with ∥z k+1,m -z * k+1,m ∥ 2 = ∥z k+1,m -z * k,m ∥ 2 + 2⟨z k+1,m -z * k,m , z * k,m -z * k+1,m ⟩ + ∥z * k,m -z * k+1,m ∥ 2 . ( ) The first term is bounded as E k ∥z k+1,m -z * k,m ∥ 2 = E k ∥z k,m -β k ∇ z l m (x k , z k,m , φ k ) -z * k,m ∥ 2 = ∥z k,m -z * k,m ∥ 2 -2β k ⟨z k,m -z * k,m , ∇ z l m (x k , z k,m )⟩ + β 2 k E k ∥∇ z l m (x k , z k,m , φ k )∥ 2 ≤ (1 -2µ m β k )∥z k,m -z * k,m ∥ 2 + C 2 l β 2 k ( ) where the last inequality follows from condition (a) and (d). Under conditions (a)-(c), it is shown that there exists a constant L z,m such that z * m (x) is L z,mlipschitz continuous (Ghadimi & Wang, 2018) [Lemma 2.2 (b) ]. Let L z = max m L z,m , then the second term in (26) can be bounded as ⟨z k+1,m -z * k,m , z * k,m -z * k+1,m ⟩ ≤ L z ∥z k+1,m -z * k,m ∥∥x k -x k+1 ∥ ≤ L z C y α k ∥z k+1,m -z * k,m ∥ ≤ µ m 2 β k ∥z k+1,m -z * k,m ∥ 2 + 1 2 L 2 z C 2 y µ -1 m α 2 k β k ( ) where C y = sup ∥Y k ∥ = sup x∈X ∥∇F (x)∥ < ∞ (by Assumption 1 or 2 and projection), and the last inequality uses the Young's inequality. The last term in ( 26) is bounded as ∥z * k,m -z * k+1,m ∥ 2 ≤ L 2 z C 2 y α 2 k . Substituting ( 27)-( 29) into ( 26) yields E k ∥z k+1,m -z * k+1,m ∥ 2 ≤ (1 -µ m β k )∥z k,m -z * k,m ∥ 2 + C 2 l β 2 k + L 2 z C 2 y µ -1 m α 2 k β k + L 2 z C 2 y α 2 k . Taking total expectation on both sides, and then telescoping implies that (set α k = α K , β k = β K , ∀k) 1 K K k=1 E∥z k,m -z * k,m ∥ 2 = O 1 Kβ K + O β K + O α K + O α 2 K β 2 K . The last inequality along with the step size choice in Lemma 3 implies that 1 K K k=1 E∥z k,m -z * k,m ∥ 2 = O α 2 K β 2 K , which along with ( 25) implies ( 20). This completes the proof.

F PROOF OF LEMMA 2

Before we present the main proof, we first introduce the Lemmas 4 and 5, which are direct consequences of (Dontchev & Rockafellar, 2009, Theorem 2F.7 ) and (Koshal et al., 2011, Lemma A.1) , respectively. Lemma 4. Under Assumption 1 , there exists a constant L λ := ρ -1 M m=1 L m such that the following inequality holds ∥λ * ρ (x) -λ * ρ (x ′ )∥ ≤ L λ ∥∇F (x) -∇F (x ′ )∥, ( ) which further indicates ∥λ * ρ (x) -λ * ρ (x ′ )∥ ≤ L λ,F ∥x -x ′ ∥, where L λ,F := ρ -1 L, with L = M m=1 L m M m=1 L m,1 . Lemma 5. For any ρ > 0 and x ∈ X , we have 0 ≤ ∥∇F (x)λ * ρ (x)∥ 2 -∥∇F (x)λ * (x)∥ 2 ≤ ρ 2 1 - 1 M . ( ) Now we start to prove Lemma 2. Proof. Throughout the following proof, we write E[•|F k ] as E k [•] for conciseness. With d(x k ) = ∇F (x k )λ * (x k ), we have ∥d(x k ) -Y k λ k ∥ 2 = ∥∇F (x k )λ * (x k ) -∇F (x k )λ * ρ (x k ) + ∇F (x k )λ * ρ (x k ) -Y k λ * ρ (x k ) + Y k λ * ρ (x k ) -Y k λ k ∥ 2 ≤ 3∥∇F (x k )(x k )λ * (x k ) -∇F (x k )λ * ρ (x k )∥ 2 + 3∥∇F (x k )λ * ρ (x k ) -Y k λ * ρ (x k )∥ 2 + 3∥Y k λ * ρ (x k ) -Y k λ k ∥ 2 ≤ 3∥∇F (x k )λ * (x k ) -∇F (x k )λ * ρ (x k )∥ 2 + 3∥∇F (x k ) -Y k ∥ 2 + 3C y ∥λ * ρ (x k ) -λ k ∥ 2 . ( ) From ( 36), to prove Lemma 2, it suffices to show ∥∇F (x k )λ * (x k ) -∇F (x k )λ * ρ (x k )∥ diminishes, and establish the convergence of Y k and λ k . Bounding ∥∇F (x k )λ * (x k ) -∇F λ * ρ (x k )∥. Denoting λ * (x k ) as λ * k and λ * ρ (x k ) as λ * ρ,k , we first consider the following bound ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ 2 = ∥∇F (x k )λ * k ∥ 2 + ∥∇F (x k )λ * ρ,k ∥ 2 -2⟨∇F (x k )λ * k , ∇F (x k )λ * ρ,k ⟩ ≤ ∥∇F (x k )λ * ρ,k ∥ 2 -∥∇F (x k )λ * k ∥ 2 ≤ ρ 2 , ( ) where the first inequality is due to the optimality condition ⟨λ, ∇F (x k ) ⊤ ∇F (x k )λ * k ⟩ ≥ ⟨λ * k , ∇F (x k ) ⊤ ∇F (x k )λ * k ⟩ = ∥∇F (x k )λ * k ∥ 2 for any λ ∈ ∆ M , and the last inequality is due to Lemma 5. With the choice of ρ = Θ(K -1 5 ) as required by Theorems 1 and 2, we have ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ 2 = O(K -1 5 ). ( ) Convergence of Y k . With E k [h k,m ] = E[h k,m |F k ], we start by E k ∥y k+1,m -∇f m (x k+1 )∥ 2 = E k ∥y k+1,m -∇f m (x k )∥ 2 + 2E k ⟨y k+1,m -∇f m (x k ), ∇f m (x k )-∇f m (x k+1 )⟩ + E k ∥∇f m (x k ) -∇f m (x k+1 )∥ 2 . ( ) We bound the first term in ( 39) as E k ∥y k+1,m -∇f m (x k )∥ 2 ≤ E k ∥y k,m -β k y k,m -h k,m -∇f m (x k )∥ 2 = ∥y k,m -∇f m (x k )∥ 2 -2β k ⟨y k,m -∇f m (x k ), y k,m -E k [h k,m ]⟩ + β 2 k E k ∥y k,m -h k,m ∥ 2 = ∥y k,m -∇f m (x k )∥ 2 -2β k ⟨y k,m -∇f m (x k ), y k,m -∇f m (x k )⟩ -2β k ⟨y k,m -∇f m (x k ), ∇f m (x k ) -E k [h k,m ]⟩ + β 2 k E k ∥y k,m -h k,m ∥ 2 ≤ (1 -2β k )∥y k,m -∇f m (x k )∥ 2 + 2β k ∥y k,m -∇f m (x k )∥∥∇f m (x k ) -E k [h k,m ]∥ + β 2 k E k ∥y k,m -h k,m ∥ 2 ≤ (1 -β k )∥y k,m -∇f m (x k )∥ 2 + β k ∥∇f m (x k ) -E k [h k,m ]∥ 2 + β 2 k E k ∥y k,m -h k,m ∥ 2 (40) where the last inequality follows from young's inequality. Now, consider the last term of (40). Selecting β k such that 3β 2 k ≤ β k /2, and Assumption 3, we have β 2 k E k ∥y k,m -h k,m ∥ 2 = β 2 k E k ∥y k,m -∇f m (x k ) + ∇f m (x k ) -E k [h k,m ] + E k [h k,m ] -h k,m ∥ 2 ≤ 3β 2 k ∥y k,m -∇f m (x k )∥ 2 + 3β 2 k ∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3β 2 k E k ∥E k [h k,m ] -h k,m ∥ 2 ≤ β k 2 ∥y k,m -∇f m (x k )∥ 2 + 3β 2 k ∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3σ 2 m β 2 k . Then, plugging in ( 41) in ( 40), we obtain E k ∥y k+1,m -∇f m (x k )∥ 2 ≤ (1 - 1 2 β k )∥y k,m -∇f m (x k )∥ 2 + β k ∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3β 2 k ∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3σ 2 m β 2 k . ( ) The second term in (39) can be bounded as ⟨y k+1,m -∇f m (x k ),∇f m (x k ) -∇f m (x k+1 )⟩ ≤ ∥y k+1,m -∇f m (x k )∥∥∇f m (x k ) -∇f m (x k+1 )∥ ≤ L m,1 ∥y k+1,m -∇f m (x k )∥∥x k+1 -x k ∥ ≤ L m,1 C y α k ∥y k+1,m -∇f m (x k )∥ ≤ 1 8 β k ∥y k+1,m -∇f m (x k )∥ 2 + 2L 2 m,1 C 2 y α 2 k β k ≤ 1 8 β k ∥y k+1,m -∇f m (x k )∥ 2 + 2 L2 1 C 2 y α 2 k β k ( ) where the second last inequality follows from young's inequality, and the last inequality follows from the definition L1 = max m L m,1 . The last term in (39) can be bounded as ∥∇f m (x k ) -∇f m (x k+1 )∥ 2 ≤ L2 1 ∥x k+1 -x k ∥ 2 ≤ L2 1 C 2 y α 2 k . ( ) Collecting the upper bounds in ( 42)-( 44) and substituting them into (39) gives E k ∥y k+1,m -∇f m (x k+1 )∥ 2 ≤ (1 - 1 4 β k )∥y k,m -∇f m (x k )∥ 2 + (β k + 3β 2 k )∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3σ 2 m β 2 k + L2 1 C 2 y α 2 k + 4 L2 1 C 2 y α 2 k β k . ( ) Suppose α k and β k are constants. Taking total expectation and then telescoping both sides of ( 45) gives 1 K K k=1 E∥y k,m -∇f m (x k )∥ 2 = O 1 Kβ k + O 1 K K k=1 E∥∇f m (x k ) -E k [h k,m ]∥ 2 + O(β k ) + O α 2 k β 2 k (46) Along with the choice of step sizes as required by Theorems 1 and 2, and due to Assumption 3, the last inequality gives 1 K K k=1 E∥y k,m -∇f m (x k )∥ 2 = O K -1 2 (47) which, based on the definitions of Y k and ∇F (x k ), implies that 1 K K k=1 E∥Y k -∇F (x k )∥ 2 = O M K -1 2 . ( ) Convergence of λ k . We write λ * ρ (x k ) in short as λ * ρ,k in the following proof. We start by ∥λ k+1 -λ * ρ,k+1 ∥ 2 = ∥λ k+1 -λ * ρ,k ∥ 2 + 2⟨λ k+1 -λ * ρ,k , λ * ρ,k -λ * ρ,k+1 ⟩ + ∥λ * ρ,k -λ * ρ,k+1 ∥ 2 . ( ) The first term is bounded as ∥λ k+1 -λ * ρ,k ∥ 2 = ∥Π ∆ M λ k -γ k Y ⊤ k Y k + ρI λ k -λ * ρ,k ∥ 2 ≤ ∥λ k -γ k Y ⊤ k Y k + ρI λ k -λ * ρ,k ∥ 2 = ∥λ k -λ * ρ,k ∥ 2 -2γ k ⟨λ k -λ * ρ,k , Y ⊤ k Y k + ρI λ k ⟩ + γ 2 k ∥ Y ⊤ k Y k + ρI λ k ∥ 2 ≤ ∥λ k -λ * ρ,k ∥ 2 -2γ k ⟨λ k -λ * ρ,k , Y ⊤ k Y k + ρI λ k ⟩ + (C 2 y + ρ) 2 γ 2 k ( ) Consider the second term in the last inequality: ⟨λ k -λ * ρ,k , Y ⊤ k Y k + ρI λ k ⟩ = ⟨λ k -λ * ρ,k , (Y ⊤ k Y k -∇F (x k ) ⊤ ∇F (x k ))λ k ⟩ + ⟨λ k -λ * ρ,k , ∇F (x k ) ⊤ ∇F (x k ) + ρI λ k ⟩ ≥ -2C y ∥λ k -λ * ρ,k ∥∥Y k -∇F (x k )∥ + ⟨λ k -λ * ρ,k , ∇F (x k ) ⊤ ∇F (x k ) + ρI (λ k -λ * ρ,k )⟩ + ⟨λ k -λ * ρ,k , ∇F (x k ) ⊤ ∇F (x k ) + ρI λ * ρ,k ⟩ ≥ -2C y ∥λ k -λ * ρ,k ∥∥Y k -∇F (x k )∥ + ρ∥λ k -λ * ρ,k ∥ 2 ≥ -2C 2 y ρ -1 ∥Y k -∇F (x k )∥ 2 + ρ 2 ∥λ k -λ * ρ,k ∥ 2 (51) where the second last inequality follows from the optimality condition that ⟨λ k -λ * ρ,k , ∇F (x k ) ⊤ ∇F (x k ) + ρI λ * ρ,k ⟩ ≥ 0, and the last inequality in follows from the Young's inequality. Plugging in (51) back to (50) gives ∥λ k+1 -λ * ρ,k ∥ 2 = (1 -ργ k )∥λ k -λ * ρ,k ∥ 2 + 4C 2 y ρ -1 γ k ∥Y k -∇F (x k )∥ 2 + (C 2 y + ρ) 2 γ 2 k . ( ) With Lemma 4, the second term in (49) can be bounded as ⟨λ k+1 -λ * ρ,k , λ * ρ,k -λ * ρ,k+1 ⟩ ≤ L λ,F ∥λ k+1 -λ * ρ,k ∥∥x k -x k+1 ∥ ≤ L λ,F C y α k ∥λ k+1 -λ * ρ,k ∥ ≤ ρ 4 γ k ∥λ k+1 -λ * ρ,k ∥ 2 + L 2 λ,F C 2 y ρ -1 α 2 k γ k , where the last inequality is due to Young's inequality. The last term in ( 49) is bounded as ∥λ * ρ,k -λ * ρ,k+1 ∥ 2 ≤ L 2 λ,F C 2 y α 2 k . Substituting ( 52)-( 54) into (49) yields ∥λ k+1 -λ * ρ,k+1 ∥ 2 ≤ (1 - ρ 2 γ k )∥λ k -λ * ρ,k ∥ 2 + 4C 2 y ρ -1 γ k ∥Y k -∇F (x k )∥ 2 + (C 2 y + ρ) 2 γ 2 k + 2L 2 λ,F C 2 y ρ -1 α 2 k γ k + L 2 λ,F C 2 y α 2 k . Suppose α k , β k , and γ k are constants given K. Taking total expectation, rearranging and taking telescoping sum on both sides of the last inequality gives 1 K K k=1 E∥λ k -λ * ρ,k ∥ 2 = O 1 Kργ k + O 1 ρK K k=1 E∥Y k -∇F (x k )∥ 2 + O γ k ρ +O α 2 k γ 2 k ρ 4 + O α 2 k γ k ρ 3 where we have used L λ,F = O( 1 ρ ) from Lemma 4. Then, plugging in the choices α k = Θ(K -9 10 ), β k = Θ(K -1 2 ), γ k = Θ(K -2 5 ), ρ = Θ(K -1 5 ) and substituting from ( 48) in ( 56) gives 1 K K k=1 E∥λ k -λ * ρ,k ∥ 2 = O M K -3 10 + K -1 5 . Since typically the number of objectives is very small compared to the number of iterations, under assumption K = O(M 10 ), we get 1 K K k=1 E∥λ k -λ * ρ,k ∥ 2 = O K -1 5 . Thus, from (36), ( 38), (48), and (58), we have 1 K K k=1 E∥d(x k ) -Y k λ k ∥ 2 = O K -1 5 . This completes the proof.

G PROOF OF THEOREM 1

Proof. We first have ∥x k+1 -x * ∥ 2 = ∥Π X (x k -α k Y k λ k ) -x * ∥ 2 ≤ ∥x k -α k Y k λ k -x * ∥ 2 = ∥x k -x * ∥ 2 -2α k ⟨x k -x * , Y k λ k ⟩ + α 2 k ∥Y k λ k ∥ 2 ≤ ∥x k -x * ∥ 2 -2α k ⟨x k -x * , ∇F (x k )λ * k ⟩ -2α k ⟨x k -x * , Y k λ k -∇F (x k )λ * k ⟩ + α 2 k C 2 y ≤ ∥x k -x * ∥ 2 -2α k λ * k • (F (x k ) -F (x * )) -2α k ⟨x k -x * , Y k λ k -∇F (x k )λ * k ⟩ + α 2 k C 2 y ( ) where the last inequality uses the convexity of f m (x)(∀m). The third term in (60) can be bounded using the Cauchy-Schwarz inequality as ⟨x k -x * , Y k λ k -∇F (x k )λ * k ⟩ ≥ -C x (∥Y k -∇F (x k )∥ + C y ∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥) (61) where C x is the upper bound of ∥x -x ′ ∥ for x, x ′ ∈ X , which follows from Assumption 2. Substituting the above inequality into (60) and rearranging gives λ * k (F (x k ) -F (x * )) ≤ 1 2α k (∥x k -x * ∥ 2 -∥x k+1 -x * ∥ 2 ) + C x (∥Y k -∇F (x k )∥ + C y ∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥) + C 2 y 2 α k . Taking telescope sum on the last inequality gives 1 K K k=1 λ * k (F (x k ) -F (x * )) ≤ ∥x 1 -x * ∥ 2 2Kα k + C x K K k=1 (∥Y k -∇F (x k )∥ + C y ∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥) + C 2 y 2 α k which along with ( 38), ( 48), and (58 ) indicates 1 K K k=1 E[λ * k • (F (x k ) -F (x * ))] = O(K -1 10 ) if we choose α k = Θ(K -9 10 ), β k = Θ(K -1 2 ), γ k = Θ(K -2 5 ), and ρ = Θ(K -1 5 ).

H PROOF OF THEOREM 2

Before we go into the main proof, we first show the following key lemma. Lemma 6. For any x ∈ X and λ ∈ ∆ M , with d(x) = ∇F (x)λ * (x), it holds that ⟨d(x), ∇F (x)λ⟩ ≥ ∥d(x)∥ 2 . Proof. We write d λ (x) = ∇F (x)λ in the following proof. Since ∆ M is a convex set, for any λ ′ ∈ ∆ M , we have α(λ ′ -λ * ) + λ * ∈ ∆ M for any α ∈ [0, 1]. Then by d(x) = arg min λ∈∆ M ∥d λ (x)∥ 2 , we have ∥d(x)∥ 2 ≤ ∥α(d λ ′ (x) -d(x)) + d(x)∥ 2 . ( ) Expanding the right hand side of the inequality gives α 2 ∥d λ ′ (x) -d(x)∥ 2 + α⟨d(x), d λ ′ (x) -d(x)⟩ ≥ 0. Since this needs to hold for α arbitrarily close to 0, we have ⟨d(x), d λ ′ (x) -d(x)⟩ ≥ 0, ∀λ ′ ∈ ∆ M which indicates the result inequality by rearranging. Now we can prove Theorem 2. Proof. By the L m,1 -smoothness of f m , we have for any m, f m (x k+1 ) ≤ f m (x k ) + α k ⟨∇f m (x k ), -Y k λ k ⟩ + L m,1 2 ∥x k+1 -x k ∥ 2 ≤ f m (x k ) + α k ⟨∇f m (x k ), -Y k λ k ⟩ + L m,1 2 C 2 y α 2 k . The second term in the last inequality can be bounded as ⟨∇f m (x k ), -Y k λ k ⟩ = ⟨∇f m (x k ), ∇F (x k )λ * k -Y k λ k ⟩ + ⟨∇f m (x k ), -∇F (x k )λ * k ⟩ ≤ L m ∥Y k -∇F (x k )∥ + C y ∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ + ⟨∇f m (x k ), -∇F (x k )λ * k ⟩ ≤ L m ∥Y k -∇F (x k )∥ + C y ∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ -∥∇F (x k )λ * k ∥ 2 , where the last inequality follows from Lemma 6 by letting ∇F (x k )λ = ∇f m (x k ), and the first inequality follows from ⟨∇f m (x k ), ∇F (x k )λ * k -Y k λ k ⟩ ≤ L m ∥∇F (x k )λ * k -Y k λ k ∥ ≤ L m ∥∇F (x k )λ k -Y k λ k + ∇F (x k )λ * ρ,k -∇F (x k )λ k + ∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ ≤ L m ∥Y k -∇F (x k )∥ + C y ∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ . Plugging ( 69) into (68), taking expectation on both sides and rearranging yields α k E∥∇F (x k )λ * k ∥ 2 ≤ E[f m (x k ) -f m (x k+1 )] + L m α k E∥Y k -∇F (x k )∥ + C y E∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ + L m,1 2 C 2 y α 2 k . Taking telescope sum on both sides of the last inequality gives 1 K K k=1 E∥∇F (x k )λ * k ∥ 2 ≤ 1 α k K (f m (x 1 ) -inf f m (x)) + L m 1 K K k=1 E∥Y k -∇F (x k )∥ + C y E∥λ k -λ * ρ,k ∥ + ∥∇F (x k )λ * k -∇F (x k )λ * ρ,k ∥ + L m,1 2 C 2 y α k which along with ( 38), ( 48), and (58 ) indicates 1 K K k=1 E∥∇F (x k )λ * k ∥ 2 = O(K -1 10 ) if we choose α k = Θ(K -9 10 ), β k = Θ(K -1 2 ), γ k = Θ(K -2 5 ), and ρ = Θ(K -1 5 ).

I PROOF OF THEOREM 3

Proof. Recall from (68), by the L m,1 -smoothness of f m , we have for any m, f m (x k+1 ) ≤ f m (x k ) + α k ⟨∇f m (x k ), -Y k λ k ⟩ + L m,1 2 C 2 y α 2 k . Multiplying both sides by λ m k and summing over all m ∈ [M ], we obtain F (x k+1 )λ k ≤ F (x k )λ k + α k ⟨∇F (x k )λ k , -Y k λ k ⟩ + L1 2 C 2 y α 2 k , where we have used λ k := (λ 1 k , λ 2 k , . . . , λ m k , . . . , λ M k ) ⊤ and L1 = max m L m,1 . We can bound the second term of (74) as ⟨∇F (x k )λ k , -Y k λ k ⟩ = ⟨∇F (x k ), -∇F (x k ) + ∇F (x k ) -Y k ⟩ ≤ -∥∇F (x k )λ k ∥ 2 + 1 2 ∥∇F (x k )λ k ∥ 2 + 1 2 ∥∇F (x k ) -Y k ∥ 2 = - 1 2 ∥∇F (x k )λ k ∥ 2 + 1 2 ∥∇F (x k ) -Y k ∥ 2 , ( ) where the first inequality is due to Cauchy-Schwartz and Young's inequalities. Substituting (75) in (74) and rearranging, we have α k 2 ∥∇F (x k )λ k ∥ 2 ≤ F (x k )λ k -F (x k+1 )λ k + α k 2 ∥∇F (x k ) -Y k ∥ 2 + L L1 2 α 2 k C 2 y . Given K, let α k , β k and γ k be constants for any k ∈ [K]. We then take total expectation on both sides and sum over iterations to obtain α k 2 K k=1 E∥∇F (x k )λ k ∥ 2 ≤ K k=1 E [F (x k )λ k -F (x k+1 )λ k ] + α k 2 K k=1 E∥∇F (x k ) -Y k ∥ 2 + L1 2 α 2 k KC 2 y . We bound the first term on the right-hand side of the inequality (77) as K k=1 E [F (x k )λ k -F (x k+1 )λ k ] = E K-1 k=1 F (x k+1 )(λ k+1 -λ k ) + F (x 1 )λ 1 -F (x K+1 )λ k ≤ E K-1 k=1 ∥F (x k+1 )∥∥λ k+1 -λ k ∥ + ∥F (x 1 )∥∥λ 1 ∥ + ∥F (x K+1 )∥∥λ k ∥ ≤ F K-1 k=1 ∥γ k Y ⊤ k Y k λ k ∥ + 2F ≤ F C 2 y (K -1)γ k + 2F, where the first inequality is due to Cauchy-Schwartz, the second inequality is due to the bounds on F (x k ), λ k and we have used the update for λ k for all k ∈ [K] with ρ = 0, and third inequality is due to the bound on Y k for all k ∈ [K]. Substituting ( 78) in ( 77) and dividing both sides by α k K 2 , we have 1 K K k=1 E∥∇F (x k )λ k ∥ 2 ≤ 2F C 2 y (K -1) K γ k α k + 4F 1 α k K + 1 K K k=1 E∥∇F (x k ) -Y k ∥ 2 + L1 α k C 2 y (79) which, along with ( 46) and choosing α k = Θ(K -3 5 ), β k = Θ(K -2 5 ), and γ k = Θ(K -1 ), we obtain 1 K K k=1 E∥∇F (x k )λ k ∥ 2 = O(M K -2 5 ). The result then follows by observing that for any k ∈ [K], we have ∥∇F (x k )λ k ∥ 2 ≥ min λ ∥∇F (x k )λ∥ 2 = ∥∇F (x k )λ * k ∥ 2 .

J IMPROVED CONVERGENCE RATE WITH MODIFIED ASSUMPTIONS

In this section we state and prove Theorem 4, which improves upon the results presented in Theorem 3, with modified assumptions. We will first state the modified assumptions. Assumption 4. For any m, there exist constants c m , σ m such that 1 K K k=1 E∥E[h k,m |F k ] - ∇f m (x k )∥ 2 ≤ c ′ m β K and E[∥h k,m -E[h k,m |F k ]∥ 2 |F k ] ≤ σ 2 m for any k. Similar to Assumption 3, Assumption 4 requires the stochastic gradient h k,m almost unbiased and has bounded variance, and this is also weaker than the standard unbiased stochastic gradient assumption. Furthermore, Assumption 4 can be satisfied by running multiple nested updates, which require additional lower-level samples. With this assumption, we present the following improved result. Theorem 4. Consider the sequences generated by Algorithm 1. Furthermore assume there exists a constant F > 0 such that for all k ∈ [K], ∥F (x k )∥ ≤ F . Then, under Assumptions 1 and 4, if we choose step sizes α k = Θ(K -1 2 ), β k = Θ(K -1 2 ), γ k = Θ(K -3 4 ), and ρ = 0, it holds that 1 K K k=1 E ∥∇F (x k )λ * (x k )∥ 2 = O M K -1 2 . ( ) Proof. Convergence of Y k . We begin the proof by revisiting the convergence analysis on Y k in the proof of Theorem 3, under the assumptions considered in Theorem 4. For convenience, we restate (39) here as E k ∥y k+1,m -∇f m (x k+1 )∥ 2 = E k ∥y k+1,m -∇f m (x k )∥ 2 + 2E k ⟨y k+1,m -∇f m (x k ), ∇f m (x k )-∇f m (x k+1 )⟩ + E k ∥∇f m (x k ) -∇f m (x k+1 )∥ 2 . ( ) We bound the first term in (83) similar to that in (42), as E k ∥y k+1,m -∇f m (x k )∥ 2 ≤ 1 - 1 2 β k ∥y k,m -∇f m (x k )∥ 2 + β k ∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3β 2 k ∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3σ 2 m β 2 k . The second term in (83) can be bounded as ⟨y k+1,m -∇f m (x k ),∇f m (x k ) -∇f m (x k+1 )⟩ ≤ ∥y k+1,m -∇f m (x k )∥∥∇f m (x k ) -∇f m (x k+1 )∥ ≤ L m,1 ∥y k+1,m -∇f m (x k )∥∥x k+1 -x k ∥ ≤ L m,1 α k ∥y k+1,m -∇f m (x k )∥∥Y k λ k ∥ ≤ 1 8 β k ∥y k+1,m -∇f m (x k )∥ 2 + 2L 2 m,1 α 2 k β k ∥Y k λ k ∥ 2 ≤ 1 8 β k ∥y k+1,m -∇f m (x k )∥ 2 + 2 L2 1 α 2 k β k ∥Y k λ k ∥ 2 (85) where the third inequality is due to the x k update, the second last inequality follows from young's inequality, and the last inequality follows from the definition L1 = max m L m,1 . The last term in ( 83) can be bounded as ∥∇f m (x k ) -∇f m (x k+1 )∥ 2 ≤ L2 1 ∥x k+1 -x k ∥ 2 ≤ L2 1 α 2 k ∥Y k λ k ∥ 2 . ( ) Collecting the upper bounds in ( 84)-( 86) and substituting them into (83) gives E k ∥y k+1,m -∇f m (x k+1 )∥ 2 ≤ (1 - 1 4 β k )∥y k,m -∇f m (x k )∥ 2 + (β k + 3β 2 k )∥∇f m (x k ) -E k [h k,m ]∥ 2 + 3σ 2 m β 2 k + L2 1 α 2 k + 4 L2 1 α 2 k β k ∥Y k λ k ∥ 2 . ( ) For all k, let α k = α K , β k = β K , and γ k = γ K be constants given K. Then, taking total expectation and then telescoping both sides of (87) gives 1 K K k=1 E∥y k,m -∇f m (x k )∥ 2 = O 1 Kβ K + O 1 K K k=1 E∥∇f m (x k ) -E k [h k,m ]∥ 2 + O(β K ) + L2 1 α 2 K β K + 4 L2 1 α 2 K β 2 K 1 K K k=1 E∥Y k λ k ∥ 2 . ( ) Summing the last inequality over all objectives m ∈ [M ], and using Assumption 4, we obtain 1 K K k=1 E∥Y k -∇F (x k )∥ 2 = O M Kβ K + O(M β K ) + M L2 1 α 2 K + 4M L2 1 α 2 K β K 1 K K k=1 E∥Y k λ k ∥ 2 , ( ) where we have used M m=1 ∥y k,m -∇f m (x k )∥ 2 ≥ ∥Y k -∇F (x k )∥ 2 . Then with the decomposition ∥Y k λ k ∥ 2 ≤ 2∥Y k -∇F (x k )∥ 2 + 2∥∇F (x k )λ k ∥ 2 , ( ) we can arrive at 1 K K k=1 E∥Y k -∇F (x k )∥ 2 = O M Kβ K + O(M β K ) + 2M L2 1 α 2 K + 8M L2 1 α 2 K β K 1 K K k=1 E∥∇F (x k )λ k ∥ 2 + 2M L2 1 α 2 + 8M L2 1 α 2 K β K 1 K K k=1 E∥Y k -∇F (x k )∥ 2 . ( ) Now, note that with the choice of stepsize α K ≤ 1 4 L√ M β K , 0 < β K < 1, and α K and β K are on the same time scale, there exist some constant 0 < C 1 < 1 and valid choice of β K such that the following inequality holds 1 -2M L2 1 α 2 K -8M L2 1 α 2 K β K ≥ C 1 . An example for a constant that satisfy the last inequality for aforementioned choice of α K and β K is C 1 = 1 4 . Then, from ( 91) and ( 92), we can arrive at 1 K K k=1 E∥Y k -∇F (x k )∥ 2 = O M Kβ K +O(M β K )+ 2M L2 1 C 1 α 2 K + 8M L2 1 C 1 α 2 K β K 1 K K k=1 E∥∇F (x k )λ k ∥ 2 (93) Next, we analyse the x k sequence. For this purpose, we follow along similar lines as in the proof of Theorem 3. Accordingly, from (68), we have for any m, f m (x k+1 ) ≤ f m (x k ) + α k ⟨∇f m (x k ), -Y k λ k ⟩ + L m,1 2 C 2 y α 2 k . ( ) Multiplying both sides by λ m k and summing over all m ∈ [M ], we obtain F (x k+1 )λ k ≤ F (x k )λ k + α k ⟨∇F (x k )λ k , -Y k λ k ⟩ + L1 2 C 2 y α 2 k , where we have used λ k := (λ 1 k , λ 2 k , . . . , λ m k , . . . , λ M k ) ⊤ and L1 = max m L m,1 . We can bound the second term of (95) as ⟨∇F (x k )λ k , -Y k λ k ⟩ = ⟨∇F (x k )λ k , -∇F (x k )λ k + ∇F (x k )λ k -Y k λ k ⟩ ≤ -∥∇F (x k )λ k ∥ 2 + ⟨∇F (x k )λ k , ∇F (x k )λ k -Y k λ k ⟩ ≤ -∥∇F (x k )λ k ∥ 2 + 1 2 ∥∇F (x k )λ k ∥ 2 + 1 2 ∥∇F (x k ) -Y k λ k ∥ 2 ≤ - 1 2 ∥∇F (x k )λ k ∥ 2 + 1 2 ∥∇F (x k ) -Y k ∥ 2 , where the second inequality is due to Cauchy-Schwartz and Young's inequalities, and the last inequality is due to bound on λ k . Substituting (96) in ( 95) and rearranging, we have α k 2 ∥∇F (x k )λ k ∥ 2 ≤ F (x k )λ k -F (x k+1 )λ k + α k 2 ∥∇F (x k ) -Y k ∥ 2 + L1 2 α 2 C 2 y . For all k, let α k = α K , β k = β K , and γ k = γ K be constants given K. We then take total expectation on both sides and sum over iterations to obtain α K 2 K k=1 E∥∇F (x k )λ k ∥ 2 ≤ K k=1 E [F (x k )λ k -F (x k+1 )λ k ] + α K 2 K k=1 E∥∇F (x k ) -Y k ∥ 2 + L1 2 α 2 K KC 2 y . We bound the first term on the right-hand side of the inequality (98) as K k=1 E [F (x k )λ k -F (x k+1 )λ k ] = E K-1 k=1 F (x k+1 )(λ k+1 -λ k ) + F (x 1 )λ 1 -F (x K+1 )λ k ≤ E K-1 k=1 ∥F (x k+1 )∥∥λ k+1 -λ k ∥ + ∥F (x 1 )∥∥λ 1 ∥ + ∥F (x K+1 )∥∥λ k ∥ ≤ F K-1 k=1 ∥γ K Y ⊤ k Y k λ k ∥ + 2F ≤ F C y K k=1 γ K ∥Y k λ k ∥ + 2F, where the first inequality is due to Cauchy-Schwartz, the second inequality is due to the bounds on F (x k ), λ k and we have used the update for λ k for all k ∈ [K] with ρ = 0, and third inequality is due to the bound on Y k for all k ∈ [K]. Generation of Figure 2 . For the comparison of MOO algorithms in objective space depicted in Figure 2 , we use 5 initializations x 0 ∈ {(-8.5, 7.5), (-8.5, 5), (10, -8), (0, 0), (9, 9)}. The optimization configurations for each algorithm is similar to that of the aforementioned trajectory example, except with initial learning rate of 0.0025. Comparison with SMG with growing batch size. For the comparison of the multi-gradient bias among SMG, SMG with increasingly large batch size, and MoCo, we use the norm of the error of the stochastic multi-gradient calculated using the three trajectories randomly initialized from x 0 ∈ {(-8.5, 7.5), (-8.5, 5), (10, -8)}. For calculating the bias of the multi-gradient, we compute the multi-gradient using 10 sets of gradient samples at each point of the trajectory, take the average and record the norm of the difference between the computed average and true multi-gradient. All three methods are run for 70000 iterations, and follow the same optimization configuration used for Figure 2 . For SMG with increasing batch size, we increase the number of samples in the minibatch used for estimating the gradient by one every 10000 iterations. We report the bias of the multi-gradient with respect to the number of iterations and also number of samples in Figure 3 . In the figure, Trajectories 1, 2, and 3 correspond to initializations (-8.5, 7.5), (-8.5, 5) , and (10, -8), respectively. It can be seen that our method performs comparable to SMG with increasing batch size, but with fewer samples. Furthermore, SMG has non-decaying bias in some trajectories.

K.2 SUPERVISED LEARNING

In this section we provide additional details and experiments on Cityscapes and NYU-v2 datasets, and also provide experiment results on two additional datasets Office-31 and Office-home. Each experiment consists of solving multiple supervised learning problems related to each dataset. We , where N is the number of training data samples. Let x be the model that we train to perform all the tasks simultaneously. Let the image dimension be P × Q. We will use T m for ground truth and Tm for the corresponding prediction by the model x, where m ∈ [M ] and M is the number of tasks. We can now formulate the corresponding objective for each task, as given in (Liu et al., 2019) . The objective for pixel-wise classification is pixel-wise cross-entropy, which is given as f 1 (x) = - 1 N P Q i,p,q T 1,i (p, q) log T1,i (p, q) where i ∈ [N ], p ∈ [P ], and q ∈ [Q]. Similarly, we can have the objectives for pixel-wise depth estimation and surface normal estimation, respectively, as f 2 (x) = 1 N P Q i,p,q |T 1,i (p, q) -T1,i (p, q)| and f 3 (x) = 1 N P Q i,p,q T 1,i (p, q) • T1,i (p, q), where • is the elementwise dot product. With these objectives, we can formulate the problem (1) for Cityscapes and NYU-v2 tasks, with f 3 only used in the latter. Similar to NYU-v2 and Cityscapes experiments, we can also formulate the supervised learning tasks on Office-31 and Office-home MTL as an instance of problem (1). Cityscapes dataset. For implementing evaluation Cityscapes dataset, we follow the experiment set up used in (Liu et al., 2021a) . All the MTL algorithms considered are trained using a SegNet (Badrinarayanan et al., 2017) model with attention mechanism MTAN (Liu et al., 2019) applied on top of it for different tasks. All the MTL methods in comparison are trained for 200 epochs, using a batch size of 8. We use Adam as the optimizer with a learning rate of 0.0001 for the first 100 epochs, and with a learning rate of 0.00005 for the rest of the epochs. Following (Liu et al., 2021a) NYU-v2 dataset. For NYU-v2 dataset, we follow the same setup as Cityscapes dataset, except with a batch size of 2. For implementing MoCo in NYU-v2 experiments, we use β k = 0.99, γ k = 0.1 with gradient normalization followed by weighting each gradient with corresponding task loss. This normalization was applied to avoid biasing towards one task, as can be seen is the case for MGDA. For the projection to simplex in the λ k update in MoCo, we apply softmax function to the update, to improve computational efficiency. The training and test loss curves for semantic segmentation, depth estimation, and surface normal estimation for NYU-v2 dataset are shown in Figures 5a, 5b , and 5c respectively. It can be seen that the model start to slightly overfit to the training data set with respect to the semantic loss after the 100th epoch. However this did not significantly harm the test performance in terms of accuracy compared to the other algorithms. MoCo with existing MTL algorithms We apply the gradient correction introduced in MoCo on top of existing MTL algorithms to further improve the performance. Specifically, we apply the gradient correction of MoCo for PCGrad and CAGrad on Cityscapes and NYU-v2. For the gradient correction (update step ( 6)) in PCGrad we use β k = 0.99 for both Cityscapes and NYU-v2 datasets, and for that in CAGrad we use β k = 0.99 for Cityscapes dataset and β k = 0.99/k 0.5 for NYU-v2 dataset. The results are shown in Tables 5 and 6 for Cityscapes and NYU-v2 datasets, respectively. We restate the results for independent task performance and original MTL algorithm performance for reference. It can be seen that the gradient correction improves the performance of the algorithm which only use stochastic gradients. For PCGrad where no explicit convex combination coefficient computation for gradients is involved, there is an improvement of ∆m% for NYU-v2 by 0.35%. For Cityscapes, it can be seen a slight degradation in terms of ∆m%, in exchange for improvement in 3 out of 4 performance metrics. This can be expected as PCGrad does not explicitly control the Figure 6 : Metaworld MT10 benchmark tasks (Yu et al., 2020b) . mentation, we use β k = 0.5/k 0.5 and γ k = 0.1/k 0.5 . The results are given in Table 8 . It can be seen that MoCo significantly outperforms other methods in most taks, and also in terms of ∆m%. Table 9 : Multi-task reinforcement learning results on MT10 Metworld benchmark.

K.3 REINFORCEMENT LEARNING

For the multi-task reinforcement learning setting, we use the multi-task reinforcement learning benchmark MT10 available in Met-world environment (Yu et al., 2020b) . Figure 6 illustrates the 10 tasks associated with MT10 benchmark. We follow the experimental setup used in (Liu et al., 2021a) and provide the empirical comparison between our MoCo method and the existing baselines. Specifically, we use MTRL codebase (Sodhani & Zhang, 2021) and use soft actor-critic (SAC) (Haarnoja et al., 2018) as the underlying reinforcement learning algorithm. All the methods are trained for 2 million steps with a batch size of 1280. Each method is evaluated once every 10000 steps and the highest average test performance of a method over 5 random seeds over the entire training stage is reported in Table 9 . In this experiment, the vanilla MoCo outperforms PCGrad, but its performance is not as good as CAGrad that optimizes the average performance of all tasks. We further run the gradient correction of MoCo on top of CAGrad, and the resultant algorithm outperforms the vanilla CAGrad. This suggests that incorporating the gradient correction of MoCo in existing gradient based MTL algorithms also boosts their performance. 



z, ξ)] and f m (x) := f m (x, z * m (x)). The problem (11) is a generalization of the popular bilevel optimization framework Ghadimi & Wang (2018); Hong et al. (2020); Liu et al. (2020); Ji et al. (2021); Chen et al. (2021; 2022).

Figure 2: Comparison of trajectories in the objective space. We use five initializations in the same toy example in Figure 1, and plot the optimization trajectory in the objective space. MGDA converges to the Pareto front from all of the initializations. SMG, PCGrad, and CAGrad which only have access to single stochastic gradient per objective fail to converge to the Pareto front in some initializations. Our MoCo follows a similar trajectory to that of MGDA, and finds the Pareto front for each initialization.

λ k+1 and x k+1 following (9)-(10) 9: end for10: Output x K (b) There exist constants L xz , l xz , l zz such that ∇ z l m (x, z) is L xz -Lipschitz continuous w.r.t. x; ∇ z l m (x, z) is L zz -Lipschitz continuous w.r.t. z. ∇ xz l m (x, z), ∇ zz l m (x,z) are respectively l xz -Lipschitz and l zz -Lipschitz continuous w.r.t. (x, z).

Figure 3: Comparison of multi-gradient error Generation of Figure 1. For generating the trajectories in Figure 1 we use 3 initializations x 0 ∈ {(-8.5, 7.5), (-8.5, 5), (10, -8)}, and run each algorithm for 70000 iterations. For all the algorithms, we use the initial learning rate of 0.001, exponentially decaying at a rate of 0.05. In this example for MoCo, we use β k = 5/k 0.5 , where k is the number of iterations.

Figure 4: Training and test loss for the Cityscapes tasks

method in comparison we report the average test performance of the model over last 10 epochs, averaged over 3 seeds. For implementing MoCo in Cityscapes dataset we use β k = 0.05/k 0.5 , γ k = 0.1/k 0.5 , where k is the iteration number. For the projection to simplex in the λ k update, we use a softmax function. The training and test loss curves for semantic segmentation and depth estimation are shown in Figures4a and 4brespectively.

Methodsuccess (mean ± stderr) Multi-task SAC 0.49 ± 0.073 Multi-task SAC + Task Encoder 0.54 ± 0.047 Multi-headed SAC 0.61 ± 0.036 PCGrad 0.72 ± 0.022 CAGrad 0.83 ± 0.045 MoCo (ours) 0.75 ± 0.050 CAGrad + MoCo (ours) 0.86 ± 0.022 One SAC agent per task (upper bound) 0.90 ± 0.032

Multi-task supervised learning on CityScape dataset with the 7-class semantic segmentation and depth estimation results. Results are averaged over 3 independent runs. CAGrad, PCGrad, GradDrop and our method are applied on the MTAN backbone.

Multi-task supervised learning on NYU-v2 dataset with 13-class semantic segmentation, depth estimation, and surface normal prediction results on NYU-v2 dataset. Results are averaged over 3 independent runs. CAGrad, PCGrad, GradDrop and MoCo are applied on the MTAN backbone. comparison of our proposed method with the state-of-the-art MTL algorithms, using challenging and widely used real world MTL benchmarks in supervised and reinforcement learning settings. The details of hyperparameters are provided in Appendix K.

) x kModel parameter at iteration k, updated as (10)h k,m Stochastic gradient estimator of ∇f m (x k ) y k,m"Tracking" variable that approximates ∇f m (x k ) Y

Comparison of MoCo with prior work on gradient based stochastic MOO, stochastic multigradient method (SMG)

for each MoCo with existing gradient manipulation MTL algorithms for Cityscapes dataset tasks. Results are averaged over 3 independent runs. (Higher better) mIoU Pix Acc Abs Err Rel Err Mean Median 11.25 22.5 30 Independent 38.30 63.76 0.6754 0.2780 25.01 19.21 30.14 57.20 69.15 -PCGrad (Yu et al., 2020a) 38.06 64.64 0.5550 0.2325 27.41 22.80 23.86 49.83 63.14 3.97 CAGrad (Liu et al., 2021a) 39.79 65.49 0.5486 0.2250 26.31 21.58 25.61 52.36 65.58 0.20 MoCo (ours) 40.30 66.07 0.5575 0.2135 26.67 21.83 25.61 51.78 64.85 0.16 PCGrad + MoCo 38.80 65.02 0.5492 0.2326 27.39 22.75 23.64 49.89 63.21 3.62 CAGrad + MoCo 39.58 65.49 0.5535 0.2292 25.97 20.86 26.84 53.79 66.65 -0.97

MoCo with existing gradient manipulation MTL algorithms for NYU-v2 dataset tasks. Results are averaged over 3 independent runs.

summarizes the hyper-parameters choices used for MoCo in each of the experiments.

Summary of hyper-parameter choices for MoCo in each experiment

annex

Published as a conference paper at ICLR 2023 Substituting (99) in ( 98) and dividing both sides by α K K 2 , we havewhere the last equality is by substituting from (93). Now, with a similar argument that we made in (92), given some C 1 , there exist some constant 0 < C 2 < 1 such thatThen, we can have), we arrive atThe result then follows by observing that for any k ∈ [K], we have

K DETAILS OF EXPERIMENTS

In this section, we describe the omitted details of experiments in the main paper.

K.1 TOY EXAMPLE

To show the advantages of our algorithm, we use a toy example similar to (Liu et al., 2021a) . The example consists of optimizing two objectives f 1 (x) and f 2 (x) with x = (x 1 , x 2 ) ⊤ ∈ R 2 , given bywhere we define converging point to be closer to the average loss. For CAGrad which explicitly computes dynamic convex combination coefficients for gradients using stochastic gradients such that it converges closer to a point perform well in terms of average task loss, there is an improvement of ∆m% for Cityscapes by 1.52% and that for NYU-v2 by 1.17%. This suggests that incorporating the gradient correction of MoCo in existing gradient based MTL algorithms also boosts their performance.In addition to the experiments described above, we demonstrate the performance of MoCo in comparison with other MTL algorithms using Office-31 (Saenko et al., 2010) and Office-home (Venkateswara et al., 2017) datasets. Both of these datasets consist of images of several classes belonging to different domains. We use the method "Mean" as the baseline for ∆m%, instead of independent task performance. The Mean baseline is the method where the average of task losses on each domain is used as the single objective optimization problem. For reporting per domain performance for all the methods compared in Office-13 and Office-home experiments, the test performance at the epoch with highest average validation accuracy (across domains) is used for each independent run. This performance measure is then averaged over three independent runs.Office-31 dataset. The dataset consists of three classification tasks on 4,110 images collected from three domains: Amazon, DSLR, and Webcam, where each domain has 31 object categories. For implementing experiments using Office-home and Office-31 datasets, we use the experiment setup and implementation given by LibMTL framework (Lin & Zhang, 2022) . The MTL algorithms are implemented using hard parameter sharing architecture, with ResNet18 backbone. As per the implementation in (Lin & Zhang, 2022) , 60% of the total dataset is used for training, 20% for validation, and the rest 20% for testing. All methods in comparison are run for 100 epochs. For MoCo implementation, we use β k = 0.5/k 0.5 and γ k = 0.1/k 0.5 . We report the test performance of best performing model based on validation accuracy after each epoch, averaged over 3 seeds. The results are given in Table 7 . It can be seen that MoCo significantly outperforms other methods in most taks, and also in terms of ∆m%.Office-home dataset. This dataset consists of four classification tasks over 15,500 labeled images on four domains; Art: paintings, sketches and/or artistic depictions, Clipart: clipart images, Product: images without background and Real-World: regular images captured with a camera. Each domain has 65 object categories. We follow the same experiment setup as Office-31, and for MoCo imple-

