GRADIENT DECONFLICTION VIA ORTHOGONAL PROJECTIONS ONTO SUBSPACES FOR MULTI-TASK LEARNING Anonymous

Abstract

Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Based on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets. }. Even after this exhaustive grid search, GradOPS-static still falls short of our GradOPS(α=-3). Therefore, GradOPS can find the optimal grid search weights in one single training run.

1. INTRODUCTION

Multi-task Learning (MTL) aims at jointly training one model to master different tasks via shared representations and bottom structures to achieve better and more generalized results. Such positive knowledge transfer is the prominent advantage of MTL and is key to the successful applications of MTL in various domains, like computer visions (Misra et al., 2016; Kokkinos, 2017; Zamir et al., 2018; Liu et al., 2019) , natural language processing (Dong et al., 2015; McCann et al., 2018; Wang et al., 2020; Radford et al., 2019) , and recommender systems (Ma et al., 2018a; b; Tang et al., 2020; Wen et al., 2020) . Many works also make further improvements via task-relationship modelling (Misra et al., 2016; Ma et al., 2018a; Xia et al., 2018; Liu et al., 2019; Tang et al., 2020; Xi et al., 2021) to fully exploit the benefit from shared structures. However, while such design allows positive transfer among the tasks, it also introduces the major challenges in MTL, of which the most dominating one is the conflicting gradients problem. Gradients of two tasks, g i and g j , are considered conflicting if their dot product g i • g j < 0. With conflicting gradients, improvements on some tasks may be achieved at the expense of undermining other tasks. The most recent and representative works that seek to directly solve conflicts among the tasks are PCGrad (Yu et al., 2020) and GradVac (Wang et al., 2020) . For each task T i , PCGrad iteratively projects its gradient g i onto gradient directions of other tasks and abandons the conflicting part; GradVac argues that each task pair (T i , T j ) should have unique gradient similarity, and chooses to modify g i towards such similarity goals. However, these methods only discussed convergence guarantee and gradient deconfliction under two-task settings (Liu et al., 2021a) . For MTLs with 3 or more tasks, the properties of these algorithms may not hold because the sequential gradient modifications are not guaranteed to produce non-conflicting gradients among any two task pairs. Specifically, as demonstrated in Figure 1(b ) and 1(c), the aggregated update direction G ′ of these methods (e.g. PCGrad) still randomly conflicts with different original g i because of the randomly selected processing orders of task pairs, leading to decreasing performance on corresponding tasks. G (0.4, 0.7, 2.9) (c) PCGrad x y z g 1 (0, 0, 2) projS(g1) G (0.2, 0.9, 2.9) (d) GradOPS Figure 1 : Illustrative example of gradient conflicts in a three-task learning problem using gradient descent (GD), PCGrad and GradOPS. Task-specific gradients are labeled g 1 , g 2 and g 3 . The aggregated gradient G or G ′ in (a),(b) and (c) conflicts with the original gradient g 3 , g 2 and g 3 , respectively, resulting in decreasing performance of corresponding tasks. Note that different processing orders in PCGrad ([1,2,3] for (b), [3,2,1] for (c)) lead to conflicts of G with different original g i . In contrast, the GradOPS-modified g ′ 1 is orthogonal to S = span{g 2 , g 3 } with the conflicting part on S removed, similarily for g ′ 2 and g ′ 3 (omitted). Thus, neither each g ′ i nor G ′ conflicts with any of {g i }. Another stream of work (Sener & Koltun, 2018; Lin et al., 2019b; a; Liu et al., 2021a) avoids directly dealing with gradient conflicts but casting the MTL problems as multi-objective optimization (MOO) problems. Though practically effective, applications of most MOO methods are greatly limited since these methods have to delicately design complex algorithms for certain trade-off among the tasks based on conflicting gradients. However, solutions toward certain trade-off may not always produce expected performance, especially when the convex loss assumption fails and thus convergence to Pareto optimal fails. In addition, different scenarios may require divergent preferences over the tasks, whereas most MOO methods can only search for one Pareto optimal toward certain trade-off because their algorithms are designed to be binded with certain trade-off strategy. Therefore, it's difficult to provide flexible and different trade-offs given conflicting gradients. Even for Liu et al. (2021a) , Liu et al. (2021b) and Navon et al. (2022) which claim to seek Pareto points with balanced trade-offs among the tasks, their solutions might not necessarily satisfy the MTL practitioners' needs, since it's always better to provide the ability of reaching different Pareto optimals and leave the decision to users (Lin et al., 2019a) . In this paper, we propose a simple yet effective MTL algorithm: Gradient Deconfliction via Orthogonal Projections onto Subspaces spanned by other task-specific gradients (GradOPS). Compared with existing projection-based methods (Yu et al., 2020; Wang et al., 2020) , our method not only completely solves all conflicts among the tasks with stable performance invariant to the random processing orders during gradient modifications, but is also guaranteed to converge to Pareto stationary points regardless of the number of tasks to be optimized. Moreover, with gradient conflicts among the tasks completely solved, our method can effectively search for diverse solutions toward different trade-off preferences simply via different non-negative linear combinations of the deconflicted gradients, which is controlled by a single hyperparameter (see Figure 2 ). The main contributions of this paper can be summarized as follows: • Focused on the conflicting gradients challenge, we propose a orthogonal projection based gradient modification method that not only completely solves all conflicts among the tasks with stable and invariant results regardless of the number of tasks and their processing orders, the aggregated final update direction is also non-conflicting with all tasks. • With non-conflicting gradients obtained, a simple reweighting strategy is designed to offer the ability of searching Pareto stationary points toward different trade-offs. We also empirically testified that, with gradient-conflicts completely solved, such a simple strategy is already effective and flexible enough to achieve similar trade-offs with some MOO methods and even outperform them. • Theoretical analysis on convergence is provided and comprehensive experiments are presented to demonstrate that our algorithm can effectively find multiple state-of-the-art solutions with different trade-offs among the tasks on MTL benchmarks in various domains. & Ba, 2014) . See Appendix B for details. GD is unable to traverse the deep valley on two of the initial points because there are conflicting gradients and the gradient magnitude of one task is much larger than the other. For MGDA (Sener & Koltun, 2018) , CAGrad (Liu et al., 2021a) , and IMTL-G (Liu et al., 2021b) , the final convergence point is fixed for each initial point. In contrast, GradOPS could converge to multiple points in the Pareto set by setting different α.

2. PRELIMINARIES

In this section, we first clarify the formal definition of MTL. Then we discuss the major problems in MTL and state the importance of achieving non-conflicting gradients among the tasks. Problem Definition. For a multi-task learning (MTL) problem with T > 1 tasks {T 1 , ..., T T }, each task is associated with a loss function L i (θ) for a shared set of parameters θ. Normally, a standard objective for MTL is to minimize the summed loss over all tasks: θ * = arg min θ i L i (θ).

Pareto Stationary Points and Optimals

. A solution θ dominates another θ ′ if L i (θ) ≤ L i (θ ′ ) for all T i and L i (θ) < L i (θ ′ ) holds for at least one T i . A solution θ * is called Pareto optimal if no solution dominates θ * . A solution θ is called Pareto stationary if there exists w ∈ R T such that w i ≥ 0, T i=1 w i = 1 and T i=1 w i g i (θ) = 0, where g i = ∇ θ L i (θ) denotes the gradient of T i . All Pareto optimals are Pareto stationary, the reverse holds when L i is convex for all T i (Drummond & Iusem, 2004; Cruz et al., 2011) . Note that MTL methods can only ensure reaching Pareto stationary points, convergence to Pareto optimals may not holds without the convex loss assumption. Conflicting Gradients. Two gradients g 1 and g 2 are conflicting if the dot product g 1 • g 2 < 0. Let G be the aggregated update gradient, i.e. G = T i=1 g i , we define two types of conflicting gradients: • conflicts among the tasks if there exists any two task T i and T j with g i • g j < 0, • conflicts of the final update direction with the tasks if any task T i satisfies that G • g i < 0. For clarity, we introduce the concept of strong non-conflicting if g i • g j ≥ 0, ∀i, j holds, and weak non-conflicting if the aggregated gradient G dose not conflict with all original task gradients g i . Note that strong non-conflicting gradients are always guaranteed to be weak non-conflicting, whereas weak non-conflicting gradients are not necessarily strong non-conflicting. Most existing MOO methods focus on seeking weak non-conflicting G toward certain trade-offs, which is directly responsible for task performance. Current projection based methods (Yu et al., 2020; Wang et al., 2020) can only ensure strong non-conflicting gradients with T = 2, and when T > 2 even weak non-confliction is not guaranteed. Advantages of Strong Non-Conflicting Gradients. Without strong non-conflicting gradients, MOO methods are limited to find only one solution toward certain trade-off by delicately balancing the conflicting gradients, which empirically may lead to unsatisfying results if such trade-off is less effective when convex assumption of L i fails. In contrast, with strong non-conflicting gradients, we instantly acquire the advantage that it will be much easier to apply different trade-offs on G simply via non-negative linear combinations of the deconflicted gradients, all guaranteed to be nonconflicting with each original g i . With such ability, stable results with various trade-offs are also ensured since all tasks are always updated toward directions with non-decreasing performance. g1 g2 (a) GD g1 g2 (b) MGDA g1 g2 (c) IMTL-G g1 g2 g ' 1 g ' 2 (d) PCGrad g1 g2 g ' 1 g ' 2 (e) GradOPS Figure 3 : Visualization of the update direction (in yellow) obtained by various methods on a twotask learning problem. We rescaled the update vector to half for better visibility. g 1 and g 2 represent the two task-specific gradients. MGDA proposes to minimize the minimum possible convex combination of task gradients, and the update vector is perpendicular to the dashed line. IMTL-G proposes to make the projections of the update vector onto {g 1 , g 2 } to be equal. PCGrad and GradOPS project each gradient onto the normal plane of the other to obtain g ′ 1 and g ′ 2 . For PCGrad, the final update vector is the average of {g ′ 1 , g ′ 2 }. GradOPS further reweights {g ′ 1 , g ′ 2 } to make trade-offs between two tasks. As a result, the final update direction of GradOPS is flexible between g ′ 1 and g ′ 2 , covering the directions of MGDA and IMTL-G instead of been fixed as other methods, and always doesn't conflict with each task-specific gradient. Thus, our GradOPS aims to solve MTL problems with the following goals: (1) to obtain stable and better performance on all the tasks by achieving strong non-conflicting gradients, (2) to provide simple yet effective strategies capable of performing different trade-offs.

3.1. STRONG NON-CONFLICTING GRADIENTS

Though recent works (Yu et al., 2020; Wang et al., 2020) have already sought to directly deconflict each task specific gradient g i with all other tasks iteratively, none of theses methods manage to solve all conflicts among the tasks as stated in Section 1, let alone to further examine the benefits of strong non-conflicting gradients. As shown in Figure 1(b ) and 1(c), both the modified g ′ i and the aggregated gradient G ′ of existing methods may still conflict with some of the original task gradients {g i }. To address the limits of existing methods and to ensure strong non-conflicting gradients, GradOPS deconflicts gradients by projecting each g i onto the subspace orthogonal to the span of the other task gradients. Formally, GradOPS proceeds as follows: (i) For each g i in any permutation of the original task gradients, GradOPS first identifies whether g i conflicts with any of the other tasks by checking the signs of g i • g j for all {g j } j̸ =i . (ii) If no conflict exists, g i remains unmodified: g ′ i = g i . Otherwise, GradOPS projects g i onto the subspace orthogonal to S = span{g j } j̸ =i . The orthogonal basis for S as {u j } is computed using the Gram-Schmidt procedure: u 1 = g 1 , if i ̸ = 1 else u 2 = g 2 , u j = g j - k<j,k̸ =i proj u k (g j ), j > 1, j ̸ = i, where proj u (v) = u•v ∥u∥ 2 u. Given the orthogonal basis {u j }, the modified g ′ i is orthogonal to S: g ′ i = g i - j̸ =i proj uj (g i ). (2) Eventually, we have g ′ i • g j ≥ 0, ∀i, j and thus each g ′ i does not conflict with every original g j . (iii) With strong non-conflicting gradients {g ′ i }, the vanilla aggregated update direction G ′ = i g ′ i is guaranteed to be non-conflicting with any of the original g j since G ′ • g j = ( i g ′ i ) • g j = i (g ′ i • g j ) ≥ 0, ∀j, the same holds for any non-negative linear combinations of the {g ′ i }. With this procedure, GradOPS ensures that both each modified task gradient g ′ i and the final update gradient G do not conflict with each original task gradient g j , thus completely solves the conflicting gradients problem. In addition, unlike (Yu et al., 2020; Wang et al., 2020) , our result is stable and invariant to the permutations of tasks during the procedure. Theoretical analysis on convergence of GradOPS is provided as Theorem 1 in Appendix A.1, under mild assumptions of neural network and small update step size.

3.2. TRADE-OFFS AMONG TASKS

Algorithm 1 GradOPS Update Rule Require: task number T , a constant α, initial model parameters θ 1: while not converged do 2: g i ← ∇ θ L i (θ), ∀i 3: for i = 1; i ≤ T do 4: if ∃k ̸ = i, g i • g k < 0 then 5: compute orthogonal basis {u j } according to Eq. 1 6: g ′ i = g i -j̸ =i proj uj (g i ) 7: else 8: g ′ i = g i 9: end if 10: end for 11: G ′ = i g ′ i 12: compute scaling factors {w i } according to Eq. 5 13: update θ with gradient G ′ new = i w i g ′ i 14: end while As shown in Figure 2 , most existing methods could only converge to one point in the Pareto set, without the ability of flexibly performing trade-offs among the tasks. Figure 3 (a) to 3(d) give a more detailed illustration about the fixed update strategies of different methods, different solutions are achieved only through different initial points. However, note that MTL practitioners may need varying trade-offs among the tasks by demand in real-world applications (Lin et al., 2019a; Navon et al., 2021) . In order to converge to points with different trade-offs, the most straightforward idea is to assign tasks different loss weights, i.e. to scale task gradients. However, this is often unsatisfying for two reasons: (1) update strategies of some methods like IMTL-G (Liu et al., 2021b) are invariant to gradients scales, and (2) it is difficult to assign appropriate weights without prior knowledge. Therefore, we further propose to dynamically scale the modified task gradients with strong nonconflicting guarantees obtained by GradOPS to adjust the update direction. To this end, we first define R i as the scalar projection of G ′ = i g ′ i onto the original g i : R i = ||G ′ || × cos(ϕ ⟨gi,G ′ ⟩ ) = G ′ • g i ||g i || . Note that R i ≥ 0 as GradOPS always ensures g i • G ′ ≥ 0. Given the same G ′ , each R i can be regarded as a measure of the angle between g i and G ′ , or a measure of the real update magnitude on direction of g i , indicating the dominance and relative update speed of task T i among the tasks. A new update direction G ′ new = i w i g ′ i is thus obtained by summing over the reweighted g ′ i with: r i = R i i R i /T , w i = r α i i r α i /T , where w i functions as the scale factor calculated from {r i } and determines trade-offs among the tasks, r i reveals the relative update magnitude on direction g i and always splits the tasks into dominating ones {T i |r i > 1} and the dominated ones with r i ≤ 1. Thus, a single hyperparameter α ∈ R on r α i will allow effective trade-offs among the dominating and dominated tasks via non-negative w i in Eq. 5. Details on convergence of GradOPS with different w are analysed in Theorem 2. Discussion about α. With α > 0, performance of the dominating tasks are expected to be improved since higher value of α will enforce G ′ new toward {g ′ i } of these tasks and yield greater {w i }. While α < 0, the dominated tasks are emphasized since lower value of α will ensure greater {w i } for tasks with smaller {r i }. Specifically, a proper α < 0 will pay more attention to the dominated tasks, and thus obtain more balanced update gradient, which is similar to IMTL-G. Keep decreasing α will further focus on tasks with smaller gradient magnitudes, which coincides with the idea of MGDA. An example is provided in Figure 2 , where the top right points in GradOPS(α=-2) and GradOPS(α=-5) converge to the similar points to IMTL-G and MGDA, respectively. Note that G ′ new = G ′ when α = 0. A pictorial description of this idea is shown in Figure 3 (e): Different α will redirect G ′ new between g ′ 1 and g ′ 2 . The complete GradOPS algorithm is summarized in Algorithm 1. Note that PCGrad is a special case of our method when T = 2 and α = 0.

3.3. ADVANTAGES OF GRADOPS

Now we discuss the advantages of GradOPS. Both GradOPS and existing MOO methods can converge to Pareto stationary points, which are also Pareto optimals under convex loss assumptions. Thus ideally, there exists no dominating methods but only methods whose trade-off strategy best suited for certain MTL applications. Realizing that the assumptions may fail in practice and convergence to Pareto optimals may not hold, we are interested in achieving robust performance via simplest strategies instead of designing delicate trade-off strategies, which may fail to reach expected performance with violations of the assumptions. Our Goal of securing robust performance is achieved via ensuring strong non-conflicting gradients. Though (Yu et al., 2020; Wang et al., 2020) are already aware of the importance of solving gradient conflicts (see Section 1), none of them manage to guarantee strong non-conflicting gradients, which is achieved by GradOPS. In addition, we further conclude empirically the benefit of achieving strong non-conflicting gradients, which is all fully exploited in GradOPS: With all conflicts solved, effective trade-offs among the tasks are much easier. With conflicting gradients, existing methods have to conceive very delicate algorithms to achieve certain trade-off preferences. The trade-off strategy in Liu et al. (2021a) requires solving the dual problem of maximizing the minimum local improvements, IMTL-G (Liu et al., 2021b) applies complex matrix multiplications only to find G with equal projection onto each g i . Unlike these methods, GradOPS can provide solutions with different trade-offs simply via non-negative linear combinations of the deconflicted gradients, and achieves similar trade-offs with different MOO methods, like MGDA and IMTL-G, or even outperforms them. Moreover, GradOPS, which only requires tuning a single hyperparameter, also outperforms the very expensive and exhaustive grid search procedures. See Section 4 and Appendix C.3 for details on effects of tuning α. Moreover, experiment results in Section 4 imply that a certain trade-off strategy may not always produce promising results on different datasets. Unlike most MOO methods which are binded with certain fixed trade-offs, GradOPS is also less affected and more stable since it flexibly supports different trade-offs and always ensures that all tasks are updated toward directions with non-decreasing performance. Lastly, it's worth mentioning that GradNorm (Chen et al., 2018) can also dynamically adjust gradient norms. However, its target of balancing the training rates of different tasks can be biased since relationship of the gradient directions are ignored, which can be improved by taking real update magnitude on each task into account as we do in Eq. 3.

4. EXPERIMENTS

In this section, we evaluate our method on diverse multi-task learning datasets including two public benchmarks and one industrial large-scale recommendation dataset. For the two public benchmarks, instead of using the frameworks implemented by different MTL methods with various experimental details, we present fair and reproducible comparisons under the unified training framework (Lin & Zhang, 2022) . GradOPS inherits the hyperparameters of the respective baseline method in all experiments, except for one additional hyperparameter α. We include the additional experimental details and results in Appendix B and C respectively. Source codes will be publicly available.

Compared methods.

(1) Uniform scaling: minimizing i L i ; (2) Single-task: solving tasks independently; (3) existing loss reweighting methods (Chen et al., 2018) , projection methods (Yu et al., 2020; Wang et al., 2020) and MMO methods (Sener & Koltun, 2018; Liu et al., 2021a; b) . Evaluation. In addition to common evaluation metrics in each experiment, we follow (Maninis et al., 2019; Liu et al., 2021a; Navon et al., 2022) The average per-task performance drop of method m with respect to the single-tasking baseline b: ∆ m = 1 T T i=1 (-1) li (M m,i -M b,i )/M b,i where l i = 1 if a higher value is better for a criterion M i on task i and 0 otherwise.

4.1. MULTI-TASK CLASSIFICATION

The UCI Census-income dataset (Dua & Graff, 2017 ) is a commonly used benchmark for multi-task learning, which contains 299,285 samples and 40 features extracted from the 1994 census database. Referring to the experimental settings in Ma et al. (2018a) , we construct three multi-task learning problems from this dataset by setting some of the features as prediction targets. In detail, task Income aims to predict whether the income exceeds 50K, task Marital aims to predict whether this person's marital status is never married, and task Education aims to predict whether the education level is at least college. Since all tasks are binary classification problems, we use the Area Under Curve (AUC) scores as the evaluation metrics. We summarize the experimental results for all methods in Table 1 . The reported results are the average performance of 10 times experiments with random parameter initialization. As shown, GradOPS(α=0) outperforms the projection-based method PCGrad in all tasks, indicating the benefit of our proposed strategy which projects each task-specify gradient onto the subspace orthogonal to the span of the other gradients to deconflict gradients. Then, we compare the perfor- mance of GradOPS with different α and other MOO methods. We note that GradOPS(α=-1) recovers CAGrad and IMTL-G, and GradOPS(α=-2) approximates MGDA. This results show that GradOPS with a proper α < 0 can roughly recover the final performance of CAGrad, IMTL-G and MGDA. GradOPS(α=-3) achieves the best performance in terms of Marital, Education and Average. Although GradOPS(α=2) doesn't achieve as good experimental results as GradOPS with α = {-1, -2, -3} do, it outperforms Uniform scaling baseline. And we find that for this dataset, almost any value of -10 ≤ α ≤ 5 will improve Average performance over Uniform scaling baseline (see Appendix C.2 for details). It suggests that the way to deconflict gradients proposed by us is beneficial for MTL.

4.2. SCENE UNDERSTANDING

NYUv2 (Silberman et al., 2012 ) is an indoor scene dataset which contains 3 tasks: 13 class semantic segmentation, depth estimation, and surface normal prediction. We follow the experiment setup from (Liu et al., 2019; Yu et al., 2020) , and fix SegNet (Badrinarayanan et al., 2017) as the network structure for all methods. The results are presented in Table 2 . Each experiment is repeated 3 times over different random parameter initialization and the mean performance is reported. Our methods GradOPS(α=-1.5) achieves comparable performance with IMTL-G. GradOPS(α=-0.5) and GradOPS(α=-1) achieve the best ∆ m and MR respectively. As surface normal estimation owns the smallest gradient magnitude, MGDA focuses on learning to predict surface normals and achieves poor performance on the other two tasks, which is similar to GradOPS with a lower α = -3. It is worth noting that although MGDA has overall good results on UCI Census-income dataset, yet it has not as good ∆ m on this dataset, which suggesting that it is hard to generalize to different domains.

4.3. LARGE-SCALE RECOMMENDATION

In this subsection, we conduct offline experiments on a large-scale recommendation system in Taobao Inc. to evaluate the performance of proposed methods. We collect an industrial dataset through sampling user logs from the recommendation system during consecutive 7 days. There are 46 features, more than 300 million users, 10 millions items and about 1.7 billions samples in the dataset. We consider predicting three post-click actions: purchase, add to shopping cart (Cart) and add to wish list (Wish). We implement the network as a feedforward neural network with several fully-connected layers with ReLU activation, and adopt AUC scores as the evaluation metrics. Results are shown in Table 3 . As wish prediction is the dominated task in this dataset, GradOPS with a negative α = {-0.5, -1, -2} show better performance than GradOPS(α=0) in wish task, and GradOPS with a positive α = {1, 2} achieve better performance in the other two tasks. Our proposed GradOPS can successfully find a set of well-distributed solutions with different trade-offs, and it outperforms the single-task baseline where each task is trained with a separate model. This experiment also shows that our method remains effective in real world large-scale applications.

5. RELATED WORK

Ever since the successful applications of multi-task learning in various domains, many methods have been released in pursuit of fully exploiting the benefits of MTL. Early branches of popular approaches heuristically explore shared experts and task-relationship modelling via gating networks (Misra et al., 2016; Ma et al., 2018a; Xia et al., 2018; Liu et al., 2019; Tang et al., 2020; Xi et al., 2021) , loss reweighting methods to ensure equal gradient norm for each tasks (Chen et al., 2018) or to uplift losses toward harder tasks (Guo et al., 2018; Kendall et al., 2018) . GradOPS is agnostic to model architectures and loss manipulations, thus can be combined with these methods by demands. Recent trends in MTL focus on gradient manipulations toward certain objectives. Chen et al. ( 2020) simply drops out task gradients randomly according to conflicting ratios among tasks. Javaloy & Valera (2021) focuses on reducing negative transfer by scaling and rotating task gradients. Existing projection based methods explicitly study the conflicting gradients problem and relationships among task gradients. Some methods more precisely alter each task gradient toward non-negative inner product (Yu et al., 2020) or certain values of gradient similarity (Wang et al., 2020) with other tasks, iteratively. GradOPS makes further improvements over these methods, as stated in Section 3.1. Originated from (Désidéri, 2012) , some methods seek to find Pareto optimals of MTL as solutions of multi-objective optimization (MOO) problems. Sener & Koltun (2018) apply the Frank-Wolfe algorithm to solve the MOO problem defined in Désidéri (2012) ; Lin et al. (2019b) choose to project the closed-form solutions of a relaxed quadratic programming to the feasible space, hoping to achieve paretion-efficiency. More recent methods seek to find Pareto optimals with certain trade-off among the tasks instead of finding an arbitrary one. Liu et al. (2021a) seeks to maximize the worst local improvement of individual tasks to achieve Pareto points with minimum average loss. IMTL-G (Liu et al., 2021b) seeks solutions with balanced performance among all tasks by searching for an update vector that has equal projections on each task gradient. Note that some methods have already been proposed to provide Pareto solutions with different trade-offs. Lin et al. (2019a) explicitly splits the loss space into independent cones and applies constrained MMO methods to search single solution for each cone; Mahapatra & Rajan (2020) provides the ability of searching solutions toward certain desired directions determined by the input preference ray r; Navon et al. (2021) trains a hypernetwork, with preference vector r as input, to directly predict weights of a certain MTL model with losses in desired ray r. Comparison of GradOPS and MOO methods are discussed in Section 3.3. Finally, as discussed in Yu et al. (2020) , we also state the difference of GradOPS with gradient projection methods (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2018; Farajtabar et al., 2020) applied in continual learning, which mainly solve the catastrophic forgetting problem in sequential lifelong learning. GradOPS is distinct from these methods in two aspects: (1) instead of concentrate only on conflicts of current task with historical ones, GradOPS focus on gradient deconfliction among all tasks within the same batch simultaneously to allow mutually positive transfer among tasks, (2) GradOPS aims to provide effective trade-offs among tasks with the non-conflicting gradients.

6. CONCLUSION

In this work, focusing on the conflicting gradients challenge in MTL, we introduce the idea of strong non-conflicting gradients, and further emphasize the advantages of acquiring such gradients: it will be much easier to apply varying trade-offs among the tasks simply via different non-negative linear combinations of the deconflicted gradients, which are all guaranteed to be non-conflicting with each original tasks. Thus, solutions with stable performance toward different trade-offs can also be achieved since all tasks are updated toward directions with non-decreasing performance in each training step for all trade-off preferences. To fully exploit such advantages, we propose a simple algorithm (GradOPS) that ensures strong non-conflicting gradients, based on which a simple reweighting strategy is also implemented to provides effective trade-offs among the tasks. Comprehensive experiments show that GradOPS achieves state-of-the-art results and is also flexible enough to achieve similar trade-offs with some of the existing MOO methods and even outperform them.

A DETAILED DERIVATION

A.1 PROOF OF THEOREM 1 Theorem 1. Assume individual loss functions L 1 , L 2 , ..., L T are differentiable. Suppose the gradient of L is L-Lipschitz with L > 0. Then with the update step size t < 2 T L , GradOPS in Section 3.1 will converge to a Pareto stationary point. Proof. As L is differential and L-smooth, we can obtain the following inequality: L(θ ′ ) ≤ L(θ) + ∇L(θ) T (θ ′ -θ) + 1 2 ∇ 2 L(θ)∥θ ′ -θ∥ 2 ≤ L(θ) + ∇L(θ) T (θ ′ -θ) + 1 2 L∥θ ′ -θ∥ 2 (6) Plugging in the GradOPS update by letting θ ′ = θ -tG ′ , we can conclude the following: L(θ ′ ) ≤ L(θ) -tG • G ′ - 1 2 Lt 2 ∥G ′ ∥ 2 (Expanding, using the identity G = i g i , G ′ = i g ′ i ) = L(θ) -t i g i • i g ′ i - 1 2 Lt∥ i g ′ i ∥ 2 = L(θ) -t   i,j (g i • g ′ j ) - 1 2 Lt i,j (g ′ i • g ′ j )   (Using the inequality g i • g ′ j ≥ 0, ∀i, j from Section 3.1) ≤ L(θ) -t   i (g i • g ′ i ) - 1 2 Lt i,j (g ′ i • g ′ j )   (Using the inequality g ′ i • g ′ j ≤ ∥g ′ i ∥ • ∥g ′ j ∥ ≤ 1 2 (∥g ′ i ∥ 2 + ∥g ′ j ∥ 2 )) ≤ L(θ) -t   i (g i • g ′ i ) - 1 2 Lt i,j 1 2 (∥g ′ i ∥ 2 + ∥g ′ j ∥ 2 )   = L(θ) -t i (g i • g ′ i ) - 1 2 T Lt i ∥g ′ i ∥ 2 (Using the identity g i • g ′ i = ∥g ′ i ∥ 2 , ∀i) = L(θ) -t i ∥g ′ i ∥ 2 - 1 2 T Lt i ∥g ′ i ∥ 2 = L(θ) -t(1 - 1 2 T Lt) i ∥g ′ i ∥ 2 . (7) Note that t(1-1 2 Ltn) i ∥g ′ i ∥ 2 ≥ 0 when t ≤ 2 T L . Further, when t < 2 T L , t(1-1 2 T Lt) i ∥g ′ i ∥ 2 = 0 if and only if ∥g ′ i ∥ 2 = 0, ∀i, i.e. g ′ i = 0, ∀i.

Hence repeatedly applying GradOPS process with t < 2

T L can reach some point θ * in the optimization landscape where g ′ i = 0, ∀i. According to Section 3.1, we have g i = 0 or g i belongs to subspace S = span{g j } j̸ =i . This means that there exists a convex combination of the gradients {g i } at this point θ * that equals zero, and therefore θ * is Pareto stationary under the definitions in Section 2. A.2 PROOF OF THEOREM 2 Theorem 2. Assume individual loss functions L 1 , L 2 , ..., L T are differentiable. Suppose the gradient of L is L-Lipschitz with L > 0. Then with the update step size t < min i,j 2wi∥g ′ i ∥ 2 T Lw 2 j ∥g ′ j ∥ 2 , GradOPS in Section 3.2 will converge to a Pareto stationary point. Proof. Similar to Section A.1, plugging in Inequality 6 by letting θ ′ = θ -tG ′ new , we can conclude the following: L(θ ′ ) ≤ L(θ) -tG • G ′ new - 1 2 Lt 2 ∥G ′ new ∥ 2 (Expanding, using the identity G = i g i , G ′ new = i w i g ′ i ) = L(θ) -t i g i • i w i g ′ i - 1 2 Lt∥ i w i g ′ i ∥ 2 ≤ L(θ) -t i w i ∥g ′ i ∥ 2 - 1 2 T Lt i w 2 i ∥g ′ i ∥ 2 . ( ) If g ′ i = 0, ∀i, GradOPS reaches the Pareto stationary point. Otherwise, we denotes T + as the tasks set where ∥g ′ i ∥ 2 > 0, i ∈ T + . Therefore, w i > 0, ∀i ∈ T + according to Section 3.2. In this case, with t < min i∈T + ,j 2wi∥g ′ i ∥ 2 T Lw 2 j ∥g ′ j ∥ 2 , GradOPS can reach some point θ * where g ′ i = 0, ∀i, i.e. the Pareto stationary point. To test whether the performance of GradOPS are robust against the hyperparameter α changes, we show both the training and test performance gains with different α on UCI Census-income dataset in Figure 7 . We note that the training performance gain of dominating task Marital rises as the value of α gets larger, while the gain of the dominated task Income goes down. This again proves the ability of GradOPS to obtain solutions with different trade-offs. It should be mentioned that higher training performance does not necessarily guarantee higher test performance. And the test performance of task Marital with α > 0 is worse than the performance with α ≤ 0, which is inconsistent with the observation of corresponding training performance. The reason may be that the model overfits on task Marital, which suggesting a better regularization scheme for this domain. Note that we achieve Average performance gains on both training and test for almost all values of



Figure 2: Visualization of trade-offs in a 2D multi-task optimization problem. Shown are trajectories of each method with 3 different initial points (labeled with black •) using Adam optimizer (Kingma& Ba, 2014). See Appendix B for details. GD is unable to traverse the deep valley on two of the initial points because there are conflicting gradients and the gradient magnitude of one task is much larger than the other. For MGDA(Sener & Koltun, 2018), CAGrad(Liu et al., 2021a), and IMTL-G(Liu et al., 2021b), the final convergence point is fixed for each initial point. In contrast, GradOPS could converge to multiple points in the Pareto set by setting different α.

Figure 4: Visualization of the loss surfaces of Figure 2.

Figure 5: Two-dimensional contour graphs of the three-dimensional surfaces in Figure 4. ⋆ denotes the point with lowest single-task loss.

Figure 6: Traces of how w i change during training for different values of α on UCI Census-income dataset.On the top row, we show the comparison between α ∈ {1, 0, -1} and on the bottom row, we show the comparison between α ∈ {3, 0, -3}. For α = 0, w i remains at 1 for all three tasks. The positive α = 1.0 assigns the dominated task Income a greater w i > 1, and dominating task Marital a lower w i < 1. And a larger value of α pushes weights farther apart. While, a negative α has the opposite effect.

and report two metrics to capture the overall Experiment results on UCI Census-income dataset. The best scores are shown in bold and the second-best scores are underlined. Pix Acc ↑ Abs Err ↓ Rel Err ↓ Mean ↓ Median ↓ 11.25 ↑ 22.5 ↑

Experimental results on NYUv2 dataset. Arrows indicate the values are the higher the better (↑) or the lower the better (↓). Best performance for each task is bold, with second-best underlined.

Experiment Results on the real large-scale recommendation system. The best and runner up results in each column are bold and underlined, respectively.

B IMPLEMENTATION DETAILS

Illustrative Example. We provide here the details for the illustrative example of Figure 2 . We modify the illustrative example in (Liu et al., 2021a) and consider θ = (θ 1 + θ 2 ) ∈ R 2 with the following individual loss functions:The three-dimensional loss surfaces and the corresponding two-dimensional contour graphs are shown in Figure 4 and 5, respectively. We pick 3 initial parameter vectors θ ∈ {(-0.85, 0.75), (-0.85, -0.3), (0.9, 0.9)} and performed 20,000 gradient updates to minimize L using the Adam optimizer with learning rate 0.001. The corresponding optimization trajectories with different methods is shown in Figure 2 .Multi-Task Classification. In the UCI Census-income dataset, there are 199,523 training examples and 99,762 test examples. Following Ma et al. (2018a) , we further randomly split test examples into a validation dataset and a test dataset by the fraction of 1:1. We apply a 2-layer fully-connected ReLU-activated neural network with 192 hidden units of each layer for all methods. We also place one dropout layer (Hinton et al., 2012) with p = 0.3 for regularization. For GradNorm (Chen et al., 2018) and CAGrad (Liu et al., 2021a) baselines which have additional hyperparameters, we follow the original papers and search α ∈ {0.5, 1.5, 2.0} for GradNorm and c ∈ {0.1, 0.2, ..., 0.9} for CAGrad with the best average performance of the three tasks on validation dataset (α = 1.5 for GradNorm and c = 0.5 for CAGrad). We train each method for 200 epochs with batch-size of 1024, and the Adam optimizer with a learning rate of 1e -4.Scene Understanding. We follow the training and evaluation procedure used in previous work on MTL (Liu et al., 2019; Yu et al., 2020) and apply SegNet (Badrinarayanan et al., 2017) as the network structure for all methods. For GradNorm (Chen et al., 2018) and CAGrad (Liu et al., 2021a) baselines, we follow the hyperparameters settings in the original papers with α = 1.5 for GradNorm and c = 0.4 for CAGrad. Each method is trained for 200 epochs with a batch size of 2, using the Adam optimizer with a learning rate of 1e-4. The learning-rate is halved to 5e-5 after 100 epochs.Large-scale Recommendation. We collect an industrial dataset through sampling user logs from the recommendation system in Taobao during consecutive 7 days. There are 46 features, more than 300 million users, 10 millions items and about 1.7 billions samples in the dataset. We implement the network as a feedforward neural network with 4 fully-connected layers with ReLU activation. The hidden units are fixed for all models with hidden size {1024,512,256,128}. All methods are trained with Adam optimizer with a learning rate of 1e -3.

C ADDITIONAL EXPERIMENTS

C.1 EFFECTS OF TUNING α.To better understand how α affects the {w i }, we show the traces of w i during training for different values of α on UCI Census-income dataset in Figure 6 . The positive and negative α have opposite effects on w i of tasks Income and Marital compared to the zero α. And the α with larger absolute values further widen the gap. -10 ≤ α ≤ 5, which indicating that GradOPS is numerically stable. Moreover, the consistently positive performance gains across all these values of α suggest that the way to deconflict gradients introduced by GradOPS can improve MTL performance.

C.3 COMPARISON BETWEEN GRADOPS AND GRADOPS WITH GRID-SEARCH WEIGHTS.

To further verify the effectiveness of the strategy introduced in Section 3.2 for obtaining w i , we conduct a comparative experiment where we train GradOPS with static w static i (referred to as GradOPSstatic) on UCI Census-income dataset. {w static i } are considered as hyperparameters, and are sampled from candidates generated by grid search. The sum of {w static i } is guaranteed to equal T = 3, and G ′ static = i w static i g ′ i is used as the final update gradient. Then, we compare the performance of GradOPS-static to our GradOPS(α=-3).The results are shown in Figure 8 . We train GradOPS-static 100 times with different combinations of {w static i

