GRADIENT DECONFLICTION VIA ORTHOGONAL PROJECTIONS ONTO SUBSPACES FOR MULTI-TASK LEARNING Anonymous

Abstract

Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Based on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets.

1. INTRODUCTION

Multi-task Learning (MTL) aims at jointly training one model to master different tasks via shared representations and bottom structures to achieve better and more generalized results. Such positive knowledge transfer is the prominent advantage of MTL and is key to the successful applications of MTL in various domains, like computer visions (Misra et al., 2016; Kokkinos, 2017; Zamir et al., 2018; Liu et al., 2019) , natural language processing (Dong et al., 2015; McCann et al., 2018; Wang et al., 2020; Radford et al., 2019), and recommender systems (Ma et al., 2018a; b; Tang et al., 2020; Wen et al., 2020) . Many works also make further improvements via task-relationship modelling (Misra et al., 2016; Ma et al., 2018a; Xia et al., 2018; Liu et al., 2019; Tang et al., 2020; Xi et al., 2021) to fully exploit the benefit from shared structures. However, while such design allows positive transfer among the tasks, it also introduces the major challenges in MTL, of which the most dominating one is the conflicting gradients problem. Gradients of two tasks, g i and g j , are considered conflicting if their dot product g i • g j < 0. With conflicting gradients, improvements on some tasks may be achieved at the expense of undermining other tasks. The most recent and representative works that seek to directly solve conflicts among the tasks are PCGrad (Yu et al., 2020) and GradVac (Wang et al., 2020) . For each task T i , PCGrad iteratively projects its gradient g i onto gradient directions of other tasks and abandons the conflicting part; GradVac argues that each task pair (T i , T j ) should have unique gradient similarity, and chooses to modify g i towards such similarity goals. However, these methods only discussed convergence guarantee and gradient deconfliction under two-task settings (Liu et al., 2021a) . For MTLs with 3 or more tasks, the properties of these algorithms may not hold because the sequential gradient modifications are not guaranteed to produce non-conflicting gradients among any two task pairs. Specifically, as demonstrated in Figure 1(b ) and 1(c), the aggregated update direction G ′ of these methods (e.g. PCGrad) still randomly conflicts with different original g i because of the randomly selected processing orders of task pairs, leading to decreasing performance on corresponding tasks. x y z g1 g2 g3 G(0, 1, 2) (a) GD x y z g 1 ( 0.8, 1.6, 2) G ( 0.4, 0.3, 2.5) (b) PCGrad x y z g 1 (0, 1.6, 2) G (0.4, 0.7, 2.9) (c) PCGrad x y z g 1 (0, 0, 2) projS(g1) G (0.2, 0.9, 2.9) (d) GradOPS Figure 1 : Illustrative example of gradient conflicts in a three-task learning problem using gradient descent (GD), PCGrad and GradOPS. Task-specific gradients are labeled g 1 , g 2 and g 3 . The aggregated gradient G or G ′ in (a),(b) and (c) conflicts with the original gradient g 3 , g 2 and g 3 , respectively, resulting in decreasing performance of corresponding tasks. Note that different processing orders in PCGrad ([1,2,3] for (b), [3,2,1] for (c)) lead to conflicts of G with different original g i . In contrast, the GradOPS-modified g ′ 1 is orthogonal to S = span{g 2 , g 3 } with the conflicting part on S removed, similarily for g ′ 2 and g ′ 3 (omitted). Thus, neither each g ′ i nor G ′ conflicts with any of {g i }. Another stream of work (Sener & Koltun, 2018; Lin et al., 2019b; a; Liu et al., 2021a) avoids directly dealing with gradient conflicts but casting the MTL problems as multi-objective optimization (MOO) problems. Though practically effective, applications of most MOO methods are greatly limited since these methods have to delicately design complex algorithms for certain trade-off among the tasks based on conflicting gradients. However, solutions toward certain trade-off may not always produce expected performance, especially when the convex loss assumption fails and thus convergence to Pareto optimal fails. In addition, different scenarios may require divergent preferences over the tasks, whereas most MOO methods can only search for one Pareto optimal toward certain trade-off because their algorithms are designed to be binded with certain trade-off strategy. Therefore, it's difficult to provide flexible and different trade-offs given conflicting gradients. In this paper, we propose a simple yet effective MTL algorithm: Gradient Deconfliction via Orthogonal Projections onto Subspaces spanned by other task-specific gradients (GradOPS). Compared with existing projection-based methods (Yu et al., 2020; Wang et al., 2020) , our method not only completely solves all conflicts among the tasks with stable performance invariant to the random processing orders during gradient modifications, but is also guaranteed to converge to Pareto stationary points regardless of the number of tasks to be optimized. Moreover, with gradient conflicts among the tasks completely solved, our method can effectively search for diverse solutions toward different trade-off preferences simply via different non-negative linear combinations of the deconflicted gradients, which is controlled by a single hyperparameter (see Figure 2 ). The main contributions of this paper can be summarized as follows: • Focused on the conflicting gradients challenge, we propose a orthogonal projection based gradient modification method that not only completely solves all conflicts among the tasks with stable and invariant results regardless of the number of tasks and their processing orders, the aggregated final update direction is also non-conflicting with all tasks. • With non-conflicting gradients obtained, a simple reweighting strategy is designed to offer the ability of searching Pareto stationary points toward different trade-offs. We also empirically testified that, with gradient-conflicts completely solved, such a simple strategy is already effective and flexible enough to achieve similar trade-offs with some MOO methods and even outperform them. • Theoretical analysis on convergence is provided and comprehensive experiments are presented to demonstrate that our algorithm can effectively find multiple state-of-the-art solutions with different trade-offs among the tasks on MTL benchmarks in various domains.



Even for Liu et al. (2021a),Liu et al. (2021b) and Navon et al. (2022) which claim to seek Pareto points with balanced trade-offs among the tasks, their solutions might not necessarily satisfy the MTL practitioners' needs, since it's always better to provide the ability of reaching different Pareto optimals and leave the decision to users (Lin et al., 2019a).

