RECON: REDUCING CONFLICTING GRADIENTS FROM THE ROOT FOR MULTI-TASK LEARNING

Abstract

A fundamental challenge for multi-task learning is that different tasks may conflict with each other when they are solved jointly, and a cause of this phenomenon is conflicting gradients during optimization. Recent works attempt to mitigate the influence of conflicting gradients by directly altering the gradients based on some criteria. However, our empirical study shows that "gradient surgery" cannot effectively reduce the occurrence of conflicting gradients. In this paper, we take a different approach to reduce conflicting gradients from the root. In essence, we investigate the task gradients w.r.t. each shared network layer, select the layers with high conflict scores, and turn them to task-specific layers. Our experiments show that such a simple approach can greatly reduce the occurrence of conflicting gradients in the remaining shared layers and achieve better performance, with only a slight increase in model parameters in many cases. Our approach can be easily applied to improve various state-of-the-art methods including gradient manipulation methods and branched architecture search methods. Given a network architecture (e.g., ResNet18), it only needs to search for the conflict layers once, and the network can be modified to be used with different methods on the same or even different datasets to gain performance improvement. The source code is available at https://github.com/moukamisama/Recon.

1. INTRODUCTION

Multi-task learning (MTL) is a learning paradigm in which multiple different but correlated tasks are jointly trained with a shared model (Caruana, 1997) , in the hope of achieving better performance with an overall smaller model size than learning each task independently. By discovering shared structures across tasks and leveraging domain-specific training signals of related tasks, MTL can achieve efficiency and effectiveness. Indeed, MTL has been successfully applied in many domains including natural language processing (Hashimoto et al., 2017) , reinforcement learning (Parisotto et al., 2016; D'Eramo et al., 2020) and computer vision (Vandenhende et al., 2021) . A major challenge for multi-task learning is negative transfer (Ruder, 2017) , which refers to the performance drop on a task caused by the learning of other tasks, resulting in worse overall performance than learning them separately. This is caused by task conflicts, i.e., tasks compete with each other and unrelated information of individual tasks may impede the learning of common structures. From the optimization point of view, a cause of negative transfer is conflicting gradients (Yu et al., 2020) , which refers to two task gradients pointing away from each other and the update of one task will have a negative effect on the other. Conflicting gradients make it difficult to optimize the multitask objective, since task gradients with larger magnitude may dominate the update vector, making the optimizer prioritize some tasks over others and struggle to converge to a desirable solution. Prior works address task/gradient conflicts mainly by balancing the tasks via task reweighting or gradient manipulation. Task reweighting methods adaptively re-weight the loss functions by homoscedastic uncertainty (Kendall et al., 2018) , balancing the pace at which tasks are learned Chen et al. (2018) ; Liu et al. (2019) , or learning a loss weight parameter (Liu et al., 2021b) . Gradient manipulation methods reduce the influence of conflicting gradients by directly altering the gradients based on different criteria (Sener & Koltun, 2018; Yu et al., 2020; Chen et al., 2020; Liu et al., 2021a) or rotating the shared features (Javaloy & Valera, 2022) . While these methods have demonstrated effectiveness in different scenarios, in our empirical study, we find that they cannot reduce the occurrence of conflicting gradients (see Sec. 3.3 for more discussion). We propose a different approach to reduce conflicting gradients for MTL. Specifically, we investigate layer-wise conflicting gradients, i.e., the task gradients w.r.t. each shared network layer. We first train the network with a regular MTL algorithm (e.g., joint-training) for a number of iterations, compute the conflict scores for all shared layers, and select those with highest conflict scores (indicating severe conflicts). We then set the selected shared layers task-specific and train the modified network from scratch. As demonstrated by comprehensive experiments and analysis, our simple approach Recon has the following key advantages: (1) Recon can greatly reduce conflicting gradients with only a slight increase in model parameters (less than 1% in some cases) and lead to significantly better performance. (2) Recon can be easily applied to improve various gradient manipulation methods and branched architecture search methods. Given a network architecture, it only needs to search for the conflict layers once, and the network can be modified to be used with different methods and even on different datasets to gain performance improvement. (3) Recon can achieve better performance than branched architecture search methods with a much smaller model.

2. RELATED WORKS

In this section, we briefly review related works in multi-task learning in four categories: tasks clustering, architecture design, architecture search, and task balancing. Tasks clustering methods mainly focus on identifying which tasks should be learned together (Thrun & O'Sullivan, 1996; Zamir et al., 2018; Standley et al., 2020; Shen et al., 2021; Fifty et al., 2021) . Architecture design methods include hard parameter sharing methods (Kokkinos, 2017; Long et al., 2017; Bragman et al., 2019) , which learn a shared feature extractor and task-specific decoders, and soft parameters sharing methods (Misra et al., 2016; Ruder et al., 2019; Gao et al., 2019; 2020; Liu et al., 2019) , where some parameters of each task are assigned to do cross-task talk via a sharing mechanism. Compared with soft parameters sharing methods, our approach Recon has much better scalability when dealing with a large number of tasks. Instead of designing a fixed network structure, some methods (Rosenbaum et al., 2018; Meyerson & Miikkulainen, 2018; Yang et al., 2020) propose to dynamically self-organize the network for different tasks. Among them, branched architecture search (Guo et al., 2020; Bruggemann et al., 2020) methods are more related to our work. They propose an automated architecture search algorithm to build a tree-structured network by learning where to branch. In contrast, our method Recon decides which layers to be shared across tasks by considering the severity of layer-wise conflicting gradients, resulting in a more compact architecture with lower time cost and better performance. Another line of research is task balancing methods. To address task/gradient conflicts, some methods attempt to re-weight the multi-task loss function using homoscedastic uncertainty (Kendall et al., 2018) , task prioritization (Guo et al., 2018) , or similar learning pace (Liu et al., 2019; 2021b) . GradNorm (Chen et al., 2018) learns task weights by dynamically tuning gradient magnitudes. MGDA (Sener & Koltun, 2018) find the weights by minimizing the norm of the weighted sum of task gradients. To reduce the influence of conflicting gradients, PCGrad (Yu et al., 2020) projects each gradient onto the normal plane of another gradient and uses the average of projected gradients for update. Graddrop (Chen et al., 2020) randomly drops some elements of gradients based on element-wise conflict. CAGrad (Liu et al., 2021a) ensures convergence to a minimum of the average loss across tasks by gradient manipulation. RotoGrad (Javaloy & Valera, 2022) re-weights task gradients and rotates the shared feature space. Instead of manipulating gradients, our method Recon leverages gradient information to modify network structure to mitigate task conflicts from the root.

3. PILOT STUDY: TASK CONFLICTS IN MULTI-TASK LEARNING

3.1 MULTI-TASK LEARNING: PROBLEM DEFINITION Multi-task learning (MTL) aims to learn a set of correlated tasks {T i } T i=1 simultaneously. For each task T i , the empirical loss function is L i (θ sh , θ i ), where θ sh are parameters shared among all tasks and θ i are task-specific parameters. The goal is to find optimal parameters θ = {θ sh , θ 1 , θ 2 , • • • , θ T } to achieve high performance across all tasks. Formally, it aims to minimize a multi-task objective: θ * = arg min θ T i w i L i (θ sh , θ i ), where w i are pre-defined or dynamically computed weights for different tasks. A popular choice is to use the average loss (i.e., equal weights). However, optimizing the multi-task objective is difficult, and a known cause is conflicting gradients.

3.2. CONFLICTING GRADIENTS

Let g i = ∇ θ sh L i (θ sh , θ i ) denote the gradient of task T i w.r.t. the shared parameters θ sh (i.e., a vector of the partial derivatives of L i w.r.t. θ sh ) and g ts i = ∇ θi L i (θ sh , θ i ) denote the gradient w.r.t. the task-specific parameters θ i . A small change of θ sh in the direction of negative g i is θ sh ← θ sh -αg i , with a sufficiently small step size α. The effect of this change on the performance of another task T j is measured by: ∆L j = L j (θ sh -αg i , θ j ) -L j (θ sh , θ j ) = -αg i • g j + o(α), where the second equality is obtained by first order Taylor approximation. Likewise, the effect of a small update of θ sh in the direction of the negative gradient of task T j (i.e., -g j ) on the performance of task T i is ∆L i = -αg i • g j + o(α). Notably, the model update for task T i is considered to have a negative effect on task T j when g i • g j < 0, since it increases the loss of task T j , and vice versa. A formal definition of conflicting gradients is given as follows (Yu et al., 2020) . Definition 1 (Conflicting Gradients). The gradients g i and g j (i ̸ = j) are said to be conflicting with each other if cos ϕ ij < 0, where ϕ ij is the angle between g i and g j . As shown in Yu et al. (2020) , conflicts in gradient pose serious challenges for optimizing the multitask objective (Eq. 1). Using the average gradient (i.e., 1 T T i=1 g i ) for gradient decent may hurt the performance of individual tasks, especially when there is a large difference in gradient magnitudes, which will make the optimizer struggle to converge to a desirable solution.

3.3. GRADIENT SURGERY CANNOT EFFECTIVELY REDUCE CONFLICTING GRADIENTS

To mitigate the influence of conflicting gradients, several methods (Yu et al., 2020; Chen et al., 2020; Liu et al., 2021a) have been proposed to perform "gradient surgery". Instead of following the average gradient direction, they alter conflicting gradients based on some criteria and use the modified gradients for model update. We conduct a pilot study to investigate whether gradient manipulation can effectively reduce the occurrence of conflicting gradients. For each training iteration, we first calculate the task gradients of all tasks w.r.t. the shared parameters (i.e., g i for any task i) and compute the conflict angle between any two task gradients g i and g j in terms of cosϕ ij . We then count and draw the distribution of cosϕ ij in all training iterations. We provide the statistics of the joint-training baseline (i.e., training all tasks jointly with equal loss weights and all parameters shared) and several state-of-the-art gradient manipulation methods including GradDrop (Chen et al., 2020) , PCGrad (Yu et al., 2020) , CAGrad (Liu et al., 2021a) , and MGDA (Sener & Koltun, 2018) on Multi-Fashion+MNIST (Lin et al., 2019) , CityScapes, NYUv2, and PASCAL-Context datasets. 2 (g i + g j ). Due to the conflict between g i and g j , the update vector is dominated by g i (in red). (b) PCGrad (Yu et al., 2020) projects each gradient onto the normal plane of the other one and uses the average of the projected gradients (indicated by dashed grey arrows) as the update vector (in green). As such, the update vector is less dominated by g i . (c) Our approach Recon finds the parameters contributing most (e.g., θ 3 ) to gradient conflicts and turns them into task specific ones. In effect, it performs an orthographic/coordinate projection of conflicting gradients to the space of the rest parameters (e.g., θ 1 and θ 2 ) such that the projected gradients g fix i and g fix j are better aligned. (d) Illustration of Recon turning a shared layer with high conflict score to task-specific layers. The results are provided in Fig. 1 , Fig. 5 , Fig. 6 , Fig. 7 , Table 6, and Tables 8-10 . It can be seen that gradient manipulation methods can only slightly reduce the occurrence of conflicting gradients (compared to joint-training) in some cases, and in some other cases they even increase it.

4. OUR APPROACH: REDUCING CONFLICTING GRADIENTS FROM THE ROOT

Our pilot study shows that adjusting gradients for model update cannot effectively prevent the occurrence of conflicting gradients in MTL, which suggests that the root causes of this phenomenon may be closely related to the nature of different tasks and the way how model parameters are shared among them. Therefore, to mitigate task conflicts for MTL, in this paper, we take a different approach to reduce the occurrence of conflicting gradients from the root.

4.1. RECON: REMOVING LAYER-WISE CONFLICTING GRADIENTS

Our approach is extremely simple and intuitive. We first identify the shared network layers where conflicts occur most frequently and then turn them into task-specific parameters. Suppose the shared model parameters θ sh are composed of n layers, i.e., θ sh = {θ (k) sh } n k=1 , where θ (k) sh is the k th shared layer. Let g (k) i denote the gradient of task T i w.r.t. the k th shared layer θ (k) sh , i.e., g (k) i is a vector of the partial derivatives of L i w.r.t. the parameters of θ (k) sh . Let ϕ (k) ij denote the angle between g (k) i and g (k) j . We define layer-wise conflicting gradients and S-conflict score as follows. Definition 2 (Layer-wise Conflicting Gradients). The gradients g (k) i and g (k) j (i ̸ = j) are said to be conflicting with each other if cos ϕ (k) ij < 0. Definition 3 (S-Conflict Score). For any -1 < S ≤ 0, the S-conflict score for the k th shared layer is the number of different pairs (i, j)(i ̸ = j) s.t. cos ϕ (k) ij < S, denoted as s (k) . S indicates the severity of conflicts, and setting S smaller means we care about cases of more severe conflicts. The S-conflict score s (k) indicates the occurrence of conflicting gradients at severity level S for the k th shared layer. If s (k) = T 2 , it means that for any two different tasks, there is a conflict in their gradients w.r.t. the k th shared layer. By computing S-conflict scores, we can identify the shared layers where conflicts occur most frequently. We describe our method Recon in Algorithm 1. First, we train the network for I iterations and compute S-conflict scores for each shared layer θ (k) in every iteration, denoted by {s (k) i } I i=1 . Then, Algorithm 1: Recon: Removing Layer-wise Conflicting Gradients Input: Model parameters θ, learning rate α, a set of tasks {T i } T i=1 , number of iterations I for computing conflict scores, conflict severity level S, number of selected layers K. // Train the network and compute conflict scores for all layers for iteration i = 1, 2, . . . , I do for i = 1 , 2, . . . , T do Compute the gradients of task T i w.r.t. all shared layers, i.e., {g (k) i } n k=1 ; end Calculate the S-conflict scores for all shared layers in the current iteration, i.e., {s we sum up the scores in all iterations, i.e., s (k) = I i=1 s (k) i , and find the layers with highest s (k) scores. Next, we set these layers to be task-specific and train the modified network from scratch. We demonstrate the effectiveness of Recon by a theoretical analysis in Sec. 4.2 and comprehensive experiments in Sec. 5. The results show that Recon can effectively reduce the occurrence of conflicting gradients in the remaining shared layers and lead to substantial improvements over state-of-the-art.

4.2. THEORETICAL ANALYSIS

Here, we provide a theoretical analysis of Recon. Let θ sh = {θ fix sh , θ cf sh }, where θ fix sh are the remaining shared parameters, and θ cf sh are those that will be turned to task-specific parameters θ cf 1 , θ cf 2 , • • • , θ cf T . Notice that θ cf 1 , θ cf 2 , • • • , θ cf T will all be initialized with θ cf sh . Therefore, after applying Recon, the model parameters are θ r = {θ fix sh , θ cf 1 , . . . , θ cf T , θ ts 1 , . . . , θ ts T }. An one-step gradient update of θ r is: θfix sh = θ fix sh -α T i=1 w i g fix i , θcf i = θ cf i -αg cf i , θts i = θ ts i -αg ts i , i = 1, . . . , T, where w i are weight parameters, Without applying Recon, the model parameters are θ = {θ fix sh , θ cf sh , θ ts 1 , . . . , θ ts T }. An one-step gradient update of θ is given by g ts i = ∇ θ ts i L i , g cf i = ∇ θ cf sh L i and g fix i = ∇ θ fix sh L i . θfix sh = θ fix sh -α T i=1 w i g fix i , θcf sh = θ cf sh -α T i=1 w i g cf i , θts i = θ ts i -αg ts i , i = 1, . . . , T. (4) After the one-step updates, the loss functions with the updated parameters θr and θ respectively are: L( θr ) = T i=1 L i θfix sh , θcf i , θts i , and, L( θ) = T i=1 L i θfix sh , θcf sh , θts i , where L i is the loss function of task T i . Denote the set of indices of the layers turned task-specific by P, then θ cf sh = {θ Theorem 4.1. Assume that L is differentiable and for any two different tasks T i and T j , it satisfies cos ϕ (k) ij ∥g (k) i ∥ < ∥g (k) j ∥, ∀k ∈ P (6) then for any sufficiently small learning rate α > 0, L( θr ) < L( θ). The theorem indicates that a single gradient update on the model parameters of Recon achieves lower loss than that on the original model parameters. The proof is provided in Appendix A

5. EXPERIMENTS

In this section, we conduct extensive experiments to evaluate our approach Recon for multi-task learning and demonstrate its effectiveness, efficiency and generality.

5.1. EXPERIMENTAL SETUP

Datasets. We evaluate Recon on 4 multi-task datasets, namely Multi-Fashion+MNIST (Lin et al., 2019) , CityScapes (Cordts et al., 2016) , NYUv2 (Couprie et al., 2013) , PASCAL-Context (Mottaghi et al., 2014) , and CelebA (Liu et al., 2015) . The tasks of each dataset are described as follows. 1) Multi-Fashion+MNIST contains two image classification tasks. Each image consists of an item from FashionMNIST and an item from MNIST. 2) CityScapes contains 2 vision tasks: 7-class semantic segmentation and depth estimation. 3) NYUv2 contains 3 tasks: 13-class semantic segmentation, depth estimation and normal prediction. 4) PASCAL-Context consists of 5 tasks: semantic segmentation, human parts segmentation and saliency estimation, surface normal estimation, and edge detection. 5) CelebA contains 40 binary classification tasks. Baselines. The baselines include 1) single-task learning (single-task): training all tasks independently; 2) joint-training (joint-train): training all tasks together with equal loss weights and all parameters shared; 3) gradient manipulation methods: MGDA (Sener & Koltun, 2018) , PCGrad (Yu et al., 2020) , GradDrop (Chen et al., 2020), CAGrad (Liu et al., 2021a) , RotoGrad (Javaloy & Valera, 2022) ; 4) branched architecture search methods: BMTAS (Bruggemann et al., 2020) ; 5) Architecture design methods: Cross-Stitch (Misra et al., 2016) , MMoE (Ma et al., 2018) . Following Liu et al. (2021a) , we implement Cross-Stitch based on SegNet (Badrinarayanan et al., 2017) . For a fair comparison, all methods use same configurations and random seeds. We run all experiments 3 times with different random seeds. More experimental details are provided in Appendix B. Relative task improvement. Following Maninis et al. (2019) , we compute the relative task improvement with respect to the single-task baseline for each task. Given a task T j , the relative task  improvement is ∆m Tj = 1 K K i=1 (-1) li (M i -S i )/S i , where M i , S i refer to metrics for the i th criterion obtained by objective model and single-task model respectively, l i = 1 if a lower value for the criterion is better and 0 otherwise. The average relative task improvement is ∆m = 1 T T j=1 ∆m Tj .

5.2. COMPARISON WITH THE STATE-OF-THE-ART

Recon improves the performance of all base models. The main results on Multi-Fashion+MNIST, and CelebA, CityScapes, PASCAL-Context, and NYUv2, are presented in Table 1, Table 2, Table 3,  Table 4, and Table 5 respectively. (1) Compared to gradient manipulation methods, Recon consistently improves their performance in most evaluation metrics, and achieve comparable performance on the rest of evaluation metrics. (2) Compared with branched architecture search methods and architecture design methods, Recon can further improve the performance of BMTAS and MMoE. Besides, Recon combined with other gradient manipulation methods with small model size can achieve better results than branched architecture search methods with much bigger models. Small increases in model parameters can lead to good performance gains. Note that Recon only changes a small portion of shared parameters to task-specific. As shown in Table 1 -5, Recon increases the model size by 0.52% to 57.25%. Recon turns 1.42%, 1.46%, 12.77%, 0.26%, 9.80% shared parameters to task-specific on Multi-Fashion+MNIST, CelebA, CityScapes, NYUv2 and PASCAL-Context respectively. The results suggest that the gradient conflicts in a small portion (less than 13%) of shared parameters impede the training of the model for multi-task learning. Recon is compatible with various neural network architectures. We use ResNet18 on Multi-Fashion+MNIST, SegNet (Badrinarayanan et al., 2017) on CityScapes, MTAN (Liu et al., 2019) on NYUv2, and MobileNetV2 (Sandler et al., 2018) on PASCAL-Context. Recon improves the performance of baselines with different neural network architectures, including the architecture search method BMTAS (Bruggemann et al., 2020) which finds a tree-like structure for multi-task learning. Only one search of conflict layers is needed for the same network architecture. An interesting observation from our experiments is that network architecture seems to be the deciding factor for the conflict layers found by Recon. With the same network architecture (e.g., ResNet18), the found conflict layers are quite consistent w.r.t. (1) different training stages (e.g., the first 25% iterations, or the middle or last ones) (see Table 12 Hence, in our experiments, we only search for the conflict layers once with the joint-training baseline in the first 25% training iterations and modify the network to improve various methods on the same dataset. We also find that the conflict layers found on one dataset can be used to modify the network to be directly applied on another dataset to gain performance improvement.

5.3. ABLATION STUDY AND ANALYSIS

Recon greatly reduces the occurrence of conflicting gradients. In Fig. 4 and Table 6 , we compare the distribution of cos ϕ ij before and after applying Recon on Multi-Fashion+MNIST (the results on other datasets are provided in Appendix C). It can be seen that Recon greatly reduces the numbers of gradient pairs with severe conflicts (cos ϕ ij ∈ (-0.01, -1]) by at least 67% and up to 79% when compared with joint-training, while gradient manipulation methods only slightly reduce the percentage and some even increases it. Similar observations can be made from Tables 8 9 10 . Randomly selecting conflict layers does not work. To show that the performance gain of Recon comes from selecting the layers with most severe conflicts instead of merely increasing model parameters, we further compare Recon with the following two baselines. RSL: randomly selecting same number of layers as Recon and set them task-specific. RSP: randomly selecting similar amount of parameters as Recon and set them task-specific. The results in Table 7 show that both RSL and RSP lead to significant performance drops, which verifies the effectiveness of the selection strategy of Recon. We compare Recon with the baselines that selects the first or last K layers in Appendix C. Ablation study on hyperparameters. We study the influence of the conflict severity S and the number of selected layers K on the performance of CAGrad w/ Recon on Multi-Fashion+MNIST. As shown in Fig. 3 , a small K leads to a significant performance drop, which indicates that there are still some shared network layers suffering from severe gradient conflicts, while a large K will not lead to further performance improvement since severe conflicts have been resolved. For the conflict severity S, we find that a high value of S (e.g., 0.0) leads to performance drops since it includes too many gradient pairs with small conflicts, while some of them are helpful for learning common structures and should not be removed. In the meantime, a too small S (e.g., -0.15) also leads to performance degradation because it ignores too many gradient pairs with large conflicts, which may be detrimental to learning. While K and S are sensitive, we may only need to tune them once for a given network architecture, as discussed in Sec. 5.2.

6. CONCLUSION

We have proposed a very simple yet effective approach, namely Recon, to reduce the occurrence of conflicting gradients for multi-task learning. By considering layer-wise gradient conflicts and identifying the shared layers with severe conflicts and setting them task-specific, Recon can significantly reduce the occurrence of severe conflicting gradients and boost the performance of existing methods with only a reasonable increase in model parameters. We have demonstrated the effectiveness, efficiency, and generality of Recon via extensive experiments and analysis. A PROOF OF THEOREM A.1 Theorem A.1. Assume that L is differentiable and for any two different tasks T i and T j , it satisfies cos ϕ (k) ij ∥g (k) i ∥ < ∥g (k) j ∥, ∀k ∈ P (8) then for any sufficiently small learning rate α > 0, L( θr ) < L( θ). Proof. We consider the first order Taylor approximation of L i . For normal update, we have L i θfix sh , θcf sh , θts i i θ fix sh , θ cf sh , θ ts i + ( θfix sh -θ fix sh ) ⊤ g fix i (10) + ( θcf sh -θ cf sh ) ⊤ g cf i + ( θts i -θ ts i ) ⊤ g ts i + o(α). For Recon update, we have L i θfix sh , θcf i , θts i =L i θ fix sh , θ cf sh , θ ts i + ( θfix sh -θ fix sh ) ⊤ g ts i (12) + ( θcf i -θ cf sh ) ⊤ g cf i + ( θts i -θ ts i ) ⊤ g ts i + o(α). The difference between the two loss functions after the update is L i θfix sh , θcf i , θts i -L i θfix sh , θcf sh , θts i =( θcf i -θcf sh ) ⊤ g cf i + o(α) (14) = -α   g cf i - T j=1 w j g cf j   ⊤ g cf i + o(α) = -α T j=1 w j g cf i -g cf j ⊤ g cf i + o(α) (16) = -α T j=1 w j ∥g cf i ∥ 2 -g cf j ⊤ g cf i + o(α). Assume, without loss of generality, that ∥g cf i ∥ ̸ = 0, then g cf i 2 -g cf j ⊤ g cf i = k∈P g (k) i 2 -g (k) i ⊤ g (k) j (18) = k∈P g (k) i g (k) i -cos ϕ (k) ij g (k) j (19) > 0. Hence, the above difference is negative, if α is sufficiently small. As such, the difference between the multi-task loss functions is also negative, if α is sufficiently small. L( θr ) -L( θ) = T i=1 L i θfix sh , θcf i , θts i - T i=1 L i θfix sh , θcf sh , θts i < 0 B EXPERIMENTAL SETUP

B.1 MULTI-FASHION+MNIST

Model. We adopt ResNet18 (He et al., 2016) without pre-training as the backbone and modify the dimension of the output features to 100 for the last linear layer. For the task-specific heads, we define two linear layers followed by a ReLU function. Table 9 : The distribution of gradient conflicts (in terms of cos ϕ ij ) w.r.t. the shared parameters on NYUv2 dataset. "Reduction" means the percentage of conflicting gradients in the interval of (-0.04, -1.0] reduced by the model compared with joint-training. The grey cell color indicates Recon greatly reduces the conflicting gradients (more than 50%). In contrast, gradient manipulation methods only slightly decrease their occurrence, and some methods even increase it. Selecting the first K layers and the last K Layers as conflict layers does not work. To further support the conclusion that the selection of parameters with higher probability of conflicting gradients contributes most to the performance gain rather than the increase in model capacity. We compare Recon with two baselines: (1) Select the first K neural network layers and turn them into task-specific layers. (2) Select the last K neural network layers and turn them into task-specific layers. The multi-task learning results on the Multi-Fashion+MNIST benchmark are presented in Table 11 . The results show that if we directly turn the top or the bottom of the neural network into task-specific parameters, it still will lead to performance degradation compared to Recon. Recon finds similar layers in different training stages. Recon ranks the network layers according to the computed S-conflict scores. The ranking result can be represented as a layer permutation, denoted as π, and π(l) is the position of layer l. The similarity between two rankings π i and π j can be measured as: d(π i , π j ) = 1 |L| l∈L |π i (l) -π j (l)|, where L denotes the set of neural network layers. In Table 12 , we measure the differences in rankings obtained in different training stages (e.g., in the first 25% iterations or the second 25% iterations) on Multi-Fashion+MNIST by Eq. 22. The small distances (less than 2.4) indicate that the layers found in different training stages are quite similar. In Table 13 , we compare the performance of the networks modified by Recon with conflict layers found in different training stages on CityScapes. It can be seen that the results of the last three rows are the same, which is because the layers found in the 3rd 25% iterations, 4th 25% iterations, and all iterations are exactly the same (the rankings may be slightly different though). The layers found in the later stages lead to slightly better performance than those found in the early stages (i.e., 1st 25% iterations and 2nd 25% iterations), indicating the conflict scores in early iterations might be a little noisy. However, since the performance gaps are acceptably small, to save time, we use the initial 25% training iterations to find conflict layers. Recon finds similar layers with different MTL methods. In Table 14 , we measure the differences in layer permutations (rankings) obtained by Recon with different methods (e.g., CAGrad and PC-Grad) on Multi-Fashion+MNIST by Eq. 22. The small distances (less than 1.9) indicate that the layers found by Recon with different methods are quite similar. Therefore, in our experiments, we only use joint-training to search for the conflict layers once, and directly apply the modified network to improve different gradient manipulation methods as shown in Tables 1 2 3 4 5 . The conflict layers found by Recon with the same architecture are transferable between different datasets. We conduct experiments with three different architectures: ResNet18, SegNet, and MTAN. (1) For Resnet18, we find that the layers found by Recon on CelebA and those found on Multi-Fashion+MNIST are exactly the same. (2) For SegNet, we find that 95% layers (38 out of 40) found on NYUv2 are identical to those found on CityScapes. On NYUv2, we compare the performance of using conflict layers found on NYUv2 (baselines w/ Recon) to that of using conflict layers found on CityScapes (i.e., baselines w/ Recon * ), as shown in Table 15 . (3) For MTAN (SegNet with attention), we find that 68% layers (17 out of 25) found on CityScapes are identical to those found on NYUv2. On CityScapes, we compare the performance of using conflict layers found on CityScapes (baselines w/ Recon) to that of using conflict layers found on NYUv2 (i.e., baselines w/ Recon * ), as shown in Table 16 . The results show that the conflict layers found on one dataset can be used to modify the network to be directly used on another dataset to consistently improve the performance of various baselines, while searching for the conflict layers again on the new dataset may lead to better performance. Analysis of running time. We evaluate how Recon scales with the number of tasks on CelebA dataset, by comparing the running time of one iteration used by Recon in computing gradient conflict scores (the most time-consuming part of Recon) to that of the baselines. The results in Fig. 11 show that Recon is as fast as other gradient manipulation methods such as CAGrad (Liu et al., 2021a) and Graddrop (Chen et al., 2020) , but much slower than joint-training especially when the number of tasks is large, which is natural since Recon needs to compute pariwise cosine similarity of task gradients. However, since Recon only needs to search for the conflict layers once for a given network architecture, as discussed above, the running time is not a problem.



Figure 1: The distributions of gradient conflicts (in terms of cos ϕ ij ) of the joint-training baseline and state-of-the-art gradient manipulation methods on Multi-Fashion+MNIST benchmark.

Figure 2: Illustration of the differences between joint-training, gradient manipulation, and our approach. (a) In joint-training, the update vector (in green) is the average gradient 12 (g i + g j ). Due to the conflict between g i and g j , the update vector is dominated by g i (in red). (b) PCGrad(Yu et al., 2020) projects each gradient onto the normal plane of the other one and uses the average of the projected gradients (indicated by dashed grey arrows) as the update vector (in green). As such, the update vector is less dominated by g i . (c) Our approach Recon finds the parameters contributing most (e.g., θ 3 ) to gradient conflicts and turns them into task specific ones. In effect, it performs an orthographic/coordinate projection of conflicting gradients to the space of the rest parameters (e.g., θ 1 and θ 2 ) such that the projected gradients g fix i and g fix j are better aligned. (d) Illustration of Recon turning a shared layer with high conflict score to task-specific layers.

θ with joint-training or any gradient manipulation method ; end // Set layers with top conflict scores task-specific For each layer k, calculate the sum of S-conflict scores in all iterations, i.e., s(k) the top K layers with highest s(k) and set them task-specific; // Train the modified network from scratch for iteration i = 1, 2, . . . do Update θ with joint-training or any gradient manipulation method; end Output: Model parameters θ.

Notice that different methods such as joint-training, MGDA Sener & Koltun (2018), PCGrad Yu et al. (2020), and CAGrad Liu et al. (2021a) choose different w i dynamically.

(k) sh }, k ∈ P. Assume that T i=1 w i = 1, then we have the following theorem.

Figure 3: The performance of CAGrad combined with Recon on the Multi-Fashion+MNIST benchmark with (a) different number of selected layers K (b) different severity value S for computing conflict scores.

Figure 4: The distribution of gradient conflicts (in terms of cos ϕ ij ) of baselines and baselines with Recon on Multi-Fashion+MNIST dataset.

Figure 11: Comparison of running time (one iteration, excludes data fetching) on CelebA dataset.

Multi-task learning results on Multi-Fashion+MNIST dataset. All experiments are repeated over 3 random seeds and the mean values are reported. ∆m% denotes the average relative improvement of all tasks. #P denotes model size (MB). The grey cell color indicates that Recon improves the result of the base model. The best average result is marked in bold.

Multi-task learning results on CelebA dataset. All experiments are repeated over 3 random seeds and the mean values are reported. ∆m% denotes the average relative improvement of all tasks. #P denotes model size (MB). The grey cell color indicates that Recon improves the result of the base model. The best average result is marked in bold.

Multi-task learning results on CityScapes dataset. All experiments are repeated over 3 random seeds and the mean values are reported. ∆m% denotes the average relative improvement of all tasks. #P denotes the model size (MB). The grey cell color indicates that Recon improves the result of the base model. The best average result is marked in bold.

and Table 13 and discussion in Appendix C), (2) different MTL methods (e.g., joint-training or gradient manipulation methods) (see Table 14 and discussion in Appendix C), and (3) different datasets (see Table 15 and Table 16 and discussion in Appendix C).

Multi-task learning results on PASCAL-Context dataset with 4-task setting. All experiments are repeated over 3 random seeds and the mean values are reported. ∆m% denotes the average relative improvement of all tasks. #P denotes the model size (MB). The grey cell color indicates Recon improves the result of the base model. The best average result is marked in bold.

Multi-task learning results on NYUv2 dataset with MTAN as backbone. All experiments are repeated over 3 random seeds and the mean values are reported. ∆m% denotes the average relative improvement of all tasks. #P denotes the model size (MB). The grey cell color indicates that Recon improves the result of the base model. The best average result is marked in bold.

The distribution of gradient conflicts (in terms of cos ϕ ij ) w.r.t. the shared parameters on Multi-Fashion+MNIST dataset. "Reduction" means the percentage of conflicting gradients in the interval of (-0.01, -1.0] reduced by the model compared with joint-training. The grey cell color indicates Recon greatly reduces the conflicting gradients (more than 50%). In contrast, gradient manipulation methods only slightly decrease their occurrence, and some method even increases it.

Comparison of Recon with RSL and RSP. PD: performance drop compared to Recon.

The distribution of gradient conflicts (in terms of cos ϕ ij ) w.r.t. the shared parameters on PASCAL-Context dataset. "Reduction" means the percentage of conflicting gradients in the interval of (-0.02, -1.0] reduced by the model compared with joint-training. The grey cell color indicates Recon greatly reduces the conflicting gradients (more than 50%). In contrast, gradient manipulation methods only slightly decrease their occurrence, and some methods even increase it.

Multi-task learning results on Multi-Fashion+MNIST dataset. LSK refers to turning the fist K layers into task-specific layers. FSK refers to turning the last K layers into task-specific layers. PD denotes the performance drop compared with Recon.

The distance between the layer permutations (rankings) obtained in different training stages on Multi-Fashion+MNIST dataset. "Iter." denotes iterations.

Performance of the networks modified by Recon with conflict layers found in different training stages of joint-training on CityScapes dataset. ∆m% denotes the average relative improvement of all tasks. #P denotes the model size (MB). The best result is marked in bold.

The distance between the layer permutations (rankings) obtained by Recon with different methods on Multi-Fashion+MNIST dataset.

Multi-task learning results on NYUv2 dataset with SegNet as backbone. Recon * denotes setting the layers found on CityScapes to task-specific. ∆m% denotes the average relative improvement of all tasks. #P denotes the model size (MB). The grey cell color indicates that Recon or Recon * improves the result of the base model.

Multi-task learning results on CityScapes dataset with MTAN as backbone. Recon * denotes setting the layers found on NYUv2 to task-specific. ∆m% denotes the average relative improvement of all tasks. #P denotes the model size (MB). The grey cell color indicates that Recon or Recon * improves the result of the base model.

ACKNOWLEDGMENTS

The authors would like to thank Lingzi Jin for checking the proof of Theorem A.1 and the anonymous reviewers for their insightful and helpful comments.

annex

Tasks, losses, and metrics. Each task is a classification problem with 10 classes and we use the cross-entropy loss as the classification loss. For evaluation, we use the classification accuracy as the metric for each task.Model hyperparameters. We train the model for 120 epochs with the batch size of 256. We adopt SGD with an initial learning rate of 0.1 and decay the learning rate by 0.1 at the 60 th and 90 th epoch.Baseline hyperparameters. For CAGrad, we set α = 0.2. For BMTAS, we set the resource loss weight to 1.0, and we search the architecture for 100 epochs. For RotoGrad, we set R k = 100 which is equal to the dimension of shared features and set the learning rate of rotation parameters as learning rate of the neural networks. For MMoE, the initial learning rate of expert networks and gates are 0.1 and 1e -3 respectively.Recon hyperparameters. We use CAGrad to train the model for 30 epochs and compute the conflict score of each shared layer. We set S = -0.1 for computing the scores. We select 25 layers with the highest conflict scores and turn them into task-specific layers.

B.2 CITYSCAPES

Model. We adopt SegNet (Badrinarayanan et al., 2017) as the backbone where the decoder is split into two convolutional heads.Model hyperparameters. We train the model for 200 epochs with the batch size of 8. We adopt Adam with an initial learning rate of 5e -5 and decay the learning rate by 0.5 at the 100 th epoch.Baselines hyperparameters. For CAGrad, we set α = 0.2. For RotoGrad, we set R k = 1024 and set the learning rate of rotation parameters as 10 times less than the learning rate of the neural networks.Recon hyperparameters. We use joint-train to train the model for 40 epochs and compute the conflict score of each shared layer. We set S = 0.0 for computing the scores. We select 39 layers with the highest conflict scores and turn them into task-specific layers.

B.3 NYUV2

Model. We adopt MTAN (Liu et al., 2019) -the SegNet combined with task-specific attention modules on the encoder.Model hyperparameters. We train the model for 200 epochs with the batch size of 2. We adopt Adam with an initial learning rate of 1e -4 and decay the learning rate by 0.5 at the 100 th epoch.Baseline hyperparameters. For CAGrad, we set α = 0.4 similar with Liu et al. (2021a) .Recon hyperparameters. We use joint-train to train the model for 40 epochs and compute the conflict score of each shared layer. We set S = -0.02 for computing the scores. We select 22 layers with the highest conflict scores and turn them into task-specific layers.

B.4 PASCAL-CONTEXT

Model. Following Bruggemann et al. (2020) , we employ MobileNetv2 Sandler et al. (2018) as the backbone with a reduced design of the ASPP module (R-ASPP) (Sandler et al., 2018) . We pre-train the model on ImageNet (Deng et al., 2009) .Model hyperparameters. We train the model for 130 epochs with the batch size of 6. We adopt Adam with an initial learning rate of 1e -4 and decay the learning rate by 0.1 at the 70 th and 100 th epoch.Baselines hyperparameters. For CAGrad, we set α = 0.1. For BMTAS, we set the resoure loss weight to 0.1, and we search the architecture for 130 epochs.Recon hyperparameters. We use joint-train to train the model for 40 epochs and compute the conflict score of each shared layer. We set S = -0.02 for computing the scores. We select 85 layers with the highest conflict scores and turn them into task-specific layers.

C ADDITIONAL ABLATION STUDY

The distribution of gradient conflicts. In addition to the statistics on Multi-Fashion+MNIST, we further show the distributions of gradient conflicts of various baselines on CityScapes, NYUv2, and PASCAL-Context in 8 9 10 . 

