TOWARDS IMPARTIAL MULTI-TASK LEARNING

Abstract

Multi-task learning (MTL) has been widely used in representation learning. However, naïvely training all tasks simultaneously may lead to the partial training issue, where specific tasks are trained more adequately than others. In this paper, we propose to learn multiple tasks impartially. Specifically, for the task-shared parameters, we optimize the scaling factors via a closed-form solution, such that the aggregated gradient (sum of raw gradients weighted by the scaling factors) has equal projections onto individual tasks. For the task-specific parameters, we dynamically weigh the task losses so that all of them are kept at a comparable scale. Further, we find the above gradient balance and loss balance are complementary and thus propose a hybrid balance method to further improve the performance. Our impartial multi-task learning (IMTL) can be end-to-end trained without any heuristic hyper-parameter tuning, and is general to be applied on all kinds of losses without any distribution assumption. Moreover, our IMTL can converge to similar results even when the task losses are designed to have different scales, and thus it is scale-invariant. We extensively evaluate our IMTL on the standard MTL benchmarks including Cityscapes, NYUv2 and CelebA. It outperforms existing loss weighting methods under the same experimental settings.

1. INTRODUCTION

Recent deep networks in computer vision can match or even surpass human beings on some specific tasks separately. However, in reality multiple tasks (e.g., semantic segmentation and depth estimation) must be solved simultaneously. Multi-task learning (MTL) (Caruana, 1997; Evgeniou & Pontil, 2004; Ruder, 2017; Zhang & Yang, 2017) aims at sharing the learned representation among tasks (Zamir et al., 2018) to make them benefit from each other and achieve better results and stronger robustness (Zamir et al., 2020) . However, sharing the representation can lead to a partial learning issue: some specific tasks are learned well while others are overlooked, due to the different loss scales or gradient magnitudes of various tasks and the mutual competition among them. Several methods have been proposed to mitigate this issue either via gradient balance such as gradient magnitude normalization (Chen et al., 2018) and Pareto optimality (Sener & Koltun, 2018) , or loss balance like homoscedastic uncertainty (Kendall et al., 2018) . Gradient balance can evenly learn task-shared parameters while ignoring task-specific ones. Loss balance can prevent MTL from being biased in favor of tasks with large loss scales but cannot ensure the impartial learning of the shared parameters. In this work, we find that gradient balance and loss balance are complementary, and combining the two balances can further improve the results. To this end, we propose impartial MTL (IMTL) via simultaneously balancing gradients and losses across tasks. For gradient balance, we propose IMTL-G(rad) to learn the scaling factors such that the aggregated gradient of task-shared parameters has equal projections onto the raw gradients of individual tasks by the raw loss of each task, respectively. The gray surface represents the plane composed by these gradients. The red arrow denotes the aggregated gradient computed by the weighted sum loss, which is ultimately used to update the model parameters. The blue arrows show the projections of g onto the raw gradients {gt}. g has the largest projection on g2 (nearest to the mean direction), g3 (smallest magnitude) and g2 (largest magnitude) for GradNorm, MGDA and PCGrad, respectively, while the projections are equal on {gt} in our IMTL-G. (see Fig. 1 (d) ). We show that the scaling factor optimization problem is equivalent to finding the angle bisector of gradients from all tasks in geometry, and derive a closed-form solution to it. In contrast with previous gradient balance methods such as GradNorm (Chen et al., 2018) , MGDA (Sener & Koltun, 2018) and PCGrad (Yu et al., 2020) , which have learning biases in favor of tasks with gradients close to the average gradient direction, those with small gradient magnitudes, and those with large gradient magnitudes, respectively (see Fig. 1 (a), (b) and (c)), in our IMTL-G task-shared parameters can be updated without bias to any task. For loss balance, we propose IMTL-L(oss) to automatically learn a loss weighting parameter for each task so that the weighted losses have comparable scales and the effect of different loss scales from various tasks can be canceled-out. Compared with uncertainty weighting (Kendall et al., 2018) , which has biases towards regression tasks rather than classification tasks, our IMTL-L treats all tasks equivalently without any bias. Besides, we model the loss balance problem from the optimization perspective without any distribution assumption that is required by (Kendall et al., 2018) . Therefore, ours is more general and can be used in any kinds of losses. Moreover, the loss weighting parameters and the network parameters can be jointly learned in an end-to-end fashion in IMTL-L. Further, we find the above two balances are complementary and can be combined to improve the performance. Specifically, we apply IMTL-G on the task-shared parameters and IMTL-L on the task-specific parameters, leading to the hybrid balance method IMTL. Our IMTL is scale-invariant: the model can converge to similar results even when the same task is designed to have different loss scales, which is common in practice. For example, the scale of the cross-entropy loss in semantic segmentation may have different scales when using "average" or "sum" reduction over locations in the loss computation. We empirically validate that our IMTL is more robust against heavy loss scale changes than its competitors. Meanwhile, our IMTL only adds negligible computational overheads. We extensively evaluate our proposed IMTL on standard benchmarks: Cityscapes, NYUv2 and CelebA, where the experimental results show that IMTL achieves superior performances under all settings. Besides, considering there lacks a fair and practical benchmark for comparing MTL methods, we unify the experimental settings such as image resolution, data augmentation, network structure, learning rate and optimizer option. We re-implement and compare with the representative MTL methods in a unified framework, which will be publicly available. Our contributions are: • We propose a novel closed-form gradient balance method, which learns task-shared parameters without any task bias; and we develop a general learnable loss balance method, where no distribution assumption is required and the scale parameters can be jointly trained with the network parameters. • We unveil that gradient balance and loss balance are complementary and accordingly propose a hybrid balance method to simultaneously balance gradients and losses. • We validate that our proposed IMTL is loss scale-invariant and is more robust against loss scale changes compared with its competitors, and we give in-depth theoretical and experimental analyses on its connections and differences with previous methods. • We extensively verify the effectiveness of our IMTL. For fair comparisons, a unified codebase will also be publicly available, where more practical settings are adopted and stronger performances are achieved compared with existing code-bases.

2. RELATED WORK

Recent advances in MTL mainly come from two aspects: network structure improvements and loss weighting developments. Network-structure methods based on soft parameter-sharing usually lead to high inference cost (review in Appendix A). Loss weighting methods find loss weights to be multiplied on the raw losses for model optimization. They employ a hard parameter-sharing paradigm (Ruder, 2017) , where several light-weight task-specific heads are attached upon the heavy-weight task-agnostic backbone. There are also efforts that learn to group tasks and branch the network in the middle layers (Guo et al., 2020; Standley et al., 2020) , which try to achieve better accuracyefficiency trade-off and can be seen as semi-hard parameter-sharing. We believe task grouping and loss weighting are orthogonal and complementary directions to facilitate multi-task learning and can benefit from each other. In this work we focus on loss weighting methods which are the most economic as almost all of the computations are shared across tasks, leading to high inference speed. Task Prioritization (Guo et al., 2018) weights task losses by their difficulties to focus on the harder tasks during training. Uncertainty weighting (Kendall et al., 2018) models the loss weights as dataagnostic task-dependent homoscedastic uncertainty. Then loss weighting is derived from maximum likelihood estimation. GradNorm (Chen et al., 2018) learns the loss weights to enforce the norm of the scaled gradient for each task to be close. MGDA (Sener & Koltun, 2018) casts multi-task learning as multi-object optimization and finds the minimum-norm point in the convex hull composed by the gradients of multiple tasks. Pareto optimality is supposed to be achieved under mild conditions. GLS (Chennupati et al., 2019) instead uses the geometric mean of task-specific losses as the target loss, we will show it actually weights the loss by its reciprocal value. PCGrad (Yu et al., 2020) avoids interferences between tasks by projecting the gradient of one task onto the normal plane of the other. DSG (Lu et al., 2020) dynamically makes a task "stop or go" by its converging state, where a task is updated only once for a while if it is stopped. Although many loss weighting methods have been proposed, they are seldom open-sourced and rarely compared thoroughly under practical settings where strong performances are achieved, which motivates us to give an in-depth analysis and a fair comparison about them.

3. IMPARTIAL MULTI-TASK LEARNING

In MTL, we map a sample x ∈ X to its labels {y t ∈ Y t } t∈[1,T ] of all T tasks through multiple taskspecific mappings {f t : X → Y t }. In most loss weighting methods, the hard parameter-sharing paradigm is employed, such that f t is parameterized by heavy-weight task-shared parameters θ and light-weight task-specific parameters θ t . All tasks take the same shared intermediate feature z = f (x; θ) as input, and the t-th task head outputs the prediction as f t (x) = f t (z; θ t ). We aim to find the scaling factors {α t } for all T task losses {L t (f t (x) , y t )}, so that the weighted sum loss L = t α t L t can be optimized to make all tasks perform well. This poses great challenges because: 1) losses may have distinguished forms such as cross-entropy loss and cosine similarity; 2) the dynamic ranges of losses may differ by orders of magnitude. In this work, we propose a hybrid solution for both the task-shared parameters θ and the task-specific parameters {θ t }, as Fig. 2 . For task-shared parameters θ, we can receive T gradients {g t = ∇ θ L t } via back-propagation from all of the T raw losses {L t }, and these gradients represent optimal update directions for individual tasks. As the parameters θ can only be updated with a single gradient, we should compute an aggregated gradient g by the linear combination of {g t }. It also implies to find the scaling factors {α t } of raw losses {L t }, since g = t α t g t = ∇ θ L = ∇ θ ( t α t L t ). Motivated by the principle of balance among tasks, we propose to make the projections of g onto {g t } to be equal, as Fig. 1 (d 

3.1. GRADIENT BALANCE: IMTL-G

: compute gradient differences D = g 1 -g 2 , • • • , g 1 -g T 10: compute unit-norm gradient differences U = u 1 -u 2 , • • • , u 1 -u T 11: compute scaling factors for tasks 2 to T : α2:T = g1U DU -1 gradient balance 12: compute scaling factors for all tasks: α = 1 -1α 2:T , α2:T 13: update task-shared parameters θ = θ -η∇ θ t αtLt 14: for t = 1 to T do 15: update task-specific parameters θt = θt -η∇ θ t Lt 16: update loss scale parameter st = st -η ∂L t ∂s t 17: end for we treat all tasks equally so that they progress in the same speed and none is left behind. Formally, let {u t = g t / g t } denote the unit-norm vector of {g t } which are row vectors, then we have: gu 1 = gu t ⇔ g (u 1 -u t ) = 0, ∀ 2 t T. (1) The above problem is under-determined, but we can obtain the closed-form results of {α t } by constraining t α t = 1. Assume α = [α 2 , • • • , α T ], U = u 1 -u 2 , • • • , u 1 -u T , D = g 1 -g 2 , • • • , g 1 -g T and 1 = [1, • • • , 1], from Eq. (1) we can obtain: α = g 1 U DU -1 . (IMTL-G) (2) The detailed derivation is in Appendix B.1. After obtaining α, the scaling factor of the first task can be computed by α 1 = 1 -1α since t α t = 1. The optimized {α t } are used to compute L = t α t L t , which is ultimately minimized by SGD to update the model. By now, back-propagation needs to be executed T times to obtain the gradient of each task loss with respect to the heavy-weight task-shared parameters θ, which is time-consuming and non-scalable. We replace the parameterlevel gradients {g t = ∇ θ L t } with feature-level gradients {∇ z L t } to compute {α t }. This implies to achieve gradient balance with respect to the last shared feature z as a surrogate of task-shared parameters θ, since it is possible for the network to back-propagate this balance all the way through the task-shared backbone starting from z. This relaxation allows us to do back propagation through the backbone only once after obtaining {α t }, and thus the training time can be dramatically reduced.

3.2. LOSS BALANCE: IMTL-L

For the task-specific parameters {θ t }, we cannot employ IMTL-G described above, because ∇ θt L τ = 0, ∀t = τ , and thus only the gradient of the corresponding task ∇ θt L t can be obtained for each θ t . Instead we propose to balance the losses among tasks by forcing the scaled losses {α t L t } to be constant for all tasks, without loss of generality, we take the constant as 1. Then the most direct idea is to compute the scaling factors as {α t = 1/L t }, but they are sensitive to outlier samples and manifest severe oscillations, so we further propose to learn to scale losses via gradient descent and thus stronger stability can be achieved. Suppose the positive losses {L t > 0} are to be balanced, we first introduce a mapping function h : R → R + to transform the arbitrarily-ranged learnable scale parameters {s t } to positive scaling factors {h (s t ) > 0}, hereafter we abandon the subscript t for brevity. Then we should construct an appropriate scaled loss g (s) so that both network parameters θ and scale parameter s can be optimized by minimizing g (s). On one hand, we balance different tasks by encouraging the scaled losses h (s) L (θ) to be 1 for all tasks, so the optimality s of s is achieved when h (s) L (θ) = 1, or equivalently: f (s) ≡ h (s) L (θ) -1 = 0, if s = s . (3) One may expect to minimize |f (s)| = |h (s) L (θ) -1| to find s , however when h (s) L (θ) < 1, the gradient with respect to θ, ∇ θ |f (s)| = -h (s) ∇ θ L (θ) , is in the opposite direction. On the other hand, assume our scaled loss g (s) is a differentiable convex function with respect to s, then its minimum is achieved if and only if s = s , where the derivative of g (s) is zero: g (s) = 0, if s = s . From Eq. ( 3) and ( 4) we find that the values of f (s) and g (s) are both 0 when s = s , we can then regard f (s) as the derivative of g (s), which is our target scaled loss and used to optimize both the network parameters θ and loss scale parameter s, then we have: g (s) = f (s) ⇔ g (s) = f (s) ds = L (θ) h (s) ds -s. From Eq. ( 3) and ( 5), we notice that both h (s) and h (s) ds denote loss scales, so we have h (s) ds = Ch (s), where C > 0 is a constant. According to ordinary differential equation, h (s) ds must be the exponential function: h (s) ds = ba s with a > 1, b > 0 (see Appendix B.2). We then have g (s) = ka s , k > 0, which is always positive and verifies our assumption about the convexity of g (s). Also note that the gradient of g (s) with respect to θ, ∇ θ g (s)= h (s) ds∇ θ L (θ) = ba s ∇ θ L (θ) , is in the appropriate direction since ba s > 0. As an instantiation, we set h (s) ds = e s (a = e, b = 1), then g (s) = e s L (θ) -s, (IMTL-L). From Eq. ( 6) we find that the raw loss is scaled by e s , and -s acts as a regularization to avoid the trivial solution s = -∞ while minimizing the scaled loss g (s). As for implementation, the task losses {L t } are scaled by {e st }, and the scaled losses {e st L -s t } are used to update both the network parameters θ, {θ t } and the scale parameters {s t }.

3.3. HYBRID BALANCE: IMTL

We have introduced IMTL-G/IMTL-L to achieve gradient/loss balance, and both of them produce scaling factors to be applied on the raw losses. They can be used solely, but we find them complementary and able to be combined to improve the performance. In IMTL-G, even if the raw losses are multiplied by arbitrary (maybe different among tasks) positive factors, the direction of the aggregated gradient g stays unchanged. Because by definition g = t α t g t is the angular bisector of the gradients {g t }, and positive scaling will not change the directions of {g t } and thus that of g (proof in Theorem 2). So we can also obtain the scale factors {α t } in IMTL-G with the losses that have been scaled by {s t } from IMTL-L. IMTL-G and IMTL-L are combined as: 1) the taskspecific parameters {θ t } and scale parameters {s t } are updated by scaled losses {e st L t -s t }; 2) the task-shared parameters θ are updated by t α t (e st L t ) which is the weighted average of {e st L t }, with the weights {α t } computed by {∇ z (e st L t )} using IMTL-G. Note that the regularization terms {-s t } in Eq. ( 6) are constants with respect to θ and z, and thus can be ignored when computing gradients and updating parameters in IMTL-G. In this way, we achieve both gradient balance for task-shared parameters and loss balance for task-specific parameters, leading to our full IMTL as illustrated in Alg. 1.

4. DISCUSSION

We draw connections between our method and previous state-of-the-artsfoot_0 in Fig. 3 . We will show that previous methods can all be categorized as gradient or loss balance, and thus each of them can be seen as a specification of our method. However, all of them have some intrinsic biases or short-comings leading to inferior performances, which we try to overcome. In the loss balance methods, we annotate the scaled loss in the bracket. L cls , L reg and L t are the raw loss of classification, regression and individual task, respectively. α cls , α reg and α t is the corresponding loss scale. L is the geometric mean loss and T is the task number. In the gradient balance methods, we annotate the projections of the aggregated gradient g = t α t g t onto the raw gradient g t of the t-th task in the bracket. u t = g t / g t is the unit-norm vector, p t = gu t is the projection of g onto g t and u s = t u t is the mean direction. GradNorm (Chen et al., 2018) balances tasks by making the norm of the scaled gradient for each task to be approximately equal. It also introduces the inverse training rate and a hyper-parameter γ to control the strength of approaching the mean gradient norm, such that tasks which learn slower can receive larger gradient magnitudes. However, it does not take into account the relationship of the gradient directions. We show that when the angle between the gradients of each pair of tasks is identical, our IMTL-G leads to the equivalent solution as GradNorm. Theorem 1. If the angle between any pair of u t , u τ stays constant: u t u τ = C 1 , ∀t = τ with C 1 < 1, then our IMTL-G leads to the same solution as that of GradNorm: gu t = C 2 ⇔ n t ≡ α t g t = α t g t = C 3 . In the above u t = g t / g t , C 1 , C 2 and C 3 are constants. Proof in Appendix C.1. In GradNorm, if without the above constant-angle condition u t u τ = C 1 , the projection of the aggregated gradient g onto task-specific gradient, gu t = ( τ C 3 u τ ) u t = C 3 ( τ u τ ) u t , is proportional to ( τ u τ ) u t . It tends to optimize the "majority tasks" whose gradient directions are closer to the mean direction t u t , resulting in undesired task bias. MGDA (Sener & Koltun, 2018) finds the weighted average gradient g = t α t g t with minimum norm in the convex hull composed by {g t }, so that t α t = 1 and α t 0, ∀t. It adopts an iterative method based on Frank-Wolfe algorithm to solve the multi-objective optimization problem. We note the minimum-norm point has a closed-form representation if without the constraints {α t 0}. In this case, we try to minimize gg = ( t α t g t ) ( τ α τ g τ ) such that t α t = 1. It implies g is perpendicular to the hyper-plane composed by {g t } as illustrated in Fig 1 (b) , and thus we have: g ⊥ (g 1 -g t ) ⇔ g (g 1 -g t ) = 0, ∀ 2 t T, and can obtain α = g 1 D DD -1 (see Appendix C.2). From Eq. ( 7), we note that the aggregated gradient satisfies: gg t = C. Then the projection of g onto g t , gu t = C/ g t , is inversely proportional to the norm of g t . So it focuses on tasks with smaller gradient magnitudes, which breaks the task balance. Even with {α t 0}, the problem still exists (see Appendix C.2) in the original MGDA method. Through experiments, we note that finding the minimum-norm point without the constraints {α t 0} leads to similar performance as MGDA with the constraints {α t 0}. In our IMTL-G, although we do not constrain {α t 0}, its loss weighting scales are always positive during the training procedure as shown in Fig. 4 . Uncertainty weighting (Kendall et al., 2018) regards the task uncertainty as loss weight. For regression, it can derive L 1 loss from Laplace distribution: -log p (y | f (x)) = |y -f (x)| /b + log b, where x is the data sample, y is the ground-truth label, f denotes the prediction model and b is the diversity of Laplace distribution. L 2 loss can be found in Appendix C.4. For classification, it takes the cross-entropy loss as a scaled categorical distribution and introduces the following approximation: -log p (y | f (x)) = -log softmax y f (x) σ 2 ≈ - 1 σ 2 log [softmax y (f (x))] + log σ, in which softmax y (•) stands for taking the y-th entry after the softmax (•) operator. MTL corresponds to maximizing the joint likelihood of multiple targets, then the derivations yield the scaling factor b/σ for the regression/classification loss. (Kendall et al., 2018 ) learn b and σ as model parameters which are updated by stochastic gradient descent. However, it is applicable only if we can find appropriate correspondence between the loss and the distribution. It is difficult to be used for losses such as cosine similarity, and it is impossible to traverse all kinds of losses to obtain a unified form for them. Moreover, it sacrifices classification tasks. From Eq. ( 8) we can find that the scaled cross-entropy loss is approximated as L = e 2s L cls -s if we set s = -log σ. By taking the derivative we have ∂L/∂s = 2e 2s L cls -1. Then s is optimized to make the scaled loss e 2s L cls to be close to 1/2. However, the scaled L 1 loss is approximated as L = e s L reg -s if we set s = -log b, and taking the derivative we have ∂L/∂s = e s L reg -1. So s is optimized to make the scaled L 1 loss to achieve 1, which is twice of the classification loss, and thus the classification task is overlooked. We would like to remark the differences between our IMTL-L and uncertainty weighting (Kendall et al., 2018) . Firstly, our derivation is motivated by the fairness among tasks, which intrinsically differs from uncertainty weighting which is based on task uncertainty considering each task independently. Secondly, IMTL-L learns to balance among tasks without any biases, while uncertainty weighting may sacrifice classification tasks to favor regression tasks as derived above. Thirdly, IMTL-L does not depend on any distribution assumptions and thus can be generally applied to various losses including cosine similarity, which uncertainty weighting may have difficulty with. As far as we know, there is no appropriate correspondence between cosine similarity and specific distributions. Lastly, uncertainty weighting needs to deal with different losses case by case, it also introduces approximations in order to derive scaling factors for certain losses (such as cross-entropy loss) which may not be optimal, but our IMTL-L has a unified form for all kinds of losses. GLS (Chennupati et al., 2019) calculates the target loss as the geometric mean: L = ( t L t ) 1 T , then the gradient of L with respect to the model parameters θ can be obtained as Appendix C.5, which can be regarded as to weigh the loss with its reciprocal value. However, as the gradient depends on the value of L, so it is not scale-invariant to the loss scale changes. Moreover, we find it to be unstable when the number of tasks is large because of the geometric mean computation.

5. EXPERIMENTS

In previous methods, various experimental settings have been adopted but there are no extensive comparisons. As one contribution of our work, we re-implement representative methods and present fair comparisons among them under the unified code-base, where more practical settings are adopted and stronger performances are achieved compared with existing code-bases. The implementations exactly follow the original papers and open-sourced code to ensure the correctness. We run experiments on the Cityscapes (Cordts et al., 2016) , NYUv2 (Silberman et al., 2012) and CelebA (Liu et al., 2015) dataset to extensively analyze different methods. Details can be found in Appendix D. Results on Cityscapes. From Tab. 1 we can obtain several informative conclusions. The uniform scaling baseline, which naïvely adds all losses, tends to optimize tasks with larger losses and gradient magnitudes, resulting in severe task bias. Uncertainty weighting (Kendall et al., 2018) sacrifices classification tasks to aid regression ones, leading to significantly worse results on semantic segmentation compared with our IMTL-L. GradNorm (Chen et al., 2018) is very sensitive to the choice of the hyper-parameter γ controlling the strength of equal gradient magnitudes, where the default γ = 1.5 works well on NYUv2 but performs badly on Cityscapes. We find its best option is γ = 0 which makes the scaled gradient norm to be exactly equal. MGDA (Sener & Koltun, 2018) focuses on tasks with smaller gradient magnitudes. So the performance of semantic segmentation is good but the other two tasks have difficulty in converging. In addition, we find our proposed closed-form variant without the hard constraints {α t 0} achieves similar results as the original iterative method. Through the experiments we notice the closed-form solution almost always yields {α t 0}. As for PCGrad (Yu et al., 2020) , it yields slightly better performance than uniform scaling because its conflict projection will have no effect when the angles between the gradients are equal or less than π/2. In contrast, our IMTL method, in terms of both gradient balance and loss balance, yields competitive performance and achieves the best balance among tasks. Moreover, we verify that the two balances are complementary and can be combined to further improve the performance, with the visualizations in Appendix E. Surprisingly, we find our IMTL can beat the single-task baseline where Table 1 : Comparison between IMTL and previous methods on Cityscapes, semantic segmentation, instance segmentation and disparity/depth estimation are considered. The first group of columns shows the regular results of different methods. The second group shows the results by manually multiply the semantic segmentation loss with 10 before applying these methods. The subscript numbers show the absolute change after scaling the loss to demonstrate the robustness of various methods. The arrows indicate the values are the higher the better (↑) or the lower the better (↓). The best and runner up results for each task are bold and underlined, respectively. In addition, we present the real-world training time of each iteration for different methods in Tab. 1. As shown, loss balance methods are the most efficient, and our gradient balance method IMTL-G adds acceptable computational overhead, similar to that of GradNorm (Chen et al., 2018) and MGDA (Sener & Koltun, 2018) . It benefits from computing gradients with respect to the shared feature maps instead of the shared model parameters (the row of "IMTL-G (exact)"), which brings similar performances but adds significant complexity due to multiple (T ) backward passes through the shared parameters. Our IMTL-G only needs to do backward computation on the shared parameters once after obtaining the loss weights via Eq. ( 2), in which the computation overhead mainly comes from the matrix multiplication rather than the matrix inverse, since the inversed matrix DU ∈ R (T -1)×(T -1) is small compared with dimension of the shared feature z. As we outperform MGDA (Sener & Koltun, 2018) and PCGrad (Yu et al., 2020) significantly in terms of the objective metrics shown in Tab. 1, we further compare the qualitative results of our hybrid balance IMTL with the loss balance method uncertainty weighting (Kendall et al., 2018) and the gradient balance method GradNorm (Chen et al., 2018) considering their strong performances (see Fig. 6 ). For depth estimation we only show predictions at the pixels where ground truth (GT) labels exist to compare with GT, which is different from Fig. 7 where depth predictions are shown for all pixels. Consistent with results in Tab. 1, our IMTL shows visually noticeable improvements especially for the semantic and instance segmentation tasks. It is worth noting that we conduct experiments under strong baselines and practical settings which are seldom explored before, in this case changing the backbone in PSPNet (Zhao et al., 2017) from ResNet-50 to ResNet-101 can only improve mIoU of the semantic segmentation task around 0.5% according to the public code basefoot_2 . Scale invariance. We are also interested in the scale invariance, which means how the results change with the loss scale. For example, in semantic segmentation, the loss scale is different if we replace the reduction method "mean" (averaged over all locations) with "sum" (summed over all locations) in the cross-entropy loss computation, or the number of the interested classes increases. The scale invariance is beneficial for model robustness. So to simulate this effect, we manually multiply the semantic segmentation loss by 10 and apply the same methods to see how the performances are affected. In the last three columns of Tab. 1 we report the absolute changes resulting from the multiplier. Our IMTL achieves the smallest performance fluctuations and thus the best invariance, while other methods are more or less affected by the loss scale change. Results on NYUv2. In Tab. 2 we find similar patterns as on Cityscapes, but NYUv2 is a rather small dataset, so uniform scaling can also obtain reasonable results. Note that uncertainty weighting (Kendall et al., 2018 ) cannot be directly used to estimate the normal surface when the cosine similarity is used as the loss, since no appropriate distribution can be found to correspond to cosine similarity. In this case, surface normal estimation owns the smallest gradient magnitude, so MGDA (Sener & Koltun, 2018) learns it best but it performs not so well for the rest two tasks. Again, our IMTL performs best taking advantage of the complementary gradient and loss balances. Results on CelebA. To compare different methods in the many-task setting, in Tab. 2 we also conduct the multi-label classification experiments on the CelebA (Liu et al., 2015) dataset. The mean accuracy of 40 tasks is used as the final metric. Our IMTL outperforms its competitors in the scenario where the task number is large, showing its superiority. Note that in this setting, GLS (Chennupati et al., 2019) has difficulty in converging and no reasonable results can be obtained.

6. CONCLUSION

We propose an impartial multi-task learning method integrating gradient balance and loss balance, which are applied on task-shared and task-specific parameters, respectively. Through our in-depth analysis, we have theoretically compared our method with previous state-of-the-arts. We have also showed that those state-of-the-arts can all be categorized as gradient or loss balance, but lead to specific bias among tasks. Through extensive experiments we verify our analysis and demonstrate the effectiveness of our method. Besides, for fair comparisons, we contribute a unified code-base, which adopts more practical settings and delivers stronger performances compared with existing code-bases, and it will be publicly available for future research. Published as a conference paper at ICLR 2021 Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2017.

A RELATED WORK OF NETWORK STRUCTURE

Cross-stitch Networks (Misra et al., 2016) learn coefficients to linearly combine activations from multiple tasks to construct better task-specific representations. To break the limitation of channelwise cross-task feature fusion only, NDDR-CNN (Gao et al., 2019) proposes the layer-wise crosschannel feature aggregation as 1 × 1 convolutions on the concatenated feature maps from multiple tasks. More generally, MTL-NAS (Gao et al., 2020) introduces cross-layer connections among tasks to fully exploit the feature sharing from both low and high layers, extending the idea in Sluice Networks (Ruder et al., 2019) by leveraging neural architecture search (Zoph & Le, 2017) . The parameters of these methods increase linearly with the number of tasks. To improve the model compactness, Residual Adapters (Rebuffi et al., 2017) introduce a small amount of task-specific parameters for each layer and convolve them with the task-agnostic representations to form the taskrelated ones. MTAN (Liu et al., 2019) generates data-dependent attention tensors by task-specific parameters to attend to the task-shared features. Single-tasking (Maninis et al., 2019) instead applies squeeze-and-excitation (Hu et al., 2018) module to generate attentive vectors for each task. In Task Routing (Strezoski et al., 2019) , the attentive vectors are randomly sampled before training and are fixed for each image. Piggyback (Mallya et al., 2018) opts to mask parameter weights in place of activation maps, dealing with task-sharing from another point-of-view. The above methods can share parameters among tasks to a large extent, however, they are not memory-efficient because each task still needs to compute all of its own intermediate feature maps, which also leads to inferior inference speed compared with loss weighting methods.

B DETAILED DERIVATION B.1 GRADIENT BALANCE: IMTL-G

Here we give the detailed derivation of the closed-form solution of our IMTL-G, we also demonstrate the scale-invariance property of our IMTL-G, which is invariant to the scale changes of losses. Solution. As we want to achieve: gu 1 = gu t ⇔ g (u 1 -u t ) = 0, ∀ 2 t T, where u t = g t / g t , recall that we have g = t α t g t and t α t = 1, if we set α = [α 2 , • • • , α T ] and G = g 2 , • • • , g T , then α 1 = 1 -1α and Eq. ( 9) can be expanded as: t α t g t u 1 -u 2 , • • • , u 1 -u T = 0 ⇔ 1 -1α , α g 1 G U = 0, where U = u 1 -u 2 , • • • , u 1 -u T , 1 and 0 indicate the all-one and all-zero row vector, respectively. Eq. ( 10) can be solved by: 1 -1α g 1 + αG U = 0 ⇔ α 1 g 1 -G U = g 1 U . ( ) Assume D = g 1 1 -G = g 1 -g 2 , • • • , g 1 -g T , then we reach: αDU = g 1 U ⇔ α = g 1 U DU -1 . Property. We can also prove the aggregated gradient g = t α t g t with {α t } given in Eq. ( 12) is invariant to the scale changes of losses {L t } (or gradients {g t = ∇ θ L t }), as the following theorem. Theorem 2. Given g = t α t g t , t α t = 1 satisfying gu t = C, when {L t } are scaled by {k t > 0} (equivalently, {g t } are scaled by {k t }), if g = t α t (k t g t ), t α t = 1 satisfies g u t = C , then g = λg. In the above we have u t = gt gt = ktgt ktgt , λ, C and C are constants. Proof. As we have: g = t α t g t = t α t k t k t g t and gu t = C, by constructing: α t = α t k t / τ α τ k τ and g = t α t (k t g t ) = g/ τ α τ k τ = λg, we have: t α t = 1 and g u t = C/ τ α τ k τ = C . From Eq. ( 12) we know that {α t } has a unique solution, and thus g satisfying IMTL-G is unique, so it must be the one given by Eq. ( 14), then we can prove that g and g are linearly correlated.

B.2 LOSS BALANCE: IMTL-L

With the ordinary differential equation, we can derive that the form of the scale function h (s) ds in our IMTL-L must be exponential function. As we have: h (s) ds = Ch (s) , C > 0. If we set y = h (s) ds, then: y = C dy ds ⇒ dy y = 1 C ds, By taking the antiderivative: dy y = 1 C ds ⇒ ln y = 1 C s + C . Then we have: h (s) ds = y = e C e 1 C s = ba s , a > 1, b > 0. C DETAILED DISCUSSION

C.1 CONDITIONAL EQUIVALENCE OF IMTL-G AND GRADNORM

First we introduce the following lemma. Lemma 3. If u t u τ = C 1 , ∀t = τ , then the solution {α t } of IMTL-G satisfies {α t > 0}. Proof. As u t = g t / g t , by constructing g = t α t g t where: α t = g t -1 / τ g τ -1 , then we have t α t = 1 and: gu t = τ u τ u t / τ g τ -1 = [(T -1) C 1 + 1] / τ g τ -1 = C 2 . ( ) From Eq. ( 12) we know the solution {α t } of IMTL-G is unique, so it must be the one given by Eq. ( 20) where {α t > 0}, so the lemma is proved. Then we prove Theorem 1 which states that IMTL-G leads to the same solution as GradNorm when the angle between any pair of gradients {g t } is identical: u t u τ = C 1 , ∀t = τ . Proof. (⇒ Necessity) Given constant projections in IMTL-G, we have: gu t = τ α τ g τ u t = C 2 . ( ) Recall that u t = g t / g t and u t u τ = C 1 , ∀t = τ . From Lemma 3 we know that {α t } given by IMTL-G must satisfy {α t > 0}. If we assume n t = α t g t , then we know α t g t = n t u t and: τ n τ u τ u t = τ =t n τ C 1 + n t = C 2 . Now we obtain: τ =t n τ C 1 + n t = τ n τ C 1 + (1 -C 1 ) n t = C 2 . ( ) As C 1 < 1, we can then prove n t = C 3 , ∀t. It implies the norm of the scaled gradient is constant, which is requested by GradNorm (Chen et al., 2018) . Moreover, we can obtain the relationship among constants from Eq. ( 24): C 1 T C 3 + (1 -C 1 ) C 3 = C 2 ⇒ C 3 = C 2 (T -1) C 1 + 1 . (⇐ Sufficiency) In GradNorm, {α t } are always chosen to satisfy {α t > 0}, so if we assume n t = α t g t , then given the constant norm of the scaled gradient in GradNorm, we have: α t g t = n t u t = C 3 u t , where u t = g t / g t . As we have g = t α t g t and u t u τ = C 1 , ∀t = τ , then we obtain: gu t = τ α τ g τ u t = τ C 3 u τ u t = C 3 [(T -1) C 1 + 1] = C 2 . ( ) It means the projections of g onto {g t } are constant, which is requested by our IMTL-G. Corollary 4. In GradNorm, if the solution {α t } satisfies t α t = 1 , then its constants are given by C 3 = 1/ t g t -1 and C 2 = [(T -1) C 1 + 1] / t g t -1 , and its scaling factors are given by α t = g t -1 / τ g τ -1 . Proof. By using α t = C 3 / g t from Eq. ( 26), we have t C 3 / g t = 1, then C 3 = 1/ t g t -1 , and also we have α t = g t -1 / τ g τ -1 . As the relationship of C 2 and C 3 from Eq. ( 27) is given by C 3 [(T -1) C 1 + 1] = C 2 , so C 2 = [(T -1) C 1 + 1] / t g t -1 . C.2 CLOSED-FORM SOLUTION OF MGDA In our relaxed MGDA (Sener & Koltun, 2018) without {α t 0}, finding g = t α t g t with t α t = 1 such that g has minimum norm is equivalent to find the normal vector of the hyperplane composed by {g t }. So we let g to be perpendicular to all of {g 1 -g t } on the hyper-plane: g ⊥ (g 1 -g t ) ⇔ g (g 1 -g t ) = 0, ∀ 2 t T. ( ) If we set α = [α 2 , • • • , α T ] and G = g 2 , • • • , g T , then we have α 1 = 1 -1α , and Eq. ( 28) can be expanded as: t α t g t g 1 -g 2 , • • • , g 1 -g T = 0 ⇔ 1 -1α , α g 1 G D = 0, where D = g 1 -g 2 , • • • , g 1 -g T , 1 and 0 indicates the all-one and all-zero row vector. Eq. ( 29) can be represented as: 1 -1α g 1 + αG D = 0 ⇔ α 1 g 1 -G D = g 1 D . As we also have D = 1 g 1 -G, then the closed-form solution of α is given by: αDD = g 1 D ⇔ α = g 1 D DD -1 . ( ) Bias of MGDA. In the main text we state that MGDA focuses on tasks with small gradient magnitudes, where we relaxed MGDA by not constraining {α t 0}. However, even with these constraints, the problem still exists. For example in the context of two tasks, assume g 1 < g 2 , if the minimum-norm point of g satisfying g = αg 1 + (1 -α) g 2 is outside the convex hull composed by {g 1 , g 2 }, or equivalently α > 1, MGDA clamps α to α = 1 and the optimal g = g 1 . Then the projections of g onto g 1 and g 2 will be g 1 and g 1 u 2 (u 2 = g 2 / g 2 ), respectively. As g 1 > g 1 u 2 , so MGDA still focuses on tasks with smaller gradient magnitudes.

C.3 ANALYSIS OF PCGRAD

PCGrad (Yu et al., 2020) mitigates the gradient conflicts by projecting the gradient of one task to the orthogonal direction of the others, and the aggregated gradient can be written as: g = t g t + τ C tτ u τ , with u t = g t / g t and the coefficients: C tt = 0, C tτ =   -   g t + t <τ, C tt u t   u τ   + , ∀t, τ, where [•] + means the ReLU operator. Note that the tasks have been shuffled before calculating the aggregated gradient g to achieve expected symmetry with respect to the task order. Eq. ( 31) can be represented more compactly in the matrix form: g = 1 (I T + CN ) G ≡ αG, where I T is the identity matrix, C = {C tτ } is the coefficient matrix whose entries are given in Eq. ( 32) and N = diag (1/ g 1 , • • • , 1/ g T ) is the diagonal normalization matrix. In Eq. ( 33) we use G and α to denote the raw gradients and scaling factors of all tasks. We find that PCGrad can also be regarded as loss weighting, with the loss weights given by α = 1 (I T + CN ). However, it still may break the balance among tasks. For example with two tasks, assume the angle between the gradients is φ: 1) when π/2 φ < π, then C = 0 -g 1 g 2 / g 2 -g 1 g 2 / g 1 0 and the projections onto the two raw gradients are g 1 sin 2 φ and g 2 sin 2 φ; 2) when 0 < φ < π/2, then C = 0 and the projections are g 1 + g 2 cos φ and g 2 + g 1 cos φ. In both cases, the projections are equal if and only if g 1 = g 2 . Otherwise, the task with larger gradient magnitude will be trained more sufficiently, which may encounter the same problem as uniform scaling that naïvely adds all the losses despite that the loss scales are highly different. C.4 L 2 LOSS IN UNCERTAINTY WEIGHTING For regression, uncertainty weighting (Kendall et al., 2018) regards the L 2 loss as likelihood estimation on the sample target which follows the Gaussian distribution: -log p (y | f (x)) = 1 2 1 σ 2 y -f (x) 2 2 + log σ 2 , ( ) where x is the data sample, y is the ground-truth label, f denotes the prediction model and σ is the standard deviation of Gaussian distribution. By setting s = -log σ 2 , the scaled L 2 loss is L = 1 2 (e s L reg -s), which has a similar form as the scaled L 1 loss except the front factor 1/2. So uncertainty weighting has difficulty in reaching a unified form for all kinds of losses, which is less general than our IMTL-L. C.5 GRADIENT OF GEOMETRIC MEAN GLS (Chennupati et al., 2019) computes the loss as the geometric mean, its gradient with respect to model parameters are: ∇ θ L = 1 T t L t 1 T -1 t     τ =t L τ   ∇ θ L t   (35) = 1 T t L t 1 T t ∇ θ L t L t = L T t 1 L t (∇ θ L t ) . ( ) where L is the geometric mean loss and T is the task number. It is equivalent to weigh the taskspecific loss with its reciprocal value, except that there exists another term L/T in the front where L = ( t L t ) 1 T , so GLS is sensitive to the loss scale changes of {L t } and not scale-invariant.

D IMPLEMENTATION DETAILS

To solely compare the loss weighting methods, we fix the network structure and choose ResNet-50 (He et al., 2016) with dilation (Chen et al., 2017) and synchronized (Peng et al., 2018) batch normalization (Ioffe & Szegedy, 2015) as the shared backbone and PSPNet (Zhao et al., 2017) as the task-specific head, and the backbone model weights are pretrained on ImageNet (Deng et al., 2009) . Following the common practice of semantic segmentation, in training we adopt augmentations as random resize (between 0.5 to 2), random rotate (between -10 to 10 degrees), Gaussian blur (with a radius of 5) and random horizontal flip. Besides, we apply strided cropping and horizontal flipping as testing augmentations. The predicted results in the overlapped region of different crops are averaged to obtain the aggregated prediction of the whole image. Only pixels with ground truth labels are included in loss and metric computation, while others are ignored. Semantic segmentation, instance segmentation, surface normal estimation and disparity/depth estimation are considered. As for the losses/metrics, semantic segmentation uses cross-entropy/mIoU, surface normal estimation adopts (1cos)/cosine similarity and both instance segmentation and disparity/depth estimation use L 1 loss. We use polynomial learning rate with a power of 0.9, SGD with a momentum of 0.9 and weight decay of 10 -4 as the optimizer, with the model trained for 200 epochs. After passing through the shared backbone where strided convolutions exist, the feature maps have 1/8 size as that of the Instance segmentation is taken as offset regression, where each pixel p i = (x i , y i ) approximates the relative offset o i = (dx i , dy i ) with respect to the centroid c id(pi) of its belonging instance id (p i ). To conduct inference, we abandon the time-consuming and complicated clustering methods adopted by the previous method (Kendall et al., 2018) . Instead, we directly use the offset vectors {o i } predicted by the model to find the centroids of instances. By definition, the norm of a centroid's offset vector should be 0, so we can transform the offset vector norm o i to the probability q i of being a centroid with the exponential function q i = e -oi . Next a 7 × 7 edge filter is applied on the centroid probability map to filter out the spurious centroids on object edges resulting from the regression target ambiguity. The locations with centroid probability q i < 0.1 are also manually suppressed. Then 7 × 7 max-pooling on the filtered probability map is used to produce candidate centroids and filter out duplicate ones. With the predicted centroids {c i }, we can then assign each pixel p i to its belonging instance id (p i ) by the distance between its approximated centroids p i + o i and the candidate centroids {c i }: id (p i ) = arg min j p i + o i -c j . Depth is measured in pixels by the disparity between the left and right images. Fig. 5 shows the whole process. Note that we need to carefully deal with label transformation during data augmentation. For example, disparity ground truth needs to be up-scaled by s times if the image is up-sampled by s times. Also, the predicted offset vectors of the flipped input should be mirrored to comply with the normal one. On the NYUv2 dataset, the batch size is 48 (6 × 8 GPUs) with the initial learning rate 0.03. We use the 795 training images for training and the 654 validation images for testing with 480 × 640 full resolution. 401 × 401 crops are used for training and testing. 13 coarse-grain classes are considered in semantic segmentation. The surface normal is represented by the unit normal vector of the corresponding surface. When doing data augmentation, surface normal ground truth n = (x, y, z) should be processed accordingly. If we resize the image by s times, the z coordinate of the normal vector should be scaled by s and renormalized: n = (x, y, sz) / (x, y, sz) . If the image is rotated by the rotation matrix R, the normal vector should also be in-plane rotated (x , y ) = (x, y) R with z unchanged. Moreover, the left-right flip should be applied on the normal vector n = (-x, y, z) when mirroring the image horizontally. During testing, the normal vectors in the overlapped region of crops are averaged and renormalized to produce the aggregated results. Depth is the absolute distance to the camera and measured by meters, which is inverse-proportional to the disparity measurement adopted by Cityscapes. So the depth in meters needs to be scaled by 1/s when the image is scaled by s times, which is the reciprocal of disparity transformation. CelebA contains 202,599 face images from 10,177 identities, where each image has 40 binary attribute annotations. We train on the 162,770 training images and test on the 19,867 validation images. Most of the implementation details are the same as those on the Cityscapes dataset, except that: 1) we employ the ResNet-18 as the backbone and linear classifiers as the task-specific heads, so totally 40 heads are attached on the backbone ; 2) the binary-cross entropy is used as the classification loss for each attribute; 3) the batch size is 256 (32 × 8 GPUs) and the model is trained from scratch for 100 epochs; 4) the input image has been aligned with the annotated 5 landmarks and cropped to 218 × 178. 



Our analysis of PCGrad(Yu et al., 2020) can be found in Appendix C.3. https://github.com/open-mmlab/mmsegmentation/tree/master/configs/pspnet



Figure 1: Comparison of gradient balance methods. In (a) to (d), g1, g2 and g3 represent the gradient computed

Figure 2: Overview of IMTL.

cls 𝐿 cls ≈ 1/2, 𝛼 reg 𝐿 reg ≈ 1) GLS (𝛼 𝑡 𝐿 𝑡 = 𝐿/𝑇) GradNorm (𝑝 𝑡 ∝ 𝒖 𝑡 𝒖 𝑠 ⊤ ) PCGrad (𝑝 𝑡 ∝ 𝒈 𝑡 ) MGDA (𝑝 𝑡 ∝ 𝒈 𝑡 -1 ) IMTL-L (𝛼 𝑡 𝐿 𝑡 ≈ const)

Figure 3: Relationship between our IMTL and previous methods. The blue dashed arrow indicates the characteristic of each method.In the loss balance methods, we annotate the scaled loss in the bracket. L cls , L reg and L t are the raw loss of classification, regression and individual task, respectively. α cls , α reg and α t is the corresponding loss scale. L is the geometric mean loss and T is the task number. In the gradient balance methods, we annotate the projections of the aggregated gradient g = t α t g t onto the raw gradient g t of the t-th task in the bracket. u t = g t / g t is the unit-norm vector, p t = gu t is the projection of g onto g t and u s = t u t is the mean direction.

method

Figure 4: Loss scales of IMTL-G for different tasks when training on the Cityscapes dataset.

Figure 5: Pipeline used in the Cityscapes visual understanding experiment. The centroids are computed from the offset regression results. Each pixel is assigned to its nearest candidate centroid.

Figure 7: Qualitative results of our IMTL on Cityscapes. Semantic segmentation, instance segmentation and disparity estimation predictions are produced by a single network. The task-shared backbone is ResNet-50 and the task-specific heads are PSPNet. The image resolution is 1024×2048.

Figure 8: Qualitative results of our IMTL on NYUv2. Semantic segmentation, surface normal estimation and depth estimation predictions are produced by a single network. The task-shared backbone is ResNet-50 and the task-specific heads are PSPNet. The image resolution is 480 × 640.

). In this way,

Experimental results on the NYUv2 and CelebA datasets, semantic segmentation, surface normal estimation, depth estimation and multi-class classification are considered. Arrows indicate the values are the higher the better (↑) or the lower the better (↓). The best and runner up results in each column are bold and underlined, respectively.

ACKNOWLEDGEMENTS

This work was supported by the Natural Science Foundation of Guangdong Province (No. 2020A1515010711), the Special Foundation for the Development of Strategic Emerging Industries of Shenzhen (No. JCYJ20200109143010272), and the Innovation and Technology Commission of the Hong Kong Special Administrative Region, China (Enterprise Support Scheme under the Innovation and Technology Fund B/E030/18).

