ROTOGRAD: DYNAMIC GRADIENT HOMOGENIZATION FOR MULTITASK LEARNING

Abstract

GradNorm (Chen et al., 2018) is a broadly used gradient-based approach for training multitask networks, where different tasks share, and thus compete during learning, for the network parameters. GradNorm eases the fitting of all individual tasks by dynamically equalizing the contribution of each task to the overall gradient magnitude. However, it does not prevent the individual tasks' gradients from conflicting, i.e., pointing towards opposite directions, and thus resulting in a poor multitask performance. In this work we propose Rotograd, an extension to GradNorm that addresses this problem by dynamically homogenizing not only the gradient magnitudes but also their directions across tasks. For this purpose, Rotograd adds a layer of task-specific rotation matrices that aligns all the task gradients. Importantly, we then analyze Rotograd (and its predecessor) through the lens of game theory, providing theoretical guarantees on the algorithm stability and convergence. Finally, our experiments on several real-world datasets and network architectures show that Rotograd outperforms previous approaches for multitask learning.

1. INTRODUCTION

While single-task learning is broadly-used and keeps achieving state-of-the-art results in different domains-sometimes beating human performance, there are countless scenarios where we would like to solve several related tasks, we encounter overfitting problems, or the available data is scarce. Multitask learning is a promising field of machine learning aiming to solve-or at least, alleviatethe aforementioned problems that single-task networks suffer from (Caruana, 1993) . Many multitask architectures have emerged throughout the past years (Kokkinos, 2017; Maninis et al., 2019; Dong et al., 2015; He et al., 2017) , yet most of them fall under the umbrella of hard parameter sharing (Caruana, 1993) . This architecture is characterized by two components: i) a shared backbone which acts as an encoder, and produces an intermediate representation shared across tasks based on the input; and ii) a set of task-specific modules that act as decoders and, using this intermediate representation as input, produce an specialized output for each of the tasks. These networks have proven to be powerful, efficient, and in many occasions, capable to improve the results of their single-task counterparts. However, they can be difficult to train. Competition among tasks for the shared resources can lead to poor results where the resources are dominated by a subset of tasks. Internally, this can be traced to a sum over task gradients of the form L = k ∇L k with respect to the backbone parameters. In this setting two undesirable scenarios may occur, as depicted in Figure 1a : i) a subset of tasks dominate the overall gradient evaluation due to magnitude differences across task gradients; and ii) individual per-task gradients point towards different directions in the parameter space, cancelling each other out. This problem can be seen as a particular case of negative transfer, a phenomenon describing when "sharing information with unrelated tasks might actually hurt performance" (Ruder, 2017) . Under the assumption that the problem comes from task unrelatedness, one solution is to use only related tasks. Several works have explored this direction (Thrun & O'Sullivan, 1996; Zamir et al., 2018) , see for example Standley et al. (2019) for a recent in-depth study on task relatedness. Another way of tackling the problem is by weighting the task dynamically during training, the motivation being that with this approach the task contribution can me manipulated at will and a good balance between gradient magnitudes can be found. Two important works on this direction are GradNorm (Chen et al., 2018), which we will cover in detail later, and Kendall et al. ( 2018), which adopts a probabilistic approach to the problem. For example Guo et al. ( 2018), adapts the weights based on their "difficulty" to be solved to prioritize harder tasks. The key contributions of this paper are the following: f φ1 Y 1 L 1 ( Y 1 , Y 1 ) X h θ Z f φ2 Y 2 L 2 ( Y 2 , Y 2 ) f φK Y K L K ( Y K , Y K ) ... ... ... • The introduction of Rotograd, an extension of GradNorm that homogenizes both the magnitude and the direction of the task gradients during training. Different experiments support this statement, showing the performance gain provided by Rotograd. • A novel interpretation of GradNorm (and Rotograd) as a gradient-play Stackelberg game, which allow us to draw theoretical guarantees regarding the stability of the training process as long as GradNorm (Rotograd) is an asymptotically slower learner than the network's optimizer. • A new implementation of GradNorm where is it applied to the intermediate representation instead of a subset of the shared parameters. This way remove the process of choosing this subset. Besides, we give some guarantees regarding the the norm of the gradient with respect to all the shared parameters, and empirically show that results are comparably good to the usual way of applying GradNorm.

2. BACKGROUND

2.1 MULTITASK LEARNING Let us consider K different tasks that we want to simultaneously learn using a gradient-based approach. All of them share the same input dataset X ∈ R N ×D , where each row x n ∈ R D is a D-dimensional sample. Additionally, each task has its own outputs Y k ∈ R N ×I k , and loss function k : Y k × Y k → R. As mentioned in Section 1 and illustrated in Figure 1c , the model is composed of a shared backbone and task-specific modules. Upon invoking the model, first the input X ∈ R B×D is passed through the backbone, h θ : X → Z, which produces a common intermediate representation of the input, Z ∈ R B×d , where B is the batch size and d the size of the shared intermediate space. Then, each task-specific module, f φ k : Z → Y k , using the intermediate representation Z as input, predicts the desired outcome for its designated task, Y k .



c) Hard parameter sharing model.

Figure 1: Examples of: (a) neglecting one vector due to magnitude difference; (b) partial cancellation of two vectors with equal magnitude; (c) a hard parameter sharing architecture.

Furthermore, a series of papers attempting to solve the conflicting gradients problem in different settings have recently emerged. For example, Flennerhag et al. (2019) deals with a similar problem in the context of meta-learning, and Yu et al. proposes two different heuristic solutions for reinforcement learning. In the context of multitask learning, Levi & Ullman (2020) adopts an imagedependent solution that uses a top-down network, and Maninis et al. (2019) applies adversarial training to deal with conflicting gradients, making use of double-backpropagation which poorly scales.

