ROTOGRAD: DYNAMIC GRADIENT HOMOGENIZATION FOR MULTITASK LEARNING

Abstract

GradNorm (Chen et al., 2018) is a broadly used gradient-based approach for training multitask networks, where different tasks share, and thus compete during learning, for the network parameters. GradNorm eases the fitting of all individual tasks by dynamically equalizing the contribution of each task to the overall gradient magnitude. However, it does not prevent the individual tasks' gradients from conflicting, i.e., pointing towards opposite directions, and thus resulting in a poor multitask performance. In this work we propose Rotograd, an extension to GradNorm that addresses this problem by dynamically homogenizing not only the gradient magnitudes but also their directions across tasks. For this purpose, Rotograd adds a layer of task-specific rotation matrices that aligns all the task gradients. Importantly, we then analyze Rotograd (and its predecessor) through the lens of game theory, providing theoretical guarantees on the algorithm stability and convergence. Finally, our experiments on several real-world datasets and network architectures show that Rotograd outperforms previous approaches for multitask learning.

1. INTRODUCTION

While single-task learning is broadly-used and keeps achieving state-of-the-art results in different domains-sometimes beating human performance, there are countless scenarios where we would like to solve several related tasks, we encounter overfitting problems, or the available data is scarce. Multitask learning is a promising field of machine learning aiming to solve-or at least, alleviatethe aforementioned problems that single-task networks suffer from (Caruana, 1993) . Many multitask architectures have emerged throughout the past years (Kokkinos, 2017; Maninis et al., 2019; Dong et al., 2015; He et al., 2017) , yet most of them fall under the umbrella of hard parameter sharing (Caruana, 1993) . This architecture is characterized by two components: i) a shared backbone which acts as an encoder, and produces an intermediate representation shared across tasks based on the input; and ii) a set of task-specific modules that act as decoders and, using this intermediate representation as input, produce an specialized output for each of the tasks. These networks have proven to be powerful, efficient, and in many occasions, capable to improve the results of their single-task counterparts. However, they can be difficult to train. Competition among tasks for the shared resources can lead to poor results where the resources are dominated by a subset of tasks. Internally, this can be traced to a sum over task gradients of the form L = k ∇L k with respect to the backbone parameters. In this setting two undesirable scenarios may occur, as depicted in Figure 1a : i) a subset of tasks dominate the overall gradient evaluation due to magnitude differences across task gradients; and ii) individual per-task gradients point towards different directions in the parameter space, cancelling each other out. This problem can be seen as a particular case of negative transfer, a phenomenon describing when "sharing information with unrelated tasks might actually hurt performance" (Ruder, 2017) . Under the assumption that the problem comes from task unrelatedness, one solution is to use only related tasks. Several works have explored this direction (Thrun & O'Sullivan, 1996; Zamir et al., 2018) , see for example Standley et al. (2019) for a recent in-depth study on task relatedness. Another way of tackling the problem is by weighting the task dynamically during training, the motivation being that with this approach the task contribution can me manipulated at will and a 1

