AUXILIARY TASK UPDATE DECOMPOSITION: THE GOOD, THE BAD AND THE NEUTRAL

Abstract

While deep learning has been very beneficial in data-rich settings, tasks with smaller training set often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks actually help the primary task. We seek to alleviate this burden by formulating a modelagnostic framework that performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates into directions which help, damage or leave the primary task loss unchanged. This allows weighting the update directions differently depending on their impact on the problem of interest. We present a novel and efficient algorithm for that purpose and show its advantage in practice. Our method leverages efficient automatic differentiation procedures and randomized singular value decomposition for scalability. We show that our framework is generic and encompasses some prior work as particular cases. Our approach consistently outperforms strong and widely used baselines when leveraging out-of-distribution data for Text and Image classification tasks.

1. INTRODUCTION

Multitask learning (Caruana, 1997) and pretraining (Devlin et al., 2018; Caron et al., 2019) have transformed machine learning by allowing downstream tasks with small training sets to benefit from statistical regularities from data-rich related tasks (Collobert & Weston, 2008; Zhang et al., 2014; Liu et al., 2019; Kornblith et al., 2019) . Despite these advances, leveraging the mixing of tasks is still an art left to the practitioner. When one is interested in a primary task, it is unclear how to select helpful auxiliary tasks, an appropriate parameter sharing architecture and a good way to filter out auxiliary data which might be detrimental to the primary tasks. Without careful choices, pretraining might hurt end-task performance (Gururangan et al., 2020) or have limited impact (Raghu et al., 2019) . Prior work has examined these problems and proposed solutions, either to choose auxiliary tasks depending on their impact on the primary task (Du et al., 2018; Lin et al., 2019) or to equalize the impact of updates across tasks (Sener & Koltun, 2018; Chen et al., 2018; Hessel et al., 2019) . Recently, several approaches (Sinha et al., 2018; Suteu & Guo, 2019; Yu et al., 2020) have been proposed that attempt to minimize interference between the updates across tasks. Our work builds on this direction, but unlike these previous approaches, we do not consider a symmetric view of multi-task learning in the sense that our goal is not to train a model performing well on all tasks. Instead, we focus on improving generalization for a single task, the primary task, and the other tasks, the auxiliary tasks are considered only through their impact on the problem of interest. For that purpose, we introduce a framework which decomposes the gradient updates from the auxiliary tasks according to their impact on the primary task. We analyze the auxiliary task gradients in the subspace spanned by the primary task per-example gradients. This allows us to decompose auxiliary gradients into into three components : components that help, interfere or have no impact on the primary task according to the Taylor expansion of the expected primary loss. This decomposition allows us to re-weight each component differently prior to the update. Our framework enables us to treat each auxiliary update differently depending on its impact on the task of interest and it encompasses prior methods such as classical multitask learning (Caruana, 1997) or more novel gradient surgery techniques (Yu et al., 2020) . To achieve a tractable approach, we introduce an efficient, robust algorithm (ATTITTUD, Auxiliary Task Training with Influence from Target Task Update Direction) to estimate the subspace spanned by the primary task gradients in an online manner and decompose the auxiliary updates appropriately. As a result, we can integrate our approach with the stochastic training of large neural networks in various contexts. The contribution of our work is four-fold. To our knowledge, this paper proposes the first approach to adapt auxiliary gradients using a decomposition built from the span of the primary task Jacobian. In order to scale this approach to deep neural nets, we contribute a tractable and efficient algorithm called ATTITTUD that leverages insights from randomized linear algebra and automatic differentiation such as the R-operator (Pearlmutter, 1994) . As our third contribution, we show that the fine-grained manipulation of the auxiliary task gradients under ATTITTUD, represents a unified framework that encompasses several previous approaches to asymmetrical task learning as special cases. Finally, we demonstrate the efficacy of our approach in both data-rich and data-starved primary tasks, over both images and textual data.

2. RELATED WORK

Methods to leverage data outside of the task of interest have been popular in machine learning since the inception of multitask learning (Caruana, 1997; Ruder, 2017; Vandenhende et al., 2020) . These methods address multiple task simultaneously and have been successful in various application domains (Collobert & Weston, 2008; Zhang et al., 2014; Misra et al., 2016) . The optimization problem induced by multitask learning is difficult and solutions have been proposed for the various difficulties, including dealing with task gradients of different magnitude (Sener & Koltun, 2018; Chen et al., 2018; Hessel et al., 2019) , or interfering with each others (Sinha et al., 2018; Suteu & Guo, 2019; Yu et al., 2020) . The specific problem of interference has been studied extensively in the context of continual learning. Continual learning visits task in sequence and update interference is particularly problematic as it yields newer tasks to damage previously mastered tasks. In particular, a family of methods to project the gradient of the new tasks to be orthogonal to the gradient of the previous tasks has been proposed (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2018; Farajtabar et al., 2019) . Different from many previous approaches, we are not interested in addressing multiple tasks per se. In our setting, only the primary task matters and the other auxiliary task have the sole role of improving generalization on the primary task. This is the setting considered by Du et al. (2018); Lin et al. (2019) , who favor auxiliary tasks whose gradient directions are helpful to the primary task. Unlike these works that use coarse properties like the cosine similarity between averaged gradients, our approach allows fine-grained gradient manipulation within a subspace. Also, in our case, we do not distinguish between the different auxiliary tasks. Instead, we aim at correcting every auxiliary gradient in the same manner to improve the loss on the primary task. This type of gradient correction is related to Yu et al. (2020) , which considers projecting multi-task gradients such that the directions of disagreement are removed. This method is actually a special case of our framework. Our work also shares some similarities with data selection and domain adaptation approaches. In this case, the training data comes from a single task but its distribution is different from the validation/test distribution (Moore & Lewis, 2010; Axelrod et al., 2011; Ngiam et al., 2018) . This classical problem has recently been addressed by sampling training points whose gradient aligns well with the expected validation gradient (Wang et al., 2020b; a) . Instead of sampling individual points based on an estimated distribution of how helpful they will be to the primary task, our work avoids the use (and inherent challenges) of this reinforcement learning approach by operating on batch gradients of groups of points. Our primary task/auxiliary task setting is also related to the pre-training then fine-tuning paradigm in which the auxiliary tasks are visited first (pre-training) to give an initialization for training on the primary task (fine-tuning). These methods have been very successful in settings where primary task data are rare. In particular, it is common to first rely on an unsupervised task over very large datasets prior to fine tuning over a supervised task (Devlin et al., 2018; Liu et al., 2019; Kornblith et al., 2019; Yang et al., 2019; Song et al., 2019; Caron et al., 2018) .

