MEASURING AND HARNESSING TRANSFERENCE IN MULTI-TASK LEARNING

Abstract

Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naïve formulations often degrade performance and in particular, identifying the tasks that would benefit from cotraining remains a challenging design question. In this paper, we analyze the dynamics of information transfer, or transference, across tasks throughout training. Specifically, we develop a similarity measure that can quantify transference among tasks and use this quantity to both better understand the optimization dynamics of multi-task learning as well as improve overall learning performance. In the latter case, we propose two methods to leverage our transference metric. The first operates at a macro-level by selecting which tasks should train together while the second functions at a micro-level by determining how to combine task gradients at each training step. We find these methods can lead to significant improvement over prior work on three supervised multi-task learning benchmarks and one multi-task reinforcement learning paradigm.

1. INTRODUCTION

Deciding if two or more objectives should be trained together in a multi-task model, as well as choosing how that model's parameters should be shared, is an inherently complex issue often left to human experts (Zhang & Yang, 2017) . However, a human's understanding of similarity is motivated by their intuition and experience rather than a prescient knowledge of the underlying structures learned by a neural network. To further complicate matters, the benefit or detriment induced from co-training relies on many non-trivial decisions including, but not limited to, dataset characteristics, model architecture, hyperparameters, capacity, and convergence (Wu et al., 2020; Vandenhende et al., 2019; Standley et al., 2019; Sun et al., 2019) . As a result, a quantifiable measure which conveys the effect of information transfer in a neural network would be valuable to practitioners and researchers alike to better construct or understand multi-task learning paradigms (Baxter, 2000; Ben-David & Schuller, 2003) . The training dynamics specific to multitask neural networks, namely cross-task interactions at the shared parameters (Zhao et al., 2018) , are difficult to predict and only fully manifest at the completion of training. Given the cost, both with regards to time and resources, of fully training a deep neural network, an exhaustive search over the 2 m -1 possible combinations of m tasks to determine ideal task groupings can be infeasible. This search is only complicated by the irreproducibility inherent in traversing a loss landscape with high curvature; an effect which appears especially pronounced in multi-task learning paradigms (Yu et al., 2020; Standley et al., 2019) . In this paper, we aim to take a step towards quantifying transference, or the dynamics of information transfer, and understanding its effect on multi-task training efficiency. As both the input data and state of model convergence are fundamental to transference (Wu et al., 2020) , we develop a parameter-free approach to measure this effect at a per-minibatch level of granularity. Moreover, our quantity makes no assumptions regarding model architecture, and is applicable to any paradigm in which shared parameters are updated with respect to multiple task losses. By analyzing multi-task training dynamics through the lens of transference, we present the following observations. First, information transfer is highly dependent on model convergence and varies significantly throughout training. Second, and perhaps surprisingly, excluding certain task gradients from the multi-task gradient update for select minibatches can improve learning efficiency. Our Figure 1 : Transference in (a) CelebA for a subset of 9 attributes; (b) Meta-World for "push", "reach", "press button top", and "open window". To determine task groupings, we compute the transference of each task i on all other tasks j, i.e. Z t {i}→j and average over time. For the purpose of illustration, we normalize the transference along each axis. Notice the majority of the tasks in (a) concentrate around a single value for each attribute. Tasks which exhibit transference above this value are considered to have relatively high transference. For instance, A 3 exhibits higher-than-average transference on A 0 , A 4 , and A 5 . A similar effect is observed in (b), with "close window" manifesting high transference on "push" and "reach". analysis suggests this is due to large variation in loss landscapes for different tasks as illustrated in Figure 4 . Building on these observations, we propose two methods to utilize transference in multitask learning algorithms -to choose which tasks to train together as well as determining which gradients to apply at each minibatch. Our experiments indicate the former can identify promising task groupings, while the latter can improve learning performance over prior methods. In summary, our main contributions are three-fold: we (1) propose the first measure (to our knowledge) which quantifies information transfer among tasks in multi-task learning; (2) demonstrate how transference can be used as a heuristic to select task groupings; (3) present a method which leverages minibatch-level transference to augment network performance.

2. RELATED WORK

Multi-Task Formulation. The most prevalent formulation of MTL is hard parameter sharing of hidden layers (Ruder, 2017; Caruana, 1993) . In this design, a subset of the hidden layers are typically shared among all tasks, and task-specific layers are stacked on top of the shared base to output a prediction value. Each task is assigned a weight, and the loss of the entire model is a linear combination of each task's loss multiplied by its respective loss weight. This particular design enables parameter efficiency by sharing hidden layers across tasks, reduces overfitting, and can facilitate transfer learning effects among tasks (Ruder, 2017; Baxter, 2000; Zhang & Yang, 2017) . Information Sharing. Prevailing wisdom suggests tasks which are similar or share a similar underlying structure may benefit from co-training in a multi-task system (Caruana, 1993; 1997) . A plethora of multi-task methods addressing what to share have been developed, such as Neural Architecture Search (Guo et al., 2020; Sun et al., 2019; Vandenhende et al., 2019; Rusu et al., 2016; Huang et al., 2018; Lu et al., 2017) and Soft-Parameter Sharing (Misra et al., 2016; Duong et al., 2015; Yang & Hospedales, 2016) , to improve multi-task performance. Though our measure of transference is complementary with these methods, we direct our focus towards which tasks should be trained together rather than architecture modifications to maximize the benefits of co-training. While deciding which tasks to train together has traditionally been addressed with costly crossvalidation techniques or high variance human intuition, recent advances have developed increasingly efficient algorithms to assess co-training performance. Swirszcz & Lozano (2012) and Bingel & Søgaard (2017) approximate multi-task performance by analyzing single-task learning characteristics. An altogether different approach may leverage recent advances in transfer learning focused on understanding task relationships (Zamir et al., 2018; Achille et al., 2019b; Dwivedi & Roig, 2019; Zhuang et al., 2020; Achille et al., 2019a); however, Standley et al. (2019) show transfer learning

