αVIL: LEARNING TO LEVERAGE AUXILIARY TASKS FOR MULTITASK LEARNING

Abstract

Multitask Learning is a Machine Learning paradigm that aims to train a range of (usually related) tasks with the help of a shared model. While the goal is often to improve the joint performance of all training tasks, another approach is to focus on the performance of a specific target task, while treating the remaining ones as auxiliary data from which to possibly leverage positive transfer towards the target during training. In such settings, it becomes important to estimate the positive or negative influence auxiliary tasks will have on the target. While many ways have been proposed to estimate task weights before or during training they typically rely on heuristics or extensive search of the weighting space. We propose a novel method called α-Variable Importance Learning (αVIL) that is able to adjust task weights dynamically during model training, by making direct use of taskspecific updates of the underlying model's parameters between training epochs. Experiments indicate that αVIL is able to outperform other Multitask Learning approaches in a variety of settings. To our knowledge, this is the first attempt at making direct use of model updates for task weight estimation.

1. INTRODUCTION

In Machine Learning, we often encounter tasks that are at least similar, if not even almost identical. For example, in Computer Vision, multiple datasets might require object segmentation or recognition (Deng et al., 2009; LeCun et al., 1998; Lin et al., 2014) whereas in Natural Language Processing, tasks can deal with sentence entailment (De Marneffe et al., 2019) or paraphrase recognition (Quirk et al., 2004) , both of which share similarities and fall under the category of Natural Language Understanding. Given that many such datasets are accessible to researchers, a naturally emerging question is whether we can leverage their commonalities in training setups. Multitask Learning (Caruana, 1993) is a Machine Learning paradigm that aims to address the above by training a group of sufficiently similar tasks together. Instead of optimizing each individual task's objective, a shared underlying model is fit so as to maximize a global performance measure, for example a LeNet-like architecture (LeCun et al., 1998) for Computer Vision, or a Transformer-based encoder (Vaswani et al., 2017) for Natural Language Processing problems. For a broader perspective of Multitask Learning approaches, we refer the reader to the overviews of Ruder (2017); Vandenhende et al. (2020) . In this paper we introduce αVIL, an approach to Multitask Learning that estimates individual task weights through direct, gradient-based metaoptimization on a weighted accumulation of taskspecific model updates. To our knowledge, this is the first attempt to leverage task-specific model deltas, that is, realized differences of model parameters before and after a task's training steps, to directly optimize task weights for target task-oriented multitask learning. We perform initial experiments on multitask setups in two domains, Computer Vision and Natural Language Understanding, and show that our method is able to successfully learn a good weighting of classification tasks.

2. RELATED WORK

Multitask Learning (MTL) can be divided into techniques which aim to improve a joint performance metric for a group of tasks (Caruana, 1993) , and methods which use auxiliary tasks to boost the performance of a single target task (Caruana, 1998; Bingel & Søgaard, 2017) . Some combinations of tasks suffer when their model parameters are shared, a phenomenon that has been termed negative transfer. There have been efforts to identify the cause of negative transfer. Du et al. (2018) use negative cosine similarity between gradients as a heuristic for determining negative transfer between target and auxiliary tasks. Yu et al. (2020) suggest that these conflicting gradients are detrimental to training when the joint optimization landscape has high positive curvature and there is a large difference in gradient magnitudes between tasks. They address this by projecting task gradients onto the normal plane if they conflict with each other. Wu et al. (2020) hypothesize that the degree of transfer between tasks is influenced by the alignment of their data samples, and propose an algorithm which adaptively aligns embedded inputs. Sener & Koltun (2018) avoid the issue of negative transfer due to competing objectives altogether, by casting MTL as a Multiobjective Optimization problem and searching for a Pareto optimal solution. In this work, we focus on the target task approach to Multitask Learning, tackling the problem of auxiliary task selection and weighting to avoid negative transfer and maximally utilize positively related tasks. Auxiliary tasks have been used to improve target task performance in Computer Vision, Reinforcement Learning (Jaderberg et al., 2016) , and Natural Language Processing (Collobert et al., 2011) . They are commonly selected based on knowledge about which tasks should be beneficial to each other through the insight that they utilize similar features to the target task (Caruana, 1998) , or are grouped empirically (Søgaard & Goldberg, 2016) . While this may often result in successful task selection, such approaches have some obvious drawbacks. Manual feature-based selection requires the researcher to have deep knowledge of the available data, an undertaking that becomes ever more difficult with the introduction of more datasets. Furthermore, this approach is prone to failure when it comes to Deep Learning, where model behaviour does not necessarily follow human intuition. Empirical task selection, e.g., through trialling various task combinations, quickly becomes computationally infeasible when the number of tasks becomes large. Therefore, in both approaches to Multitask Learning (optimizing either a target task using auxiliary data or a global performance metric), automatic task weighting during training can be beneficial for optimally exploiting relationships between tasks. To this end, Guo et al. ( 2019) use a two-staged approach; first, a subset of auxiliary tasks which are most likely to improve the main task's validation performance is selected, by utilizing a Multi-Armed Bandit, the estimates of which are continuously updated during training. The second step makes use of a Gaussian Process to infer a mixing ratio for data points belonging to the selected tasks, which subsequently are used to train the model. A different approach by Wang et al. (2020) aims to directly differentiate at each training step the model's validation loss with respect to the probability of selecting instances of the training data (parametrised by a scorer network). This approach is used in multilingual translation by training the scorer to output probabilities for all of the tasks' training data. However, this method relies on noisy, per step estimates of the gradients of the scorer's parameters as well as the analytical derivation of it depending on the optimizer used. Our method in comparison is agnostic to the optimization procedure used. Most similarly to our method, Sivasankaran et al. ( 2017) recently introduced Discriminative Importance Weighting for acoustic model training. In their work, the authors train a model on the CHiME-3 dataset (Barker et al., 2015) , adding 6 artificially perturbed datasets as auxiliary tasks. Their method relies on estimating model performances on the targeted validation data when training tasks in isolation, and subsequently using those estimates as a proxy to adjust individual task weights. Our method differs from this approach by directly optimizing the target validation loss with respect to the weights applied to the model updates originating from training each task.

3. α-VARIABLE IMPORTANCE LEARNING

The target task-oriented approach to Multitask Learning can be defined as follows. A set of classification tasks T = {t 1 , t 2 , . . . , t n } are given, each associated with training and validation datasets, D train ti and D dev ti , as well as a target task t * ∈ T . We want to find weights W = {ω 1 , ω 2 , . . . , ω n } capturing the importance of each task such that training the parameters θ of a Deep Neural Network on the weighted sum of losses for each task maximizes the model's performance on the target task's development set:

