CONDITIONALLY ADAPTIVE MULTI-TASK LEARNING: IMPROVING TRANSFER LEARNING IN NLP USING FEWER PARAMETERS & LESS DATA

Abstract

Multi-Task Learning (MTL) networks have emerged as a promising method for transferring learned knowledge across different tasks. However, MTL must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Often, in Natural Language Processing (NLP), a separate model per task is needed to obtain the best performance. However, many fine-tuning approaches are both parameter inefficient, i.e., potentially involving one new model per task, and highly susceptible to losing knowledge acquired during pretraining. We propose a novel Transformer based Adapter consisting of a new conditional attention mechanism as well as a set of task-conditioned modules that facilitate weight sharing. Through this construction, we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed. We also use a new multi-task data sampling strategy to mitigate the negative effects of data imbalance across tasks. Using this approach, we are able to surpass single task fine-tuning methods while being parameter and data efficient (using around 66% of the data for weight updates). Compared to other BERT Large methods on GLUE, our 8-task model surpasses other Adapter methods by 2.8% and our 24-task model outperforms by 0.7-1.0% models that use MTL and single task fine-tuning. We show that a larger variant of our single multi-task model approach performs competitively across 26 NLP tasks and yields state-of-the-art results on a number of test and development sets. Our code is publicly available at https://github.com/CAMTL/CA-MTL.

1. INTRODUCTION

The introduction of deep, contextualized Masked Language Models (MLM) 1 trained on massive amounts of unlabeled data has led to significant advances across many different Natural Language Processing (NLP) tasks (Peters et al., 2018; Liu et al., 2019a) . Much of these recent advances can be attributed to the now well-known BERT approach (Devlin et al., 2018) . Substantial improvements over previous state-of-the-art results on the GLUE benchmark (Wang et al., 2018) have been obtained by multiple groups using BERT models with task specific fine-tuning. The "BERT-variant + fine-tuning" formula has continued to improve over time with newer work constantly pushing the state-of-the-art forward on the GLUE benchmark. The use of a single neural architecture for multiple NLP tasks has shown promise long before the current wave of BERT inspired methods (Collobert & Weston, 2008) and recent work has argued that autoregressive language models (ARLMs) trained on large-scale datasets -such as the GPT family of models (Radford et al., 2018) , are in practice multi-task learners (Brown et al., 2020) . However, even with MLMs and ARLMs trained for multi-tasking, single task fine-tuning is usually also employed to achieve state-of-the-art performance on specific tasks of interest. Typically this fine-tuning process may entail: creating a task-specific fine-tuned model (Devlin et al., 2018) , training specialized model components for task-specific predictions (Houlsby et al., 2019) or fine-tuning a single multi-task architecture (Liu et al., 2019b) . Single-task fine-tuning overall pretrained model parameters may have other issues. Recent analyses of such MLM have shed light on the linguistic knowledge that is captured in the hidden states and attention maps (Clark et al., 2019b; Tenney et al., 2019a; Merchant et al., 2020) . Particularly, BERT has middle Transformer (Vaswani et al., 2017) layers that are typically the most transferable to a downstream task (Liu et al., 2019a) . The model proxies the steps of the traditional NLP pipeline in a localizable way (Tenney et al., 2019a ) -with basic syntactic information appearing earlier in the network, while high-level semantic information appearing in higher-level layers. Since pretraining is usually done on large-scale datasets, it may be useful, for a variety of downstream tasks, to conserve that knowledge. However, single task fine-tuning causes catastrophic forgetting of the knowledge learned during MLM (Howard & Ruder, 2018) . To preserve knowledge, freezing part of a pretrained network and using Adapters for new tasks have shown promising results (Houlsby et al., 2019) . Inspired by the human ability to transfer learned knowledge from one task to another new task, Multi-Task Learning (MTL) in a general sense (Caruana, 1997; Rajpurkar et al., 2016b; Ruder, 2017) has been applied in many fields outside of NLP. Caruana (1993) showed that a model trained in a multi-task manner can take advantage of the inductive transfer between tasks, achieving a better generalization performance. MTL has the advantage of computational/storage efficiency (Zhang & Yang, 2017), but training models in a multi-task setting is a balancing act; particularly with datasets that have different: (a) dataset sizes, (b) task difficulty levels, and (c) different types of loss functions. In practice, learning multiple tasks at once is challenging since negative transfer (Wang et al., 2019a) , task interference (Wu et al., 2020; Yu et al., 2020) and catastrophic forgetting (Serrà et al., 2018) can lead to worse data efficiency, training stability and generalization compared to single task fine-tuning. Using Conditionally Adaptive Learning, we seek to improve pretraining knowledge retention and multi-task inductive knowledge transfer. Our contributions are the following: • A new task conditioned Transformer that adapts and modulates pretrained weights (Section 2.1). • A novel way to prioritize tasks with an uncertainty based multi-task data sampling method that helps balance the sampling of tasks to avoid catastrophic forgetting (Section 2.2). Our Conditionally Adaptive Multi-Task Learning (CA-MTL) approach is illustrated in Figure 1 . To the best of our knowledge, our work is the first to explore the use of a latent representation of tasks to modularize and adapt pretrained architectures. Further, we believe our work is also the first to examine uncertainty sampling for large-scale multi-task learning in NLP. We show the efficacy of CA-MTL by: (a) testing on 26 different tasks and (b) presenting state-of-the-art results on a number of test sets as well as superior performance against both single-task and MTL baselines. Moreover, we further demonstrate that our method has advantages over (c) other adapter networks, and (d) other MTL sampling methods. Finally, we provide ablations and separate analysis of the MT-Uncertainty Sampling technique in section 4.1 and of each component of the adapter in 4.2.

2. METHODOLOGY

This section is organized according to the two main MTL problems that we will tackle: (1) How to modularize a pretrained network with latent task representations? (2) How to balance different tasks in MTL? We define each task as: T i {p i (y i |x i , z i ), L i , pi (x i )}, where z i is task i's learnable shallow embedding, L i is the task loss, and pi (x i ) is the empirical distribution of the training data pair {x i , y i }, for i ∈ {1, . . . , T } and T the number of supervised tasks. The MTL objective is: min φ(z),θ1,...,θ T

T i=1

L i (f φ(zi),θi (x i ), y i ) (1)



Figure 1: CA-MTL base architecture with our uncertainty-based sampling algorithm. Each task has its own decoder. The input embedding layer and the lower Transformer layers are frozen. The upper Transformer layer and Conditional Alignment module are modulated with the task embedding.

