CONDITIONALLY ADAPTIVE MULTI-TASK LEARNING: IMPROVING TRANSFER LEARNING IN NLP USING FEWER PARAMETERS & LESS DATA

Abstract

Multi-Task Learning (MTL) networks have emerged as a promising method for transferring learned knowledge across different tasks. However, MTL must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Often, in Natural Language Processing (NLP), a separate model per task is needed to obtain the best performance. However, many fine-tuning approaches are both parameter inefficient, i.e., potentially involving one new model per task, and highly susceptible to losing knowledge acquired during pretraining. We propose a novel Transformer based Adapter consisting of a new conditional attention mechanism as well as a set of task-conditioned modules that facilitate weight sharing. Through this construction, we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed. We also use a new multi-task data sampling strategy to mitigate the negative effects of data imbalance across tasks. Using this approach, we are able to surpass single task fine-tuning methods while being parameter and data efficient (using around 66% of the data for weight updates). Compared to other BERT Large methods on GLUE, our 8-task model surpasses other Adapter methods by 2.8% and our 24-task model outperforms by 0.7-1.0% models that use MTL and single task fine-tuning. We show that a larger variant of our single multi-task model approach performs competitively across 26 NLP tasks and yields state-of-the-art results on a number of test and development sets. Our code is publicly available at https://github.com/CAMTL/CA-MTL.

1. INTRODUCTION

The introduction of deep, contextualized Masked Language Models (MLM) 1 trained on massive amounts of unlabeled data has led to significant advances across many different Natural Language Processing (NLP) tasks (Peters et al., 2018; Liu et al., 2019a) . Much of these recent advances can be attributed to the now well-known BERT approach (Devlin et al., 2018) . Substantial improvements over previous state-of-the-art results on the GLUE benchmark (Wang et al., 2018) have been obtained by multiple groups using BERT models with task specific fine-tuning. The "BERT-variant + fine-tuning" formula has continued to improve over time with newer work constantly pushing the state-of-the-art forward on the GLUE benchmark. The use of a single neural architecture for multiple NLP tasks has shown promise long before the current wave of BERT inspired methods (Collobert & Weston, 2008) and recent work has argued that autoregressive language models (ARLMs) trained on large-scale datasets -such as the GPT family of models (Radford et al., 2018) , are in practice multi-task learners (Brown et al., 2020) . However, even with MLMs and ARLMs trained for multi-tasking, single task fine-tuning is usually also employed to achieve state-of-the-art performance on specific tasks of interest. Typically this fine-tuning process may entail: creating a task-specific fine-tuned model (Devlin et al., 2018) , training specialized model components for task-specific predictions (Houlsby et al., 2019) or fine-tuning a single multi-task architecture (Liu et al., 2019b) . Single-task fine-tuning overall pretrained model parameters may have other issues. Recent analyses of such MLM have shed light on the linguistic knowledge that is captured in the hidden states and attention maps (Clark et al., 2019b; Tenney et al., 2019a; Merchant et al., 2020) . Particularly, BERT has middle Transformer (Vaswani et al., 2017) layers that are typically the most transferable to a downstream task (Liu et al., 2019a) . The model proxies the steps of the traditional NLP pipeline in a localizable way (Tenney et al., 2019a ) -with basic syntactic information appearing earlier in the network, while high-level semantic information appearing in higher-level layers. Since pretraining is usually done on large-scale datasets, it may be useful, for a variety of downstream tasks, to conserve that knowledge. However, single task fine-tuning causes catastrophic forgetting of the knowledge learned during MLM (Howard & Ruder, 2018) . To preserve knowledge, freezing part of a pretrained network and using Adapters for new tasks have shown promising results (Houlsby et al., 2019) . Inspired by the human ability to transfer learned knowledge from one task to another new task, Multi-Task Learning (MTL) in a general sense (Caruana, 1997; Rajpurkar et al., 2016b; Ruder, 2017) has been applied in many fields outside of NLP. Caruana (1993) showed that a model trained in a multi-task manner can take advantage of the inductive transfer between tasks, achieving a better generalization performance. MTL has the advantage of computational/storage efficiency (Zhang & Yang, 2017) , but training models in a multi-task setting is a balancing act; particularly with datasets that have different: (a) dataset sizes, (b) task difficulty levels, and (c) different types of loss functions. In practice, learning multiple tasks at once is challenging since negative transfer (Wang et al., 2019a) , task interference (Wu et al., 2020; Yu et al., 2020) and catastrophic forgetting (Serrà et al., 2018) can lead to worse data efficiency, training stability and generalization compared to single task fine-tuning. Using Conditionally Adaptive Learning, we seek to improve pretraining knowledge retention and multi-task inductive knowledge transfer. Our contributions are the following: • A new task conditioned Transformer that adapts and modulates pretrained weights (Section 2.1). • A novel way to prioritize tasks with an uncertainty based multi-task data sampling method that helps balance the sampling of tasks to avoid catastrophic forgetting (Section 2.2). Our Conditionally Adaptive Multi-Task Learning (CA-MTL) approach is illustrated in Figure 1 . To the best of our knowledge, our work is the first to explore the use of a latent representation of tasks to modularize and adapt pretrained architectures. Further, we believe our work is also the first to examine uncertainty sampling for large-scale multi-task learning in NLP. We show the efficacy of CA-MTL by: (a) testing on 26 different tasks and (b) presenting state-of-the-art results on a number of test sets as well as superior performance against both single-task and MTL baselines. Moreover, we further demonstrate that our method has advantages over (c) other adapter networks, and (d) other MTL sampling methods. Finally, we provide ablations and separate analysis of the MT-Uncertainty Sampling technique in section 4.1 and of each component of the adapter in 4.2.

2. METHODOLOGY

This section is organized according to the two main MTL problems that we will tackle: (1) How to modularize a pretrained network with latent task representations? (2) How to balance different tasks in MTL? We define each task as: T i {p i (y i |x i , z i ), L i , pi (x i )} , where z i is task i's learnable shallow embedding, L i is the task loss, and pi (x i ) is the empirical distribution of the training data pair {x i , y i }, for i ∈ {1, . . . , T } and T the number of supervised tasks. The MTL objective is: min φ(z),θ1,...,θ T T i=1 L i (f φ(zi),θi (x i ), y i ) (1) where f is the predictor function (includes encoder model and decoder heads), φ(z) are learnable generated weights conditioned on z, and θ i are task-specific parameters for the output decoder heads. z is constructed using an embedding lookup table.

2.1. TASK CONDITIONED TRANSFORMER

Our task conditioned Transformer architecture is based on one simple concept. We either add conditional layers or modulate existing pretrained weights using a task representation by extending Feature Wise Linear Modulation (Perez et al., 2018) functions in several ways depending on the Transformer layer. We define our framework below. Definition 1 (Conditional Weight Transformations). Given a neural network weight matrix W, we compute transformations of the form φ(W|z i ) = γ i (z i )W + β i (z i ), where γ i and β i are learned functions that transform the weights based on a learned vector embedding z i , for task i. Definition 2 (Conditionally Adaptive Learning). In our setting, Conditionally Adaptive Learning is the process of learning a set of φs for the conditionally adaptive modules presented below along with a set of task embedding vectors z i for T tasks, using a multi-task loss (see equation 1). In the subsections that follow: We introduce a new Transformer Attention Module using blockdiagonal Conditional Attention that allows the original query-key based attention to account for task-specific biases (section 2.1.1). We propose a new Conditional Alignment method that aligns the data of diverse tasks and that performs better than its unconditioned and higher capacity predecessor (section 2.1.2). We adapt layer normalization statistics to specific tasks using a new Conditional Layer Normalization module (section 2.1.3). We add a Conditional Bottleneck that facilitates weight sharing and task-specific information flow from lower layers (section 2.1.4). In our experiments we provide an ablation study of these components (Table 1 ) examining performance in terms of GLUE scores.

2.1.1. CONDITIONAL ATTENTION Figure 2: Conditional Attention Module

Given d, the input dimensions, the query Q, the key K, and the value V as defined in Vaswani et al. (2017) , we redefine the attention operation: Attention(Q, K, V, z i )) = softmax M (z i ) + QK T √ d V M (z i ) = N n=1 A n (z i ), A n (z i ) = A n γ i (z i ) + β i (z i ) where is the direct sum operator (see section A.6), N is the number of block matrices A n ∈ R (L/N )×(L/N ) along the diagonal of the attention matrix, L is the input sequence, M (z i ) = diag(A 1 , . . . , A N ) is a block diagonal conditional matrix. Note that A n is constructed using L/N trainable and randomly initialized L/N dimensional vectors. While the original attention matrix depends on the hidden states h, M (z i ) is a learnable weight matrix that only depends on the task embedding z i ∈ R d . γ i , β i : R d → R L 2 /N 2 are Feature Wise Linear Modulation (Perez et al., 2018) functions. We also experimented with full-block Conditional Attention ∈ R L×L . Not only did it have N 2 more parameters compared to the block-diagonal variant, but it also performed significantly worse on the GLUE development set (see FBA variant in Table 10 ). It is possible that GLUE tasks derive a certain benefit from localized attention that is a consequence of M (z i ). With M (z i ), each element in a sequence can only attend to other elements in its subsequence of length L/N . In our experiments we used N = d/L. The full Conditional Attention mechanism used in our experiments is illustrated in Figure 2 . 2.1.2 CONDITIONAL ALIGNMENT Wu et al. (2020) showed that in MTL having T separate alignment modules R 1 , . . . , R T increases BERT LARGE avg. scores on five GLUE tasks (CoLA, MRPC, QNLI, RTE, SST-2) by 2.35%. Inspired by this work, we found that adding a task conditioned alignment layer between the input embedding layer and the first BERT Transformer layer improved multi-task model performance. However, instead of having T separate alignment matrices R i for each T task, one alignment matrix R is generated as a function of the task embedding z i . As in Wu et al. (2020) , we tested this module on the same five GLUE tasks and with BERT LARGE . Enabling task conditioned weight sharing across covariance alignment modules allows us to outperforms BERT LARGE by 3.61%. This is 1.26 % higher than having T separate alignment matrices. Inserting R into BERT, yields the following encoder function f : f = T t=1 g θi (E(x i ) R(z i )B), R(z i ) = Rγ i (z i ) + β i (z i ) where x i ∈ R d is the layer input, g θi is the decoder head function for task i with weights θ i , E the frozen BERT embedding layer, B the BERT Transformer layers and R the linear weight matrix of a single task conditioned alignment matrix. γ i , β i : R d → R d are Feature Wise Linear Modulation functions.

2.1.3. CONDITIONAL LAYER NORMALIZATION (CLN)

We extend the Conditional Batch Normalization idea from de Vries et al. (2017) to Layer Normalization (Ba et al., 2016) . For task T i , i ∈ {1, . . . , T }: h i = 1 σ (a i -µ) * γi (z i ) + β i (z i ), γi (z i ) = γ γ i (z i ) + β (3) where h i is the CLN output vector, a i are the preceding layer activations associated with task i, µ and σ are the mean and the variance of the summed inputs within each layer as defined in Ba et al. We created a task conditioned two layer feed-forward bottleneck layer (CFF up/down in Figure 3 ). The conditional bottleneck layer follows the same transformation as in equation 2. The module in Figure 3a is added to the top most Transformer layers of CA-MTL BASE and uses a CLN. For CA-MTL LARGE this module is the main building block of the skip connection added alongside all Transformer layers seen in Figure 3b . The connection at layer j takes in the matrix sum of the Transformer layer output at j and the previous connection's output at j -1. The Conditional bottleneck allows lower layer information to flow upwards depending on the task. Our intuition for introducing this component is related to recent studies (Tenney et al., 2019a) that showed that the "most important layers for a given task appear at specific positions". As with the other modules described so far, each task adaptation is created from the weights of a single shared adapter that is modulated by the task embedding.

2.2. MULTI-TASK UNCERTAINTY SAMPLING

MT-Uncertainty Sampling is a task selection strategy that is inspired by Active Learning techniques. Our algorithm 1 is outlined in the Appendix, Section A.2. Similar to Active Learning, our algorithm first evaluates the model uncertainty. MT-Uncertainty Sampling uses Shannon Entropy, an uncertainty measure, to choose training examples by first doing forward pass through the model with b × T input samples. For an output classification prediction with C i possible classes and probabilities (p i,1 , . . . , p i,Ci ), the Shannon Entropy H i , for task T i and i ∈ {1, . . . , T }, our uncertainty measure U(x) are given by: H i = H i (f φ(zi),θi (x)) = - Ci c=1 p c log p c , U(x i ) = H i (f φ(zi),θi (x)) Ĥ × H i (4) Ĥ = max i∈{1,...,T } Hi = max 1 b x∈xi H i , H i = - Ci c=1 1 C i log 1 C i ( ) where Hi is the average Shannon Entropy across b samples of task t, H i , the Shannon entropy of choosing classes with uniform distribution and Ĥ, the maximum of each task's average entropy over b samples. H i is normalizing factor that accounts for differing number of prediction classes (without the normalizing factor H i , tasks with a binary classification C i = 1 were rarely chosen). Further, to limit high entropy outliers and to favor tasks with highest uncertainty, we normalize with Ĥ. The measure in eq. 4 allows Algorithm 1 to choose b samples from b × T candidates to train the model.

3. RELATED WORK

Multi-Tasking in NLP. To take advantage of the potential positive transfer of knowledge from one task to another, several works have proposed carefully choosing which tasks to train as an intermediate step in NLP before single task fine-tuning (Bingel & Søgaard, 2017; Kerinec et al., 2018; Wang et al., 2019a; Standley et al., 2019; Pruksachatkun et al., 2020; Phang et al., 2018) . The intermediate tasks are not required to perform well and are not typically evaluated jointly. In this work, all tasks are trained jointly and all tasks used are evaluated from a single model. In Natural Language Understanding (NLU), it is still the case that to get the best task performance one often needs a separate model per task (Clark et al., 2019c; McCann et al., 2018) . At scale, Multilingual NMT systems (Aharoni et al., 2019) have also found that MTL model performance degrades as the number of tasks increases. We notice a similar trend in NLU with our baseline MTL model. Recently, approaches in MTL have tackled the problem by designing task specific decoders on top of a shared model (Liu et al., 2019b) or distilling multiple single-task models into one (Clark et al., 2019c) . Nonetheless, such MTL approaches still involves single task fine-tuning. In this paper, we show that it is possible to achieve high performance in NLU without single task fine-tuning. Adapters. Adapters are trainable modules that are attached in specific locations of a pretrained network. They provide another promising avenue to limit the number of parameters needed when confronted with a large number of tasks. This approach is useful with pretrained MLM models that have rich linguistic information (Tenney et al., 2019b; Clark et al., 2019b; Liu et al., 2019a; Tenney et al., 2019a) . Recently, Houlsby et al. (2019) added an adapter to a pretrained BERT model by fine-tuning the layer norms and adding feed forward bottlenecks in every Transformer layer. However, such methods adapt each task individually during the fine-tuning process. Unlike prior work, our method harnesses the vectorized representations of tasks to modularize a single pretrained model across all tasks. Stickland et al. ( 2019) and Tay et al. (2020) also mix both MTL and adapters with BERT and T5 encoder-decoder (Raffel et al., 2019) respectively by creating local task modules that are controlled by a global task agnostic module. The main drawback is that a new set of non-shared parameters must be added when a new task is introduced. CA-MTL shares all parameters and is able to re-modulate existing weights with a new task embedding vector. Active Learning, Task Selection and Sampling. Our sampling technique is similar to the ones found in several active learning algorithms (Chen et al., 2006) that are based on Shannon entropy estimations. Reichart et al. (2008) and Ikhwantri et al. (2018) examined Multi-Task Active Learning (MTAL), a technique that chooses one informative sample for T different learners (or models) for each T tasks. Instead we choose T tasks samples for one model. Moreover, the algorithm weights each sample by the corresponding task score, and the Shannon entropy is normalized to account for various losses (see equation 5). Also, our algorithm is used in a large scale MTL setup ( 2 tasks). Recently, Glover & Hokamp (2019) explored task selection in MTL using learning policies based on counterfactual estimations (Charles et al., 2013) . However, such method considers only fixed stochastic parameterized policies while our method adapts its selection criterion based on model uncertainty throughout the training process.

4. EXPERIMENTS AND RESULTS

We show that our adapter of section 2 achieve parameter efficient transfer for 26 NLP tasks. Our implementation of CA-MTL is based on HuggingFace (Wolf et al., 2019) . Hyperparameters and our experimental set-up are outlined in A.5. To preserve the weights of the pretrained model, CA-MTL's bottom half Transformer layers are frozen in all experiments (except in section 4.4). We also tested different layer freezing configurations and found that freezing half the layers worked best on average (see Section A.8). 4 In Figure 4 , we see from the results that MT-Uncertainty converges faster by reaching the 80% average GLUE score line before other task sampling methods. Further, MT-Uncertainty maximum score on 200k iterations is at 82.2, which is 1.7% higher than Counterfactual sampling. The datasets in the GLUE benchmark offers a wide range of dataset sizes. This is useful to test how MT-Uncertainty manages a jointly trained low resource task (CoLA) and high resource task (MNLI). Figure 5 explains how catastrophic forgetting is curtailed by sampling tasks before performance drops. With π rand , all of CoLA's tasks are sampled by iteration 500, at which point the larger MNLI dataset overtakes the learning process and CoLA's dev set performance starts to diminish. On the other hand, with MT-Uncertainty sampling, CoLA is sampled whenever Shannon entropy is higher than MNLI's. The model first assesses uncertain samples using Shannon Entropy then decides what data is necessary to train on. This process allows lower resource tasks to keep performance steady. We provide evidence in Figure 8 of A.2 that MT-Uncertainty is able to manage task difficulty -by choosing the most difficult tasks first. a CA=Conditional Alignment, CLN=Conditional Layer Normalization, Task σ=scores standard deviation across tasks. π rand = 1/T, π |task| = |D i | T i=1 |D i | -1

4.2. ABLATION AND MODULE ANALYSIS

In Table 1 , we present the results of an ablation study to determine which elements of CA-MTL BERT-BASE had the largest positive gain on average GLUE scores. Starting from a MTL BERT BASE baseline trained using random task sampling (π rand ). Apart for the Conditional Adapter, each module as well as MT-Uncertainty lift overall performance and reduce variance across tasks. Please note that we also included accuracy/F1 scores for QQP, MRPC and Pearson/ Spearman correlation for STS-B to calculate score standard deviation Task σ. Intuitively, when negative task transfer occurs between two tasks, either (1) task interference is bidirectional and scores are both impacted, or (2) interference is unidirectional and only one score is impacted. We calculate Task σ to characterize changes in the dynamic range of performance across multiple tasks. We do this to asses the degree to which performance improvements are distributed across all tasks or only subsets of tasks. As we can see from Table 1 , Conditional Attention, Conditional Alignment, Conditional Layer Normalization, MT-Uncertainty play roles in reducing Task σ and increasing performance across tasks. This provides partial evidence of CA-MTL's ability to mitigating negative task transfer. We show that Conditional Alignment can learn to capture covariate distribution differences with task embeddings co-learned from other adapter components of CA-MTL. In Figure 6 , we arrive at similar conclusions as Wu et al. (2020) , who proved that negative task transfer is reduced when task covariances are aligned. The authors provided a "covariance similarity score" to gauge covariance alignment. For task i and j with m i and m j data samples respectively, and given d dimensional inputs to the first Transformer layer X i ∈ R mi×d and X j ∈ R mj ×d , we rewrite the steps to calculate the covariance similarity score between task i and j: (a) Take the covariance matrix X i X i , (b) Find its best rank-r i approximation U i,ri D i,ri U i,ri , where r i is chosen to contain 99% of the singular values. (c) Apply steps (a), (b) to X j , and compute the covariance similarity score CovSim i,j : CovSim i,j := (U i,ri D 1/2 i,ri ) U j,rj D 1/2 j,rj F U i,ri D 1/2 i,ri F • U j,rj D 1/2 j,rj F . CovSim i = 1 T -1 j =i CovSim i,j Since we are training models with T tasks, we take the average covariance similarity score CovSim i between task i and all other tasks. We measure CovSim i using equation 7 between 9 single-task models trained on individual GLUE tasks. For each task in Figure 6 , we measure the similarity score on the MTL trained BERT BASE baseline, e.g., CoLA (MTL), or CA-MTL BERT-BASE model, e.g., MNLI (CA-MTL). Our score improvement measure is the % difference between a single task model and MTL or CA-MTL on the particular task. We find that covariance similarity increases for 9 tasks and that performance increases for 7 out 9 tasks. These measurements confirm that the Conditional Alignment is able to align task covariance, thereby helping alleviate task interference.

4.3. JOINTLY TRAINING ON 8 TASKS: GLUE

In PALS+Anneal Sampling (Stickland et al., 2019) , and the LARGE adapter, Adapters-256 (Houlsby et al., 2019) . Against single task (ST) models, CA-MTL is 1.3% higher than BERT BASE , with 5 out 9 tasks equal or greater performance, and 0.7% higher than BERT LARGE , with 3 out 9 tasks equal or greater performance. ST models, however, need 9 models or close to 9× more parameters for all 9 tasks. We noted that CA-MTL BERT-LARGE 's average score is driven by strong RTE scores. While RTE benefits from MTL, this behavior may also be a side effect of layer freezing. In Table 10 , we see that CA-MTL has gains over ST on more and more tasks as we gradually unfreeze layers. In Table 3 we examine the ability of our method to quickly adapt to new tasks. We performed domain adaptation on SciTail (Khot et al., 2018) and SNLI (Bowman et al., 2015) datasets, using a CA-MTL BASE model trained on GLUE and a new linear decoder head. We tested several pretrained and randomly initialized task embeddings in a zero-shot setting. The complete set of experiments with all task embeddings can be found in the Appendix, Section A.4. We then selected the best task embedding for our results in Table 3 . STS-B and MRPC MTL-trained task embeddings performed best on SciTail and SNLI respectively. CA-MTL BERT-BASE has faster adaptation than MT-DNN SMART (Jiang et al., 2020) as evidenced by higher performances in low-resource regimes (0.1% and 1% of the data). When trained on the complete dataset, CA-MTL BERT-BASE is on par with MT-DNN SMART . Unlike MT-DNN SMART however, we do not add context from a semantic similarity model -MT-DNN SMART is built off HNN (He et al., 2019) . Nonetheless, with a larger model, CA-MTL surpasses MT-DNN SMART on the full SNLI and SciTail datasets in Table 6 . Effects of Scaling Task Count. In Figure 7 we continue to test if CA-MTL mitigates task interference by measuring GLUE average scores when progressively adding 9 GLUE tasks, 8 Super-GLUE tasks (Wang et al., 2019b) , 6 MRQA tasks (Fisch et al., 2019) . Tasks are described in Appendix section A.9. The results show that adding 23 tasks drops the performance of our baseline MTL BERT BASE (π rand ). MTL BERT increases by 4.3% when adding MRQA but, with 23 tasks, the model performance drops by 1.8%. The opposite is true when CA-MTL modules are integrated into the model. CA-MTL continues to show gains with a large number of tasks and surpasses the baseline MTL model by close to 4% when trained on 23 tasks. We notice in Table 4 that even for large models, CA-MTL provides large gains in performance on average over both ST and MTL models. For the BERT based models, CA-MTL provides 2.3% gain over ST and higher scores on 17 out 24 tasks. For RoBERTa based models, CA-MTL provides 1.2% gain over ST and higher scores on 15 out 24 tasks. We remind the reader that this is achieved with a single model. Even when trained with 16 other tasks, it is interesting to note that the MTL baseline perform better than the ST baseline on Super GLUE where most tasks have a small number of samples. Also, we used NER to test if we could still outperform the ST baseline on a token-level task, significantly different from other tasks. Unfortunately, while CA-MTL performs significantly better than the MTL baseline model, CA-MTL had not yet overfit on this particular task and could have closed the gap with the ST baselines with more training cycles. Comparisons with other methods. In Table 5 , CA-MTL BERT is compared to other Large BERT based methods that either use MTL + ST, such as MT-DNN (Liu et al., 2019b) , intermediate tasks + ST, such as STILTS (Phang et al., 2018) or MTL model distillation + ST, such as BAM! (Clark et al., 2019c) . Our method scores higher than MT-DNN on 5 of 9 tasks and by 1.0 % on avg. Against STILTS, CA-MTL realizes a 0.7 % avg. score gain, surpassing scores on 6 of 9 tasks. We also show that CA-MTL RoBERTa is within only 1.6 % of a RoBERTa ensemble of 5 to 7 models per task and that uses intermediate tasks. Using our 24-task CA-MTL large RoBERTa-based model, we report NER F1 scores on the WNUT2017 test set in Table 6a . We compare our result with RoBERTa LARGE and XLM-R LARGE (Nguyen et al., 2020) the current state-of-the-art (SOTA). Our model outperforms XLM-R LARGE by 1.6%, reaching a new state-of-the-art. Using domain adaptation as described in Section 4.4, we report results on the SciTail test set in Table 6b and SNLI test set in Table 6b . For SciTail, our model matches the current SOTAfoot_0 ALUM (Liu et al., 2020) , a RoBERTa large based model that additionally uses the SMART (Jiang et al., 2020) fine-tuning method. For SNLI, our model outperforms SemBert, the current SOTAfoot_1 . 

5. CONCLUSION

We believe that our experiments here have helped demonstrate the potential of task conditioned adaptive learning within a single model that performs multiple tasks. In a large-scale 24-task NLP experiment, CA-MTL outperforms fully tuned single task models by 2.3% for BERT Large and by 1.2% for RoBERTa Large using 1.12 times the number of parameters, while single task fine-tuning approach requires 24 separately tuned single task models or 24 times the number of parameters. When a BERT vanilla MTL model sees its performance drop as the number of tasks increases, CA-MTL scores continue to climb. Performance gains are not driven by a single task as it is often the case in MTL. Each CA-MTL module that adapts a Transformer model is able to reduce performance variances between tasks, increasing average scores and aligning task covariances. This evidence shows that CA-MTL is able to mitigate task interference and promote more efficient parameter sharing. We showed that MT-Uncertainty is able to avoid degrading performances of low resource tasks. Tasks are sampled whenever the model sees entropy increase, helping avoid catastrophic forgetting. Overall, CA-MTL offers a promising avenue to dynamically adapt and modularize knowledge embedded in large monolithic pretrained models. Extending such ideas will be an objective for future work.

A APPENDIX

A.1 SUMMARY OF ACRONYMS Acronyms of datasets and descriptions can be found below in section A.9.  ; Output: B -multi-task batch of size b 1 B ← ∅ 2 for t ← 1 to T do 3 Generate x t := {x t,1 , . . . , x t,b } i.i.d. ∼ D t 4 for i ← 1 to b do 5 H t,i ← - Ci c=1 p c (f (x t,i )) log p c (f (x t,i )) B ← top_b({U t,i |t ∈ [1, . . . , T ], i ∈ [1, . . . , b]}) b samples w/ highest uncertainty Return: With B , solve eq. 1 with gradient descent; updated model f An advantage of our MT-Uncertainty Sampling approach is its ability to manage task difficulty. This is highlighted in Figure 8 . In this experiment, we estimated task difficulty using the Evolutionary Data Measures (EDM)foot_2 proposed by Collins et al. (2018) . The task difficulty estimate relies on multiple dataset statistics such as the data size, class diversity, class balance and class interference. Interestingly, estimated task difficulty correlates with the first instance that the selection of a specific task occurs. Supposing that QNLI is an outlier, we notice that peaks in the data occur whenever tasks are first selected by MT Uncertainty sampling. This process follows the following order: 1. MNLI 2. CoLA 3. RTE 4. QQP 5. MRPC 6.SST-2, which is the order from highest task difficulty to lowest task difficulty using EDM. As opposed to Curriculum Learning (Bengio et al., 2009) , MT-Uncertainty dynamically prioritizes the most difficult tasks. As also discovered in MTL vision work (Guo et al., 2018) , this type of prioritization on more difficult tasks may explain MT-Uncertainty's improved performance over other task selection methods. In MTL, heuristics to balance tasks during training is typically done by weighting each task's loss differently. We see here how MT-Uncertainty is able to prioritize task difficulty. While the EDM difficulty measure is shown to correlate well with model performance, it lacks precision. As reported in Collins et al. (2018) , the average score achieved on the Yahoo Answers dataset is 69.9% and its difficulty is 4.51. The average score achieved on Yelp Full is 56.8%, 13.1% less than Yahoo Answers and its difficulty is 4.42. The authors mention that "This indicates that the difficulty measure in its current incarnation may be more effective at assigning a class of difficulty to datasets, rather than a regression-like value".

A.3 OTHER RELATED WORK

Multi-Tasking in NLP and other fields. MTL weight sharing algorithms such as Mixture-of-Experts (MoE) have found success in NLP (Lepikhin et al., 2020) . CA-MTL can complement MoE since the Transformers multi-headed attention can be seen as a form of MoE (Peng et al., 2020) . In Vision, MTL can also improve with optimization (Sener & Koltun, 2018) or gradient-based approaches (Chen et al., 2017; Yu et al., 2020) . Active Learning, Task Selection and Sampling. Ikhwantri et al. (2018) examined multi-task active learning for neural semantic role labeling in a low resource setting, using entity recognition as the sole auxiliary task. They used uncertainty sampling for active learning and found that 12% less data could be used compared to passive learning. Reichart et al. (2008) has examined different active learning techniques for the two task annotation scenario, focusing on named entity recognition and syntactic parse tree annotations. In contrast, here we examine the larger scale data regime, the modularization of a multi-task neural architecture, and the many task ( 2) setting among other differences. Other than MTAL (Reichart et al., 2008; Ikhwantri et al., 2018) , Kendall et al. (2017) leveraged model uncertainty to balance MTL losses but not to select tasks as is proposed here.

A.4 ZERO-SHOT RESULTS ON SCITAIL AND SNLI

Before testing models on domain adaptation in section 4.4, we ran zero-shot evaluations on the development set of SciTail and SNLI. Table 8 outlines 8-task CA-MTL BERT-BASE 's zero-shot transfer abilities when pretrained on GLUE with our MTL approach. We expand the task embedding layer to accommodate an extra task and explore various embedding initialization. We found that reusing STS-B and MRPC task embeddings worked best for SciTail and SNLI respectively. A.5 MORE EXPERIMENTAL DETAILS We used a batch size of 32 and a seed of 12 in all experiments. We used Adam (Kingma & Ba, 2015) as the optimizer with a learning rate of 2e-5. We applied a learning rate decay with warm up over the first 10% of the training steps. Unless otherwise specified, we used 5 epochs, a seed of 12 and a sequence length of 128. Additional details are outlined in section . Our data prepossessing and linear decoder heads are the same as in Devlin et al. (2018) . We used the same dropout rate of 0.1 in all layers. To run our experiments, we used either four NVIDIA P100 GPU for base models or four NVIDIA V100 GPU for larger ones. We did not perform parameter search. We do not use ensemble of models or task-specific tricks (Devlin et al., 2018; Liu et al., 2019b; Clark et al., 2019c) . All models are either 12 Transformer layers for BASE and 24 Transformer layers for LARGE. Apart from CA-MTL, models trained in multi-task learning (BERT or RoBERTa without adapters) used random task sampling. For Table 1 and Figure 7 , all BERT-based model have half their layers frozen (untrained) for a fair comparison of ablation results. For the 24-task MTL and CA-MTL models in Tables 4 and 5 , we increased the input sequence length to 256 and used 8 epochs. A.6 THE DIRECT SUM OPERATOR In section 2.1.1, we used the direct sum operator ⊕. This operation allows us to create a block diagonal matrix. The direct sum of a matrix A ∈ R n×m and B ∈ R p×q results in a matrix of size (m + p) × (n + q), defined as: A ⊕ B = A 0 0 B =          a 11 • • • a 1n 0 • • • 0 . . . . . . . . . . . . . . . . . . a m1 • • • a mn 0 • • • 0 0 • • • 0 b 11 • • • b 1q . . . . . . . . . . . . . . . . . . 0 • • • 0 b p1 • • • b pq         

A.7 BASELINES AND OTHER EXPERIMENTAL RESULTS

In this section, we present our baseline results for BERT, RoBERTa, CA-MTL as well as other models. Our single task results (ST) that we ran ourselves surpass other paper's reported scores in Table 9 . Liu et al. (2019c) reports random seed median scores for RoBERTa. However, our RoBERTa ST baseline matches or surpasses the original paper's scores 4 out 7 times on the development set when scores are comparable (QQP F1 and STS-B spearman are not reported). All experiments in this section were run for only 5 epochs, exclusively on the GLUE dataset for the large BERT-based 8-task CA-MTL model. Results in Table 10 reveal that as we freeze more layers, performance tends to decrease. However, since we wanted to preserve as much pretrained knowledge as possible, we chose to keep at least 50% of layers frozen. While this has slightly lowered our performance on 9 GLUE tasks, we believe that keeping as much of the original pretrained weights is beneficial when increasing the total number of tasks in MTL to 24 or more tasks. However, we did not explore this hypothesis more. A.9 DATASET DESCRIPTION The datasets that were used for the domain adaptation experiments were SciTailfoot_3 and SNLIfoot_4 . We jointly trained a CA-MTL RoBERTa-LARGE model on 9 GLUE tasks, 8 Super-GLUEfoot_5 tasks, 6 MRQAfoot_6 tasks, and on WNUT2017foot_7 (Derczynski et al., 2017) . All GLUE tasks are binary classification, except STS-B (regression) and MNLI (three classes). We used the same GLUE data preprocessing as in Devlin et al. (2018) . SuperGLUE has a more diverse task format than GLUE, which is mostly limited to sentence and sentence-pair classification. We follow the same preprocessing procedure as in Wang et al. (2019b) . All tasks are binary classification tasks, except CB (three classes). Also, WiC and WSC are span based classification tasks. We used the same modified MRQA dataset and preprocessing steps that were used in Joshi et al. (2019) . All MRQA tasks are span prediction tasks which seeks to identify start and end tokens of an answer span in the input text. SNLI is a natural inference task where we predict three classes. Examples of three target labels are: Entailment, Contradiction, and Neutral (irrelevant). SciTail is a textual entailment dataset. The hypotheses in SciTail are created from multiple-choice science exams and the answer candidates (premise) are extracted from the web using information retrieval tools. SciTail is a binary true/false classification tasks that seeks to predict whether the premise entails the hypothesis. The two datasets are used only for domain adaptation in this study (see section A.4 for the details of our approach).



https://leaderboard.allenai.org/scitail/submissions/public on 09/27/2020 https://nlp.stanford.edu/projects/snli/ on 09/27/2020 https://github.com/Wluper/edm https://allenai.org/data/scitail; Leaderboard can be found at: https://leaderboard.allenai.org/scitail/submissions/public https://nlp.stanford.edu/projects/snli/ https://super.gluebenchmark.com/tasks https://github.com/mrqa/MRQA-Shared-Task-2019 https://github.com/leondz/emerging_entities_17



Figure 1: CA-MTL base architecture with our uncertainty-based sampling algorithm. Each task has its own decoder. The input embedding layer and the lower Transformer layers are frozen. The upper Transformer layer and Conditional Alignment module are modulated with the task embedding.

Figure 3: a) Conditional Bottleneck for CA-MTLBASE. b) Conditional Bottleneck for CA-MTLLARGE.

Figure 5: CoLA/MNLI Dev set scores and Entropy for π rand (left) and MT-Uncertainty (right).

Figure 6: Task performance vs. avg. covariance similarity scores (eq. 7) for MTL and CA-MTL.

Figure 7: Effects of adding more datasets on avg GLUE scores. Experiments conducted on 3 epochs. When 23 tasks are trained jointly, performance of CA-MTLBERT-BASE continues to improve.

Figure 8: Task composition of MT-Uncertainty sampling and estimated task difficulty using EDM: number of training samples per task at each iteration for batch size of 32. The occurrence of first peaks and estimated difficulty follow the same order: From highest to lowest: MNLI > CoLA > RTE > QQP = MRPC > SST-2.

Model ablation study a on the GLUE dev set. All models have the bottom half layers frozen.

we evaluate the performance of CA-MTL against single task fine-tuned models, MTL as well as the other BERT-based adapters on GLUE. As inHoulsby et al. (2019), MNLI m and MNLI mm are treated as separate tasks. Our results indicate that CA-MTL outperforms both the BASE adapter,

Adapters with layer freezing vs. ST/MT on GLUE test set. F1 scores are reported for QQP/MRPC, Spearman's correlation for STS-B, accuracy on the matched/mismatch sets for MNLI, Matthew's correlation for CoLA and accuracy for other tasks. * Individual scores not available. ST=Single Task, MTL=Multitask, g.e.= greater or equal to. Results from: 1 Devlin et al. (2018) 2 Stickland et al. (2019). 3 Houlsby et al. (2019) .

Domain adaptation results on dev. sets for BASE models. 1 Liu et al. (2019b), 2 Jiang et al. (2020)



Our 24-task CA-MTL vs. other large models on GLUE. F1 is reported for QQP/MRPC, Spearman's corr. for STS-B, Matthew's corr. for CoLA and accuracy for other tasks. *Split not available. **Uses intermediate task fine-tuning + ST.

CA-MTL test performance vs. SOTA.

List of acronyms used in this paper.

∪ x t and D t ← D t \ x t Entropy of task with highest average entropy 21 Update U t,i ← U t,i / Ĥ

CA-MTL is flexible and extensible to new tasks. However, CA-MTL is sensitive to the new task's embedding. We tested multiple task embeddings that worked best on either SciTail or SNLI by checking performance in a zero shot setting or using 0% of the data.

F1 scores are reported for QQP/MRPC, Spearman's correlation for STS-B, accuracy on the matched/mismatch sets for MNLI, Matthew's correlation for CoLA and accuracy for other tasks. ST=Single Task, MTL=Multitask. *QNLI v1 (we report v2) **F1 score or Spearman's correlation is not reported. ***Unknown random seeds. Results from: 1 Stickland et al. (2019) 2 Liu et al. (2019b) 3 Phang et al. (2018) 4 Liu et al. (2019c).

8-task CA-MTL BERT-LARGE (see section 4.3) for various layer freezing configurations. F1 scores are reported for QQP/MRPC, Spearman's correlation for STS-B, accuracy on the matched/mismatch sets for MNLI, Matthew's correlation for CoLA and accuracy for other tasks. FBA = Full Block Attention

GLUE(Wang et al., 2018) dataset description. References: 1 Warstadt et al. (2018), 2 Socher et al. (2013), 3 Dolan & Brockett (2005), 4 Cer et al. (2017), 5 Williams et al. (2018), 6 Wang et al. (2018), 7 Levesque (2011)

Super-GLUE(Wang et al., 2019b)  dataset description. References: 1 Clark et al. (2019a), 2 de Marneffe et al. (2019), 3 Gordon et al. (2012), 4 Khashabi et al. (2018), 5 Zhang et al. (2018), 6 Wang et al. (2019b), 7 Poliak et al. (2018), 8 Levesque (2011)

MRQA(Fisch et al., 2019) dataset description. References: 1 Rajpurkar et al. (2016a), 2 Trischler et al. (2017), 3 Joshi et al. (2017), 4 Dunn et al. (2017), 5 Yang et al. (2018), 6 Kwiatkowski et al. (2019)

SNLI (Bowman et al., 2015)  and SciTail(Khot et al., 2018) datasets description. Stanford Natural Language Inference 550.2k inference human-written English sentence pairs SciTail 2 Science and Entailment 23.5K entailment Science question answering

ACKNOWLEDGMENTS

This research was supported by the Canada CIFAR AI Chairs Program, NSERC and PROMPT. Experiments in this article were conducted with Compute Canada and MILA computational infrastructure and we thank them for their support. We would like to thank Colin Raffel, Sandeep Subramanian, and Nicolas Gontier for their useful feedback and the anonymous reviewers for helpful comments, discussions and suggestions.

