CONTINUAL PRE-TRAINING OF LANGUAGE MODELS

Abstract

Language models (LMs) have been instrumental for the rapid advance of natural language processing. This paper studies continual pre-training of LMs, in particular, continual domain-adaptive pre-training (or continual DAP-training). Existing research has shown that further pre-training an LM using a domain corpus to adapt the LM to the domain can improve the end-task performance in the domain. This paper proposes a novel method to continually DAP-train an LM with a sequence of unlabeled domain corpora to adapt the LM to these domains to improve their endtask performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in the pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances. Empirical evaluation demonstrates the effectiveness of the proposed method.

1. INTRODUCTION

Pre-trained language models (LMs) like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have significantly advanced NLP. Recently, LMs have also been used by many continual learning (CL) systems to learn a sequence of end-tasks incrementally (Ke et al., 2021a; Sun et al., 2020; Huang et al., 2021) , which we call continual end-task learning. It is also desirable to continually pre-train LMs themselves. This includes (1) continual general pre-training, which incrementally updates the LM using the most recent data that has a similar distribution as the pre-training data, and (2) continual domain-adaptive pre-training, which further pre-trains a LM incrementally to adapt it to a sequence of domains. Note that LM editing (with or without continual learning) (Mitchell et al., 2022) that corrects mistakes learned in the LM is a special case of continual end-task learning (Kim et al., 2022) as each editing task or a group of editing tasks learned together is basically a task in continual learning, which aims to perform the editings correctly without interfering with or forgetting the other knowledge already learned in the current LM. This paper focuses on continual domain-adaptive pre-training (or continual DAP-training) of LMs. It is known that DAP-training 2 an LM (without continual learning) using a large unlabeled domain corpus before end-task fine-tuning achieves better results (Gururangan et al., 2020; Xu et al., 2019; Ke et al., 2022b) . This paper goes a step further to continually learn to improve an LM's ability to handle new or emerging domains or topics without forgetting the skills or knowledge learned in the past. This is important in the real world, where the data shifts constantly and new domains, events or topics keep emerging (Ke et al., 2022b) and the LM needs to be updated to serve the users better. We call this problem continual DAP-training. Starting from a pre-trained general LM (i.e., the LM has already been pre-trained on D 0 ), we incrementally DAP-train a sequence of domain corpora D 1 , D 2 , .... Once a domain is trained, its data is no longer accessible. This is different from conventional continual learning (CL) where each task is an end-task. In the proposed continual DAP-training, each task is an unlabeled domain corpus to be learned. An end-task fine-tunes the continually DAP-trained LM to evaluate its performance. It is worth noting that D 0 is usually a broad or general domain (e.g., News). In practice, a continually DAP-trained LM may be trained by individual users, institutions or a mix of both who have one or more large corpora of some particular domains. In such cases, the raw data may not be shared, but the final LM can be shared by all. There are multiple desiderata for a continual DAP-training system: (1) It should not suffer from catastrophic forgetting (CF), i.e., it should perform reasonably well on learned domains. This requires the system (a) to overcome CF for each new domain and (b) to overcome CF for the general language knowledge in the LM. This is important because the knowledge learned from each domain alone will not be sufficient for good end-task performances. None of the existing CL methods can achieve all the above. This paper represents a step towards achieving them. The proposed method is called DAS (Continual DA-pre-training of LMs with Soft-masking). DAS proposes a novel soft-masking mechanism that computes the importance (a real number between 0 and 1) of unitsfoot_0 for general or domain knowledge and soft-mask them based on their importance values to control the backward gradient flow. In the forward pass, soft-masking is not applied, which encourages KT across domains. It does not isolate any sub-network for any domain so that the knowledge in the full LM can be leveraged for end-task fine-tuning. To apply this mechanism, DAS implements two functions: (1) Initialization, which computes the importance of units to the general knowledge in the LM without accessing the LM pre-training data (D 0 ). It is applied on the pre-trained LM before the continual learning starts, and (2) continual learning, which DAP-trains each domain while preventing CF on the general and domain knowledge and encouraging cross-domain KT. In (1), it is not obvious how to compute the importance without pre-training data. DAS proposes a novel proxy based on robustness to compute the importance of units for the general knowledge. In (2) , the soft-masking is directly applicable because we have the domain data and the importance can be computed based on its gradient inspired by the pruning community (Li et al., 2021; Michel et al., 2019) . Moreover, DAS contrasts the previously learned knowledge and the full (including both the learned domains and the current domain) knowledge to encourage the current domain representation to learn knowledge that is not already in the knowledge learned from previous domains and integrate it with the learned knowledge 4 . In end-task fine-tuning, DAS does not requires the domain-ID as all knowledge is accumulated into the DAP-trained LM. In summary, this work makes the following contributions. (i) It studies the new problem of continual DAP-training and discovers that the full LM is needed for a good continual DAP-training method. The popular parameter-isolation approach to overcoming CF in convention CL is unsuitable. (ii) It proposes a novel soft-masking method to overcome CF and to encourage KT, and a constrative learning based method for knowledge integration. (iii) To preserve the general knowledge in the LM, a novel proxy is also proposed. (iv) Experimental results demonstrate the effectiveness of DAS. these additional modules is usually challenging and can be inaccurate. DAS belongs to the former family that directly updates the LM. This is very challenging for CL due to CF. To our knowledge, no existing system in this family is about CL. Continual learning. Most CL methods were proposed to overcome CF: (1) Regularization methods (Kirkpatrick et al., 2016; Seff et al., 2017) compute the importance of each parameter to previous tasks and uses a regularizer to penalize the sum of changes. DAS is related to but also very different from EWC (Kirkpatrick et al., 2016) . (1) DAS does not control each parameter/weight, but only attention heads or neurons based on their importance scores. This gives less forgetting (see the forgetting rate in Table 2 ) because even a small change to each parameter for a neuron can give a large total change to the neuron's activation. ( 2) DAS directly controls the backward gradient flow on each neuron, which is more fine-grained and effective than the sum of changes of all parameters. Our experimental results confirm that EWC is significantly poorer than DAS (see Table 2 ). ( 2) Replay methods retain (Rebuffi et al., 2017; Wang et al., 2020b) or generate some data of old tasks (Shin et al., 2017; He & Jaeger, 2018) and use them in learning a new task; (3) parameter-isolation methods (Serrà et al., 2018; Wortsman et al., 2020) allocate neurons and parameters or sub-networks for different tasks/domains and mask them in task learning. For continual DAP-training, this means that end-tasks cannot use the general knowledge in the LM, which results in poor end-task performances. In NLP, CL has been used for slot filling (Shen et al., 2019) , language learning (Li et al., 2019) , sentiment analysis (Ke et al., 2021a) , topic modeling (Gupta et al., 2020) , question answering (Greco et al., 2019 ) and text classification (Sun et al., 2020; Huang et al., 2021; Chuang et al., 2020) . But none is for DAP-training. Some recent CL papers concern LMs. The system in (Madotto et al., 2020) learns separate adapters for different domains and thus has no CF or KT. DEMIX (Gururangan et al., 2021) initializes the new adapter with the closest old adapter. CPT (Ke et al., 2022a) and ELLE (Qin et al., 2022) are most closely related to DAS. However, CPT uses the parameter-isolation approach to learn and protect each task, which is weak (see Sec. 4.2) . It also needs domain-ID in end-task fine-tuning. ELLE has to start from pre-training the LM itself rather than from a pre-trained LM like DAS. It also uses a large memory (1G per domain) to store the replay data (including the pre-training data) and expands the network for each domain. Neither is required in DAS. Jin et al. (2021) evaluated several existing CL techniques in a similar setting as DAS and performed analyses on dealing with CF. However, no new technique was proposed in the paper. Neural network pruning. Many parameters in a network are redundant and can be pruned (Li et al., 2021; Lai et al., 2021; Michel et al., 2019; Voita et al., 2019) . Existing methods include discarding parameters with small absolute values (Han et al., 2015; Guo et al., 2016) , accumulated gradient (Michel et al., 2019) , and lottery ticket hypothesis (Brix et al., 2020) . However, these methods are not directly applicable as we need to preserve not only individual domain knowledge but also the general knowledge in the LM. For general knowledge, since we do not have any pre-training data, a proxy is proposed based on robustness. For domain knowledge, we adopt a pruning method but use the importance as soft-masks as we want to accumulate knowledge rather than to compress the LM. Contrastive Learning. Contrastive learning (Chen et al., 2020; He et al., 2020) learns good representations by maximizing the similarity of positive pairs and minimizes that of negative pairs, L contrast = - 1 N N n=1 log e (sim(qn,q + n )/τ ) N j=1 e (sim(qn,q + j )/τ ) , ( ) where N is the batch size, τ is a temperature parameter, sim(•) is a similarity metric, and q n and q + n are representations for positive pairs x n and x + n . DAS contrasts the learned knowledge from previous domains and the pre-trained LM (general knowledge) with the full knowledge (including both the previous domains and current domain knowledge) to achieve a complementary effect.

3. PROPOSED DAS TECHNIQUE

Continual DAP-training in DAS is based on two main ideas: (1) preserving the important general language knowledge in the LM and the knowledge learned from previous domains to overcome CF by soft-masking units based on their importance, which also facilitates cross-task knowledge transfer (KT), and ( 2) encouraging the model to learn complementary representations of the current domain and previous domains to achieve knowledge integration. Figure 1 gives an overview of DAS.

Transformer

Layer 𝑙 The whole learning consists of two main functions: (i) initialization and (ii) continual learning. (i) computes the importance of units to the general language knowledge in the LM. It is done before the continual learning starts. (ii) is for continual learning, which consists of two steps: (a) domain training and (b) importance computation. (a) takes the importance scores accumulated so far (including those to the general knowledge in the original LM and to the knowledge learned from previous domains) and the input data of the current domain to learn the domain and to achieve (1) and ( 2) above, while (b) computes the importance scores for the current domain for future use. The following sub-sections present each function and step in detail. Backward (A) General Knowledge Importance Computation {𝑰 𝑙 (𝒌) } 𝒌=1 t-1 Normalize 𝜵 𝑙 1 -𝑰 𝑙 (⩽𝑡-1)

3.1. INITIALIZATION: COMPUTING IMPORTANCE OF UNITS TO THE GENERAL KNOWLEDGE

This initialization function computes the importance of units (attention heads and neurons) in the Transformer for the general knowledge in the original LM. The key components of a Transformer are multi-head attention layer, intermediate layer and output layer. Below, we use "layer" or l to indicate any of these three layers because our method treats the three layers similarly. Importance of units in a layer. It has been found that not all units in a layer are important (Michel et al., 2019) . We introduce a virtual parameter, g l , for computing the importance of the units in a layer l. We call these virtual parameters as each g (k) is initialized to 1. We only need the gradient on each parameter to compute the importance of its corresponding unit, no update to any parameter. ôl = g l ⊗ o l , where o l refers to the output of layer l (which can be any of the three layers mentioned above). The ⊗ refers to element-wise multiplication, i.e., each variable g l,i in g l corresponding to a unit (a neuron or attention head) in the layer. We adapt the gradient-based importance detection method in (Michel et al., 2019) for our purpose. Given a dataset D = {(x n , y n )} N n=1 of N samples (y n is the class label of x n as (Michel et al., 2019) worked on supervised learning), the importance of neurons or heads in the layer is estimated with a gradient-based proxy score I l = 1 N N n=1 | ∂L impt (x n , y n )) ∂g l |, where L impt is a task-specific loss function. Note the virtual parameter g l is initialized as all 1's, and is not changed. This is because we need only its average gradient ∇ g l (the term within || in Eq. 3) over all the data to compute the importance and will not use the gradient to update the virtual parameter. In training (Sec. 3.2 and Fig 1 (B)), the virtual parameter can be discarded. The resulting I l is of the same size as g l , each entry corresponding to the importance of a unit (a neurons or attention head). Recall that the initialization function is to learn the importance of units to the general knowledge in the LM (denoted as I (0) l ). Although Eq. 3 offers a possible way, it is not directly applicable. If we use the domain data at hand and employ the MLM loss as L impt , ∇ g l only gives the importance for the domain-specific knowledge. However, to compute the importance of units to the general knowledge in the LM (which is our goal), we need the original data used in pre-training the LM to compute the L impt . In practice, such data is not accessible to users of the LM. Further, label is needed in Eq. 3 but our domain corpus is unlabeled in DAP-training. To address these issues, we propose a proxy KL-divergence loss (L proxy ) to replace L impt to learn units importance for the general knowledge. Proxy KL-divergence loss. We propose to use model robustness as the proxy, i.e., we try to detect units that are important for LM's robustness. Their gradients, ∇ g l , then indicate the robustness and the importance to the LM model. Our rationale is as follows: If an I (0) l,i (the importance of unit i in layer l) has a high value, then it is important to the LM's robustness because its change can cause the LM to change a lot. It is thus an important unit. In contrast, if I (0) l,i is small, it is a less important unit. To compute the robustness of the LM, we take a subset of the current domain data {x sub n }foot_2 (no label in DAP-training) and input x sub n twice to the LM to obtain two representations of it and then compute the KL-divergence between them, L impt = KL(f 1 LM (x sub n ), f 2 LM (x sub n )), where f 1 LM and f 2 LM are the LM with different dropout masks. We don't need to add any additional dropouts to implement these two as the Transformer already has dropout masks placed on fullyconnected layers and attention probabilities. Thus, simply feeding the same input to the Transformer twice will get two representations with different dropout masks. Since dropout is similar to adding noise, the difference between the two representations can be regarded as the robustness of the LM. of each domain k (k can be any domain in {1...t -1}) that has been learned (Sec. 3.3). This is achieved by soft-masking the learning based on accumulated importance as follows. 6Accumulating importance. We accumulate the importance after task t -1 was learned is done via element-wise max (EMax) as follows:

3.2. TRAINING: LEARNING

I (≤t-1) l = EMax({I (t-1) l , I (≤t-2) l }), where t refers to the current task-ID and I (≤t-2) l refers to the previously accumulated importance at task t -2. We do not need to save I 0 l and all {I (k) l } t-1 k=1 for Eq. 5. We only save the incrementally accumulated importance after training of each task. Soft-masking units. Given the accumulated importance I (≤t-1) l of layer l and the DAP-training loss L DAP-train (typically the MLM loss; we also propose an additional loss in Eq. 7), we constrain (or soft-mask) its corresponding gradient (∇ l ) flow as follows, ∇l = (1 -I (≤t-1) l ) ⊗ ∇ l , As mentioned in Sec. 3.1, we expand (by copying) the importance I (≤t-1) l to match the dimensions of ∇ l to apply it to all associated parameters. This is soft-masking as each element in I (≤t-1) l is a real number in [0, 1] (not binary {0, 1}), which gives the model the flexibility to adjust any unit. We note that the above soft-masks are only applied in the backward pass, but not in the forward pass, which encourages knowledge transfer as each domain training can leverage the knowledge learned from all past domains. To further encourage the model to learn a good representation from both the accumulated knowledge (I (≤t-1) l ) and the full knowledge (both accumulated and current domain knowledge), we introduce a contrastive learning method to encourage complementary representation. Integrating the previously learned knowledge and the current domain knowledge. Soft-masking helps prevent forgetting the previously learned knowledge. We want to further encourage knowledge transfer by integrating the new and learned knowledge. We propose to contrast the previously learned knowledge and the full knowledge (both previously learned knowledge and the current domain knowledge). Note that the contrasting cannot do anything to the shared past knowledge as it is protected by soft-masks. Thus, it effectively pushes the current domain knowledge away to be complementary to the past knowledge. This is done based on the current domain data as follows. Contrasting the learned and full knowledge. We denote the output of LM without any consideration of importance as o full , which refers to the full knowledge. We further denote the output of LM that is multiplied by the importance (i.e., I (≤t-1) l ⊗ o l ) as o prev , which refers to the previously learned knowledge. We contrast the two by using o full as anchor and o full with different dropouts as positive samples (dentoed as o full+ ). o prev is used as negative instances.  Compared to Eq. 1, the second term is added in the denominator, i.e., representations in the previously learned knowledge as additional negative instances. Final Loss Function. The final DAP-training loss combines the Masked Language Model (MLM) loss after applying the proposed soft-masking for the general knowledge (Sec. 3.1) and the proposed contrastive loss (λ is a hyper-parameter), L DAP-train = L MLM + λL contrast (8)

3.3. COMPUTE IMPORTANCE OF UNITS TO THE CURRENT DOMAIN

After training the new/current domain t, we learn the units importance by applying Eq. 3 for the domain. We do not need any proxy to compute L impt as in Eq. 4 because we can directly use the current domain data. Specifically, we randomly sample a subset (a hyper-parameter) of the current domain data {(x sub n , y sub n )}, where x sub n is the input and y sub n is the masked token as in MLM selfsupervised loss. We can then easily compute the importance I (t) l by plugging L MLM into L impt in Eq. 3. The resulting I (t) l will be used in the next task by accumulating with the previously accumulated importance (Eq. 5) and soft-masking the learning (Eq. 6).

4. EXPERIMENTS

We use RoBERTa (Liu et al., 2019) foot_4 as the LM. Following the standard evaluation setup (Lange et al., 2019) and, after a domain is trained, its training data is discarded. After all domains are incrementally learned, the final model is evaluated by fine-tuning the end-tasks in all domains. (Gururangan et al., 2020; Dery et al., 2021; Beltagy et al., 2019) . The results are averages of 5 random seeds (the domain training order is as they appear in the first row). Due to space limits, the results for different domain orders and the standard deviations are reported in Appendix D and Appendix E, respectively). Non-CL baselines have no forgetting. also achieving some transfer, is weaker than NCL. In general, CL baselines are all poorer than DAS as they don't have methods to encourage knowledge transfer or they have to rely on adapters. (3). Directly learning the domains within the LM helps DAS achieve better results than adapter and prompt based methods. DAS is better than adapter-based systems (DAP-Adapter, NCL-Adapter and HAT-Adapter) and prompt-based system (DAP-Prompt). This is because adapters and prompts do not have sufficient trainable parameters, which are also randomly initialized and can be hard to train. (4). Using the full LM to learn all tasks rather than using sub-networks (of HAT-based methods) makes DAS more effective. HAT performs poorly, indicating it is unsuitable for DAP-training as discussed in Sec. 1. Even if we use all features (not only the feature from the corresponding sub-network), we still get poor results (HAT-All) as the features used in DAP-training (in an LM sub-network) are different from features used in end-task fine-tuning (features from the whole LM). Knowledge transfer and forgetting avoidance. To see how the models fare on CF and knowledge transfer, we compare the forgetting rates (forget R.) (Liu et al., 2020) , 1 t-1 t-1 k=1 A k,k -A t,k , where A k,k is the end-task accuracy right after its domain k is DAP-trained, and A t,k is the accuracy of the end-task of domain k after DAP-training the last domain t. We average over all end-tasks except the last as the last domain has no forgetting. The higher the forgetting rate is, the more forgetting it has. Negative rates indicate positive knowledge transfer. Clearly, DAS has the strongest negative forgetting rate, indicating it does well on both forgetting prevention and knowledge transfer. NCL, NCL-Adapter, DEMIX, EWC, KD and DER++ all suffer from some forgetting. HAT has no forgetting but it cannot learn well. HAT and BCL have no forgetting but are weak in transfer. Effectiveness of the proxy KL-divergence loss. We use proxy KL-divergence loss in the initialization function (Sec. 3.1) to compute the importance of units for general knowledge. We are interested in how good the proxy is. We use two kinds of experiments to provide evidences. (1) Comparing with a sample set of D 0 . In some cases, the continual DAP-training users may have the data D 0 that was used to pre-train the LM. Then we can just sample a subset from D 0 to compute the parameter importance to the general knowledge in the LM. However, since we do not have D 0 that was used to pre-train RoBERTa, we use the Wiki data (Merity et al., 2017) as the sample set of D 0 . We choose it as it is a general dataset with a wide topic coverage and was used to pre-train an LM, and it has a similar size as our domain data (around 700M). We conducted two experiments using the data: (a) DAS (Wiki+MLM), which uses MLM as the loss in the initialization stage to compute the importance of units (to identify the general knowledge) just like any other domains in the continual learning part, and (b) DAS (Wiki+KL), which uses KL-divergence as in the initialization stage just like the proposed proxy method. The results are given in Table 3 . We can see that DAS (Wiki + KL) performs similarly to DAS but outperforms DAS (Wiki + MLM). This indicates that the proposed proxy KL-divergence is more effective. MLM actually adapts the LM to the Wikipedia data, which may not be sufficiently representative of the original data used in pre-training the LM. As a result, it ends up identifying the knowledge that is suitable only for the Wikipedia data. In contrast, the proposed proxy KL-divergence leverages the random dropout mask and measures the robustness, which is less related to a specific domain and thus reflects the (general) knowledge in the original LM better. (2) Comparing general knowledge computed from different domain corpora. Here, we also provide some indirect evidences to show the effectiveness of the proxy method for computing the importance of units to the general knowledge in the LM. We conduct a separate non-CL experiment to compare the attention heads' importance score vectors after applying the proxy using the data from different domains. 12 For each domain i, we compare its importance vector with the importance vector of every other domain, and then average the cosine similarities to get the value for domain i. We get 0.92 for Restaurant, the same 0.91 for ACL, AI, and Phone, 0.89 for PubMed and 0.92 for Camera. We see that different domains give similar importance values, which indirectly shows that our proxy can approximately identify the common general knowledge. Ablation. We want to know if the proposed (1) initialization (Sec. 3.1), ( 2) soft-masking, and (3) contrastive learning are helpful. To answer (1), we conduct the ablation DAS (w/o initialization), where we remove the initialization and directly do the continual learning given no consideration to the general knowledge in the LM. To answer (2), we conduct the ablations (1) DAS (w/o softmask), where we remove the soft-masks, and only use contrastive learning based on Eq. 7 (with the second term in the denominator removed); and ( 2) DAS (random) with randomly generated importance scores to do soft-masking and contrastive learning. To answer (3), we conduct two ablations: (i) DAS (w/o contrast) where we remove the contrastive loss and only soft-mask according to the importance; (ii) DAS (domain-specific) where we contrast domain-specific and learned knowledge (Sec. 3.2). 

5. CONCLUSION

This paper proposed a novel method DAS for the continual DAP-training of an LM. It has three key ideas: (1) Preserving the important previous knowledge by soft-masking units according to their importance to overcome CF and to facilitate knowledge transfer. ( 2) Using a novel proxy to compute the importance of units to the general knowledge in the LM. (3) Learning complementary representations for knowledge integration. A set of techniques is proposed to achieve them. Extensive experiments showed the effectiveness of DAS. The current approach involves two functions in learning. We will study how to combine them to further improve the results in the future. initializes the new adapter with the parameters of the previous trained adapter nearest to the new domain data. They use the perplexity on a held-out sample to choose the most probable adapter. For fair comparison, we use the same size as {x sub n } as the held-out samples. ( 8) Hard attention to overcome forgetting (HAT-Adapter) Ke et al. (2021c) is derived from HAT Serrà et al. (2018) , the state-of-the-art parameter-isolation based method with almost no forgetting. However, HAT requires task id information in end-task fine-tuning (DAS works in domainagnostic manner and does not need the task id information; see Sec. 1). HAT also needs to train an addition task embedding to mask each layer of the network which makes the DAP-training inefficient. (9) Continual learning plugin with capsule(BCL) Ke et al. (2021c) is a continual learning model that can avoid forgetting and encourage knowledge transfer. It is similar to NCL-Adapter. The difference is that its adapters consist of two modules, one is a capsule network (a new capsule is added once a new domain arrives) to encourage transfer and the other one is similar to HAT to avoid forgetting. Similar to HAT, task/domain information is needed in end-task fine-tuning. We replace the backbone network from BERT with RoBERTa for fair comparison. (10) Continual learning plugin with contrastive transfer (CLASSIC) Ke et al. (2021b) is a continual learning model that can avoid forgetting and encourage knowledge transfer via contrasting loss. It is similar to HAT. but 3 additional contrastive loss are used for distillation, knowledge transfer and supervised contrast. Since DAS is working on unsupervised data, we remove the supervised contrastive loss. Similar to HAT, task information is needed in end-task fine-tuning. We replace the backbone network from BERT with RoBERTa for fair comparison. (11) Knowledge distillation (KD) Hinton et al. (2015) minimizes the representational deviation between the learned representation and the new representation in DAP-training. We compute the KL divergence between the representations (the output before the masked language model prediction head) of each token of the previous DAP-trained LM and current LM as the distillation loss. 2020) is a recent replay method using distillation to regularize the new task training. We store 16.4K tokens for each learned domain as the memory, which is the largest memory we can use for the system to run. ( 14) HAT Serrà et al. (2018) is used in the Transformer layers (including self-attention, intermediate and output layers) rather than the added adapter layers. Additional task embedding and task information for end-task fine-tuning are needed.

C IMPLEMENTATION DETAILS

Architecture. We adopt RoBERTa BASE as our backbone LM. A masked language model head is applied for DAP-training. The end-task fine-tuning of RoBERTa follows the standard practice. For the three ASC tasks (see Table 1 ), we adopt the ASC formulation in Xu et al. (2019) , where the aspect (e.g., "sound") and review sentence (e.g., "The sound is great") are concatenated via </s>. Hyperparameters. Unless otherwise stated, the same hyper-parameters are used in all experiments. The maximum input length is set to 164 which is sufficient for all datasets. Adam optimizer is used for both DAP-training and end-task fine-tuning. The max sequence length is also set to 164.

DAP-training.

The learning rate is set to 1e-4 and batch size to 256. We train 2.5K steps for each domain, roughly a full pass through the domain data, following Gururangan et al. (2020) ; Xu et al. (2019) . The subset of data {x sub n } for computing L impt to determine head importance in Secs. 3.1 and 3.3 is set to 1.64 Million tokens, which is sufficient in our experiments. λ in Eq. 8 is set to 1 and τ in Eq. 7 is set to 0.05. End-task fine-tuning. The learning rate is set to 1e-5 and batch size to 16. We train on end-task fine-tuning datasets for 5 epochs for Restaurant; 10 epochs for ACL, AI and PubMed; and 15 epochs for Phone and Camera. We simply take the results for the last epoch, assuming no validation sets. We empirically found that the above number of epochs gives us stable and convergence results.



For simplicity, we use the term units to mean both attention heads and neurons. Contrasting the past domains and only the domain-specific knowledge gives poorer results (see Sec. 4.2) as it causes the two types of knowledge to split rather than to integrate. We use a subset to save computation as we assume that the DAP-training domain can be very large. In Sec. 4, we show that a subset is sufficient to compute the importance of units for the general knowledge. Before training, we normalized the the importance values in each layer l for a domain k by making the importance scores for all units in the layer having a mean of 0 and standard deviation of 1. To further facilitate soft-masking, the normalized importance scores are rounded by a Tanh activation so that the values are in the interval of [0,1]. To simplify the notation, we still use I (k) l to represent the resulting importance. https://huggingface.co/roberta-base We use attention heads instead of other units because they are arguably the most important component in a Transformer(Michel et al., 2019;Voita et al., 2019;McCarley et al., 2019).



It should encourage knowledge transfer (KT) across domains to achieve improved end-task performances. This requires the system to enable (a) forward transfer, learning a new domain by leveraging the knowledge from previous domains, and (b) backwards transfer, gaining improved performance on previous domains after learning a relevant new domain. (3) It should work without requiring the domain-ID for each end-task fine-tuning.

Figure 1: Illustration of DAS. The red cross indicates that the gradient is not used to update the Transformer but only to compute importance. (A) Initialization (Sec. 3.1) computes the importance of units for the general knowledge in the LM. (B) Domain Training (Sec. 3.2) trains a new domain using the importance scores as soft-masks and contrasts the previously learned knowledge and the full knowledge. (C) Importance Computation (Sec. 3.3) computes the importance of the units for the current domain.

A NEW DOMAIN VIA SOFT-MASKING AND CONTRASTIVE LOSS Recall we want to preserve the learned knowledge in the LM during DAP-training using the accumulated importance I (≤t-1) l when we learn domain t, which includes both the importance for the general knowledge I (0) l (Sec. 3.1) and learned domain-specific knowledge I (k) l

, with o full n , o full+ n , and o prev n , our contrastive loss is (sim(•) is the cosine similarity),

Figure 1 (B) shows a red arrow pointed from o full to itself, indicating the positive instances are from inputting twice. The dashed red arrow pointing to o prev indicates the negative instances contrasting the full and previously learned knowledge.

EWC Buzzega et al. (2020) is a popular regularization-based method which adopts elastic weights consolidation to add L 2 regularization to parameter changes.(13) DER++Buzzega et al. (

Statistics of datasets for DAP-training. More details of their end-task supervised learning datasets are given in Appendix A.

End-task macro-F1 (MF1), accuracy and forgetting rate results for all domains after the continual DAP-training of all domains, except for CHEMPORT in the PubMed domain, for which we use micro-F1 following

Results for the Wiki dataset as the sample set of D 0 -average of 5 random seeds

Ablation results -averages of 5 random seeds. See standard deviations in Appendix E.

shows that the full DAS is the best on average and for most domains, indicating that every component contributes. Additional observations are: (1) DAS's gain is partially from the preserved general knowledge. We can see DAS (w/o initialization) is poorer on average; (2) Soft-masking helps as DAS (w/o softmask) is poorer than DAS. This is reasonable because soft masking can preserve learned domains. Besides, our gradient-based mask is informative as DAS (random) is worse than DAS; (3) Contrastive learning is effective as DAS (w/o contrast) and DAS (domain-specific) are both poorer, indicating the contrastive learning in DAS can help learn good representations

ACKNOWLEDGEMENTS

The work of Zixuan Ke, Gyuhak Kim, and Bing Liu was supported in part by a research contract from KDDI, a research contract from DARPA (HR001120C0023), and three NSF grants (IIS-1910424, IIS-1838770, and CNS-2225427).

A DATASETS DETAILS

Table 1 in the main paper has already showed the number of examples in each dataset. Here we provide additional details about the 4 types of end-tasks.(1) (Phone, Camera and Restaurant) Aspect Sentiment Classification (ASC) is defined as follows: given an aspect or product feature (e.g., picture quality in a camera review) and a review sentence containing the aspect in a domain or product category (e.g., camera), classify if the sentence expresses a positive, negative, or neutral (no opinion) sentiment or polarity about the aspect (for the Phone and Camera datasets, there are only negative and positive polarities in the data).(2) (ACL) Citation intent classification is defined as follows: given a citing sentence (a sentence contains a citation), classify if the sentence expresses a citation function among "background", "motivation", "uses", "extension" and "comparison or contrast future".(3) (AI) Relation classification is defined as follows: given a within-sentence word sequence span containing a pair of entities, classify if the span expresses a relation among "feature of", "conjunction", "evaluate for", "hyponym of", "used for", "part of" and "compare".(4) (PubMed) Chemical-protein interaction classification is defined as follows: given a span containing a pair of chemical and protein, classify if the span expresses a chemical-protein interaction among "downregulator", "substrate", "indirect-upregulator", "indirect-downregulator", "agnonist", "activator", "product of", "agonist-activator", "inhibitor", "upregulator", "substrate product of", "agonist-inhibitor"and "antagonist".

B BASELINE DETAILS

Non-Continual Learning Baselines: Each of these baselines builds a separate model for each task independently. It thus has no knowledge transfer or CF. 2021) adds a sequence of real vector tokens (called virtual tokens or prompt tokens) to the end of the original sequence. In DAP-training, RoBERTa (the LM) is fixed and only the prompt tokens are trained. In end-task fine-tuning, both LM and the trained prompt are trainable. We initialize 100 tokens and set the learning rate of the prompt token to 0.3 in DAP-training, following the setting in Lester et al. (2021) .Continual Learning (CL) Baselines.(5) Naive continual learning (NCL) is a naive extension of Gururangan et al. (2020) , which continually/incrementally DAP-trains the LM to learn all domains using MLM loss with no mechanism to deal with CF.(6) Continual learning with adapter (NCL-Adapter) Houlsby et al. ( 2019) is similar to the adapter based system. The only difference is that the same set of adapters is shared across all domains, rather than using a new adapter for each new domain. 

