CONTINUAL PRE-TRAINING OF LANGUAGE MODELS

Abstract

Language models (LMs) have been instrumental for the rapid advance of natural language processing. This paper studies continual pre-training of LMs, in particular, continual domain-adaptive pre-training (or continual DAP-training). Existing research has shown that further pre-training an LM using a domain corpus to adapt the LM to the domain can improve the end-task performance in the domain. This paper proposes a novel method to continually DAP-train an LM with a sequence of unlabeled domain corpora to adapt the LM to these domains to improve their endtask performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in the pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances. Empirical evaluation demonstrates the effectiveness of the proposed method.

1. INTRODUCTION

Pre-trained language models (LMs) like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have significantly advanced NLP. Recently, LMs have also been used by many continual learning (CL) systems to learn a sequence of end-tasks incrementally (Ke et al., 2021a; Sun et al., 2020; Huang et al., 2021) , which we call continual end-task learning. It is also desirable to continually pre-train LMs themselves. This includes (1) continual general pre-training, which incrementally updates the LM using the most recent data that has a similar distribution as the pre-training data, and (2) continual domain-adaptive pre-training, which further pre-trains a LM incrementally to adapt it to a sequence of domains. Note that LM editing (with or without continual learning) (Mitchell et al., 2022) that corrects mistakes learned in the LM is a special case of continual end-task learning (Kim et al., 2022) as each editing task or a group of editing tasks learned together is basically a task in continual learning, which aims to perform the editings correctly without interfering with or forgetting the other knowledge already learned in the current LM. This paper focuses on continual domain-adaptive pre-training (or continual DAP-training) of LMs. It is known that DAP-training 2 an LM (without continual learning) using a large unlabeled domain corpus before end-task fine-tuning achieves better results (Gururangan et al., 2020; Xu et al., 2019; Ke et al., 2022b) . This paper goes a step further to continually learn to improve an LM's ability to handle new or emerging domains or topics without forgetting the skills or knowledge learned in the past. This is important in the real world, where the data shifts constantly and new domains, events or topics keep emerging (Ke et al., 2022b) and the LM needs to be updated to serve the users better. We call this problem continual DAP-training. Starting from a pre-trained general LM (i.e., the LM has already been pre-trained on D 0 ), we incrementally DAP-train a sequence of domain corpora D 1 , D 2 , .... Once a domain is trained, its data is no longer accessible. This is different from conventional continual learning (CL) where each task is an end-task. In the proposed continual DAP-training, each task is an unlabeled domain corpus to be learned. An end-task fine-tunes the continually DAP-trained LM to evaluate its performance. It is worth noting that D 0 is usually a broad or general domain (e.g., News). In practice, a continually DAP-trained LM may be trained by individual users, institutions or a mix of both who have one or more large corpora of some particular domains. In such cases, the raw data may not be shared, but the final LM can be shared by all. There are multiple desiderata for a continual DAP-training system: (1) It should not suffer from catastrophic forgetting (CF), i.e., it should perform reasonably well on learned domains. This requires the system (a) to overcome CF for each new domain and (b) to overcome CF for the general language knowledge in the LM. This is important because the knowledge learned from each domain alone will not be sufficient for good end-task performances. (2) It should encourage knowledge transfer (KT) across domains to achieve improved end-task performances. This requires the system to enable (a) forward transfer, learning a new domain by leveraging the knowledge from previous domains, and (b) backwards transfer, gaining improved performance on previous domains after learning a relevant new domain. (3) It should work without requiring the domain-ID for each end-task fine-tuning. None of the existing CL methods can achieve all the above. This paper represents a step towards achieving them. The proposed method is called DAS (Continual DA-pre-training of LMs with Soft-masking). DAS proposes a novel soft-masking mechanism that computes the importance (a real number between 0 and 1) of unitsfoot_0 for general or domain knowledge and soft-mask them based on their importance values to control the backward gradient flow. In the forward pass, soft-masking is not applied, which encourages KT across domains. It does not isolate any sub-network for any domain so that the knowledge in the full LM can be leveraged for end-task fine-tuning. To apply this mechanism, DAS implements two functions: (1) Initialization, which computes the importance of units to the general knowledge in the LM without accessing the LM pre-training data (D 0 ). It is applied on the pre-trained LM before the continual learning starts, and (2) continual learning, which DAP-trains each domain while preventing CF on the general and domain knowledge and encouraging cross-domain KT. In (1), it is not obvious how to compute the importance without pre-training data. DAS proposes a novel proxy based on robustness to compute the importance of units for the general knowledge. In (2), the soft-masking is directly applicable because we have the domain data and the importance can be computed based on its gradient inspired by the pruning community (Li et al., 2021; Michel et al., 2019) . Moreover, DAS contrasts the previously learned knowledge and the full (including both the learned domains and the current domain) knowledge to encourage the current domain representation to learn knowledge that is not already in the knowledge learned from previous domains and integrate it with the learned knowledgefoot_1 . In end-task fine-tuning, DAS does not requires the domain-ID as all knowledge is accumulated into the DAP-trained LM. In summary, this work makes the following contributions. (i) It studies the new problem of continual DAP-training and discovers that the full LM is needed for a good continual DAP-training method. The popular parameter-isolation approach to overcoming CF in convention CL is unsuitable. (ii) It proposes a novel soft-masking method to overcome CF and to encourage KT, and a constrative learning based method for knowledge integration. (iii) To preserve the general knowledge in the LM, a novel proxy is also proposed. (iv) Experimental results demonstrate the effectiveness of DAS.

2. RELATED WORK

DAP-training. DAP-training can be achieved by directly updating the LM (Xu et al., 2019; Sun et al., 2019; Lee et al., 2020; Alsentzer et al., 2019; Gururangan et al., 2020; Chakrabarty et al., 2019; Ke et al., 2022b) or by training only a small set of additional parameters. 



For simplicity, we use the term units to mean both attention heads and neurons. Contrasting the past domains and only the domain-specific knowledge gives poorer results (see Sec. 4.2) as it causes the two types of knowledge to split rather than to integrate.



For example, Pfeiffer et al. (2020); Wang et al. (2020a); Ke et al. (2021a;b;c) trained adapters and Gu et al. (2021) trained a prompt to adapt to a domain. While adapter and prompt could be effective, transfer knowledge among

