CONTINUAL PRE-TRAINING OF LANGUAGE MODELS

Abstract

Language models (LMs) have been instrumental for the rapid advance of natural language processing. This paper studies continual pre-training of LMs, in particular, continual domain-adaptive pre-training (or continual DAP-training). Existing research has shown that further pre-training an LM using a domain corpus to adapt the LM to the domain can improve the end-task performance in the domain. This paper proposes a novel method to continually DAP-train an LM with a sequence of unlabeled domain corpora to adapt the LM to these domains to improve their endtask performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in the pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances. Empirical evaluation demonstrates the effectiveness of the proposed method.

1. INTRODUCTION

Pre-trained language models (LMs) like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have significantly advanced NLP. Recently, LMs have also been used by many continual learning (CL) systems to learn a sequence of end-tasks incrementally (Ke et al., 2021a; Sun et al., 2020; Huang et al., 2021) , which we call continual end-task learning. It is also desirable to continually pre-train LMs themselves. This includes (1) continual general pre-training, which incrementally updates the LM using the most recent data that has a similar distribution as the pre-training data, and (2) continual domain-adaptive pre-training, which further pre-trains a LM incrementally to adapt it to a sequence of domains. Note that LM editing (with or without continual learning) (Mitchell et al., 2022) that corrects mistakes learned in the LM is a special case of continual end-task learning (Kim et al., 2022) as each editing task or a group of editing tasks learned together is basically a task in continual learning, which aims to perform the editings correctly without interfering with or forgetting the other knowledge already learned in the current LM. This paper focuses on continual domain-adaptive pre-training (or continual DAP-training) of LMs. It is known that DAP-training 2 an LM (without continual learning) using a large unlabeled domain corpus before end-task fine-tuning achieves better results (Gururangan et al., 2020; Xu et al., 2019; Ke et al., 2022b) . This paper goes a step further to continually learn to improve an LM's ability to handle new or emerging domains or topics without forgetting the skills or knowledge learned in the past. This is important in the real world, where the data shifts constantly and new domains, events or topics keep emerging (Ke et al., 2022b) and the LM needs to be updated to serve the users better.

