LANGUAGE MODEL PRE-TRAINING WITH LINGUISTI-CALLY MOTIVATED CURRICULUM LEARNING

Abstract

Pre-training serves as a foundation of recent NLP models, where language modeling task is performed over large texts. It has been shown that data affects the quality of pre-training, and curriculum has been investigated regarding sequence length. We consider a linguistic perspective in the curriculum, where frequent words are learned first and rare words last. This is achieved by replacing hierarchical phrases that contain infrequent words by their constituent labels. By such syntactic substitutions, a curriculum can be made by gradually introducing words with decreasing frequency levels. Without modifying model architectures or introducing external computational overhead, our data-centric method gives better performances over vanilla BERT on various downstream benchmarks.

1. INTRODUCTION

Pre-trained language models (PLM) have gained much attention and achieved strong results in various NLP tasks (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) . Based on selfsupervised learning objectives such as causal language modeling (Peters et al., 2018; Radford et al., 2019) , masked language modeling (Devlin et al., 2019) , and text-to-text generation (Lewis et al., 2020; Raffel et al., 2020) , PLM can learn task-agnostic transferable features from large-scale unlabeled corpora. It has also been shown that PLM can encode syntactic (Hewitt & Manning, 2019; Goldberg, 2019; Wu et al., 2020) , semantic (Tenney et al., 2019; Jawahar et al., 2019) , and factual (Petroni et al., 2019; Dai et al., 2022) knowledge. For improving the representation power of PLM, much research has been done on setting different training objectives (Zhang et al., 2019; Yang et al., 2019; Liu et al., 2019b) , modifying model architectures (Dong et al., 2019; Clark et al., 2020; He et al., 2021) and scaling up the parameter count (Shoeybi et al., 2019; Rae et al., 2021; Fedus et al., 2022; Chowdhery et al., 2022) . However, relatively less work considers on the way of using pre-training corpus, where most of the methods leverage the raw text as a whole (from millions to billions of tokens) and train for multiple epochs, given sufficient data, the training strategy may have reduced effect. 2021) split corpus into blocks with specified sizes for BERT pre-training. These methods focus on changing the sequence length instead of the content and emphasize the convergence speed. Beyond text length, there is a more salient discrepancy between the current PLM training and the language learning process of humans. In particular, we only learn limited but the most common and useful words at the beginning, then we grasp some basic syntactic concepts such as part-of-speech, set phrase, and clause, before recognizing a large number of uncommon words via generalization or their specific usages. Inspired by psycho-linguistic curriculum learning (Elman, 1990; 1993; Bengio et al., 2009) , we propose a data-centric approach that progressively pre-trains a language model using a curriculum that involves reconstructed data. An example contrast of the masked language model pre-training and our multi-stage curriculum training is shown in Figure 1 . Our curriculum consists of m stages (m = 2 in Figure 1 (b)), with each having a incrementally larger vocabulary. Specifically, we first use constituent (and part-of-speech) labels from Penn Treebank (Marcus et al., 1993) to replace the lower frequency words, and the model updates using the text composed of the most frequent words and the constituent labels. In this stage, all words are at the same frequency level, and thus trained equally thoroughly. Then we gradually introduce less frequent words, letting the model further improve based on previously acquired knowledge. During this stage, the previously learned constituent labels can serve as categorical knowledge to guide the learning of infrequent words. Experimental results using BERT show that our method can improve pre-training, showing better performance across tasks including general language understanding, named entity recognition, question answering, part-of-speech tagging, and parsing. Through empirical analysis, we find that our curriculum training can mitigate the representation degeneration problem (Gao et al., 2019) in PLM, and the injected constituent labels can encode meaningful linguistic features that bridge the word representations across different frequencies. Code and model will be released for further research.

2. METHOD

We take BERT (Devlin et al., 2019) as our baseline, which is trained using masked language modeling, one of the most successful self-supervised learning objectives for pre-training ( §2.1). Our method leverages linguistically motivated curriculum learning based on vanilla masked language modeling ( §2.2), with a dedicated data-centric method for stage-wise corpus reconstruction ( §2.3).

2.1. MASKED LANGUAGE MODELING

Masked language modeling (MLM) aims to predict the original target word w i through modeling the contextualized representation of a randomly masked word w i in its context: L MLM = - i log P θ (w i | w i ) = - i log exp(E(w i ) h i ) |V | j=1 exp(E(w j ) h i ) , where w i is the masked symbol [MASK] in a context, h i is the corresponding contextualized output.

2.2. MOTIVATION FOR DATA-CENTRIC CURRICULUM TRAINING

During vanilla MLM pre-training shown in Figure 1 (a), the model needs to predict the corresponding word surface independently. Although the more common words such as "10000", "million", and "twelve" could be updated frequently, lower frequency words such as "1981", "18th", and "trillion" may receive far less training signal. The discrepancy in word frequency may do harm to model training in tasks such as text classification and machine translation (Gong et al., 2018) . For language modeling, previous studies have shown that lower frequency words are learned poorly (Schick & Schütze, 2020), which can also degenerate the training process for all other words (Yu et al., 2022) . We consider a psycho-linguistically motivated curriculum learning method by proposing two main rules to address the issues: 1) common words first, rare words next, and 2) models are learned with structural constraints, or syntax. To build a curriculum schedule that satisfies the above rules, we inject constituent labels (Marcus et al., 1993) into raw text, replacing different words in different training stages by their corresponding constituent structures.



Recent work has shown the influence of a curriculum for pre-training. Li et al. (2021) propose a sequence length warmup strategy for GPT-2 pre-training, which can improve training stability and efficiency. Similarly, Nagatsuka et al. (

It covers about [MASK] square meters. Dexter Devon Reid Jr. (born March 18, [MASK])... With more than 3 [MASK] members... ...after being a faculty member for [MASK] years. the [MASK] century was a tumultuous time... Over 400 [MASK] tokens have been burned. ...about [MASK] square... ...March 18, [MASK])... ...3 [MASK] members... ...for [MASK] years. the [MASK] century... ...400 [MASK] tokens... about [MASK] square... ...March 18, [MASK])... ...3 [MASK] members... ...for [MASK] years. the [MASK] century... ...400 [MASK] tokens...

Figure 1: An example of (a) vanilla masked language modeling, and (b) our method using two-stage curriculum training. In the first stage, we replace the original lower frequency target word such as "trillion" by a constituent label CD, which stands for the cardinal number. The representations can be better updated in the latter training stage after acquiring the "concept" of CD.

