LANGUAGE MODEL PRE-TRAINING WITH LINGUISTI-CALLY MOTIVATED CURRICULUM LEARNING

Abstract

Pre-training serves as a foundation of recent NLP models, where language modeling task is performed over large texts. It has been shown that data affects the quality of pre-training, and curriculum has been investigated regarding sequence length. We consider a linguistic perspective in the curriculum, where frequent words are learned first and rare words last. This is achieved by replacing hierarchical phrases that contain infrequent words by their constituent labels. By such syntactic substitutions, a curriculum can be made by gradually introducing words with decreasing frequency levels. Without modifying model architectures or introducing external computational overhead, our data-centric method gives better performances over vanilla BERT on various downstream benchmarks.

1. INTRODUCTION

Pre-trained language models (PLM) have gained much attention and achieved strong results in various NLP tasks (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) . Based on selfsupervised learning objectives such as causal language modeling (Peters et al., 2018; Radford et al., 2019) , masked language modeling (Devlin et al., 2019) , and text-to-text generation (Lewis et al., 2020; Raffel et al., 2020) , PLM can learn task-agnostic transferable features from large-scale unlabeled corpora. It has also been shown that PLM can encode syntactic (Hewitt & Manning, 2019; Goldberg, 2019; Wu et al., 2020) , semantic (Tenney et al., 2019; Jawahar et al., 2019) , and factual (Petroni et al., 2019; Dai et al., 2022) knowledge. For improving the representation power of PLM, much research has been done on setting different training objectives (Zhang et al., 2019; Yang et al., 2019; Liu et al., 2019b) , modifying model architectures (Dong et al., 2019; Clark et al., 2020; He et al., 2021) and scaling up the parameter count (Shoeybi et al., 2019; Rae et al., 2021; Fedus et al., 2022; Chowdhery et al., 2022) . However, relatively less work considers on the way of using pre-training corpus, where most of the methods leverage the raw text as a whole (from millions to billions of tokens) and train for multiple epochs, given sufficient data, the training strategy may have reduced effect. 2021) split corpus into blocks with specified sizes for BERT pre-training. These methods focus on changing the sequence length instead of the content and emphasize the convergence speed. Beyond text length, there is a more salient discrepancy between the current PLM training and the language learning process of humans. In particular, we only learn limited but the most common and useful words at the beginning, then we grasp some basic syntactic concepts such as part-of-speech, set phrase, and clause, before recognizing a large number of uncommon words via generalization or their specific usages. Inspired by psycho-linguistic curriculum learning (Elman, 1990; 1993; Bengio et al., 2009) , we propose a data-centric approach that progressively pre-trains a language model using a curriculum that involves reconstructed data. An example contrast of the masked language model pre-training and our multi-stage curriculum training is shown in Figure 1 . Our curriculum consists of m stages (m = 2 in Figure 1 (b)), with each having a incrementally larger vocabulary. Specifically, we first use constituent (and part-of-speech) labels from Penn Treebank (Marcus et al., 1993) to replace the lower frequency words, and the model updates using the text composed of the most frequent words and the constituent labels. In this stage, all words are at the same frequency level, and thus



Recent work has shown the influence of a curriculum for pre-training. Li et al. (2021) propose a sequence length warmup strategy for GPT-2 pre-training, which can improve training stability and efficiency. Similarly, Nagatsuka et al. (

