LANGUAGE MODEL PRE-TRAINING WITH LINGUISTI-CALLY MOTIVATED CURRICULUM LEARNING

Abstract

Pre-training serves as a foundation of recent NLP models, where language modeling task is performed over large texts. It has been shown that data affects the quality of pre-training, and curriculum has been investigated regarding sequence length. We consider a linguistic perspective in the curriculum, where frequent words are learned first and rare words last. This is achieved by replacing hierarchical phrases that contain infrequent words by their constituent labels. By such syntactic substitutions, a curriculum can be made by gradually introducing words with decreasing frequency levels. Without modifying model architectures or introducing external computational overhead, our data-centric method gives better performances over vanilla BERT on various downstream benchmarks.

1. INTRODUCTION

Pre-trained language models (PLM) have gained much attention and achieved strong results in various NLP tasks (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) . Based on selfsupervised learning objectives such as causal language modeling (Peters et al., 2018; Radford et al., 2019) , masked language modeling (Devlin et al., 2019) , and text-to-text generation (Lewis et al., 2020; Raffel et al., 2020) , PLM can learn task-agnostic transferable features from large-scale unlabeled corpora. It has also been shown that PLM can encode syntactic (Hewitt & Manning, 2019; Goldberg, 2019; Wu et al., 2020) , semantic (Tenney et al., 2019; Jawahar et al., 2019) , and factual (Petroni et al., 2019; Dai et al., 2022) knowledge. For improving the representation power of PLM, much research has been done on setting different training objectives (Zhang et al., 2019; Yang et al., 2019; Liu et al., 2019b) , modifying model architectures (Dong et al., 2019; Clark et al., 2020; He et al., 2021) and scaling up the parameter count (Shoeybi et al., 2019; Rae et al., 2021; Fedus et al., 2022; Chowdhery et al., 2022) . However, relatively less work considers on the way of using pre-training corpus, where most of the methods leverage the raw text as a whole (from millions to billions of tokens) and train for multiple epochs, given sufficient data, the training strategy may have reduced effect. Recent work has shown the influence of a curriculum for pre-training. Li et al. (2021) propose a sequence length warmup strategy for GPT-2 pre-training, which can improve training stability and efficiency. Similarly, Nagatsuka et al. (2021) split corpus into blocks with specified sizes for BERT pre-training. These methods focus on changing the sequence length instead of the content and emphasize the convergence speed. Beyond text length, there is a more salient discrepancy between the current PLM training and the language learning process of humans. In particular, we only learn limited but the most common and useful words at the beginning, then we grasp some basic syntactic concepts such as part-of-speech, set phrase, and clause, before recognizing a large number of uncommon words via generalization or their specific usages. Inspired by psycho-linguistic curriculum learning (Elman, 1990; 1993; Bengio et al., 2009) , we propose a data-centric approach that progressively pre-trains a language model using a curriculum that involves reconstructed data. An example contrast of the masked language model pre-training and our multi-stage curriculum training is shown in Figure 1 . Our curriculum consists of m stages (m = 2 in Figure 1 (b)), with each having a incrementally larger vocabulary. Specifically, we first use constituent (and part-of-speech) labels from Penn Treebank (Marcus et al., 1993) to replace the lower frequency words, and the model updates using the text composed of the most frequent words and the constituent labels. In this stage, all words are at the same frequency level, and thus trained equally thoroughly. Then we gradually introduce less frequent words, letting the model further improve based on previously acquired knowledge. During this stage, the previously learned constituent labels can serve as categorical knowledge to guide the learning of infrequent words. Experimental results using BERT show that our method can improve pre-training, showing better performance across tasks including general language understanding, named entity recognition, question answering, part-of-speech tagging, and parsing. Through empirical analysis, we find that our curriculum training can mitigate the representation degeneration problem (Gao et al., 2019) in PLM, and the injected constituent labels can encode meaningful linguistic features that bridge the word representations across different frequencies. Code and model will be released for further research.

2. METHOD

We take BERT (Devlin et al., 2019) as our baseline, which is trained using masked language modeling, one of the most successful self-supervised learning objectives for pre-training ( §2.1). Our method leverages linguistically motivated curriculum learning based on vanilla masked language modeling ( §2.2), with a dedicated data-centric method for stage-wise corpus reconstruction ( §2.3).

2.1. MASKED LANGUAGE MODELING

Masked language modeling (MLM) aims to predict the original target word w i through modeling the contextualized representation of a randomly masked word w i in its context: L MLM = - i log P θ (w i | w i ) = - i log exp(E(w i ) h i ) |V | j=1 exp(E(w j ) h i ) , where w i is the masked symbol [MASK] in a context, h i is the corresponding contextualized output.

2.2. MOTIVATION FOR DATA-CENTRIC CURRICULUM TRAINING

During vanilla MLM pre-training shown in Figure 1 (a), the model needs to predict the corresponding word surface independently. Although the more common words such as "10000", "million", and "twelve" could be updated frequently, lower frequency words such as "1981", "18th", and "trillion" may receive far less training signal. The discrepancy in word frequency may do harm to model training in tasks such as text classification and machine translation (Gong et al., 2018) . For language modeling, previous studies have shown that lower frequency words are learned poorly (Schick & Schütze, 2020) , which can also degenerate the training process for all other words (Yu et al., 2022) . We consider a psycho-linguistically motivated curriculum learning method by proposing two main rules to address the issues: 1) common words first, rare words next, and 2) models are learned with structural constraints, or syntax. To build a curriculum schedule that satisfies the above rules, we inject constituent labels (Marcus et al., 1993) . skewered chicken

Yakitori is

Figure 2 : Illustration for reconstructing the sentence "Yakitori is a Japanese type of skewered chicken." in our four stages of curriculum training. At each stage, we use the corresponding constituent labels (colored in blue) to replace the original words according to their overall frequency. In Figure 1 Figure 2 shows an example of injecting constituent labels into a sentence. In the first stage, we only keep the most frequent words ("is", "a", "type", "of ") while replacing the others by their corresponding constituent labels, these labels are directly used as normal words in the corpus, which can also be randomly masked and predicted. In the second stage, we allow medium frequency words ("chicken") to appear together with the most frequent words and remaining labels. In the third stage, we add the low-frequency words ("Japanese", "skewered") and the data is close to the original format, except for some labels that indicate rare words ("Yakitori"). In the last stage, we use the original corpus for training. In practice, we set the word frequency ranking intervals of ∼0.5K, 0.5K∼3K, 3K∼18K, and 18K∼ as the high, medium, low frequency, and other rare words, respectively. The number of training stages and the frequency intervals are set roughly according to the word distribution and vocabulary size of the embedding table, we leave the optimization of these settings to future work. To simplify the implementation of our curriculum, we directly use the wordfreq library from (Speer et al., 2018) and ignore the statistics of sub-words (Sennrich et al., 2016; Wu et al., 2016) after tokenization. Thus we can directly process the corpus without considering the distinction of word distribution for different domains, and avoid selecting among tokenizers.

3.1. PRE-TRAINING

We follow the setup of BERT-base-cased architecture from Devlin et al. (2019) . The model is a 12 layers Transformer encoder, with a 768 hidden size and 12 attention heads. English WIKIPEDIA and the BOOKCORPUS (Zhu et al., 2015) are used as the pre-training data. We train our model with AdamW (Loshchilov & Hutter, 2019) optimizer for 1M steps with a learning rate 1e-4, batch size 256, warmup ratio 0.01, and with mixed precision using 8×32GB V100 GPUs. Following the recipe from Liu et al. (2019c) and Izsak et al. (2021) , we do not use the next sentence prediction objective. For offline data reconstruction, we use the Benepar (Kitaev & Klein, 2018) for parsing. In each of the first three stages, we train our model using the reconstructed corpus for 200K steps (i.e., a total of 600K steps). Then we use the raw corpus for the 600K∼1M training steps. Since we add some constituent labels such as NP, VP, and JJ (see the full list in Appendix A) in the text, we enlarge the embedding table by treating them as normal tokens, thus making our vocabulary size slightly larger (from 28,996 to 29,051)foot_0 . The 55 externally added embeddings can be discarded after pre-training.

3.2. DOWNSTREAM TASKS AND DATASET

We evaluate on general tasks including natural language understanding, named entity recognition, and question answering. Since our method uses text mixed with syntax-related labels during pretraining, we also evaluate on syntax-related tasks such as part-of-speech tagging and parsing. Statistics of the datasets are shown in Appendix B. GLUE. The GLUE benchmark (Wang et al., 2019a) is used for evaluating general language understanding, we compare on sub-tasks including MNLI (Williams et al., 2018) , QQP (Chen et al., 2018) , QNLI (Rajpurkar et al., 2016) , SST-2 (Socher et al., 2013) , CoLA (Warstadt et al., 2019) , STS-B (Cer et al., 2017) , MRPC (Dolan & Brockett, 2005) , and RTE (Bentivogli et al., 2009) . Named Entity Recognition. We use the CoNLL2003 datasets (Tjong Kim Sang & De Meulder, 2003) for named entity recognition, the entity labels include PER, LOC, ORG, and MISC. Question Answering. Two versions of the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016; 2018) are used. SQuAD 1.1 aims to predict the text span in the passage. SQuAD 2.0 allows the possibility that no answer exists in the paragraph. Part-of-Speech Tagging. The Wall Street Journal (WSJ) portion of the Penn Treebank (Marcus et al., 1993) is used for POS tagging. We follow Manning (2011) by selecting sections 0-18 as the training set, 19-21 as the development set, and 22-24 as the test set. Constituency Parsing. WSJ is also used for constituency parsing, where we use the standard splits with sections 02-21 as the training set, 22 as the development set, and 23 as the test set.

3.3. FINE-TUNING

For constituency parsing, we use the self-attentive encoder (Kitaev & Klein, 2018) We compare fine-tuned results using different pre-trained models: 1) The BERT-base-cased checkpoint released by Google, denoted as BERT; 2) Our model trained from scratch, denoted as BERTreimp. The main difference between BERT-reimp and BERT is that we do not use the next sentence prediction objective; 3) Our model trained from scratch with curriculum learning, we denote it as BERT-CL. The only difference between BERT-CL and BERT-reimp is the training corpus. For our results, we report by averaging five runs with different seeds.

3.4. RESULTS

Table 1 shows the results for the GLUE benchmark. We find that BERT-CL consistently outperforms BERT-reimp across all tasks, showing that our curriculum pre-training is useful. Among all models, BERT-CL gives the best averaged results for both development and test set. We find that the CoLA task shows the largest improvement (+12.2% compared with BERT), which aims to judge the linguistic acceptability of sentences. This task benefits from our curriculum since we offer some syntactic labels during pre-training, nevertheless, our method can also improve the capability for general language understanding tasks such as natural language inference and sentiment analysis. Gui et al. (2020) . The improvement over BERT and BERT-reimp may come from the fact that the named entity usually forms a NN and NP structure in the constituency tree and such knowledge can be better acquired in our preliminary curriculum training stages. Table 3 shows the results for question answering. Our model gives the best results on both datasets (+0.49/+0.36 and +1.83/+1.86 over BERT-reimp). Compared with the sentence-level classification and token-level labeling tasks, question answering is more challenging since it requires understanding both query and passage with long-term dependency. We hypothesize that the improvement is due to that span-based answers usually form common constituent structures such as NP and CD, where these features are quite useful for answering the majority of questions like "what...", "where...", and "how many...", "when...". Table 4 shows the results for POS tagging. By using the full training set, our model gives better results than BERT and BERT-reimp. Since WSJ POS tagging is a less complicated task with rich training resources, we also 1) use fewer training data with 2% to 75% samples, and 2) fix the model parameters while only training a linear classifier upon the contextualized output, which is also called probing (Conneau et al., 2018; Liu et al., 2019a) . We find that BERT-CL still consistently gives better results under low-resource and probing settings. Table 5 shows the results for constituency parsing. Compared with BERT and BERT-reimp, our model gives an absolute improvement with +0.30 and +0.37 F1 scores, respectively. The advantage of using the complete match metric is more significant, where BERT-CL gives +2.47 and +1.30 absolute improvement, respectively. Note that although we leverage a constituency parser for building reconstructed data mixed with constituent labels, the improvement is non-trivial since we use the general purposed MLM training instead of specifically augmenting training data for parsing. To analyze the influence of corpus, we evaluate on GLUE and CoNLL2003 NER test set by 1) using only the WIKIPEDIA, and 2) adding CC-NEWS (Hamborg et al., 2017) for pre-training. Results are shown in Table 6 . We can see that BERT-CL still outperforms BERT-reimp when changing the pre-training corpus, showing that the curriculum is useful when applied to different corpora. 7 . We find that the length-based curriculum does not help much in downstream tasks and our method significantly performs better on both GLUE and CoNLL2003 NER tasks. This shows that our content-based curriculum is not only closer to the process of language learning of humans, but also more helpful for model training. To further evaluate the generalizability of our method, we also try our method to 1) different model settings including larger model RoBERTa-large, and the generative-style language model GPT-2; 2) different curriculum training schedule. We find that our method can also generalize to larger model or GPT-2-style causal language model training, and the curriculum schedule can also affect the overall performance. Detailed results and discussion can be found in Appendix D.

4. ANALYSIS

We analyze the possible reasons behind the performance advantage of BERT-CL, discussing how data-centric curriculum training helps language model pre-training.

4.1. REPRESENTATION DEGENERATION OF LANGUAGE MODEL

Existing work shows the representation degeneration problem of language models (Gao et al., 2019; Ethayarajh, 2019; Wang et al., 2019b; Cai et al., 2020; Biś et al., 2021; Yu et al., 2022) , where the word embeddings or contextualized output are highly anisotropic and may limit the representation power. This problem also exists when training static word embeddings (Mu et al., 2018) . Figure 3 visualizes the evolution of word embeddings over training iterations using PCA. In the top row, we find that the embeddings of BERT degenerate quickly in the early stage, where the overall distribution falls into a relatively narrow angle. The low-frequency words are significantly separated from others and the overall shape does not change much during all training iterations. The evolution of word distribution in BERT-CL is highly different. Since we incorporate words in different frequency intervals at different stages, they gradually form their clusters stage-by-stage until 600K steps. When using the raw corpus for the last 400K steps of training, we find that the low-frequency words are not represented separately, and different clusters come close to each other. Finally, there is no significant borderline for words with different frequencies, and the overall distribution is also more uniform than BERT. In addition to visualization, we use a measure of isotropy in Mu et al. (2018) and Rajaee & Pilehvar (2021) for evaluation. Details and results are shown in Appendix E, we find that our curriculum training leads to a more isotropic representation space quantitatively. One of the reasons behind the representation degeneration problem is the frequency discrepancy between words. For example, Gong et al. (2018) find that the word embeddings are heavily biased towards word frequency, proposing an adversarial training method to learn frequency-agnostic representations. Gao et al. (2019) theoretically show that it could be caused by a large number of rarely appeared tokens, and they use a cosine regularization term to enforce normalizing the distribution. Yu et al. (2022) leverage an adaptive gradient gating mechanism for rare tokens training. Although these methods alleviate the degeneration problem and improve performance in pure language modeling or machine translation task, they require additional computational costs and are not used in pre-training. In contrast, we use a data-centric curriculum pre-training approach that introduces the constituent labels and decouples the words with different frequencies by reformulating the corpus. 

4.2. INTERMEDIATE RESULTS OF CURRICULUM TRAINING

In Figure 4 , we show the GLUE benchmark test set results using different checkpoints during BERT pre-training with and without curriculum training. We find that the BERT model gives a relatively higher performance in the initial stages for almost all tasks. However, the improvement is limited as the training step increases and also inconsistent for tasks such as SST-2, CoLA, STS-B, and MRPC. For BERT-CL, the performance is lower at the very beginning, but significantly and stably increases then, especially for the first 600K training steps. This shows that curriculum training can boost the capability of the model constantly. We also find that, for MNLI, QNLI, CoLA, and RTE, BERT-CL (600K) already performs on par with or better than BERT (1M). This shows that the reconstructed data can also steer PLM capability for certain tasks with less training cost.

4.3. VISUALIZATION OF CONSTITUENT LABEL EMBEDDINGS

The added constituent labels are heavily used during the early pre-training stages for data reconstructing, and their corresponding representations are also updated according to their parameters in the extended embedding table. Although the external added parameters can be discarded during fine-tuning, we are still interested in the role of these constituent labels during curriculum training. Figure 5 visualizes the 2D distributions from the embeddings of constituent labels. In the very early stages (steps 0∼50K), constituent structures can be learned quickly, where some small meaningful clusters emerge and some relationships are also mined. For example, the groups of nouns and verbs, and the similarity between JJR/JJS and RBR/RBS pairs. The discriminating distribution becomes clearer when the training steps increases, where the nouns and verbs (also the most common constituent structures) are gradually separated from each other. These show that our model could learn some meaningful concepts of these constituent structures during pre-training.

4.4. CONSTITUENT LABELS SERVE AS ANCHORS TO BRIDGE TOKEN LEARNING

In the initial stage, the constituent labels are combined with the most frequent words so that the model can learn some syntax rules, the basic usage of frequent words or terms, and their interactions. Since we only leave the high-frequency words and a bunch of labels, the vocabulary size is much smaller, the training signal received for each token is enriched and also more uniform. The initial results can offer the fundamental capability for language understanding, we then allow more medium and low frequency words to participate in pre-training, combined with constituent labels and most frequent words that have been well learned. When adding infrequent words to replace the constituent labels, since the words and labels share a similar context, the learned contextualized knowledge from constituent labels can also serve as guidance to the latter learning process of the upcoming words. From this perspective, the curriculum settings of using constituent labels can bridge the gap between words across different frequencies. Table 8 shows the most similar words to some constituent labels in the trained embedding table. We find that 1) The neighboring words can reflect the fine-grained linguistic characteristics of the constituent labels. For example, the plural number ("games", "states", "children") of NNS, the 3rd person singular present ("is", "does", "has") of VBZ, and the cardinal number ("five", "million", "00") of CD; 2) Different words are captured reasonably according to each constituent label, for Table 8 : The most similar words to each constituent label according to their dot product. high, medium, and low frequency intervals. This shows that the constituent label can serve as an anchor or prototype to help model words of different frequency ranges in the curriculum; 3) Phraselevel constituent labels that are usually composed of multiple tokens such as noun phrase of NP and wh-adjective phrase of WHNP are also well encoded with meaningful similar words such as "him", "she", "he", and "who", "what", "where". See Appendix F for more examples.

5. RELATED WORK

Knowledge Enhanced LM. It has been shown that PLM are capable of encoding syntax and semantic knowledge (Hewitt & Manning, 2019; Tenney et al., 2019; Pérez-Mayos et al., 2021) . There are also a line of work explicitly integrating such knowledge to enhance model representation (Lauscher et al., 2020; Sachan et al., 2021; Xu et al., 2021b) . In particular, Levine et al. (2020) leverage word sense prediction task into BERT pre-training, Bai et al. (2022) propose hypernym class prediction for causal language modeling. These methods focus on word-level external knowledge stored in WordNet. Instead, we uses the hierarchical syntactic tree to inject word-, phrase-and clause-level knowledge. Moreover, through the underlying syntax structure of texts, our method can tackle more situations during pre-training where words are not in WordNet (e.g., url/email address, new words). Curriculum Learning. Curriculum learning has been extensively studied in a range of tasks (Wang et al., 2022) . In natural language processing, Bengio et al. (2009) first show that it can help generalization and speed up the convergence of language modeling. For general-purposed language model pre-training, Campos (2021) defines some sentence difficulty metrics based on sentence length, ngram probability, and part-of-speech diversity for curriculum settings on LSTM-based model. Zhang et al. (2021) group sequences with similar length during pre-training and find that it can help downstream tasks. Nagatsuka et al. (2021) propose progressively increasing the block-size of input text, i.e., using sentences of increasing lengths for pre-training. Li et al. (2021) propose a regularization method for GPT-2 curriculum training, which is also based on sentence length. Unlike these methods, we build a linguistically motivated curriculum based on the learning content. Data-centric AI. Data-centric method become an emerging topic for modern AI systems (Ng, 2021; Hajij et al., 2021; Xu et al., 2021a; Huang et al., 2022; Eyuboglu et al., 2022) . The main idea is to use an established model off-the-shelf, but engineer the data for stronger results, including data collection, annotation, augmentation, cleaning, reordering, and deduplicating (Russell et al., 2008; Krishnan et al., 2016; Wei & Zou, 2019; Press et al., 2021; Agrawal et al., 2021; Lee et al., 2022) . To better leverage the language model for downstream tasks, there is also a trend to build cloze-style samples for fine-tuning (Schick & Schütze, 2021; Gao et al., 2021b) . Reconstructed training data such as prompts or instructions are also being studied for better using large language models in few/zero-shot scenarios (Sanh et al., 2022; Wei et al., 2022; Yuan & Liu, 2022) . In this paper, we attempt to reformulate the data using a syntax-guided mixup strategy for language model pre-training.

6. CONCLUSION

We investigate curriculum learning for language model pre-training and focus on a purely datacentric method, without setting multiple training tasks, modifying model architecture, or introducing external computational overhead during pre-training. Particularly, we propose a data mixup strategy that injects constituent labels into the text and progressively increases the vocabulary on the corpus from high-frequency to low-frequency words. Experiments on multiple downstream tasks show that our method leads to better performance compared with baselines. Beyond autoencoding-style model like BERT, our method can apply to auto-regressive model like GPT2 where the reconstructed data is used for left-to-right language modeling. Specifically, we use GPT2-base as our backbone and pre-train on the same corpus. We evaluate them on LAM-BADA (Paperno et al., 2016) , WikiText2 (Merity et al., 2016) and SWAG (Zellers et al., 2018) without any fine-tuning (i.e., zero-shot) using the LM evaluation framework from Gao et al. (2021a) . Overall, the proposed curriculum training can improve the performance stably. We find that different schedule settings affect the results, where a moderate amount of training using both mixup and raw training data is necessary. For example, training with only 30% steps in the first three stages with the mixup data is not much sufficient. In future work, techniques such as self-paced learning (Kumar et al., 2010; Jiang et al., 2015) can also be considered for setting a better schedule. Model

F MORE EXAMPLES FOR THE MOST SIMILAR WORDS

More examples for the most similar words to each constituent label are shown in Table 15 . 



There is another option to avoid increasing vocabulary size by using the [unused1] to [unused99] tokens for replacing the added constituent labels.



Figure 1: An example of (a) vanilla masked language modeling, and (b) our method using two-stage curriculum training. In the first stage, we replace the original lower frequency target word such as "trillion" by a constituent label CD, which stands for the cardinal number. The representations can be better updated in the latter training stage after acquiring the "concept" of CD.

and initialize it with different pre-trained models. For sentence-level classification (GLUE), token-level labeling (NER and POS tagging), and span-based question answering (SQuAD), we follow BERT for finetuning. FollowingLiu et al. (2019c)  andLan et al. (2020), we fine-tune STS-B, MRPC, and RTE by starting from a trained MNLI checkpoint. For other tasks, we train separately using single model and single task without data augmentation. Hyperparameter settings are shown in Appendix C.

Figure 3: Visualization of word embeddings during pre-training, the numbers in the parentheses denote the training steps. Different colors mean different word frequencies.

Figure 5: Visualization of the learned constituent label embeddings using t-SNE.

into raw text, replacing different words in different training stages by their corresponding constituent structures. ..., is, a, type, of, ... Medium: ..., chicken, ... Low: ..., Japanese, skewered, ... Others: ..., Yakitori, ...



Results on GLUE benchmark dev set and test set. The best results are in bold.

Results on CoNLL2003 test set.

Results on SQuAD dev set.



† 95.34 † 96.02 † 96.07 † 96.12 † 96.17 † Results on WSJ POS tagging test set by model fine-tuning and linear probing. †, ‡: Statistically significant compared BERT-reimp with p < 0.01 and p < 0.05 by t-test, respectively.

Results on WSJ parsing test set. CM means complete matching the constituency tree of the whole sentence.

Results on GLUE and CoNLL2003 by using different pre-training corpus.

Comparison with an existing method that leverages curriculum learning for pre-training.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, 2015. A LIST OF CONSTITUENT LABELS We injected into the corpus with a total of 55 constituent labels defined in Penn Treebank, including: Labels = [-LRB-, -RRB-, ADJP, ADVP, CONJP, DT, EX, FRAG, FW, INTJ, JJ, JJR, JJS, LS, LST, NAC, NN, NNP, NNPS, NNS, NP, NX, PDT, POS, PRN, PRP, PRP$, PRT, QP, RBR, RBS, RP, RRC, SBAR, SBARQ, SINV, SQ, SYM, TOP, UCP, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WHADJP, WHADVP, WHNP, WHPP, WP, WP$, WRB] A brief description of these labels can be found in http://surdeanu.cs.arizona.edu/ mihai/teaching/ista555-fall13/readings/PennTreebankConstituents. html. After data reconstruction, we treat these labels as normal tokens and enlarge our embedding table, making them involved in the masked language model pre-training: Statistics of datasets in our experiments.C PARAMETER SETTINGS FOR FINE-TUNINGFine-tuning parameters for GLUE, NER, SQuAD, POS tagging, and parsing are given in Table10.

Parameter settings for fine-tuning downstream tasks.For larger model settings, we pre-train RoBERTa-large model (355M parameters) on the same corpus and compared on GLUE tasks. In particular, we use the recipe fromWettig et al. (2022) for efficient pre-training by using 40% masking ratio with 500K training steps.

Comparison between larger models with RoBERTa-large setting.Results on GLUE dev sets are shown in Table11. Overall, compared with RoBERTa-large-reimp, our method lead to large improvement on CoLA and MRPC, and also giving close performance or minor improvement across other tasks.

Comparison between auto-regressive language models with GPT2-base setting. ppl: perplexity, bpb: bits per byte.Results are shown in Table12. We find that the results of auto-regressive LMs are still in favor of the proposed technique, where better results such as lower perplexity are achieved.D.2 DIFFERENT CURRICULUM TRAINING STEPS SETTINGSWe investigate the different training steps setting in our four-stage curriculum training, results are shown in Table13.

Comparison on different four-stage curriculum training schedule.

More examples for the most similar words to the constituent labels. NN: noun; NNP: proper noun, singular; NNPS: proper noun, plural; VB: verb; VBP: non-3rd person singular present; VBN: past participle; VBD: past tense; VBG: gerund or present participle; WHADVP: wh-adverb phrase; WRB: wh-adverb; RBS: adverb, superlative; RBR: adverb, comparative; JJS: adjective, superlative; JJR: adjective, comparative.

E MEASURE THE ISOTROPY OF REPRESENTATION SPACE

We measure the isotropy of representation space using the metric defined by Mu et al. (2018) , which is also used for measuring recent language models (Rajaee & Pilehvar, 2021) :where W is the set of representation vectors, U is the set of eigenvectors of W W , Z(u) is a partition function define in Arora et al. (2016) :The perfect isotropic space would have I(W ) close to 1. We calculated the I(W ) scores for BERT, BERT-reimp, and BERT-CL, the results are shown in Table 14 .BERT BERT-reimp BERT-CL I(W ) 1.05e-5 6.15e-7 1.15e-4Table 14 : Measuring the isotropy of representation space of different models.We find that the I(W ) score of BERT-reimp is lower that that of BERT, the reason can be that the original BERT leverage multi-task training (masked language modeling and next sentence prediction). Compared with these two models, BERT-CL gives a higher I(W ) score of 1.15e-4, showing that curriculum training can lead to a more isotropic representation space.

