TAKING NOTES ON THE FLY HELPS LANGUAGE PRE-TRAINING

Abstract

How to make unsupervised language pre-training more efficient and less resourceintensive is an important research direction in NLP. In this paper, we focus on improving the efficiency of language pre-training methods through providing better data utilization. It is well-known that in language data corpus, words follow a heavy-tail distribution. A large proportion of words appear only very few times and the embeddings of rare words are usually poorly optimized. We argue that such embeddings carry inadequate semantic signals, which could make the data utilization inefficient and slow down the pre-training of the entire model. To mitigate this problem, we propose Taking Notes on the Fly (TNF), which takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time. Specifically, TNF maintains a note dictionary and saves a rare word's contextual information in it as notes when the rare word occurs in a sentence. When the same rare word occurs again during training, the note information saved beforehand can be employed to enhance the semantics of the current sentence. By doing so, TNF provides better data utilization since cross-sentence information is employed to cover the inadequate semantics caused by rare words in the sentences. We implement TNF on both BERT and ELECTRA to check its efficiency and effectiveness. Experimental results show that TNF's training time is 60% less than its backbone pre-training models when reaching the same performance. When trained with the same number of iterations, TNF outperforms its backbone methods on most of downstream tasks and the average GLUE score.

1. INTRODUCTION

Unsupervised language pre-training, e.g., BERT (Devlin et al., 2018) , is shown to be a successful way to improve the performance of various NLP downstream tasks. However, as the pre-training task requires no human labeling effort, a massive scale of training corpus from the Web can be used to train models with billions of parameters (Raffel et al., 2019) , making the pre-training computationally expensive. As an illustration, training a BERT-base model on Wikipedia corpus requires more than five days on 16 NVIDIA Tesla V100 GPUs. Therefore, how to make language pre-training more efficient and less resource-intensive, has become an important research direction in the field (Strubell et al., 2019) . Our work aims at improving the efficiency of language pre-training methods. In particular, we study how to speed up pre-training through better data utilization. It is well-known that in a natural language data corpus, words follow a heavy-tail distribution (Larson, 2010) . A large proportion of words appear only very few times and the embeddings of those (rare) words are usually poorly optimized and noisy (Bahdanau et al., 2017; Gong et al., 2018; Khassanov et al., 2019; Schick & Schütze, 2020) .

