TAKING NOTES ON THE FLY HELPS LANGUAGE PRE-TRAINING

Abstract

How to make unsupervised language pre-training more efficient and less resourceintensive is an important research direction in NLP. In this paper, we focus on improving the efficiency of language pre-training methods through providing better data utilization. It is well-known that in language data corpus, words follow a heavy-tail distribution. A large proportion of words appear only very few times and the embeddings of rare words are usually poorly optimized. We argue that such embeddings carry inadequate semantic signals, which could make the data utilization inefficient and slow down the pre-training of the entire model. To mitigate this problem, we propose Taking Notes on the Fly (TNF), which takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time. Specifically, TNF maintains a note dictionary and saves a rare word's contextual information in it as notes when the rare word occurs in a sentence. When the same rare word occurs again during training, the note information saved beforehand can be employed to enhance the semantics of the current sentence. By doing so, TNF provides better data utilization since cross-sentence information is employed to cover the inadequate semantics caused by rare words in the sentences. We implement TNF on both BERT and ELECTRA to check its efficiency and effectiveness. Experimental results show that TNF's training time is 60% less than its backbone pre-training models when reaching the same performance. When trained with the same number of iterations, TNF outperforms its backbone methods on most of downstream tasks and the average GLUE score.

1. INTRODUCTION

Unsupervised language pre-training, e.g., BERT (Devlin et al., 2018) , is shown to be a successful way to improve the performance of various NLP downstream tasks. However, as the pre-training task requires no human labeling effort, a massive scale of training corpus from the Web can be used to train models with billions of parameters (Raffel et al., 2019) , making the pre-training computationally expensive. As an illustration, training a BERT-base model on Wikipedia corpus requires more than five days on 16 NVIDIA Tesla V100 GPUs. Therefore, how to make language pre-training more efficient and less resource-intensive, has become an important research direction in the field (Strubell et al., 2019) . Our work aims at improving the efficiency of language pre-training methods. In particular, we study how to speed up pre-training through better data utilization. It is well-known that in a natural language data corpus, words follow a heavy-tail distribution (Larson, 2010) . A large proportion of words appear only very few times and the embeddings of those (rare) words are usually poorly optimized and noisy (Bahdanau et al., 2017; Gong et al., 2018; Khassanov et al., 2019; Schick & Schütze, 2020) .

COVID-19 has cost thousands of lives .

A note of 'COVID-19' taken from a previously seen sentence: The COVID-19 pandemic is an ongoing global crisis.

Pandemic; global crisis

COVID-19 has cost thousands of ______ . What is COVID-19? dollars? donuts? puppies? tomatoes? Without Notes: With Notes: Figure 1 : An illustration of how taking notes of rare words can help language understanding. The left part of the figure shows that without any understanding of the rare word "COVID-19", there are too many grammatically-correct, while semantically-wrong options for us to fill in the blank. In the right half, we show that a note of "COVID-19" taken from a previously-seen sentence can act as a very strong signal for us to predict the correct word at the masked position. Unlike previous works that sought to merely improve the embedding quality of rare words, we argue that the existence of rare words could also slow down the training process of other model parameters. Taking BERT as an example, if we imagine the model encounters the following masked sentence during pre-training: COVID-19 has cost thousands of lives. Note that "COVID-19" is a rare word, while also the only key information for the model to rely on to fill in the blank with the correct answer "lives". As the embedding of the rare word "COVID-19" is poorly trained, the Transformer lacks concrete input signal to predict "lives". Furthermore, with noisy inputs, the model needs to take longer time to converge and sometimes even cannot generalize well (Zhang et al., 2016) . Empirically, we observe that around 20% of the sentences in the corpus contain at least one rare word. Moreover, since most pre-training methods concatenate adjacent multiple sentences to form one input sample, empirically we find that more than 90 % of input samples contain at least one rare word. The large proportion of such sentences could cause severe data utilization problem for language pre-training due to the lack of concrete semantics for sentence understanding. Therefore, learning from the masked language modeling tasks using these noisy embeddings may make the pre-training inefficient. Moreover, completely removing those sentences with rare words is not an applicable choice either since it will significantly reduce the size of the training data and hurt the final model performance. Our method to solve this problem is inspired by how humans manage information. Note-taking is a useful skill which can help people recall information that would otherwise be lost, especially for new concepts during learning (Makany et al., 2009) . If people take notes when facing a rare word that they don't know, then next time when the rare word appears, they can refer to the notes to better understand the sentence. For example, we may meet the following sentence somewhere beforehand: The COVID-19 pandemic is an ongoing global crisis. From the sentence, we can realize that "COVID-19" is related to "pandemic" and "global crisis" and record the connection in the notes. When facing "COVID-19" again in the masked-language-modeling task above, we can refer to the note of "COVID-19". It is easy to see that once "pandemic" and "global crisis" are connected to "COVID-19", we can understand the sentence and predict "lives" more easily, as illustrated in Figure 1 . Mapped back to language pre-training, we believe for rare words, explicitly leveraging cross-sentence information is helpful to enhance semantics of the rare words in the current sentence to predict the masked tokens. Through this more efficient data utilization, the Transformer can receive better input signals which leads to more efficient training of its model parameters. Motivated by the discussion above, we propose a new learning approach called "Taking Notes on the Fly"(TNF) to improve data utilization for language pre-training. Specifically, we maintain a note dictionary, where the keys are rare words and the values are historical contextual representations of them. In the forward pass, when a rare word w appears in a sentence, we query the value of w in the note dictionary and use it as a part of the input. In this way, the semantic information of w saved in the note can be encoded together with other words through the model. Besides updating the model parameters, we also update the note dictionary. In particular, we define the note of w in the current

