PMI-MASKING: PRINCIPLED MASKING OF CORRELATED SPANS

Abstract

Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.

1. INTRODUCTION

In the couple of years since BERT was introduced in a seminal paper by Devlin et al. (2019a) , Masked Language Models (MLMs) have rapidly advanced the NLP frontier (Sun et al., 2019; Liu et al., 2019; Joshi et al., 2020; Raffel et al., 2019) . At the heart of the MLM approach is the task of predicting a masked subset of the text given the remaining, unmasked text. The text itself is broken up into tokens, each token consisting of a word or part of a word; thus "chair" constitutes a single token, but out-of-vocabulary words like "e-igen-val-ue" are broken up into several sub-word tokens. In BERT, 15% of tokens are chosen to be masked uniformly at random. It is the random choice of single tokens that we address in this paper: we show that this approach is suboptimal and offer a principled alternative. To see why Random-Token Masking is suboptimal, consider the special case of sub-word tokens. Given the masked sentence "To approximate the matrix, we use the eigenvector corresponding to its largest e-[mask]-val-ue", an MLM will quickly learn to predict "igen" based only on the context "e-[mask]-val-ue", rendering the rest of the sentence redundant. The question is whether the network will also learn to relate the broader context to the tokens comprising "eigenvalue". When they are masked together, the network is forced to do so, but such masking occurs with vanishingly small probability. One might hypothesize that the network would nonetheless be able to piece such meaning together from local cues; however, we show that it often struggles to do so. We establish this via a controlled experiment, in which we reduced the size of the vocabulary, thereby breaking more words into sub-word tokens. We compared the extent to which such vocabulary reduction degraded regular BERT relative to so-called Whole-Word Masking BERT (WW-BERT) (Devlin et al., 2019b) , a version of BERT that jointly masks all sub-word tokens comprising an out-of-vocabulary word during training. We show that vanilla BERT's performance degrades much more rapidly than that of WWBERT as the vocabulary size shrinks. The intuitive explanation

