ON A BENEFIT OF MASKED LANGUAGE MODEL PRE-TRAINING: ROBUSTNESS TO SIMPLICITY BIAS Anonymous authors Paper under double-blind review

Abstract

Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a question not fully answered. In this work we theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features, partly answering the question. Our explanation is that MLM pretraining may alleviate problems brought by simplicity bias (Shah et al., 2020), which refers to the phenomenon that a deep model tends to rely excessively on simple features. In NLP tasks, those simple features could be token-level features whose spurious association with the label can be learned easily. We show that MLM pretraining makes learning from the context easier. Thus, pretrained models are less likely to rely excessively on a single token. We also explore the theoretical explanations of MLM's efficacy in causal settings. Compared with Wei et al. ( 2021), we achieve similar results with milder assumptions. Finally, we close the gap between our theories and real-world practices by conducting experiments on real-world tasks.

1. INTRODUCTION

): The solid line is a simple (linear) decision boundary that utilizes only one dimension, while the dashed line is a more complex decision boundary that utilizes two dimensions and maximizes the margin. The question "why is masked language model (MLM) pretraining (Devlin et al., 2019; Liu et al., 2019) useful?" has not been totally answered. In this work, as an initial step toward the answer, we show and explain that MLM pretraining makes the model robust to lexicon-level features that are spuriously associated with the target label. It gives the model a better generalization capability under distribution shift. Previous studies have empirically shown the robustness of MLM pretrained models. Hao et al. (2019) show that MLM pretraining leads to wider optima and better generalization capability. Hendrycks et al. (2020) and Tu et al. (2020) show that pretrained models are more robust to out-of-distribution data and spurious features. However, it remains unanswered why pretrained models are more robust. We conjecture that models trained from scratch suffer from the pitfall of simplicity bias Shah et al. (2020) (Figure 1 ). Shah et al. (2020) and Kalimeris et al. (2019) showed that deep networks tend to converge to a simple decision boundary that involves only a few features. The networks may not utilize all the features and thus may not maximize the margin, which results in worse robustness. A consequence of this could be that a model may excessively rely on a feature that has spurious association with the label and ignore the other features that are more robust. In the studies of Shah et al. (2020) and Kalimeris et al. (2019) , they investigated networks with continuous input. Lovering et al. ( 2021) discovered similar results on synthetic NLP tasks, where the inputs are discrete. We will further explore this discrete setting in this work. We start the exploration with the following assumptions: Let the sentence, label pair be X, Y . Assumption 1. We assume that from X, we can extract two features X 1 and X 2 . Assumption 2. X 1 is a spurious feature that has strong association with Y . Specifically, it means that, solely relying on X 1 , one can predict with high accuracy over the data distribution, but cannot be 100% correctly. Assumption 3. X 2 is a robust feature based on which Y can be predicted with 100% accuracy. Namely, there exists a deterministic mapping f X2→Y that maps X 2 to Y . The assumptions above are realistic in some NLP tasks. In NLP tasks, the input X is a sequence of tokens. Some tasks satisfy Assumption 1: X can be decomposed into X 1 and X 2 , where X 1 is the presence of certain tokens, and X 2 is the context of the token. Thus, X 2 has a much higher dimensionality than X 1 . As shown by the analysis of Gardner et al. ( 2021), there are indeed datasets where Assumption 2 is true. However, if Assumption 3 is true, we would desire the model to rely on X 2 , which contains the semantics of the input X. With these assumptions, in Section 2 we empirically demonstrate that spurious features in discrete inputs can cause problems as in the continuous cases Shah et al. ( 2020); Kalimeris et al. (2019) . We show that, possibly due to the simplicity bias, a deep model is likely to excessively rely on X 1 and to rely on X 2 less. In Section 3.1 and Section 3.2 we provide a theoretical explanation of how MLM pretraining makes a model robust to spurious features. Let Π 1 be the conditional probability P (X 1 |X 2 ). We show (1) the relation between the mutual information I(Π 1 ; Y ) ≥ I(X 1 ; Y ) and that (2) the convergence rate of learning from Π 1 is of the same order as learning from X 1 . That is, when the MLM model can perfectly model the probability P (X 1 |X 2 ) and thus generate perfect Π 1 , learning from Π 1 is as easy as learning from X 1 . As a result, the model will be more likely to rely on Π 1 . Since Π 1 is estimated based on X 2 , higher reliance on Π 1 also implies higher reliance on the robust feature X 2 . This avoids the pitfall of simplicity bias that the model relies excessively on X 1 . To relax Assumption 3, we make one step further by considering causal settings in Section 3.3. The above results partly explain why MLM pretrining is useful for NLP. Denote a sequence of tokens as X = X 1 , X 2 , • • • , X L . During the MLM pretraining process, each token is masked randomly at a certain probability, and the training objective is to predict the masked tokens with the maximum likelihood loss. As a result, the model is capable of estimating the conditional probability P (X i |X \ X i ) for all i = 1, 2, • • • , L. Even though which of the tokens is spurious is unknown, as long as the spurious token has a non-zero probability to be masked during pretraining, MLM can estimate its distribution conditioned on the context and thus can reduce the reliance on it. Finally, we close the gap between our theories and reality. One major gap is that, in reality, we do not use the conditional probability for downstream tasks. Instead, we feed the input X without masking any token and fine-tune the model along with a shallow layer over its output. Regardless of that, we hypothesize that the robustness brought by MLM pretraining still exists. To prove that, in Section 4 we use the toy example and verify the effect of MLM pretraining when using the common practice for fine-tuning. In Section 5, we validate our theories with two real-world NLP tasks. In sum, our study leads to new research directions. Firstly, we provide a new explanation of MLM pretraing's efficacy. Unlike the previous purely theoretical studies Saunshi et al. (2021); Wei et al. (2021) , our assumptions are milder and more realistic. Secondly, we study NLP robustness from the perspective of self-supervised model, which has been widely used since Word2vec Mikolov et al. ( 2013) and thus is indispensable to the generalization to unseen data. We reveal the mechanism that leads to its robustness, which may enable us to further reinforce it in the future.

2. A TOY EXAMPLE

To show that spurious association can cause difficulty of convergence, we construct a toy example with random variables X 1 , X 2 , Y and experimental variables (d 2 , ν). In our setting, the random variables X 1 , X 2 , Y satisfy the assumptions mentioned above, and the experimental variables ν, d 2 control the the strength the spurious association and the difficulty of learning from the robust features. Finally, we measure the difficulty of the task for different (d 2 , ν) by counting the number of updates required for a model to converge. Specially, we design the relationship between the random variables in the following way. Let the dimension of the random variables X 1 and X 2 be 2 and d 2 respectively. Their value x 1 ∈ X 1 = {e 1 , e 2 } and x 2 ∈ X 2 = {e 1 , • • • , e d2 }, where e i is the one-hot vector whose ith element is 1. We



Figure 1: The pitfall of simplicity bias Shah et al. (2020):The solid line is a simple (linear) decision boundary that utilizes only one dimension, while the dashed line is a more complex decision boundary that utilizes two dimensions and maximizes the margin.

