TOWARDS CONDITIONALLY DEPENDENT MASKED LANGUAGE MODELS

Abstract

Masked language modeling has proven to be an effective paradigm for learning representations of language. However, when multiple tokens are masked out, the masked language model's (MLM) distribution over the masked positions assumes that the masked tokens are conditionally independent given the unmasked tokens-an assumption that does not hold in practice. Existing work addresses this limitation by interpreting the sum of unary scores (i.e., the logits or the log probabilities of single tokens when conditioned on all others) as the log potential a Markov random field (MRF). While this new model no longer makes any independence assumptions, it remains unclear whether this approach (i) results in a good probabilistic model of language and further (ii) derives a model that is faithful (i.e., has matching unary distributions) to the original model. This paper studies MRFs derived this way in a controlled setting where only two tokens are masked out at a time, which makes it possible to compute exact distributional properties. We find that such pairwise MRFs are often worse probabilistic models of language from a perplexity standpoint, and moreover have unary distributions that do not match the unary distributions of the original MLM. We then study a statisticallymotivated iterative optimization algorithm for deriving joint pairwise distributions that are more compatible with the original unary distributions. While this iterative approach outperforms the MRF approach, the algorithm itself is too expensive to be practical. We thus amortize this optimization process through a parameterized feed-forward layer that learns to modify the original MLM's pairwise distributions to be both non-independent and faithful, and find that this approach outperforms the MLM for scoring pairwise tokens.

1. INTRODUCTION

Masked language modeling has proven to be an effective paradigm for learning generalizable representations of language (Devlin et al., 2019; Liu et al., 2019; He et al., 2021) and other structured domains (Rives et al., 2021; Mahmood et al., 2021; He et al., 2022) . From a probabilistic perspective, masked language models (MLM) make strong independence assumptions. When multiple tokens are masked out, MLMs assume that the distributions over the masked tokens are conditionally independent given the unmasked tokens-an assumption that clearly does not hold for language. For example, consider the sentence: "The [MASK] 1 [MASK] 2 pleasantly surprised by an analysis paper." MLM's assume that the distribution over the two tokens are independent and thus cannot systematically assign higher probability to grammatical subject-verb agreements ("reviewer was" and "reviewers were") than ungrammatical ones ( * "reviewer were" and * "reviewers was"). These types of statistical dependencies can occur for words that are far apart, The [MASK] 1 , tired from reading so many papers that focused on performance gains, [MASK] 2 pleasantly surprised by an analysis paper. Indeed, such long-range dependencies animate much work on hierarchical approaches to language which posit (usually tree-like) structures in which words that are "close" in structure space (but potentially far apart in surface form) have high dependency with one another. From purely a representation learning perspective, such model misspecifications arising from incorrect statistical assumptions may not be catastrophic. These assumptions can enable scalable training and even aid in learning better representations by serving as a statistical bottleneck that forces more 1

