TOWARDS CONDITIONALLY DEPENDENT MASKED LANGUAGE MODELS

Abstract

Masked language modeling has proven to be an effective paradigm for learning representations of language. However, when multiple tokens are masked out, the masked language model's (MLM) distribution over the masked positions assumes that the masked tokens are conditionally independent given the unmasked tokens-an assumption that does not hold in practice. Existing work addresses this limitation by interpreting the sum of unary scores (i.e., the logits or the log probabilities of single tokens when conditioned on all others) as the log potential a Markov random field (MRF). While this new model no longer makes any independence assumptions, it remains unclear whether this approach (i) results in a good probabilistic model of language and further (ii) derives a model that is faithful (i.e., has matching unary distributions) to the original model. This paper studies MRFs derived this way in a controlled setting where only two tokens are masked out at a time, which makes it possible to compute exact distributional properties. We find that such pairwise MRFs are often worse probabilistic models of language from a perplexity standpoint, and moreover have unary distributions that do not match the unary distributions of the original MLM. We then study a statisticallymotivated iterative optimization algorithm for deriving joint pairwise distributions that are more compatible with the original unary distributions. While this iterative approach outperforms the MRF approach, the algorithm itself is too expensive to be practical. We thus amortize this optimization process through a parameterized feed-forward layer that learns to modify the original MLM's pairwise distributions to be both non-independent and faithful, and find that this approach outperforms the MLM for scoring pairwise tokens.

1. INTRODUCTION

Masked language modeling has proven to be an effective paradigm for learning generalizable representations of language (Devlin et al., 2019; Liu et al., 2019; He et al., 2021) and other structured domains (Rives et al., 2021; Mahmood et al., 2021; He et al., 2022) . From a probabilistic perspective, masked language models (MLM) make strong independence assumptions. When multiple tokens are masked out, MLMs assume that the distributions over the masked tokens are conditionally independent given the unmasked tokens-an assumption that clearly does not hold for language. For example, consider the sentence: "The [MASK] 1 [MASK] 2 pleasantly surprised by an analysis paper." MLM's assume that the distribution over the two tokens are independent and thus cannot systematically assign higher probability to grammatical subject-verb agreements ("reviewer was" and "reviewers were") than ungrammatical ones ( * "reviewer were" and * "reviewers was"). These types of statistical dependencies can occur for words that are far apart, The [MASK] 1 , tired from reading so many papers that focused on performance gains, [MASK] 2 pleasantly surprised by an analysis paper. Indeed, such long-range dependencies animate much work on hierarchical approaches to language which posit (usually tree-like) structures in which words that are "close" in structure space (but potentially far apart in surface form) have high dependency with one another. From purely a representation learning perspective, such model misspecifications arising from incorrect statistical assumptions may not be catastrophic. These assumptions can enable scalable training and even aid in learning better representations by serving as a statistical bottleneck that forces more information to be captured by the hidden states.foot_0 However, we observe that MLMs are increasingly being employed as probabilistic models of language, for example for scoring (Salazar et al., 2020; Xu et al., 2022) and sampling/decoding (Wang & Cho, 2019; Ghazvininejad et al., 2019; Ng et al., 2020; Yamakoshi et al., 2022) sentences. Under such probabilistic uses of MLMs, it becomes critical to ensure that the underlying statistical assumptions are plausibly grounded in reality. Existing work has approached this problem by using the conditionals of an MLM to define an alternative probabilistic model of language that does not make said conditional independence assumptions. Noting that the unary conditional distributions of an MLM (i.e., the conditional distributions output by the MLM when a single token is masked out) do not make any independence assumptions, Goyal et al. ( 2022) define a fully connected Markov random field (MRF) language model whose log potential of a sentence is defined to the sum of these unary log probabilities (or logits). This approach, while sensible, raises two questions: (i) is this new model a good probabilistic model of language, and (ii) are the conditionals of the derived model faithful to the original MLM, i.e., are the unary conditionals of the new model the same as (or similar to) the unary conditionals of the MLM? 2 The latter faithfulness question is important because due to the scale at which these models are trained, it is not completely outrageous to posit that the unary conditionals learned by the MLM are close enough the true unary distributions of language. 3 This paper investigates both questions in a controlled pairwise conditional setting where only two tokens are masked out at a time, which makes it possible to compute the MRF's pairwise distribution exactly. Surprisingly, we find that such pairwise MRFs are often a worse probabilistic model of language than even the original MLM that assumes independence between the two masked tokens. We moreover find that the MRF's unary distributions do not match the MLM's unary distributions. In light of this result, we study two alternative approaches to deriving non-independent pairwise distributions from the MLM's unary distributions. The first approach exploits the Hammersley-Clifford-Besag theorem (Besag, 1974) , which allows one to write down a joint distribution in terms of unary conditionals. The second approach uses an iterative algorithm that finds a joint distribution over two masked positions whose unary conditionals are closest, in the KL sense, to the unary conditionals of the MLM (Arnold & Gokhale, 1998) . We find that joint pairwise distributions from the iterative approach have better perplexity than both the MRF and the MLM, and also have unary conditionals that are closer to those of the original MLM's. While effective, the iterative algorithm is too expensive to be practical. We thus propose an amortized variant of the iterative approach that can compute non-independent pairwise conditionals using only a single forward pass of the MLM followed by an efficient feed-forward layer, and find that this amortized approach outperforms original MLM when scoring adjacent pairwise tokens. Our code will be made publicly available.

2. BACKGROUND

We begin by introducing notation. Let V be a vocabulary of tokens, and T be the text length, and w ∈ V T be an input sentence/paragraph. We are particularly interested in the case when a subset S ⊆ [T ] ≜ {1, . . . , T } of the input w is replaced with the special [MASK] tokens; in this case we will use the notation q t|S (• | w S ) to denote the output distribution of the MLM at position t ∈ S, where w S is derived from w by masking out w t for all t ∈ S. MLMs are trained to maximize the log-likelihood of a set of masked words S in a sentence. More formally, consider an MLM parameterized by a vector θ ∈ Θ and some distribution µ(•) over subsets of positions to mask S ⊆ [T ]. The MLM learning objective can then be written as: arg max θ E w∼p(•) E S∼µ(•) 1 |S| t∈S log q t|S (w t | w S ; θ) , where p(•) denotes the true data distribution. Let p S|S (• | w S ) analogously be the conditionals of the data distribution and further let q S|S (w S | w S ) ≜ i∈S q i|S (w i | w S ) be the joint distribution



Indeed, prior work has found that masking out contiguous words (which on average have higher dependency than non-contiguous words; Joshi et al., 2020) or employing more aggressive masking rates(Wettig et al., 2022) can improve representation learning.2 Of course, it is possible that the set of unary conditional distributions themselves may be incompatible (Arnold & Press, 1989), i.e., there is no joint distribution whose unary conditionals exactly equal those of the MLM's. In our empirical study we show that this is indeed the case.3 As noted by https://machinethoughts.wordpress.com/2019/07/14/a-consistency-theorem-for-bert/.

