ON THE USEFULNESS OF EMBEDDINGS, CLUSTERS AND STRINGS FOR TEXT GENERATOR EVALUATION

Abstract

A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there is a dearth of such metrics, which inhibits the rapid and efficient progress of language generators. One exception is the recently proposed MAUVE. In theory, MAUVE measures an informationtheoretic divergence between two probability distributions over strings: one representing the language generator under evaluation and the other representing the true natural language distribution. MAUVE's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet in practice, as this divergence is uncomputable, MAUVE approximates it by measuring the divergence between multinomial distributions over clusters instead, where cluster assignments are attained by grouping strings based on a pre-trained language model's embeddings. As we show, however, this is not a tight approximation-in either theory or practice. This begs the question: why does MAUVE work so well? In this work, we show that MAUVE was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. In fact, classical divergences paired with its proposed cluster-based approximation may actually serve as better evaluation metrics. We finish the paper with a probing analysis; this analysis leads us to conclude that-by encoding syntactic-and coherence-level features of text, while ignoring surface-level features-such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators. 1 * Equal contribution. 1 Code available at https://github.com/rycolab/clusters-in-language-evaluation. 2 We define a language generator as a probability distribution qw over strings w. Specifically, we consider this distribution as used during generation. E.g., if decoding is performed with nucleus sampling, we consider the final distribution where every sentence with tokens not in the nucleus is assigned a probability of 0. 3 Most measures we consider are not metrics in a strict sense; we use the term "metric" out of convention.

1. INTRODUCTION

Probabilistic text generators have improved greatly over the last years, with models producing increasingly human-like text (Yang et al., 2019; Brown et al., 2020; Raffel et al., 2020; Rae et al., 2021; Hoffmann et al., 2022) . As the gap between human and model-generated text closes, the quality of our evaluation metrics becomes ever more important for determining generator quality, especially given the increasing number of user-facing systems employing these generators. While human evaluations serve as the gold standard, they are costly (in both time and money), leading researchers to rely on automatic metrics-i.e., metrics that can be measured by a computer-for the bulk of their development process. Many automatic language generator evaluation metrics share the same underlying mechanism: the quantitative comparison of two probability distributions. Specifically, most metrics measure a difference between the distributions over strings defined by: (1) a language generation model 2 and (2) the natural language itself. This includes some of the most widely used language evaluation metrics: 3 cross-entropy (Shannon, 1948) , perplexity (Jelinek et al., 1977) , and (more recently) MAUVE (Pillutla et al., 2021) . As typically applied to evaluate language generators, however, these metrics have a number of computational and qualitative issues (discussed in §3). Such issues manifest empirically: the most commonly used automatic metrics are known to correlate poorly with human judgements (Wiseman et al., 2017; Reiter, 2018; Sellam et al., 2020; Gehrmann et al., 2021) . A newly proposed metric stands apart: MAUVE (Pillutla et al., 2021) . In theory, MAUVE measures the area under the curve formed by the divergence between two probability distributions, qualitatively mimicking a precision-recall quantification (Djolonga et al., 2020; Kynkäänniemi et al., 2019) . The authors attribute the success of their metric to the qualitative properties of this new class of divergences. Yet, due to this divergence being in practice uncomputable, Pillutla et al. propose an approximation to it. Specifically, rather than directly comparing the original two distributions over strings, MAUVE first clusters samples taken from these distributions based on the embeddings of a pre-trained language model; it then estimates the proposed divergence using the samples' empiricallyobserved multinomial distributions over cluster assignments. As we will show, this approximation is bad in both theory §4 and practice §5.1-to the point that the term "approximation" is arguably a misnomer. Thus, the reasons why MAUVE works well-knowledge which is important for the continued progress of language generator evaluation metrics-are still unknown. In this work, we aim to uncover these reasons. To this end, we consider the axes on which MAUVE differs from other evaluation metrics: is MAUVE's success due to the new divergence metric, to its "approximation", or both? Empirically, we identify MAUVE's substitution of probability distributions over strings with probability distributions over embedding-based clusters as the main factor for its success. We show that mathematically, this substitution leads to a quite biased estimator of the original string-based divergences. Yet it also leads to metrics with lower variance and stronger correlations with human judgements. In fact, all divergence measures analysed here correlate more strongly with human judgements when cluster-based distributions are used in place of string-based ones. Finally, in order to understand the root of the effectiveness of these cluster-based metrics, we probe the clusters themselves. We find that sentence-level permutations within texts noticeably affect cluster assignments, suggesting that cluster-based metrics are susceptible to attributes such as coherence. On the other hand, basic manipulations that render text unhuman-like, such as removing all articles from the input text, do not seem to affect these metrics significantly. Together, these results lead us to conjecture that embedding-based metrics may be favourable when estimating the quality of state-of-the-art (SOTA) language generators, as SOTA models are known to (at least typically) produce grammatical text. That is, by ignoring surface-level features of text-while emphasising discourse-and coherence-level ones-clustered embeddings may simply be better suited for the evaluation of the top language generation systems. Yet these findings also suggest routes through which such metrics can be gamed, bringing into question their robustness. We believe these findings, along with the theoretical framework we provide for evaluation metrics' comparison, are important for the further development of language generator evaluation metrics.

2. DIVERGENCE METRICS FOR LANGUAGE GENERATOR EVALUATION

When evaluating language generation systems, we will first assume the existence of an unknown ground-truth distribution p w . This distribution is defined over strings w and its domain spans W ≡ Σ * , where Σ is an alphabet of words and Σ * is its Kleene closure. Second, we are given a probabilistic text generator q w , which is also a distribution over W. An evaluation metric for a language generator q w can now be defined as a measure of its "distance" from p w : ∆(p w , q w ). In short, ∆(•, •) should return high values if q w is a bad approximation to p w , and it should return low values if it is a good one. Notably, it is not clear whether q w being a good approximation to p w in terms of an arbitrary ∆(•, •) guarantees that it will be a good language generator. Indeed, models that perform well in terms of standard metrics, such as perplexity, often still produce poor-quality text (Holtzman et al., 2020) . Thus, we are interested specifically in ∆(•, •) that correlate highly with human quality judgements. More formally, we define human quality judgements as a (potentially noisy) mapping α(q w ) from a language generator to a real-valued score. For fixed p w , a useful metric ∆(p w , •) for evaluating the quality of a language generator q w is one whose scores correlate highly with humanscores(•). This notion can be operationalised as follows. Assume we have N language generator models. Let us define: δ human (q (1) w , . . . , q (N ) w ) = α q (1) w , . . . , α q (N ) w (1) δ metric (q (1) w , . . . , q (N ) w ) = ∆ p w , q (1) w , . . . , ∆ p w , q (N ) w (2) We then quantify a metric's usefulness on a specific natural language task (and its distribution p w ) as: quality(∆, p w ) = |corr (δ human , δ metric ) | We now review common choices for ∆(•, •). Given the probabilistic nature of most language generators, a number of divergence measures are among these choices, which quantify the difference between two probability distributions. 4 The rest of this work focuses primarily on this class of metrics. Forward Divergence. Cross-entropy, ∆ H (p w , q w ) def = H(p w , q w ), which is equivalent (up to an additive constant) to the forward Kullback-Leibler (KL) divergence, is one such choice: ∆ → (p w , q w ) def = KL(p w || q w ) = H(p w , q w ) -H(p w ) (1) H(p w , q w ) = ∆ H (p w , q w ) where we use to signify additive or multiplicative equivalence. ( 1) is true since H(p w ) is constant with respect to q w . Since Pearson and Spearman correlations-the metrics we use to evaluate ∆'s quality-are invariant to translational shifts, the cross-entropy and forward KL are equivalent as language generator metrics. We will refer to them interchangeably during subsequent comparisons. Backward Divergence. Albeit much less common, another potential evaluation metric would be the backward (exclusive) KL divergence: ∆ ← (p w , q w ) def = KL(q w || p w ) As opposed to the forward KL, when use as an evaluation metric, Eq. ( 5) is not effectively equivalent to the cross-entropy between q w and p w , as H(q w ) is not constant across language generators q w . Exponentiated Divergence. By far, the most common choice of ∆ to evaluate language models is the perplexity: pw,qw) . Notably, perplexity is equivalent (up to a multiplicative constant) to an exponentiated Kullback-Leibler divergence between p w and q w , which follows from the same relationship as in Eq. ( 4). Given the property that both Pearson and Spearman correlations are invariant to a change in scale, the perplexity and exponentiated KL will thus be equivalent as language generator metrics. For consistency, we will use solely the exponentiated KL in our analyses: ∆ perp (p w , q w ) def = e H( ∆ exp (p w , q w ) def = e KL(pw||qw) Jensen-Shannon Divergence. Note that the KL divergence is non-symmetric and unbounded. On the other hand, the Jensen-Shannon (JS) divergence-defined as the average of two KLs-is symmetric with respect to its inputs and is guaranteed to produce bounded values: ∆ JS (p w , q w ) def = 1 2 KL(p w || r .5 w ) + KL(q w || r .5 w ) , r λ w = λ p w + (1 -λ) q w (7) Area Under the Curve (AUC) Divergence. Finally, information divergence frontiers are a recently proposed class of metrics for generative models (Sajjadi et al., 2018; Kynkäänniemi et al., 2019) . The variant proposed by Pillutla et al. (2021) computes the area under the curve formed by a series of Kullback-Leibler divergences as we change a mixing parameter λ: ∆ AUC (p w , q w ) = 1 -AUC e -s KL(pw||r λ w ) , e -s KL(qw||r λ w ) , r λ w = λ p w + (1 -λ) q w (8) where λ is varied across the interval [0, 1], and s ∈ R >0 is a strictly positive real-valued scaling constant. Note that we define the AUC divergence as 1 -AUC(•, •) so that a larger value indicates a greater discrepancy with the reference corpus p w .

3. INFELICITIES AND APPROXIMATIONS

There are several issues, both computational and qualitative, with using the divergences presented in §2 to evaluate language generators. We now review these issues, along with both commonly-used and newly-proposed methods to address them via approximations.

3.1. NECESSITY OF FULL SUPPORT

A well-known property of the (forward) KL divergence between two distributions p w and q w is that it is infinite for any q w that assigns 0 probability to an event in the support of p w (i.e., for which p w (w) > 0). The above is often not an issue for ∆ exp and ∆ → : most neural language generators cannot assign 0 probability to any string due to the final softmax operation typically used to project their outputs onto the probability simplex. However, these same models are often used with decoding strategies that prune the space W: e.g., both top-k and nucleus sampling modify q w such that strings which do not meet a certain criterion are reassigned 0 probability. While top-k and nucleus sampling typically lead to systems with qualitatively better text, they will likely be given an infinitely bad score by both ∆ exp and ∆ → , which is perhaps too harsh a penalty for an otherwise good language generator.

3.2. p w IS UNKNOWN

In practice, we do not have access to the true distribution p w . Rather, we are typically given a corpus {w pw n } N n=1 , whose instances we assume to be sampled i.i.d. from p w . The common approach to address this issue is thus to derive a statistical estimator ∆ that uses this corpus to approximate ∆. There are two common strategies for building such estimators: Monte Carlo and plug-in estimation. Monte Carlo Estimation. Our i.i.d. assumption w.r.t. samples in {w pw n } N n=1 allows us to derive a Monte Carlo estimator for certain divergences. We start with the forward KL divergence: KL(p w || q w ) def = 1 N N n=1 log p w (w pw n ) q w (w pw n ) = - 1 N N n=1 log q w (w pw n ) + const where const ∈ R is constant with respect to q w . Eq. ( 9) is an unbiased estimator of KL divergence, which in turn allows us to build estimators ∆ → and ∆ exp . Unfortunately, unbiased estimates of ∆ ← , ∆ JS and ∆ AUC are not as straightforward to compute, as they require explicit knowledge of p w rather than just samples (see App. A). This issue motivates the use of our next set of estimation techniques. Plug-in Estimation. Here we consider estimation via building an approximation of p w itself to use in the formulas given in §2. Specifically, we construct a density estimator for p w (which we denote as p w ) and "plug it into" a given ∆.foot_1 However, this is a bit circular: the task of building a language generator q w itself is often framed as density estimation of p w . Thus, if we think q w is the "best" estimator for p w , we should logically use it in our plug-in estimator. Yet, using q w would be nonsensical; by the definition of a (shifted) divergence, it would always lead to the lowest possible value of ∆, e.g., ∆ → (q w , q w ) = 0. To use plug-in estimation in this setting, we should therefore choose a different estimator for p w , e.g., from a family of density estimators that differs from those used to create q w . More formally, we consider a function π which takes a corpus as input and produces a (queryable) distribution p w def = π({w pw n } N n=1 ). This function typically induces a secondary model, e.g., an n-gram model or neural network, trained on the corpus {w pw n } N n=1 . Our chosen π may introduce biases (e.g., from the inductive biases of the architecture parameterising π) into our metrics' estimation. To balance out such biases, we may consider using the same method to create an approximation q w for use in our plug-in estimators, rather than directly querying q w : ∆ ← ({w pw n } N n=1 , q w ) def = KL( q w || p w ) Plug-in estimators for ∆ JS and ∆ AUC are defined similarly. Further, if q w is a smoothed approximation to the original q w , using it may also mitigate the issues discussed in §3.1. We thus also compute estimators for the forward/exponentiated divergences using plug-in estimators, e.g.,: ∆ → ({w pw n } N n=1 , q w ) def = - 1 N N n=1 log q w (w pw n ) Unfortunately, most functions π cannot produce a good estimate of p w using only a small corpus, which is the case we consider since we rely on evaluation sets for {w pw n } N n=1 . While the best available language models are a class of π typically trained on millions (if not billions) of sentences, a standard evaluation set is quite small-on the order of one to ten thousand sentences-and we cannot expect π to provide a good p w when fit using only such a small dataset. Accordingly, depending on our choice of π, this class of metrics may be either high variance or high bias, both of which are problematic.

3.3. CLUSTERING-BASED APPROXIMATIONS

For the ∆ above that require density estimators for p w and/or q w , our choice of π will have a large effect on its value. We may thus wish to rethink our approximation technique altogether, and instead work with different distributions for which we can create lower variance density estimators. This is the approach used by Pillutla et al. (2021) when approximating ∆ AUC . Specifically, instead of computing the above metrics on the original distributions p w and q w , they use the cluster-based distributions p c and q c . Given a pre-trained language model, these cluster-based distributions are defined as: p c (c) = w∈W p w (w) 1 c = φ(PLM(w)) (12) where PLM(•) takes as input an utterance w and outputs an embedding r, and φ(•) is a pretrained clustering function. Note that the function φ(•) is trained jointly on samples from the two distributions under consideration as we will detail later; we defer the reader to Pillutla et al. (2021) for a more detailed explanation of the procedure. Given these distributions, we can evaluate cluster-based versions of all the divergences above, simply by substituting the original p w and q w with the new p c and q c .

4. ANALYSING CLUSTER-BASED APPROXIMATIONS

We now take a closer look at the biases introduced by the substitution of cluster-based distributions suggested in §3.3. For simplicity, we focus on the bias introduced to KL(p w || q w )-a computation involved in MAUVE's ∆ AUC . This divergence can be decomposed as: KL(p w || q w ) string-based KL (1) = KL(p(c) || q(c)) + KL(p(w | c) || q(w | c)) ≥0 ≥ KL(p c || q c ) cluster-based KL (13) where (1) follows from the fact that p(c, w) = p(w), which is true because the cluster assignment is deterministic, i.e.: p(c | w) = 1{c = φ(PLM(w))}. See the full decomposition of this equation in App. B. Notably, as KL divergences are always non-negative, the cluster-based version is negatively biased, lower-bounding the string-based one. Further, the actual measurement is done on the distribution over cluster assignments p(c); the distribution p(w | c) is completely ignored. Assuming a reasonable number of clusters is used when defining p c , however, it should be easier to create good approximations of distributions over clusters than distributions over strings due to the sheer difference in the size of the supports alone. Consequently, the variance of cluster-based metrics should be lower, at the cost of the bias introduced by this substitution. Further, it is not clear whether this bias is inherently bad when evaluating the quality of language generators: the answer to this question must be determined empirically (by measuring the correlation in Eq. ( 3)). To this end, we now provide an empirical comparison between string-and cluster-based language generation evaluation.

5. EXPERIMENTS

Setup. We follow the setup of Pillutla et al. throughout our experiments. We compare systems for open-ended text generation q w with human-generated text p w . As human-generated samples, we use 5k strings taken from WebText's test set. As model-generated text, we sample 5k strings from each of our evaluated systems, conditioning our models on the first 10 words of human-generated strings before sampling (i.e., a text-completion task). For our language generators, we compare 4 model architectures (all variants of GPT-2), each under two decoding strategies, giving us a total of 8 systems. Explicitly, we compare the small, medium, large, and XL versions of GPT-2, decoding strings using either ancestral or nucleus sampling. Following Pillutla et al. (2021) , we use a nucleus probability of 0.9 for small and medium, while 0.95 for large and XL GPT-2's. Importantly, we run our experiments only on English text, which is a notable limitation of our work; future work should verify that findings hold across languages. String-based Approximations p w . To compute our string-based divergences, we require a secondary language model p w to estimate p w . Further, following the issues highlighted in §3, we will also rely on a secondary language model q w to estimate q w . We will use n-gram models for these approximations. Specifically, we use Kneser-Essen-Ney smoothed 5-gram models, as implemented in KenLM (Ney et al., 1994; Heafield, 2011) . We choose n-gram models explicitly because-while they are by no means SOTA language models-they should have inductive biases which are different from the models we are trying to evaluate. We present results using LSTM-based estimators in App. E. When computing ∆ AUC , we use a scaling constant s of 0.2. Cluster-based Approximations p c . Cluster-based distributions, as presented in Eq. ( 12), are defined by a choice of PLM(•) and pre-trained clustering function φ(•). We rely on GPT-2 XL as our PLM, and use K-means as our clustering function; results using other pre-trained language models are in App. E. Specifically, we first extract embeddings from the last word in each sentence using GPT-2 XL and then use PCA to reduce their dimensionality (keeping 90% of the original variance explained); results using mean of word embeddings can be found in App. E. We then train K-means (with K = 500) on a joint set of GPT-2 embeddings extracted from: the 5k human-generated strings, and 5k model-generated sentences. Finally, we approximate p c and q c by computing the frequency with which strings (among these 5k used ones) are assigned to each cluster. To avoid infinite divergence measures, we estimate distributions using Laplace smoothing with α = 1 (which is equivalent to imposing a Dirichlet distributed prior with α = 1 over the cluster allocation). When computing ∆ AUC , we use a scaling constant s of 5. q c q w Figure 1: Correlations between the true q w and the estimated q c and q w . Our first experiment tries to identify whether p c and q c provide faithful approximations of p w and q w . To this end, we compare both q w and q c to the true q w , i.e., the language generator under evaluation. Explicitly, we compute the Spearman correlations between the probabilities assigned by each model to the strings in {w qw n } N n=1 . Fig. 1 presents these correlations. We see that-despite being estimated on very little data-probability estimates from our n-gram models correlate strongly with the ground-truth probabilities of q w ; this result holds for all four GPT-2 architectures. On the other hand, our cluster-based probabilities consistently present negative correlations with q w . This result has an important implication: if cluster distributions do not correlate with q w , then KL( p c || q c ) is likely a poor estimate of KL(p w || q w ). This further implies that the approximation used by Pillutla et al. is not an accurate estimate of ∆ AUC (p w , q w ), which brings into question whether this new divergence is really responsible for MAUVE's success.

5.2. ∆ AS TEXT EVALUATION METRICS

We now compare how various string-and cluster-based divergence measures correlate with human judgement scores. 8 In short, Fig. 2 shows that all divergences do better when estimated with cluster distributions. These results evince that MAUVE's (Pillutla et al., 2021) high correlations with human judgements (represented here as ∆ AUC (p c , q c )) are mainly due to their use of clusterbased approximations (p c , q c ), rather than to their proposed divergence ∆ AUC . In fact, we see slight improvements over ∆ AUC when using the divergences ∆ ← and ∆ JS instead. Furthermore, cluster-based divergences appear to be more stable, exhibiting smaller variances across random seeds. Collectively, our results suggest that cluster-based divergences may produce better metrics of text quality than string-based divergences. This motivates the question: Which components of natural language are captured by cluster-based distributions p c , and which are overlooked by ignoring p(w | c) when computing the cluster-based divergences? Our next set of experiments aim to answer this.

6. PROBING CLUSTERS

To better understand the aspects of natural language that our cluster distributions encode, we must first understand how φ(PLM(•)) partitions the string space W. In other words, we must understand what components of natural language-e.g., semantics, syntactic attributes, or surface features-lead to strings being assigned to similar or different clusters. Such an analysis should provide a deeper insight into the actual similarity being measured by cluster-based divergences (while also revealing how such a metric might be gamed). To this end, we probe (Alain & Bengio, 2016) the clusters learned by the MAUVE algorithm for a number of linguistic attributes-including subject matter, sentiment, prose style, word order, basic grammaticality and document length-looking at how they affect both cluster assignment and the divergence scores between texts that differ in these characteristics. Notably, we probe cluster assignments directly-without relying on any diagnostic classifiers (Adi et al., 2017) . Our probing analyses are thus exempt from most recent criticism against probing methodologies (Hewitt & Liang, 2019; Pimentel et al., 2020a; b; Ravichander et al., 2021; Elazar et al., 2021) .

6.1. FINDING FEATURES p c ENCODES

Setup. We look at texts annotated with different attribute categories in order to explore correlations between the presence of these attributes and cluster membership. Specifically, we analyse texts': sentiment, authorship, and topic (using the Yelp Polarity, News Category, and 20 NewsGroup datasets, respectively). Further details on datasets are provided in App. D. For each of these classification datasets, we compute the cluster-category distributions that result from the MAUVE algorithm using the standard training split for the respective datasets; all evaluations are then performed on test splits. Explicitly, we first learn a partitioning φ(•) of the embedding space (w.r.t a language model PLM(•)). Each cluster is then labelled with the majority category represented in that cluster by training examples; text categories in the test set are then predicted using this labelling, depending on which of the clusters the example falls into. For comparison's sake, we use four language models as PLM(•): GPT-2 with small, medium, large, and XL architectures. Results using embeddings from BERT (Devlin et al., 2019) can be found in App. E. Further, we use two methods for learning clusters: • φ(•) Learned on WebText. Using the same procedure as in §5.2, we train PCA and K-means functions to partition the embedding space (again relying on WebText's test set for our data). This mimics the setting under which our partitions would be learned in practice.foot_5 • φ(•) Learned on Training Set. We instead train the PCA and K-means clustering functions on the analysed dataset's training set. This setting studies the partitioning that our clustering functions have the capacity to learn in an ideal setting, i.e., where the attribute in question is one of the main differentiating factors between texts. Results. In Fig. 3a , we see that, at least for large numbers of clusters, cluster assignment is indeed indicative of a text's sentiment. Interestingly, this is the case even when clusters are trained on data that is not particularly polar in sentiment (i.e., on WebText). On the other hand, we are only able to predict author and topic (with reasonable accuracy) when clusters are learned on text data with authorship and topic as distinguishing factors. These results indicate that, while writing style and subject matter are captured by the text embeddings, they likely were not being used as distinguishing features between corpora in our cluster-based divergences. We further see that, in all classification settings, the capacity to encode these analysed attributes appears to increase with model size, perhaps suggesting that the embedding spaces of larger models decompose along higher-level features of text.

6.2. HOW TEXT FEATURES IMPACT ∆

We next assess how changing different features of our evaluated text impacts divergence scores. Specifically, we look at the impact of: text truncation; article removal; stopword removal; sentencelevel permutations; and word-level permutations. Setup. We follow a similar setup to §5. In order to create a more controlled setting, though, we primarily consider human-generated text in these experiments (i.e., the 5k human-written articles in WebText's test set). We take the first 2500 articles of this dataset as our reference corpus {w pw n } N n=1 . We then use the remaining 2500 reference strings as the comparison corpus, i.e., in place of the model-generated text that we would typically evaluate {w qw n } N n=1 . In order to explore how changing specific features of text affects ∆ w.r.t. the reference corpus, we then compute scores when making the following modifications to the comparison corpus: • No modification (p). This is a baseline experiment where we keep the original strings unchanged. • Text Truncation (p short ). We truncate texts to 1/3 of their original length. This allows us to understand whether the divergence metrics pick up on differences in dataset length statistics. • Article Removal (p no art ). We remove all articles ('a', 'an' and 'the') in the text. This allows us to understand whether the divergence metrics can distinguish between texts with or without basic levels of fluency and grammaticality. • Stopwords Removal (p no stop ). We remove all stopwords (e.g., 'that' or 'so') in the text. This allows us to understand whether the divergence metrics can detect differing levels of syntactic coherence, rather than just focusing on content words.foot_6 • Sentence-level Permutation (p swap ). We permute the first halves of texts (as delineated by sentences) across the entire corpus (i.e. randomly reassigning the strings' first halves). This allows us to understand whether the divergence metrics detect coherence. • Word-level Permutation (p rand ). We randomly permute all words in a text. This allows us to understand whether the divergence metrics can only distinguish between bag-of-word level features. • GPT-2 Baseline (q). As an extra baseline, we also use the first 2500 generations from GPT-2 XL on the WebText text-completion task, as in §5. Results. Fig. 4 shows that certain alterations to the evaluated text-such as completely removing articles-have almost no impact on its divergences from the reference corpora for various ∆. In fact, text without any articles is judged as better than GPT-2 XL's by all of the cluster-based divergences (see Fig. 10 for a zoomed-in version). Further, while this perturbation undoubtedly affects the text's fluency, it has less of an effect on ∆ than, e.g., truncating texts. This is arguably undesirable: A metric of text quality should place more emphasis on fluency than surface statistics, such as length. On the other hand, our metrics deem text with stopwords removed as utterly different from the reference. Permuting words within texts has a similar effect, demonstrating that, at least to some extent, the embedding space captures notions of syntax and grammaticality, rather than pure unigram statistics. The increase in ∆ shown when performing sentence-level permutations likewise suggests that the clusters delineate different levels of coherence to some extent. In Fig. 13 (in App. E), we perform an additional experiment where we again probe the clusters (as in §6.1), but for surface features of text this time, such as the percentage of stopwords and punctuation symbols in a text. There we see evidence that such features of text are not strongly encoded in the clustering scheme. Perhaps surprisingly, when applied to cluster-based distributions, all of the studied ∆ metrics rank the distance of the perturbed texts from the reference texts in the exact same order (this is clear in Fig. 10 , a zoomed-in version of Fig. 4 ). For these perturbations, the ∆ differ only in the magnitude of their outputs, further suggesting that the ∆ AUC metric itself is likely not critical for the effectiveness of MAUVE. One potential avenue for future research would be investigating whether different algorithms for discretisation of the embedding space create clusters that align with specific linguistic attributes; this could be a useful diagnostic for language generators' improvement. This section's results-along with those of §6.1-suggest that divergences based on PLM embeddings are more sensitive to syntax-and coherence-related properties of the target text than to its superficial features. The opposite, however, might be said of our string-based distributions. These findings offer a potential explanation for the effectiveness of metrics that make use of PLM embeddings, such as MAUVE or BERTSCORE (Zhang et al., 2020) . As current SOTA language generators already typically produce grammatical text, being invariant to surface statistics may perhaps be a feature-as opposed to a bug-when trying to assess the quality of the text they produce. Yet, this may also reveal potential ways in which such metrics can be gamed, bringing this paradigm's robustness into question.

7. CONCLUSION

In this paper, we analyse MAUVE, a recently-proposed automatic metric for language generator evaluation. While MAUVE correlates quite well with human quality judgements, it is unclear which of the metric's design choices are in fact responsible for its success-a shortcoming that impedes the further development of language generator evaluation metrics. We attempt to rectify this shortcoming. We first provide a general theoretical framework for the comparison of language generator evaluation metrics. Then through a series of empirical studies, we identify MAUVE's substitution of probability distributions over embedding-based clusters-in place of the traditional distributions over strings-as the attribute largely responsible for the metric's success. In order to better understand the nature of this improvement, we probe the clusters used by the density estimators, analysing what they ignore and what they emphasise about the input text. We find that, while distributions over clusters are sensitive to syntax-or coherence-level perturbations to the text, this is not the case for several surfacelevel perturbations. We thus conjecture that, by focusing on higher-level text features, cluster-based evaluation metrics may simply be better suited to rank high-performing models, and that this is a general paradigm worth further exploration. repetitions (Welleck et al., 2020) , Zipfian coefficient (Holtzman et al., 2020) , or the perplexity of generated text (Fan et al., 2018) . Final assessments of language generation systems are still often performed using human evaluations, as automatic metrics on their own have not proven sufficient for differentiation between top-end language generation systems. Automatic evaluation metrics for language models based on statistical divergences have been proposed by a number of different authors. For instance, Meister & Cotterell (2021) assessed the quality of language models while using a number of divergences between distributions over surface statistics in text corpora. Xiang et al. (2021) propose the approximation of distributions over strings with distributions in the embedding space in standard divergence metrics. Pillutla et al. (2021) 

D EXPERIMENTAL SETUP

Embedding Models. To obtain text embeddings for our input strings, we use several different PLM. Namely, we use the four sizes of GPT-2 (Radford et al., 2019) , as well as BERT-base and BERT-large (both cased; Devlin et al., 2019) . For the former, we use the embedding of the final token. For the latter, we use the embedding associated with the special CLS token. In both cases, we additionally present results using the average across token embeddings in App. E. All texts are truncated to 512 tokens (for experiments in §6.2, this truncation is performed before any manipulations of the text), in order to ensure that all models can process the input text in their context. For our probing analysis in §6, we employ the following datasets: • Sentiment. Figure 13 : R 2 when using cluster assignments to predict % of tokens in a text that are either punctuation or stopwords. Setup follows that of §6.1, albeit using solely the WebText dataset to train our clustering functions in this setting. We compute the average percentage of stopwords or punctuation per cluster in half of our strings and use these pre-computed averages when predicting the percentages in the other half, computing this prediction's R 2 (i.e. the percentage of explained variance). We see that larger PLMs-which are often claimed to provide better representations of language-do encode more information about such surface features than smaller models. This could simply be due to the fact that the embeddings from larger PLMs are typically of a larger dimension and, thus, have the capacity to encode additional (perhaps "less critical") attributes of text. While these attributes do not appear to be differentiating factors when partitioning the embedding space into a small number of clusters, they become relevant when partitioning into a larger number of clusters. Even with several clusters and large PLMs, though, the R 2 values we find are still quite small, at around 0.20.



Here we make use of shifted divergences: a divergence measure that potentially has an additive constant, i.e., there exists a constant c ∈ R such that ∆(•, •) + c is a divergence. For our purposes, an additive constant should not affect the quality of our metrics, as the correlation in Eq. (3) is translation-invariant. Often, plug-in and Monte Carlo estimators must be used together. Even if we are measuring the divergence between two queryable distributions, the sum over W is infinite and non-decomposable, thus uncomputable. We ran our entire pipeline (from sampling strings {w qw n } N n=1 from qw to evaluating the KLs) with 5 different seeds. The variance across seeds is depicted as error bars in Figs.1 and 2. We followPillutla et al.'s (2021) experimental setup here. Note that Pillutla et al. provide an in-depth analysis of experimental design choices and show MAUVE is robust to various hyperparameter choices. We use the human judgement scores collected byPillutla et al. (2021). If no strings in the training set are assigned to a cluster, we label it with the overall majority category. Stopwords are defined as common words, such as "that" or "so", that primarily serve a syntactic function. We use the set of English stopwords defined by NLTK(Bird et al., 2009). Specifically, Djolonga et al. proposed a framework that measures the trade-off between precision and recall using Rényi divergences.



67

DOES p c APPROXIMATE p w ?

Figure 2: Correlations between string-and cluster-based divergences with human judgement scores. Legend: ∆ exp in dark green; ∆ → in orange; ∆ ← in blue; ∆ JS in pink; ∆ AUC in lime green.

Figure 3: Accuracy when predicting different attributes of text from their cluster assignments. Assignments (i.e. φ(•)) are learned using text from either WebText, or the training set of the respective classification datasets. Dashed lines represent baseline accuracies, i.e., always guessing the majority class.

Figure 4: Divergence measures between two corpora: the reference text is unmodified while the comparison text undergoes perturbation. Higher values indicate a greater discrepancy according to ∆.

Figure5: Correlations between string-and cluster-based divergences and human judgement scores using a number of estimators (with different PLM(•) when defining p c , or language models for p w ). Legend: ∆ exp in dark green; ∆ → in orange; ∆ ← in blue; ∆ JS in pink; ∆ AUC in lime green.

Figure 7: Accuracy when predicting different attributes of text from their cluster assignments. Same plot as Fig. 3 albeit using embeddings from BERT.

Figure 8: ∆ AUC scores between reference text and alternate text distributions as a function of number of clusters used to estimate p c and q c . Importantly, in this figure, we compare the first 2500 sentences (p (1) ) in the human-generated WebText test set to these same strings, but under the proposed interventions. I.e., for all points corresponding to an (altered) distribution p (1) , we estimate our cluster distributions on the same set of human-generated sentences, but where the strings in one group have been intervened on. The baseline distribution p (2) still represents the final 2500 sentences in WebText (as in the original plot); and baseline distribution q is text sampled from GPT-2 XL.

Figure 10: Zoomed in version of Fig. 4 to give a closer look at different scores assigned to texts manipulated in different ways.

Figure 11: Version of Fig. 9 that uses the mean of the contextual embeddings from a text to form clusters.

Figure 12: Version of Fig. 10 that uses the mean of the contextual embeddings from a text to form clusters.

present MAUVE-the object of study of this work-which is based on a new divergence metric inspired by the information frontier divergencesDjolonga et al. (2020). 11 Besides proposing this AUC divergence metric,Pillutla et al. (2021) also propose a new way to approximate it using clusters over word embeddings. We provide an analysis of this paradigm, showing that in practice, we should expect this method to provide a poor estimation of the intended quantity. We go on to perform empirical experiments to identify the proposed use of distributions over clusters itself is likely responsible for the metric's success, rather than characteristics of the new divergence. We see this work as complementary toPillutla et al. (2021), providing deeper and more comprehensive insights into the metric's inner workings.

To analyse sentiment, we use Yelp Polarity(Zhang et al., 2015), a dataset extracted from the Yelp Dataset Challenge 2015, which contains binary sentiment classifications for highly polar Yelp reviews. We use 10k examples randomly sampled from the training set and 5k examples randomly sampled from the test set.• Authorship. To analyse authorship, we use News Category(Misra & Grover, 2021;Misra, 2022), a dataset consisting of 200k news headlines from the years 2012-2018 obtained from HuffPost. We scrape entire articles from the URLs provided by the dataset. We only use the subset of articles for which the article's author has ≥ 400 articles within the dataset, giving us a training set of 32k and a test set of 6k with 46 unique authors. • Topic. To analyse topic, we use the 20 NewsGroup dataset, which contains 18k newsgroups posts (partitioned into train and test sets using the original splits) on 20 topics, such as subcategories of science, politics and religion. The distribution over text topics is relatively uniform.

8. ACKNOWLEDGEMENTS

We would like to thank the authors of MAUVE for sharing their human evaluation data with us. We would also like to thank Gábor Melis for his insightful feedback on an initial version of this paper, our anonymous reviewers for their help in improving our final draft, and Luca Malagutti for detailed feedback on clarity and presentation.

A A MONTE CARLO ESTIMATOR FOR BACKWARD, JS, AND AUC DIVERGENCES

Consider a Monte Carlo estimator for ∆ ← :where w qw n ∼ q w . We can easily sample from q w . However, computing Eq. ( 14) requires knowledge of log p w , which we do not have. Thus we resort to other techniques for estimating these divergences, as discussed in §3.

B STRING VS. CLUSTER-BASED KULLBACK-LEIBLER DECOMPOSITION

The decomposition in Eq. ( 13) can be shown as follows:(1)Marginalise over all wwhere again (1) follows from the fact that p(c, w) = p(w), which is true because the cluster assignment is deterministic, i.e.:Thus, KL(p(c) || q(c)) provides a biased estimate of KL(p w || q w ).

C RELATED WORK

Over the years, a number of evaluation metrics have been proposed for language generation tasks (such as translation and summarization); the most well-established and commonly-used include BLEU (Papineni et al., 2002) , ROUGE (Lin, 2004) and METEOR (Banerjee & Lavie, 2005) . However, these metrics-which rely on n-gram statistics-have been shown to correlate poorly with human judgement (Reiter, 2018) . Recent metrics have improved upon these correlations with more advanced techniques, e.g., BEER (Stanojević & Sima'an, 2014) , MoverScore (Zhao et al., 2019) and BLEURT (Sellam et al., 2020) . These metrics, though, are intended for language generation tasks with a strict set of reference texts. While reasonably effective for directed generation tasks, they do not transfer well to the open-ended domain.For tasks in which there is not a clear reference, e.g., story generation, basic statistics are typically employed to provide a preliminary evaluation of generated text. Such statistics include n-gram Published as a conference paper at ICLR 2023 

