ON THE USEFULNESS OF EMBEDDINGS, CLUSTERS AND STRINGS FOR TEXT GENERATOR EVALUATION

Abstract

A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there is a dearth of such metrics, which inhibits the rapid and efficient progress of language generators. One exception is the recently proposed MAUVE. In theory, MAUVE measures an informationtheoretic divergence between two probability distributions over strings: one representing the language generator under evaluation and the other representing the true natural language distribution. MAUVE's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet in practice, as this divergence is uncomputable, MAUVE approximates it by measuring the divergence between multinomial distributions over clusters instead, where cluster assignments are attained by grouping strings based on a pre-trained language model's embeddings. As we show, however, this is not a tight approximation-in either theory or practice. This begs the question: why does MAUVE work so well? In this work, we show that MAUVE was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. In fact, classical divergences paired with its proposed cluster-based approximation may actually serve as better evaluation metrics. We finish the paper with a probing analysis; this analysis leads us to conclude that-by encoding syntactic-and coherence-level features of text, while ignoring surface-level features-such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators. 1 * Equal contribution. 1 Code available at https://github.com/rycolab/clusters-in-language-evaluation. 2 We define a language generator as a probability distribution qw over strings w. Specifically, we consider this distribution as used during generation. E.g., if decoding is performed with nucleus sampling, we consider the final distribution where every sentence with tokens not in the nucleus is assigned a probability of 0. 3 Most measures we consider are not metrics in a strict sense; we use the term "metric" out of convention.

1. INTRODUCTION

Probabilistic text generators have improved greatly over the last years, with models producing increasingly human-like text (Yang et al., 2019; Brown et al., 2020; Raffel et al., 2020; Rae et al., 2021; Hoffmann et al., 2022) . As the gap between human and model-generated text closes, the quality of our evaluation metrics becomes ever more important for determining generator quality, especially given the increasing number of user-facing systems employing these generators. While human evaluations serve as the gold standard, they are costly (in both time and money), leading researchers to rely on automatic metrics-i.e., metrics that can be measured by a computer-for the bulk of their development process. Many automatic language generator evaluation metrics share the same underlying mechanism: the quantitative comparison of two probability distributions. Specifically, most metrics measure a difference between the distributions over strings defined by: (1) a language generation model 2 and (2) the natural language itself. This includes some of the most widely used language evaluation metrics: 3 cross-entropy (Shannon, 1948 ), perplexity (Jelinek et al., 1977) , and (more recently) MAUVE (Pillutla et al., 2021) . As typically applied to evaluate language generators, however, these metrics have a number of computational and qualitative issues (discussed in §3). Such issues manifest empirically: the most commonly used automatic metrics are known to correlate poorly with human judgements (Wiseman et al., 2017; Reiter, 2018; Sellam et al., 2020; Gehrmann et al., 2021) . A newly proposed metric stands apart: MAUVE (Pillutla et al., 2021) . In theory, MAUVE measures the area under the curve formed by the divergence between two probability distributions, qualitatively mimicking a precision-recall quantification (Djolonga et al., 2020; Kynkäänniemi et al., 2019) . The authors attribute the success of their metric to the qualitative properties of this new class of divergences. Yet, due to this divergence being in practice uncomputable, Pillutla et al. propose an approximation to it. Specifically, rather than directly comparing the original two distributions over strings, MAUVE first clusters samples taken from these distributions based on the embeddings of a pre-trained language model; it then estimates the proposed divergence using the samples' empiricallyobserved multinomial distributions over cluster assignments. As we will show, this approximation is bad in both theory §4 and practice §5.1-to the point that the term "approximation" is arguably a misnomer. Thus, the reasons why MAUVE works well-knowledge which is important for the continued progress of language generator evaluation metrics-are still unknown. In this work, we aim to uncover these reasons. To this end, we consider the axes on which MAUVE differs from other evaluation metrics: is MAUVE's success due to the new divergence metric, to its "approximation", or both? Empirically, we identify MAUVE's substitution of probability distributions over strings with probability distributions over embedding-based clusters as the main factor for its success. We show that mathematically, this substitution leads to a quite biased estimator of the original string-based divergences. Yet it also leads to metrics with lower variance and stronger correlations with human judgements. In fact, all divergence measures analysed here correlate more strongly with human judgements when cluster-based distributions are used in place of string-based ones. Finally, in order to understand the root of the effectiveness of these cluster-based metrics, we probe the clusters themselves. We find that sentence-level permutations within texts noticeably affect cluster assignments, suggesting that cluster-based metrics are susceptible to attributes such as coherence. On the other hand, basic manipulations that render text unhuman-like, such as removing all articles from the input text, do not seem to affect these metrics significantly. Together, these results lead us to conjecture that embedding-based metrics may be favourable when estimating the quality of state-of-the-art (SOTA) language generators, as SOTA models are known to (at least typically) produce grammatical text. That is, by ignoring surface-level features of text-while emphasising discourse-and coherence-level ones-clustered embeddings may simply be better suited for the evaluation of the top language generation systems. Yet these findings also suggest routes through which such metrics can be gamed, bringing into question their robustness. We believe these findings, along with the theoretical framework we provide for evaluation metrics' comparison, are important for the further development of language generator evaluation metrics.

2. DIVERGENCE METRICS FOR LANGUAGE GENERATOR EVALUATION

When evaluating language generation systems, we will first assume the existence of an unknown ground-truth distribution p w . This distribution is defined over strings w and its domain spans W ≡ Σ * , where Σ is an alphabet of words and Σ * is its Kleene closure. Second, we are given a probabilistic text generator q w , which is also a distribution over W. An evaluation metric for a language generator q w can now be defined as a measure of its "distance" from p w : ∆(p w , q w ). In short, ∆(•, •) should return high values if q w is a bad approximation to p w , and it should return low values if it is a good one. Notably, it is not clear whether q w being a good approximation to p w in terms of an arbitrary ∆(•, •) guarantees that it will be a good language generator. Indeed, models that perform well in terms of standard metrics, such as perplexity, often still produce poor-quality text (Holtzman et al., 2020) . Thus, we are interested specifically in ∆(•, •) that correlate highly with human quality judgements. More formally, we define human quality judgements as a (potentially noisy) mapping α(q w ) from a language generator to a real-valued score. For fixed p w , a useful metric ∆(p w , •) for evaluating the quality of a language generator q w is one whose scores correlate highly with humanscores(•). This notion can be operationalised as follows. Assume we have N language generator models. Let us define: δ human (q (1) w , . . . , q (N ) w ) = α q (1) w , . . . , α q (N ) w (1) δ metric (q (1) w , . . . , q (N ) w ) = ∆ p w , q (1) w , . . . , ∆ p w , q (N ) w (2)

