DISTRIBUTION AWARE METRICS FOR CONDITIONAL NATURAL LANGUAGE GENERATION

Abstract

Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the bestmatching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech recognition, it does not allow for robust evaluation in the case where diversity in the ground truths represents signal for the model. In this work we argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse, and where the diversity in those captions captures useful additional information about the context. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of meta-metrics built on top of existing pairwise metrics that compare the distributions of reference and model-generated caption sets using small sample sets of each. We demonstrate the utility of our approach with a case study in visual description: where we show that existing models optimize for single-description quality over diversity, and gain some insights into how sampling methods and temperature impact description quality and diversity.

1. INTRODUCTION

Recent models for conditional language generation, particularly in the field of visual description, have shown dramatic improvements in both fluency and the ability to ground generated language in context (Liu et al., 2021; Zhou et al., 2020; Mokady et al., 2021; Chen et al., 2018) . Standard metrics for these tasks such as BLEU, ROUGE, METEOR, and CIDEr, compare a generated text with a reference set of texts and compute some measure of quality for the generated text. By construction of these metrics, a model will achieve the best performance by generating a single high-scoring text. In contrast, it has been widely observed that large language models such as GPT-3 (Brown et al., 2020) or LAMDA (Thoppilan et al., 2022) generate the most realistic texts at temperatures close to one, where the set of potential texts generated is often very diverse. More significantly, if we look at an example of an image from MS-COCO and its set of reference captions (Figure 1 ), we notice that each (human-generated) reference contains a unique subset of the overall information in the image: "A woman in a red robe is sitting at a dining table." "A woman in a red flowered shawl sits at a table while a man wearing jeans is in the kitchen looking at her." "A person sits at a table and another person stands in the kitchen." "A woman is sitting at a table wearing a robe while a man is cooking." "Man and woman in a kitchen looking in the same direction." Important features like the red robe, the man, the gaze of the two people etc, are mentioned only in one or a few captions. Metrics that encourage generating information from only one of these captions will generally fail to capture much of the important detail in the image. This holds for more than just image description. For many conditional language generation tasks such as video captioning, abstractive summarization, translation, and open-ended question-answering, it is often beneficial to be able to sample from a diverse distribution of generated outputs. If we compute a caption from a state-of-the-art model (Zhou et al., 2020) we get: "A woman sitting in a kitchen next to a man." Figure 1 : Samples from these two models achieve similar BLEU scores, however, the samples from VLP lie near a center of the distribution, and fail to capture the dispersion of natural language in the ground truths, while the samples from an ideal model better match the ground truth distribution. In this work, we introduce metrics which better measure deviations between samples from candidate and reference distributions, compared to single-sample pairwise metrics. In this description, we see that only information common to most or all of the reference captions is preserved. This is intuitive, since including more information runs the risk that no reference caption contains that information, leading to a low score. Note that this caption may not actually be the most likely (highest expected similarity to a reference caption), because e.g. the BLEU score also includes a term encouraging longer texts. It seems the designers of these metrics are already aware that direct use of shortest distance to a reference caption favors generated captions which are even shorter and more impoverished. However, the (log-) text length heuristic in standard metrics is a poor proxy for actual diversity. Instead of generating a variety of captions, taking 10 samples from the state-of-the-art model, yields only 10 repetitions of the above caption. Thus, we encounter an issue in the evaluation of conditional text generation models from multiple sampled texts. When several ground truths are available, typically the metric score is based on the maximum score with some ground truth. This leads to an issue, shown earlier, where the model is encouraged to produce a text with the lowest expected distance (max pairwise score for a particular n-gram as in BLEU) to a reference text, i.e. near a strong mode in the training text distribution. Changing the metric aggregation method, say from max score over reference examples to average or sum (ROUGE), does not change the situation substantially. The model will still be encouraged to produce a single output with high average scores to nearby references, which will be maximized at a smoothed mode in the training text distribution. Failure modes of other methods of aggregation are discussed in both Caglayan et al. (2020) and Yeh et al. (2021) , including issues with multi-modal reference distributions and single outlier texts. Such an over-reliance on simple aggregations for multiple candidates and references has, over time, compounded into several issues: The first, discussed further in section 3, is that, as observed in visual description by Chan et al. (2022) and dialog generation by Caglayan et al. (2020) , human performance on datasets under existing metrics is often lower than model performance, even though human-generated captions tend to receive higher scores under human evaluation. The second, discussed in section 2, is that diversity of candidate texts is largely relegated to reference-unaware measures, encouraging models to diverge from ground truth distributions to hit diversity targets. In this work, we aim to solve these problems by introducing several novel automated ways of measuring the performance of conditional text generation models based on measuring the divergence between samples from two text distributions. While some recent methods have been designed to closely measure the divergence between full distributions of text data in the unconditional case (Pillutla et al., 2021) , no such methods exist for the conditional generation case, which operates on the level of 10s of reference samples and candidates. Our contributions are summarized as follows: 1. We introduce a new paradigm for the evaluation of conditional text generation models based on sampling from both candidate and reference distributions. 2. We introduce two new families of metrics which extend existing semantic distances: trianglerank metrics, and kernel-based metrics, designed to measure the divergence between small text samples from candidate and reference distributions.

