SMART: SENTENCES AS BASIC UNITS FOR TEXT EVALUATION

Abstract

Widely used evaluation metrics for text generation either do not work well with longer texts or fail to evaluate all aspects of text quality. In this paper, we introduce a new metric called SMART to mitigate such limitations. Specifically, we treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Candidate sentences are also compared to sentences in the source documents to allow grounding (e.g., factuality) evaluation. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics on the SummEval summarization meta-evaluation dataset, while the same metric with a string-based matching function is competitive with current modelbased metrics. The latter does not use any neural model, which is useful during model development phases where resources can be limited and fast evaluation is required. SMART also outperforms all factuality evaluation metrics on the TRUE benchmark. Finally, we also conducted extensive analyses showing that our proposed metrics work well with longer summaries and are less biased towards specific models.



and human annotations are from SummEval (Fabbri et al., 2021) . Each bucket in the x-axis contains equal number of data points. More details in Section A.2. One major obstacle in the progress of text generation tasks (e.g., document summarization, longform question answering, data-to-text generation, etc.) is automatic evaluation. Traditionally, automatic metrics that rely on discrete tokenlevel matching such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) have been utilized to check whether system outputs are of high quality across four dimensions (Kryscinski et al., 2019; Fabbri et al., 2021) : coherence, factuality, fluency, and informativeness. These metrics do not correlate well with human judgments on all four dimensions of text quality (Fabbri et al., 2021) . Because of this, the evaluation is usually coupled with human elicitation studies that ask humans to rate texts. These studies can be expensive and nearly impossible to reproduce. More recently, pretrained language models are leveraged for automatically evaluating systemgenerated texts (Zhang* et al., 2020; Sellam et al., 2020; Yuan et al., 2021) , which have shown improvements on correlation with human judgments. Nevertheless, both ROUGE and LMbased metrics have three major drawbacks. Firstly, these metrics are not good at evaluating long and multi-sentence texts. Figure 1 illustrates system-level rank correlations of ROUGE in different text lengths, which shows that after a certain length, ROUGE drastically decreases its evaluative power. By design, ROUGE is also not robust to evaluating possibly shuffled information in long outputs, hurting its performance on evaluating coherence. On the other hand, LM-based metrics such as the state-of-the-art BARTScore (Yuan et al., 2021) , are constrained to the length limitation of the pretrained LM used, thus they are not able to evaluate outputs longer than this limit. And secondly, most of these metrics only use reference texts during evaluation. This restricts the capability of the metrics from evaluating dimensions of text quality that requires grounding to the source. Yuan et al. (2021) suggested to use the source document during evaluation, however their evaluation is still limited to short documents because of length limitations in LMs. In this paper, we propose an automatic metric called SMART (Sentence MAtching for Rating Text)foot_0 . SMART is motivated by the pyramid method of human evaluation for summarization (Nenkova et al., 2007) , where they transform text into semantic content units (SCUs), or sentences that contain a single fact. Since this kind of transformation cannot be done automatically, we use sentences as a proxy to SCUs, and treat them as basic units of matching instead of tokens. This additionally allows the metric to effectively support long and multi-sentence texts. Since sentences most likely do not have exact matches, we use a soft-matching function that returns a matching score between 0 and 1, given a pair of sentences. Moreover, to allow grounded evaluation, we also include the source in the calculation of the metric. Similar to ROUGE, we introduce multiple SMART versions using sentence n-gram overlap and longest common subsequence. Our experiments show that SMART with BLEURT (Sellam et al., 2020) as a soft-matching function outperforms all the competing approaches on all four dimensions of quality in the SummEval dataset (Fabbri et al., 2019) . We also show that SMART with T5-ANLI (Honovich et al., 2022) outperforms all competing factuality-based evaluation metrics on the TRUE benchmark (Honovich et al., 2022) . Moreover, a faster variant of SMART, which does not use any neural model for text matching, shows competitive correlations with human judgments. Finally, our extensive analyses show that SMART works better with longer summaries and is less biased towards specific models.

2. RELATED WORK

Evaluation in conditional generation tasks such as machine translation and document summarization is a long-standing problem. Traditionally, evaluation involves human elicitation studies that score texts based on different metrics of quality, such as adequacy, fidelity, and fluency in machine translation (Hovy, 1999) , and coherence, conciseness, fluency, readability, and content relevance in summarization (Mani, 2001; Nenkova et al., 2007) . Automatic metrics based on token n-gram matching have been developed to replace these expensive and time-consuming studies, in which ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) are most widely used in summarization and translation, respectively. Several extensions to token n-gram matching have been proposed, such as using paraphrases, synonyms (Lavie & Agarwal, 2007) , and word embeddings (Ng & Abrecht, 2015) to handle cases that are semantically equivalent, and downweighting common n-grams to focus more on salient ones (Vedantam et al., 2015) . Popović (2015) instead use character-level n-gram matching to also match words that are conjugated differently and support morphologically rich languages. With the introduction and success of pretrained language models such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020) , evaluation metrics that leverage them have been proposed. BERTScore (Zhang* et al., 2020) Fabbri et al., 2021) . In contrast, we show that our metric correlates better with human judgments than all competing models. Factuality in summarization (Falke et al., 2019; Maynez et al., 2020) is usually evaluated separately since most automatic metrics are focused on informativeness and do not include the source document in the metric calculation. Factuality-specific metrics can be divided into three approaches: natural



github.com/google-research/google-research/tree/master/smart_eval



Figure 1: Kendall tau system-level correlations of ROUGE and SMART averaged over four dimensions of summary quality as the number of tokens increases. Summaries are from CNN/DM(Hermann et al., 2015)  and human annotations are from SummEval(Fabbri et al., 2021). Each bucket in the x-axis contains equal number of data points. More details in Section A.2.

leverages contextualized token embeddings from BERT and obtains pairwise matching of tokens from reference and system summaries. MoverScore(Zhao et al., 2019)   extends BERTScore by instead having many-to-one soft alignments using Word Mover's Distance(WMD; Kusner et al., 2015).BLEURT (Sellam et al., 2020)  fine-tunes BERT to predict human scores with large-scale synthetic training data. BARTScore(Yuan et al., 2021)  uses BART and treats evaluation as a text generation problem, using likelihood of predicting the system summary given the source document or the reference summary.Clark et al. (2019) and Zhao et al. (2019)  also explored sentence-level matching with WMD using (contextualized) sentence embeddings, however they show no concrete improvements over other model-based metrics (

