DO SUMMARIZATION MODELS SYNTHESIZE?

Abstract

Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical systematic reviews of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate. This approach improves model synthesis performance. Our hope is that by highlighting the need for synthesis (in some summarization settings), this work motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize.

1. INTRODUCTION

Multi-document summarization (MDS) models aim to distill inputs into concise synopses that preserve key content. Examples of MDS include summarizing news articles (Dang, 2005; Fabbri et al., 2019; Ghalandari et al., 2020; Evans et al., 2004) , answering questions from multiple sources (Dang, 2006) , and producing overviews of scientific literature (Liu et al., 2018; Lu et al., 2020; Mollá & Santiago-Martínez, 2012; Wallace et al., 2020; DeYoung et al., 2021) . We expect summarization models to produce outputs consistent with inputs (Kryscinski et al., 2020; Nan et al., 2021a) , e.g., discussing the same types of entities (Nan et al., 2021b) and allowing one to answer questions similar in a way that is consistent with individual inputs (Wang et al., 2020a; Scialom et al., 2021) . In some applications models must synthesize inputs-i.e., aggregate potentially conflicting information-to yield an accurate synopsis (Figure 1 ). As a simple example, consider the metareviews of movies featured on Rotten Tomatoes,foot_0 which provide a consensus view of individual critic opinions. These reviews should therefore reflect the mean and range of sentiment implicit in the input critiques: A summary of mostly negative reviews (e.g., Gigli) should communicate that the film was widely panned; a summary of mixed reviews (as in the case of The Fifth Element) ought to convey that critics disagreed and discuss the main positive and negative attributes. A more consequential example is the task of summarizing the evidence presented in clinical trials. Individual trials will frequently present conflicting evidence about whether or not a particular health intervention is effective. An ideal summary of the evidence would appropriately weigh the findings presented in the constituent inputs and reflect the evidence on balance.

…

The Fifth Element is a bold, bright, loud, rowdy, lush, extravagant science fiction space opera …

}

There was no significant difference in the risk of hospitalisation between hydroxychloroquine and placebo groups The effect size of hydroxychloroquine was higher than placebo for COVID-19 symptomatic infection … although this was not statistically significant. What are the desiderata of multi-document synthesis? First, summaries produced by models should be consistent with the input data, with respect to the latent property of interest. In the case of Rotten Tomatoes, the sentiment of the summary should be in line with the aggregate sentiment expressed in the individual critic reviews. A corollary to this is that models should be sensitive to changes in the composition of inputs, e.g., removing most of the negative reviews from a set of inputs should yield a summary with a corresponding increase in the expressed sentiment.

Synthesizing movie reviews

In this work we evaluate neural MDS models with respect to these criteria. To this end we use a meta-reviews dataset from Rotten Tomatoes (Leone, 2020) and a dataset of systematic reviews (meta-analyses) summarizing the evidence about medical interventions (Wallace et al., 2020) . For the former we probe the degree to which generated meta-review sentiment agrees with the expected aggregate sentiment score; for the latter we evaluate whether the generated summary indicates that the input evidence suggests, on balance, that the intervention under consideration was effective. Our main contributions are summarized as follows. (1) To the best of our knowledge, this is the first work to investigate implicit synthesis in summarization, and the degree to which modern models are capable of this.foot_1 (2) We show that "off-the-shelf" neural MDS models are somewhat inconsistent and insensitive with respect to performing synthesis in summarization. (3) We propose and evaluate a simple and general technique which involves generating a diverse set of output candidates (Vijayakumar et al., 2016) and then selecting from these on the basis of agreement with an expected aggregate measure (based on inputs), with promising results.

2. SYNTHESIS AND SUMMARIZATION

In standard multi-document summarization, we assume inputs (X i , y i ), where X i = {x i1 , ..., x i|Xi| }. We then typically train a summarization model with parameters θ, to consume X i and yield summaries ŷi as similar as possible to targets y i . More precisely, the standard objective entails finding estimates for θ which maximize target token log-probabilities. Assuming the input documents x ij in X i have been linearized (i.e., concatenated, usually with adjoining special tokens to demarcate individual inputs) into a string x ⊕ i of input tokens, this objective takes the form: |yi| t=1 log p θ (y it |y i1 , ..., y i(t-1) , x ⊕ i ), where p θ is a probability assigned to the token at position t in the (linearized) target x ⊕ i by a summarization model with parameters θ. By myopically focusing on encouraging the model to produce tokens that mimic the targets, this objective aligns with standard (but flawed) measures of automated summary quality like ROUGE (Lin, 2004) , which quantify n-gram overlap between targets y i and outputs ŷi . We are interested in settings in which there is an additional, latent property implicit in the constituent input texts x ij , z ij . For example, z ij might reflect the sentiment in critique j of the film indexed by i. Summaries should synthesize this aspect, i.e., the generated summary ŷi should implicitly convey an aggregated z i which reflects a synthesis or aggregation G over Z i = {z i1 , ...z i|Xi| }. That is, we assume z i = G(Z i ) . In both cases considered here-summaries of film critiques and synopses of clinical trials evidence-G can reasonably be assumed to be a (weighted) mean, G(Z i ) = 1 |Xi| |Xi| j=1 α ij z ij . That is, summaries should roughly reflect the average sentiment and reported treatment effect in the cases of movie reviews and clinical trial reports, respectively.



A website that aggregates film reviews: https://www.rottentomatoes.com/. See Appendix B for related content aggregation work, over structured relations Shah et al. (2021a).



Synthesizing reports of clinical trials } … The evidence does not support use of hydroxychloroquine for treating COVID-19.

Figure 1: Two multi-document summarization tasks where models must implicitly synthesize inputs to produce accurate summaries. Left: Summarizing film reviews with varying sentiment to yield a critics consensus. Right: Summarizing trials that have evaluated a particular medical invention.

