DO SUMMARIZATION MODELS SYNTHESIZE?

Abstract

Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical systematic reviews of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate. This approach improves model synthesis performance. Our hope is that by highlighting the need for synthesis (in some summarization settings), this work motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize.

1. INTRODUCTION

Multi-document summarization (MDS) models aim to distill inputs into concise synopses that preserve key content. Examples of MDS include summarizing news articles (Dang, 2005; Fabbri et al., 2019; Ghalandari et al., 2020; Evans et al., 2004) , answering questions from multiple sources (Dang, 2006) , and producing overviews of scientific literature (Liu et al., 2018; Lu et al., 2020; Mollá & Santiago-Martínez, 2012; Wallace et al., 2020; DeYoung et al., 2021) . We expect summarization models to produce outputs consistent with inputs (Kryscinski et al., 2020; Nan et al., 2021a) , e.g., discussing the same types of entities (Nan et al., 2021b) and allowing one to answer questions similar in a way that is consistent with individual inputs (Wang et al., 2020a; Scialom et al., 2021) . In some applications models must synthesize inputs-i.e., aggregate potentially conflicting information-to yield an accurate synopsis (Figure 1 ). As a simple example, consider the metareviews of movies featured on Rotten Tomatoes,foot_0 which provide a consensus view of individual critic opinions. These reviews should therefore reflect the mean and range of sentiment implicit in the input critiques: A summary of mostly negative reviews (e.g., Gigli) should communicate that the film was widely panned; a summary of mixed reviews (as in the case of The Fifth Element) ought to convey that critics disagreed and discuss the main positive and negative attributes. A more consequential example is the task of summarizing the evidence presented in clinical trials. Individual trials will frequently present conflicting evidence about whether or not a particular health intervention is effective. An ideal summary of the evidence would appropriately weigh the findings presented in the constituent inputs and reflect the evidence on balance.



A website that aggregates film reviews: https://www.rottentomatoes.com/.1

