DO SUMMARIZATION MODELS SYNTHESIZE?

Abstract

Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical systematic reviews of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate. This approach improves model synthesis performance. Our hope is that by highlighting the need for synthesis (in some summarization settings), this work motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize. Synthesizing movie reviews Synthesizing reports of clinical trials } … The evidence does not support use of hydroxychloroquine for treating COVID-19.

1. INTRODUCTION

Multi-document summarization (MDS) models aim to distill inputs into concise synopses that preserve key content. Examples of MDS include summarizing news articles (Dang, 2005; Fabbri et al., 2019; Ghalandari et al., 2020; Evans et al., 2004) , answering questions from multiple sources (Dang, 2006) , and producing overviews of scientific literature (Liu et al., 2018; Lu et al., 2020; Mollá & Santiago-Martínez, 2012; Wallace et al., 2020; DeYoung et al., 2021) . We expect summarization models to produce outputs consistent with inputs (Kryscinski et al., 2020; Nan et al., 2021a) , e.g., discussing the same types of entities (Nan et al., 2021b) and allowing one to answer questions similar in a way that is consistent with individual inputs (Wang et al., 2020a; Scialom et al., 2021) . In some applications models must synthesize inputs-i.e., aggregate potentially conflicting information-to yield an accurate synopsis (Figure 1 ). As a simple example, consider the metareviews of movies featured on Rotten Tomatoes,foot_0 which provide a consensus view of individual critic opinions. These reviews should therefore reflect the mean and range of sentiment implicit in the input critiques: A summary of mostly negative reviews (e.g., Gigli) should communicate that the film was widely panned; a summary of mixed reviews (as in the case of The Fifth Element) ought to convey that critics disagreed and discuss the main positive and negative attributes. A more consequential example is the task of summarizing the evidence presented in clinical trials. Individual trials will frequently present conflicting evidence about whether or not a particular health intervention is effective. An ideal summary of the evidence would appropriately weigh the findings presented in the constituent inputs and reflect the evidence on balance.

…

The Fifth Element is a bold, bright, loud, rowdy, lush, extravagant science fiction space opera …

}

There was no significant difference in the risk of hospitalisation between hydroxychloroquine and placebo groups The effect size of hydroxychloroquine was higher than placebo for COVID-19 symptomatic infection … although this was not statistically significant. Figure 1 : Two multi-document summarization tasks where models must implicitly synthesize inputs to produce accurate summaries. Left: Summarizing film reviews with varying sentiment to yield a critics consensus. Right: Summarizing trials that have evaluated a particular medical invention. What are the desiderata of multi-document synthesis? First, summaries produced by models should be consistent with the input data, with respect to the latent property of interest. In the case of Rotten Tomatoes, the sentiment of the summary should be in line with the aggregate sentiment expressed in the individual critic reviews. A corollary to this is that models should be sensitive to changes in the composition of inputs, e.g., removing most of the negative reviews from a set of inputs should yield a summary with a corresponding increase in the expressed sentiment. In this work we evaluate neural MDS models with respect to these criteria. To this end we use a meta-reviews dataset from Rotten Tomatoes (Leone, 2020) and a dataset of systematic reviews (meta-analyses) summarizing the evidence about medical interventions (Wallace et al., 2020) . For the former we probe the degree to which generated meta-review sentiment agrees with the expected aggregate sentiment score; for the latter we evaluate whether the generated summary indicates that the input evidence suggests, on balance, that the intervention under consideration was effective. Our main contributions are summarized as follows. (1) To the best of our knowledge, this is the first work to investigate implicit synthesis in summarization, and the degree to which modern models are capable of this. 2 (2) We show that "off-the-shelf" neural MDS models are somewhat inconsistent and insensitive with respect to performing synthesis in summarization. (3) We propose and evaluate a simple and general technique which involves generating a diverse set of output candidates (Vijayakumar et al., 2016) and then selecting from these on the basis of agreement with an expected aggregate measure (based on inputs), with promising results.

2. SYNTHESIS AND SUMMARIZATION

In standard multi-document summarization, we assume inputs (X i , y i ), where X i = {x i1 , ..., x i|Xi| }. We then typically train a summarization model with parameters θ, to consume X i and yield summaries ŷi as similar as possible to targets y i . More precisely, the standard objective entails finding estimates for θ which maximize target token log-probabilities. Assuming the input documents x ij in X i have been linearized (i.e., concatenated, usually with adjoining special tokens to demarcate individual inputs) into a string x ⊕ i of input tokens, this objective takes the form: |yi| t=1 log p θ (y it |y i1 , ..., y i(t-1) , x ⊕ i ) , where p θ is a probability assigned to the token at position t in the (linearized) target x ⊕ i by a summarization model with parameters θ. By myopically focusing on encouraging the model to produce tokens that mimic the targets, this objective aligns with standard (but flawed) measures of automated summary quality like ROUGE (Lin, 2004) , which quantify n-gram overlap between targets y i and outputs ŷi . We are interested in settings in which there is an additional, latent property implicit in the constituent input texts x ij , z ij . For example, z ij might reflect the sentiment in critique j of the film indexed by i. Summaries should synthesize this aspect, i.e., the generated summary ŷi should implicitly convey an aggregated z i which reflects a synthesis or aggregation G over Z i = {z i1 , ...z i|Xi| }. That is, we assume z i = G(Z i ) . In both cases considered here-summaries of film critiques and synopses of clinical trials evidence-G can reasonably be assumed to be a (weighted) mean, G(Z i ) = 1 |Xi| |Xi| j=1 α ij z ij . That is, summaries should roughly reflect the average sentiment and reported treatment effect in the cases of movie reviews and clinical trial reports, respectively. right) . Number of metareviews, average meta-review length (tokens), number of input reviews per split, average number of inputs per instance, average total length of an input to an instance. For movie reviews, the target percent positive reports the fraction of metareviews with a positive sentiment; for systematic reviews this refers to the fraction of metareviews reporting a significant effect. † We subset the original dev set to instances of ≤ 4k tokens (to accommodate T5; the other models can consume up to 16k). We investigate the following questions. (1) Do model summaries ŷi reflect the anticipated aggregate aspect of interest? That is, how well calibrated is the aspect communicated in the generated summary (z iŷ ) compared to the expected z i ? (2) Can we improve the ability of summarization models to synthesize by explicitly incorporating synthesis targets z i into the decoding process? We propose a simple inference-time procedure to explicitly preference output candidates that align with the expected aggregate property of interest (e.g., average sentiment), and report promising results for the approach. This strategy also naturally lends itself to cautious summarization, i.e., approaches in which we allow the model to abstain from generating an output if it does not produce any candidates that reflect the anticipated aggregate measure.

3.1. MOVIE REVIEWS

We first consider a dataset comprising movie reviews and associated meta-reviews summarizing these from Rotten Tomatoes. An in-house staffer summarizes audience reviewsfoot_2 into meta-reviews. These meta-reviews synthesize the constituent input reviews, and reflect the aggregate critic reception of a film. Each meta-review is associated with a numerical "Tomatometer" score, which is an overall measure of what percent reviews were positive for the corresponding film (G then is an average of the positive indicator per review). The Rotten Tomatoes dataset we use comprises 9095 movies with meta-reviews constructed from 244,000 individual reviews (Table 1 ). Measuring sentiment in movie reviews. As our measure g we train a BERT model (Devlin et al., 2019) using the continuous (fine-grained) sentiment targets provided in the SST dataset (Socher et al., 2013) . 4 We trained this model for 3 epochs using a learning rate of 5e-5 using the Huggingface libraryfoot_4 with no hyperparameter tuning. While the raw text of the SST dataset is in-domain, the targets themselves are not. We find a reasonably strong correlation between our sentiment estimates and the "true" meta-review sentiment ("Tomatometer" score): The R 2 (centered) is 0.696, mean squared error (MSE) of 0.022, and Pearson's r of 0.836 (Figure 2 , upper left).

3.2. BIOMEDICAL SYSTEMATIC REVIEWS OF TREATMENTS

Our second dataset is a collection of systematic reviews from the Cochrane Collaboration. 6 This dataset comprises roughly 2600 systematic reviews summarizing a total of 16,500 clinical trials evaluating interventions in healthcare (Table 1 ). Each review includes both a natural language sum- mary and accompanying statistical meta-analysis results. The latter provides an aggregate statistical summary of the individual (study-level) data extracted from the trials included in each review. The natural language summary should accurately convey and contextualize the findings of the metaanalysis. Therefore, the (lack of) treatment efficacy communicated in a given summary should generally agree with the direction of the corresponding meta-analytic point estimate. Measuring effects in evidence syntheses For systematic reviews of clinical trials, we resort to a less granular classification model g(x ij ), g(y i ) which attempts to infer whether a given piece of text reports a significant result or not. In particular we use RobotReviewer (Marshall et al., 2017; DeYoung et al., 2020) . Given a narrative describing a clinical trial result (or a systematic review summary of such results), RobotReviewer predicts whether the reported result indicates a significant effect of the treatment being investigated, or not. We can compare this prediction to the "truth", which here is derived from the meta-analytic result (specifically by checking whether p < 0.05). Applying this off-the-shelf model to the manually composed summaries accompanying the meta-analyses in our Cochrane set, we observe a macro-average F1 score of 0.577 (Table 10 , Appendix D), providing a reasonable (if weak) measure for this task.

4. MODELS

We evaluate a suite of transformer (Vaswani et al., 2017) summarization models: Longformer (Beltagy et al., 2020), Pegasus (Zhang et al., 2020) , PRIMERA (Xiao et al., 2021) , and T5 (Raffel et al., 2020) . PRIMERA was designed and pre-trained specifically for multi-document summarization specifically. And while not explicitly designed as multi-document summarization models, both Pegasus Zhang et al. ( 2020) and T5foot_6 have been used on multi-document tasks, while Longformer has been used for a related multi-document summarization task (DeYoung et al., 2021) . For all models we mostly use hyperparameters defaulted to in their respective huggingface implementations. We conduct a hyperparameter sweep over optimization steps and learning rate, selecting the best model by ROUGE1 performance on the dev set (Appendix C, Tables 8. 9 ).

5.1. HOW WELL DO SUMMARIZATION MODELS SYNTHESIZE?

We report sentiment performance for all models in g) over model generated or reference (human written) summaries and (b) the reference sentiment (Tomatometer) score. Across these metrics, correlations between the sentiment measured in model generated outputs and the Tomatometer score are considerably lower than that between the same measurement over human-composed summaries and said score. Based on these metrics, human authors do a better job of synthesis than the models when composing their summaries. For systematic reviews (Section 3.2), we are able to measure g whether a text appears to report significant treatment effect or not, and we can compare this against the p-value from the corresponding statistical meta-analysis. This permits only a coarse assessment of synthesis, as we are unable to measure correlations. Instead we report classification metrics describing how often the effect significance inferred from a summary (generated or manually written) matches the ground truth derived from the meta-analysis (Table 2 ). The results are qualitatively similar to the sentiment case, in that the humans appear to do a better job of synthesis -as best we can measure, the significance reported in their summaries better aligns with the statistical results than in model generated summaries.

5.2. SENSITIVITY TO INPUT ORDERING

Synthesis of inputs should be invariant to ordering (e.g., the critics' consensus on a film does not depend on the order in which one reads the reviews). Here we evaluate if models are sensitive to input orderings with respect to the synthesized aspect of interest (z iŷ ) in the resultant outputs. Specifically, X i = {x i1 , ..., x i|Xi| } will constitute an arbitrary ordering of inputs reflected in the linearized version x ⊕ i . This ordering should not affect the aggregate aspect z iŷ in the summary. To evaluate if models realize this invariance, we permute the instance i inputs X i (and, consequently, the linearized x ⊕ i ) one hundred times, randomizing input orderings. For each such permutation Xi (and associated x⊕ i ), we generate a summary ŷi and estimate of the resultant aspect ziŷ , using the corresponding measurement model. By repeating this process for each instance i, we can construct an empirical distribution over ziŷ 's under different random orderings. Movie reviews. We zero-mean the ziŷ 's inferred over each instance, and combine the distributions from all instances into a histogram (Figure 3 left ). This shows the spread of sentiments inferred over outputs under random input orderings minus the corresponding instance mean sentiment. Were a model completely invariant to ordering, the empirical distribution over these differences would collapse to 0. Instead, we observe a relatively wide spread in the sentiment measured over outputs generated from different permutations, indicating a counter-intuitive sensitivity to orderings.foot_7  Systematic reviews. For each X i we have 100 order permutations and associated summaries; we infer whether these report significant results or not, and record the fraction that do (p i ). If models were invariant to ordering, this fraction would always be 0 or 1. Values in-between suggest the model flips the report conclusion as a result of different input orderings. We calculate the entropy of p i to quantify this. Figure 3 (right) shows a histogram of these entropies calculated over the subset of examples where the associated meta-analysis indicates a significant effect.foot_8 Densities away from zero indicate sensitivity to ordering.

5.3. SENSITIVITY TO INPUT COMPOSITION

Synthesis models should be responsive to changes in the distribution of the attribute to be synthesized in the input composition: If we increase the ratio of positive to negative reviews in an input set, we would anticipate a concomitant change in the sentiment communicated in the meta-review z iŷ . To assess if models meet this synthesis desiderata, we manipulate model inputs X i in such a way to induce an expected change in the target measure z iŷ ; we then measure if the output yields a summary that aligns with this expected change. Movie reviews. We manipulate the ratio of positive to negative reviews and observe the resultant change in the property of interest latent in the corresponding output. We take movies with mixed reviews, and delete 10%, 20%, 30%, ..., 100% of the positive inputs, retaining the negative inputs; we then repeat the process but instead remove negative inputs. For each of these permutations, we measure the input sentiment, the meta-review sentiment, and how well they correlate (Table 3 ). Figure 4 plots the relationship between the fraction of positive reviews in the (manipulated) input sets and the granular sentiment score inferred over the resultant outputs. The models are generally undersensitive to changes in their input: rather than having a change in meta-review sentiment equivalent in size to changes in input sentiment (a slope of 1, as we observe when we fit a model to the human written summaries). Models tend to have trouble changing their sentiment, and require a large change in input distribution to substantially change the sentiment communicated in the output.  + j A M K D y D l 8 X 7 W 0 / E L F E B P Q z s g l N y x o C z x M 1 I A W S o B v a X 1 x Y 4 i Q j X m C G l W q 4 T a z 9 F U l P M y D D v J Y r E C P d R l 7 Q M 5 S g i y k / H L w z h g V H a s C O k K a 7 h W P 0 9 k a J I q U E U m s 4 I 6 Z 6 a 9 k b i f 1 4 r 0 Z 1 T P 6 U 8 T j T h e L K o k z C o B R z l A d t U E q z Z w B C E J T W 3 Q t x D E m F t U s u b E N z p l 2 d J v V x y j 0 r l 6 + N C 5 T y L I w f 2 w D 4 o A h e c g A q 4 A l V Q A x g 8 g m f w C t 6 s J + v F e r c + J q 1 z V j a z A / 7 A + v w B V f q W v g = = < / l a t e x i t > Despite some issues with plot, the film showcases smart direction. x i < l a t e x i t s h a 1 _ b a s e 6 4 = " B 1 9 Y r B l 8 G a C N 1 o 4 D q s / 0 q 3 1 r 7 w c = " > A Systematic Reviews. To measure sensitivity to changes in input composition, we manipulate our inputs X i such that the meta-analysis result (target z iŷ ) flips from a significant effect to no effect, or from no effect to an effect. Operationally, we do this by first taking of a subset of the reviews that have conflicting evidence (yielding 139 unique reviews). We then order inputs in these by (weighted) effect sizes, 10 and remove subsets which ought to flip the significance result. A A B 8 X i c b V B N T w I x E J 3 F L 8 Q v 1 K O X R m L i i e y i i R 6 J X j x i 4 g I R V t I t X W j o t p u 2 a y Q b / o U X D x r j 1 X / j z X 9 j g T 0 o + J J J X t 6 b y c y 8 M O F M G 9 f 9 d g o r q 2 v r G 8 X N 0 t b 2 z u 5 e e f + g q W W q C P W J 5 F K 1 Q 6 w p Z 4 L 6 h h l O 2 4 m i O A 4 5 b Y W j 6 6 n f e q R K M y n u z D i h Q Y w H g k W M Y G O l + 6 e H r k x 4 q n u s V 6 6 4 V X c G t E y 8 n F Q g R 6 N X / u r 2 J U l j K g z h W O u O 5 y Y m y L A y j H A 6 K X V T T R N M R n h A O 5 Y K H F M d Z L O L J + j E K n 0 U S W V L G D R T f 0 9 k O N Z 6 H I e 2 M 8 Z m q B e 9 q f i f 1 0 l N d B l k T C S p o Y L M F 0 U p R 0 a i 6 f u o z x Q l h o 8 t w U Q x e h R l L V o J C L V 9 Y l m g k v W M t w I 1 o 0 V I 6 E v W M e f N D K / M 2 V K 8 0 g + m l n M v J C M J A 8 4 J c Z K X j 8 k Z k y J S B v z A R + U K 0 7 V W Q C v E z c n F c j R H J S / + s O I J i G T h g q i d c 9 1 Y u O l R B l O B Z u X + V u B E G G X s S Q Y O l d Q 4 I U k K T 0 q L U G Q K j 7 P T g 7 Z + f I b W S a O / 0 K r E W Q E L L X M p g D y V 9 r 8 n p k Q L Z K y G A n d q s I t k t 4 C L J q 2 T J V C 9 a n g i N U 8 K o K U A V R 8 0 q f T U R 1 Q E o 7 X i m 5 f K Z p f / T b r G Z r s 3 v K L g i c K c w F p z z o v R x V c / W F U u l d t p f x C N o 3 X w m y D u w I B 1 c Z j 2 f y R z I 6 o C N Q k F z k 3 j q K S Z + W y N R Y 7 B i e I E K d R f Q = " > A A A B + H i c b V B N S 8 N A E N 3 U r 1 o / G v X o Z b E I 9 V K S K u i x 6 M V j B V s L b Q i b 7 a Z d u t m E 3 Y k Q Q 3 + J F

6. IMPROVING SYNTHESIS IN SUMMARIZATION

We propose a simple post-hoc approach to improving the synthesis performed by multi-document summarization models. This involves the following steps: (1) Generate an explicitly diverse set of output candidates 11 ; (2) Select from these as the final output the candidate that best agrees with the expected synthesis result (as predicted by an external model). 12 For (1), we rely on a previously proposed technique for generating diverse outputs C i from input x ⊕ i , namely Diverse Beam Search (DBS) (Vijayakumar et al., 2016) . This method modifies standard beam search to maintain multiple groups of beams. During decoding, a term is added to the next-token log probabilities which effectively penalizes production of (partial) strings similar to candidates on beams in other groups. 13 In (2) we would like to select the output that best synthesizes the property of interest; this requires a mechanism for specifying what we expect the synthesized property be, given the inputs. For example, if we know the sentiment scores associated with input movie reviews, we might enforce that the sentiment expressed in the output agrees with the average of these. To realize this intuition, we can select as final output from C i the string that best aligns with this anticipated aggregate property (sentiment score or significance finding). Operationally, this requires an external model to measure-or estimate-the aspect of interest as latent in a given candidate output. This is a limitation of the approach, but in many settings it may be feasible to identify or construct a model; we were able to do so for both tasks considered in this paper. There is no guarantee that any member of C i will align well with the anticipated aggregated property. In such cases, we have no means of yielding an output consistent with respect to synthesis, and it may be desirable to abstain from outputting anything at all in such cases; that is, to be a cautious summarizer (Ferri et al., 2004; Hechtlinger et al., 2018) . We consider this strategy in the case of generating narrative synopses of evidence, as this constitutes a case in which (a) one would very In fixed effects meta-analysis the weights are inverse variances associated with study-level effect estimates. See Appendix Tables 11, 12 for an ablation over diversity vs. standard beam search outputs For a related generate-and-select approach Oved & Levy (2021) see Appendix B. This penalty is associated with a hyperparameter λ that encodes the relative importance of realizing diverse; we use λ=0.5 here and did not extensively tune this. Other hyperparameters include number of groups and total number of beams; we used 5 for both of these, retaining 5 beams as used for analysis above. Generate-diverse then select Figure 6 : Differences relative to human summaries under vanilla decoding and the proposed generate-diverse then select strategy on the Rotten Tomatoes dataset and task. We report Pearson's r and R 2 , both measures of synthesis "calibration". Vanilla decoding yields synthesis performance worse than humans, but explicitly considering synthesis at inference time as proposed results in performance comparable to and sometimes better than the human summaries (as best we can measure). much prefer not to produce a misleading summary of clinical evidence (Kell et al., 2021) , and, (b) we observe many cases where the diverse decoding strategy yields an output that seems to communicate (at a granular level) the aggregate findings expected. Movie Reviews For movie reviews we use a BERT (Devlin et al., 2019) model trained on IMDB (Maas et al., 2011) foot_9 to predict the sentiment of each input x ij , using the proportion of x ij ∈ X i with a positive score as an approximation for the target sentiment z iŷ . For each diverse prediction C i , we predict a sentiment ziŷ using our sentiment regression model (Section 3.1), and select the prediction cloest to the estimated target sentiment |z iŷz iŷ |. We find this improves model performance to human-like levels in terms of synthesis, as best we can measure (Table 4 , Figure 6 ). Systematic Reviews. In the case of systematic reviews, we can have only a binary measure of significant effect (or not). As a proxy for z iŷ , we again use RobotReviewer to extract an effect for each of the model inputs x ij , using the majority vote (i.e., do the plurality of x ij ∈ X i indicate that there was an effect). We classify each output candidate in C i again using RobotReviewer to estimate ziŷ . We then select for output the highest probability candidate in C i which agrees with the majority vote of the inputs, and abstain where there are no viable candidates. For the models we do choose a summary for, we find performance similar to our measure (Table 5 ). Movie reviews show a wide range of sentiments; systematic reviews show some improvement but are biased towards no effect (qualitatively observed in Appendix G).

7. RELATED WORK

Automatic (multi-document) summarization (Nenkova & McKeown, 2011; Maybury, 1999) has been an active subfield within NLP for decades. We have focused our analysis on modern, neural abstractive models for conditional text generation (Bahdanau et al., 2015) . In light of their empirical Table 5 : Systematic Review results with modified-then-selected predictions. F1 is a macro-averaged F1 on the set of returned results. We abstain when no output matches the expected synthesis result. success, we have specifically evaluated a set of Transformer-based (Vaswani et al., 2017 ) models which have recently been used for multi-document summarization (Beltagy et al., 2020; Zhang et al., 2020; Xiao et al., 2021; Raffel et al., 2020) . There has been some work on highlighting conflicting evidence in health literature specifically (Shah et al., 2021b; a) , though this was focused primarily on highlighting conflicting evidence, and explicitly aggregating extracted content. Sentence fusion One view on synthesis might be that is a particular kind of sentence fusion (Barzilay & McKeown, 2005) . However, past work on "fusing" sentences has assumed that the aim is to generate an output that contains the information common to similar sentences (Thadani & McKeown, 2013) . This is intuitive in the context of, e.g., summarizing multiple news articles covering the same event. But here we are interested in the more challenging setting in which the output should reflect an aggregate measure of potentially conflicting evidence or opinions. Interpretation and analysis of neural models for NLP This work is also related to the emerging body of work on analyzing neural NLP models, their behaviors, "knowledge", and "abilities" in general e.g., Linzen et al. ( 2016 2022). There has been some work specifically on analyzing neural summarization models. Xu et al. (2020a) investigated when a model is likely to extract (copy) rather than abstract (generate). Xu & Durrett (2021) furthered this analysis by assessing when models were relying on the local input to produce particular output tokens, and when they instead rely on mostly on a background language distribution acquired in pre-training. Factuality of neural summarizers Neural conditional generation models have proven adept at producing fluent outputs, but in the context of summarization they are prone to hallucinating content unsupported by input documents (Maynez et al., 2020; Kryscinski et al., 2019) . Automated metrics such as ROUGE do not reliably capture such phenomena (Falke et al., 2019; Maynez et al., 2020) . This has motivated several efforts to design automated factuality metrics (e.g., Wang et al. (2020b) ; Xu et al. (2020b) ; see Pagnoni et al. (2021) for an overview).

8. CONCLUSIONS

We have outlined and investigated the problem of synthesis as related to some summarization tasks. We showed that existing models are partially able to synthesize implicitly, but do so imperfectly: For instance, the aggregation they perform is sensitive to input ordering, and they are not as sensitive to perturbations in the composition of inputs as one would hope. We proposed and validated a straightforward inference time method to improve model synthesis capabilities by preferentially outputting summary candidates that align with a predicted aggregate measure, and demonstrated empirically that this offers gains in performance. Our hope is that this work encourages additional research into summarization models that explicitly optimize to accurately synthesize potentially conflicting evidence and information.

A NOTATION Variable Definition X i

The i th set of input documents, corresponding to instance i y i The target summary y of the i th instance y i,t The t th token of target y i ŷi A generated summary of the i th instance x ij The j th input document for instance i x ⊕ i A particular linearization of the input documents 2021a) created a "nutri-bullets" system for generating consensus-based summaries of health and nutrition related content. They assume a low-supervision setting in which one has a set of tuples extracted from texts with which to train a content extractor, and where one can design heuristic rule-based aggregation strategies on top of extracted tuples mapping onto discrete categories like "consensus". By contrast, we assume a more typical supervised summarization setting and are interested in continuous aggregation of a latent attribute of interest, and we do not assume (or have access to) relational tuples over inputs. Indeed, recent work Wolhandler et al. (2022) has shown that systematic reviews are categorically different than news summarization, and that relational tuple extractors do not perform well in the medical domain. X i θ Model & parameters p θ Probabily under parameters θ p θ (y i,t |y i,1..t-1 , x ⊕ i ) Standard First, Shah et al. (2021a) 's focus primarily on settings in which training data is (severely) limited, and motivate their pipeline approach on the basis of this limited supervision assumption. For this reason they define separate modules: The first performs content selection (tuple extraction; this does require manual annotations of tuples on a subset of texts to train such an extractor); The second applies (manually composed) deterministic aggregation rules over these extracted tuples to combine them; a final module then generates a "surface realization" conditioned on the aggregated result. We have investigated more typical supervised settings (with thousands of input and summary pairs), and we are training modern end-to-end transformer-based summarization models. We have empirically assessed the extent to which model outputs in this typical training regime are consistent with the continuous synthesis result anticipated. We do not have annotated tuples on our inputs (as would be required to use the Shah et al. (2021a) approach, as it assumes a trained content extractor module). And while applying discrete (manually composed) aggregation operators over inputs makes sense in some settings, we are explicitly interested in the ability of models to aggregate variables of interest continuously, for example producing "very positive" summaries when movie reviews are overwhelmingly positive, and merely "positive" summaries when they are only mostly positive. In sum, the approach proposed by Shah et al. (2021a) is appropriate in, and designed for, lowsupervision settings (which we do not consider here) where there are natural "tuples" to be extracted from inputs and supervision for this sub-task (which we do not have) and where discrete aggregation categories of inputs is natural (whereas we are interested in continuous aggregation, e.g., mean sentiment). This work measures how many new tuples each input document might add in contrast to subsets of other inputs. By greedily building subsets of inputs as a function of new information added, they find that standard multiple document summarization datasets merely need to select two to four documents from inputs of up to ten, whereas their approach breaks down in the case of systematic reviews. They find that due to both technical constraints for relation extraction, as well as the inability to model contradiction, relational extraction and aggregation methods are insufficient for producing evidence syntheses. Oved & Levy (2021) introduce the Perturb and Selection Summarizer (PASS) system for summarizing Amazon product reviews. It works by perturbing model inputs (i.e. keep random subsets of the input), generating a summary for each perturbation (via standard beam search), and then selecting amongst outputs (via a ranker) to produce a coherent, self-consistent, and fluent summary. PASS is similar to our work in that it generates multiple outputs and selects amongst them. However it differs in several key respects. The key conceptual difference between PASS and our work is that PASS's target is a summary's self-consistency (a product review might contradict itself on some aspect, e.g. simultaneously discussing a product fitting well in addition to the product running a size small), whereas our target is a continuous fact derived from the inputs as a whole (e.g. aggregate sentiment or effect sizes). PASS is designed to produce summaries that are plausible, as opposed (and complementary) to summaries that reflect inherent contradiction in the input data. As PASS produces summaries from subsets of each instance's input, it cannot perform an explicit synthesis on its own, as opposed to our work, wherein each summary was produced with access to the whole of each instance's input.

C MODELS

We train all models using a modified Huggingface Transformers libraryWolf et al. (2020) . For the Pegasus model, we use a distilled version provided by Huggingface (Table 7 ). All models were trained using their default hyperparameters, except for batch size, optimization steps, learning rates, and any parameters specified in Table 7 . We fix our batch size to 16, using gradient accumulation over single instances at a time, with floating point 16 (fp16) precision (due to data size), and perform an approximate (subject to resource constraints) grid search over learning rates and training steps (Tables 8, 9 ), keeping the model highlighted in bold. Earlier experimentation was performed ad-hoc with Longformer and T5 models only; we found that while lower numbers of steps could perform well, they had high variance and were more sensitive to hyperparameter changes than longer runs. All training was performed on 48G NVIDIA RTX8000 GPUs, most models are unable to fit single instance gradient information into fewer than 40G, even at reduced precision.

D DETAILED RESULTS

Measure Validation As our results rely on using proxy metrics, we measure the quality of these proxies. See Figure 8 for movie meta-review sentiment correlation with human results, and Table 10 for how well the automatic significance measures correlate with the underlying truth. Diversity Sampling. We include detailed results for the importance of diversity sampling; the diversity sampling procedure produces better metrics in every dimension ( 

E ROUGE RESULTS

We report mean differences in ROUGE outputs for both datasets in Figure 10 . Ideally, these would have all mass at zero. Table 12 : Systematic reviews results with multiple generate-then-select predictions, this time using the top-5 results from standard beam-search. F1 is a macro-averaged F1 on the set of returned results. We abstain when no output matches the expected synthesis result.

Generated Effect

The overall evidence supports the use of topical antibiotics in surgical patients who have undergone minor surgery, compared to no treatment. The effect on other outcomes, other than infection rate, is consistent. The safety profile of topical antibiotics is also of concern. Further well-designed RCTs are needed to assess effectiveness of topical antibiotics in surgical patients. no significant difference A single application of topical antibiotics in surgical site wounds reduces the risk of infection, and the risk of other complications, including wound dehiscence. The risk of infection recurrence is low. The use of topical antibiotics outside of surgery should be restricted to surgical site wounds. no significant difference A single application of topical antibiotics in surgical site wounds reduces the risk of infection, and the risk of other complications, including wound dehiscence. The risk of infection recurrence is low.

no significant difference

The overall evidence supports the use of topical antibiotics in surgical patients to reduce the risk of infection, and the risk of other complications, especially in high-risk patients. There is a lack of evidence in low-risk patients to support the use of topical antibiotics in this setting. significant difference A single application of topical antibiotics in surgical site infection prevention has been demonstrated to reduce the risk of infection in patients who have undergone surgery. The number of patients who have been treated with topical antibiotics has been small but this is due to risk of bias in the trials. Ointment use should be limited to patients whose primary wound is irradiated. significantly difference Table 16 : An instance where generating multiple reviews allows our models to find a candidate summary reporting a significant difference (the target).



A website that aggregates film reviews: https://www.rottentomatoes.com/. See Appendix B for related content aggregation work, over structured relationsShah et al. (2021a). written by designated "top-critics", audience members recognized for quality and quantity of reviews SST is itself based on a collection of Rotten Tomatoes critic reviews(Pang & Lee, 2005). We verified that the SST text fragments do not overlap with any of our target reviews by manually checking any (fragment, review) pair with substantial (>= 75%) overlap for approximately one quarter of all reviews. https://github.com/huggingface/transformers/blob/main/examples/ pytorch/text-classification/run_glue.py An international non-profit dedicated to helping healthcare providers make evidence-based decisions. https://huggingface.co/osama7/t5-summarization-multinews For a ROUGE1 comparison, see Appendix E, Figure10. These are the more interesting cases; we provide results over the entire dataset in Appendix Figure9. https://huggingface.co/lvwerra/bert-imdb



Figure3: The spread of sentiment/treatment effect measured in outputs produced from permuted input orderings. Left: Movie review sentiment. Right: Systematic review significance prediction entropy (0 indicates order insensitivity) on the subset of reviews that report significant effects.

Movie reviews (left): Correlation between subsampled inputs and generated meta-reviews. Systematic reviews (right): macro-averaged results (F1 and accuracy) for subsampled inputs.

Figure 4: Model sentiment sensitivity to manipulated input sentiment. The intensity patterns indicate that models tend to oscillate between low and high sentiments in outputs, and are not responsive to subtler shifts in input sentiment compositions. For context we include a model regression (blue) and the reference sensitivity regression (black).

ẑi = G(x i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " b y r p b 1B J 7 V Z R l Y z V u 9 N o L T s e S U g = " > A A A C A n i c b V D L S g M x F M 3 4 r P U 1 6 k r c B I t Q N 2 W m C r o R i i 5 0 W c E + o D M O m T R t Q z P J k G T E O h Q 3 /o o b F 4 q 4 9 S v c + T e m 7 S y 0 9 c C F w z n 3 c u 8 9 Y c y o 0 o 7 z b c 3 N L y w u L e d W 8 q t r 6 x u b 9 t Z 2 X Y l E Y l L D g g n Z D J E i j H J S 0 1 Q z 0 o w l Q V H I S C P s X 4 z 8 x h 2 R i g p + o w c x 8 S P U 5 b R D M d J G C u x d r 4 d 0

y s i Q 6 w w M T a k k g 3 B W 3 x 5 m T R r V e + s W r s 9 r 9 S v 8 j i K c A T H c A o e X E A d b q A B P h A Q 8 A y v 8 O Z o 5 8 V 5 d z 7 m r Q U n n z m E P 3 A + f w D l W p E P < / l a t e x i t > The movie was excellent … [SEP] Meandering but well directed … [SEP] The film is over long … [SEP] { A mess; difficult to watch and too long. A bloated movie lacking direction. Candidate outputs from Diverse Beam Search and inferred sentiments over these C i < l a t e x i t s h a 1 _ b a s e 6 4 = " m 9 C / 8 p w D p D O S M X H G 1 i H j J E / o y p E = " > A A A B 9 H i c b V D L S g M x F L 2 p r 1 p f V Z d u g k V w V W a q o M t i N y 4 r 2 A e 0 Q 8 m k m T Y 0 k x m T T K E M / Q 4 3 L h R x 6 8 e 4 8 2 / M t L P Q 1 g O B w z n 3 c k + O H w u u j e N 8 o 8 L G 5 t b 2 T n G 3 t L d / c H h U P j 5 p 6 y

o l m M a E T M m I 9 S y U J m f b S R e g 5 v r D K E A e R s k 8 a v F B / b 6Q k 1 H o W + n Y y C 6 l X v U z 8 z + s l J r j 1 U i 7 j x D B J l 4 e C R G A T 4 a w B P O S K U S N m l h C q u M 2 K 6 Z g o Q o 3 t q W R L c F e / v E 7 a t a p 7 V a 0 9 X F f q d 3 k d R T i D c 7 g E F 2 6 g D v f Q h B Z Q e I Jn e I U 3 N E U v 6 B 1 9 L E c L K N 8 5 h T 9 A n z / 3 C J I 5 < / l a t e x i t > , ẑiŷ ) < l a t e x i t s h a 1 _ b a s e 6 4 = " z o q f C s y + m F J 2 Q E H x B G F K 5 r w C Z i 4 = " > A A A C a 3 i c b V F N b 9 N A E F 2 b r x K g B H o A A Y c V U a S 0 q i K 7 I M G x o h w 4 F o m 0 l e J g j T f j Z N X 1 r r U 7 b h s s X / i J 3 P g H X P g P r F N L l J a R V n r z 5 o 1 m 5 m 1 W K u k o i n 4 G 4 a 3 b d + 7 e 2 7 j f e / D w 0 e b j / p O n R 8 5 U

X 5 2 k U N j 0 k s p h C e I U F j j 1 s D 3 M z e q 1 V w 0 f e m b O c 2 P 9 0 8 T X 7 N W O G g r n V k X m l e 3 N 7 n q t J f 9 X m 1 a U v 5 / V U p c V o R a X g / J K c T K 8 N Z 7 P p U V B a u U B C C v 9 r l w s w Y I g / z 0 9 b 0 J 8 / e S b 4 G h v H L 8 Z 7 3 1 + O 9 j / 0 N m x w V 6 y 1 2 z E Y v a O 7 b N P 7 J B N m G C / g s 3 g W f A 8 + B 1 u h S / C V 5 f S M O h 6 t t g / E Q 7 / A D C e v W Q = < / l a t e x i t > g(ŷ il ) < l a t e x i t s h a 1 _ b a s e 6 4 = " p g 5 n 4 T / y K 1

Figure5: Proposed strategy to improve synthesis. We generate an intentionally diverse set of output candidates(Vijayakumar et al., 2016) and then select from these the text that best agrees with the predicted aggregate property of interest (here, sentiment). We can also abstain when the model fails to yield an appropriate output.

Figure 7: Distributions of outputs for the candiate summaries. Movie reviews (left) show a histogram for the range of differences between lowest and highest output sentiments. Systematic reviews (right) show histograms of the fractions of outputs reporting significant results.

); Tenney et al. (2019); Petroni et al. (2019); Niven & Kao (2019); Meng et al. (

Figure 8: Actual sentiment vs. predicted sentiments on model outputs.

Dataset statistics for movie reviews (left) and systematic reviews (



Base synthesis results. Movie reviews (left): correlations between sentiment measured in model outputs and target sentiments. We report R 2 , Pearson's r, and mean-squared errors. Systematic reviews (right): we report macro-averaged F1s. ROUGE1 included for reference.

Movie Reviews: Generate diverse movie meta-reviews and then choose among them using an approximate target sentiment (left) or the oracle sentiment (right). R1 is ROUGE1 score.

auto-regressive prediction of the next token given an input and partial summary z ij Latent property (sentiment, significance finding) of x ij z i Aggregated latent property (sentiment, significance finding) of X i z iŷ Latent property measured over summary ŷ G Aggregation function over latent properties z ij , yields z i g Auxillary function to measure latent property z ij of x ij , or z iŷ of ŷ α ij A weight for z ij Notation.

Wolhandler et al. (2022) attempts to measure how challenging multi-document summarization is, as a function of the unique knowledge (represented as relational tuples) required to produce a summary. Model hyperparameters. We used optimizers, schedulers, weight decay, and label smoothing as best according to examples from source implementations (where available). Optimizer warmup was arbitrarily chosen. Non-specified parameters were the Huggingface defaults.

Movie Reviews dev training results, best models bolded.

Systematic Reviews dev training results, best models bolded. We experimented with other parameters (in particular learning rates), and found that total number of steps was more important.

Systematic review significance validation results.

Movie Reviews. Top left: Generate 5 diverse movie meta-reviews and then choose among them using an approximate target sentiment. Top right: Generate 25 diverse movie meta-reviews and then choose among them using an approximate target sentiment; this was accidentally referenced in an earlier version of this work. Bottom left: Generate 5 movie meta-reviews using standard beam search and choose among them using an approximate target sentiment. Bottom right: Generate 5 diverse movie meta-reviews and select amongst them using the oracle sentiment. In all cases R1 refers to ROUGE1.

annex

F EXAMPLES OF DIVERSE MOVIE SUMMARIES Summary Sentiment The Private Lives of Pippa Lee relies on a strong ensemble cast to deliver witty and poignant observations about life and relationships.

0.800731

The Private Lives of Pippa Lee relies on a strong ensemble cast to deliver witty and poignant observations about life and relationships. With a strong cast and Robin Wright Penn's sharp performance, The Private Lives of Pippa Lee succeeds as both a witty tribute to lost characters and a showcase for Robin Wright Penn.

0.809596

With a strong cast and Robin Wright Penn's empathetic direction, The Private Lives of Pippa Lee succeeds as both a humorous look at domestic issues and a poignant look at relationships.

0.809081

The Private Lives of Pippa Lee benefits from Robin Wright Penn's superb performance, as well as a strong ensemble cast that includes Keanu Reeves, and Faye Dunaway.

0.845693

The Private Lives of Pippa Lee has an affecting ensemble cast and Robin Wright Penn delivers a noteworthy performance, although the film is a bit too episodic.

0.654905

Table 13 : Different meta-reviews of "The Private Lives of Pippa Lee" and corresponding sentiments. The target sentiment for this meta-review is 70%, generating diverse candidates helps find a metareview closer to the target.

Summary

Sentiment You Don't Mess With the Zohan's handful of laughs are almost enough to compensate for its inconsistent tone and stale, obvious jokes.

0.242698

You Don't Mess with the Zohan has a handful of crotch thrusts, but not enough of them land.

0.429654

You Don't Mess With the Zohan's handful of laughs are almost enough to compensate for its aimless, crass script.

0.287896

You Don't Mess with the Zohan has its moments, but not all of them -and the jokes are embarrassingly crass and often crude.

0.434442

You Don't Mess with the Zohan has its moments, but not all of them -and the jokes are embarrassingly crass and often crude. The script 0.406172 Table 14 : Different meta-reviews for "You Don't Mess With The Zohan"; a relatively panned movie with a target meta-review sentiment of 37%.

Generated

Effect Ketanserin versus placebo in the Raynaud's phenomenon is neither effective nor safe. The Raynaud's phenomenon is associated with significant adverse effects including dizziness and pain. The effectiveness of ketanserin for the Raynaud's phenomenon is unknown. no significant difference Ketanserin versus placebo in the Raynaud's phenomenon is neither effective nor safe. The Raynaud's phenomenon is associated with significant adverse effects including dizziness and pain. no significant difference Ketanserin and serotonin receptor antagonists in the Raynaud's phenomenon treatment of systemic scleroderma reduce the incidence of ischaemic ulcers and may reduce the frequency of adverse events. significant difference The Raynaud's phenomenon is associated with a small number of adverse effects when administered orally to patients with Raynaud's phenomenon. The frequency of Raynaud's phenomenon is similar to that of other drugs. However, there is little evidence to aid the treatment of Raynaud's phenomenon. no significant difference The Raynaud's phenomenon is associated with a small number of adverse effects when administered orally to patients with Raynaud's phenomenon. The frequency of Raynaud's phenomenon is similar to that of other drugs. no significant difference Table 15: An instance where generating multiple reviews allows our models to find a candidate summary reporting a significant difference (the target).

