TEXT SUMMARIZATION WITH ORACLE EXPECTATION

Abstract

Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document. Since most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy, different labeling algorithms have been proposed to extrapolate oracle extracts for model training. In this work, we identify two flaws with the widely used greedy labeling approach: it delivers suboptimal and deterministic oracles. To alleviate both issues, we propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels. We define a new learning objective for extractive summarization which incorporates learning signals from multiple oracle summaries and prove it is equivalent to estimating the oracle expectation for each document sentence. Without any architectural modifications, the proposed labeling scheme achieves superior performance on a variety of summarization benchmarks across domains and languages, in both supervised and zero-shot settings. 1

1. INTRODUCTION

Summarization is the process of condensing a source text into a shorter version while preserving its information content. Thanks to neural encoder-decoder models (Bahdanau et al., 2015; Sutskever et al., 2014 ), Transformer-based architectures (Vaswani et al., 2017) , and large-scale pretraining (Liu & Lapata, 2019; Zhang et al., 2020a; Lewis et al., 2020) , the past few years have witnessed a huge leap forward in summarization technology. Abstractive methods fluently paraphrase the main content of the input, using a vocabulary different from the original document, while extractive approaches are less creative -they produce summaries by identifying and subsequently concatenating the most important sentences in a document -but manage to avoid hallucinations, false statements and inconsistencies. Neural extractive summarization is typically formulated as a sequence labeling problem (Cheng & Lapata, 2016), assuming access to (binary) labels indicating whether a document sentence should be in the summary. In contrast to the plethora of datasets (see Section 5 for examples) available for abstractive summarization (typically thousands of document-abstract pairs), there are no large-scale datasets with gold sentence labels for extractive summarization. Oracle labels are thus extrapolated from abstracts via heuristics, amongst which greedy search (Nallapati et al., 2017) is the most popular by far (Liu & Lapata, 2019; Xu et al., 2020; Dou et al., 2021; Jia et al., 2022) . In this work we challenge received wisdom and rethink whether greedy search is the best way to create sentence labels for extractive summarization. Specifically, we highlight two flaws with greedy labeling: (1) the search procedure is suboptimal, i.e., it does not guarantee a global optimum for the search objective, and (2) greedy oracles are deterministic, i.e., they yield a single reference extract for any given input by associating sentences in the document to its corresponding abstract. Perhaps an obvious solution to the suboptimality problem would be to look for oracle summaries following a procedure based on beam search. Although beam search finds better oracles, we empirically observe that summarization models trained on these do not consistently improve over greedy oracles, possibly due to the higher risk of under-fitting (Narayan et al., 2018a) -there are too few positive labels. Moreover, beam search would also create deterministic oracles. A summarization



Our code and models can be found at https://github.com/yumoxu/oreo.

annex

Published as a conference paper at ICLR 2023 Table 1 : Sentence labels for a CNN/DM article according to different labeling schemes. Only the first 10 document sentences are shown. Greedy and Beam create oracle summaries (i.e., sentences with label 1) with greedy and beam search, respectively. OREO, our labeling algorithm, incorporates information from multiple summary hypotheses shown in the bar chart (R is the mean of ROUGE-1 and ROUGE-2). OREO assigns high scores (> 0.5) to sentences 1 and 4 which contain an important named entity, ::::::: Jasmine ::::::: Coleman, and location, Croydon, South East London. In comparison, greedy and beam labeling consider only one oracle summary, and assign zero to sentences 1 or 4, failing to capture that these are informative and should be probably included in the summary.

ID Document Sentence

Greedy Beam OREO 1 ::::: Jasmine :::::Coleman, 12, has been found safe and well some 50 miles from her home. 0 1 0.568 2 A 12-year-old girl who went missing from her family home at 2 AM amid fears she was driven away by an "older man" has been found safe and well. In it, she was described as fair with long, blonde hair and as having possibly been wearing black riding trousers and a polo shirt or a paisley pattern dress. 0 0 0.000 10 On Saturday afternoon the force confirmed she had been found safe and well in Croydon but could not confirm the circumstances under which police located her. 0 0 0.000

Reference Summary

• :::::: Jasmine ::::::: Coleman disappeared from her home at around 2 AM this morning.• Police believed she may have been driven towards London by an older man.• She has been found safe and well in Croydon, South East London today. system trained on either greedy or beam oracles is optimized by maximizing the likelihood of a single oracle summary. This ignores the fact that there can be multiple valid summaries for an article, in other words, the summary hypothesis space is naturally a multi-modal probability distribution. We illustrate this point in Table 1 .In this paper we define a new learning objective for extractive summarization which promotes nondeterministic learning in the summary hypothesis space, and introduce OREO, ORacle ExpectatiOn labeling, as a simple yet effective sentence labeling scheme. We prove the equivalence between estimating OREO labels and optimizing the proposed learning objective. As a result, it is sufficient for current models to be trained on OREO labels without requiring any architectural changes.Extensive experiments on summarization benchmarks show that OREO outperforms comparison labeling schemes in both supervised and zero-shot settings, including cross-domain and cross-lingual tasks. Additionally, we showcase that extracts created by OREO can better guide the learning and inference of a generative system, facilitating the generation of higher-quality abstracts. We further analyze OREO's behavior by measuring attainable summary knowledge at inference time, and demonstrate it is superior to related deterministic and soft labeling schemes, which we argue contributes to consistent performance gain across summarization tasks.2 RELATED WORK Narayan et al. (2018a) were among the first to discuss problematic aspects of sentence labeling schemes for extractive summarization. They argue that labeling sentences individually as in Cheng & Lapata (2016) often generates too many positive labels which leads to overfitting, while a model

