SUMMARIZATION PROGRAMS: INTERPRETABLE ABSTRACTIVE SUMMARIZATION WITH NEURAL MODULAR TREES

Abstract

Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-SEARCH that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-SEARCH effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems. 1

1. INTRODUCTION

Progress in pre-trained language models has led to state-of-the-art abstractive summarization models capable of generating highly fluent and concise summaries (Lewis et al., 2020; Zhang et al., 2020; Raffel et al., 2020) . Abstractive summarization models do not suffer from the restrictive nature of extractive summarization systems that only copy parts of the source document. However, their ability to generate non-factual content (Cao et al., 2018; Maynez et al., 2020) and their lack of clear interpretability makes it harder to debug their errors and deploy them in real-world scenarios. Towards interpretable summarization models, Jing & McKeown (1999; 2000) show that human summaries typically follow a cut-and-paste process, and propose a modular architecture involving separate operations that perform sentence extraction, sentence reduction, sentence fusion, etc. Most recent efforts on explainable abstractive summarization follow an extractive-abstractive framework that only provides supporting evidence or 'rationales' for the summary (Hsu et al., 2018; Gehrmann et al., 2018; Liu & Lapata, 2019; Zhao et al., 2020; Li et al., 2021) . These models highlight words or sentences from the source document but are not able to explicitly capture the generative process of a summary, i.e., the reasoning steps performed in order to generate each summary sentence from the source document sentence(s), like sentence compression, fusion, etc. In this work, we seek to bridge D3: Paris Match and Bild reported that the video was recovered from a phone at the wreckage site.

S1':

Prosecutor Brice Robin says he is not aware of any video footage of the crash of Flight 9525.

D1:

The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane.

S2':

The video was recovered from a phone at the crash site, according to Paris Match.

Fusion Fusion

D2: Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation."

Paraphrase

Summary: Prosecutor Brice Robin says he is not aware of any video footage of the crash of Flight 9525. The video was recovered from a phone at the crash site, according to Paris Match.

I1:

The prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted he was not aware of any video footage.

Compression Fusion

I2: Prosecutor Brice Robin says he is not aware of any video footage from the plane.

Summarization Program

Figure this gap by proposing a novel Summarization Program framework for explaining abstractive summarization, that views summarization as a systematic reasoning process over document sentences. A Summarization Program (SP) is a modular executable program that consists of an (ordered) list of binary trees, each encoding the generative process of an abstractive summary sentence from the source document ( §3). Fig. 1 shows an example. The leaves in an SP are the source document sentences (typically, only a small subset that are relevant for generating the summary). Each intermediate node represents a generation from a neural module (shown with labeled edges) which are then composed to derive the final summary sentences at the roots of the trees. We develop three neural modules for building SPs -sentence compression, paraphrasing, and fusion (Jing & McKeown, 1999; 2000) that finetune a pre-trained language model on task-specific data. We evaluate Summarization Programs by asking the following two research questions (see Fig. 2 for an overview). 1. RQ1. Given a human-written abstractive summary, can we develop an algorithm for identifying a Summarization Program that effectively represents the generative process of the summary? 2. RQ2. Using the SPs identified in RQ1 as supervision, can we develop models that generate Summarization Programs as interpretable intermediate representations for generating summaries? We answer the first research question by automatically identifying SPs for human summaries ( §4). Specifically, we develop an efficient best-first search algorithm, SP-SEARCH that iteratively applies different neural modules to a set of extracted document sentences in order to generate newer sentences, such that the ROUGE (Lin, 2004) score with respect to the gold summary is maximized by these new sentences. SP-SEARCH achieves efficiency through important design choices including maintaining a priority queue that scores, ranks, and prunes intermediate generations (appendix A). We conduct experiments on two English single document summarization datasets, CNN/DailyMail (Hermann et al., 2015) and XSum (Narayan et al., 2018) to show that SP-SEARCH outputs SPs that effectively reproduce human summaries, significantly outperforming several baselines ( §6.1). Moreover, human evaluation shows that our neural modules are highly faithful,foot_1 performing operations they are supposed to and also generating outputs that are mostly factual to their inputs ( §6.2). We leverage SP-SEARCH to obtain oracle programs for human summaries that also serve as supervision for answering our second research question. In particular, we propose two seq2seq models for Summarization Program generation from a source document ( §5, Fig. 2 ). In our first Extract-and-Build SP generation model, an extractive summarization model first selects a set of document sentences, which are then passed to another program-generating model. In our second Joint SP generation model, sentence extraction and SP generation happen as part of a single model. We obtain initial promising results and while state-of-the-art end2end models demonstrate better ROUGE scores, our oracle SP-SEARCH results indicate significant room for improvement in future work ( §6.3).



Supporting code available at https://github.com/swarnaHub/SummarizationPrograms. 'Neural module faithfulness' refers to whether the modules perform their expected operations. This is different from 'summary faithfulness' that evaluates whether a summary contains only factual claims from the source document. The latter will be referred to as 'summary factuality'.



Figure 1: Example of a Summarization Program showing the generative process of two summary sentences (marked with labels S1' and S2' in yellow) from the three document sentences (marked with labels D1, D2, and D3 in blue) using compression, paraphrase and fusion neural modules. Edges directed from leaves to roots, and the intermediate generations are labeled with I1 and I2.

