SUMMARIZATION PROGRAMS: INTERPRETABLE ABSTRACTIVE SUMMARIZATION WITH NEURAL MODULAR TREES

Abstract

Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-SEARCH that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-SEARCH effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems.

1. INTRODUCTION

Progress in pre-trained language models has led to state-of-the-art abstractive summarization models capable of generating highly fluent and concise summaries (Lewis et al., 2020; Zhang et al., 2020; Raffel et al., 2020) . Abstractive summarization models do not suffer from the restrictive nature of extractive summarization systems that only copy parts of the source document. However, their ability to generate non-factual content (Cao et al., 2018; Maynez et al., 2020) and their lack of clear interpretability makes it harder to debug their errors and deploy them in real-world scenarios. Towards interpretable summarization models, Jing & McKeown (1999; 2000) show that human summaries typically follow a cut-and-paste process, and propose a modular architecture involving separate operations that perform sentence extraction, sentence reduction, sentence fusion, etc. Most recent efforts on explainable abstractive summarization follow an extractive-abstractive framework that only provides supporting evidence or 'rationales' for the summary (Hsu et al., 2018; Gehrmann et al., 2018; Liu & Lapata, 2019; Zhao et al., 2020; Li et al., 2021) . These models highlight words or sentences from the source document but are not able to explicitly capture the generative process of a summary, i.e., the reasoning steps performed in order to generate each summary sentence from the source document sentence(s), like sentence compression, fusion, etc. In this work, we seek to bridge this gap by proposing a novel Summarization Program framework for explaining abstractive summarization, that views summarization as a systematic reasoning process over document sentences. A Summarization Program (SP) is a modular executable program that consists of an (ordered) list of binary trees, each encoding the generative process of an abstractive summary sentence from the source document ( §3). Fig. 1 shows an example. The leaves in an SP are the source document sentences (typically, only a small subset that are relevant for generating the summary). Each intermediate node represents a generation from a neural module (shown with labeled edges) which are then composed to derive the final summary sentences at the roots of the trees. We develop three neural modules for building SPs -sentence compression, paraphrasing, and fusion (Jing & McKeown, 1999; 2000) that finetune a pre-trained language model on task-specific data. We evaluate Summarization Programs by asking the following two research questions (see Fig. 2 for an overview). 1. RQ1. Given a human-written abstractive summary, can we develop an algorithm for identifying a Summarization Program that effectively represents the generative process of the summary? 2. RQ2. Using the SPs identified in RQ1 as supervision, can we develop models that generate Summarization Programs as interpretable intermediate representations for generating summaries? We answer the first research question by automatically identifying SPs for human summaries ( §4). Specifically, we develop an efficient best-first search algorithm, SP-SEARCH that iteratively applies different neural modules to a set of extracted document sentences in order to generate newer sentences, such that the ROUGE (Lin, 2004) score with respect to the gold summary is maximized by these new sentences. SP-SEARCH achieves efficiency through important design choices including maintaining a priority queue that scores, ranks, and prunes intermediate generations (appendix A). We conduct experiments on two English single document summarization datasets, CNN/DailyMail (Hermann et al., 2015) and XSum (Narayan et al., 2018) to show that SP-SEARCH outputs SPs that effectively reproduce human summaries, significantly outperforming several baselines ( §6.1). Moreover, human evaluation shows that our neural modules are highly faithful,foot_1 performing operations they are supposed to and also generating outputs that are mostly factual to their inputs ( §6.2). We leverage SP-SEARCH to obtain oracle programs for human summaries that also serve as supervision for answering our second research question. In particular, we propose two seq2seq models for Summarization Program generation from a source document ( §5, Fig. 2 ). In our first Extract-and-Build SP generation model, an extractive summarization model first selects a set of document sentences, which are then passed to another program-generating model. In our second Joint SP generation model, sentence extraction and SP generation happen as part of a single model. We obtain initial promising results and while state-of-the-art end2end models demonstrate better ROUGE scores, our oracle SP-SEARCH results indicate significant room for improvement in future work ( §6.3). To evaluate whether SPs improve the interpretability of summarization models, we conduct a smallscale simulation study (Doshi-Velez & Kim, 2017; Hase & Bansal, 2020; Zhou et al., 2022) where we ask humans to simulate model's reasoning by writing programs for model summaries ( §6.4). We observe that after seeing model SPs, humans are able to better predict them for unseen samples, such that the executed summaries match more closely with the model summaries. Our contributions are: • We introduce the Summarization Program, an interpretable modular framework for explainable abstractive summarization. Summarization Program consists of an ordered list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence through the use of different neural modules (sentence compression, paraphrasing, and fusion). • We propose an efficient best-first search method, SP-SEARCH that identifies Summarization Programs for human-written summaries, obtaining a high ROUGE-2 of 40 on the CNN/DailyMail dataset with neural modules that are highly faithful to their intended behavior. • We present initial Summarization Program generation models that generate SPs from a source document, which are then executed to obtain final summaries. We demonstrate that SPs improve the interpretability of summarization models by allowing humans to better simulate model behavior.

2. RELATED WORK

Pre-trained Models for Summarization. State-of-the-art pre-trained language models like BART (Lewis et al., 2020) , T5 (Raffel et al., 2020) , and PEGASUS (Zhang et al., 2020) generate summaries in an end-to-end manner. Lacking complete mechanistic understanding of transformers (Elhage et al., 2021) , it remains difficult to understand the reasoning process behind the generated summary. In particular, we cannot determine whether generation follows a process similar to that of humans, e.g., using operations like selection of important information, abstraction through vocabulary generalization, sentence fusion, etc (Kintsch & van Dijk, 1978; Brown & Day, 1983; Jing & McKeown, 1999; 2000) . Moreover, it is hard to determine the source of errors in summaries without knowing the reasoning path to the summary. Recent work on interpretable summarization models highlight words or sentences as rationales for the generated summaries (Hsu et al., 2018; Gehrmann et al., 2018; Liu & Lapata, 2019; Liu et al., 2019; Zhao et al., 2020; Li et al., 2021) , but textual highlights do not provide a complete explanation of the summary generation process. Thus, Summarization Program is a step towards more explicit interpretability in abstractive summarization. Neural Module Networks. Our work also draws inspiration from Neural Module Networks (NMN) that execute programs as learned functions composed of neural modules (Andreas et al., 2016; Hu et al., 2018; Jiang & Bansal, 2019; Gupta et al., 2020; Subramanian et al., 2020; Saha et al., 2021a; Le et al., 2022) . Typically, the modules in an NMN provide attention-based explanations whose interpretability has been debated (Serrano & Smith, 2019; Wiegreffe & Pinter, 2019) . Khot et al. (2021) alternatively propose Text Modular Networks that decompose multi-hop questions into subquestions to be solved by simpler QA models. We also follow this text-in text-out paradigm for our modules but equipped to perform diverse sentence-level operations in document summarization. We evaluate our modules for 'neural module faithfulness' (Subramanian et al., 2020) to demonstrate that the SPs mostly provide a faithful interpretation of the generated summaries.

Multi-step Reasoning over Text.

Step-by-step reasoning has received much interest as a way to explain various reasoning tasks including QA (Dalvi et al., 2021; Ribeiro et al., 2022) and natural language deduction (Bostrom et al., 2022; Saha et al., 2020; Gontier et al., 2020; Saha et al., 2021b; Tafjord et al., 2021) . Summarization Programs similarly encode the reasoning steps in a summary generation process. Our SP-SEARCH method follows a forward chaining method that tries to reach a hypothesis from a set of premises (Tafjord et al., 2021; Bostrom et al., 2022) , unlike backward chaining methods that do the opposite (Gontier et al., 2020; Arabshahi et al., 2021; Kalyanpur et al., 2022; Dalvi et al., 2022) . In another recent line of work, chain-of-thought prompting (Nye et al., 2021; Wei et al., 2022; Wang et al., 2022) encourages LMs to generate intermediate reasoning steps before producing a final answer to a problem. However, the lack of explicit chaining between the reasoning steps and the final output may compromise the faithfulness of those steps, which have also not yet been evaluated as explanations of model behavior per se. Recent work has explored ways to force these reasoning steps to be more like deductive proofs of the final answer (Creswell & Shanahan, 2022) or instead use generations from a larger language model as silver supervision for a smaller pipeline student model (Eisenstein et al., 2022) . In contrast, we aim to develop a method whose rationales (our summarization programs) exactly describe the reasoning process of the overall system and we explicitly evaluate their faithfulness. Specifically, we generate silver programs by our search algorithm that tries to emulate the human summaries and then train models that generate programs which are further executed to obtain final summaries (and evaluated via a simulatability study).

3. SUMMARIZATION PROGRAM

We assume that we have a document D = {D i } d i=1 consisting of d sentences and a corresponding abstractive summary S = {S i } s i=1 consisting of s sentences. A Summarization Program P = {T i } s i=1 is defined as an (ordered) list of s binary trees where each tree T i = (V i , E i ) is a structured representation of the generative process of each summary sentence S i ∈ S. Fig. 1 shows an example of an SP with two trees for two summary sentences. The set of nodes V i in each tree consists of single sentences, and the edges E i are labeled with one of the neural modules m ∈ {paraphrase(•), compression(•), fusion(•, •)}. These modules represent operations over sentences wherein compression(X ) → Y and paraphrase(X ) → Y are unary operations and fusion(X , Y ) → Z is a binary operation. The leaf nodes in each tree are sentences from the document D i ∈ D, and the root is a summary sentence S i ∈ S. All other nodes are intermediates sentences generated by executing a neural module (referred to as I1 and I2 in Fig. 1 ). An edge from a node u ∈ V i to a node v ∈ V i labeled with the module m means that v is generated by executing m on u. The summary S is obtained by concatenating the root nodes of the trees in order. We hypothesize that the generative process of each summary sentence can be captured by composing different neural modules that operate over sentences. Following prior work on modular approaches to abstractive summarization (Jing & McKeown, 1999; 2000; Lebanoff et al., 2019; 2020b) , we define the following three neural modules for building Summarization Programs. Fusion Module. Sentence fusion in summarization combines information from multiple sentences (Lebanoff et al., 2020a) . We finetune a BART-large model, which takes two sentences as input and outputs one fused sentence. Existing sentence fusion datasets either aim to improve discourse connections (Geva et al., 2019) or aim to fuse similar sentences from multiple documents (Brook Weiss et al., 2022) . Instead, we want to fuse two disparate sentences into one sentence, which requires the model to merge related pieces and remove unimportant information. To obtain training data for such a model, we follow Lebanoff et al. (2019) to align each summary sentence from CNN/DailyMail with one to many similar and non-redundant document sentences. As our training data, we only use examples that align one summary sentence with two source sentences. Compression Module. The compression module generates a compressed output of a single sentence. It involves generating a shorter sentence by preserving the essential content and the syntactic structure of the input. We finetune a BART-large model (Lewis et al., 2020) on a large parallel corpus of uncompressed and compressed sentences (Filippova & Altun, 2013) . Paraphrase Module. The paraphrase module generates a sentence that involves syntactic transformations or lexical paraphrasing of the input sentence. We use a publicly available PEGASUSbased (Zhang et al., 2020) paraphrase model from HuggingFace (Wolf et al., 2020) . 3 In practice, we observe paraphrased outputs to frequently involve some compression as well, which we analyze in detail as part of the 'Neural Module Faithfulness' evaluation ( §6.2).

4. RQ1: SUMMARIZATION PROGRAM SEARCH

Our first research question of interest is whether given a document D and a human-written summary S, we can generate a Summarization Program P that best explains the generative process of the summary (see the left part of Fig. 2 ). We achieve this by developing an efficient best-first search method, named SP-SEARCH, outlined in Algorithm 1 in the Appendix. Conceptually, it is similar to a forward chaining algorithm in logic programming (Russell & Norvig, 2009) , in which we start from a set of premises (equivalently, a small set of document sentences) and iteratively apply neural modules on them to generate newer deductions (equivalently, intermediate sentences) until the goal hypothesis (equivalently, the summary) is generated. SP-SEARCH processes each summary Figure 2 : Overview of the two RQs aimed at evaluating SPs. On the left, we show our SP-SEARCH algorithm that takes a source document and a human summary as input and identifies an SP that best explains the summary (RQ1). On the right, we show our SP generation models that take a document as input and output a program, which is then executed to generate the summary (RQ2). S1 and S2 denote human summary sentences while S1' and S2' represent identified or generated summaries. sentence separately, thereby generating a unique tree for each summary sentence. We optimize the trees for their ROUGE-L (R-L)foot_3 scores with respect to the gold summary sentences. SP-SEARCH Algorithm. SP-SEARCH operates as follows (see Algorithm 1). For each summary sentence, it initializes a priority queue, each element (s 1 , s 2 , m, h) of which represents a module m, the sentences s 1 and s 2 on which m is defined (s 2 is empty for 'compression' and 'paraphrase' modules) and the corresponding tree height h. The initial elements in the queue represent possible operations only on the document sentences. Next, at each step of the search, SP-SEARCH pops an element from the queue and executes the operation over the corresponding operands to generate a new sentence. Each module generates top-5 outputs using beam search and the one with the highest R-L score is chosen. Using this new sentence, SP-SEARCH creates new elements for the priority queue by considering all potential operations involving the new sentence and other available sentences. Then, these new elements may be explored in the next steps of the search. As a best-first search method, SP-SEARCH ranks the elements in the queue with the following scoring function. Score(s 1 , s 2 , m, h) = max R-L(s 1 , S), R-L(s 2 , S) The scoring function considers the maximum of the R-L scores with respect to generations s 1 and s 2 such that if the module m is executed, it can possibly lead to a sentence that is of an even higher R-L score than its children. Intuitively, it encourages prioritizing elements in the queue that can potentially lead to higher scores. Whenever a new sentence is generated, its R-L score with the gold summary sentence is computed and accordingly, the best possible root node of the tree is updated. Upon completion of the search (when the queue becomes empty), the node with the maximum score is chosen as the root node and the tree rooted at that node is the final tree. Since performing SP-SEARCH exhaustively can be prohibitive, it is made efficient through the following design choices. Design Choices for Efficient SP-SEARCH. Here we discuss the choices briefly and provide more details in Appendix A. (1) Top-k Document Sentences. We rank each document sentence by computing ROUGE-1 with the summary and build the SP with only the top-k document sentences as eligible source sentences. (2) Filtering Queue Elements. To limit the search space, SP-SEARCH defines certain heuristics for choosing the elements to be added to the queue like a compressed sentence is not compressed again, each document sentence is used at most once in a tree, etc. (3) Priority Queue and Pruning. SP-SEARCH maintains a fixed queue size and a ranked list of the elements according to Eq. 1 such that only the top-ranked elements are kept in the queue and the rest are pruned. (4) Parent has a higher ROUGE than children. A branch of the search is only expanded further if the new generation (parent) has a higher R-L score than the source sentence(s) (children). This greedy approach ensures that every reasoning step is a step closer to the summary. ( 5) Maximum Tree Height. SP-SEARCH chooses a maximum height of the trees. ( 6) Batching Module Executions. SP-SEARCH executes all operations in the queue together by batching at each depth of the search.

5. RQ2: ABSTRACTIVE SUMMARIZATION VIA SUMMARIZATION PROGRAMS

Given that we have proposed SP-SEARCH for identifying SPs for human summaries, we now want to leverage the algorithm to generate training data for building SP generation models, as part of RQ2. In particular, we use SP-SEARCH to identify Summarization Programs for all training samples in the CNN/DailyMail dataset (Hermann et al., 2015) . Our second research question now asks whether these identified SPs can be used as supervision for developing abstractive summarization models via the generation of SPs (see the right part of Fig. 2 ). Hence, we define a supervised learning problem f : D → P that generates a Summarization Program P from document D. Summary Generation from an SP. We parse and execute a generated SP through iterative inference over the neural modules to obtain the final summary. We check for well-formedness of the SP by ensuring that the generated sequence (1) has balanced parentheses, (2) does not contain any out-of-vocabulary token (besides module names, sentence identifiers, and parentheses), and (3) has consistent number of operands for an operation. During inference, we decode up to top-k SPs using beam search and execute the first well-formed one. When none of the top-k outputs is well-formed (<1% of the samples), we generate the corresponding extractive summary.

6. EXPERIMENTS

We experiment with two English single-document summarization datasets, CNN/DailyMail (Hermann et al., 2015) and XSum (Narayan et al., 2018) . We discuss CNN/DM results below and point readers to Appendix F for XSum results. Evaluation Metrics. We use ROUGE to measure similarity between the SP-SEARCH summaries and the human summaries. We also compute factuality of the SP-SEARCH summaries with respect of the gold summaries using a SOTA factuality metric, QuestEval (Scialom et al., 2021) . Note that while factuality is typically measured against the source document, RQ1 requires us to evaluate how factual the SP-SEARCH summary is to the reference summary (that the SPs are aiming to reproduce). Experimental Design. We conduct experiments with 1000 random validation samples. 5 We set the number of initial document sentences (k) to 4, the maximum queue size (Q) to 20, maximum height of the trees (H) to 2 and the decoding strategy for each module to beam search with beam size of 5. Our choice of hypeparameters is based on a comprehensive study of the trade-off between ROUGE scores and average search time, as presented in Appendix C. As baselines, we consider the following extractive and abstractive oracles. (1) Random SP. For each summary sentence in the gold summary, we randomly sample a tree structure (from those obtained through SP-SEARCH) and execute it with randomly chosen leaf nodes from the set of top-k sentences. (2) Top-4 Sentences. Our next baseline is an extractive summarization system with the top-4 document sentences ranked according to ROUGE-1 scores with the gold summary. (3) BART-Oracle. We also compare with two BART-oracle models. In one, we generate top-10 summaries using beam search. In another, we sample 10 summaries with multinomial sampling and the summary with the highest ROUGE-L score with the gold summary is chosen. (4) SP-SEARCH Leaves. Our final baseline is an SP-SEARCH variant where we only consider the leaf nodes (document sentences). This represents an extractive summary with uptofoot_5 top-4 sentences. Since the trees are built on top of these nodes, a better ROUGE score for SP-SEARCH will indicate that these leaf nodes or document sentences are being composed in a way such that the resultant summary is more similar to the human summary. (5) Ablations of Modules. These are SP-SEARCH variants where each module is removed. The random SP baseline uses the same pre-trained modules as SP-SEARCH but only obtains an R-2 of 15, demonstrating that arbitrary composition of these modules is much less effective than the searched programs. The BART-oracle models exhibit less diversity between the generations, as reflected by their lower ROUGE scores. This demonstrates the utility of a tree structure in an SP. Finally, we observe that each of our neural modules plays a significant role in constructing more accurate SPs. Removing the fusion module leads to the highest drop in ROUGE which suggests that fusing information from multiple sentences is one of the core operations in human summaries. SP-SEARCH also obtains the best QuestEval score, even though it does not specifically optimize for it, suggesting that the summaries are not only similar to the human summaries (as measured by ROUGE) but also have high factual overlap with the human summaries. In summary, SP is an effective way of representing the generative process of abstractive summaries, outperforming random SPs, extractive oracles and unstructured abstractive oracles.

6.2. RQ1 RESULTS: NEURAL MODULE FAITHFULNESS EVALUATION

Higher ROUGE and factuality scores obtained by SP-SEARCH with respect to the reference summaries do not guarantee that SPs provide a faithful interpretation of the generated summaries. §6.1 specifically evaluates the final summaries but not the intermediate steps used to generate them. In fact, prior work on Neural Module Networks has shown that faithfulness may not always be guaranteed for neural modules and that incorrect intermediate outputs can still lead to correct final outputs (Subramanian et al., 2020) . Experimental Design. We compare our SP generation models (Joint and Extract-and-Build) with the following baselines and oracles on the CNN/DailyMail test set. As baselines, we consider (1) a SOTA Extractive Summarization Model, MatchSum (Zhong et al., 2020) that we also use to extract initial document sentences for our Extract-and-Build SP generation model, (2) SOTA Abstractive Summarization Models, BART (Lewis et al., 2020) and PEGASUS (Zhang et al., 2020) , and (3) Random SP Models that randomly sample the number of summary sentences from {1,2,3,4}, and for each summary sentence, randomly sample and execute a tree structure with randomly chosen leaf nodes from all or top-k document sentences. As oracles, we consider the same models introduced in §6.1 for RQ1. They provide an estimate of the upper bound of SP models. Evaluation Metrics. Besides ROUGE, we also evaluate the factuality of the generated summaries (with respect to the source document) using QuestEval that is shown to correlate well with humans. (Scialom et al., 2021) . Results. Table 3 shows the results. We observe that our models obtain R-2 and R-L scores of 16 and 26 respectively. The Extractand-Build model performs slightly better than our Joint model, possibly due to the former generating programs over a good initial set of document sentences. Both of our models outperform the Random SP baselines, demonstrating that they learn useful patterns of which neural modules should act on which document sentences. Compared to SOTA abstractive models, our models' interpretability (as discussed in §6.4) comes at the cost of some drop in performance (about 5 points in R-2). However, the oracle results suggest that our models provide good starting points for better SP generation models and that there is substantial room for improvement in future work (e.g., the SP-SEARCH Top-1 model obtains an R-2 of 35, leaving a large gap of 19 R-2 points to oracle SPs). We hope the community will explore and improve upon our SP generation models, as a way to make progress on the important and challenging task of developing interpretable abstractive summarization models. We observe a similar trend for the QuestEval scores where our models largely outperform the random SP baselines, while showing worse performance than BART and PEGASUS, possibly because of the cascading effect of compositional generation via SPs. Discussion. BART and PEGASUS optimize for gold summaries while our SP models optimize for SP-Search summaries. The SP models, by themselves, are simple and effective -when the generated summaries are evaluated using SP-Search summaries as targets, we obtain a comparable R-2 of 21 to SOTA models. This shows that retraining our models on even better oracle programs by incorporating more modules to SP-SEARCH or further enhancing the existing ones can help close the gap to SOTA models. We hope the community will explore and improve upon our methods, as a way to make progress on the challenging task of developing interpretable summarization models.

6.4. RQ2 RESULTS: INTERPRETABILITY EVALUATION OF SP VIA SIMULATION STUDY

We ask if SPs improve the interpretability of a summarization model. In particular, we are interested in evaluating model simulatability, a measure of whether a person can predict model behavior on new inputs (Doshi-Velez & Kim, 2017) . Similar to Zhou et al. (2022) , we are specifically interested in model reasoning, as represented by the programs that our models generate and execute. The primary motivation for predicting model reasoning is that it is what we want to better understand by virtue of model explanations, and simulation studies that focus on predicting final outputs do so only to show that users have a good mental model of model reasoning (Hase & Bansal, 2020) We say that an SP is more representative of the model summary if executing it with our pre-trained modules generates a summary that is closer to the model summary. This is a measure of program similarity since variance in the SPs may make graph similarity measures inappropriate (e.g., compression followed by paraphrase and vice versa may generate similar outputs). Results. The results are presented in Table 4 . In general, high ROUGE scores suggest that, when models are forced to generate summaries via SPs, their reasoning becomes quite predictable a priori. We also see about 4 points improvement in ROUGE-2, with 61% of samples showing an improved R-2 score (statistically significant at p < 0.001), 10% of samples with a drop in R-2, and remaining being ties. This suggests that SPs are potentially good explanations of model reasoning, such that humans can generalize across model reasoning patterns after being given the explanations.

7. DISCUSSION AND CONCLUSION

We proposed the Summarization Program, a novel framework for interpretable abstractive summarization. We demonstrated its effectiveness by developing SP-SEARCH that identifies summarization programs for human summaries with highly faithful neural modules and SP models that produce summaries from source documents. Two most common forms of errors in the generated SPs include (1) redundant or longer paths in an SP, and (2) fusion module generating non-factual sentences or ignoring one of its source sentences. Two other notable issues, arising out of the independence assumption of summary sentences, are (1) final summary sentences having overlapping information, and (2) incoherence between consecutive sentences. One way to improve this is to add a 'coherence' module on top of the root nodes before generating the final summary. We build SPs using sentences as the fundamental content unit (nodes) due to the relative ease of defining and training neural modules on sentences and the availability of large-scale training data. Summarization may also involve other text manipulation operations that are not fully captured by our modules but our framework allows easy inclusion of other modules.

ETHICS STATEMENT

Despite the recent success of pre-trained language models in abstractive summarization, their lack of explainability still remains a major concern and we hope that Summarization Programs prove to be an important step in bridging that gap. That said, summarization is inherently a subjective task and existing summarization datasets also significantly vary in terms of their stylistic features like abstractiveness, length, and specificity (Goyal et al., 2022) . Hence, more future work is needed to understand the general applicability of our neural modules and how effective they are in encoding different kinds of summaries. Broadly put, Summarization Program is a case study for breaking a complex NLP problem down into sub-problems and then solving them through neural modules without having access to intermediate supervision. We fine-tune language models for building modules and language models can be prone to generating unwanted content (Weidinger et al., 2021) . However, since each module is focused on one particular skill, that should help limit the negative impact and provide users with more control, compared to end-to-end models. Summarization Programs are also specifically designed to trace the origin of any toxic or hallucinated content in the generated summaries.

REPRODUCIBILITY STATEMENT

To encourage reproducibility, we make our source code publicly available.  (•) 1 8 compression ( fusion (•, •) ) 2 8 fusion ( fusion (•, •) fusion (•, •) ) 2 7 (•) 0 7 fusion ( compression (•) fusion (•, •) ) 2 6 paraphrase ( compression (•) ) 2 6 paraphrase ( fusion (•, •) ) 2 6 fusion ( fusion (•, •) compression (•) ) 2 5 fusion ( fusion (•, •) ) 2 5 fusion ( compression (•) ) 2 5 Table 5 : Top 10 tree structures for the human-written summaries in the training corpus of CNN/DM. For clarity, each tree structure is accompanied with the corresponding tree height and the frequency (in percentage). An empty structure of "(•)" represents a singleton node. A DESIGN CHOICES FOR EFFICIENT SP-SEARCH (CONTINUED FROM §4) SP-SEARCH is outlined in Algorithm 1. It is made efficient through important design choices as discussed below. For clarity, we do not show all of these choices in the Algorithm outline. Top-k Document Sentences. The search space grows exponentially at each depth (because of the fusion operation), but one way to limit the growth is to ensure that SP-SEARCH starts with a small number of document sentences. A summary is typically constructed by using information from only a small number of sentences from the document. Hence, we rank each document sentence by computing the ROUGE-1 score with the summary and build the Summarization Program with only the top-k document sentences ('k' being a small number) as eligible source sentences. Filtering Queue Elements. The search space is also dependent on how many elements are added to the queue at each step and how many of those are expanded further. Note that whenever a new sentence is generated via a module, it can potentially fuse with all previous generations to create new queue elements. Since doing this exhaustively increases the search space, SP-SEARCH defines certain heuristics for choosing the elements that will be added to the queue: (1) a sentence that has been compressed once is not compressed again, (2) a sentence that has been paraphrased once is not paraphrased again, (3) each document sentence is used at most once in a tree, (4) two sentences are not fused if they are intermediate generations from the same sentence, and (5) since fusion is not a symmetric operation and can lead to different generations based on the order, sentence fusion happens keeping the temporal order of the sentences in the document intact. Priority Queue and Pruning. SP-SEARCH performs an additional step of pruning for the elements that are added to the queue. It maintains a fixed queue size and while expanding the elements in the queue in a best-first manner, it maintains a ranked list of the elements according to Eq. 1 such that only the top-ranked elements are kept in the queue and the rest are pruned. Parent has a higher ROUGE than children. Each module generates multiple outputs through beam search and SP-SEARCH ensures that the best output has a higher R-L score than the nodes on which the module is defined. If this is not the case, the corresponding branch of the search is not expanded further. This constraint generalizes to the property that every node in a Summarization Program will have a higher R-L score (with respect to the summary sentence) than all other nodes in the subtree rooted at that node. Conceptually, this greedy approach ensures that every reasoning step in a Summarization Program is a step closer to the summary (according to a scoring function). Maximum Tree Height. SP-SEARCH chooses a maximum height of the trees, beyond which the nodes are not expanded further during the search. Batching Module Executions. Instead of executing each module separately, which requires a forward pass over a neural model and can be time-consuming, SP-SEARCH executes all operations in the queue together by batching at each depth of the search. Algorithm 1: SP-SEARCH Algorithm  Input: Top-k Document Sentences D k , Summary S, Modules {Mc, Mp, M f }, Maximum Height H, Maximum Queue Size Q, Number of Generations G, Scoring Function Score Output: Summarization Program P 1 function SP-SEARCH(D k , S, {Mc, Mp, M f }, H, Q, G, Score) 2 P = [] ▷ Initialize Summarization Program.

B ANALYSIS OF SP-SEARCH SUMMARIZATION PROGRAMS FOR CNN/DM

We observe that 68% of the summaries in CNN/DM have 3 or 4 summary sentences and, equivalently, the corresponding Summarization Programs have 3 or 4 trees. Note that while we initialize SP-SEARCH with the Top-4 document sentences, a Summarization Program may choose to ignore some of these sentences if including them does not lead to higher ROUGE scores. Upon analysis, we find that 73% of the programs are constructed using all four initial sentences, 23% are constructed with three sentences and 3% have two sentences. We also note that the trees can have any height up to the maximum defined height of 2. A tree of height 0 represents a singleton node with a single document sentence. Thus, a Summarization Program with only singleton nodes represents an extractive summary. Overall, we observe that our trees have as many as 20 different structures. 

C SP-SEARCH HYPERPARAMETERS

In order to analyze the effect of different hyperparameters in SP-SEARCH, we compare the trade-off between ROUGEfoot_7 scores and average search time per sample (on a single RTX 2080 Ti GPU) by varying the number of initial document sentences (k), the queue size (Q), decoding strategy (D) and the maximum height of the trees (H) in Table 6 . We observe that increasing the number of document sentences k beyond 4 leads to some improvements in ROUGE scores. However, this comes at the cost of increased search time. Increasing the queue size (Q) also leads to minor improvements but again suffers from increased search time. While increasing the maximum tree height (H) to 3 results in better ROUGE scores, on inspection, we observe that it happens primarily due to low-quality fusion of two arbitrary sentences that may not always lead to better programs. Finally, beam search performs better than greedy decoding, and leveraging the best of the top-5 generations from each module improves the results further with almost no increase in search time. Overall, at a moderate search time of less than 30 seconds/sample on a single RTX 2080 Ti GPU, SP-SEARCH obtains a 20 point increase in R-2 compared to our extractive baseline with leaf nodes (document sentences). SP-SEARCH builds Summarization Programs by optimizing for ROUGE-L because of its better overall performance across all ROUGE metrics. As shown in Table 7 , optimizing instead for R-1 and R-2 leads to slightly better R-1 and R-2 respectively but overall R-L performs better. (i,j) shows the fraction of time a module 'i' in a row performs the operation 'j' in a column. The final column shows how often a module generates non-factual outputs. We report the mean and variance of the scores between the two annotators. we find that the different ROUGE metrics may not correlate well with each other when optimizing for one of these. SP-SEARCH uses ROUGE-1 scores to select the candidate sentences for building the Summarization Programs. We also experiment with other ROUGE metrics in Table 8 and observe that our algorithm is fairly robust to the choice of metric. R-2 obtains slightly better results than R-1 and R-L but overall, all metrics still obtain sufficiently high oracle results of 40-41 points in R-2 through SP-Search, independent of the selection metric.

D SP MODEL HYPERPARAMETERS

We build our models on top of the HuggingFace transformers library (Wolf et al., 2020) . All models are trained for 40000 steps with a batch size of 16, learning rate of 3 * 10 -5 and warmup steps of 500. We set the maximum input length to 512 and maximum generation length to 100. During inference, we generate up to Top-10 Summarization Programs with beam search and output the first well-formed program. We also set the minimum generation length to 10 to prevent the model from generating too short sequences and repetition penalty to 2. Program execution from the SPs is performed with the same set of hyperparamaters for each module as used during SP-SEARCH. E RQ1 RESULTS: NEURAL MODULE FAITHFULNESS EVALUATION ON CNN/DM (CONTINUED FROM §6.1) We discuss the faithfulness evaluation of our modules below. Study Design. We would attribute each sentence to exactly one of the modules, but since our modules are generative, we do not know if each module performs only and exactly its named function. For example, fusing two sentences into a fluent sentence may in practice also involve some paraphrasing or compression. Hence, we evaluate module faithfulness by analyzing how often a module m i performs the role of a module m j . Two expert annotators, with knowledge of NLP, annotate each intermediate generation (from a module) and assign a binary label against each of the modules that could lead to that output. Additionally, the annotators also label each generation for whether it contains non-factual content (that cannot be verified from the sentence(s) used to generate it). The study is conducted with 20 Summarization Programs, consisting of a total of 107 samples, with 54 fusion, 25 paraphrase, and 28 compression operations. Results. As part of answering RQ1, we keep the paraphrase and compression modules unaltered and only retrain the fusion module for XSum, with the training data obtained using the same heuristics as used for CNN/DM (Lebanoff et al., 2019) . We also use the same hyperparameters for SP-SEARCH as those used for CNN/DM and compare all methods on 1000 randomly chosen validation samples of XSum. As shown in Table 10 , SP-SEARCH obtains an R-2 of 29.60 and an R-L of 48.48, outperforming all baseline methods by a significant margin. Unlike CNN/DM, extractive baselines like 'Lead 1', 'SP-SEARCH Leaves', and 'Top-1 Sentence' do not perform well for XSum, while the abstractive baseline, BART-Oracle does significantly better. The SP-SEARCH results are also lower than in CNN/DM because of the highly abstractive nature of the dataset and the relative difficulty in emulating reference summaries.

F.2 RQ2 RESULTS: EVALUATION OF SUMMARIZATION PROGRAM GENERATION MODELS

Next, for RQ2, we train a Joint SP generation model that generates the program directly from the document. 9 In Table 11 , we compare our model with BART and PEGASUS and a random SP generation baseline. While our model outperforms the random SP baseline, there's significant room for improvement compared to state-of-the-art methods, as indicated by the 16 point gap in R-2 compared to the SP-SEARCH oracle. G EXAMPLES OF SUMMARIZATION PROGRAMS 

D3:

The parents of Cayman Naib, 13, have been communicating through the Facebook group "Find Cayman" since a day after his disappearance, according to close friend David Binswanger.

S1':

Volunteers are searching for missing eighth-grader. S2': Cayman Naib, 13, has been missing since Wednesday.

Fusion Fusion Fusion Fusion

Gold Summary: Cayman Naib, 13, hasn't been heard from since Wednesday. Police, family, volunteers search for eighth-grader. SP-Search Summary: Cayman Naib, 13, has been missing since Wednesday. Volunteers are searching for missing eighth-grader. Hundreds of volunteers are searching for missing eighth-grader.

D1:

The search has drawn hundreds of volunteers on foot and online. D2: A Pennsylvania community is pulling together to search for an eighth-grade student who has been missing since Wednesday. 

H SIMULATABILITY INTERFACE OF SUMMARIZATION PROGRAMS

In Figure 10 , we show the interface for our simulatability study in which annotators are asked to write Summarization Pograms for model-generated summaries.



Supporting code available at https://github.com/swarnaHub/SummarizationPrograms. 'Neural module faithfulness' refers to whether the modules perform their expected operations. This is different from 'summary faithfulness' that evaluates whether a summary contains only factual claims from the source document. The latter will be referred to as 'summary factuality'. Model available at https://huggingface.co/tuner007/pegasus_paraphrase. We conducted some initial experiments to observe that optimizing for R-L leads to more consistent improvements across all ROUGE measures. See Appendix C for details. Due to the expensive nature of SP-SEARCH, we experiment with a random subset of the validation set. SP-SEARCH can ignore certain top-4 sentences if including them do not lead to higher ROUGE scores. We use a non-parametric bootstrap test(Efron & Tibshirani, 1994) for significance testing. We use the HuggingFace implementation of ROUGE at https://github.com/huggingface/ datasets/blob/main/metrics/rouge/rouge.py. We do not experiment with the Extract-and-Build model because extractive models do not perform well on XSum.



Figure 1: Example of a Summarization Program showing the generative process of two summary sentences (marked with labels S1' and S2' in yellow) from the three document sentences (marked with labels D1, D2, and D3 in blue) using compression, paraphrase and fusion neural modules. Edges directed from leaves to roots, and the intermediate generations are labeled with I1 and I2.

RQ1 RESULTS: CAN SP-SEARCH REPRESENT THE SUMMARIZATION PROCESS?Our first set of experiments is aimed at answering our first RQ (as discussed in §4) where the goal is to analyze how well the Summarization Programs identified by SP-SEARCH represent the humanwritten summaries. We consider two variants of SP-SEARCH -(1) SP-SEARCH Top-1 in which each module generates only one output, and (2) SP-SEARCH where each module generates Top-5 outputs (via beam search) and one with the best R-L is chosen.

], height ← 1 ▷ Initialize queue of items and tree height.

← currRouge, bestRoot ← Node(D1) ▷ Update ROUGE and root. 10 queue ← queue ∪ {(D1, ϕ, Mc, height)} ▷ Add compression items. 11 queue ← queue ∪ {(D1, ϕ, Mp, height)} ▷ Add paraphrase items. 12 for D2 ∈ D k do 13 queue ← queue ∪ {(D1, D2, M f , height)} ▷ Add fusion items. 14 prevHeight ← 0, visited ← [] 15 while |queue| > 0 do 16 height ← queue[0].height ▷ Get height of the top queue item.

Figure 3: Example of a Summarization Program identified by SP-SEARCH. The summary identified by SP-SEARCH matches closely with the gold summary.

Real Madrid lost to Athletic Bilbao. I2: Aritz Aduriz headed the ball in at the end of the first half to give Bilbao the victory. Paraphrase D1: Real Madrid fell to a lacklustre 1-0 defeat at the hands of Athletic Bilbao Saturday, potentially handing the La Liga advantage to arch rival Barcelona.SP-Search Summary:Real Madrid lost to Athletic Bilbao. Aritz Aduriz headed to give Bilbao the victory. Bayern Munich continued its march to a Bundesliga title.I1: Real Madrid fell to a defeat at the hands of Athletic Bilbao.CompressionGold Summary: Real Madrid slump to defeat against Athletic Bilbao. Solitary goal from Aritz Aduriz enough to give the Basques victory. Bayern Munich continue Bundesliga domination.

Bayern Munich continued its seemingly inexorable march to a third consecutive Bundesliga title with a come from behind 3-1 victory away to Hannover..

Figure 4: Example of a Summarization Program identified by SP-SEARCH. The summary identified by SP-SEARCH matches closely with the gold summary.

Figure 10: Interface for simulating Summarization Programs given a source document and a model summary. Annotators construct trees for all summary sentences by creating one edge at a time.

Summarization Program Generation Models. We propose two initial models for our task. The SPs are encoded as strings such that they can be generated by a seq2seq generative model. We first label each sentence in the document with unique identifiers like '<D1>', '<D2>', etc. The Summarization Program is then represented using a nested bracketing structure composed of the modules and the sentence identifiers as shown in Fig.2. For our first Extract-And-Build Summarization Program Generation Model, we hypothesize that a document can contain a large number of sentences but only a small fraction of those are typically useful for generating a summary. Given a training corpus of samples (D

RQ1 -ROUGE scores for SP-SEARCH summaries and oracle extractive, abstractive baselines. Final column compares the factuality of the summaries with respect to the gold summaries through a factuality metric, QuestEval.

RQ1 -Neural Module Faithfulness evaluation of SPs. Entry (i,j) shows how often a module 'i' in a row performs the operation 'j' in a column. The final column shows the fraction of non-factual outputs.Hence, we also evaluate the faithfulness of the three modules by conducting a human study (with 2 experts having knowledge of NLP) for 20 SPs. These constitute a total of 107 samples, including 54 fusion, 25 paraphrase, and 28 compression operations. Table2shows the results (with more details in Appendix E). The diagonal entries in the table demonstrate how often a module performs its intended behavior and the high values of 0.8-0.98 suggest that our modules are highly faithful. Interestingly, the values in the last column demonstrate that compression and paraphrase modules almost never generate non-factual outputs while our fusion module is more prone to that (about 20%), indicating room for improvement in the fusion module.6.3 RQ2 RESULTS: EVALUATION OF SUMMARIZATION PROGRAM GENERATION MODELSOur next set of experiments addresses our second RQ (as discussed in §5) on evaluating Summarization Program Generation models in terms of their summary generation capabilities.

RQ2

.

The details of our SP-SEARCH algorithm are shown in Algorithm 1. The hyperparameters for RQ1 and RQ2 are discussed in Appendix C and Appendix D respectively. The CNN/DailyMail and XSum datasets are also publicly available at https://huggingface.co/datasets/cnn_dailymail and https://huggingface.co/datasets/xsum respectively.

Execute(queue, {Mc, Mp, M f }, G) ▷ Execute all operations in queue.

Table5shows the top 10 tree structures in our corpus. As an example, a tree structure of "compression ( Analysis of ROUGE scores (R1/R2/RL) and average search time (in seconds) for SP-SEARCH on 1000 validation samples in CNN/DM with different search hyperparameters. k = Number of extracted document sentences. Q = Maximum queue size. D = Decoding Strategy from the modules. H = Maximum height of the tree. T = Average search time in seconds. Search times are computed on a single RTX 2080 Ti GPU. Beam(5) refers to beam search with beam size 5 and best of top-5 generations from each module.

Comparison of ROUGE scores for the summaries generated by SP-SEARCH based on the optimization metric. Optimizing for R-L leads to more consistent results across all ROUGE metrics.

Comparison of ROUGE scores for the summaries generated by SP-SEARCH based on the metric used for choosing the initial sentences (IS). The optimization metric is set to R-L for all.

RQ1 -Neural Module Faithfulness evaluation of Summarization Programs. Each entry

Table 9  shows the results. The diagonal entries in the table demonstrate how often a module performs its intended behavior and the high values of 0.8-0.98 suggest that our modules RQ1 -Comparison of ROUGE scores for the SP-SEARCH summaries with different oracle extractive and abstractive baselines on the XSum dataset. are highly faithful. In about 20% of the cases, a fusion operation may not involve any fusion. We also observe that some form of compression frequently happens as part of the paraphrasing and fusion modules, as shown in the first column of the table. Similarly, the fusion outputs also tend to involve some compression and paraphrasing operations. Interestingly, the values in the last column of the table demonstrate that compression and paraphrase modules almost never generate non-factual outputs while our fusion module is more prone to that (about 20%), indicating room for improvement in the fusion module.

Figures 3, 4, 5,and 6 show some examples of Summarization Programs identified by SP-SEARCH for human-written summaries.Figures 7, 8,and 9 show examples of Summarization Programs and corresponding summaries generated by our Extract-and-Build SP generation model. For compactness, we show the Summarization Programs by merging the common nodes between the trees. RQ2 -Comparison of our Joint SP generation model with state-of-the-art abstractive summarization models on the XSum test set.

ACKNOWLEDGEMENTS

We thank the reviewers for their valuable feedback. We also thank Archiki Prasad and Prateek Yadav for their help with neural module faithfulness annotations, and David Wan for help with factuality evaluation. This work was supported by NSF-CAREER Award 1846185, NSF-AI Engage Institute DRL-2112635, Google Ph.D. Fellowship, and Bloomberg Data Science Ph.D. Fellowship.

D3:

Strong currents hindered the search, which lasted until Friday evening, Cayman 27 reported.S1': Comic book artist Norman Lee went missing in the Cayman Islands on Monday. D1: Norman Lee, an artist for DC and Marvel comics, went missing while snorkeling with his wife off the eastern coast of Grand Cayman, CNN affiliate WCVB reported.I3: Strong currents hindered the search, which lasted until Friday evening.

D2:

The search for a comic book artist missing in the Cayman Islands since Thursday is now being called a recovery mission.

Paraphrase

SP-Search Comic book artist Norman Lee went missing in the Cayman Islands on Monday. The search lasted until Friday evening.I1: Norman Lee went missing while snorkeling off the coast of Grand Cayman.

Fusion Fusion

Gold Summary: Comic book artist Norman Lee went missing in the Cayman Islands on Thursday. Authorities called off search on Friday evening.I2: comic book artist missing in the Cayman Islands.

S2':

The search lasted until Friday evening. 

D2:

The 42-year-old man was apprehended after a guard at saw him carve letter "K" in section of brickwork.Two American women were arrested for carving their initials into a wall.

I2:

Two women and a 42-year-old man have reportedly been arrested.

Compression

D1: Two American women have reportedly been arrested for carving their initials into a wall with a coin inside Rome's Colosseum.SP-Search Summary: Two American women were arrested for carving their initials into a wall. Two women and a 42-year-old man have reportedly been arrested at Colosseum. Cambodia's Angkor Archeological Park experienced its own string of nudity-related incidents this year.I1: Two American women were arrested for carving their initials into a wall with a coin.

Paraphrase

Gold Summary: Two American women arrested for carving initials into a Colosseum wall. Meanwhile, Egypt investigating Russian pornography film reportedly shot at Great Pyramids. Cambodia's Angkor Archeological Park experienced a string of nudity-related incidents this year.

S2':

Two women and a 42-year-old man have reportedly been arrested at Colosseum.

D3:

The women, both from California, reportedly snapped a selfie of themselves with their initials before they were arrested.D4/S3': Cambodia's Angkor Archeological Park experienced its own string of nudity-related incidents this year.

I3:

The women, both from California, reportedly snapped a selfie of themselves with their initials before they were arrested at Colosseum. ). The first two summary sentences have overlap in information, originating from the independence assumption in the generative process of each summary sentence.

Fusion

Predicted Summary: Prosecutor Brice Robin says he is not aware of any video footage of the crash of Flight 9525. The video was recovered from a phone at the crash site, according to Paris Match.Gold Summary: Marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports. Journalists at Bild and Paris Match are "very confident" the video clip is real, an editor says. Andreas Lubitz had informed hisLufthansa training school of an episode of severe depression, airline says.D3: Paris Match and Bild reported that the video was recovered from a phone at the wreckage site.

S1':

Prosecutor Brice Robin says he is not aware of any video footage of the crash of Flight 9525.

D1:

The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane.

S2':

The video was recovered from a phone at the crash site, according to Paris Match.

Fusion Fusion

D2: Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation."

I1:

The prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted he was not aware of any video footage.

Compression Fusion

I2: Prosecutor Brice Robin says he is not aware of any video footage from the plane. I3: He said that the drones had been tested in controlled conditions.

D2:

The miniature aircraft will be fitted with a camera and pepper spray; each drone costs between $9,560 and $19,300, Yadav added.

Paraphrase

Predicted Summary: Police in India are using drones to spray crowds with an aerial solution to crowds. Drones cost between $9,560 and $19,300 and have been tested in controlled conditions.I1: Police in India are putting in favor of an aerial solution to crowds.

Fusion Fusion

Gold Summary: Police in Lucknow, northern India, have bought four drones to help control crowds. The unmanned aerial vehicles are being fitted with cameras and pepper spray to subdue angry protesters. Some Indians have questioned why police are resorting to "authoritarian and forceful methods".I2: Police in India are using miniature drones to spray crowds with pepper spray. S2': Drones cost between $9,560 and $19,300 and have been tested in controlled conditions. 

Compression

Predicted Summary: Manning appears to have joined Twitter this week. Manning said she suffers from gender dysphoria. In August 2013, Manning said she wanted to transition from male to female after being sentenced to 35 years in prison.

I1:

In August of that year, Manning said she wanted to transition from male to female after being sentenced for her crimes.

Fusion Fusion

Gold Summary: Manning is serving a 35-year sentence for leaking thousands of classified documents. She says she will be using a voice phone to dictate her tweets.I2: Manning, who suffers from gender dysphoria, was sentenced to 35 years in prison in 2013 and said she wanted to transition to a female. D2/S2': Manning said she suffers from gender dysphoria. 

