CORDIAL: COARSE-TO-FINE ABSTRACTIVE DI-ALOGUE SUMMARIZATION WITH CONTROLLABLE GRANULARITY

Abstract

Dialogue summarization is challenging due to its multi-speaker standpoints, casual spoken language style, and limited labeled data. In this paper, we propose CORDIAL, aiming to improve the abstractive dialogue summarization quality and, at the same time, enable granularity controllability. We propose 1) a coarse-tofine generation strategy that generates a summary draft followed by a final summary. The summary draft, which provides weakly-supervised signals, comprises pseudo-labeled interrogative pronoun categories and noisy key phrases extracted with a constituency parser. 2) A simple strategy to control the granularity of the final summary. CORDIAL can automatically determine or control the number of generated summary sentences for a given dialogue by predicting and highlighting different text spans from the source text. Our model achieves state-of-the-art performance on the largest dialogue summarization corpus SAMSum. We conduct comprehensive case study and show competitive human evaluation results and controllability to human-annotated summaries.

1. INTRODUCTION

Text summarization tasks aim to distill the most critical information in a text to produce an abridged version. In particular, abstractive -as opposed to extractive -summarization requires neural generative models with a high level of semantic understanding, as the output words do not necessarily appear in the source text. It is more challenging but gives much flexibility to a summary compared to any extractive summarization models (Zhang et al., 2018) . Abstractive dialogue summarization has been discussed in the literature of AMI meeting corpus (McCowan et al., 2005) . The size and quality of labeled data are bottlenecks, as collecting summaries is costly and judgements when creating them are inherently subjective. The AMI corpus has only 141 summaries, and the largest dialogue summarization dataset SAMSum (Gliwa et al., 2019) has number of training samples only equal to 5% of the commonly-used text summarization dataset CNN/DailyMail (Hermann et al., 2015) . In addition to (and perhaps due to) the shortage of labeled data, dialogue summarization has not received much attention despite the prevalence of dialogues (text messages, emails, social media) and the vast application potential of dialogue summarization systems. Significant research efforts have been focused on summarization of single-speaker documents such as News (Hermann et al., 2015; Nallapati et al., 2016; See et al., 2017) or scientific publications (Qazvinian & Radev, 2008; Nikolov et al., 2018) . However, summarizing a dialogue presents a unique set of challenges. A conversation always involves multiple speakers with different points of view, and its natural language style is very different from a standard writing style. For example, conversational data contains more abbreviations and typos. Information is more scattered across in a dialogue, compared to articles where usually titles or the first few sentences contain the most salient information. Recently, the ability to control text summarization in the News domain has been gradually attracting more attention (Fan et al., 2018; Liu et al., 2018) , with work focusing on learning length embeddings to control summary lengths. However, the length information is only added during the decoding stage, making the encoding stage less informed, and the overall conditional generation implicit. Saito et al. (2020) instead first explicitly extract "prototype" text span in the desired length and then Figure 1 : An input and output example for our proposed solution. Given the dialogue on the left hand side, we first construct summary draft with intent and key phrase information for coarse-to-fine generation. Then, we split the dialogue into several pieces by special tokens for model controllability and interpretability. paraphrase it to the output News summary. However, the retrieve-and-rewrite process is restricted by the extraction quality, leaving its performance limited by extractive solutions' capabilities. In this paper, we propose CORDIAL, a coarse-to-fine abstractive dialogue summarization model equipped with granularity controllability. Unlike previous methods (Goo & Chen, 2018; Pan et al., 2018) which heavily rely on explicit intent annotations in datasets, we label each dialogue turn with a pre-defined interrogative pronoun category using a weakly-supervised labeling approach. The automatically labeled user intent together with its corresponding key phrase extraction provide weak supervision during summary generation. In addition, we propose a length-controllable generation method specifically for dialogue summarizaiton. We match each summary sentence "linearly" to its corresponding dialogue context and clip it by highlighting tokens. We then train our model to predict where to clip and generate only one sentence for each highlighted dialogue. This strategy enables CORDIAL to generate a summary at different granularity by highlighting arbitrary numbers of text spans from a dialogue and making our model more interpretable. We build our model on top of BART-xsum (Lewis et al., 2019) , which is first pre-trained with unsupervised denoising objectives, and further fine-tuned on the News summarization corpus XSUM (Narayan et al., 2018) . In the experimental results, we show that CORDIAL achieves state-ofthe-art dialogue summarization performance on several automatic metrics. The main contributions of this workfoot_0 are: 1) We propose a coarse-to-fine strategy that uses artificial summary draft as weak supervision, 2) we introduce a text-span based conditional generation approach to control the granularity of generated dialogue summaries without human-written summaries at different detail levels, and 3) we conduct comprehensive case study and human evaluation to show that CORDIAL can achieve consistent and informative summary, especially for controllable summary, where existing models either cannot do it or do it poorly.

2. METHODOLOGY

In this section, we first briefly cover the background of generative language pre-trained models. Then, we introduce our proposed summary draft construction and summary controllability in detail. The proposed solution is generalizable to all the generative language models. We define the conversational history input as D = {X 1 , X 2 , . . . , X N }, where each X i has a sequence of words, N is the



Our code is released at www.anonymous.com

