CORDIAL: COARSE-TO-FINE ABSTRACTIVE DI-ALOGUE SUMMARIZATION WITH CONTROLLABLE GRANULARITY

Abstract

Dialogue summarization is challenging due to its multi-speaker standpoints, casual spoken language style, and limited labeled data. In this paper, we propose CORDIAL, aiming to improve the abstractive dialogue summarization quality and, at the same time, enable granularity controllability. We propose 1) a coarse-tofine generation strategy that generates a summary draft followed by a final summary. The summary draft, which provides weakly-supervised signals, comprises pseudo-labeled interrogative pronoun categories and noisy key phrases extracted with a constituency parser. 2) A simple strategy to control the granularity of the final summary. CORDIAL can automatically determine or control the number of generated summary sentences for a given dialogue by predicting and highlighting different text spans from the source text. Our model achieves state-of-the-art performance on the largest dialogue summarization corpus SAMSum. We conduct comprehensive case study and show competitive human evaluation results and controllability to human-annotated summaries.

1. INTRODUCTION

Text summarization tasks aim to distill the most critical information in a text to produce an abridged version. In particular, abstractive -as opposed to extractive -summarization requires neural generative models with a high level of semantic understanding, as the output words do not necessarily appear in the source text. It is more challenging but gives much flexibility to a summary compared to any extractive summarization models (Zhang et al., 2018) . Abstractive dialogue summarization has been discussed in the literature of AMI meeting corpus (McCowan et al., 2005) . The size and quality of labeled data are bottlenecks, as collecting summaries is costly and judgements when creating them are inherently subjective. The AMI corpus has only 141 summaries, and the largest dialogue summarization dataset SAMSum (Gliwa et al., 2019) has number of training samples only equal to 5% of the commonly-used text summarization dataset CNN/DailyMail (Hermann et al., 2015) . In addition to (and perhaps due to) the shortage of labeled data, dialogue summarization has not received much attention despite the prevalence of dialogues (text messages, emails, social media) and the vast application potential of dialogue summarization systems. Significant research efforts have been focused on summarization of single-speaker documents such as News (Hermann et al., 2015; Nallapati et al., 2016; See et al., 2017) or scientific publications (Qazvinian & Radev, 2008; Nikolov et al., 2018) . However, summarizing a dialogue presents a unique set of challenges. A conversation always involves multiple speakers with different points of view, and its natural language style is very different from a standard writing style. For example, conversational data contains more abbreviations and typos. Information is more scattered across in a dialogue, compared to articles where usually titles or the first few sentences contain the most salient information. Recently, the ability to control text summarization in the News domain has been gradually attracting more attention (Fan et al., 2018; Liu et al., 2018) , with work focusing on learning length embeddings to control summary lengths. However, the length information is only added during the decoding stage, making the encoding stage less informed, and the overall conditional generation implicit. Saito et al. (2020) instead first explicitly extract "prototype" text span in the desired length and then 1

