TOWARDS DATA DISTILLATION FOR END-TO-END SPOKEN CONVERSATIONAL QUESTION ANSWERING

Abstract

In spoken question answering, QA systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora. In this task, our main objective is to build a QA system to deal with conversational questions both in spoken and text forms, and to explore the plausibility of providing more cues in spoken documents with systems in information gathering. To this end, instead of adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which directly fuse audio-text features to reduce the misalignment between automatic speech recognition hypotheses and the reference transcriptions. In addition, to evaluate the capacity of QA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 120k question-answer pairs. Experiments demonstrate that our proposed method achieves superior performance in spoken conversational question answering.

1. INTRODUCTION

Conversational Machine Reading Comprehension (CMRC) has been studied extensively over the past few years within the natural language processing (NLP) communities (Zhu et al., 2018; Liu et al., 2019; Yang et al., 2019) . Different from traditional MRC tasks, CMRC aims to enable models to learn the representation of the context paragraph and multi-turn dialogues. Existing methods to the conversational question answering (QA) tasks (Huang et al., 2018a; Devlin et al., 2018; Xu et al., 2019; Gong et al., 2020) have achieved superior performances on several benchmark datasets, such as QuAC (Choi et al., 2018) and CoQA (Elgohary et al., 2018) . However, few studies have investigated CMRC in both spoken content and text documents. To incorporate spoken content into machine comprehension, there are few public datasets that evaluate the effectiveness of the model in spoken question answering (SQA) scenarios. TOEFL listening comprehension (Tseng et al., 2016) is one of the related corpus for this task, an English test designed to evaluate the English language proficiency of non-native speakers. But the multi-choice question answering setting and its scale is limited to train for robust SCQA models. The rest two spoken question answering datasets are Spoken-SQuAD (Li et al., 2018) and ODSQA (Lee et al., 2018) , respectively. However, there is usually no connection between a series of questions and answers within the same spoken passage among these datasets. More importantly, the most common way people seek or test their knowledge is via human conversations, which capture and maintain the common ground in spoken and text context from the dialogue flow. There are many real-world applications related to SCQA tasks, such as voice assistant and chat robot. In recent years, neural network based methods have achieved promising progress in speech processing domain. Most existing works first select a feature extractor (Gao et al., 2019) , and then enroll the feature embedding into the state-of-the-art learning framework, as used in single-turn spoken language processing tasks such as speech retrieval (Lee et al., 2015; Fan-Jiang et al., 2020; Karakos et al., 2020 ), translation (Bérard et al., 2016; Serdyuk et al., 2018; Di Gangi et al., 2020; Tu et al., 2020) and recognition (Zhang et al., 2017; Zhou et al., 2018; Bruguier et al., 2019 ; Siriwardhana et al., 2020) . However, simply adopting existing methods to the SCQA tasks will cause several challenges. First, transforming speech signals into ASR transcriptions is inevitably associated with ASR errors (See Table 2 ). Previous work (Lee et al., 2019) shows that directly feed the ASR output as the input for the following down-stream modules usually cause significant performance loss, especially in SQA tasks. Second, speech corresponds to a multi-turn conversation (e.g. lectures, interview, meetings), thus the discourse structure will have more complex correlations between questions and answers than that of a monologue. Third, additional information, such as audio recordings, contains potentially valuable information in spoken form. Many QA systems may leverage kind of orality to generate better representations. Fourth, existing QA models are tailored for a specific (text) domain. For our SCQA tasks, it is crucial to guide the system to learn kind of orality in documents. In this work, we propose a new spoken conversational question answering task -SCQA, and introduce Spoken-CoQA, a spoken conversational question answering dataset to evaluate a QA system whether necessary to tackle the task of question answering on noisy speech transcripts and text document. We compare Spoken-CoQA with existing SQA datasets (See Table 1 ). Unlike existing SQA datasets, Spoken-CoQA is a multi-turn conversational SQA dataset, which is more challenging than single-turn benchmarks. First, every question is dependent on the conversation history in the Spoken-CoQA dataset. It is thus difficult for the machine to parse. Second, errors in ASR modules also degrade the performance of machines in tackling contextual understanding with context paragraph. To mitigate the effects of speech recognition errors, we then present a novel knowledge distillation (KD) method for spoken conversational question answering tasks. Our first intuition is speech utterances and text contents share the dual nature property, and we can take advantage of this property to learn these two forms of the correspondences. We enroll this knowledge into the student model, and then guide the student to unveil the bottleneck in noisy ASR outputs to boost performance. Empirical results show that our proposed DDNet achieves remarkable performance gains in SCQA tasks. To the best of our knowledge, we are the first work in spoken conversational machine reading comprehension tasks.



Figure 1: An illustration of flow diagram for spoken conversational question answering tasks with an example from our proposed Spoken-CoQA dataset.

Comparison of Spoken-CoQA with existing spoken question answering datasets.

