TOWARDS DATA DISTILLATION FOR END-TO-END SPOKEN CONVERSATIONAL QUESTION ANSWERING

Abstract

In spoken question answering, QA systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora. In this task, our main objective is to build a QA system to deal with conversational questions both in spoken and text forms, and to explore the plausibility of providing more cues in spoken documents with systems in information gathering. To this end, instead of adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which directly fuse audio-text features to reduce the misalignment between automatic speech recognition hypotheses and the reference transcriptions. In addition, to evaluate the capacity of QA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 120k question-answer pairs. Experiments demonstrate that our proposed method achieves superior performance in spoken conversational question answering.

1. INTRODUCTION

Conversational Machine Reading Comprehension (CMRC) has been studied extensively over the past few years within the natural language processing (NLP) communities (Zhu et al., 2018; Liu et al., 2019; Yang et al., 2019) . Different from traditional MRC tasks, CMRC aims to enable models to learn the representation of the context paragraph and multi-turn dialogues. Existing methods to the conversational question answering (QA) tasks (Huang et al., 2018a; Devlin et al., 2018; Xu et al., 2019; Gong et al., 2020) have achieved superior performances on several benchmark datasets, such as QuAC (Choi et al., 2018) and CoQA (Elgohary et al., 2018) . However, few studies have investigated CMRC in both spoken content and text documents. To incorporate spoken content into machine comprehension, there are few public datasets that evaluate the effectiveness of the model in spoken question answering (SQA) scenarios. TOEFL listening comprehension (Tseng et al., 2016) is one of the related corpus for this task, an English test designed to evaluate the English language proficiency of non-native speakers. But the multi-choice question answering setting and its scale is limited to train for robust SCQA models. The rest two spoken question answering datasets are Spoken-SQuAD (Li et al., 2018) and ODSQA (Lee et al., 2018) , respectively. However, there is usually no connection between a series of questions and answers within the same spoken passage among these datasets. More importantly, the most common way people seek or test their knowledge is via human conversations, which capture and maintain the common ground in spoken and text context from the dialogue flow. There are many real-world applications related to SCQA tasks, such as voice assistant and chat robot. In recent years, neural network based methods have achieved promising progress in speech processing domain. Most existing works first select a feature extractor (Gao et al., 2019) , and then enroll the feature embedding into the state-of-the-art learning framework, as used in single-turn spoken language processing tasks such as speech retrieval (Lee et al., 2015; Fan-Jiang et al., 2020; Karakos et al., 2020 ), translation (Bérard et al., 2016; Serdyuk et al., 2018; Di Gangi et al., 2020; Tu et al., 2020) and recognition (Zhang et al., 2017; Zhou et al., 2018; Bruguier et al., 2019; Siriwardhana 

