MODELLING HIERARCHICAL STRUCTURE BETWEEN DIALOGUE POLICY AND NATURAL LANGUAGE GEN-ERATOR WITH OPTION FRAMEWORK FOR TASK-ORIENTED DIALOGUE SYSTEM

Abstract

Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In our work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO, where the latent dialogue act is applied to avoid designing specific dialogue act representations; (2) train HDNO via hierarchical reinforcement learning (HRL), as well as suggest the asynchronous updates between dialogue policy and NLG during training to theoretically guarantee their convergence to a local maximizer; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation. Finally, we demonstrate the semantic meanings of latent dialogue acts to show the explanability for HDNO.

1. INTRODUCTION

Designing a task-oriented dialogue system is a popular and challenging research topic in the recent decades. In contrast to the open-domain dialogue system (Ritter et al., 2011) , it aims to help people complete real-life tasks through dialogues without human service (e.g., booking tickets) (Young, 2006) . In a task-oriented dialogue task, each dialogue is defined with a goal which includes user requests (i.e., represented as a set of key words known as slot values). The conventional taskoriented dialogue system is comprised of 4 modules (see Appendix 3.1), each of which used to be implemented with handcrafted rules (Chen et al., 2017) . Given user utterances, it gives responses in turn to fulfill the requests via mentioning corresponding slot values. Recently, several works focused on training a task-oriented dialogue system in end-to-end fashion (E2E) (Bordes et al., 2016; Wen et al., 2017) for generalizing dialogues outside corpora. To train a E2E model via supervised learning (SL), generated system utterances are forced to fit the oracle responses collected from human-to-human conversations (Budzianowski et al., 2017a) . The oracle responses contain faults by humans thus being inaccurate, which leads to biased SL. On the other hand, the goal is absolutely clear, though the criterion of success rate that evaluates the goal completion is non-differentiable and cannot be used as a loss for SL.

