MODELLING HIERARCHICAL STRUCTURE BETWEEN DIALOGUE POLICY AND NATURAL LANGUAGE GEN-ERATOR WITH OPTION FRAMEWORK FOR TASK-ORIENTED DIALOGUE SYSTEM

Abstract

Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In our work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO, where the latent dialogue act is applied to avoid designing specific dialogue act representations; (2) train HDNO via hierarchical reinforcement learning (HRL), as well as suggest the asynchronous updates between dialogue policy and NLG during training to theoretically guarantee their convergence to a local maximizer; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation. Finally, we demonstrate the semantic meanings of latent dialogue acts to show the explanability for HDNO.

1. INTRODUCTION

Designing a task-oriented dialogue system is a popular and challenging research topic in the recent decades. In contrast to the open-domain dialogue system (Ritter et al., 2011) , it aims to help people complete real-life tasks through dialogues without human service (e.g., booking tickets) (Young, 2006) . In a task-oriented dialogue task, each dialogue is defined with a goal which includes user requests (i.e., represented as a set of key words known as slot values). The conventional taskoriented dialogue system is comprised of 4 modules (see Appendix 3.1), each of which used to be implemented with handcrafted rules (Chen et al., 2017) . Given user utterances, it gives responses in turn to fulfill the requests via mentioning corresponding slot values. Recently, several works focused on training a task-oriented dialogue system in end-to-end fashion (E2E) (Bordes et al., 2016; Wen et al., 2017) for generalizing dialogues outside corpora. To train a E2E model via supervised learning (SL), generated system utterances are forced to fit the oracle responses collected from human-to-human conversations (Budzianowski et al., 2017a) . The oracle responses contain faults by humans thus being inaccurate, which leads to biased SL. On the other hand, the goal is absolutely clear, though the criterion of success rate that evaluates the goal completion is non-differentiable and cannot be used as a loss for SL. To tackle this problem, reinforcement learning (RL) is applied to train a task-oriented dialogue system (Williams and Young, 2007; Zhao and Eskénazi, 2016; Peng et al., 2018; Zhao et al., 2019) . Specifically, some works merely optimized dialogue policy while other modules, e.g., the natural language generator (NLG), were fixed (Peng et al., 2018; Zhao et al., 2019; Su et al., 2018) . In contrast, other works extended the dialogue policy to NLG and applied RL on the entire E2E dialogue system, regarding each generated word in a response as an action (Zhao and Eskénazi, 2016) . Although previous works enhanced the performance on fulfilling user requests, the comprehensibility of generated system utterances are corrupted (Peng et al., 2018; Zhao et al., 2019; Tang et al., 2018a) . The possible reasons are: (1) solely optimizing dialogue policy could easily cause the biased improvement on fulfilling user requests, ignoring the comprehensibility of generated utterances (see Section 3.1); (2) the state space and action space (represented as a vocabulary) in E2E fashion is so huge that learning to generate comprehensible utterances becomes difficult (Lewis et al., 2017) ; and (3) dialogue system in E2E fashion may lack explanation during the procedure of decision. In our work, we propose to model the hierarchical structure between dialogue policy and NLG with the option framework, i.e., a hierarchical reinforcement learning (HRL) framework (Sutton et al., 1999) called HDNO (see Section 4.1) so that the high-level temporal abstraction can provide the ability of explanation during the procedure of decision. Specifically, dialogue policy works as a highlevel policy over dialogue acts (i.e. options) and NLG works as a low-level policy over generated words (i.e. primitive actions). Therefore, these two modules are decoupled during optimization with the smaller state space for NLG and the smaller action space for dialogue policy (see Appendix F). To reduce the efforts on designing dialogue act representations, we represent a dialogue act as latent factors. During training, we suggest the asynchronous updates between dialogue policy and NLG to theoretically guarantee their convergence to a local maximizer (see Section 4.2). Finally, we propose using a discriminator modelled with language models (Yang et al., 2018) as an additional reward to further improve the comprehensibility (see Section 5). We evaluate HDNO on two datasets with dialogues in multiple domains: MultiWOZ 2.0 (Budzianowski et al., 2018) and MultiWOZ 2.1 (Eric et al., 2019 ), compared with word-level E2E (Budzianowski et al., 2018 ) trained with RL, LaRL (Zhao et al., 2019) and HDSA (Chen et al., 2019) . The experiments show that HDNO works best in the total performance evaluated with automatic metrics (see Section 6.2.1) and the human evaluation (see Section B.1). Furthermore, we study the latent dialogue acts and show the ability of explanation for HDNO (see Section 6.4).

2. RELATED WORK

Firstly, we go through the previous works on studying the dialogue act representation for taskoriented dialogue systems. Some previous works optimized dialogue policy with reinforcement learning (RL), which made decision via selecting from handcrafted dialogue acts represented as ontology (Peng et al., 2018; Young et al., 2007; Walker, 2000; He et al., 2018) . Such a representation method is easily understood by human beings, while the dialogue act space becomes limited in representation. To deal with this problem, some researchers investigated training dialogue acts via fitting oracle dialogue acts represented in sequence (Chen et al., 2019; Zhang et al., 2019; Lei et al., 2018) . This representation method generalized dialogue acts, however, designing a good representation is effort demanding. To handle this problem, learning a latent representation of dialogue act was attempted (Zhao et al., 2019; Yarats and Lewis, 2018) . In our work, similar to (Zhao et al., 2019) we learn latent dialogue acts without any labels of dialogue acts. By this view, our work can be regarded as an extension of LaRL (Zhao et al., 2019) on learning strategy. Then, we review the previous works modelling a dialogue system with a hierarchical structure. In the field of task-oriented dialogue systems, many works lay on modelling dialogue acts or the state space with a hierarchical structure to tackle the decision problem for dialogues with multi-domain tasks (Cuayáhuitl et al., 2009; Peng et al., 2017; Chen et al., 2019; Tang et al., 2018b; Budzianowski et al., 2017b) . Distinguished from these works, our work views the relationship between dialogue policy and natural language generator (NLG) as a natural hierarchical structure and models it with the option framework (Sutton et al., 1999) . In the field of open-domain dialogue system, a similar hierarchical structure was proposed (Serban et al., 2017; Saleh et al., 2019) but with a different motivation from ours. In this sense, these two fields are possible to be unified.

