MODELLING HIERARCHICAL STRUCTURE BETWEEN DIALOGUE POLICY AND NATURAL LANGUAGE GEN-ERATOR WITH OPTION FRAMEWORK FOR TASK-ORIENTED DIALOGUE SYSTEM

Abstract

Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In our work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO, where the latent dialogue act is applied to avoid designing specific dialogue act representations; (2) train HDNO via hierarchical reinforcement learning (HRL), as well as suggest the asynchronous updates between dialogue policy and NLG during training to theoretically guarantee their convergence to a local maximizer; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation. Finally, we demonstrate the semantic meanings of latent dialogue acts to show the explanability for HDNO.

1. INTRODUCTION

Designing a task-oriented dialogue system is a popular and challenging research topic in the recent decades. In contrast to the open-domain dialogue system (Ritter et al., 2011) , it aims to help people complete real-life tasks through dialogues without human service (e.g., booking tickets) (Young, 2006) . In a task-oriented dialogue task, each dialogue is defined with a goal which includes user requests (i.e., represented as a set of key words known as slot values). The conventional taskoriented dialogue system is comprised of 4 modules (see Appendix 3.1), each of which used to be implemented with handcrafted rules (Chen et al., 2017) . Given user utterances, it gives responses in turn to fulfill the requests via mentioning corresponding slot values. Recently, several works focused on training a task-oriented dialogue system in end-to-end fashion (E2E) (Bordes et al., 2016; Wen et al., 2017) for generalizing dialogues outside corpora. To train a E2E model via supervised learning (SL), generated system utterances are forced to fit the oracle responses collected from human-to-human conversations (Budzianowski et al., 2017a) . The oracle responses contain faults by humans thus being inaccurate, which leads to biased SL. On the other hand, the goal is absolutely clear, though the criterion of success rate that evaluates the goal completion is non-differentiable and cannot be used as a loss for SL. To tackle this problem, reinforcement learning (RL) is applied to train a task-oriented dialogue system (Williams and Young, 2007; Zhao and Eskénazi, 2016; Peng et al., 2018; Zhao et al., 2019) . Specifically, some works merely optimized dialogue policy while other modules, e.g., the natural language generator (NLG), were fixed (Peng et al., 2018; Zhao et al., 2019; Su et al., 2018) . In contrast, other works extended the dialogue policy to NLG and applied RL on the entire E2E dialogue system, regarding each generated word in a response as an action (Zhao and Eskénazi, 2016) . Although previous works enhanced the performance on fulfilling user requests, the comprehensibility of generated system utterances are corrupted (Peng et al., 2018; Zhao et al., 2019; Tang et al., 2018a) . The possible reasons are: (1) solely optimizing dialogue policy could easily cause the biased improvement on fulfilling user requests, ignoring the comprehensibility of generated utterances (see Section 3.1); (2) the state space and action space (represented as a vocabulary) in E2E fashion is so huge that learning to generate comprehensible utterances becomes difficult (Lewis et al., 2017) ; and (3) dialogue system in E2E fashion may lack explanation during the procedure of decision. In our work, we propose to model the hierarchical structure between dialogue policy and NLG with the option framework, i.e., a hierarchical reinforcement learning (HRL) framework (Sutton et al., 1999) called HDNO (see Section 4.1) so that the high-level temporal abstraction can provide the ability of explanation during the procedure of decision. Specifically, dialogue policy works as a highlevel policy over dialogue acts (i.e. options) and NLG works as a low-level policy over generated words (i.e. primitive actions). Therefore, these two modules are decoupled during optimization with the smaller state space for NLG and the smaller action space for dialogue policy (see Appendix F). To reduce the efforts on designing dialogue act representations, we represent a dialogue act as latent factors. During training, we suggest the asynchronous updates between dialogue policy and NLG to theoretically guarantee their convergence to a local maximizer (see Section 4.2). Finally, we propose using a discriminator modelled with language models (Yang et al., 2018) as an additional reward to further improve the comprehensibility (see Section 5). We evaluate HDNO on two datasets with dialogues in multiple domains: MultiWOZ 2.0 (Budzianowski et al., 2018) and MultiWOZ 2.1 (Eric et al., 2019) , compared with word-level E2E (Budzianowski et al., 2018) trained with RL, LaRL (Zhao et al., 2019) and HDSA (Chen et al., 2019) . The experiments show that HDNO works best in the total performance evaluated with automatic metrics (see Section 6.2.1) and the human evaluation (see Section B.1). Furthermore, we study the latent dialogue acts and show the ability of explanation for HDNO (see Section 6.4).

2. RELATED WORK

Firstly, we go through the previous works on studying the dialogue act representation for taskoriented dialogue systems. Some previous works optimized dialogue policy with reinforcement learning (RL), which made decision via selecting from handcrafted dialogue acts represented as ontology (Peng et al., 2018; Young et al., 2007; Walker, 2000; He et al., 2018) . Such a representation method is easily understood by human beings, while the dialogue act space becomes limited in representation. To deal with this problem, some researchers investigated training dialogue acts via fitting oracle dialogue acts represented in sequence (Chen et al., 2019; Zhang et al., 2019; Lei et al., 2018) . This representation method generalized dialogue acts, however, designing a good representation is effort demanding. To handle this problem, learning a latent representation of dialogue act was attempted (Zhao et al., 2019; Yarats and Lewis, 2018) . In our work, similar to (Zhao et al., 2019) we learn latent dialogue acts without any labels of dialogue acts. By this view, our work can be regarded as an extension of LaRL (Zhao et al., 2019) on learning strategy. Then, we review the previous works modelling a dialogue system with a hierarchical structure. In the field of task-oriented dialogue systems, many works lay on modelling dialogue acts or the state space with a hierarchical structure to tackle the decision problem for dialogues with multi-domain tasks (Cuayáhuitl et al., 2009; Peng et al., 2017; Chen et al., 2019; Tang et al., 2018b; Budzianowski et al., 2017b) . Distinguished from these works, our work views the relationship between dialogue policy and natural language generator (NLG) as a natural hierarchical structure and models it with the option framework (Sutton et al., 1999) . In the field of open-domain dialogue system, a similar hierarchical structure was proposed (Serban et al., 2017; Saleh et al., 2019) but with a different motivation from ours. In this sense, these two fields are possible to be unified. Finally, among the works training with hierarchical reinforcement learning (HRL), some of them set up an extrinsic reward for high-level policy and an intrinsic reward for low-level policy respectively to encourage the convergence (Peng et al., 2017; Budzianowski et al., 2017b) . In our work, we train both high-level policy and low-level policy with identical rewards to guarantee the consistency between two policies (Sutton et al., 1999) . On the other hand, in the field of open-domain dialogue system, Saleh et al. (2019) represented the joint generated utterances over a turn as a low-level action such that both high-level policy and low-level policy were in identical time scales. Besides, its lowlevel policy gradients flew through high-level policy during training, which degraded hierarchical policies to an E2E policy with a word-level action space. In our work, (1) dialogue policy and NLG are decoupled during optimization and no gradients are allowed to flow between them; (2) these two policies are asynchronously updated to theoretically guarantee the convergence to a local maximizer; and (3) each generated word is regarded as a low-level action.

3.1. TASK-ORIENTED DIALOGUE SYSTEM

Brief Introduction: A task-oriented dialogue system aims to help fulfill a user's task through conversation in turns. In general, each dialogue is modelled with an ontology called goal which includes inform slots and request slots. The traditional modular dialogue system is constituted of natural language understanding (NLU), dialogue state tracker (DST), dialogue policy and natural language generator (NLG). For a dialogue system, it needs to infer inform slots from user utterances and transform them to a dialogue state, which is completed by NLU and DST (Chen et al., 2017) . In this work, we focus on optimizing dialogue policy and NLG, leveraging oracle dialogue states and database search results to produce dialogue acts and then responses (that should include as many request slots as possible) in turns. For optimizing dialogue policy, it is modelled with Markov decision process (MDP) (Williams and Young, 2007) . Existing Challenges: We identify the main challenges of task-oriented dialogue systems: (1) A dialogue with a single domain (i.e. completing one task in a dialogue) has been broadly studied, however, handling a dialogue with multiple domains is more challenging and needs more studies on it (Budzianowski et al., 2018) ; (2) If ignoring the syntactic structure of generated system utterances (i.e. losing comprehensibility), the mission of task-oriented dialogues will be simplified to generating corresponding labels (i.e., slots) for user utterances. Several existing algorithms already reached high scores on request slots acquisition but low scores on the comprehensibility of generated system utterances (Zhao et al., 2019; Mehri et al., 2019) , so the simplified task has been well-addressed. Reversely, if only focusing on the comprehensibility, the score on request slots acquisition could be drastically affected (Chen et al., 2019; Hosseini-Asl et al., 2020) . In this work, we investigate the trade-off between the comprehensibility and request slots acquisition; (3) Designing and annotating a dialogue act structure is effort demanding (Budzianowski et al., 2018) . Therefore, learning a meaningful latent dialogue act becomes a new challenge (Zhao et al., 2019) .

3.2. HIERARCHICAL REINFORCEMENT LEARNING WITH OPTION FRAMEWORK

Hierarchical reinforcement learning (HRL) is a variant of reinforcement learning (RL) which extends the decision problem to coarser grains with multiple hierarchies (Sutton et al., 1999; Dayan and Hinton, 1993; Parr and Russell, 1998; Dietterich, 1998) . Amongst several HRL methods, the option framework (Sutton et al., 1999) is a temporal abstraction for RL, where each option (i.e. a high-level action) lasts for a number of steps through primitive actions (i.e. low-level actions). From the view of decision problems, an MDP defined with a fixed set of options naturally forms a semi-MDP (SMDP). Formally, an option o = I, β, π is composed of three components: an initiation set I ⊆ S (where S is a state space), a termination function β(s t ) → [0, 1] and an intra-option policy π(a t |s t ) → [0, 1]. The reward over primitive actions is defined as r t ∈ R, identical to vanilla RL. An option o t is available at s t ∈ I. At each s t , π is used to decide a low-level action a t until the option is stochastically terminated by β. Similar to RL on flat actions, the probability transition function over options is defined as p(s |s t , o t ) = (abbreviated as g t for simplicity). Given a set of options o ∈ O, the optimization problem over options is defined as max o E o [ k∈M γ k-t g k ], where m = (t, t , ...) is a sequence containing the time step of each eventfoot_0 that will be experienced from some time step t to future. To automatically fit complicated circumstances, we may also need to discover options dynamically during learning. Intra-option policy gradient theorem and termination policy gradient theorem (Bacon et al., 2017) provide the basis to apply a policy gradient method for option discovery. 

4.1. MODEL FORMULATION

In this section, we present a view on modelling the Hierarchical structure between Dialogue policy and Natural language generator (NLG) with the Option framework (Sutton et al., 1999) , called HDNO. Specifically, a dialogue act in HDNO is seen as an option whereas each generated word from NLG is a primitive action. Accordingly, dialogue policy and NLG become the policy over option (i.e. high-level policy) and the intra-option policy (i.e. low-level policy) respectively. Distinguished from a conventional modular system, we additionally give a context to NLG to satisfy the conditions of the option framework. Moreover, since the primitive action space (i.e. a vocabulary) comprises a termination symbol, NLG can take over the responsibility of termination. For this reason, termination policy in the original option framework is absorbed into the intra-option policy. The formal definition of HDNO is shown in Definition 1. Definition 1. A dialogue policy (i.e. a policy over option) is defined as φ :  S × O → [0, 1], I o ×V → [0, 1 ] is natural language generator (NLG) (i.e. an intra-option policy). V is a vocabulary (including a termination symbol). According to MDP theorem over option (Sutton et al., 1999) and intra-option policy gradient theorem (Bacon et al., 2017) , we can naturally apply REINFORCE (Williams, 1992) to learn both φ and π. Therefore, following Section 3 and Definition 1, we can write policy gradients in our case such that ∇J (φ) = E φ [ k∈M γ k-t g k ∇ ln φ(o t |s t ) ] and ∇J (π ot ) = E πo t [ T i=t γ i-t r i ∇ ln π ot (w t |s t ) ], where we assume that the length of all generated system utterances is T and m = (t, t , ...) is a sequence containing the time steps of an event that appear in future for an arbitrary o t = I ot , π ot ∈ O.

4.2. ASYNCHRONOUSLY UPDATING DIALOGUE POLICY AND NLG DURING LEARNING

As mentioned in Section 4.1, dialogue policy and NLG are written as φ(o|s) and π o (w|s) respectively. However, since o = I o , π o , we can assume that when dialogue policy made a decision, it has to consider the current performance on the overall set of low-level policies for NLG, denoted as π = {π o } o∈O . For the reason, we temporarily rewrite dialogue policy to φ(o|π, s) for convenience. The aim is finding the best policies (i.e. maximizers) so that the value can be maximized such that max φ,π v(s|φ(o|π, s)), ∀s ∈ S. If updating these two policies synchronously, it will cause the composite state (i.e. π, s ) of φ(o|π, s) inconsistent before and after the update each time. Therefore, the value does not always monotonically improve during learning, which will affect the convergence of both policies (see Proposition 1). To address this problem, we suggest updating dialogue policy and NLG asynchronously during learning to theoretically guarantee the convergence of these policies to a local maximizer (see Proposition 2). The proofs of these two propositions are left to appendices due to limited space. Proposition 1. Following the model of Definition 1, if φ(o|π, s) and π are synchronously updated, the value does not always monotonically improve and the policies may never converge to a local maximizer. Assumption 1. (1) Reward function is bounded. ( 2) With sufficient number of samples, the Monte Carlo estimation for value on any state is accurate enough. Proposition 2. Following the model of Definition 1 and Assumption 1, if φ(o|π, s) and π are asynchronously updated, the value can improve monotonically during learning and the policies can finally converge to a local maximizer.

4.3. IMPLEMENTATION OF HDNO

In implementation of HDNO, we represent a dialogue act as latent factors (Zhao et al., 2019) , which reduces the effort on designing a suitable representation. In detail, a dialogue act z (i.e. an indicator representing an option) is sampled from a dialogue policy represented as an isotropic multivariate Gaussian distribution such that φ(z|c; λ) = N (z|µ(c), Σ(c)), where c is a context as well as µ(c) ∈ R K and Σ(c) ∈ R K×K are parameterized with λ. Moreover, NLG, i.e. π(w t |z, ct ; ν), is represented as a categorical distribution over a word parameterized with ν, conditioned on an option z and a context ct which involves preceding generated utterances in addition to the context c that activates the option z. The full picture of this architecture is shown in Figure 1 . Furthermore, φ(z|c; λ) (i.e. dialogue policy) is implemented with one-layer linear model and outputs the mean and variance of a multivariate Gaussian distribution. The input of φ(z|c; λ) is a context vector. In details, the last user utterances are firstly encoded with a bidirectional RNN (Schuster and Paliwal, 1997) with gated recurrent unit (GRU) cell (Chung et al., 2014) and global type attention mechanism (Bahdanau et al., 2015) . Then, an oracle dialogue state and an oracle database search result are concatenated to an encoding vector of user utterances to form a context vector. The utterance encoder is only trained during pretraining and fixed as a context extractor during HRL, so that the context space is reduced. On the other hand, π(w t |z, ct ; ν) (i.e. NLG) is implemented with a recurrent neural network (RNN) with long short-term memory (LSTM) cell (Hochreiter and Schmidhuber, 1997) , where the initial state is the concatenation of a context vector and a dialogue act sampled from dialogue policy. The information in the initial state is assumed to be propagated to the hidden states at the future time steps, so we only feed it in the initial state in implementation.

4.4. PRETRAINING WITH BAYESIAN FRAMEWORK

Compared to VHRED (Serban et al., 2017) proposed in the field of open-domain dialogue systems, the latent dialogue act in HDNO is equivalent to the latent variables in VHRED. However, a context of HDNO includes not only user utterances but also dialogue state and database search result. In this sense, HDNO extends VHRED to the field of task-oriented dialogue systems. As a result, by changing user utterances in VHRED to the context in HDNO, we can directly formulate a variational lower bound following the Bayesian framework and the model in Section 4.1 such that max λ,ν E z∼φ(z|c;λ) [ T t=1 log π(w t |z, ct ; ν) ] -β KL[ φ(z|c; λ) || N (z|0, I) ], where φ(z|c; λ) is constrained by KL-divergence (Kullback and Leibler, 1951 ) from a multivariate standard Gaussian distribution N (z|0, I). Referring to (Higgins et al., 2017) , we additionally add a multiplier β on the term of KL-divergence to control the disentanglement of a latent dialogue act z. We use Eq.1 for pretraining dialogue policy and NLG in E2E fashion with oracle system utterances to roughly allocate the roles of these two modules. Nevertheless, restricted with the existing human faults in oracle system utterances (see Section 1), we need to further improve it via HRL.

5. USING DISCRIMINATOR OF LANGUAGE MODELS AS A REWARD

Benefiting from the availability of non-differentiable evaluation metrics in RL, we can directly apply the success rate on reward denoted by r succ . However, there exists two potential issues: (1) since the success rate is given until the end of a dialogue that is zero in the other turns, which may cause a sparse reward; and (2) the success rate is only correlated to the occurrence of request slots in generated system utterances, the improvement on comprehensibility may be weakened. To mitigate the above drawbacks, we propose to leverage a discriminator modelled as language models (see Definition 2) as an additional reward. Specifically, at each time step it evaluates each generated word by log-likelihood to reflect the comprehensibility. Definition 2. Discriminator D(w t |w t-1 ) → [0, 1] is defined as the Markov language model following (Yang et al., 2018) . At some time step τ , for an arbitrary option e = I e , π e , w τ ∼ π e (•|s τ ) is a sampled word at state s τ ∈ I e . The reward of discriminator to evaluate w τ is defined as r disc τ = log D(w τ |w τ -1 ) , where w τ -1 is a generated word at time step τ -1. According to Definition 2, we can see that T t=1 γ t r disc consistently grows as the joint loglikelihood T t=1 log D(w t |w t-1 ) grows, thereby maximizing E π [ T t=1 γ t r disc t ] is almost equiva- lent to maximizing E[ T t=1 log D(w t |w t-1 ) ] when γ is around 1 and T is not too large, where T denotes the number of time steps in a turn. For this reason, r disc is suitable for evaluating the comprehensibility of generated system utterances if we presume that the discriminator can well represent human language. Combining the reward of success rate r succ and the reward of discriminator r disc , we propose a total reward such that r total t = (1 -α) r succ t + α r disc t , where α ∈ [0, 1] is a multiplier controlling the trade-off between these two types of rewards. In implementation, the discriminator is equipped with the same architecture as that of NLG.

6.1. EXPERIMENTAL SETUPS

Dataset Description: To evaluate the performance of our task-oriented dialogue system, we run experiments on the latest benchmark datasets MultiWoz 2.0 (Budzianowski et al., 2018) and Multi-Woz 2.1 (Eric et al., 2019) . MultiWoz 2.0 is a large scale task-oriented dialogue dataset including 10425 dialogues that spans 7 distinct domains, where the whole dialogues are generated by humanto-human conversations. Each dialogue is defined with a goal for a user, which may be consisting of 1-5 domains. A dialogue system attempts to fulfill a goal by interacting with a user. As for data preprocessing, we follow the same delexicalized method provided by (Budzianowski et al., 2018) , also used in other works (Zhao et al., 2019; Chen et al., 2019) . On the other hand, MultiWoz 2.1 is a modified version of MultiWoz 2.0 which mainly fixes the noisy dialogue state annotations and corrects 146 dialogue utterances. Finally, the dataset of either MultiWoz 2.0 or MultiWoz 2.1 is split into a training set with 8438 dialogues, a validation set with 1000 dialogues and a testing set with 1000 dialogues (Budzianowski et al., 2018; Eric et al., 2019) . Task Description: Since we only concentrate on learning dialogue policy and natural language generator (NLG), all experiments are conducted on the dialog-context-to-text generation task proposed in (Budzianowski et al., 2018) . In this task, it assumes that a dialogue system has access to the oracle dialogue state and database search result. Given user utterances, a dialogue system attempts to generate appropriate utterances as a response in each turn. To train a dialogue system with hierarchical reinforcement learning (HRL), we follow the setups described in Definition 1. Each dialogue is only evaluated with the goal (e.g. calculating the success rate) at the end of dialogue, which means that no evaluation is allowed during any of turns. Automatic Evaluation Metrics: Following (Budzianowski et al., 2017a) , we leverage three automatic metrics to evaluate generated utterances from a dialogue system such as inform rate, success rate and BLEU score. Inform rate measures whether a dialogue system provides appropriate entities (e.g., the name of restaurant). Success rate shows the ratio of request slots appearing in generated utterances. BLEU score (Papineni et al., 2002) evaluates the comprehensibility of generated utterances. Finally, we use a popular total score (Zhang et al., 2019; Mehri et al., 2019 ) such that 0.5 × (Inform + Success) + BLEU to fairly evaluate the performance of a dialogue system. Baseline Description: We compare HDNO with other models such as LaRL (Zhao et al., 2019) , HDSA (Chen et al., 2019) , and a baseline end-to-end model (Budzianowski et al., 2018) . All these models leveraged oracle dialogue states and database search results, which are introduced as follows: • The baseline end-to-end model (Budzianowski et al., 2018 ) is a model which directly maps a context to system utterances. Followed by (Zhao et al., 2019) , we train it with RL, where each generated word is looked as an action. For convenience, we name it as word-level E2E model, abbreviated as WE2E. • LaRL (Zhao et al., 2019 ) is a model that firstly represents a dialogue act as latent factors in the field of task-oriented dialogue system. Specifically, it models a latent dialogue act as categorical variables, each of which is mapped to a continuous embedding vector for learning. During training, it only updates dialogue policy, where a latent categorical dialogue act for each turn is looked as an action. • HDSA (Chen et al., 2019 ) is a model that represents each dialogue act as a hierarchical graph. To fit the oracle dialogue act, a pretrained 12-layer BERT (Devlin et al., 2019) is applied. Then the predicted dialogue act is transformed to the hierarchical graph structure with 3-layer self-attention model (Vaswani et al., 2017) , called disentangled self-attention model. This model is trained only with supervised learning (SL). Experimental Details: For HDNOfoot_1 , we pretrain a model following Eq.1 and select the best model with the minimum loss on the validation set; as well as the discriminator is pretrained with oracle system utterances. During HRL, we initialize parameters with the pretrained model and select the best model according to the greatest reward on the validation set. For efficiency, we only use greedy search for decoding in validation. In test, we apply beam search (Medress et al., 1977) for decoding to obtain a better performance. The beam width is selected through the best validation performance for each model. For simplicity, we only show the performance of the best beam width on test set. We use stochastic gradient descent (SGD) for HRL and Adam optimizer (Kingma and Ba, 2015) for pretraining. For the baselines, we train them with the original source codes. The specific details and hyperparameters for training and testing are shown in Appendix B.3. Notification: Please notice that the results of baselines showed in their original papers could be underestimated, which is due to the upgrade of the official evaluator this yearfoot_2 . For this reason, we re-run these experiments with the original open source codes and evaluate the performance of all models (including HDNO) via the latest official evaluator in this work.

6.2.1. AUTOMATIC EVALUATION METRICS

We firstly compare HDNO with the state-of-the-art baselines and the human performance on both datasets via automatic evaluation metrics. As Table 1 shows, we can see that HDNO trained with the proposed asynchronous updates between dialogue policy and NLG (i.e. HDNO (Async.)) largely outperforms the baselines on the inform rate and total score, while its BLEU score is lower than that of HDSA. Moreover, the performance of all models trained with RL except for HDNO trained with synchronous updates between dialogue policy and NLG (i.e. HDNO (Sync.)) on the inform rate and success rate exceeds that of the model trained with SL (i.e. HDSA). The possible reason may be that SL is highly dependent on the oracle system utterances and humans may commit faults during generating these dialogues as we stated in Section 1. Besides, the poor results on HDNO (Sync.) validates the theoretical analysis in Section 4.2 that synchronous updates between dialogue policy and NLG could cause the failure to approach a local optimum, while HDNO (Async.) shows the success for the asynchronous updates proposed in Proposition 2. Furthermore, the results of pretraining give the evidence that the improvement is actually from the proposed algorithm. For conciseness, we write HDNO instead of HDNO (Async.) in the rest of paper. We also conduct human evaluations for WE2E, LaRL and HDNO that are shown in Appendix B.1. We now compare the proposed reward shape in Eq.2 with a reward only constituted of the success rate and a reward combining the success rate with BLEU score (i.e. a linear combination similar to the proposed reward shape). As Table 2 shows, in comparison with other reward shapes, the proposed reward shape (i.e. success + discriminator) performs better on preserving the comprehensibility while improving the success rate and inform rate to maximum. To further study the impact of discriminator in the total reward, we also run several ablation studies on α (see Section 5). As Table 2 shows, the results oscillate within a small range, which means that the proposed reward shape is not sensitive to the hyperparameter α if it is selected within a rational range.

6.4. STUDY ON LATENT DIALOGUE ACT

We now study the semantic meanings of latent dialogue acts and demonstrate the clustering results as Figure 2 shows. To show the explanability of HDNO that other baselines do not possess, we show the results for both HDNO and LaRL. Since LaRL tends to generate duplicate latent dialogue acts, the dots in the diagram are overlapped. Through analyzing the randomly selected system utterances, we find that the clusters of latent dialogue acts of HDNO possesses some semantic meanings, while that of LaRL is difficult to be observed any meaningful explanations. Next, we briefly describe our findings on clusters of HDNO. The clusters in blue dots and green dots are related to the general phrases for goodbye at the end of service; the cluster in orange dots is related to the booking for trains; the cluster in red dots is related to informing user with database search results; the cluster in brown dots is related to recommendation; the cluster in pink dots is related to informing unsuccessful booking; the cluster in grey is related to informing booked; and the cluster in yellow is related Clustering is conducted with k-means algorithm (Arthur and Vassilvitskii, 2006) on the original dimensions whereas dimension reduction is conducted with T-SNE algorithm (Maaten and Hinton, 2008) . We randomly show 3 turns of system utterances for each cluster. to requesting more information. Surprisingly, these semantic meanings of latent dialogue acts are highly correlated with that of the oracle handcrafted dialogue acts (see Appendix E) described by Budzianowski et al. (2018) . Therefore, learning latent dialogue act with the option framework (Sutton et al., 1999 ) may potentially substitute for handcrafting dialogue act with ontology, without losing its explanability.

7. CONCLUSION AND FUTURE WORK

In this paper, we present a view on modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework (Sutton et al., 1999) in a task-oriented dialogue system and train it with hierarchical reinforcement learning (HRL). Moreover, we suggest asynchronous updates between dialogue policy and NLG to theoretically guarantee their convergence to a local maximizer. Finally, we propose using a discriminator modelled as language models (Yang et al., 2018) as a reward to further improve the comprehensibility of generated responses. In the future work, we are going to extend this work to optimizing all modules by HRL instead of only dialogue policy and NLG, as well as study on solving the problem of credit assignment among these modules (Chen et al., 2017) during training. Moreover, thanks to the option framework (Sutton et al., 1999) , the latent dialogue act shows explicit semantic meanings, while disentangling factors of a latent dialogue act (i.e. each latent factor owning a distinct semantic meaning) during HRL is left to be further investigated. At next time step t + 1 , however, the actual value for any arbitrary state s that we obtain from the last synchronous update is the equation such that v t+1 s|φ q+1 (π q+1 ) , and the following scenario such that v t+1 s|φ q+1 (π q+1 ) < v t+1 s|φ q (π q ) (5) could happen, which means that the monotonic improvement path of the value cannot always hold during learning. Since the value is an evaluation of policies, the policies could never converge to a local maximizer. A Proof. Firstly, in REINFORCE (Williams, 1992) the value is approximated by Monte Carlo estimation. Following (2) in Assumption 1 and the denotation shown in the proof of Proposition 1 above, if we update φ(π) and π asynchronously every k ∈ N steps, from some time step t ∈ N, assuming that both of policies have been updated q ∈ N times, we can construct a monotonically increasing sequence of values for any arbitrary state s ∈ S during learning such that v t s|φ q (π q ) ≤ v t+k s|φ q+1 (π q ) ≤ v t+2k s|φ q+1 (π q+1 ) ≤ v t+3k s|φ q+2 (π q+1 ) ≤ ... ≤ v t+nk s|φ q+ n 2 (π q+ n 2 ) , if n is an even number, v t+nk s|φ q+ n+1 2 (π q+ n-1 2 ) , if n is an odd number. ( ) Due to (1) in Assumption 1, we get that the value is bounded and we consider that v (s|φ(π)) ∈ R, ∀s ∈ S. According to (1) Lemma 1, the sequence of values is Fejér Monotone with respect to the maximum value. For simplicity, we denote the sequence of values in Eq.6 as {v m } m∈{t+nk|t,b,k∈N} and the maximum value as v * . Since (2) in Lemma 1, we can conclude that {||v m -v * ||} m∈{t+nk|t,b,k∈N} can converge. Also, since the v * has to be the final item of the sequence, we can write that ||v m -v * || -||v * -v * || → 0, m → ∞. ( ) If we rearrange the left hand side of Eq.7, we can obtain the result such that v m → v * , m → ∞. ( ) From Eq.8, we can conclude that finally the sequence of v(s|φ(π)) will converge to some local maximum. Since the value is an evaluation of φ(o|π, s) and π(w|s), we can get the conclusion that the asynchronous updates enable these two policies to converge to a local maximizer.

B.1 HUMAN EVALUATION

Due to the possible inconsistency between automatic evaluation metrics and human perception, we conducted human evaluation on comparing the quality of generated responses. We provide two criteria for human assessors to evaluate the generated responses: (1) fluency: how fluent the 

Context

[user]: Before booking , i would also like to know the travel time , price , and departure time please . [sys]: There are two trains meeting your needs with the first leaving at 16:00 and the last one leaving at 16:30. Do you want to book one of these ? [user]: No hold off on booking for now . Can you help me find an attraction called Tate Modern?

Generated Response

WE2E: Night Pubis a nightclub located in the center of town . The address is Street 1st. The phone number is 12345678. The entrance fee is 30pounds . The postcode is P01. Is there anything else i can help you with ? LaRL:Tate Modern s a museum in the center. Their phone number is 12345678. HDNO: Tate Modern s a museum in the center. Their phone number is 12345678. Is there anything else i can help you with ? generated responses are (i.e., with no obvious grammar errors and redundant descriptions); and (2) appropriateness: how related the generated responses are to the provided context. For each criterion, a human assessor is allowed to give a score ranging from 1 to 5, where 5 indicates the best and 1 indicates the worst performance. Then, we calculated the mean and the variance for the score of each model. The final results and some example generated responses from the questionnaire are shown in Figure 3 . The proposed HDNO performs best in the human evaluation, especially outperforming LaRL (Zhao et al., 2019) on appropriateness so much. In this section, we show the learning curves of HDNO (Async.) (i.e. abbreviated as HDNO in the rest of the appendix) and baselines on both MultiWoz 2.0 and MultiWoz 2.1. To show the performance of generalization, we only demonstrate the validation results during training. As seen from Figure 4 , compared with baselines HDNO can preserve the comprehensibility (i.e. BLEU) while improving success rate and inform rate faster. We only show the results for the initial 16,000 episodes for conciseness of the figure. In this section, we show the complete results of HDNO (Async.) and HDNO (Sync.) as well as the pretraining results with beam search in Table 3 , 5 and 4 respectively, where beam width is selected from 1, 2 and 5. Apparently, the proposed HRL algorithm gives an enormous improvement on the performance in comparison with that of the pretraining. Nevertheless, the pretraining is an essential part that cannot be replaced before training with reinforcement learning in the task-oriented dialogue system. The possibility of training from scratch in HRL with a good performance for the task-oriented dialogue system is left to be investigated.

B.4 COMPLETE BENCHMARK RESULTS ON MULTIWOZ 2.0

In this section, we show the complete state-of-the-art results on MultiWoz 2.0 for the policy optimization task, from the official recordsfoot_3 . All these results are collected from their original papers. As Table 6 shows, HDNO leads the board on the total performance evaluated by 0.5 × (Inform + Success) + BLEU, where each metric is measured in percentage. However, due to the update of official evaluator this year, the results marked with * were probably underestimated.

B.5 EXAMPLES OF GENERATED DELEXICALIZED DIALOGUES

In this section, we demonstrate some system utterances generated by the baselines and HDNO. Since most of the dialogues in MultiWoz 2.0 and MultiWoz 2.1 are similar, we only show the results on MultiWoz 2.0. As we can see from Table 7 and 8, compared with the baselines trained with SL (i.e. HDSA), the performance of HDNO on fulfilling a user's request is actually better, however, (Pei et al., 2019b) 85.30 73.30 20.13 99.43 HDSA * (Chen et al., 2019) 82.90 68.90 23.60 99.50 ARDM (Wu et al., 2019) 87.40 72.80 20.60 100.70 DAMD (Zhang et al., 2019) 89.20 77.90 18.60 102.15 SOLOIST (Peng et al., 2020) 89.60 79.30 18.30 102.75 MarCo (Wang et al., 2020) 92.30 78.60 20.02 105.47 the generated utterances of HDSA could be more fluent and comprehensible. In comparison with other baselines trained with RL, the generated utterances of HDNO is apparently more fluent and comprehensible. Especially, WE2E tends to generate as many slots as possible so as to increase the success rate, regardless of the comprehensibility of generated utterances. This is the common issue of most RL methods on task-oriented dialogue system as we stated in Section 3 in the main part of paper.

C EXTRA EXPERIMENTAL SETUPS C.1 TRAINING DETAILS

During pretraining, natural language generator (NLG) is trained via forcing the prediction at each time step to match an oracle word given a preceding oracle word as input. Nevertheless, during hierarchical reinforcement learning (HRL), NLG updates the estimated distribution at each time step by ∇J (π) = E π [ T i=t γ i-t r i ∇ ln π(w t |s t ) ] given a word sampled from a preceding predicted distribution. As for sampling a dialogue act from dialogue policy, we leverage reparameterization trick during pretraining. On the other hand, the discriminator is firstly pretrained and fixed as a reward during HRL. We find that it is useful to train both of the discriminator and HDNO simultaneously during pretraining. Specifically, in addition to training on oracle system utterances, we also use the generated utterances to train the discriminator, which can improve its performance in experiments. The optimization problem is expressed as Eq.9, where D(•|•, θ) is the discriminator defined in Definition 2 in the main part of paper, parameterized with θ; and η is a multiplier to control the contribution of the generated utterances ( ŵt ) T t=0 given an initial state. We hypothesize that the generated utterances could expand the original small scale corpus (i.e. oracle system utterances), so as to improve the generalization of the discriminator. Since we have not grasped the intrinsic reason of this method, it is only regarded as a trick for training the discriminator in our work. max θ T t=0 log D(w t |w t-1 , θ) + η T t=0 log D( ŵt | ŵt-1 , θ) During HRL, r succ and r disc are respectively normalized by z-score normalization (Patro and Sahu, 2015) to adaptively control their impacts in the total reward. For instance, when either r succ or r disc converges around some value, its normalized value will be close to zero and the total reward will be biased to the other. Therefore, this can further mitigate the conflicts between improving both the success rate and preserving the comprehensibility as stated in Section 3 in the main part of paper. Furthermore, the experiment of HDNO was run on a Nvidia GeForce RTX 2080Ti graphic card, and it consumes around 3 hours for SL and 2 hours for RL. Therefore, it is not very expensive to reproduce our results. as r t for simplicity) to measure the performance of an action at each time step. RL (Sutton and Barto, 2018) is a learning paradigm which aims to find an optimal policy π(a t |s t ) → [0, 1] to deal with MDP by maximizing the expectation over cumulative long-term rewards. Mathematically, it can be expressed as max π E π [ ∞ t=0 γ t r t ], where γ ∈ (0, 1) is a discount factor. Different from value-based methods, the policy gradient method derives a stochastic policy directly by optimizing a performance function w.r.t. parameters of the policy (Sutton and Barto, 2018). However, since the performance function cannot be differentiated w.r.t parameters of the policy, the policy gradient is derived by a natural gradient such that ∇ θ J (θ) = E π [ Q(s t , a t )∇ θ ln π θ (a t |s t ) ]. REINFORCE (Williams, 1992 ) is a policy gradient method that evaluates Q(s t , a t ) by a return G t = ∞ t=0 γ t r t . To deal with a continuous action space, π θ (a t |s t ) can be represented as a Gaussian distribution (Sutton and Barto, 2018) , where the mean and scale are both parameterized with θ. In this section, we show the ontology for representing a handcrafted dialogue act (Budzianowski et al., 2018) in Table 10 . The semantic meanings of latent dialogue acts analyzed in Section 5.2.3 in the main part of paper are correlated to these dialogue act types. This is the reason why we conclude that learning latent dialogue acts can potentially substitute for handcrafted dialogue acts with ontology.

F EXTRA DISCUSSION

Discussion on Reinforcement Learning for Task-oriented Dialogue System: As we stated in Section 3 in the main part of paper, the primary issue of using reinforcement learning (RL) in taskoriented dialogue system is that the improvement on fulfilling user requests and comprehensibility of generated system utterances is not simple to be balanced. One reason could be that the reward is only set up to improve the success rate and during learning the aspect of comprehensibility may be easily ignored. This is the reason why we consider extra reward criteria in our work to mitigate this predicament. Another reason could be that the state space and action space are so large in an end-to-end (E2E) model so that learning a mapping between these two spaces becomes difficult, as stated in Section 1 in the main part of paper. To deal with it, we propose to decouple dialogue policy and natural language generator (NLG) of an E2E model into two separate modules as those in the traditional architecture during learning, as well as model them with the option framework (Sutton et al., 1999) . As a result, the complexity of mapping from context to system utterances is reduced from V 2 to (L + M )V + ML, where V is the vocabulary size, L V is the space size of latent dialogue acts and M V is the space size of encoded utterances.



An event is defined as a policy over option calling an intra-option policy at some state. The source code of implementation of HDNO is on https://github.com/mikezhang95/HDNO. Please check it via the clarification for this incident below the table called Policy Optimization on the official website of MultiWoz: https://github.com/budzianowski/multiwoz. https://github.com/budzianowski/multiwoz.



Figure1: This diagram demonstrates the overall architecture for modelling the hierarchical structure between dialogue policy (i.e. high-level policy) and NLG (i.e. low-level policy) as an option framework. The text in gray represents the concepts for a traditional task-oriented dialogue system whereas the text in red matches the concepts for the option framework.

Figure 2: These diagrams demonstrate latent dialogue acts of HDNO and LaRL clustered in 8 categories on a 2-D plane. Clustering is conducted with k-means algorithm(Arthur and Vassilvitskii, 2006) on the original dimensions whereas dimension reduction is conducted with T-SNE algorithm(Maaten and Hinton, 2008). We randomly show 3 turns of system utterances for each cluster.

Figure 3: (a) This figure shows the statistical results of human evaluation, where 30 people participated in this human evaluation and the questionnaire is constituted of 31 randomly selected turns of dialogues. (b) This figure shows the example generated responses from the questionnaire.

Figure 4: Validation inform rate, success rate and BLEU score during training.

where S is a set of contexts (i.e. utterances, dialogue states and database search results); O is a set of dialogue acts (i.e. options). A dialogue act is defined as o = I o , π o , where I o ⊆ S is a set of corresponding contexts for a generated word and π o :

The table shows the main results on MultiWoz 2.0 and MultiWoz 2.1 evaluated with the automatic evaluation metrics. The results of HDNO are from the models trained with the proposed reward shape in Section 5, where α = 0.0001 for MultiWoz 2.0 and α = 0.01 for MultiWoz 2.1.

The table shows the results of different reward shapes for HDNO on MultiWoz 2.0.

yes , their contact number is [taxi_phone] . do you need anything else ? you are welcome . may i help with any other bookings ? please call again . goodbye . i have train [train_id] that leaves at [value_time] and will get you into [value... i have train [train_id] . it is departing at [value_time] and will arrive by [va... and what day and time ? happy to be of service have a wonderful day ! thank you , have a great day , goodbye . we are happy to help . come back soon . i am sorry , there are no restaurant -s like that in [value_place] unfortunately... oh <unk> . i did a search for a [value_pricerange] vegetarian restaurant and my ... i am sorry but there is nothing matching your request . would you like to try an...

.2 PROOF OF PROPOSITION 2 Lemma 1 (Bauschke et al. (2011)). (1) Any increasing bounded sequence(x n ) n∈N in R is Fejér Monotone with respect to sup{x n } n∈N such that ∀n ∈ N, ||x n+1 -sup{x n } n∈N || ≤ ||x n -sup{x n } n∈N ||. (2) For a Fejér Monotone Sequence (x n ) n∈N , ||x n -sup{x n } n∈N || n∈N converges, || • || is an arbitrary norm in R. Assumption 1.(1) Reward function is bounded. (2) With sufficient number of samples, the Monte Carlo estimation for value on any state is accurate enough. Proposition 2. Following the model of Definition 1 and Assumption 1, if φ(o|π, s) and π are asynchronously updated, the value can improve monotonically during learning and both policies can finally converge to a local maximizer.

Context [user]: I need to reserve a table at Steak House, can you help me ? [sys]: I sure can . Steak Houseis an expensive English restaurant in the center of town . What day would you like your reservation ? [user]: No hold off on booking for now . Can you help me find an attraction called Tate Modern? : I can book that for you . What day and time would you like to book the table for ? LaRL: Your reference number is R01 . HDNO: What time would you like to dine ?

The table shows the results of HDNO (Async.) with beam search on MultiWoz 2.0 and MultiWoz 2.1.

The table shows the pretraining results with beam search on MultiWoz 2.0 and MultiWoz 2.1.

The table shows the results of HDNO (Sync.) with beam search on MultiWoz 2.0 and MultiWoz 2.1. The reward shape is the one we proposed in Section 5, with α = 0.0001 for MultiWoz 2.0 and α = 0.01 for MultiWoz 2.1 respectively.

The table shows the full benchmark results on MultiWoz 2.0, compared with HDNO.

The table shows some delexicalized system utterances generated by the baselines and HDNO on MultiWoz 2.0. please help me make a taxi reservation sys: i can help with that . what time do you need to leave or arrive by ? usr: it does not matter , just so i arrive by [value time] . sys: i would be happy to help with your request , but i will need to know what time you are looking for . usr: i want to go to the [restaurant name] . and i need the car type and contact number please . sys: i have booked a [taxi type] for you . the contact number is [taxi phone] . is there anything else i can help you with today ? usr: no that is all . thanks sys: you are welcome . have a great day . goodbye . LaRL usr: please help me make a taxi reservation sys: i have booked a [taxi type] for you . the contact number is [taxi phone] . usr: it does not matter , just so i arrive by [value time] . sys: i have booked a [taxi type] for you . the contact number is [taxi phone] . usr: i want to go to the [restaurant name] . and i need the car type and contact number please . sys: i have booked a [taxi type] for you . the contact number is [taxi phone] .

Hyperparameters for training on MultiWoz 2.0 and MultiWoz 2.1.

Dialogue act ontology. / select / recommend / not found / request booking info / offer booking / inform booked / decline booking / welcome / greet / bye / reqmore

ACKNOWLEDGMENTS

This work is supported by the Engineering and Physical Sciences Research Council of UK (EPSRC) under awards EP/S000909/1.

APPENDICES A PROOFS

A.1 PROOF OF PROPOSITION 1 Proposition 1. Following the model of Definition 1, if φ(o|π, s) and π are synchronously updated, the value does not always monotonically improve and the policies may never converge to a local maximizer.Proof. At an arbitrary time step t ∈ N, assume that both of φ(o|π, s) and π have been updated for q ∈ N times, denoted as φ q (π q ) and π q respectively for conciseness. The current value for any arbitrary state s ∈ S is denoted as v s|φ q (π q ) . If we synchronously update φ q (π q ) and π q , we can obtain the following equations on value after updates at time step t ∈ N such that v t s|φ q+1 (π q ) ≥ v t s|φ q (π q ) , v t s|φ q (π q+1 ) ≥ v t s|φ q (π q ) .(3) 

