FANTASTIC REWARDS AND HOW TO TAME THEM: A CASE STUDY ON REWARD LEARNING FOR TASK-ORIENTED DIALOGUE SYSTEMS

Abstract

When learning task-oriented dialogue (ToD) agents, reinforcement learning (RL) techniques can naturally be utilized to train dialogue strategies to achieve userspecific goals. Prior works mainly focus on adopting advanced RL techniques to train the ToD agents, while the design of the reward function is not well studied. This paper aims at answering the question of how to efficiently learn and leverage a reward function for training end-to-end (E2E) ToD agents. Specifically, we introduce two generalized objectives for reward-function learning, inspired by the classical learning-to-rank literature. Further, we utilize the learned reward function to guide the training of the E2E ToD agent. With the proposed techniques, we achieve competitive results on the E2E response-generation task on the Multiwoz 2.0 dataset.

1. INTRODUCTION

The bloom of pre-training language models (e.g., Devlin et al., 2018; Lewis et al., 2019; Radford et al., 2019; Zhang et al., 2022c) have significantly pushed the boundaries of natural language processing (NLP) on real-world tasks. Among all the promising potentials, one important application is the task-oriented dialogue (ToD) systems, which interact with the users in multiple turns via natural languages to accomplish tasks such as weather inquiry, ticket booking, or schedule planning (Chen et al., 2017; Kwan et al., 2022) . Traditionally, the problem of ToD is decomposed into several sub-tasks (Smith & Hipp, 1994; Young et al., 2013) : natural language understanding (NLU) for understanding turn-level user intents or slot values (Tur & De Mori, 2011; Casanueva et al., 2020) , dialogue state tracking (DST) for tracking user belief state across multiple dialogue turns (Zhang et al., 2019; Zhu et al., 2020) , dialogue management (DM) for choosing system actions to take (Peng et al., 2017; Zhao et al., 2019) , and natural language generation (NLG) for mapping system actions to natural language responses (Wen et al., 2015; Zhang et al., 2020) . This pipeline approach, however, requires intensive structural designs and comprehensive data annotation for model training (Kwan et al., 2022) . Recently, there has been a growing interest in building end-to-end (E2E) ToD agents, which directly generate responses based on the natural language conversation mixing user utterances and past responses. Apart from this structural simplicity, many of the E2E ToD models can utilize the pre-trained language models and are simply trained by supervisedly fine-tuning the pre-trained models on the ToD datasets (e.g., Hosseini-Asl et al., 2020; Ham et al., 2020; Lin et al., 2020; Peng et al., 2021) . Due to the intrinsic similarity between dialogues and sequential decision-making, reinforcement learning (RL) methods are naturally employed to train dialogue systems and have achieved some success (e.g., Williams & Young, 2007; Georgila & Traum, 2011; Zhao et al., 2019) . Since interacting with users during the training process is mostly impractical, offline RL (Lange et al., 2012; Levine et al., 2020) , i.e., RL on static datasets, has recently been adopted to train E2E ToD models (e.g., Jaques et al., 2019; 2020; Ramachandran et al., 2021; Snell et al., 2022a; b; Jang et al., 2022) . Although this direction already presents promising empirical results, an open question exists on how to properly design the reward function for the underlying (offline) RL. Existing works (e.g., Wu et al., 2019c; Jang et al., 2022; Snell et al., 2022b) manually design a sparse reward function that only indicates whether the agent achieves the goal or not. Unfortunately, due to the delayed feedback, learning from such a sparse reward signal is itself challenging for RL agents (Andrychowicz et al., 2017; Liu et al., 2019; Durugkar et al., 2021) . When applied to train the more complicated ToD agents, the sparse reward signal could lead to poor empirical performance (Takanobu et al., 2019; Wang et al., 2020a) . To address this issue, we aim at answering the following question in this paper: How to efficiently learn a reward function and leverage it for training E2E dialogue agents? We answer the first half of this question by introducing two reward-learning objectives, RewardNet and RewardMLE, based on the classical learning-to-rank literature (Cao et al., 2007; Xia et al., 2008) . Our desiderata is a reward function that can "explain" some non-trivial preferencebased ordering among multiple alternative dialogue trajectories, thus potentially allowing the resulting RL-trained ToD agents to have better-than-demo performance. We accomplish this goal by learning a parameterized reward function on dialogue turns, from which the accumulated reward of a dialogue trajectory can reflect the preference among multiple alternatives. We answer the second half of the question by utilizing the learned reward function to guide the training of the E2E ToD system, with special considerations on the training stability. With these answers to the above question, we achieve competitive results on the E2E response-generation task on the widely-used dialogue benchmark MultiWOZ 2.0 (Budzianowski et al., 2018) . Several ablation studies and analyses are conducted to provide further insights into the proposed techniques.

2. BACKGROUND

Task-oriented dialogue as reinforcement learning. We formulate the problem of the ToD system as a partially observable Markov decision process (POMDP) (Kaelbling et al., 1998) , specified by M = ⟨S, A, O, P, R, γ⟩, where state s ∈ S consists of the previous dialogue history h and the user intended goal g specified prior to the start of the dialogue; o ∈ O is the observation that can be the user utterance; action a ∈ A can be the system response or dialogue act; P(s ′ | s, a) is the underlying transition probability; R(h, a, g) is the intermediate reward function for taking action a under dialogue history h and goal g; and γ ∈ [0, 1] is the discount factor. The dialogue history h t at timestep t consists of all the previous observations and actions, i.e., h t ≜ {o 0 , a 0 , . . . , o t-1 , a t-1 , o t }. Since the ToD agent cannot directly observe the user goal g, it makes a decision based on the entire dialogue history h t so far. Specifically, the policy π is defined as a mapping from h t to a probability distribution over A, i.e., π ≜ π(a t | h t ). The training objective is to find a policy π that maximizes the expected (discounted) cumulative reward J(π) ≜ E µg,π,P T t=0 γ t R(h t , a t , g) , where µ g is the distribution of goals and T is the number of turns in the dialogue trajectory. Reward design and learning in ToD systems. Unlike the classical RL problems where the intermediate reward function is well designed and provided, in ToD systems we can only get the evaluation results at the end of the dialogue (Budzianowski et al., 2018) . Consequently, most of the existing works adopt the manually designed intermediate reward function that only gives binary reward to indicate whether the dialogue agent achieves the goal or not (e.g., Weisz et al., 2018; Wu et al., 2019c; Jang et al., 2022) : R(h t , a t , g) = R const or 0, if goal g is achieved at timestep t , -R const , if goal g is not achieved at timestep t , where R const is a positive constant that can be 1. However, such a sparse reward signal can be one of the reasons that the ToD agents from RL often have poor performance (Takanobu et al., 2019; Wang et al., 2020a) . A similar issue is also observed in goal-oriented RL (Andrychowicz et al., 2017) . To address the above issue, a few recent works focus on learning an intermediate reward function from demonstrations or mechanical dialogue assessments (e.g., Wang et al., 2020a; Ramachandran et al., 2021) , inspired by the reward-learning-from-preferences in RL (e.g., Christiano et al., 2017; Brown et al., 2019; 2020) . More precisely, suppose we are given two dialogue trajectories τ i and τ j , taking the form τ i ≜ {g (i) , (o (i) 0 , a (i) 0 ), . . . , (o (i) T , a T )}. We want to learn a parametrized reward function R θ (o t , a t , g) with parameter θ,foot_0 such that T t=0 R θ (o (i) t , a (i) t , g (i) ) > T t=0 R θ (o (j) t , a t , g (j) ) when τ i is preferred over τ j (denoted as τ i ≻ τ j ) and vice versa. Then one can follow the Bradley-Terry model of pairwise preferences (Bradley & Terry, 1952) to train the reward function by minimizing the loss ℓ(θ) = -τ i ≻τ j log exp T t=0 R θ (o (i) t ,a (i) t ,g (i) ) k∈{i,j} exp T t=0 R θ (o (k) t ,a (k) t ,g (k) ) . (1) ℓ(θ) can be interpreted as a pairwise ranking loss, which is formalized as binary classification in the problem of learning to rank (Herbrich et al., 1999; Freund et al., 2003; Burges et al., 2005) .

3. MAIN METHOD

In this section, we first introduce two objectives for reward-function learning based on the classical approaches in the learning-to-rank (LTR) literature (Liu, 2009) . Then we use MinTL (Lin et al., 2020) as an example to demonstrate how we can use the learned reward function as a plugin module to improve existing methods of training the E2E ToD models.

3.1. TWO GENERALIZED OBJECTIVES FOR REWARD LEARNING

We introduce two objectives, RewardNet and RewardMLE, both of which can utilize multiple dialogue trajectories on each update for optimizing the reward function. Our motivation is that, compared with the pairwise approach described in Eq. ( 1), these two objectives consider more information at each training step, and thus can be more effective for reward learning and may lead to a better solution under the stochastic training setting. Setup. Assume that there are N ≥ 2 dialogue trajectories, denoted by D N ≜ (τ 1 , τ 2 , . . . , τ N ), and each trajectory τ i has an automatic evaluation score S(τ i ). 2 For simplicity, we assume that these N dialogue trajectories are of equal length T and are already sorted by the automatic evaluation scores, i.e., τ 1 ≻ τ 2 ≻ • • • ≻ τ N , or equivalently, S(τ 1 ) > S(τ 2 ) > • • • > S(τ N ). We denote the accumulated reward of dialogue trajectory τ i from R θ as J(τ i ; θ) = T t=0 R θ (o (i) t , a t , g (i) ). Our goal is to learn a reward function R θ (o, a, g) such that the accumulated rewards of those trajectories can reflect the ranking order, i.e., J(τ 1 ; θ) > • • • > J(τ N ; θ). RewardNet. The proposed RewardNet objective for reward function learning is adapted from the ListNet loss (Cao et al., 2007) in the LTR literature. Specifically, given N trajectories and their associated scores, we define the RewardNet loss as the cross entropy between {J(τ i ; θ)} N i=1 and {S(τ i )} N i=1 : ℓ RewardNet (θ; D N ) ≜ - N i=1 P S (τ i ) • log P J(τ ;θ) (τ i ) , P S (τ i ) = S(τ i ) N k=1 S(τ k ) , P J(τ ;θ) (τ i ) = Φ(J(τ i ; θ)) N k=1 Φ(J(τ k ; θ)) , where Φ(•) is a monotonic positive function defined on R + , and P S (τ i ) is a normalized probability vector defined by the true evaluation scores of those N trajectories. Note that when N = 2 and Φ is the identity function, RewardNet can be viewed as a soft version of the pairwise preference loss defined in Eq. ( 1), where the hard binary preference labels are replaced by {P S (τ i )} N i=1 . This soft pairwise loss is adopted for reward learning in the recent CASPI paper (Ramachandran et al., 2021) .

RewardMLE.

The RewardMLE objective is based on the ListMLE loss (Xia et al., 2008) , where we only utilize the ranking order in the batched dialogue trajectories D N , rather than the original metric scores {S(τ i )} N i=1 . Let y = rank(S) be the random variable that represents the ranking order of the dialogue trajectories (y(τ i ) = i, ∀ i, if the batched trajectories D N are sorted). The RewardMLE objective is defined as the negative log-likelihood of the ranking order y under the Plackett-Luce choice model (Plackett, 1975; Luce, 2012) induced by the accumulated reward of each trajectory {J(τ i ; θ)} N i=1 . Specifically, the loss is defined as ℓ RewardMLE (θ; D N ) ≜ -log P y | {J(τ i ; θ)} N i=1 , with P y | {J(τ i ; θ)} N i=1 = N i=1 Φ(J(τ i ; θ)) N k=i Φ(J(τ k ; θ)) , where trajectories in D N are assumed sorted as described in the problem setup, i.e., τ 1 ≻ • • • ≻ τ N . Since RewardMLE only requires the ranking information derived from the raw scores, it is potentially a more robust choice when the preference scores could be inaccurate. In Eqs. ( 2) and (3), the monotonic positive function Φ transforms the unnormalized inputs {J(τ i ; θ)} N i=1 to a N -dimensional probabilistic simplex. In this work, we consider Φ being exponential function exp(•) and power function (•) p , p ∈ N, which are respectively known as the softmax transform and the escort transform (Mei et al., 2020) .

3.2. POLICY GRADIENT UPDATE WITH LEARNED REWARD FUNCTION

With the learned reward function R θ (o, a, g), the next step is to improve the parametrized dialogue agents π ϕ via policy gradient methods (Sutton & Barto, 2018) , given a collected offline dataset D := {τ k } K k=1 . A classical approach to train the policy π ϕ is to estimate the policy gradient via the REINFORCE method (Williams, 1992) : ∇ ϕ J REINFORCE (π ϕ ) = E (g,ht)∼ D,ãt∼π ϕ (• | ht) [∇ ϕ log π ϕ (ã t | h t ) • G π ϕ (h t , ãt , g)] , where G π ϕ (h t , ãt , g) is the (discounted) accumulated reward that the agent π ϕ receives, starting from the observation o t (part of h t ) and action ãt , under the given goal g. When the discount factor γ > 0, estimating G π ϕ (h t , ãt , g) requires Monte Carlo sampling (on-policy) or temporal difference learning (off-policy), both of which require learning an additional value-function network. Empirically we observe that learning an additional action-value function could introduce instability and extra compute to the subsequent training of the E2E dialogue model. To simplify the training pipeline, we simply set the discount factor γ = 0, and thus G π ϕ (h t , ãt , g) = R θ (o t , ãt , g). Though the policy gradient estimator defined in Eq. ( 4) is unbiased, it tends to have high variance, especially when the action space is large. Unfortunately, in the E2E ToD system, the action space is often defined to be the Cartesian product of the vocabulary, which itself has a dimension larger than 30000. As a result, optimizing the agent π ϕ by the REINFORCE estimator may suffer from divergent training. We illustrate this phenomenon via a toy example in Section 5.2. To mitigate the high-variance issue of the REINFORCE estimator, we utilize the Gumbel-softmax (GS) trick (Jang et al., 2016; Maddison et al., 2016; Fan et al., 2021) to reduce the variance. Specifically, J GS (π ϕ ) = E at∼π ϕ (• | ht) [R θ (o t , a t , g)] ≈ E ϵ ϵ ϵ∼Gumbel(0,1) [R θ (o t , f ϕ (h t , ϵ ϵ ϵ), g)] , with f ϕ (h t , ϵ ϵ ϵ) = f (1) ϕ (h t , ϵ ϵ ϵ), . . . , f (|A|) ϕ (h t , ϵ ϵ ϵ) ∈ R |A| , f (i) ϕ (h t , ϵ ϵ ϵ) = exp((li(ht;ϕ)+ϵi)/λ) |A| j=1 exp((lj (ht;ϕ)+ϵj )/λ) , where {l i (h t ; ϕ)} |A| i=1 are the logits of the categorical distribution defined by the agent π ϕ , and λ is the temperature parameter that we set as 1. Besides, following the pessimistic principle in offline RL (Buckman et al., 2020) , we add a weighted regularization such that the actions generated by the agent π ϕ are close to the actions in the dataset D, ℓ W (π ϕ ) := -E (ht,at,g)∼ D[log π ϕ (a t | h t ) • R θ (o t , a t , g)] , which is similar to the weighted behavior cloning in offline RL (Wang et al., 2020b) , except that we directly use the intermediate rewards as the weights, rather than using the value function. Combining the policy gradient and the weighted regularization, we have the following loss for the agent π ϕ : ℓ GEN (ϕ) = -α • J GS (π ϕ ) + ℓ W (π ϕ ) , ( ) where α is the coefficient balancing these two parts. Note that the original supervised-learning loss of MinTL (Lin et al., 2020) can be decomposed into two parts, respectively for the dialogue state tracking (DST) and the response generation. We retain the DST loss ℓ DST (ϕ) in MinTL and replace its response-generation loss with Eq. ( 5). Our final loss for ToD agent training is ℓ(ϕ) = ℓ GEN (ϕ) + ℓ DST (ϕ) . We illustrate our method in Fig. 1 and provide an algorithm box in Appendix B. Remark Eq. ( 6) for the learning of the dialogue agent π ϕ is essentially a generalized objective from several previous works. Specifically, if we set α = 0 and set the reward function to be constant R θ (o t , a t , g) ≡ 1, Eq. ( 6) reduces to the objective in MinTL, without any guidance for responsegeneration from the learned reward function R θ . If we set α = 0, and use the RewardNet loss with N = 2 and Φ = (•) 1 (i.e., the identity function) to train the reward function, Eq. ( 6) reduces to the objective in CASPI (Ramachandran et al., 2021) . In Section 5, we demonstrate the advantages of our techniques proposed in this section, including the RewardNet and RewardMLE losses for reward learning, and the J GS (π ϕ ) for agent training.

4. RELATED WORK

Recent works on the E2E ToD systems (e.g., Wu et al., 2019b; Lin et al., 2020; Hosseini-Asl et al., 2020; Ham et al., 2020; Peng et al., 2021; Yang et al., 2021) have significantly improved the overall system's performance and simplified the algorithmic designs in earlier works, which require solving several pipeline based sub-tasks (e.g., Young et al., 2013; Gao et al., 2018; Zhang et al., 2020) . The reward function trained by our methods can be leveraged as guidance to train existing E2E models, without changing the underlying structures. We demonstrate the effectiveness of our proposed reward learning methods under the structure of MinTL (Lin et al., 2020) and GALAXY (He et al. (2022) ; in Appendix E) where we only add an additional reward-function-guided objective for the response-generation model, while keeping other components of the respective structure unchanged. One line of related research is applying RL to train ToD agents. It is often unsuccessful to directly apply RL algorithms such as the DDPG (Lillicrap et al., 2015) or PPO (Schulman et al., 2017) since the agent training could potentially diverge (Zhao et al., 2019; Jang et al., 2022; Kwan et al., 2022) . Recently, a number of works consider offline RL (Levine et al., 2020) as a promising solution to stabilize the agent training on a static dataset (e.g., Jaques et al., 2020; Ramachandran et al., 2021; Jang et al., 2022; Verma et al., 2022; Snell et al., 2022a; b) . Following the offline RL principle, we use a reward-weighted regularization to stabilize the dialogue-agent training. Together with the incorporation of the Gumbel-softmax trick to estimate the policy gradient, our work retains algorithmic simplicity while improving the training stability and overall performance. Finally, our paper closely relates to works on reward learning for the ToD systems (e.g., Takanobu et al., 2019; Ramachandran et al., 2021) . This research thread differs from works that directly use a manually designed reward function, which only gives sparse signals to indicate whether the agent achieves the goal or not (e.g., Weisz et al., 2018; Wu et al., 2019c; Jang et al., 2022; Snell et al., 2022b) . One line of this research direction is utilizing inverse reinforcement learning (IRL) (Russell, 1998) to learn a dense reward function, by assuming the collected data be expert demonstrations (Takanobu et al., 2019) . However, modern IRL techniques such as GAIL-style algorithms (Ho & Ermon, 2016; Fu et al., 2017) often require iterating between reward learning and policy training (Finn et al., 2016) , which is computationally expensive and less scalable to dialogue-generation models. Besides, the IRL methods aim at justifying the data, while the reward-learning framework in our work seeks to explain the preference among multiple trajectories, potentially leading to betterthan-demo agents (Brown et al., 2019; 2020) . Our paper is more closely related to the research on 

5. EXPERIMENTS

Dataset. We evaluate the proposed methods on the MultiWOZ 2.0 dataset (Budzianowski et al., 2018) , which is a representative ToD benchmark. MultiWOZ 2.0 is a large-scale multi-domain dialogue corpus with seven domains: attraction, hospital, police, hotel, restaurant, taxi, and train. Each dialogue therein covers between one to three domains. This dataset has 8438 dialogues in the training set and 1000 dialogues in the validation and test set respectively. Evaluation Metrics. Our proposed method is evaluated on the E2E dialogue-modeling task of the MultiWOZ 2.0 dataset. Following the standard setup (e.g., Budzianowski et al., 2018; Mehri et al., 2019) , we use four automatic evaluations metrics: 1) Inform rate: the fraction of the dialogues where the system has provided an appropriate entity; 2) Success rate: the fraction of the dialogues where the system answered all the requested information; 3) BLEU score (Papineni et al., 2002) : measures the fluency of the generated responses; 4) Combined Score (Mehri et al., 2019) : an overall quality measure defined as Combined Score =: (Inform + Success) × 0.5 + BLEU. Details on prepossessing and implementation are in Appendix B and G. 2)) when we use escort transform (Φ = (•) 1 , the identity function) with pairwise preference (N = 2). When we use three dialogue trajectories (N = 3) to construct the RewardNet loss and retain the same escort transform, the overall performance generally improves over CASPI. As discussed in Section 3.1, our RewardNet loss generalizes the pairwise-preference learning by taking more information on each update of the reward model and thus could learn a better reward function. Appendix D further compares our methods with CASPI.

Main evaluation.

The performance is further gently improved by changing the RewardNet loss (Eq. ( 2)) to the RewardMLE loss (Eq. ( 3)), with the softmax transform (Φ = exp(•)) and N = 5 dialogue trajectories. This again demonstrates the benefit of our proposal of using multiple trajectories to learn the reward model. Section 5.2 conducts ablation studies on the number of trajectories and choice of Φ. So far, we follow the prior work to not utilize policy gradient to train the response-generation model, i.e., α = 0 in Eq. ( 5). Extra performance gain can be obtained by adding the policy-gradient updates via the Gumbel-softmax trick (GS) discussed in Section 3.2. Indeed, GS improves both the plain RewardNet and RewardMLE models. This shows the efficacy of directly optimizing the responsegeneration model w.r.t. the learned reward function. Further discussion is provided in Section 5.2. Appendix E provides the experimental results when applying our method onto the recent GALAXY backbone (He et al., 2022) . Appendix F discusses the results on the MultiWOZ 2.1 dataset. Low-resource experiment. We evaluate our method on the low-data regime by following the testing strategy in Lin et al. (2020) . Specifically, we use 5%, 10%, and 20% of the training data to train our basic RewardNet and RewardMLE models in Table 1 , without the GS component. We compare them with the baseline scores in Lin et al. (2020) . Table 2 reports the results. It is clear that our models outperform the baselines, MinTL and DAMD, showing the efficacy of our method. Compared with Table 1 , our models trained with 20% of the data perform competitively with the baseline methods trained on the full training set.

5.2. ABLATION STUDY

The ablation study considers the following four research questions to better understand our methods. (a): What if we learn the reward function via a different number of trajectories? In Fig. 2a and 2b , we vary the number of trajectories used for the reward-learning losses in Table 1 . To avoid unwanted interference, we use the basic version of models without the GS component. The case of using two trajectories reduces to the pairwise-preference loss discussed in Section 2. As shown in Fig. 2a and 2b , our generalized approach of using multiple trajectories to learn the reward function provides the flexibility to outperform the classical pairwise-preference learning. This is more apparent in the RewardMLE models, which are less sensitive to small errors in the ground-truth scores. In general, the optimal trajectory number may depend on the scoring quality. (b): Do different probabilistic transforms in reward learning objectives affect the performance? We modify the basic version of the RewardNet and RewardMLE models in Table 1 by using the softmax transform and by using different powers in the escort transform in the reward learning losses Eqs. ( 2) and (3). For the escort transform, we consider Φ = (•) p , p ∈ {1, 2, 3, 4}. Figs. 2c and 2d plot the resulting Combined Scores. We see that the RewardMLE model is less sensitive to the choice of probabilistic transform -all the considered variants have a Combined Score of at least 104. In fact, changing its softmax transform used in Table 1 to the escort transform with power two improves the performance to 106.77. Thus, the choice of probabilistic transform provides an additional angle to improve the learned reward function and the entire ToD model. (d) BLEU Score Figure 3 : Line plots showing the four automatic evaluation metrics of the RewardNet + GS model in Table 1 under different α values in the generation-model loss Eq. ( 5). Results are the average over five seeds. (c): Is our method sensitive to the coefficient α in the generation-model loss Eq. ( 5)? To investigate the robustness of our method under different weights for the policy-gradient optimization of the response-generation model. We select our best policy-gradient-based model in Table 1 , the RewardNet + GS model, and vary the α coefficient in the generation-model loss Eq. ( 5). Fig. 3 plots the resulting four automatic evaluation metrics. We see that our model is relatively robust to the choice of α. The five variants in Fig. 3 all have Combined Scores of at least 105, higher than the best baseline result of 104.78 in Table 1 . In fact, by changing the α coefficient to 0.5 from 0.1 used in Table 1 , we achieve a even better Combined Score of ≈ 107.2. Further, the capability of task completion and the fluency of the generated responses are both relatively insensitive to the choice of α. (d): How does the addition of the policy-gradient method Gumbel-softmax help the performance? Fig. 4 compares the performance of our models in Table 1 , with error bars showing the standard deviation of the Combined Score over five seeds. It is clear that the addition of the Gumbel-softmax method can not only improve the score but also reduce the performance variation, which is apparent when comparing the RewardMLE model with the RewardMLE + GS model. As discussed in Section 3.2, the Gumbel-softmax (GS) trick can be more advantageous than the classical REINFORCE method (Williams, 1992) for the policy-gradient update. As a demonstration, we conduct a toy experiment following Yin et al. (2019) and plot the results in Fig. 5 . The task here is to learn the parameter ψ of a D-dimensional categorical distribution to maximize a simple reward function. Specifically, denote the sigmoid function as σ(•), the goal is max ψ∈R D E x∼Cate(σ(ψ)) [f (x)] , f (x) ≜ 0.5 + x/(D • R), ∀ x ∈ {1, . . . , D} , where Cate(σ(ψ)) denotes the categorical distribution with probability vector σ(ψ), and D = R = 30. The best σ(ψ) is (0, . . . , 0, 1), leading to the optimal expected reward of ≈ 0.533. We initialize ψ = 0 and use one sample for the stochastic gradient-ascent update, with a learning rate of 1.0. The first row of Fig. 5 traces the objective function during the training process when using the true gradient, REINFORCE, and the GS for policy-gradient updates. We see that the REINFORCE method converges to a local maximum, while the GS method reaches the global optimum, as using the true gradient for updates. The second row shows the gradients for θ 1 and θ D , where we see that gradient estimates from the REINFORCE method are both unstable and vanishing, compared to the GS method. The learned probabilities {σ(ψ) 1 , . . . , σ(ψ) D } is traced in the third row, where the red line is for σ(ψ) D that should ideally be 1, and the shadowed lines are for the other components that ought to be 0. The learning process of the GS method closely resembles that of using the true gradient, while REINFORCE oscillates around a local optimum. The last row of Fig. 5 plots the estimate of gradient variance via 500 samples, averaged over each component of the ψ vector. The gradient variance of the REINFORCE method is on the order of 10 -2 at the beginning and converges to roughly 10 -4 , while the GS is 10 -6 throughout the training process. This toy experiment shows that a low-variance method, such as the GS, can be critical to the success of policy-gradient training.

5.3. FURTHER ANALYSIS

Human evaluation. For a more comprehensive evaluation of our method, we conduct a human evaluation on the quality of the generated responses, where our model and the top two baselines in Table 1 , GPT-Critic and CASPI, are compared. We follow the evaluation protocol in prior work (e.g., Zhang et al., 2020; Ramachandran et al., 2021; Jang et al., 2022) to evaluate on two metrics: 1) Appropriateness: measures the appropriateness of the generated response under the context of the dialogue turn; 2) Fluency: evaluates the comprehensibility and coherency of the generated response. We randomly picked 50 turns in the test set and showed to 10 evaluators the responses generated from each method, together with the dialogue history up to that turn. The method names were anonymized. The evaluators were asked to read the dialogue history and score the response on a 5-Point Likert Scale {1, 2, 3, 4, 5}, where score 5 is the highest and 1 the lowest. Fig. 6 summarizes the evaluation results. We see that our model outperforms the baselines in both the appropriateness and fluency scores. The human-evaluation results coincide with our comparatively good dialogue-task completion and BLEU score in Table 1 . Examples of the generated dialogues. Tables 3 and 4 in Appendix A conduct two case studies comparing the generated responses from our method with those from the baselines GPT-Critic and CASPI. We additionally annotate the generated responses to discuss the quality of those generations. These examples show that the responses from our model compare favorably with the baselines in both task completion and comprehensibility, aligning with the automatic and human evaluations. Quality of the DST. To further understand the performance gain of our models, we compare our basic RewardNet and RewardMLE models in Table 1 with the baselines UBAR and GPT-Critic on the quality of the generated dialogue states. Fig. 7 plots the results of the dialogue state prediction, measured by the two metrics Joint (Goal) Accuracy and Slot Accuracy (Wu et al., 2019a) . We see that our two models have more accurate DST than the two baselines, which can be related to their better performance in Table 1 . Interestingly, the DST of the RewardMLE model is also better than that of the RewardNet model. This may suggest that a better reward model not only benefits the learning of response generation, but also the DST. These two losses are jointly minimized in training the ToD model, and thus a good response-generation loss from a better reward model may help the optimization of the DST loss.

6. CONCLUSION

In this paper, we aim to answer the question of how to efficiently learn and utilize a reward function for training the E2E ToD agents. We answer this question by introducing two generalized rewardlearning objectives, and utilize a stable policy-gradient method to guide the training of the E2E ToD agents. Future work includes extending our reward-learning objectives to other applications, such as the question-answering with verification. ACKNOWLEDGMENTS S. Yang and M. Zhou acknowledge the support of NSF-IIS 2212418 and the Texas Advanced Computing Center (TACC) for providing HPC resources that have contributed to the research results reported within this paper.

ETHICAL STATEMENT

We develop our methods based on the publicly available MultiWOZ 2.0 dataset (Budzianowski et al., 2018) . It is important to note that, like other ToD models, our implementation will likely reflect the socio-economic and entity biases inherent in the MultiWOZ dataset (Qian et al., 2021) . We initialize model parameters from the pre-trained BART model. And a comprehensive analysis of certain biases captured by BART is outside our scope. Besides, we invited volunteers for the human evaluation, with transparent and detailed explanations and communications on data usage, research intent, occupied hours, etc. 

B ALGORITHMIC DETAILS

Prepossessing. The raw corpus is prepossessed by common practice in the ToD literature. Specifically, we represent the database (DB) query results as one-hot vectors following Budzianowski et al. (2018) , use domain-adaptive delexicalization proposed by Wen et al. (2016) , and generate delexicalized responses with placeholders for specific DST/DB information as in Zhang et al. (2020) . Implementation of the response model. Our model in Section 5 is based on the MinTL ToD model (Lin et al., 2020) , which uses the pre-trained BART-large model (Lewis et al., 2019) . MinTL directly works on the system response and does not explicitly output the dialogue act. Our proposed method in Section 3 is applied to the response training, and we retain the DST-training loss in MinTL. Our model is trained by fine-tuning BART on the training set and early-stopping by the validation set. Implementation of the reward model. Our reward model is implemented by the encoder part of the BART-base model, followed by a simple two-layer MLP. The output of the reward model is scaled to [0, 1] via the sigmoid function. The input to the reward model is the concatenation of the belief state, system response, and dialogue goal, at each turn of the sampled dialogue rollout. The model outputs the of each turn in the dialogue rollout, which is summed and fed into the losses proposed in Section 3.1. We use the HuggingFace library (Wolf et al., 2019) to implement our reward model. Algorithm 1 illustrates the pipeline of our methods. Algorithm 1 Pipeline of the proposed reward learning and utilization methods for training E2E ToD agents. Input: Reward function R θ (o, a, g), ToD agent π ϕ , dataset D := g (k) , (o (k) t , a , number of iterations M 1 and M 2 , probabilistic transform function Φ, hyperparameters N , α. for iteration ∈ {1, . . . , M 1 } do Sample N dialogue trajectories from the dataset D. Optimize R θ via RewardNet (Eq. ( 2)) or RewardMLE (Eq. ( 3)). end for Fix the reward function R θ . for iteration ∈ {1, . . . , M 2 } do Sample a batch of transition tuples g (k) , (o (k) t , a (k) t ) from the dataset D. Optimize the ToD agent π ϕ via Eq. ( 6). end for Output: Trained ToD agent π ϕ .

C COMPARISON WITH SOME OTHER REWARD-LEARNING METHODS IN RL-BASED DIALOGUE AGENTS

As an additional discussion on related work, in this section we briefly compare our work with three other reward-learning methods in RL-based dialogue agents, i.e., Saito (2018) , Hu et al. (2018), and Li et al. (2020) . These three papers all focus on the dialogue-management module of the pipeline design, wherein the action spaces of the agents are some abstract dialogue acts rather than the human-language-like system response as in our paper. The possible system responses are fixed in these three papers. For example, Hu et al. (2018) train the agent to select from "the set of the indices to all available questions in the Q20 game;" and Li et al. (2020) have an action space of size 300. As discussed in Section 1, such a pipeline approach requires intensive structural designs, such as determining the possible questions in the Q20 games; and may not enjoy the language plurality and conversation elicitation that our E2E model could offer, e.g., as shown in Table 4 . Due to the complexity and the much higher dimension of our action space as the system responses, the methods proposed in these three papers are not directly applicable to our setting, which will be discussed in detail below. The method proposed in Saito ( 2018) require specially-designed curriculum data and hand-crafted decomposition of the entire task into sub-tasks, which are not readily available and are non-trival for large-scale multi-domain dialogue corpus such as our tested MultiWOZ 2.0 dataset. The use of progressive neural networks to provide reward information in Saito ( 2018) require additional computation and memory complexity and thus may not scale to transformers. Meanwhile, our method scales well to transformers, as shown in our experiments (Section 5). Further, the method in Saito (2018) may only be feasible on the task of constrained information-retrieval, but not on some more general tasks such as the booking requirement in the tested MultiWOZ 2.0 dataset. Our experimental results show that our method is capable of such tasks. Similar to our work, Hu et al. (2018) propose a neural network to approximate the reward function to deliver immediate rewards at each timestep. Apart from the aforementioned simple action space, Hu et al. (2018) use the long-term return G t as a surrogate indicator of r t+1 to train the reward function (Eq. ( 6) of Hu et al. (2018) ), which is lack of justification. By contrast, as discussed in Section 2 and Section 3, our method is based on the classical approaches in the learning-to-rank (LTR) literature and extends the classical reward-learning-from-preferences into utilizing multiple dialogue trajectories simultaneously to optimize the reward function. As discussed before, Li et al. (2020) consider a relatively small action space of size 300 and learns the reward model via the GAN structure, which may not stably scale up to high-dimensional action space such as the system response in our E2E ToD system. The learned reward function in Li et al. (2020) only measure the probability that the input is from the real-data distribution, i.e., only considers a pair of dialogue state s t and the corresponding system action a t . This reward function does not consider the success of the entire dialogue, which is intuitively less favorable to the E2E ToD systems. By contrast, our method trains a reward function that aligns with some evaluations on the entire dialogue trajectories, which is more directly related to the usage of the ToD systems. We further note that apart from the reward-learning method, our paper also discusses using the Gumbel-softmax trick as a more stable method to train the E2E ToD systems, and conducts a toy experiment in Section 5.2 to illustrate the advantage of the Gumbel-softmax trick over the classical REINFORCE method. This is not covered in the three prior works Saito (2018), Hu et al. (2018), and Li et al. (2020) . Finally, these three prior works use online RL methods such as DQN, REINFORCE, and PPO, which require environmental interactions and are thus less practical, as discussed in Section 1. In contrast, our method allows training E2E response-generation models from static datasets by utilizing offline RL techniques (e.g., Levine et al., 2020; Fujimoto & Gu, 2021; Yang et al., 2022a; b; c) . The performances of both our methods and CASPI are relatively stable across random seeds. In particular, both of our methods have higher Combined Score than CASPI on each of the five tested random seeds. This stable improvement of our methods over CASPI aligns with our intuition discussed in Section 3 and our main experimental results in Section 5.1.

D DETAILED COMPARISON WITH CASPI

We note that both the Average score and the Median score across the tested random seeds are valid metrics for performance comparison. Nevertheless, there is some ambiguity in calculating the "Median" of the Combined Score, namely, should it be the median of the Combined Scores on each random seed, or should it be calculated as Median(Combined Score) ≜ (Median(Inform) + Median(Success)) × 0.5 + Median(BLEU)? The first way aligns better with the definition of "Median" while the second way aligns better with the definition of Combined Score. Such an ambiguity is cleared out when using the Average as the evaluation metric. Besides, the metric Average score is widely used in prior work (e.g., Zhang et al., 2020; Lin et al., 2020; Jang et al., 2022) . With these considerations, we choose to report the Average over five random seeds in our experimental section (Section 5). 

E EXPERIMENTS WITH THE GALAXY

To further demonstrate the efficacy and applicability of our approach, we apply our reward learning and utilization methods to the recently proposed GALAXY backbone (He et al., 2022) , which achieves competitive performance on the E2E response-generation task on the MultiWOZ 2.0 dataset. We note that the GALAXY paper does not disclose how many and which random seeds were used to obtain its main results; and the official codebase fixes the random seed as 10. This makes us unsure if its reported results are only from this single seed of 10. To mitigate the randomness in the optimization process, we re-run the vanilla GALAXY on the five random seeds used to generate our main results and compare it on these seeds with the variants equipped with our methods RewardNet +GS, N = 3, Φ = (•) 1 and RewardMLE +GS, N = 5, Φ = exp(•). Table 6 shows a detailed breakdown of the scores on each of the five random seeds. Note that since we use different random seeds than the original GALAXY paper, we are unable to get its reported scores. We see from Table 6 that adding our reward learning and utilization methods improves the performance of the vanilla GALAXY, in almost all evaluation metrics, in both the Average and the Median scores. In particular, adding our RewardMLE +GS method improves the average Combined Score of the vanilla GALAXY by 3.27, and adding our RewardNet +GS improves the vanilla GALAXY by 2.11. These relatively significant improvements may further demonstrate the effectiveness and general applicability of our proposed methods. 1 ), which also report results on the MultiWOZ 2.1 dataset. Additionally, we also present our rerun of CASPI on this dataset. As in Table 1 , the Combined Scores of our methods are generally better than the baselines. In fact, our methods achieve both good task completion (Inform and Success rates) and fluent generated responses (BLEU score). Table 8 shows a detailed breakdown of the scores of CASPI and our two methods on each of the five tested random seeds. We see that the scores of our methods are generally robust across random seeds. In fact, on four out of those five seeds, at least one of our methods performs better than CASPI. Further, our methods have higher Average and Median scores than CASPI on each of the four evaluation metrics. This set of experiments may further demonstrate the efficacy of our proposed methods. 



We use the belief state, action, and goal as the reward function inputs. The belief state is part of the observation ot. We also drop the dependency on ht for R θ to simplify the reward function learning. We use the Combined Score(Mehri et al., 2019) as S(τi). Detailed definition is delayed to Section 5. The CASPI paper reports the median score over random seeds, instead of the more commonly used mean score. We run the official CASPI codebase (https://github.com/salesforce/CASPI) and report the mean scores.



Figure1: Overview of the proposed method. We denote "Accumulated Reward" for the learned accumulated reward, J(•; θ) for the accumulated reward of each trajectory, S(•) for the combined score of each trajectory, ℓW for the weighted regularization, ℓDST for the DST loss, and "DH" for the dialogue history. In the right panel, (ht, at, g) ∼ D. We use BART for both the reward model and the ToD model.

Figure 2: Line plots comparing the Combined Score when the RewardNet and RewardMLE losses are constructed under a different number of sampled trajectories or different probabilistic transforms. The y-axis represents the Combined Score. p = 1, 2, 3, 4 is the escort transform with power 1, 2, 3, 4. "smax" is the softmax transform. Results are the average over five random seeds.

Figure 4: Bar plot comparing our four models in Table 1. Mean and one standard deviation over five random seeds are shown. "R-Net" denotes RewardNet . "R-MLE" is RewardMLE . "GS" is Gumbel-softmax.

Results of the E2E response generation task on the MultiWOZ 2.0 dataset. The best result on each metric is bold. The results of UBAR are from the reproduction byJang et al. (2022). The results of CASPI are from our reproduction. All our provided results are the average over five random seeds. Other results are from the original paper. "GS" denotes the Gumbel-softmax trick. (•) 1 denotes the power function with power 1.

Table 1 compares the performance of our methods with several classical and recent approaches in the E2E response-generation task. As shown in Table 1, our proposed methods not only improve the dialogue-task completion, measured by the Inform rate and the Success rate; but also generate fluent responses, reflected by the competitive BLEU scores. Recall that CASPI is a special case of the RewardNet loss (Eq. (

Results on the simulated low-resource settings, where 5%, 10%, and 20% of the training data is used to train the models. The best result on each metric under each setting is bold. "Comb." is the Combined Score. All our provided results are the average over five random seeds. Baseline results are fromLin et al. (2020).

Generated and original dialogue for ID PMUL4610. "User" denotes User Utterance. "System" denotes true System Response. Red annotations explain unsatisfactory responses and green for satisfactory ones.System Response thank you for calling. please let us know if we can assist you in the future. goodbye.

Generated and original dialogue for ID SNG1012. "User" denotes User Utterance. "System" denotes true System Response. Red annotations explain unsatisfactory responses and green for satisfactory ones. ] is a [value price] [value stars] star [value type] in the [value area] with free wifi and parking. would you like me to book it for you? CASPI i have [value choice] [value type] that meet your criteria . [value name] and [value name]. both are [value price] and offer free wifi and parking . do you have a preference ? Ours i have [value choice] [value type] that have free wifi . i would recommend the [value name]. would you like me to book a room ?

further compares CASPI and our two methods: RewardNet +GS, N = 3, Φ = (•) 1 and RewardMLE +GS, N = 5, Φ = exp(•), showing a detailed breakdown of the scores onto each of the five tested random seeds.

Per random-seed results of the E2E response generation task on the MultiWOZ 2.0 dataset, comparing

Per random-seed results of the E2E response-generation task on the MultiWOZ 2.0 dataset, comparing the vanilla GALAXY and the variants with our proposed methods: RewardNet +GS, N = 3, Φ = (•) 1 and RewardMLE +GS, N = 5, Φ = exp(•). Here, (•) 1 denotes the power function with power 1. "Comb." is the Combined Score. The row "Median" shows the median score of the corresponding column over the five tested random seeds.To test the efficacy of our proposed methods on additional datasets, we run our two methods RewardNet + GS, N = 3, Φ = (•) 1 and RewardMLE + GS, N = 5, Φ = exp(•) on the Mul-tiWOZ 2.1 dataset. Table7compares our two methods with the baselines SimpleTOD and UBAR in the main evaluation (Table

Results of the E2E response-generation task on the MultiWOZ 2.1 dataset. The best result on each metric is bold. The results of SimpleTOD and UBAR are from the original paper. The results of CASPI are from our reproduction. All our provided results are the average over five random seeds. "GS" denotes the Gumbel-softmax trick. (•) 1 denotes the power function with power 1.

Per random-seed results of the E2E response-generation task on the MultiWOZ 2.1 dataset, comparing the CASPI and the variants with our proposed methods: RewardNet +GS, N = 3, Φ = (•) 1 and RewardMLE +GS, N = 5, Φ = exp(•). Here, (•) 1 denotes the power function with power 1. "Comb." is the Combined Score. The row "Median" shows the median score of the corresponding column over the five tested random seeds.

availability

Source code and checkpoints are publicly released at https://github.com/

Appendix

A EXAMPLES OF THE GENERATED DIALOGUES Tables 3 and 4 show two case study comparing the generated responses from our method and from the baselines GPT-Critic and CASPI. Our method outperforms the baselines in terms of both task completion and the quality of the generated expressions.

G IMPLEMENTATION DETAILS

Our implementation is based on the official codebase of MinTL and CASPI. Apart from the hyperparameters discussed in Section 5.2, most other hyperparameters and the training procedure of our models follow MinTL and CASPI. In addition to the discussion of our method in Section 3, we list the important hyperparameters for training our reward model in Table 9 and the important hyperparameters for training our response-generation model in Table 10 . Both BART models use the default token length of 142.We note that unlike CASPI which uses the dialogue acts as the actions, the action space of our reward model is the system response, which is a combinatorial space of the vocabulary. We use this action space because we want to pass gradients from the reward model to the E2E response-generation model during the training process, where the output of the E2E model is the human-language-like system response.For the reward-function learning, we do not change the length of the trajectories in the dataset. The reward model is updated by the preference scores/orderings among multiple trajectories of the same length. Our intuition is that trajectories of the same length may roughly correspond to tasks of similar complexity, making the preference among them more comparable and meaningful. This approach is also taken by the prior work CASPI.Our tested MultiWOZ2.0 dataset is publicly available at https://github.com/budzianowski/multiwoz, and the MultiWOZ2.1 dataset is available at https://github.com/thu-coai/ConvLab-2/tree/master/data/multiwoz. 

