FANTASTIC REWARDS AND HOW TO TAME THEM: A CASE STUDY ON REWARD LEARNING FOR TASK-ORIENTED DIALOGUE SYSTEMS

Abstract

When learning task-oriented dialogue (ToD) agents, reinforcement learning (RL) techniques can naturally be utilized to train dialogue strategies to achieve userspecific goals. Prior works mainly focus on adopting advanced RL techniques to train the ToD agents, while the design of the reward function is not well studied. This paper aims at answering the question of how to efficiently learn and leverage a reward function for training end-to-end (E2E) ToD agents. Specifically, we introduce two generalized objectives for reward-function learning, inspired by the classical learning-to-rank literature. Further, we utilize the learned reward function to guide the training of the E2E ToD agent. With the proposed techniques, we achieve competitive results on the E2E response-generation task on the Multiwoz 2.0 dataset.

1. INTRODUCTION

The bloom of pre-training language models (e.g., Devlin et al., 2018; Lewis et al., 2019; Radford et al., 2019; Zhang et al., 2022c) have significantly pushed the boundaries of natural language processing (NLP) on real-world tasks. Among all the promising potentials, one important application is the task-oriented dialogue (ToD) systems, which interact with the users in multiple turns via natural languages to accomplish tasks such as weather inquiry, ticket booking, or schedule planning (Chen et al., 2017; Kwan et al., 2022) . Traditionally, the problem of ToD is decomposed into several sub-tasks (Smith & Hipp, 1994; Young et al., 2013) : natural language understanding (NLU) for understanding turn-level user intents or slot values (Tur & De Mori, 2011; Casanueva et al., 2020) , dialogue state tracking (DST) for tracking user belief state across multiple dialogue turns (Zhang et al., 2019; Zhu et al., 2020) , dialogue management (DM) for choosing system actions to take (Peng et al., 2017; Zhao et al., 2019) , and natural language generation (NLG) for mapping system actions to natural language responses (Wen et al., 2015; Zhang et al., 2020) . This pipeline approach, however, requires intensive structural designs and comprehensive data annotation for model training (Kwan et al., 2022) . Recently, there has been a growing interest in building end-to-end (E2E) ToD agents, which directly generate responses based on the natural language conversation mixing user utterances and past responses. Apart from this structural simplicity, many of the E2E ToD models can utilize the pre-trained language models and are simply trained by supervisedly fine-tuning the pre-trained models on the ToD datasets (e.g., Hosseini-Asl et al., 2020; Ham et al., 2020; Lin et al., 2020; Peng et al., 2021) . Due to the intrinsic similarity between dialogues and sequential decision-making, reinforcement learning (RL) methods are naturally employed to train dialogue systems and have achieved some success (e.g., Williams & Young, 2007; Georgila & Traum, 2011; Zhao et al., 2019) . Since interacting with users during the training process is mostly impractical, offline RL (Lange et al., 2012; Levine et al., 2020) , i.e., RL on static datasets, has recently been adopted to train E2E ToD models (e.g., Jaques et al., 2019; 2020; Ramachandran et al., 2021; Snell et al., 2022a; b; Jang et al., 2022) . Although this direction already presents promising empirical results, an open question exists on how to properly design the reward function for the underlying (offline) RL. Existing works (e.g., Wu et al., 2019c; Jang et al., 2022; Snell et al., 2022b) manually design a sparse reward function that only indicates whether the agent achieves the goal or not. Unfortunately, due to the delayed feedback, learning from such a sparse reward signal is itself challenging for RL agents (Andrychowicz et al., 2017; Liu et al., 2019; Durugkar et al., 2021) . When applied to train the more complicated ToD agents, the sparse reward signal could lead to poor empirical performance (Takanobu et al., 2019; Wang et al., 2020a) . To address this issue, we aim at answering the following question in this paper: How to efficiently learn a reward function and leverage it for training E2E dialogue agents? We answer the first half of this question by introducing two reward-learning objectives, RewardNet and RewardMLE, based on the classical learning-to-rank literature (Cao et al., 2007; Xia et al., 2008) . Our desiderata is a reward function that can "explain" some non-trivial preferencebased ordering among multiple alternative dialogue trajectories, thus potentially allowing the resulting RL-trained ToD agents to have better-than-demo performance. We accomplish this goal by learning a parameterized reward function on dialogue turns, from which the accumulated reward of a dialogue trajectory can reflect the preference among multiple alternatives. We answer the second half of the question by utilizing the learned reward function to guide the training of the E2E ToD system, with special considerations on the training stability. With these answers to the above question, we achieve competitive results on the E2E response-generation task on the widely-used dialogue benchmark MultiWOZ 2.0 (Budzianowski et al., 2018) . Several ablation studies and analyses are conducted to provide further insights into the proposed techniques.

2. BACKGROUND

Task-oriented dialogue as reinforcement learning. We formulate the problem of the ToD system as a partially observable Markov decision process (POMDP) (Kaelbling et al., 1998) , specified by M = ⟨S, A, O, P, R, γ⟩, where state s ∈ S consists of the previous dialogue history h and the user intended goal g specified prior to the start of the dialogue; o ∈ O is the observation that can be the user utterance; action a ∈ A can be the system response or dialogue act; P(s ′ | s, a) is the underlying transition probability; R(h, a, g) is the intermediate reward function for taking action a under dialogue history h and goal g; and γ ∈ [0, 1] is the discount factor. The dialogue history h t at timestep t consists of all the previous observations and actions, i.e., h t ≜ {o 0 , a 0 , . . . , o t-1 , a t-1 , o t }. Since the ToD agent cannot directly observe the user goal g, it makes a decision based on the entire dialogue history h t so far. Specifically, the policy π is defined as a mapping from h t to a probability distribution over A, i.e., π ≜ π(a t | h t ). The training objective is to find a policy π that maximizes the expected (discounted) cumulative reward J(π) ≜ E µg,π,P T t=0 γ t R(h t , a t , g) , where µ g is the distribution of goals and T is the number of turns in the dialogue trajectory. Reward design and learning in ToD systems. Unlike the classical RL problems where the intermediate reward function is well designed and provided, in ToD systems we can only get the evaluation results at the end of the dialogue (Budzianowski et al., 2018) . Consequently, most of the existing works adopt the manually designed intermediate reward function that only gives binary reward to indicate whether the dialogue agent achieves the goal or not (e.g., Weisz et al., 2018; Wu et al., 2019c; Jang et al., 2022) : R(h t , a t , g) = R const or 0, if goal g is achieved at timestep t , -R const , if goal g is not achieved at timestep t , where R const is a positive constant that can be 1. However, such a sparse reward signal can be one of the reasons that the ToD agents from RL often have poor performance (Takanobu et al., 2019; Wang et al., 2020a) . A similar issue is also observed in goal-oriented RL (Andrychowicz et al., 2017) . To address the above issue, a few recent works focus on learning an intermediate reward function from demonstrations or mechanical dialogue assessments (e.g., Wang et al., 2020a; Ramachandran et al., 2021) , inspired by the reward-learning-from-preferences in RL (e.g., Christiano et al., 2017; Brown et al., 2019; 2020) . More precisely, suppose we are given two dialogue trajectories τ i and τ j , taking the form τ i ≜ {g (i) , (o (i) 0 , a (i) 0 ), . . . , T , a T )}. We want to learn a

availability

Source code and checkpoints are publicly released at https://github.com/

