FANTASTIC REWARDS AND HOW TO TAME THEM: A CASE STUDY ON REWARD LEARNING FOR TASK-ORIENTED DIALOGUE SYSTEMS

Abstract

When learning task-oriented dialogue (ToD) agents, reinforcement learning (RL) techniques can naturally be utilized to train dialogue strategies to achieve userspecific goals. Prior works mainly focus on adopting advanced RL techniques to train the ToD agents, while the design of the reward function is not well studied. This paper aims at answering the question of how to efficiently learn and leverage a reward function for training end-to-end (E2E) ToD agents. Specifically, we introduce two generalized objectives for reward-function learning, inspired by the classical learning-to-rank literature. Further, we utilize the learned reward function to guide the training of the E2E ToD agent. With the proposed techniques, we achieve competitive results on the E2E response-generation task on the Multiwoz 2.0 dataset.

1. INTRODUCTION

The bloom of pre-training language models (e.g., Devlin et al., 2018; Lewis et al., 2019; Radford et al., 2019; Zhang et al., 2022c) have significantly pushed the boundaries of natural language processing (NLP) on real-world tasks. Among all the promising potentials, one important application is the task-oriented dialogue (ToD) systems, which interact with the users in multiple turns via natural languages to accomplish tasks such as weather inquiry, ticket booking, or schedule planning (Chen et al., 2017; Kwan et al., 2022) . Traditionally, the problem of ToD is decomposed into several sub-tasks (Smith & Hipp, 1994; Young et al., 2013) : natural language understanding (NLU) for understanding turn-level user intents or slot values (Tur & De Mori, 2011; Casanueva et al., 2020) , dialogue state tracking (DST) for tracking user belief state across multiple dialogue turns (Zhang et al., 2019; Zhu et al., 2020) , dialogue management (DM) for choosing system actions to take (Peng et al., 2017; Zhao et al., 2019) , and natural language generation (NLG) for mapping system actions to natural language responses (Wen et al., 2015; Zhang et al., 2020) . This pipeline approach, however, requires intensive structural designs and comprehensive data annotation for model training (Kwan et al., 2022) . Recently, there has been a growing interest in building end-to-end (E2E) ToD agents, which directly generate responses based on the natural language conversation mixing user utterances and past responses. Apart from this structural simplicity, many of the E2E ToD models can utilize the pre-trained language models and are simply trained by supervisedly fine-tuning the pre-trained models on the ToD datasets (e.g., Hosseini-Asl et al., 2020; Ham et al., 2020; Lin et al., 2020; Peng et al., 2021) . Due to the intrinsic similarity between dialogues and sequential decision-making, reinforcement learning (RL) methods are naturally employed to train dialogue systems and have achieved some success (e.g., Williams & Young, 2007; Georgila & Traum, 2011; Zhao et al., 2019) . Since interacting with users during the training process is mostly impractical, offline RL (Lange et al., 2012; Levine et al., 2020) , i.e., RL on static datasets, has recently been adopted to train E2E ToD models (e.g., Jaques et al., 2019; 2020; Ramachandran et al., 2021; Snell et al., 2022a; b; Jang et al., 2022) . Although this direction already presents promising empirical results, an open question exists on how to properly design the reward function for the underlying (offline) RL. Existing works (e.g., * Equal Contribution. Corresponds to {yihao.ac@gmail.com, shentao.yang@mccombs.utexas.edu}. 1

availability

Source code and checkpoints are publicly released at https://github.com/Shentao-YANG

