ONE CANNOT STAND FOR EVERYONE! LEVERAGING MULTIPLE USER SIMULATORS TO TRAIN TASK-ORIENTED DIALOGUE SYSTEMS Anonymous

Abstract

User simulators are agents designed to imitate human users; recent advances have found that Task-oriented Dialogue (ToD) systems optimized toward a user simulator could better satisfy the need of human users. However, this might result in a sub-optimal ToD system if it is tailored to only one ad hoc user simulator, since human users can behave differently. In this paper, we propose a framework called MUST to optimize ToD systems via leveraging Multiple User SimulaTors. The main challenges of implementing the MUST fall in 1) how to adaptively specify which user simulator to interact with the ToD system at each optimization step, since the ToD system might be over-fitted to some specific user simulators, and simultaneously under-fitted to some others; 2) how to avoid catastrophic forgetting of the adaption for a simulator that is not selected for several consecutive optimization steps. To tackle these challenges, we formulate MUST as a Multi-armed bandits (MAB) problem and provide a method called MUST adaptive that balances i) the boosting adaption for adaptive interactions between different user simulators and the ToD system and ii) the uniform adaption to avoid the catastrophic forgetting issue. With both automatic evaluations and human evaluations, our experimental results on the restaurant search task from MultiWOZ show that the dialogue system trained by our proposed MUST achieves a better performance than those trained by any single user simulator. It also has a better generalization ability when testing with unseen user simulators. Moreover, our method MUST adaptive can efficiently leverage multiple user simulators to train the ToD system by our visualization analysis on convergence speeds.

1. INTRODUCTION

Task-oriented dialogue systems aim to help users accomplish their various tasks (e.g., restaurant reservations) through natural language conversations. Training task-oriented dialogue systems in supervised learning (SL) approaches often requires a large amount of expert-labeled dialogues, however collecting these dialogues is usually expensive and time-consuming. Moreover, even with a large amount of dialogue data, some dialogue states may not be explored sufficiently for dialogue systems 1 (Li et al., 2016b). To this end, many researchers try to build user simulators to mimic human users for generating reasonable and natural conversations. By using a user simulator and sampled user goals, we can train the dialogue system from scratch with reinforcement learning (RL) algorithms. Previous works tend to design better user simulator models (Schatzmann et al., 2007; Asri et al., 2016; Gur et al., 2018; Kreyssig et al., 2018; Lin et al., 2021 ). Especially, Shi et al. (2019) builds various user simulators and analyzes the behavior of each user simulator in the popular restaurant search task from MultiWOZ (Budzianowski et al., 2018) . In real application scenarios, the deployed dialogue system needs to face various types of human users. A single ad hoc user simulator can only represent one or a group of users, while other users might be under-represented. Instead of choosing the best-performing one from many dialogue systems trained by different single user simulators, we believe that it is worth trying to train a dialogue system by leveraging all user simulators simultaneously. In this paper, we propose a framework called MUST to utilize Multiple User SimulaTors simultaneously to obtain a better system agent. There exist several simple ways to implement the MUST framework, including a merging strategy, a continual reinforcement learning (CRL) strategy, and a uniform adaption strategy, denoted as MUST merging , MUST CRL , and MUST uniform respectively (See Sec. 3.2). However, none of them could effectively tackle the challenges: 1) how to efficiently leverage multiple user simulators when training the dialogue system since the system might be easily over-fitted to some specific user simulators and simultaneously under-fitted to some others, and 2) it should avoid a catastrophic forgetting issue. To tackle them effectively, we first formulate the problem as a Multi-armed bandits (MAB) problem (Auer et al., 2002) ; similar to the exploitation vs exploration trade-off, specifying multiple user simulators should trade off a boosting adaption (tackling the challenge 1) and a uniform adaption (tackling the challenge 2), see Sec. 4.1 for more details. Then we implement a new method called MUST adaptive which utilizes an adaptively-updated distribution among all user simulators to sample them to train the dialogue system in the RL training. Our experimental results on the restaurant search task from MultiWOZ with both automatic evaluations and human evaluations show that the dialogue system trained by our proposed MUST achieves a better performance than those trained by any single user simulator. It also has a better generalization ability when testing with unseen user simulators and is more robust to the diversity of user simulators. Moreover, the visualization analysis on convergence speeds demonstrates that our MUST adaptive is more efficient than MUST uniform to leverage multiple user simulators to train dialogue systems. Our contributions are three-fold: (1) To the best of our knowledge, our proposed MUST is the first developed work to improve the dialogue system by using multiple user simulators simultaneously; (2) We design several ways to implement the MUST. Especially, we formulate MUST as a Multiarmed bandits (MAB) problem, based on which we provide a novel method MUST adaptive ; and (3) The results show that dialogue systems trained with MUST consistently outperform those trained with a single user simulator through automatic and human evaluations. Especially, it largely improves the performance of the dialogue system tested on out-of-domain evaluation. Furthermore, training the system with the proposed MUST adaptive can converge faster than with MUST uniform .

2. BACKGROUND

Dialogue System. Task-oriented dialogue systems aim to help users accomplish various tasks such as restaurant reservations through natural language conversations. Researchers usually divide the task-oriented dialogue systems into four modules (Wen et al., 2017; Ham et al., 2020; Peng et al., 2021) : Natural Language Understanding (NLU) (Liu & Lane, 2016) that first comprehends user's intents and extracts the slots-values pairs, Dialog State Tracker (DST) (Williams et al., 2013) that tracks the values of slots, Dialog Policy Learning (POL) (Peng et al., 2017; 2018) that decides the dialog actions, and Natural Language Generation (NLG) (Wen et al., 2015; Peng et al., 2020) that translates the dialog actions into a natural-language form. The DST module and the POL module usually are collectively referred to as the dialogue manager (DM) (Chen et al., 2017) . These different modules can be trained independently or jointly in an end-to-end manner (Wen et al., 2017; Liu & Lane, 2018; Ham et al., 2020; Peng et al., 2021; Hosseini-Asl et al., 2020) . User Simulator. The user simulator is also an agent but plays a user role. Different from dialogue systems, the user agent has a goal describing a target entity (e.g., a restaurant at a specific location) and should express its goal completely in an organized way by interacting with the system agent (Takanobu et al., 2020) . Therefore, besides the modules of NLU, DM, and NLG like dialogue systems, the user agent should have another module called Goal Generator (Kreyssig et al., 2018) , which is responsible for generating the user's goal. Building a user simulator usually could use an agenda-based approach (Schatzmann et al., 2007; Schatzmann & Young, 2009) designing handcrafted rules to mimic user behaviors or a model-based approach such as neural networks (Asri et al., 2016; Kreyssig et al., 2018; Gur et al., 2018) learned on a corpus of dialogues. Training Dialogue Systems with a User Simulator. At the beginning of a dialogue, the user agent obtains its initial goal from the Goal Generator and then expresses its goal in natural languages. For the system agent, it does not know the user's goal and it should gradually understand the user's utterances, query the database to find entities, and provide useful information to see if it is accomplishing



We use the dialogue systems to refer to the task-oriented dialogue systems for simplicity in this paper.

