A MIXTURE-OF-EXPERT APPROACH TO RL-BASED DIALOGUE MANAGEMENT

Abstract

Despite recent advancements in language models (LMs), their application to dialogue management (DM) problems and ability to carry on rich conversations remain a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with a combinatorially complex action space even for a medium-size vocabulary. As a result, they struggle to produce a successful and engaging dialogue even if they are warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a particular attribute or personality, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. Our MoE approach provides greater flexibility to generate sensible utterances with different intents and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in terms of the diversity and sensibility of the generated utterances and the overall DM performance.

1. INTRODUCTION

With the tremendous advancements in natural language understanding and generation, increasing attention has been directed to construct intelligent dialogue agents that can carry out engaging conversations with users. Such interactions can be open-ended, contain different topics, and often involve an underlying task, such as negotiation, information exchange, and recommendation. Therefore, to satisfy the user, a good dialogue agent should not only generate natural responses, but also be capable of pursuing the task's objectives and adapting to the user's feedback on-the-fly. A standard solution is to train the dialogue agent using behavioral cloning, where the agent is a language model (LM) that imitates the utterances in the training set (Gašić et al., 2011; Fatemi et al., 2016) . By leveraging deep neural networks, e.g., RNNs (Sutskever et al., 2014) and Transformers (Vaswani et al., 2017) , a LM encodes the conversation to a low-dimensional dialogue state and predicts an utterance, but steering such generation for particular purposes remains an open question. Several works studied ways to fine-tune a LM to generate texts with specific contexts (Ziegler et al., 2019; Ficler and Goldberg, 2017) . Other results learned a single steerable LM that is capable of generating utterances for multiple specific intents (Gu et al., 2017; Chen et al., 2018; Subramani et al., 2019; Dathathri et al., 2019) . While these LMs produce fluent and relevant responses, it is unclear how to control them to systematically pursue goals during multi-turn dialogue conversations. Another popular approach is to view dialogue management (DM) as a control problem and use reinforcement learning (RL) to optimize the agent's policy (which is often a LM itself). Using RL for dialogue systems has a long history. Earlier work relies on specific, hand-crafted semantic states (Levin and Pieraccini, 1997; Singh et al., 2002; Walker, 2000) or partially observable belief states (Williams and Young, 2007; Young et al., 2010) , in which the agent chooses the best handcrafted dialogue act at each turn, with the goal of either satisfying the user (Shah et al., 2018) , completing the task (Shi and Yu, 2018) , or responding to the user's query (Serban et al., 2017a) . However, the application of these approaches is limited to problems whose action space can be captured by hand-crafted representations, and they cannot handle complex conversations. On the other hand, more recent approaches use deep learning to extract semantic representations from conversation histories, treat these representations as dialogue belief states, and apply RL to learn a word-level generative DM agent (Jaques et al., 2019; Li et al., 2016; 2017; Shin et al., 2020) . However, since there are innumerable possibilities of language utterances, and thus, the action space of the RL problem is extremely large, the agent often performs planning poorly and generates incomprehensible utterances (Zhao et al., 2019) . Another issue is that RL only optimizes a scalar reward, while the aforementioned methods often need to optimize for both the quality of the generated utterance, e.g., ease of answering (Li et al., 2016) , fluency (Li et al., 2017; 2019) , and diversity (Yarats and Lewis, 2018) , and the goal, e.g., conversation length (Zhou et al., 2020) , user's sentiment (Hancock et al., 2019) , and task completion (Verma et al., 2022; Jang et al., 2021) . Moreover, defining the reward as weighted combination of these metrics is not ideal, since the hand-picked weights do not often reflect the underlying success criteria. To address the above issues related to using RL in dialogue management (DM) systems, we propose an RL-based DM agent using a novel mixture of expert (MoE) approach. Our MoE approach is based on a mixture of expert language model (MoE-LM), which consists of three main components: 1) a LM (a probabilistic encoder and a decoder) capable of learning diverse semantics for conversation histories, and as a result generating diverse utterances, which we refer to as the primitive LM or LM 0 , 2) a number of specialized LMs (or experts), {LM i } m i=1 , that each is constructed using the latent space learned by LM 0 , but has been trained such that it is capable of generating utterances corresponding to a certain intent or personality, and 3) an RL-based dialogue manager (DM) that at each turn, given the latent state shared by the experts {LM i } m i=0 and the utterance action(s) they suggest, chooses one among them for the agent to execute. Our MoE-LM can be seen as a special case of hierarchical LMs (e.g., Serban et al. 2017a; Zhao et al. 2019; Saleh et al. 2020 ), but it is different than them because it learns both the LMs (experts) and the DM. Moreover, the DM in MoE-LM is a policy conditioned on both the latent state and the actions suggested by the experts, and not just the state as it is common in hierarchical RL. The primitive LM (LM 0 ) plays an important role in this model because it learns diverse semantics for conversation histories and allows the agent to generate a wide variety of utterances. This diversity is also shared with the specialized LMs (experts) and gives them flexibility in generating their (more) specialized utterances. Another important feature of MoE-LM is its modularity that facilitates adding and removing specialized LMs (experts). Moreover, this hierarchical architecture allows us to solve an RL problem with much smaller state and action spaces, which is quite important in the quality of the learned policy. Finally, since the candidate utterances are generated by experts with different intents, instead of combining all agent-user signals into a single RL reward, our DM agent can focus on optimizing the specific goal of the conversation task. We start the paper with a brief introduction of LMs and the use of Markov decision processes (MDPs) in modeling dialogue management problems in Section 2. We then describe the overall architecture of our MoE-LM in Section 3, followed by the detailed implementation of each of its three main components (described in the above paragraph) in Sections 4 to 6. Finally, in Section 7, we demonstrate the effectiveness of our MoE-LM in open-domain dialogues, in terms of both its ability to generate diverse and sensible utterances and its overall DM performance.

2. PRELIMINARIES

Language Models (LMs) In this work, we employ seq2seq LMs to generate the next utterances in a dialogue. We assume access to a dataset of the form D = {(X (k) , Y (k) )} |D| k=1 , where each X = X (k) is a L-turn conversation history X = {X l } L 1 l=0 and Y is its next utterance. We denote by N X , an upper-bound on the length (number of tokens) of each utterance X l in X. 1 The role of a LM is to predict the probability of the next utterance Y , consisting of N tokens, conditioned on the conversation history X, i.e., p Y = {y n } N n=1 | X . In the transformer architecture (?), the LM first encodes the conversation history X using an encoder to a (L ⇥ N X )-length sequence of embeddings {(z l,0 , . . . , z l,N X 1 )} L 1 l=0 , where each z l,n is a vector in the latent space. For notational convenience, we concatenate these embeddings into a single embedding z 2 Z ✓ R d and denote the overall dimension of the latent space as d. In the RNN architecture (Serban et al., 2016) , the LM's encoder directly maps the conversation history X to a latent state z 2 Z ✓ R d . In both architectures, the next utterance b Y = {b y n } N n=1 is sampled token-by-token from the decoder , < l a t e x i t s h a 1 _ b a s e 6 4 = " P m w W r b s H Q q q r q c t + 0 3 D A P O g n z 4 Y = " > A A A C P n i c f V D B T h s x E P X S 0 k K A A u 2 R i 0 V A I I R W 3 p A 2 6 Q 1 B D 7 2 g U q k B p G w U z T q T x M L 2 r m x v 1 X S V X + A K n 9 P f 6 A 9 w q 3 r t s d 4 Q J E D A k y w 9 v X k z n n l J J o V 1 j P 0 O Z l 6 8 n H 3 1 e m 6 + s r C 4 9 G Z 5 Z f X t i U 1 z w 7 H F U 5 m a s w Q s S q G x 5 Y S T e J Y Z B J V I P E 3 O D 8 v 6 6 X c 0 V q T 6 m x t l 2 F E w 0  < l a t e x i t s h a 1 _ b a s e 6 4 = " v X r C p Q E 7 b I q H A 3 p z U t W h a e c w s V Q = " > A A A C P n i c f Z D N S s N A F I U n / h t / q 0 s 3 w S q I S E l E 1 G V R F 2 7 E C l a F J p S b 6 W 0 7 d D I J M x O x h L 6 C W 3 0 c X 8 M X c C d u X T q p E d S K F w Y + z j 0 z c + 8 J E 8 6 U d t 1 n a 2 x 8 Y n J q e m b W n p t f W F x a L q 1 c q T i V F O s 0 5 r G 8 C U E h Z w L r m m m O N 4 l E i E K O 1 2 H v O O 9 f 3 6 J U L B a X u p 9 g E E F H s D a j o H P J r 3 V Z c 7 n s V t x h O a P g F V A m R d W a J W v D b 8 U 0 j V B o y k G p h u c m O s h A a k Y 5 D m w / V Z g A 7 U E H G w Y F R K i v v i Q X 3 B P s A 5 t P U = " > A A A C Q 3 i c f Z D L S g M x F I Y z 3 q 1 3 X b o J V k F E y o y I u h R 0 o Q t R w W q l U + R M e q Y N J p k h y Y h l 6 F O 4 1 c f x I X w G d + J W M F M r e M M D g Y / / / E n O + a N U c G N 9 / 8 k b G B w a H h k d G y 9 N T E 5 N z 8 z O z Z + b J N M M q y w R i a 5 F Y F B w h V X L r c B a q h F k J P A i u t 4 r + h c 3 q A 1 P 1 J n t p N i Q 0 F I 8 5 g y s k y 5 D C b Y d x b R 2 N V v 2 K 3 6 v 6 G 8 I + l A m / T q 5 m v O W w 2 b C M o n K M g H G 1 A M / t Y 0 c t O V M Y L c U Z g Z T Y N f Q w r p D B R J N I + 9 N 3 K U r T m n S O N H u K E t 7 6 t c b O U h j O j J y z m J C 8 7 N X i H / 1 6 p m N d x o 5 V 2 l m U b G P j + J M U J v Q Y n 3 a 5 B q Z F R 0 H w D R 3 s 1 L W B g 3 M u p B K 4 T 6 6 X T Q e u p L q v f i T r s Y U H / v o d E x i f A = " > A A A C P H i c f Z D L S g M x F I Y z X m u J N s A o i U m a k q M u C L n Q h X r C C l y J j T h m Y y Q I R A c K u P u d y d u X Z v W C t w Q O D j P + S c / g E V w b y R k b H x i c m c P Z n Z u f m F x a W q j l P F s M J i E a t a A B o F l g x A i s J Q o h C g R e B p f v / y G p X m s b w w Q Q b E b Q k D z k D Y X z W q U H C L q D o b / C G U C D D O r a d N b Z s z S C K V h A r S u e i G h k o w n A X t P N S b A O t D C u k U J E e p G N h i R e t q R h r O y R h g U r z c y i L T u R o F R m D a + m e v L / V q c m G t k X C a p Q c k + P g p T Q U M + v T J l f I j O h a A K a n Z W y N i h g x q a T w / Q q L w L k q A C E v N z A f V i u C m Z d r + V t + s / I a f R U t m v M T d U t v e T r F V i q U j Y J g K W S U b x C O p E w O y S m p E E Z C c k f u y Y P z D w L r h X E G d Z J t / K e X K I v O L h S + r n V V d 2 V K g s b t S h q 1 i k L I X c = " > A A A C R n i c f V B d a x N B F L 2 b V q 3 x q 6 2 P v g y N g o g s s z E 2 8 a 2 o U F + k F U x b y I Z w d 3 K T D p 2 Z X W Z m i 2 H J 3 / D V / p z + B f + E b 6 W v n U 1 T q K I e G D i c e + 6 d e 0 9 W K O k 8 5 z + j x s r q n b v 3 1 u 4 3 H z x 8 9 P j J + s b m g c t L K 6 g v c p X b o w w d K W m o 7 6 V X d F R Y Q p 0 p O s x O P t T 1 w 1 O y T u b m q 5 8 V N N Q 4 N X I i B f o g p a l G f y x Q s d 2 R H q 2 3 e N x t J 0 m v w 3 i c 8 D d J 9 1 0 g v P e 2 z b s s i f k C L V h i f 7 Q R P U / H u S g 1 G S 8 U O j d I e O G H F V o v h a J 5 M y 0 d F S h O c E q D Q A 1 q c s N q s f S c v Q j K m E 1 y G 5 7 x b K H e 7 q h Q O z f T W X D W S 7 o / a 7 X 4 t 9 q g 9 J P e s J K m K D 0 Z c f 3 R p F T M 5 6 x O g I 2 l J e H V L B A U V o Z d m T h G i 8 K H n J r p R w q 3 W P o c 5 u 4 V Z N H n 9 l W V o p 1 q / D Y P t 0 3 T 1 z X 7 n 1 G a G 2 N g z Z D r T X j s 3 + S g H S f b c e d L p 7 X z f p n w G j y D L X g J C X R h B z 7 B P v R B Q A H f 4 Q e c R e f R r + g i u r y 2 N q J l z 1 P 4 D Q 2 4 A i O 3 s W o = < / l a t e x i t > G m < l a t e x i t s h a 1 _ b a s e 6 4 = " r O D I j b g l + 1 A b Q U 4 D Q z f N + y Z a q b c = " > A A A C Q H i c f V B N T x s x E P V S 2 t L 0 C 8 q R i 9 W 0 U l V V K 2 8 a S L g h w Y E e U E F q A l I 2 Q r P O J L h 4 7 Z U 9 i 4 h W + Q 9 c 6 c / p v + g / 4 I a 4 9 o Q T g t R W b Z 9 k 6 e n N m / H M y w q t P A n x I 1 p 4 s P j w 0 e O l J 7 W n z 5 6 / e L m 8 8 q r r b e k k d q T < l a t e x i t s h a 1 _ b a s e 6 4 = " J j A f h P 4 R 8 Markov Decision Processes (MDPs) have been used to model dialogue management problems in a variety of settings (Li et al., 2016; Asadi and Williams, 2016; Jaques et al., 2019) . In such MDPs, denoted by M = (S, A, P, r, s 0 , ), the state space S represents the tokenized conversation history and the initial state s 0 2 S is the initial user's query. The action space A is also the tokenized language space with each action a 2 A being the agent's next utterance (which is a fixed-length, N X , sequence of tokens). The transition kernel P models the user's response to the action taken by the agent (bot). Finally, the reward function r measures the user's satisfaction. In these MDPs, we can think of the entire LM as a policy that maps conversation histories to next utterances, and solve them by finding a policy ⇡ ⇤ with maximum expected discounted return, i.e., ⇡ ⇤ 2 arg max ⇡ J ⇡ := E[ P 1 k=0 t r t | P, s 0 , ⇡]. Note that the size of the tokenized state and action spaces grow exponentially with the size of the vocabulary. This makes it intractable to solve the MDP even for a medium-size vocabulary. As a result, it would quite desirable to develop a novel MDP paradigm that is more amendable to RL-based DM systems. V 1 h 1 l 4 F E r g x 1 S p P G o c A h 5 p v E w O 9 2 e 1 g / P 0 H l l z R c a F 9 j P Y W T U U E m g I H X T s 4 E l f 7 x c F 3 G r k S T t J h d x I j 4 m r c 1 A R H u 9 I V o 8 i c U M d T b H / v F K 9 C Y d W F n m a E h q 8 L 6 X i I L 6 F T h S U u O k l p Y e C 5 C n M M J e o A Z y 9 P 1 q t u 6 E v w 3 K g A + t C 8 8 Q n 6 m / d l S Q e z / O s + D M g U 7 8 n 7 W p + L d a r 6 R h u 1 8 p U 5 S E R t 5 9 N C w 1 J 8 u n t / O B c i h J j w M B 6 V T Y l c s T c C A p J F R L d z D c 4 n A v z P 1 c o A O y 7 n 2 V g h v l c D 4 J t 4 3 S D 1 P 2 P 6 M y 9 8 b A a i H X + / D 4 v 0 m 3 E S c b c f O g W d / 6 N E 9 4 i a 2 x 1 + w d S 1 i L b b F d t s 8 6 T L K v 7 I J d s m / R 9 + g q u o 5 u 7 q w L 0 b x n l f 2 G 6 O c t x h 2 w V w = = < / l a t e x i t > . . . < l a t e x i t s h a 1 _ b a s e 6 4 = " r t d 5 3 e R v d j X L 7 a J / X J 2 A T o V 2 v f k = " > A A A C R 3 i c f V B N b x M x E P W G A m 3 4 a u H I x W p A Q g i t 7 B C a c K u g B y 6 I V m r a o u w q m n U m i V W v d 2 v P A t E q v 4 M r / T n 9 C f w K b o g j T p p K g I A n W X p 6 8 2 Y 8 8 7 L S a E 9 C f I 0 a 1 9 a u 3 7 i 5 v t G 8 d f v O 3 X u b W / e P f F E 5 h X 1 V m M K d Z O D R a I t 9 0 m T w p H Q I e W b w O D t 9 v a g f f 0 D n d W E P a V Z i m s P E 6 r F W Q E F K k 4 9 6 h F O g + v 1 8 K I e b L R F 3 2 1 L 2 O l z E U j y X 3 Z e B i N 6 L t u h y G Y s l W m y F / e F W 9 C g Z F a r K 0 Z I y 4 P 1 A i p L S G h x p Z X D e T C q P J a h T m O A g U A s 5 + r R e b j 3 n j 4 M y 4 u P C h W e J L 9 V f O 2 r I v Z / l W X D m Q F P / Z 2 0 h / q 0 2 q G j c S 2 t t y 4 r Q q s u P x p X h V P B F B H y k H S o y s 0 B A O R 1 2 5 W o K D h S F o J r J H o Z b H L 4 N c 9 + V 6 I A K 9 7 R O w E 1 y + D Q P t 0 2 S Z w v 2 P 6 O 2 V 8 b A m i H X q / D 4 v 8 l R O 5 Y 7 c e e g 0 9 p 9 t U p 4 n T 1 k 2 + w J k 6 z L d t k b t s / 6 T L E z 9 p l 9 Y e f R R f Q t + h 7 9 u L Q 2 o l X P A / Y b G t F P p e O y L g = = < / l a t e x i t > b Y 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " U v r y 7 M 7 d / 1 + y N O n f r 0 h + v f a i e X U = " > A A A C R 3 i c f V B N b x M x E P W G A m 3 4 a u H I x W p A Q g i t v C E 0 4 V Z B D 1 w Q r d S 0 R d l V N O t M E q u 2 d 2 v P A t E q v 4 M r / T n 9 C f w K b o g j T p p K g I A n W X p 6 8 2 Y 8 8 / J S K 0 9 C f I 0 a 1 9 a u 3 7 i 5 v t G 8 d f v O 3 X u b W / e P f F E 5 i X 1 Z 6 M K d 5 O B R K 4 t 9 U q T x p H Q I J t d 4 n J + + X t S P P 6 D z q r C H N C s x M z C x a q w k U J C y 9 K M a 4 R S o f j 8 f m u F m S 8 T d d p L 0 O l z E i X i e d F 8 G I n o v 2 q L L k 1 g s 0 W I r 7 A + 3 o k f p q J C V Q U t S g / e D R J S U 1 e B I S Y 3 z Z l p 5 L E G e w g Q H g V o w 6 L N 6 u f W c P w 7 K i I 8 L F 5 4 l v l R / 7 a j B e D 8 z e X A a o K n / s 7 Y Q / 1 Y b V D T u Z b W y Z U V o 5 e V H 4 0 p z K v g i A j 5 S D i X p W S A g n Q q 7 c j k F B 5 J C U M 1 0 D 8 M t D t + G u e 9 K d E C F e 1 q n 4 C Y G P s 3 D b Z P 0 2 Y L 9 z 6 j s l T G w Z s j 1 K j z + b 3 L U j p O d u H P Q a e 2 + W i W 8 z h 6 y b f a E J a z L d t k b t s / 6 T L I z 9 p l 9 Y e f R R f Q t + h 7 9 u L Q 2 o l X P A / Y b G t F P F F a y a g = = < / l a t e x i t > b Y m < l a t e x i t s h a 1 _ b a s e 6 4 = " e o w 8 B o O V g K 6 g R U w 6 4 Z 0 J I y L W 5 r o = " > A A A C P n i c f Z D N S s N A F I U n / h t / q 0 s 3 w S q I S E l E 1 G V R F 2 7 E C l a F J p S b 6 W 0 7 d D I J M x O x h L 6 C W 3 0 c X 8 M X c C d u X T q p E d S K F w Y + z j 0 z c + 8 J E 8 6 U d t 1 n a 2 x 8 Y n J q e m b W n p t f W F x a L q 1 c q T i V F O s 0 5 r G 8 C U E h Z w L r m m m O N 4 l E i E K O 1 2 H v O O 9 f 3 6 J U L B a X u p 9 g E E F H s D a j o H P J r y n W X C 6 7 F X d Y z i h 4 B Z R J U b V m y d r w W z F N I x S a c l C q 4 b m J D j K Q m l G O A 9 t P F S Z A e 9 D B h k E B E a o g G w 4 7 c D a N 0 n L a s T R H a G e o f r + R Q a R U P w q N M w L d V b 9 7 u f h X r 5 H q 9 m G Q M Z G k G g X 9 / K i d c k f H T r 6 5 0 2 I S q e Z 9 A 0 A l M 7 M 6 t A s S q D b 5 2 P 4 J m l 0 k n p l 3 z x O U o G O 5 n f k g O x H c D c x u H X 8 n p / + M T H w Z D d k m V + 9 3 i q N w t V v x 9 i t 7 F 3 v l 6 l G R 8 A x Z I + t k i 3 j k g F T J K a m R O q G k S + 7 J A 3 m 0 n q w X 6 9 V 6 + 7 S O W c W d V f K j r P c P I k C u 9 g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " r O D I j b g l + 1 A b Q U 4 D Q z f N + y Z a q b c = " > A A A C Q H i c f V B N T x s x E P V S 2 t L 0 C 8 q R i 9 W 0 U l V V K 2 8 a S L g h w Y E e U E F q A l I 2 Q r P O J L h 4 7 Z U 9 i 4 h W + Q 9 c z P Q B H b E U + i n R t d E t B / P M W d x B g = " > A A A C R n i c f Z D P S i N B E M Z r 4 u 6 q 2 X X 9 s 0 c v z W Y F E Q k z E t S j u A t 6 E R W M E T I h 1 H Q q S Z O e n q G 7 R w x D X m O v + j i + g i / h T b z a E y O o E Q s a f n z 1 d X f V F 6 V S G O v 7 d 1 5 p 5 s v X b 7 N z 8 + X v P x Z + L i 4 t r 5 y b J N O c 6 j y R i b 6 I 0 J A U i u p W W E k X q S a M I 0 m N a P C 3 6 D c u S R u R q D M 7 T K k V Y 0 + J r u B o n R S G M d o + R 8 k O 2 n 5 7 q e J X / X G x a Q g m U I F J n b S X v T 9 h J + F Z T M p y i c Y 0 A z + 1 r R y 1 F V z S q B x m h l L k A + x R 0 6 H C m E w r H w 8 9 Y m t O 6 b B u o t 1 R l o 3 V 1 z d y j I 0 Z x p F z F k O a 9 7 1 C / K j X z G x 3 t 5 U L l W a W F H / + q J t J Z h N W J M A 6 Q h O 3 c u g A u R Z u V s b 7 q J F b l 1 M 5 / E d u F 0 1 H 7 t 3 j l D T a R G / k I e p e j F c j t 1 s v 3 C z o M 6 N Q L 0 Z H Z Z d K d x 4 + V Q d Q P Y W 4 D R y J M = " > A A A C R 3 i c f Z D L S s N A F I Y n 9 V 7 v u n Q T r I K I l E R E X Y q 6 c C M q W C 8 0 o Z x M T 9 v B y S T O n K g l 9 D n c 6 u P 4 C D 6 F O 3 H p t F b w h g c G P v 7 z z 8 w 5 f 5 R K Y c j z n p 3 C w O D Q 8 M j o W H F 8 Y n J q e m Z 2 7 s w k m e Z Y 4 Y l M 9 E U E B q V Q W C F B E i 9 S j R B H E s + j q 7 1 u / / w G t R G J O q V 2 i m E M T S U a g g N Z K Q x u R R 1 b Q P l l p + b V Z k p e 2 e u V + x v 8 P p R Y v 4 5 r s 8 5 S U E 9 4 F q M i L s G Y q u + l F O a g S X C J n W K Q G U y B X 0 E T q x Y V x G j C v D d 1 x 1 2 2 S t 1 t J N o e R W 5 P / X o j h 9 i Y d h x Z Z w z U M j 9 7 X f G v X j W j x n a Y C 5 V m h I p / f N T I p E u J 2 4 3 A r Q u N n G T b A n A t 7 K w u b 4 E G T j a o Y r C P d h e N h / b d o x Q 1 U K J X G 7 V r l w 5 E M N l U H 1 U = " > A A A C R X i c f V B N b 9 N A E F 2 X r 5 J S a O m x l x U p E q o q y z Z W E m 4 V 9 N A L o k i k L Y q j a r y Z J K u s d 6 3 d M R B Z + R m 9 w s / h N / A j e k N c Y Z O m E k U t T 1 r p 6 c 2 b 2 Z m X l 0 o 6 i q I f w c q d u / f u P 1 h 9 2 F h 7 t P 7 4 y c b m 0 2 N n K i u w K 4 w y 9 j Q H h 0 p q 7 J I k h a e l R S h y h S f 5 5 M 2 8 f v I J r Z N G f 6 B p i f 0 C R l o O p Q D y U i / 7 L A c 4 B q o / z s 4 2 m l G Y J m n 8 s s W j s J O 0 X 7 V S T 5 J 2 q 5 M k P A 6 j B Z p s i a O z z W A n G x h R F a h J K H C u F 0 c l 9 W u w J I X C W S O r H J Y g J j D C n q c a C n T 9 e r H z j D / 3 y o A P j f V P E 1 + o f 3 f U U D g 3 L X L v L I D G 7 t / a X L y p 1 q t o 2 O n X U p c V o R a X H w 0 r x c n w e Q B 8 I C 0 K U l N P Q F j p d + V i D B Y E + Z g a 2 Q H 6 W y y + 9 X P f l W i B j N 2 t M 7 C j A

3. MIXTURE OF EXPERTS (MOE) LANGUAGE MODEL

We start by explaining how a MoE language model (MoE-LM) can enrich the bot's utterances and improve the overall performance of the DM. While our approach is applicable to any DM system, we use open-domain dialogue (Sankar et al., 2019) as a running example to show how MoE-LM-based agents can improve user satisfaction measured by an improvement on a sentiment or engagement. Intuitively a good DM agent should possess different behaviors (e.g., inquisitive, explorative, relevant, soothing, empathetic, complimentary, provoking) and swiftly decide which intent to use to pivot a conversation, build rapport, pique the user's interests, improve their mood, etc. To achieve this goal, we require the LM to have a language representation (primitive discovery) that captures different semantics, in order to encode different conversations and avoid generating dull and repetitive responses. We also need a machinery (expert construction) to embed different intents into sub-models of this LM, so that they can behave accordingly when prompted, and respond efficiently. Finally, with various candidate utterances available, the DM module of this LM should understand the current level of user satisfaction and determine which response is the most appropriate. Motivated by these observations, we construct our MoE-LM in three steps as shown in Figure 1 . We give the main idea behind each step here and leave their detailed descriptions to Sections 4, 5, and 6. Step 1: Primitive Discovery. We first employ the dataset D, introduced in Section 2, and learn a language model LM 0 = ( , G 0 , ) consisting of a stochastic encoder (i.e., an encoder and a latent space distribution G 0 that maps the encoded conversation into a latent distribution), and a decoder . The stochastic encoder ( , G 0 ) comprises an encoder that maps tokenized conversation histories X to a latent space Z ✓ R d , i.e., z = (X) 2 Z, which is then used to construct a parameterized d-dimensional Gaussian distribution G 0 (z 0 |z) = N µ 0 (z), 2 0 (z)I d⇥d over R d . The decoder predicts the next utterance b Y 0 (token-by-token) conditioned on the point z 0 sampled from the latent distribution, i.e., ( b Y 0 |z 0 ) 3 , z 0 ⇠ G 0 (•|z). We denote by LM 0 (Y |X) := E z 0 ⇠G0(•|z),z= (X) [ (Y |z 0 )], the primitive and learn it using a loss function that in addition to predicting the next utterance accurately, encourages diversity and generalization in the learned latent space Z (see Eq. 1 and Fig. 2 ). As we will explain in Section 4, our loss function is inspired by those in prior work, and more specifically by the one in OPAL (Ajay et al., 2020) , i.e., an unsupervised learning method for discovering primitive skills in trajectories that are used by some downstream RL tasks. Step 2: Expert Construction. Given the latent space Z, encoder ( , G 0 ), and decoder learned in Step 1, we now learn m latent distributions {G i } m i=1 , each defined as G i (z 0 |z) = N µ i (z), 2 i (z)I d⇥d . Intuitively, each G i corresponds to an attribute, e.g., an intent or a personality (in case of a chatbot) and generates samples in specific parts of the latent space Z. This results in having m LMs, {LM i } m i=1 , LM i = ( , G i , ), each of them corresponds to a specialized version of the original LM, LM 0 , and serves as an expert in our MoE-LM. Upon receiving a conversation history X, each expert LM i generates a candidate (or more) for the next utterance b Y i in certain parts of the language space that are compatible with its attribute (personality). As we will explain in Section 5, each G i is learned using a loss function that encourages its corresponding LM, LM i , to generate utterances consistent with its attribute (see Eq. 2). Step 3: Dialogue Manager (DM). The dialogue manager, denoted by µ, takes as input the encoded conversation history z = (X) and the candidate action utterances generated by the experts { b Y i } m i=0 , and selects one of them as the action for the bot to execute, i.e., b Y ⇠ µ(• | z, { b Y i } m i=0 ). We will describe how DM is trained using reinforcement learning (RL) in Section 6.

4. PRIMITIVE DISCOVERY IN MOE-LM

Motivated by literature in the reinforcement and imitation learning fields (Ajay et al., 2020) , we propose to learn the primitive LM, LM 0 , in our MoE-LM by solving the following KL-constrained optimization problem that aims at capturing diverse semantics: min ( ,G 0 , ),⇢ b E z 0 ⇠⇢(•|z,Y ),z= (X) ⇥ log (Y |z 0 ) ⇤ , s.t. b E z= (X) ⇥ KL ⇢(z 0 |z, Y ) || G0(z 0 |z) ⇤  ✏KL, (1) where b E is the empirical expectation over (X, Y ) in the dataset D, ⇢ is a distribution over the latent space conditioned on the encoded conversation history z and the target utterance Y , and ✏ KL is a positive real-valued threshold. Using (1), we learn LM 0 = ( , G 0 , ) by maximizing the log-likelihood, while enforcing consistency between the latent variable z 0 predicted by G 0 (•|z) and ⇢(•|z, Y ) via the KL constraint. The distribution ⇢(•|z, Y ) is a Gaussian N µ ⇢ (z, ⇢ (Y )), 2 ⇢ (z, ⇢ (Y ))I d⇥d in which ⇢ is a pre-trained encoder for the target utterance Y , and mean µ ⇢ (•, •) and variance 2 ⇢ (•, •) are trainable models. One reason for using a separate encoder ⇢ for the target utterance Y is to avoid overfitting (i.e., to avoid having back-propagation gradient of with Y as input).

Connection to VAE-like objectives

In practice, we implement the KL constraint in (1) as a penalty weighted by an appropriately chosen coefficient. Thus, one may interpret the objective in (1) as a variation of -VAE (Burgess et al., 2018) . Due to the connection to VAEs, one may draw similarities between our method and existing dialogue approaches such as VHRED (Serban et al., 2017b) and VHCR (Park et al., 2018) . However, we emphasize that there are key differences, and these may be explained by first understanding how the objective in (1) encourages diversity, which is key to good primitive learning. Namely, it is important that primitive discovery learns an encoder-decoder , which can be modulated by the choice of z; i.e., changing z 0 while fixing X should lead to different distributions over generated utterances. The objective in (1) encourages this diversity by conditioning the latent variable z 0 on both the target utterance Y and z = (X), i.e., z 0 ⇠ ⇢(•|z, Y ). In contrast, the KL constraint is used to make sure that the stochastic encoder G 0 (•|z) of our primitive LM is not too varied for different Y , and thus has a limiting effect on diversity. For example, in the extreme when ✏ KL = 0 (or, ! 1 when used as a regularizer) there will be no specialization of the latent space for different Y . Although ! 1 is an extreme case, degenerate behavior can also happen when = 1, i.e., in the traditional variational loss used by VHRED and VHCR. Specifically, it is well-known that the traditional VAE loss is an upper bound on the negative log-likelihood of the data under a stochastic encoder-decoder parameterization. Thus if the data can be modeled by a single LM then a VAE-optimal decoder can simply ignore G 0 , leading to a degenerated latent space as observed in previous work (Park et al., 2018) . This is precisely the reason that, in our approach, we weaken the KL constraint (✏ KL 0 or, equivalently, ⌧ 1). This enables our approach to more reliably guarantee that a unique z 0 represents each distinct conversation pair (X, Y ), thus capturing diverse semantic modalities and enabling easier downstream specialization. In the mathematical results below, we formalize the claim above, namely, that the log-likelihood objective in (1) leads to a learned , that can easily recover any arbitrary desired LM by specializing the latent space G. We begin with a definition that characterizes the coverage of an arbitrary LM on the conditional conversation data distribution P D (Y |X). Definition 1. LM D,⇠ is a ⇠-common LM of data D if ED[TV(LM D,⇠ (Y |X)||PD(Y |X)))]  ⇠. Leveraging Theorem 4.1 in Ajay et al. (2020) , we now present the theoretical result characterizing the representational power of our primitive encoder-decoder pair ( , ) on data D. Lemma 1. Let ( , ⇢, ) be the solution to (1) with b E z 0 ⇠⇢(•|z,Y ),z= (X) [ log (Y |z 0 )] = ✏. Then there exists LM := ( , G, ) such that E D [TV(LM D,⇠ (Y |X)||LM(Y |X))]  ⇠ + q 1 2 (✏ + H), where G(z 0 |z) = E Y ⇠D [⇢(z 0 |z, Y )], and H = E D [log P D (Y |X)] is a constant depending on D. The result above shows that, as long as LM D,⇠ is ⇠-common in D, then there exists a specialization of the latent space G that, when paired with , , can approximately recover LM D,⇠ . The quality of the approximation is a function of ✏ -how well the objective in (1) was optimized -and ⇠. In practice, to construct the primitive by replacing G with G 0 , i.e., LM 0 = ( , G 0 , ), because G 0 (z 0 |z) can be viewed as an distillation of ⇢(z 0 |z, Y ). This theoretical result also motivates the next section, where we explain our algorithm's "Step 2: Expert Construction". Specifically, we show how to use the trained encoder-decoder pair , to learn a spectrum of different specialized experts parameterized by different latent distributions G i .

5. EXPERT CONSTRUCTION WITH PLUG-AND-PLAY LANGUAGE MODELS

To complete the MoE framework one needs to systematically create a gamut of different experts LM i , 8i 2 {1, . . . , m}, with each generating candidate utterances of different intents. By viewing each expert as a distribution of particular behaviors in conversation data D, we leverage the results of Section 4 and Lemma 1 and adopt a universal encoder-decoder ( , ) among all the experts. Therefore, each expert i is only parameterized by an arbitrary d-dimensional latent distribution (e.g., Gaussian), and it samples certain regions of the latent space Z. Following the terminology from Dathathri et al. (2019) , these experts can all be catagorized as plug-and-play language models (PPLMs). Creating experts is handy because it only requires learning new latent distributions, while switching between experts amounts to sampling a different distribution. Denote by `i(X, Y ) 2 R a real-valued label that characterizes the intent of expert i 2 {1, . . . , m}, e.g., determined by an off-the-shelf sentiment classifier. We train the latent distribution G i (z) of expert i by solving the optimization problem min Gi b E z 0 ⇠Gi(•|z),z= (X),Y ⇠ (•|z 0 ) [ `i(X,Y )]. (2) Unlike the weighted maximum likelihood approach considered in Dathathri et al. (2019) , which assigns weight `i to training samples that correspond to expert i, we propose to learn each expert via reward-maximization and treat `i as a reward signal w.r.t. expert i to be maximized. Interestingly, this approach is also linked to reinforcement learning (RL), in which both the "state" and "action" spaces are the latent space Z, and the "policy" is the latent distribution G i . The main benefit of our approach is that it does not require the target utterance Y from data D and is thus less vulnerable to data-imbalance issues in D on certain intents. Notice from (2) that the reward-maximization problem is myopic, i.e., the above RL problem has a discounting factor of 0. The main motivation is that, unlike dialogue management that is a sequential decision-making problem, here we want each expert to possess particular behaviors, and this can readily be done via greedy maximization. Long-term dialogue optimization will be handled by the dialogue manager rather than these experts. For example in the case of a Gaussian G i , we use the standard REINFORCE (Sutton et al., 1999) algorithm to learn the model parameters (µ i , 2 i ) of G i according to {µi, i} {µi, i} + ↵ • E z 0 ⇠G i (•|z),Y ⇠ (•|z 0 ) [`i(X, Y ) • r {µ i , i } log PG i (z 0 |z)], i 2 {1, . . . , m}, where ↵ > 0 is the learning rate. To reduce the variance of these estimates, we also adopt the baseline reduction technique (Greensmith et al., 2004) in policy gradient, which replaces `i(X, Y ) with `i(X, Y ) := `i(X, Y ) E Y ⇠ (•| (X)) [`i(X, Y )]. Following arguments from Lemma 1 and Lemma 4.0.1 in Ajay et al. (2020) , in the following we quantify the sub-optimality of expert LM i . Corollary 1. Denote the i-th reward-maximizing objective as L i (LM) := b E Y ⇠LM(•|X) [`i(X,Y )]. Suppose an optimal LM for this objective LM i,⇠ 2 arg max LM L i (LM) is ⇠-common in D. Moreover, let G ? i be in the arg min of (2). Then with expert LM i = ( , G ? i , ) and (✏, H) from Lemma 1, we have |L i (LM i ) L i (LM i,⇠ )|  2k`ik 1 • (⇠ + q 1 2 (✏ + H)). While it may be obvious that optimizing G i w.r.t. (2) encourages expert LM i to capture the behaviors encouraged by `i, this corollary has two further implications: (i) Since the sub-optimality of LM i compared to the oracle LM i,⇠ is bounded by the quantity ✏ defined in Lemma 1, it justifies using the primitive ( , ), which optimizes ✏, for expert construction; (ii) Sub-optimality further depends on ⇠, quantifying how well LM i,⇠ is represented in the original dataset D.

6. REINFORCEMENT LEARNING FOR MOE-LM DIALOGUE MANAGER

We now describe the dialogue manager (DM) of our MoE-LM and propose RL algorithms to train it. As mentioned in Section 3, the DM is a policy µ that takes the encoded conversation history z = (X) and the m + 1 candidate action utterances generated by the experts { b Y i } m i=0 ,foot_3 and stochastically selects one of them to execute, i.e., b Y ⇠ µ(• | z, { b Y i } m i=0 ). Note that each expert i 2 {0, . . . , m} is a LM, LM i , that acts as a policy ⇡ i (•|X) and maps each conversation history X to an utterance b Y i . With this architecture we address the large size of state and action spaces in the original MDP that grows exponentially with the size of the vocabulary. As described in Section 2, the state and action spaces of the original MDP are the tokenized conversation history and the tokenized language space, respectively, while here the DM should choose among m + 1 actions (which is a much smaller and finite action space) given the latent space Z (which is a continuous state space) of encoded conversations. It is important to note that our MoE-LM is different than other hierarchical architectures (Kulkarni et al., 2016) in which the decision at the high-level is to choose a low-level controller only based on the current state of the system. In MoE-LM, the DM observes both the current state and the actions suggested by the experts and then chooses one among them. We consider two RL settings to solve this specialized MDP. The first one is offline RL, in which the policy must be learned from the collected conversations D without further (online) interactions. Offline RL requires optimizing a policy, while minimizing the deviation from the behavior policy to avoid errors due to data co-variate shift. Among numerous offline RL algorithms (Kumar et al., 2020; Carta et al., 2021; Jaques et al., 2019) , one effective algorithm to learn the DM policy µ is IQL (Kostrikov et al., 2021) . Given conversation data D, IQL first computes the critic functions (Q ✓ ⇤ (z, a), V ⇤ (z)) via solving min ✓, E (z,a,r,z+)2D [(r + V (z + ) Q ✓ (z, a)) 2 ] + • E (z,a)2D [L ⌧ 2 (Q ✓ (z, a) V (z))], where z is the encoded conversation, a is the bot utterance, z + is the next encoded conversation, r is the conversation-level reward, > 0 is a tunable weight, and L ⌧ 2 is the expectile regression operator (Koenker and Hallock, 2001) of estimating the top-⌧ expectile statistics of the Q-function random variable (approximated by the value function V ), and then extracts the DM policy µ via greedification over the finite set of MoE candidate utterances: µ(a | z, { b Y i } m i=0 ) = arg max a2{ b Yi} m i=0 Q ✓ ⇤ (z, a) . Intuitively, IQL leverages the generalization capacity of critic functions to estimate the value of the best action without directly querying the values w.r.t. unseen actions. This makes it less conservative than most offline RL methods that either constrain the policy's actions to be in-distribution or solve a behavior-regularized policy optimization problem. The second RL setting for learning the DM policy µ is via model-based RL (MBRL) (Shah et al., 2018; Wei et al., 2018) . While this paradigm can be applied to any online/offline RL algorithms we demonstrate it with the simple DQN (Mnih et al., 2013) . Here we first learn a user utterance model P user (X + |X, a) := E z= user ([X,a]) [ user (X + |z)] via maximum likelihood, then generate data D MB , whose next-state b s + encodes the next conversation generated from roll-outs and the corresponding candidate actions, solve the Bellman error minimization: min ✓ P (s,a,r,b s+)2DMB (r + Q ✓ target (b s + , arg max a+2{0,...,m} Q ✓ (b s + , a + )) Q ✓ (s, a)) 2 , and obtain the DM policy µ via the aforementioned greedification step. The benefit of MBRL over offline RL is two-fold. First, one can easily obtain a high-fidelity user utterance model (Peng et al., 2020) by simply fine-tuning a large LM (e.g., GPT-3 (Floridi and Chiriatti, 2020) ). Second, with sufficient dialogue roll-outs that captures many different scenarios, MBRL generally can be more data-efficient and less prone to distributional shifts than offline RL.

7. EXPERIMENTS

We evaluate our MoE-approach on two open-domain benchmarks that are common within the RLbased dialogue management literature (Jaques et al., 2019) . The first one is the Cornell Movie corpus (Danescu-Niculescu-Mizil and Lee, 2011), which consists of conversations between speakers in different movie lines and has a median conversation length of 3 utterances. The second is the Reddit Casual (Ghandeharioun et al., 2019) conversations, which is a subset of the Reddit corpus that only contains casual conversations on various topics of at least 3 turns and a median of 7 utterances. We conduct several experiments to test the efficacy of different parts in the MoE-LM, namely (i) the predictive power and diversity of the primitive, (ii) the quality of experts, and (iii) the overall DM performance. For each metric, we report mean ± standard error over 100 conversations sampled from the evaluation set. We also ran an ablation study on 4 transformer-based MoE-LMs, namely MoE-1, MoE-2, MoE-3-MoE-4, to understand how performance is affected by different model architectures, language encoders, and latent generators. MoE-1 and MoE-2 use a simpler architecture, while MoE-3 and MoE-4 use the same encoder architecture as BERT (Devlin et al., 2018) . MoE-1 uses much smaller latent distribution models {G i } than MoE-2; MoE-3 uses the pre-trained BERT encoder, while MoE-4 trains that from scratch. Details of these models can be found in Appendix B.3. EXP 1: Comparing Primitive Models We compare the quality of latent representations learned by the 4 MoE-LMs (via optimizing Eq. 1) and 2 baselines (standard Transformer (Vaswani et al., 2017) and VHREDfoot_4 (Serban et al., 2017b) ). To assess their quality, for each test conversation we generated 25 utterances and reported the following 3 metrics: (i) Diversity: The 1 Sparsity (Hurley and Rickard, 2009) of the singular values of the embedded utterances, i.e., Diversity({ Ŷi } ) := 1 p d kSVDk 1 /kSVDk 2 / p d 1 2 [0, 1], where SVD := SVD({ SE ( Ŷi } 25 i=1 ), and SE is a pre-trained sentence encoder (e.g., a USE (Cer et al., 2018) ); (ii) Dist-{1, 2, 3} (Li et al., 2015) : Ratio of unique {1, 2, 3}-gram in the generated utterances; (iii) Perplexity: The perplexity score of the utterance w.r.t. a GPT-2 LM, which is more correlated to human's judgement on semantic fluency (Pang et al., 2020) . These metrics measure both accuracy and semantic diversity. Qualitatively, we also measure fluency and diversity of LMs using human ratings (see Appendix B.8 for details). The results of the above experiments are reported in Table 1 and 6 4 . In comparisons with the baselines (Transformer and VHRED), generally (i) transformer-based LMs out-perform VHRED due to their attention mechanism that explicitly encodes sequential semantic information, and (ii) the MoE-LMs achieve way better diversity without sacrificing much on accuracy (i.e., the perplexity scores are still quite low). Qualitatively, the sample utterances generated the Transformer are closer to the targets than that by MoE-2 and MoE-4, likely because Transformer tends to memorize the corpus (Kharitonov et al., 2021) . Contrarily, MoE-LMs generate utterances that have similar contexts with targets but paraphrased or similar structures but different contexts, demonstrating their generalizability. Human evaluations also show that MoE-2 and MoE-4 generate more diverse utterances while retaining sufficient semantic fluency. Among different MoE-LMs, MoE-2 and MoE-4 have the best performances, particularly MoE-4 has better diversity while MoE-2 has lower perplexity. This corroborates with our hypotheses that (i) jointly training the encoder and decoder with Eq. 1 seems necessary to encourage semantic diversity (as opposed to using a pre-trained BERT encoder, which maximizes likelihood), (ii) sufficient representation power is necessary for G 0 to match the posterior distribution ⇢ in order to capture different semantics in D. In Fig. 2a and 2b , we visualize the latent space of 400 conversation data samples for both Transformer and MoE-4. The latent states of MoE-4 are much more dispersed across the embedding space, implying that most conversations get encoded uniquely. In contrast, the latent space of Transformer has many clusters, suggesting it is more prone to generating similar utterances even with different input conversation and is thus less generalizable. EXP 2: Quality of Experts We compare the performance of experts learned by the 4 MoE-LMs (where experts are separately trained by optimizing Eq. 2) and 2 baselines (WD (Holtzman et al., 2018) and PPLM (Dathathri et al., 2019) ). To study the sub-optimality gap in Corollary 1, we also include the performance of Transformer-based expert end-to-end LMs that are individually optimized with REINFORCE (Li et al., 2016) , using the expert labels {`i} as rewards. Inspired by Ghandeharioun et al. (2019) on how bot behaviors are characterized, we use the following label functions to define the 5 . Compared with the baseline LMs, generally the experts created under the MoE-LM framework, especially under MoE-2 and MoE-4, better capture all different language intents (where WD and PPLM appear to capture negative sentiments and emotions much more effectively than behaviors), demonstrating the efficacy of our approach which constructs specialized experts on a diverse language space via reward maximization (instead of weighted MLE). Human evaluations also show that MoE-4 is most effective in generating semantically fluent utterances that possess a wide range of expert characteristics. Similar to the ablation study in EXP 1, all the experts associated with MoE-2 and MoE-4 are also among the best ones in capturing language intents. Interestingly, with the Reddit data the experts in MoE-4 perform the best, while with much less data (Cornell) the best experts are built upon the simpler MoE-2 architecture. We conjecture this difference is due to over-fitting issues faced by the larger LMs (MoE-4) when there is insufficient data for expert fine-tuning. In Fig. 2c and 2d we visualize the latent space of the sentiment-based experts in MoE-4, each with 400 conversation data samples. Notice that the sentiment experts' latent distributions are clearly separated (because positive and negative sentiments have opposite behaviors), while the emotion expert's latent distribution have more gradual separations and even some overlaps (because e.g., joy versus optimism are quite similar, while joy versus anger are quite different). This validates our MoE-LM represents different behaviors in separate regions of the latent space and justifies our structural prior of modeling each expert as a specialized version of the primitive LM, whose latent distribution focuses on particular regions. EXP 3: MoE-RL Against DialoGPT Simulated Users We compare the dialogue management performance of MoE-LM, for which their MoE-based DMs µ are trained with different methods: (i) IQL (Kostrikov et al., 2021) , (ii) Ensemble DQN (Carta et al., 2021) , (iii) KL-control (Jaques et al., 2019) , with 3 standard RL-based DM baselines using the VHRL architecture (Saleh et al., 2020) : (i) REINFORCE (Li et al., 2016) , (ii) KL-control, (iii) SAC (Haarnoja et al., 2018) . According to the results on expert quality in EXP2, we pick the MoE-2 and MoE-4 frameworks for the Cornell and Reddit tasks respectively. For systematic evaluation, we perform the experiment by having these RL agents interact with a DialoGPT (Zhang et al., 2019) simulated user environment for a maximum of 5 turns. The DM task is to maximize total user satisfaction in the conversation level, which is measured by both (i) user's overall sentiment, and (ii) user's sentiment transition. To construct an immediate reward that serves as a surrogate for user satisfaction, we set r(X, a, X + ) = 1 `sent (X + ) + 2 (`s ent (X + ) 1 1 L P L 1 l=0 l `sent (X l )) , where the linear combi- We attribute this finding to three factors: (i) MoE-LM restricts the action space into a smaller set of candidate utterances generated by experts (whose qualities are validated in EXP2), the corresponding RL problem then becomes simpler and requires less data (especially in Cornell) to solve; (ii) Unlike the baseline RL methods, which need to optimize both bot-and-user signals, the MoE DM agents focus on optimizing the user satisfaction goal and are therefore more effective; (iii) Among different RL settings, MBRL, which first learns a user utterance model (the model uses the same encoder from the primitive LM and learns a separate decoder for user-utterance prediction) then does DM, performs much better than offline RL methods that moderately improve upon the primitive LM (behavior policy). IQL-based dialogue managers are among the best across different settings potentially because IQL is more robust to co-variate shifts than standard RL methods, e.g., Ens-Q, SAC, and yet it is less conservative than the behavior-regularized algorithms, e.g., KLC. Interestingly, our MoE-LMs also have lower (better) perplexity scores than other methods. This may be due to the fact that MoE-LM uses pre-trained encoder and decoder from the primitive LM, which are optimized for generalization and accuracy, while other RL methods may distort their language representations to create utterances that maximize reward but become less natural.

8. CONCLUDING REMARKS

We developed a mixture-of-expert (MoE) approach for RL-based dialogue management (DM). Our MoE language model (MoE-LM) comprises of three main components: (i) a LM that can generate diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) that can produce utterances corresponding to a particular attribute or intent, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. To understand how well our MoE approach generates diverse and sensible utterances, and solves DM problems, we evaluated it using two open-domain dialogue tasks and compared it with SOTA baselines. Our results showed that MoE-LM (i) improves diversity of text generation, (ii) can generate utterances with specific intents, and (iii) yields better overall performance. We consider our work as a step forward in creating steerable LMs that possess different intents and in training RL-based DMs that can carry on rich conversations. Future work includes improving the language representation with informationtheoretic approaches, fine-tuning the experts based on the DM objective, extending the RL agent to track users' behaviors (via abstract belief states) and plan upon them, preventing RL dialogue agents from generating harmful behaviors (via enforcing safety constraints), and evaluating our MoE-LM on more realistic problems, such as information retrieval, recommendation, and negotiation.



If the actual utterance X l has fewer tokens than N X , it will be padded by a specific token and masked. Note that we use Y as the next utterance in the dataset and Ŷ as the one predicted by the LM. In practice, we can use both latent states as the input to the decoder model ( b Y0|z 0 , z). For simplicity, we assume that each expert generates only a single candidate utterance at each step. It would be straightforward to extend this to multiple (and even a varying number of) candidate utterances. The VHRED model implementation is identical to that inJaques et al. (2019) to ensure fair comparisons.



C b D j s w N k 0 S s t p x 9 I c o Z 2 h + v 1 G B p F S / S g 0 z g h 0 V / 3 u 5 e J f v U a q 2 4 d B x k S S a h T 0 8 6 N 2 y h 0 d O / n m T o t J p J r 3 D Q C V z M z q 0 C 5 I o N r k Y / s n a H a R e G b e P U 9 Q g o 7 l d u a D 7 E R w N z C 7 d f y d n P 4 z M v F l N G S b X L 3 f K Y 7 C 1 W 7 F 2 6 / s X e y V q 0 d F w j N k j a y T L e K R A 1 I l p 6 R G 6 o S S L r k n D + T R e r J e r F f r 7 d M 6 Z h V 3 V s m P s t 4 / A A 3 4 r u s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " t j H b O g r f 4 h o 9 D

X e P U 9 R g E 7 2 W h 6 B b E m 6 7 b r d W u F 7 Q f 0 a u P o 2 O S i 7 X 4 G e K v + F 8 o x J s V T Z P N 8 u 7 h / 2 E x 8 g i W S K r J C D b Z J c c k B N S J Y x I c k f u y Y P 3 6 D 1 7 L 9 7 r h 3 X A 6 9 9 Z I N / K e 3 s H H 1 6 w 7 Q = = < / l a t e x i t > X < l a t e x i t s h a _ b a s e = " m P

s H I j q t + w = = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " + t 0/ f C T + b w d n 8 U Y z d j x a O o 9 N J M E = " > A A A C P H i c f Z D L S s N A F I Y n 9 V 6 v 1 a W b Y B V F p C Q i 6 r K g C 1 2 I F 2 w r N K W c T E / S o Z N J m J m I N f Q N 3 O r j + B 7 u 3 Y l b 1 0 5 r B W 9 4 Y O D j P / / M n P P 7 C W d K O 8 6 T l R s Z H R u f m J z K T 8 / M z s 0 v F B a r K k 4 l x Q q N e S y v f F D I m c C K Z p r j V S I R I p 9 j z e 8 c 9 P u 1 a 5 S K x e J S d x N s R B A K F j A K 2 k g X t + v N h a J T c g Z l / w Z 3 C E U y r L N m w V r 1 W j F N I x S a c l C q 7 j q J b m Q g N a M c e 3 k v V Z g A 7 U C I d Y M C I l S N b D B q z 1 4 z S s s O Y m m O 0 P Z A / X o j g 0 i p b u Q b Z w S 6 r X 7 2 + u J f v X q q g / 1 G x k S Sa h T 0 4 6 M g 5 b a O 7 f 7 e d o t J p J p 3 D Q C V z M x q 0 z Z I o N q k k / c O 0 e w i 8 c S 8 e 5 q g B B 3 L z c w D G U Z w 0 z O 7 h d 5 W n / 4 z M v F p N J Q 3 u b o / U / w N 1 e 2 S u 1 v a O d 8 p l o + H C U + S Z b J C N o h L 9 k i Z H J E z U i G U B O S O 3 J M H 6 9 F 6 t l 6 s 1 w 9 r z h r e W S L f y n p 7 B y 8 b r g I = < / l a t e x i t > z 0

7 Y X N T 5 6 w p r v a 6 x B o 5 B N U C V T H H d X g 4 2 4 l / J c o X Z c g r X t i G W u U 4 B x g k s c V + L c Y g b 8 H A b Y 9 l S D Q t s p J s u O 6 a Z X e r S f G v + 0 o x P 1 b k c B y t q R S r x T g R v a h 7 V S f K z W z l 2 / 2 S m E z n K H m t 9 8 1 M 8 l d S k t L 6 c 9 Y Z A 7 O f I E u B F + V 8 q H Y I A 7 n 0 8 l / o T + F o N H f u 6 X D A 2 4 1 O w U M Z i B g h 9 j f 9 s g 3 i 3 Z c 0 a h b 4 2 e V X y u t + H R p 8 l J L Y w + h P W v 9 e r + w T T h O b J G 1 s k 2 i U i D 7 J P P 5 J i 0 C C d D c k E u y V X w K 7 g O / g R / b 6 w z w b T n H b m H 4 N 9 / Z V i v H Q = = < / l a t e x i t > z 0 m < l a t e x i t s h a 1 _ b a s e 6 4 = " O 7 N Q F5 P R K N q j J U l h R p M q c 7 M M X T M = " > A A A C P n i c f V B N T x s x E P V C P y C U 8 t E j F 4 u A q C q 0 s k P a p D c E P X B B p V I D S N k o m n U m i Y X X u 7 K 9 F W G V v 8 A V f k 7 / R v 8 A t 4 o r R 7 w h S G 1 V e J K l p z d v x j M v z p S 0 j r F f w c z s i 5 e v X s / N V x b e L L 5 d W l 5 Z P b Z p b g S 2 R K p S c x q D R S U 1 t p x 0 C k 8 z g 5 D E C k / i s / 2 y f v I D j Z W p / u 5 G G X Y S G G j Z l w J c K V 1 s d X l 3 u c r C R o 3 z Z p 2 y k L M d 3 v j s C W t + r L E G 5 S G b o E q m O O q u B B t R L x V 5 g t o J Bd a 2 O c t c p w D j p F A 4 r k S 5 x Q z E G Q y w 7 a m G B G 2 n m C w 7 p p t e 6 d F + a v z T j k 7 U P z s K S K w d J b F 3 J u C G 9 t 9 a K f 6 v 1 s 5 d v 9 k p p M 5 y h 1 o 8 f N T P F X U p L S + n P W l Q O D X y B I S R f l c q h m B A O J 9 P J f q C / h a D h 3 7 u 1 w w N u N R 8 K C I w g w T O x / 6 2 Q b R d s u e M U j 8 a P a v 4 X B / D o 0 + T 4 1 r I P 4 X 1 b / X q 7 t 4 0 4 T m y R t b J e 8 J J g + y S A 3 J E W k S Q I b k k V + Q 6 + B n c B L + D 2 w f r T D D t e U f + Q n B 3 D / b l r u E = < / l a t e x i t > z 0 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " L 5 V l 1 1 z R P 3 o J r U N f h c g B J U p G 4 h c = " > A A A C R n i c f V B d a x N B F L 2 b V q 3 x q 6 2 P v g y N g o g s M z E 2 8 a 2 o U F + k F U x b y I Z w d 3 K T D p 2 d X W Z m i 2 H J 3 / D V / p z + B f + E b 6 W v n U 1 T q K I e G D i c e + 6 d e 0 9 a a O U 8 5 z + j x s r q n b v 3 1 u 4 3 H z x 8 9 P j J + s b m g c t L K 6 k v c 5 3 b o x Q d a W W o 7 5 X X d F R Y w i z V d J i e f K j r h 6 d k n c r N V z 8 r a J j h 1 K i J k u i D l C Q Z + m O J m u 2 O x G i 9 x e N u W 4 h e h / F Y 8 D e i + y 4 Q 3 n v b 5 l 0 m Y r 5 A C 5 b Y H 2 1 E z 5 N x L s u M j J c a n R s I X v h h h d Y r q W n e T E p H B c o T n N I g U I M Z u W G 1 W H r O X g R l z C a 5 D c 9 4 t l B v d 1 S Y O T f L 0 u C s l 3 R / 1 m r x b 7 V B 6 S e 9 Y a V M U X o y 8 v q j S a m Z z 1 m d A B s r S 9 L r W S A o r Q q 7 M n m M F q U P O T W T j x R u s f Q 5 z N 0 r y K L P 7 a s q Q T v N 8 N s 8 3 D Z N X t f s f 0 Z l b o y B N U O u N + G x f 5 O D d i y 2 4 8 6 X T m v n / T L h N X g G W / A S B H R h B z 7 B P v R B Q g H f 4 Q e c R e f R r + g i u r y 2 N q J l z 1 P 4 D Q 2 4 A r V E s S 4 = < / l a t e x i t > G 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 C F y 4 s E r 4 8 H l S d 8 M T p B C L 2 k 1 y

6 c / p v + g / 4 I a 4 9 o Q T g t R W b Z 9 k 6 e n N m / H M y w q t P A n x I 1 p 4 s P j w 0 e O l J 7 W n z 5 6 / e L m 8 8 q r r b e k k d q T V 1 h 1 l 4 F E r g x 1 S p P G o c A h 5 p v E w O 9 2 e 1 g / P 0 H l l z R c a F 9 j P YW T U U E m g I H X T s 4 E l f 7 x c F 3 G r k S T t J h d x I j 4 m r c 1 A R H u 9 I V o 8 i c U M d T b H / v F K 9 C Y d W F n m a E h q 8 L 6 X i I L 6 F T h S Uu O k l p Y e C 5 C n M M J e o A Z y 9 P 1 q t u 6 E v w 3 K g A + t C 8 8 Q n 6 m / d l S Q e z / O s + D M g U 7 8 n 7 W p + L d a r 6 R h u 1 8 p U 5 S E R t 5 9 N C w 1 J 8 u n t / O B c i h J j w M B 6 V T Y l c s T c C A p J F R L d z D c 4 n A v z P 1 c o A O y 7 n 2 V g h v l c D 4 J t 4 3 S D 1 P 2 P 6 M y 9 8 b A a i H X + / D 4 v 0 m 3 E S c b c f O g W d / 6 N E 9 4 i a 2 x 1 + w d S 1 i L b b F d t s 8 6 T L K v 7 I J d s m / R 9 + g q u o 5 u 7 q w L 0 b x n l f 2 G 6 O c t x h 2 w V w = = < / l a t e x i t > . . . < l a t e x i t s h a 1 _ b a s e 6 4 = " H /

r 8 D 7 F a T j f q g b b 1 d p p r b K 3 P 0 l 4 D l b h N 6 x D A D u w B 4 d w A n X g k M J / u I Y b 7 9 a 7 9 x 6 8 x 2 d r y Z v c + Q V v q g R P P p y w 6 w = = < / l a t e x i t > G 0

8 w B 0 M 4 a 7 j t 2 t G a x 1 6 T + j U J 9 G S 0 W b q / 8 z x d 9 w t l 7 2 N 8 s b J x u l n d 1 + w q N s g S 2 y F e a z L b b D D t g x q z D O r t k 9 e 2 C P z p P z 4 r w 6 b x / W g t O / M 8 + + V c F 5 B y 8 7 s e s = < / l a t e x i t > b Y 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " K q z G p l e D 1 3 x 0

Figure 1: (Left) MoE-LM Architecture. (Right) Sample utterance workflow generated by an MoE-LM trained with Reddit data. Step 1: encodes conversation history. Step 2: Gi, 8i, generate candidate bot utterances. Step 3: µ selects the bot response by Q-score ranking & post-processing.

(Appendix A.1), and sample utterances generated by these LMs can be found in Appendix A.3. Human evaluation on diversity/fluency of different LMs are given in Table

Figure 2: Latent Space Visualizations. Figures (a) and (b) Compares Two Primitive Representations. Figures (c) and (d) Illustrates How Experts (of Different Sentiments and Emotions) are Represented by Latent Clusters.

Accuracy (Perplexity) and Diversity of Language Primitive Experts Trained with Reddit.

Performance (w.r.t. User Satisfaction in Conversation) of MBRL-based DM Trained with Reddit.

Quality of Each Expert PPLM Trained on Reddit Casual dataset w.r.t. its Trained Label. Trans. RL Corresponds to Individually Optimized LMs Using Expert Labels {`i} as Rewards.

Phase 2 Raters Evaluation (Reddit Casual Models). nation weights ( 1 , 2 ) = (0.75, 0.25) correlate withGhandeharioun et al. (2019), and `sent (X) is the same RoBerTa-based sentiment labeler as in EXP2, which assigns a score from [ 1, 1] that is proportional to the positive sentiment and inversely proportional to the negative sentiment prediction probabilities. To ensure the baseline RL DM methods can also possess certain bot-level features, e.g., question, positive sentiment, etc., besides the above RL reward for user satisfaction we also optimize a linear combination of bot-based rewards when training the baseline models, see Appendix B ofSaleh et al. (2020) for more details. Since the DM problem lasts at most 5 turns, we use this as the effective horizon and set = 1 1/5 = 0.8.The results of the above experiments (performed in both offline RL and MBRL settings) are reported in Table2, Table 7 (Appendix A.1), Table 9 and 10 (Appendix A.2), with sample utterances reported in Appendix A.11. Our experiments show that MoE-LMs outperform most baselines on DM performance.

