AN EQUAL-SIZE HARD EM ALGORITHM FOR DIVERSE DIALOGUE GENERATION

Abstract

Open-domain dialogue systems aim to interact with humans through natural language texts in an open-ended fashion. Despite the recent success of super large dialogue systems such as ChatGPT, using medium-to-small-sized dialogue systems remains the common practice as they are more lightweight and accessible; however, generating diverse dialogue responses is challenging, especially with smaller models. In this work, we propose an Equal-size Hard Expectation-Maximization (EqHard-EM) algorithm to train a multi-decoder model for diverse dialogue generation. Our algorithm assigns a sample to a decoder in a hard manner and additionally imposes an equal-assignment constraint to ensure that all decoders are well-trained. We provide detailed theoretical analysis to justify our approach. Further, experiments on two large-scale open-domain dialogue datasets verify that our EqHard-EM algorithm generates high-quality diverse responses 1 .

1. INTRODUCTION

Open-domain dialogue systems aim to generate natural language text utterances to hold open-ended conversations with humans (Li et al., 2017a; Wang et al., 2021b) . These systems have shown great success, and are seamlessly integrated into our society through chatbots. The recently launched ChatGPTfoot_1 model, for example, has shown remarkable conversational skills. However, ChatGPT is prohibitively large in size and requires significant human feedback during its training process, and therefore, training medium-to-small-sized language models without human feedback still remains the common practice (Wu et al., 2021; Wang et al., 2021a; Chen et al., 2022) , and these smaller models tend to generate generic responses such as I don't know (Li et al., 2016b; Wang et al., 2021b) . One of the possible causes is the one-to-many mapping phenomenon in the dialogue task, where a dialogue context may correspond to multiple plausible responses (Wei et al., 2019; Bao et al., 2020; Khan et al., 2020) . Learning to generate a dialogue utterance is thus analogous to learning a multimodal distribution of a continuous variable, where a mode refers to a peak in the distribution; in the dialogue task, a mode can be thought of as a set of similar responses. The widely used crossentropy training is not good at capturing different modes, as it encourages the prediction to cover all plausible responses, forcing the model to learn an overly smooth distribution. Consequently, these neural dialogue models resort to generating generic responses (Wei et al., 2019) . et al., 1977) to address the one-to-many mapping phenomenon. However, their direct application of the standard EM algorithms suffers from the decoder-collapsing problem, where the multi-decoder model degenerates to a single-decoder model. The authors attempt to alleviate this problem by disabling the dropout layers during the E-step of the EM algorithm, but their results show that the model is still susceptible to collapses. To this end, we propose a novel EM variant to multi-decoder diverse dialogue generation. A standard EM algorithm assigns a sample to all decoders by the posterior probability, thus known as Soft-EM; it suffers from synchronous-training collapse, where the posterior probabilities tend to be similar and all decoders are also trained similarly. The Hard-EM variant trains the decoder that has the highest posterior probability; it suffers from non-training collapse due to the rich-gets-richer phenomenon. Our proposed EM algorithm avoids both types of collapses by adopting hard assignments and imposing an equal-assignment constraint. Hence, we call our approach Equal-size Hard EM (EqHard-EM). We conducted experiments on the Weibo (Gao et al., 2019) and OpenSubtitles (Lison et al., 2018) datasets. Results show that our EqHard-EM algorithm alleviates the decoder-collapsing problem in both Soft-EM and Hard-EM, allowing the multi-decoder model to specialize and generate highquality and diverse responses. In addition, we provide in-depth theoretical and empirical analyses to better understand our EqHard-EM algorithm. To sum up, our contributions are three-fold: 1) We propose the EqHard-EM algorithm to alleviate the collapse issues in soft and hard EM variants; 2) We provide detailed theoretical analysis to justify our equal assignment; and 3) We conduct extensive empirical experiments to show the effectiveness of our approach.

2. APPROACH

In this section, we first present our mixture model architecture (Subsection 2.1). Then, we propose EqHard-EM, a novel EM variant, for training such mixture models (Subsection 2.2).

2.1. NEURAL ARCHITECTURE

We propose to address the one-to-many mapping phenomenon in the dialogue task with a mixture model. Given a dialogue context, we use a shared encoder to build an input utterance's hidden representations, based on which multiple decoders generate a set of output responses. For multiple decoders, it is possible to instantiate each with a full Transformer model (Shen et al., 2019) , but this leads to a large number of parameters that may not fit into the memory of an ordinary GPU. To this end, we propose a multi-adapter architecture (Figure 1a ), where different decoders share most Transformer parameters, but differ by a few inserted, thin adapter layers (Houlsby et al., 2019) . In fact, the adapter model is widely used for parameter-efficient domain adaptation (Artetxe et al., 2020; Wang et al., 2021a; Ngo Trung et al., 2021) . We propose to apply this architecture to



Our code is available at https://github.com/MANGA-UOFA/EqHard-EM https://openai.com/blog/chatgpt/



Previous studies address generic responses mainly at training or inference time. Training-time approaches explore different training objectives: Li et al. (2016b) apply reinforcement learning to optimize a customized reward function that discourages generic responses. Khan et al. (2020) train a generative adversarial network in the latent space to discourage generic responses because they can be easily identified by the discriminator. Among inference approaches, Vijayakumar et al. (2016) encourage diverse beam results by penalizing similar beam entities. Holtzman et al. (2020) apply nucleus sampling to allow plausible but less probable words to be decoded. Wang et al. (2021b) apply label smoothing to prevent the model from being overly confident with generic responses.

Figure 1: (a) Our multi-adapter neural architecture. (b) The equal-size hard assignment scheme. Dashed circles: decoders are conceptually duplicated when we solve the assignment problem.

