AN EQUAL-SIZE HARD EM ALGORITHM FOR DIVERSE DIALOGUE GENERATION

Abstract

Open-domain dialogue systems aim to interact with humans through natural language texts in an open-ended fashion. Despite the recent success of super large dialogue systems such as ChatGPT, using medium-to-small-sized dialogue systems remains the common practice as they are more lightweight and accessible; however, generating diverse dialogue responses is challenging, especially with smaller models. In this work, we propose an Equal-size Hard Expectation-Maximization (EqHard-EM) algorithm to train a multi-decoder model for diverse dialogue generation. Our algorithm assigns a sample to a decoder in a hard manner and additionally imposes an equal-assignment constraint to ensure that all decoders are well-trained. We provide detailed theoretical analysis to justify our approach. Further, experiments on two large-scale open-domain dialogue datasets verify that our EqHard-EM algorithm generates high-quality diverse responses 1 .

1. INTRODUCTION

Open-domain dialogue systems aim to generate natural language text utterances to hold open-ended conversations with humans (Li et al., 2017a; Wang et al., 2021b) . These systems have shown great success, and are seamlessly integrated into our society through chatbots. The recently launched ChatGPTfoot_1 model, for example, has shown remarkable conversational skills. However, ChatGPT is prohibitively large in size and requires significant human feedback during its training process, and therefore, training medium-to-small-sized language models without human feedback still remains the common practice (Wu et al., 2021; Wang et al., 2021a; Chen et al., 2022) , and these smaller models tend to generate generic responses such as I don't know (Li et al., 2016b; Wang et al., 2021b) . One of the possible causes is the one-to-many mapping phenomenon in the dialogue task, where a dialogue context may correspond to multiple plausible responses (Wei et al., 2019; Bao et al., 2020; Khan et al., 2020) . Learning to generate a dialogue utterance is thus analogous to learning a multimodal distribution of a continuous variable, where a mode refers to a peak in the distribution; in the dialogue task, a mode can be thought of as a set of similar responses. The widely used crossentropy training is not good at capturing different modes, as it encourages the prediction to cover all plausible responses, forcing the model to learn an overly smooth distribution. Consequently, these neural dialogue models resort to generating generic responses (Wei et al., 2019) . 2020) apply nucleus sampling to allow plausible but less probable words to be decoded. Wang et al. (2021b) apply label smoothing to prevent the model from being overly confident with generic responses.



Our code is available at https://github.com/MANGA-UOFA/EqHard-EM https://openai.com/blog/chatgpt/ 1



Previous studies address generic responses mainly at training or inference time. Training-time approaches explore different training objectives: Li et al. (2016b) apply reinforcement learning to optimize a customized reward function that discourages generic responses. Khan et al. (2020) train a generative adversarial network in the latent space to discourage generic responses because they can be easily identified by the discriminator. Among inference approaches, Vijayakumar et al. (2016) encourage diverse beam results by penalizing similar beam entities. Holtzman et al. (

