DIFFUSEQ: SEQUENCE TO SEQUENCE TEXT GENERATION WITH DIFFUSION MODELS

Abstract

Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is under-explored due to the discrete nature of texts, especially for conditional generation. We tackle this challenge by proposing DIFFUSEQ: a diffusion model designed for sequenceto-sequence (SEQ2SEQ) text generation tasks. Upon extensive evaluation over a wide range of SEQ2SEQ tasks, we find DIFFUSEQ achieving comparable or even better performance than six established baselines, including a state-of-theart model that is based on pre-trained language models. Apart from quality, an intriguing property of DIFFUSEQ is its high diversity during generation, which is desired in many SEQ2SEQ tasks. We further include a theoretical analysis revealing the connection between DIFFUSEQ and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks. 1

1. INTRODUCTION

Among existing generative models, GAN (Goodfellow et al., 2014) suffers from the instability issue (Salimans et al., 2016) , subjecting to mode collapse (Metz et al., 2017) ; VAE (Kingma & Welling, 2014) has to rely on surrogate objectives to approximate maximum likelihood training and Flow-based models (Dinh et al., 2017) has to use specialized architectures to construct reversible transform. Diffusion models (Ho et al., 2020; Nichol & Dhariwal, 2021) have circumvented several of these limitations and emerged as a new paradigm for generative models, theoretically underpinned by non-equilibrium thermodynamics (Sohl-Dickstein et al., 2015) and score-matching network (Song & Ermon, 2019) . To date, the major breakthroughs are in domains using continuous signals, such as vision (Saharia et al., 2022a; b; Ramesh et al., 2022) and audio (Kong et al., 2020) . However, extending continuous diffusion models to natural language remains an open challenge due to the inherently discrete nature of texts. On the basis of unconditional generation in continuous space which is illustrated in Figure 1 (a), existing efforts (Hoogeboom et al., 2021; Austin et al., 2021) start customizing diffusion models to text in discrete space on unconditional language modeling (i.e., free text generation). Diffusion-LM (Li et al., 2022) , as in Figure 1 (b), models texts in continuous space and proposes to use an extra-trained classifier as guidance (i.e., the condition signal x) to impose subtle changes (usually complex, finegrained constraints) on generated sentences. Nonetheless, these models do not naturally generalize to conditional language modeling (i.e., the model assigns probabilities p(w|x) to sequences of words w given x). In the more general sequence-to-sequence (SEQ2SEQ) setting where the condition x is also a sequence of words, applying Diffusion-LM can be difficult. The reason is that classifiers are attributes-oriented, and we can not train hundreds-of-thousands classifiers to model the semantic meaning between conditions and generated sentences. SEQ2SEQ is an essential setting in NLP that covers a wide range of important tasks such as openended sentence generation, dialogue, paraphrasing, and text style transfer. In this paper, we propose



Code is available at https://github.com/Shark-NLP/DiffuSeq 1

