DIFFUSEQ: SEQUENCE TO SEQUENCE TEXT GENERATION WITH DIFFUSION MODELS

Abstract

Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is under-explored due to the discrete nature of texts, especially for conditional generation. We tackle this challenge by proposing DIFFUSEQ: a diffusion model designed for sequenceto-sequence (SEQ2SEQ) text generation tasks. Upon extensive evaluation over a wide range of SEQ2SEQ tasks, we find DIFFUSEQ achieving comparable or even better performance than six established baselines, including a state-of-theart model that is based on pre-trained language models. Apart from quality, an intriguing property of DIFFUSEQ is its high diversity during generation, which is desired in many SEQ2SEQ tasks. We further include a theoretical analysis revealing the connection between DIFFUSEQ and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks. 1

1. INTRODUCTION

Among existing generative models, GAN (Goodfellow et al., 2014) suffers from the instability issue (Salimans et al., 2016) , subjecting to mode collapse (Metz et al., 2017) ; VAE (Kingma & Welling, 2014) has to rely on surrogate objectives to approximate maximum likelihood training and Flow-based models (Dinh et al., 2017) has to use specialized architectures to construct reversible transform. Diffusion models (Ho et al., 2020; Nichol & Dhariwal, 2021) have circumvented several of these limitations and emerged as a new paradigm for generative models, theoretically underpinned by non-equilibrium thermodynamics (Sohl-Dickstein et al., 2015) and score-matching network (Song & Ermon, 2019) . To date, the major breakthroughs are in domains using continuous signals, such as vision (Saharia et al., 2022a; b; Ramesh et al., 2022) and audio (Kong et al., 2020) . However, extending continuous diffusion models to natural language remains an open challenge due to the inherently discrete nature of texts. On the basis of unconditional generation in continuous space which is illustrated in Figure 1 (a), existing efforts (Hoogeboom et al., 2021; Austin et al., 2021) start customizing diffusion models to text in discrete space on unconditional language modeling (i.e., free text generation). Diffusion-LM (Li et al., 2022) , as in Figure 1 (b), models texts in continuous space and proposes to use an extra-trained classifier as guidance (i.e., the condition signal x) to impose subtle changes (usually complex, finegrained constraints) on generated sentences. Nonetheless, these models do not naturally generalize to conditional language modeling (i.e., the model assigns probabilities p(w|x) to sequences of words w given x). In the more general sequence-to-sequence (SEQ2SEQ) setting where the condition x is also a sequence of words, applying Diffusion-LM can be difficult. The reason is that classifiers are attributes-oriented, and we can not train hundreds-of-thousands classifiers to model the semantic meaning between conditions and generated sentences. SEQ2SEQ is an essential setting in NLP that covers a wide range of important tasks such as openended sentence generation, dialogue, paraphrasing, and text style transfer. In this paper, we propose To corroborate the effectiveness of our DIFFUSEQ, we conduct experiments on four SEQ2SEQ tasks. Compared to AR and NAR models, which suffer from the "degeneration" problem (Holtzman et al., 2019) and rely on decoding strategies, DIFFUSEQ can achieve considerable sentence-level diversity without sacrificing the quality (see § 4.2). To sum up, we make a series of technical and conceptual contributions: (a) we are the first to deploy the diffusion model on SEQ2SEQ text generation, and our proposed DIFFUSEQ as a conditional language model is trained end-to-end in a classifier-free manner; (b) we establish a theoretical connection among AR, NAR and DIFFUSEQ models, and justify DIFFUSEQ as an extension of iterative-NAR models; (c) with strong empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks.

2. PRELIMINARY AND PROBLEM STATEMENT

Preliminary. A diffusion model typically contains forward and reverse processes. Given a data point sampled from a real-world data distribution z 0 ∼ q(z), the forward process gradually corrupts z 0 into a standard Gaussian noise z T ∼ N (0, I). For each forward step t ∈ [1, 2, ..., T ], the perturbation is controlled by q(z t |z t-1 ) = N (z t ; √ 1 -β t z t-1 , β t I), with β t ∈ (0, 1) as different variance scales. Once the forward process is completed, the reverse denoising process tries to gradually reconstruct the original data z 0 via sampling from z T by learning a diffusion model f θ . Problem Statement. Many recent efforts have been devoted to adapting diffusion models to discrete texts (See § 5). However, they all focus on unconditional sequence modeling. In this paper, we target the sequence-to-sequence text generation tasks. In particular, given a m-length source sequence w x = {w x 1 , ..., w x m }, we aim to learn a diffusion model that can produce a n-length target sequence w y = {w y 1 , ..., w y n } conditioning on the source sequence.

3. DIFFUSEQ

We propose DIFFUSEQ to extend vanilla diffusion models to learn conditional text generation (as shown in Figure 2 ), concerning the model architecture and the training objective. Forward Process with Partial Noising. In the beginning of forward process, we follow Diffusion-LM (Li et al., 2022) to design an embedding function EMB(w) to map the discrete text w into a continuous space. In particular, given a pair of sequence w x and w y , DIFFUSEQ learns a unified



Code is available at https://github.com/Shark-NLP/DiffuSeq



Figure 1: The demonstration of unconditional, classifier-guided, and classifier-free diffusion models.

