DIFFUSEQ: SEQUENCE TO SEQUENCE TEXT GENERATION WITH DIFFUSION MODELS

Abstract

Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is under-explored due to the discrete nature of texts, especially for conditional generation. We tackle this challenge by proposing DIFFUSEQ: a diffusion model designed for sequenceto-sequence (SEQ2SEQ) text generation tasks. Upon extensive evaluation over a wide range of SEQ2SEQ tasks, we find DIFFUSEQ achieving comparable or even better performance than six established baselines, including a state-of-theart model that is based on pre-trained language models. Apart from quality, an intriguing property of DIFFUSEQ is its high diversity during generation, which is desired in many SEQ2SEQ tasks. We further include a theoretical analysis revealing the connection between DIFFUSEQ and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks.

1. INTRODUCTION

Among existing generative models, GAN (Goodfellow et al., 2014) suffers from the instability issue (Salimans et al., 2016) , subjecting to mode collapse (Metz et al., 2017) ; VAE (Kingma & Welling, 2014) has to rely on surrogate objectives to approximate maximum likelihood training and Flow-based models (Dinh et al., 2017) has to use specialized architectures to construct reversible transform. Diffusion models (Ho et al., 2020; Nichol & Dhariwal, 2021) have circumvented several of these limitations and emerged as a new paradigm for generative models, theoretically underpinned by non-equilibrium thermodynamics (Sohl-Dickstein et al., 2015) and score-matching network (Song & Ermon, 2019) . To date, the major breakthroughs are in domains using continuous signals, such as vision (Saharia et al., 2022a; b; Ramesh et al., 2022) and audio (Kong et al., 2020) . However, extending continuous diffusion models to natural language remains an open challenge due to the inherently discrete nature of texts. On the basis of unconditional generation in continuous space which is illustrated in Figure 1 (a), existing efforts (Hoogeboom et al., 2021; Austin et al., 2021) start customizing diffusion models to text in discrete space on unconditional language modeling (i.e., free text generation). Diffusion-LM (Li et al., 2022) , as in Figure 1 (b), models texts in continuous space and proposes to use an extra-trained classifier as guidance (i.e., the condition signal x) to impose subtle changes (usually complex, finegrained constraints) on generated sentences. Nonetheless, these models do not naturally generalize to conditional language modeling (i.e., the model assigns probabilities p(w|x) to sequences of words w given x). In the more general sequence-to-sequence (SEQ2SEQ) setting where the condition x is also a sequence of words, applying Diffusion-LM can be difficult. The reason is that classifiers are attributes-oriented, and we can not train hundreds-of-thousands classifiers to model the semantic meaning between conditions and generated sentences. SEQ2SEQ is an essential setting in NLP that covers a wide range of important tasks such as openended sentence generation, dialogue, paraphrasing, and text style transfer. In this paper, we propose DIFFUSEQ, depicted in Figure 1 (c), a classifier-free diffusion model that supports SEQ2SEQ text generation tasks. By modeling the conditional probability of the target sentence w given context x using one single model, one advantage of DIFFUSEQ is that this paradigm allows a complete model to fit data distribution and utilize conditional guidance, rather than depending on a separate classifier. Different from canonical generation approaches in an autoregressive (AR) left-to-right manner (Radford et al., 2019) , DIFFUSEQ generates text tokens parallelly in the non-autoregressive (NAR) way. To corroborate the effectiveness of our DIFFUSEQ, we conduct experiments on four SEQ2SEQ tasks. Compared to AR and NAR models, which suffer from the "degeneration" problem (Holtzman et al., 2019) and rely on decoding strategies, DIFFUSEQ can achieve considerable sentence-level diversity without sacrificing the quality (see § 4.2). To sum up, we make a series of technical and conceptual contributions: (a) we are the first to deploy the diffusion model on SEQ2SEQ text generation, and our proposed DIFFUSEQ as a conditional language model is trained end-to-end in a classifier-free manner; (b) we establish a theoretical connection among AR, NAR and DIFFUSEQ models, and justify DIFFUSEQ as an extension of iterative-NAR models; (c) with strong empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks.

2. PRELIMINARY AND PROBLEM STATEMENT

Preliminary. A diffusion model typically contains forward and reverse processes. Given a data point sampled from a real-world data distribution z 0 ∼ q(z), the forward process gradually corrupts z 0 into a standard Gaussian noise z T ∼ N (0, I). For each forward step t ∈ [1, 2, ..., T ], the perturbation is controlled by q(z t |z t-1 ) = N (z t ; √ 1 -β t z t-1 , β t I), with β t ∈ (0, 1) as different variance scales. Once the forward process is completed, the reverse denoising process tries to gradually reconstruct the original data z 0 via sampling from z T by learning a diffusion model f θ . Problem Statement. Many recent efforts have been devoted to adapting diffusion models to discrete texts (See § 5). However, they all focus on unconditional sequence modeling. In this paper, we target the sequence-to-sequence text generation tasks. In particular, given a m-length source sequence w x = {w x 1 , ..., w x m }, we aim to learn a diffusion model that can produce a n-length target sequence w y = {w y 1 , ..., w y n } conditioning on the source sequence.

3. DIFFUSEQ

We propose DIFFUSEQ to extend vanilla diffusion models to learn conditional text generation (as shown in Figure 2 ), concerning the model architecture and the training objective. Forward Process with Partial Noising. In the beginning of forward process, we follow Diffusion-LM (Li et al., 2022) to design an embedding function EMB(w) to map the discrete text w into a continuous space. In particular, given a pair of sequence w x and w y , DIFFUSEQ learns a unified 

Forward process Reverse process Gaussian Noise Rounding

Figure 2 : The diffusion process of our conditional diffusion language model DIFFUSEQ. Given the source w x and the target w y , we pair-wisely transform them into continuous space z 0 . The partial Gaussian noise is iteratively added on the target space of z t . feature space of w x and w y by embedding transformation and concatenation as EMB(w x⊕y ) = [EMB(w x 1 ), ..., EMB(w x m ), EMB(w y 1 ), ..., EMB(w y n )] ∈ R (m+n)×d . The transformation allows us to adapt discrete textual input into the standard forward process, by extending the original forward chain to a new Markov transition q ϕ (z 0 |w x⊕y ) = N (EMB(w x⊕y ), β 0 I). We denote z t = x t ⊕ y t to simplify the wordings, where x t and y t represent parts of z t that belong to w x and w y , respectively. For each forward step q(z t |z t-1 ), we gradually inject noise into last step's hidden state z t-1 to obtain z t . Unlike conventional diffusion models that corrupt the whole z t (both x t and y t ) without distinction, we only impose noising on y t . This modification (termed partial noising) allows us to adapt diffusion models for conditional language modeling. Reverse Process with Conditional Denoising. The ultimate goal of the reverse process is to recover the original z 0 by denoising z t : p θ (z 0:T ) := p(z T ) T t=1 p θ (z t-1 |z t ). We model the learning process p θ (z t-1 |z t ) = N (z t-1 ; µ θ (z t , t), σ θ (z t , t)) using the proposed diffusion model DIFFUSEQ: f θ (z t , t), where the µ θ (•) and σ θ (•) is the parameterization of the predicted mean and standard deviation of q(z t-1 |z t ) in forward process, derived using Bayes' rule. The detailed derivations are in Appendix A. With the partial nosing strategy adopted in the forward process, we can impose the input as the condition when denoising as shown in Figure 1 . The proposed conditional denoising is classifier-free by nature: we do not require extra-trained classifiers to control the denoising process. Specifically, we use a transformer architecture to model f θ , which spontaneously models the semantic relation between x t and y t . We compute the variational lower bound (L VLB ) following the original diffusion process. L round corresponds to rounding operation in Figure 2 . L VLB = E q(z 1:T |z0) log q(z T |z 0 ) p θ (z T ) L T + T t=2 log q(z t-1 |z 0 , z t ) p θ (z t-1 |z t ) Lt-1 + log q ϕ (z 0 |w x⊕y ) p θ (z 0 |z 1 ) L0 -log p θ (w x⊕y |z 0 ) Lround . We further simplify the training objective as follows (details in Appendix A): min θ L VLB = min θ T t=2 ||z 0 -f θ (z t , t)|| 2 + ||EMB(w x⊕y ) -f θ (z 1 , 1)|| 2 -log p θ (w x⊕y |z 0 ) → min θ T t=2 ||y 0 -fθ (z t , t)|| 2 + ||EMB(w y ) -fθ (z 1 , 1)|| 2 + R(||z 0 || 2 ) , here we use fθ (z t , t) to denote the fractions of recovered z 0 corresponding to y 0 . Note that although in the first term, we only compute the loss w.r.t y 0 , due to the attention mechanism in the transformer, the reconstruction of y 0 also takes x 0 into account, thus the gradients from the first term will also affect the learning of x 0 . The mathematically equivalent regularization term R(||z 0 || 2 )) regularize the embedding learning. We further share the embedding function between source and target sequences, enabling the training of two different feature spaces jointly. This sets DIFFUSEQ away from existing solutions in vision such as GLIDE (Nichol et al., 2022) . Training and Inference Methods. In our preliminary experiments, we find that the high diversity in NLP datasets and long diffusion steps often result in insufficient training. We hypothesize the reason is that sampling step t uniformly causes unnecessary noise in the L VLB objective. We hence employ importance sampling (Nichol & Dhariwal, 2021) to address this problem. L VLB = E t∼pt L t p t , p t ∝ E[L 2 t ], T -1 t=0 p t = 1. Intuitively, the importance-weighted sampling algorithm will spend more steps on diffusion steps with larger L t , and vice versa. To conduct SEQ2SEQ generation given the condition EMB(w x ), we randomly sample y T ∼ N (0, I) and concatenate y T with EMB(w x ) to obtain z T . We can now repeat the reverse process until we arrive at z 0 . At each sampling step, an anchoring function is executed towards reparameterized z t . Specifically, the anchoring function: (a) operates rounding on z t to map it back to word embedding space following Li et al. ( 2022); (b) replaces the part of recovered z t-1 that belongs to w x with the original x 0 , considering that this part is recovered from corrupted z t via f θ and not strictly equals to x 0 . Note that (b) is designed for DIFFUSEQ. To improve the quality of generation, we apply the widely used Minimum Bayes Risk (MBR) decoding strategy (Koehn, 2004) . We first generate a set of candidate samples S from different random seeds of DIFFUSEQ and select the best output sequence that achieves the minimum expected risk under a meaningful loss function (e.g. BLEU or other cheaper metrics like precision). In practice, we use the negative BLEU score in our implementation. Connections to AR, Iter-NAR, and Fully-NAR Models. To better understand the behavior of DIFFUSEQ, we give the theoretical connection to autoregressive (AR), iterative non-autoregressive (iter-NAR), and fully non-autoregressive (fully-NAR) models. We argue that DIFFUSEQ can be seen as an extension of iter-NAR model. Detailed graphical learning discrepancies of these four cases are discussed in Appendix B for reference. AR models learn p(w y 1:n |w x ) by autoregressive decomposition based on left-context: p AR (w y 1:n |w x ) = p(w y 1 |w x ) initial prediction i=1,...,n-1 p(w y i+1 |w y 1:i , w x ) progressive left-context prediction , while fully-NAR models (Gu et al., 2018; Qian et al., 2021) learn the conditional probability given independent assumption for fast inference: p fully-NAR (w y 1:n |w x ) = i=1,...,n p(w y i |w x ). To make a better analogy to AR and NAR models, we use a lossless way to formulate iterative NAR models (Gu et al., 2019; Ghazvininejad et al., 2019) by introducing a series of intermediate sequences w y 1:K-1 , w y K = w y with K editable iterations: p iter-NAR (w y 1:n |w x ) = w y 1 ,...,w y K-1 i=1...n p(w y 1,i |w x ) initial prediction k=1..K-1 i=1...n p(w y k+1,i |w y k,1:n , w x ) progressive full-context prediction . ( ) Previous study (Huang et al., 2022) shows that there is a gap called conditional total correlation between AR Eq. ( 4) and fully-NAR Eq. ( 5) learning paradigms, because of lossy decomposition of NAR models. However, when comparing iter-NAR Eq. ( 6) with AR Eq. ( 4) models, they both can be factorized into an initial prediction term and a progressive prediction process based on different context (i.e. left-context in AR and full-context in iter-NAR), and the discrepancy pointed out by Huang et al. (2022) is therefore closed in iter-NAR assuming sufficient steps. By showing DIF-FUSEQ is an extension of the iter-NAR model, we offer a justification that it will not suffer from the conditional total correlation for the same reason. A straight-forward way to formulate pure continuous diffusion models is to introduce a series of Gaussian noise-corrupted features along with diffusion steps: y 1:T -1 , y 0 = y, y T ∼ N (0, I). p diffusion (w y |w x ) = y T ,...,y0 p(w y |y 0 , w x ) final prediction t=T,...,1 p(y t-1 |y t , w x ) progressive full-context diffusion , where p(y t-1 |y t , w x ) describes the diffusion step on continuous representations y. The rounding operation in DIFFUSEQ maps the continuous vectors y to discrete w y for each time step t, we in addition introduce this into Eq. ( 7): By rearranging Eq. ( 8) into Eq. ( 9), we can see DIFFUSEQ can be seen as a more generalized form of iter-NAR Eq. ( 6) before marginalizing out {y T , . . . , y 0 }, despite the different initialization of y Tfoot_1 . A more detailed derivation is shown in Appendix C. p DIFFUSEQ (w y |w x ) = w y T ,...,

4. EXPERIMENTS

We conduct experiments to validate the effectiveness of DIFFUSEQ on four different tasks, against six strong AR/NAR baselines.

4.1. EXPERIMENTAL SETUP

Tasks and Datasets. SEQ2SEQ generation covers a wide range of tasks, among which we choose four typical and popular tasks. Open domain dialogue requires models to generate informative responses given a dialogue context. We use Commonsense Conversation Dataset (Zhou et al., 2018) , which is extracted from Reddit single-round dialogs, with over 3 million conversational pairs. Question generation(QG) aims to generate questions given a context as input. To obtain sufficient training samples, we use the dataset Quasar-T (Dhingra et al., 2017) preprocessed by Lin et al. (2018) , and then generate document-question pairs to obtain 119K training samples (details in Appendix D.1). Text simplification aims to revise the complex text into sequences with simplified grammar and word choice. Jiang et al. (2020) constructs a corpus consisting of 677K complexsimple sentences with revision alignment. Paraphrase task generates an alternative surface form in the same language expressing the same semantic content. We adopt widely used QQPfoot_2 sourced from the community question answering forum Quora, with 147K positive pairs. Baselines. We consider three groups of models as baselines, covering both AR and NAR architectures. The first group of methods adopts encoder-decoder architecture (Cho et al., 2014) which is well-studied for SEQ2SEQ tasks, and we conduct experiments on two popular models: GRU with attention and Transformer (Vaswani et al., 2017) . The second group is the finetuned large pre-trained language model (PLM), among which GPT2 (Radford et al., 2019) has demonstrated great success in almost all SEQ2SEQ tasks. We further compare to GPVAE (Du et al., 2022) , which augments a pre-trained T5 (Raffel et al., 2020) with VAE to improve the generation diversity. For the last group of baselines, we consider LevT (Gu et al., 2019) , a widely used, strong iterative NAR model. All baselines are trained following instructions in their papers, and details can be found in Appendix D.2. Evaluation. We evaluate the generated sequences from two aspects: quality and diversity. To evaluate the quality, we use the standard metric BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) score. Since string-similarity-based metrics can be unsatisfactory for open-ended generation, we also report BERTScore (Zhang et al., 2019) that assesses the semantic similarity between generated sentences and references. Details are in Appendix D.4. Higher scores of BLEU, ROUGE and BERTScore reflect better performance. As for diversity, we use distinct unigram (dist-1) to measure intra-diversity within each generated sentence, where the lower dist-1 indicates that the generated sentence contains more repeated words. For sentence-level diversity evaluation, we consider sentence-level self-BLEU (Zhu et al., 2018) to measure the n-gram overlap between the set of outputs w.r.t one source sentence, and we additionally use diverse 4-gram (div-4) (Deshpande et al., 2019) to measure the ratio of distinct 4-grams in the set of outputs per source sentence. The lower self-BLEU and higher div-4 suggest higher diversity of generation. For each method including DIFFUSEQ, we generate 3 samples for each source sentence to compute the diversity metrics. Implementation Details. Our DIFFUSEQ is based on the 12 layers of Transformer with 12 attention heads, where the time step embedding is plugged akin to the position embedding. The maximum sequence length is 128, with embedding dimension d = 128, diffusion steps T = 2, 000 and a square-root noise schedule. To reduce the out-of-vocabulary generation, we apply Byte Pair Encoding (Sennrich et al., 2016) to construct the vocabulary. After conducting the diversity beam What's the best way to make friends and make make friends? How can I make friends and more something? search (DBS) (Vijayakumar et al., 2016) for the Transformer-base model and GPT model, we find that DBS does not always promote diversity over temperature sampling and therefore we list the best diversity results. We compute the accuracy metrics of DIFFUSEQ using MBR with the size of candidate samples |S| = 10. The experiment is deployed on NVIDIA A100 Tensor Core GPUs, and we use 4 GPUs on training and single GPU on sampling.

4.2. MAIN RESULTS

As shown in Table 1 , we conclude that DIFFUSEQ achieves comparable or even higher generation quality compared with strong baselines. At the same time, DIFFUSEQ consistently demonstrates its superiority in generating diverse outputs given the same input sequence. As we can see from Table 1 , DIFFUSEQ wins competitions over at least one quality metric against 6 baselines × 4 tasks. Although NAR models such as LevT can also outperform AR baselines sometimes, they still lag well behind DIFFUSEQ by large margins (i.e., relative improvements over 50% for BLEU in QG task and R-L in Dialogue task). Even compared with pre-trained then finetuned GPT2 models, DIFFUSEQ still delivers superior performance than the base variant, and is comparable with the large variant, which has 8.2 times more parameters than DIFFUSEQ. These empirical results amply support our findings in § 3, where we theoretically analyze the potential of diffusion models in modeling text sequences compared with AR models given sufficient diffusion steps. DIFFUSEQ, as a member of the deep generative model family, also exhibit the capacity to generate highly diverse sequences. As suggested by self-BLEU (lower is better) and div-4 (higher is better), in almost all cases, DIFFUSEQ significantly outperforms 4 AR baselines in terms of sentence-level diversity (i.e., producing diverse outputs given the same input). For diversity in word choice within one sentence, we consider dist-1: a higher dist-1 indicates less repetition within a sentence. As we can see from Table 1 , DIFFUSEQ has less repetition compared with encoder-decoder methods, but still fall behind the pre-trained GPT2 models (the same situation with BERTScore). These results suggest there is still room for improvement (e.g., use pre-training techniques) in diffusion models' token-level choice. Different from NAR-LevT, DIFFUSEQ does not rely on an extra length prediction module but automatically decides by the padding token instead and is able to generate longer output sentences, indicated by the last column for average generation length. In 

4.3. ANALYSIS

We conduct a series of analysis to investigate the effectiveness of different aspects in DIFFUSEQ. Diversity Ensures Quality. Generating high-quality texts with high diversity is an important requirement for many text generation applications and the trade-off between quality and diversity is always a critical concern in open-ended NLG tasks (Zhang et al., 2021) . Different from AR models relying on the decoding strategy like temperature and nucleus sampling (Holtzman et al., 2019) and VAE models sampling latent variable from Gaussian Prior, the natural advantage of DIFFUSEQ is to generate different sentences along with a series of random Gaussian noise. In Figure 4 , we elucidate that DIFFUSEQ have better trade-off between generation quality (BLEU) and sentence-level diversity (div-4). Here we further demonstrate that the high diversity provided by DIFFUSEQ can be turned into better quality. MBR is a common strategy to improve generation quality by aggregating and ranking candidate sequences, and we find that the upper bound of MBR is decided by a diversified candidate set. To valid this, we simultaneously apply MBR on both DIFFUSEQ and GPT2 with various candidate sizes |S|. The results are shown in Figure 3 . As we can see, DIFFUSEQ lags behind GPT2 without using MBR (|S| = 1) or with a small candidate set (|S| = 3). However, as |S| increases, DIFFUSEQ starts to outperform GPT2 by an increasing margin. The reason is that autoregressive models like GPT2 tend to generate highly similar candidates (as discussed in § 4.2), which impedes the effectiveness of MBR. As |S| increases to 20, DIFFUSEQ still shows better rising trends than GPT2. Our findings also stress the importance of better ranking methods in diffusion research. Step-wise Analysis against Iterative NAR. Given the underlying theoretical connection between iterative NAR and DIFFUSEQ discussed in § 3, we empirically investigate the behavior of LevT and DIFFUSEQ by analyzing their step-wise quality (i.e. BLEU) and diversity (i.e. div-4) curves. As is suggested in Figure 5 , LevT grows fiercely in quality at the very beginning of generation, and quickly slows down in the successive refinement process. But DIFFUSEQ behaves differently, with BLEU score growing slowly at first, increasing rapidly as the diffusion process progresses and finally surpassing LevT. It is also observed that the diversity of both LevT and DIFFUSEQ is determined at the very early stage regardless of future refinement or diffusion, where DIFFUSEQ consistently outperforms LevT on diversity at any stage of generation. We conjecture that DIFFUSEQ explores more possible results at the first half of generation process, and soon converges to several potential candidates when it is closed to the end of steps. In this case, DIFFUSEQ shows its capacity to take both generation quality and diversity into consideration, and this is the capacity that iterative-NAR and even AR models can not obtain, due to the different learning paradigms. Inference Speed. The slow sampling speed is one of the major concerns about diffusion models. Here we fix the number of diffusion steps during training for DIFFUSEQ while shrinking the inference steps following DDIM (Song et al., 2020) . As we can see from Figure 6 , when reducing the inference to 1,000 diffusion steps on single GPU, DIFFUSEQ achieves a higher BLEU score than GPT2-large yet registers a closer inference speed to GPT2-large. Effectiveness of Joint Training. In DIFFUSEQ, the representations of w x and w y are jointly trained using the same embedding function EMB(•) (stated in § 3). To validate the effectiveness (Nichol et al., 2022; Ramesh et al., 2022) . In particular, we decouple the training of EMB(w x ) and EMB(w y ) by replacing EMB(w x ) with representations extracted from a pre-trained BERT-tiny model (Turc et al., 2019) . From Table 3 , we find that the decoupled training strategy results in poor performance.

5. RELATED WORK

Diffusion Models for Text Modeling. Text-to-Image generation using diffusion models has developed many potential applications. Models such as Imagen (Saharia et al., 2022b) and DALL-E (Ramesh et al., 2022) are usually two-staged relying on the pre-trained models, requiring the alignment between the embedding vectors from two sources. GLIDE (Nichol et al., 2022) explores diffusion model with classifier-free (Ho & Salimans, 2022) guidance by setting guidance scale during training. The target space of these models is not discrete text space but stable vectors of pixel values. There are other works of diffusion on text generation, but they stick to the original encoderdecoder architecture and the diffusion process is interspersed on the decoder (Savinov et al., 2021) , or the latent space (Yu et al., 2022) . For text generation using the diffusion models, Hoogeboom et al. (2021) introduce the multinomial diffusion for character-level text generation, the forward categorical noise is applied through the Markov transition matrix. Austin et al. (2021) generalize discrete text diffusion models by introducing the absorbing state ([MASK]). However, discrete diffusion models may suffer from the scaling of the one-hot row vectors, and they only generate text samples unconditionally in discrete space. Diffusion-LM (Li et al., 2022) and Analog Bits (Chen et al., 2022) propose a new language model diffused on the continuous latent representations, with different mapping functions that connect the discrete and continuous space of texts. Compared with our work, we focus on the SEQ2SEQ diffusion models for text generation in the continuous space and our work is the first to explore this setting to the best of our knowledge. Diffusion Models for Conditional Generation. Related to conditional-VAE (Zhao et al., 2017) , we can consider the latent encoded input x as a condition. Diffusion-LM (Li et al., 2022) adopts the plug-and-play approaches (Dathathri et al., 2020) to compose fine-grained constraints on the generated sentences, but it fails to condition on the whole source sentence in SEQ2SEQ tasks. Noted that this controllable generation method is orthogonal to our DIFFSEQ, in other words, we can further add classifier-guided constraints on the SEQ2SEQ output to further control the text generation. There are other conditional diffusion models on the time series prediction like CSDI (Tashiro et al., 2021) or audio generation like WaveGrad (Chen et al., 2021) , but their class conditions are usually attributes that are easy to model, while the contextual texts as conditions are much more complex.

6. CONCLUSIONS

We propose DIFFUSEQ to tackle SEQ2SEQ tasks in a diffusion way, which contains the strong potential to achieve better generation quality and diversity trade-off. The capability enables favorable characteristics of DIFFUSEQ to further enhance the quality of final results, by leveraging a minimum Bayes risk decoding algorithm. Besides, we theoretically connect the AR and NAR models to DIF-FUSEQ, and show that DIFFUSEQ is a powerful extension of iterative-NAR model. The empirical results demonstrate that DIFFUSEQ is also a powerful model for text generation, matching or even surpassing competitive AR, iterative NAR, and large-scale pre-trained models on quality and diversity. Given the limited progress of current diffusion models on text generation, our study addresses promising achievements by such a new sequence-to-sequence learning paradigm. 

A OBJECTIVE DERIVATIONS OF DIFFUSEQ

The diffusion model is well-known as its ability to achieve the trade-off between flexibility and tractability of the models' probability distributions, compared with GAN, VAE and Flow-based models. Following Ho et al. (2020) ; Nichol & Dhariwal (2021) ; Song et al. (2020) , we systematically define the forward noising process and reverse denoising process on latent continuous space z. The forward noising is to perturb the structure of data z 0 . z 0 is finally changed into the partial Gaussian noise with y T ∼ N (0, I) through T -step forward random disturbance q(z t |z t-1 ) = N (z t ; 1 -β t z t-1 , β t I), with t = 1, 2, ..., T and {β t ∈ (0, 1)} T t=1 are the variance schedule. Let α t = 1 -β t and ᾱt = t i=1 α i , we have: z t = √ α t z t-1 + √ 1 -α t ϵ t-1 = √ α t α t-1 z t-2 + 1 -α t α t-1 εt-2 =... = √ ᾱt z 0 + √ 1 -ᾱt ϵ, where ϵ stands for Gaussian noises. In the end, q(z t |z 0 ) = N (z t ; √ ᾱt z 0 , (1 -ᾱt )I). We use a sqrt noise schedule in Diffusion-LM (Li et al., 2022) , that is, ᾱt = 1 -t/T + s with s as a small constant at the start of noise level. The reverse process then denoises z t , aiming to recover original z 0 , and is defined as: p θ (z 0:T ) := p(z T ) T t=1 p θ (z t-1 |z t ), p θ (z t-1 |z t ) = N (z t-1 ; µ θ (z t , t), σ θ (z t , t)). The learning of p θ is based on our diffusion model DIFFUSEQ: f θ (z t , t), where the µ θ (•) and σ θ (•) is the predicted parameterization of the mean and standard variation of q(z t |z t-1 ) in forward process. Using Bayes' rule: q(z t-1 |z t , z 0 ) = q(z t |z t-1 , z 0 ) q(z t-1 |z 0 ) q(z t |z 0 ) Substitute Eq. ( 11) to it and we can get the parameterized mean of q(z t-1 |z t , z 0 ): µ t (z t , z 0 ) = √ α t (1 -ᾱt-1 ) 1 -ᾱt z t + √ ᾱt-1 β t 1 -ᾱt z 0 , and for brevity, we short the coefficient of z t and z 0 as U and E respectively. We can use the variational lower bound to optimize the negative log-likelihood E[-log p θ (x 0 )] ≤ L VLB . The objective can be further rewritten to be a combination of several KL-divergence and entropy terms following Sohl-Dickstein et al. (2015) . L VLB = L T + L T -1 + • • • + L 0 = E q(z 1:T |z0) log q(z T |z 0 ) p θ (z T ) + T t=2 log q(z t-1 |z 0 , z t ) p θ (z t-1 |z t ) + log q ϕ (z 0 |w x⊕y ) p θ (z 0 |z 1 ) -log p θ (w x⊕y |z 0 ) . (15) For 1 ≤ t ≤ T -1, we compute the parameterization of L t by substituting Eq. ( 14) to minimize the difference from µ t and µ θ following Ho et al. (2020) : From DIFFUSEQ to diffusion model We show how to derive DIFFUSEQ to the straight-forward diffusion model on continuous space.  L t =



Code is available at https://github.com/Shark-NLP/DiffuSeq For NAR models, yT is uniform copied from the source sentence or unk's token embedding(Gu et al., 2018); for diffusion models, yT is sampled from normal distribution N (0, I). https://www.kaggle.com/c/quora-question-pairs Including top-p sampling, temperature, diversity beam search (DBS) and etc. Implement using Hugging-Face Transformers https://github.com/huggingface/transformers



Figure 1: The demonstration of unconditional, classifier-guided, and classifier-free diffusion models.

Figure 4: Trade-off between quality and diversity (details in Appendix D.3).

Figure 5: The curve of BLEU/div-4 score along with generation process (percentage of steps).

Figure 6: The BLEU and inference speed of DIFFUSEQ and GPT2-large.

The overall results of different methods on different SEQ2SEQ tasks. The first group ⋄ of methods adopt autoregressive encoder-decoder architecture and the second group • is the finetuned large pre-trained language model (also in autoregressive manner) while the last group ‡ is non-autoregressive. The best results are bold, and the best results without PLMs are underlined.

Sample outputs in QQP test set, conditioned on the same x.

Table 2, we provide examples to showcase DIFFUSEQ's ability to generate diverse samples. More examples can be found in Appendix D.5.

Results with or without joint training for Question Generation task.

Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017. Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. Commonsense knowledge aware conversation generation with graph attention. In International Joint Conference on Artificial Intelligence, IJCAI, 2018. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097-1100, 2018.

E z0 log q(z t |z 0 , z t+1 ) p θ (z t |z t+1 ) = E z0 1 C ||µ t (z t , z 0 ) -µ θ (z t , t)|| 2Figure 7: Graphical Models of AR, Fully NAR, iterative NAR and DIFFUSEQ models. For simplicity, we omit source node w x . Gray nodes indicate dependency on the source node while white nodes are independent to the source node.

Sample outputs with different random seed in Text Simplification test set.Complex sentence: People can experience loneliness for many reasons, and many life events may cause it, such as a lack of friendship relations during childhood and adolescence, or the physical absence of meaningful people around a person. Simplified: One cause of loneliness is a lack of friends during childhood and teenage years.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers and other peers for their valuable advice, and we also acknowledge Chenxin An's efforts to update the generation results for the Transformer-base model on QG and Paraphrasing tasks. This work is partially supported by the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100) and the joint research scheme of the National Natural Science Foundation of China (NSFC) and the Research Grants Council (RGC) under grant number N HKU714/21.

availability

//www.nltk.org/_modules/nltk/translate/bleu_score.html 6 https://github.com/Tiiiger/bert_score 

annex

consisting of an initial prediction and an autoregressive left-context prediction process, while fully-NAR models (Gu et al., 2018; Qian et al., 2021) learn the conditional probability given independent assumption for fast inference:To make a better analogy to AR and NAR models, we use a lossless way to formulate iterative NAR models (Gu et al., 2019; Ghazvininejad et al., 2019) by introducing a series of intermediate sequences w y 1:K-1 , w y K = w y as:Previous study (Huang et al., 2022) shows that there is a gap called conditional total correlation between AR and fully-NAR learning paradigms, because of the lossy decomposition of NAR models. This gap is mainly responsible for the performance drop from AR to NAR models. However, when comparing iter-NAR, Eq. ( 20), with AR models, they both can be factorized into an initial prediction term and a progressive prediction process based on different context (i.e. left-context in AR and full-context in iter-NAR). The discrepancy as pointed out by Huang et al. ( 2022) is therefore closed in iter-NAR assuming sufficient steps. By showing DIFFUSEQ is an extension of the iter-NAR model, we offer a justification that it will not suffer from the conditional total correlation for the same reason.A straight-forward way to formulate naive diffusion models is to introduce a series of Gaussian noise-corrupted features y 1:T -1 , y 0 = y, y T ∼ N (0, I) on continuous space as: where p(y t-1 |y t , w x ) describes the diffusion process on contiguous representations y. The total number of diffusion steps is denoted as T . Thereafter we omit the independent decomposition on w y and y t . To apply diffusion models on discrete space, the rounding operation in DIFFUSEQ maps the continuous vectors y to discrete w y for each time step t, we hence in addition introduce both contiguous feature y and the discrete text w y to represent the discrete text into Eq. ( 21): By rearranging Eq. ( 23) and Eq. ( 24), we can see that DIFFUSEQ can be seen as a more generalized form of iter-NAR before marginalizing out {y T , . . . , y 0 }, where Eq. ( 23) and Eq. ( 24) are equivalent with different computation order, despite the different initialization of y T . For NAR models, y T is uniform copied from the source sentence or unk's token embedding (Gu et al., 2018) ; for diffusion models, y T is sampled from normal distribution N (0, I).It is notable that unlike AR and fully NAR models generating text all at once, iterative NAR and diffusion models feature a self-corrected text generation process. The graphical comparison is shown in Figure 7 . 

D.2 SETTINGS OF BASELINES

We compare the settings of different models, including the number of parameters and how to sample the different output sentences, as shown in Table 4 . For plain GRU-based encoder-decoder methods, we do not implement diversity search algorithms on it, thus its sentence-level diversity could be very poor. For NAR-LevT, we set max iteration to 9 and follow the termination condition mentioned in the original paper. For GPVAE-T5, we tune the scalar to find the trade-off quality and diversity on the dev set. The scalars of all four tasks are set to 2. We implement GPT2 baselines using HuggingFace Transformers and for the baseline Transformer-base, we use Fairseq. For other baselines, there is no explicit factor to control the diversity generation, so we leave them as single points in the figure.

D.4 METRICS

The used BLEU score is sentence-level smoothed from BLEU-1 to 4, and used ROUGE-L score is longest common subsequence based statistics. The implementation is based on NLTK 5 and torchmetrics. The n-gram based metrics may fail to capture the semantic meaning of sentences, so we consider using BERTScore 6 . Specifically, we use microsoft/deberta-xlarge-mnli to help BERTScore correlate better with human scores.

D.5 GENERATION RESULTS

For different tasks we list some generation examples. As we can see in Table 5 , Table 6 and Table 7 , DIFFUSEQ tends to generate diverse outputs, but sometimes the sentence is not as fluent as finetuned GPT2. 

