SELF-CONDITIONED EMBEDDING DIFFUSION FOR TEXT GENERATION

Abstract

Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion (SED), a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models -while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.

1. INTRODUCTION

Continuous diffusion models (Sohl-Dickstein et al., 2015) have taken the world of image generation by storm, advancing the state of the art further than ever before (Rombach et al., 2021; Ramesh et al., 2022) . Can the same framework encounter as much success on the text modality? Diffusion for language is indeed an attractive prospect. Compared to autoregressive (AR) models (Bengio et al., 2000; Sutskever et al., 2011; Austin et al., 2021; Hoffmann et al., 2022) , diffusion models can predict all tokens in a sequence at once. This allows for bidirectional, rather than causal attentionincreasing interactions between tokens, potentially leading to more coherent samples. Diffusion models can make a better usage of hardware accelerators during inference than AR models, since computations are parallelizable over the sequence axis. Yet AR models remain the mainstream approach for modelling text. A major obstacle to text diffusion is that diffusion processes typically operate in continuous space. While this naturally handle images, text is inherently discrete. Consequently, most previous attempts to apply diffusion to text have focused on discrete diffusion-like approaches. These methods do not benefit from the refinements made to continuous diffusion in the image domain. Crucially, they cannot make use of guidance (Dhariwal & Nichol, 2021) , which drastically improves diffusion models sample quality. We address this gap by making a simple observation: language models operate mostly in continuous space, with discrete tokens only as inputs and outputs. A natural idea is then to conduct diffusion directly in a continuous token embedding space. For simplicity, we use a fixed embedding space, either random or stemming from a trained language model. Combined with the "self-conditioning" (Chen et al., 2022) refinement, this forms the basis of the method we propose, Self-conditioned Embedding Diffusion (SED). SED models rival mainstream AR models in both conditional and unconditional text generation. We make the following contributions: • In section 3, we introduce SED, the first continuous diffusion approach for text with good scaling properties (testing models up to 420M parameters). We analyze several continuous text diffusion settings, and identify self-conditioning and diffusion on small fixed embeddings as key factors to make continuous text diffusion work. • In section 4, we apply classifier-free guidance (Ho & Salimans, 2022) to text data-an original achievement. We show that SED can rival AR models on generic language tasks, for similar models sizes. SED samples achieve a better likelihood-entropy trade-off compared to these models, and are deemed comparable (if slightly worse) by human raters.

2. RELATED WORK

We provide an overview of diffusion models with a focus on modeling discrete data, as well as AR models and sample-based metrics for evaluating text generation. Continuous diffusion on continuous image data. Continuous diffusion has recently established itself as the method of choice for modeling continuous data such as images. While our main focus in this paper is on discrete data, we review some key works in continuous data modeling as this literature was the major source of inspiration for SED. GPT-3 (Brown et al., 2020) , which trained a large AR language model unconditionally, and used few-shot prompting to adapt it to new tasks. A few works later improved upon the results of GPT-3, including Hoffmann et al. (2022) . Sample-based evaluation of text generative models. There are traditionally two classes of metrics for generative modeling: likelihood-based and sample-based. While the likelihood-based way is mathematically appealing, its usefulness for measuring progress is reduced by the fact that not



The first continuous diffusion formulation was introduced in the seminal work bySohl-Dickstein et al. (2015).Ho et al. (2020)  improved and simplified this formulation, relating it to denoising score matching, and creating a new method called DDPM. Nichol & Dhariwal (2021) further improved upon DDPM, showcasing impressive diffusion results compared to GANs. Rombach et al. (2021, Stable Diffusion) introduced diffusion in latent space. Conceptually similar to SED, it was specifically targeted at image modeling. Classifier-free guidance was proposed by Ho & Salimans (2022) as a mean to improve image fidelity at the cost of reduced diversity. GLIDE (Nichol et al., 2022) scaled up the ideas of guided diffusion, while DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022) are the latest, most advanced image generation systems to date, combining most of the improvements proposed in previous works. Savinov et al., 2022) was the first non-AR method to show strong results both in machine translation and unconditional text generation. MaskGIT (Chang et al., 2022) demonstrated excellent results in modeling VQ-discretized images. These approaches rely on training models to predict masked tokens from their context, and iterating this reconstruction step multiple times at sampling time. Despite those positive developments, the samples from discrete diffusion methods for text modeling remains less coherent than those produced by AR methods. Continuous diffusion on discrete data. Fewer works try to tackle diffusion on discrete data from the same angle as SED -starting by turning the data into continuous representations before modeling it with continuous diffusion formulations. Mittal et al. (2021) used a VAE to generate such representations for discrete music modeling, with exciting results. Closest to SED, Diffusion-LM (Li et al., 2022) trains a token embedding together with the diffusion model itself. Diffusion-LM meets success on specific language applications, in low data regime and on constrained, very formatted textual data. Most recently, Analog Bits (Chen et al., 2022) introduced self-conditioning, closely related to step-unrolls in SUNDAE (Savinov et al., 2022), together with bit-level modeling to improve the generation of discretized images. While the qualitative results of those continuous methods on text modeling show promise, they have not been shown to scale to large realistic text datasets like C4 (Raffel et al., 2020) yet, or to compare with AR approaches on generic language tasks. Auto-regressive modelling on discrete data. AR models remain the method of choice for modeling discrete data. In combination with neural networks, they were first explored by Bengio et al. (2000) and later combined with RNNs (Sutskever et al., 2011). Their breakthrough moment came with the advent of the Transformer architecture, introduced by Vaswani et al. (2017) for machine translation. Even more impressive results were shown with

