SELF-CONDITIONED EMBEDDING DIFFUSION FOR TEXT GENERATION

Abstract

Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion (SED), a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models -while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.

1. INTRODUCTION

Continuous diffusion models (Sohl-Dickstein et al., 2015) have taken the world of image generation by storm, advancing the state of the art further than ever before (Rombach et al., 2021; Ramesh et al., 2022) . Can the same framework encounter as much success on the text modality? Diffusion for language is indeed an attractive prospect. Compared to autoregressive (AR) models (Bengio et al., 2000; Sutskever et al., 2011; Austin et al., 2021; Hoffmann et al., 2022) , diffusion models can predict all tokens in a sequence at once. This allows for bidirectional, rather than causal attentionincreasing interactions between tokens, potentially leading to more coherent samples. Diffusion models can make a better usage of hardware accelerators during inference than AR models, since computations are parallelizable over the sequence axis. Yet AR models remain the mainstream approach for modelling text. A major obstacle to text diffusion is that diffusion processes typically operate in continuous space. While this naturally handle images, text is inherently discrete. Consequently, most previous attempts to apply diffusion to text have focused on discrete diffusion-like approaches. These methods do not benefit from the refinements made to continuous diffusion in the image domain. Crucially, they cannot make use of guidance (Dhariwal & Nichol, 2021) , which drastically improves diffusion models sample quality. We address this gap by making a simple observation: language models operate mostly in continuous space, with discrete tokens only as inputs and outputs. A natural idea is then to conduct diffusion directly in a continuous token embedding space. For simplicity, we use a fixed embedding space, either random or stemming from a trained language model. Combined with the "self-conditioning" (Chen et al., 2022) refinement, this forms the basis of the method we propose, Self-conditioned Embedding Diffusion (SED). SED models rival mainstream AR models in both conditional and unconditional text generation. We make the following contributions: • In section 3, we introduce SED, the first continuous diffusion approach for text with good scaling properties (testing models up to 420M parameters). We analyze several continuous text diffusion settings, and identify self-conditioning and diffusion on small fixed embeddings as key factors to make continuous text diffusion work. • In section 4, we apply classifier-free guidance (Ho & Salimans, 2022) to text data-an original achievement. We show that SED can rival AR models on generic language tasks, for similar models sizes. SED samples achieve a better likelihood-entropy trade-off compared to these models, and are deemed comparable (if slightly worse) by human raters.

