DIFFUSER: DIFFUSION VIA EDIT-BASED RECON-STRUCTION

Abstract

In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm. Despite being performant, these models lack the ability to revise existing text, which limits their usability in many practical scenarios. We look to address this, with DIFFUSER (Diffusion via Edit-based Reconstruction), a new edit-based generative model for text based on denoising diffusion models -a class of models that use a Markov chain of denoising steps to incrementally generate data. DIFFUSER is not only a strong generative model in general, rivalling autoregressive models on several tasks spanning machine translation, summarization, and style transfer; it can also perform other varieties of generation that standard autoregressive models are not well-suited for. For instance, we demonstrate that DIFFUSER makes it possible for a user to condition generation on a prototype, or an incomplete sequence, and continue revising based on previous edit steps.

1. INTRODUCTION

Revision and editing are central to how humans produce content; we write and revise emails and papers, gradually produce works of art, and iterate on plans for a project. Despite this, the most dominant paradigm in text generation is purely autoregressive, producing text left-to-right in a single pass (Bengio et al., 2003) . Although models employing this single-pass form of generation are highly performant, they are limited by the inability to refine existing text. To address this, we propose DIFFUSER: Diffusion via Edit-based Reconstruction, a flexible method to apply edit-based generative processes to arbitrary text generation tasks. Specifically, we take inspiration from diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) , generative models that generate by way of incremental denoising steps, and adapt this approach to the text generation paradigm with a formulation similar to natural editing processes. Prior work on text generation either focuses on improving the performance of standard autoregressive (AR) models through larger models and datasets (Vaswani et al., 2017; Sutskever et al., 2014; Radford et al.; Brown et al., 2020) or on proposing new, non-autoregressive approaches (Gu et al., 2017; Ghazvininejad et al., 2019; Gu et al., 2019) to improve general modes of text generation. A thus far separate line of models has taken the perspective of modeling text edits for specific tasks: e.g. style transfer (Reid & Zhong, 2021; Malmi et al., 2020) , sentence fusion (Malmi et al., 2019) , and grammatical error correction (Dale & Kilgarriff, 2011) . DIFFUSER unifies these two perspectives by enabling edit processes to be applied to general purpose text generation without compromising performance or requiring external supervised data (Guu et al., 2018) . This design enables it 1 . This enables flexible editing in a range of contexts, including machine translation, summarization, style transfer, while also allowing for the possibility of taking outside input to guide and constrain generation. Learning these edit-based diffusion processes required several innovations over standard autoregressive and MLM-style iterative generation approaches (Ghazvininejad et al., 2019; Austin et al., 2021; Savinov et al., 2022) , including forming edit-based corruption and reconstruction processes for training (Sec 3), as well as techniques to improve the quality of decoding sequences across both timesteps and token-level generations (including 2D beam search; Sec 3.6, Sec 3.5). To demonstrate the effectiveness of DIFFUSER, we test our method on three text generation tasks: machine translation, abstractive summarization, and text style transfer, and show on-par or improved performance compared to purely autoregressive, single-pass and non-autoregressive methods. We also provide qualitative samples of the edit processes learned by the models in different settings and analyses on training and inference speeds, as well as the relationship between edit steps and performance. Overall, we demonstrate the potential of edit-based generative models to offer 1) more performant generation, 2) greater interactivity between different models (as we can now perform edits in the discrete space on model generated output), and 3) more flexible/controllable generation.

2. BACKGROUND

DIFFUSER operates at the intersection of text generation, editing processes, and diffusion models. We first provide the background and intuition of these three techniques.

2.1. TEXT GENERATION

Most text generation models used in NLP today are autoregressive in nature. In this paradigm, given a sequence s = [s 0 , s 1 , . . . , s N ], one can model the likelihood of the entire sequence P (s) by modeling the probability of predicting each token in an autoregressive, often left-to-right, manner. This formulation, where the likelihood of a token p(s t ) is conditioned on its predecessors s <t , is



DIFFUSER's text generation process. Orange represents replacements, blue represents insertions, red represents deletions, and white represents keep operations. This process largely imitates a natural editing process(Reid & Neubig, 2022).to both generate and edit text, including externally produced content, a natural extension of the text generation paradigm.

