ANALOG BITS: GENERATING DISCRETE DATA USING DIFFUSION MODELS WITH SELF-CONDITIONING

Abstract

We present Bit Diffusion: a simple and generic approach for generating discrete data with continuous state and continuous time diffusion models. The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits. To generate samples, the model first generates the analog bits, which are then thresholded to obtain the bits that represent the discrete variables. We further propose two simple techniques, namely Self-Conditioning and Asymmetric Time Intervals, which lead to a significant improvement in sample quality. Despite its simplicity, the proposed approach can achieve strong performance in both discrete image generation and image captioning tasks. For discrete/categorical image generation, we significantly improve previous state-of-the-art on both CIFAR-10 (which has 3K discrete 8-bit tokens) and IMAGENET 64×64 (which has 12K discrete 8-bit tokens), outperforming the best autoregressive model in both sample quality (measured by FID) and efficiency. For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.

1. INTRODUCTION

State-of-the-art generative models for discrete data, such as discrete images and text, are based on autoregressive modeling (Van den Oord et al., 2016; Salimans et al., 2017; Parmar et al., 2018; Child et al., 2019; Roy et al., 2021; Jun et al., 2020; Sutskever et al., 2014; Brown et al., 2020; Chowdhery et al., 2022) , where the networks, often Transformers (Vaswani et al., 2017) , are trained to predict each token given its preceding ones in a sequential manner or with causal attention masks. One major drawback of such approaches is that they typically require computation and memory that is quadratic to the dimension of data (e.g., sequence length or image size), leading to difficulties in modeling large images or sequences. Another drawback is that, during generation, autoregressive models generate one token at a time so the total number of sequential sampling steps is often the same as the dimension of data, making it slow in generating large images or long sequences. In contrast, diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020) , or score-based generative models (Song & Ermon, 2019; 2020; Song et al., 2021) , can model much higher dimensional data without running into computation and memory issues. During generation, diffusion models iteratively refine samples with a high degree of parallelism, so the total number of sequential sampling steps can be much less than the dimension of data. However, state-of-the-art diffusion models (Dhariwal & Nichol, 2021; Ho et al., 2022; Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022) can only generate continuous data (mainly real valued pixels), and have not yet achieved results competitive with autoregressive models in generating discrete/categorical data, such as generating discrete/categorical images (Hoogeboom et al., 2021; Austin et al., 2021) . In this work, we propose a simple and generic approach for enabling continuous state diffusion models to generate discrete data. The key ingredient in our approach is analog bits: real numbers used to model the bits that represent the discrete data. Analog bits can be directly modeled by continuous state diffusion models, without requiring a discrete state space or re-formulation of the continuous



† Work done as a student researcher at Google. Code at https://github.com/google-research/pix2seq.

