ANALOG BITS: GENERATING DISCRETE DATA USING DIFFUSION MODELS WITH SELF-CONDITIONING

Abstract

We present Bit Diffusion: a simple and generic approach for generating discrete data with continuous state and continuous time diffusion models. The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits. To generate samples, the model first generates the analog bits, which are then thresholded to obtain the bits that represent the discrete variables. We further propose two simple techniques, namely Self-Conditioning and Asymmetric Time Intervals, which lead to a significant improvement in sample quality. Despite its simplicity, the proposed approach can achieve strong performance in both discrete image generation and image captioning tasks. For discrete/categorical image generation, we significantly improve previous state-of-the-art on both CIFAR-10 (which has 3K discrete 8-bit tokens) and IMAGENET 64×64 (which has 12K discrete 8-bit tokens), outperforming the best autoregressive model in both sample quality (measured by FID) and efficiency. For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.

1. INTRODUCTION

State-of-the-art generative models for discrete data, such as discrete images and text, are based on autoregressive modeling (Van den Oord et al., 2016; Salimans et al., 2017; Parmar et al., 2018; Child et al., 2019; Roy et al., 2021; Jun et al., 2020; Sutskever et al., 2014; Brown et al., 2020; Chowdhery et al., 2022) , where the networks, often Transformers (Vaswani et al., 2017) , are trained to predict each token given its preceding ones in a sequential manner or with causal attention masks. One major drawback of such approaches is that they typically require computation and memory that is quadratic to the dimension of data (e.g., sequence length or image size), leading to difficulties in modeling large images or sequences. Another drawback is that, during generation, autoregressive models generate one token at a time so the total number of sequential sampling steps is often the same as the dimension of data, making it slow in generating large images or long sequences. In contrast, diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020) , or score-based generative models (Song & Ermon, 2019; 2020; Song et al., 2021) , can model much higher dimensional data without running into computation and memory issues. During generation, diffusion models iteratively refine samples with a high degree of parallelism, so the total number of sequential sampling steps can be much less than the dimension of data. However, state-of-the-art diffusion models (Dhariwal & Nichol, 2021; Ho et al., 2022; Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022) can only generate continuous data (mainly real valued pixels), and have not yet achieved results competitive with autoregressive models in generating discrete/categorical data, such as generating discrete/categorical images (Hoogeboom et al., 2021; Austin et al., 2021) . In this work, we propose a simple and generic approach for enabling continuous state diffusion models to generate discrete data. The key ingredient in our approach is analog bits: real numbers used to model the bits that represent the discrete data. Analog bits can be directly modeled by continuous state diffusion models, without requiring a discrete state space or re-formulation of the continuous diffusion process. At sampling time, the generated analog bits can be decoded into discrete variables by a simple thresholding operation. Our approach, as illustrated in Figure 1 , is based on the following high-level conjecture. With strong continuous generative models (diffusion models in particular), it should not be too difficult to generate highly concentrated bimodal data where each real-valued analog bit is close to a binary bit. To reduce the prediction loss (such as negative log likelihood), the network has to model structures among analog bits that can actually lead to meaningful discrete variables after thresholding. Besides analog bits, we further propose two simple techniques, namely Self-Conditioning and Asymmetric Time Intervals that greatly improve the sample quality. We evaluate the proposed approach on both discrete image generation, and image-conditional text / caption generation. On discrete CIFAR-10 and IMAGENET 64×64, the proposed Bit Diffusion model significantly improves both existing discrete diffusion models but also the best autoregressive model. For example, on categorical CIFAR-10, the best autoregressive model (Jun et al., 2020) obtains a FID of 12.75, while our model (withfoot_0 /3 of the model size of the autoregressive model, using 100 instead of 3072 sequential inference steps) achieves a much better 6.93. For image captioning on MS-COCO dataset, our model achieves a result competitive with a strong autoregressive captioner based on a Transformer.

2. METHOD

Preliminaries We start with a short introduction to diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020; 2021) . Diffusion models learn a series of state transitions to map noise from a known prior distribution to x 0 from the data distribution. To learn this (reverse) transition from the noise distribution to the data distribution, a forward transition from x 0 to x t is first defined: x t = γ(t) x 0 + 1 -γ(t) , where ∼ N (0, I), t ∼ U(0, T ) is a continuous variable, and γ(t) is a monotonically decreasing function from 1 to 0. Instead of directly learning a neural net to model the transition from x t to x t-∆ , one can learn a neural net f (x t , t) to predict x 0 (or ) from x t , and estimate x t-∆ from x t and estimated x0 (or ˜ ). This training of f (x t , t) is based on denoising with a 2 regression loss: L x0 = E t∼U (0,T ), ∼N (0,1) f ( γ(t) x 0 + 1 -γ(t) , t) -x 0 2 . ( ) To generate samples from a learned model, it follows a series of (reverse) state transition x T → x T -∆ → • • • → x 0 . This can be achieved by iteratively applying denoising function f on each state x t to estimate x 0 , and then make a transition to x t-∆ with the estimated x0 (using transition rules such as those specified in DDPM (Ho et al., 2020) or DDIM (Song et al., 2020) ). Note that state transitions in these diffusion models assume a continuous data space and state space. Therefore, one cannot directly apply it to model and generate discrete/categorical data. Analog Bits A discrete data variable from an alphabet of size K can be represented using n = log 2 K bits, as {0, 1} n . Due to the discreteness, existing work has to re-formulate continuous diffusion models by adopting a discrete data space and state space (Sohl-Dickstein et al., 2015; Hoogeboom et al., 2021; Austin et al., 2021) . In contrast, we propose to simply cast the binary bits {0, 1} n into real numbers R n for the continuous diffusion models 1 . We term these real numbers analog bits since they learn to share the same bimodal values as binary bits but are modeled as real numbers. To draw samples, we follow the same procedure as sampling in a continuous diffusion model, except that we apply a quantization operation at the end by simply thresholding the generated analog bits. This yields binary bits which can be then converted into original discrete/categorical variables. Notably, there is no hard constraint to force the model to generate exact binary bits, but we



After casting as real numbers, one may also transform them by shifting and scaling from 0, 1 to -b, b.



Figure 1: Bit Diffusion: modeling discrete data using continuous diffusion models with analog bits.

