SCORE-BASED CONTINUOUS-TIME DISCRETE DIFFU-SION MODELS

Abstract

Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt the score-based modeling to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data, and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.

1. INTRODUCTION

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have emerged as an important technique for data distribution modeling, where a data-corrupting forward process is coupled with a denoising reverse process to simulate a diffusion relationship between the data distribution and an uninformed source. Such models admit stable learning procedures and have demonstrated superior performance on continuous data modeling in challenging scenarios (Dhariwal & Nichol, 2021) , leading to rapidly increasing popularity. Song et al. (2020) established a stochastic differential equation view of diffusion models by forming the limit of finer corruption and denoising steps in the forward and backward processes, rendering a continuum of distributions. This perspective has provided a unified framework under a new score-based learning objective, and inspired a variety of simulation methods for efficient sampling and inference (De Bortoli et al., 2021; Zhang & Chen, 2022) . Given the advantages of diffusion models in terms of flexibility, learning tractability, and sampling, there have been several attempts to extend the approach to discrete data. Recent attempts have investigated alternative corruption operations for discrete data in the forward process, yielding promising results (Hoogeboom et al., 2021b; a; Austin et al., 2021) . However, these extensions still execute a finite sequence of corruption and restoration steps, and remain restricted to a fixed reverse sampling strategy that can be sub-optimal. To overcome this limitation, we investigate whether a continuoustime discrete diffusion formulation might admit more effective estimation and generation. Such an extension is highly non-trivial however. The continuous-time diffusion framework is based on a stochastic differential equation (SDE) with respect to the score function, which itself is the gradient of the log-likelihood with respect to a continuous variable. Although this can be used to characterize a continuum of infinitesimally evolving distributions over a continuous space, such a formulation no longer exists for discrete variables, since the gradient of the log-likelihood does not exist with respect to a discrete variable. Recently, Campbell et al. (2022) made significant progress in closing this gap by proposing a discrete data distribution that evolves in a continuous-time Markov chain. With this formulation they were able to approximate maximum likelihood training with an evidence lower bound (ELBO) surrogate and using a predictor-corrector sampling scheme to generalize some of the benefits of continuous-time modeling to a discrete space. However, a limitation of this previous work is the reliance on an ELBO approximation to the MLE when it is known that score-based learning yields superior estimation quality (when it can be applied) due to its unbiasedness (Song et al., 2020) . Of course, developing an analog of the score matching for discrete spaces is non-trivial due to non-differentiability. Nevertheless, a score function for a discrete random variable cannot be arbitrary: whenever two score functions match over a space, their corresponding distributions should also match; the score function should characterize the direction of infinitesimal evolution of a discrete distribution; and finally the score should enable tractable score-based estimation. In this paper, we investigate the design of such score functions for discrete spaces achieving these desired properties, and provide corresponding learning and sampling methods. In addition, we complete the agenda for continuous time diffusion in discrete spaces by formulating a coherent SDE in terms of stochastic jump processes. The main contributions are: • We extend the definition of score functions to generic categorical discrete variables, and derive a continuous-time discrete diffusion model via continuous time Markov chain in Section 3.1; • We derive a score-based objective called categorical ratio matching for estimating the proposed model in Section 3.2, which can be tractably optimized, showing that a previous proposal for binary discrete data (Hyvärinen, 2007) can be obtained as a special case; • In Section 4, we develop a numerical simulation technique for reverse sampling, then provide an analytical sampling method based on implicit modeling of the conditional marginals; • We discuss three architectural choices and present a novel "hollow" neural model in Section 5; • We evaluate the proposed SDDM on a set of synthetic and real-world music and image benchmarks, achieving promising results in Section 6.

2. BACKGROUND

Diffusion models (Sohl-Dickstein et al., 2015) are characterized by a forward Markov process that transforms an observation x 0 ∼ π data (x 0 ) to a reference distribution x T ∼ q T (x T ), and a backward process, which is also Markovian, that recovers the data distribution from x T . Specifically, the forward process is defined through simple corruption operations, q t+1|t . For continuous data the corruption kernel usually adds Gaussian noise (Ho et al., 2020) ; whereas for discrete data, where x 0 ∈ X with finite cardinality |X |, the corruption kernel can be uniform, discrete Gaussian, or some other choice (Austin et al., 2021; Johnson et al., 2021) . Given the corruption kernel, after T -steps, the forward process forms a joint distribution, q 0:T (x 0:T ) = π data (x 0 ) T -1 t=0 q t+1|t (x t+1 |x t ) . The backward process can be derived from the joint distribution via Bayes' rule, q 0:T (x 0:T ) = q T (x T ) T -1 t=0 q t|t+1 (x t |x t+1 ) , q t|t+1 (x t |x t+1 ) = q t+1|t (xt+1|xt)qt(xt) qt+1(xt+1) , where q t (x t ) denotes the marginal distribution and the prior q T is usually a simple distribution. Typically the backward kernel q t|t+1 (x t |x k+1 ) is intractable; thus it is usually parameterized by a neural network, denoted p θ t|t+1 , and learned via ELBO (Sohl-Dickstein et al., 2015; Ho et al., 2020; Austin et al., 2021) or score-matching (Song & Ermon, 2019) . Due to the structure of the joint distribution (1) and (2), the ELBO admits a particularly simple formulation as ℓ vb = Eπ data DKL q T |0 ||qT + T -1 t=1 Eq t|0 DKL q t|t+1,0 ||p θ t|t+1 -Eq 1|0 log p θ 0|1 (x0|x1) , which is applicable for both continuous and discrete diffusion model learning. For continuous variables with Gaussian corruptions q (x t+1 |x t ) = N x t+1 ; 1 -β t+1 x t , β t+1 I , and a backward kernel p θ (x t |x t+1 ) = N x t ; 1 √ 1-βt+1 x t+1 + β t+1 r θ t (x t+1 ) , β t+1 I , such that r θ t is learned as a neural network and β t is a predefined variance schedule, the ELBO (3) can be rewritten as ℓ vb = T -1 t=0 (1 -α t ) E π data E pα t (x ′ |x) r θ t (x ′ ) -∇ x ′ log p αt (x ′ |x) where α t = t-1 t=0 (1 -β t ). The Equation 4is highly related to score-matching (Hyvärinen & Dayan, 2005) with a score function defined as ∇ x log p t+1|t (x t+1 |x t ). Reverse sampling in discrete-time models is somewhat restricted by the forward process. Therefore, continuous-time diffusion models have been constructed with t indexed from [0, T ]. Song et al. (2020) develop an SDE perspective and define a diffusion process with respect to the score as dx =f (x, t) dt + g(t)dw, forward SDE, ( ) dx = f (x, t) -g 2 (t)∇ x log p t (x)dt + g(t)d w, reverse SDE, where w and w are standard Wiener processes, f (x, t) is a vector-valued drift function, and g(t) is a scalar-valued diffusion coefficient. A score-based learning method can be easily derived as an extension of (4). However, the score function ∇ x log p t (x) is not defined for a discrete variable, and the SDE above no longer suffices to characterize a continuous-time extension for discrete diffusion.

3. CONTINUOUS TIME DISCRETE SCORE MATCHING

Although Campbell et al. (2022) bypass the score function in a continuous-time extension, by leveraging a stochastic process view of the ELBO approximation, we will instead focus on a score-based extension for continuous-time discrete diffusion.

3.1. CONTINUOUS TIME MODELING

Consider the finite discrete state space X = C D , where C = {1, 2, ..., C} is a code book. To generalize score matching from a continuous space R n to discrete space X , we first model the forward process as a continuous time Markov chain {X t } t∈[0,T ] , whose transition probability is characterized by rate matrices Q t ∈ R |X |×|X | . In particular, if we let q denote the distribution for the forward process X t , the transition probability will satisfy the Kolmogorov forward equation: d dt q t|s (x t |x s ) = x∈X q t|s (x|x s )Q t (x, x t ), s < t (7) If the forward process starts at the target distribution q 0 = π data , the marginal at time t has the form: q t (x t ) = X π data (x 0 )q t|0 (x t |x 0 )dx 0 (8) By properly choosing the rate matrices Q t , we can achieve a final distribution close to a known tractable distribution q T ≈ π ref . Then, the reverse time process X t = X T -t can be expressed as a generative process from the reference distribution π ref to the target distribution π data (Anderson & Rhodes, 1983; Van Handel, 2007) . Proposition 3.1. The reverse time process X t of the continuous time Markov chain X t is also a Markov process, whose transition probabilities q s|t (•|•) for s < t satisfy: q s|t (x s |x t ) = qs(xs) qt(xt) q t|s (x t |x s ), s < t (9) Since we are considering a finite space X , the reverse time process X is uniquely determined by its rate matrices, which we denote as R t . Using the transition in Equation 9, we have the following. Proposition 3.2. For a continuous time Markov chain {X t } t∈[0,T ] with distribution q and rate matrices Q t , the rate matrices R t for the reverse process satisfy: R t (x, y) = q t (y) q t (x) Q t (y, x) Equation 10 provides a closed form expression for the reverse time rate R t . Therefore, once we know the ratio q t (y)/q t (x), we can obtain the generative flow towards π data .

3.2. CATEGORICAL RATIO MATCHING

In general, the ratio q t (y)/q t (x) in Equation 10is intractable. This ratio behaves analogously to the score function ∇ log π(x) in Equation 6. Such a connection inspires learning the ratio q t (y)/q t (x) in a similar manner to learning the score function for diffusion models in continuous spaces. For binary discrete variable, Hyvärinen (2007) proposed to match the ratio π(X \d , X d = c)/π(X \d , X d = 1c) as an extension for score matching. In this work, we consider the more general categorical case, where neither the score function ∇ log π(x) nor the binary ratio is well defined. The generalization relies on the singleton conditional distribution: π(X d = c|x \d ) = π(x \d , X d = c)/ c ′ ∈C π(x \d , X d = c ′ ) yielding a score function in categorical space we seek to match. The sufficiency of matching Equation 11 is guaranteed by the property that the joint distribution is completely determined by its singleton conditional distributions (Brook, 1964; Lyu, 2012) . Proposition 3.3. Consider random variables X = (X 1 , ..., X D ) ∈ X , and two probability distributions π 1 , π 2 . We have π 1 = π 2 , if and only if their conditional distributions are equal π 1 (X d = x d |x \d ) = π 2 (X d = x d |x \d ), for any x ∈ X and d = 1, ..., D. Going back to the ratio q t (y)/q t (x) in reverse time rate Equation 10, Proposition 3.3 tells us we can recover the probability ratio by employing a time dependent neural network p t (•; θ) to match the conditional distribution, p t (X d |x \d ; θ) ≈ q t (X d |x \d ) ⇒ q t (y d , x \d ) q t (x d , x \d ) = q t (X d t = y d |x \d ) q t (X d t = x d |x \d ) ≈ p t (X d t = y d |x \d ; θ) p t (X d t = x d |x \d ; θ) To train p t (•; θ), we minimize the expected cross entropy with respect to the conditional distributions along the forward process: θ * = arg min θ T 0 xt∈X q t (x t ) D d=1 - c∈C q t (X d t = c|x \d t ) log p t (X d t = c|x \d t ; θ) dt where the loss is minimized if p t (X d t |x \d t ; θ) ≡ q t (X d t |x \d t ) . This loss function matches the ratio in Equation 11, hence we name it as categorical ratio matching as a generalization of Hyvärinen (2007) . Since the spirit of this loss function is the same as the score matching in continuous space, we also interchangeably call this method as categorical score matching. In Equation 13, the expectation over q t (•) can be efficiently estimated via Monte Carlo sampling. However, the conditional distribution q t (X d t = c|x \d t ) is intractable, which brings difficulty to the training. Fortunately, using the property of conditional distribution, we can avoid computing this intractable term in the loss function. Proposition 3.4. The categorical ratio matching loss function in Equation 13 can be simplified as: θ * = arg min θ T 0 xt∈X q t (x t ) D d=1 -log p t (X d = x d t |x \d t ; θ) dt Using this surprisingly neat loss function in Equation 14, we can efficiently learn the conditional distribution. The learned p t (X d t |x \d t ; θ) determines a reverse process, and we use p(•; θ) to denote its joint distribution in order to distinguish with the true reverse process. We will sometimes drop the θ if it does not create ambiguity.

4. CONTINUOUS TIME DISCRETE SAMPLING

We next develop and efficient sampling scheme for the proposed diffusion model.

4.1. CONTINUOUS TIME SIMULATION FOR FORWARD PROCESS

In a continuous time Markov chain, the transition matrix q t|s (•|•) from time s to time t in the forward process can be obtained by solving the ODE in Equation 7. For general rate matrices Q t ∈ R |X |×|X | , solving the ODE is intractable. Therefore, we follow standard practice in diffusion models (Austin et al., 2021; Campbell et al., 2022) and approximate the forward process by factorizing X t = (X 1 t , ..., X D t ) where each sub-process X d t propagates independently. Furthermore, we also define the sub-rate matrix Q d t = Qβ(t) in terms of a fixed base rate Q = P ΛP -1 ∈ R C×C and time schedule function β(t); see design of β(t) in Appendix C.1. In this way, the sub-transition matrix in each dimension can then be easily computed (Campbell et al., 2022) as: q d t|s = P exp Λ t s β(τ )dτ P -1 (15) In particular, we use a uniform stationary base rate Q = 11 T -CI, which is a natural choice for categorical discrete distributions as it admits a uniform distribution as its stationary distribution (Hoogeboom et al., 2021b; Austin et al., 2021) . For simplicity of comparison we use the same schedule function as in Campbell et al. (2022) for all the experiments, and in Appendix C.1 we discuss other possible choices.

4.2. DISCRETE TIME SAMPLING FOR REVERSE PROCESS

Even given the learned conditional distribution p t (•; θ), simulating the reverse time process is much harder than simulating the forward process, as the reverse process cannot be factorized. For example, consider x, y that are only different at the d-th site. The rate for the reversed process to jump to y from x at time t is: R d t (x t , y; θ) = pt(X d t =y d |x \d t ;θ) pt(X d t =x d t |x \d t ;θ) Q t (y, x t ) Such a jump rate depends on both the time t and the value in the other dimensions x \d t . Hence, we can not simulate each dimension in parallel, making exact simulation of the reverse time process X t less efficient. Considering R d t (x t , y; θ) is already an approximation of R d t (x t , y), we employ Euler's method for parallel simulation. Specifically, given x t at time t, we fix the rate in Equation 16, then determine the transition probabilities for dimension d at time t -ϵ according to: p d t-ϵ|t (X d t-ϵ = c|x \d t ; θ) = ϵR d t (x t , X d t-ϵ = c; θ), c ̸ = x d t 1 -ϵ c ′ ̸ =x d t R d t (x t , X d t-ϵ = c ′ ; θ), c = x d t (17) We clip the quantities to ensure all probabilities are non-negative. Then, we collect a new value from each dimension to obtain a new state y t-ϵ , which has the factorized probability: p t-ϵ|t (X t-ϵ = y t-ϵ |x t ; θ) = D d=1 p d t-ϵ|t (X d t-ϵ = y d t-ϵ |x \d t ; θ) In this way, we can update all dimensions of x t in parallel, making the simulation of the reverse time process much more efficient.

4.3. ANALYTICAL SAMPLING FOR REVERSE PROCESS

The Euler's method mentioned above assumes the rate matrix R τ is fixed during τ ∈ [t -ϵ, t]. Such an approximation makes the samples x t deviate from the correct time slice marginal q t (•), especially when the time step ϵ is large. To mitigate this approximation error, diffusion models usually introduce a corrector to improve sample quality (Song et al., 2020; Campbell et al., 2022) , however this increases computational cost. Instead, we propose an alternative approach that leverages implicit modeling of p t (X d |x \d t ; θ). Specifically, for distribution q, we have: q t (X d t |x \d t ) = x d 0 q 0|t (x d 0 |x \d t )q t|0 (X d t |x d 0 ) Since q t|0 is tractable, computing Equation 19 is efficient once we know q 0|t (X d 0 |x \d t ). Hence, we can replace the explicit approximation in Equation 12 by the following implicit approximation: p 0|t (X d 0 |x \d t ; θ) ≈ q 0|t (X d 0 |x \d t ) ⇒ p t (X d t |x \d t ; θ) = x d 0 p 0|t (x d 0 |x \d t ; θ)q t|0 (X d t |x d 0 ) (20) Equation 20 provides a tractable transformation from p 0|t (X d 0 |x \d t ; θ) to p t (X d t |x \d t ; θ). Hence, we can continue using the categorical ratio matching loss in Equation 14 to train p 0|t (X d 0 |x \d t ; θ). To conduct backward sampling via this new parameterization, we consider the true reverse process: q t-ϵ|t (X d t-ϵ |x t ) = x d 0 q 0|t (x d 0 |x t )q t-ϵ|0,t (X d t-ϵ |x d 0 , x t ) (21) = x d 0 q(x d t |x d 0 , x \d t )q(x d 0 |x \d t ) q(x d t |x \d t ) q(x t |x d 0 , X d t-ϵ )q(X d t-ϵ |x d 0 ) q(x t |x d 0 ) (22) ∝ x d 0 q 0|t (x d 0 |x \d t )q t|t-ϵ (x d t |X d t-ϵ )q t-ϵ|0 (X d t-ϵ |x d 0 ) By substituting p 0 (X d 0 |x \d t ; θ), we have: p t-ϵ|t (X d t-ϵ |x t ; θ) ∝ x d 0 p 0|t (x d 0 |x \d t ; θ)q t|t-ϵ (x d t |X d t-ϵ )q t-ϵ|0 (X d t-ϵ |x d 0 ) Thus, we obtain an analytical expression for the reverse process sampling that avoids the simulation error in Euler's method. Hence, we refer to this method as analytical sampling.

5. PARAMETERIZATION

The key to the simplicity of the objective function in Equation 14is the special structure of the conditional marginal distributions p t (X d t |x \d t ; θ) and p 0|t (X d 0 |x \d t ; θ) such that the marginals of dth dimension do not depend on the current value x d t at time t. However, this elegance in the objective brings additional challenges to design an efficient and flexible neural network parameterization. A key design constraint is that the predictions at a dimension d must not depend on x d t , otherwise the information leak would make Equation 14 trivial to solve. The prediction can depend on the other coordinates in x t however. We offer three alternative architectures that are able to satisfy this constraint but incur different trade-offs between flexibility and computational cost. With out loss of generality we consider the parameterization of p t (X d t |x \d t ; θ), however the same designs can be directly applied to the parameterization of p 0|t (X d 0 |x \d t ; θ).

5.1. ENERGY BASED MODELS

The most general and flexible form of parameterization is an Energy based model (EBM), where an arbitrary neural network f θ (x, t) : C D × R → R can be used to specify the energy of any sample. In this case the conditional marginals can be modeled as: p t (X d t = c|x \d t ; θ) = exp -f θ ([X d t = c, x \d t ], t) c ′ ∈C exp -f θ ([X d t = c ′ , x \d t ], t) where we overload the notation to use [c, x \d t ] to denote [x 0 t , x 1 t , . . . , X d t = c, x d+1 t , . . .] such that only the d-th value is replaced with c. This modeling flexibility however comes at a high computational cost. Specifically, to evaluate D d=1 p t (X d t |x \d t ; θ) one needs O(D × C) rounds of evaluation of f θ , which is computationally prohibitive when modeling high dimensional data with a deep neural parameterization of f θ . Nevertheless, this connection between EBMs and diffusion provides another way to learn time-dependent EBMs, which might be of independent interest.

5.2. MASKED MODELS

To alleviate the computation overhead of EBMs while preserving flexibility, a masked model is a natural choice. Specifically, let a masking function m d (x) = [x 1 , . . . , x d-1 , MASK, x d+1 , . . . , x D ] replace the d-th dimension of a given x to a special mask token MASK. Then one can formulate the following conditional parameterization: p t (X d t |x \d t ; θ) = Softmax f θ m(d), t , where f θ (x, t) : {C ∪ MASK} D × R → R C Here f can still be a general neural network with only a minor requirement on handling then masked token. Overall this approaches requires O(D) rounds of evaluation of f θ . Since we are dealing with discrete data, one can further reduce D at the cost of increasing vocabulary size C, to further reduce the rounds of feed-forward evaluations of f θ .

5.3. HOLLOW TRANSFORMERS

Even though the two parameterizations above allow flexibility in the neural network design, they require a number of feed-forward evaluations that scales in the dimensionality and/or the vocabulary size of discrete data. As an alternative, we propose a Transformer variant that requires only O(1) feed-forward evaluations. The key constraint is that the diagonals of the Jacobian matrix of p t (X t |x t ; θ) be zero for any input x t . Many techniques have been developed to guarantee this property, including autoregressive masking (Germain et al., 2015; Vaswani et al., 2017) , which creates a triangular Jacobian for multi-layer networks, or hollow masking (Chen & Duvenaud, 2019), which considers the full context but only permits a single layer of dense interaction between dimensions. Here, to capture the full context for each dimension with expressive deep neural architectures, we introduce the hollow Transformer, a model that runs two autoregressive Transformers, one in each direction, for L-layers; see Figure 1 . The hollow Jacobian matrix is obtained via the summation of upper and lower triangular Jacobians so the full context for each position is obtained. We add one additional Transformer layer at the top, where the query vector comes from the two embedding directions of the corresponding dimension, and attention on the key and value vectors are conducted jointly with the two directions. For clarity we omit details of each Transformer layer as we use the standard architecture (Vaswani et al., 2017) . Overall, this specially designed architecture avoids any dependence on dimensionality or vocabulary size in terms of the number of feed-forward evaluations required, while being able to leverage the expressiveness of multi-layer Transformers. 𝑄 % 𝐾 & ∘ , 𝐾 ' ∘ → [𝑉 & , 𝑉 ' ] 𝑃(𝑥 ! |𝑥 \! ) 𝑃(𝑥 # |𝑥 \# ) 𝑃(𝑥 $ |𝑥 \$ ) 𝑃(𝑥 % |𝑥 \% ) 𝑉 & K ' 𝑉 ( 𝐾 ( 𝑥 ! 𝑥 " 𝑥 # 𝑥 $ 𝑡 𝑥 ! 𝑥 " 𝑥 # 𝑥 $ 𝑡 𝑥 ! 𝑥 " 𝑥 # 𝑥 $ 𝑥 ! 𝑥 " 𝑥 # 𝑥 $ 𝑥 ! 𝑥 " 𝑥 # 𝑥 $ Jacobian matrix 𝜕𝑃 ) (𝑋 | 𝑥 ) ) 𝜕𝑥 )

6. EXPERIMENTS

We present an empirical evaluation of the proposed diffusion approach on synthetic data, CIFAR10 dataset and the monophonic music dataset. The primary goal is to compare the effectiveness of the proposed categorical ratio matching with the alternative parameterizations presented in Section 5. We use the time duration T = 1 and the uniform forward process in all cases. Experiments are conducted on machines equipped with TPU-v4 chips. For more details please refer to Appendix C.

6.1. DEEP EBMS ON SYNTHETIC DATA

As discussed in Section 5.1, one can leverage the EBMs for modeling conditional marginals. Here we verify this approach using a synthetic dataset from the discrete EBM learning community. Following prior works (Dai et al., 2020; Zhang et al., 2022) , we use seven different distributions of 32-dimensional binary discrete data to evaluate different approaches. Each discrete distribution is converted from a 2-D continuous point (x, y) ∈ R 2 by quantizing both x and y with 16-bit representations using a Gray code (Gray, 1953) . We parameterize the energy function f θ (x, t) using the same 3-layer MLP as used in prior works (Dai et al., 2020; Zhang et al., 2022) , with one minor change to directly add a sinusoidal embedding of t into each hidden layer before activation. The uniform rate constant is set to 1.0 and we use a time resolution of 1e -3 for simulation. To measure sample quality, we follow (Zhang et al., 2022) and compare 4,000 generated samples to true data samples using exponential Hamming MMD (Gretton et al., 2012) , repeated 10 times to report average results in Table 1 . As baselines we include PCD (Tieleman, 2008) , ALOE (Dai et al., 2020) with a larger network for dual parameterization, and EB-GFN (Zhang et al., 2022) . Here we find that the proposed continuous-time categorical ratio matching approach is able to successfully learn a good EBM, yielding an MMD value that is consistently lower than the baseline methods. We also visualize the distribution obtained via SDDM in Figure 2 via reverse mapping of discrete Gray codes into 2-D points in continuous space. These plots show a similar distribution to the ground truth data (see Appendix C.2 for more information). Ablation study: we further provide ablation studies on three different parameterizations proposed in Section 5 and two different sampling methods. For the full results please see Appendix C.2.1. Overall when 3-layer transformers are used in these three parameterizations, the performance are comparable to each other, which shows that the masked model and hollow model can achieve better quality-speed trade-offs than EBMs. We will revisit the comparison among samplers in next section.

6.2. IMAGE MODELING ON CIFAR10

Published as a conference paper at ICLR 2023 2spirals 8gaussians circles moons pinwheel swissroll chekcerboard Figure 2 : Visualization of sampled discrete binary data in 2D space via decoding of Gray codes. Table 1 : Quality of generated binary samples from the learned EBMs, in terms of MMD with exponential Hamming kernel using bandwidth=0.1 (in units of 1 × 10 -4 , the lower the better). 2spirals 8gaussians circles moons pinwheel swissroll checkerboard PCD (Tieleman, 2008) 2 Next we evaluate image generation quality using the CIFAR10 dataset. In this case, each raw image has shape (32, 32, 3) with 256 choices per each pixel value. We compare several diffusion models from the literature, with the main goal of understanding performance in the different spaces with either continuous or discrete time formulations. We primarily evaluate the performance of the masked model formulation presented in Section 5.2, and report the commonly used Inception Score (IS) and Fréchet Inception Distance (FID) against the training set. As the image pixels are naturally discretized continuous values, we instead evaluate the proposed method in the quantization hash-code space via pretrained VQGAN (Esser et al., 2021) . The main purpose of doing this is to 1) evaluate score matching on complex categorical discrete data; and 2) alleviate the burden of the Transformers modeling long sequences. The resulting data lies in a 64-dimensional categorical space with vocabulary size |C|=512. Note that VQGAN is a lossy compressor where the reconstructed images from the decoder obtain IS=9.67 and FID=9.05, which is the upper-bound of categorical diffusion built on top of this vector quantizer (VQ). Here we simply use a 12-layer BERT-base architecture to parameterize the f θ (x, t). See Appendix C.3 for more details about the VQ and architecture used. From Table 2 we can see that, dealing with images in a categorical discrete space is generally much harder than an ordinal discrete space, as it loses ordinal structure as a prior. The proposed approach is able to approach the performance limit of VQ based categorical discrete diffusion, and improves the performance of D3PM and τ LDR in the same VQ space with the same parameterization. In Figure 3 we visualize the reverse sampling procedure (unevenly sampled time steps) using two proposed samplers, where the analytical sampler achieves reasonable quality in fewer steps. To quantitatively show why the continuous-time modeling would achieve better flexibility, we report the image quality with different number of sampling steps in Table 3 . We can see as D3PM was trained with 1,000 steps, the performance drops quickly with fewer steps. Our continuous-time version can be more robust even with much fewer steps. Also the analytical sampler can be more robust when fewer steps are given, where the forward Euler may need corrector for better performance.

6.3. MONOPHONIC MUSIC MODELING

Finally, we conduct a study on discrete music generation following the same settings used in Campbell et al. (2022) . This benchmark is originally from the Lakh pianoroll dataset (Raffel, 2016; Dong et al., 2018) , cleaned by removing some trivial music sequences. In the end there are 6,000 music sequences for training and 973 sequences for evaluation. Each sequence is of length 256, where the vocabulary size |X | = 129 consists of 128 notes and a rest. The vocabulary is scrambled so there is no ordering information preserved. This creates a challenging categorical discrete data modeling task. Here we primarily compare against existing diffusion models in discrete spaces. We use the simulation time ϵ=1e -3 for all methods. Since the dimensionality is high, the conditional marginals p 0|t (X d |x \d t ) are parameterized by the proposed hollow Transformer (as discussed in Section 5.3) with 6 Transformer layers. Appendix C.4 provides more details about the training and parameterizations. The evaluation was conducted on the held-out test set, where for each sequence, a prefix of length 32 is given as conditioning and the models are asked to generate the remaining 224 tokens. We follow the same evaluation protocol as in Campbell et al. (2022) and report the Hellinger Distance and Proportion of Outliers after 5 runs per each test case in Table 4 . Overall we can see the proposed SDDM is able to consistently improve the two metrics.

7. LIMITATION AND CONCLUSION

In this paper we presented a new learning and sampling paradigm for continuous-time diffusion models in categorical discrete spaces. We developed an extension of score matching to categorical discrete variables, showing that the corresponding continuous-time score matching properly aligns the reverse process with the posterior of the forward processes. The new learning paradigm also naturally introduces new sampling algorithms. Despite the promising preliminary results, there are still limitations in the current treatment. The main bottleneck is the design of the conditional marginal parameterization, which requires non-trivial trade-offs between computational cost and flexibility of the architectures; score matching for general categorical discrete variables does not benefit from prior knowledge about ordinal discrete data; and finally unifying score matching between continuous and discrete spaces would be needed to handle data in mixed spaces. We believe this initiative sheds new light on score matching for discrete-space diffusion.

B.2 PROOF FOR PROPOSITION 3.2

Proof. Since the transition kernel for the time reversal process X t satisfies Equation 9, we consider its time derivative. For x ̸ = y, we have d dt q T -t|T -s (y|x) = d dt q T -t (y) q T -s (x) q T -s|T -t (x|y) (32) = d dt q T -t (y) q T -s (x) q T -s|T -t (x|y) + q T -t (y) q T -s (x) d dt q T -s|T -t (x|y) For the first term in Equation 33, we use Kolmogorov forward equation: d dt q T -t (y) = d dt x0∈X π data (x 0 )q T -t|0 (y|x 0 ) (34) = x0∈X π data (x 0 ) d dt q T -t|0 (y|x 0 ) (35) = x0∈X π data (x 0 ) z -q T -t|0 (z|x 0 )Q T -t (z, y) (36) = - z q T -t (z)Q T -t (z, y) Since X is a finite space, the summation over z is finite, hence we obtain: lim t→s d dt q T -t (y) q T -s (x) q T -s|T -t (x|y) = lim t→s -z q T -t (z)Q T -t (z, y) q T -s (x) q T -s|T -t (x|y) = 0 For the second term in Equation 33, we use Kolmogorov backward equation to obtain the derivative: d dt q T -s|T -t (x|y) = z Q T -s (z, x)q T -s|T -t (y, z) Then we have: lim t→s q T -t (y) q T -s (x) d dt q T -s|T -t (x|y) = q T -s (y) q T -s (x) Q T -s (y, x) By combining these two results, and using the property that: R T -s (x, y) = lim t→s d dt q T -t|T -s (x|y) = 0 + q T -s (x) q T -s (y) Q T -s (x, y) then relabelling T -s by t yields: R t (x, y) = q t (y) q t (x) Q t (x, y) B.3 PROOF FOR PROPOSITION 3.3 Proof. For one direction, if π 1 = π 2 , it is trivial that their conditional distributions match. For the other direction, consider x, y ∈ X and an arbitrary probability distribution π. We have: π(y) π(x) = π(y 1 , y 2 , y 3 , ..., y D-1 , y D ) π(x 1 , x 2 , x 3 , ..., x D-1 , x D ) = π(y 1 , y 2 , y 3 , ..., y D-1 , y D ) π(x 1 , y 2 , y 3 , ..., y D-1 , y D ) π(x 1 , y 2 , y 3 , ..., y D-1 , y D ) π(x 1 , x 2 , y 3 , ..., y D-1 , y D ) • • • π(x 1 , x 2 , ..., x D-1 , y D ) π(x 1 , x 2 , ..., x D-1 , x D ) (44) = D d=1 π(y d |x 1:d-1 , y d+1:D ) π(x d |x 1:d-1 , y d+1:D ) Thus the probability ratio can be decomposed as a product completely determined by the singleton conditional distributions. Hence, if π 1 and π 2 have the same singleton conditional distributions, then π 1 (y)/π 1 (x) = π 2 (y)/π 2 (x), ∀x, y. As they are both distributions, we can easily see 1/π 1 (x) = y π 1 (y)/π 1 (x) = y π 2 (y)/π 2 (x) = 1/π 2 (x), ∀x. Hence π 1 = π 2 . B.4 PROOF FOR PROPOSITION 3.4 The key idea for the proof is that the conditional distribution on d-th dimension does not rely on the value x d . That is to say, the value of q t (X d = c|x \d t ) log p t (X d = c|x \d t ; θ) does not depend on x d t . Hence, we have: xt∈X q t (x) D d=1 c∈C q t (X d = c|x \d t ) log p t (X d = c|x \d t ; θ) (46) = D d=1 c∈C xt∈X q t (x t ) q t (x \d t , X d = c) c ′ ∈C q t (x \d t , X d = c ′ ) log p t (X d = c|x \d t ; θ) (47) = D d=1 c∈C x \d t c ′′ ∈C q t (x \d t , c ′′ ) q t (x \d t , X d = c) c ′ ∈C q t (x \d t , X d = c ′ ) log p t (X d = c|x \d t ; θ) (48) = D d=1 c∈C x \d t q t (x \d t , X d = c) c ′ ∈C q t (x \d t , X d = c ′ ) log p t (X d = c|x \d t ; θ) c ′′ ∈C q t (x \d t , c ′′ ) (49) = D d=1 c∈C x \d t q t (x \d t , X d = c) log p t (X d = c|x \d t ; θ) (50) = D d=1 xt∈X q t (x t ) log p t (X d = x d t |x \d t ; θ) (51) = xt∈X D d=1 q t (x t ) log p t (X d = x d t |x \d t ; θ) By substituting this result into original score matching loss function: θ * = arg min θ T 0 xt∈X q t (x t ) D d=1 - c∈C q t (X d = c|x \d t ) log p t (X d = c|x \d t ; θ) dt we obtain a simplified and tractable loss function: θ * = arg min θ T 0 xt∈X q t (x t ) D d=1 -log p t (X d = x d t |x \d t ; θ) dt and prove the proposition 3.4.

C EXPERIMENTAL DETAILS C.1 NOISE SCHEDULE OF β(t)

Empirically we find the following time schedule β(t) to be effective in some situations: t 0 β(τ )dτ = -cos π 2 t 1 2 + 1 This provides a cosine style noise schedule so that the forward process has a reasonable noise level to contribute to sample quality (Nichol & Dhariwal, 2021) .

C.2 SYNTHETIC DATA

To parameterize f θ (x, t), we use the same 3-layer MLP as used in Zhang et al. (2022) . Each hidden layer has dimensionality of 256, with elu activations. We use a constant learning rate of 1e -4 and train the model using 300k steps, where the per-step batch size is 128. The data is generated on the fly, using the data generator provided by Dai et al. (2020) . In this section we provide the ablation study on three parameterization methods proposed in section 5, as well as the two proposed sampling methods (namely the forward Euler's method, denoted as Fwd Euler, and the analytical sampling method, denoted as analytical). Overall we can see all the combinations of parameterization+sampler achieve reasonable results. The comparison among different samplers and models are mixed, which indicates that more efficient parameterizations like masked or hollow models can achieve better quality-speed trade-offs than EBM in this synthetic data.

C.3 EXPERIMENTS ON CIFAR10

Vector Quantization We model the images in a learned categorical discrete latent space with a Vector Quantized Variational Autoencoder (VQ-VAE) (Van Den Oord et al. (2017) ). The VQVAE encoder maps an image in H × W × 3 to H 4 × W 4 tokens with a vocabulary size of 512. During training, we use the VQGAN (Esser et al. (2021) ) variant which uses a GAN loss and a perceptual loss in addition to the ELBO objective. We follow the implementation of VQGAN in MaskGIT (Chang et al. (2022) ) with adaptations for a smaller down sample rate and fewer parameters. In particular, we use three convolution blocks in the encoder, with average pooling for down sampling between them. The three blocks use 64, 128, and 256 filters respectively, where each block consists of two standard residual blocks. We use nearest neighbor lookup to map the encoder outputs to a token index, according to a codebook of 512 × 256. The decoder mirrors the encoder architecture. The overall parameters for the VQVAE is 12.35M. We use the straight-through gradient estimator (Van Den Oord et al. (2017) ) for the quantization part during training. To acquire a general VQ-VAE model as the bridge between pixel space and latent space, we train it on the ImageNet dataset (Deng et al. (2009) ) at 64 × 64 resolution for 90 epochs. The model is trained with the Adam optimizer (β 1 = 0, β 2 = 0.99) (Kingma & Ba (2014)), using a peak learning rate of 1e -4 with a schedule of linear warm up and cosine decay. The GAN loss and perceptual loss are added with weight 0.1. For the 32 × 32 images in the CIFAR10 dataset, 8 × 8 latent codes are produced. The reconstruction achieves FID of 9.05 and IS of 9.67.

Paramterization and training

We parameterize f θ (x, t) using the masked modeling (see section 5.2). The backbone of the neural network is the same as BERT-base, which consists of 12 layers of Transformers, where each layer has 12 attention heads, embedding size of 768 and hidden layer size of 3072 for MLPs. We simply concatenate the time embedding representation of t together with the other tokens. After obtaining the final embedding of each token, we feed that into a 2-block ResNet (using the same parameterization as used in Campbell et al. (2022) for their music generation experiments) and predict logits for the masked position. We use a constant uniform rate of 0.007 for the forward process. We train the model with 4x4x4 TPU-v4 chips, with batch size of 128. The training is done after 700k steps in about 20h. The learning rate is warmed up from 0 to 1e -4 during the first 3% steps of training, and then decays to 0 in a linear schedule. The final evaluation is done on the exponential moving average of the model parameters, with the decay rate of 0.999.

C.4 MUSIC DATASET

We parameterize the f θ (x, t) using the proposed hollow Transformer (see section 5.3). Each Transformer component has 6 layers with embedding size of 256, 8 attention heads and hidden dimension of 2048. We simply concatenate the time embedding together with other tokens for the feed-forward. We experiment it using 2x2 TPU-v4 chips, and it took around 12 hours to get to convergence, or roughly 2m steps with batch size of 64.



,(4)



Figure 1: Diagram illustration of the Hollow Transformer.

Figure 3: Visualization of reverse sampling with different samplers.

Metrics on sample quality for different diffusion models on CIFAR10 dataset. Here Inception Score (IS) and Fréchet Inception Distance (FID) are compared. We follow the common practice to compare the 50,000 unconditionally sampled images from the model and the images from training dataset. Approaches with representations in different state spaces are listed in separate sections. *our re-implementation in VQ space, where we tuned configurations and report the best result.

FID/IS on CIFAR10 with different sampling steps, using continuous/discrete-time methods.

Conditional music generation.

Conditional music generation compared against ground truth completions, with different noise schedules. Visualization of the true data in 2D space via decoding of Gray codes.

Ablation study of SDDM on synthetic dataset, using the same experimetal protocol as in 1

A RELATED WORK

The discrete diffusion model has actually been described in the original paper by (Sohl-Dickstein et al., 2015) , in which a binomial diffusion process is considered for binary variable modeling. After the resurgence of the diffusion models with successful applications on continuous variable distribution modeling (Ho et al., 2020; Song et al., 2020) , the recent discrete diffusion extensions mainly focus on the design of forward corruption kernels. The multinomial diffusion extension is proposed (Hoogeboom et al., 2021b) , which inspires more structured perturbation operations in the forward process in Austin et al. (2021) beyond uniform noise injection, including discretized Gaussian perturbation, transition proportion to token embedding distance, and absorbing state corruption. Furthermore, Hoogeboom et al. (2021a) introduce the standard masking operation in language model into forward process, and Johnson et al. ( 2021) consider the injection and deletion operation for corruption that is beyond the in-place perturbation. Discrete diffusion models demonstrate better flexibility and have been used as alternative of autoregressive models, e.g., in Gu et al. (2022) ; Cohen et al. (2022) , discrete diffusion models are used for latent codes quantization and composed with decoder/encoder in VQ-VAE. These models still follow the finite-step corruption in forward processes and exploit MLE-related objectives for learning, which is orthogonal to our focus.The most related work is Campbell et al. (2022) , which generalize the discrete diffusion processes by characterizing the continuum of discrete variable evolving through continuous-time Markov chain. Our work also considers the continuous-time discrete diffusion. The major differences lie in twofold: i), we designed score-based learning, while the learning in Campbell et al. (2022) relies on ELBO ii), our derivation admits an analytical sampling strategy during backward sampling, in addition to the commonly used numerical simulations in Campbell et al. (2022) . Together with (Campbell et al., 2022) , we complete the missing puzzle of discrete diffusion models in continuous-time framework.

B DEFFERED PROOF B.1 PROOF FOR PROPOSITION 3.1

Proof. For a time u ∈ (0, T ), denote the filtration F u = σ{X v : v ∈ [0, u]}, which is the sigma algebra of the events, X u , from time 0 to time u. Also, denote the filtration for the reverse time process aswhich is the sigma algebra of events, X u , from time u to time T . By the Markov property, F u and F(u) are conditionally independent given X u . Now, consider any A ∈ F T -t , and any s < t. Then we haveis independent of A, which proves the reverse time process X t is a continuous time Markov chain. Also, from the derivation above we have:q t|s (x t |x s ) q s (x s ) q t (x t ) = P(X s = x s |X t = x t , A) = P(X s = x s |X t = x t ) = q s|t (x s |x t ) (31)Hence, we have established the proposition.

