SCORE-BASED CONTINUOUS-TIME DISCRETE DIFFU-SION MODELS

Abstract

Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt the score-based modeling to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data, and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.

1. INTRODUCTION

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have emerged as an important technique for data distribution modeling, where a data-corrupting forward process is coupled with a denoising reverse process to simulate a diffusion relationship between the data distribution and an uninformed source. Such models admit stable learning procedures and have demonstrated superior performance on continuous data modeling in challenging scenarios (Dhariwal & Nichol, 2021) , leading to rapidly increasing popularity. Song et al. (2020) established a stochastic differential equation view of diffusion models by forming the limit of finer corruption and denoising steps in the forward and backward processes, rendering a continuum of distributions. This perspective has provided a unified framework under a new score-based learning objective, and inspired a variety of simulation methods for efficient sampling and inference (De Bortoli et al., 2021; Zhang & Chen, 2022) . Given the advantages of diffusion models in terms of flexibility, learning tractability, and sampling, there have been several attempts to extend the approach to discrete data. Recent attempts have investigated alternative corruption operations for discrete data in the forward process, yielding promising results (Hoogeboom et al., 2021b; a; Austin et al., 2021) . However, these extensions still execute a finite sequence of corruption and restoration steps, and remain restricted to a fixed reverse sampling strategy that can be sub-optimal. To overcome this limitation, we investigate whether a continuoustime discrete diffusion formulation might admit more effective estimation and generation. Such an extension is highly non-trivial however. The continuous-time diffusion framework is based on a stochastic differential equation (SDE) with respect to the score function, which itself is the gradient of the log-likelihood with respect to a continuous variable. Although this can be used to characterize a continuum of infinitesimally evolving distributions over a continuous space, such a formulation no longer exists for discrete variables, since the gradient of the log-likelihood does not exist with respect to a discrete variable. Recently, Campbell et al. ( 2022) made significant progress in closing this gap by proposing a discrete data distribution that evolves in a continuous-time Markov chain. With this formulation they were able to approximate maximum likelihood training with an evidence lower bound (ELBO) surrogate and using a predictor-corrector sampling scheme to generalize some of the benefits of continuous-time modeling to a discrete space. However, a limitation of this previous work is the reliance on an ELBO approximation to the MLE when it is known that score-based learning yields superior estimation quality (when it can be applied) due to its unbiasedness (Song et al., 2020) . Of course, developing an analog of the score matching for discrete spaces is non-trivial due to non-differentiability. Nevertheless, a score function for a discrete random variable cannot be arbitrary: whenever two score functions match over a space, their corresponding distributions should also match; the score function should characterize the direction of infinitesimal evolution of a discrete distribution; and finally the score should enable tractable score-based estimation. In this paper, we investigate the design of such score functions for discrete spaces achieving these desired properties, and provide corresponding learning and sampling methods. In addition, we complete the agenda for continuous time diffusion in discrete spaces by formulating a coherent SDE in terms of stochastic jump processes. The main contributions are: • We extend the definition of score functions to generic categorical discrete variables, and derive a continuous-time discrete diffusion model via continuous time Markov chain in Section 3.1; • We derive a score-based objective called categorical ratio matching for estimating the proposed model in Section 3.2, which can be tractably optimized, showing that a previous proposal for binary discrete data (Hyvärinen, 2007) can be obtained as a special case; • In Section 4, we develop a numerical simulation technique for reverse sampling, then provide an analytical sampling method based on implicit modeling of the conditional marginals; • We discuss three architectural choices and present a novel "hollow" neural model in Section 5; • We evaluate the proposed SDDM on a set of synthetic and real-world music and image benchmarks, achieving promising results in Section 6.

2. BACKGROUND

Diffusion models (Sohl-Dickstein et al., 2015) are characterized by a forward Markov process that transforms an observation x 0 ∼ π data (x 0 ) to a reference distribution x T ∼ q T (x T ), and a backward process, which is also Markovian, that recovers the data distribution from x T . Specifically, the forward process is defined through simple corruption operations, q t+1|t . For continuous data the corruption kernel usually adds Gaussian noise (Ho et al., 2020) ; whereas for discrete data, where x 0 ∈ X with finite cardinality |X |, the corruption kernel can be uniform, discrete Gaussian, or some other choice (Austin et al., 2021; Johnson et al., 2021) . Given the corruption kernel, after T -steps, the forward process forms a joint distribution, q 0:T (x 0:T ) = π data (x 0 ) T -1 t=0 q t+1|t (x t+1 |x t ) . (1) The backward process can be derived from the joint distribution via Bayes' rule, q 0:T (x 0:T ) = q T (x T ) T -1 t=0 q t|t+1 (x t |x t+1 ) , q t|t+1 (x t |x t+1 ) = q t+1|t (xt+1|xt)qt(xt) qt+1(xt+1) , where q t (x t ) denotes the marginal distribution and the prior q T is usually a simple distribution. Typically the backward kernel q t|t+1 (x t |x k+1 ) is intractable; thus it is usually parameterized by a neural network, denoted p θ t|t+1 , and learned via ELBO (Sohl-Dickstein et al., 2015; Ho et al., 2020; Austin et al., 2021) or score-matching (Song & Ermon, 2019) . Due to the structure of the joint distribution (1) and (2), the ELBO admits a particularly simple formulation as ℓ vb = Eπ data DKL q T |0 ||qT + T -1 t=1 Eq t|0 DKL q t|t+1,0 ||p θ t|t+1 -Eq 1|0 log p θ 0|1 (x0|x1) , which is applicable for both continuous and discrete diffusion model learning. For continuous variables with Gaussian corruptions q (x t+1 |x t ) = N x t+1 ; 1 -β t+1 x t , β t+1 I , and a backward kernel p θ (x t |x t+1 ) = N x t ; 1 √ 1-βt+1 x t+1 + β t+1 r θ t (x t+1 ) , β t+1 I , such that r θ t is learned as a neural network and β t is a predefined variance schedule, the ELBO (3) can be rewritten as ℓ vb = T -1 t=0 (1 -α t ) E π data E pα t (x ′ |x) r θ t (x ′ ) -∇ x ′ log p αt (x ′ |x)



* Work done during an internship at Google. 2 ,(4)

