DENOISING MCMC FOR ACCELERATING DIFFUSION-BASED GENERATIVE MODELS

Abstract

Diffusion models are powerful generative models that simulate the reverse of diffusion processes using score functions to synthesize data from noise. The sampling process of diffusion models can be interpreted as solving the reverse stochastic differential equation (SDE) or the ordinary differential equation (ODE) of the diffusion process, which often requires up to thousands of discretization steps to generate a single image. This has sparked a great interest in developing efficient integration techniques for reverse-S/ODEs. Here, we propose an orthogonal approach to accelerating score-based sampling: Denoising MCMC (DMCMC). DMCMC first uses MCMC to produce initialization points for reverse-S/ODE in the product space of data and variance (or diffusion time). Then, a reverse-S/ODE integrator is used to denoise the initialization points. Since MCMC traverses close to the data manifold, the cost of producing a clean sample for DMCMC is much less than that of producing a clean sample from noise. To verify the proposed concept, we show that Denoising Langevin Gibbs (DLG), an instance of DMCMC, successfully accelerates all six reverse-S/ODE integrators considered in this work on the tasks of CIFAR10 and CelebA-HQ-256 image generation. Notably, combined with integrators of Karras et al. ( 2022) and pre-trained score models of Song et al. (2021b), DLG achieves state-of-the-art results among score-based models. In the limited number of score function evaluation (NFE) settings on CIFAR10, we have 3.86 FID with ≈ 10 NFE and 2.63 FID with ≈ 20 NFE. On CelebA-HQ-256, we have 6.99 FID with ≈ 160 NFE, which beats the current best record of Kim et al. (2022) among score-based models, 7.16 FID with 4000 NFE.

1. INTRODUCTION

Sampling from a probability distribution given its score function, i.e., the gradient of the log-density, is an active area of research in machine learning. Its applications range far and wide, from Bayesian learning (Welling & Teh, 2011) to learning energy-based models (Song & Kingma, 2021) , synthesizing new high-quality data (Dhariwal & Nichol, 2021) , and so on. Typical examples of traditional score-based samplers are Markov chain Monte Carlo (MCMC) methods such as Langevin dynamics (Langevin, 1908) and Hamiltonian Monte Carlo (Neal, 2011) . Recent developments in score matching with deep neural networks (DNNs) have made it possible to estimate scores of high-dimensional distributions such as those of natural images (Song & Ermon, 2019) . However, natural data distributions are often sharp and multi-modal, rendering naïve application of traditional MCMC methods impractical. Specifically, MCMC methods tend to skip over or get stuck at local high-density modes, producing biased samples (Levy et al., 2018) . Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021a) depart from MCMC and use the concept of diffusion, the process of gradually corrupting data into noise, to generate samples. Song et al. (2021b) observed that for each diffusion process, there is a reverse stochastic differential equation (SDE) and an ordinary differential equation (ODE). Hence, given a noise sample, integrating the reverse-S/ODE produces a data sample. Only a time-dependent score function of the data during the diffusion process is required to simulate the reverse process. This discovery generated great interest in finding better ways to integrate reverse-S/ODEs. For instance, Song et al. (2021b) uses black-box ODE solvers with adaptive stepsizes to accelerate sampling. Furthermore, multitude of recent works on score-based generative modeling focus on improv-Figure 1 : Top: a conceptual illustration of a VE diffusion model sampling process and DMCMC sampling process. VE diffusion models integrate the reverse-S/ODE starting from maximum diffusion time / maximum noise level. So, samples are often noisy with small computation budget due to large truncation error. DMCMC produces an MCMC chain which travels close to the image manifold (compare the noise level σ). So, the MCMC samples can be denoised to produce high-quality data with relatively little computation. Bottom: Visualization of sampling processes without (left) and with (right) DMCMC on CelebA-HQ-256 under a fixed computation budget. ing reverse-S/ODE integrators (Jolicoeur-Martineau et al., 2021; Lu et al., 2022; Karras et al., 2022; Zhang & Chen, 2022) . In this work, we develop an orthogonal approach to accelerating score-based sampling. Specifically, we propose Denoising MCMC (DMCMC) which combines MCMC with reverse-S/ODE integrators. MCMC is used to generate initialization points {(x n , t n )} in the product space of data x and variance exploding (VE) diffusion time t / noise level σ (see Fig. 1 top panel). Since all modes are connected in the product space, MCMC mixes well. Then, a reverse-S/ODE integrator solves the reverse-S/ODE starting at x n from time t = t n to t = 0. Since MCMC explores high-density regions, the MCMC chain stays close to the data manifold, so t n tends to be close to 0, i.e., noise level tends to be small (see Fig. 1 top and bottom panels). Thus, integrating the reverse-S/ODE from t = t n to t = 0 is much faster than integrating the reverse-S/ODE from maximum time t = T to t = 0 starting from noise. This leads to a significant acceleration of the sampling process. Our contributions can be summarized as follows. • We introduce the product space of data and diffusion time, and develop a novel score-based sampling framework called Denoising MCMC on the product space. Our framework is general, as any MCMC, any VE process noise-conditional score function, and any reverse-S/ODE integrator can be used in a plug-and-play manner. • We develop Denoising Langevin Gibbs (DLG), which is an instance of Denoising MCMC that is simple to implement and is scalable. The MCMC part of DLG alternates between a data update step with Langevin dynamics and a noise level prediction step, so all that DLG requires is a pre-trained noise-conditional score network and a noise level classifier. 

2. BACKGROUND

2.1 DENOISING SCORE MATCHING Given a distribution p(x), a noise level σ, and a perturbation kernel p σ (x | x) = N (x | x, σ 2 I), solving the denoising score matching objective (Vincent, 2011 ) min θ E p( x) E pσ(x| x) ∥s θ (x) -∇ x log p σ (x | x)∥ 2 2 (1) yields a score model s θ (x) which approximates the score of p σ (x | x)p( x) d x. Denoising score matching was then extended to train Noise Conditional Score Networks (NCSNs) s θ (x, σ) which approximate the score of data smoothed at a general set of noise levels by solving min θ E λ(σ) E p( x) E pσ(x| x) ∥s θ (x, σ) -∇ x log p σ (x | x)∥ 2 2 (2) where λ(σ) can be a discrete or a continuous distribution over (σ min , σ max ) (Song & Ermon, 2019; Song et al., 2021b) . We note p σ (x | x)p( x) d x approaches p(x) as σ → 0, since the perturbation kernel p σ (x | x) converges to the Dirac delta function centered at x.

2.2. MARKOV CHAIN MONTE CARLO (MCMC)

Given an unnormalized version of p(x) or the score function ∇ x log p(x), MCMC constructs a Markov chain in the data space whose stationary distribution is p(x). An MCMC which uses the unnormalized density is the Metropolis-Hastings MCMC (Metropolis et al., 1953; Hastings, 1970) that builds a Markov chain by sequentially accepting or rejecting proposal distribution samples according to a density ratio. A popular score-based MCMC is Langevin dynamics (Langevin, 1908) . Langevin dynamics generates a Markov Chain {x n } ∞ n=1 using the iteration x n+1 = x n + (η/2) • ∇ x log p(x n ) + √ η • ϵ (3) where ϵ ∼ N (0, I). {x n } ∞ n=1 converges to p(x) in distribution for an appropriate choice of η. To sample from a joint distribution p(x, y), we may resort to Gibbs sampling (Geman & Geman, 1984) . Given a current Markov chain state (x n , y n ), Gibbs sampling produces x n+1 by sampling from p(x | y n ) and y n+1 by sampling from p(y | x n+1 ). The sampling steps may be replaced with MCMC. Hence, Gibbs sampling is useful when conditional distributions are amenable to MCMC. Annealed MCMC. Despite their diversity, MCMC methods often have difficulty crossing lowdensity regions in high-dimensional multimodal distributions. For Langevin dynamics, at a lowdensity region, the score function vanishes in Eq. ( 3), resulting in a meaningless diffusion. Moreover, natural data often lies on a low-dimensional manifold. Thus, once Langevin dynamics leaves the data manifold, it becomes impossible for Langevin dynamics to find its way back. One way to remedy this problem is to use annealing, i.e., constructing a sequence of increasingly smooth and wide distributions and running MCMC at different levels of smoothness. As smoothness is increased, disjoint modes merge, so MCMC can cross over to other modes. Annealing has been used to empower various types of MCMC (Geyer & Thompson, 1995; Neal, 2001) . In this work, we shall refer to the collection of MCMC that use annealing as annealed MCMC. An instance of annealed MCMC is annealed Langevin dynamics (ALD) (Song & Ermon, 2019) . For a sequence of increasing noise levels {σ i } N i=1 , Langevin dynamics is sequentially executed with p σi (x | x)p( x) d x in place of p(x) in Eq. ( 3) for i = N, N -1, . . . , 1. Since p(x) smoothed at a large noise level has wide support and connected modes, ALD overcomes the pitfalls of vanilla Langevin dynamics. However, ALD has the drawback that thousands of iterations are required to produce a single batch of samples.

2.3. DIFFUSION MODELS

Diffusion models and differential equations. Diffusion models opened up a new avenue towards fast sampling with score functions via SDEs and ODEs (Song et al., 2021b) . Suppose data is distributed in R d . Given a diffusion process of data sample x 0 ∼ p(x) into a sample from a simple prior noise distribution, the trajectory of data during diffusion can be described with an Itô SDE dx = f (x, t) dt + g(t) dw for some drift coefficient f : R d × [0, T ] → R d , diffusion coefficient g : [0, T ] → R, and Brownian motion w. Here, T is the diffusion termination time. With initial condition x(0) = x 0 , integrating Eq. ( 4) from time t = 0 to t = T produces a sample from the prior distribution. For each diffusion SDE, there exists a corresponding reverse-SDE: dx = [f (x, t) -g(t) 2 ∇ x log p t (x)] dt + g(t) d w where p t (x) is the density of x(t) evolving according to Eq. ( 4) and w is a Brownian motion if time flows from t = T to t = 0. Given a sample x T from the prior distribution, integrating Eq. ( 5) with initial condition x(T ) = x T from t = T to t = 0 results in a sample from p(x). Moreover, to each reverse-SDE, there exists a corresponding deterministic reverse-ODE dx = f (x, t) -(1/2) • g(t) 2 ∇ x log p t (x) dt which also can be integrated from t = T to t = 0 to produce samples from p(x). Diffusion models generate data by simulating the reverse of the diffusion process, i.e., by solving the reverse-S/ODE of the diffusion process. Initial works on diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) used computationally expensive ancestral sampling to solve the reverse differential equations. Later works discovered that using adaptive numerical integrators to solve the reverse-S/ODE could accelerate the sampling process. This led to great attention on developing better reverse-S/ODE integrators (Jolicoeur-Martineau et al., 2021; Song et al., 2021b; Lu et al., 2022; Karras et al., 2022; Zhang & Chen, 2022) . Our work is orthogonal to such works as focus on finding good initialization points for integration via MCMC. Hence, a better integration technique directly translates to even better generative performance when plugged into Denoising MCMC.

Variance exploding (VE) diffusion model. A VE diffusion model considers the diffusion process

dx = d[σ 2 (t)] dt dw. ( ) where σ(t) increases from σ min = σ(0) to σ max = σ(T ). Distribution of x(t) evolves as p t (x) = p σ(t) (x | x)p( x) d x (8) so if σ min is sufficiently small, p 0 (x) ≈ p(x), and if σ max is sufficiently large, so variance explodes, p T (x) ≈ N (x | 0, σ 2 max I). If we have a score model s θ (x, σ) trained with Eq. (2), ∇ x log p t (x) ≈ s θ (x, σ(t)). It follows that with x T ∼ N (x | 0, σ 2 max I), we may integrate the reverse-S/ODE corresponding to Eq. ( 7) with x(T ) = x T from t = T to t = 0 using a score model to generate data. In the next section, we bridge MCMC and reverse-S/ODE integrators with VE diffusion to form a novel sampling framework that improves both MCMC and diffusion models.

3. DENOISING MARKOV CHAIN MONTE CARLO (DMCMC)

We now develop a general framework called Denoising MCMC (DMCMC) which combines MCMC with reverse-S/ODE integrators. We denote the data space as X ⊆ R d and the noise level space as S = [σ min , σ max ]. The construction of DMCMC is comprised of two steps. In the first step, we build MCMC to generate initialization points in the product space X × S, i.e., X augmented by the smoothness parameter σ. Since σ(t) is a monotone increasing function, this is equivalent to augmenting the data space with diffusion time t. In the second step, we incorporate denoising steps, where we denoise the generated initialization points via reverse-S/ODE integrators. Note that while the predictor-corrector (PC) method (Song et al., 2021b ) also uses MCMC, PC uses MCMC to improve the denoising process. Thus, DMCMC and PC contribute to orthogonal aspects of diffusion sampling. More discussion on the differences between DMCMC and PC is in Appendix D. 3.1 CONSTRUCTION STEP 1: MCMC ON THE PRODUCT SPACE X × S Suppose p(x) is a high-dimensional multimodal distribution, supported on a low-dimensional manifold. If the modes are separated by wide low-density regions, MCMC can have difficulty moving between the modes. Indeed, convergence time for such distributions can grow exponential in dimension d (Roberts & Rosenthal, 2001) . Intuitively, for MCMC to move between disjoint modes, the Markov Chain would have to step off the data manifold. However, once MCMC leaves the data manifold, the density or the score vanishes. Then, most random directions produced by the proposal distribution do not point to the manifold. Thus, MCMC gets lost in the ambient space, whose volume grows exponentially in d. Annealing via Gaussian smoothing, used in both ALD and VE diffusion, circumvents this problem. As p(x) smoothed with perturbation kernel p σ (x | x) of increasing σ, the modes of p(x) grow wider and start to connect. Thus, MCMC can easily transition between modes. However, running MCMC in the manner of ALD is inefficient since we do not know how many iterations within each noise level is sufficient. To address this problem, we propose to augment X with the smoothness scale σ and run MCMC in the product space X × S such that MCMC automatically controls the value of σ. Below, we formally describe MCMC on X × S.

Let us define the

σ-conditional distribution p(x | σ) := p σ (x | x)p( x) dx. (9) We also define a prior p(σ) on S. Then by the Bayes' Rule, p(x, σ) = p(x | σ) • p(σ). (10) In Appendix E.2, we discuss the effect of varying the prior. MCMC with p(x, σ) will produce samples {(x n , σ n )} in X × S such that σ n ∼ p(σ), x n ∼ p(x | σ n ). (11) Hence, if σ n ≫ σ min , x n will be a noisy sample, i.e., a sample corrupted with Gaussian noise of variance σ 2 n , and if σ n ≈ σ min , x n will resemble a sample from p(x). Since our goal is to generate samples from p(x), naïvely, we can keep samples (x n , σ n ) with σ n ≈ σ min and discard other samples. However, this could lead to a large waste of computation resources. In the next section, we incorporate reverse-S/ODE integrators to avert this problem.

3.2. CONSTRUCTION STEP 2: INCORPORATING DENOISING STEPS

Let us recall that integrating the reverse-S/ODE for the VE diffusion SDE Eq. ( 7) from time t = T to t = 0 sends samples from p T (x) to samples from p 0 (x) ≈ p(x). In general, integrating the reverse-SDE or ODE from time t = t 2 to t = t 1 for t 1 < t 2 sends samples from p t2 (x) to samples from p t1 (x) (Song et al., 2021b) . We use this fact to denoise MCMC samples from p(x, σ). Suppose we are given a sample (x n , σ n ) ∼ p(x, σ). With t n := σ -1 (σ n ), Eq. ( 11) tells us x n ∼ p tn (x) (12) so integrating the reverse-S/ODE with initial condition x(t n ) = x n from t = t n to t = 0 produces a sample from p 0 (x) ≈ p(x). Here, we note that any reverse-S/ODE solver may be used to carry out the integration. In Appendix F.2, we show the necessity of the denoising step. Let us briefly explain how DMCMC accelerates sampling. Given an MCMC chain {(x n , σ n )} in X ×S and a prior p(σ) which places high mass near σ min , MCMC traverses high probability regions, so σ n ≪ σ max for most n, i.e., t n ≪ T for most n. This also means the sequence {x n } generally stays close to the data manifold. So, the average length of integration intervals (0, t n ) will tend to be much shorter than T . Thus, integrating the reverse-S/ODE over (0, t n ) to denoise x n is much faster than integrating the reverse-S/ODE over (0, T ), i.e., standard diffusion sampling. This idea is illustrated in Figure 1 , and more detailed explanation is given in Appendix E.1. Naïvely, we could extend denoising score matching Eq. ( 1) to estimate the score ŝθ (x, σ) : R d × R → R d × R of p(x, σ) and apply Langevin dynamics in the first step of DMCMC. But, this would prevent us from using pre-trained score models, as we would have to solve (for some small ν > 0) min θ E ϵ∼N (0 d+1 ,ν 2 I d+1 ) E p(x,σ) [∥ŝ θ (x -ϵ 1:d , σ -ϵ d+1 ) -ϵ/ν 2 ∥ 2 2 ] which differs from Eq. ( 2). Gibbs sampling provides a simple path around this problem. Let us recall that given a previous MCMC iterate (x n , σ n ), Gibbs sampling proceeds by alternating between an x update step x n+1 ∼ p(x | σ n ) and a σ update step σ n+1 ∼ p(σ | x n+1 ). Below, we describe our score-based sampling algorithm, Denoising Langevin Gibbs (DLG) (pseudocode in Appendix C). Updating x. Suppose we are given an MCMC iterate (x n , σ n ) and a score model s θ (x, σ) from Eq. ( 2). We generate x n+1 by a Langevin dynamics step on p(x | σ n ). Specifically, by Eq. ( 9), ∇ x log p(x | σ n ) ≈ s θ (x, σ n ) (14 ) and so an Langevin dynamics update on x, according to Eq. ( 3) is x n+1 = x n + (η/2) • s θ (x n , σ n ) + √ η • ϵ (15) for ϵ ∼ N (0, I). Here, we call η the step size. Updating σ. We now have x n+1 and need to sample σ n+1 ∼ p(σ | x n+1 ). To this end, we first train a DNN noise level classifier q ϕ (σ | x) to approximate p(σ | x) by solving max ϕ E p(x,σ) [log q ϕ (σ | x)]. (16) Specifically, we discretize [σ min , σ max ] into M levels τ 1 = σ min < τ 2 < • • • < τ M = σ max . Given τ m where 1 ≤ m ≤ M , m serves as the label and clean training data corrupted by Gaussian noise of variance τ 2 m serves as the classifier input. The classifier is trained to predict m by minimizing the cross entropy loss. Having trained a noise level classifier, we sample σ n+1 by drawing an index m according to the classifier output probability for x n+1 and setting σ n+1 = τ m . In practice, using the index of largest probability worked fine. We denote this process as σ n+1 ∼ q ϕ (σ | x n+1 ). In Appendix F.1 we verify whether DLG with the approximated conditional works as intended.

4.1. PRACTICAL CONSIDERATIONS

Computation cost of σ prediction. We found that using shallow neural networks for the noise classifier q ϕ was sufficient to accelerate sampling. Concretely, using a neural net with four convolution layers and one fully connected layer as the classifier, one evaluation of q ϕ was around 100 ∼ 1000 times faster than one evaluation of the score model s θ . So, when comparing sampling methods, we only count the number of score function evaluations (NFE). We also note that the training time q ϕ was negligible compared to the training time of s θ . For instance, on CelebA-HQ-256, training q ϕ with the aforementioned architecture for 100 epochs took around 15 minutes on an RTX 2080 Ti. Starting points for DLG. Theoretically, an MCMC chain {(x n , σ n )} ∞ n=0 will converge to p(x, σ) regardless of the starting point (x 0 , σ 0 ). However, theory shows that setting starting points close to the stationary distribution, i.e., using "warm start", can significantly accelerate convergence of the Markov chain (Dalalyan, 2017; Dwivedi et al., 2019) . Thus, we set x 0 by generating a clean data sample with a reverse-S/ODE solver starting from prior noise, adding some Gaussian noise to the clean data sample, and running Gibbs sampling for a few iterations. Pseudocode is shown in Appendix C. The NFE involved in generating x 0 is included in the final per-sample average NFE computation for DLG when comparing methods in Section 5. But, we note that this cost vanishes in the limit of infinite sample size. Reducing autocorrelation. Autocorrelation in MCMC chains, i.e., correlation between consecutive samples in the MCMC chain, could reduce the sample diversity of MCMC. A typical technique to reduce autocorrelation is to use every n skip -th samples of the MCMC chain for some n skip > 1. For DMCMC, this means we denoise every n skip -th sample. So, if we use n den NFE to denoise MCMC samples, the average NFE for generating a single sample is around n skip + n den . Choosing iterates to apply denoising. The MCMC chain can be partitioned into blocks which consist of n skip consecutive samples. Using every n skip -th sample of the MCMC chain corresponds to denoising the last iterate of each block. Instead, to further shorten the length of integration, within each block, we apply denoising to the sample of minimum noise scale σ. Choice of prior p(σ). We use p(σ) ∝ 1/σ to drive the MCMC chain towards small values of σ. 

5.1. MIXING OF DMCMC CHAINS

Here we provide experimental proof that DMCMC is capable of visiting diverse modes as a consequence of running MCMC in the product space X × S. To this end, we compare DLG with and without σ updates. DLG without σ updates is just Langevin dynamics at fixed σ, so we run fifty Langevin dynamics chains and fifty DLG chains on a mixture of Gaussians (MoG) with 1k modes at CIFAR10 images. All chains are initialized at a single mode. For each method, we compute the mode coverage of the samples, the class distribution of the samples, and the autocorrelation of sample image class sequence. Since the noise conditional score function can be calculated analytically for MoGs, this setting decouples sampler performance from score model approximation error. Figure 2 shows the results. In the left panel, we observe that Langevin dynamics is unable to escape the initial mode. Increasing the step size η of Langevin dynamics caused the chain to diverge. On the other hand, DLG successfully captures all modes of the distribution. DLG samples cover all 1k modes at chain length 432. Middle panel provides evidence that DLG samples correctly reflect the statistics of the data distribution. Finally, the right panel indicates that the DLG chain moves freely between classes, i.e., distant modes. These observations validate our claim that DLG mixes well.

5.2. ACCELERATING IMAGE GENERATION WITH SCORE NETWORKS

We compare six integrators with and without DLG on CIFAR10 and CelebA-HQ-256 image generation. The deterministic integrators are: the deterministic integrator of Karras et al. (2022) (KAR1), the probability flow integrator of Song et al. (2021b) , and the RK45 solver. The stochastic integrators are: the stochastic integrator of Karras et al. (2022) (KAR2), the reverse diffusion integrator of Song et al. (2021b) , and the Euler-Maruyama method. We use the Fréchet Inception Distance (FID) (Heusel et al., 2017) to measure sample quality. For CIFAR10, we generate 50k samples, and for CelebA-HQ-256, we generate 10k samples. We use pre-trained score models of Song et al. (2021b) . CIFAR10. In Figure 3 , we make two important observations. First, DLG successfully accelerates all six integrators by a non-trivial margin. In particular, if an integrator without DLG already performs well, the integrator combined with DLG outperforms other integrators combined with DLG. For instance, compare the results for KAR1 with those of other deterministic integrators. Second, DLG improves the performance lower bound for some deterministic integrators. While the performance of KAR1 and RK45 saturates at around 4 FID, KAR1 and RK45 with DLG achieve around 2.4 FID. CelebA-HQ-256. Figure 4 shows the results on CelebA-HQ-256. We observe that DLG improves computational efficiency and sample quality simultaneously. Indeed, in Figure 5 , we observe remarkable improvements in sample quality despite using fewer NFE. This demonstrates the scalability of DLG to generating high-resolution images. We also note that we did not perform an exhaustive search of DLG hyper-parameters for CelebA-HQ-256, so fine-tuning could yield better results. Achieving 

5.3. HYPER-PARAMETER ABLATION STUDY

Given an integrator, DLG is determined by total NFE per sample n, NFE spent on denoising samples n den , and Langevin dynamics step size η. We fix the integrator to be KAR1 and observe the effect of each component on CIFAR10 image generation. We observed similar trends with other samplers. η vs. n den /n. In the left panel of Figure 6 , we fix NFE and vary η and n den /n. n den governs individual sample quality, and n skip = n -n den governs sample diversity. Thus, we observe optimal FID is achieved when n den /n has intermediate values, not extreme values near 0 or 1. Also, lower n den /n is needed to attain optimality for lower η. This is because lower η means the MCMC chain travels closer to the image manifold at the cost of slower mixing. NFE vs. n den /n. In the middle panel of Figure 6 , we fix η and vary NFE and n den /n. We observe two trends. First, in the small NFE regime, where 10 ≤ NFE ≤ 50, it is beneficial to decrease n den /n as NFE increases. Second, in the large NFE regime, where NFE > 50, it is beneficial to increase n den /n as NFE increases. This is because if n skip is sufficiently large, MCMC chain starts producing essentially independent samples, so increasing n skip further provides no gain. Also, as we increase NFE, the set of n den /n which provides near-optimal performance becomes larger. Moreover, we see that most of the time, optimal FID is achieved when n den /n > 0.5, i.e., when n den > n skip . So, a reasonable strategy for choosing n den given η and NFE budget n is to find smallest n skip which produces visually distinct samples, and then allocate n den = n -n skip . η vs. NFE. In the right panel of Figure 6 , we choose optimal (in terms of FID) n den /n for each combination of η and NFE. We see choosing overly small or large η leads to performance degradation. If η is within a certain range, we obtain similarly good performance. In the case of CIFAR10, we found it reasonable to set η ∈ [0.05, 1.0]. To choose η for data of general dimension d, we define a value κ := η/ √ d called displacement per dimension. If we see Eq. ( 14), Gaussian noise of zero mean and variance η is added to the sample at each update step. Gaussian annulus theorem tells us that a high-dimensional Gaussian noise with zero mean and variance η has Euclidean norm approximately η 

6. CONCLUSION

In this work, we proposed DMCMC which combines MCMC with reverse-S/ODE integrators. This has led to improvements for both MCMC and diffusion models. For MCMC, DMCMC allows Markov chains to visit disjoint modes. For diffusion models, DMCMC accelerates sampling by reducing the average integration interval length of reverse-S/ODE. We developed a practical instance of DMCMC called DLG, and demonstrated the practicality and scalability of DLG through various experiments. In particular, DLG achieved state-of-the-art results on CIFAR10 and CelebA-HQ-256 among score-based models. Overall, our work opens up an orthogonal approach to accelerating score-based sampling. We leave exploration of other kinds of MCMC or diffusion process such as VP diffusion (see Appendix E.3) in DMCMC as future work.

A DETAILED EXPERIMENT SETTINGS

Device. We use an RTX 2080 Ti or two Quadro RTX 6000 depending on the required VRAM. Codes. For probability flow, RK45, reverse diffusion, and Euler-Maruyama integrators, we modify the code provided by Song et al. (2021b) in the GitHub repository https://github.com/ yang-song/score_sde_pytorch. For KAR1 and KAR2, since Karras et al. (2022) did not release their implementation of the samplers, we used our implementation based on their paper. For evaluation, we use the FID implementation provided in the GitHub repository https://github. com/mseitzer/pytorch-fid. Datasets. We use the CIFAR10 dataset (Krizhevsky, 2009) and the CelebA-HQ-256 dataset (Karras et al., 2018) . Data processing. All data are normalized into the range [0, 1]. Following Song et al. (2021b) , for all methods, a denoising step using Tweedie's denoising formula is applied at the end of the sampling process. Noise predictor network q ϕ . The noise predictor network q ϕ has four convolution layers followed by a fully connected layer. The convolution layers have channels 32, 64, 128, 256, and (σ min , σ max ) is discretized into 1k points σ min (σ max /σ min ) t for t spaced evenly on [0, 1]. On CIFAR10, q ϕ is trained for 200 epochs, and on CelebA-HQ-256, q ϕ is trained for 100 epochs. We use the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001.

Mixture of Gaussians.

MoG has 1k modes at randomly sampled CIFAR10 training set images. For Langevin dynamics, we use step size η = 0.0001. Using a larger step size caused the Langevin dynamics chain to diverge. For DLG, we use step size η = 1.0, n skip = 1, n den = 20, and the reverse diffusion integrator. For each method, all chains were initialized at a single mode to test mixing capabilities in the worst-case scenario. CIFAR10 image generation. We use a pre-trained NCSN++ (cont.) score model provided by Song et al. (2021b) . For the baseline methods, we use the recommended settings. For DLG, the chain was initialized by generating samples with the deterministic integrator of Karras et al. (2022) using 37 NFE, adding Gaussian noise of variance 0.25, and running 20 iterations of Langevin-Gibbs. 3 . CelebA-HQ-256 image generation. We use a pre-trained NCSN++ (cont.) score model provided by Song et al. (2021b) . For the baseline methods, we use the recommended settings. For DLG, the chain was initialized by generating samples with the stochastic integrator of Karras et al. (2022) using 37 NFE, adding Gaussian noise of variance 0.25, and running 70 iterations of Langevin-Gibbs. 

C PSEUDOCODES

Let us denote a reverse-S/ODE integrator as π(s θ , x, σ, n). Given a point x at noise level σ, score function s θ , and NFE budget n, π returns the result of integrating reverse-S/ODE from σ to σ min starting from x. That is, π returns a sample from p 0 (x). Algorithm 1 MCMC Starting Point Generation 1: Input: Integration NFE budget n 1 , number of Langevin-Gibbs steps n 2 , Langevin step size η 2: Sample x0 ∼ N (0, σ 2 max I) 3: x0 ← π(s θ , x0 , σ max , n 1 ) 4: x0 ← x0 + 0.5 • ϵ where ϵ ∼ N (0, I) 5: σ 0 ← 0.5 6: for t = 1, 2, . . . n 2 do 7: x0 ← x0 + 0.5 • η • s θ ( x0 , σ 0 ) + √ η • ϵ where ϵ ∼ N (0, I) 8: σ 0 ∼ q ϕ (σ | x 0 ) 9: end for 10: Return ( x0 , σ 0 ) Algorithm 2 Denoising Langevin Gibbs 1: Input: n den , n skip , MCMC starting point ( x0 , σ 0 ), Langevin step size η, number of samples needed N 2: ( x, σ) ← ( x0 , σ 0 ) 3: for k = 1, 2, . . . N do 4: Initialize minimum noise level tracking variables ( x, σ) ← ( x, σ) 5: for t = 1, 2, . . . , n skip do 6: x ← x + 0.5 • η • s θ ( x, σ) + √ η • ϵ where ϵ ∼ N (0, I) 7: σ ∼ q ϕ (σ | x) 8: if σ < σ then end for 12: x k ← π(s θ , x, σ, n den ) 13: end for 14: Return {x k } N k=1 D MORE RELATED WORKS D.1 COMPARISON OF DMCMC WITH THE PREDICTOR-CORRECTOR SAMPLER As both DMCMC and the predictor-corrector (PC) algorithm (Song et al., 2021b) rely on MCMC, one may ask whether DMCMC is a straightforward extension of PC. However, we assure the reader that DMCMC is not a trivial generalization of PC. On a high level, the main message of our paper is that we can significantly improve diffusion sampling by choosing better initial points for the denoising process. This claim was verified through extensive experiments. On the other hand, the focus of the corrector step of PC is on improving the denoising process by refining the predictor / integrator steps. So, DMCMC and PC contribute to orthogonal aspects of diffusion sampling. Indeed, the improvements offered by MCMC corrector step are rather marginal (see Table 1 in (Song et al., 2021b )) compared to the improvements offered by DMCMC (see Figure 3 in our paper). On a low level, we distinguish DMCMC from PC on three levels: theoretical, implementation-wise, and empirical. Theoretically, MCMC in DMCMC and MCMC in PC reduce truncation error (error arising from numerically integrating reverse-S/ODE) in entirely distinct ways. We observe that diffusion model sampling can be broken down into two parts: (part 1) generating an initialization point, and (part 2) integrating the reverse-S/ODE starting from the initialization point to generate clean data. MCMC in PC aims to improve part 2 whereas MCMC in DMCMC aims to improve part 1. In PC, MCMC is used as a corrector. That is, MCMC corrects the distributions of intermediate points during integration of the reverse-S/ODE (part 2). That is why PC sampling proceeds by alternating between taking a predictor step and running Langevin dynamics at a fixed noise level. In DMCMC, MCMC generates initialization points for reverse-S/ODE (part 1) that lie near the image manifold. This is possible because MCMC runs in the augmented space X × S by adaptively updating the noise level. This leads to reduced truncation error, or equivalently, acceleration, as it is easier for reverse-S/ODE integrators to generate clean data from points near the data manifold than from prior noise distribution samples (e.g., Gaussian noise). For a more detailed explanation of the acceleration mechanism of DMCMC, we refer the reader to Appendix E.1. DMCMC does not resemble PC even from the perspective of implementation. MCMC in PC runs during reverse-S/ODE integration, and MCMC in DMCMC runs before reverse-S/ODE integration: Finally, DMCMC can also be used to accelerate PC algorithms as well, as shown in Figure 13 . This means MCMC in DMCMC and PC play orthogonal roles in generating clean data. If DMCMC were a straightforward extension of PC, we would not see such acceleration. 2021) have also used noise predictor networks to improve diffusion sampling. Although both works employed a neural net to predict the noise level / diffusion time for a given data input, we emphasize that the works do not overlap with the contributions of DMCMC. The main contribution of our paper is not the noise prediction network itself. The novelty of DMCMC arises from how we use the noise prediction network. On a high level, the main message of our paper is that we can significantly improve diffusion sampling by choosing better initial points for denoising. A noise predictor network was used in the process of finding initial points. On the other hand, the focus of the mentioned works is on improving the denoising process with a noise predictor network. So, DMCMC and the mentioned works contribute to orthogonal aspects of diffusion sampling. Let us elaborate. We first note that diffusion model sampling can be broken down into two parts: (part 1) generating an initialization point, and (part 2) integrating the reverse-S/ODE to generate data from the initialization point. Nichol & Dhariwal (2021) and Roman et al. (2021) use the noise prediction network to accelerate integration of the reverse-S/ODE (improve part 2). This is achieved by learning the covariance of the reverse distribution (Nichol & Dhariwal, 2021) or by adjusting the noise schedule with a neural net (Roman et al., 2021) . In DMCMC, MCMC generates initialization points for reverse-S/ODE that lie near the image manifold (improve part 1). This is possible because MCMC runs in X × S by adaptively updating the noise level with a noise prediction network. This leads to acceleration, as it is easier for reverse-S/ODE integrators to generate clean data from points near the data manifold than from prior noise distribution samples (e.g., Gaussian noise). Hence, our paper proposes an acceleration approach entirely different from those of and Nichol & Dhariwal (2021) and Roman et al. (2021) . In fact, DMCMC does not resemble Nichol & Dhariwal (2021) and Roman et al. (2021) even from the perspective of implementation. Noise predictors in Nichol & Dhariwal (2021) and Roman et al. (2021) are used during reverse-S/ODE integration, and the noise predictor in DMCMC is used before reverse-S/ODE integration as a sub-step of MCMC.

E MORE DISCUSSIONS E.1 MORE EXPLANATION ON HOW DMCMC ACCELERATES SAMPLING

The discussion at the end of Section 3.2 describes why DMCMC combined with a reverse-S/ODE integrator is faster than using a reverse-S/ODE integrator alone. Here, we provide a more detailed explanation of the acceleration phenomenon. To begin, we note that DMCMC consists of two steps that are executed sequentially: (a) MCMC on p(x, σ), and (b) denoising MCMC samples by reverse-S/ODE. Since DMCMC without (a) is just standard diffusion, we see the acceleration behavior comes from using p(x, σ) samples as initial points for reverse-S/ODE. Thus, we need to show how using p(x, σ) samples as initial points for reverse-S/ODE accelerates image generation. This proceeds in two steps. We first explain how MCMC produces samples (x n , σ n ) from p(x, σ) such that σ n is significantly smaller than σ max . Then, we explain how integrating over (σ min , σ n ) is faster than integrating over (σ min , σ max ) as in standard diffusion. To begin, we first explain how MCMC produces samples (x n , σ n ) from p(x, σ) such that σ n is significantly smaller than σ max . We observe that p σ (x| x)p( x)d x becomes flatter and wider as σ increases. This is because p σ (x| x)p( x)d x means we are applying Gaussian smoothing to p(x) with a Gaussian kernel of variance σ 2 . For instance, if p(x) is a normal distribution with variance γ 2 , p σ (x| x)p( x)d x is a normal distribution with variance σ 2 + γ 2 . It follows that high density values of p(x, σ) occur when x is near the data manifold and σ is small. Since MCMC traverses high-probability regions, we can expect {x n } will be close to the data manifold and {σ n } will be smaller than σ max . Indeed, in Appendix F.1, we see that actual σ n values are significantly smaller than σ max = 50 in CIFAR10. Standard diffusion needs to integrate the reverse-S/ODE over the large interval (σ min , σ max ) to produce clean images. On the other hand, in DMCMC, MCMC produces samples (x n , σ n ) from p(x, σ) such that σ n is significantly smaller than σ max . So, in the DMCMC framework, to generate clean images, we only need to integrate the reverse-S/ODE over the small interval (σ min , σ n ). This means the cost of integrating over (σ n , σ max ) vanishes for DMCMC, leading to accelerated image generation. (To be precise, the cost of integrating over (σ n , σ max ) is replaced by the cost of running MCMC to sample from p(x, σ), but we observe in Section 5.1 that DLG mixes rapidly, so this cost is negligible.) More rigorously, given the same computation budget and the same integration method, integrating over (σ min , σ n ) has less truncation error than integrating over (σ min , σ max ). Computation budget roughly corresponds to the number of discretization points of the integration interval we can use to approximate the reverse-S/ODE. Thus, a shorter interval of integration means we can use smaller step size h during integration, which implies smaller error. For instance, Euler's method has O(h 2 ) local error. A more rigorous exposition is given in Chapter 7 of Stoer & Bulisch (2002) . This justifies how DMCMC can generate better samples than standard diffusion under a fixed computation budget. In other words, DMCMC can use less computation budget than standard diffusion to achieve similar sample quality as standard diffusion, i.e., DMCMC can accelerate sampling.

E.2 TRADE-OFF BETWEEN THE CONVERGENCE SPEED OF MCMC AND SHARPNESS OF p(σ)

Let us assume that -log p(x | σ) is strongly convex and has L Lipschitz continuous gradients. We also assume -log p(σ) is strongly convex and has M Lipschitz continuous gradients. We use such assumptions, because in the setting where either -log p(x | σ) or -log p(σ) is nonconvex, it is In VE diffusion, the supportfoot_1 of p σ (x | x)p( x)d x grows wider as σ increases. Thus, wherever the current MCMC x iterate is, there always is some noise level σ such that x is in the support of p σ (x | x)p( x)d x so the score function provides a meaningful (i.e., non-zero) direction. The noise classification network q ϕ predicts this optimal noise level σ for x, as shown in Appendix F.1. On the other hand, in the VP diffusion framework, the prior noise distribution (standard normal distribution) has finite support. To be precise, the standard normal distribution has non-zero density everywhere, but numerically, density becomes zero at points sufficiently far from the origin. As the data distribution also has finite support, all intermediate distributions p(x | α) of VP diffusion have finite support as well. Since MCMC takes random steps, it is possible for an MCMC iterate to land at a point where no intermediate distribution p(x | α) of VP diffusion provides meaningful density or gradient. Even worse, the mean of the corruption process of VP diffusion shifts in the process of changing data samples to zero mean standard normal samples. So, high-density paths between prior noise samples and data samples can become very narrow in high-dimensional scenarios. Due to these pathologies of VP diffusion, what we observed happening in practice was that Langevin Gibbs would repeat the following behavior: try to approach the image manifold in the x update step so α increases, step off the high-density path leading to the data manifold in the x update step, and get absorbed into the prior noise distribution so α decays to 0. Indeed, in Figure 14 we observe α values oscillating around zero. This implies x n are essentially prior noise samples, so using {(x n , α n )} as initial points for the reverse-S/ODE has no practical benefit. This type of problem with narrow, high-density regions often plagues MCMC. For a toy example, see Figure 3 of Cobb et al. (2019) . We postulate that a better MCMC sampling scheme could enable us to run DMCMC under the VP diffusion framework. For instance, one could use MCMC with rejection steps that would reject points that step off high-density regions, or incorporate Riemannian manifold structure into the sampling scheme. However, we believe this is a topic for future work. F.2 DENOISING STEP ABLATION One may ask whether the denoising step in DMCMC is a necessary component to generate clean samples. To answer this question, we compare DLG with and without the denoising step. We use the reverse diffusion integrator as the reverse-SDE solver in the denoising step. Figure 17 shows DLG without the denoising step completely fails to generate valid clean image samples, regardless of how large n skip we use. Figure 17 : Ablation of the denoising step in DMCMC.



DENOISING LANGEVIN GIBBS (DLG)In Section 3, we described an abstract framework, DMCMC, for accelerating score-based sampling by combining MCMC and reverse-S/ODE integrators. We now develop a concrete instance of DM-CMC. As the second construction step of DMCMC is simple, we only describe the first step. When we mention the support of a distribution, we mean the set of points where the numerical value (value represented on a computer) of the density is nonzero. This is because, even when a distribution theoretically has nonzero density everywhere, it is possible that the numerical value of the density vanishes outside some bounded region.



Figure 2: Ablation study of σ update step in DLG.

Figure 3: Sampling acceleration of DLG on CIFAR10. FID of notable points are written in the corresponding color. Top row: deterministic integrators. Bottom row: stochastic integrators.

Figure 4: Sampling acceleration of DLG on CelebA-HQ-256. A dot indicates an integrator without DLG, and a cross of the same color indicates corresponding integrator combined with DLG. Dotted lines indicate performance improvement due to DLG.

Figure 5: Non-cherry-picked samples on CelebA-HQ-256 using the settings for Fig. 4. Each row shows samples for an integrator without (left col.) and with (right col.) DLG.

Figure 6: Ablation study of DLG with KAR1. Dots indicate the points of lowest FID.

So, the average displacement of the sample per dimension by the random noise is around η/ √ d. Since κ is a dimension-independent value, given κ and d, we can set η = √ dκ. On CIFAR10, we have η ∈ [0.05, 1.0], which translates to κ ∈ [0.0009, 0.018]. This means, on CelebA-HQ-256, we can choose η ∈ [0.4, 8.0]. If the sampler was inefficient, we chose a smaller κ to trade-off diversity for sample quality.

Figures 7 and 8 each show a DLG chain on CIFAR10 and CelebA-HQ-256, respectively. The chain progresses from left to right, from right end to left end of row below. On CIFAR10, we can see that the chain visits diverse classes. On CelebA-HQ-256, we can see that the chain transitions between diverse attributes such as gender, hair color, skin color, glasses, facial expression, posture, etc.

Figure 7: Visualization of a DLG chain on CIFAR10.

Figure 8: Visualization of a DLG chain on CelebA-HQ-256.

Figure 9: Additional non-cherry-picked samples for deterministic integrators on CIFAR10 without (left col.) and with (right col.) DLG.

Figure 10: Additional non-cherry-picked samples for stochastic integrators on CIFAR10 without (left col.) and with (right col.) DLG.

Figure 11: Additional non-cherry-picked samples for deterministic integrators on CelebA-HQ-256 without (left col.) and with (right col.) DLG.

Figure 12: Additional non-cherry-picked samples for stochastic integrators on CelebA-HQ-256 without (left col.) and with (right col.) DLG.

PC : (prior noise) → (integ. step) → (MCMC) → (integ. step) → (MCMC) → . . . → (data) • DMCMC : (MCMC to generate points near image manifold) → (integration) → (data) Moreover, as already mentioned, MCMC in PC runs in X at a fixed σ whereas MCMC in DMCMC runs in X × S by adaptively updating σ.

Figure 13: PC acceleration with DLG on CIFAR10.

Figure 14: α n trajectories for DLG with VP diffusion in the MoG setting.

OF σ DURING LANGEVIN GIBBS ON X × SIn Figure15, we have visualized Langevin Gibbs trajectories of σ in the MoG setting with 1k modes at CIFAR10 images (this setting decouples sampler behavior from score function approximation error) and the score network setting. We indeed observe that σ moves up and down, allowing x to travel between disjoint modes of the distribution. Moreover, in Figure2of Section 5.1, x samples cover all 1k modes of the MoG. This experimentally proves x sequence of DLG is exploring the entire image distribution.In the MoG setting, let us define δ n as the distance of x n to the closest mode in the MoG. In Figure16, we seeσ n ∼ q ϕ (σ | x n ) is almost identical to δ n / √ d,where d is the dimension of x n . This is reasonable, as the Gaussian annulus theorem tells us samples from a high-dimensional Gaussian distribution of mean µ and variance σ 2 come from a shell of radius σ √ d centered at µ. Specifically, due to the Gaussian annulus theorem, the samples of p σ (x | x)p( x)d x are most likely to come from a shell of radius σ √ d centered around the image manifold. (See Figure 2 (a) in Chung et al. (2022)) Then, given certain x which is distance δ from the image manifold, we can intuitively argue that optimal σ for x, i.e., σ of highest likelihood under p(σ | x), can be determined by equating δ = σ √ d such that σ = δ/ √ d. In the MoG case, this δ is approximated by the distance of x to the closest mode in the MoG. Since the σ values predicted by the noise classifier agree with approximated δ/ √ d, we can see that the noise classifier is a good approximation of p(σ | x).

Figure 15: Visualization of σ n trajectories in MoG and score network settings.

Figure 16: q ϕ prediction and (approximate) distance to image manifold in the MoG setting.

We verify the effectiveness of DLG by accelerating six reverse-S/ODE integrators. Notably, combined with the integrators ofKarras et al. (2022), DLG achieves state-of-the-art results among score-based models. On CIFAR10 in the limited number of score function evaluation (NFE) setting, we obtain 3.86 FID with ≈ 10 NFE and 2.63 FID with ≈ 20 NFE. On CelebA-HQ-256, we have 6.99 FID with ≈ 160 NFE, which is currently the best result with score-based models. The computation cost of evaluating a noise level classifier is negligible, so we obtain acceleration essentially for free.

Comparison of fast samplers on CI-FAR10 FID. Number in parenthesis indicates extra or less NFE used. Best numbers are bolded.

Table 2 lists the hyper-parameters for DLG and the corresponding FID used to produce Figure 3.

DLG hyper-parameters and FID for integrators in Figure

Table 3 lists the hyper-parameters for DLG and the corresponding FID used to produce Figure 4.

DLG hyper-parameters and FID for deterministic samplers in Figure4.Achieving On CIFAR10, we use KAR1 settings of Table2. On CelebA-HQ-256, we use n den = 131, n skip = 27, η = 4.0, which achieves 6.99 FID.

acknowledgement

We provide an anonymous Google drive link to a zip file containing code and noise classifier checkpoints for our main experiments. https://drive.google.com/file/d/ 1C5RO4UB6x8eatIHDxGiGLV_oLsEKcyOI/view?usp=share_link We have also added a pseudocode of DMCMC in Appendix C, and hyper-parameters are described in Appendix A.

annex

difficult to say anything theoretically meaningful about convergence of MCMC on the joint p(x, σ). We also assume the MCMC of choice is Langevin dynamics for ease of analysis.The sharpness of the prior p(σ) can then be characterized by M . Intuitively, if p(σ) is more peaked around zero, -log p(σ) will have a gradient which changes more rapidly, and so M will be larger. We then note that since -log p(x, σ) = -log p(x | σ) -log p(σ), -log p(x, σ) is also strongly convex and -log p(x, σ) has L + M Lipschitz continuous gradients.We now resort to Theorem 3 in (Cheng & Bartlett, 2018) , which shows that the convergence time of Langevin dynamics on -log p(x, σ) in terms of KL divergence grows in the order of O((L + M ) 2 ) (ignoring log terms). So, we indeed see there is a trade-off between the mixing time of MCMC and the choice of the prior. Using a prior that is sharper around zero will increase M and thus increase convergence time quadratically. We speculate that a similar analysis will hold for Langevin Gibbs as well, but a rigorous analysis of Langevin Gibbs is worthy of a paper of its own. However, in practice, we do note that Langevin Gibbs with the prior used in our paper converges quite fast, as shown in Section 5.1. In the rightmost panel of Figure 2 , we observe that the autocorrelation of image labels vanishes after only a few iterations. If the sampler mixed poorly in the x space, image labels would have high autocorrelation. For visualization of DLG chains, we refer the readers to Appendix B.1. So, we can say Langevin Gibbs reliably produces samples from p(x, σ) even with a small number of steps.

E.3 VE DIFFUSION VS. VP DIFFUSION IN DMCMC

A natural question is whether we can join MCMC with reverse-S/ODE using the VP diffusion framework as well. It is true that the VP setting could be more stable than the VE setting, and that is why some recent solvers work in the VP setting. For instance, DPM-Solver (Lu et al., 2022) only provides experiment results in the VP setting, despite the fact that DPM-Solver can be applied to the VE setting as well. In the case of DEIS (Zhang & Chen, 2022), there is a large performance gap in the VE and VP settings. Specifically, the performance of DEIS deteriorates significantly in the VE setting.We used VE diffusion because, from the perspective of generating better initialization points via MCMC, VE setting was better than VP setting. We believe the benefits outweigh the downsides, since, as shown in Table 1 , DLG beats all recent fast solvers regardless of whether that fast solver works in the VP setting or the VE setting. We conjecture that stability does not have a large influence on DMCMC because the reverse-S/ODE initialization points generated by MCMC are close to the image manifold, such that the variance of initialization points is small compared to the variance of prior noise. This is possibly why DMCMC improves the performance lower bound for some deterministic integrators (Section 5.2).We also give a detailed explanation of why DMCMC is more compatible with VE. Concretely, we observed two differences between VE diffusion and VP diffusion that made MCMC with VP diffusion difficult.We first establish some notations. For α ∈ [0, 1), we define the perturbation kernelIf α = 0, we get the prior noise distribution, which is the standard normal distribution. If α = 1, we recover the data distribution. VP diffusion proceeds by decreasing α from 1 to 0 (Song et al., 2021b) . We can generate clean data from Gaussian noise by using a numerical integrator to solve the reverse-S/ODE for VP diffusion from α = 0 to α = 1. With a prior p(α) on A := [0, 1), we then have a joint distributionon the product space X ×A. Then, analogous to DMCMC with VE diffusion, we may use MCMC to produces samples {(x n , α n )} from p(x, α) with the help of an α-classifier q ϕ and then integrate the reverse-S/ODE from α n to 1 to generate clean data. However, we observed two differences between VE diffusion and VP diffusion that made MCMC with VP diffusion difficult.

