GDDIM: GENERALIZED DENOISING DIFFUSION IM-PLICIT MODELS

Abstract

Our goal is to extend the denoising diffusion implicit model (DDIM) to general diffusion models (DMs) besides isotropic diffusions. Instead of constructing a non-Markov noising process as in the original DDIM, we examine the mechanism of DDIM from a numerical perspective. We discover that the DDIM can be obtained by using some specific approximations of the score when solving the corresponding stochastic differential equation. We present an interpretation of the accelerating effects of DDIM that also explains the advantages of a deterministic sampling scheme over the stochastic one for fast sampling. Building on this insight, we extend DDIM to general DMs, coined generalized DDIM (gDDIM), with a small but delicate modification in parameterizing the score network. We validate gDDIM in two non-isotropic DMs: Blurring diffusion model (BDM) and Critically-damped Langevin diffusion model (CLD). We observe more than 20 times acceleration in BDM. In the CLD, a diffusion model by augmenting the diffusion process with velocity, our algorithm achieves an FID score of 2.26, on CIFAR10, with only 50 number of score function evaluations (NFEs) and an FID score of 2.86 with only 27 NFEs. Project page and code: https://github.com/qshzh/gDDIM.

1. INTRODUCTION

Generative models based on diffusion models (DMs) have experienced rapid developments in the past few years and show competitive sample quality compared with generative adversarial networks (GANs) (Dhariwal & Nichol, 2021; Ramesh et al.; Rombach et al., 2021) , competitive negative log likelihood compared with autoregressive models in various domains and tasks (Song et al., 2021; Kawar et al., 2021) . Besides, DMs enjoy other merits such as stable and scalable training, and mode-collapsing resiliency (Song et al., 2021; Nichol & Dhariwal, 2021) . However, slow and expensive sampling prevents DMs from further application in more complex and higher dimension tasks. Once trained, GANs only forward pass neural networks once to generate samples, but the vanilla sampling method of DMs needs 1000 or even 4000 steps (Nichol & Dhariwal, 2021; Ho et al., 2020; Song et al., 2020b) to pull noise back to the data distribution, which means thousands of neural networks forward evaluations. Therefore, the generation process of DMs is several orders of magnitude slower than GANs. How to speed up sampling of DMs has received significant attention. Building on the seminal work by Song et al. (2020b) on the connection between stochastic differential equations (SDEs) and diffusion models, a promising strategy based on probability flows (Song et al., 2020b ) has been developed. The probability flows are ordinary differential equations (ODE) associated with DMs that share equivalent marginal with SDE. Simple plug-in of off-the-shelf ODE solvers can already achieve significant acceleration compared to SDEs-based methods (Song et al., 2020b) . The arguably most popular sampling method is denoising diffusion implicit model (DDIM) (Song et al., 2020a) , which includes both deterministic and stochastic samplers, and both show tremendous im-provement in sampling quality compared with previous methods when only a small number of steps is used for the generation. Although significant improvements of the DDIM in sampling efficiency have been observed empirically, the understanding of the mechanism of the DDIM is still lacking. First, why does solving probability flow ODE provide much higher sample quality than solving SDEs, when the number of steps is small? Second, it is shown that stochastic DDIM reduces to marginal-equivalent SDE (Zhang & Chen, 2022 ), but its discretization scheme and mechanism of acceleration are still unclear. Finally, can we generalize DDIMs to other DMs and achieve similar or even better acceleration results? In this work, we conduct a comprehensive study to answer the above questions, so that we can generalize and improve DDIM. We start with an interesting observation that the DDIM can solve corresponding SDEs/ODE exactly without any discretization error in finite or even one step when the training dataset consists of only one data point. For deterministic DDIM, we find that the added noise in perturbed data along the diffusion is constant along an exact solution of probability flow ODE (see Prop 1). Besides, provided only one evaluation of log density gradient (a.k.a. score), we are already able to recover accurate score information for any datapoints, and this explains the acceleration of stochastic DDIM for SDEs (see Prop 3). Based on this observation, together with the manifold hypothesis, we present one possible interpretation to explain why the discretization scheme used in DDIMs is effective on realistic datasets (see Fig. 2 ). Equipped with this new interpretation, we extend DDIM to general DMs, which we coin generalized DDIM (gDDIM). With only a small but delicate change of the score model parameterization during sampling, gDDIM can accelerate DMs based on general diffusion processes. Specifically, we verify the sampling quality of gDDIM on Blurring diffusion models (BDM) (Hoogeboom & Salimans, 2022; Rissanen et al., 2022) and critically-damped Langevin diffusion (CLD) (Dockhorn et al., 2021) in terms of Fréchet inception distance (FID) (Heusel et al., 2017) . To summarize, we have made the following contributions: 1) We provide an interpretation for the DDIM and unravel its mechanism. 2) The interpretation not only justifies the numerical discretization of DDIMs but also provides insights into why ODE-based samplers are preferred over SDEbased samplers when NFE is low. 3) We propose gDDIM, a generalized DDIM that can accelerate a large class of DMs deterministically and stochastically. 4) We show by extensive experiments that gDDIM can drastically improve sampling quality/efficiency almost for free. Specifically, when applied to CLD, gDDIM can achieve an FID score of 2.86 with only 27 steps and 2.26 with 50 steps. gDDIM has more than 20 times acceleration on BDM compared with the original samplers. The rest of this paper is organized as follows. In Sec. 2 we provide a brief inntroduction to diffusion models. In Sec. 3 we present an interpretation of the DDIM that explains its effectiveness in practice. Built on this interpretation, we generalize DDIM for general diffusion models in Sec. 4.

2. BACKGROUND

In this section, we provide a brief introduction to diffusion models (DMs). Most DMs are built on two diffusion processes in continuous-time, one forward diffusion known as the noising process that drives any data distribution to a tractable distribution such as Gaussian by gradually adding noise to the data, and one backward diffusion known as the denoising process that sequentially removes noise from noised data to generate realistic samples. The continuous-time noising and denoising processes are modeled by stochastic differential equations (SDEs) (Särkkä & Solin, 2019) . In particular, the forward diffusion is a linear SDE with state u (t) ∈ R D du = F t udt + G t dw, t ∈ [0, T ] where F t , G t ∈ R D×D represent the linear drift coefficient and diffusion coefficient respectively, and w is a standard Wiener process. When the coefficients are piece-wise continuous, Eq. ( 1) admits a unique solution (Oksendal, 2013) . Denote by p t (u) the distribution of the solutions {u(t)} 0≤t≤T (simulated trajectories) to Eq. ( 1) at time t, then p 0 is determined by the data distribution and p T is a (approximate) Gaussian distribution. That is, the forward diffusion Eq. ( 1) starts as a data sample and ends as a Gaussian random variable. This can be achieved with properly chosen coefficients F t , G t . Thanks to linearity of Eq. ( 1), the  R t EI-Multistep Figure 1: Importance of K t for score parameterization s θ (u, t) = -K -T t ϵ θ (u, t) and acceleration of diffusion sampling with probability flow ODE. Trajectories of probability ODE for CLD (Dockhorn et al., 2021) at random pixel locations (Left). Pixel value and output of ϵ θ in v channel with choice K t = L t (Dockhorn et al., 2021) along the trajectory (Mid). Output of ϵ θ in x, v channels with our choice R t (Right). The smooth network output along trajectories enables large stepsize and thus sampling acceleration. gDDIM based on the proper parameterization of K t can accelerate more than 50 times compared with the naive Euler solver (Lower row). The backward process from u(T ) to u(0) of Eq. ( 1) is the denoising process. It can be characterized by the backward SDE simulated in reverse-time direction (Song et al., 2020b; Anderson, 1982)  du = [F t udt -G t G T t ∇ log p t (u)]dt + G t d w, where w denotes a standard Wiener process running backward in time. Here ∇ log p t (u) is known as the score function. When Eq. ( 2) is initialized with u(T ) ∼ p T , the distribution of the simulated trajectories coincides with that of the forward diffusion Eq. (1). Thus, u(0) of these trajectories are unbiased samples from p 0 ; the backward diffusion Eq. ( 2) is an ideal generative model. In general, the score function ∇ log p t (u) is not accessible. In diffusion-based generative models, a time-dependent network s θ (u, t), known as the score network, is used to fit the score ∇ log p t (u). One effective approach to train s θ (u, t) is the denoising score matching (DSM) technique (Song et al., 2020b; Ho et al., 2020; Vincent, 2011) that seeks to minimize the DSM loss E t∼U [0,T ] E u(0),u(t)|u(0) [∥∇ log p 0t (u(t)|u(0)) -s θ (u(t), t)∥ 2 Λt ], where U[0, T ] represents the uniform distribution over the interval [0, T ]. The time-dependent weight Λ t is chosen to balance the trade-off between sample fidelity and data likelihood of learned generative model (Song et al., 2021) . It is discovered in Ho et al. (2020) that reparameterizing the score network by s θ (u, t) = -K -T t ϵ θ (u, t) ) with K t K T t = Σ t leads to better sampling quality. In this parameterization, the network tries to predict directly the noise added to perturb the original data. Invoking the expression N (µ t u(0), Σ t ) of p 0t (u(t)|u(0)), this parameterization results in the new DSM loss L(θ) = E t∼U [0,T ] E u(0)∼p0,ϵ∼N (0,I D ) [∥ϵ -ϵ θ (µ t u(0) + K t ϵ, t)∥ 2 K -1 t ΛtK -T t ]. Sampling: After the score network s θ is trained, one can generate samples via the backward SDE Eq. ( 2) with a learned score, or the marginal equivalent SDE/ODE (Song et al., 2020b; Zhang & Chen, 2021; 2022 ) du = [F t u - 1 + λ 2 2 G t G T t s θ (u, t)]dt + λG t dw, where λ ≥ 0 is a free parameter. Regardless of the value of λ, the exact solutions to Eq. ( 6) produce unbiased samples from p 0 (u) if s θ (u, t) = ∇ log p t (u) for all t, u. When λ = 1, Eq. ( 6) reduces to reverse-time diffusion in Eq. ( 2). When λ = 0, Eq. ( 6) is known as the probability flow ODE (Song et al., 2020b ) du = [F t u - 1 2 G t G T t s θ (u, t)]dt. ( ) Isotropic diffusion and DDIM: Most existing DMs are isotropic diffusions. A popular DM is Denoising diffusion probabilistic modeling (DDPM) (Ho et al., 2020) . For a given data distribution p data (x), DDPM has u = x ∈ R d and sets p 0 (u) = p data (x). Though originally proposed in the discrete-time setting, it can be viewed as a discretization of a continuous-time SDE with parameters F t := 1 2 d log α t dt I d , G t := - d log α t dt I d (8) for a decreasing scalar function α t satisfying α 0 = 1, α T = 0. Here I d represents the identity matrix of dimension d. For this SDE, K t is always chosen to be √ 1α t I d . The sampling scheme proposed in DDPM is inefficient; it requires hundreds or even thousands of steps, and thus number of score function evaluations (NFEs), to generate realistic samples. A more efficient alternative is the Denoising diffusion implicit modeling (DDIM) proposed in Song et al. (2020a) . It proposes a different sampling scheme over a grid {t i } x(t i-1 ) = α ti-1 α ti x(t i ) + ( 1 -α ti-1 -σ 2 ti -1 -α ti α ti-1 α ti )ϵ θ (x(t i ), t i ) + σ ti ϵ, where {σ ti } are hyperparameters and ϵ ∼ N (0, I d ). DDIM can generate reasonable samples within 50 NFEs. For the special case where σ ti = 0, it is recently discovered in Zhang & Chen (2022) that Eq. (9) coincides with the numerical solution to Eq. ( 7) using an advanced discretization scheme known as the exponential integrator (EI) that utilizes the semi-linear structure of Eq. ( 7). CLD and BDM: Dockhorn et al. (2021) propose critically-dampled Langevin diffusion (CLD), a DM based on an augmented diffusion with an auxiliary velocity term. More specifically, the state of the diffusion in CLD is of the form u(t) = [x(t), v(t)] ∈ R 2d with velocity variable v(t) ∈ R d . The CLD employs the forward diffusion Eq. (1) with coefficients F t := 0 βM -1 β -ΓβM -1 ⊗ I d , G t := 0 0 0 -ΓβM -1 ⊗ I d . Here Γ > 0, β > 0, M > 0 are hyperparameters. Compared with most other DMs such as DDPM that inject noise to the data state x directly, the CLD introduces noise to the data state x through the coupling between v and x as the noise only affects the velocity component v directly. Another interesting DM is Blurring diffusion model (BDM) (Hoogeboom & Salimans, 2022) . It can be shown the forward process in BDM can be formulated as a SDE with (Detailed derivation in App. B) F t := d log[V α t V T ] dt , G t := dσ 2 t dt -F t σ 2 t -σ 2 t F t , where V T denotes a Discrete Cosine Transform (DCT) and V denotes the Inverse DCT. Diagonal matrices α t , σ t are determined by frequencies information and dissipation time. Though it is argued that inductive bias in CLD and BDM can benefit diffusion model (Dockhorn et al., 2021; Hoogeboom & Salimans, 2022) , non-isotropic DMs are not easy to accelerate. Compared with DDPM, CLD introduces significant oscillation due to x-v coupling while only inefficient ancestral sampling algorithm supports BDM (Hoogeboom & Salimans, 2022) .

SOLUTION

The complexity of sampling from a DM is proportional to the NFEs used to numerically solve Eq. ( 6). To establish a sampling algorithm with a small NFEs, we ask the bold question: Can we generate samples exactly from a DM with finite steps if the score function is precise? To gain some insights into this question, we start with the simplest scenario where the training dataset consists of only one data point x 0 . It turns out that accurate sampling from diffusion models on this toy example is not that easy, even if the exact score function is accessible. Most well-known numerical methods for Eq. ( 6), such as Runge Kutta (RK) for ODE, Euler-Maruyama (EM) for SDE, are accompanied by discretization error and cannot recover the single data point in the training set unless an infinite number of steps are used. Surprisingly, DDIMs can recover the single data point in this toy example in one step. Built on this example, we show how the DDIM can be obtained by solving the SDE/ODE Eq. ( 6) with proper approximations. The effectiveness of DDIM is then explained by justifying the usage of those approximations for general datasets at the end of this section.

ODE sampling

We consider the deterministic DDIM, that is, Eq. ( 9) with σ ti = 0. In view of Eq. ( 8), the score network Eq. ( 4) is s θ (u, t) = -ϵ θ (u,t) √ 1-αt . To differentiate between the learned score and the real score, denote the ground truth version of ϵ θ by ϵ GT . In our toy example, the following property holds for ϵ GT . Proposition 1. Assume p 0 (u) is a Dirac distribution. Let u(t) be an arbitrary solution to the probability flow ODE Eq. (7) with coefficient Eq. (8) and the ground truth score, then ϵ GT (u(t ), t) = - √ 1 -α t ∇ log p t (u(t)) remains constant, which is ∇ log p T (u(T )), along u(t). We remark that even though ϵ GT (u(t), t) remains constant along an exact solution, the score ∇ log p t (u(t)) is time-varying. This underscores the advantage of the parameterization ϵ θ over s θ . Inspired by Prop 1, we devise a sampling algorithm as follows that can recover the exact data point in one step for our toy example. This algorithm turns out to coincide with the deterministic DDIM. Proposition 2. With the parameterization s θ (u, τ ) = -ϵ θ (u,τ ) √ 1-ατ and the approximation ϵ θ (u, τ ) ≈ ϵ θ (u(t), t) for τ ∈ [t -∆t, t], the solution to the probability flow ODE Eq. (7) with coefficient Eq. (8) is u(t -∆t) = α t-∆t α t u(t) + ( 1 -α t-∆t - √ 1 -α t α t-∆t α t )ϵ θ (u(t), t), which coincides with deterministic DDIM. When ϵ θ = ϵ GT as is the case in our toy example, there is no approximation error in Prop 2 and Eq. ( 12) is precise. This implies that deterministic DDIM can recover the training data in one step in our example. The update Eq. ( 12) corresponds to a numerical method known as the exponential integrator to the probability flow ODE Eq. ( 7) with coefficient Eq. ( 8) and parameterization s θ (u, τ ) = -ϵ θ (u,τ ) √ 1-ατ . This strategy is used and developed recently in Zhang & Chen (2022) . Prop 1 and toy experiments in Fig. 2 provide sights on why such a strategy should work.

SDE sampling

The above discussions however do not hold for stochastic cases where λ > 0 in Eq. ( 6) and σ ti > 0 in Eq. ( 9). Since the solutions to Eq. ( 6) from t = T to t = 0 are stochastic, neither ∇ log p t (u(t)) nor ϵ GT (u(t), t) remains constant along sampled trajectories; both are affected by the stochastic noise. The denoising SDE Eq. ( 6) is more challenging compared with the probability ODE since it injects additional noise to u(t). The score information needs to remove not only noise presented in u(T ) but also injected noise along the diffusion. In general, one evaluation of ϵ θ (u, t) can only provide the information to remove noise in the current state u; it cannot predict the future injected noise. Can we do better? The answer is affirmative on our toy dataset. Given only one score evaluation, it turns out that score at any point can be recovered. Proposition 3. Assume SDE coefficients Eq. (8) and that p 0 (u) is a Dirac distribution. Given any evaluation of the score function ∇ log p s (u(s)), one can recover ∇ log p t (u) for any t, u as ∇ log p t (u) = 1 -α s 1 -α t α t α s ∇ log p s (u(s)) - 1 1 -α t (u - α t α s u(s)). The major difference between Prop 3 and Prop 1 is that Eq. ( 13) retains the dependence of the score over the state u. This dependence is important in canceling the injected noise in the denoising SDE Eq. ( 6). This approximation Eq. ( 13) turns out to lead to a numerical scheme for Eq. ( 6) that coincide with the stochastic DDIM. Theorem 1. Given the parameterization s θ (u, τ ) = -ϵ θ (u,τ ) √ 1-ατ and the approximation s θ (u, τ ) ≈ 1-αt 1-ατ ατ αt s θ (u(t), t) -1 1-ατ (u -ατ αt u(t)) for τ ∈ [t -∆t, t], the exact solution u(t -∆t) to Eq. (6) with coefficient Eq. (8) is u(t -∆t) ∼ N ( α t-∆t α t u(t) + - α t-∆t α t √ 1 -α t + 1 -α t-∆t -σ 2 t ϵ θ (u(t), t), σ 2 t I d ) with σ t = (1 -α t-∆t ) 1 -1-αt-∆t 1-αt λ 2 αt αt-∆t λ 2 , which is the same as the DDIM Eq. (9). Note that Thm 1 with λ = 0 agrees with Prop 2; both reproduce the deterministic DDIM but with different derivations. In summary, DDIMs can be derived by utilizing local approximations.

Justification of Dirac approximation

While Prop 1 and Prop 3 require the strong assumption that the data distribution is a Dirac, DDIMs in Prop 2 and Thm 1 work very effectively on realistic datasets, which may contain millions of datapoints (Nichol et al., 2021) . Here we present one possible interpretation based on the manifold hypothesis (Roweis & Saul, 2000) . It is believed that real-world data lie on a low-dimensional manifold (Tenenbaum et al., 2000) embedded in a high-dimensional space and the data points are well separated in high-dimensional data space. For example, realistic images are scattered in pixel space and the distance between every two images can be very large if measured in pixel difference even if they are similar perceptually. To model this property, we consider a dataset consisting of M datapoints {u (m) } M m=1 . The exact score is ∇ log p t (u) = m w m ∇ log p 0t (u|u (m) ), w m = p 0t (u|u (m) ) m p 0t (u|u (m) ) , which can be interpreted as a weighted sum of M score functions associated with Dirac distributions. This is illustrated in Fig. 2 . In the red color region where the weights {w m } are dominated by one specific data u (m * ) and thus ∇ log p t (u) ≈ ∇ log p 0t (u|u (m * ) ). Moreover, in the green region different modes have similar ∇ log p 0t (u|u (m) ) as all of them are close to Gaussian and can be approximated by any condition score of any mode. The {ϵ GT (u(t), t)} trajectories in Fig. 2 validate our hypothesis as we have very smooth curves at the beginning and ending period. The phenomenon that score of realistic datasets can be locally approximated by the score of one datapoint partially justifies the Dirac distribution assumption in Prop 1 and 3 and the effectiveness of DDIMs.

4. GENERALIZE AND IMPROVE DDIM

The DDIM is specifically designed for DDPMs. Can we generalize it to other DMs? With the insights in Prop 1 and 3, it turns out that with a carefully chosen K τ , we can generalize DDIMs to any DMs with general drift and diffusion. We coin the resulted algorithm the Generalized DDIM (gDDIM).

4.1. DETERMINISTIC GDDIM WITH PROP 1

Toy dataset: Motivated by Prop 1, we ask whether there exists an ϵ GT that remains constant along a solution to the probability flow ODE Eq. ( 7). We start with a special case with initial distribution p 0 (u) = N (u 0 , Σ 0 ). It turns out that any solution to Eq. ( 7) is of the form u(t) = Ψ(t, 0)u 0 + R t ϵ (16) with a constant ϵ and a time-varying parameterization coefficients R t ∈ R D×D that satisfies R 0 R T 0 = Σ 0 and dR t dt = (F t + 1 2 G t G T t Σ -1 t )R t . Here Ψ(t, s) is the transition matrix associated with F τ ; it is the solution to ∂Ψ(t,s) ∂t 4). We remark K t = √ 1α t I d is a solution to Eq. ( 17) when the DM is specialized to DDPM. Based on Eq. ( 16) and Eq. ( 17), we extend Prop 1 to more general DMs. = F t Ψ(t, s), Ψ(s, s) = I D . Interestingly, R t satisfies R t R T t = Σ t like K t in Eq. ( Proposition 4. Assume the data distribution p 0 (u) is N (u 0 , Σ 0 ). Let u(t) be an arbitrary solution to the probability flow ODE Eq. (7) with the ground truth score, then ϵ GT (u(t), t) := -R T t ∇ log p t (u(t)) remains constant along u(t). Note that Prop 4 is slightly more general than Prop 1 in the sense that the initial distribution p 0 is a Gaussian instead of a Dirac. Diffusion models with augmented states such as CLD use a Gaussian distribution on the velocity channel for each data point. Thus, when there is a single data point, the initial distribution is a Gaussian instead of a Dirac distribution. A direct consequence of Prop 4 is that we can conduct accurate sampling in one step in the toy example since we can recover the score along any simulated trajectory given its value at t = T , if K t in Eq. ( 4) is set to be R t . This choice K t = R t will make a huge difference in sampling quality as we will show later. The fact provides guidance to design an efficient sampling scheme for realistic data.

Realistic dataset:

As the accurate score is not available for realistic datasets, we need to use learned score s θ (u, t) for sampling. With our new parameterization ϵ θ (u, t) = -R T t s θ (u, t) and the approximation εθ (u, τ ) = ϵ θ (u(t), t) for τ ∈ [t -∆t, t], we reach the update step for deterministic gDDIM by solving probability flow with approximator εθ (u, τ ) exactly as u(t -∆t) = Ψ(t -∆t, t)u(t) + [ t-∆t t 1 2 Ψ(t -∆t, τ )G τ G T τ R -T τ dτ ]ϵ θ (u(t), t), Multistep predictor-corrector for ODE: Inspired by Zhang & Chen (2022), we further boost the sampling efficiency of gDDIM by combining Eq. ( 18) with multistep methods (Hochbruck & Ostermann, 2010; Zhang & Chen, 2022; Liu et al., 2022) . We derive multistep predictor-corrector methods to reduce the number of steps while retaining accuracy (Press et al., 2007; Sauer, 2005) . Empirically, we found that using more NFEs in predictor leads to better performance when the total NFE is small. Thus, we only present multistep predictor for deterministic gDDIM. We include the proof and multistep corrector in App. B. For time discretization grid {t i } N i=0 where t 0 = 0, t N = T , the q-th step predictor from t i to t i-1 in term of ϵ θ parameterization reads u(t i-1 ) = Ψ(t i-1 , t i )u(t i ) + q-1 j=0 [C ij ϵ θ (u(t i+j ), t i+j )], C ij = ti-1 ti 1 2 Ψ(t i-1 , τ )G τ G T τ R -T τ k̸ =j [ τ -t i+k t i+j -t i+k ]dτ. ( ) We note that coefficients in Eqs. ( 18) and (19b) for general DMs can be calculated efficiently using standard numerical solvers if closed-form solutions are not available.

4.2. STOCHASTIC GDDIM WITH PROP 3

Following the same spirits, we generalize Prop 3 Proposition 5. Assume the data distribution p 0 (u) is N (u 0 , Σ 0 ). Given any evaluation of the score function ∇ log p s (u(s)), one can recover ∇ log p t (u) for any t, u as ∇ log p t (u) = Σ -1 t Ψ(t, s)Σ s ∇ log p s (u(s)) -Σ -1 t [u -Ψ(t, s)u(s)]. ( ) Prop 5 is not surprising; in our example, the score has a closed form. Eq. ( 20) not only provides an accurate score estimation for our toy dataset, but also serves as a score approximator for realistic data. Realistic dataset: Based on Eq. ( 20), with the parameterization s θ (u, τ ) = -R -T τ ϵ θ (u, τ ), we propose the following gDDIM approximator εθ (u, τ ) for ϵ θ (u, τ ) εθ (u, τ ) = R -1 τ Ψ(τ, s)R s ϵ θ (u(s), s) + R -1 τ [u -Ψ(τ, s)u(s)]. Proposition 6. With the parameterization ϵ θ (u, t) = -R T t s θ (u, t) and the approximator εθ (u, τ ) in Eq. ( 21), the solution to Eq. (6) satisfies u(t) ∼ N (Ψ(t, s)u(s) + [ Ψ(t, s) -Ψ(t, s)]R s ϵ θ (u(s), s), P st ), where Ψ(t, s) is the transition matrix associated with Fτ := F τ + 1+λ 2 2 G τ G T τ Σ -1 τ and the covariance matrix P st solves dP sτ dτ = Fτ P sτ + P sτ F T τ + λ 2 G τ G T τ , P ss = 0. Our stochastic gDDIM then uses Eq. ( 22) for update. Though the stochastic gDDIM and the deterministic gDDIM look quite different from each other, there exists a connection between them. Proposition 7. Eq. ( 22) in stochastic gDDIM reduces to Eq. (18) in deterministic gDDIM when λ = 0.

5. EXPERIMENTS

As gDDIM reduces to DDIM for VPSDE and DDIM proves very successful, we validate the generation and effectiveness of gDDIM on CLD and BDM. We design experiments to answer the following questions. How to verify Prop 4 and 5 empirically? Can gDDIM improve sampling efficiency compared with existing works? What differences do the choice of λ and K t make? We conduct experiments with different DMs and sampling algorithms on CIFAR10 for quantitative comparison. We include more illustrative experiments on toy datasets, high dimensional image datasets, and more baseline comparison in App. C.

Choice of K t :

A key of gDDIM is the special choice K t = R t which is obtained via solving Eq. ( 17). In CLD, Dockhorn et al. (2021) choose K t = L t based on Cholesky decomposition of Σ t and it does not obey Eq. ( 17). More details regarding L t are included in App. C. As it is shown in Fig. 1 , on real datasets with a trained score model, we randomly pick pixel locations and check the pixel value and ϵ θ output along the solutions to the probability flow ODE produced by the high-resolution ODE solver. With the choice K t = L t , ϵ (L) θ (u, t; v) suffers from oscillation like x value along time. However, ϵ (R) θ (u, t) is much more flat. We further compare samples generated by L t and R t parameterizaiton in Tab. 1, where both use the multistep exponential solver in Eq. ( 19). Choice of λ: We further conduct a study with different λ values. Note that polynomial extrapolation in Eq. ( 19) is not used here even when λ = 0. As it is shown in Tab. 2, increasing λ deteriorates the sample quality, demonstrating our claim that deterministic DDIM has better performance than its stochastic counterpart when a small NFE is used. We also find stochastic gDDIM significantly outperforms EM, which indicates the effectiveness of the approximation Eq. ( 21).

Accelerate various DMs:

We present a comparison among various DMs and various sampling algorithms. To make a fair comparison, we compare three DMs with similar size networks while retaining other hyperparameters from their original works. We make two modifications to DDPM, including continuous-time training (Song et al., 2020b) and smaller stop sampling time (Karras et al., 2022) 

B PROOFS

Since the proposed gDDIM is a generalization of DDIM, the results regarding gDDIM in Sec. 4 are generalizations of those in Sec. 3. In particular, Prop 4 generalizes Prop 1, Eq. ( 18) generalizes Prop 2, and Prop 5 generalizes Prop 3. Thus, for the sake of simplicity, we mainly present proofs for gDDIM in Sec. 4.

B.1 BLURRING DIFFUSION MODELS

We first review the formulations regarding BDM proposed in Hoogeboom & Salimans (2022) and show it can be reformulated as SDE in continuous time Eq. ( 11). Hoogeboom & Salimans (2022) introduce an forwarding noising scheme where noise corrupts data in frequency space with different schedules for the dimensions. Different from existing DMs, the diffusion process is defined in frequency space: p(y t |y 0 ) = N (y t |α t y 0 , σ t I) y t = V T x t (24) where α, σ are R d×d diagonal matries and control the different diffuse rate for data along different dimension. And y t is the mapping of x t in frequency domain obtained through Discrete Cosine Transform (DCT) V T . We note V is the inverse DCT mapping and V T V = I. Based on Eq. ( 24), we are able to derive its corresponding noising scheme in original data space p(x t |x 0 ) = N (x t |V α t V T x 0 , σ t I). Eq. ( 25) indicates BDM is a non-isotropic diffusion process. Therefore, we are able to derive its forward process as a linear SDE. For a general linear SDE Eq. ( 1), its mean and covariance follow dm t dt = F t m t ( ) dΣ t dt = F t Σ t + Σ t F t + G t G T t . ( ) Plugging Eq. ( 25) into Eq. ( 26), we are able to derive the drift and diffusion F t , G t for BDM as Eq. ( 11). As BDM admit a SDE formulation, we can use Eq. ( 3) to train BDM. For choice of hyperparameters α t , σ t and pratical implementation, we include more details in App. C.

B.2 DETERMINISTIC GDDIM

B.2.1 PROOF OF EQ. ( 17) Since we assume data distribution p 0 (u) = N (u 0 , Σ 0 ), the score has closed form ∇ log p t (u) = -Σ -1 t (u -Ψ(t, 0)u 0 ). To make sure our construction Eq. ( 16) is a solution to the probability flow ODE, we examine the condition for R t . The LHS of the probability flow ODE is du = d[Ψ(t, 0)u 0 + R t ϵ] = Ψ(t, 0)u 0 dt + Ṙt ϵdt = [F t Ψ(t, 0)u 0 + Ṙt ϵ]dt. ( ) The RHS of the probability flow ODE is [F t u - 1 2 G t G T t ∇ log p t (u)]dt = [F t Ψ(t, 0)u 0 + F t R t ϵ + 1 2 G t G T t R -T t ϵ]dt = [F t Ψ(t, 0)u 0 + F t R t ϵ + 1 2 G t G T t R -T t R -1 t R t ϵ]dt, where the first equality is due to ∇ log p t (u) = -R -T t ϵ. Since Eqs. ( 29) and ( 30) holds for each ϵ, we establish Ṙt = (F t + 1 2 G t G T t R -T t R -1 t )R t (31) = (F t + 1 2 G t G T t Σ -1 t )R t .

B.2.2 PROOF OF PROP 4

Similar to the proof of Eq. ( 17), over a solution {u(t), t} to the probability flow ODE, R -1 t (u(t) -Ψ(t, 0)u 0 ) is constant. Furthermore, by Eq. ( 28), ∇ log p t (u(t)) = -Σ -1 t (u(t) -Ψ(t, 0)u 0 ) = -R -T t R -1 t (u(t) -Ψ(t, 0)u 0 ) Eq. ( 33) implies that -R T t ∇ log p t (u(t)) is a constant and invariant with respect to t.

B.2.3 PROOF OF EQS. (12) AND (18)

We derive Eq. ( 18) first. The update step is based on the approximation εθ (u, τ ) = ϵ θ (u(t), t) for τ ∈ [t -∆t, t]. The resultant ODE with εθ reads u = F τ u + 1 2 G τ G T τ R -1 τ ϵ θ (u(t), t), which is a linear ODE. The closed-form solution reads u(t i-1 ) = Ψ(t i-1 , t i )u(t) + [ ti-1 ti 1 2 Ψ(t i-1 , τ )G τ G T τ R -T τ dτ ]ϵ θ (u(t), t), where Ψ(t, s) is the transition matrix associated F t , that is, Ψ satisfies dΨ(t, s) dt = F t Ψ(t, s) Ψ(s, s) = I D . ( ) When the DM is specified to be DDPM, we derive Eq. ( 12) based on Eq. ( 18) by expanding the coefficients in Eq. ( 18) explicitly as Ψ(t, s) = α t α s , Ψ(t -∆t, t) = α t-∆t α t , t-∆t t 1 2 Ψ(t -∆t, τ )G τ G T τ R -1 τ dτ = t-∆t t - 1 2 α t-∆t α τ d log α τ dτ 1 √ 1 -α τ dτ = √ α t-∆t 1 -α τ α τ αt-∆t αt = 1 -α t-∆t - √ 1 -α t α t-∆t α t .

B.2.4 PROOF OF MULTISTEP PREDICTOR-CORRECTOR

Our Multistep Predictor-Corrector method slightly extends the traditional linear multistep Predictor-Corrector method to incorporate the semilinear structure in the probability flow ODE with an exponential integrator (Press et al., 2007; Hochbruck & Ostermann, 2010) . Predictor: For Eq. ( 7), the key insight of the multistep predictor is to use existing function eval- uations ϵ θ (u(t i ), t i ), ϵ θ (u(t i+1 ), t i+1 ), • • • , ϵ θ (u(t i+q-1 ), t i+q-1 ) and their timestamps t i , t i+1 , • • • , t i+q-1 to fit a q -1 order polynomial ϵ p (t) to approximate ϵ θ (u(τ ), τ ). With this approximator εθ (u, τ ) = ϵ p (τ ) for τ ∈ [t i-1 , t i ], the multistep predictor step is obtained by solving du dt = F t u + 1 2 G τ G T τ R -T τ εθ (u, τ ) = F t u + 1 2 G τ G T τ R -T τ ϵ p (τ ), which is a linear ODE. The solution to Eq. (37) satisfies u(t i-1 ) = Ψ(t i-1 , t i )u(t i ) + ti-1 ti 1 2 G τ G T τ R -T τ ϵ p (τ )dτ. ( ) Based on Lagrange formula, we can write ϵ p (τ ) as ϵ p (τ ) = q-1 j=0 [ k̸ =j τ -t i+k t i+j -t i+k ]ϵ θ (u ti+j , t i+j ). ( ) Plugging Eq. ( 39) into Eq. ( 38), we obtain u(t i-1 ) = Ψ(t i-1 , t i )u(t i ) + q-1 j=0 [ p C (q) ij ϵ θ (u(t i+j ), t i+j )], ( ) p C (q) ij = ti-1 ti 1 2 Ψ(t i-1 , τ )G τ G T τ R -T τ q-1 k̸ =j,k=0 [ τ -t i+k t i+j -t i+k ]dτ, which are Eqs. (19a) and (19b) . Here we use p C (q) ij to emphasize these are constants used in the q-step predictor. The 1-step predictor reduces to Eq. ( 18).

Corrector:

Compared with the explicit scheme for the multistep predictor, the multistep corrector behaves like an implicit method (Press et al., 2007) . Instead of constructing ϵ p (τ ) to extrapolate model output for τ ∈ [t i-1 , t i ] as in the predictor, the q step corrector aims to find ϵ c (τ ) to interpolate ϵ θ (u(t i-1 ), t i-1 ), ϵ θ (u(t i ), t i ), ϵ θ (u(t i+1 ), t i+1 ), • • • , ϵ θ (u(t i+q-2 ), t i+q-2 ) and their timestamps t i-1 , t i , t i+1 , • • • , t i+q-2 . Thus, u(t i-1 ) is obtained by solving u(t i-1 ) = Ψ(t i-1 , t i )u(t i ) + ti-1 ti 1 2 G τ G T τ R -T τ ϵ c (τ )dτ. ( ) Since ϵ c (τ ) is defined implicitly, it is not easy to find ϵ c (τ ), u(t i-1 ). Instead, practitioners bypass the difficulties by interpolating ϵ θ ( ū(t i-1 ), t i-1 ), ϵ θ ( ū(t i ), t i ), ϵ θ ( ū(t i+1 ), t i+1 ), • • • , ϵ θ ( ū(t i+q-2 ), t i+q-2 ) where ū(t i-1 ) is obtained by the predictor in Eq. ( 38) and ū(t i ) = u(t i ), ū(t i+1 ) = u(t i+1 ), • • • , ū(t i+q-2 ) = u(t i+q-2 ). Hence, we derive the update step for corrector based on u(t i-1 ) = Ψ(t i-1 , t i )u(t i ) + ti-1 ti 1 2 G τ G T τ R -T τ ϵ c (τ )dτ, where ϵ c (τ ) is defined as ϵ c (τ ) = q-2 j=-1 [ k̸ =j τ -t i+k t i+j -t i+k ]ϵ θ ( ūti+j , t i+j ). Plugging Eq. ( 44) into Eq. ( 43), we reach the update step for the corrector u(t i-1 ) = Ψ(t i-1 , t i )u(t i ) + q-2 j=-1 [ c C (q) ij ϵ θ ( ū(t i+j ), t i+j )], (q) ij = ti-1 ti 1 2 Ψ(t i-1 , τ )G τ G T τ R -T τ q-2 k̸ =j,k=-1 [ τ -t i+k t i+j -t i+k ]dτ. We use c C (q) ij to emphasis constants used in the q-step corrector.

Exponential multistep Predictor-Corrector:

Here we present the Exponential multistep Predictor-Corrector algorithm. Specifically, we employ one q-step corrector update step after an update step of the q-step predictor. The interested reader can easily extend the idea to employ multiple update steps of corrector or different number of steps for the predictor and the corrector. We note coefficients p C, c C can be calculated using high resolution ODE solver once and used everywhere. Published as a conference paper at ICLR 2023 Lemma 1. t s Ψ(t, τ ) 1 + λ 2 2 G τ G T τ Σ -1 τ Ψ(τ, s) = Ψ(t, s) -Ψ(t, s) Proof. For a fixed s, we define N (t) = t s Ψ(t, τ ) 1+λ 2 2 G τ G T τ Σ -1 τ Ψ(τ, s) and M (t) = Ψ(t, s) - Ψ(t, s). It follows that dN (τ ) dτ = Fτ N (τ ) + 1 + λ 2 2 G τ G T τ Σ -1 τ Ψ(τ, s) = F τ N (τ ) + 1 + λ 2 2 G τ G T τ Σ -1 τ [N (τ ) + Ψ(τ, s)] (54) dM (τ ) dτ = Fτ Ψ(τ, s) -F τ Ψ(τ, s) = F τ M (τ ) + 1 + λ 2 2 G τ G T τ Σ -1 τ Ψ(τ, s). Define E(t) = N (t) -M (t), then dE(t) dt = (F t + 1 + λ 2 2 G t G T t Σ -1 t )E(t). On the other hand, N (s) = M (s) = 0 which implies E(s) = 0. We thus conclude E(t) = 0 and N (t) = M (t). Using lemma 1, we simplify Eq. ( 52) to Ψ(t, s)u(s) + [ Ψ(t, s) -Ψ(t, s)]R s ϵ θ (u(s), s), which is the mean in Eq. ( 22).

B.3.3 PROOF OF THM 1

We restate the conclusion presented in Thm 1. The exact solution u(t -∆t) to Eq. ( 6) with coefficient Eq. ( 8) is u(t -∆t) ∼ N ( α t-∆t α t u(t) + - α t-∆t α t √ 1 -α t + 1 -α t-∆t -σ 2 t ϵ θ (u(t), t), σ 2 t I d ) with σ 2 t = (1 -α t-∆t ) 1 -1-αt-∆t 1-αt λ 2 αt αt-∆t λ 2 , which is the same as the DDIM Eq. ( 9). Thm 1 is a concrete application of Eq. ( 22) when the DM is a DDPM and F τ , G τ are set to Eq. ( 8). Thanks to the special form of F τ , Ψ has the expression Ψ(t, s) = α t α s I d , and Ψ satisfies log Ψ(t, s) = t s [ 1 2 d log α τ dτ - 1 + λ 2 2 d log α τ dτ 1 1 -α τ ]dτ Ψ(t, s) = 1 -α t 1 -α s 1+λ 2 2 α s α t λ 2 2 . ( ) Mean: Based on Eq. ( 22), we obtain the mean of pst as α t α s u(s) +   - α t α s √ 1 -α s + 1 -α t 1 -α s 1+λ 2 2 α s α t λ 2 2 √ 1 -α s   ϵ θ (u(s), s) (62) = α t α s u(s) +   - α t α s √ 1 -α s + (1 -α t ) 1 -α t-∆t 1 -α t λ 2 α t α t-∆t λ 2   ϵ θ (u(s), s) (63) = α t α s u(s) + - α t α s √ 1 -α s + 1 -α t -σ 2 s ϵ θ (u(s), s), where σ 2 s = (1 -α t ) 1 - 1 -α t 1 -α s λ 2 α s α t λ 2 . ( ) Setting (s, t) ← (t, t -∆t), we arrive at the mean update in Eq. ( 14). Covariance: It follows from dP sτ dτ = 2[ d log α τ 2dτ - 1 + λ 2 2 d log α τ dτ 1 1 -α τ ]P sτ -λ 2 d log α τ dτ I d , P ss = 0 that P st = (1 -α t ) 1 - 1 -α t 1 -α s λ 2 α s α t λ 2 . Setting (s, t) ← (t, t -∆t), we recover the covariance in Eq. ( 14).

B.3.4 PROOF OF PROP 7

When λ = 0, the update step in Eq. ( 22) from s to t reads u(t) = Ψ(t, s)u(s) + [ Ψ(t, s) -Ψ(t, s)]R s ϵ θ (u(s), s). Meanwhile, the update step in Eq. ( 18) from s to t is u(t) = Ψ(t, s)u(s) + [ t s 1 2 Ψ(t, τ )G τ G T τ R -T τ dτ ]ϵ θ (u(s), s). Eqs. ( 66) and ( 67) are equivalent once we have the following lemma. Lemma 2. When λ = 0, t s 1 2 Ψ(t, τ )G τ G T τ R -T τ dτ = [ Ψ(t, s) -Ψ(t, s)]R s . Proof. We introduce two new functions N (t) := t s 1 2 Ψ(t, τ )G τ G T τ R -T τ dτ M (t) := [ Ψ(t, s) -Ψ(t, s)]R s . First, N (s) = M (s) = 0. Second, they satisfy dN (t) dt = F t N (t) + 1 2 G t G T t R -T t (70) dM (t) dt = [ Ft Ψ(t, s) -F t Ψ(t, s)]R s (71) = F t M (t) + 1 2 G t G T t Σ -1 t Ψ(t, s)R s . Note Ψ and R satisfy the same linear differential equation as d Ψ(t, s) dt = [F t + 1 2 G t G T t Σ -1 t ] Ψ(t, s), dR t dt = [F t + 1 2 G t G T t Σ -1 t ]R t . It is a standard result in linear system theory (see Särkkä & Solin (2019, Eq(2.34 ) )) that Ψ(t, s) = R t R -1 s . Plugging it and R t R T t = Σ t into Eq. (72) yields dM (t) dt = F t M (t) + 1 2 G t G T t R -T t . Define E(t) = N (t) -M (t), then it satisfies E(s) = 0 dE(t) dt = F t E(t), which clearly implies that E(t) = 0. Thus, N (t) = M (t).

C MORE EXPERIMENT DETAILS

We present the practical implementation of gDDIM and its application to BMD and CLD. We include training details and discuss the necessary calculation overhead for executing gDDIM. More experiments are conducted to verify the effectiveness of gDDIM compared with other sampling algorithms. We report image sampling performance over an average of 3 runs with different random seeds.

C.1 BDM: TRAINING AND SAMPLING

Unfortunately, the pre-trained models for BDM are not available. We reproduce the training pipeline in BDM (Hoogeboom & Salimans, 2022) to validate the acceleration of gDDIM. The official pipeline is quite similar to the popular DDPM (Ho et al., 2020) . We highlight the main difference and changes in our implementation. Compared DDPM, BDM use a different forward noising scheme Eqs. ( 11) and ( 25). The two key hyperparameters {α t }, {σ t } follow the exact same setup in Hoogeboom & Salimans (2022) , whose details and python implementation can be found in Appendix A (Hoogeboom & Salimans, 2022) . In our implementation, we use Unet network architectures (Song et al., 2020b) . We find our larger Unet improves samples quality. As a comparison, our SDE sampler can achieve FID as low as 2.51 while Hoogeboom & Salimans (2022) only has 3.17 on CIFAR10.

C.2 CLD: TRAINING AND SAMPLING

For CLD, Our training pipeline, model architectures and hyperparameters are similar to those in Dockhorn et al. (2021) . The main differences are in the choice of K t and loss weights K -1 t Λ t K -T t . Denote by ϵ θ (u, t) = [ϵ θ (u, t; x), ϵ θ (u, t; v)] for corresponding model parameterization. The authors of Dockhorn et al. (2021) originally propose the parameterization s θ (u, t) = -L -T t ϵ θ (u, t) where Σ t = L t L T t is the Cholesky decomposition of the covariance matrix of p 0t (u(t)|x( 0)). Built on DSM Eq. ( 3), they propose hybrid score matching (HSM) that is claimed to be advantageous (Dockhorn et al., 2021) . It uses the loss E t∼U [0,T ] E x(0),u(t)|x(0) [∥ϵ -ϵ θ (µ t (x 0 ) + L t ϵ, t)∥ 2 L -1 t ΛtL -T t ]. With a similar derivation (Dockhorn et al., 2021) , we obtain the HSM loss with our new score parameterization s θ (u, t) = -L -T t ϵ θ (u, t) as E t∼U [0,T ] E x(0),u(t)|x(0) [∥ϵ -ϵ θ (µ t (x 0 ) + R t ϵ, t)∥ 2 R -1 t ΛtR -T t ]. Though Eqs. ( 76) and ( 77) look similar, we cannot directly use pretrained model provided in Dockhorn et al. (2021) for gDDIM. Due to the lower triangular structure of L t and the special G t , the solution to Eq. ( 6) only relies on ϵ θ (u, t; v) and thus only ϵ θ (u, t; v) is learned in Dockhorn et al. (2021) via a special choice of Λ t . In contrast, in our new parametrization, both ϵ θ (u, t; x) and ϵ θ (u, t; v) are needed to solve Eq. ( 6). To train the score model for gDDIM, we set R -1 t Λ t R -T t = I for simplicity, similar to the choice made in Ho et al. (2020) . Our weight choice has reasonable performance and we leave improvement possibilities, such as mixed score (Dockhorn et al., 2021) , better Λ t weights (Song et al., 2021) , for future work. Though we require a different training scheme of score model compared with Dockhorn et al. (2021) , the modifications to the training pipeline and extra costs are almost ignorable. We change from K t = L t to K t = R t . Unlike L t which has a triangular structure and closed form expression as (Dockhorn et al., 2021 ) L t = Σ xx t 0 Σ xv t √ Σ xx t Σ xx t Σ xv t -(Σ xv t ) 2 Σ xx t , with Σ t = Σ xx t Σ xv t Σ xv t Σ vv t , we rely on high accurate numerical solver to solve R t . The triangular structure of L t and sparse pattern of G t for CLD in Eq. ( 10) also have an impact  L -1 t Λ t L -T t = 0 0 0 1 ⊗ I d , while we choose R -1 t Λ t R -T t = 1 0 0 1 ⊗ I d . As a result, we need to double channels in the output layer in our new parameterization associated with R t , though the increased number of parameters in last layer is negligible compared with other parts of diffusion models. We include the model architectures and hyperparameters in Tab. 4. In additional to the standard size model on CIFAR10, we also train a smaller model for CELEBA to show the efficacy and advantages of gDDIM. In gDDIM, many coefficients cannot be obtained in closed-form. Here we present our approach to obtain them numerically. Those constant coefficients can be divided into two categories, solutions to ODEs and definite integrals. We remark that these coefficients only need to be calculated once and then can be used everywhere. For CLD, each of these coefficient corresponds to a 2 × 2 matrix. The calculation of all these coefficients can be done within 1 min. Type I: Solving ODEs The problem appears when we need to evaluate R t in Eq. ( 17) and Ψ(t, s) in d Ψ(t, s) dt = Ft Ψ(t, s), Ψ(s, s) = I. Across our experiments, we use RK4 with a step size 10 -6 to calculate the ODE solutions. For Ψ(t, s), we only need to calculate Ψ(t, 0) because Ψ(t, s) = Ψ(t, 0)[ Ψ(s, 0)] -1 . In CLD, F t , G t , R t , Σ t can be simplified to a 2 × 2 matrix; solving the ODE with a small step size is extremely fast. We note Ψ(t, s) and Σ t admit close-form formula (Dockhorn et al., 2021) . Since the output of numerical solvers are discrete in time, we employ a linear interpolation to handle query in continuous time. Since R t , Ψ are determined by the forward SDE in DMs, the numerical results can be shared. In stochastic gDDIM Eq. ( 22), we apply the same techniques to solve P st . In BDM, F t can be simplified into matrix whose shape align with spatial shape of given images. We note that the drift coefficient of BDM is a diagonal matrix, we can decompose matrix ODE into multiple one dimensional ODE. Thanks to parallel computation in GPU, solving multiple one dimensional ODE is efficient.

Type II: Definite integrals

The problem appears in the derivation of update step in Eqs. ( 18), ( 19b) and ( 46), which require coefficients such as p C (q) ij , c C ij . We use step size 10 -5 for the integration from t to t -∆t. The integrand can be efficiently evaluated in parallel using GPUs. Again, the coefficients, such as p C (q) ij , c C (q) ij , are calculated once and used afterwards if we need sample another batch with the same time discretization.

C.4 GDDIM WITH OTHER DIFFUSION MODELS

Though we only test gDDIM in several existing diffusion models, including DDPM, BDM, and CLD, gDDIM can be applied to any other pre-trained diffusion models as long as the full score function is available. In the following, we list key procedures to integrate gDDIM sampler into general diffusion models. The integration consists of two stages, offline preparation of gDDIM (Stage I), and online execution of gDDIM (Stage II). Stage I: Offline preparation of gDDIM Preparation of gDDIM includes scheduling timestamps and the calculation of coefficients for the execution of gDDIM. Step 1: Determine an increasing time sequence T = {t i } Step 2: Obtain Ψ(t, s) by solving ODE Eq. (81). Step 3: Calculate R t by solving ODE Eq. ( 17) Step 4: Obtain p C (q) ij , c C (q) ij via applying definite integrator solvers on Eqs. ( 18), (19b) and ( 46) How to solve ODEs Eqs. ( 17) and ( 81) and definite integrals Eqs. ( 18), (19b) and ( 46) has been discussed in App. C.3. Stage II: Online execution of gDDIM Stage II employs high order EI-based ODE solvers for Eq. ( 82) with K t = R t . We include pseudo-code for simulating EI-multistep solvers in Algo 1. It mainly uses updates Eq. ( 40) and Eq. (45).

C.5 MORE EXPERIMENTS ON THE CHOICE OF SCORE PARAMETERIZATION

Here we present more experiments details and more experiments regarding Prop 1 and differences between the two parameterizations involving R t and L t . Toy experiments:  K t = √ Σ t EI K t = L t EI K t = R t NFE=15 NFE=25 NFE=35 NFE=60 NFE=200 Figure 4 : Sampling on a challenging 2D example with the exact score ∇ log p t (u), where data distribution is a mixture of Gaussian with small variance. Compared with Euler, algorithms based on Exponential Integrator (EI) Eq. ( 82) have much better sampling quality. Among EI-based samplers, different K t for score parameterization ϵ(u, t) = -K -T t log p t (u) have different sampling performances when NFE is small. Clearly, R t proposed by gDDIM enjoys better sampling quality given the same NFE budget. Here we present more empirical results to demonstrate the advantage of proper K t . In VPSDE, the advantage of DDIM has been verified in various models and datasets (Song et al., 2020a) . To empirically verify Prop 1, we present one toy example where the data distribution is a mixture of two one dimension Gaussian distributions. The ground truth ∇ log p t (u) is known in this toy example. As shown in Fig. 2 , along the solution to probability flow ODE, the score parameterization ϵ(u, t) = -∇ log p t (u) √ 1α t enjoys smoothing property. We remark that R In CLD, we present a similar study. As the covariance Σ t in CLD is no longer diagonal, we find the difference of the L t parameterization suggested by (Dockhorn et al., 2021) and the R t parameterization is large in Fig. 3 . The oscillation of {ϵ (Lt) } prevents numerical solvers from taking large step sizes and slows down sampling. We also include a more challenging 2D example to illustrate the difference further in Fig. 4 . We compare sampling algorithms based on Exponential Integrator without multistep methods for a fair comparison, which reads t = L t = √ Σ t = √ 1 -α t I d in u(t -∆t) = Ψ(t -∆t, t)u(t) + [ t-∆t t 1 2 Ψ(t -∆t, τ )G τ G T τ K -T τ dτ ]ϵ (Kt) (u, t), where ϵ (Kt) (u, t) = K T t ∇ log p t (u(t)). Though we have the exact score, sampling with Eq. (82) will not give us satisfying results if we use K t = L t other than R t when NFE is small.

Image experiments:

We present more empirical results regarding the comparison between L t and R t . Note that we use exponential integrator for the parametrization L t as well, similar to DEIS (Zhang & Chen, 2022) . We vary the polynomial order q in multistep methods (Zhang & Chen, 2022 ) and test sampling performance on CIFAR10 and CELEBA. In both datasets, we generate 50k images and calcualte their FID. As shown in Tabs. 5 and 6, R t has significant advantages, especially when NFE is small. We also find that multistep method with a large q can harm sampling performance when NFE is small. This is reasonable; the method with larger q assumes the nonlinear function is smooth in a large domain and may rely on outdated information for approximation, which may worsen the accuracy.

C.6 MORE EXPERIMENTS ON THE CHOICE OF λ

To study the effects of λ, we visualize the trajectories generated with various λ but the same random seeds in Fig. 5 on our toy example. Clearly, trajectories with smaller λ have better smoothing property while trajectories with large λ contain much more randomness. From the fast sampling perspective, trajectories with more stochasticity are much harder to predict with small NFE compared with smooth trajectories. 



Figure 2: Manifold hypothesis and Dirac distribution assumption. We model an image dataset as a mixture of well-separated Dirac distribution and visualize the diffusion process on the left. Curves in red indicate high density area spanned by p 0t (u(t)|u(0)) by different mode and region surrounded by them indicates the phase when p t (u) is dominated by one mode while region surrounded by blue one is for the mixing phase, and green region indicates fully mixed phase. On the right, sampling trajectories depict smoothness of ϵ GT along ODE solutions, which justifies approximations used in DDIM and partially explains its empirical acceleration.

Figure 3: Trajectory and ϵ of Probability Flow solution in CLD

Figure 7: Comparison between L (Upper) and R (Lower) with exponential integrator on CELEBA.

Figure 8: Comparison between EM (Upper) and gDDIM (Lower) on CIFAR10.

L t vs R t (Our) on CLD

, which help improve sampling quality empirically. For BDM, we noteHoogeboom & Salimans (2022) only supports the ancestral sampling algorithm, a variant of EM algorithm. With reformulated noising and denoising process as SDE Eq. (11), we can generate samples by solving corresponding SDE/ODEs. The sampling quality of gDDIM with 50 NFE can outperform the original ancestral sampler with 1000 NFE, more than 20 times acceleration.

Acceleration on various DMs with similar training pipelines and architecture. For RK45, we tune its tolerance hyperparameters so that the real NFE is close but not equal to the given NFE. The more structural knowledge we leverage, the more efficient algorithms we obtain. In this work, we provide a clean interpretation of DDIMs based on the manifold hypothesis and the sparsity property on realistic datasets. This new perspective unboxes the numerical discretization used in DDIM and explains the advantage of ODE-based sampler over SDE-based when NFE is small. Based on this interpretation, we extend DDIMs to general diffusion models. The new algorithm, gDDIM, only requires a tiny but elegant modification to the parameterization of the score model and improves sampling efficiency drastically. We conduct extensive experiments to validate the effectiveness of our new sampling algorithm. MORE RELATED WORKS Learning generative models with DMs via score matching has received tremendous attention recently(Sohl-Dickstein et al., 2015;Lyu, 2012;Song & Ermon, 2019; Song et al., 2020b;Ho et al., 2020; Nichol & Dhariwal, 2021). However, the sampling efficiency of DMs is still not satisfying. Jolicoeur-Martineau et al. (2021a) introduced adaptive solvers for SDEs associated with DMs for the task of image generation.Song et al. (2020a)  modified the forward noising process into a non-Markov process without changing the training objective function. The authors then proposed a family of samplers, including deterministic DDIM and stochastic DDIM, based on the modifications. Both of the samplers demonstrate significant improvements over previous samplers. There are variants of the DDIM that aim to further improve the sampling quality and efficiency. Deterministic DDIM in fact reduces to probability flow in the infinitesimal step size limit(Song et al., 2020a;Liu et al., 2022). Meanwhile, various approaches have been proposed to accelerate DDIM(Kong  & Ping, 2021b;Watson et al., 2021;Liu et al., 2022).Bao et al. (2022) improved the DDIM by optimizing the reverse variance in DMs.Watson et al. (2022) generalized the DDIM in DDPM with learned update coefficients, which are trained by minimizing an external perceptual loss.Nichol  & Dhariwal (2021)  tuned the variance of the schedule of DDPM.Liu et al. (2022) found that the DDIM is a pseudo numerical method and proposed a pseudo linear multi-step method for it. Zhang & Chen (2022) discovered that DDIMs are numerical integrators for marginal-equivalent SDEs, and the deterministic DDIM is actually an exponential integrator for the probability flow ODE. They further utilized exponential multistep methods to boost sampling performance for VPSDE.

on the training loss function of the score model. Due to the special structure of G t , we only need to learn s θ (u, t; v) signals present in the velocity channel. When K t = L t , s θ (u, t) = -L -T t ϵ (u, t; v) to recover s θ (u, t; v). In contrast, R t does not share the triangular structure as L t ; both ϵ (u, t; v) are needed to recover s θ (u, t; v). Therefore, Dockhorn et al. (2021) sets loss weights

Model architectures and hyperparameters

VPSDE, and thus gDDIM is the same as DDIM; the differences among R t , L t , √ Σ t only appear when Σ t is non-diagonal. .17 463.45 / 1.17 463.32 / 1.17 240.45 / 2.29 R t 332.70 / 1.47 292.31 / 1.70 13.27 / 10.15 2.26 / 9.77 More experiments on CELEBA

More comparison on CIFAR10. FIDs may be reported based on different training techniques and data augmentation. It should not be regarded as the only evidence to compare different algorithms.

Predictor-only vs Predictor-Corrector (PC). Compared with Predictor-only, PC adds one more correcting step after each predicting step except the last step. When sampling with N step, the Predictor-only approach requires n score evaluation, while PC consumes 2N -1 NFE.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for their useful comments. This work is partially supported by NSF ECCS-1942523, NSF CCF-2008513, and NSF DMS-1847802. 

Algorithm 1 Exponential multistep Predictor-Corrector

Input: Timestamps {t i } N i=0 , step order q, coefficients for predictor update p C, coefficients for corrector update c C Instantiate: u(t N ) ∼ p T (u) for i in N, N -1, • • • , 1 do # predictor update step q cur = min(q, Ni + 1) # handle warming start, use lower order multistep method u ti-1 ← Simulate Eq. ( 40) with q cur -step predictor # corrector update step q cur = min(q, Ni + 2) # handle warming start, use lower order multistep method ū(t i-1 ), ū(t i ), • • • , ū(t i+qcur-1 ) ← u(t i-1 ), u(t i ), • • • , u(t i+qcur-1 ) u ti-1 ← Simulate Eq. ( 45) with q cur -step corrector end forAssuming that the data distribution p 0 (u) is N (u 0 , Σ 0 ) with a given Σ 0 , we can derive the mean and covariance of p t (u) asTherefore, the ground truth score readsWe assume Σ 0 is given but u 0 is unknown. Fortunately, u 0 can be inferred via one score evaluation as follows. Given evaluation ∇ log p s (u(s)), we can recover u 0 as u 0 = Ψ(0, s)[Σ s ∇ log p s (u(s)) + u(s)].(50) Plugging Eq. (50) and Ψ(t, s) = Ψ(t, 0)Ψ(0, s) into Eq. ( 49), we recover Eq. (20).

B.3.2 PROOF OF PROP 6

With the approximator εθ (u, t) defined in Eq. ( 21) for τ ∈ [s, t], Eq. ( 6) can be reformulated asτ , and denote by Ψ(t, s) the transition matrix associated with it. Clearly, Eq. ( 51) is a linear differential equation on u, and the conditional probability pst (u(t)|u(s)) associated with it is a Gaussian distribution.Applying Särkkä & Solin (2019, Eq (6.6,6.7 )), we obtain the exact expressionsfor the mean of pst (u(t)|u(s)). Its covariance P sτ satisfiesEq. ( 52) has a closed form expression with the help of the following lemma. We include more qualitative results on the choice of λ and comparison between the Euler-Maruyama (EM) method and the gDDIM in Figs. 8 and 9 . Clearly, when NFE is small, increasing λ has a negative effect on the sampling quality of gDDIM. We hypothesize that λ = 0 already generates high-fidelity samples and additional noise may harm the sampling performance. With a fixed number of function evaluations, information derived from score network fails to remove the injected noise as we increase λ. On the other hand, we find that the EM method shows slightly better quality as we increase λ. We hypothesize that the ODE or SDEs with small λ has more oscillations than SDEs with large λ. It is known that the EM method has a very bad performance for oscillations systems and suffers from large discretization error (Press et al., 2007) . From previous experiments, we find that ODE in CLD is highly oscillated.We also find both methods perform worse than Symmetric Splitting CLD Sampler (SSCS) (Dockhorn et al., 2021) when λ = 1. The improvement by utilizing Hamiltonian structure and SDEs structure is significant. This encourages further exploration that incorporates Hamiltonian structure into gDDIM in the future. Nevertheless, we also remark that SSCS with λ = 1.0 performs much worse than gDDIM with λ = 0.

C.7 MORE COMPARISONS

We also compare the performance of the CLD model we trained with that claimed in Dockhorn et al. (2021) in Tab. 7. We find that our trained model performs worse than Dockhorn et al. (2021) when a blackbox ODE solver or EM sampling scheme with large NFE are used. There may be two reasons. First, with similar size model, our training scheme not only needs to fit ∇ v log p t (u), but also ∇ x log p t (u), while Dockhorn et al. (2021) can allocate all representation resources of neural network to ∇ v log p t (u). Another factor is the mixed score trick on parameterization, which is shown empirically have a boost in model performance (Dockhorn et al., 2021) but we do not include it in our training.We also compare our algorithm with more accelerating sampling methods in Tab. 7. gDDIM has achieved the best sampling acceleration results among training-free methods, but it still cannot compete with some distillation-based acceleration methods. In Tab. 8, we compare Predictor-only method with Predictor-Corrector (PC) method. With the same number of steps N , PC can improve the quality of Predictor-only at the cost of additional N -1 score evaluations, which is almost two times slower compared with the Predictor-only method. We also find large q may harm the sampling performance in the exponential multistep method when NFE is small. We note high order polynomial requires more datapoints to fit polynomial. The used datapoints may be out-of-date and harmful to sampling quality when we have large stepsizes.

C.8 NEGATIVE LOG LIKELIHOOD EVALUATION

Because our method only modifies the score parameterization compared with the original CLD (Dockhorn et al., 2021) , we follow a similar procedure to evaluate the bound of negative log-likelihood (NLL). Specifically, we can simulate probability ODE Eq. ( 7) to estimate the log-likelihood of given data (Grathwohl et al., 2018) . However, our diffusion model models the joint distribution p(u 0 ) = p(x 0 , v 0 ) on test data x 0 and augmented velocity data v 0 . Getting marginal distribution p(x 0 ) from p(u 0 ) is challenging, as we need integrate v 0 for each x 0 . To circumvent this issue, Dockhorn et al. (2021) derives a lower bound on the log-likelihood,where H(p(v 0 )) denotes the entropy of p(v 0 ). We can then estimate the lower bound with the Monte Carlo approach.Empirically, our trained model achieves a NLL upper bound 3.33 bits/dim, which is comparable with 3.31 bits/dim reported in the original CLD (Dockhorn et al., 2021) . Possible approaches to further reduce the NLL bound include maximal likelihood weights (Song et al., 2021) , and improved training techniques such as mixed score. For more discussions of log-likelihood and how to tighten the bound, we refer the reader to Dockhorn et al. (2021) .C.9 CODE LICENSES We implemented gDDIM and related algorithms in Jax. We have used code from a number of sources in Tab. 9. 

