FAST SAMPLING OF DIFFUSION MODELS WITH EXPO-NENTIAL INTEGRATOR

Abstract

The past few years have witnessed the great success of Diffusion models (DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with fewer steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler (DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate highfidelity samples in as few as 10 steps. Moreover, by directly using pre-trained DMs, we achieve state-of-art sampling performance when the number of score function evaluation (NFE) is limited, e.g., 4.17 FID with 10 NFEs, 2.86 FID with only 20 NFEs on CIFAR10.

1. INTRODUCTION

The Diffusion model (DM) (Ho et al., 2020) is a generative modeling method developed recently that relies on the basic idea of reversing a given simple diffusion process. A time-dependent score function is learned for this purpose and DMs are thus also known as score-based models (Song et al., 2020b) . Compared with other generative models such as generative adversarial networks (GANs), in addition to great scalability, the DM has the advantage of stable training is less hyperparameter sensitive (Creswell et al., 2018; Kingma & Welling, 2019) . DMs have recently achieved impressive performances on a variety of tasks, including unconditional image generation (Ho et al., 2020; Song et al., 2020b; Rombach et al., 2021; Dhariwal & Nichol, 2021 ), text conditioned image generation (Nichol et al., 2021; Ramesh et al., 2022 ), text generation (Hoogeboom et al., 2021; Austin et al., 2021) , 3D point cloud generation (Lyu et al., 2021) , inverse problem (Kawar et al., 2021; Song et al., 2021b) , etc. However, the remarkable performance of DMs comes at the cost of slow sampling; it takes much longer time to produce high-quality samples compared with GANs. For instance, the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020) needs 1000 steps to generate one sample and each step requires evaluating the learning neural network once; this is substantially slower than GANs (Goodfellow et al., 2014; Karras et al., 2019) . For this reason, there exist several studies aiming at improve the sampling speed for DMs (More related works are discussed in App. A). One category of methods modify/optimize the forward noising process such that backward denoising process can be more efficient (Nichol & Dhariwal, 2021; Song et al., 2020b; Watson et al., 2021; Bao et al., 2022) . An important and effective instance is the Denoising Diffusion Implicit Model (DDIM) (Song et al., 2020a ) that uses a non-Markovian noising process. Another category of methods speed up the numerical solver for stochastic differential equations (SDEs) or ordinary differential equations (ODEs) associated with the DMs (Jolicoeur-Martineau et al., 2021; Song et al., 2020b; Tachibana et al., 2021) . In (Song et al., 2020b) , blackbox ODE solvers are used to solve a marginal equivalent ODE known as the Probability Flow (PF), for fast sampling. In (Liu et al., Figure 1 : Generated images with various DMs. Latent diffusion (Rombach et al., 2021) (Left), 256 × 256 image with text A shirt with inscription "World peace" (15 NFE). VE diffusion (Song et al., 2020b ) (Mid), FFHQ 256 × 256 (12 NFE). VP diffusion (Ho et al., 2020) (Right), CIFAR10 (7 NFE) and CELEBA (5 NFE). 2022), the authors combine DDIM with high order methods to solve this ODE and achieve further acceleration. Note that the deterministic DDIM can also be viewed as a time discretization of the PF as it matches the latter in the continuous limit (Song et al., 2020a; Liu et al., 2022) . However, it is unclear why DDIM works better than generic methods such as Euler. The objective of this work is to establish a principled discretization scheme for the learned backward diffusion processes in DMs so as to achieve fast sampling. Since the most expensive part in sampling a DM is the evaluation of the neural network that parameterizes the backward diffusion, we seek a discretization method that requires a small number of network function evaluation (NFE). We start with a family of marginal equivalent SDEs/ODEs associated with DMs and investigate numerical error sources, which include fitting error and discretization error. We observe that even with the same trained model, different discretization schemes can have dramatically different performances in terms of discretization error. We then carry out a sequence of experiments to systematically investigate the influences of different factors on the discretization error. We find out that the Exponential Integrator (EI) (Hochbruck & Ostermann, 2010 ) that utilizes the semilinear structure of the backward diffusion has minimum error. To further reduce the discretization error, we propose to either use high order polynomials to approximate the nonlinear term in the ODE or employ Runge Kutta methods on a transformed ODE. The resulting algorithms, termed Diffusion Exponential Integrator Sampler (DEIS), achieve the best sampling quality with limited NFEs. Our contributions are summarized as follows: 1) We investigate a family of marginal equivalent SDEs/ODEs for fast sampling and conduct a systematic error analysis for their numerical solvers. 2) We propose DEIS, an efficient sampler that can be applied to any DMs to achieve superior sampling quality with a limited number of NFEs. DEIS can also accelerate data log-likelihood evaluation. 3) We prove that the deterministic DDIM is a special case of DEIS, justifying the effectiveness of DDIM from a discretization perspective. 4) We conduct comprehensive experiments to validate the efficacy of DEIS. For instance, with a pre-trained model (Song et al., 2020b) , DEIS is able to reach 4.17 FID with 10 NFEs, and 2.86 FID with 20 NFEs on CIFAR10.

2. BACKGROUND ON DIFFUSION MODELS

A DM consists of a fixed forward diffusion (noising) process that adds noise to the data, and a learned backward diffusion (denoising) process that gradually removes the added noise. The backward diffusion is trained to match the forward one in probability law, and when this happens, one can in principle generate perfect samples from the data distribution by simulating the backward diffusion.

Forward noising diffusion:

The forward diffusion of a DM for D-dimensional data is a linear diffusion described by the stochastic differential equation (SDE) (Särkkä & Solin, 2019) ) is a simple Gaussian distribution, denoted as N (µ t x 0 , Σ t ), and the distribution π(x T ) := p T (x T ) is easy to sample from. Two popular SDEs in diffusion models (Song et al., 2020b) are summarized in Tab. 1. dx = F t xdt + G t dw, Here we use matrix notation for F t and G t to highlight the generality of our method. Our approach is applicable to any DMs, including the Blurring diffusion models (BDM) (Hoogeboom & Salimans, 2022; Rissanen et al., 2022) and the critically-damped Langevin diffusion (CLD) (Dockhorn et al., 2021) where these coefficients are indeed non-diagonal matrices. Backward denoising diffusion: Under mild assumptions (Anderson, 1982; Song et al., 2020b) , the forward diffusion Eq. ( 1) is associated with a reverse-time diffusion process dx = [F t xdt -G t G T t ∇ log p t (x)]dt + G t dw, where w denotes a standard Wiener process in the reverse-time direction. The distribution of the trajectories of Eq. ( 2) with terminal distribution x T ∼ π coincides with that of Eq. ( 1) with initial distribution x 0 ∼ p 0 , that is, Eq. ( 2) matches Eq. ( 1) in probability law. Thus, in principle, we can generate new samples from the data distribution p 0 by simulating the backward diffusion Eq. (2). However, to solve Eq. ( 2), we need to evaluate the score function ∇ log p t (x), which is not accessible. Training: The basic idea of DMs is to use a time-dependent network s θ (x, t), known as a score network, to approximate the score ∇ log p t (x). This is achieved by score matching techniques (Hyvärinen, 2005; Vincent, 2011) where the score network s θ is trained by minimizing the denoising score matching loss L(θ) = E t∼Unif[0,T ] E p(x0)p0t(xt|x0) [∥∇ log p 0t (x t |x 0 ) -s θ (x t , t)∥ 2 Λt ]. Here ∇ log p 0t (x t |x 0 ) has a closed form expression as p 0t (x t |x 0 ) is a simple Gaussian distribution, and Λ t denotes a time-dependent weight. This loss can be evaluated using empirical samples by Monte Carlo methods and thus standard stochastic optimization algorithms can be used for training. We refer the reader to (Ho et al., 2020; Song et al., 2020b) for more details on choices of Λ t and training techniques.

3. FAST SAMPLING WITH LEARNED SCORE MODELS

Once the score network s θ (x, t) ≈ ∇ log p t (x) is trained, it can be used to generate new samples by solving the backward SDE Eq. ( 2) with ∇ log p t (x) replaced by s θ (x, t). It turns out there are infinitely many diffusion processes one can use. In this work, we consider a family of SDEs d x = [F t x - 1 + λ 2 2 G t G T t s θ ( x, t)]dt + λG t dw, parameterized by λ ≥ 0. Here we use x to distinguish the solution to the SDE associated with the learned score from the ground truth x in Eqs. ( 1) and (2). When λ = 0, Eq. ( 4) reduces to an ODE known as the probability flow ODE (Song et al., 2020b) . The reverse-time diffusion Eq. ( 2) with an approximated score is a special case of Eq. ( 4) with λ = 1. Denote the trajectories generated by Eq. ( 4) as { x * t } 0≤t≤T and the marginal distributions as p * t . The following Proposition (Zhang & Chen, 2021 ) (Proof in App. D) holds. Proposition 1. When s θ (x, t) = ∇ log p t (x) for all x, t, and p * T = π, the marginal distribution p * t of Eq. (4) matches p t of the forward diffusion Eq. (1) for all 0 ≤ t ≤ T . The above result justifies the usage of Eq. ( 4) for generating samples. To generate a new sample, one can sample x * T from π and solve Eq. ( 4) to obtain a sample x * 0 . However, in practice, exact solutions to Eq. ( 4) are not attainable and one needs to discretize Eq. ( 4) over time to get an approximated solution. Denote the approximated solution by xt and its marginal distribution by pt , then the error of the generative model, that is, the difference between p 0 (x) and p0 (x), is caused by two error sources, fitting error and discretization error. The fitting error is due to the mismatch between the learned score network s θ and the ground truth score ∇ log p t (x). The discretization error includes all extra errors introduced by the discretization in numerically solving Eq. (4). To reduce discretization error, one needs to use smaller stepsize and thus larger number of steps, making the sampling less efficient. The objective of this work is to investigate these two error sources and develop a more efficient sampling scheme from Eq. ( 4) with less errors. In this section, we focus on the ODE approach with λ = 0. All experiments in this section are conducted based on VPSDE over the CIFAR10 dataset unless stated otherwise. The discussions on SDE approach with λ > 0 are deferred to App. C. Since DMs demonstrate impressive empirical results in generating high-fidelity samples, it is tempting to believe that the learned score network is able to fit the score of data distribution very well, that is, s θ (x, t) ≈ ∇ log p t (x) for almost all x ∈ R D and t ∈ [0, T ]. This is, however, not true; the fitting error can be arbitrarily large on some x, t as illustrated in a simple example below. In fact, the learned score models are not accurate for most x, t.

3.1. CAN WE LEARN GLOBALLY ACCURATE SCORE?

pt(x) || ∇ log pt(x) -sθ(x, t) ||2 t = 0 t = T Consider a generative modeling task over 1-dimensional space, i.e., D = 1. The data distribution is a Gaussian concentrated with a very small variance. We plot the fitting errorfoot_0 between a score model trained by minimizing Eq. ( 3) and the ground truth score in Fig. 2 . As can be seen from the figure, the score model works well in the region where p t (x) is large but suffers from large error in the region where p t (x) is small. This observation can be explained by examining the training loss Eq. (3). In particular, the training data of Eq. ( 3) are sampled from p t (x). In regions with a low p t (x) value, the learned score network is not expected to work well due to the lack of training data. This phenomenon becomes even clearer in realistic settings with high-dimensional data. The region with high p t (x) value is extremely small since realistic data is often sparsely distributed in R D ; it is believed real data such as images concentrate on an intrinsic low dimensional manifold (Deng et al., 2009; Pless & Souvenir, 2009; Liu et al., 2022) . As a consequence, to ensure x0 is close to x 0 , we need to make sure xt stays in the high p t (x) region for all t. This makes fast sampling from Eq. (4) a challenging task as it prevents us from taking an aggressive step size that is likely to take the solution to the region where the fitting error of the learned score network is large. A good discretization scheme for Eq. ( 4) should be able to help reduce the impact of the fitting error of the score network during sampling.

3.2. DISCRETIZATION ERROR

We next investigate the discretization error of solving the probability flow ODE (λ = 0) d x dt = F t x - 1 2 G t G T t s θ ( x, t). The exact solution to this ODE is xt = Ψ(t, s) xs + t s Ψ(t, τ )[- 1 2 G τ G T τ s θ ( xτ , τ )]dτ, ( ) (a) (b) (c) (d) Figure 3 : Fig. 3a shows average pixel difference ∆ p between ground truth x * 0 and numerical solution x0 from Euler method and EI method. Fig. 3b depicts approximation error ∆ s along ground truth solutions. Fig. 3d shows ∆ s can be dramatically reduced if the parameterization ϵ θ (x, t) instead of s θ (x, t) is used. This parameterization helps the EI method outperform the Euler method in Fig. 3c . where Ψ(t, s) satisfying ∂ ∂t Ψ(t, s) = F t Ψ(t, s), Ψ(s, s) = I is known as the transition matrix from time s to t associated with F τ . Eq. ( 5) is a semilinear stiff ODE (Hochbruck & Ostermann, 2010) that consists of a linear term F t x and a nonlinear term s θ ( x, t). There exist many different numerical solvers for Eq. ( 5) associated with different discretization schemes to approximate Eq. ( 6) (Griffiths & Higham, 2010) . As the discretization step size goes to zero, the solutions obtained from all these methods converge to that of Eq. ( 5). However, the performances of these methods can be dramatically different when the step size is large. On the other hand, to achieve fast sampling with Eq. ( 5), we need to approximately solve it with a small number of discretization steps, and thus large step size. This motivates us to develop an efficient discretizaiton scheme that fits with Eq. ( 5) best. In the rest of this section, we systematically study the discretization error in solving Eq. ( 5), both theoretically and empirically with carefully designed experiments. Based on these results, we develop an efficient algorithm for Eq. ( 5) that requires a small number of NFEs. Ingredient 1: Exponential Integrator over Euler method. The Euler method is the most elementary explicit numerical method for ODEs and is widely used in numerical softwares (Virtanen et al., 2020) . When applied to Eq. ( 5), the Euler method reads xt-∆t = xt -[F t xt - 1 2 G t G T t s θ ( xt , t)]∆t. This is used in many existing works in DMs (Song et al., 2020b; Dockhorn et al., 2021) . This approach however has low accuracy and is sometimes unstable when the stepsize is not sufficiently small. To improve the accuracy, we propose to use the Exponential Integrator (EI), a method that leverages the semilinear structure of Eq. ( 5). When applied to Eq. ( 5), the EI reads xt-∆t = Ψ(t -∆t, t) xt + [ t-∆t t - 1 2 Ψ(t -∆t, τ )G τ G T τ dτ ]s θ ( xt , t). It is effective if the nonlinear term s θ ( xt , t) does not change much along the solution. In fact, for any given ∆t, Eq. ( 8) solves Eq. ( 5) exactly if s θ ( xt , t) is constant over the time interval [t -∆t, t]. To compare the EI Eq. ( 8) and the Euler method Eq. ( 7), we plot in Fig. 3a the average pixel difference ∆ p between the ground truth x * 0 and the numerical solution x0 obtained by these two methods for various number N of steps. Surprisingly, the EI method performs worse than the Euler method. This observation suggests that there are other major factors that contribute to the error ∆ p . In particular, the condition that the nonlinear term s θ ( xt , t) does not change much along the solution assumed for the EI method does not hold. To see this, we plot the score approximation error ∆ s (τ ) = ||s θ (x τ , τ ) -s θ (x t , t)|| 2 , τ ∈ [t -∆t, t] along the exact solution { x * t } to Eq. ( 5) in Fig. 3b foot_1 . It can be seen that the approximation error grows rapidly as t approaches 0. This is not strange; the score of realistic data distribution ∇ log p t (x) should change rapidly as t → 0 (Dockhorn et al., 2021) . Ingredient 2: ϵ θ (x, t) over s θ (x, t). The issues caused by rapidly changing score ∇ log p t (x) do not only exist in sampling, but also appear in the training of DMs. To address these issues, a different parameterization of the score network is used. In particular, it is found that the parameterization (Ho et al., 2020)  ∇ log p t (x) ≈ -L -T t ϵ θ (x, t) , where L t can be any matrix satisfying L t L T t = Σ t , leads to significant improvements of accuracy. The rationale of this parameterization is based on a reformulation of the training loss Eq. ( 3) as (Ho et al., 2020 ) L(θ) = E t∼Unif[0,T ] E p(x0),ϵ∼N (0,I) [∥ϵ -ϵ θ (µ t x 0 + L t ϵ, t)∥ 2 Λt ] with Λt = L -1 t Λ t L -T t . The network ϵ θ tries to follow ϵ which is sampled from a standard Gaussian and thus has a small magnitude. In comparison, the parameterization s θ = -L -T t ϵ θ can take large value as L t → 0 as t approaches 0. It is thus better to approximate ϵ θ than s θ with a neural network. We adopt this parameterization and rewrite Eq. ( 5) as d x dt = F t x + 1 2 G t G T t L -T t ϵ θ ( x, t). ( ) Applying the EI to Eq. ( 10) yields xt-∆t = Ψ(t -∆t, t) xt + [ t-∆t t 1 2 Ψ(t -∆t, τ )G τ G T τ L -T τ dτ ]ϵ θ ( xt , t). Compared with Eq. ( 8), Eq. ( 11) employs -L -T τ ϵ θ (x t , t) instead of s θ (x t , t) = -L -T t ϵ θ (x t , t) to approximate the score s θ (x τ , τ ) over the time interval τ ∈ [t -∆t, t]. This modification from L -T t to L -T τ turns out to be crucial; the coefficient L -T τ changes rapidly over time. This is verified by Fig. 3d where we plot the score approximation error ∆ s when the parameterization ϵ θ is used, from which we see the error ∆ s is greatly reduced compared with Fig. 3b . With this modification, the EI method significantly outperforms the Euler method as shown in Fig. 3c . Next we develop several fast sampling algorithms, all coined as the Diffusion Exponential Integrator Sampler (DEIS), based on Eq. ( 11), for DMs. Interestingly, the discretization Eq. ( 11) based on EI coincides with the popular deterministic DDIM when the forward diffusion Eq. ( 1) is VPSDE (Song et al., 2020a) as summarized below (Proof in App. E). Proposition 2. When the forward diffusion Eq. ( 1) is set to be VPSDE (F t , G t are specified in Tab. 1), the EI discretization Eq. (11) becomes xt-∆t = α t-∆t α t xt + [ 1 -α t-∆t - α t-∆t α t √ 1 -α t ]ϵ θ ( xt , t), which coincides with the deterministic DDIM sampling algorithm. Our result provides an alternative justification for the efficacy of DDIM for VPSDE from a numerical discretization point of view. Unlike DDIM, our method Eq. ( 11) can be applied to any diffusion SDEs to improve the efficiency and accuracy of discretizations. In the discretization Eq. ( 11), we use ϵ θ ( xt , t) to approximate ϵ θ ( xτ , τ ) for all τ ∈ [t-∆t, t], which is a zero order approximation. Comparing Eq. ( 11) and Eq. ( 6) we see that this approximation error largely determines the accuracy of discretization. One natural question to ask is whether it is possible to use a better approximation of ϵ θ ( xτ , τ ) to further improve the accuracy? We answer this question affirmatively below with an improved algorithm. Ingredient 3: Polynomial extrapolation of ϵ θ . Before presenting our algorithm, we investigate how ϵ θ (x t , t) evolves along a ground truth solution { xt } from t = T to t = 0. We plot the relative change in 2-norm of ϵ θ (x t , t) in Fig. 4a . It reveals that for most time instances the relative change is small. This motivates us to use previous (backward) evaluations of ϵ θ up to t to extrapolate ϵ θ (x τ , τ ) for τ ∈ [t -∆t, t]. Inspired by the high-order polynomial extrapolation in linear multistep methods, we propose to use high-order polynomial extrapolation of ϵ θ in our EI method. To this end, consider time discretization {t i } N i=0 where t 0 = 0, t N = T . For each i, we fit a polynomial P r (t) of degree r with respect to the interpolation points (t i+j , ϵ θ ( xti+j , t i+j )), 0 ≤ j ≤ r. This polynomial P r (t) has explicit expression Figure 4 : Fig. 4a shows relative changes of ϵ θ ( x * t , t) with respect to t are relatively small, especially when t > 0.15. Fig. 4b depicts the extrapolation error with N = 10. High order polynomial can reduce approximation error effectively. Fig. 4c illustrates effects of extrapolation. When N is small, higher order polynomial approximation leads to better samples. P r (t) = r j=0 [ k̸ =j t -t i+k t i+j -t i+k ]ϵ θ ( xti+j , t i+j ). ( ) We then use P r (t) to approximate ϵ θ (x τ , τ ) over the interval [t i-1 , t i ]. For i > Nr, we need to use polynomials of lower order to approximate ϵ θ . To see the advantages of this approximation, we plot the approximate error ∆ ϵ (t) = ||ϵ θ (x t , t) -P r (t)|| 2 of ϵ θ (x t , t) by P r (t) along ground truth trajectories { x * t } in Fig. 4b . It can be seen that higher order polynomials can reduce approximation error compared with the case r = 0 which uses zero order approximation as in Eq. ( 11). As in the EI method Eq. ( 11) that uses a zero order approximation of the score in Eq. ( 6), the update step of order r is obtained by plugging the polynomial approximation Eq. ( 13) into Eq. ( 6). It can be written explicitly as xti-1 = Ψ(t i-1 , t i ) xti + r j=0 [C ij ϵ θ ( xti+j , t i+j )] C ij = ti-1 ti 1 2 Ψ(t i-1 , τ )G τ G T τ L -T τ k̸ =j [ τ -t i+k t i+j -t i+k ]dτ. ( ) We remark that the update in Eq. ( 14) is a linear combination of xti and ϵ θ ( xti+j , t i+j ), where the weights Ψ(t i-1 , t i ) and C ij are calculated once for a given forward diffusion Eq. ( 1) and time discretization, and can be reused across batches. For some diffusion Eq. (1), Ψ(t i-1 , t i ), C ij have closed form expression. Even if analytic formulas are not available, one can use high accuracy solver to obtain these coefficients. In DMs (e.g., VPSDE and VESDE), Eq. ( 15) are normally 1dimensional or 2-dimensional integrations and are thus easy to evaluate numerically. This approach resembles the classical Adams-Bashforth (Hochbruck & Ostermann, 2010 ) method, thus we term it tAB-DEIS. Here we use t to differentiate it from other DEIS algorithms we present later in Sec. 4 based on a time-scaled ODE. The tAB-DEIS algorithm is summarized in Algo 1. Note that the deterministic DDIM is a special case of tAB-DEIS for VPSDE with r = 0. The polynomial approximation used in DEIS improves the sampling quality significantly when sampling steps N is small, as shown in Fig. 4c .

4. EXPONENTIAL INTEGRATOR: SIMPLIFY PROBABILITY FLOW ODE

Next we present a different perspective to DEIS based on ODE transformations. The probability ODE Eq. ( 10) can be transformed into a simple non-stiff ODE, and then off-the-shelf ODE solvers can be applied to solve the ODE effectively. To this end, we introduce variable ŷt := Ψ(t, 0) xt and rewrite Eq. ( 10) into d ŷ dt = 1 2 Ψ(t, 0)G t G T t L -T t ϵ θ (Ψ(0, t) ŷ, t). ( ) Note that, departing from Eq. ( 10), Eq. ( 16) does not possess semi-linear structure. Thus, we can apply off-the-shelf ODE solvers to Eq. ( 16) without accounting for the semi-linear structure in algorithm design. This transformation Eq. ( 16) can be further improved by taking into account the Algorithm 1 tAB-DEIS Input: {t i } N i=0 , r Instantiate: xt N , Empty ϵ-buffer Calculate weights Ψ, C based on Eq. ( 15) for i in N, N -1, • • • , 1 do ϵ-buffer.append(ϵ θ ( xti , t i )) xti-1 ← Eq. ( 14) with Ψ, C, ϵ-buffer end for  (t) = √ α 0 ( 1-αt αt - 1-α0 ), Eq. ( 10) can be transformed into d ŷ dρ = ϵ θ ( α β -1 (ρ) α 0 ŷ, β -1 (ρ)), ρ ∈ [β(0), β(T )]. After transformation, the ODE becomes a black-box ODE that can be solved by generic ODE solvers efficiently since the stiffness caused by the semi-linear structure is removed. This is the core idea of the variants of DEIS we present next. Based on the transformed ODE Eq. ( 17) and the above discussions, we propose two variants of the DEIS algorithm: ρRK-DEIS when applying classical RK methods, and ρAB-DEIS when applying Adams-Bashforth methods. We remark that the difference between tAB-DEIS and ρAB-DEIS lies in the fact that tAB-DEIS fits polynomials in t which may not be polynomials in ρ. Thanks to simplified ODEs, DEIS enjoys the convergence order guarantee as its underlying RK or AB solvers.

5. EXPERIMENTS

Abalation study: As shown in Fig. 5 , ingredients introduced in Sec. 3.2 can significantly improve sampling efficiency on CIFAR10. Besides, DEIS outperforms standard samplers by a large margin.

DEIS variants:

We include performance evaluations of various DEIS with VPSDE on CIFAR10 in Tab. 2, including DDIM, ρRK-DEIS, ρAB-DEIS and tAB-DEIS. For ρRK-DEIS, we find Heun's method works best among second-order RK methods, denoted as ρ2Heun, Kutta method for third order, denoted as ρ3Kutta, and classic fourth-order RK denoted as ρ4RK. For Adam-Bashforth methods, we consider fitting 1, 2, 3 order polynomial in t, ρ, denoted as tAB and ρAB respectively. We observe that almost all DEIS algorithms can generate high-fidelity images with small NFE. Also, note that DEIS with high-order polynomial approximation can significantly outperform DDIM; the latter coincides with the zero-order polynomial approximation. We also find the performance of high order ρRK-DEIS is not satisfying when NFE is small but competitive as NFE increases. It is within expectation as high order methods enjoy smaller local truncation error and total accumulated error when small step size is used and the advantage is vanishing as we reduce the number of steps.

More comparisons:

We conduct more comparisons with popular sampler for DMs, including DDPM, DDIM, PNDM (Liu et al., 2021) , A-DDIM (Bao et al., 2022) , FastDPM (Kong & Ping, 2021) , and Ito-Taylor (Tachibana et al., 2021) . We further propose Improved PNDM (iPNDM) that avoids the expensive warming start, which leads to better empirical performance. We conduct (Yu et al., 2015) with pre-trained model (Dhariwal & Nichol, 2021) . We compare DEIS with selected baselines in Fig. 7 quantitatively, and show empirical samples in Fig. 6 . More implementation details, the performance of various DMs, and many more qualitative experiments are included in App. H. 

6. CONCLUSION

In this work, we consider fast sampling problems for DMs. We present the diffusion exponential integrator sampler (DEIS), a fast sampling algorithm for DMs based on a novel discretization scheme of the backward diffusion process. In addition to its theoretical elegance, DEIS also works efficiently in practice; it is able to generate high-fidelity samples with less than 10 NFEs. Exploring better extrapolation may further improve sampling quality. More discussions are included in App. B.

A MORE RELATED WORKS

A lot of research has been conducted to speed up the sampling of DMs. In (Kong & Ping, 2021; Watson et al., 2021) the authors optimize denosing process by modifying the underlying stochastic process. However, such acceleration can not generate high quality samples with a small number of discretization steps. In (Song et al., 2020a) the authors use a non-Markovian forward noising. The resulted algorihtm, DDIM, achieves significant acceleration than DDPMs. More recently, the authors of (Bao et al., 2022) optimize the backward Markovian process to approximate the non-Markovian forward process and get an analytic expression of optimal variance in denoising process. Another strategy to make the forward diffusion nonlinear and trainable (Zhang & Chen, 2021; Vargas et al., 2021; De Bortoli et al., 2021; Wang et al., 2021; Chen et al., 2021a) in the spirit of Schrödinger bridge (Chen et al., 2021b) . This however comes with a heavy training overhead. More closely related to our method is (Liu et al., 2022) , which interprets update step in deterministic DDIM as a combination of gradient estimation step and transfer step. It modifies high order ODE methods to provide an estimation of the gradient and uses DDIM for transfer step. However, the decomposition of DDIM into two separate components is not theoretically justified. Based on our analysis on Exponential Integrator, Liu et al. (2022) uses Exponential Integral but with a Euler discretization-based approximation of the nonlinear term. This approximation is inaccurate and may suffer large discretization error if the step size is large as we show in Sec. 5. The semilinear structure presented in probability flow ODE has been widely investigated in physics and numerical simulation (Hochbruck & Ostermann, 2010; Whalen et al., 2015) , from which we get inspirations. The stiff property of the ODEs requires more efficient ODE solvers instead of blackbox solvers that are designed for general ODE problems. In this work, we investigate sovlers for differential equations in diffusion model and take advantage of the semilinear structure.

B DISCUSSIONS

1. Q -Can DEIS help accelerate the likelihood evaluation of diffusion models? A -Theoretically, our methods can be used in likelihood evaluation as DEIS only changes numerical discretization. Practically, we can use ρRK-DEIS with Eq. ( 16) and Prop 3 to accelerate likelihood evaluation. We find NLL evaluation based on RK can converge with 36 NFE with 3 order Kutta solver, which reaches 3.16 bits/dim compared with 3.15 bits/dim for RK45 (Song et al., 2020b) and achieves around 4 times acceleration. 2. Q -Can the proposed method further be accelerated by designing an adaptive step size solver? A -The proposed ρRK-DEIS can be combined with out-of-shelf adaptive step size solvers. However, we find that most ODE trajectories resulting from various starting points share similar patterns in curvature, and a tuned fixed step size works efficiently. Most existing adaptive step size strategies have some probability of getting rejected for the proposed step size, which will waste the NFE budget. Take the example of RK45, one rejection will waste 5 NFE, which is unacceptable when we try to generate samples in 10 NFE or even fewer steps. 3. Q -The proposed AB-DEIS and iPNDM use lower-order multistep solvers for computing the initial solution. Do they have a convergence guarantee? A -We use lower-order multistep for the first few steps to save computational costs. The strategy can help us achieve similar sampling quality with less NFE as we show in Tabs. 4 and 5, which aligns with our goal of sampling with small NFE. Moreover, lower order Adams-Bashforth methods also enjoy a convergence guarantee, albeit with a slower rate.  where D(x, t) is trained to predict clean data given noise data x at time t. They employ the second-order Huen method to solve Eq. ( 18). Additionally, they show all isotropic diffusion models with arbitrary s(t), σ(t) can be transformed into the suggested diffusion model with parameter schedule s(t) = 1, σ(t) = t by proper rescaling. The rescaling in Karras et al. ( 2022) is equivalent to change-of-variables we introduce in Sec. 4, and Eq. ( 18) is the simplified ODE Eq. ( 17) we used that takes into account the analytical form of Ψ, G t , L t . To further illustrate the point, consider the example with the popular VPSDE in Prop 3. In this case, the ρRK-DEIS uses the time rescaling ρ(t) = 1-αt αt and the state rescaling ŷt = 1 αt xt (note α 0 = 1 in VPSDE). The forward process for ŷρ becomes ŷρ = ŷt(ρ) ∼ N ( x0 , 1 -α t α t ) = N ( ŷ0 , ρ 2 ), where t(ρ) is the inverse function of ρ(t) and the last equality holds due to x0 = ŷ0 . Comparing Eq. ( 19) and the parameter schedule s(t) = 1, σ(t) = t in Karras et al. ( 2022), we conclude that ŷρ is equivalent to x t and ρ is the same as t. Moreover, x-D(x,t) t is equivalent to ϵ θ ( α β -1 (ρ) α0 ŷ, β -1 (ρ)) since both predict added white noise from noised data. In summary, Karras et al. (2022, Algorithm 1) is a special case of ρRK-DEIS, which can be obtained by employing second-order Heun method in Eq. ( 17). We include the empirical comparison between other DEIS algorithms and Karras et al. (2022, Algorithm 1), which we denote as ρ2Heun. We find with relatively large NFE, third-order Kutta is better than second-order Heun. And tAB-DEIS outperforms ρRK-DEIS when NFE is small.

5.. Q -How is DEIS compared with sampling algorithm in Lu et al. (2022)?

A -We note DPM-Solver (Lu et al., 2022)  where λ := log αt σt is known as one half of log-SNR (a.k.a. signal-to-noise-ratio) and εθ (x λ , λ) = ϵ θ (x t , t) with corresponding t given λ. Similar to exponential Runge-Kutta method (Hochbruck & Ostermann, 2010) , Lu et al. (2022) approximate λt λs e -λ ϵ θ (x λ , λ)dλ based on Taylor expansion and propose DPM-Solvers. Eq. ( 20) shares a lot of similarities with ρRK-DEIS. Specifically, ρ(t) = e -λ(t) since ρ = 1-αt αt , √ α t = α t , and √ 1α t = σ t in VPSDE. Similar to Eq. ( 20), the exact solution in Eq. ( 17) follows x t = α t α s x s + √ α t ρt ρs ε(x ρ , ρ)dρ, where εθ (x ρ , ρ) = ϵ θ (x t , t) with corresponding t given ρ. ρRK-DEIS employs out-of-shelf Runge-Kutta solvers for ρt ρs ε(x ρ , ρ)dρ.

An example of DPM-Solver2

To illustrate the connection and difference more clearly, we consider DPM-Solver-2 and ρRK-DEIS with the standard middle point solver and compare their update schemes. To compare these two algorithms, we first introduce a function F DDIM inspired by DDIM. In ρRK-DEIS and DPM-Solver, F DDIM is defined as F DDIM (x, g, s, t) = α t α s x s + [ √ 1 -α t - α t α s √ 1 -α s ]g (22) F DDIM (x, g, s, t) = α t α s x s + [σ t (e h -1)]g, where h = λ t -λ s (23) respectively. With F DDIM , we can reformulate update schemes of DPM-Solver2 and ρRK-DEIS with midpoint solver into Algo 2 and 3. The two algorithms are only different in the choice of midpoint s i and s i . In particular, s i = √ ρ i ρ i+1 . Connection with Runge-Kutta Though both algorithms are inspired by EI methods and Runge-Kutta, they are actually different even when there is no semi-linear structure in diffusion flow ODE. Let us consider VESDE introduced in Karras et al. (2022) where α t = 1, σ t = t. The VESDE has a simple ODE formulation, dx = ϵ θ (x, t)dt. Eq. ( 24) does not have a semi-linear structure. In this case, ρRK-DEIS reduces to standard Runge-Kutta methods and has convergence order O(∆t κ ) for κ-order RK methods. The DPMsolver uses the parametrization λ =log(t), and is different from standard Runge Kutta and reformulate Eq. ( 24) as dx = -e λ ϵ θ (x, t λ (λ))dλ. (25) For κ order DPM-Solver, it has convergence order O(∆λ κ ) under certain assumptions stated in Lu et al. (2022) .

Empirical comparison

We compare DPM-Solver2, DPM-Solver3, tAB-DEIS, and ρRK-DEIS on 64 × 64 class-conditioned ImageNet. We observe tAB-DEIS has the best sample quality most of time. We believe it is because multistep is better than single-step methods when we have a limited NFEs e.g., 6. DPM-Solvers are better than ρRK-DEIS in small NFE regions and the difference shrinks fastly as we increase sampling steps. We hypothesize that this is because DPM-Solvers are tailored for sampling with small NFEs. However, tRK-DEIS has a slightly better FID when NFE is relatively large, although the difference in performance is small. The observation aligns with our experiments in CIFAR10, third order ρRK-DEIS achieves 2.56 with 51 NFE while the third order DPM-Solver achieves 2.65 with 48 NFE (Lu et al., 2022) . We include more visual comparison in Figs. 8 and 9 . Algorithm 2 DPM-Solver-2 1: Input: x i , t i , t i-1 and corresponding λ i , λ i-1 2: Output: x i-1 3: s i = t λ ( λi+λi-1 2 ) 4: g = ϵ θ (x i , t i ) 5: u i = F DDIM (x i , g, t i , s i ) 6: g = ϵ θ (u i , s i ) 7: x i-1 = F DDIM (u i , g, s i , t i-1 ) Algorithm 3 ρRK-DEIS with midpoint solver 1: Input: x i , t i , t i-1 and corresponding ρ i , ρ i-1 2: Output: (Lu et al., 2022; Karras et al., 2022; Song et al., 2020a) . How do compared algorithm and DEIS perform under different step size scheduling? x i-1 3: s i = t ρ ( ρi+ρi-1 2 ) 4: g = ϵ θ (x i , t i ) 5: u i = F DDIM (x i , g, t i , s i ) 6: g = ϵ θ (u i , s i ) 7: x i-1 = F DDIM (u i , g, s i , t i-1 ) A -The comparison given the same time discretization is included in App. H.3. We find different algorithms may prefer different time discretization. We provide a comparison for different sampling algorithms under their best time scheduling in Tab. 2. In most cases especially low NFE region, we find tAB-DEIS performs better than other approaches. 7. Q -Can DEIS be generalized to accelerate SDE sampling for diffusion models? A -Some techniques developed in DEIS, such as better score parameterization and analytic treatment of linear term, can be applied to SDE counterparts. However, SDE is more difficult to accelerate compared with ODE. We include more discussions in App. C.

C DISCRETIZATION ERROR OF SDE SAMPLING

In this section, we consider the problem of solving the SDE Eq. ( 4) with λ > 0. As shown in Prop 1, this would also lead to a sampling scheme from DMs. The exact solution to Eq. ( 4) satisfies xt = Ψ(t, s) xs Linear term + t s Weight Ψ(t, τ ) 1 + λ 2 2 G τ G T τ L -T τ ϵ θ ( xτ , τ )dτ Nonlinear term + t s λ Ψ(t, τ )G τ dw Noise term , where Ψ is as before. The goal is to approximate Eq. ( 26) through discretization. Interestingly, the stochastic DDIM (Song et al., 2020a) turns out to be a numerical solver for Eq. ( 26) as follows (Proof in App. G). Proposition 4. For the VPSDE, the stochastic DDIM is a discretization scheme of Eq. (26). How do we discretize Eq. ( 26) for a general SDE Eq. ( 4)? One strategy is to follow what we did for the ODE (λ = 0) in Sec. 3.2 and approximate ϵ θ ( xτ , τ ) by a polynomial. However, we found this strategy does not work well in practice. We believe it is due to several possible reasons as follows. We do not pursue the discretization of the SDE Eq. ( 4) further in this paper and leave it for future. Nonlinear weight and discretization error. In Eq. ( 26), the linear and noise terms can be calculated exactly without discretizaiton error. Thus, only the nonlinear term ϵ θ can induce error in the EI method. Compared with Eq. ( 11), Eq. ( 26) has a larger weight for the nonlinearity term as λ > 0 and is therefore more likely to cause larger errors. From this perspective, the ODE with λ = 0 is the best option since it minimizes the weight of nonlinear term. In Song et al. (2020a) , the authors also observed that the deterministic DDIM outperforms stochastic DDIM. Such observation is consistent with our analysis. Besides, we notice that the nonlinear weight in VPSDE is significantly smaller than that in VESDE, which implies VPSDE has smaller discretization error. Indeed, empirically, VPSDE has much better sampling performance when N is small. Additional noise. Compared with Eq. ( 11) for ODEs, Eq. ( 26) injects additional noise to the state when it is simulated backward. Thus, to generate new samples by denoising, the score model needs to not only remove noise in xt N , but also remove this injected noise. For this reason, a better approximation of ϵ θ may be needed.

D PROOF OF PROP 1

The proof is inspired by (Zhang & Chen, 2021) . We show that the marginal distribution induced by Eq. ( 4) does not depend on the choice of λ and equals the marginal distribution induced by Eq. ( 2) when the score model is perfect. Consider the distribution q induced by the SDE dx = [F t x - 1 + λ 2 2 G t G T t ∇ log q t (x)]dt + λG t dw. Eq. ( 27) is simulated from t = T to t = 0. According to the Fokker-Planck-Kolmogorov (FPK) Equation, q solves the partial differential equation ∂q t (x) ∂t = -∇ • {[F t x - 1 + λ 2 2 G t G T t ∇ log q t (x)]q t (x)} - λ 2 2 ⟨G t G T t , ∂ 2 ∂x i ∂x j q t (x)⟩ = -∇ • {[F t x - 1 2 G t G T t ∇ log q t (x)]q t (x)} + ∇ • {[ λ 2 2 G t G T t ∇ log q t (x)]q t (x)}- λ 2 2 ⟨G t G T t , ∂ 2 ∂x i ∂x j q t (x)⟩, where ∇• denotes the divergence operator. Since ∇ • {[ λ 2 2 G t G T t ∇ log q t (x)]q t (x)} = ∇ • [ λ 2 2 G t G T t ∇q t (x)] = ⟨ λ 2 2 G t G T t , ∂ 2 ∂x i ∂x j q t (x)⟩, we obtain ∂q t (x) ∂t = -∇ • {[F t x - 1 2 G t G T t ∇ log q t (x)]q t (x)}. Eq. ( 29) shows that the above partial differential equation does not depend on λ. Thus, the marginal distribution of Eq. ( 27) is independent of the value of λ.

E PROOF OF PROP 2

Thanks to , A straightforward calculation based on Eq. ( 6) gives that Ψ(t, s) for the VPSDE is Ψ(t, s) = α t α s . It follows that t s Ψ(t, τ ) 1 2 G τ G T τ L -1 τ dτ = t s - 1 2 α t α τ d log α τ dτ 1 √ 1 -α τ dτ = √ α t t s - 1 2 dα τ α 1.5 τ (1 -α τ ) 0.5 = √ α t 1 -τ τ αt αs = √ 1 -α t - α t α s √ 1 -α s . Setting t ← t -∆t, s ← t, we write Eq. ( 11) as xt-∆t = α t-∆t α t xt + [ 1 -α t-∆t - α t-∆t α t √ 1 -α t ]ϵ θ ( xt , t).

F PROOF OF PROP 3

We start our proof with Eq. ( 16). In VPSDE, Eq. ( 16) reduce to d ŷ dt = - 1 2 α 0 α t d log α t dt 1 √ 1 -α t ϵ θ (Ψ(0, t) ŷ, t). Now we consider a rescaled time ρ, which satisfies the following equation dρ dt = - 1 2 α 0 α t d log α t dt 1 √ 1 -α t . Plugging Eq. ( 31) into Eq. ( 30), we reach d ŷ dρ = ϵ θ (Ψ(0, t) ŷ, t). In VPSDE, we α t is a monotonically decreasing function with respect to t. Therefore, there exists a bijective mapping between ρ and t based on Eq. ( 31), which we define as β and ρ = β(t). Furthermore, we can solve Eq. ( 31) for β β(t) = √ α 0 ( 1 -α t α t - 1 -α 0 α 0 ). G PROOF OF PROP 4 Our derivation uses the notations in (Song et al., 2020a) . The DDIM employs the update step x t-∆t = √ α t-∆t ( x t - √ 1 -α t ϵ θ (x t , t) √ α t ) + 1 -α t-∆t -η 2 1 -α t-∆t 1 -α t (1 - α t α t-∆t )ϵ θ (x t , t) + η 1 -α t-∆t 1 -α t (1 - α t α t-∆t )ϵ t , where η is a hyperparameter and η ∈ [0, 1]. When η = 0, Eq. ( 34) becomes determinstic and reduces to Eq. ( 12). We show that Eq. ( 34) is equivalent to Eq. ( 4) when η = λ and ∆t → 0. By Eq. ( 34), x t-∆t ∼ N (µ η , σ 2 η I), where µ η = √ α t-∆t ( x t - √ 1 -α t ϵ θ (x t , t) √ α t ) + 1 -α t-∆t -η 2 1 -α t-∆t 1 -α t (1 - α t α t-∆t )ϵ θ (x t , t) σ 2 η = η 2 1 -α t-∆t 1 -α t (1 - α t α t-∆t ). It follows that lim ∆t→0 x t -µ η t -(t -∆t) = (1 -αt-∆t αt ) ∆t x t + √ α t-∆t 1-αt αt -1 -α t-∆t -η 2 1-αt-∆t 1-αt (1 -αt αt-∆t ) ∆t ϵ θ (x t , t) = 1 2 d log α t dt x + 1 + η 2 2 d log α t dt 1 √ 1 -α t ϵ θ (x t , t) =F t x + 1 + η 2 2 G t G T t L -T t ϵ θ (x t , t), lim ∆t→0 η 2 1-αt-∆t 1-αt (1 -αt αt-∆t ) dt = -η 2 d log α t dt = η 2 G t G T t . Consequently, the continuous limit of Eq. ( 34) is dx = [F t x + 1 + η 2 2 G t G T t L -T t ϵ θ (x, t)]dt + ηG t dw, which is exactly Eq. ( 4) if η = λ. Algorithm 4 Improved PNDM (iPNDM) 4 we observe that the approximation error is not uniformly distributed for all t 0 ≤ t ≤ t N when uniform discretization over time is used; the error increases as t approaches 0. This observation implies that, instead of a uniform step size (linear timesteps), a smaller step size should be used for t close to 0 to improve accuracy. One such option is the quadratic timestep suggested in (Song et al., 2020a ) that follows linspace(t 0 , √ t N , N + 1) 2 . Input: {t i } N i=0 , t i = i∆t, order r Instantiate: x t N , Empty ϵ-buffer for i in N, N -1, • • • , 1 do j = min(N -i + 1, To better understand the effects of time discretization, we investigate the difference between the ground truth x * t and the numerical solution xt with the same boundary value x * T x * t -xt = t T 1 2 Ψ(t, τ )G τ G T τ L -T τ ∆ϵ θ (τ )dτ, ∆ϵ θ (τ ) = ϵ θ ( x * τ , τ ) -P r (τ ). Eq. ( 41) shows that the difference between the solutions x * t and xt is a weighted sum of ∆ϵ θ (τ ). We emphasize that Eq. ( 41) does not only contain the approximation error of P r (τ ) which we discussed before, but also accumulation error. Indeed, since P r (τ ) is fitted on the solution { xτ } instead of ground truth trajectory { x * τ }, there exists accumulation error caused by past errors. A good choice of time discretization should balance the approximation error and the accumulation error. We have two options for time discretization, adaptive step size, and fixed timestamps. There exists one unique ODE for DMs and we find various ODE trajectories share a similar pattern of curva- ture empirically. And the cost of rejected steps in adaptive step size solvers is not ignorable when our NFE is small, such as 10 or even 5. Thus, we prefer and explore fixed timestamps in DEIS.

Method

We experiment with several popular options for time discretization (Salimans & Ho, 2022; Song et al., 2020a) in H.3. Surprisingly, given the different budgets of NFE, we find various samplers have different preferences for timesteps. How to design time discretization in a symmetrical approach is an interesting problem; we leave it for future research. In Fig. 5 , we show the effects of each ingredient we introduce. With Exponential Integrator, other ingredients can consistently improve sampling quality in terms of FID. Compared with other sampling algorithms, DEIS enjoys significant acceleration. We present a study about sampling with difference t 0 and time scheduling based VPSDE. We consider two choices of t 0 (10 -foot_3 , 10 -4 ) and three choices for time scheduling. The first time scheduling follows the power function in t t i = ( N -i N t 1 κ 0 + i N t 1 κ N ) κ , the second time scheduling follows power function in ρ ρ i = ( N -i N ρ 1 κ 0 + i N ρ 1 κ N ) κ , and the last time scheduling follows a uniform step in log ρ space log ρ i = N -i N log ρ 0 + i N log ρ N . We include the comparison between different t 0 and time scheduling in Tabs. 6 to 8. We notice t 0 has a huge influence on image FIDs, which is also noticed and investigated across different studies (Kim et al., 2021; Dockhorn et al., 2021) . Among various scheduling, we observe tAB-DEIS has obvious advantages when NFE is small and ρRK-DEIS is competitive when we NFE is relatively large.

H.4 MORE ABALATION STUDY

We include more quantitative comparisons of the introduced ingredients in Tab. 9 for Fig. 5 . Since ingredients ϵ θ -based parameterization and polynomial extrapolation are only compatible with the exponential integrator, we cannot combine them with the Euler method. We also provide performance when applying quadratic timestamp scheduling to Euler Tab. 10 directly. We find sampling with small NFE and large NFE have different preferences for time schedules. We also report the performance of the RK45 ODE solver for VPSDE on CIFAR10 in Tab. 11 3 . As a popular and well-developed ODE solver, RK45 has decent sampling performance when NFE ≥ 50. However, the sampling quality with limited NFE is not satisfying. Such results are within expectation as RK45 does not take advantage of the structure information of diffusion models. The overall performance of RK45 solver is worse than iPNDM and DEIS when NFE is small.

H.5 COMPARISON WITH ANALYTIC-DDIM (A-DDIM) (BAO ET AL., 2022)

We also compare our algorithm with Analytic-DDIM (A-DDIM) in terms of fast sampling performance. We failed to reproduce the significant improvements claimed in (Bao et al., 2022) in our default CIFAR10 checkpoint. There could be two factors that contribute to this. First, we use a score network trained with continuous time loss objective and different weights (Song et al., 2020b) . However, Analytic-DDIM is proposed for DDPM with discrete times and finite timestamps. Second, some tricks have huge impacts on the sampling quality in A-DDIM. For instance, A-DDIM heavily depends on clipping value in the last few steps (Bao et al., 2022) . A-DDIM does not provide high-quality samples without proper clipping when NFE is low. To compare with A-DDIM, we conduct another experiment with checkpoints provided by (Bao et al., 2022) and integrate iPNDM and DEIS into the provided codebase; the results are shown in Tab. 12. We use piecewise linear function to fit discrete SDE coefficients in (Bao et al., 2022) for DEIS. Without any ad-hoc tricks, the plugin-and-play iPNDM is comparable or even slightly better than A-DDIM when the NFE budget is small, and DEIS is better than both of them. FID for various DEIS with κ = 1 in Eq. ( 42 



Because the fitting error explodes when t → 0, we have scaled the fitting error for better visualization. The { x * t } are approximated by solving ODE with high accuracy solvers and sufficiently small step size. For better visualization, we have removed the time discretization points in Fig. 3b and Fig.3d, since ∆s = 0 at these points and becomes negative infinity in log scale. We use scipy.integrate.solve ivp and tune tolerance to get different performances on different NFE. We find different combinations of absolute tolerance and relative tolerance may result in the same NFE but different FID. We report the best FID in that case. https://github.com/openai/guided-diffusion



Figure 2: Fitting error on a toy demo. Lighter areas represent higher probability region (left) and larger fitting error (right).

Figure 5: Ablation study and comparison with other samplers. We notice switching from Euler to Exponential Integrator worsens FID, which we explore and explain Ingredient 2 in Sec. 3. With EI, ϵ θ , polynomial extrapolation and optimizing timestamps can significantly improve the sampling quality. Compared with other samplers, ODE sampler based on RK45 (Song et al., 2020b), SDE samplers based on Euler-Maruyama (EM) (Song et al., 2020b) and SDE adaptive step size solver (Jolicoeur-Martineau et al., 2021), DEIS can converge much faster. analytical form of Ψ, G t , L t . Here we present treatment for VPSDE; the results can be extended to other (scalar) DMs such as VESDE. Proposition 3. For the VPSDE, with ŷt =

Figure 6: Generated samples of DDIM and DEIS with unconditional 256×256 ImageNet pretrained model(Dhariwal & Nichol, 2021)

For ρRK-DEIS, the upper right number indicates extra NFEs used. Bold numbers denote the best performance achieved with similar NFE budgets. For a fair comparison, we report numbers based on their best time discretization for different algorithms with different NFE. We include a comparison given the same time discretization in App. H.3. †: The concurrent work (Karras et al., 2022) applies Heun method to a rescaled DM. This is a special case of ρ2Heun (More discussion included in App. B).comparison on image datasets, including 64 × 64 CelebA(Liu et al., 2015) with pre-trained model fromSong et al. (2020a), class-conditioned 64 × 64 ImageNet(Deng et al., 2009) with pre-trained model(Dhariwal & Nichol, 2021), 256 × 256 LSUN Bedroom

Figure 7: Sample quality measured by FID ↓ of different sampling algorithms with pre-trained DMs.

Figure 8: DPM-Solver v.s. DEIS with unconditional 256 × 256 ImageNet pretrained model (Dhariwal & Nichol, 2021). (Zoom in to see details)

Figure 11: Generated images with text "A artistic painting of snow trees by Pablo Picaso, oil on canvas" (15 NFE)

Two popular SDEs, variance preserving SDE (VPSDE) and variance exploding SDE (VESDE). The parameter α t is decreasing with α 0 ≈ 1, α T ≈ 0, while σ t is increasing. D×D denotes the linear drift coefficient, G t ∈ R D×D denotes the diffusion coefficient, and w is a standard Wiener process. The diffusion Eq. (1) is initiated at the training data and simulated over a fixed time window [0, T ]. Denote by p t (x t ) the marginal distribution of x t and by p 0t (x t |x 0 ) the conditional distribution from x 0 to x t , then p 0 (x 0 ) represents the underlying distribution of the training data. The simulated trajectories are represented by {x t } 0≤t≤T . The parameters F t and G t are chosen such that the conditional marginal distribution p 0t (x t |x 0

More results of DEIS for VPSDE on CIFAR10 with limited NFE.

We noteKarras et al. (2022) is a concurrent work that introduces a second-order Heun method in a rescaled ODE. The algorithm is a special case of ρRK-DEIS with the second-order Heun RK method. Below we show the equivalence. As the two works use different sets of notations, we use blue for notations fromKarras et al. (2022) and orange for our notations.

is a concurrent work and it also uses the exponential integrator to reduce discretization error during sampling. Both start with the exact ODE solution but are different at discretization methods for nonlinear score parts. Below we show the connections and differences. As the two works use different sets of notations, we use cyan for notations fromLu et al. (2022) and orange for our notations. Exact ODE solution Lu et al. (2022) investigate diffusion model with forward noising x t ∼ N (α t x 0 , σ 2 t ). Lu et al. (2022, Proposition 3.1) propose the exact solution of ODE of x t given initial value x s at time s ≥ 0

The ODE solvers are sensitive to step size choice. Different works suggest different time discretization

PNDM and iPNDM on CIFAR10 H.3 IMPACT OF t 0 AND TIME SCHEDULING ON FIDS Ingredient 4: Optimizing time discretization. From Fig.

PNDM and iPNDM on CELEBA

DEIS for VPSDE on CIFAR10 with t 0 = 10 -3 .H.6 SAMPLING QUALITY ON IMAGENET 32 × 32We conduct experiments on ImageNet 32 × 32 with pre-trained VPSDE model provided in(Song  et al., 2021a). Again, we observe significant improvement over DDIM and iPNDM methods when the NFE budget is low. Even with 50 NFE, DEIS is able to outperform blackbox ODE solver in terms of sampling quality.H.7 DETAILS OF EXPERIMENTS ON IMAGENET 64 × 64 AND BEDROOM 256 × 256We use popular checkpoints from guided-diffusion 4 for our class-conditioned ImageNet 64 × 64 and 256 × 256 LSUN bedroom experiments. Though the models are trained with discrete time, we simply treat them as continuous diffusion models. Better performance is possible if we have a better time discretization scheme. We adopt time scheduling with κ = 7 in Eq. (43) suggested byKarras et al. (2022) with ρ 1 = 0.002, ρ N = 80.0, which gives a better empirical performance in class-conditioned ImageNet. We also use Eq. (44) time scheduling suggested byLu et al. (2022) and ρ 1 = 0.002, ρ N = 80.0. Better sampling quality may be obtained with different time discretization.

DEIS for VPSDE on CIFAR10 with t 0 = 10 -4 in Eq. (42)

DEIS for VPSDE on CIFAR10 with t 0 and time scheduling suggested byKarras et al. (2022) H.9 MORE REUSLTS ON VESDE Though VESDE does not achieve the same accelerations as VPSDE, our method can significantly accelerate VESDE sampling compared with previous method for VESDE. We show the accelerated FID for VESDE on CIFAR10 in Tab. 15 and sampled images in Fig.10.H.10 CHECKPOINT USED AND CODE LICENSESOur code will be released in the future. We implemented our approach in Jax and PyTorch. We have also used code from a number of sources in Tab. 16.

Quantitative comparison in Fig.5for introduced ingredients, Exponential Integrator (EI), ϵ θ -based score parameterization, polynomial extrapolation, and optimizing time discretization {t i }, where we change uniform stepsize to quadratic one t 0 = 10 -4 . We include Tabs. 6 to 8 for more ablation studies regarding time discretization. .99 8.46 4.96 3.54 2.81 2.62 Quadratic 294.01 138.73 39.82 19.26 8.49 3.96 2.88 2.61 2.57

Effects of different timesteps on the Euler method. We use t 0 = 10 -4 which has lower FID score compared with the default t 0 = 10 -3(Song et al., 2020b)  in the experiments.We list the used checkpoints and the corresponding experiments in Tab. 17.

Quantitative performance of RK45 ODE solver with t 0 = 10 -4 in Fig.5.

Comparison with A-DDIM on the checkpoint and time scheduling provided by(Bao et al., 2022) on CIFAR10 DEIS 34.69 13.94 9.55 8.41 tAB2-DEIS 29.50 11.36 8.79 8.29 tAB3-DEIS 28.09 10.55 8.58 8.25

Sampling quality on VPSDE ImageNet32 × 32 with the checkpoint provided by Song et al. (2021a). Blackbox ODE solver reports FID 8.34 with ODE tolerance 1 × 10 -5 (NFE around 130). DEIS 26.65±0.63 8.81±0.23 4.33±0.07 3.19±0.03 tAB2-DEIS 25.13±0.56 7.20±0.21 3.61±0.05 3.04±0.02 tAB3-DEIS 25.07±0.49 6.95±0.09 3.41±0.04 2.95±0.03

Mean and standard deviation of multiple runs with 4 different random seeds on the checkpoint and time scheduling provided by Liu et al. (2022) on CELEBA. DEIS 103.52±2.09 46.90±0.38 27.64±0.05 19.86±0.03 tAB1-DEIS 56.33±0.87 26.16±0.12 18.52±0.03 16.64±0.01 tAB2-DEIS 58.65±0.25 20.89±0.09 16.94±0.03 16.33±0.02 tAB3-DEIS 96.70±0.90 25.01±0.03 16.59±0.03 16.31±0.02

FID results of DEIS on VESDE CIFAR10. We note the Predictor-Corrector algorithm proposed in(Song et al., 2020b)  have ≥ 100 FID if sampling with limited NFE budget (≤ 50).

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for their useful comments. This work is partially supported by NSF ECCS-1942523 and NSF CCF-2008513. 

annex

• In Sec. 3, the ground-truth solutions { x * t } are approximated by solving ODE with high accuracy solvers and small step size. We empirically find solutions of RK4 converge when step size smaller than 2 × 10 -3 in VPSDE. We approximated ground-truth solutions by RK4 solutions with step size 1 × 10 -3 .• It is found that correcting steps and an extra denoising step can improve image quality at additional NFE costs (Song et al., 2020b; Jolicoeur-Martineau et al., 2021) . For a fair comparison, we disable the correcting steps, extra denoising step, or other heuristic clipping tricks for all methods and experiments in this work unless stated otherwise.• Due to numerical issues, we set ending time t 0 in DMs during sampling a non-zero number. Song et al. (2020b) suggests t 0 = 10 -3 for VPSDE and t 0 = 10 -5 for VESDE. In practice, we find the value of t 0 and time scheduling have huge impacts on FIDs. This finding is not new and has been pointed out by existing works (Jolicoeur-Martineau et al., 2021; Kim et al., 2021; Song et al., 2020a) . Interestingly, we found different algorithms have different preferences for t 0 and time scheduling. We report the best FIDs for each method among different choices of t 0 and time scheduling in Tab. 2. We use t 0 suggested by the original paper and codebase for different checkpoints and quadratic time scheduling suggested by Song et al. (2020a) unless stated otherwise. We include a comprehensive study about t 0 and time scheduling in App. H.3• Because PNDM needs 12 NFE for the first 3 steps, we compare PNDM only when NFE is great than 12. However, our proposed iPNDM can work when NFE is less than 12.• We include the comparison against A-DDIM (Bao et al., 2022) with its official checkpoints and implementation in App. H.5.• We only provide qualitative results for text-to-image experiment with pre-trained model (Ramesh et al., 2022) .• We include proposed r-th order iPNDM in App. H.2. We use r = 3 by default unless stated otherwise.

H.2 IMPROVED PNDM

By Eq. ( 11), PNDM can be viewed as a combination of Exponential Integrator and linear multistep method based on the Euler method. More specifically, it uses a linear combination of multiple score evaluations instead of using only the latest score evaluation. PNDM follows the stepswhereThe coefficients in Eq. ( 36) are derived based on black-box ODE Euler discretization with fixed step size. Similarly, there exist lower order approximationsOriginally, PNDM uses Runge-Kutta for warming start and costs 4 score network evaluation for each of the first 3 steps. To reduce the NFE in sampling, the improved PNDM (iPNDM) uses lower order multistep for warming start. We summarize iPNDM in Algo 4. We include a comparison with tAB-DEIS in Tabs. 4 and 5, we adapt uniform step size for tAB-DEIS when NFE=50 in CIFAR10 as we find its performance is slightly better than the quadratic one. 

