QUASI-TAYLOR SAMPLERS FOR DIFFUSION GENERA-TIVE MODELS BASED ON IDEAL DERIVATIVES Anonymous authors Paper under double-blind review

Abstract

Diffusion generative models have emerged as a new challenger to popular deep neural generative models such as GANs, but have the drawback that they often require a huge number of neural function evaluations (NFEs) during synthesis unless some sophisticated sampling strategies are employed. This paper proposes new efficient samplers based on the numerical schemes derived by the familiar Taylor expansion, which directly solves the ODE/SDE of interest. In general, it is not easy to compute the derivatives that are required in higher-order Taylor schemes, but in the case of diffusion models, this difficulty is alleviated by the trick that the authors call "ideal derivative substitution," in which the higher-order derivatives are replaced by tractable ones. To derive ideal derivatives, the authors argue the "single point approximation," in which the true score function is approximated by a conditional one, holds in many cases, and considered the derivatives of this approximation. Applying thus obtained new quasi-Taylor samplers to image generation tasks, the authors experimentally confirmed that the proposed samplers could synthesize plausible images in small number of NFEs, and that the performance was better or at the same level as DDIM and Runge-Kutta methods. The paper also argues the relevance of the proposed samplers to the existing ones mentioned above. Generative modeling based on deep neural networks is an important research subject for both fundamental and applied purposes, and has been a major trend in machine learning studies for several years. To date, various types of neural generative models have been studied including



Starting from a certain random variable x T , then by evolving the R-SDE reverse in time, we may obtain a x0 which follows p(x 0 , 0 | x T , T ) (i.e. the solution of R-FPE eq. ( 4)). Therefore, if the initial density p(x 0 , 0) of the forward dynamics eq. ( 2) is the true density, then we may utilize this mechanism as a generative model to draw a new sample x0 from it. Another approach is based on FPE eq. ( 2). By formally eliminating the diffusion term of the FPE for the forward process, we can derive another backward FPE (see also § E.3.1). Being diffusionfree, the backward FPE yields a deterministic ODE, which is called the Probability Flow ODE (PF-ODE) (Song et al., 2020b ), and is an example of neural ODEs (Chen et al., 2018) . The population density obtained by evolving this system is exactly the same as the above R-SDE. PF-ODE coeffs: f (x t , t) = f (x t , t) := f (x t , t) -1 2 g(t) 2 ∇ xt log p(x t , t). ḡ(t) = 0. (7) Some extensions of this framework include as follows. Dockhorn et al. (2021) introduced the velocity variable considering the Hamiltonian dynamics. Another extension is the introduction of a conditioning parameter, and guidance techniques using it (Dhariwal & Nichol, 2021; Ho & Salimans, 2021; Choi et al., 2021) to promote the dynamics to go to a specific class of images, which has achieved remarkable results in text-to-image tasks (Nichol et al., 2021; Ramesh et al., 2022) . Variance-Preserving Model (VP-SDE Model): The solution of unconditioned FPE is written as the convolution with the initial density p(x 0 , 0) and the fundamental solution, or the heat kernel, p(x t , t | x 0 , 0), which is the solution of the conditional FPE under the assumption that the initial density was delta function, p(x 0 , 0) = δ(x 0 -x * 0 ). Although it is still intractable to solve this problem in general, a well-known exception is the (time-dependent) Ornstein-Uhlenbeck (OU) process where f (x t , t) = -1 2 β t x t and g(x t , t) = √ β t . β t = β(t) is a non-negative continuous function. The specific form of diffusion coefficient β t has some options: a simplest one would be the linear function, and another would be the cosine schedule proposed in (Nichol & Dhariwal, 2021) ; see also § D. In any cases, if it is the OU process, the heat kernel is simply written as follows, p(x t , t | x 0 , 0) = N (x t | 1σ 2 t x 0 , σ 2 t I), where σ 2 t = 1exp - t 0 β t dt . Hereafter, we denote the noise variance by ν t := σ 2 t . (In some literature, the signal level α t := 1σ 2 t is used as a basic parameter instead of the variance.) This model is referred to as the variance-preserving (VP) model by Song et al. (2020b) . It has good properties such as the scale of data x t 2 is almost homogeneous, which is advantageous in neural models. However, the variance exploding (VE) model (Song et al., 2020b) in which the norm increases is also practicable, and the theory can be developed in a similar manner. Training Objective: In diffusion-based generative models, one estimates the score function ∇ xt log p(x t , t) = ∇ xt log E p(x0,0) [p(x t , t | x 0 , 0)] by a neural network S θ (x t , t). This sort of learning has been referred to as the score matching (Hyvärinen & Dayan, 2005; Vincent, 2011) . However, the exact evaluation of this training target is clearly intractable because of the expectation E p(x0,0) [•] , so it has been common to consider a Variational Bayesian surrogate loss; Ho & Salimans (2021) showed that the following loss function approximates the negative ELBO, L := E[ - √ ν t ∇ xt log p(x t , t | x 0 , 0) -S θ (x t , t) 2 2 ] = E[ xt- √ 1-νtx0 √ νt -S θ (x t , t) 2 2 ] (9) = E[ w -S θ ( √ 1 -ν t x 0 + √ ν t w, t) 2 2 ], where the expectation in eq. ( 10) is taken w.r.t. x 0 ∼ D, w ∼ N (0, I), and t ∼ Uniform([0, T ]). Some variants of the score matching objectives are also studied. For example, Chen et al. (2020) reported that the L 1 loss gave better results than the L 2 loss in speech synthesis. Also, Kingma et al. (2021) argued that the weighted loss with SNR-based weights improves the performance. It should be noted that the above loss function will actually be very close to the ideal score matching loss function in practice, where the probability is not conditioned on x 0 , i.e., L ideal = E[ - √ ν t ∇ xt log p(x t , t) -S θ (x t , t) 2 2 ]. This is because there almost always exists a point x 0 on the data manifold such that ∇ xt log p(x t , t) ≈ ∇ xt log p(x t , t | x 0 , 0) holds with very high accuracy in very high-dim cases, because of the wellknown "log-sum-exp ≈ max" law. For more details, see § 3.3 and § A. Sampling Schemes for R-SDE and PF-ODE: Thus obtained S θ (x t , t) is expected to finely approximate -√ ν t ∇ xt log p(x t , t), and we may use it in eq. ( 5). One of the simplest numerical schemes for solving SDEs is the Euler-Maruyama method (Maruyama, 1955, Theorem. 1) as follows, and many diffusion generative models are actually using it. Euler-Maruyama: x t-h ← x th f (x t , t) + √ hg(t)w, where w ∼ N (0, I) where h > 0 is the step size. The error of the Euler-Maruyama method is the order of O( √ h) in general, though it is actually O(h) in our case; this is because ∇ xt g(t) = 0. As a better solver for the R-SDE, the Predictor-Corrector (PC)-based sampler was proposed in (Song et al., 2020b) . The PC sampler outperformed the Predictor-only strategy, but it requires many NFEs in the correction process, so we will exclude it in our discussion. Another R-SDE solver is the one proposed by Jolicoeur-Martineau et al. (2021) , whose NFE per refinement step is 2. On the other hand, there are also deterministic samplers for PF-ODE eqs. ( 5), (7) as follows, Euler: x t-h ← x t -h f (x t , t) Runge-Kutta: x t-h ← x t -h m i=1 b i k i , where k i = f (x t -h i-1 j=1 a ij k j , thc i ) (14) where {a ij }, {b i }, {c i } are coefficients of the Runge-Kutta (RK) method (see § E.5). The error of the Euler method is O(h), and that of the RK method is O(h p ), p ≤ m in general (Press et al., 2007, § 16) . Another deterministic sampler is DDIM (Song et al., 2020a , Eq. ( 13)), and is also understood as a PF-ODE solver (Salimans & Ho, 2022) . Its NFE per step is only 1, and is capable of efficiently generate samples. DDIM: x t-h ← α t-h αt x t + σ t-h -α t-h αt σ t S θ (x t , t). In addition, as a concurrent work as ours, Lu et al. (2022) proposed the DPM-solver, which is based on the Taylor expansion of PF-ODE. However, as the gradient is evaluated using several different points, the NFE per step is greater than 1 in general. Liu et al. (2022) proposed a sampler based on the linear multi-step method, in which the NFE/step is reduced to 1 except initial 3 steps. Another PF-ODE solver is the DEIS (Zhang & Chen, 2022) which is based on the exponential integrator with some non-trivial approximations such as the polynomial interpolation of score function. Other techniques that aimed to make sampling faster include as follows. Song & Ermon (2020) proposed a variety of techniques to accelerate the sampling. Watson et al. (2021) proposed a DP-based optimization method to tune noise schedules for faster sampling. Luhman & Luhman (2021) and Salimans & Ho (2022) proposed distilling the pretrained teacher model to a student model that can predict teacher's several steps in a single step, which is efficient during the sampling but extra training for distillation is required. Bao et al. (2022a; b) derived some analytic expressions of reverse dynamics to enable faster sampling. 3 PROPOSED METHOD: QUASI-TAYLOR SAMPLERS

3.1. MOTIVATION: HIGHER-ORDER STRAIGHTFORWARD SOLVERS FOR R-SDE AND PF-ODE

As mentioned above, DDIM already exists as an efficient solver for PF-ODE, but it can only be considered a PF-ODE solver up to first-order terms (Song et al., 2020a; Salimans & Ho, 2022) , and it would not be clear enough whether it can be considered a higher-order solver for PF-ODE. Some other techniques (Lu et al., 2022; Liu et al., 2022; Zhang & Chen, 2022) were designed as higher-order PF-ODE solvers, though their derivations are rather sophisticated and less simple. Since PF-ODE and R-SDE provide the basis for the diffusion generative models, it would be beneficial to develop samplers that directly solve them through intuitive and straightforward arguments. From these motivations, we propose a simple but efficient sampler based on the Taylor expansion, a very basic technique that is familiar to many researchers and practitioners. In general, Taylor methods are not very popular as numerical schemes because they require higher-order derivatives, which are not always tractable. However, in diffusion models, the derivatives are easily and effectively evaluated, albeit approximately. The validity of this approximation requires some consideration (see § A, § B), but once accepted, an efficient sampler can be derived simply by substituting this approximation formula into the Taylor series. This section describes the details of the idea, and derives solvers for both PF-ODE and R-SDE. Entire sampling procedures are summarized in § F.

3.2. TAYLOR SCHEME FOR ODE AND ITÔ-TAYLOR SCHEME FOR SDE

Taylor Scheme for Deterministic Systems For simplicity, we consider the 1-dim case here, but we can easily generalized it to multidimensional cases. (See § E.1.1.) Given a ODE ẋt = a(x t , t), where the function a is sufficiently smooth, then we can consider the Taylor expansion of it, using a differential operator L := ∂ t + a(t, x t )∂ xt . We can write the Taylor expansion of the path x t as follows. Ignoring o(h p ) terms of the series, we obtain a numerical scheme of order p. x t+h = x t + ha(x t , t) + h 2 2! L a(x t , t) + h 3 3! L 2 a(x t , t) + • • • . ( ) Itô-Taylor Scheme for Stochastic Systems In stochastic systems, the Taylor expansion requires modifications because of the relation E[dB 2 t ] = dt. If x t obeys a stochastic system dx t = a(x t , t)dt+ b(x t , t)dB t , then the path is written in a stochastic version of Taylor-like series, which is often called the Itô-Taylor expansion, a.k.a. Wagner-Platen expansion (Platen & Wagner, 1982) ; (Kloeden et al., 1994, § 2 .3.B); (Särkkä & Solin, 2019, § 8.2) . The Itô-Taylor expansion is based on the following differential operators L , G , which are based on Itô's formula (Itô, 1944) . L := ∂ t + a(x, t)∂ x + 1 2 b(x, t) 2 ∂ 2 x , G := b(x, t)∂ x In (Kloeden & Platen, 1992) , a number of higher order numerical schemes for SDEs based on the Itô-Taylor expansion are presented. One of the simplest of them is as follows. See also § E.1.2. Theorem 1 (Kloeden & Platen (1992, § 14.2): An Itô-Taylor scheme of weak order β = 2). Let x t obeys the above SDE, and let the differential operators L , G be given by eq. ( 17). Then, the following numerical scheme weakly converges with the order of β = 2 (see § E.4). Furthermore, in a special case where G 2 b ≡ 0, the strong γ = 1.5 convergence is also guaranteed (Kloeden & Platen, 1992, § 10.4 ). x t+h ← x t + ha + wt b + w2 t -h 2 G b + h 2 2 L a + ( wt h -zt )L b + zt G a where wt = √ hw t , zt = h √ hz t are correlated Gaussian random variables, and w t , z t are given by w t = u 1 and z t = 1 2 u 1 + 1 2 √ 3 u 2 , where u 1 , u 2 ∼ N (0, 1) (i.i.d.). The notations a, L a, etc. are the abbreviations for a(x t , t), (L a)(x t , t), etc.

3.3. SINGLE POINT APPROXIMATION OF THE SCORE FUNCTION

Before proceeding, let us introduce the single point approximation of score function that ∇ xt log p(x t , t) almost certainly has a some point x 0 on the data manifold such that the following approximation holds, ∇ xt log p(x t , t) = ∇ xt log p(x t , t | x 0 , 0)p(x 0 , 0)dx 0 ≈ ∇ xt log p(x t , t | x 0 , 0). ( ) To date, this approximation has often been understood as a tractable variational surrogate. However, the error between the integral and the single point approximation is actually very small in practical scenarios. More specifically, the following facts can be shown under some assumptions. 1. The relative L 2 distance between ∇ xt log p(x t , t) and ∇ xt log p(x t , t | x 0 , 0) is bounded above by (1ν t )/ν t for any point x 0 on the "data manifold" in practical scenarios. 2. When the noise level is low ν t ≈ 0, and the data space is sufficiently high-dimensional, the distant points far from x t do not contribute to the integral. If the data manifold is locally a k-dim subspace of the entire d-dim data space, where 1 k d, then the relative L 2 distance is bounded above by around 2 k/d. Of course, the single point approximation is not always valid. In fact, the approximation tends to break down when the noise level ν t is around 0.9 (SNR = (1ν t )/ν t is around 0.1). In this region, the single point approximation can deviates from the true gradient by about 20% in some cases. Conversely, however, it would be also said that the error is as small as this level even in the worst empirical cases. For more details on this approximation, see § A.

3.4. IDEAL DERIVATIVE SUBSTITUTION

In order to adopt the above Taylor schemes to our problem setting where the base SDE is eq. ( 5), and f , f are given by eqs. ( 6), (7), we need to consider the following differential operators. Note that the time evolves backward in time in our case, the temporal derivative should be -∂ t , L = -∂ t -f (x t , t) • ∇ xt , L = -∂ t -f (x t , t) • ∇ xt + β t 2 ∆ xt , G = β t (1 • ∇ xt ) , where f (x t , t) = - β t 2 x t + β t 2 √ ν t S θ (x t , t), f (x t , t) = - β t 2 x t + β t √ ν t S θ (x t , t). ( ) It is not easy in general to evaluate expressions involving such many derivatives. Indeed, for example, L (-f ) has the derivatives of the learned score function, viz. ∂ t S θ (x t , t) and (• • ∇ xt )S θ (x t , t), which are costly to evaluate exactly, whether in approaches based on finite differences (as in (Lu et al., 2022) ), back-propagation, or the JAX paradigm (Bradbury et al., 2018) , because they eventually require extra evaluation of a deeply nested function other than S θ (x t , t), and extra memory consumption. Fortunately, however, by using the trick which the authors call the "ideal derivative substitution", we may write all of the derivatives as a simple combination of known values, only consisting of x t , S θ (x t , t), ν t , β t and derivatives of β t , and no extra computation is needed. Since the score function has a single point approximation eq. ( 19) we may assume that the derivatives should ideally hold following equalities. For derivation, see § B.1. Conjecture 1 (Ideal Derivatives). Under assumptions in § A -i.e. the data space R d is sufficiently high dimensional d 1, the data manifold M ⊂ R d is also sufficiently high dimensional but much smaller than the entire space (1 dim M d), M is bounded, M is sufficiently smooth locally, and the variance parameter ν t is close to 0 or 1; -then it is likely that the following approximations hold, where a ∈ R d is an arbitrary vector. We call them the "ideal derivatives". (a • ∇ xt ) S θ (x t , t) = 1 √ ν t a, -∂ t S θ (x t , t) = - β t 2 √ ν t x t - S θ (x t , t) √ ν t . To confirm the accuracy of this approximation, we compared empirical and ideal derivatives using MNIST (LeCun et al., 2010) and CIFAR10 (Krizhevsky, 2009) . As a result, it was confirmed that the approximation of spatial derivative, i.e. (a • ∇), is usually very accurate; the cosine similarity between the empirical and ideal derivatives is nearly always > 0.99 (Figure 10 ). On the other hand, for the time derivative ∂ t , it was confirmed that it is quite accurate when the time parameter t (and the variance ν t ) are small, but the error increases when the time parameter t (and the variance ν t ) become larger (Figure 9 ). See § B.2 for more details.

3.5. QUASI-TAYLOR AND QUASI-ITÔ-TAYLOR SCHEMES WITH IDEAL DERIVATIVES

As we can see in § B.2, the ideal derivative approximation is sometimes very accurate while sometimes not. In any case, however, the error in the ideal derivative only affects the second or higher order terms of Taylor series, and it will not be the dominant error in the whole. As there is an overall correlation between the true and ideal derivatives, the advantages will outweigh the disadvantages on average, and we can regularly use this approximation on a speculative basis, even though there exist some cases where the approximation is not accurate. If we accept the ideal derivative approximation, we can formally compute the symbolic expressions for the derivatives L (-f ), L (-f ), L (g), G (-f ) and G (g) that appear in the Taylor and Itô-Taylor series by routine calculations, which can be easily automated by computer algebra systems such as SymPy (Meurer et al., 2017) as shown in § B.3. By substituting thus obtained symbolic expressions into the above Taylor series, we can derive Taylor schemes for both PF-ODE and R-SDE as follows. Algorithm 1 (Quasi-Taylor Sampler with Ideal Derivatives for PF-ODE). Starting from a Gaussian noise x T ∼ N (0, I), iterate the following refinement steps until x 0 is obtained. x t-h = ρ t,h x t + µ t,h S θ (x t , t)/ √ ν t , where ρ t,h = 1 + βth 2 + h 2 4 β 2 t 2 -βt + h 3 4 β 3 t 12 -βt βt 2 + βt 3 + • • • , ( ) µ t,h = -βth 2 + h 2 4 βt - β 2 t 2νt + h 3 4 β 3 t (-ν 2 t +3νt-3) 12ν 2 t + βt βt 2νt -βt 3 + • • • . ( ) Using terms up to O(h 2 ), the sampler will have 2nd-order convergence (henceforth referred to as Taylor 2nd), and using terms up to O(h 3 ), the sampler will 3rd-order convergent (similarly, Taylor 3rd). If we use up to the O(h) terms, the algorithm is same as the Euler method. Algorithm 2 (Quasi-Itô-Taylor Sampler with Ideal Derivatives for R-SDE). Starting from a Gaussian noise x T ∼ N (0, I), iterate the following refinement steps until x 0 is obtained. x t-h = ρ t,h x t + µ t,h S θ (x t , t)/ √ ν t + n t,h , where ρ t,h = 1 + βt 2 h + h 2 4 β 2 t 2 -βt , µ t,h = -β t h + βth 2 2 , ( ) n t,h = √ β t √ hw t + h 3/2 -βt 2 √ βt (w t -z t ) + β 3/2 t (νt-2) 2νt z τ . ( ) The Gaussian variables w t and z t have dimension-wise correlations, and each dimension is sampled similarly to Theorem 1. Computation Cost: At first glance, these algorithms may appear to be very complex. However, the computational complexity hardly increases compared to the Euler or Euler-Maruyama methods, because almost all of the computational cost is accounted for by the neural network S θ (x t , t), and the costs for scalar values ρ • t,h , µ • t,h and noise generation n t,h are almost negligible. It should also be noted that these scalar values can be pre-computed and stored in the memory before synthesis. Thus the computational complexity of these methods are practically equal to Euler, Euler-Maruyama, and DDIM methods. Error from the Exact Solution of PF-ODE: The numerical error of the Quasi-Taylor method from the exact solution increases depending on the following factors: (1) The truncation error of the Taylor series in each step, i.e. O(h p+1 ), (2) The number of the steps i.e. O(1/h), (3) The training and generalization error of the score function, i.e. ≈ L, and (4) The average error between the true and ideal derivatives of the score function =: δ . If the factors 3 and 4 could be zero, then the numerical error is the order of O(h p ). Otherwise, the expected numerical error is roughly evaluated as follows, error = O h -1 (hL + h 2 (L + δ ) + h 3 (L + δ ) + • • • + h p+1 ) = O L + h(L + δ ) + h 2 (L + δ ) + • • • + h p . ( ) That is, the error of Euler method is O(L + h), the Heun method (2nd order Runge-Kutta) will be O(L + hL + h 2 ), and the Taylor-2nd method is O(L + h(L + δ ) + h 2 ). As long as L, δ > 0, the predominant O(h) term will not disappear. Therefore, the overall order of the error will not decrease even if we increase the order of Taylor series greater than p ≥ 3. Nevertheless, beyond such an order evaluation, specific coefficients in higher order terms can still affect the performance, which should be validated empirically.

4. IMAGE SYNTHESIS EXPERIMENT

Experimental Configuration: In this section, we conduct experiments to verify the effectiveness of the methods developed in this paper. Specifically, we compare the performance of the Euler scheme eq. ( 13), Taylor 2nd & Taylor 3rd (Alg. 1), DDIM (Song et al., 2020a) , and the Runge Kutta methods (Heun and RK4 § E.5; these are less efficient than others because of NFEs per step) for PF-ODE, as well as the Euler-Maruyama scheme eq. ( 12) and Itô-Taylor (Alg. 2) for R-SDE. The datasets we used were CIFAR-10 (32 × 32) (Krizhevsky, 2009) and CelebA (64 × 64) (Liu et al., 2015) . The network structure was not novel but was based on an existing open source implementation; we used the "NCSN++" implemented in the official PyTorch code by Song et al. (2020b) . The network consisted of 4 levels of resolution, with the feature dimension of each level being 128 → 128 → 256 → 256 → 256. Each level consisted of BigGAN-type ResBlocks, and the number of ResBlocks in each level was 8 (CIFAR-10) and 4 (CelebA). The loss function we used was the unweighted L 2 loss similarly to (Ho et al., 2020) . The optimizer was Adam (Kingma & Ba, 2014) . The machine used for training was an in-house Linux server dedicated to medium-scale machine learning training with four GPUs (NVIDIA Tesla V100). The batch size was 256. The number of training steps was 0.1 M steps, and the training took about a day for each dataset. The noising schedule was also the same as the existing one, the default configuration of VP-SDE (Song et al., 2020b) : β t = 0.1 + 19.9t and ν t = 1exp(-0.1t -9.95t 2 ) eq. ( 76). The integration duration was T = 1, and the step size h was constant, i.e. h = T /N where N is the number of refinement steps. As a quality assessment metric, we used the Fréchet Inception Distance (FID) (Heusel et al., 2017) . To evaluate FIDs, we used the pretrained Inception v3 checkpoint (Szegedy et al., 2016) , and resized all images to 299 × 299 × 3 by bilinear interpolation before feeding them to the Inception network. For each condition, 10,000 images were randomly generated to compute the FID score. Note that in this experiment, the computational resources for training were limited, and training was stopped before it fully converged (only 0.1 M steps, while in some other papers the number of training steps was e.g. 1.3 M steps in (Song et al., 2020b) ). Therefore, it would be necessary to observe relative comparisons between samplers rather than directly comparing these FID value to those presented in other papers. Results: Figure 1 and Figure 2 show random samples for each sampler. More examples are available in § G. The deterministic samplers considered in this paper generated plausible images much faster than the vanilla Euler-Maruyama sampler. Figure 3a and Figure 3b reports the FID scores. From these figures, the following observations can be made. First, the proposed Quasi-Taylor methods have about the same or slightly better than DDIM. The reason for this is discussed in the next section § 5. We also found that the Runge-Kutta methods reduces FID in fewer steps overall. However, they also hit bottom faster. This may be due to the effect of the singularity at the time origin (see § D) in the final step. (This can be seen in Figure 16 . In the second right column, the Runge-Kutta methods produce images similar to the other deterministic samplers, but the rightmost ones seem to be slightly noisier than the others). Even though the ideal derivatives are only approximations and contain some errors, the convergence destinations of Quasi-Taylor methods were almost the same as the Runge-Kutta methods. This suggests that the error in the ideal derivatives is actually hardly a problem, because in regions where the approximation error is large, the state x t is noisy to begin with (e.g. left 2/3 figures in Figure 16 ), and the approximation error is negligible compared to the noise that was originally there. The proposed stochastic sampler (Itô-Taylor) also showed sufficiently competitive results, in terms of both FID scores and visual impression. Comparison of the figures in § G (e.g. Figure 21 ) confirms that the Itô-Taylor method empirically reaches almost the same target as Euler-Maruyama method much more accurately, and it could be expected to be a safe alternative to Euler-Maruyama method when stochastic sampling is important.

5. DISCUSSION: RELATIONSHIP WITH DDIM

In the above experiment, the performance of the proposed Quasi-Taylor methods are found to be almost equivalent to that of DDIM. In fact, despite having distinctly different derivation logics, the proposed method and DDIM actually agree, at least up to the 3rd order terms of h. Therefore, it is not surprising the results are similar; and the smaller h is, the closer the results are. This can be quickly verified by doing a Taylor expansion of the coefficients of eq. ( 15), i.e., α t-h αt and (σ t-h -α t-h αt σ t ), w.r.t. h. Although it is tedious to perform this calculation by hand, the computer algebra systems e.g. SymPy immediately calculate it. For this computation, see § C. This finding that truncating DDIM at the 2nd or 3rd order of h yields exactly the same algorithms as the proposed Quasi-Taylor methods may be a useful insight for DDIM users, even if it does not lead them to switch the regular sampler from DDIM to Quasi-Taylor. That is, it offers an option of truncating the higher-order terms of DDIM.

6. CONCLUDING REMARKS

This paper proposed a Taylor-expansion approach for diffusion generative models, particularly the Probability Flow ODE (PF-ODE) and the reverse-time SDE (R-SDE) solvers. The assumptions to derive our sampler were minimalistic, and the derivation process was straightforward. We just substituted the derivatives in the Taylor series by ideal ones. The obtained Quasi-Taylor and Quasi-Itô-Taylor samplers performed better than or on par with DDIM and Runge-Kutta methods. This fact implicitly supports the validity of our approximations. Conversely, if we could find some examples where the Quasi-Taylor methods, DDIM and RK methods gave decisively different results, we might be able to gain a deeper understanding of the structure of data manifold and the fundamentals of diffusion models by investigating the causes of discrepancy. Reproducibility Statement Pseudocodes of the proposed methods are available in § F, and the derivation of the proposed method is described in § B.1, § B.3. The experiment is based on open source code with minimal modifications to match the proposed method, and all the data used in this paper are publicly available. Experimental conditions are elaborated in § 4. Ethics Statement As a final note, negative aspects of generative models are generally pointed out, such as the risk of reproducing bias and discrimination in training data and the risk of being misused for deep fakes. Since this method only provides a solution to existing generative models, it does not take special measures against these problems. Maximum ethical care should be taken in the practical application of this method.

A ON THE APPROXIMATION

∇ xt log p(x t , t) ≈ ∇ xt log p(x t , t | x 0 , 0) Although ∇ xt log p(x t , t) and ∇ xt log p(x t , t | x 0 , 0) are clearly distinct concepts, they are nearly equivalent in practical situations that the data space is high-dimensional and the data are distributed in a small subset (low-dimensional manifold) of the space. That is, in such a case, the integrated gradient ∇ xt log p(x t , t), given by ∇ xt log p(x t , t) = ∇ xt log p(x t , t | x 0 , 0)p(x 0 , 0)dx 0 = ∇ xt p(x t , t | x 0 , 0)p(x 0 , 0)dx 0 p(x t , t | x 0 , 0)p(x 0 , 0)dx 0 = E p(x0) [∇ xt p(x t , t | x 0 , 0)] E p(x0) [p(x t , t | x 0 , 0)] = E p(x0) -xt- √ 1-νtx0 νt p(x t , t | x 0 , 0) E p(x0) [p(x t , t | x 0 , 0)] = E p(x0) - x t - √ 1 -ν t x 0 ν t =∇x t log p(xt,t|x0,0) e -xt- √ 1-νtx0 2 /2νt E p(x0) [e -xt- √ 1-νtx0 2 /2νt ] =:q(x0|xt) , almost always has a single point approximation which is written as ∇ xt log p(x t , t | x (i) 0 , 0), where x (i) 0 is a certain point on the data manifold. There are two phases depending on the noise scale ᾱt = 1ν t , and both phases have different reasons for the validity of the approximation.

A.1 PHASE (1): ANYONE

CAN BE A REPRESENTATIVE (ᾱ t = 1 -ν t ∼ 0) If x t is far from any of { √ 1 -ν t x (i) 0 | x (i) 0 ∼ p(x 0 )}, the gradients from each scaled data points are almost the same. Therefore, the integrated one viz. ∇ log p(x t , t) can be approximated by any of  ∇ log p(x t , t | x (i) 0 , 0) for any x (i) 0 ∼ p(x 0 ). (•) satisfies E p(x0) [q(x 0 | x t )] = 1, the L 2 distance between ∇ log p(x t , t) and ∇ log p(x t , t | x (i) 0 , 0) is bounded above as follows, ∇ log p(x t , t) -∇ log p(x t , t | x (i) 0 , 0) 2 2 = E p(x0) x t - √ 1 -ν t x 0 ν t q(x 0 | x t ) - x t - √ 1 -ν t x (i) 0 ν t 2 2 = E p(x0) x t - √ 1 -ν t x 0 ν t - x t - √ 1 -ν t x (i) 0 ν t q(x 0 | x t ) 2 2 = 1 -ν t ν 2 t E p(x0) (x 0 -x (i) 0 )q(x 0 | x t ) 2 2 ≤ 1 -ν t ν 2 t E p(x0) x 0 -x (i) 0 2 2 q(x 0 | x t ) (∵ Jensen's inequality) ≤ 1 -ν t ν 2 t max x0 x 0 -x (i) 0 2 2 E p(x0) [q(x 0 | x t )] = 1 -ν t ν 2 t max x0 x 0 -x (i) 0 2 2 ≤ 4(1 -ν t )R 2 M ν 2 t , where R M is the radius of the smallest ball that covers the data manifold M. The radius R M will be a finite constant in most practical scenarios we are interested in. For example, if the data space is d-dim box, i.e., M ⊂ [0, 1] d , then R M is bounded above by R M ≤ √ d/2. scaled datapoints { √ 1 -ν t x (i) 0 } n i=1 x t gradient ∇ log p(x t , t)≈ ∇ log p(x t , t | x (i) 0 , 0) Figure 4 : Intuitive reasons why a single point approximation is valid. The case where 1ν t ≈ 0 Noting that ∇ log p(x t , t | x (i) 0 , 0) 2 ≈ d/ν t ( this is actually a random variable that follows the χ-distribution 1 ), the relative L 2 error is evaluated as follows, (relative L 2 error) = ∇ log p(x t , t) -∇ log p(x t , t | x (i) 0 , 0) 2 ∇ log p(x t , t | x (i) 0 , 0) 2 1 -ν t ν t . Similarly, the cosine similarity is evaluated as cossim(∇ log p(x t , t), ∇ log p(x t , t | x (i) 0 , 0)) 1 - 1 2 1 + 1 -ν t ν t 1 -ν t ν t . ( ) These bounds suggest that the relative error between the true gradient and the single point approximation approaches to 0, and the cosine similarity goes to 1, as ν t → 1 (ᾱ t → 0). That is, the single point approximation is largely valid when the noise level ν t is high.

Proofs for some facts used in this section

Jensen's inequality: A sketchy proof of Jensen's inequality for L 2 -norm is as follows: n w n x (n) 2 2 = i n w n x (n) i 2 = i n w n x (n) i 2 ≤ i n w n (x (n) i ) 2 = n w n i (x (n) i ) 2 = n w n x (n) 2 2 . ( ) Cosine similarity: Given the relative distance between x and y as xy 2 / x 2 = r (for simplicity, let r 1) then the cosine similarity between the vectors are evaluated as follows, cossim(x, y) = x, y x 2 y 2 = x 2 y 2 • x, y x 2 2 = 1 2 x 2 y 2 • x 2 2 + y 2 2 -x -y 2 2 x 2 2 = 1 2 x 2 y 2 • 1 + y 2 2 x 2 2 -r 2 = 1 2 x 2 y 2 + y 2 x 2 - x 2 y 2 r 2 . ( ) Noting that the triangle inequality implies that the L 2 norm of y is bounded as follows, (1 -r) x 2 ≤ y 2 ≤ (1 + r) x 2 , we can put y 2 / x 2 = 1 + δ, (0 < |δ| < r 1), and can evaluate the cosine similarity as follows, cossim(x, y) = 1 2 1 1 + δ + (1 + δ) - 1 1 + δ r 2 ≈ 1 2 (1 -δ) + (1 + δ) -(1 -δ)r 2 > 1 - 1 2 (1 + r)r 2 . ( ) 1 Given Gaussian random variables Xi ∼ N (0, 1), then the squared sum of them d i=1 X 2 i follows the χ 2 -distribution with d degrees of freedom. It is well known that χ 2 distribution converges to N (d, 2d) as d increases. Therefore, If d is sufficiently large, x 2 := ( d i=1 X 2 i ) 1/2 ≈ d ± √ 2d ≈ √ d ± 1/2. Thus the L2 norm of d-dim Gaussian variable is approximated by √ d , and the error is as small as the scale of 1/2. It implies that Gaussian distribution looks like a sphere in high-dimensional space, contrary to our low-dimensional intuition. Indeed, the following inequality holds (Laurent & Massart, 2000, Lem. 1)  : p((1 -2 y/d) 1/2 ≤ x 2/ √ d ≤ (1 + 2 y/d + 2y/d) 1/2 ) ≥ 1 -2e -y . x (0) 0 x (1) 0 x (2) 0 x (3) 0 x -ν t ≈ 1 A.2 PHASE (2): WINNER TAKES ALL ( ᾱt = 1 -ν t 0) The above bounds suggest that the single point approximation is valid only when the noise level ν t is high (ν t ≈ 1). However, we can also show that the approximation is also valid because of another reason when the noise level ν t is low (ν t ≈ 0). If p(x 0 ) is a discrete distribution, and if ν t ≈ 0, the weight factor q(x 29) is almost certainly a "one-hot vector" such that (i) 0 | x t ) in eq. ( q(x (i) 0 | x t ) ≈ 1, and q(x (j) 0 | x t ) ≈ 0 (j = i). ( ) That is, the entropy of q(x 0 | x t ) is nearly zero (see Figure 8 ). It is almost certain (prob ≈ 1), since the converse is quite rare; i.e., there exists another data point x 5a intuitively shows the reason. (j) 0 near x t ∼ p(x t , t | x (j) 0 , 0) in such a situation 2 . Figure If p(x 0 ) is a continuous distribution, very small area (almost a single point) of the data manifold contributes to the integration eq. ( 29). Figure 5b intuitively shows the situation. The weight function will become nearly the delta function, and the integration will be almost the same as the single point approximation, q(x 0 | x t ) ≈ δ(x 0 -x * 0 ). ) where x * 0 is a certain point close to the perpendicular projection of x t on the data manifold. These are related to the well-known formulae ε log exp(•/ε) and ε log exp(•/ε) goes to max(•) when ε → 0. Naturally, this is more likely to be true for high-dimensional space and is expected to break down for low-dimensional toy data. Now let us elaborate the above discussion. As the scale we are interested in now is very small, the data manifold is approximated as a flat k-dim space. In addition, as q(• | x t ) is a Gaussian function, it decays rapidly with distance from the center (perpendicular projection of x t ∈ R d to the k-dim 2 Let us evaluate the probability more qualitatively using a toy model. Let us denote a k-dim ball of radius r by B k (r), and let Di be the L2 distance between xt and √ 1 -νtx (i) 0 , i.e., Di := xt - √ 1 -νtx (i) 0 2 ≈ √ νtd. The question is whether there exists another data point x (j) 0 (j = i) in the discrete data distribution D = {x (i) 0 }i such that Dj < Di. In other words, whether there exists another data point in B d ( √ νtd) centered at √ 1 -νtx (i) 0 . If it is the case, q(x (j) 0 | xt) has significantly large value compared to q(x (i) 0 | xt) , and it will have strong effects on the result of integration eq. ( 29). Now let us show that it is unlikely in a high-dimensional data space. As it is difficult evaluating the probability for general cases, let us consider a toy model that the all the n points are uniformly distributed in a k-dim ball (k d). That is, we assume that the data manifold is M = B k ( √ k) which satisfies D ⊂ M ⊂ R d . Then, the probability we are interested in is roughly evaluated as p(∃j = i, Dj < Di) ≈ 1 -1 - vol[B k ( √ νtk)] vol[ √ 1 -νtM] n = 1 -1 - vol[B k ( √ νtk)] vol[B k ( (1 -νt)k)] n . Remembering that the volume of k-dim ball is given by vol [B k (r)] = π k/2 Γ(k/2+1) r k , the probability is approxi- mated as follows, r.h.s = 1 -1 - νtk (1 -νt)k k/2 n ≈ n νt 1 -νt k/2 , if νt ≈ 0, k 1. Thus, when νt is sufficiently small and the data manifold is sufficiently high-dimensional, it is almost unlikely that there exists such a point x (j) 0 (j = i) unless n is very large. subspace). For this reason, only a small ball around the perpendicular foot of x t contributes to the integration. As the majority of k-dim Gaussian variables of variance σ 2 I are distributed on a sphere of radius √ kσ, it would be sufficient to consider a ball of radius ( √ k + 2) √ ν t . Now the integration eq. ( 29) is approximated as follows, eq. ( 29) ≈ B [∇ xt log p(x t , t | x 0 , 0)q(x 0 | x t )p(x 0 )] dx 0 , where B = {x 0 | √ 1 -ν t x 0 -x t 2 < ( √ k + 2) √ ν t }. ( ) As B is small, the gradient vector ∇ xt log p(x t , t | x 0 , 0) is almost constant in this region. Quanti- tatively, let x (1) 0 , x 0 ∈ B, then the distance between these points is evidently bounded above by x (1) 0 -x (2) 0 2 ≤ 2( √ k + 2) √ ν t , which implies ∇ log p(x t , t | x (1) 0 , 0) -∇ log p(x t , t | x (2) 0 , 0) 2 = x t - √ 1 -ν t x (1) 0 ν t - x t - √ 1 -ν t x (2) 0 ν t 2 = √ 1 -ν(x (1) 0 -x (2) 0 ) ν t 2 ≤ √ 1 -ν t ν t • 2( √ k + 2) √ ν t =2( √ k + 2) 1 -ν t ν t . Noting that ∇ log p(x t , t | x 0 , 0) 2 ≈ d/ν t , the relative L 2 error between two gradients is bounded above by ∇ log p(x t , t | x (1) 0 , 0) -∇ log p(x t , t | x (2) 0 , 0) 2 ∇ log p(x t , t | x (2) 0 , 0) 2 2 ( √ k + 2) √ 1 -ν t √ d 2 k d . By integrating both sidesfoot_0 w.r.t. x (1) 0 , i.e. (•)q(x (1) 0 | x t )p(x (1) 0 , we obtain the following inequality for any point x * 0 ∈ B, ∇ log p(x t , t) -∇ log p(x t , t | x * 0 , 0) 2 ∇ log p(x t , t | x * 0 , 0) 2 2 k d . Now, let us recall that a noised data x t ∼ p(x t , t) is generated in the following procedure. 1. Draw a data x (i) 0 from p(x 0 , 0). 2. Draw a Gaussian x t ∼ p(x t , t | x (i) 0 , 0) = N ( √ 1 -ν t x (i) 0 , ν t I d ). Here, the data point x (i) 0 ∈ M is in the ball B, because the distance between √ 1 -ν t x (i) 0 and the perpendicular foot of x t is about √ ν t k, which is clearly smaller than ( √ k + 2) √ ν t . Thus we can write the following inequality. ∇ log p(x t , t) -∇ log p(x t , t | x (i) 0 , 0) 2 ∇ log p(x t , t | x (i) 0 , 0) 2 2 k d , and we may conclude that the following approximation is largely valid if ν t ≈ 0, k d, ∇ log p(x t , t) ≈ ∇ log p(x t , t | x (i) 0 , 0).

A.3 COMPARISON OF THE EMPIRICAL SCORE FUNCTION AND THE SINGLE POINT APPROXIMATION

Let us empirically validate the accuracy of single point approximation using real data as follows, • D = {MNIST (LeCun et al., 2010) 60,000 samples}, • D = {CIFAR-10 (Krizhevsky, 2009) 50,000 samples}. Since the true score function cannot be determined without knowing the true density (which will be possible with synthetic data, but discussing such data will not be very interesting here), the empirical score function was calculated using the real data D above as follows, True Score = ∇ log p(x t , t) = E p(x0) [q(x 0 | x t )∇ log p(x t , t | x 0 , 0)] ≈ 1 |D| x0∈D [q(x 0 | x t )∇ log p(x t , t | x 0 , 0)] =: Empirical Score. ( ) The evaluation of empirical score function using the entire dataset is unrealistic if the dataset D is large, but it is feasible if D is a small dataset like MNIST and CIFAR-10. In order to evaluate the accuracy of single point approximation, we evaluated following three metrics. • Relative L 2 error between the empirical score function and ∇ log p(x t , t | x 0 , 0), • Cosine similarity between the empirical score function and ∇ log p(x t , t | x 0 , 0), • Entropy of q(x 0 | x t ). Figure 6 shows the relative L 2 distance, for both datasets. Figure 7 similarly show the distribution (random 10,000 trials) of the cosine similarity, and Figure 8 shows the entropy. Dashed curves indicate the bounds evaluated in eq. ( 31) and eq. ( 32). These figures show that the range of intermediate region between Phase (1) and Phase (2) will not have impact in practical situations since we do not evaluate the neural network S θ (•, •) in this range so many times (i.e., ᾱt ∼ 10 -3 to 10 -1 ⇔ ν t ∼ 0.999 to 0.9). Moreover, the approximation accuracy is still very high even in this region. Furthermore, although MNIST and CIFAR-10 are quite "low-dimensional" for real-world images, approximations are established with such high accuracy. Therefore, it is expected to be established with higher accuracy for more realistic images. B ON THE IDEAL DERIVATIVE APPROXIMATION Thus, we can assume that the single point approximation almost always holds practically. - S θ (x t , t) √ ν t model ≈ ∇ xt log p(x t , t) almost equal ≈ ∇ xt log p(x t , t | x (i) 0 , 0) = - x t - √ 1 -ν t x (i) 0 ν t . Therefore, we may also expect that the similar approximation will be valid for their derivatives. Of course, strictly speaking, such an expectation is mathematically incorrect. For example, let g(x) = f (x) + ε sin ωx, then the difference g(x)f (x) = ε sin ωx goes to zero as ε → 0, but the difference of derivatives g (x)f (x) = εω cos ωx does not if ω → ∞ faster than 1/ε. If the error between them in the Fourier domain is written as E(ω) = G(ω) -F (ω), then the L 2 error between the derivatives is g (x) -f (x) 2 2 = ωE(ω) 2 2 × const (Parseval's theorem). In other words, the single point approximation does not necessarily imply the ideal derivative approximation. If it is to be mathematically rigorous, it must be supported by other nontrivial knowledge on the data manifold. This nontrivial leap is the most important "conjecture" made in this paper and its theoretical background should be more closely evaluated in the future.

B.1 DERIVATION OF THE "IDEAL DERIVATIVES"

Because of the discussion in § A, the true score function ∇ xt log p(x t , t) is finely approximated by a single point approximation ∇ xt log p(x t , t | x 0 , 0). Now we may also assume that the derivatives of both will also be close. In this paper, we are interested in the Taylor expansion of the following form (see also § E.1.1), ψ(x h , h) = ψ(x 0 , 0) + ∞ k=1 h k k! (∂ t + a(x t , t) • ∇ xt ) k ψ(x t , t) t=0 . If the function ψ(x t , t) is separable in each dimension (i.e., ∂ xi ψ j = 0 for i = j), the following relation holds, (a(x t , t) • ∇ xt ) ψ(x t , t) = a(x t , t) ∇ xt ψ(x t , t), where is the element-wise product or operation. If a(x t , t) is also separable in each dimension 4 the Taylor series is formally rewritten as follows, ψ(x t , t) = ψ(x 0 , 0) + ∞ k=1 t k k! 1∂ t + a(x t , t) ∂ xt k ψ(x t , t) t=0 where ∂ xt := ∇ xt is the element-wise derivative operator. This is formally the same as the 1-dim Taylor series. Therefore, it is sufficient to consider the 1-dim Taylor series first, and parallelize each dimension later. Thus the derivatives we actually need are the following two. ∂ xt S θ (x t , t) = ∇ xt S θ (x t , t), ∂ t S θ (x t , t) = (1∂ t ) S θ (x t , t). (49) B.1.1 SPATIAL DERIVATIVE ∂ xt S θ (x t , t) := ∇ xt S θ (x t , t) Let us first compute the spatial derivative of the conditional score function. (a • ∇ xt )(- √ ν t ∇ xt log p(x t , t | x 0 , 0)) = i a i ∂ xt i x t - √ 1 -ν t x 0 √ ν t 4 In general, (a • ∇) 2 = ( i ai∂ i ) 2 = ( i ai∂ i )( j aj∂ j ) = i ai j (∂ i aj + aj∂ i ∂ j ). If a is separable in each dimension, the ∂ i aj(i = j) terms vanish, and (a • ∇) 2 = i (ai∂ i ai + j aiaj∂ i ∂ j ). If the function ψ(xt, t) is separable in each dimension, then (a • ∇) 2 ψ k = i (ai∂ i ai + j aiaj∂ i ∂ j )ψ k = (a k ∂ k a k + a 2 k ∂ 2 k )ψ k . Thus we can formally write (a • ∇) 2 ψ = (a ∇ a + a a ∇ ∇) ψ = a (∇ a + a ∇ ∇) ψ = a ∇ (a ∇) ψ = (a ∇ ) 2 ψ = (a ∂ x ) 2 ψ. (Note that the operator (a • ∇) is scalar while (a ∂ x ) is d-dim vector.) We can similarly show (a • ∇) k ψ = (a ∂ x ) k ψ for k ≥ 3. = 1 √ ν t      i a i ∂ xt i (x t - √ 1 -ν t x 0 ) 1 . . . i a i ∂ xt i (x t - √ 1 -ν t x 0 ) d      = 1 √ ν t      i a i ∂ xt i (x t 1 - √ 1 -ν t x 0 1 ) . . . i a i ∂ xt i (x t d - √ 1 -ν t x 0 d )      = 1 √ ν t      a 1 ∂ xt 1 (x t 1 - √ 1 -ν t x 0 1 ) . . . a d ∂ xt d (x t d - √ 1 -ν t x 0 d )      = 1 √ ν t    a 1 . . . a d    = 1 √ ν t a = a 1 √ ν t 1. Here, we used the notation x t i to denotes the i-th component of a vector x t . Note that up to this point in the discussion, there have been no approximations, but strict ones. Now let us consider the approximation. Because of the single point approximation, we may assume that the derivative of the integrated score function will also be approximated by the derivative of the conditional score function, i.e., (a • ∇ xt )(- √ ν t ∇ xt log p(x t , t)) ≈ (a • ∇ xt )(- √ ν t ∇ xt log p(x t , t | x 0 , 0)). As the neural network S θ (x t , t) is trained so that it approximates the integrated score function, we can also assume the following relation, (a • ∇ xt )S θ (x t , t) ≈ (a • ∇ xt )(- √ ν t ∇ xt log p(x t , t | x 0 , 0)) = 1 √ ν t a. ( ) Thus we have obtained the ideal spatial derivative of the neural network. We can also formally write the spatial derivative as follows using the above notation, a (∂ xt S θ (x t , t)) = a 1 √ ν t 1. We can also write it as ) ). During the computation, x 0 is replaced by the relation ∂ xt S θ (x t , t) = 1 √ ν t 1. (54) B.1.2 TIME DERIVATIVE -∂ t S θ (x t , t) Next, let us compute -∂ t (- √ ν t ∇ xt log p(x t , t | x 0 , x 0 = 1 √ 1 -ν t (x t + ν t ∇ xt log p(x t , t | x 0 , 0)) . ( ) We also use the following relations between ν t , β t , which is immediately obtained from the definition of ν t , νt = (1 -ν t )β t . Using the above information, we may compute the temporal derivative of the conditional score function as follows. - ∂ t (- √ ν t ∇ xt log p(x t , t | x 0 , 0)) = -∂ t x t - √ 1 -ν t x 0 √ ν t = - 1 √ ν t 1 2 νt (1 -ν t ) -1/2 x 0 -(x t - √ 1 -ν t x 0 ) - 1 2 νt ν -3/2 t = - νt 2ν 3/2 t ν t √ 1 -ν t x 0 -(x t - √ 1 -ν t x 0 ) = - νt 2ν 3/2 t -x t + 1 √ 1 -ν t x 0 = - νt 2ν 3/2 t -x t + 1 √ 1 -ν t 1 √ 1 -ν t (x t + ν t ∇ xt log p(x t , t | x 0 , 0)) = - νt 2ν 3/2 t -1 + 1 1 -ν t x t + 1 1 -ν t (ν t ∇ xt log p(x t , t | x 0 , 0)) = - 1 2ν 3/2 t νt 1 -ν t (ν t x t + ν t ∇ xt log p(x t , t | x 0 , 0)) = - 1 2ν 3/2 t β t (ν t x t + ν t ∇ xt log p(x t , t | x 0 , 0)) = - β t 2 √ ν t (x t + ∇ xt log p(x t , t | x 0 , 0)) . (Note that this calculation is exact, and no approximation is injected.) Because of the single point approximation, we may assume -∂ t (- √ ν t ∇ xt log p(x t , t)) ≈ -∂ t (- √ ν t ∇ xt log p(x t , t | x 0 , 0)) = - β t 2 √ ν t (x t + ∇ xt log p(x t , t | x 0 , 0)) ≈ - β t 2 √ ν t (x t + ∇ xt log p(x t , t)) , and therefore, we can also assume that the temporal derivative of the neural network is approximated as -∂ t S θ (x t , t) ≈ - β t 2 √ ν t x t - 1 √ ν t S θ (x t , t) . The "derivatives" have some good points. For example, the partial derivatives commute, ∂ xt ∂ t S θ (x t , t) = ∂ t ∂ xt S θ (x t , t).

B.2 COMPARISON OF THE EMPIRICAL SCORE DERIVATIVES AND IDEAL DERIVATIVES

Let us empirically validate that idela approximation using real data similarly as above. However, since the equations will become very complicated if we evaluate the exact empirical score derivatives, we instead used finite differences as the ground truths. That is, let S(x, t) be the routine that computes the empirical score function as follows, S(x, t) = - √ ν t |D| x0∈D [q(x 0 | x t )∇ log p(x t , t | x 0 , 0)], and we evaluated the empirical score derivatives by the finite differences as follows 5 , Empirical t Deriv: ∂ t S ≈ S(x t , t + ε) -S(x t , t) ε (62) Empirical x t Deriv: (a • ∇ xt )S ≈ S(x t + εa, t) -S(x t , t) ε , where a ∼ N (0, I). ( ) where ε should be a sufficiently small value, and we used ε = 10 -3 here. We compared these empirical derivatives with the ideal derivatives using MNIST and CIFAR-10. Ideal t Deriv: ∂ t S θ = β t 2 √ ν t x t - 1 √ ν t S θ (x t , t) = β t 2 √ ν t x t - x t - √ 1 -ν t x 0 ν t Ideal x t Deriv: (a • ∇ xt )S θ = 1 √ ν t a As the ideal derivatives require the specific function forms of diffusion and variance schedules, we tested on following two noise schedules. Linear schedule We first tested on the linear schedule eq. ( 76), where β 0 = 0.1 and β 1 = 9.95. This is the same schedule as the one used in the main text. Figure 9 shows the relative L 2 error and the cosine similarity between the ideal t derivative eq. ( 21) and the empirical t derivative eq. ( 62), in which it is observed that they are very close when 0 t 0.5, while the approximation accuracy decreases as t increases. However, even in that case, there tends to be an overall positive correlation. It can also be observed that there is an error that seems to originate from the singularity of time origin when t ≈ 0. (See also § D.2.) For the x derivative (Figure 9 ), on the other hand, we can confirm that the errors between the ideal x derivative eq. ( 21) and empirical x derivative eq. ( 62) are generally very highly correlated, except around t ≈ 0.5.

Modified tanh schedule

We also tested on another noise schedule, the modified tanh schedule eq. ( 79) which does not have the singularity at the time origin. The parameters A, k were determined so that ν 0 = 0.001 and ν 1 = 0.999. Figure 11 and Figure 12 show the results. In this case, the overall trend is similar to the linear schedule, but we can observe that the singularity of the time origin of the t derivative is eliminated. 5 To verify the empirical xt derivative, let us consider a simple case of three-variable function f (x, y, z). As its total derivative is df = ∂xf dx + ∂yf dy + ∂zf dz, we have f (x + a, y + b, z + c) -f (x, y, z) = (a∂x + b∂y + c∂z)f (x, y, z) for small a, b, c. Let a = εa , b = εb and c = εc , then f (x + εa , y + εb , z + εc ) -f (x, y, z) = ε(a ∂x + b ∂y + c ∂z)f (x, y, z). Therefore, we can write the spatial derivative as (a ∂x + b ∂y + c ∂z)f (x, y, z) = limε→0 1 ε (f (x + εa , y + εb , z + εc ) -f (x, y, z)).  (-f ), L (-f ), L (g), G (-f ), G (g) The computation of the derivative L (-f ), L (-f ), L (g), G (-f ), G (g) does not require any particular nontrivial process. All we have to do is rewrite a term every time we encounter a derivative of S θ (x t , t) or ν t , and the rest is at the level of elementary exercises in introductory calculus. To execute this symbolic computation, the use of computer algebra systems will be a good option. It should be noted, however, that some implementation tricks to process such custom derivatives are required (in other words, the term-rewriting system should be customized). The results are shown below. Although these expressions appear complex at first glance, the code generation system can automatically generate code for such expressions. L (-f )(x t , t) = β 2 t 4 - βt 2 x t + βt 2 √ ν t - β 2 t 4ν 3/2 S θ (x t , t) (64) L f )(x t , t) = β 2 t 4 - βt 2 x t + βt √ ν t S θ (x t , t) (65) G (-f )(x t , t) = 1 2 - 1 ν t β 3/2 t (66) L g(t) = - βt 2 √ β t (67) G g(t) = 0. We may also compute higher order derivatives, though we do not use them in this paper except L L (-f ), L L (-f )(x t , t) = β 3 t 8 - 3β t βt 4 + βt 2 x t + β 3 t (-ν 2 t + 3ν t -3) 8ν 5/2 t + 3β t βt 4ν 3/2 t - βt 2 √ ν t S θ (x t , t) (69) L L (-f )(x t , t) = β 3 t 8 - 3β t βt 4 + βt 2 x t - β 3 t + 4 βt 4 √ ν t S θ (x t , t) L G (-f )(x t , t) = √ β t ν 2 t ν t (2β 2 t + 3 βt ) 2 -β 2 t - 3ν 2 t βt 4 G L (-f )(x t , t) = β t β 2 t 4 - βt 2 + βt ν t G G (-f )(x t , t) = 0 L L g(t) = 2β t βt -β2 t 4β 3/2 t L G g(t) = 0 G L g(t) = 0 G G g(t) = 0. As we can see, no factors other than integers, x t , S θ (x t , t), ν t , β t and derivatives of β t appear. This is also true for higher order derivatives, which can be easily shown.

SymPy Code Snippet for Automatic Symbolic Computation of Derivatives

The following code snippet is a minimalistic example of SymPy code to compute the above derivatives using the customized derivative method. We used SymPy 1.11 to test the following code snippet. 64) print(simplify(L_flat(L_flat(f_flat(x,t))))) # L L f (xt, t); see eq. ( 69), # we can similarly define f , L , G and compute other derivatives. The result will look like [Out 1] - xβ(t) 2 + S θ (x, t)β(t) 2 ν(t) [Out 2] - xβ 2 (t) 4 + x d dt β(t) 2 + S θ (x, t)β 2 (t) 4ν 3 2 (t) - S θ (x, t) d dt β(t) 2 ν(t) [Out 3] - xβ 3 (t) 8 + 3xβ(t) d dt β(t) 4 - x d 2 dt 2 β(t) 2 + S θ (x, t)β 3 (t) 8 ν(t) - 3S θ (x, t)β 3 (t) 8ν 3 2 (t) + 3S θ (x, t)β 3 (t) 8ν 5 2 (t) - 3S θ (x, t)β(t) d dt β(t) 4ν 3 2 (t) + S θ (x, t) d 2 dt 2 β(t) 2 ν(t) and so on. Some additional coding techniques can further improve the readability of these expressions, but there will be no need to go any deeper into such subsidiary issues here. Thus obtained symbolic expressions can be automatically converted into executable code in practical programming languages including Python and C++ using a code generator, though the authors hand-coded the obtained expressions in Python for the experiments in this paper.

C TRUNCATED DDIM IS EQUIVALENT TO THE QUASI-TAYLOR SAMPLER

Using SymPy, we can easily compute the Taylor expansion of a given function. For example, the following code sympy.series(B(t+h), h, 0, 4) yields the result like β(t) + h d dξ 1 β(ξ 1 ) ξ1=t + h 2 d 2 dξ 2 1 β(ξ 1 ) ξ1=t 2 + h 3 d 3 dξ 3 1 β(ξ 1 ) ξ1=t 6 + O h 4 . Similarly, using the relation νt = (1ν t )β t , we can easily compute the Taylor expansion of ν t-h as follows. sympy.series(nu(t-h), h, 0, 3) ν t-h = ν(t)+h (β(t)ν(t) -β(t))+h 2    β 2 (t)ν(t) 2 - β 2 (t) 2 - ν(t) d dξ1 β(ξ 1 ) ξ1=t 2 + d dξ1 β(ξ 1 ) ξ1=t 2   +O h 3 Using this functionality of SymPy, we can easily compute the Taylor expansion of the DDIM (Song et al., 2020a) . Let us recall that the DDIM algorithm is given by eq. ( 15), and using our notation α = √ 1ν and σ = √ ν, it can be written as follows, DDIM: x t-h ← -ν t-h 1 -ν t =:ρ DDIM t,h x t + √ ν t-h - 1 -ν t-h 1 -ν t ν t =:µ DDIM t,h S θ (x t , t). Then using SymPy, the Taylor expansion of ρ DDIM t,h and µ DDIM t,h are computed as follows, ρ DDIM t,h = 1 + β t 2 h - h 2 4 β 2 t 2 -βt + h 3 4 β 3 t 12 - β t βt 2 + βt 3 + o(h 3 ), √ ν t µ DDIM t,h = - β t 2 h + h 2 4 βt - β 2 t 2ν t + h 3 4 - β 3 t 12 + β 3 t 4ν t - β 3 t 4ν 2 t + β t βt 2ν t - βt 3 + o(h 3 ). Although it has been known that DDIM corresponds to the Euler method up to 1st order terms (Song et al., 2020a; Salimans & Ho, 2022) , this expansion gives better understanding of higher order terms. That is, these are exactly equivalent to our deterministic Quasi-Taylor sampler eq. ( 23) and eq. ( 24) up to 3rd-order terms. This fact may suggest that the assumptions behind the DDIM derivation will be logically equivalent to our assumptions of ideal derivatives. The advantage of the proposed Quasi-Taylor method is that we can decide the hyperparameter at which order the Taylor expansion is truncated. On the other hand, DDIM automatically incorporates terms of much higher order, leaving no room for order tuning.  ẋ = - t -1 1 -e -(t-1) 2 x, x(0) = 1, ( ) which is a simplified model of the Linear schedule eq. ( 76). The exact solution is as follows, x = √ e -1 √ e (t-1) 2 -1 , which diverges at t = 1. In this case, a(x, t) = -x•(t-1)/(1-e -(t-1) 2 ) is not Lipschitz continuous, as the Taylor expansion of the denominator is 1e -(t-1) 2 = (t -1) 2 + O((t -1) 4 ), and a(x, t) is approximately -x/(t -1) near t = 1. In these cases, the coefficient a(•, •) is not Lipschitz continuous. Even these seemingly simplest ODEs behave very complexly unless the coefficients are carefully designed. In PF-ODE, the Lipschitz condition is written as follows, Lip( f ) = ∂ xt β t 2 x t - β t 2 √ ν t S θ (x t , t) < ∞. Using the ideal derivative of S θ (x t , t), this condition translates as Lip( f ) = |β t (1 -1/ν t )| = νt ν t < ∞.

D.2 SPECIFIC SCHEDULES

Including this point, the necessary conditions for a variance schedule ν t will be summarized as follows. 1. ν 0 ≈ 0 so that the initial density p(x 0 , 0) is close to the true data density. 2. ν T ≈ 1 so that the terminal density p(x T , T ) is close to the Gaussian. 3. Sufficiently smooth so that β t = -d dt log(1ν t ) is well defined. • In addition, β t should also be smooth so that the Taylor schemes can be used. 4. Monotonic (s < t =⇒ ν s ≤ ν t ) to make β t non-negative. 5. Preferably, make the drift coefficient f Lipschitz continuous so that PF-ODE has a unique solution, i.e., Lip( f ) ≈ | νt /ν t | < ∞. The following two scheduling functions which are common in diffusion generative models satisfy the conditions 1, 2, 4 above (the linear schedule also satisfies the 3rd condition), Linear: ν t = 1 -e -β0t-β1t 2 , β t = β 0 + 2β 1 t, Cosine: ν t = 1 -C cos 2 π 2 t/T + ς 1 + ς , β t = π T tan π 2 t/T +ς 1+ς if 0 ≤ t ≤ T Θ if T < t ≤ T . E SUPPLEMENT ON FUNDAMENTALS For convenience, let us summarize some basics behind the ideas in this paper. The contents of this section are not particularly novel, but the authors expect that this section will give a better understanding of the ideas of this paper and the continuous-time approach to diffusion generative models. E.1 TAYLOR EXPANSION AND ITÔ-TAYLOR EXPANSION E.1.1 TAYLOR EXPANSION OF DETERMINISTIC SYSTEMS 1-dimensional case Let us first consider a 1-dim deterministic system ẋ(t) = a(x(t), t), where a(•, •) is sufficiently smooth, and let us derive the Taylor series expression of the solution of this ODE. Let ϕ(x(t), t) be a differentiable function. Its total derivative is written as dϕ = ∂ϕ ∂t dt + ∂ϕ ∂x dx = ∂ϕ ∂t dt + ∂ϕ ∂x dx dt dt = ∂ϕ ∂t + ∂ϕ ∂x a(x, t) dt = ∂ ∂t + a(x, t) ∂ ∂x =:L ϕdt. By integrating both sides from 0 to t, we have ϕ(x(t), t) = ϕ(x(0), 0) + t 0 (L ϕ)(x(s), s)ds. We use this formula recursively to obtain the Taylor series of the above system. Let ϕ(x(t), t) = x(t), then we have x(t) = x(0) + t 0 (L x)(x(s), s)ds = x(0) + t 0 a(x(s), s)ds. Let ϕ(x(t), t) = a(x(t), t), then we have a(x(t), t) = a(x(0), 0) + t 0 (L a)(x(s), s)ds. Using the above two equations, we have x(t) = x(0) + t 0 a(x(0), 0) + t1 0 (L a)(x(t 2 ), t 2 )dt 2 dt 1 = x(0) + t 0 a(x(0), 0)dt 1 + t 0 t1 0 (L a)(x(t 2 ), t 2 )dt 2 dt 1 . = x(0) + ta(x(0), 0) + t 0 t1 0 (L a)(x(t 2 ), t 2 )dt 2 dt 1 . We can expand again the term inside of the integral by using the following relation, (L a)(x(t), t) = (L a)(x(0), 0) + t 0 (L 2 a)(x(s), s)ds, and obtain the expansion as follows, x(t) = x(0) + ta(x(0), 0) + t 2 2 (L a)(x(0), 0) + t 0 t1 0 t2 0 (L 2 a)(x(t 3 ), t 3 )dt 3 dt 2 dt 1 . Applying this argument recursively, we can obtain the Taylor series of the deterministic system, if a(•, •) is sufficiently smooth. x(t) = x(0) + ta(x(0), 0) + ∞ k=2 t k k! (L k-1 a)(x(0), 0). Since the above discussion is valid when the integration interval is (t, t + h) instead of (0, t), it can be written as follows, x(t + h) = x(t) + ha(x(t), t) + ∞ k=2 h k k! (L k-1 a)(x(t), t). Multi-dimensional case Let us consider the multi-dimensional ODE ẋ = a(x, t). The total derivative of a smooth scalar function ϕ(x, t) is written as dϕ = ∂ϕ ∂t dt + i ∂ϕ ∂x i dx i = ∂ϕ ∂t dt + i ∂ϕ ∂x i dx i dt dt = ∂ϕ ∂t dt + i a i (x, t) ∂ ∂x i ϕdt = ∂ϕ ∂t + (a(x, t) • ∇)ϕ dt = ∂ ∂t + a(x, t) • ∇ ϕdt. Let ϕ(x, t) be a vector-valued smooth function, then we immediately have dϕ = (∂ t + a(x, t) • ∇) ϕdt. Using the scalar operator L = (∂ t + a(x, t) • ∇), we can obtain the following Taylor expansion similarly to the 1-dim case, x(t) = x(0) + t(L x) t=0 + t 2 2 (L 2 x) t=0 + • • • = x(0) + t (∂ t + a(x, t) • ∇) x t=0 + t 2 2 (∂ t + a(x, t) • ∇) 2 x t=0 + • • • = x(0) + ta(x(0), 0) + t 2 2 (∂ t + a(x, t) • ∇) a(x, t) t=0 + • • • The second order term (a • ∇)a is written as follows, (a • ∇) a = i a i ∂ xi    a 1 (x, t) . . . a d (x, t)    =    a 1 ∂ x1 a 1 • • • a d ∂ x d a 1 . . . . . . . . . a 1 ∂ x1 a d • • • a d ∂ x d a d       1 . . . 1    . In a special case where each dimension is separable, i.e. ∂ xi a j = 0(i = j), the above d × d matrix is diagonal, and we have (a • ∇) a =    a 1 ∂ x1 a 1 . . . a d ∂ x1 a d    =    a 1 . . . a d       ∂ x1 . . . ∂ x d       a 1 . . . a d    = a ∇ a. where is the element-wise product or operation. In this case, it is sufficient to consider each dimension separately, and it is formally equivalent to the 1-dim case.

E.1.2 ITÔ-TAYLOR EXPANSION OF STOCHASTIC SYSTEMS

For the above reasons, it is sufficient to consider the 1-dim case even in the stochastic case. As is well known, Taylor expansion is not valid for stochastic systems x t = a(x t , t)dt + b(x t , t)dB t . This is because of the relation "dB 2 t ∼ dt". This effect is taken into account in the celebrated Itô's lemma, i.e., the stochastic version of eq. ( 80), dϕ = ∂ ∂t + a(x, t) ∂ ∂x + b(x, t) 2 2 ∂ 2 ∂x 2 =:L ϕdt + b(x, t) ∂ ∂x =:G ϕdB t . By using Itô's formula recursively, we can obtain the following higher-order expansion of a stochastic system, which is called the Itô-Taylor expansion. x h = x 0 + h 0 a(x t , t)dt + h 0 b(x t , t)dB t (95) = x 0 + h 0 a(x 0 , 0) + t 0 (L a)(x s , s)ds + t 0 (G a)(x s , s)dB s dt + h 0 b(x 0 , 0) + t 0 (L b)(x s , s)ds + t 0 (G b)(x s , s)dB s dB t = x 0 + a(x 0 , 0) h 0 dt + h 0 t 0 (L a)(x s , s)dsdt + h 0 t 0 (G a)(x s , s)dB s dt + b(x 0 , 0) h 0 dB t + h 0 t 0 (L b)(x s , s)dsdB t + h 0 t 0 (G b)(x s , s)dB s dB t = x 0 + a(x 0 , 0) h 0 dt + b(x 0 , 0) h 0 dB t + h 0 t 0 (L a)(x 0 , 0) + • • • dsdt + h 0 t 0 (G a)(x 0 , 0) + • • • dB s dt + h 0 t 0 (L b)(x 0 , 0) + • • • dsdB t + h 0 t 0 (G b)(x 0 , 0) + • • • dB s dB t = x 0 + a(x 0 , 0)h + b(x 0 , 0)B h + (L a)(x 0 , 0) h 0 t 0 dsdt + (G a)(x 0 , 0) h 0 t 0 dB s dt + (L b)(x 0 , 0) h 0 t 0 dsdB t + (G b)(x 0 , 0) h 0 t 0 dB s dB t + Remainder, where the remainder consists of triple integrals as follows, Remainder = h 0 t 0 s 0 (L 2 a)(x u , u)dudsdt + • • • . We ignore these terms now. If we also ignore the double integrals, we obtain the Euler-Maruyama scheme. Evaluation of Each Integral in eq. ( 96) Let us evaluate the double integrals.                        h 0 t 0 dsdt • • • (deterministic) h 0 t 0 dB s dt • • • (stochastic 1) h 0 t 0 dsdB t • • • (stochastic 2) h 0 t 0 dB s dB t • • • (stochastic 3) Deterministic: The deterministic one, h 0 t 0 dsdt, is easy to evaluate. (99) As z is the limit of a sum of Gaussian variables, i.e., z = lim n→∞ n-1 i=0 h n B hi/n = lim n→∞ h n n-1 i=0 i j=1 (B hj/n -B h(j-1)/n) ) = lim n→∞ h n n-1 i=0 i j=1 W j , W j ∼ N (0, h n ), so z is also a Gaussian, whose mean is 0. The variance, however, requires some discussions, which shall be seen later. In addition, we shall also see that z is correlated with B h , E[z • B h ] = 0. ( ) Stochastic 2: The second one has the correlation with the first one as follows. Here, we use the integral-by-parts formula. See e.g. (Øksendal, 2013, Theorem 4.1.5) . h 0 t 0 dsdB t = h 0 tdB t = tB t h 0 - h 0 B t dt = hB h -z. ( ) Stochastic 3: The third one is computed as follows, using a famous formula of Itô integral. h 0 t 0 dB s dB t = h 0 B t dB t = 1 2 B 2 h -h . ( ) This is derived by substituting ϕ(x, t) = x 2 and x t = B t (i.e., dx t = 0 • dt + 1 • dB t ) into Itô's formula eq. ( 94). Substituting the Integrals to the Itô-Taylor Expansion: Let us denote w := B h ∼ N (0, h), and let us shift the integral interval from (0, h) to (t, t + h). Then, we may rewrite the above second order expansion as follows, which we have already seen in the main text, x t+h = x t + ha(x t , t) + wb(x t , t) + (L a)(x t , t) • h 2 /2 + (G a)(x t , t) • z + (L b)(x t , t) • ( wh -z) + (G b)(x t , t) • ( w2 -h)/2. ( ) Covariance of the Random Variables w, z Next, let us evaluate the the variance of z, and the correlation between the Gaussian variables w and z. Let us first calculate the variance of ( whz). By Itô's isometry, we have E[( wh -z) 2 ] = E   h 0 sdB s 2   = E h 0 s 2 ds = 1 3 h 3 . ( ) The correlation between w and z is similarly evaluated by Itô's isometry (see (Øksendal, 2013 , Proof of Lemma 3.1.5)) as follows, E[ wz] = E B h h 0 sdB s = E h 0 dB s h 0 sdB s = E h 0 sds = 1 2 h 2 . ( ) Using the above variance and covariance, we can calculate the variance of z as follows, E[z 2 ] = E[( wh -z) 2 -h 2 w2 + 2h wz] = 1 3 h 3 -h 2 • h + 2h • h 2 2 = 1 3 h 3 . ( ) We need to find random variables w, z that satisfy the requirements for (co)variances that E[ w2 ] = h, E[ wz] = h 2 /2 and E[z 2 ] = h 3 /3, and we can easily verify that the following ones do, w z = √ h 0 h √ h/2 h √ h/2 √ 3 u 1 u 2 (108) where u 1 , u 2 i.i.d. ∼ N (0, 1). Let us compute the covariance matrix just to be sure. (111) For simplicity, we consider a 1-dim case here. The following approach using the Fourier methods (characteristic function) will be easy and intuitive. See also e.g. (Cox & Miller, 1965, § 5) , (Karlin & Taylor, 1981, § 15) , (Shreve, 2004, § 6) and (Särkkä & Solin, 2019, § 5) . E w z [ w z] = E √ h 0 h √ h/2 h √ h/2 √ 3 u 1 u 2 [u 1 u 2 ] √ h h √ h/2 0 h √ h/2 √ 3 = √ h 0 h √ h/2 h √ h/2 √ 3 E u 2 1 u 1 u 2 u 1 u 2 u 2 2 √ h h √ h/2 0 h √ h/2 √ 3 = √ h 0 h √ h/2 h √ h/2 √ 3 √ h h √ h/2 0 h √ h/2 √ 3 = h h 2 /2 h 2 /2 h 3 /3 . ( Let us consider an infinitesimal time step h. Then x t+h is written as follows, x t+h = x t + w t , w t ∼ N (f (x t , t)h, g(t) 2 h). ( ) Let us consider the characteristic function φ x (ω) := E p(x) [exp iωx] (where i = √ -1), which is the Fourier transform of the density function p(x). Because of the convolution theorem φ x+y (ω) = φ x (ω) • φ y (ω), the characteristic functions of p(x t+h , t + h), p(x t , t), and p(w t ) have the following relation, φ x t+h (ω) = φ wt (ω)φ xt (ω). (113) It is easily shown that the characteristic function of the above Gaussian is given by φ wt (ω) = exp iωf (x t , t)h - 1 2 g(t) 2 hω 2 . ( ) Expanding the r.h.s. up to the first order terms of h, we have φ wt (ω) = 1 + iωf (x t , t)h - 1 2 g(t) 2 hω 2 + O(h 2 ). Thus we obtain φ x t+h (ω) -φ xt (ω) h = -(-iω)f (x t , t) + (-iω) 2 g(t) 2 2 φ t (ω) + O(h). When h → 0, ∂ t φ xt (ω) = -(-iω)f (x t , t)φ t (ω) + (-iω) 2 g(t) 2 2 φ t (ω). Since (-iω) in the Fourier domain corresponds to the spatial derivative ∂ xt in the real domain 6 , it translates as follows, ∂ t p(x t , t) = -∂ xt (f (x t , t)p(x t , t)) + ∂ 2 xt g(t) 2 2 p(x t , t), and thus we have obtained the Fokker-Planck equation. In particular, if f = 0, this equation is called the heat equation, which was also first developed by Fourier. Example: Overdamped Langevin Dynamics When the drift term is the gradient of a potential function U (•), the SDE is often called the overdampled Langevin equation 7 , dx t = -∇ xt U (x t )dt + √ 2DdB t , 6 Integral by parts and p(-∞) = p(∞) = 0. Formally writing, E d dx p(x) [e iωx ] = ∞ -∞ e iωx d dx p(x)dx = - ∞ -∞ iωe iωx p(x)dx = (-iω)E p(x) [e iωx ]. 7 The Langevin equation should actually be the following equation system, which is called the underdamped Langevin equation, ẋt = vt, M vt = -γvt -∇x t U (xt) + √ 2D dBt dt , where vt is the velocity (momentum) variable, M is the mass of particle, and γ is a constant called friction or viscosity coefficient. In this case, the energy function should also be modified as E = 1 2 M v 2 + U (x). Assuming that the mass is very small compared to the friction, the derivative of momentum M vt can be ignored (i.e., if the force F is constant, the ODE M v = -γv + F has the solution v = Ce -tγ/M + F/γ, and the velocity immediately converges to F/γ if M γ ) and the overdamped equation is obtained. where D is a scalar constant. Its associated Fokker-Planck equation is ∂ t p(x t , t) = ∇ xt • (∇ xt U (x t ) • p(x t , t) + ∇ xt Dp(x t , t)) = ∇ xt • [∇ xt (U (x t ) + D log p(x t , t)) • p(x t , t)] . If U (x t ) + D log p(x t , t) is constant, the r.h.s. will be zero, and therefore, ∂ t p(x t , t) = 0. That is, p(x) = 1 Z e -U (x)/D , where Z = R d e -U (x)/D dx (121) is the stationary solution of the FPE, and it does no longer evolve over time. Therefore, it is expected that the particles obeying the Langevin equation will eventually follow this Boltzmann distribution. Let us compare the microscopic dynamics of each particle obeying the Langevin equation with the macroscopic dynamics of the population obeying the FPE using a 2-dim toy model. The energy function we use here is the following Gaussian mixture model, U (x, y) = -log 5 k=1 exp - (x -cos 2kπ 5 ) 2 + (y -sin 2kπ 5 ) 2 2σ 2 , ( ) where σ = 0.1. The diffusion parameter was D = 5. Figure 13 shows the force field -∇U (x) where x = (x, y), potential function U (x), and the Boltzmann distribution p(x) = 1 Z e -U (x)/D . Figure 14 shows the time evolution of Langevin and Fokker-Planck equations, where the initial density was p(x 0 , 0) = N (0, I). The time step size was h = 5 × 10 -5 , and the figures are plotted every 50 steps. These figures will be helpful to intuitively understand that the SDE (Langevin equation) and the FPE are describing the same phenomenon from different perspectives: individual description or population description.

E.3 DERIVATION OF THE PROBABILITY FLOW ODE AND REVERSE-SDE E.3.1 DERIVATION OF THE PROBABILITY FLOW ODE

For convenience, the derivation of the PF-ODE is briefly described here. For more general case details where g(t) is a matrix-valued coefficient which is dependent on x t , i.e. G(x t , t), please refer to the original paper (Song et al., 2020b) . Firstly, let us consider an SDE, dx t = f (x t , t)dt + g(t)dB t . The associated FPE is, ∂ t p(x t , t) = -∇ • (f (x t , t)p(x t , t)) + ∆ 1 2 g(t) 2 p(x t , t). Here, let us forcibly incorporate the diffusion term (i.e., the second-order derivative) into the drift term (first-order derivative). Noting that the Laplacian is the inner product of two ∇-s, i.e. ∆ = ∇ • ∇, r.h.s. = -∇ • f (x t , t)p(x t , t) -∇ 1 2 g(t) 2 p(x t , t) = -∇ • f (x t , t) - 1 p(x t , t) ∇ 1 2 g(t) 2 p(x t , t) p(x t , t) = -∇ • f (x t , t) - 1 2 g(t) 2 ∇ log p(x t , t) =: f (xt,t) p(x t , t) . Thus we have obtained a diffusion-free version of the FPE of the form, ∂ t p(x t , t) = -∇ • f (x t , t)p(x t , t) - ( ( ( ( ( ( ( ∆ (0 • p(x t , t)). Since it is an 'FPE', it has an associated 'SDE' as follows, dx t = f (x t , t)dt + 0 • dB t . Although this 'SDE' does not give the same particle-wise dynamics as the original SDE, its macroscopic dynamics of population is exactly the same as the original one, since the associated FPE has not changed from the original one at all. That is, the density evolves from p(x 0 , 0) to p(x T , T ) exactly the same way. Since the obtained 'SDE' is actually a deterministic ODE, its time-reversal is simply obtained by flipping the sign, i.e., dx t = (-f (x t , t))(-dt). (128) Thus the Probability Flow ODE is obtained. If the terminal random variables are drawn from p(x T , T ), then the particles obeying this ODE will reconstruct the initial density p(x 0 , 0) as a population. Note that the vector field of the form j := ρv is often referred to as the flux, particularly in physical contexts e.g. fluid dynamics, where ρ is density and v is a vector field (such as the velocity field), and the PDE of the form ∂ t ρ + ∇ • j = 0 129) is often called the continuity equation, which is closely related to the conservation laws; in the fluid case, the mass conservation law. Another well known example is the charge conservation law in electromagnetism, where ρ is the charge density and j is the current density. In the present case eq. ( 126), j := p f is understood as the flux, and therefore, f could be understood as a sort of velocity field, and the conservation law corresponds to the fact that the sum of probability mass is constant. This will give an intuitive understanding why the method is called the Probability "Flow" ODE.

E.3.2 DERIVATION OF R-FPE FROM KBE

In this section, let us show that R-FPE eq. ( 4) is the same as the KBE eq. ( 3). Let us consider the case where the diffusion coefficient is independent of the spatial variable, i.e., g(x t , t) = g(t). The overall discussion is based on (Anderson, 1982, § 5) , though the following calculation is not explicitly written in the paper. Firstly, the KBE is -∂ s p(x t , t | x s , s) = f (x s , s) • ∇ xs p(x t , t | x s , s) + g(s) 2 2 ∆ xs p(x t , t | x s , s), and what we need to show is that this is equivalent as the R-FPE, -∂ s p(x s , s | x t , t) = ∇ xs •(f (x s , t)-g(s) 2 ∇ xs log p(x s , s))p(x s , s | x t , t)+∆ xs g(s) 2 2 p(x s , s | x t , t) For simplicity, we derive the former from the latter.  • (au) = a • ∇u + (∇ • a)u Now let us evaluate each term. The l.h.s. is -∂ s (p(x t , t | x s , s)p(x s , s)) = -∂(p t|s p s ) = -(∂p t|s )p s -p t|s ∂p s = -(∂p t|s )p s -p t|s -∇ • (f p s ) + ∆ g 2 2 p s (∵ FPE) = -(∂p t|s )p s -p t|s -f • (∇p s ) -(∇ • f )p s + ∆ g 2 2 p s (∵ Leibniz rule). ( ) The drift term is ∇ • (f -g 2 ∇ log p s )p t|s p s = ∇ • f -g 2 ∇p s p s p t|s p s = ∇ • f p t|s p s -(∇p s )g 2 p t|s = ∇ • f p t|s p s -∇ • (∇p s )g 2 p t|s = f • ∇(p t|s p s ) + (∇ • f )p t|s p s -(∇p s ) • (∇g 2 p t|s ) + (∆p s )g 2 p t|s = f • (p s ∇p t|s + p t|s ∇p s ) + (∇ • f )p t|s p s -g 2 (∇p s ) • (∇p t|s ) + g 2 (∆p s )p t|s . Finally, the diffision term is ∆( g 2 2 p t|s p s ) = ∇ • ∇( g 2 2 p t|s p s ) = g 2 2 ∇ • (∇p t|s )p s + (∇p s )p t|s = g 2 2 (p s ∆p t|s + p t|s ∆p s + 2∇p s • ∇p t|s ). Now let us simplify the R-FPE using the above relations.  ∂p s|t + ∇ • (f -g 2 ∇ log p s )p s|t + ∆ g 2 2 p s|t ∝ ∂(p t|s p s ) + ∇ • (f -g 2 ∇ log p s )p t|s p s + ∆( g 2 2 p t|s p s ) = (∂p t|s )p s + p t|s -f • (∇p s ) - (∇ • f )p s + ∆ g 2 2 p s + f It implies that if the following relation holds, then the R-FPE also holds, ∂ s p t|s + f • ∇p t|s + g 2 2 (∆p t|s ) = 0. This equation is nothing other than the KBE. Hence, KBE implies R-FPE. We may also do the same thing for the case where the diffusion coefficient g(x t , t) is dependent on the spatial variable x t . In this case, the target R-FPE is modified as -∂p s|t = ∇ • f - 1 p s ∇(g 2 p s ) p s|t + ∆ g 2 2 p s|t = ∇ • f -∇g 2 -g 2 ∇ log p s p s|t + ∆ g 2 2 p s|t , but we can also show that the KBE implies this relation. The use of computer algebra systems such as SymPy can reduce the calculation effort for checking this fact.

E.4 ON THE CONVERGENCE OF NUMERICAL SDE SCHEMES.

Let us introduce two convergence concepts which are commonly used in numerical SDE studies. See also (Kloeden et al., 1994, § 3.3, § 3.4 ) Definition 1 (Strong Convergence). Let xt be the path at the continuous limit h → 0, and x t be the discretized numerical path, computed by a numerical scheme with the step size h > 0. Then, it is said the numerical scheme has the strong order of convergence γ if the following inequality holds for a certain constant K s > 0, E[|x t -xt |] ≤ K s h γ . (137) Definition 2 (Weak Convergence). Similarly, it is said that the scheme has the weak order of convergence β, if the following inequality holds for any test functions φ(•) in a certain class of functions, and a certain constant K w > 0, |E[φ(x t )] -E[φ(x t )]| ≤ K w h β . ( ) It is known that the Euler-Maruyama scheme has the strong convergence of order γ = 0.5, and weak order of β = 1, in general. However, for more specific cases including the diffusion generative models that the diffusion coefficient g(t) is not dependent on x t , the Euler-Maruyama scheme has a little better strong convergence of order γ = 1. The strong convergence is concerned with the precision of the path, while the weak convergence is with the precision of the moments. In our case, we are not much interested in whether a data x0 generated using a finite h > 0 approximates the continuous limit x0 , (h → 0) driven by the same Brownian motion. Instead, we are more interested in whether the density p(x 0 ) of the samples generated with a finite step size h > 0 approximates the ideal density p(x 0 ), (h → 0) which is supposed to approximate the true density p(x 0 ). In this sense, the concept of strong convergence is not much important for us, but the weak convergence would be sufficient.

E.5 RUNGE-KUTTA METHODS

Let us briefly introduce the derivation of Runge-Kutta methods for a 1-dim ODE ẋ = f (x, t). We are interested in deriving the following derivative-free formula x t+h = x t + n i=1 hb i k i + o(h p ), where k i = f (x t + j<i ha ij k j , t + hc i ), n ≥ p. (139) The array of coefficients are often shown in the following form, which is called the Butcher tableau.  c n a n1 a n2 • • • a n,n-1 b 1 b 2 • • • b n-1 b n (140) Now the problem is how to design each value in this tableau. First, for simplicity, let us consider the following 2nd order case, By simplifying the above equations, we obtain the following relations of coefficients. x t+h = x t + hb 1 k 1 + hb 2 k 2 + o(h 2 ) k 1 = f (x t , b 1 + b 2 = 1, b 2 a 21 = 1 2 , b 2 c 2 = 1 2 , ( ) and thus we obtain the following Butcher tableau. (When c 2 = 1, the method is particularly called Heun's method.) Thus we have confirmed that the second order Taylor expansion of x t+h is expressed as eq. ( 141), which is independent of any derivatives of f . We can similarly consider higher-order methods. In the 3rd-order case, following relations are automatically obtained after a little effort of symbolic math programming using e.g. SymPy, By solving this equation system using SymPy, the following Butcher tableau is obtained. 0 c 2 a 21 c 3 a 31 a 32 b 1 b 2 b 3 = 0 c 2 ( = 2/3) c 2 c 3 ( = c 2 ) c3(3c 2 2 -3c2+c3) c2(3c2-2) c3(c2-c3) c2(3c2-2) 1 -1 2c2 -1 2c3 + 1 3c2c3 2-3c3 6c2(c2-c3) 3c2-2 6c3(c2-c3) (149) It will naturally become much more complicated when we consider higher-order methods, though it can be simpler if substituting specific values. In the 4th-order case, when the Butcher tableau is written as follows, the method is particularly called the Classical 4th-order Runge-Kutta (RK4) method. Numerical Example Let us consider a toy example ẋ = x sin t, x 0 = 1. The exact solution is x t = e 1-cos t . Figure 15 compares the numerical solutions of the Euler, Heun, Classical RK4 and Taylor 2nd methods, where the step size is h = 0.5. It is clearly observed that the Euler method immediately deviates significantly from the exact solution but higher-order methods follow it for longer periods; while 2nd-order methods (Heun and Taylor 2nd) gradually deviate from the exact solution, the Classical 4th-order Runge-Kutta method more finely approximates the exact solution. Note that the Taylor 2nd method is given as follows, x t+h = x t + h(x t sin t) + h 2 2 (x t cos t + x t sin 2 t) derivative of x sin t . (151) In this case the computation of the derivative (∂ t + x sin t∂ x )(x sin t) is tractable. However, it will be infeasible if this term is a little more complicated. The Runge-Kutta method is advantageous if the derivatives are not easy to compute. 



We used the following relation:Ex[f (x)] -y 2 = Ex[f (x) -y] 2 ≤ Ex[ f (x) -y 2]. This is also written as div(au) = a • gradu + u(diva). This is just expressing the relation, i ∂i(aiu) = i (ai(∂iu) + (∂iai)u) .



Figure 1: Comparison of the synthesis results of CelebA (64 × 64) data. The number of refinement steps is N = 12.

Figure 2: Comparison of the synthesis results of CIFAR-10 (32 × 32) data. The number of refinement steps is N = 30.

Figure 4 intuitively illustrates the reason. More quantitatively, noting that the weight function q

Figure 5: Intuitive reasons why a single point approximation is valid. The case where 1ν t ≈ 1

Figure 6: Relative L 2 distance (↓) between the empirical score function eq. (45) and the single point approximation, depending on the noise level ν t . Dashed curves indicate the upper bounds evaluated in eq. (31).

Figure 8: Entropy of q(x 0 | x t ) depending on the noise level ν t .

Figure 11: Relative L 2 Error (↓) and Cosine similarity (↑) between the empirical t derivative and the ideal t derivative. (Modified tanh schedule.)

Other integrals contain stochastic integrations. Let us denote the first one by z. z := h t dt.

Figure 13: Force field (potential gradient, score function) -∇U (x), potential function U (x), and the Boltzmann distribution p(x) = 1Z e -U (x)/D . The scalar potential U (x) is given by eq. (122)

1/(2c 2 ) 1/(2c 2 ) (148)

b 2 + b 3 = 1, b 2 c 2 + b 3 c 3 b 2 c 2 + (a 31 + a 32 )b 3 c 3 = 1 3.

a 21 c 3 a 31 a 32 c 4 a 41 a 42 a 43 b

Figure 15: Comparison of numerical solutions of the ODE ẋ = x sin t, x 0 = 1.

Taylor (2nd) (f) Taylor (3rd) (g) Euler-Maruyama (h) Itô-Taylor

Figure 17: CelebA (64 × 64) synthesis samples. N = 4.

Figure 18: CelebA (64 × 64) synthesis samples. N = 8.

Taylor (2nd) (f) Taylor (3rd) (g) Euler-Maruyama (h) Itô-Taylor

Figure 19: CelebA (64 × 64) synthesis samples. N = 16.

Figure 20: CelebA (64 × 64) synthesis samples. N = 20.

Taylor (2nd) (f) Taylor (3rd) (g) Euler-Maruyama (h) Itô-Taylor

Figure 21: CelebA (64 × 64) synthesis samples. N = 30.

Figure 22: CelebA (64 × 64) synthesis samples. N = 100.

Figure 23: CIFAR-10(32 × 32) synthesis samples. N = 4.

Figure 24: CIFAR-10 (32 × 32) synthesis samples. N = 8.

Figure 25: CIFAR-10 (32 × 32) synthesis samples. N = 20.

Figure 26: CIFAR-10 (32 × 32) synthesis samples. N = 100.

D ON THE NOISE SCHEDULE D.1 BACKGROUND: PICARD-LINDELÖF THEOREMLet us consider a 1-dim deterministic system ẋ(t) = a(x(t), t). It is well known that this ODE has a unique solution if a(x, t) is Lipschitz continuous w.r.t. x and continuous w.r.t. t (Picard-Lindelöf Theorem). Otherwise, ODEs often behave less favorably. (Similar Lipschitz conditions are also required for SDEs.) Example 1. For example, the ODE ẋ = x 2 , x(0) = 1 has the solution x = 1/(1t) when t < 1, and it blows up at t = 1. It is usually impossible to consider what happens after > 1 in ordinary contexts.Example 2. Another well-known example is ẋ = √ x, x(0) = 0. It has a solution x = t 2 /4, butx ≡ 0 is also a solution. It actually has infinitely many solutions x = 0(if t ≤ t 0 ), x = (tt 0 ) 2 /4 (if t > t 0 ), where t 0 ≥ 0 is an arbitrary constant.Example 3. Let us consider the following ODE

) E.2 DIFFUSION PROCESS AND FOKKER-PLANCK EQUATION Derivation of Fokker-Planck Equation In this section, let us derive the Fokker-Planck equation ∂ t p(x t , t) = -∇ xt (f (x t , t)p(x t , t))

By the Bayes theorem, we may substitute each p(x s , s | x t , t) with p(x t , t | x s , s)p(x s , s)/p(x t , t), and obtain To simplify the notation, let us denote p s := p(x s , s), p t|s := p(x t , t | x s , s), and drop the variables x s and s (that is, ∂ always means ∂ s , and ∇ always means ∇ xs ). Let us also recall the Leibniz rule of the divergence of vector field. Let a be a vector-valued function and let u be a scalar function, then the following relation holds 8 ,

• (p s ∇p t|s +

t) k 2 = f (x t + ha 21 k 1 , t + hc 2 ),or,x t+h = x t + hb 1 f (x t , t) + hb 2 f (x t + ha 21 f (x t , t), t + hc 2 ) + o(h 2 ).(141)Noting that the Taylor expansions of the l.h.s. and the third term in r.h.s. are written as follows,+ hb 1 f (x t , t) + hb 2 f (x t , t) + ha 21 f (x t , t)f (x t , t) + hc 2 ḟ (x t , t) + o(h 2 ). (144)Comparing both sides, we can find that the derivatives in both sides can be eliminated if the following equation are satisfied,hf (x t , t) = hb 1 f (x t , t) + hb 2 f (x t , t), )(x t , t) = hb 2 ha 21 f (x t , t)f (x t , t) + hc 2 ḟ (x t , t) .

annex

where ς > 0 is a small constant, C = 1/ cos 2 (πς/2(1 + ς)) is a constant to make ν 0 = 0, and the threshold constant is Θ = β T .However, these common schedules do not satisfy the 5th condition that the drift coefficient f is Lipschitz continuous. Indeed, it is easily verified that lim t→0 νt /ν t = ∞ in both cases, since ν 0 = 0 but ν0 > 0. Nevertheless, t = 0 is the only singular point, and since no function value or derivative at t = 0 is evaluated by numerical methods (except by the Runge-Kutta method), this point can practically be ignored.Note that, we can also consider some other schedule functions such as the sigmoid function and the hyperbolic tangent, which satisfy the condition 2, 3, 4, 5 but do not satisfy the 1st condition rigorously (but if ν 0 is less than or equal to the level of the quantization error in the data, we may consider the first condition to be essentially satisfied), Sigmoid:Modified Tanh:where the parameter function λ(t) has some options, such as λ(t) = log(1 + Ae kt ), and A > 0, k > 0 are hyperparameters.

D.3 HOW TO AVOID THE TIME ORIGIN SINGULARITY IN THE RUNGE-KUTTA METHODS

When using the Heun and Classical RK4 methods, the function f (x t , t) is evaluated at time t = 0. However, since the function f (x t , t) contains the term proportional to 1/ √ ν t , it will diverge at time t = 0 if the linear eq. ( 76) or cosine schedule eq. ( 77) is used. The simplest way to avoid this is to replace the function f (x 0 , 0) with f (x ε , ε) where ε > 0 is a sufficiently small constant, only when the need to evaluate the function at time t = 0 arises.The same thing could happen at t = T if the cosine schedule and DDIM were used simultaneously, but this can be handled in the same way.If we use the sigmoid eq. ( 78) or modified tanh schedules, eq. ( 79) these problems do not occur unless the hyperparameters A and k are chosen to be very extreme values.

F PSEUDOCODE F.1 QUASI-TAYLOR SAMPLER

The following pseudocode shows the proposed Quasi-Taylor sampler.Algorithm 1 Quasi-Taylor Sampling Scheme with Ideal DerivativesRequire: Compute ρ using eq. ( 23)Compute µ using eq. ( 24)[Update Data]x ← ρ x + µ S θ (x, t, c)/ √ ν eq. ( 22)end for Clip outliers of x so that e.g. -1 ≤ x ≤ 1. End Output: x Note that, ρ and µ , as well as ν t , λ t , β t and their derivatives, are only dependent on T and h. Therefore, they can be pre-computed before the actual synthesis.

F.2 QUASI-ITÔ-TAYLOR SAMPLER

The following pseudocode shows the proposed Quasi-Itô-Taylor sampler.Algorithm 2 Quasi-Itô-Taylor Sampling Scheme with Ideal Derivatives

Require:

Trained neural network model S θ (xt, t, c)The conditioning information c is optional.Draw a d-dimensional Gaussian noise with unit varianceCompute, or get from lookup table, the values of νt, βt, βtCompute ρ using eq. ( 26)Compute µ using eq. ( 26)See eq. ( 108) and Theorem 1Compute n using eq. ( 27) and w, z above.end for Clip outliers of x so that e.g.The following pseudocode shows an example for the training of diffusion-based generative models. Append the tuple (x0, c, t, ν, w) to batch end for

Algorithm 3 Training of Diffusion models

See also eq. ( 10). 

