APPROXIMATED ANOMALOUS DIFFUSION: GAUSSIAN MIXTURE SCORE-BASED GENERATIVE MODELS

Abstract

Score-based generative models (SGMs) can generate high-quality samples via Langevin dynamics with a drift term and a diffusion term (Gaussian noise) iteratively calculated and added to a sample until convergence. In biological systems, it is observed that the neural population can conduct heavy-tailed Lévy dynamics for sampling-based probabilistic representation through neural fluctuations. Critically, unlike the existing sampling process of SGMs, heavy-tailed Lévy dynamics can produce both large jumps and small roaming to explore the sampling space, resulting in better sampling results than Langevin dynamics with a lacking of large jumps. Motivated by this contrast, we explore a new class of SGMs with the sampling based on the Lévy dynamics. However, exact numerical simulation of the Lévy dynamics is significantly more challenging and intractable. We hence propose an alternative solution by leveraging Gaussian mixture noises during training to mimic the desired large jumps and small roaming properties. Theoretically, GM-SGMs conduct a probabilistic graphical model used by empirical Bayes for sampling, expanding the maximum a posteriori (MAP) estimation applied by conventional SGMs. Expensive experiments on the challenging image generation tasks show that our GM-SGMs exhibit superior sampling quality over prior art SGMs across various sampling iterations.

1. INTRODUCTION

Score-based generative models (SGMs) (Song and Ermon, 2019; 2020; Song et al., 2021b; a; Dockhorn et al., 2022; Karras et al., 2022) have recently demonstrated tremendous performance in data synthesis, especially high-quality images, along with easier model optimization (Song and Ermon, 2019) , richer generative diversity (Xiao et al., 2022) , and solid theories (De Bortoli et al., 2021) . During optimization, SGMs learn to fit a score function by predicting the Gaussian noises added to a sample drawn from a target dataset. To generate a sample in the target distribution, the SGMs conduct Langevin dynamics constructed from the score function. This process reverses a Brownian motion starting from the dataset distribution with i.i.d. Gaussian increments. As a special case of Monte Carlo Markov Chain methods, the Langevin dynamics has been widely applied for constructing sampling-based algorithms (Rey-Bellet and Spiliopoulos, 2015) . However, increasing evidence from experimental observation suggests that in biological systems, the neural population implements sampling-based probabilistic representations through a heavy-tailed Lévy dynamics (He, 2014; Donoghue et al., 2020; Townsend and Gong, 2018; Muller et al., 2018) , which instead reverses an anomalous diffusion process with heavy-tailed increments (Fig. 2 (Left)). The neural coding benefits from Lévy dynamics, since it can implement large jumps that facilitates the process to escape from local minimal and explore the sampling space more thoroughly (Ye and Zhu, 2018; Qi and Gong, 2022) . A natural question arises: can we apply the Lévy dynamics instead of the Langevin dynamics for better sampling performance of SGMs? Inspired by this insight, we explore a novel class of SGMs that reverse the anomalous diffusion for sampling. Nonetheless, exact numerical simulation of the Lévy dynamics (i.e., reversing the anomalous diffusion) is drastically more challenging and intractable, especially for high-dimensional data such as images. To tackle this challenge, we consider Brownian motion with Gaussian mixture as an approximation of the anomalous diffusion. To tackle this challenge, we train the SGMs with Gaussian mixture noises to enable both large jumps and small roaming during the sampling phase, reminiscent of the Lévy dynamics. Concretely, we construct a novel variant of SGMs, namely Gaussian Mixture SGMs (GM-SGMs), that learn to denoise Gaussian mixture noises; In doing so, our model is enabled to reverse the Gaussian mixture Brownian motion by switching between large jump and small roaming during sampling, resembling the merits of Lévy dynamics. Theoretically, our GM-SGMs conduct the probabilistic graphical model (PGM) of an empirical Bayes (EB) to reverse a Gaussian mixture Brownian motion; Instead, conventional SGMs perform a PGM of maximum a posteriori (MAP) estimation to reverse a Brownian motion. Empirically, extensive experiments on several challenging image generation tasks verify the ability of our GM-SGMs to automatically select large jump or small roaming during sampling and the promising ability of our GM-SGMs over state-of-the-art SGMs under different sampling budgets.

2. PRELIMINARY: SCORE-BASED GENERATIVE MODELS

Score matching was originally developed for non-normalized statistical learning (Hyvärinen and Dayan, 2005) . By observing i.i.d. samples of an unknown (target) distribution p * in d dimensions, score matching directly approximates the score function s(x) := ▽ x log p * (x) via a model s θ parameterized by θ, for x ∈ R d . Score-based generative models (SGMs) aim to generate samples in the distribution p * via score matching through the following iterations x T ∼ N (0, I), x t-1 = x t + ϵ 2 t 2 s θ (x t , t) + ϵ t z t , t = T, . . . , 1, where ϵ t is the step size and z t ∼ i.i.d. N (0, I). This process transforms a Gaussian noise x T towards a sample x 0 obeying p * . Eq.( 1) can be considered as the reverse of a corrupting process where the noises are gradually added to a datum x 0 x 0 ∼ p * , x t+1 = x t + ϵ t z t , t = 0, . . . , T -1, The first SGM, noise conditional score network (NCSN) (Song and Ermon, 2019) , is trained by fitting the score function s(x) via minimizing the weighted explicit score matching (ESM) objective L(θ; {σ t } T t=1 ) = T t=1 λ(σ t )E x∼p * ,η∼N (0,σ 2 t I) 1 2 ∥s θ (x + η, t) -s(x + η)∥ 2 , where σ 2 t is the noise variance at the time step t, and λ(σ t ) the weights for each time step t. Discarding the constant part independent from θ, the ESM can be rewritten as a tractable denoising score matching (DSM) objective (Vincent, 2011)  L(θ; {σ t } T t=1 ) = T t=1 λ(σ t )E x∼p * ,η∼N (0,σ 2 t I) 1 2 ∥s θ (x + η, t) -▽ x+η log p σt (x + η | x)∥ 2 = T t=1 λ(σ t )E x∼p * ,η∼N (0,σ 2 t I) 1 2 s θ (x + η, t) + η σ 2 t 2 , where p σt (• | x) := N (•; x, σ 2 t I). Considering the sampling process as a stochastic differential equation (SDE), Song et al. (2021b) further proposed an improved version NCSN++ that utilizes an existing numerical solver of SDEs to enhance the sampling quality.

3.1. ANOMALOUS DIFFUSION PERSPECTIVE

In this section, we introduce a wider class of corrupting processes as well as their reverses (i.e., the corresponding sampling process), followed by analyzing their fundamental properties. We first write Eq.(1-2) in a continuous formula. Eq.( 2) is the discretized version of a Brownian motion x 0 ∼ p * , dx = dw, where the stochastic increment is Gaussian and satisfies The sampling process of SGMs (Eq.( 1)) is the dicretized version of a Langevin dynamics ∆w t = w t+∆t -w t ∼ N (0, ∆tI), t, ∆t > 0. dx = 1 2 ϵ 2 (t) ▽ x log p * (x)dt + ϵ(t)d w, where ϵ(t) is a time-dependent positive scalar function, and • represents reversing the time flow of the process. The Langevin dynamics (Eq.( 5)) is the reverse process of the Brownian motion (Eq.( 4)). Further, Eq.( 4) is an instance of a more general class of stochastic process called stable Lévy process, x 0 ∼ p * , dx = dL α , ( ) where α is the Lévy index, and the increment ∆L α t obeys a Lévy stable distribution ∆L α t = L α t+∆t -L α t ∼ SαS(x; α, ∆t 1 α ). The Lévy stable distribution SαS has the probabilistic distribution function (pdf) SαS(x; α, γ) = 1 π d R d + exp{- 1 2 ∥γy∥ α } cos(x • y)dy. Similar as the Langevin dynamics (Eq.( 5)), Eq.( 6) has its reverse process called Lévy dynamics dx = ϵ(t) 2 2p * (x) D α-2 {p * (x) ▽ x log p * (x)}dt + ϵ(t) 2 α d Lα , where D α-2 {•} is the Riesz fractional derivative of order α, and dL α represents another stable Lévy process (Mandelbrot and Mandelbrot, 1982) . Theoretically, we can sample from p * through the Lévy dynamics (Eq.( 7)) that reverses a stable Lévy process (Eq.( 6)). When α = 2, γ = 1 2 , SαS becomes the standard Gaussian distribution, and the stable Lévy process (Eq.( 6)) becomes a Brownian motion. When 1 ≤ α < 2, SαS is heavy-tailed, and the stable Lévy process becomes an anomalous diffusion (a.k.a., super-diffusion or Lévy flight) (Klages et al., 2008; Metzler et al., 2014) . To visually demonstrate the differences between Gaussian distribution and heavy-tailed Lévy stable distribution, we depict their probability distribution functions (pdf) in one-dimensional case. It is observed from Fig. 1 (left) that compared to Gaussian distribution, the Lévy stable distribution is not only more heavy-tailed but also more concentrated around zero center. In particular, the latter property prevents us from simply using a Gaussian distribution with high variance as an substitution of the Lévy stable distribution. We further inspect the stochastic behavior differences of Brownian motion and anomalous diffusion resulted from their different distributions in the increment. As shown in Fig. 2 , it is evident that the heavy-tailed property of Lévy stable distribution enables the sampling process (i.e., the Lévy dynamics Eq.( 7)) to better explore the full space more efficiently (Ye and Zhu, 2018) . More specifically, the Lévy dynamics can produce large jumps for allowing the sample point to traverse though low-probability regions. As a result, it is less possible for the sample point to be trapped by local minimal. Also, it is easier for the sample point to switch between different modes, making it suitable to deal with multi-mode distribution situations (Qi and Gong, 2022) . On the other hand, the higher concentration near zero eases the Lévy dynamics to conduct small roaming for exploring local regions. Consequently, the anomalous diffusion involves at least two modes: (1) small roaming mode for careful searching a local area, and (2) large jump mode for escaping from one area to a In contrast, the Gaussian distribution is relatively light-tailed, resulting in a lacking of large jumps in the Langevin dynamics (Eq.( 5)). The sample point is thus easier to be trapped at local minimal. More importantly, for conventional SGMs, the light-tailed issue is not only caused by the stochastic increments z t , but also intensified by the model s θ itself, which plays the main role to guide the sample point to the target distribution p * . To illustrate this, we revisit the score function in the conventional training objective (Eq.( 3)), where the model s θ (•, t) is trained to fit log p * σt (x + η | x) = - η σ 2 t , x ∼ p * , ∀η ∈ R d . Ideally, the model s θ (•, t) could learn the above map for all η ∈ R d , since the range of a Gaussian distribution is the whole R d . In practice, however, often s θ (•, t) only learns this map well for those η with relatively high probability being sampled from N (0, σ 2 t I), whilst lacking experiences for the low-probability others (corresponding to the large jumps). As a result, the conventional training only allows SGMs learn to reverse a Brownian motion, with less possible large jumps and inferior capability of predicting the corresponding samples. This insightful analysis as above motivates us to design a new class of SGMs capable of conducting large jumps similar as the anomalous diffusion for sampling. As the reverse capability of a SGM is largely shaped by the distribution of the training noises η, the key is on reformulating this noise distribution for training.

3.2. GAUSSIAN MIXTURE

Although the anomalous diffusion comes with desired merits as discussed above, conducting a general Lévy dynamics (Eq.( 7)) is computationally more challenging. First, the Riesz fractional derivative D{•} is difficult to calculate (C ¸elik and Duman, 2012) . Second, the Lévy dynamics involves p * which is intractable for high-dimensional cases. Third, sampling from a Lévy stable distribution (Mantegna, 1994) is much more expensive than sampling from a Gaussian distribution. To bypass these obstacles, instead of struggling to approximate the Lévy dynamics, we resort to an alternative approach that mimics the functional properties (i.e., ability to implement both large jumps and small roaming) of the anomalous diffusion. Formally, we exploit a Gaussian mixture sampling N mix (0, {I, β 2 I}) := 1 2 N (0, I) + N (0, β 2 I) , where we mix N (0, I) with another Gaussian distribution with large variance (N (0, β 2 I), β > 1). Note, our mixture operation sums two pdfs up to a single mixed pdf, instead of taking the means of samples from two individual distributions respectively. We visualize their differences in Fig. 1 (Right): (1) The latter has much less concentration near zero. (2) Whilst the former (ours) largely keeps the original concentration degree, and is also more heavy-tailed than the latter. Correspondingly, we define the Gaussian mixture Brownian motion as dw 1,β , where the increment ∆w 1,β t = w 1,β t+∆t -w 1,β t (8) obeys the Gaussian mixture distribution N mix (0, {∆tI, β 2 ∆tI}). To verify that our Gaussian mixture Brownian motion captures the behavior of the anomalous diffusion dL α , we conduct a twodimensional simulation as shown in Fig. 2 (Middle). We can see that the large jumps property can be approximated well although not exactly for global space exploration, whilst preserving the small roaming trait in local regions.

3.3. TRAINING

As analyzed in Sec. 3.1, the distribution of η in Eq.( 3) determines what kinds of corrupting processes the SGM can reverse. To enable the model to reverse a Gaussian mixture Brownian motion, we propose Gaussian mixture SGMs (GM-SGM), which is trained to minimize T t=1 λ(σ t )E x∼p * (x),η∼Nmix(0,{σ 2 t I,β 2 σ 2 t I}) 1 2 s θ (x + η, t) + η σ 2 t 2 , where β ≥ 1 is the only hyper-parameter. Compared to the conventional training objective (Eq.( 3)), the only difference is that the noise η now is sampled from the Gaussian mixture N (0, {σ 2 t I, β 2 σ 2 t I}) instead of N (0, σ 2 t I) at each time step t. Minimizing our objective Eq.( 9) is equivalent to alternatively minimize the original objective Eq.( 3) and the following objective T t=1 λ(σ t )E x∼p * (x),η∼N (0,σ 2 t I) 1 2 s θ (x + βη, t) + βη σ 2 t 2 . ( ) This provides new opportunities, where the scaled noises βη added to the sample x have much larger deviation than expected, forcing the model conduct a large jump in order to recover x ∼ p * . As there is no indication with the input about if the noise η is scaled by β or not, the model must learn to decide on its own when to make the large jump, i.e., learning to switch between large jump mode and small roaming mode properly. After training, at each time step t, our GM-SGM will automatically select one of possible denoising prior: the noise to be denoised at this step obeys N (0, σ 2 t I) or N (0, σ 2 t β 2 I). This enables GM-SGM to vary the step size adaptively to some degree with more robustness against outlier cases (e.g., η satisfying N (η; 0, σ 2 t I) ≪ 1). To reduce the stochasticity of training, we alternatively minimize two objectives Eq.(3) and Eq.( 10), instead of uniformly sampling them. The training procedure is summarized in Alg. 1.

Algorithm 1 GM-SGM training

Input: The model s θ , the overall training iterations N , the parameter β.  for n = 1 to N do Draw x ∼ p * (x), t ∼ Uniform[1, 2, . . . , T ], η ∼ N (0, I) if n mod 2 = 0 then β = 1 else β = β end if Backpropagate on λ(σ t ) s θ (x + βη, t) + βη

3.4. SAMPLING

During inference, we sample from a trained GM-SGM using exactly the same procedure as conventional SGMs (Eq.( 1)). Thus, our GM-SGMs do not introduce any extra cost during inference. This is some inconsistent with our training noise increments z t ∼ N mix (0, {I, β 2 I}). We provide the theoretical insights and analysis as follows.

4.1. UNDERSTANDING GM-SGMS VIA PROBABILISTIC INFERENCE

GM-SGMs can be extended to general cases, where we train the model via the Gaussian mixture N mix (x; 0, p( β)) := N (x; 0, β2 I)p( β)d β, where p( β) is a possibly continuous pdf of the scalar β. In Sec. 3.2, the distribution of β is discrete: p( β = 1) = p( β = β) = 0.5, corresponding to the small roaming and large jump modes respectively. In this general case, the corresponding objective is T t=1 λ(σ t )E β∼p( β)   E x∼p * (x),η∼N (0,σ 2 t I)   1 2 s θ (x + βη, t) + βη σ 2 t 2 β    . ( ) To understand why the sampling process of GM-SGM can converge to the target distribution, we analyze what the GM-SGM is really doing from the perspective of probabilistic inference. Let us first revisit the conventional SGMs. For convenience, we denote -σ t s θ (x+σ t ξ, t) as ξ θ (x, t), an equivalent form of the SGM. Then ξ θ is trained to infer the standardized noise ξ through the probabilistic graphical model (PGM) p(ξ | x t , t) ∝ p(x t | ξ, t)p(ξ), given x t = x 0 + σ t ξ and t, where x 0 ∼ p * , ξ ∼ N (0, I), and p(ξ) = N (0, I). This PGM is the same as that in a maximum a posteriori estimation (MAP) formula. After estimating ξ by ξ θ , at step t, the model can push the sample point x t along the direction of -ξ to approximate the target distribution. This approximation is reasonable due to following relationship. Theorem 1. Given the corrupting process (σ t > σ s , ∀t > s) x 0 ∼ p * , x t = x 0 + σ t ξ t , ξ t ∼ i.i.d. N (0, I), for any time points 0 ≤ u < s < t, we have p(x s | x t , x u ) = N x t - σ 2 t -σ 2 s σ 2 t -σ 2 u (x t -x u ), (σ 2 t -σ 2 s )(σ 2 s -σ 2 u ) σ 2 t -σ 2 u I . See the proof in A.1 of Appendix. As a result, if we set u = 0, it is clear that if x t moves along the direction -ξ = -(x t -x 0 )/σ t with small enough step, it will be closer to x s for s < t. During sampling, the SGM takes the step along this direction with a scalar ϵ 2 t 2 controlling the step size at every time step t and a random increment z t that represents the uncertainty in p(x s | x t , x 0 ). When applying the above analysis to GM-SGM, we find that this time ξ θ is trained to infer ξ through p(ξ | x t , t) ∝ p(x t | ξ, t)p(ξ | β)p( β | x t , t)d β, p( β | x t , t) ∝ p(x t | β, t)p( β) where p( β) is the distribution of β in Eq.( 11), and p(ξ | β) = N (ξ; 0, β2 I). Therefore, GM-SGMs also infer a direction -ξ to push the sample point x t closer to x s , s < t. The key difference is that infers the direction -ξ is inferred through the PGM of maximum a posteriori estimation (MAP) formula with the SGMs, whilst through the PGM of empirical Bayes (EB) formula (BP and TA, 2001 ) by our GM-SGM. As illustrated in Fig. 3 , GM-SGM takes into consideration of the probability that ξ is associated with different covariance matrices β2 I, corresponding to large jumps and small roaming. With the above analysis, we can more clearly explain why we apply the Langevin dynamics as the sampling process of GM-SGM, with the increments z t in Gaussian distribution instead of Gaussian mixture. From the perspective of probabilistic inference, z t represents the level of uncertainty of estimating ξ θ at time step t. Indeed, during training, we do not specify different levels of uncertainty of ξ for different β given the time t. This means the model infers ξ at the same level of certainty at each time step t, independent of β. Therefore, for GM-SGM we can adopt the sampling process as the SGMs. Otherwise, when Gaussian mixture increments are used for sampling, the model performance would degrade due to inconsistent uncertainty introduced (verified in Sec. 5).

4.2. RELATIONSHIP TO THE CONVENTIONAL SGMS

Following the relationships between our GM-SGMs and conventional SGMs as analyzed in Sec. 4.1, we further make additional interesting connections as follows. First, GM-SGMs can be considered as a more robust variant of SGMs. Extra robustness comes from that the model is optimized to deal with large jumps βη, η ∼ N (0, β2 σ 2 t I), β > 1 (the outliers of N (0, σ 2 t I)) at each step t (Eq.( 10)). Second, training GM-SGMs can be regarded as an ensemble learning procedure of training SGMs. Specifically, for every β, we train a SGM (denoted as s θ, β ) by minimizing corresponding objective constructed by β, followed by mixing these SGMs based on p( β). Third, the sampling process of GM-SGMs can be seen as reversing a mixture of multiple Brownian motions, where the noise term z t is amplified by different β. These corrupting processes expand the hypothesis space of original SGMs which only contains a single Brownian motion. The model s θ adaptively selects one of them for denoising, i.e., switching from different modes ( β > 1 for large jumps and β ≤ 1 for small roaming). This will be verified in Fig. 4 . As a result, going beyond conventional SGMs with a need for designing a step schedule for sampling, our GM-SGMs can vary the step sizes of the score function term automatically.

5. EXPERIMENTS

We evaluate image generation tasks on CIFAR-10 ( Krizhevsky et al., 2009) , CelebA (Liu et al., 2015) and LSUN (church and bedroom) (Yu et al., 2015) . We focus on comparing GM-SGMs with the prior art SGMs, namely NCSN++ (Song et al., 2021b) . For fair comparison, we construct our proposed GM-SGM using the same U-Net architecture as NCSN++. We use the released checkpoints of NCSN++ for all the datasets except CelebA for which no checkpoints released and we use the released codes to train by ourselves. We first set β = 2 in Alg. 1 and investigate the effect of different β later. We provide more detailed settings of implementation in A.2 of Appendix. 

5.1. SAMPLING STATISTICS

To verify that our GM-SGM indeed makes large jumps following the PGM of empirical Bayes inference, we track the values of the score function s θ (x t , t) evaluated during the sampling procedure and calculate the norm of the estimated noise vector ξ θ (x t , t) for every time step t (see Sec. 4.1). We denote ξ GM-SGM θ (x t , t) and ξ SGM θ (x t , t) as the noise vector produced by our GM-SGM and SGMs respectively. For a relative comparison, we calculate the ratio of mean and the ratio of variance r mean (t) = E ξ GM-SGM θ (x t , t) E ξ SGM θ (x t , t) ] , r var (t) = var ξ GM-SGM θ (x t , t) var ξ SGM θ (x t , t) ] , where r mean /r var represent the relative denoising step-size/variation taken by our GM-SGM over convnetional SGMs on average. We sample 512 images using GM-SGM and SGM trained on CelebA (64 × 64) (Liu et al., 2015) respectively and plot the results of sampling statistics in Fig. 4 . We draw several key observations: (1) With the r mean > 1 across the whole sampling process, we validate that our GM-SGM does take more large jumps. (2) r var starts at the range of [50, 150] and surges rapidly afterwards followed by a quick dip. This implies that when the sample point is sufficiently informative (e.g., close to the target distribution), our GM-SGM would activately infer the noises ξ based on different hypothesis of β. In contrast, the norm variation by SGM is always much smaller. The underlying reason is that, GM-SGM can take much more diverse sampling paths, as expressed by Eq.( 13). More specifically, at every step t, GM-SGM estimate ξ with two possible selections in the prior distribution of ξ: N (0, I) or N (0, β 2 I) at each t. Therefore, there are a total of 2 T denoising paths GM-SGM can choose from during sampling.

5.2. EVALUATION ON IMAGE SYNTHESIS QUALITY

Next, we evaluate the sampling quality. As shown in Fig. 5 , our GM-SGMs can generate much better quality images than SGMs under the same iterations T . See A.7 of Appendix for more visual comparison. For quantitative measurement, we adopt the popular Fréchet inception distance (FID) (Heusel et al., 2017) . Lower FID is better. As shown in Table 1 -2, GM-SGM is consistently superior over SGM across different sampling budgets and generation tasks. 

5.3. GM-SGM PARAMETER ANALYSIS

We investigate the effect of β on sampling quality. The default setting is β = 2. In general, if β is over small, our model advantage will be reduced due to reduced heavy-tail property. If β is over large, the denoising task would become too challenge to solve, hence harming the performance. To empirically validate this, we test GM-SGMs by varying β in [1.5, 3] on CIFAR-10 with 50 iterations. When β = 1.5, the FID increases from 29.4 to 37.9. When β = 3, the FID goes to 105.4. However, both are still better than 456.5 by SGM. See Tab. 3 in A.3 of Appendix for more results. Also, we test other Gaussian mixture designs and find them less effective (see A.6 of Appendix).

5.4. FURTHER STUDY

We investigate more designs of GM-SGMs. First, we only use the objective Eq.( 10) for training. As shown in Fig. 7 of Appendix, the model fails to generate natural images even under many iterations. This is because this model only learns to implement large jumps, while ignoring small roaming which is also necessary for space exploring, as discussed in Sec. 3.1. Second, instead of Gaussian increments as SGMs, we turn to use Gaussian mixture N (0, {I, β 2 I}). We test this on CelebA (64 × 64) under 333 iterations. We find the FID increases from 4.1 to 5.2 (still much better than 14.5 by SGMs). This verifies empirically our analysis in Sec. 4.1 that the probabilistic inference of our GM-SGM shares the same level of uncertainty for all β at each step t. We also investigate GM-SGMs using Gaussian mixture with general covariance matrices in A.5 of Appendix.

6. RELATED WORK

The first SGMs, the noise conditional score network (NCSN), is introduced (Song and Ermon, 2019). Later on, Song and Ermon (2020) further improved the NCSN by scaling the noises and improving the stability with the moving average. By abstracting the previous SGMs into a unified framework based on the stochastic differential equation (SDE), Song et al. (2021b) proposed the NCSN++ for high-resolution image generation using numerical SDE solvers and several architectural enhancements. Relying on Hamiltonian Monte Carlo methods (Neal et al., 2011) , critically-damped Langevin diffusion (CLD) based SGMs is introduced (Dockhorn et al., 2022) . Alternatively, Jing et al. (2022) implemented the sampling process on a series of selected subspaces. Besides, Vahdat et al. (2021) trained the SGMs in a latent space of variational autoencoder (Kingma and Welling, 2019). While Karras et al. (2022) focused on refining the training hyper-parameters, sampling schedule and high-order numerical methods for better performance. DDPMs (Ho et al., 2020) are another class of iterative denoising based generation models. Their origin is (Sohl-Dickstein et al., 2015) that proposed to destroy the data through a diffusion process, and learn to reverse this process via maximizing the variational bound. Although theoretically equivalent to SGMs (Vincent, 2011) , DDPMs adopt a variance preserving (VP) stochastic process for sampling. In contrast, SGMs consider a variance exploding (VE) stochastic process (i.e., the Langevin dynamics). There have been further improvements for DDPMs (Yang et al., 2022) . For instance, DDIMs (Song et al., 2020) generate samples via a class of non-Markovian diffusion processes. Nichol and Dhariwal (2021) proposed to learn the noise schedules for better sampling quality. Liu et al. (2021) further improved DDIMs by implementing the sampling on a manifold for better denoising. Bao et al. (2022b; a) attempted to optimize the discrete-time schedules for sampling speed-up. Application of SGMs and DDPMs has been rapidly evolved, e.g., text-to-image generation (Nichol et al., 2021; Saharia et al., 2022; Rombach et al., 2022) . Recently, SGMs were applied in creating, editing, and recovering photo-realistic images (Meng et al., 2021; Saharia et al., 2021a; b; Kawar et al., 2022) , or high fidelity audio streams (Kong et al., 2021; Chen et al., 2021) . Complementary to all the prior efforts, in this work we investigate the fundamental limitation of lacking large jumps in sampling with conventional SGMs and construct a more robust variant characterized by superior exploration ability, along with solid theoretical justification.

7. CONCLUSION

Inspired by the heavy-tailed Lévy dynamics (the reverse of the anomalous diffusion) implemented by neural coding in biological systems, we propose a novel class of generative models, namely Gaussian mixture score-based generative models (GM-SGMs). In contrast to conventional SGMs based on Langevin dynamics, our models are featured with a relatively heavy-tailed sampling property and capability of more thoroughly exploring the sampling space. Specifically, GM-SGMs are trained by denoising Gaussian mixture noises with built-in automatic switches between large jumps and small roaming in the spirit of anomalous diffusion. Theoretically, GM-SGMs implement a probabilistic graphical model (PGM) of an empirical Bayesian for generation, expanding the PGM of maximum a posteriori estimation used by the SGMs; This results in a sampling process with automatic selection of the step schedule and more robustness. Empirically, we demonstrate that our GM-SGMs significantly outperform the conventional SGMs on a diversity of challenging image generation tasks. Next, we compare GM-SGMs with DDPMs using non-Gaussian noises for training or sampling, including DDGM (Nachmani et al., 2021) where Gaussian noises are replaced by noises obeying Gamma distribution. As shown in Table 5 , our GM-SGMs are superior over DDPMs, DDPM-based DDGM, and comparable to DDIM and DDIM-based DDGM, despite that their base model DDPM and DDIM already have better low-iteration performance than our base model SGMs. Table 5 : FID of SGMs (Song et al., 2021b) , GM-SGMs, DDPM (Ho et al., 2020) , DDIM (Song et al., 2020) , and DDGM (based on DDPM and DDIM) (Nachmani et al., 2021)  T t=1 λ(σ t )E B∼p( B)   E x∼p * ,η∼N (0,σ 2 t I)   1 2 s θ (x + Bη, t) + Bη σ 2 t 2 B    . It is not obvious which kinds of p( B) can train a better GM-SGM. Here, we try a p( B) motivated by the observation that for image generation, SGMs tend to generate images with too much highfrequency noise when the iterations number is low (see examples in Sec. 3.4). To suppress the generation for high-frequency components, we design a p( B) matrix as follows p( B = I) = 0.5, P ( B = 2M ) = 0.5, where M is a filter operated on the height and width coordinates to shrink the top-5% high-frequency part of an image by 0.5. We obtain the following results: For CelebA (64 × 64), the FID of GM-SGM decreases to 3.77 at 400 iterations (better than the result of GM-SGM in the main paper); On CIFAR-10, the FID of GM-SGM decreases to 2.24 at 1000 iterations, better than SGMs (2.51). This implies that the optimal settings of Gaussian mixture can be more complicated, which calls for a further study in the future. 



Figure 1: (Left) The pdf of one-dimensional Lévy stable distribution with different α and standard Gaussian distribution. The Lévy stable distribution is more heavy-tailed than the Gaussian distribution. (Right) Addition of two Gaussian distributions vs. their mixture. Best viewed in color.

Figure 2: Visualization of (Left) anomalous diffusion dL α (α = 1.8), (Middle) Gaussian mixture Brownian motion dw 1,β (β = 2), and (Right) Brownian motion dw. We fixed the random seed at 2023 for all the three simulations. Best viewed in color.

Schematic diagram of the sampling inference: SGMs vs. GM-SGMs.

Figure 4: The norm ∥ξ θ ∥ of GM-SGMs and SGMs. (Left) The norms across several sampling processes. The ratio of (Middle) mean r mean and (Right) variance r var at different time steps.

Figure 5: Image samples generated by SGMs and GM-SGMs on CIFAR-10, CelebA with 64 × 64 resolution and LSUN (church and bedroom) with 256 × 256 resolutions.

FAILURE EXAMPLES In Fig. 6 and Fig. 7, we show the failure examples under improper parameter settings as mentioned in Sec. 5.

Figure 6: Image samples generated by a GM-SGM trained on CIFAR-10, where we apply the Gaussian mixture N (0, {0.5 2 I, 2 2 I}) instead of Gaussian mixture N (0, {I, 2 2 I}) for training. Iterations: 2000.

Figure 7: Image samples generated by a GM-SGM trained on CelebA (64 × 64), where we apply N (0, 2 2 I) instead of Gaussian mixture N (0, {I, 2 2 I}) for training. Iterations: 2000.

Figure 10: Image samples generated by SGMs (left) and GM-SGMs (right). Dataset: CIFAR-10. Iterations: 500.

Figure 11: Facial image samples generated by SGMs (left) and GM-SGMs (right). Dataset: CelebA (64 × 64). Iterations: 20.

Figure 12: Facial image samples generated by SGMs (left) and GM-SGMs (right). Dataset: CelebA (64 × 64). Iterations: 200.

Figure 13: Facial image samples generated by SGMs (left) and GM-SGMs (right). Dataset: CelebA (64 × 64). Iterations: 400.

Figure 14: Church image samples generated by SGMs (left) and GM-SGMs (right). Dataset: LSUN (church) (256 × 256). Iterations: 100.

Figure 15: Church image samples generated by SGMs (left) and GM-SGMs (right). Dataset: LSUN (church) (256 × 256). Iterations: 200.

Figure 16: Church image samples generated by SGMs (left) and GM-SGMs (right). Dataset: LSUN (church) (256 × 256). Iterations: 400.

Figure 17: Bedroom image samples generated by SGMs (left) and GM-SGMs (right). Dataset: LSUN (bedroom) (256 × 256). Iterations: 100.

Figure 18: Bedroom image samples generated by SGMs (left) and GM-SGMs (right). Dataset: LSUN (bedroom) (256 × 256). Iterations: 200.

Figure 19: Bedroom image samples generated by SGMs (left) and GM-SGMs (right). Dataset: LSUN (bedroom) (256 × 256). Iterations: 400.

Figure 20: Church image samples generated by GM-SGMs with 50 iterations (left) and 40 iterations (right). Dataset: LSUN (church) (256 × 256). SGMs only produce unrecognizable noises (like left of Fig. 11) below 100 iterations.

Figure 21: Bedroom image samples generated by GM-SGMs with 50 iterations (left) and 40 iterations (right). Dataset: LSUN (bedroom) (256 × 256). SGMs only produce unrecognizable noises (like left of Fig. 11) below 100 iterations.

FID score of generating object images on CIFAR-10. Lower FID is better. SGM: NCSN++Song et al. (2021b)).

The comparison of the performance of GM-SGMs under different β in Alg. 1. Metrics: FID score. Dataset: CIFAR-10.In this section, we compare our GM-SGMs with existing SGMs and DDPMs. First, we compare GM-SGMs with SGMs equipped with adaptive SDE solver (ASDE)(Jolicoeur-Martineau et al., 2021). As shown in Table4, our GM-SGMs outperform the competitors significantly. This suggests that our method has superior acceleration benefit. Also note our GM-SGMs do not alter the sampling process, without adding extra per-step complexity during inference. In contrast, alternative acceleration methods usually increase per-step cost.

FID of SGMs(Song et al., 2021b), GM-SGMs, and SGMs equipped with adaptive SDE solver(Jolicoeur-Martineau et al., 2021). Dataset: CIFAR-10.

annex

which implies that p(x s | x t , x u ) also obeys a Gaussian distribution, and its mean and covariance matrix can be obtainedHence, the estimated direction -ξ θ (x t , t) = -(x t -x 0 )/σ t guides the sample point x t closer to x s for some s < t whose distribution is closer to p * .

A.2 DETAILED SETTINGS

We train the GM-SGM on CIFAR-10 for 550k iterations at the batch size of 64 on single NVIDIA RTX 3090 GPU to compare the public checkpoint of the SGM (NCSN++) trained on CIFAR-10. We train both GM-SGM and SGMs on CelebA (64 × 64) for 230k iterations at the batch size of 64 on two NVIDIA RTX 3090 GPUs. We train the GM-SGM on LSUN (church) and LSUN (bedroom) at resolution of 256 × 256 for 250k iterations at the batch size of 64 and compare the public checkpoint of the SGM (NCSN++) trained on LSUN (church) on 4 NVIDIA RTX A6000 GPUs. We apply the same hyper-parameters of optimizer and learning schedule as that in Song et al. (2021b) . For sampling, we use the same generation process of continuous variance exploding (VE)-NCSN++ Song et al. (2021b) with both predictor and corrector for both SGMs and GM-SGM.For sampling, all the experiments apply the original linear schedule used in Song et al. (2021b) with both predictor and corrector (PC).

A.3 EXTRA RESULTS ON PARAMETER ANALYSIS

We report more results on how the parameter β influences the performance of GM-SGM in Tab. 3. In particular, when β = 1, we get the original SGM. 

