LEARNING ENERGY-BASED MODELS BY DIFFUSION RECOVERY LIKELIHOOD

Abstract

While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. Optimizing recovery likelihood is more tractable than marginal likelihood, as sampling from the conditional distributions is much easier than sampling from the marginal distributions. After training, synthesized images can be generated by the sampling process that initializes from Gaussian white noise distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets. Our implementation is available at https://github.com/ruiqigao/recovery_likelihood.



Inspired by Sohl-Dickstein et al. (2015) and Ho et al. (2020) , we propose a diffusion recovery likelihood method to tackle the challenge of training EBMs directly on a dataset by instead learning a sequence of EBMs for the marginal distributions of the diffusion process. The sequence of marginal EBMs are learned with recovery likelihoods that are defined as the conditional distributions that invert the diffusion process. Compared to standard maximum likelihood estimation (MLE) of EBMs, learning marginal EBMs by diffusion recovery likelihood only requires sampling from the conditional distributions, which is much easier than sampling from the marginal distributions. After learning the marginal EBMs, we can generate synthesized images by a sequence of conditional samples initialized from the Gaussian white noise distribution. Unlike Ho et al. (2020) that approximates the reverse process by normal distributions, in our case the conditional distributions are derived from the marginal EBMs, which are more flexible. The framework of recovery likelihood was originally proposed in Bengio et al. (2013) . In our work, we adapt it to learning the sequence of marginal EBMs from the diffusion data. Our work is also related to the denoising score matching method of Vincent (2011) , which was further developed by Song & Ermon (2019; 2020) for learning from diffusion data. The training objective used for diffusion probabilisitic models is a weighted version of the denoising score matching objective, as revealed by Ho et al. (2020) . These methods learn the score functions (the gradients of the energy functions) directly, instead of using the gradients of learned energy functions as in EBMs. On the other hand, Saremi et al. (2018) parametrizes the score function as the gradient of a MLP energy function, and Saremi & Hyvarinen (2019) further unifies denoising score matching and neural empirical Bayes. We demonstrate the efficacy of diffusion recovery likelihood on CIFAR-10, CelebA and LSUN datasets. The generated samples are of high fidelity and comparable to GAN-based methods. On CIFAR-10, we achieve FID 9.58 and inception score 8.30, exceeding existing methods of learning explicit EBMs to a large extent. We also demonstrate that diffusion recovery likelihood outperforms denoising score matching from diffusion data if we naively take the gradients of explicit energy functions as the score functions. More interestingly, by using a thousand diffusion time steps, we demonstrate that even very long MCMC chains from the sequence of conditional distributions produce samples that represent realistic images. With the faithful long-run MCMC samples from the conditional distributions, we can accurately estimate the marginal partition function at zero noise level by importance sampling, and thus evaluate the normalized density of data under the EBM. 

2. BACKGROUND

Let x ∼ p data (x) denote a training example, and p θ (x) denote a model's probability density function that aims to approximates p data (x). An energy-based model (EBM) is defined as: p θ (x) = 1 Z θ exp(f θ (x)), where Z θ = exp(f θ (x))dx is the partition function, which is analytically intractable for highdimensional x. For images, we parameterize f θ (x) with a convolutional neural network with a scalar output. The energy-based model in equation 1 can, in principle, be learned through MLE. Specifically, suppose we observe samples x i ∼ p data (x) for i = 1, 2, ..., n. The log-likelihood function is L(θ) = 1 n n i=1 log p θ (x i ) . = E x∼p data [log p θ (x)]. In MLE, we seek to maximize the log-likelihood function, where the gradient approximately follows (Xie et al., 2016b ) - ∂ ∂θ D KL (p data p θ ) = E x∼p data ∂ ∂θ f θ (x) -E x∼p θ ∂ ∂θ f θ (x) . (3) The expectations can be approximated by averaging over the observed samples and the synthesized samples drawn from the model distribution p θ (x) respectively. Generating synthesized samples from p θ (x) can be done with Markov Chain Monte Carlo (MCMC) such as Langevin dynamics (or Hamiltonian Monte Carlo (Girolami & Calderhead, 2011 )), which iterates where τ indexes the time, δ is the step size, and τ ∼ N (0, I). The difficulty lies in the fact that for highdimensional and multi-modal distributions, MCMC sampling can take a long time to converge, and the sampling chains may have difficulty traversing modes. As demonstrated in Figure 2 , training EBMs with synthesized samples from non-convergent MCMC results in malformed energy landscapes (Nijkamp et al., 2019b) , even if the samples from the model look reasonable. x τ +1 = x τ + δ 2 2 ∇ x f θ (x τ ) + δ τ , 3 RECOVERY LIKELIHOOD

3.1. FROM MARGINAL TO CONDITIONAL

Given the difficulty of sampling from the marginal density p θ (x), following Bengio et al. (2013) , we use the recovery likelihood defined by the density of the observed sample conditional on a noisy sample perturbed by isotropic Gaussian noise. Specifically, let x = x + σ be the noisy observation of x, where ∼ N (0, I). Suppose p θ (x) is defined by the EBM in equation 1, then the conditional EBM can be derived as p θ (x|x) = 1 Zθ (x) exp f θ (x) - 1 2σ 2 x -x 2 , where Zθ (x) = exp f θ (x) -1 2σ 2 xx 2 dx is the partition function of this conditional EBM. See Appendix A.1 for the derivation. Compared to p θ (x) (equation 1), the extra quadratic term 1 2σ 2 xx 2 in p θ (x|x) constrains the energy landscape to be localized around x, making the latter less multi-modal and easier to sample from. As we will show later, when σ is small, p θ (x|x) is approximately a single mode Gaussian distribution, which greatly reduces the burden of MCMC. A more general formulation is x = ax + σ , where a is a positive constant. In that case, we can let y = ax, and treat y as the observed sample. Assume p θ (y) = 1 Z θ exp(f θ (y)), then by change of variable, the density function of x can be derived as g θ (x) = ap θ (ax).

3.2. MAXIMIZING RECOVERY LIKELIHOOD

With the conditional EBM, assume we have observed samples x i ∼ p data (x) and the corresponding perturbed samples xi = x i + σ i for i = 1, ..., n. We define the recovery log-likelihood function as J (θ) = 1 n n i=1 log p θ (x i |x i ). The term recovery indicates that we attempt to recover the clean sample x i from the noisy sample xi . Thus, instead of maximizing L(θ) in equation 2, we can maximize J (θ), whose distributions are easier to sample from. Specifically, we generate synthesized samples by K steps of Langevin dynamics that iterates x τ +1 = x τ + δ 2 2 (∇ x f θ (x τ ) + 1 σ 2 (x -x τ )) + δ τ . ( ) The model is then updated following the same learning gradients as MLE (equation 3), because the quadratic term -1 2σ 2 xx 2 is not related to θ. Following the classical analysis of MLE, we can show that the point estimate given by maximizing recovery likelihood is an unbiased estimator of the true parameters, which means that given enough data, a rich enough model and exact synthesis, maximizing the recovery likelihood learns θ such that p data (x) = p θ (x). See Appendix A.2 for a theoretical explanation.

3.3. NORMAL APPROXIMATION TO CONDITIONAL

When the variance of perturbed noise σ 2 is small, p θ (x|x) can be approximated by a normal distribution via a first order Taylor expansion at x. Specifically, the negative conditional energy is -E θ (x|x) = f θ (x) - 1 2σ 2 x -x 2 (8) . = f θ (x) + ∇ x f θ (x), x -x - 1 2σ 2 x -x 2 (9) = - 1 2σ 2 x -(x + σ 2 ∇ x f θ (x)) 2 + c, where c include terms irrelevant of x (see Appendix A.3 for a detailed derivation). In the above approximation, we do not perform second order Taylor expansion because σ 2 is small, and xx 2 /2σ 2 will dominate all the second order terms from Taylor expansion. Thus we can approximate p θ (x|x) by a Gaussian approximation p θ (x|x): p θ (x|x) = N x; x + σ 2 ∇ x f θ (x), σ 2 . (11) We can sample from this distribution using: x gen = x + σ 2 ∇ x f θ (x) + σ , where ∼ N (0, I). This resembles a single step of Langevin dynamics, except that σ is replaced by √ 2σ in Langevin dynamics. This normal approximation has two traits: (1) it verifies the fact that the conditional density p θ (x|x) can be generally easier to sample from when σ is small; (2) it provides hints of choosing the step size of Langevin dynamics, as discussed in section 3.5.

3.4. CONNECTION TO VARIATIONAL INFERENCE AND SCORE MATCHING

The normal approximation to the conditional distribution leads to a natural connection to diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020) and denoising score matching (Vincent, 2011; Song & Ermon, 2019; 2020; Saremi et al., 2018; Saremi & Hyvarinen, 2019) . Specifically, in diffusion probabilistic models, instead of modeling p θ (x) as an energy-based model, it recruits variational inference and directly models the conditional density as p θ (x|x) ∼ N x + σ 2 s θ (x), σ 2 , ( ) which is in agreement with the normal approximation (equation 11), with s θ (x) = ∇ x f θ (x). On the other hand, the training objective of denoising score matching is to minimize 1 2σ 2 E p(x,x) [ x -(x + σ 2 s θ (x)) 2 ], where s θ (x) is the score of the density of x. This objective is in agreement with the objective of maximizing log-likelihood of the normal approximation (equation 10), except that for normal approximation, ∇ x f θ (•) is the score of density of x, instead of x. However, the difference between the scores of density of x and x is of O(σ 2 ), which is negligible when σ is sufficiently small (see Appendix A.4 for details). We can further show that the learning gradient of maximizing log-likelihood of the normal approximation is approximately the same as the learning gradient of maximizing the original recovery log-likelihood with one step of Langevin dynamics (see Appendix A.5). It indicates that the training process of maximizing recovery likelihood agrees with the one of diffusion probabilistic models and denoising score matching when σ is small. As the normal approximation is accurate only when σ is small, it requires many time steps in the diffusion process for this approximation to work well, which is also reported in Ho et al. (2020) and Song & Ermon (2020) . In contrast, the diffusion recovery likelihood framework can be more flexible in choosing the number of time steps and the magnitude of σ.

3.5. DIFFUSION RECOVERY LIKELIHOOD

As we discuss, sampling from p θ (x|x) becomes simple only when σ is small. In the extreme case when σ → ∞, p θ (x|x) converges to the marginal distribution p θ (x), which is again highly multimodal and difficult to sample from. To keep σ small and meanwhile equip the model with the ability to generate new samples initialized from white noise, inspired by Sohl-Dickstein et al. (2015) and Ho et al. (2020) , we propose to learn a sequence of recovery likelihoods, on gradually perturbed observed data based on a diffusion process. Specifically, assume a sequence of perturbed observations x 0 , x 1 , ..., x T such that x 0 ∼ p data (x); x t+1 = 1 -σ 2 t+1 x t + σ t+1 t+1 , t = 0, 1, ...T -1. The scaling factor 1 -σ 2 t+1 ensures that the sequence is a spherical interpolation between the observed sample and Gaussian white noise. Let y t = 1 -σ 2 t+1 x t , and we assume a sequence of conditional EBMs p θ (y t |x t+1 ) = 1 Zθ,t (x t+1 ) exp f θ (y t , t) - 1 2σ 2 t+1 x t+1 -y t 2 , t = 0, 1, ..., T -1, ( ) where f θ (y t , t) is defined by a neural network conditioned on t. We follow the learning algorithm in section 3.2. A question is how to determine the step size schedule δ t of Langevin dynamics. Inspired by the sampling procedure of the normal approximation (equation 12), we set the step size δ t = bσ t , where b < 1 is a tuned hyperparameter. This schedule turns out to work well in practice. Thus the K steps of Langevin dynamics iterates  y τ +1 t = y τ t + b 2 σ 2 t 2 (∇ y f θ (y τ t , t) + 1 σ 2 t (x t+1 -y τ t )) + bσ t τ . (

GAN-based

WGAN-GP (Gulrajani et al., 2017) 36.4 7.86 ± .07 SNGAN (Miyato et al., 2018) 21.7 8.22 ± .05 SNGAN-DDLS (Che et al., 2020) 15.42 9.09 ± .10 StyleGAN2-ADA (Karras et al., 2020) 3.26 9.74 ± .05

Score-based

NCSN (Song & Ermon, 2019) 25.32 8.87 ± .12 NCSN-v2 (Song & Ermon, 2020) 10.87 8.40 ± .07 DDPM (Ho et al., 2020) 3.17 Setting / Objective FID↓ Inception↑ T = 1, K = 180 32.12 6.72 ± 0.12 T = 1000, K = 0 22.58 7.71 ± 0.08 T = 1000, K = 0 (DSM) 21.76 7.76 ± 0.11 T = 6, K = 10 --T = 6, K = 30 9.58 8.30 ± 0.11 T = 6, K = 50 9.36 8.46 ± 0.13 Interpolation. As shown in Figure 5 , our model is capable of smooth interpolation between two generated samples. Specifically, for two samples x (0) 0 and x (1) 0 , we do a sphere interpolation between the initial white noise images x (0) T and x (1) T and the noise terms of Langevin dynamics (0) t,τ and (1) t,τ for every sampling step at every time step. More interpolation results can be found in Appendix C.3.

Image inpainting.

A promising application of energy-based models is to use the learned model as a prior model for image processing, such as image inpainting, denoising and super-resolution (Gao et al., 2018; Du & Mordatch, 2019; Song & Ermon, 2019) . In Figure 6 , we demonstrate that the learned models by maximizing recovery likelihoods are capable of realistic and semantically meaningful image inpainting. Specifically, given a masked image and the corresponding mask, we first obtain a sequence of perturbed masked images at different noise levels. The inpainting can be easily achieved by running Langevin dynamics progressively on the masked pixels while keeping the observed pixels fixed at decreasingly lower noise levels. Additional image inpainting results can be found in Appendix C.4. Ablation study. Table 2 summarizes the results of ablation study on CIFAR-10. We investigate the effect of changing the numbers of time steps T and sampling steps K. First, to show that it is beneficial to learn by diffusion recovery likelihood, we compare against a baseline approach (T = 1, K = 180) where we use only one time step, so that the recovery likelihood becomes marginal likelihood. The approach is adopted by Nijkamp et al. (2019b) and Du & Mordatch (2019) . For fair comparison, we equip the baseline method the same budget of MCMC sampling as our T6 setting (i.e., 180 sampling steps). Our method outperforms this baseline method by a large margin. Also the models are trained more efficiently as the number of sampling steps per iteration is reduced and amortized by time steps. Next, we report the sample quality of setting T1k. We test two training objectives for this setting: (1) maximizing recovery likelihoods (T = 1000, K = 0) and ( 2) maximizing the approximated normal distributions (T=1000, K=0 (DSM)). As mentioned in section 3.4, ( 2) is equivalent to the training objectives of denoising score matching (Song & Ermon, 2019; 2020) and diffusion probabilistic model (Ho et al., 2020) , except that the score functions are taken as the gradients of explicit energy functions. In practice, for a direct comparison, (2) follows the same implementation as in Ho et al. (2020) , except that the score function is parametrized as the gradients of the explicit energy function used in our method. (1) and (2) achieve similar sample quality in terms of quantitative metrics, where (2) results in a slightly better FID score yet a slightly worse inception score. This verifies the fact that the training objectives of (1) and (2) are consistent. Both (1) and (2) performs worse than setting T6. A possible explanation is that the sampling error may accumulate with many time steps, so that a more flexible schedule of time steps accompanied with certain amount of sampling steps is preferred. Last, we examine the influence of varying the number of sampling steps while fixing the number of time steps. The training becomes unstable when the number of sampling steps are not enough (T = 6, K = 10), and more sampling steps lead to better sample quality. However, since K = 50 does not gain significant improvement versus K = 30, yet of much higher computational cost, we keep K = 30 for image generation on all datasets. See Appendix C.1 for a plot of FID scores over iterations.

4.2. LONG-RUN CHAIN ANALYSIS

Besides achieving high quality generation, a perhaps equally important aspect of learning EBMs is to obtain a faithful energy potential. A principle way to check the validity of the learned potential is to perform long-run sampling chains and see if the samples still remain realistic. However, as pointed out in Nijkamp et al. (2019a) , almost all existing methods of learning EBMs fail in getting realistic long-run chain samples. In this subsection, we demonstrate that by composing a thousand diffusion time steps (T1k setting), we can form steady long-run MCMC chains for the conditional distributions. First we prepare a faithful sampler for conducting long-run sampling. Specifically, after training the model under T 1k setting by maximizing diffusion recovery likelihood, for each time step, we first sample from the normal approximation and count it as one sampling step, and then use Hamiltonian Monte Carlo (HMC) (Neal et al., 2011) with 2 leapfrog steps to perform the consecutive sampling steps. To obtain a reasonable schedule of sampling step size, for each time step we adaptively adjust the step size of HMC to make the average acceptance rate range in [0.6, 0.9], which is computed  (x 0 ) = 1 -σ 2 1 p θ ( 1 -σ 2 1 x 0 ) . We report test bits per dimension on CIFAR-10 in Table 4 . Note that the result should be taken with a grain of salt, because the partition function is estimated by samples and as shown in Appendix A.6, it is a stochastic lower bound of the true value, that will converge to the true value when the number of samples grows large.

5. CONCLUSION

We propose to learn EBMs by diffusion recovery likelihood, a variant of MLE applied to diffusion processes. We achieve high quality image synthesis, and with a thousand noise levels, we obtain faithful long-run MCMC samples that indicate the validity of the learned energy potentials. Since this method can learn EBMs efficiently with small budget of MCMC, we are also interested in scaling it up to higher resolution images and investigating this method in other data modalities in the future. Evaluation metrics. We use FID and inception scores as quantitative evaluation metrics of sample quality. On all the datasets, we calculate FID and inception scores on 50,000 samples using the original code from Salimans et al. (2016) and Heusel et al. (2017) . 



Figure 1: Generated samples on LSUN 128 2 church outdoor (left), LSUN 128 2 bedroom (center) and CelebA 64 2 (right).

Figure 3: Illustration of diffusion recovery likelihood on 2D checkerboard example. Top: progressively generated samples. Bottom: estimated marginal densities.

Figure 2: Comparison of learning EBMs by diffusion recovery likelihood (Ours) versus marginal likelihood (Short-run).

Figure 5: Interpolation results between the leftmost and rightmost generated samples. For top to bottom: LSUN church outdoor 128 2 , LSUN bedroom 128 2 and CelebA 64 2 .

Figure 6: Image inpainting on LSUN church outdoor 128 2 (left) and CelebA 64 2 (right). With each block, the top row are mask images while the bottom row are inpainted images.

Figure 7: Left: Adjusted step size of HMC over time step. Center: Acceptance rate over time step. Right: Estimated log partition function over number of samples with different number of sampling steps per time step. The x axis is plotted in log scale.over 1000 chains for 100 steps. Figure7displays the adjusted step size (left) and acceptance rate (center) over time step. The adjusted step size increases logarithmically. With this step size schedule, we generate long-run chains from the learned sequence of conditional distributions. As shown in Figure8, images remain realistic for even 100k sampling steps in total (i.e., 100 sampling steps per time step), resulting in FID 24.89. This score is close to the one computed on samples generated by 1k steps (i.e., sampled from normal approximation), which is 25.12. As a further check, we recruit a No-U-Turn Sampler (Hoffman & Gelman, 2014) with the same step size schedule as HMC to perform long-run sampling, where the samples also remain realistic. See Appendix C.2 for details.

Figure 8: Long-run chain samples from model-T1k with different total amount of HMC steps. From left to right: 1k steps, 10k steps and 100k steps.More interestingly, given the faithful long-run MCMC samples from the conditional distributions, we can estimate the log ratio of the partition functions of the marginal distributions, and further estimate the partition function of p θ (y 0 ). The strategy is based on annealed importance sampling(Neal, 2001). See Appendix A.6 for the implementation details. The right subfigure of Figure7depicts the estimated log partition function of p θ (y 0 ) over the number of MCMC samples used. To verify the estimation strategy and again check the long-run chain samples, we conduct multiple runs using samples generated with different numbers of HMC steps and display the estimation curves. All the curves saturate to values close to each other at the end, indicating the stability of long-run chain samples and the effectiveness of the estimation strategy. With the estimated partition function, by change of variable, we can estimate the normalized density of data as g θ (x 0 ) = 1 -σ 2 1 p θ ( 1 -σ 2 1 x 0 ). We report test bits per dimension on CIFAR-10 in Table4. Note that the result should be taken with a grain of salt, because the partition function is estimated by samples and as shown in Appendix A.6, it is a stochastic lower bound of the true value, that will converge to the true value when the number of samples grows large.

Figure9demonstrates FID scores computed on 2,500 samples every 15,000 iterations.

Figure 9: FIDs for different number of Langevin steps.

Figure 11: Interpolation results between the leftmost and rightmost generated samples on CelebA 64 × 64.

Figure 12: Interpolation results between the leftmost and rightmost generated samples on LSUN church outdoor 128 × 128.

Figure 13: Interpolation results between the leftmost and rightmost generated samples on LSUN bedroom 128 × 128.

Figure 14: Image inpainting results on CelebA 64 × 64. Top: masked images, bottom: inpainted images.

Figure 15: Image inpainting results on LSUN church outdoor 128 × 128. Top: masked images, bottom: inpainted images.

Figure 16: Generated samples on CIFAR-10.

Training repeat Sample t ∼ Unif({0, ..., T -1}). Sample pairs (y t , x t+1 ). FID and inception scores on CIFAR-10.

Ablation of training objectives, time steps T and sampling steps K on CIFAR-10. K = 0 indicates that we sample from the normal approximation.

FID scores on CelebA 64 2 .

Model architectures of various solutions. N is a hyperparameter that we sweep over.

Hyperparameters of various datasets.

ACKNOWLEDGEMENT

The work was done while Ruiqi Gao and Yang Song were interns at Google Brain during the summer of 2020. The work of Ying Nian Wu is supported by NSF DMS-2015577. We thank Alexander A. Alemi, Jonathan Ho, Tim Salimans and Kevin Murphy for their insightful discussions during the course of this project.

annex

Algorithm 2 Progressive sampling Sample x T ∼ N (0, I). for t ← T -1 to 0 do y t = x t+1 . for τ ← 1 to K do Update y t according to equation 17. end for x t = y t / 1 -σ 2 t+1 . end for return x 0 .

4. EXPERIMENTS

To show that diffusion recovery likelihood is flexible for diffusion process of various magnitudes of noise, we test the method under two settings: (1) T = 6, with K = 30 steps of Langevin dynamic per time step; (2) T = 1000, with sampling from normal approximation. (2) resembles the noise schedule of Ho et al. (2020) and the magnitude of noise added at each time step is much smaller compared to (1). For both settings, we set σ 2 t to increase linearly. The network structure of f θ (x, t) is based on Wide ResNet (Zagoruyko & Komodakis, 2016) and we remove weight normalization. t is encoded by Transformer sinusoidal position embedding as in (Ho et al., 2020) . For (1), we find that adding another scaling factor c t to the step size δ t helps. Architecture and training details are in Appendix B. Henceforth we simply refer the two settings as T6 and T1k. 1 and 3 summarize the quantitative evaluations on CIFAR-10 and CelebA datasets, in terms of Frechet Inception Distance (FID) (Heusel et al., 2017) and inception scores (Salimans et al., 2016) . On CIFAR-10, our model achieves FID 9.58 and inception score 8.30, which outperforms existing methods of learning explicit energy-based models to a large extent, and is superior to a majority of GAN-based methods. On CelebA, our model obtains results comparable with the state-of-the-art GAN-based methods, and outperforms score-based methods (Song & Ermon, 2019; 2020) . Note that the score-based methods (Song & Ermon, 2019; 2020) and diffusion probabilistic models (Ho et al., 2020) directly parametrize and learn the score of data distribution, whereas our goal is to learn explicit energy-based models. A EXTENDED DERIVATIONS A.1 DERIVATION OF EQUATION 5 Let x = x + σ , where ∼ N (0, I). Given the marginal distribution of

4.1. IMAGE GENERATION

We can derive the conditional distribution of x given x aswhere we absorb all the terms that are irrelevant of x as Zθ (x).

A.2 THEORETICAL UNDERSTANDING

In this subsection, we analyze the asymptotic behavior of maximizing the recovery log-likelihood.For model class {p θ (x), ∀θ}, suppose there exists θ * such that p data = p θ * . According to the classical theory of MLE, let θ0 be the point estimate by MLE. Then we have θ is an unbiased estimator of θ * with asymptotic normality:is the Fisher information, and n is the number of observed samples.Let θ be the point estimate given by maximizing recovery log-likelihood, we can derive a result in parallel to that of MLE:The relationship between I 0 (θ) and I(θ) is thatThus there is loss of information, but θ is still an unbiased estimator of θ * with asymptotic normality.A.4 DIFFERENCE BETWEEN THE SCORES OF p(x) AND p(x)For notation clarity, with x = x + , we let p be the distribution of x, and p be the distribution of x.Then for a smooth testing function with vanishing tails,Integral by parts,Thus we have the heat equationThe scoreThus the difference between the score of p and p is of the order σ 2 , which is negligible when σ 2 is small.

LIKELIHOOD

In this subsection we demonstrate that the learning gradient of maximizing likelihood of the normal approximation is approximately the same as the gradient of maximizing the original recovery likelihood with one step of Langevin sampling. Specifically, the gradient of the normal approximation of recovery log-likelihood for an observed x obs isOn the other hand, to maximize the original recovery likelihood, suppose we sample x syn ∼ p θ (x|x), then the gradient ascent of the original recovery log-likelihood iswhere h θ (x) = ∇ θ f θ (x). Approximately, if we perform one step of Langevin dynamics from x to obtain x syn , i.e., x syn = x + σ 2 f θ (x) + √ 2σe, and assume f θ (x) is locally linear in x, thenComparing equations 37 and 43, we see that the two gradients agree with each other.

A.6 ESTIMATING THE PARTITION FUNCTION

We can utilize the sequence of learned distributions of y t (= 1 -σ 2 t+1 x t ) to estimate the partition function. Specifically, the marginal distribution of y t isWe can estimate the ratio of the partition functions at two consecutive time steps using importance samplingwhere y t+1,i are samples generated by progressive sampling. Starting from t = T , where p T (x) follows Gaussian distribution, we can compute log Z θ,t along the reverse path of the diffusion process, until we reach t = 0:In practice, since the ratio given by MCMC samples can vary across many orders of magnitude, it is more meaningful to estimateUnfortunately, although equation 46 is an unbiased estimator of Z θ,t /Z θ,t+1 , the logarithm of this estimator is generally a stochastic lower bound of log(Z θ,t /Z θ,t+1 ) (Grosse et al., 2016). However, as we show below, this bound will gradually converge to an unbiased estimator of log(Z θ,t /Z θ,t+1 ), as the number of samples becomes large. Specifically, let A be the estimator in equation 46, µ be the true value of Z θ,t /Z θ,t+1 . We have E[A] = µ, then by second order Taylor expansion,By law of large number, Var(A) → 0 as M → ∞, and thus E[log A] → log µ. This is also consistent with the estimation curves in the right subfigure of Figure 7 : since Var(A) ≥ 0, the estimation curve increases from below as the number of samples becomes larger. When the curve becomes stable, it indicates the convergence.

B EXPERIMENTAL DETAILS

Model architecture. Our network structure is based on Wide ResNet (Zagoruyko & Komodakis, 2016) . Table 5 lists the detailed network structures of various resolutions. The number of ResBlocks at every level N is a hyperparameter that we sweep over. The values of N for various datasets are listed in Table 6 . Each ResBlock consists of two Conv2D layers. For the second Conv2D layer, we use zero initialization for the weights, and add a trainable channel-wise scaling parameter to the output. We remove the weight normalization, and use leaky ReLU (slope = 0.2) as the activation function in ResBlocks. Spectral normalization (Miyato et al., 2018) is used to regularize parameters in Conv2D layer, ResBlocks and Dense layer. For encoding time step t, we follow the scheme in (Ho et al., 2020) . Specifically, the time step t is first transformed into sinusoidal embedding, and then two Dense layers is added. The time embedding is added after the first Conv2D layer of each ResBlock.Training. We use Adam (Kingma & Ba, 2014) optimizer for all the experiments. We find that for high resolution images, using a smaller β 1 in Adam help stabilize training. We use learning rate 0.0001 for all the experiments. For the values of β 1 , batch sizes and the number of training iterations for various datasets, see Table 6 .Datasets. We use the following datasets in our experiments: CIFAR-10 ( Krizhevsky et al., 2009) , CelebA (Liu et al., 2018) and LSUN (Yu et al., 2015) 

