LEARNING ENERGY-BASED MODELS BY DIFFUSION RECOVERY LIKELIHOOD

Abstract

While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. Optimizing recovery likelihood is more tractable than marginal likelihood, as sampling from the conditional distributions is much easier than sampling from the marginal distributions. After training, synthesized images can be generated by the sampling process that initializes from Gaussian white noise distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets. Our implementation is available at https://github.com/ruiqigao/recovery_likelihood.



Inspired by Sohl-Dickstein et al. (2015) and Ho et al. (2020) , we propose a diffusion recovery likelihood method to tackle the challenge of training EBMs directly on a dataset by instead learning a sequence of EBMs for the marginal distributions of the diffusion process. The sequence of marginal EBMs are learned with recovery likelihoods that are defined as the conditional distributions that invert the diffusion process. Compared to standard maximum likelihood estimation (MLE) of EBMs, learning marginal EBMs by diffusion recovery likelihood only requires sampling from the conditional distributions, which is much easier than sampling from the marginal distributions. After learning the marginal EBMs, we can generate synthesized images by a sequence of conditional samples initialized from the Gaussian white noise distribution. Unlike Ho et al. ( 2020) that approximates the reverse process by normal distributions, in our case the conditional distributions are derived from the marginal EBMs, which are more flexible. The framework of recovery likelihood was originally proposed in Bengio et al. (2013) . In our work, we adapt it to learning the sequence of marginal EBMs from the diffusion data. Our work is also related to the denoising score matching method of Vincent ( 2011 We demonstrate the efficacy of diffusion recovery likelihood on CIFAR-10, CelebA and LSUN datasets. The generated samples are of high fidelity and comparable to GAN-based methods. On CIFAR-10, we achieve FID 9.58 and inception score 8.30, exceeding existing methods of learning explicit EBMs to a large extent. We also demonstrate that diffusion recovery likelihood outperforms denoising score matching from diffusion data if we naively take the gradients of explicit energy functions as the score functions. More interestingly, by using a thousand diffusion time steps, we demonstrate that even very long MCMC chains from the sequence of conditional distributions produce samples that represent realistic images. With the faithful long-run MCMC samples from the conditional distributions, we can accurately estimate the marginal partition function at zero noise level by importance sampling, and thus evaluate the normalized density of data under the EBM. INTRODUCTIONEBMs(LeCun et al., 2006; Ngiam et al., 2011; Kim & Bengio, 2016; Zhao et al., 2016; Goyal et al., 2017; Xie et al., 2016b; Finn et al., 2016; Gao et al., 2018; Kumar et al., 2019; Nijkamp et al., 2019b; Du & Mordatch, 2019; Grathwohl et al., 2019; Desjardins et al., 2011; Gao et al., 2020; Che et al., 2020; Grathwohl et al., 2020; Qiu et al., 2019; Rhodes et al., 2020) are an appealing class of probabilistic models, which can be viewed as generative versions of discriminators(Jin et al., 2017; Lazarow et al., 2017; Lee et al., 2018; Grathwohl et al., 2020), yet can be learned from unlabeled data. Despite a number of desirable properties, two challenges remain for training EBMs on highdimensional datasets. First, learning EBMs by maximum likelihood requires Markov Chain Monte Carlo (MCMC) to generate samples from the model, which can be extremely expensive. Second, as pointed out inNijkamp et al. (2019a), the energy potentials learned with non-convergent MCMC do not have a valid steady-state, in the sense that samples from long-run Markov chains can differ greatly from observed samples, making it difficult to evaluate the learned energy potentials.Another line of work, originating fromSohl-Dickstein et al. (2015), is to learn from a diffused version of the data, which are obtained from the original data via a diffusion process that sequentially adds Gaussian white noise. From such diffusion data, one can learn the conditional model of the data at a certain noise level given their noisy versions at the higher noise level of the diffusion process. After learning the sequence of conditional models that invert the diffusion process, one can then generate synthesized images from Gaussian white noise images by ancestral sampling. Building on



Figure 1: Generated samples on LSUN 128 2 church outdoor (left), LSUN 128 2 bedroom (center) and CelebA 64 2 (right).

), which was further developed bySong & Ermon (2019; 2020)  for learning from diffusion data. The training objective used for diffusion probabilisitic models is a weighted version of the denoising score matching objective, as revealed byHo et al. (2020). These methods learn the score functions (the gradients of the energy functions) directly, instead of using the gradients of learned energy functions as in EBMs. On the other hand, Saremi et al. (2018) parametrizes the score function as the gradient of a MLP energy function, and Saremi & Hyvarinen (2019) further unifies denoising score matching and neural empirical Bayes.

