LEARNING ENERGY-BASED MODELS BY DIFFUSION RECOVERY LIKELIHOOD

Abstract

While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. Optimizing recovery likelihood is more tractable than marginal likelihood, as sampling from the conditional distributions is much easier than sampling from the marginal distributions. After training, synthesized images can be generated by the sampling process that initializes from Gaussian white noise distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets. Our implementation is available at https://github.com/ruiqigao/recovery_likelihood.

1. INTRODUCTION

EBMs (LeCun et al., 2006; Ngiam et al., 2011; Kim & Bengio, 2016; Zhao et al., 2016; Goyal et al., 2017; Xie et al., 2016b; Finn et al., 2016; Gao et al., 2018; Kumar et al., 2019; Nijkamp et al., 2019b; Du & Mordatch, 2019; Grathwohl et al., 2019; Desjardins et al., 2011; Gao et al., 2020; Che et al., 2020; Grathwohl et al., 2020; Qiu et al., 2019; Rhodes et al., 2020) are an appealing class of probabilistic models, which can be viewed as generative versions of discriminators (Jin et al., 2017; Lazarow et al., 2017; Lee et al., 2018; Grathwohl et al., 2020 ), yet can be learned from unlabeled data. Despite a number of desirable properties, two challenges remain for training EBMs on highdimensional datasets. First, learning EBMs by maximum likelihood requires Markov Chain Monte Carlo (MCMC) to generate samples from the model, which can be extremely expensive. Second, as pointed out in Nijkamp et al. (2019a) , the energy potentials learned with non-convergent MCMC do not have a valid steady-state, in the sense that samples from long-run Markov chains can differ greatly from observed samples, making it difficult to evaluate the learned energy potentials. Another line of work, originating from Sohl-Dickstein et al. (2015) , is to learn from a diffused version of the data, which are obtained from the original data via a diffusion process that sequentially adds Gaussian white noise. From such diffusion data, one can learn the conditional model of the data at a certain noise level given their noisy versions at the higher noise level of the diffusion process. After learning the sequence of conditional models that invert the diffusion process, one can then generate synthesized images from Gaussian white noise images by ancestral sampling. Building on

