VAEBM: A SYMBIOSIS BETWEEN VARIATIONAL AU-TOENCODERS AND ENERGY-BASED MODELS

Abstract

Energy-based models (EBMs) have recently been successful in representing complex distributions of small images. However, sampling from them requires expensive Markov chain Monte Carlo (MCMC) iterations that mix slowly in high dimensional pixel space. Unlike EBMs, variational autoencoders (VAEs) generate samples quickly and are equipped with a latent space that enables fast traversal of the data manifold. However, VAEs tend to assign high probability density to regions in data space outside the actual data distribution and often fail at generating sharp images. In this paper, we propose VAEBM, a symbiotic composition of a VAE and an EBM that offers the best of both worlds. VAEBM captures the overall mode structure of the data distribution using a state-of-the-art VAE and it relies on its EBM component to explicitly exclude non-data-like regions from the model and refine the image samples. Moreover, the VAE component in VAEBM allows us to speed up MCMC updates by reparameterizing them in the VAE's latent space. Our experimental results show that VAEBM outperforms state-of-the-art VAEs and EBMs in generative quality on several benchmark image datasets by a large margin. It can generate high-quality images as large as 256×256 pixels with short MCMC chains. We also demonstrate that VAEBM provides complete mode coverage and performs well in out-of-distribution detection.

1. INTRODUCTION

Deep generative learning is a central problem in machine learning. It has found diverse applications, ranging from image (Brock et al., 2018; Karras et al., 2019; Razavi et al., 2019) , music (Dhariwal et al., 2020) and speech (Ping et al., 2020; Oord et al., 2016a) generation, distribution alignment across domains (Zhu et al., 2017; Liu et al., 2017; Tzeng et al., 2017) and semi-supervised learning (Kingma et al., 2014; Izmailov et al., 2020) to 3D point cloud generation (Yang et al., 2019) , light-transport simulation (Müller et al., 2019) , molecular modeling (Sanchez-Lengeling & Aspuru-Guzik, 2018; Noé et al., 2019) and equivariant sampling in theoretical physics (Kanwar et al., 2020) . Among competing frameworks, likelihood-based models include variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) , normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2016) , autoregressive models (Oord et al., 2016b) , and energy-based models (EBMs) (Lecun et al., 2006; Salakhutdinov et al., 2007) . These models are trained by maximizing the data likelihood under the model, and unlike generative adversarial networks (GANs) (Goodfellow et al., 2014) , their training is usually stable and they cover modes in data more faithfully by construction. Among likelihood-based models, EBMs model the unnormalized data density by assigning low energy to high-probability regions in the data space (Xie et al., 2016; Du & Mordatch, 2019) . EBMs are appealing because they require almost no restrictions on network architectures (unlike normalizing flows) and are therefore potentially very expressive. They also exhibit better robustness and out-of-distribution generalization (Du & Mordatch, 2019) because, during training, areas with high probability under the model but low probability under the data distribution are penalized explicitly. However, training and sampling EBMs usually requires MCMC, which can suffer from slow mode mixing and is computationally expensive when neural networks represent the energy function. On the other hand, VAEs are computationally more efficient for sampling than EBMs, as they do not require running expensive MCMC steps. VAEs also do not suffer from expressivity limitations that normalizing flows face (Dupont et al., 2019; Kong & Chaudhuri, 2020) , and in fact, they have recently shown state-of-the-art generative results among non-autoregressive likelihood-based models (Vahdat & Kautz, 2020) . Moreover, VAEs naturally come with a latent embedding of data that allows fast traverse of the data manifold by moving in the latent space and mapping the movements to the data space. However, VAEs tend to assign high probability to regions with low density under the data distribution. This often results in blurry or corrupted samples generated by VAEs. This also explains why VAEs often fail at out-of-distribution detection (Nalisnick et al., 2019) . In this paper, we propose a novel generative model as a symbiotic composition of a VAE and an EBM (VAEBM) that combines the best of both. VAEBM defines the generative distribution as the product of a VAE generator and an EBM component defined in pixel space. Intuitively, the VAE captures the majority of the mode structure in the data distribution. However, it may still generate samples from low-probability regions in the data space. Thus, the energy function focuses on refining the details and reducing the likelihood of non-data-like regions, which leads to significantly improved samples. Moreover, we show that training VAEBM by maximizing the data likelihood easily decomposes into training the VAE and the EBM component separately. The VAE is trained using the reparameterization trick, while the EBM component requires sampling from the joint energy-based model during training. We show that we can sidestep the difficulties of sampling from VAEBM, by reparametrizing the MCMC updates using VAE's latent variables. This allows MCMC chains to quickly traverse the model distribution and it speeds up mixing. As a result, we only need to run short chains to obtain approximate samples from the model, accelerating both training and sampling at test time. Experimental results show that our model outperforms previous EBMs and state-of-the-art VAEs on image generation benchmarks including CIFAR-10, CelebA 64, LSUN Church 64, and CelebA HQ 256 by a large margin, reducing the gap with GANs. We also show that our model covers the modes in the data distribution faithfully, while having less spurious modes for out-of-distribution data. To the best of knowledge, VAEBM is the first successful EBM applied to large images. In summary, this paper makes the following contributions: i) We propose a new generative model using the product of a VAE generator and an EBM defined in the data space. ii) We show how training this model can be decomposed into training the VAE first, and then training the EBM component. iii) We show how MCMC sampling from VAEBM can be pushed to the VAE's latent space, accelerating sampling. iv) We demonstrate state-of-the-art image synthesis quality among likelihood-based models, confirm complete mode coverage, and show strong out-of-distribution detection performance.

2. BACKGROUND

Energy-based Models: An EBM assumes p ψ (x) to be a Gibbs distribution of the form p ψ (x) = exp (-E ψ (x)) /Z ψ , where E ψ (x) is the energy function with parameters ψ and Z ψ = x exp (-E ψ (x)) dx is the normalization constant. There is no restriction on the particular form of E ψ (x). Given a set of samples drawn from the data distribution p d (x), the goal of maximum likelihood learning is to maximize the log-likelihood L(ψ) = E x∼p d (x) [log p ψ (x)], which has the derivative (Woodford, 2006) : ∂ ψ L(ψ) = E x∼p d (x) [-∂ ψ E ψ (x)] + E x∼p ψ (x) [∂ ψ E ψ (x)] For the first expectation, the positive phase, samples are drawn from the data distribution p d (x), and for the second expectation, the negative phase, samples are drawn from the model p ψ (x) itself. However, sampling from p ψ (x) in the negative phase is itself intractable and approximate samples are usually drawn using MCMC. A commonly used MCMC algorithm is Langevin dynamics (LD) (Neal, 1993) . Given an initial sample x 0 , Langevin dynamics iteratively updates it as: x t+1 = x t - η 2 ∇ x E ψ (x t ) + √ ηω t , ω t ∼ N (0, I), ( ) where η is the step-size. 1 In practice, Eq. 2 is run for finite iterations, which yields a Markov chain with an invariant distribution approximately close to the original target distribution.



In principle one would require an accept/reject step to make it a rigorous MCMC algorithm, but for sufficiently small stepsizes this is not necessary in practice(Neal, 1993).

