NCP-VAE: VARIATIONAL AUTOENCODERS WITH NOISE CONTRASTIVE PRIORS

Abstract

Variational autoencoders (VAEs) are one of the powerful likelihood-based generative models with applications in various domains. However, they struggle to generate high-quality images, especially when samples are obtained from the prior without any tempering. One explanation for VAEs' poor generative quality is the prior hole problem: the prior distribution fails to match the aggregate approximate posterior. Due to this mismatch, there exist areas in the latent space with high density under the prior that do not correspond to any encoded image. Samples from those areas are decoded to corrupted images. To tackle this issue, we propose an energy-based prior defined by the product of a base prior distribution and a reweighting factor, designed to bring the base closer to the aggregate posterior. We train the reweighting factor by noise contrastive estimation, and we generalize it to hierarchical VAEs with many latent variable groups. Our experiments confirm that the proposed noise contrastive priors improve the generative performance of state-of-the-art VAEs by a large margin on the MNIST, CIFAR-10, CelebA 64, and CelebA HQ 256 datasets.

1. INTRODUCTION

q(z) p(z) r(z) Figure 1 : We propose an EBM prior using the product of a base prior p(z) and a reweighting factor r(z), designed to bring the base prior closer to the aggregate posterior q(z). Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) are one of the powerful likelihood-based generative models that have applications in image generation (Brock et al., 2018; Karras et al., 2019; Razavi et al., 2019) , music synthesis (Dhariwal et al., 2020) , speech generation (Oord et al., 2016; Ping et al., 2020) , image captioning (Aneja et al., 2019; Deshpande et al., 2019; Aneja et al., 2018) , semi-supervised learning (Kingma et al., 2014; Izmailov et al., 2020) , and representation learning (Van Den Oord et al., 2017; Fortuin et al., 2018) . Although there has been tremendous progress in improving the expressivity of the approximate posterior, several studies have observed that VAE priors fail to match the aggregate (approximate) posterior (Rosca et al., 2018; Hoffman & Johnson, 2016) . This phenomenon is sometimes described as holes in the prior, referring to regions in the latent space that are not decoded to data-like samples. Such regions often have a high density under the prior but have a low density under the aggregate approximate posterior. The prior hole problem is commonly tackled by increasing the flexibility of the prior via hierarchical priors (Klushyn et al., 2019) , autoregressive models (Gulrajani et al., 2016) , a mixture of approximate posteriors (Tomczak & Welling, 2018), normalizing flows (Xu et al., 2019; Chen et al., 2016 ), resampled priors (Bauer & Mnih, 2019) , and energy-based models (Pang et al., 2020; Vahdat et al., 2018b; a; 2020) . Among them, energy-based models (EBMs) (Du & Mordatch, 2019; Pang et al., 2020) have shown promising results in learning expressive priors. However, they require running iterative MCMC steps during training which is computationally expensive, especially when the energy function is represented by a neural network. Moreover, they scale poorly to hierarchical models where an EBM is defined on each group of latent variables. Our key insight in this paper is that a trainable prior is brought as close as possible to the aggregate posterior as a result of training a VAE. The mismatch between the prior and the aggregate posterior can be reduced by simply reweighting the prior to re-adjust its likelihood in the area of mismatch with the aggregate posterior. To represent this reweighting mechanism, we formulate the prior using an EBM that is defined by the product of a reweighting factor and a base trainable prior as shown in Fig. 1 . We represent the reweighting factor using neural networks and the base prior using Normal distributions. Instead of expensive MCMC sampling, we use noise contrastive estimation (NCE) (Gutmann & Hyvärinen, 2010) for training the EBM prior. We show that NCE naturally trains the reweighting factor in our prior by learning a binary classifier to distinguish samples from a target distribution (i.e., samples from the approximate posterior) vs. samples from a noise distribution (i.e., the base trainable prior). However, since NCE's success depends on how close the noise distribution is to the target distribution, we first train the VAE with the base prior to bring it close to the aggregate posterior. And then, we train the EBM prior using NCE. In this paper, we make the following contributions: i) We propose an EBM prior termed noise contrastive prior (NCP) which is trained by contrasting samples from the aggregate posterior to samples from a base prior. NCPs are learned as a post-training mechanism to replace the original prior with a more flexible prior, which can improve the generative performance of VAEs with any structure. ii) We also show how NCPs are trained on hierarchical VAEs with many latent variable groups. We show that training hierarchical NCPs scales easily to many groups, as they are trained for each latent variable group in parallel. iii) Finally, we demonstrate that NCPs improve the generative quality of VAEs by a large margin across datasets.

2. BACKGROUND

We first review VAEs, their extension to hierarchical VAEs, and the prior hole problem. Hierarchical VAEs (HVAEs): To increase the expressivity of both prior and approximate posterior, earlier work adapted a hierarchical latent variable structure (Vahdat & Kautz, 2020; Kingma et al., 2016; Sønderby et al., 2016; Gregor et al., 2016) . In HVAEs, the latent variable z is divided into K separate groups, z = {z 1 , . . . , z K }. The approximate posterior and the prior distributions are then defined by q(z|x) = K k=1 q(z k |z <k , x) and p(z) = K k=1 p(z k |z <k ). Using these, the training objective becomes: L HVAE (x) := E q(z|x) [log p(x|z)] - K k=1 E q(z <k |x) [KL(q(z k |z <k , x)||p(z k |z <k ))] , where q(z <k |x) = k-1 i=1 q(z i |z <i , x) is the approximate posterior up to the (k -1) th groupfoot_0 . The Prior Hole Problem: Let q(z) E p d (x) [q(z|x)] denote the aggregate (approximate) posterior. In Appendix B.1, we show that maximizing E p d (x) [L VAE (x)] with respect to the prior parameters corresponds to bringing the prior as close as possible to the aggregate posterior by minimizing KL(q(z)||p(z)) w.r.t. p(z). Formally, the prior hole problem refers to the phenomenon that p(z) fails to match q(z).



For k = 1, the expectation inside the summation is simplified to KL(q(z1|x)||p(z1)).



Variational Autoencoders: VAEs learn a generative distribution p(x, z) = p(z)p(x|z) where p(z) is a prior distribution over the latent variable z and p(x|z) is a likelihood function that generates the data x given z. VAEs are trained by maximizing a variational lower bound on the log-likelihood log p(x): log p(x) ≥ E z∼q(z|x) [log p(x|z)] -KL(q(z|x)||p(z)) := L VAE (x), (1) where q(z|x) is an approximate posterior and KL is the Kullback-Leibler divergence. The final training objective is formulated by E p d (x) [L VAE (x)] where p d (x) is the data distribution (Kingma & Welling, 2014).

