NCP-VAE: VARIATIONAL AUTOENCODERS WITH NOISE CONTRASTIVE PRIORS

Abstract

Variational autoencoders (VAEs) are one of the powerful likelihood-based generative models with applications in various domains. However, they struggle to generate high-quality images, especially when samples are obtained from the prior without any tempering. One explanation for VAEs' poor generative quality is the prior hole problem: the prior distribution fails to match the aggregate approximate posterior. Due to this mismatch, there exist areas in the latent space with high density under the prior that do not correspond to any encoded image. Samples from those areas are decoded to corrupted images. To tackle this issue, we propose an energy-based prior defined by the product of a base prior distribution and a reweighting factor, designed to bring the base closer to the aggregate posterior. We train the reweighting factor by noise contrastive estimation, and we generalize it to hierarchical VAEs with many latent variable groups. Our experiments confirm that the proposed noise contrastive priors improve the generative performance of state-of-the-art VAEs by a large margin on the MNIST, CIFAR-10, CelebA 64, and CelebA HQ 256 datasets.

1. INTRODUCTION

q(z) p(z) r(z) Figure 1 : We propose an EBM prior using the product of a base prior p(z) and a reweighting factor r(z), designed to bring the base prior closer to the aggregate posterior q(z). Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) are one of the powerful likelihood-based generative models that have applications in image generation (Brock et al., 2018; Karras et al., 2019; Razavi et al., 2019 ), music synthesis (Dhariwal et al., 2020) , speech generation (Oord et al., 2016; Ping et al., 2020) , image captioning (Aneja et al., 2019; Deshpande et al., 2019; Aneja et al., 2018) , semi-supervised learning (Kingma et al., 2014; Izmailov et al., 2020) , and representation learning (Van Den Oord et al., 2017; Fortuin et al., 2018) . Although there has been tremendous progress in improving the expressivity of the approximate posterior, several studies have observed that VAE priors fail to match the aggregate (approximate) posterior (Rosca et al., 2018; Hoffman & Johnson, 2016) . This phenomenon is sometimes described as holes in the prior, referring to regions in the latent space that are not decoded to data-like samples. Such regions often have a high density under the prior but have a low density under the aggregate approximate posterior. The prior hole problem is commonly tackled by increasing the flexibility of the prior via hierarchical priors (Klushyn et al., 2019) , autoregressive models (Gulrajani et al., 2016) , a mixture of approximate posteriors (Tomczak & Welling, 2018) , normalizing flows (Xu et al., 2019; Chen et al., 2016 ), resampled priors (Bauer & Mnih, 2019 ), and energy-based models (Pang et al., 2020; Vahdat et al., 2018b; a; 2020) . Among them, energy-based models (EBMs) (Du & Mordatch, 2019; Pang et al., 2020) have shown promising results in learning expressive priors. However, they require running iterative MCMC steps during training which is computationally expensive, especially when the energy function is represented by a

