VERY DEEP VAES GENERALIZE AUTOREGRESSIVE MODELS AND CAN OUTPERFORM THEM ON IMAGES

Abstract

We present a hierarchical VAE that, for the first time, generates samples quickly and outperforms the PixelCNN in log-likelihood on all natural image benchmarks. We begin by observing that, in theory, VAEs can actually represent autoregressive models, as well as faster, better models if they exist, when made sufficiently deep. Despite this, autoregressive models have historically outperformed VAEs in loglikelihood. We test if insufficient depth explains why by scaling a VAE to greater stochastic depth than previously explored and evaluating it CIFAR-10, ImageNet, and FFHQ. In comparison to the PixelCNN, these very deep VAEs achieve higher likelihoods, use fewer parameters, generate samples thousands of times faster, and are more easily applied to high-resolution images. Qualitative studies suggest this is because the VAE learns efficient hierarchical visual representations. We release our source code and models at https://github.com/openai/vdvae. High resolution



Figure 1 : Selected samples from our very deep VAE on FFHQ-256, and a demonstration of learned generative process. VAEs can learn to first generate global features at low resolution, then fill in local details in parallel at higher resolutions. When made sufficiently deep, this learned, parallel, multiscale generative procedure attains a higher log-likelihood than the PixelCNN.

1. INTRODUCTION

One potential path to increased data-efficiency, generalization, and robustness of machine learning methods is to train generative models. These models can learn useful representations without human supervision by learning to create examples of the data itself. Many types of generative models have flourished in recent years, including likelihood-based generative models, which include autoregressive models (Uria et al., 2013) , variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) , and invertible flows (Dinh et al., 2014; 2016) . Their objective, the negative log-likelihood, is equivalent to the KL divergence between the data distribution and the model distribution. A wide variety of models can be compared and assessed along this criteria, which corresponds to how well they fit the data in an information-theoretic sense. Starting with the PixelCNN (Van den Oord et al., 2016) , autoregressive models have long achieved the highest log-likelihoods across many modalities, despite counterintuitive modeling assumptions. For example, although natural images are observations of latent scenes, autoregressive models learn dependencies solely between observed variables. That process can require complex function approximators that integrate long-range dependencies (Oord et al., 2016; Child et al., 2019) . In contrast, VAEs and invertible flows incorporate latent variables and can thus, in principle, learn a simpler model that mirrors how images are actually generated. Despite this theoretical advantage, on the landmark ImageNet density estimation benchmark, the Gated PixelCNN still achieves higher likelihoods than all flows and VAEs, corresponding to a better fit with the data. Is the autoregressive modeling assumption actually a better inductive bias for images, or can VAEs, sufficiently improved, outperform autoregressive models? The answer has significant practical stakes, because large, compute-intensive autoregressive models (Strubell et al., 2019) are increasingly used for a variety of applications (Oord et al., 2016; Brown et al., 2020; Dhariwal et al., 2020; Chen et al., 2020) . Unlike autoregressive models, latent variable models only need to learn dependencies between latent and observed variables; such models can not only support faster synthesis and higher-dimensional data, but may also do so using smaller, less powerful architectures. We start this work with a simple but (to the best of our knowledge) unstated observation: hierarchical VAEs should be able to at least match autoregressive models, because autoregressive models are equivalent to VAEs with a powerful prior and restricted approximate posterior (which merely outputs observed variables). In the worst case, VAEs should be able to replicate the functionality of autoregressive models; in the best case, they should be able to learn better latent representations, possibly with much fewer layers, if such representations exist. We formalize this observation in Section 3, showing it is only true for VAEs with more stochastic layers than previous work has explored. Then we experimentally test it on competitive natural image benchmarks. Our contributions are the following: • We provide theoretical justification for why greater depth (up to the data dimension D, but also as low as some value K D) could improve VAE performance (Section 3) • We introduce an architecture capable of scaling past 70 layers, when previous work explored at most 30 (Section 4) • We verify that depth, independent of model capacity, improves log-likelihood, and allows VAEs to outperform the PixelCNN on all benchmarks (Section 5.1) • Compared to the PixelCNN, we show the model also uses fewer parameters, generates samples thousands of times more quickly, and can be scaled to larger images. We show evidence these qualities may emerge from the model learning an efficient hierarchical representation of images (Section 5.2) • We release code and models at https://github.com/openai/vdvae.

2. PRELIMINARIES

We review prior work and introduce some of the basic terminology used in the field.

