VARIANCE REDUCTION IN HIERARCHICAL VARIATIONAL AUTOENCODERS

Abstract

Variational autoencoders with deep hierarchies of stochastic layers have been known to suffer from the problem of posterior collapse, where the top layers fall back to the prior and become independent of input. We suggest that the hierarchical VAE objective explicitly includes the variance of the function parameterizing the mean and variance of the latent Gaussian distribution which itself is often a high variance function. Building on this we generalize VAE neural networks by incorporating a smoothing parameter motivated by Gaussian analysis to reduce higher frequency components and consequently the variance in parameterizing functions and show that this can help to solve the problem of posterior collapse. We further show that under such smoothing the VAE loss exhibits a phase transition, where the top layer KL divergence sharply drops to zero at a critical value of the smoothing parameter that is similar for the same model across datasets. We validate the phenomenon across model configurations and datasets.

1. INTRODUCTION

Variational autoencoders (VAE) [10] are a popular latent variable model for unsupervised learning that simplifies learning by the introduction of a learned approximate posterior. Given data x and latent variables z, we specify the conditional distribution p(x|z) by parameterizing the distribution parameters by a neural network. Since it is difficult to learn such a model directly, another conditional distribution q(z|x) is introduced to approximate the posterior distribution. During learning the goal is to maximize the evidence lower bound (ELBO), which lower bounds the log likelihood, log p(x) ≥ E q(z|x) log p(x|z) + log p(z)log q(z|x) . In their simplest form, the generative model p(x|z) and the approximate posterior q(z|x) are Gaussian distributions optimized in unison. A natural way to increase the modeling capacity of VAE is to incorporate a hierarchy of stochastic variables. Such models, however, turn out to be difficult to train and higher levels in the hierarchy tend to remain independent of input data -a problem termed posterior collapse. Posterior collapse in VAEs manifests itself by the latent distribution tending to fall back to the prior. With hierarchical VAEs the effect is found to be more pronounced in the top layers farther from the output. For the purpose of the paper and for clarity of exposition, we focus on the simplest extension of hierarchical variational autoencoders where stochastic layers are stacked serially on top of each other [2, 21]  , p(x, z) = p(x|z 1 )p(z L ) L-1 i=1 p(z i |z i+1 ) and q(z|x) = q(z 1 |x) L-1 i=1 q(z i+1 |z i ). The intermediate distributions in this model are commonly taken to be Gaussian distributions parameterized by neural network functions, so that p(z i |z i+1 ) = N (z i |µ(z i+1 ), σ(z i+1 )), where µ(z), σ(z) are neural networks computing the mean and variance of the Gaussian distribution. We refer to them as vanilla hierarchical variational autoencoders. For each stochastic layer in this model there is a corresponding KL divergence term in the objective given by E[KL(q(z i |z i-1 )||p(z i |z i+1 )]. (1) As described later, expression 1 can be easily decomposed to show an explicit dependence on the variance of the parameterizing functions µ(z i ), σ(z i ) of the intermediate Gaussian distribution. We further show the KL divergence term to be closely related to the harmonics of the parameterizing function. For complex parameterizing functions the KL divergence term has large high frequency components (and thus high variance) which leads to unstable training causing posterior collapse.

