VARIANCE REDUCTION IN HIERARCHICAL VARIATIONAL AUTOENCODERS

Abstract

Variational autoencoders with deep hierarchies of stochastic layers have been known to suffer from the problem of posterior collapse, where the top layers fall back to the prior and become independent of input. We suggest that the hierarchical VAE objective explicitly includes the variance of the function parameterizing the mean and variance of the latent Gaussian distribution which itself is often a high variance function. Building on this we generalize VAE neural networks by incorporating a smoothing parameter motivated by Gaussian analysis to reduce higher frequency components and consequently the variance in parameterizing functions and show that this can help to solve the problem of posterior collapse. We further show that under such smoothing the VAE loss exhibits a phase transition, where the top layer KL divergence sharply drops to zero at a critical value of the smoothing parameter that is similar for the same model across datasets. We validate the phenomenon across model configurations and datasets.

1. INTRODUCTION

Variational autoencoders (VAE) [10] are a popular latent variable model for unsupervised learning that simplifies learning by the introduction of a learned approximate posterior. Given data x and latent variables z, we specify the conditional distribution p(x|z) by parameterizing the distribution parameters by a neural network. Since it is difficult to learn such a model directly, another conditional distribution q(z|x) is introduced to approximate the posterior distribution. During learning the goal is to maximize the evidence lower bound (ELBO), which lower bounds the log likelihood, log p(x) ≥ E q(z|x) log p(x|z) + log p(z)log q(z|x) . In their simplest form, the generative model p(x|z) and the approximate posterior q(z|x) are Gaussian distributions optimized in unison. A natural way to increase the modeling capacity of VAE is to incorporate a hierarchy of stochastic variables. Such models, however, turn out to be difficult to train and higher levels in the hierarchy tend to remain independent of input data -a problem termed posterior collapse. Posterior collapse in VAEs manifests itself by the latent distribution tending to fall back to the prior. With hierarchical VAEs the effect is found to be more pronounced in the top layers farther from the output. For the purpose of the paper and for clarity of exposition, we focus on the simplest extension of hierarchical variational autoencoders where stochastic layers are stacked serially on top of each other [2, 21]  , p(x, z) = p(x|z 1 )p(z L ) L-1 i=1 p(z i |z i+1 ) and q(z|x) = q(z 1 |x) L-1 i=1 q(z i+1 |z i ). The intermediate distributions in this model are commonly taken to be Gaussian distributions parameterized by neural network functions, so that p(z i |z i+1 ) = N (z i |µ(z i+1 ), σ(z i+1 )), where µ(z), σ(z) are neural networks computing the mean and variance of the Gaussian distribution. We refer to them as vanilla hierarchical variational autoencoders. For each stochastic layer in this model there is a corresponding KL divergence term in the objective given by E[KL(q(z i |z i-1 )||p(z i |z i+1 )]. (1) As described later, expression 1 can be easily decomposed to show an explicit dependence on the variance of the parameterizing functions µ(z i ), σ(z i ) of the intermediate Gaussian distribution. We further show the KL divergence term to be closely related to the harmonics of the parameterizing function. For complex parameterizing functions the KL divergence term has large high frequency components (and thus high variance) which leads to unstable training causing posterior collapse. Building on this, we suggest a method for training the simplest hierarchical extension of VAE that avoids the problem of posterior collapse without introducing further architectural complexity [13, 21] . Given a hierarchical variational autoencoder, our training method incorporates a smoothing parameter (we denote this by ρ) in the neural network functions used to parameterize the intermediate latent distributions. The smoothing is done such that expected values are preserved, the higher frequencies are attenuated and the variance is reduced. Next, the gradients computed with the smooth functions are used to train the original hierarchical variational autoencoder. For the construction of the smoothing transformations for VAEs with Gaussian latent spaces we make use of ideas from the analysis of Gaussian spaces. We analyze the stochastic functions in vanilla hierarchical VAEs as Hermite expansions on Gaussian spaces [9] . The Ornstein-Uhlenbeck (OU) semigroup from Gaussian analysis is a set of operators that we show to smoothly interpolate between a random variable and its expectation. The OU semigroup provides the appropriate set of smoothing operators which enable us to control variance and avoid posterior collapse. We further show that by smoothing the intermediate parameterizing functions µ(z), σ(z) in the proposed manner, the KL divergence of the top layer sees a sudden sharp drop toward zero as the amount of smoothing is decreased. This behaviour is retained when we evaluate the KL divergence on the original unsmoothed variational autoencoder model. This behaviour is reminiscent of phase transitions from statistical mechanics and we adopt the same terminology to describe the phenomenon. Our experiments suggest that the phenomenon is general across datasets and commonly used architectures. Furthermore, the critical value of the smoothing parameter ρ at which the transition occurs is fixed for a given model configuration and varies with stochastic depth and width. We make the following contributions. First, we establish a connection between higher harmonics, variance, posterior collapse and phase transitions in hierarchical VAEs. Second, we show that by using the Ornstein-Uhlenbeck semigroup of operators on the generative stochastic functions in VAEs we reduce higher frequencies and consequently variance to mitigate posterior collpase. We corroborate our findings experimentally and further obtain in CIFAR-10 likelihoods competitive with more complex architectural solutions alongside a reduction in model size. We refer to the proposed family of models as Hermite variational autoencoders (HVAE).

2.1. ANALYSIS ON GAUSSIAN SPACES

The analysis of Gaussian spaces studies functions of Gaussian random variables. These are realvalued functions defined on R n endowed with the Gaussian measure. Many functions employed in machine learning are instances of such functions: decoders for variational autoencoders, as is the case in this work, and generators for generative adversarial networks being two examples. By way of summary, the main facts we use from this field are that a function on a Gaussian space can be expanded in an orthonormal basis, where the basis functions are the Hermite polynomials. This orthonormal expansion is akin to a Fourier transform in this space. The second fact is that the coefficients of such an expansion can be modified in a way to reduce the variance of the expanded function by applying an operator from the Ornstein-Uhlenbeck semigroup of operators. Next, we give a brief introduction. For further details on Gaussian analysis we refer to [9] . Gaussian Spaces: Let L 2 (R n , γ) be the space of square integrable functions, f : R n → R, with the Gaussian measure γ(z) = i N (z i |0, 1). Given functions f, g in this space, the inner product is given by f , g = E γ(z) [f (z)g(z)]. Basis functions for L 2 (R, γ): Taking the space of univariate functions L 2 (R, γ) , it is known that the polynomial functions φ i (z) = z i are a basis for this space. By a process of orthonormalization we obtain the normalized Hermite polynomial basis for this space. The first few Hermite polynomials are the following: h 0 (z) = 1, h 1 (z) = z, h 2 = z 2 -1 √ 2 , . . .. Basis functions for L 2 (R n , γ): Letting α ∈ N n be a multi-index, the basis functions for L 2 (R n , γ) are obtained by multiplying the univariate basis functions across dimension, h α (z) = i h αi (z i ).

