AR-ELBO: PREVENTING POSTERIOR COLLAPSE INDUCED BY OVERSMOOTHING IN GAUSSIAN VAE

Abstract

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon that the learned latent space becomes uninformative. This is related to local optima of the objective function that are often introduced by a fixed hyperparameter resembling the data variance. We suggest that this variance parameter regularizes the VAE and affects its smoothness, which is the magnitude of its gradient. An inappropriate choice of this parameter causes oversmoothness and leads to posterior collapse. This is shown theoretically by analysis on the linear approximated objective function and empirically in general cases. We propose AR-ELBO, which stands for adaptively regularized ELBO (Evidence Lower BOund). It controls the strength of regularization by adapting the variance parameter, and thus avoids oversmoothing the model. Generation models trained by proposed objectives show improved Fréchet inception distance (FID) of images generated from the MNIST and CelebA datasets.

1. INTRODUCTION

The variational autoencoder (VAE) framework (Kingma & Welling, 2014; Higgins et al., 2017; Zhao et al., 2019) is a popular approach to achieve generative modeling in the field of machine learning. In this framework, a model that approximates the true posterior of observation data, is learned by a joint training of encoder and decoder, which creates a stochastic mapping between the observation data and the learned deep latent space. The latent space is assumed to follow a prior distribution. The generation of a new data sample can be done by sampling the latent space and passing the sample through the decoder. It is common to assume that both the prior on the latent space and the posterior of the observation data follow a Gaussian distribution. This setup is also known as the Gaussian VAE. In this case, the variance of the decoder output is usually modeled as an isotropic matrix σ 2 x I with a scalar parameter σ 2 x ≥ 0. Furthermore, in order to deal with the intractable log-likelihood of the true posterior, the evidence lower bound (ELBO) (Jordan et al., 1999 ) is adopted as the objective function instead. While VAE-based generative models are usually considered to be more stable and easier to train than generative adversarial networks (Goodfellow et al., 2014) , they often suffer from the problem of posterior collapse (Bowman et al., 2015; Sønderby et al., 2016; Alemi et al., 2017; Xu & Durrett, 2018; He et al., 2019; Razavi et al., 2019a; Ma et al., 2019) , in which the latent space has little information of the input data. The phenomenon is generally mentioned as "the posterior collapses to the prior in the latent space" (Razavi et al., 2019a) . Recently, several works have suggested that the variance parameter σ 2 x in the ELBO is strongly related to posterior collapse. For example, Lucas et al. ( 2019) analyzed posterior collapse through the analysis on a linear VAE. It revealed that an inappropriate choice of σ 2 x will introduce sub-optimal local optima and cause posterior collapse. Moreover, Lucas et al. (2019) reveals that contrary to the popular belief, these local optima are not introduced by replacing the log-likelihood with the ELBO, but by an excessively large σ 2 x . On the other hand, it can be shown that fixing σ 2 x to an excessively small value leads to under-regularization of the decoder, which can cause overfitting. However, in most implementations of a Gaussian VAE, the variance parameter σ 2 x is a fixed constant regardless of the input data and is usually 1.0. In another work, Dai & Wipf (2019) proposed a two-stage VAE and treated σ 2 x as a training parameter. Besides the inappropriate choice of the variance parameter, posterior collapse can also induced by other causes. For example, Dai et al. (2020) found that, small nonlinear perturbation introduced in

