AR-ELBO: PREVENTING POSTERIOR COLLAPSE INDUCED BY OVERSMOOTHING IN GAUSSIAN VAE

Abstract

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon that the learned latent space becomes uninformative. This is related to local optima of the objective function that are often introduced by a fixed hyperparameter resembling the data variance. We suggest that this variance parameter regularizes the VAE and affects its smoothness, which is the magnitude of its gradient. An inappropriate choice of this parameter causes oversmoothness and leads to posterior collapse. This is shown theoretically by analysis on the linear approximated objective function and empirically in general cases. We propose AR-ELBO, which stands for adaptively regularized ELBO (Evidence Lower BOund). It controls the strength of regularization by adapting the variance parameter, and thus avoids oversmoothing the model. Generation models trained by proposed objectives show improved Fréchet inception distance (FID) of images generated from the MNIST and CelebA datasets.

1. INTRODUCTION

The variational autoencoder (VAE) framework (Kingma & Welling, 2014; Higgins et al., 2017; Zhao et al., 2019) is a popular approach to achieve generative modeling in the field of machine learning. In this framework, a model that approximates the true posterior of observation data, is learned by a joint training of encoder and decoder, which creates a stochastic mapping between the observation data and the learned deep latent space. The latent space is assumed to follow a prior distribution. The generation of a new data sample can be done by sampling the latent space and passing the sample through the decoder. It is common to assume that both the prior on the latent space and the posterior of the observation data follow a Gaussian distribution. This setup is also known as the Gaussian VAE. In this case, the variance of the decoder output is usually modeled as an isotropic matrix σ 2 x I with a scalar parameter σ 2 x ≥ 0. Furthermore, in order to deal with the intractable log-likelihood of the true posterior, the evidence lower bound (ELBO) (Jordan et al., 1999 ) is adopted as the objective function instead. While VAE-based generative models are usually considered to be more stable and easier to train than generative adversarial networks (Goodfellow et al., 2014) , they often suffer from the problem of posterior collapse (Bowman et al., 2015; Sønderby et al., 2016; Alemi et al., 2017; Xu & Durrett, 2018; He et al., 2019; Razavi et al., 2019a; Ma et al., 2019) , in which the latent space has little information of the input data. The phenomenon is generally mentioned as "the posterior collapses to the prior in the latent space" (Razavi et al., 2019a) . Recently, several works have suggested that the variance parameter σ 2 x in the ELBO is strongly related to posterior collapse. For example, Lucas et al. ( 2019) analyzed posterior collapse through the analysis on a linear VAE. It revealed that an inappropriate choice of σ 2 x will introduce sub-optimal local optima and cause posterior collapse. Moreover, Lucas et al. (2019) reveals that contrary to the popular belief, these local optima are not introduced by replacing the log-likelihood with the ELBO, but by an excessively large σ 2 x . On the other hand, it can be shown that fixing σ 2 x to an excessively small value leads to under-regularization of the decoder, which can cause overfitting. However, in most implementations of a Gaussian VAE, the variance parameter σ 2 x is a fixed constant regardless of the input data and is usually 1.0. In another work, Dai & Wipf (2019) proposed a two-stage VAE and treated σ 2 x as a training parameter. Besides the inappropriate choice of the variance parameter, posterior collapse can also induced by other causes. For example, Dai et al. (2020) found that, small nonlinear perturbation introduced in the network architecture can also result into extra sub-optimal local minima. However, in this work we will keep our focus on the variance parameter. We suggest that σ 2 x affects the strength of regulation over the gradient magnitude of the decoder. We call the expected gradient magnitude the smoothness throughout this paper. The smaller the gradient magnitude, the smoother the model. In particular, we would like to focus on the local smoothness of the model, which is the smoothness evaluated within the neighborhood of the encoded latent variable of the observation data. Thus, we begin with the following hypothesis: Main Hypothesis. The value of σ 2 x controls the regularization strength of the smoothness of the decoder. Therefore, an excessively large σ 2 x causes oversmoothness, which results in posterior collapse. Following the hypothesis, the estimation of σ 2 x should be related to properties of the approximated posterior of the latent space, such as its local smoothness. We will start with analyzing how σ 2 x regularizes the local smoothness of the stochastic decoder and then propose new objective functions that inherently determine σ 2 x via maximum likelihood estimation (MLE). This proposed objective function is named AR-ELBO (Adaptively Regularized ELBO), which controls the regularization strength via σ 2 x . Furthermore, several variations are derived for different parameterizations of variance parameters. Our main contributions are listed as follows: 1. We show that our main hypothesis holds for linear approximated ELBO and empirically holds in the general case in Section 3. This also suggests that the variance parameter σ 2 x should be estimated from properties of the approximated posterior instead of being treated as a hyperparameter. 2. We propose the AR-ELBO, an ELBO-based objective function that adaptively regularizes the smoothness of the decoder by MLE of the variance parameter σ 2 x in Section 4. Variations of AR-ELBO for several variance parameterizations of posterior distributions are also derived. AR-ELBO prevents the model from the posterior collapse induced by oversmoothing and improves the quality of generation, which is shown in Section 5. The organization of this paper is as follows. In Section 2, we propose a mathematical definition of posterior collapse in the form of mutual information. This also includes the conventional definition: "the posterior collapses to the prior in the latent space". In Section 3, the theoretical analysis and empirical support of the main hypothesis are given. We perform an analysis showing that σ 2 x affects the smoothness of the decoder via variance parameters of the latent space learned by the encoder. In Section 4, we propose new AR-ELBO objective functions for various variance parameterizations of posterior distributions, which can relieve the decoder from being oversmoothed in the training and prevent posterior collapse. These objective functions no longer include any hyperparameters and can adaptively estimate the variance parameter σ 2 x from the observation data. It should be noted that if we adaptively determine σ 2 x with the proposed AR-ELBO, the strength of regularization of the decoder smoothness will gradually decrease as training progresses. In Section 5, we conduct an experiment on the MNIST and CelebA datasets, which shows that utilizing the proposed AR-ELBO with the standard Gaussian VAE can be competitive with many other variations of VAE models in most situations. Throughout this paper, we use a, a and A for a scalar, a column vector and a matrix, and ln and log denote the natural logarithm and common logarithm, respectively. Our code is available from the following URLfoot_0 .

2. POSTERIOR COLLAPSE IN GAUSSIAN VAE

We begin with the standard formulation of the Gaussian VAE, which is the foundation of our research. A definition of posterior collapse is proposed by using mutual information (MI).



URL hidden due to blind review.

