SIMPLE AND EFFECTIVE VAE TRAINING WITH CALIBRATED DECODERS

Abstract

Variational autoencoders (VAEs) provide an effective and simple method for modeling complex distributions. However, training VAEs often requires considerable hyperparameter tuning to determine the optimal amount of information retained by the latent variable. We study the impact of calibrated decoders, which learn the uncertainty of the decoding distribution and can determine this amount of information automatically, on the VAE performance. While many methods for learning calibrated decoders have been proposed, many of the recent papers that employ VAEs rely on heuristic hyperparameters and ad-hoc modifications instead. We perform the first comprehensive comparative analysis of calibrated decoder and provide recommendations for simple and effective VAE training. Our analysis covers a range of datasets and several single-image and sequential VAE models. We further propose a simple but novel modification to the commonly used Gaussian decoder, which computes the prediction variance analytically. We observe empirically that using heuristic modifications is not necessary with our method.

1. INTRODUCTION

Deep density models based on the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) have found ubiquitous use in probabilistic modeling and representation learning as they are both conceptually simple and are able to scale to very complex distributions and large datasets. These VAE techniques are used for tasks such as future frame prediction (Castrejon et al., 2019) , image segmentation (Kohl et al., 2018) , generating speech (Chung et al., 2015) and music (Dhariwal et al., 2020) , as well as model-based reinforcement learning (Hafner et al., 2019a) . However, in practice, many of these approaches require careful manual tuning of the balance between two terms that correspond to distortion and rate from information theory (Alemi et al., 2017) . This balance trades off fidelity of reconstruction and quality of samples from the model: a model with low rate would not contain enough information to reconstruct the data, while allowing the model to have high rate might lead to unrealistic samples from the prior as the KL-divergence constraint becomes weaker (Alemi et al., 2017; Higgins et al., 2017) . While a proper variational lower bound does not expose any free parameters to control this tradeoff, many prior works heuristically introduce a weight on the prior KL-divergence term, often denoted β. Usually, β needs to be tuned for every dataset and model variant as a hyperparameter, which slows down development and can lead to poor performance as finding the optimal value is often prohibitively computationally expensive. Moreover, using β = 1 precludes the appealing interpretation of the VAE objective as a bound on the data likelihood, and is undesirable for applications like density modeling. While many architectures for calibrating decoders have been proposed in the literature (Kingma & Welling, 2014; Kingma et al., 2016; Dai & Wipf, 2019) , more applied work typically employs VAEs with uncalibrated decoding distributions, such as Gaussian distributions without a learned variance, where the decoder only outputs the mean parameter (Castrejon et al., 2019; Denton & Fergus, 2018; Lee et al., 2019; Babaeizadeh et al., 2018; Lee et al., 2018; Hafner et al., 2019b; Pong et al., 2019; Zhu et al., 2017; Pavlakos et al., 2019) , or uses other ad-hoc modifications to the objective (Sohn et al., 2015; Henaff et al., 2019) . Indeed, it is well known that attempting to learn the variance in a Gaussian decoder may lead to numerical instability (Rezende & Viola, 2018; Dai & Wipf, 2019) , and naïve approaches often lead to poor results. As a result, it remains unclear whether practical empirical performance of VAEs actually benefits from calibrated decoders or not. To rectify this, our first contribution is a comparative analysis of various calibrated decoder architectures and practical recommendations for simple and effective VAE training. We find that, while naïve calibrated decoders often lead to worse results, a careful choice of the decoder distribution can work very well, and removes the need to tune the additional parameter β. Indeed, we note that the entropy of the decoding distribution controls the mutual information I(x; z). Calibrated decoders allow the model to control I(x; z) automatically, instead of relying on manual tuning. Our second contribution is a simple but novel technique for optimizing the decoder variance analytically, without requiring the decoder network to produce it as an additional output. We call the resulting approach to learning the Gaussian variance the σ-VAE. In our experiments, the σ-VAE outperforms the alternative of learning the variance through gradient descent, while being simpler to implement and extend. We validate our results on several VAE and sequence VAE models and a range of image and video datasets. 2014) use Gaussian distributions with learned variance parameter for grayscale images. However, modeling images with continuous distributions is prone to instability as the variance can converge to zero (Rezende & Viola, 2018; Mattei & Frellsen, 2018; Dai & Wipf, 2019) . Some work has attempted to rectify this problem by using dequantization (Gregor et al., 2016) , which is theoretically appealing as it is tightly related to the log-likelihood of the original discrete data (Theis et al., 2016) , optimizing the variance in a two-stage procedure (Arvanitidis et al., 2017) , or training a post-hoc prior (Ghosh et al., 2019 ). Takahashi et al. (2018) ; Barron (2019) proposed more expressive distributions. Additionally, different choices for representing such variance exist, including diagonal covariance (Kingma & Welling, 2014; Sønderby et al., 2016; Rolfe, 2016) , or a single shared parameter (Kingma et al., 2016; Dai & Wipf, 2019; Edwards & Storkey, 2016; Rezende & Viola, 2018) . We analyze these and notice that learning a single variance parameter shared across images leads to stable training and good performance, without the use of dequantization or even clipping the variance, although these techniques can be used with our decoders; and further improve the estimation of this variance with an analytic solution. Early work on discrete VAE decoders for color images modeled them with the Bernoulli distribution, treating the color intensities as probabilities (Gregor et al., 2015) . Further work has explored various parameterizations based on discretized continuous distributions, such as discretized logistic (Kingma et al., 2016) . More recent work has improved expressivity of the decoder with a mixture of discretized logistics (Chen et al., 2016; Maaløe et al., 2019) . However, these models also employ powerful autoregressive decoders (Chen et al., 2016; Gulrajani et al., 2016; Maaløe et al., 2019) , and the latent variables in these models may not represent all of the significant factors of variation in the data, as some factors can instead be modeled internally by the autoregressive decoder (Alemi et al., 2017) .foot_0 While many calibrated decoders have been proposed, outside the core generative modeling community uncalibrated decoders are ubiquitous. They are used in work on video prediction (Denton & Fergus, 2018; Castrejon et al., 2019; Lee et al., 2018; Babaeizadeh et al., 2018) , image segmentation (Kohl et al., 2018) , image-to-image translation (Zhu et al., 2017) , 3D human pose (Pavlakos et al., 2019) , as well as model-based reinforcement learning (Henaff et al., 2019; Hafner et al., 2019b; a) , and representation learning (Lee et al., 2019; Watter et al., 2015; Pong et al., 2019) . Most of these works utilize the heuristic hyperparameter β instead, which is undesirable both as the resulting objective is no longer a bound on the likelihood, and as β usually requires extensive tuning. In this work, we analyze the common pitfalls of using calibrated decoders that may have prevented the practitioners from using them, propose a simple and effective analytic way of learning such calibrated distribution, and provide a comprehensive experimental evaluation of different decoding distributions. 2018) further discuss constrained optimization objectives for VAEs, which also yield a similar hyperparameter. Here, we focus on β-VAEs with Gaussian decoders with constant variance, as commonly used in recent work, and show that the hyperparameter β can be incorporated in the decoding likelihood for these models.



BIVA (Maaløe et al., 2019) uses the Mixture of Logistics decoder proposed in(Salimans et al., 2017) that produces the channels for each pixel autoregressively, see also App D.



variational autoencoders has studied a number of different decoder parameterizations. Kingma & Welling (2014); Rezende et al. (2014) use the Bernoulli distribution for the binary MNIST data and Kingma & Welling (

Alternative discussions of the hyperparameter β are presented by Zhao et al. (2017); Higgins et al. (2017); Alemi et al. (2017); Achille & Soatto (2018), who show that it controls the amount of information in the latent variable, I(x; z). Peng et al. (2018); Rezende & Viola (

