SIMPLE AND EFFECTIVE VAE TRAINING WITH CALIBRATED DECODERS

Abstract

Variational autoencoders (VAEs) provide an effective and simple method for modeling complex distributions. However, training VAEs often requires considerable hyperparameter tuning to determine the optimal amount of information retained by the latent variable. We study the impact of calibrated decoders, which learn the uncertainty of the decoding distribution and can determine this amount of information automatically, on the VAE performance. While many methods for learning calibrated decoders have been proposed, many of the recent papers that employ VAEs rely on heuristic hyperparameters and ad-hoc modifications instead. We perform the first comprehensive comparative analysis of calibrated decoder and provide recommendations for simple and effective VAE training. Our analysis covers a range of datasets and several single-image and sequential VAE models. We further propose a simple but novel modification to the commonly used Gaussian decoder, which computes the prediction variance analytically. We observe empirically that using heuristic modifications is not necessary with our method.

1. INTRODUCTION

Deep density models based on the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) have found ubiquitous use in probabilistic modeling and representation learning as they are both conceptually simple and are able to scale to very complex distributions and large datasets. These VAE techniques are used for tasks such as future frame prediction (Castrejon et al., 2019) , image segmentation (Kohl et al., 2018) , generating speech (Chung et al., 2015) and music (Dhariwal et al., 2020) , as well as model-based reinforcement learning (Hafner et al., 2019a) . However, in practice, many of these approaches require careful manual tuning of the balance between two terms that correspond to distortion and rate from information theory (Alemi et al., 2017) . This balance trades off fidelity of reconstruction and quality of samples from the model: a model with low rate would not contain enough information to reconstruct the data, while allowing the model to have high rate might lead to unrealistic samples from the prior as the KL-divergence constraint becomes weaker (Alemi et al., 2017; Higgins et al., 2017) . While a proper variational lower bound does not expose any free parameters to control this tradeoff, many prior works heuristically introduce a weight on the prior KL-divergence term, often denoted β. Usually, β needs to be tuned for every dataset and model variant as a hyperparameter, which slows down development and can lead to poor performance as finding the optimal value is often prohibitively computationally expensive. Moreover, using β = 1 precludes the appealing interpretation of the VAE objective as a bound on the data likelihood, and is undesirable for applications like density modeling. While many architectures for calibrating decoders have been proposed in the literature (Kingma & Welling, 2014; Kingma et al., 2016; Dai & Wipf, 2019) , more applied work typically employs VAEs with uncalibrated decoding distributions, such as Gaussian distributions without a learned variance, where the decoder only outputs the mean parameter (Castrejon et al., 2019; Denton & Fergus, 2018; Lee et al., 2019; Babaeizadeh et al., 2018; Lee et al., 2018; Hafner et al., 2019b; Pong et al., 2019; Zhu et al., 2017; Pavlakos et al., 2019) , or uses other ad-hoc modifications to the objective (Sohn et al., 2015; Henaff et al., 2019) . Indeed, it is well known that attempting to learn the variance in a Gaussian decoder may lead to numerical instability (Rezende & Viola, 2018; Dai & Wipf, 2019) , and naïve approaches often lead to poor results. As a result, it remains unclear whether practical empirical performance of VAEs actually benefits from calibrated decoders or not. To rectify this, our first contribution is a comparative analysis of various calibrated decoder architectures and practical recommendations for simple and effective VAE training. We find that, while naïve calibrated decoders often lead to worse results, a careful choice of the decoder distribution can work very well, and removes the need to tune the additional parameter β. Indeed, we note that the entropy of the decoding distribution controls the mutual information I(x; z). Calibrated decoders allow the model to control I(x; z) automatically, instead of relying on manual tuning. Our second contribution is a simple but novel technique for optimizing the decoder variance analytically, without requiring the decoder network to produce it as an additional output. We call the resulting approach to learning the Gaussian variance the σ-VAE. In our experiments, the σ-VAE outperforms the alternative of learning the variance through gradient descent, while being simpler to implement and extend. We validate our results on several VAE and sequence VAE models and a range of image and video datasets. 2014) use Gaussian distributions with learned variance parameter for grayscale images. However, modeling images with continuous distributions is prone to instability as the variance can converge to zero (Rezende & Viola, 2018; Mattei & Frellsen, 2018; Dai & Wipf, 2019) . Some work has attempted to rectify this problem by using dequantization (Gregor et al., 2016) , which is theoretically appealing as it is tightly related to the log-likelihood of the original discrete data (Theis et al., 2016) , optimizing the variance in a two-stage procedure (Arvanitidis et al., 2017) , or training a post-hoc prior (Ghosh et al., 2019) . Takahashi et al. (2018) ; Barron (2019) proposed more expressive distributions. Additionally, different choices for representing such variance exist, including diagonal covariance (Kingma & Welling, 2014; Sønderby et al., 2016; Rolfe, 2016) , or a single shared parameter (Kingma et al., 2016; Dai & Wipf, 2019; Edwards & Storkey, 2016; Rezende & Viola, 2018) . We analyze these and notice that learning a single variance parameter shared across images leads to stable training and good performance, without the use of dequantization or even clipping the variance, although these techniques can be used with our decoders; and further improve the estimation of this variance with an analytic solution.

2. RELATED WORK

Early work on discrete VAE decoders for color images modeled them with the Bernoulli distribution, treating the color intensities as probabilities (Gregor et al., 2015) . Further work has explored various parameterizations based on discretized continuous distributions, such as discretized logistic (Kingma et al., 2016) . More recent work has improved expressivity of the decoder with a mixture of discretized logistics (Chen et al., 2016; Maaløe et al., 2019) . However, these models also employ powerful autoregressive decoders (Chen et al., 2016; Gulrajani et al., 2016; Maaløe et al., 2019) , and the latent variables in these models may not represent all of the significant factors of variation in the data, as some factors can instead be modeled internally by the autoregressive decoder (Alemi et al., 2017) .foot_0 While many calibrated decoders have been proposed, outside the core generative modeling community uncalibrated decoders are ubiquitous. They are used in work on video prediction (Denton & Fergus, 2018; Castrejon et al., 2019; Lee et al., 2018; Babaeizadeh et al., 2018) , image segmentation (Kohl et al., 2018) , image-to-image translation (Zhu et al., 2017) , 3D human pose (Pavlakos et al., 2019) , as well as model-based reinforcement learning (Henaff et al., 2019; Hafner et al., 2019b; a) , and representation learning (Lee et al., 2019; Watter et al., 2015; Pong et al., 2019) . Most of these works utilize the heuristic hyperparameter β instead, which is undesirable both as the resulting objective is no longer a bound on the likelihood, and as β usually requires extensive tuning. In this work, we analyze the common pitfalls of using calibrated decoders that may have prevented the practitioners from using them, propose a simple and effective analytic way of learning such calibrated distribution, and provide a comprehensive experimental evaluation of different decoding distributions. Alternative discussions of the hyperparameter β are presented by Zhao et al. (2017) ; Higgins et al. (2017); Alemi et al. (2017) ; Achille & Soatto (2018) , who show that it controls the amount of information in the latent variable, I(x; z). Peng et al. (2018) ; Rezende & Viola (2018) further discuss constrained optimization objectives for VAEs, which also yield a similar hyperparameter. Here, we focus on β-VAEs with Gaussian decoders with constant variance, as commonly used in recent work, and show that the hyperparameter β can be incorporated in the decoding likelihood for these models.

3. ANALYSING DECODING DISTRIBUTIONS

The generative model of a VAE (Kingma & Welling, 2014; Rezende et al., 2014) with parameters θ is specified with a prior distribution over the latent variable p θ (z), commonly unit Gaussian, and a decoding distribution p θ (x|z), which for color images is commonly a conditional Gaussian parameterized with a neural network. We would like to fit this generative model to a given dataset by maximizing the evidence lower bound (ELBO (Neal & Hinton, 1998; Jordan et al., 1999; Kingma & Welling, 2014; Rezende et al., 2014) ), which uses an approximate posterior distribution q φ (z|x), also commonly a conditional Gaussian specified with a neural network. In this work, we focus on the form of the decoding distribution p θ (x|z). To achieve the best results, we want a decoding distribution that represents the required probability p(x|z) accurately In this section, we will review and analyze various choices of decoding distributions that enable better decoder calibration, including expressive decoding distributions that can represent both the prediction of the image and the uncertainty about such prediction, or even multimodal predictions.

3.1. GAUSSIAN DECODERS

We first analyse the commonly used Gaussian decoders. We note that the commonly used MSE reconstruction loss between the reconstruction x and ground truth data x is equivalent to the negative log-likelihood objective with a Gaussian decoding distribution with constant variance: -ln p(x|z) = 1 2 ||x -x|| 2 + D ln √ 2π = 1 2 ||x -x|| 2 + c = D 2 MSE(x, x) + c, where p(x|z) ∼ N (x, I), the prediction x is produced with a neural network x = µ θ (z), and D is the dimensionality of x. This demonstrates a drawback of methods that rely simply on the MSE loss (Castrejon et al., 2019; Denton & Fergus, 2018; Lee et al., 2019; Hafner et al., 2019b; Pong et al., 2019; Zhu et al., 2017; Henaff et al., 2019) , as it is equivalent to assuming a particular, constant variance of the Gaussian decoding distribution. By learning this variance, we can achieve much better performance due to better calibration of the decoder. There are several ways in which we can specify this variance. An expressive way to specify the variance is to specify a diagonal covariance matrix for the image, with one value per pixel (Kingma & Welling, 2014; Sønderby et al., 2016; Rolfe, 2016) . This can be done, for example, by letting a neural network σ θ output the diagonal entries of the covariance matrix given a latent sample z: p θ (x|z) ∼ N µ θ (z), σ θ (z) 2 . (1) This parameterization of the decoding distribution outputs one variance value per each pixel and channel. While powerful, we observe in Section 5.3 that this approach attains suboptimal performance, and is moreover prone to numerical instability. Instead, we will find experimentally that a simpler parameterization, in which the covariance matrix is specified with a single shared (Kingma et al., 2016; Dai & Wipf, 2019; Edwards & Storkey, 2016; Rezende & Viola, 2018) parameter σ as Σ = σI often works better in practice: p θ,σ (x|z) ∼ N µ θ (z), σ 2 I . (2) The parameter σ can be optimized together with parameters of the neural network θ with gradient descent. Of particular interest is the interpretation of this parameter. Writing out the expression for the decoding likelihood, we obtain -ln p(x|z) = 1 2σ 2 ||x-x|| 2 +D ln σ √ 2π = 1 2σ 2 ||x-x|| 2 +D ln σ+c = D ln σ+ D 2σ 2 MSE(x, x)+c. The full objective of the resulting Gaussian σ-VAE is: L θ,φ,σ = D ln σ + D 2σ 2 M SE(x, x) + D KL (q(z|x)||p(z)). Note that σ may be viewed as a weighting parameter between the MSE reconstruction term and the KL-divergence term in the objective. Moreover, this objective explicitly specifies how to select the optimal variance: the variance should be selected to minimize the (weighted) MSE loss while also minimizing the logarithm of the variance. Decoder Calibration It is important that the decoder distribution be calibrated in the statistical sense, that is, the predicted probabilities should correspond to the frequencies of seeing a particular value of x given that prediction (DeGroot & Fienberg, 1983; Dawid, 1982) . The calibration of a neural network can be usually improved by estimating the uncertainty of that prediction (Guo et al., 2017) , such as the variance of a Gaussian (Kendall & Gal, 2017) . Since the naive MSE loss assumes a constant variance, it does not effectively represent the uncertainty of the prediction, and is often poorly calibrated. Instead, learning the variance as in Eq. 3 leads to better uncertainty estimation and better calibration. In Sec 5.1, we show that learning a good estimate of this uncertainty is crucial for the quality of the VAE generations. Connection to β-VAE. The β-VAE objective (Higgins et al., 2017) for a Gaussian decoder with unit variance is: L β = D 2 M SE(x, x) + βD KL (q(z|x)||p(z)). We see that it can be interpreted as a particular case of the objective (3), where the variance is constant and the term D ln σ can be ignored during optimization. The β-VAE objective is then equivalent to a σ-VAE with a constant variance σ = β/2 (for a particular learning rate setting). In recent work (Zhu et al., 2017; Denton & Fergus, 2018; Lee et al., 2019) , β-VAE models are often used in this exact regime. By tuning the β term, practitioners are able to tune the variance of the decoder, manually producing a more calibrated decoder. However, by re-interpreting the β-VAE objective as a special case of the VAE and introducing the missing D ln σ term, we can both obtain a valid evidence lower bound, and remove the need to manually select β. Instead, the variance σ can instead simply be learned end-to-end, reducing the need for hyperparameter tuning. An alternative discussion of this connection in the context of linear VAEs is also presented by Lucas et al. (2019) . While the β term is not necessary for good performance if the decoder is calibrated, it can still be employed if desired, such as when the aim is to attain better disentanglement (Higgins et al., 2017) or a particular rate-distortion tradeoff (Alemi et al., 2017) . However, we found that with calibrated decoders, the best sample quality is obtained when β = 1. Loss implementation details. For the correct evidence lower bound computation, it is necessary to add the values of the MSE loss and the KL divergence across the dimensions. We observe that common implementations of these losses (Denton & Fergus, 2018; Abadi et al., 2016; Paszke et al., 2019) use averaging instead, which will lead to poor results if the number of image dimensions is significantly different from the number of the latent dimensions. While this can be conveniently ignored in the β-VAE regime, where the balance term is tuned manually anyway, for the σ-VAE it is essential to compute the objective value correctly. Variance implementation details. Since the variance is non-negative, we parameterize it logarithmically as σ 2 = e 2λ , where λ is the logarithm of the standard deviation. For some models, such as per-pixel variance decoders, we observed that it is necessary to restrict the variance range for numerical stability. We do so by using the soft clipping operations proposed by Chua et al. (2018) : λ := λ max -softplus(λ max -λ); λ := λ min + softplus(λ -λ min ). We observe that setting λ min = -6 to lower bound the standard deviation to be at least half of the distance between allowed color values works well in practice. We also observe that this clipping is unnecessary when learning a shared σ value.

3.2. DISCRETE DECODERS

It is possible to use discrete decoding distributions to generate images, as color values are commonly restricted to a fixed set of integer pixel intensities (e.g. 0..255). Indeed, for discrete color values, discrete distributions are arguably more appropriate. In the most general case, a discrete decoding distribution factorized per each pixel and channel would be specified by a probability mass vector x with 256 entries, one per each possible intensity value, similarly to a per-pixel classifier of the intensity value. We can implement it with a soft-max layer, yielding the following log-likelihood loss (sometimes called the cross-entropy loss) for a true pixel with intensity i: We will evaluate these and further choices of discrete decoders, described in Appendix D. We recommend choosing the decoder distribution that best suits the structure of the data, such as discrete decoders for discrete data and continuous decoders for continuous data. -ln p(x|z) = -ln exp(x i ) j exp(x j ) ,

4. OPTIMAL VARIANCE ESTIMATION FOR CALIBRATED GAUSSIAN DECODERS

In this section, we propose a simple but novel analytic way of obtaining a calibrated decoder for continuous distributions that further improves performance. The Gaussian decoders with learned variance described in Section 3.1 are calibrated and work better than naïve unit variance decoders. However, for σ-VAE optimized with gradient descent or Adam (Kingma & Ba, 2015), we observe that careful learning rate tuning can yield significantly better performance, which is in line with prior work that reported poor performance of gradient descent for optimizing Gaussian distributions (Amari, 1998; Peters & Schaal, 2008) . A smaller learning rate often produces better performance, but slows down the training, as the likelihood values p(x|z) will be very suboptimal in the beginning. Instead, here we propose an analytic solution for the value of σ, which computes it analytically and does not require gradient descent. The maximum likelihood estimate of the variance given a known mean is the average squared distance from the mean: σ * = arg max σ N (x|µ, σ 2 I) = MSE(x, µ), where MSE(x, µ) = 1 D i (x i -µ i ) 2 . Eq. 5 can be easily shown using manual differentiation, and is a generalization of the fact that the MLE estimate of the variance is the sample variance. The optimal variance for the decoder distribution under the maximum likelihood criterion is then simply the average MSE loss over the data and the encoder distribution. We leverage this to create an optimal analytic solution for the variance. In the batch setting, the optimal variance would be simply the MSE loss, and can be updated after every gradient update for the other parameters of the decoder. In the mini-batch setting, we use a batchwise estimate of the variance computed for the current minibatch. We analyze these approximations in Appendix C. At test time, a running average of the variance over the training data is used. This method, which we call optimal σ-VAE, allows us to learn very efficiently as we use the optimal variance estimate at every training step. It is also easier to implement, as no separate optimizer for the variance parameter is needed. If the variance is not needed at test time, it can also be simply discarded after training. Per-image optimal σ-VAE. Optimal σ-VAE uses a single variance value shared across all data points. However, the optimal σ-VAE also allows more powerful variance estimates, such as learning a variance value per each pixel, or even a variance value per each image, the difference in implementation simply being the dimensions across which the averaging in Equation 5 operates. This approach can be interpreted as variational variance prediction in the framework of Stirn & Knowles (2020) . We now provide an empirical analysis of different decoding distributions, and validate the benefits of our σ-VAE approach. We use a small convolutional VAE model on SVHN (Netzer et al., 2011) , a larger hierarchical HVAE model (Maaløe et al., 2019) on the CelebA (Liu et al., 2015) and CIFAR (Krizhevsky et al., 2009) datasets, and a sequence VAE model called SVG (Denton & Fergus, 2018) on the BAIR Pushing dataset (Finn & Levine, 2017) . We evaluate the ELBO values as well as visual quality measured by the Fréchet Inception Distance (FID, Heusel et al. (2017) ). Images are 28 × 28 for SVHN and 32×32 for CelebA and CIFAR, while video experiments were performed on 64 × 64 frames following Denton & Fergus (2018) . We do not use KL annealing as it did not improve the results in our experiments. Further experimental details are in App. B. Higher values of β cause the images to lose detail, while lower values of β might make samples unrealistic. The proposed optimal σ-VAE is able to learn the balance end-to-end, here converging to an equivalent of β-VAE with β = 0.006.

5.1. DO CALIBRATED DECODERS BALANCE THE VAE OBJECTIVE WITHOUT TUNING β?

As detailed in Section 3.1, a β-VAE with a unit variance Gaussian decoder commonly used in prior work is equivalent to a σ-VAE with constant, manually tuned variance. There is a simple relationship between beta and the variance: σ = β/2. To compare the variance that the σ-VAE learns to the manually tuned variance in the case of the β-VAE, we compare the ELBO values and the corresponding values of β in Table 1 . We find that learning the variance produces similar values of β to the manually tuned values in the β-VAE case, indicating that the σ-VAE is able to learn the balance between the two objective terms in a single training run, without hyperparameter tuning. Moreover, the σ-VAE outperforms the best β-VAE run. This is because end-to-end learning produces better estimates of the variance than is possible with manual search, improving the likelihood (as measured by the lower bound) and the visual quality. Figure 3 shows the qualitative results from this experiment. We further validate our results on both single-image and sequential VAE models on a range of datasets in Table 2 and Figure 2 . Single-sample ELBO values are reported, and ELBO values on discretized data are reported for discrete distributions. We see that learning a shared variance in a Gaussian decoders (shared σ-VAE) outperforms the naïve unit variance decoder (Gaussian VAE) as well as tuning the β constant for the Gaussian VAE manually. We also see that calibrated discrete decoders, such as full categorical distribution or mixture of discretized logistics, perform better than the naïve Gaussian VAE. Using Bernoulli distribution by treating the color intensities as probabilities (Gregor et al., 2015; Watter et al., 2015) performs poorly. Our results further improve upon the sequence VAE method of Denton & Fergus (2018) , which uses a unit variance Gaussian with the β-VAE objective.

5.2. HOW DOES LEARNING CALIBRATED DECODERS IMPACT THE LATENT VARIABLE INFORMATION CONTENT?

We saw above that calibrated decoders result in higher log-likelihood bounds. Are calibrated decoders also beneficial for representation learning? We evaluate the mutual information I e (x; z) between the data p d (x) and encoder samples q(z|x), as well as the mismatch between the prior p(z) and the marginal encoder distribution m(z) = E p d (x) q(z|x), measured by the marginal KL D KL (m(z)||p(z)). These terms are related to the rate term of the VAE objective as follows (Alemi et al., 2017) : That is, the rate term decomposes into the true mutual information and the marginal KL term. We want to learn expressive latent variables with high mutual information. However, doing so by tuning the β value relaxes the constraint that the encoder and the prior distributions match, and leads to degraded quality of samples from the prior, which creates a trade-off between expressive representations and ability to generate good samples. To compare the β-VAE and σ-VAE in terms of these quantities, we estimate the marginal KL term via Monte Carlo sampling, as proposed by Rosca et al. (2018) , and plot the results in Figure 4 . As expected, we see that lower β values lead to higher mutual information. However, after a certain point, lower values of β also cause a significant mismatch between the marginal and the prior distributions. By calculating the "effective" β for the σ-VAE, as per Section 4, we can see that the σ-VAE captures an inflection point in the D KL (m(z)||p(z)) term, learning a representation with the highest possible MI, but without degrading sample quality. This explains the high visual quality of the optimal σ-VAE samples: since the marginal and the prior distributions match, the samples from the prior look similar to reconstructions, while for a β-VAE with low β, the samples from the prior are poor. We see that, in contrast to the β-VAE, where the mutual information is controlled by a hyperparameter, the σ-VAE can adjust the appropriate amount of information automatically and is able to find the setting that produces both informative latents and high quality samples. E p d (x) [D KL (q(z|x)||p(z))] = E p d (x) [D KL (q(z|x)||m(z))] + D KL (m(z)||p(z)) = I e (x; z) + D KL (m(z)||p(z)). An alternative discussion of tuning β is presented by Alemi et al. (2017) , who show that β controls the rate-distortion trade-off. Here, we show that the crucial trade-off also controlled by β is the trade-off between two components of the rate itself, which control expressivity of representations and the match between the variational and the prior distributions, respectively.

5.3. WHAT ARE THE COMMON CHALLENGES IN LEARNING THE VARIANCE THAT PREVENT

PRACTITIONERS FROM USING IT, AND HOW TO RECTIFY THEM? If learning the decoder variance improves generation, why are learned variances not used more often? In this section, we discuss how the naïve approach to learning variances, where the decoder outputs a variance for each pixel along with the mean, leads to poor results. First, we find that this method often diverges very quickly due to numerical instability, as the network is able to predict certain pixels with very high certainty, leading to degenerate variances. In contrast, learning a shared variance is always numerically stable in our experiments. We can rectify this numerical instability by bounding the output variance (Section 3.1). However, even with bounded variance, we observe that learning per-pixel variances leads to poor results in Table 2 . While the per-pixel variance achieves a good ELBO value, it produces very poor samples, as measured by FID and visual inspection. We see that the specific form of learned variance: a shared variance, a per-image variance, or a per-pixel variance, can lead to very different performance in practice. We hypothesize the per-pixel decoder performs poorly as it incentivizes the model to focus on particular pixels that can be predicted well, instead of focusing equally on all parts of the image. This is consistent with prior work on denoising diffusion models which noted that likelihood-based models place too much focus on imperceptible details, which leads to deteriorated results (Ho et al., 2020) . The shared and per-image variance models mitigate this issue at the cost of introducing more bias, and work better in practice.

5.4. CAN AN ANALYTIC SOLUTION FOR OPTIMAL VARIANCE FURTHER IMPROVE LEARNING?

We evaluate the optimal σ-VAE which uses an analytic solution for the variance (Section 4). Table 2 shows that it achieves superior results in terms of log-likelihood. We also note that the optimal σ-VAE converges to a good variance estimate instantaneously, which speeds up learning (highlighted in Figure 9 in the Appendix). In addition, we evaluate the per-image optimal σ-VAE, in which a single variance is computed per image. This model achieves significantly higher visual quality. While producing this per-image variance with a neural network would require additional architecture tuning, optimal σ-VAE is extremely simple to implement (it can be implemented simply as changing the axes of summation), not requiring any new tunable parameters.

6. CONCLUSION

We presented a simple and effective method for learning calibrated decoders, as well as an evaluation of different decoding distributions with several VAE and sequential VAE models. The proposed method outperforms methods that use naïve unit variance Gaussian decoders and tune a heuristic weight β on the KL-divergence loss, as commonly done in prior work. Moreover, it does not use the heuristic weight β, making it easier to train than this prior work. We expect that the simple techniques for learning calibrated decoders can allow practitioners to speed up the development cycle, obtain better results, and reduce the need for manual hyperparameter tuning. 

A ADDITIONAL EXPERIMENTAL RESULTS

In this section, we provide more qualitative results in Figures 7, 6 , 8, 5 as well as a graph showing the convergence properties of the variance for different models in Fig. 9 . In order to validate our method with a different architecture, we also report performance of different decoders with a small 5-layer convolutional architecture on the CelebA and CIFAR dataset in Table 3 . We see that the ordering of the methods is consistent with this smaller architecture.

B EXPERIMENTAL DETAILS

For the small convolutional network test on SVHN, the encoder has 3 convolutional layers followed by a fully connected layer, while the decoder has a fully connected layer followed by 3 convolutional layers. The β was tuned from 100 to 0.0001 for β-VAE. The number of channels in the convolutional layers starts with 32 and increases 2 times in every layer. The dimension of the latent variable is 20. Adam (Kingma & Ba, 2015) with learning rate of 1e-3 is used for optimization. Batch size of 128 was used and all models were trained for 10 epochs. We additionally evaluate this small convolutional network on CelebA, CIFAR, and Frey Facefoot_2 datasets in Table 3 . Unit Gaussian prior and Gaussian posteriors with diagonal covariance were used. For the larger hierarchical VAE, we used the official pytorch implementation of (Maaløe et al., 2019) . We use the baseline hierarchical VAE with 15 layers of latent variables, without the top-down and bottom-up connections. For the hierarchical VAE and Under review as a conference paper at ICLR 2021 the SVG-LP model, we use the default hyperparameters in the respective implementations. We use the standard train-val-test split for all datasets. All models were trained on a single high-end GPU. We use the official PyTorch implementation of the Inception network to compute FID. All methods are compared on the same hyperparameters.

C EMPIRICAL ANALYZIS OF APPROXIMATIONS FOR OPTIMAL σ-VAE

The optimal σ-VAE requires computing the following estimate of the variance σ * = arg max σ E x∼Data E q(z|x) ln p(x|µ θ (z), σ 2 I) = E x∼Data E q(z|x) MSE(x, µ θ (z)). This requires computing two expectations, with respect to the data in the dataset, and with respect to the encoder distribution. We use MC sampling with one sample per data point to approximate both expectations. Inspired by common practices in VAEs, we use one sample per data point to approximate the inner expectation. On SVHN, the standard error of this approximation is 0.26% of the value of sigma. We further approximate the outer expectation with a single batch instead of the entire dataset. On SVHN, the standard error of this approximation is 2% of the value of sigma. We see that both approximations are accurate in practice. The second approximation yields a biased estimate of the evidence lower bound because the same batch is used to approximate the variance and compute the lower bound estimate. However, this bias can be corrected by using a different batch, or with a running average of the variance with an appropriate decay. This running average can also be used to reduce the variance of the estimate and to achieve convergence guarantees, but we did not find it necessary in our experiments. We describe the alternative decoders evaluated in Table 2 : using the bitwise-categorical, and the logistic mixture distributions.

D ALTERNATIVE DECODER CHOICES

Bitwise-categorical VAE While the 256-way categorical decoder described in Section 3.2 is very powerful due to the ability to specify any possible intensity distribution, it suffers from high computational and memory requirements. Because 256 values need to be kept for each pixel and channel, simply keeping this distribution in memory for one 3-channel 1024 × 1024 image would require 3 GiB of memory, compared to 0.012 GiB for the Gaussian decoder. Therefore, training deep neural networks with this full categorical distribution is impractical for high-resolution images or videos. The bitwise-categorical VAE improves the memory complexity by defining the distribution over 256 values in a more compact way. Specifically, it defines a binary distribution over each bit in the pixel intensity value, requiring 8 values in total, one for each bit. This distribution can be thought of as a classifier that predicts the value of each bit in the image separately. In our implementation of the bitwise-categorical likelihood, we convert the image channels to binary format and use the standard binary cross-entropy loss (which reduces to binary log-likelihood since all bits in the image are deterministically either zero or one). While in our experiments the bitwise-categorical distribution did not outperform other choices, it often performs on par with our proposed method. We expect this distribution to be useful due to its generality as it is able to represent values stored in any digital format by converting them into binary. Logistic mixture VAE For this decoder, we adapt the discretized logistic mixture from Salimans et al. (2017) . To define a discrete 256-way distribution, it divides the corresponding continuous distribution into 256 bins, where the probability mass is defined as the integral of the PDF over the corresponding bin. (Kingma et al., 2016) uses the logistic distribution discretized in this manner for the decoder. Salimans et al. (2017) suggests to make all bins except the first and the last be of equal size, whereas the first and the last bin include, respectively, the intervals (-∞, 0] and [1, ∞). Salimans et al. (2017) further suggests using a mixture of discretized logistics for improved capacity. Our implementation largely follows the one in Salimans et al. (2017) , however, we note that the original implementation is not suitable for learning latent variable models, as it generates the channels autoregressively. This will cause the latent variable to lose color information since it can be represented by the autoregressive decoder. We therefore adapt the mixture of discretized logistics to the pure latent variable setup by removing the mean-adjusting coefficients from (Salimans et al., 2017) . In our experiments, the logistic mixture outperformed other discrete distributions.



BIVA (Maaløe et al., 2019) uses the Mixture of Logistics decoder proposed in(Salimans et al., 2017) that produces the channels for each pixel autoregressively, see also App D. Available at https://cs.nyu.edu/ ˜roweis/data.html



Figure 1: Different types of calibrated decoders for Gaussian VAE, model parameters are denoted with enclosing squares. Left: both the mean µ and the variance σ are output by a neural network with parameters θ. Center: σ-VAE with shared variance, the mean is output by a neural network with parameters θ, but the variance it iself a global parameter. Right: the proposed optimal σ-VAE, the mean is output by a neural network with parameters θ, and the variance is computed analytically from the training data D.

Figure 2: Images or videos (bottom right) sampled from the proposed optimal σ-VAE and a unit variance Gaussian VAE models. The Gaussian VAE does not have a means to control the expressivity of the latent variable and produces suboptimal, blurry samples. The σ-VAE controls the expressivity by learning a calibrated decoder, and produces higher quality sequences on all datasets.

Analysis of learned variance on SVHN. The parameter β is tuned manually in β-VAE and learned in σ-VAE. σ-VAE achieves better performance, while the value of β (implicitly defined via the decoder variance) automatically converges close the value found by manual tuning. 0.006 < -3333 22.25

Figure 3: Analysis of learned variance on SVHN. The parameter β is tuned manually in β-VAE and learned in σ-VAE.Higher values of β cause the images to lose detail, while lower values of β might make samples unrealistic. The proposed optimal σ-VAE is able to learn the balance end-to-end, here converging to an equivalent of β-VAE with β = 0.006.

Figure 4: Comparison of β-VAE and σ-VAE on SVHN in terms of mutual information I e (x; z) and marginal KL divergence KL(m(z)||p(z)) (see Sec. 5.2). I e (x; z) increases with lower β, yielding expressive representations and better reconstruction. However, after a certain point, lowering β leads to a rapid increase in the marginal KL, yielding poor samples from the prior. The σ-VAE is able to automatically find the inflection point after which the marginal KL begins to increase, capturing as much information as possible while still producing good samples.

Figure 5: Samples from the σ-VAE (left) and the Gaussian VAE (right) on the SVHN dataset. The Gaussian VAE produces blurry results with muted colors, while the σ-VAE is able to produce accurate images of digits.

Figure 6: Samples from the σ-VAE (left) and the Gaussian VAE (right) on the CelebA dataset, images cropped to the face for clarity. The Gaussian VAE produces blurry results with indistinct face features, while the σ-VAE is able to produce accurate images of faces.

Figure 7: Samples from the σ-VAE (top) and the Gaussian VAE (bottom) on the BAIR dataset. Sampled sequences conditioned on two initial frames are shown, and the ground truth sequence is shown at the top. The Gaussian VAE produces blurry robot arm texture and the arm often disappears towards the end of the sequence, while the σ-VAE is able to produce sequences with realistic motion and model the details of the arm texture, such as the gripper.

Figure 8: Samples from the σ-VAE (left) and the Gaussian VAE (right) on the challenging CIFAR dataset. The Gaussian VAE produces blurry results with muted colors, while the σ-VAE models the distribution of shapes in the CIFAR data more faithfully.

Figure9: Variance convergence speed on SVHN. We see that the shared σ-VAE which optimizes the variance with gradient descent has an initial period of convergence when the variance converges to the region of the optimal value. In contrast, σ-VAE with analytical (optimal) variance quickly learns a good estimate of the variance, which leads to better performance. The unit variance Gaussian β-VAE can be interpreted as having a constant variance determined by β, shown here. Since the variance doesn't change throughout training, it achieves suboptimal performance.

Prior work on variational autoencoders has studied a number of different decoder parameterizations. Kingma & Welling (2014); Rezende et al. (2014) use the Bernoulli distribution for the binary MNIST data and Kingma & Welling (

Generative modeling performance of the proposed σ-VAE on different models and datasets. For SVG, we compare with the original method(Denton & Fergus, 2018), which uses β-VAE. We see that uncalibrated decoders such as mean-only Gaussian perform poorly. β-VAE allows to calibrate the decoder but needs careful hyperparameter tuning. Calibrated decoders such as categorical or σ-VAE perform best. [1] Gregor et al. (2015), [2] Takahashi et al. (2018), [3] Higgins et al. (2017).

Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. ICLR, 2016.Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746-2754, 2015.

Generative modeling performance of the proposed σ-VAE on CelebA, CIFAR, and Frey Face with a smaller model. We see that uncalibrated decoders such as mean-only Gaussian perform poorly. β-VAE allows to calibrate the decoder but needs careful hyperparameter tuning. Calibrated decoders such as categorical or σ-VAE perform best.

ELBO on discretized data. All distributions except categorical have scalar scale parameters. The σ-VAE performs well on the discretized ELBO metric, performing similarly to a discrete distribution parametrized as a discretized Gaussian or discretized Logistic. Full categorical distribution attains highest likelihood due to having the most statistical power.

