AR-ELBO: PREVENTING POSTERIOR COLLAPSE INDUCED BY OVERSMOOTHING IN GAUSSIAN VAE

Abstract

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon that the learned latent space becomes uninformative. This is related to local optima of the objective function that are often introduced by a fixed hyperparameter resembling the data variance. We suggest that this variance parameter regularizes the VAE and affects its smoothness, which is the magnitude of its gradient. An inappropriate choice of this parameter causes oversmoothness and leads to posterior collapse. This is shown theoretically by analysis on the linear approximated objective function and empirically in general cases. We propose AR-ELBO, which stands for adaptively regularized ELBO (Evidence Lower BOund). It controls the strength of regularization by adapting the variance parameter, and thus avoids oversmoothing the model. Generation models trained by proposed objectives show improved Fréchet inception distance (FID) of images generated from the MNIST and CelebA datasets.

1. INTRODUCTION

The variational autoencoder (VAE) framework (Kingma & Welling, 2014; Higgins et al., 2017; Zhao et al., 2019) is a popular approach to achieve generative modeling in the field of machine learning. In this framework, a model that approximates the true posterior of observation data, is learned by a joint training of encoder and decoder, which creates a stochastic mapping between the observation data and the learned deep latent space. The latent space is assumed to follow a prior distribution. The generation of a new data sample can be done by sampling the latent space and passing the sample through the decoder. It is common to assume that both the prior on the latent space and the posterior of the observation data follow a Gaussian distribution. This setup is also known as the Gaussian VAE. In this case, the variance of the decoder output is usually modeled as an isotropic matrix σ 2 x I with a scalar parameter σ 2 x ≥ 0. Furthermore, in order to deal with the intractable log-likelihood of the true posterior, the evidence lower bound (ELBO) (Jordan et al., 1999 ) is adopted as the objective function instead. While VAE-based generative models are usually considered to be more stable and easier to train than generative adversarial networks (Goodfellow et al., 2014) , they often suffer from the problem of posterior collapse (Bowman et al., 2015; Sønderby et al., 2016; Alemi et al., 2017; Xu & Durrett, 2018; He et al., 2019; Razavi et al., 2019a; Ma et al., 2019) , in which the latent space has little information of the input data. The phenomenon is generally mentioned as "the posterior collapses to the prior in the latent space" (Razavi et al., 2019a) . Recently, several works have suggested that the variance parameter σ 2 x in the ELBO is strongly related to posterior collapse. For example, Lucas et al. (2019) analyzed posterior collapse through the analysis on a linear VAE. It revealed that an inappropriate choice of σ 2 x will introduce sub-optimal local optima and cause posterior collapse. Moreover, Lucas et al. (2019) reveals that contrary to the popular belief, these local optima are not introduced by replacing the log-likelihood with the ELBO, but by an excessively large σ 2 x . On the other hand, it can be shown that fixing σ 2 x to an excessively small value leads to under-regularization of the decoder, which can cause overfitting. However, in most implementations of a Gaussian VAE, the variance parameter σ 2 x is a fixed constant regardless of the input data and is usually 1.0. In another work, Dai & Wipf (2019) proposed a two-stage VAE and treated σ 2 x as a training parameter. Besides the inappropriate choice of the variance parameter, posterior collapse can also induced by other causes. For example, Dai et al. (2020) found that, small nonlinear perturbation introduced in the network architecture can also result into extra sub-optimal local minima. However, in this work we will keep our focus on the variance parameter. We suggest that σ 2 x affects the strength of regulation over the gradient magnitude of the decoder. We call the expected gradient magnitude the smoothness throughout this paper. The smaller the gradient magnitude, the smoother the model. In particular, we would like to focus on the local smoothness of the model, which is the smoothness evaluated within the neighborhood of the encoded latent variable of the observation data. Thus, we begin with the following hypothesis: Main Hypothesis. The value of σ 2 x controls the regularization strength of the smoothness of the decoder. Therefore, an excessively large σ 2 x causes oversmoothness, which results in posterior collapse. Following the hypothesis, the estimation of σ 2 x should be related to properties of the approximated posterior of the latent space, such as its local smoothness. We will start with analyzing how σ 2 x regularizes the local smoothness of the stochastic decoder and then propose new objective functions that inherently determine σ 2 x via maximum likelihood estimation (MLE). This proposed objective function is named AR-ELBO (Adaptively Regularized ELBO), which controls the regularization strength via σ 2 x . Furthermore, several variations are derived for different parameterizations of variance parameters. Our main contributions are listed as follows: 1. We show that our main hypothesis holds for linear approximated ELBO and empirically holds in the general case in Section 3. This also suggests that the variance parameter σ 2 x should be estimated from properties of the approximated posterior instead of being treated as a hyperparameter. 2. We propose the AR-ELBO, an ELBO-based objective function that adaptively regularizes the smoothness of the decoder by MLE of the variance parameter σ 2 x in Section 4. Variations of AR-ELBO for several variance parameterizations of posterior distributions are also derived. AR-ELBO prevents the model from the posterior collapse induced by oversmoothing and improves the quality of generation, which is shown in Section 5. The organization of this paper is as follows. In Section 2, we propose a mathematical definition of posterior collapse in the form of mutual information. This also includes the conventional definition: "the posterior collapses to the prior in the latent space". In Section 3, the theoretical analysis and empirical support of the main hypothesis are given. We perform an analysis showing that σ 2 x affects the smoothness of the decoder via variance parameters of the latent space learned by the encoder. In Section 4, we propose new AR-ELBO objective functions for various variance parameterizations of posterior distributions, which can relieve the decoder from being oversmoothed in the training and prevent posterior collapse. These objective functions no longer include any hyperparameters and can adaptively estimate the variance parameter σ 2 x from the observation data. It should be noted that if we adaptively determine σ 2 x with the proposed AR-ELBO, the strength of regularization of the decoder smoothness will gradually decrease as training progresses. In Section 5, we conduct an experiment on the MNIST and CelebA datasets, which shows that utilizing the proposed AR-ELBO with the standard Gaussian VAE can be competitive with many other variations of VAE models in most situations. Throughout this paper, we use a, a and A for a scalar, a column vector and a matrix, and ln and log denote the natural logarithm and common logarithm, respectively. Our code is available from the following URLfoot_0 .

2. POSTERIOR COLLAPSE IN GAUSSIAN VAE

We begin with the standard formulation of the Gaussian VAE, which is the foundation of our research. A definition of posterior collapse is proposed by using mutual information (MI).

2.1. GAUSSIAN VAE

Consider a data space X ⊂ R dx and a sample set {x i } N i=1 ⊂ X , where x i ∼ p data (x). The empirical distribution pdata (x) on X can be evaluated by pdata (x) = 1 N N n=1 δ(x -x n ), where δ(•) denotes the Dirac delta function. In the standard VAE framework, a latent space Z ⊂ R dz is learned and the sampled latent variables z ∈ Z are used to generate data samples x ∈ X . Let q φ (z|x) and p θ (x|z) denote the stochastic encoder and decoder, respectively. Trainable parameters of the two neural networks are denoted as φ and θ. The decoder generates data samples by p θ (x) := E p(z) [p θ (x|z)], where p(z) is the prior distribution on Z. The encoder and decoder are jointly trained by minimizing the following objective function: L = -E pdata(x) [ln p θ (x)] + E pdata(x) D KL (q φ (z|x) p θ (z|x)) + E pdata(x) [ln p data (x)] = D KL (p data (x) p θ (x)) + E pdata(x) D KL (q φ (z|x) p θ (z|x)) . (1) This objective function was derived in Zhao et al. (2019) , which represents everything in the form of Kullback-Leibler divergence, and is equivalent to ELBO maximization up to an additive constant. In the context of the Gaussian VAE, the encoder and decoder are assumed to satisfy q φ (z|x) = N (z|µ φ (x), diag(σ 2 φ (x))) and p θ (x|z) = N (x|µ θ (z), σ 2 x I). The prior p(z) is also assumed to be the Gaussian distribution as p(z) = N (z|0, I). Substituting ( 2) into (1) while omitting terms independent of θ and φ leads to the following objective: Jσ 2 x (θ, φ) = E pdata(x) 1 2σ 2 x E q φ (z|x) [ x -µ θ (z) 2 2 ] + D KL (q φ (z|x) p(z)) , which can be interpreted as the sum of the expected values of the reconstruction loss and the regularization term. In the case of a Gaussian prior and posterior, the regularization term is equal to 1 2 dz i=1 (σ 2 φ,i (x) + µ φ,i (x) 2 -log σ 2 φ,i (x) -1).

2.2. POSTERIOR COLLAPSE

Posterior collapse is a major problem, where the encoder learns to map inputs to the latent space while ignoring the data distribution. In this phenomenon, the MI between input data and reconstructed data through the encoder-decoder path is reduced because the latent space has less information about the data distribution. Here, we suggest the following definition of posterior collapse. Definition 1. Posterior collapse is defined as the MI I(x; x ) becoming nearly zero, where x := µ θ (z) with z ∼ q φ (z|x). In many works (Bowman et al., 2015; Sønderby et al., 2016; Alemi et al., 2017; He et al., 2019; Razavi et al., 2019a) , the phenomenon denoted as posterior collapse has been mathematically represented as E pdata(x) D KL (q φ (z|x) p(z)) → 0, which we hereafter refer to as KL collapse (Xu & Durrett, 2018) . However, posterior collapse is not always caused by the diminished KL divergence, i.e., posterior collapse with E pdata(x) D KL (q φ (z|x) p(z)) = 0 can occur. The proposed definition of posterior collapse includes KL collapse from the following theorem, which is proven in Appendix A. Theorem 2. I(x; x ) → 0 as E p data (x) D KL (q φ (z|x) p(z)) → 0 holds for any p θ (x|z). In Appendix E, it is demonstrated that posterior collapse can happen even if the KL divergence is nonzero when the posterior variance is fixed in Z.

3. VARIANCE PARAMETER σ 2

x AND THE LOCAL SMOOTHNESS In this section, we provide mathematical and empirical support of the main hypothesis. Throughout this section, we adopt the following parameterization for the encoder for simplicity: q φ,σ 2 z (z|x) = N (z|µ φ (x), σ 2 z I), where the variance is parameterized as an isotropic matrix unlike the conventional VAE. A similar analysis on the conventional VAE can be found in Appendix B. It begins with showing that the choice of σ 2 x affects the convergence point of σ 2 z , which is the variance parameter of the latent space. Then, we show that σ 2 z acts as the weight of the gradient penalty, which is implicitly included in (3). This supports the main hypothesis that the over-regulation imposed by a large σ 2 x via σ 2 z causes the oversmoothness of the decoder and leads to posterior collapse. It is also empirically supported by observing the tendencies of the convergence point of σ 2 z , the smoothness and the MI I(x, x ). Ultimately, these items of evidence motivated us to develop a method that adapts σ 2 x to prevent oversmoothing of the decoder. 3.1 REGULARIZATION EFFECT OF σ 2 x IN LINEAR APPROXIMATED ELBO The effect of σ 2 x on the convergence point of the variance parameter σ 2 z can be observed from two extreme cases, σ 2 x → 0+ and σ 2 x → ∞. In the first case, Jσ 2 x reduces to E pdata(x) E q φ,σ 2 z (z|x) [ x -µ θ (z) 2 2 ], and it becomes zero only if σ 2 z = 0. In the second case, Jσ 2 x reduces to E pdata(x) D KL (q φ,σ 2 z (z|x) p(z)), and σ 2 z becomes 1 at the minimum point, from D KL (q φ,σ 2 z (z|x) p(z)) = dz 2 (σ 2 z -log σ 2 z -1) + µ φ (x) 2 2 . This shows that a small σ 2 x makes σ 2 z converge to a value around 0, while a large σ 2 x makes σ 2 z converge to a value around 1. If σ 2 z is sufficiently small, as the training progresses to a certain extent, the perturbed decoding process µ θ (z + z ) around z = µ φ (x) with z ∼ N ( z |0, σ 2 z I) can be approximated as a linear function. The ELBO can be approximated as follows by using the linear approximation of µ θ (•) and omitting terms independent of θ and φ: Jσ 2 x (θ, φ, σ 2 z ) ≈ 1 2σ 2 x E pdata(x) x -µ θ (µ φ (x)) 2 2 + σ 2 z ∇µ θ (µ φ (x)) 2 F + 2σ 2 x µ φ (x) 2 2 . (4) In the approximation above, • F is the Frobenius norm and σ 2 z is treated as a function parameter. The derivation of the above approximation can be found in Appendix B. Equation (4) decomposes the objective function into three terms: a reconstruction error term, gradient penalty term and L 2 regularization term. As one can see from (4), σ 2 z regularizes the smoothness of the decoder by penalizing its gradient norm in training. Although the linear approximation above is derived for the simplified VAE parameterization, we also provide the linear approximation of the ELBO for the standard VAE parameterization (2) in Appendix B, where the second term in (4) becomes a weighted gradient penalty. Summarizing the above observations shows that σ 2 x affects the smoothness via σ 2 z , while σ 2 z directly regularizes the smoothness. This means that if σ 2 x is excessively large, it will cause overregularization of the decoder and suppress I(z, x )(≥ I(x, x )), which finally leads to posterior collapse. To avoid such over-regularization, σ 2 x and σ 2 z should be determined appropriately. In addition, an experiment shows that posterior collapse can be triggered by directly manipulating σ 2 z , as discussed in Appendix E.

3.2. EMPIRICAL STUDY ON SMOOTHNESS OF DECODER IN THE GENERAL CASE

Section 3.1 shows the impact of σ 2 x on the regularization of the decoder smoothness through the linear approximated objective function. To support the main hypothesis in the general case, an experiment on the MNIST dataset (LeCun et al., 1998) is conducted. Several criteria are accessed to provide evidence for the regularization effect of σ 2 x on the decoder smoothness and its consequential effect on MI I(x, x ). To confirm that σ 2 x affects the smoothness via σ 2 z , we conduct the experiment for two cases: stochastic encoding and deterministic encoding. While the stochastic encoder q φ,σ 2 z (z|x) is used in the former case, a VAE equipped with a deterministic encoder, i.e., σ 2 z is fixed to zero during the training, is investigated in the latter case. Observing the difference between the two cases provides empirical support for Section 3.1. To investigate the relation between σ 2 x and the smoothness of the decoder clearly, common generalization techniques such as batch normalization (Ioffe & Szegedy, 2015; Santurkar et al., 2018) and weight decay are not used. Criteria We used several criteria to observe the impact of σ 2 x in the experiment, such as the reconstruction error (MSE), KL divergence value E pdata(x) D KL (q φ (z|x) p(z)) and the final converged value of σ 2 z . In addition, we also estimate the local smoothness of the decoder and the MI between x and x , denoted as I(x, x ). To access the local smoothness, the expected local smoothness (ELS) is introduced, which is the lower bound of the Lipschitz constant of the decoder. Consider a sample that is decoded with perturbation µ θ (µ φ (x) + z ), where the perturbation follows a zero-mean Table 1 : Evaluation of various criteria for different σ 2 x : the expected value of xx 2 2 (MSE), KL divergence, the converged value of σ 2 z , the upper bound of the MI I(x ; z), the expected gap (perturbation variance s 2 z are set to 10 -2 and 10 -3 ) and expected local smoothness (ELS). Guassian distribution with variance s 2 z , z ∼ N ( z |0, s 2 z I). Let z and z be i.i.d. random variables. We define the expected gap ∆ 2 (s 2 z ) as ∆ 2 (s 2 z ) := E pdata(x) E N ( z |0,s 2 z I)N ( z |0,s 2 z I) [∆ 2 (x, z , z )] (5) with ∆ 2 (x, z , z ) := µ θ (µ φ (x) + z ) -µ θ (µ φ (x) + z ) 2 2 . As s 2 z decereases, the ratio ∆ 2 (s 2 z )/(2s 2 z ) converges and becomes an indicator of E pdata (x)[ ∇µ θ (µ φ (x)) 2 F ], which is regularized by σ 2 z in (4). Therefore, we can now define the ELS as E pdata(x) [ ∇µ θ (µ φ (x)) 2 F ]. Further details of the ELS are described in Appendix C. As a reference, we estimate the upper bound of I(x, x ), which is I(x ; z), by Monte Carlo estimation. Results Table 1 summarizes the results for different σ 2 x . In the stochastic encoding case, a larger σ 2 x consistently leads to a larger σ 2 z . This results in a smaller expected gap, a smaller ELS and a lower upper bound of MI. This supports the main hypothesis that a larger σ 2 x makes the decoder smoother. In the case of σ 2 x = 1.0, all the criteria except MSE become nearly zero, where KL collapse and posterior collapse both occur due to the over-regularization of the smoothness of the latent space. On the other hand, in the deterministic encoding case, the ELS increases with σ 2 x . This is because σ 2 x does not directly regularize the decoder via the gradient penalty as in (4). As a result, the MI upper bound does not shrink to zero even if σ 2 x = 1.0, where posterior collapse occurs in the case of stochastic encoding. The difference in the results between the two cases clearly suggests that a large σ 2 x triggers the oversmoothness via σ 2 z , which is consistent with the discussion in Section 3.1. These results provide empirical support of the main hypothesis as well as the discussion in Section 3.1. Further details and examples of images are shown in Appendix D.

3.3. DIFFICULTY OF CHOOSING AN APPROPRIATE σ 2

x According to previous sections, a large σ 2 x will cause oversmoothness. Therefore, we consider the case of fixing σ 2 x to a sufficiently small value to avoid the problem. We arrive at the following theorem, whose proof can be found in Appendix F: Theorem 3. Consider the global optimum of J σ 2 x (θ, φ, σ 2 z ) w.r.t. a given σ 2 x . If σ 2 x → 0, then σ 2 z → 0. In Theorem 3, J σ 2 x (θ, φ, σ 2 z ) is optimized on the basis of the true data distribution instead of the empirical data distribution. According to the theorem, σ 2 z converges to zero as σ 2 x approaches zero, which leads to zero gradient penalty for the decoder as the VAE training progresses. In practice, we have no access to p data (x), but we have access to the empirical distribution pdata (x). Theorem 3 is satisfied even when p data (x) is replaced with pdata (x). In this case, where σ 2 x is chosen to be small, the optimization process of Jσ 2 x will fit p θ,σ 2 x (x) to the empirical distribution pdata (x), which usually results in overfitting. As shown above, choosing an appropriate σ 2 x that avoids both oversmoothness and overfitting is nontrivial. Moreover, it is likely that σ 2 x should be adapted depending on the status of training. Therefore, it is intuitive to adapt σ 2 x instead of fixing it, which will be described in the next section.

4. ADAPTIVELY REGULARIZED ELBO

A modified ELBO-based objective function is proposed, which can be interpreted as an implicit update scheme that simultaneously updates σ 2 x and the rest of the parameters. We also derive corresponding objective functions for models with different variance parameterizations.

4.1. ELBO WITH ADAPTIVE σ 2 x

In this subsection, we newly optimize the VAE objective function (1) w.r.t. all the parameters including σ 2 x , which is usually fixed in existing implementations. Following the process of establishing (3) but keeping the terms related to σ 2 x , we arrive at: J (θ, φ, σ 2 x ) = E pdata(x) 1 2σ 2 x E q φ (z|x) x -µ θ (z) 2 2 + D KL (q φ (z|x) p(z)) + d x 2 ln σ 2 x . (7) From the partial derivative of J w.r.t. σ 2 x , the MLE of σ 2 x , denoted as σ2 x , can be evaluated with the other parameters fixed. Then, the ordinary network parameters θ and φ can be updated by optimizing (7) with the variance σ 2 x fixed. This combination of MLE and the alternative update between (θ, φ) and σ 2 x guarantees that (i) if θ and φ are fixed, then there exists σ2 x such that J (θ, φ, σ2 x ) ≤ J (θ, φ, σ 2 x ) and (ii) for the σ2 x obtained in the previous step, there exist θ and φ, such that J ( θ, φ, σ2 x ) ≤ J (θ, φ, σ2 x ). In this respect, the convergence of the optimization is assured and the parameter σ 2 x is kept as the MLE during the whole training stage. This inspired us to develop a weight scheduling scheme for σ2 x , leading to a modified ELBO-based objective function. Consider the trainable network parameters (θ, φ) and the variance parameter σ 2 x . The update of the objective J (θ, φ, σ 2 x ) is divided as σ 2(t+1) x = 1 d x E pdata(x) E q φ (t) (z|x) x -µ θ (t) (z) 2 2 (8a) θ (t+1) , φ (t+1) = arg min θ,φ Jσ 2(t+1) x (θ, φ), ( ) where t is the iteration index. The step updating (θ, φ) is the same as that in the standard VAE; the step updating σ 2 x in (8a) can be interpreted as determining an appropriate balance between the reconstruction error and the KL term in Jσ 2 x (θ, φ). As the learning progresses, the parameter σ 2 x will decrease along with the MSE E pdata(x) E q φ (z|x) [ x-µ θ (z) 2 2 ], which is consistent with the discussion in Dai & Wipf (2019) . Proposed objective function (AR-ELBO) The update scheme above can be further simplified by substituting (8a) into (7), which converts J (θ, φ, σ2 x ) into JAR (θ, φ) = d x 2 ln E pdata(x) E q φ (z|x) x -µ θ (z) 2 2 + E pdata(x) D KL (q φ (z|x) p(z)) , where all constant terms w.r.t. the parameters are omitted. Optimizing (9) also makes σ 2 x remain as the MLE during the VAE training. Moreover, ( 9) is equivalent to the standard Gaussian VAE plus weight balancing with (8b). This relieves the VAE from the problem of imbalance between the KL divergence term and the reconstruction loss. Also, as stated by Theorem 3, decreasing σ 2 x also decreases σ 2 z . This gradually relieves the regularization of the ELS (6), which can be observed from (4). However, this eventually diminishes the gradient penalty; therefore, we suggest using earlystopping and learning rate scheduling to deal with this situation, which can give the decoder both appropriate smoothness and generalization capability.

4.2. OBJECTIVES FOR VARIOUS PARAMETERIZATIONS

In the standard VAE given by (2), the variance of the decoded distribution on X , denoted as Σ x , is modeled as an identity matrix, i.e., Σ x = σ 2 x I. In this case, σ 2 x is simply a scalar value and the reconstruction objective is the same as conventional MSE and is minimized as in (9). However, out of Reconstruction objective ( Jrec (θ, φ, Σx )) (Iso-I) σ 2 x I dx 2 ln E pdata(x) E q φ (z|x) [ x -µ θ (z) 2 2 ] (Iso-D) σ 2 x (z)I dx 2 E pdata(x) E q φ (z|x) [ln x -µ θ (z) 2 2 ] (Diag-I) diag(σ 2 x ) 1 2 dx i=1 ln E pdata(x) E q φ (z|x) [(x i -µ θ,i (z)) 2 ] (Diag-D) diag(σ 2 x (z)) 1 2 dx i=1 E pdata(x) E q φ (z|x) [ln(x i -µ θ,i (z)) 2 ] curiosity, we would like to explore three other variance parameterizations in addition to (2) and derive corresponding reconstruction objectives for these cases, in which the reconstruction objectives are no longer equal to MSE. In fact, the variance Σ x can not only be parameterized by an isotropic/diagonal matrix but also be chosen to be independent or dependent on z. We denote these in Table 2 as Iso-I (Isotropic-Independent), Iso-D (Isotropic-Dependent), Diag-I (Diagonal-Independent) and Diag-D (Diagonal-Dependent). The first case, Iso-I, corresponds to the standard variance model Σ x = σ 2 x I. For these parameterizations, the corresponding objectives may be summarized as JAR (θ, φ, Σx ) = Jrec (θ, φ, Σx ) + E pdata(x) D KL (q(z|x) p(z)) (10a) Jrec (θ, φ, Σ x ) = 1 2 E pdata(x) E q(z|x) trace Σ -1 x (x -µ θ (z))(x -µ θ (z)) + ln |Σ x | . (10b) By evaluating the partial derivative of Jrec w.r.t. Σ x , i.e., using its MLE, the reconstruction loss Jrec corresponding to each case can be derived. All the derivations can be found in Appendix G. The final reconstruction objectives with different parameterizations of Σ x are listed in Table 2 . It is interesting to note that the reconstruction error for each dimension in the data space has to be calculated separately for Diag-D; meanwhile, only MSE for the whole minibatch is needed in Iso-I. Considering the optimization stability in practical situations, we suggest adding a small constant, e.g., 10 -6 , before taking the logarithm except for in the case of Iso-I. Although the proposed objective functions are capable of determining Σ x appropriately, it should be noted that a gap still exists between the prior p(z) and the aggregated posterior q φ (z) obtained by the proposed methods. The cause can be observed from the reformulated (1) (see Appendix H): L = D KL (p data (x) p θ (x)) + D KL (p data (x)q φ (z|x) q φ (z)p θ (x|z)) + D KL (q φ (z) p(z)) . The first two terms in (11) can eventually become dominant in the VAE training. As a consequence, generation through sampling latent variables from the prior can cause off-distribution samples to be generated. To overcome this prior-posterior mismatch, at least two approaches can be adopted: (i) conduct another posterior estimation after the ordinary VAE training (van den Oord et al., 2017; Razavi et al., 2019b; Dai & Wipf, 2019; Ghosh et al., 2020; Morrow & Chiu, 2020) or (ii) add another regularizing term to the objective function (Makhzani et al., 2015; Tolstikhin et al., 2018; Zhao et al., 2019) . The former approach is adopted in our work since it is effective (Ghosh et al., 2020) and is applicable to any VAE variation. To summarize, the proposed AR-ELBO regularizes the Gaussian VAE with appropriate weighting for the gradient penalty without the need for an extra hyperparameter. Moreover, the remaining mismatch between the prior and posterior is mitigated by an extra pass of posterior estimation. A detailed discussion comparing the proposed objective function with previous works can be found in Appendix I.

5. EXPERIMENTS

We compare the proposed methods with the following models: VAE, RAE (Ghosh et al., 2020) , WAE-MMD (Tolstikhin et al., 2018) and plain autoencoder (AE). The quality of generated images is evaluated using Fréchet Inception Distance (FID) (Heusel et al., 2017) on the MNIST and CelebA (Liu et al., 2015) datasets with the default train/test split. Regarding the prior-posterior mismatch, three approaches are tested on all the models. The first approach is the conventional case, which samples the latent variables from the prior. The other two approaches are applied after the ordinary training. The second approach forms an aggregated posterior q φ (z) by a second-stage VAE (Dai & Wipf, 2019) . The third approach uses a Gaussian mixture model (GMM) with 10 to 100 components (Ghosh et al., 2020) to fit the posterior. The baseline is the standard VAE, and two methods of choosing σ 2 x are tested: (i) σ 2 x fixed to 1.0 as in general implementations; and (ii) σ 2 x learned with (7) by an optimizer as an usual trainable parameter (Dai & Wipf, 2019) . In RAE, its objective function is the sum of the reconstruction error, the regularization of the decoder and the L 2 regularization of the latent space. In RAE-GP, gradient penalty is used as the regularization of the decoder. The objective function of RAE-GP is equivalent to (4) except for that the weighting parameters are determined manually. For WAE-MMD, inverse multi-quadratic kernels with seven scales are used as proposed by Tolstikhin et al. (2018) . Note that both WAE and RAE have hyperparameters in their objective functions. In contrast, the other methods, as well as the proposed objective functions, include no hyperparameters. However, regarding the major factor that affects the smoothness of the decoder, the proposed methods and VAE control smoothness by regularizing terms in their objective functions, while WAE and AE rely only on the network architecture and generalization techniques. The latent space dimensions for MNIST and CelebA were set to d z = 16 and 64, respectively, consistent with Ghosh et al. (2020) . A common network architecture, which is adopted from Chen et al. ( 2016) and described in Appendix J, is used for all models. In Table 3 , we report the evaluation result of each method as: (i) the MSE of the reconstructed test data and (ii) the FID of generated images. The proposed methods (Diag-I, Iso-D and Diag-D) except for Iso-I do not necessarily achieve the best MSE, since the MSE is no longer the reconstruction loss for them. In the case of sampling z ∈ Z from the prior, a low FID is achieved by WAE due to the relatively strong regularization of the aggregated posterior with MMD. The images generated by the proposed method achieve the best FID score on MNIST, and is competitive on CelebA. It should be noted that the learned σ 2 x values on MNIST and CelebA from Iso-I are 0.0056 and 0.0050, respectively, which are much smaller than 1.0. Examples of reconstructed and generated images are shown in Appendix K. As shown in Table 3 , different parameterizations of variances can affect the FID greatly. In order to clearly observe the advantage of estimating Σ x by MLE rather than estimating it as an usual trainable parameter, we examined the two approaches on all the four parameterizations (Iso-I, Iso-D, Diag-I and Diag-D): (i) solve (10b) using MLE as (9) (AR-ELBO); (ii) simply treat Σ x as a trainable parameter (Dai & Wipf, 2019) . The comparative result can be obtained from the bottom eight lines of Table 3 . It shows that applying AR-ELBO improves FID scores in most of the cases. Furthermore, in order to examine the feasibility of these learned latent spaces, we also evaluated the FID scores of the images generated by interpolating two latent variables with ratios between [0, 1] with all the models above. The proposed method still shows the best performance in terms of FID on MNIST and is competitive on CelebA. This result suggests the feasibility of proposed method on downstream tasks. The detail and image samples can be found in Appendix L.

6. CONCLUSION

We analyzed the posterior collapse phenomenon on the Gaussian VAE and investigated how strongly the variance parameter impacts the local smoothness of the decoder. The relation between the variance parameter and the local smoothness is examined both theoretically and empirically. We proposed optimization schemes to regulate the local smoothness appropriately, which leads to the prevention of posterior collapse due to oversmoothness. The proposed AR-ELBO implicitly optimizes the variance parameter to avoid over-regularizing of the smoothness. In addition, we proposed several parameterizations (Iso-D, Diag-I, Diag-D) of posterior variances, which are the extensions of the conventional VAE (Iso-I). The corresponding AR-ELBOs for these parameterizations are also derived. Our experiments show that the Gaussian VAE equipped with the proposed objective functions is competitive with other state-of-the-art models in terms of FID for both generated and interpolated images. Moreover, the proposed method remains stable in the most complicated parameterization (Diag-D). In this work, the prior-posterior mismatch was covered by extra posterior estimation methods; however, we would like to seek a thorough solution for this in the future. A PROOF OF THEOREM 2 Let x be the input sample. We denote its corresponding latent space vector as z and the reconstructed sample as x . We have the following relation: I(x; z) ≥ I(x; x ), which can be proved similarly to the proof of Lemma 5 in Appendix F. On the other hand, I(x; z) can be evaluated by using the definition of the MI as I(x; z) = D KL (p data (x)q φ (z|x) pdata (x)q φ (z)) = E pdata(x) E q φ (z|x) [ln q φ (z|x) -ln q φ (z)] = E pdata(x) D KL (q φ (z|x) p(z)) -D KL (q φ (z) p(z)) (13) ≤ E pdata(x) D KL (q φ (z|x) p(z)), where I(x; z), D KL (q φ (z|x) p(z)) and D KL (q φ (z) p(z)) are all non-negative. Inequalities ( 12) and ( 14) lead to the proof.

B LINEAR APPROXIMATION OF THE ELBO-BASED OBJECTIVE J σ 2

x We start with parameterizing the encoder while following the assumption in (2). Given a sufficiently small perturbation with p( z ) = N (z|0, diag(σ 2 φ (x))), the linear approximation of µ θ (•) at µ φ (x) can be represented as µ θ (µ φ (x) + z ) = µ θ (µ φ (x)) + J µ θ (µ φ (x)) z , ( ) where J µ θ (µ φ (x)) represents the Jacobian matrix of µ θ (z) at z = µ φ (x). Substituting ( 15) into (3) leads to E q φ (z|x) [ x -µ θ (z) 2 2 ] = E N (z|0,diag(σ 2 φ (x))) x -(µ θ (µ φ (x)) + J µ θ (µ φ (x)) z ) 2 2 = x -µ θ (µ φ (x)) 2 2 + E N (z|0,diag(σ 2 φ (x))) z J µ θ (µ φ (x)) J µ θ (µ φ (x)) z + E N (z|0,diag(σ 2 φ (x))) (x -µ θ (µ φ (x))) J µ θ (µ φ (x)) z =0 , ( ) where the last term is zero under the assumption that the perturbation is sufficiently small. The expectation in the second right-hand-side term can be evaluated as E N (z|0,diag(σ 2 φ (x))) z J µ θ (µ φ (x)) J µ θ (µ φ (x)) z = trace E N (z|0,diag(σ 2 φ (x))) z z J µ θ (µ φ (x)) J µ θ (µ φ (x)) = trace diag(σ 2 φ (x))J µ θ (µ φ (x)) J µ θ (µ φ (x)) = dx i=1 dz j=1 σ 2 φ,j (x) ∂µ θ,i (z) ∂z j z=µ φ (z) 2 , which can be interpreted as the gradient penalty for the decoder weighted by σ 2 φ (x). By substituting the above result into (3), its linear approximation can be obtained as Jσ 2 x (θ, φ) ≈ 1 2σ 2 x E pdata(x) x -µ θ (µ φ (x)) 2 2 + dx i=1 dz j=1 σ 2 φ,j (x) ∂µ θ,i (z) ∂z j z=µ φ (x) 2 + 2σ 2 x µ φ (x) 2 2 . ( ) In the case of the simplified parameterization described in Section 3, the second right-hand-side term in ( 18) can be further reduced to σ 2 z ∇µ θ (µ φ (x)) 2 F . In the simplified case, the perturbation follows a multivariate i.i.d. Gaussian distribution, z ∼ N (z|0, σ 2 z I). Under this assumption, we have E p( z ) z J µ θ (µ φ (x)) J µ θ (µ φ (x)) z = E p( z ) z dz i=1 λ i u i (x)u i (x) z = dz i=1 λ i u i (x) E p( z ) [ z z ]u i (x) = 2σ 2 z dz i=1 λ i , ( ) where λ i is the ith eigenvalue of J µ θ (µ φ (x))J µ θ (µ φ (x)) , which is a symmetrical positive definite matrix, and the corresponding eigenvectors are (u i (x)) dz i=1 . Following the simplified assumption, the second right-hand-side term in ( 16) now becomes E N (z|0,σ 2 z I) z J µ θ (µ φ (x)) J µ θ (µ φ (x)) z . Combining ( 19) and the fact that dz i=1 λ i = trace(J µ θ (µ φ (x))J µ θ (µ φ (x)) ) = ∇µ θ (µ φ (x)) 2 2 , we can finally obtain the following linear approximation for the simplified parameterization: E N (z|0,σ 2 z I) z J µ θ (µ φ (x)) J µ θ (µ φ (x)) z = E N (z|0,σ 2 z I) trace J µ θ (µ φ (x)) J µ θ (µ φ (x)) = σ 2 z dx i=1 dz j=1 ∂µ θ,i (z) ∂z j z=µ φ (z) 2 = σ 2 z ∇µ θ (µ φ (x)) 2 F .

C EXPECTED LOCAL SMOOTHNESS OF DECODER

Here, we describe the relation between the expected local smoothness E pdata(x) [ ∇µ θ (µ φ (x)) 2 F ] and the expected gap ∆ 2 (s 2 z ). First, consider the relation ∆ 2 (x, z , z ) := µ θ (µ φ (x) + z ) -µ θ (µ φ (x) + z ) 2 2 = K θ (µ φ (x), z , z ) 2 z -z 2 2 , with the perturbation z following the Gaussian distribution N ( z |0, s 2 z I). Applying the expectation operator to (21) leads to ∆ 2 (x, z , z ) = E p( z , z ) K θ (µ φ (x), z , z ) 2 z -z 2 2 (22) ≤ E p( z , z ) K θ (µ φ (x), z , z ) 2 E p( z , z ) z -z 2 2 (23) =: 2K 2 θ (µ φ (x), s 2 z )d z s 2 z , ( z , z ) := N ( z |0, s 2 z I)N ( z |0, s 2 z I), z -z ∼ N ( z -z |0, 2s 2 z I) and K 2 θ (µ φ (x)) := E p( z , z ) K θ (µ φ (x), z , z ) 2 . Note that in (23), we assume that K 2 θ (µ φ (x) ) is independent of z and z . Consider the case that the variance s 2 z is sufficiently small to approximate µ θ (z) linearly around z = µ φ (x), which is perturbed with variance s 2 z . In such a case, K θ (µ φ (x), z , z ) is independent of z and z , which fits the assumption in (23). Under this local linearity assumption, K 2 θ (µ φ (x)) is bounded as K 2 θ (µ φ (x), s 2 z ) ≤ K 2 θ , where K θ denotes the Lipschitz constant of the decoder. Following the assumption, K 2 θ (µ φ (x), s 2 z ) can be formulated by invoking (15) as K 2 θ (µ φ (x), s 2 z ) = E p( z , z ) [ µ θ (µ φ (x) + z ) -µ θ (µ φ (x) + z ) 2 2 ] E p( z , z ) [ z -z 2 2 ] = E p( z , z ) [( z -z ) J µ θ (µ φ (x)) J µ θ (µ φ (x))( z -z )] 2d z s 2 z = trace E p( z , z ) [( z -z )( z -z ) ]J µ θ (µ φ (x)) J µ θ (µ φ (x)) 2d z s 2 z = trace J µ θ (µ φ (x)) J µ θ (µ φ (x)) d z . ( ) Applying the expectation operator to (26) leads to K 2 θ (s 2 z ) := E pdata K 2 θ (µ φ (x), s 2 z ) = E pdata(x) trace J µ θ (µ φ (x)) J µ θ (µ φ (x)) d z = E pdata ∇µ θ (µ φ (x)) 2 F d z . Finally, combining ( 24) and ( 27) yields the following connection between the expected gap and the expected local smoothness: ∆ 2 (s 2 z ) = 2E pdata ∇µ θ (µ φ (x)) 2 F s 2 z . D EXPERIMENTAL DETAILS FOR SECTION 3.2

D.1 EXPERIMENTAL SETUP

In the experiment, the model is trained with the Adam optimizer with a learning rate of 10 -3 . The dimension of the latent space is set to 8. We run 200 epochs with a minibatch size of 64 for all σ 2 x . We use the following DNN architectures for the encoder and decoder, respectively: (128, 7, 7) size of (128, 7, 7) , 28, 28) . x ∈ R 28×28 → Conv 64 → ReLU size of (64, 14, 14) → Conv 128 → ReLU → Reshape → Flatten → FC 1024 → ReLU → FC 16 , z ∈ R 16 → FC 1024 → ReLU → FC 128×7×7 → ReLU size of (128, 7, 7) → ConvT 64 → ReLU size of (64, 14, 14) → ConvT 1 → Sigmoid size of (1 Here, FC k , Conv k , ConvT k and ReLU denote the fully connected layer mapping to R k , the convolutional layer mapping to k channels, the transpose convolutional layer mapping to k channels and the rectified linear units (ReLU), respectively. The 3-tuple (channels, height, width) in the right column represents the output shape of each layer. In all the Conv k and ConvT k layers, 4 × 4 convolutional filters are used with a common stride of (2, 2). Regarding the evaluation of criteria, MSE and KL are evaluated on the training set because the aim of the experiment is to validate the relation between σ 2 z and the smoothness of the decoder. The upper bound of the MI is obtained by calculating -E pdata(x) E q φ,σ 2 z (z|x) ln E pdata(x ) exp - 1 2σ 2 z z -µ φ (x ) 2 2 - d x 2 for each minibatch and then taking their mean, where the batch size is 10, 000 for all the evaluations.

D.2 SAMPLES OF GENERATED IMAGES AND T-SNE VISUALIZATION OF LATENT SPACES

Figure 1 shows several images decoded from µ φ (x) + z with z ∼ N ( z |0, s 2 z I) for the cases with σ 2 x = 1.0 and 0.1. Posterior collapse can be observed from these blurry images decoded from the stochastic encoding case with σ 2 x = 1.0. This is due to the removal of batch normalization, which makes σ 2 x = 1.0 become an inappropriate choice. However, if σ 2 x is determined or adapted appropriately such as by using the proposed method, posterior collapse will not happen. In the other settings, the tendency of how the image changes with the perturbation is similar, as shown in Table 1 .

Stochastic encoding

Deterministic encoding x clearly, we remove batch normalization, which usually helps prevent posterior collapse to a certain extent. As a result, the latent space with σ 2 x = 1.0 completely collapses and q φ (z) approaches p(z) as shown in Figures 1 and 2 From the previous sections, we know that σ 2 x affects the smoothness via σ 2 x . However, it would be interesting to see what will happen if σ 2 z is fixed while σ 2 x is optimized. In this experiment, the variance parameter σ 2 z is fixed while σ 2 x is optimized with the AR-ELBO (9) under the parameterization in Section 3. The other settings remain the same as those in Section 3.2. We evaluate the numerical results for different σ 2 z with the criteria listed in Section 3.2. According to Table 5 , the tendencies of the expected gap and ELS show that a large σ 2 z makes the decoder smoother, which is consistent with the discussion in Section 3.1. However, the tendency of the KL divergence is different from that in Section 3.2. Although a larger σ 2 z consistently leads to a smaller MI, and eventually the MI collapses to zero; the KL divergence still remains far from zero, which means that posterior collapse can happen without KL collapse. This phenomenon can be visually confirmed by observing the t-SNE plot in Figure 3 . The cause of this phenomenon can be roughly reasoned from the linear approximated ELBO (4), in which σ 2 z directly affects the gradient penalty and causes oversmoothness. It should be pointed out that the strength of L 2 regularization in (4) is gradually decreased with decreasing σ 2 x ; therefore, it does not dominate the whole objective function. As a result, the mean of the approximated posterior q φ (z) is far from the mean of the prior p(z) (which is 0), and therefore D KL (q φ (z) p(z)) in ( 13) does not diminish to zero.

F PROOF OF THEOREM 3

According to Theorem 4 in Dai & Wipf (2019) , we know that lim σ 2 x →0 E pdata(x) E q φ,σ 2 z (z|x) x -µ θ (z) 2 2 = 0, which also leads to σ2 x → 0 (σ 2 x → 0). Here, σ2 x is estimated through MLE and is given by σ2 x = 1 d x E pdata(x) E q φ,σ 2 z (z|x) x -µ θ (z) 2 2 . ( ) To prove Theorem 3, we need the following auxiliary theorem: Theorem 4. In the training stage of VAE, we have σ 2 z → 0 (σ 2 x → 0). First, we state the two lemmas with proofs. Lemma 5. In a VAE, I(x, x ) ≤ I(z, z e ) always holds, where z e is the encoded latent variable z e = µ φ (x) with x ∼ p data (x). Proof. The data processing flow of the VAE is x → z e → z → x ; z e = µ φ (x), z = z e + z , and x = µ θ (z), where z ∼ N ( z |0, σ 2 z I). The MI I(x; z, x ) can be represented as I(x; z, x ) = I(x; x ) + I(x; z|x ) (32) = I(x; z) + I(x; x |z). (33) Since x and x are conditionally independent on the given z, it follows that I(x; x |z) = 0. From the non-negativity of MI, we have I(x; z) ≥ I(x; x ). Repeating the same procedure for I(x; z e , z) leads to the proof. Lemma 6. The MI between x and x diverges to positive infinity as σ 2 x → 0, where x is obtained from x ∼ p data (x) as x = µ θ (µ φ (x) + z ). Proof. A lower bound of I(x; x ) is I(x; x ) = D KL (p data (x)p θ,φ (x |x) p data (x)p θ,φ (x )) = E pdata(x) E p θ,φ (x |x) [ln p θ,φ (x |x) -ln p θ,φ (x )] = H [p θ,φ (x )] -E pdata(x) H [p θ,φ (x |x)] ≥ H [p θ,φ (x )] -E pdata(x) H(σ 2 x I), where p θ,φ (x |x) := E q φ (z|x) [p θ (x |z)] and p θ,φ (x ) := E pdata(x) E q φ (z|x) [p θ (x |z)]. Here, we denote the differential entropy of the Gaussian with variance σ2 x I as H(σ 2 x I) := 1 2 ln(2πeσ 2dx x ). ( ) Since σ2 x → 0 as σ 2 x → 0, H[p θ,φ (x )] → H[p data (x) ] and H(σ 2 x ) → -∞ in the inequality of (34). Therefore, I(x; x ) → ∞ as σ 2 x → 0. Now we prove Theorem 3. The MI I(z; z e ) satisfies I(z; z e ) = D KL q φ (z e )q σ 2 z (z|z e ) q φ (z e )q φ,σ 2 z (z) = E q φ (ze) E q σ 2 z (z|ze) ln q σ 2 z (z|z e ) -ln q φ,σ 2 z (z) = H q φ,σ 2 z (z) -E q φ (ze) H q σ 2 z (z|z e ) ≤ H(Σ φ,σ 2 z ) -H(σ 2 z I) = d z 2 ln det(Σ φ,σ 2 z ) σ 2 z , where Σ φ,σ 2 z denotes the variance of q φ,σ 2 z (z). Invoking Lemma 5 and (36) leads to I(x; x ) ≤ d z 2 ln det(Σ φ,σ 2 z ) σ 2 z . Now, consider σ 2 z → 0 as L → 0. According to Lemma 6, it follows that det(Σ φ,σ 2 z ) → +∞, which contradicts the fact that q φ,σ 2 z (z) → p(z). Thus, we must have σ 2 z → 0 as σ 2 x converges to zero.

G DERIVATION OF PROPOSED OBJECTIVES

Here, we derive the objectives listed in Table 2 . Consider an arbitrary Σ x without any condition. The MLE of Σ x , Σx , can be obtained by Σx = E pdata(x) E q φ (z|x) (x -µ θ (z))(x -µ θ (z)) . ( ) From the partial derivative of Jrec (θ, φ, Σ x ) w.r.t. Σ x , we have ∂ Jrec (θ, φ, Σ x ) ∂Σ x = 1 2 E pdata E q φ (z|x) (x -µ θ (z))(x -µ θ (z)) + Σ -1 x . ( ) The MLE of Σx and the objectives for the different parameterizations are described in the following.

G.1 ISO-I

For Iso-I, the MLE of σ 2 x can be given as σ2 x = 1 d x E pdata(x) E q φ (z|x) x -µ θ (z) 2 2 . Substituting ( 40) into (7) leads to JAR (θ, φ, σ2 x ) = E pdata(x) 1 2σ 2 x E q φ (z|x) [ x -µ θ (z) 2 2 ] + D KL (q φ (z|x) p(z)) + d x 2 ln σ2 x = d x 2 + E pdata(x) D KL (q φ (z|x) p(z)) + d x 2 ln E pdata(x) E q φ (z|x) x -µ θ (z) 2 2 - d x 2 ln d x , where the first and fourth terms are constants and thus omitted in (9).

G.2 ISO-D

First, substitute Σ x = σ 2 x (z)I into (10b): Jrec (θ, φ, Σ x ) = E pdata(x) E q φ (z|x) 1 2σ 2 x (z) x -µ θ (z) 2 2 + d x 2 ln σ 2 x (z) . Also, we know that the MLE of σ 2 x (z) is σ2 x (z) = 1 d x x -µ θ (z) 2 2 . Substituting ( 43) into (42) leads to the reconstruction objective of Iso-D: Jrec (θ, φ, Σx ) = E pdata(x) E q φ (z|x) 1 2σ 2 x (z) x -µ θ (z) 2 2 + d x 2 ln σ2 x (z) = d x 2 + d x 2 E pdata(x) E q φ (z|x) ln x -µ θ (z) 2 2 - d x 2 ln d x . G.3 DIAG-I First, substitute Σ x = diag(σ 2 x ) into (10b): Jrec (θ, φ, Σ x ) = E pdata(x) dx i=1 1 2σ 2 x,i E q φ (z|x) (x i -µ θ,i (z)) 2 + dx i=1 1 2 ln σ 2 x,i . Also, we know that the MLE of σ 2 x,i is σ2 x,i = E pdata(x) E q φ (z|x) (x i -µ θ,i 2 . Substituting ( 47) into (46) leads to the reconstruction objective for Diag-I: Jrec (θ, φ, Σx ) = E pdata(x) dx i=1 1 2σ 2 x,i E q φ (z|x) (x i -µ θ,i (z)) 2 + dx i=1 1 2 ln σ2 x,i = d x 2 + 1 2 dx i=1 ln E pdata(x) E q φ (z|x) (x i -µ θ,i (z)) 2 . ( ) G.4 DIAG-D First, substitute Σ x = diag(σ 2 x (z)) into (10b): Jrec (θ, φ, Σ x ) = E pdata(x) E q φ (z|x) dx i=1 1 2σ 2 x,i (z) (x i -µ θ,i (z)) 2 + 1 2 ln σ 2 x,i (z) . Also, we know that the MLE of σ 2 x,i (z) is σ2 x,i (z) = (x i -µ θ,i (z)) 2 . Substituting ( 51) into (50) leads to the reconstruction objective for Diag-D: Jrec (θ, φ, Σx ) = E pdata(x) E q φ (z|x) dx i=1 1 2σ 2 x,i (z) (x i -µ θ,i (z)) 2 + 1 2 ln σ2 x,i (z) (52) = d x 2 + 1 2 dx i=1 E pdata(x) E q φ (z|x) ln (x i -µ θ,i (z)) 2 . H DERIVATION OF (11) The KL divergence terms of (1) can be represented as E pdata(x) D KL (q φ (z|x) p θ (z|x)) = E pdata(x)q φ (z|x) [ln q φ (z|x) -ln p θ (z|x)] = E pdata(x)q φ (z|x) ln p data (x)q φ (z|x) p(z)p θ (x|z) and E pdata(x)q φ (z|x) [ln q φ (z) -ln p(z)] = D KL (q φ (z) p(z)). (55) Substituting the two equations above into (1), then L can be reformulated into (11).

I RELATED WORKS

To the best of our knowledge, Lucas et al. (2019) were among the first to suggest that posterior collapse may be caused by a sub-optimal σ 2 x . In the past, one of the common approaches for dealing with posterior collapse was to anneal the weight of the KL term in the ELBO. The first such attempt was KL annealing (Bowman et al., 2015) . Bowman et al. (2015) introduced a weighting coefficient on the KL term in the cost function during training. The weighting scheduling is determined in advance, e.g., the weight increases monotonically (Bowman et al., 2015; Sønderby et al., 2016) or changes cyclically (Fu et al., 2019) as the training progresses. The weighting coefficient also appears in Higgins et al. (2017) , and is interpreted as a hyperparameter that controls the information capacity of the latent space. The suggested value for such hyperparameter is larger than 1. This is analogous to setting σ 2 x larger than the MLE value σ2 x for (7) and (8a), which enforces a stronger smoothness in exchange of better latent space disentanglement. The differences between Higgins et al. (2017) and the proposed method are: (i) σ 2 x changes between every minibatch; and (ii) the estimation of σ 2 x is aimed to prevent the oversmoothness. Shao et al. (2020) proposed ControlVAE, which combined control theory with the VAE, and applied PI/PID control to determine the weight on the KL term. Although applying control theory to the weighting of the KL term makes it possible to reflect the status of the optimization, ControlVAE needs extra hyperparameters to be tuned in advance. On the other hand, our method can be interpreted as automatic KL annealing that estimates σ 2 x through MLE without the need of tuning an extra hyperparameter. Ghosh et al. (2020) interpreted the stochastic autoencoder with the reparameterization trick as noise injection process and proposed replacing such a mechanism with an explicit regularized autoencoder (RAE). RAE regularizes its decoder in several ways: L 2 regularization, a gradient penalty (Gulrajani et al., 2017) and spectral normalization (Miyato et al., 2018) . As discussed in Section 3.1, if σ 2 z is sufficiently small, the ELBO can also be approximately represented as a sum of three losses (4), which correspond to the terms included in the basic RAE objective function. The approximated objective function (4) can be obtained when RAE with a gradient penalty (RAE-GP) is used and tuned appropriately. Dai & Wipf (2019) optimized σ 2 x using an optimizer, which is the most similar approach to our Iso-I model in that ( 7) is used as an objective function. Our proposed method provides simplified objective functions that enable the variance parameter to be optimized automatically and guarantee that σ 2 x decreases as the reconstruction loss decreases, enabling the gradient penalty to be gradually weakened. space, the corresponding latent space is more likely to be feasible for other downstream tasks. Therefore, in this section, we evaluate the FID scores of images generated by latent variable interpolation for various models mentioned in section 5. First, we choose 10,000 random pairs of images from both MNIST and CelebA datasets. The interpolation is done by applying spherical interpolation (Ghosh et al., 2020) in latent spaces and then generate the interpolated images with the decoders. In the end, we evaluate the FID of these interpolated images. Furthermore, the experiment has been proceeded with two different setups of mixing ratios: (i) a fixed ratio of 0.5, i.e., the mid-point of two latent variables; and (ii) a uniformly distributed random ratio between [0, 1] for each image pair. The result is shown in Table 4 , where the proposed method achieved the best score on MNIST and is competitive on CelebA. This suggests that a generative model with proper smoothness achieved via the proposed method is also feasible for other applications such as interpolation and possibly applicable for other semantic controls. 



URL hidden due to blind review. We used the PyTorch version of the FID implementation from https://github.com/mseitzer/ pytorch-fid for all the models. However, the result may slightly differ from that obtained with the Tensor-Flow implementation https://github.com/bioinf-jku/TTUR.



Figure 1: Images in red boxes are the original images sampled from the MNIST dataset. The images in blue boxes are reconstructed by µ θ (µ φ (x)). The other images are decoded from neighbor points of µ φ (x), which are perturbed by z ∼ N ( z |0, s 2 z I). The latent spaces are also visualized via t-SNE (Maaten & Hinton, 2008) in Figure 2. The dots with different colors represent the latent vectors encoded from images of different labels (numbers), and the pink dots are the sampling points generated from the prior p(z). As mentioned earlier, to observe the effect of σ 2x clearly, we remove batch normalization, which usually helps prevent posterior collapse to a certain extent. As a result, the latent space with σ 2 x = 1.0 completely collapses and q φ (z) approaches p(z) as shown in Figures1 and 2(a). In this case, both KL collapse and posterior collapse occur.

(a). In this case, both KL collapse and posterior collapse occur.

Figure 2: Visualization of latent space via t-SNE. Pink dots are sampling points generated from the prior p(z).

Figure 3: Visualization of latent space via t-SNE. Pink dots are sampling points generated from the prior p(z).

Figure 5: Reconstructed images and examples of images generated from the prior and the estimated posterior on CelebA.

Figure 6: Examples of interpolated images.

Parameterizations of posterior variance in X and corresponding reconstruction objectives.

Numerical evaluation on MNIST and CelebA. The MSE of sample reconstruction is evaluated on the test set. The quality of generated samples is measured by FID for three cases: sampling the latent variables from the prior and from the estimated posterior by 2 nd VAE and GMM.Σx is learned as a trainable parameter like inDai & Wipf (2019). However, the work does not include Diag-I, Iso-D and Diag-D parameterizations.

Evaluation of FID scores of interpolated images on MNIST and CelebA. The interpolation ratio for each image pair is designated as (i) the mid-point; and (ii) a random-point between the two.

Evaluation of various criteria for different σ 2 z . These criteria are the expected value of x -x 2 2 (MSE), KL divergence, the upper bound of MI I(x , z), the expected gap (the perturbation variance s 2 z is set to 10 -2 and 10 -3 ) and expected local smoothness (ELS).

annex

Comparing our work with these works showed that the proposed AR-ELBO and its variations are also capable of regularizing the Gaussian VAE via weighting of the gradient penalty. In the standard parameterization such as (2), the second term in (4) is a weighted gradient penalty, which makes it possible to regularize each dimension of the latent space differently according to the property of the input data. In addition, the combination of the proposed objective functions together with standard Gaussian VAE allows implicit gradient regularization on the decoder with lower computational cost than that of explicitly adding the gradient penalty to the objective function, such as in RAE.

J DETAILS OF EXPERIMENTAL SETUP IN SECTION 5

In this experiment, the Adam optimizer (Kingma & Ba, 2015) is used and the maximum number of epochs is set to 100 for MNIST and 70 for CelebA. The learning rates are 0.001 for MNIST and 0.0002 for CelebA. A minibatch size of 64 is used. All the FID 2 values are evaluated with 10, 000 generated samples.For the posterior estimation by the second-stage VAE, we adopt the same networks for the encoder and decoder as those in Dai & Wipf (2019) . For GMM fitting, we use the same settings as those in Ghosh et al. (2020) . Experimental details including the network architectures for each dataset are described in the following.

J.1 MNIST

We construct the encoder and decoder for the MNIST dataset using the architecture in Chen et al. (2016) . The encoder is constructed asThe decoder is constructed as , 28, 28) .In all the Conv k layers and all the ConvT k layers except for the last, 5 × 5 convolutional filters with stride (2, 2) are used. The difference between this architecture and those used in Appendix D.1 is whether batch normalization is applied or not. Although in the original work of Chen et al. ( 2016), the discriminator used leaky ReLU (lReLU), we adopt ReLU for the encoder part, which improves the performance for all the models evenly.

J.2 CELEBA

The CelebA images are preprocessed with center cropping of 140 × 140, then resized to 64 × 64 as described in Tolstikhin et al. (2018) and Ghosh et al. (2020) . It should be noted that the size of cropping differs among the previous works, and it markedly affects the FID score. We choose the above cropping size as is the largest among the related works and seems to be the most difficult case for image generation. Moreover, Tolstikhin et al. (2018) and Ghosh et al. (2020) size of (3, 64, 64). In all the Conv k layers and all the ConvT k layers except for the last, 5 × 5 convolutional filters with stride (2, 2) are used. We use ReLU instead of leaky ReLU due to the performance consideration described in the previous subsection. To fit the size of the input images in our experiment, one extra convolutional layer is added for the encoder and the channel size is twice as large as that in Chen et al. (2016) ,

K EXAMPLES OF RECONSTRUCTED AND GENERATED IMAGES IN SECTION 5

We show examples of reconstructed images and images generated by sampling the learned approximated posterior from the proposed method and other works in Figures 4 and 5 . 

L INTERPOLATION OF LATENT VARIABLES

This section aims to investigate the feasibility of the learned latent space of the methods mentioned in Section 5. If high quality images can be generated by interpolating the latent variables in a latent

