QUANTITATIVE UNDERSTANDING OF VAE AS A NON-LINEARLY SCALED ISOMETRIC EMBEDDING

Abstract

Variational autoencoder (VAE) estimates the posterior parameters (mean and variance) of latent variables corresponding to each input data. While it is used for many tasks, the transparency of the model is still an underlying issue. This paper provides a quantitative understanding of VAE property by interpreting VAE as a non-linearly scaled isometric embedding. According to the Rate-distortion theory, the optimal transform coding is achieved by using a PCA-like orthonormal transform where the transform space is isometric to the input. From this analogy, we show theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. As a result, we can estimate the data probabilities in the input space from the prior, loss metrics, and corresponding posterior parameters. In addition, the quantitative importance of each latent variable can be evaluated like the eigenvalue of PCA.

1. INTRODUCTION

Variational autoencoder (VAE) (Kingma & Welling, 2014) is one of the most successful generative models, estimating posterior parameters of latent variables for each input data. In VAE, the latent representation is obtained by maximizing an evidence lower bound (ELBO). A number of studies (Higgins et al., 2017; Kim & Mnih, 2018; Lopez et al., 2018; Chen et al., 2018; Locatello et al., 2019; Alemi et al., 2018; Rolínek et al., 2019) have tried to reveal the property of latent variables. However, quantitative behavior of VAE is still not well clarified. For example, there has not been a theoretical formulation of the reconstruction loss and KL divergence in ELBO after optimization. More specifically, although the conditional distribution p θ (x|z) in the reconstruction loss of ELBO is predetermined such as the Gaussian or Bernoulli distributions, it has not been discussed well whether the true conditional distribution after optimization matches the predetermined distribution. Rate-distortion (RD) theory (Berger, 1971) , which is an important part of Shannon information theory and successfully applied to image compression, quantitatively formulates the RD trade-off optimum in lossy compression. To realize a quantitative data analysis, Rate-distortion (RD) theory based autoencoder, RaDOGAGA (Kato et al., 2020) , has been proposed with isometric embedding (Han & Hong, 2006) where the distance between arbitrary two points of input space in a given metrics is always the same as L2 distance in the isometric embedding space. In this paper, by mapping VAE latent space to an implicit isometric space like RaDOGAGA on variable-by-variable basis and analysing VAE quantitatively as a well-examined lossy compression, we thoroughly clarify the quantitative properties of VAE theoretically and experimentally as follows. 1) Implicit isometric embedding is derived in the loss metric defined space such that the entropy of data representation becomes minimum. A scaling factor between the VAE latent space and implicit isometric space is formulated by the posterior for each input. In the case of β-VAE, the posterior variance of each dimensional component in the implicit isometric embedding space is a constant β/2, which is analogous to the rate-distortion optimal of transform coding in RD theory. As a result, the reconstruction loss and KL divergence in ELBO can be quantitatively formulated. 2) From these properties, VAE can provide a practical quantitative analysis of input data. First, the data probabilities in the input space can be estimated from the prior, loss metric, and posterior parameters. In addition, the quantitative importance of each latent variable, analogous to the eigenvalue of PCA, can be evaluated from the posterior variance of VAE. This work will lead the information theoretic generative models in the right direction.

2.1. VARIATIONAL AUTOENCODER AND THEORETICAL ANALYSIS

In VAE, ELBO is maximized instead of maximizing the log-likelihood directly. Let x ∈ R m be a point in a dataset. The original VAE model consists of a latent variable with fixed prior z ∼ p(z) = N (z; 0, I n ) ∈ R n , a parametric encoder Enc φ : x ⇒ z, and a parametric decoder Dec θ : z ⇒ x. In the encoder, q φ (z|x) = N (z; µ (x) , σ (x) ) is provided by estimating parameters µ (x) and σ (x) . Let L x be a local cost at data x. Then, ELBO is described by ELBO = E x∼p(x) E z∼q φ (z|x) [log p θ (x|z)] -D KL (q φ (z|x) p(z)) . (1) In E x∼p(x) [ • ] , the first term E z∼q φ (z|x) [ • ] is called the reconstruction loss. The second term D KL (•) is a Kullback-Leibler (KL) divergence. Let µ j(x) , σ j(x) , and D KLj(x) be j-th dimensional values of µ (x) , σ (x) , and KL divergence. Then D KL (•) is derived as: D KL (•) = n j=1 D KLj(x) , where D KLj(x) = 1 2 µ j(x) 2 + σ j(x) 2 -log σ j(x) 2 -1 . (2) D(x, x) denotes a metric such as sum square error (SSE) and binary cross-entropy (BCE) as loglikelihoods of Gaussian and Bernoulli distributions, respectively. In training VAE, the next objective is used instead of Eq. 1, where β is a parameter to control the trade-off (Higgins et al., 2017) . [D(x, x) ] + βD KL (•). L x = E z∼q φ (z|x) (3) However, it has not been fully discussed whether the true conditional distribution matches the predetermined distribution, or how the value of KL divergence is derived after training. There have been several studies to analyse VAE theoretically. Alemi et al. (2018) introduced the RD trade-off based on the information-theoretic framework to analyse β-VAE. However, they did not clarify the quantitative property after optimization. Dai et al. (2018) showed that VAE restricted as a linear transform can be considered as a robust PCA. However, their model has a limitation for the analysis on each latent variable basis because of the linearity assumption. Rolínek et al. (2019) showed that the Jacobian matrix of VAE at each latent variable is orthogonal, which makes latent variables disentangled implicitly. However, they do not uncover the orthonormality and quantitative properties because they simplify KL divergence as a constant. Dai & Wipf (2019) also showed that the expected rate of VAE for the r-dimensional manifold is close to -(r/2) log γ + O(1) at γ → 0 when p θ ( x|x) = N ( x; x, γI m ) holds. The remaining challenge is to clearly figure out what latent space is obtained at a given dataset, a loss metric, and β in the model.

2.2. RATE-DISTORTION THEORY, TRANSFORM CODING, AND ISOMETRIC EMBEDDING

RD theory (Berger, 1971 ) formulated the optimal transform coding (Goyal, 2001) for the Gaussian source with square error metric as follows. Let x ∈ R m be a point in a dataset. First, the data are transformed deterministically with the orthonormal transform (orthogonal and unit norm) such as Karhunen-Loève transform (KLT) (Rao & Yip, 2000) . Let z ∈ R m be a point transformed from x. Then, z is entropy-coded by allowing equivalent stochastic distortion (or posterior with constant variance) in each dimension. A lower bound of a rate R at a distortion D is denoted by R(D). The derivation of R(D) is as follows. Let z j be the j-th dimensional component of z and σ zj 2 be the variance of z j in a dataset. It is noted that σ zj 2 is the equivalent to eigenvalues of PCA for the dataset. Let d be a distortion equally allowed in each dimensional channel. At the optimal condition, the distortion D opt and rate R opt on the curve R(D) is calculated as a function of d:  The simplest way to allow equivalent distortion is to use a uniform quantization (Goyal, 2001) . Let T be a quantization step, and round(•) be a round function. Quantized value ẑj is derived as kT , where k = round(z j /T ). Then, d is approximated by T 2 /12 as explained in Appendix H.1. To practically achieve the best RD trade-off in image compression, rate-distortion optimization (RDO) has also been widely used (Sullivan & Wiegand, 1998) . In RDO, the best trade-off is achieved by finding a encoding parameter that minimizes a cost L = D + λR at given Lagrange parameter λ. Recently, deep image compression (Ballé et al., 2018) has been proposed. In these works, instead of an orthonormal transform with sum square error (SSE) metric in the conventional lossy compression, a deep autoencoder is trained with flexible metrics, such as structural similarity (SSIM) (Wang et al., 2001) for RDO. Recently, an isometric autoencoder, RaDOGAGA (Kato et al., 2020) was proposed based on Ballé et al. (2018) . They proved that the latent space to be isometric to the input space if the model is trained by RDO using a parametric prior and posterior with constant variance. By contrast, VAE uses a fixed prior with a variable posterior. In section 3, we explain that VAE can be quantitatively understood as the rate-distortion optimum as in Eq. 4 by mapping VAE latent space to implicit isometric embedding on a variable-to-variable basis as in Fig. 1 .

3. UNDERSTANDING OF VAE AS A SCALED ISOMETRIC EMBEDDING

This section shows the quantitative understanding of VAE. First, we present the hypothesis of mapping VAE latent space to an implicit isometric embedding space. Second, we reformulate the objective of β-VAE for easy analysis. Third, we prove the hypothesis from the minimum condition of the objective. Then, we show that ELBO can be interpreted as an optimized RDO cost of transform coding where the quantitative properties are well clarified, as well as discuss and correct some prior theoretical studies. Lastly, we explain the quantitative properties of VAE to validate the theory including approximations and provide a practical data analysis.

3.1. HYPOTHESIS OF MAPPING VAE TO THE IMPLICIT ORTHONORMAL TRANSFORM

Figure 1 shows the mapping of VAE to the implicit isometric embedding. Assume the data manifold is smooth and differentiable. Let S input (⊂ R m ) be an input space of the dataset. D(x, x) denotes a metric for points x, x ∈ S input . Using the second order Taylor expansion, D(x, x + δx) can be approximated by t δx G x δx, where G x and δx are an x dependent positive definite Hermitian metric tensor and an arbitrary infinitesimal displacement in S input , respectively. The derivations of G x for SSE, BCE, and SSIM are shown in Appendix H.2. Next, an implicit isometric embedding space S Iso (⊂ R m ) is introduced like the isometric latent space in RaDOGAGA (Kato et al., 2020) , such that the entropy of data representation is minimum in the inner product space of G x . Let y and y j be a point in S Iso and its j-th component, respectively. Because of the isometricity, p(x) p(y) will hold. We will also show the posterior variance of each dimensional component y j is a constant β/2. In addition, the variance of y j will show the importance like PCA when the data manifold has a disentangled feature by nature in the metric space of G x and the prior covariance is diagonal. Then, S Iso is nonlinearly scaled to the VAE's anisometric orthogonal space S VAE (⊂ R n ) on a variable-by-variable basis. Let z be a point in S VAE , and z j denotes the j-th component of z. Let p(y j ) and p(z j ) be the probability distribution of the j-th variable in S Iso and S VAE . Each variable y j is nonlinearly scaled to z j , such that dz j /dy j = p(y j )/p(z j ) to fit the cumulative distribution. dz j /dy j is σ j(x) / β/2, the ratio of posterior's standard deviations for z j and y j , such that KL divergences in both spaces are equivalent. In addition, dimensional components whose KL divergences are zero can be discarded because such dimensions have no information. 3.2 REFORMULATION OF OBJECTIVE TO THE FORM USING ∂x/∂z j AND ∂x/∂z We reformulate the objective L x to the form using ∂x/∂z j and ∂x/∂z. Here, the dimensions of x and z, i.e., m and n, are set as the same. The condition to reduce n is shown in section 3.3. Reformulation of D(x, x) loss: In accordance with Kato et al. (2020) , the loss D(x, x) can be decomposed into D( x, x) + D(x, x), where x denotes Dec θ (µ (x) ). The first term D( x, x) is a distortion between the decoded values of µ (x) with and without noise σ (x) . We call this term as a coding loss. This term is expanded as follows. δ x denotes xx. Then, D( x, x) term can be approximated by t δ x G x δ x. Let x zj be ∂x/∂z j at z j = µ j(x) , and δz j ∼ N (0, σ j(x) ) be an added noise in z j . Then, δ x is approximated by δ x m j=1 δz j x zj . Because δz j and δz k for j = k are uncorrelated, the average of D( x, x) over z ∼ q φ (z|x) can be finally reformulated by E z∼q φ (z|x) [D( x, x)] E z∼q φ (z|x) t δ x G x δ x n j=1 σ j(x) 2 t x zj G x x zj . The second term D(x, x) is a loss between the input data and Dec θ (µ (x) ). We call this term a transform loss. We presume VAE is analogous to the Wiener filter (Wiener, 1964; Jin et al., 2003) where the coding loss is regarded as an added noise. From the Wiener filter theory, the ratio between the transform loss and coding loss is close to the ratio between the coding loss and the variance of the input data. The coding loss, approximately nβ/2 as in Eq. 14, should be smaller than the variance of the input data to capture meaningful information. Thus the transform loss, usually small, is not considered in the following discussion. Appendix B explains the detail in a simple 1-dimensional VAE. We show the exhaustive and quantitative evaluation of coding loss and transform loss in the toy dataset in appendix E.2 to validate this approximation. Reformulation of KL divergence: When σ j(x) 1, σ j(x) 2 -log σ j(x) 2 is observed. For example, when σ j(x) 2 < 0.1, we have -(σ j(x) 2 / log σ j(x) 2 ) < 0.05. In such dimensions, D KLj(x) can be approximated as Eq. 6 by ignoring the σ j(x) 2 term and setting p(µ j(x) ) to N (z j ; 0, 1): D KLj(x) 1 2 µ j(x) 2 -log σ j(x) 2 -1 = -log σ j(x) p(µ j(x) ) - log 2πe 2 . ( ) Eq. 6 can be considered as a rate of entropy coding for a symbol with mean µ j(x) allowing quantization noise σ j(x) 2 , as shown in Appendix H.3. Thus, in the dimension with meaningful information, σ j(x) 2 is much smaller than the prior variance 1, and the approximation in Eq.6 is reasonable. Let p(µ (x) ) be n j=1 p(µ j(x) ). p(µ (x) ) = p(x) |det(∂x/∂z)| holds where det(∂x/∂z) is a Jacobian determinant at z = µ (x) . Let C DKL be a constant n 2 log 2πe. Then, D KL (•) is reformulated by D KL (•) -log p(µ (x) ) n j=1 σ j(x) -C DKL -log p(x) det ∂x ∂z n j=1 σ j(x) -C DKL . (7) Final objective form: From Eqs. 5 and 7, the objective L x to minimise is derived as: L x = n j=1 σ j(x) 2 t x zj G x x zj -β log p(x) det ∂x ∂z n j=1 σ j(x) -C DKL .

3.3. PROOF OF THE HYPOTHESIS

Mapping VAE to implicit isometric embedding: The minimum condition of L x at x is examined. Let xzj be the j-th column vector of a cofactor matrix for Jacobian matrix ∂x/∂z. Note that d log |det(∂x/∂z)|/dx zj = xzj /det(∂x/∂z) holds as is also used in Kato et al. (2020) . Using this equation, the derivative of L x by x zj is described by dL x dx zj = 2σ j(x) 2 G x x zj - β det (∂x/∂z) xzj . Note that t x z k • xzj = det(∂x/∂z) δ jk holds by the cofactor's property. Here, • denotes the dot product, and δ jk denotes the Kronecker delta. By setting Eq. 9 to zero and multiplying t x z k from the left, the condition to minimize L x is derived by the next orthogonal form of x zj : (2σ j(x) 2 /β) t x z k G x x zj = δ jk . Here, the diagonal posterior covariance is the key for orthogonality. Next, implicit latent variable y and its j-th dimensional component y j are introduced. Set y j to zero at z j = 0. The derivative between y j and z j at µ j(x) is defined by dy j dz j zj =µ j(x) = β 2 σ j(x) -1 . ( ) x yj denotes ∂x/∂y j . By applying x zj = dy j /dz j x yj to Eq. 10, x yj shows the isometric property (Han & Hong, 2006; Kato et al., 2020) in the inner product space with a metric tensor G x as follows: t x yj G x x y k = δ jk . ( ) Minimum entropy of implicit isometric representation: Let L min x be a minimum of L x at x. D min x and R minx denote a coding loss and KL divergence in L min x , respectively. By applying Eqs. 10-11 and p(z j ) = (dy j /dz j ) p(y j ) to Eqs. 5 and 7, the following equations are derived: L min x = D min x + βR min x , where D min x = nβ 2 , R min x = -log p(y) - n log(βπe) 2 . Here, D min x is derived as n j=1 (β/2) t x yj G x x yj = nβ/2, implying each dimensional posterior variance of the implicit isometric variable is a constant β/2. In addition, exp(-L min x /β) = p(y) exp(Const.) ∝ p(y) p(x) will hold in the inner product space of G x from the isometricity. By averaging L min x over x ∼ p(x) and approximating this average by the integration over y ∼ p(y), the global minimum L G is derived as: L G = D G + βR G , where D G = nβ 2 , R G = min p(y) -p(y) log p(y)dy - n log(βπe) 2 . The term -p(y) log p(y)dy in R G is the entropy of y. Thus, the optimal implicit isometric space is derived such that the entropy of data representation is minimum in the inner product space of G x . When the data manifold has a disentangled property in the given metric, each y j will capture a disentangled feature with minimum entropy, as shown in Kato et al. (2020) . This is analogous to PCA for Gaussian data, which gives the disentangled representation with minimum entropy in SSE. Considering the similarity to the PCA eigenvalues, the variance of y j will indicate the importance of each dimension. In the dimensions where the variance of y j is less than β/2, σ j(x) = 1, µ j(x) = 0, and D KLj(x) = 0 will hold. In addition, σ j(x) 2 t x zj G x x zj will be close to 0 because this needs not to be balanced with D KLj(x) . This is similar to the case in the RD theory in Eq. 4 where σ zj 2 is less than d, meaning no information. As a result, Eqs. 10-14 will not hold here. Thus, latent variables with variances from the largest to the n-th with D KLj(x) > 0 are sufficient for the representation and the dimensions with D KLj(x) = 0 can be ignored, allowing the reduction of the dimension n for z. Some approximations may be slightly violated, however, our analysis still helps to understand VAE.

3.4. DISCUSSION AND RELATIONSHIP WITH PRIOR THEORETICAL STUDIES

First, we show β-VAE optimum as in Eq. 14 can be interpreted as the rate-distortion optimum (Eq. 4) in RD theory when the uniform distortion d in Eq. 4 is set to β/2 in the metric defined space. H(X) = -p(x) log p(x) dx denotes a differential entropy for a set x ∈ X; x ∼ p(x). For the 1-dimensional Gaussian data x ∼ N (x, 0, σ 2 ), H(X) = 1 2 log(2πeσ 2 ) holds. Thus, R opt in Eq. 4 is derived as a difference of the differential entropy between transformed data z ∼ j N (z j ; 0, σ zj ) and uniform distortion D ∼ N (D; 0, dI m ). R G is also derived as a difference of the differential entropy between transformed data y ∼ p(y) and uniform distortion D ∼ N (D; 0, (β/2)I m ). Furthermore, D G in Eq. 14 can be interpreted as D opt in Eq. 4 by setting d = β/2. As a result, the VAE optimal corresponds to the rate-distortion optimal of transform coding in RD theory, and β/2 is regarded as a variance of the constant distortion equally added to each dimensional component. Because of the isometricity, the power of distortion (i.e., posterior variance) in the implicit isometric space is the same as that in the metric defined input space. Thus the conditional distribution after optimization in the metric defined space is derived as p θ (x|z) = p θ (x| x) N (x; x, (β/2)I). This is consistent with the fact that the quality of the reconstructed data becomes worse in larger β. Next, we estimate the reconstruction loss E q φ (z|x) [log p θ (x|z)] and KL divergence D KL (•) in β-VAE and also correct the analysis in Alemi et al. (2018) . Let H = -E p(x) [log p(x)] be a differential entropy of input data. When β = 1, Alemi et al. (2018) suggest "the ELBO objective alone (and the marginal likelihood) cannot distinguish between models that make no use of the latent variable (autodecoders) versus models that make large use of the latent variable and learn useful representations for reconstruction (autoencoders)," because the reconstruction loss and KL divergence can be arbitrary value on the line -E q φ (z|x) [log p θ (x|z)]+D KL (•) = H. Correctly, the reconstruction loss and KL divergence after optimization are deterministically estimated at any β (including β = 1) as: E q φ (z|x) [log p θ (x|z)] -(n/2) log(βπe), D KL (•) -log p(y) -(n/2) log(βπe). ( ) The proof is explained in Appendix A.1. Thus ELBO can be estimated as: ELBO = E p(x) [E z∼q φ (z|x) [log p θ (x|z)] -D KL (•)] E p(x) [log p(y)] E p(x) [log p(x)]. ( ) As a result, when the objective of β-VAE is optimised, ELBO (Eq. 1) in the original form (Kingma & Welling, 2014) is approximately equal to the log-likelihood of x, regardless β = 1 or not. Finally, the predetermined conditional distribution p Rp (x| x) and the true conditional distribution after optimization p Rθ (x| x) are examined using β in the input Euclidean space of x. Assume p Rp (x| x) = N (x; x, σ 2 I). In this case, the metric D(x, x) is derived as -log p Rp (x| x) = (1/2σ 2 )|x -x| 2 2 + Const. From Eq. 13, the following equations are derived: E q φ ( x|x) [D(x, x)] = E q φ ( x|x) 1 2σ 2 |x -x| 2 2 = E q φ ( x|x) 1 2σ 2 i (x i -xi ) 2 nβ/2, ( ) E q φ ( x|x) (x i -xi ) 2 βσ 2 . ( ) Because the variance of each dimension is estimated as βσ 2 , the true conditional distribution after optimization is approximated as p Rθ (x| x) N (x; x, βσ 2 I). If β = 1, i.e., the original VAE, p Rp (x| x) and p Rθ (x| x) are equivalent as expected. If β = 1, however, p Rp (x| x) and p Rθ (x| x) are different. Actually, what β-VAE does is only to scale the variance of the pre-determined conditional distribution in the original VAE by a factor of β, because β-VAE objective can be rewritten as: E q φ (•) [log N (x; x, σ 2 I)] -βD KL (•) = β E q φ (•) [log N (x; x, βσ 2 I)] -D KL (•) + const. (19) More detailed discussions about prior works (Higgins et al. (2017) ; Alemi et al. (2018) ; Dai et al. (2018) ; Dai & Wipf (2019) ; Tishby et al. (1999); Goyal (2001) ) are explained in Appendix A.

3.5. QUANTITATIVE PROPERTIES TO VALIDATE THE THEORY

This section shows three quantitative properties in VAE with a prior N (z; 0, I n ), to validate the theory in section 3.3. The second and third properties also provide practical data analysis approaches. The derivation of equations in the second and third properties are explained in appendix C. Norm of x yj equal to 1: Let e (j) be a vector (0, • • • , j-th 1 , • • • , 0) where the j-th dimension is 1, and others are 0. Let D j (z) be D(Dec θ (z), Dec θ (z + e (j) ))/ 2 , where denotes a minute value for the numerical differential. From Eq. 10, the squared norm of x yj can be numerically evaluated as the first term of Eq. 20. This value will be equal to 1 at any x and dimension j except D KLj(x) = 0. 2 β σ j(x) 2 D j (z) 2 β σ j(x) 2 t x zj G x x zj t x yj G x x yj = 1. ( ) If observed, the existence of an implicit isometric embedding can be shown because of unit norm and orthogonality (Rolínek et al., 2019) . Eq. 20 also show σ j(x) 2 t x zj G x x zj β 2 , implying that a noise σ j(x) added to each dimension of latent variable causes an equal noise β/2 in the input space. PCA-like feature: When the data manifold has a disentangled property in the given metric, the variance of the j-th implicit latent component y j can be roughly estimated as y j 2 p(y j )dy j β 2 E x∼p(x) [σ j(x) -2 ]. The average E[σ j(x) -2 ] on the right allows evaluating the quantitative importance of each dimension in practice, like the eigenvalue of PCA. Note that a dimension whose average is close to 1 implies D KLj(x) = 0. Such a dimension has no information and is an exceptions of the property in Eq. 20. Estimation of the data probability distribution: First, assume the case m = n. Since the y space is isometric to the inner product space of G x , the PDFs in both spaces are the same. The Jacobian determinant between the input space and inner product space, giving the the ratio of PDFs, is derived as |G x | 1 2 . We set p(µ (x) ) to the prior. Thus, the data probability in the input space can be estimated by |G x | 1 2 and either the prior/posterior or L x after training, as the following last two equations: p(x) |G x | 1 2 p(y) ∝ |G x | 1 2 p(µ (x) ) m j=1 σ j(x) ∝ |G x | 1 2 exp - 1 β L x . In the case m > n, the derivation of the PDF ratio between the input space and the inner product space is generally intractable, except for G x = a x I m , where a x is an x-dependent scalar factor. In this case, the PDF ratio is given by a x n/2 . Thus, p(x) can be estimated as follows: p(x) ∝ a x n 2 p(µ (x) ) n j=1 σ j(x) ∝ a x n 2 exp - 1 β L x . Equations 22 and 23 enable a probability-based quantitative data analysis/sampling in practice.

4. EXPERIMENT

We show the experiments of the quantitative properties presented in Section 3.5. First, the results of the toy dataset are presented. Then, the results of CelebA are shown as a real data example.

4.1. EVALUATION OF QUANTITATIVE PROPERTIES IN THE TOY DATASET

The toy dataset is generated as follows. First, three dimensional variables s 1 , s 2 , and s 3 are sampled in accordance with the three different shapes of distributions p(s 1 ), p(s 2 ), and p(s 3 ), as shown in Fig. 2 . The variances of s 1 , s 2 , and s 3 are 1/6, 2/3, and 8/3, respectively, such that the ratio of the variances is 1:4:16. Second, three 16-dimensional uncorrelated vectors v 1 , v 2 , and v 3 with L2 norm 1, are provided. Finally, 50, 000 toy data with 16 dimensions are generated by x = 3 i=1 s i v i . The data generation probability p(x) is also set to p(s 1 )p(s 2 )p(s 3 ). If our hypothesis is correct, p(y j ) will be close to p(s j ). Then, σ j(x) ∝ dz j /dy j = p(y j )/p(z j ) will also vary a lot with these varieties of PDFs. Because the properties presented in Section 3.5 are calculated from σ j(x) , our theory can be easily validated by evaluating those properties. Then, the VAE model is trained using Eq. 1. We use two kinds of the reconstruction loss D(•, •) to analyze the effect of the loss metrics. The first is the square error loss equivalent to sum square error (SSE). The second is the downward-convex loss which we design as Eq. 24, such that the shape becomes similar to the BCE loss as in Appendix H.2: D(x, x) = a x x -x 2 2 , where a x = (2/3 + 2 x 2 2 /21) and G x = a x I m . Here, a x is chosen such that the mean of a x for the toy dataset is 1.0 since the variance of x is 1/6+2/3+8/3=7/2. The details of the networks and training conditions are written in Appendix D.1. Then the network is trained with two types of reconstruction losses. The ratio of transform loss to coding loss for the square error loss is 0.023, and that for the downward-convex loss is 0.024. As expected in section 3.2, the transform losses are negligibly small. Tables 1 and 2 show the measurements of 2 β σ j(x) 2 D j (z) (shown as 2 β σ j 2 D j ), D j (z), and σ j(x) -2 described in Section 3.5. In these tables, z 1 , z 2 , and z 3 show acquired latent variables. "Av." and "SD" are the average and standard deviation, respectively. To begin with, the norm of the implicit orthonormal basis  (a) p(µ (x) ) (b) exp(-Lx/β) (c) a 3/2 x p(µ (x) ) j σ j(x) (d) a 3/2 x exp(-Lx/β) x p(µ (x) ) j σ j(x) , and (d) a 3/2 x exp(-L x /β). is discussed. In both tables, the values of 2 β σ (x)j 2 D j (z) are close to 1.0 in each dimension as described in Eq. 23. By contrast, the average of D j (z), which corresponds to t x zj G x x z k , is different in each dimension. Therefore, the derivative of x with z j , the original latent variable of VAE, is not normalized. Next, the PCA-like feature is examined. The average of σ j(x) -2 in Eq.21 and its ratio are shown in Tables 1 and 2 . Although the average of σ j(x) -2 is a rough estimation of variance, the ratio is close to 1:4:16, i.e., the variance ratio of generation parameters s 1 , s 2 , and s 3 . When comparing both losses, the ratio of s 2 and s 3 for the downward-convex loss is somewhat smaller than that for the square error. This is explained as follows. In the downward-convex loss, |x yj | 2 tends to be 1/a x from Eq. 12, i.e. t x yj (a x I m ) x y k = δ jk . Therefore, the region in the inner product space with a larger norm is shrunk, and the estimated variances corresponding to s 2 and s 3 become smaller. Figure 3 shows the scattering plots of the data generation probability p(x) and estimated probabilities for the downward-convex loss. The plots for the square error loss are shown in Appendix E. Figure 3a shows the plots of p(x) and the prior probabilities p(µ (x) ). This graph implies that it is difficult to estimate p(x) only from the prior. The correlation coefficient shown as "R" (0.434) is also low. Figure 3b shows the plots of p(x) and exp(-L x /β), i.e., the lower bound of likelihood. The correlation coefficient (0.771) becomes better, but is still not high. Next, Figures 3c and 3d show the plots of a 3/2 x p(µ (x) ) j σ j(x) and a 3/2 x exp(-L x /β) in Eq. 23. These graphs, showing a high correlation coefficients around 0.91, support that the objective L x in Eq. 3 is optimized in the inner product space of G x . In the case of the square error loss, the plots with exp(-L x /β) also shows a high correlation coefficient 0.904 because a x is 1, allowing the probability estimation from L x in Eq. 3. The ablation study with different PDF, losses, and β is shown in Appendix E.

4.2. EVALUATIONS IN CELEBA DATASET

This section evaluates the first and second quantitative properties of VAE trained with the CelebA datasetfoot_0 (Liu et al., 2015) as an example of real data. This dataset is composed of 202,599 celebrity images. In use, the images are center-cropped to form 64 × 64 sized images. Figure 4 shows the averages of σ j(x) -2 in Eq.21 as the estimated variances, as well as the average and the standard deviation of 2 β σ j(x) 2 D j (z) in Eq.20 as the estimated square norm of implicit transform. The latent variables z i are numbered in descending order by the estimated variance. In the dimensions greater than the 27th, the averages of σ j(x) -2 are close to 1 and that of 2 β σ j(x) 2 D j (z) is close to 0, implying D KL (•) = 0. Between the 1st and 26th dimensions, the mean and standard deviation of 2 β σ j(x) 2 D j (z) averages are 1.83 and 0.13, respectively. These values seem almost constant with a small standard deviation; however, the mean is somewhat larger than the expected value 1. This result implies that the implicit transform can be considered as almost orthonormal by dividing √ 1.83 1.35. Thus, the average of σ j(x) -2 still can determine the quantitative importance of each latent variable. This also mean that the added noise to each y j is around 1.83(β/2). We also train VAE by the decomposed loss explicitly, where L x is set to D(x, x) + D( x, x) + βD KL (•). Figure 5 shows the result. Here, the mean and standard deviation of 2 β σ j(x) 2 D j (z) averages are 0.92 and 0.04, respectively, which suggests almost a unit norm. As a result, the explicit use of decomposed loss matches the theory better, allowing better analysis. The slight violation of the norm in the conventional form needs a more exact analysis as a future study. Figure 6 shows decoder outputs where the selected latent variables are traversed from -2 to 2 while setting the rest to 0. The average of σ j(x) -2 is also shown there. The components are grouped by the average of σ j(x) -2 , such that z 1 , z 2 , z 3 to the large, z 16 , z 17 to the medium, and z 32 to the small, accordingly. In the large group, significant changes of background brightness, the direction of the face, and hair color are observed. In the medium group, we can see minor changes such as facial expressions. However, in the small group, there are almost no changes. This result strongly supports that the average of σ j(x) -2 shows the importance of each latent variable. The traversed outputs for all the component and results with another conditions are shown in Appendix F.

5. CONCLUSION

This paper provides a quantitative understanding of VAE by non-linear mapping to an isometric embedding. According to the Rate-distortion theory, the optimal transform coding is achieved by using PCA/KLT orthonormal transform, where the transform space is isometric to the input. From this analogy, we show theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. Based on this property, we also clarify that VAE can provide a practical quantitative analysis of input data such as the probability estimation in the input space and the PCA-like quantitative multivariate analysis. We believe the quantitative properties thoroughly uncovered in this paper will be a milestone to further advance the information theory-based generative models such as VAE in the right direction. A DETAILED RELATION TO PRIOR WORKS Firstly, we clarify the the difference between ELBO in Eq. 1 and the objective L x in Eq. 3 in the right direction. Then, we discuss the relation to the prior works. We also point out the incorrectness of some works.

A.1 DERIVATION OF ELBO WITH CLEAR AND QUANTITATIVE FORM

We derive the reconstruction loss and KL divergence terms in ELBO (without β) at x in Eq. 1 when the objective of β-VAE L x in Eq. 13 is optimised. The reconstruction loss can be rewritten as E z∼q φ (z|x) [log p θ (x|z)] = q φ (z|x) log p θ (x|z)dz = q φ (y|x) log p θ (x|y)dy. ( ) Let µ y(x) be a implicit isometric variable corresponding to µ (x) . Because the posterior variance in each isometric latent variable is a constant β/2, q φ (y|x) N (y; µ y(x) , (β/2)I n ) will hold. If β/2 is small, p( x) p(x) will hold. Then, the next equation will hold also using isometricity; p θ (x|z) = p θ (x|y) = p θ (x| x) = p( x|x)p(x)/p( x) p( x|x) q φ (y|x). ( ) Thus the reconstruction loss is estimated as: E z∼q φ (z|x) [log p θ (x|z)] ∼ N (y; µ y(x) , (β/2)I n ) log N (y; µ y(x) , (β/2)I n ) dy = -(n/2) log(βπe). From Eq. 13, KL divergence is derived as: D KL (•) = R minx = -log p(y) -(n/2) log(βπe). By summing both terms, ELBO at x can be estimated as ELBO = E x∼p(x) [E z∼q φ (z|x) [log p θ (x|z)] -D KL (•)] E x∼p(x) [log p(y)] E x∼p(x) [log p(x)]. As a result, ELBO (Eq. 1) in the original form (Kingma & Welling, 2014) is close to the loglikelihood of x, regardless β = 1 or not, when the objective of β-VAE (Higgins et al., 2017) is optimised. Some of the prior VAE works do not explicitly distinguish between the reconstruction loss E x∼p( x|x) [p(x| x)] in ELBO and the distortion D(x, x) in the objective L x by mistake, which leads to some incorrect discussion. In addition, there have also been incorrect discussions in some prior works. In ELBO to Eq. 1, the value of ELBO will become log p(y) log p(x) regardless β = 1 or not. If D(x, x + δx) = t δxG x δx + O(||δx|| 3 ) is not SSE, by introducing a variable x = L x -1 x where L x satisfies t L x L x = G x , the metric D(•, •) can be replaced by SSE in the Euclidean space of x. A.2 RELATION TO TISHBY ET AL. (1999) The theory described in Tishby et al. (1999) is consistent with our analysis. Tishby et al. (1999) clarified the behaviour of the compressed representation when the rate-distortion trade-off is optimized. x ∈ X denotes the signal space with a fixed probability p(x) and x ∈ X denotes its compressed representation. Let D(x, x) be a loss metric. Then the rate-distortion trade-off can be described as: L = I(X; X) + β E p(x, x) [D(x, x)]. By solving this condition, they derive the following equation: p( x|x) ∝ exp(-β D(x, x)). As shown in our discussion above, p( x|x) N ( x; x, (β/2)I m ) will hold in the metric defined space from our VAE analysis. This result is equivalent to Eq. 31 in their work if D(x, x) is SSE and β is set to β -1 , as follows: p( x|x) ∝ exp(-β D(x, x)) = exp - ||x -x|| 2 2 2(β/2) ∝ N ( x; x, (β/2)I m ). If D(x, x) is not SSE, the use of the space transformation explained in appendix A.1 will lead to the same result. A.3 RELATION TO β-VAE (HIGGINS ET AL., 2017) In the β-VAE work by Higgins et al. (2017) , it is presumed that the objective L x was not mistakenly distinguished from ELBO. In their work, ELBO equation is modified as: -(n/2) log(βπe) are applied, the value of Eq. 33 is derived as β log p(x) + (β -1)(n/2) log(βπe), which is different from the log-likelihood of E p(x) [ E x∼p φ ( x|x) [q θ (x| x)] -βD KL (•) ]. x if β = 1. Correctly, what β-VAE really does is only to scale the variance of the pre-determined conditional distribution in the original VAE by a factor of β. In the case the pre-determined conditional distribution is Gaussian N (x; x, σ 2 I), the objective of β can be can be rewritten as a linearly scaled original VAE objective with a Gaussian N (x; x, βσ 2 I): In this appendix, we will show that β determines the value of R and D specifically. We also show that R H -D will hold regardless β = 1 or not. E q φ (•) [log N (x; x, σ 2 I)] -βD KL (•) = E q φ (•) - 1 2 log 2πσ 2 - |x -x| 2 2 2σ 2 -βD KL (•) = β E q φ (•) - 1 2 log 2πβσ 2 - |x -x| 2 2 2βσ 2 -D KL (•) + β 2 log 2πβσ 2 - 1 2 log 2πσ 2 = β E q φ (•) [log N (x; x, βσ 2 I)] -D KL (•) + const. In their work, these values of H, D, and D are mathematically defined as: H ≡ -dx p * (x) log p * (x), D ≡ -dx p * (x) dz e(z|x) log d(x|z), R ≡ dx p * (x) dz e(z|x) log e(z|x) m(z) . ( ) Here, p * (x) is a true PDF of x, e(z|x) is a stochastic encoder, e(z|x) is a decoder, and m(z) is a marginal probability of z. Our work allows a rough estimation of Eqs. 35-37 with β by introducing the implicit isometric variable y as explained in our work. Using isometric variable y and the relation dz e(z|x) = dy e(y|x), Eq. 36 can be rewritten as: D = -dx p * (x) dy e(y|x) log d(x|y). ( ) Let µ y be the implicit isometric latent variable corresponding to the mean of encoder output µ (x) . As discussed in section 3.3, e(y|x) = N (y; µ y , (β/2)I n ) will hold. Because of isometricity, the value of d(x|y) will be also close to e(y|x) = N (y; µ y , (β/2)I n ). Though d(x|z) must depend on e(z|x), this important point has not been discussed well in this work. By using the implicit isometric variable, we can connect both theoretically. Thus, D can be estimated as:  Here, if β/2, i.e., added noise, is small enough compared to the variance of x, a normal distribution function term in this equation will act like a delta function. Thus m(y) can be approximated as: m(y) d x p * ( x) δ( x -x) p * (x). In the similar way, the following approximation will also hold. dy e(y|x) log m(y) dy e(y|x) log p * (x) d x δ( xx) log p * ( x) log p * (x) (43) By using these approximation and applying Eqs. 38-39, R in Eq. 37 can be approximated as: 2018) also suggest D should satisfy D ≥ 0 because D is a distortion; however, we suggest D should be treated as a differential entropy and can be less than 0 because x is once handled as a continuous signal with a stochastic process in Eqs. 35-37. Here, D (n/2) log(βπe) can be -∞ if β → 0, as also shown in Dai & Wipf (2019) . Thus, upper bound of R at β → 0 is not H, but R = H -(-∞) = ∞, as shown in RD theory for a continuous signal. Huang et al. (2020) show this property experimentally in their figures 4-8 such that R seems to diverge if MSE is close to 0.

A.5 RELATION TO DAI ET AL. (2018) AND DAI & WIPF (2019)

Our work is consistent with Dai et al. (2018) and Dai & Wipf (2019) . Dai et al. (2018) analyses VAE by assuming a linear model. As a result, the estimated posterior is constant. If the distribution of the manifold is the Gaussian, our work and Dai et al. (2018) give a similar result with constant posterior variances. For non-Gaussian data, however, the quantitative analysis such as probability estimation is intractable using their linear model. Our work reveals that the posterior variance gives a scaling factor between z in VAE and y in the isometric space when VAE is ideally trained with rich parameters. This is validated by Figures 3c and 3d , where the estimation of the posterior variance at each data point is a key. Next, the relation to Dai & Wipf (2019) is discussed. They analyse a behavior of VAE when ideally trained. For example, the theorem 5 in their work shows that D → (d/2) log γ + O(1) and R → -(γ/2) log γ+O(1) hold if γ → +0, where γ, d, and γ denote a variance of d(x|z), data dimension, and latent dimension, respectively. By setting γ = β/2 and d = γ = n, this theorem is consistent with R and D derived in Eq. 39 and Eq. 44. A.6 RELATION TO TRANSFORM CODING (GOYAL, 2001) We show the optimum condition of VAE shown in Eq. 14 can be mapped to the optimum condition of transform coding (Goyal, 2001) as shown in Eq. 4. First, the derivation of Eq. 4 is explained by solving the optimal distortion assignment to each dimension. In the transform coding for m dimensional the Gaussian data, an input data x is transformed to z using an orthonormal transform such as KLT/DCT. Then each dimensional component z j is encoded with allowing distortion d j . Let D be a target distortion satisfying D = m j=1 d j . σ 2 zj denotes a variance of each dimensional component z j for the input dataset. Then, a rate R can be derived as m j=1 1 2 log(σ 2 zj /d j ). By introducing a Lagrange parameter λ and minimizing a rate-distortion optimization cost L = D+λR, the optimum condition is derived as: λ opt = 2D/m, d j = D/m = λ opt /2. ( ) This result is consistent with Eq. 14 by setting β = λ opt = 2D/m. This implies that L G in Eq. 14 is a rate-distortion optimization (RDO) cost of transform coding when x is deterministically transformed to y in the implicit isometric space and stochastically encoded with a distortion β/2.

B ESTIMATION OF THE CODING LOSS AND TRANSFORM LOSS IN 1-DIMENSIONAL LINEAR VAE

This appendix estimates the coding loss and transform loss in 1-dimensional linear β-VAE for the Gaussian data, and also shows that the result is consistent with the Wiener filter. Let x be a one dimensional data with the normal distribution: x ∈ R, x ∼ N (x; 0, σ x 2 ) (46) Let z be a one dimensional latent variable. Following two linear encoder and decoder are provided with constant parameters a, b, and σ z to optimize: z = ax + σ z where ∼ N ( ; 0, 1), x = bz. First, KL divergence at x, D KLx is derived. Due to the above relationship, we have p(z) = N (z; 0, (aσ x ) 2 ). Using Eq. 6, KL-divergence at x can be evaluated as: D KLx = -log(σ z p(z)) - 1 2 log 2πe = -log σ x + a 2 x 2 2 - 1 2 . ( ) Second, the reconstruction loss at x D x is evaluated as: D x = E ∼N ( ;0,1) [(x -(b(ax + σ z ))) 2 ] = ((ab -1)x) 2 + b 2 σ z 2 . Then, the loss objective L x = D x + βD KLx is averaged over x ∼ N (x; 0, σ x 2 ), and the objective L to minimize is derived as: L = E x∼N (x;0,σx 2 ) [L x ] = (ab -1) 2 σ x 2 + b 2 σ z 2 + β -log σ z + a 2 σ x 2 2 - 1 2 . ( ) Here, (ab -1) 2 σ x 2 and b 2 σ z 2 in the last equation are corresponding to the transform loss D T and coding loss D C , respectively. By solving dL/da = 0, dL/db = 0, and dL/dσ z = 0, a , b, and σ z are derived as follows: a = 1/σ x , b = σ x 1 + 1 -2β/σ x 2 2 , σ z = 2 β/2 σ x 1 + 1 -2β/σ x 2 . ( ) From Eq. 52, D T and D C are derived as: D T = 1 -2β/σ x 2 -1 2 2 σ x 2 , D C = β/2. ( ) As shown in section 3.3, the added noise, β/2, should be reasonably smaller than the data variance σ x 2 . If σ x 2 β, b and σ z in Eq. 52 can be approximated as: D T (β/2) 2 σ x 2 = β/2 σ x 2 D C . As shown in this equation, D T /D C is small in the VAE where the added noise is reasonably small, and D T can be ignored. Next, the relation to the Wiener filter is discussed. We consider an simple 1-dimensional Gaussian process. Let x ∼ N (x; 0, σ 2 x ) be input data. Then, x is scaled by s, and a Gaussian noise n ∼ N (n; 0, σ 2 n ) is added. Thus, y = s x + n is observed. From the Wiener filter theory, the estimated value with minimum distortion, x can be formulated as: x = sσ x 2 s 2 σ x 2 + σ n 2 y. In this case, the estimation error is derived as: E[(x -x) 2 ] = σ n 4 (s 2 σ x 2 + σ n 2 ) 2 σ x 2 + s 2 σ x 4 (s 2 σ x 2 + σ n 2 ) 2 σ n 2 = σ x 2 σ x 2 + (σ n 2 /s 2 ) (σ n 2 /s 2 ). ( ) In the second equation, the first term is corresponding to the transform loss, and the second term is corresponding to the coding loss. Here the ratio of the transform loss and coding loss is derived as σ n 2 /(s 2 σ x 2 ). By appying s = 1/σ x and σ n = σ z to σ n 2 /(s 2 σ x 2 ) and assuming σ 2 x β/2, this ratio can be described as: σ n 2 s 2 σ x 2 = σ z 2 = β/2 σ 2 x 4 1 + 1 -2β/σ x 2 2 = β/2 σ 2 x + O β/2 σ 2 x 2 . ( ) This result is consistent with Eq. 54, implying that optimized VAE and the Wiener filter show similar behaviours. C DERIVATION OF QUANTITATIVE PROPERTIES IN SECTION 3.5

C.1 DERIVATION OF THE ESTIMATED VARIANCE

This appendix explains the derivation of Eq. 21 in Section 3.5. Here, we assume that z j is mapped to y j such that y j is set to 0 at z j = 0. We also assume that the prior distribution is N (z; 0, I n ). The variance is derived by the subtraction of E[y j ] 2 , the square of the mean, from E[y 2 j ], the square mean. Thus, the approximations of both E[y j ] and E[y 2 j ] are needed. First, the approximation of the mean E[y j ] is explained. Because the cumulative distribution functions (CDFs) of y j are the same as CDF of z j , the following equations hold: 0 -∞ p(y j )dy j = 0 -∞ p(z j )dz j = 0.5, ∞ 0 p(y j )dy j = ∞ 0 p(z j )dz j = 0.5. ( ) This equation means that the median of the y j distribution is 0. Because the mean and median are close in most cases, the mean E[y j ] can be approximated as 0. As a result, the variance of y j can be approximated by the square mean E[y 2 j ]. Second, the approximation of the square mean E[y 2 j ] is explained. The standard deviation of the posterior σ j(x) is assumed as a function of z j , regardless of x. This function is denoted as σ j (z j ). For z j ≥ 0, y j is approximated as follows, using Eq. 11 and replacing the average of 1/σ j (ź j ) over źj = [0, z j ] by 1/σ j (z j ): y j = zj 0 dy j dź j dź j = β 2 zi 0 1 σ j (ź j ) dź i β 2 1 σ j (z j ) zj 0 dź j = β 2 z j σ j (z j ) . ( ) The same approximation is applied to z i < 0. Then the square mean of y i is approximated as follows, assuming that the correlation between σ(z j ) -2 and z j 2 is low: y j 2 p(y j )dy j β 2 z j σ j (z j ) 2 p(z j )dz j β 2 σ j (z j ) -2 p(z j )dz j z j 2 p(z j )dz j . (60) Finally, the square mean of y i is approximated as the following equation, using z j 2 p(z j )dz j = 1 and replacing σ j (z j ) 2 by σ j(x) 2 , i.e., the posterior variance derived from the input data: y j 2 p(y j )dy j β 2 σ j (z j ) -2 p(z j )dz j β 2 E zj ∼p(zj ) [σ j (z j ) -2 ] β 2 E x∼p(x) [σ j(x) -2 ]. ( ) Although some rough approximations are used in the expansion, the estimated variance in the last equation seems still reasonable, because σ j(x) shows a scale factor between y j and z j while the variance of z j is always 1 for the prior N (z j ; 0, 1). Considering the variance of the prior z j 2 p(z j )dz j in the expansion, this estimation method can be applied to any prior distribution. This appendix shows the derivation of variables in Eqs. 22 and 23. First, the derivation of L x for the input x is described. Then, the PDF ratio between the input space and inner product space is explained for the cases m = n and m > n. Derivation of L x for the input x : As shown in in Eq. 1, L x is denoted as -E z∼q φ (z|x) [ • ]+βD KL ( • ). We approximate E z∼q φ (z|x) [ • ] as 1 2 (D(x, Dec θ (µ x + σ x )) + D(x, Dec θ (µ x -σ x ) )), i.e., the average of two samples, instead of the average over z ∼ q φ (z|x). D KL ( • ) can be calculated from µ x and σ x using Eq. 2. The PDF ratio in the case m = n: The PDF ratio for m = n is a Jacobian determinant between two spaces. First, ( ∂x ∂y ) T G x ( ∂x ∂y ) = I m holds from Eq. 12. |∂x/∂y| 2 |G x | = 1 also holds by calculating the determinant. Finally, |∂x/∂y| is derived as |G x | 1/2 using |∂y/∂x| = |∂x/∂y| -1 . The PDF ratio in the case m > n and G x = a x I m : Although the strict derivation needs the treatment of the Riemannian manifold, we provide a simple explanation in this appendix. Here, it is assumed that D KL(j) (•) > 0 holds for all j = [1, ..n]. If D KL(j) (•) = 0 for some j, n is replaced by the number of latent variables with D KL(j) (•) > 0. For the implicit isometric space S iso (⊂ R m ), there exists a matrix L x such that both y = L x x and G x = t L x L x holds. w denotes a point in S iso , i.e., w ∈ S iso . Because G x is assumed as a x I m in Section 3.5, L x = a x 1/2 I m holds. Then, the mapping function w = h(x) between S input and S iso is defined, such that: ∂h(x) ∂x = ∂w ∂x = L x , and h(x (0) ) = w (0) for ∃ x (0) ∈ S input and ∃ w (0) ∈ S iso . ( ) Let δx and δw are infinitesimal displacements around x and w = h(x), such that w + δw = h(x + δx). Then the next equation holds from Eq. 62: δw = L x δx Let δx (1) , δx (2) , δw (1) , and δw (2) be two arbitrary infinitesimal displacements around x and w = h(x), such that δw (1) = L x δx (1) and δw (2) = L x δx (2) . Then the following equation holds, where • denotes the dot product. t δx (1) G x δx (2) = t (L x δx (1) )(L x δx (2) ) = δw (1) • δw (2) (64) This equation shows the isometric mapping from the inner product space for x ∈ S input with the metric tensor G x to the Euclidean space for w ∈ S iso . Note that all of the column vectors in the Jacobian matrix ∂x/∂y also have a unit norm and are orthogonal to each other in the metric space for x ∈ S input with the metric tensor G x . Therefore, the m × n Jacobian matrix ∂w/∂y should have a property that all of the column vectors have a unit norm and are orthogonal to each other in the Euclidean space. Then n-dimensional space which is composed of the meaningful dimensions from the implicit isometric space is named as the implicit orthonormal space S ortho . Figure 7 shows the projection of the volume element from the implicit orthonormal space to the isometric space and input space. Let dV ortho be an infinitesimal n-dimensional volume element in S ortho . This volume element is a n-dimensional rectangular solid having each edge length dy j . Let V n (dV X ) be the n-dimensional volume of a volume element dV X . Then, V n (dV ortho ) = n j dy j holds. Next, dV ortho is projected to n dimensional infinitesimal element dV iso in S iso by ∂w/∂y. Because of the orthonormality, dV iso is equivalent to the rotation / reflection of dV ortho , and V n (dV iso ) is the same as V n (dV ortho ), i.e., n j dy j . Then, dV iso is projected to n-dimensional element dV input in S input by ∂x/∂w = L -1 x = a x -1/2 I m . Because each dimension is scaled equally by the scale factor a x -1/2 , V n (dV input ) = n j a x -1/2 dy j = a x -n/2 V n (dV ortho ) holds. Here, the ratio of the volume element between S input and S ortho is V n (dV input )/V n (dV ortho ) = a x -n/2 . Note that the PDF ratio is derived by the reciprocal of V n (dV input )/V n (dV ortho ). As a result, the PDF ratio is derived as a x n/2 .

EXPERIMENTS

This appendix explains the networks and training conditions in Section 4.

D.1 TOY DATA SET

This appendix explains the details of the networks and training conditions in the experiment of the toy data set in Section 4.1.

Network configurations:

FC(i, o, f) denotes a FC layer with input dimension i, output dimension o, and activate function f. The encoder network is composed of FC(16, 128, tanh)-FC(128, 64, tahh)-FC(64, 3, linear)×2 (for µ and σ). The decoder network is composed of FC(3, 64, tanh)-FC(64, 128, tahh)-FC (128, 16, linear) .

Training conditions:

The reconstruction loss D(•, •) is derived such that the loss per input dimension is calculated and all of the losses are averaged by the input dimension m = 16. The KL divergence is derived as a summation of D KL(j) (•) as explained in Eq. 2. In our code, we use essentially the same, but a constant factor scaled loss objective from the original β-VAE form L x = D(•, •) + βD KL(j) (•) in Eq. 1, such as: L x = λ D(•, •) + D KL(j) (•). ( ) Equation 65 is essentially equivalent to L = D(•, •) + βD KL(j) (•), multiplying a constant λ = β -1 to the original form. The reason why we use this form is as follows. Let ELBO true be the true ELBO in the sense of log-likelihood, such as E[log p(x)]. As shown in Section 3.3, the minimum of the loss objective in the original β-VAE form is likely to be a -βELBO true + Constant. If we use Eq. 65, the minimum of the loss objective will be -ELBO true + Constant, which seems more natural form of ELBO. Thus, Eq. 65 allows estimating a data probability from L x in Eqs. 22 and 23, without scaling L x by 1/β. Then the network is trained with λ = β -1 = 100 using 500 epochs with a batch size of 128. Here, Adam optimizer is used with the learning rate of 1e-3. We use a PC with CPU Inter(R) Xeon(R) CPU E3-1280v5@3.70GHz, 32GB memory equipped with NVIDIA GeForce GTX 1080. The simulation time for each trial is about 20 minutes, including the statistics evaluation codes. In our experiments, λ or β -1 , i.e., 100, seems somewhat large. This is caused by the use of the mean square error as a reconstruction loss. In contrast, KL divergence is the sum for the whole image, which can be thought of as a rate for the whole image. Considering the number of input dimensions, β = (λ/16) -1 = 16/λ = 0.16 is thought of as β in the general form of VAE.

D.2 CELEBA DATA SET

This appendix explains the details of the networks and training conditions in the experiment of the toy data set in Section 4.2. Network configurations: CNN(w, h, s, c, f) denotes a CNN layer with kernel size (w, h), stride size s, dimension c, and activate function f. GDN and IGDNfoot_1 are activation functions designed for image compression (Ballé et al., 2016) . This activation function is effective and popular in deep image compression studies. The encoder network is composed of CNN(9, 9, 2, 64, GDN) -CNN(5, 5, 2, 64, GDN) -CNN(5, 5, 2, 64, GDN) -CNN(5, 5, 2, 64, GDN) -FC(1024, 1024, softplus) -FC(1024, 32, None)×2 (for µ and σ) in encoder. The decoder network is composed of FC(32, 1024, softplus) -FC(1024, 1024, softplus) -CNN(5, 5, 2, 64, IGDN) -CNN(5, 5, 2, 64, IGDN) -CNN(5, 5, 2, 64, IGDN)-CNN(9, 9, 2, 3, IGDN).

Training conditions:

In this experiment, SSIM explained in Appendix H.2 is used as a reconstruction loss. The reconstruction loss D(•, •) is derived as follows. Let SSIM be a SSIM calculated from two input images. Then 1 -SSIM is set to D(•, •). The KL divergence is derived as a summation of D KL(j) (•) as explained in Eq. 2. We also use the loss form as in Equation 65 in our code. In the case of the decomposed loss, the loss function L x is set to λ(D(x, x) + D( x, x)) + D KL (•) in our code. Then, the network is trained with λ = β -1 = 1, 000 using a batch size of 64 for 300,000 iterations. Here, Adam optimizer is used with the learning rate of 1e-3. We use a PC with CPU Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, 12GB memory equipped with NVIDIA GeForce GTX 1080. The simulation time for each trial is about 180 minutes, including the statistics evaluation codes. In our experiments, λ = β -1 = 1, 000 seems large. This is caused by the use of SSIM. As explained in Appendix H.2, SSIM is measured for a whole image, and its range is between 0 and 1. The order of 1 -SSIM is almost equivalent to the mean square error per pixel, as shown in Eq. 74. As explained in Appendix D.1, KL divergence is thought of as a rate for the whole image. Considering the number of pixels in a image, β = (λ/(64 × 64)) -1 = 4096/λ = 4.096 is comparable to β in the general form of VAE. First, the estimated norm of the implicit transform in the figures (a) is discussed. In all conditions, the norms are close to 1 as described in Eq. 20 in the λ range 50 to 1000. These results show consistency with our theoretical analysis, supporting the existence of the implicit orthonormal transform. The values in the Norm dataset are the closest to 1, and those in the Ramp dataset are the most different, which seems consistent with the difficulty of the fitting. Second, the ratio of the estimated variances is discussed. In the figures (b), Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) . Then, Var(z 2 )/Var(z 1 ) and Var(z 3 )/Var(z 1 ) are plotted. In all conditions, the ratios of Var(z 2 )/Var(z 1 ) and Var(z 3 )/Var(z 1 ) are close to the variance ratios of the input variables, i.e., 4 and 16, in the λ range 5 to 500. Figure 20 shows the detailed comparison of the ratio for the three datasets and three coding losses at λ = 100. In most cases, the estimated variances in the downward-convex loss are the smallest, and those in the upward-convex loss are the largest, which is more distinct for Var(z 3 )/Var(z 1 ). This can be explained as follows. When using the downward-convex loss, the space region with a large norm is thought of as shrinking in the inner product space, as described in Section 4.1. This will make the variance smaller. In contrast, when using the upward-convex loss, the space region with a large norm is thought of as expanding in the inner product space, making the variance larger. Here, the dependency of the losses on the ratio changes is less in the Norm dataset. The possible reason is that data in the normal distribution concentrate around the center, having less effect on the loss scale factor in the downward-convex loss and upward-convex loss. Third, the correlation coefficients between p(x) and the estimated data probabilities in the figures (c) are discussed. In the Mix dataset and Ramp dataset, the correlation coefficients are around 0.9 in the λ range from 20 to 200 when the estimated probabilities a x n/2 p(µ (x) ) n j=1 σ j(x) and a x n/2 exp(-(1/β)L x ) in Eq. 23 are used. When using p(µ (x) ) n j=1 σ j(x) and exp(-(1/β)L x ) in the downward-convex loss and upward-convex loss, the correlation coefficients become worse. In addition, when using the prior probability p(µ (x) ), the correlation coefficients always show the worst. In the Norm dataset, the correlation coefficients are close to 1.0 in the wider range of λ when using the estimated distribution in Eq. 23. When using p(µ (x) ) n j=1 σ j(x) and exp(-(1/β)L x ) in the downward-convex loss and upward-convex loss, the correlation coefficients also become worse. When using the prior probability p(µ (x) ), however, the correlation coefficients are close to 1 in contrast to the other two datasets. This can be explained because both the input distribution and the prior distribution are the same normal distribution, allowing the posterior variances almost constant. These results also show consistency with our theoretical analysis. Figure 21 shows the dependency of the coding loss on β for the Mix, Ramp, and Norm dataset using square the error loss. From D G in Eq. 14 and n = 3, the theoretical value of coding loss is 3β 2 , as also shown in the figure. Unlike Figs. 11-19, x-axis is β = λ -1 to evaluate the linearity. As expected in section 3.3, the coding losses are close to the theoretical value where β < 0.1, i.e., λ > 10. Figure 22 shows the dependency of the ratio of transform loss to coding loss on β for the Mix, Ramp, and Norm dataset using square the error loss. From Eq. 54, the estimated transform loss is 3 i=1 (β/2) 2 /Var(s i ) = 63β 2 32 . Thus the theoretical value is ( 63β 2 32 )/( 3β 2 ) = 21β 16 , as is also shown in the figure . x-axis is also β = λ -1 like Figure 21 . Considering the correlation coefficient discussed above, the useful range of β seems between 0.005-0.05 (20-200 for λ). In this range, the ratio is less than 0.1, implying the transform loss is almost negligible. As expected in section 3.2 and appendix B, the ratio is close to the theoretical value where β > 0.01, i.e., λ < 100. For β < 0.01, the conversion loss is still negligibly small, but the ratio is somewhat off the theoretical value. The reason is presumably that the transform loss is too small to fit the network. As shown above, this ablation study strongly supports our theoretical analysis in sections 3. Figure 23 shows decoder outputs for all the components, where each latent variable is traversed from -2 to 2. The estimated variance of each y j , i.e., σ -2 j , is also shown in these figures. The latent variables z i are numbered in descending order by the estimated variances. Figure 23a is a result using the conventional loss form, i.e., L x = D(x, x) + βD KL (•). The degrees of change seem to descend in accordance with the estimated variances. In the range where j is 1 from 10, the degrees of changes are large. In the range j > 10, the degrees of changes becomes gradually smaller. Furthermore, almost no change is observed in the range j > 27. As shown in Figure 4 , D KL(j) (•) is close to zero for j > 27, meaning no information. Thus, this result is clearly consistent with our theoretical analysis in Section Figure 23b is a result using the decomposed loss form, i.e., L x = D(x, x) + D( x, x) + βD KL (•). The degrees of change also seem to descend in accordance with the estimated variances. When looking at the detail, there are still minor changes even j = 32. As shown in Figure 5 , KL divergences D KL(j) (•) for all the components are larger than zero. This implies all of the dimensional components have meaningful information. Therefore, we can see a minor change even j = 32. Thus, this result is also consistent with our theoretical analysis. Another minor difference is sharpness. Although the quantitative comparison is difficult, the decoded images in Figure 23b seems somewhat sharper than those in Figure 23a . A possible reason for this minor difference is as follows. The transform loss D(x, x) serves to bring the decoded image of µ (x) closer to the input. In the conventional image coding, the orthonormal transform and its inverse transform are used for encoding and decoding, respectively. Therefore, the input and the decoded output are equivalent when not using quantization. If not so, the quality of the decoded image will suffer from the degradation. Considering this analogy, the use of decomposed loss might improve the decoded images for µ (x) , encouraging the improvement of the orthonormality of the encoder/decoder in VAE.

F.2 ADDITIONAL EXPERIMENTAL RESULT WITH OTHER CONDITION

In this Section, we provide the experimental results with other condition. We use essentially the same condition as described in Appendix D.2, except for the following conditions. The bottleneck size and λ are set to 256 and 10000, respectively. The encoder network is composed of CNN(9, 9, 2, 64, GDN) -CNN(5, 5, 2, 64, GDN) -CNN(5, 5, 2, 64, GDN) -CNN(5, 5, 2, 64, GDN) -FC(1024, 2048, softplus) -FC(2048, 256, None)×2 (for µ and σ) in encoder. The decoder network is composed of FC(256, 2048, softplus) -FC(2048, 1024, softplus) -CNN(5, 5, 2, 64, IGDN) -CNN(5, 5, 2, 64, IGDN) -CNN(5, 5, 2, 64, IGDN)-CNN(9, 9, 2, 3, IGDN).

Figures 24a and 24b show the averages of σ j(x)

-2 as well as the average and the standard deviation of 2 β σ j(x) 2 D j (z) in the conventional loss form and the decomposed loss form, respectively. When using the conventional loss form, the mean of 2 β σ j(x) 2 D j (z) is 1.25, which is closer to 1 than the mean 1.83 in Section 4.2. This suggests that the implicit transform is closer to the orthonormal. The possible reason is that a bigger reconstruction error is likely to cause the interference to RD-trade off and a slight violation of the theory, and it might be compensated with a larger lambda. When using the decomposed loss form, the mean of 2 β σ j(x) 2 D j (z) is 0.95, meaning almost unit norm. These results also support that VAE provides the implicit orthonormal transform even if the lambda or bottleneck size is varied.

G ADDITIONAL EXPERIMENTAL RESULT WITH MNIST DATASET

In this Appendix, we provide the experimental result of Section 4.2 with MNIST datasetfoot_2 consists of binary hand-written digits with a dimension of 768(=28 × 28). We use standard training split which includes 50,000 data points. For the reconstruction loss, we use the binary cross entropy loss (BCE) for the Bernoulli distribution. We averaged BCE by the number of pixels. The encoder network is composed of FC(768, 1024, relu) -FC(1024, 1024, relu) -FC(1024, bottleneck size) in encoder. The decoder network is composed of FC(bottleneck size, 1024, relu) -FC(1024, 1024, relu) -FC (1024, 768, sigmoid) . The batch size is 256 and the training iteration number is 50,000. In this section, results with two parameters, (bottleneck size=32, λ=2000) and (bottleneck size=64, λ=10000) are provided. Note that since we averaged BCE loss by the number of pixels, β in the conventional β VAE is derived by 768/λ. Then, the model is optimized by Adam optimizer with the learning rate of 1e-3, using the conventional (not decomposed) loss form. We use a PC with CPU Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, 12GB memory equipped with NVIDIA GeForce GTX 1080. The simulation time for each trial is about 10 minutes, including the statistics evaluation codes. Figure 25 shows the averages of σ j(x) -2 as well as the average and the standard deviation of 2 β σ j(x) 2 D j (z). In both conditions, the means of 2 β σ j(x) 2 D j (z) averages are also close to 1 except in the dimensions where σ j(x) -2 is less than 10. These results suggest the theoretical property still holds when using the BCE loss. In the dimensions where σ j(x) -2 is less than 10, the 2 β σ j(x) 2 D j (z) is somewhat lower than 1. The possible reason is that D KL(j) (•) in such dimension is 0 for some inputs and is larger than 0 in other inputs. The understanding of the transition region needs further study. Let T be a quantization step. Quantized values ẑj is derived as k T , where k = round(z j /T ). Then d, the distortion per channel, is approximated by d = k (k+1/2)T (k-1/2)T p(z j )(z j -k T ) 2 dz j k T p(k T ) (k+1/2)T (k-1/2)T 1 T (z j -k T ) 2 dz j = T 2 12 k T p(k T ) T 2 12 . Here, k T p(k T ) ∞ -∞ p(z j )dz j = 1 is used. The distortion for the given quantized value is also estimated as T 2 /12, because this value is approximated by In this appendix, the approximations of the reconstruction losses as a quadratic form t δx G x δx+C x are explained for the sum of square error (SSE), binary cross entropy (BCE) and Structural Similarity (SSIM). Here, we have borrowed the derivation of BCE and SSIM from Kato et al. (2020) , and add some explanation and clarification to them for convenience. We also describe the log-likelihood of the Gaussian distribution. Let x and xi be decoded sample Dec θ (z) and its i-th dimensional component respectively. δx and δx i denote xx and x i -xi , respectively. It is also assumed that δx and δx i are infinitesimal. The details of the approximations are described as follows.

Sum square error:

In the case of sum square error, G x is equal to I m . This can be derived as: m i=1 (x i -xi ) 2 = m i=1 δx 2 i = t δxI m δx. ( ) Binary cross entropy: Binary cross entropy is a log likelihood of the Bernoulli distribution. The Bernoulli distribution is described as: p θ (x|z) = m i=1 xi xi (1 -xi ) (1-xi) Then, the binary cross-entropy (BCE) can be expanded as: -log p θ (x|z) = -log m i=1 xi xi (1 -xi ) (1-xi) = m i=1 (-x i log xi -(1 -x i ) log (1 -xi )) = i (-x i log(x i + δx i ) -(1 -x i ) log(1 -x i -δx i )) = i -x i log 1 + δx i x i -(1 -x i ) log 1 - δx i 1 -x i + i (-x i log(x i ) -(1 -x i ) log(1 -x i )). (70) x + 1 1-x in the BCE approximation. Here, the second term of the last equation is a constant C x depending on x. Using log(1 + x) = x -x 2 /2 + O(x 3 ), the first term of the last equation is further expanded as follows: i -x i δx i x i - δx i 2 2x i 2 (1 -x i ) - δx i 1 -x i - δx i 2 2 (1 -x i ) 2 + O δx i 3 = i 1 2 1 x i + 1 1 -x i δx i 2 + O δx i 3 . ( ) As a result, a metric tensor G x can be approximated as the following positive definite Hermitian matrix:  G x =      1 2 1 x1 + 1 1-x1 0 . . .

Structural similarity (SSIM):

Structural similarity (SSIM) (Wang et al., 2001) is widely used for picture quality metric, which is close to human subjective quality. Let SSIM be a SSIM value between two pictures. The range of the SSIM value is between 0 and 1. The higher the value, the better the quality. In this appendix, we also show that (1 -SSIM) can be approximated to a quadratic form such as t δx G x δx. SSIM N ×N (h,v) (x, y) denotes a SSIM value between N × N windows in pictures X and Y , where x ∈ R N 2 and y ∈ R N 2 denote N × N pixels cropped from the top-left coordinate (h, v) in the images X and Y , respectively. Let µ x , µ y be the averages of all dimensional components in x, y, and σ x , σ y be the variances of all dimensional components in x, y in the N × N windows, respectively. Then, SSIM N ×N (h,v) (x, y) is derived as SSIM N ×N (h,v) (x, y) = 2µ x µ y µ x 2 + µ y 2 • 2σ xy σ x 2 + σ y 2 . ( ) In order to calculate a SSIM value for a picture, the window is shifted in a whole picture and all of SSIM values are averaged. Therefore, if 1 -SSIM N ×N (h,v) (x, y) is expressed as a quadratic form t δx G (h,v)x δx, (1 -SSIM) can be also expressed in quadratic form t δx G x δx. Let δx be a minute displacement of x. µ δx and σ δx 2 denote an average and variance of all dimensional components in δx, respectively. Then, SSIM between x and x + δx can be approximated as:  SSIM N ×N (h,v) (x, x + δx) 1 - µ δx 2 2µ x and σ δx 2 = t δx V δx, where V = 1 N I N -M, respectively. As a result, 1 SSIM N ×N (h,v) (x, x + δx) can be expressed in the following quadratic form as: 1 -SSIM N ×N (h,v) (x, x + δx) t δx G (h,v)x δx, where G (h,v)x = 1 2µ x 2 M + 1 2σ x 2 V . (77) It is noted that M is a positive definite Hermitian matrix and V is a positive semidefinite Hermitian matrix. Therefore, G (h,v)x is a positive definite Hermitian matrix. As a result, (1 -SSIM) can be also expressed in quadratic form t δx G x δx, where G x is a positive definite Hermitian matrix. Log-likelihood of Gaussian distribution: Gaussian distribution is described as: p θ (x|z) = m i=1 1 √ 2πσ 2 e -(xi-xi) 2 /2σ 2 = m i=1 1 √ 2πσ 2 e -δxi 2 /2σ 2 , ( ) where σ 2 is a variance as a hyper parameter. Then, the log-likelihood of the Gaussian distribution is denoted as: -log p θ (x|z) = -log m i=1 1 √ 2πσ 2 e -δx 2 i /2σ 2 = 1 2σ 2 m i=1 δx 2 i + m 2 log(2πσ 2 ). The first term can be rewritten as (1/2σ 2 ) t δxI m δx. Thus, G x = (1/2σ 2 ) I m holds. C x is derived as the second term of the last equation in Eq.78.

H.3 DETAILED EXPLANATION OF KL DIVERGENCE AS A RATE OF ENTROPY CODING.

This appendix explains the detail how KL divergence can be interpreted as a rate in the transform coding. In the transform coding, input data is transformed by an orthonormal transform. Then, the transformed data is quantized, and an entropy code is assigned to the quantized symbol, such that the length of the entropy code is equivalent to the logarithm of the estimated symbol probability. It is generally intractable to derive the rate and distortion of individual symbols in the ideal information coding. Thus, we first discuss the case of uniform quantization. Let P zj and R zj be the probability and rate in the uniform quantization coding of z j ∼ N (z j ; 0, 1). Here, µ j(x) and σ j(x) 2 are regarded as a quantized value and a coding noise after the uniform quantization, respectively. Let T be a quantization step size. The coding noise after quantization is T 2 /12 for the quantization step size T , as explained in Appendix H.1. Thus, T is derived as T = 2 √ 3σ j(x) from σ j(x) 2 = T 2 /12. We also assume σ j(x) 2 1. As shown in Fig. 27a , P zj is denoted by µ j(x) +T /2 µ j(x) -T /2 p(z j )dz j where p(z j ) is N (z j ; 0, 1). Using Simpson's numerical integration method and e x = 1 + x + O(x 2 ) expansion, P zj is approximated as: 



(http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) Google provides a code in the official Tensorflow library (https://github.com/tensorflow/compression) http://yann.lecun.com/exdb/mnist/



Figure 1: Mapping of VAE to implicit isometric embedding.

Figure 2: PDFs of three variables to generate a toy dataset.

Figure 3: Scattering plots of the data generation probability (x-axis) versus four estimated probabilities (y-axes) for the downward-convex loss. y-axes are (a) p(µ (x) ), (b) exp(-L x /β), (c) a 3/2

Figure 4: Graph of σ j(x) -2 average and 2 β σ j(x) 2 D j (z) in VAE for CelebA dataset.

Figure 5: Graph of σ j(x) -2 average and 2 β σ j(x) 2 D j (z) in VAE for CelebA dataset with explicit decomposed loss.

Figure 6: Dependency of decoded image changes with z j = -2 to 2 on the average of σ j(x)-2 .

RELATION TO ALEMI ET AL. (2018)Alemi et al. (2018) discuss the rate-distortion trade-off by the theoretical entropy analysis. Their work is also presumed that the objective L x was not mistakenly distinguished from ELBO, which leads to the incorrect discussion. In their work, the differential entropy for the input H, distortion D, and rate R are derived carefully. They suggest that VAE with β = 1 is sensitive (unstable) because D and R can be arbitrary value on the line R = H -βD = H -D. Furthermore, they also suggest that R ≥ H, D = 0 at β → 0 and R = 0, D ≥ H at β → ∞ will hold as shown the figure 1 of their work.

dx p * (x) dy N (y; µ y , (β/2)I n ) log N (y; µ y , (β/2)I n ) dx p * (x) is examined. m(y) is a marginal probability of y. Using the relation dz e(z|x) = dy e(y|x) and e(z|x)/m(z) = (e(y|x)(dy/dz))/(m(y)(dy/dz)) = e(y|x)/m(y), Eq. 37 can be rewritten as: R dx p * (x) dy e(y|x) log e, e(y|x) p( x|x) N ( x; x, (β/2)I m ) will approximately hold where x denotes a decoder output. Thus m(y) can be approximated by: m(y) dx p * (x)e(y|x) dx p * (x) N ( x; x, (β/2)I m )

dx p * (x) dy e(y|x) log e(y|x) p * (x) -dx p * (x) log p * (x) --dx p * (x) dy e(y|x) log e(y|x) discussed above, R and D can be specifically derived from β. In addition, Shannon lower bound discussed inAlemi et al. (2018) can be roughly verified in the optimized VAE with clearer notations using β. From the discussion above, we presume Alemi et al. (2018) might wrongly treat D in their work. They suggest that VAE with β = 1 is sensitive (unstable) because D and R can be arbitrary value on the line R = H -βD = H -D; however, our work as well as Tishby et al. (1999) (appendix A.2) and Dai & Wipf (2019)(appendix A.5) show that the differential entropy of the distortion and rate, i.e., D and R, are specifically determined by β after optimization, and R = H -D will hold for any β regardless β = 1 or not. Alemi et al. (

Figure 7: Projection of the volume element from the implicit orthonormal space to the isometric space and input space. V n (•) denotes n-dimensional volume.

Figures 11 -19 show the property measurements for all combinations of the datasets and coding losses, with changing λ. In each Figure, the estimated norms of the implicit transform are shown in the figure (a), the ratios of the estimated variances are shown in the figure (b), and the correlation coefficients between p(x) and estimated data probabilities are shown in the figure (c), respectively.

Figure 9: PDFs of three variables to generate a Ramp dataset.

Figure 10: Scale factor a x for the downward-convex loss and upwardconvex loss.

Figure11: Property measurements of the Mix dataset using the square error loss. λ is changed from 1 to 1, 000. Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) .

Figure13: Property measurements of the Mix dataset using the upward-convex loss. λ is changed from 1 to 1, 000. Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) .

Figure14: Property measurements of the Ramp dataset using the square error loss. λ is changed from 1 to 1, 000. Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) .

Figure16: Property measurements of the Ramp dataset using the upward-convex loss. λ is changed from 1 to 1, 000. Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) .

Figure17: Property measurements of the Norm dataset using the square error loss. λ is changed from 1 to 1, 000. Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) .

Figure19: Property measurements of the Mix dataset using the upward-convex loss. λ is changed from 1 to 1, 000. Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) .

(a) Var(z2)/Var(z1). (b) Var(z3)/Var(z1).

Figure20: Ratio of the estimated variances Var(z 3 )/Var(z 1 ) and Var(z 2 )/Var(z 1 ) for the three datasets and three coding losses at λ = 100. Var(z j ) denotes the estimated variance, given by the average of σ -2 j(x) .

Figure 21: Dependency of Coding Loss on β for Mix, Norm, and Ramp dataset using square loss.

Figure 22: Dependency of Transform loss / Coding Loss Ratio on β for Mix, Norm, and Ramp dataset using square loss.

Figure 23: Traversed outputs for all the component, changing z j from -2 to 2. The latent variables z j are numbered in descending order by the estimated variance σ -2 j shown in Figures 4 and 5.

Figure 24: Graph of σ j(x) -2 average and 2 β σ j(x) 2 D j (z) in CelebA dataset. The bottleneck size and λ are set to 256 and 10000, respectively.

Figure 25: Graph of σ j(x)-2 average and 2 β σ j(x) 2 D j (z) in MNIST dataset.

(z -k T ) 2 dz j .H.2 APPROXIMATION OF RECONSTRUCTION LOSS AS A QUADRATIC FORM.

Figure 26: Graph of 1 2

is a downward-convex function as shown in Figure26.

j(x) -T 2 ) + 4p(µ j(x) ) + p(µ j(x) + T 2 ) = T p(µ j(x) ) 6 4 + e 4µ j(x) T -T 2 8 + e -4µ j(x) T -T 2 8 T p µ j(x) 1 -T 2 /24 = 6 π σ j(x) e -(µ j(x) 2 )/2 1 -σ j(x)

Property measurements of the toy dataset trained with the square error loss.

Property measurements of the toy dataset trained with the downward-convex loss.

derivation, they use E x∼p( x|x) [p(x| x)] as a reconstruction loss, without discussing what kinds of properties the distortion probability p( x|x) should be. In training VAE with a real dataset, by contrast, they use a predetermined distortion metric D(x, x) like BCE and SSE as a reconstruction loss instead of a log-likelihood of the distortion probability, without discussing what distortion probability should be after optimization. Correctly, the distortion probability p( x|x) after training is determined by β and the metric as p θ ( x|x) N ( x; x, (β/2)I m ) in the metric defined space. Then, by applying p θ ( x|x) p θ ( x|x)

E ADDITIONAL RESULTS IN THE TOY DATASETS E.1 SCATTERING PLOTS FOR THE SQUARE ERROR LOSS IN SECTION

4.1 Figure 8a shows the plots of p(x) and estimated probabilities for the square error coding loss in Section 4.1, where the scale factor a x in Eq. 23 is 1. Thus, both exp(-L x /β) and p(µ (x) ) j σ j(x) show a high correlation, allowing easy estimation of the data probability in the input space. In contrast, p(µ (x) ) still shows a low correlation. These results are consistent with our theory. In this appendix, we explain the ablation study for the toy datasets. We introduce three toy datasets and three coding losses including those used in Section 4.1. We also change β -1 = λ from 1 to 1, 000 in training. The details of the experimental conditions are shown as follows.Datasets: First, we call the toy dataset used in Section 4.1 the Mix dataset in order to distinguish three datasets. The second dataset is generated such that three dimensional variables s 1 , s 2 , and s 3 are sampled in accordance with the distributions p(s 1 ), p(s 2 ), and p(s 3 ) in Figure 9 . The variances of the variables are the same as those of the Mix dataset, i.e., 1/6, 2/3, and 8/3, respectively. We call this the Ramp dataset. Because the PDF shape of this dataset is quite different from the prior N (z; 0, I 3 ), the fitting will be the most difficult among the three. The third dataset is generated such that three dimensional variables s 1 , s 2 , and s 3 are sampled in accordance with the normal distributions N (s 1 ; 0, 1/6), N (s 2 ; 0, 2/3), and N (s 3 ; 0, 8/3), respectively. We call this the Norm dataset. The fitting will be the easiest, because both the prior and input have the normal distributions, and the posterior standard deviation, given by the PDF ratio at the same CDF, can be a constant.Coding losses: Two of the three coding losses is the square error loss and the downward-convex loss described in Section 4.1. The third coding loss is a upward-convex loss which we design as Eq. 66 such that the scale factor a x becomes the reciprocal of the scale factor in Eq. 24:Figure 10 shows the scale factors a x in Eqs. 24 and 66, where s 1 in x = (s 1 , 0, 0) moves within ±5.Parameters: As explained in Appendix D.1, λ = 1/β is used as a hyper parameter. Specifically, λ = 1, 2, 5, 10, 20, 50, 100, 200, 500, and 1, 000 are used. Using log(1 + x) = x + O(x 2 ) expansion, R µσ is derived as:When R zj and D KLj(x) (•) in Eq. 2 are compared, both equations are except a small constant difference 1 2 log(πe/6) 0.176 for each dimension. As a result, KL divergence for j-th dimension is equivalent to the rate for the uniform quantization coding, allowing a small constant difference.To make theoretical analysis easier, we use the simpler approximation as) instead of Eq.80, as shown in Fig. 27b . Then, R zj is derived as:This equation also means that the approximation of KL divergence in Eq. 6 is equivalent to the rate in the uniform quantization coding with P zj = 2 √ 3σ j(x) p(µ j(x) ) approximation, allowing the same small constant difference as in Eq. 81. It is noted that the approximation P zj = 2 √ 3σ j(x) p(µ j(x) ) in Figure 27b can be applied to any kinds of prior PDFs because there is no explicit assumption for the prior PDF. This implies that the theoretical discussion after Eq. 6 in the main text will hold in arbitrary prior PDFs.Finally, the meaning of the small constant difference 1 2 log πe 6 in Eqs. 81 and 82 is shown. Pearlman & Said (2011) explains that the difference of the rate between the ideal information coding and uniform quantization is 1 2 log πe 6 . This is caused by the entropy difference of the noise distributions. In the ideal case, the noise distribution is known as a Gaussian. In the case the noise variance is σ 2 , the entropy of the Gaussian noise is 1 2 log(σ 2 2πe). For the uniform quantization with a uniform noise distribution, the entropy is 1 2 log(σ 2 12). As a result, the difference is just 1 2 log πe 6 . Because the rate estimation in this appendix uses a uniform quantization, the small offset 1 2 log πe 6 can be regarded as a difference between the ideal information coding and the uniform quantization. As a result, KL divergence in Eq. 2 and Eq. 6 can be regarded as a rate in the ideal informaton coding for the symbol with the mean µ j(x) and variance σ j(x)2 .

