QUANTITATIVE UNDERSTANDING OF VAE AS A NON-LINEARLY SCALED ISOMETRIC EMBEDDING

Abstract

Variational autoencoder (VAE) estimates the posterior parameters (mean and variance) of latent variables corresponding to each input data. While it is used for many tasks, the transparency of the model is still an underlying issue. This paper provides a quantitative understanding of VAE property by interpreting VAE as a non-linearly scaled isometric embedding. According to the Rate-distortion theory, the optimal transform coding is achieved by using a PCA-like orthonormal transform where the transform space is isometric to the input. From this analogy, we show theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. As a result, we can estimate the data probabilities in the input space from the prior, loss metrics, and corresponding posterior parameters. In addition, the quantitative importance of each latent variable can be evaluated like the eigenvalue of PCA.

1. INTRODUCTION

Variational autoencoder (VAE) (Kingma & Welling, 2014) is one of the most successful generative models, estimating posterior parameters of latent variables for each input data. In VAE, the latent representation is obtained by maximizing an evidence lower bound (ELBO). A number of studies (Higgins et al., 2017; Kim & Mnih, 2018; Lopez et al., 2018; Chen et al., 2018; Locatello et al., 2019; Alemi et al., 2018; Rolínek et al., 2019) have tried to reveal the property of latent variables. However, quantitative behavior of VAE is still not well clarified. For example, there has not been a theoretical formulation of the reconstruction loss and KL divergence in ELBO after optimization. More specifically, although the conditional distribution p θ (x|z) in the reconstruction loss of ELBO is predetermined such as the Gaussian or Bernoulli distributions, it has not been discussed well whether the true conditional distribution after optimization matches the predetermined distribution. Rate-distortion (RD) theory (Berger, 1971) , which is an important part of Shannon information theory and successfully applied to image compression, quantitatively formulates the RD trade-off optimum in lossy compression. To realize a quantitative data analysis, Rate-distortion (RD) theory based autoencoder, RaDOGAGA (Kato et al., 2020) , has been proposed with isometric embedding (Han & Hong, 2006) where the distance between arbitrary two points of input space in a given metrics is always the same as L2 distance in the isometric embedding space. In this paper, by mapping VAE latent space to an implicit isometric space like RaDOGAGA on variable-by-variable basis and analysing VAE quantitatively as a well-examined lossy compression, we thoroughly clarify the quantitative properties of VAE theoretically and experimentally as follows. 1) Implicit isometric embedding is derived in the loss metric defined space such that the entropy of data representation becomes minimum. A scaling factor between the VAE latent space and implicit isometric space is formulated by the posterior for each input. In the case of β-VAE, the posterior variance of each dimensional component in the implicit isometric embedding space is a constant β/2, which is analogous to the rate-distortion optimal of transform coding in RD theory. As a result, the reconstruction loss and KL divergence in ELBO can be quantitatively formulated. 2) From these properties, VAE can provide a practical quantitative analysis of input data. First, the data probabilities in the input space can be estimated from the prior, loss metric, and posterior parameters. In addition, the quantitative importance of each latent variable, analogous to the eigenvalue of PCA, can be evaluated from the posterior variance of VAE. This work will lead the information theoretic generative models in the right direction.

2.1. VARIATIONAL AUTOENCODER AND THEORETICAL ANALYSIS

In VAE, ELBO is maximized instead of maximizing the log-likelihood directly. Let x ∈ R m be a point in a dataset. The original VAE model consists of a latent variable with fixed prior z ∼ p(z) = N (z; 0, I n ) ∈ R n , a parametric encoder Enc φ : x ⇒ z, and a parametric decoder Dec θ : z ⇒ x. In the encoder, q φ (z|x) = N (z; µ (x) , σ (x) ) is provided by estimating parameters µ (x) and σ (x) . Let L x be a local cost at data x. Then, ELBO is described by ELBO = E x∼p(x) E z∼q φ (z|x) [log p θ (x|z)] -D KL (q φ (z|x) p(z)) . (1) In E x∼p(x) [ • ], the first term E z∼q φ (z|x) [ • ] is called the reconstruction loss. The second term D KL (•) is a Kullback-Leibler (KL) divergence. Let µ j(x) , σ j(x) , and D KLj(x) be j-th dimensional values of µ (x) , σ (x) , and KL divergence. Then D KL (•) is derived as: D KL (•) = n j=1 D KLj(x) , where D KLj(x) = 1 2 µ j(x) 2 + σ j(x) 2 -log σ j(x) 2 -1 . (2) D(x, x) denotes a metric such as sum square error (SSE) and binary cross-entropy (BCE) as loglikelihoods of Gaussian and Bernoulli distributions, respectively. In training VAE, the next objective is used instead of Eq. 1, where β is a parameter to control the trade-off (Higgins et al., 2017) . L x = E z∼q φ (z|x) [D(x, x)] + βD KL (•). However, it has not been fully discussed whether the true conditional distribution matches the predetermined distribution, or how the value of KL divergence is derived after training. There have been several studies to analyse VAE theoretically. Alemi et al. ( 2018) introduced the RD trade-off based on the information-theoretic framework to analyse β-VAE. However, they did not clarify the quantitative property after optimization. Dai et al. (2018) showed that VAE restricted as a linear transform can be considered as a robust PCA. However, their model has a limitation for the analysis on each latent variable basis because of the linearity assumption. Rolínek et al. (2019) showed that the Jacobian matrix of VAE at each latent variable is orthogonal, which makes latent variables disentangled implicitly. However, they do not uncover the orthonormality and quantitative properties because they simplify KL divergence as a constant. Dai & Wipf (2019) also showed that the expected rate of VAE for the r-dimensional manifold is close to -(r/2) log γ + O(1) at γ → 0 when p θ ( x|x) = N ( x; x, γI m ) holds. The remaining challenge is to clearly figure out what latent space is obtained at a given dataset, a loss metric, and β in the model.

2.2. RATE-DISTORTION THEORY, TRANSFORM CODING, AND ISOMETRIC EMBEDDING

RD theory (Berger, 1971 ) formulated the optimal transform coding (Goyal, 2001) for the Gaussian source with square error metric as follows. Let x ∈ R m be a point in a dataset. First, the data are transformed deterministically with the orthonormal transform (orthogonal and unit norm) such as Karhunen-Loève transform (KLT) (Rao & Yip, 2000) . Let z ∈ R m be a point transformed from x. Then, z is entropy-coded by allowing equivalent stochastic distortion (or posterior with constant variance) in each dimension. A lower bound of a rate R at a distortion D is denoted by R(D). The derivation of R(D) is as follows. Let z j be the j-th dimensional component of z and σ zj 2 be the variance of z j in a dataset. It is noted that σ zj 2 is the equivalent to eigenvalues of PCA for the dataset. Let d be a distortion equally allowed in each dimensional channel. At the optimal condition, the distortion D opt and rate R opt on the curve R(D) is calculated as a function of d:  The simplest way to allow equivalent distortion is to use a uniform quantization (Goyal, 2001) . Let T be a quantization step, and round(•) be a round function. Quantized value ẑj is derived as kT , where k = round(z j /T ). Then, d is approximated by T 2 /12 as explained in Appendix H.1.

