TRADING INFORMATION BETWEEN LATENTS IN HIERARCHICAL VARIATIONAL AUTOENCODERS

Abstract

Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of β-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content ("bit rate") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018) . In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.

1. INTRODUCTION

Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) are a class of deep generative models that are used, e.g., for density modeling (Takahashi et al., 2018) , clustering (Jiang et al., 2017) , nonlinear dimensionality reduction of scientific measurements (Laloy et al., 2017) , data compression (Ballé et al., 2017) , anomaly detection (Xu et al., 2018) , and image generation (Razavi et al., 2019) . VAEs (more precisely, β-VAEs (Higgins et al., 2017 )) span such a diverse set of application domains in part because they can be tuned to a specific task without changing the network architecture, in a way that is well understood from information theory (Alemi et al., 2018) . The original proposal of VAEs (Kingma & Welling, 2014) motivates them from the perspective of generative probabilistic modeling and approximate Bayesian inference. However, the generalization to β-VAEs breaks this interpretation as they are no longer trained by maximizing a lower bound on the marginal data likelihood. These models are better described as neural networks that are trained to learn the identity function, i.e., to make their output resemble the input as closely as possible. This task is made nontrivial by introducing a so-called (variational) information bottleneck (Alemi et al., 2017; Tishby & Zaslavsky, 2015) at one or more layers, which restricts the information content that passes through these layers. The network activations at the information bottleneck are called latent representations (or simply "latents"), and they split the network into an encoder part (from input to latents) and a decoder part (from latents to output). This separation of the model into an encoder and a decoder allows us to categorize the wide variety of applications of VAEs into three domains: 1. data reconstruction tasks, i.e., applications that involve both the encoder and the decoder; these include various nonlinear inter-and extrapolations (e.g., image upscaling, denoising, or inpainting), and VAE-based methods for lossy data compression; 2. representation learning tasks, i.e., applications that involve only the encoder; they serve a downstream task that operates on the (typically lower dimensional) latent representation, e.g., classification, regression, visualization, clustering, or anomaly detection; and 3. generative modeling tasks, i.e., applications that involve only the decoder are less common but include generating new samples that resemble training data. Figure 1 : Left: trade-off between performance in the three applications domains of VAEs, using GHVAE trained on the SVHN data set (details: Section 5); higher is better for all three metrics; gray dots on walls show 2d-projections. Right: color code, corresponding layer-wise rates (Eq. 7), and individual performance landscapes (size of dots ∝ performance). The hyperparameters β 2 and β 1 allow us to tune the HVAE for best data reconstruction (△), best representation learning (⋄), or best generative modeling ( ). Note that performance landscapes differ strongly across the three applications, and neither a standard VAE (β 2 = β 1 = 1; marked "•" in right panels) nor a conventional β-VAE (β 2 = β 1 ; dashed red lines) result in optimal models for any of the three applications. The information bottleneck incentivizes the VAE to encode information into the latents efficiently by removing any redundancies from the input. How agressively this is done can be controlled by tuning the strength β of the information bottleneck (Alemi et al., 2018) . Unfortunately, information theory distinguishes relevant from redundant information only in a quantitative way that is agnostic to the qualitative features that each piece of information represents about some data point. In practice, many VAE-architectures (Deng et al., 2017; Yingzhen & Mandt, 2018; Ballé et al., 2018) try to separate qualitatively different features into different parts of the latent representation by making the model architecture reflect some prior assumptions about the semantic structure of the data. This allows downstream applications from the three domains discussed above to more precisely target specific qualitative aspects of the data by using or manipulating only the corresponding part of the latent representation. However, in this approach, the degree of detail to which each qualitative aspect is encoded in the latents can be controlled at most indirectly by tuning network layer sizes. In this paper, we argue both theoretically and empirically that the three different application domains of VAEs identified above require different trade-offs in the amount of information that is encoded in each part of the latent representation. We propose a method to independently control the information content (or "rate") of each layer of latent representations, generalizing the rate/distortion theory of β-VAEs (Alemi et al., 2018) for VAEs with more than one layer of latents ("hierarchical VAEs" or HVAEs for short). We identify the most general model architecture that is compatible with our proposal and analyze how both theoretical performance bounds and empirically measured performances in each of the above three application domains depend on how rate is distributed across layers. Our approach is summarized in Figure 1 . The 3d-plot shows empirically measured performance metrics (discussed in detail in Section 5.2) for the three application domains identified above. Each point on the colored surface corresponds to different layer-wise rates in an HVAE with two layers of latents. Crucially, the rates that lead to optimal performance are different for each of the three application domains (see markers △, , and ⋄ in Figure 1 ), and none of these three optimal models coincide with a conventional β-VAE (dashed red lines in right panels). Thus, being able to control each layer's individual rate allows practitioners to train VAEs that target a specific application. The paper is structured as follows. Section 2 summarizes related work. Section 3 introduces the proposed information-trading method. We then analyze how controlling individual layers' rates can be used to tune HVAEs for specific tasks, i.e., how performance in each of the three application domains identified above depends on the allocation of rates across layers. This analysis is done theoretically in Section 4 and empirically in Section 5. Section 6 provides concluding remarks. 

3. A HIERARCHICAL INFORMATION TRADING FRAMEWORK

We propose a refinement of the rate/distortion theory of β-VAEs (Alemi et al., 2018) that admits controlling individual layers' rates in VAEs with more than one layers of latents (hierarchical VAEs).

3.1. CONVENTIONAL β-VAE WITH HIERARCHICAL LATENT REPRESENTATIONS

We consider a hierarchical VAE (HVAE) for data x with L layers of latent representations {z ℓ } L ℓ=1 . Figure 2 , discussed further in Section 3.2 below, illustrates various model architectures for the example of L = 2. Solid arrows depict the generative model p θ ({z ℓ }, x), where θ are model parameters (neural network weights). We assume that the implementation factorizes p θ ({z ℓ }, x) as follows, p θ ({z ℓ }, x) = p θ (z L ) p θ (z L-1 |z L ) p θ (z L-2 |z L-1 , z L ) • • • p θ (z 1 |z ≥2 ) p θ (x|z ≥1 ) where the notation z ≥n for any n is short for the collection of latents {z ℓ } L ℓ=n (thus, z ≥1 and {z ℓ } are synonymous), and the numbering of latents from L down to 1 follows the common convention in the literature (Sønderby et al., 2016; Gulrajani et al., 2017; Child, 2021) . The loss function of a normal β-VAE (Higgins et al., 2017) with this generic architecture would be L β (θ, ϕ) = E x∼Xtrain E q ϕ ({z ℓ }|x) -log p θ (x|{z ℓ }) = "distortion" D +β D KL q ϕ ({z ℓ } | x) p θ ({z ℓ }) = "rate" R . Here, q ϕ ({z ℓ } | x) is the inference (or "encoder") model with parameteres ϕ, X train is the training set, D KL [ • || • ] denotes Kullback-Leibler divergence, and the Lagrange parameter β > 0 trades off between a (total) rate R and a distortion D (Alemi et al., 2018) . Setting β = 1 turns Eq. 2 into the negative ELBO objective of a regular VAE (Kingma & Welling, 2014). The rate R obtains its name as it measures the (total) information content that q ϕ encodes into the latent representations {z ℓ }, which would manifest itself in the expected bit rate when one optimally encodes a random draw {z ℓ } ∼ q ϕ ({z ℓ } | x) using p θ ({z ℓ }) as an entropy model (Agustsson & Theis, 2020; Bennett et al., 2002) . An important observation pointed out in (Alemi et al., 2017) is that, regardless how rate R is traded off against distortion D by tuning β, their sum R + D is-in expectation under any data distribution p data (x)-always lower bounded by the entropy H[p data (x)] := E pdata(x) [-log p data (x)], E pdata(x) [R + D] ≥ H[p data (x)] ∀ p data . Limitations. The rate R in Eq. 2 is a property of the collection {z ℓ } of all latents, which can limit its interpretability for some inference models. For example, the common convention of enumerating layers z ℓ from ℓ = L down to 1 in Eq. 1 is reminiscent of a naive architecture for the inference model that factorizes in reverse order compared to Eq. 1 ("bottom up", see dashed arrows in Figure 2  (a)), i.e., q ϕ ({z ℓ } | x) = q ϕ (z 1 |x) q ϕ (z 2 |z 1 ) • • • q ϕ (z L |z L-1 ). Using a HVAE with such a "bottom-up" inference model to reconstruct some given data point x would map x to z 1 using q ϕ (z 1 |x) and then map z 1 back to the data space using p θ (x|z 1 ), thus ignoring all latents z ℓ with ℓ > 1. Yet, the rate term in Eq. 2 still depends on all latents, including the ones not needed to reconstruct any data (practical VAE-based compression methods using bits-back coding (Frey & Hinton, 1997) would, however, indeed use z ℓ with ℓ > 1 as auxiliary variables for computational efficiency). , uses an inference model (dashed arrows) that traverses the latents {z ℓ } in the same order as the generative model (solid arrows). We consider the following generalization of this architecture (see Figure 2  (c)), q ϕ ({z ℓ } | x) = q ϕ (z L |x) q ϕ (z L-1 | z L , x) q ϕ (z L-2 | z L-1 , z L , x) • • • q ϕ (z 1 | z ≥2 , x). ) Formally, Eq. 4 is just the product rule of probability theory and therefore holds for arbitrary inference models q ϕ ({z ℓ } | x). More practically, however, we make the assumption that the actual implementation of q ϕ ({z ℓ } | x) follows the structure in Eq. 4. This means that, using the trained model, the most efficient way to map a given data point x to its reconstruction x now involves all latents z ℓ (either drawing a sample or taking the mode at each step): x q ϕ (z L |x) ------→ z L q ϕ (z L-1 |z L ,x) ----------→ z L-1 -→ • • • -→ z 2 q ϕ (z1|z ≥2 ,x) ---------→ z 1 p θ (x|{z ℓ }) --------→ x. (5) Layer-wise Rates. We can interpret Eq. 5 in that it first maps x to a "crude" representation z L , which gets iteratively refined to z 1 , and finally to a reconstruction x. Note that each factor q ϕ (z ℓ | z ≥ℓ+1 , x) of the inference model in Eq. 4 is conditioned not only on the previous layers z ≥ℓ+1 but also on the original data x. This allows the inference model to target each refinement step in Eq. 5 such that the reconstruction x becomes close to x. More formally, we chose the inference architecture in Eq. 4 such that it factorizes over {z ℓ } in the same order as the generative model (Eq. 1). This allows us to split the total rate R into a sum of layer-wise rates as follows, R = E q ϕ ({z ℓ }|x) log q ϕ (z L |x) p θ (z L ) + log q ϕ (z L-1 |z L , x) p θ (z L-1 |z L ) + . . . + log q ϕ (z 1 |z ≥2 , x) p θ (z 1 |z ≥2 ) = R(z L ) + R(z L-1 |z L ) + R(z L-2 | z L-1 , z L ) + . . . + R(z 1 |z ≥2 ). (6) Here, R(z L ) = D KL q ϕ (z L |x) p θ (z L ) and R(z ℓ |z ≥ℓ+1 ) = E q(z ≥ℓ+1 |x) D KL q ϕ (z ℓ | z ≥ℓ+1 , x) p θ (z ℓ | z ≥ℓ+1 ) quantify the information content of the highest-order latent representation z L and the (expected) increase in information content in each refinement step z ℓ+1 → z ℓ in Eq. 5, respectively. Controlling Each Layer's Rate. Using Eqs. 6-7, we generalize the rate/distortion trade-off from Section 3.1 by introducing L individual Lagrange multipliers β L , β L-1 , . . . , β 1 , collectively denoted as boldface β. This leads to a new loss function that generalizes Eq. 2 as follows, L β (θ, ϕ) = E x∼Xtrain D + β L R(z L ) + β L-1 R(z L-1 |z L ) + . . . + β 1 R(z 1 |z ≥2 ) . Setting all βs to the same value recovers the conventional β-VAE (Eq. 2), which trades off distortion against total information content in {z ℓ }. Tuning each β-hyperparameter individually allows trading off information content across latents. (In a very deep HVAE (i.e., large L) it may be more practical to group layers into only few bins and to use the same β-value for all layers within a bin.) We analyze how to tune βs for various applications theoretically in Section 4 and empirically in Section 5.

4. INFORMATION-THEORETICAL PERFORMANCE BOUNDS FOR HVAES

In this section, we analyze theoretically how various performance metrics for HVAEs are restricted by the individual layers' rates R(z L ) and R(z ℓ |z ≥ℓ+1 ) identified in Eq. 7 for a HVAE with "topdown" inference model. Our analysis motivates the use of the information-trading loss function in Eq. 8 for training HVAEs, following the argument from the introduction that VAEs are commonly used for a vast variety of tasks. As we show, different tasks require different trade-offs that can be targeted by tuning the Lagrange multipliers β in Eq. 8. We group tasks into the application domains of (i) data reconstruction and manipulation, (ii) representation learning, and (iii) data generation. Data Reconstruction and Manipulation. The most obvious class of application domains of VAEs includes tasks that combine encoder and decoder to map some data point x to representations {z ℓ } and then back to the data space. The simplest performance metric for such data reconstruction tasks is the expected distortion E pdata(x) [D] , which we can bound by combining Eq. 3 with Eqs. 6-7, E pdata(x) [D] ≥ H[p data (x)] -E pdata(x) R(z L ) + R(z L-1 |z L ) + • • • + R(z 1 |z ≥2 ) . Eq. 9 would suggest that higher rates (i.e., lower β's) are always better for data reconstruction tasks. However, in many practical tasks (e.g., image upscaling, denoising, or inpainting) the goal is not solely to reconstruct the original data but also to manipulate the latent representations {z ℓ } in a meaningful way. Here, lower rates can lead to more semantically meaningful representation spaces (see, e.g., Section 5.6 below). Controlling how rate is distributed across layers via Eq. 8 may allow practitioners to have a semantically meaningful high-level representation z L with low rate R(z L ) while still retaining a high total rate R, thus allowing for low distortion D without violating Eq. 9. Representation Learning. In many practical applications, VAEs are used as nonlinear dimensionality reduction methods to prepare some complicated high-dimensional data x for downstream tasks such as classification, regression, visualization, clustering, or anomaly detection. We consider a classifier p cls. (y|z ℓ ) operating on the latents z ℓ at some level ℓ. We assume that the (unknown) true data generative process p data (y, x) = p data (y) p data (x|y) generates data x conditioned on some true label y, thus defining a Markov chain y pdata --→ x q ϕ -→ z ℓ pcls. --→ ŷ where ŷ := arg max y p cls. (y|z ℓ ). Classification accuracy is bounded (Meyen, 2016 ) by a function of the mutual information I q (y; z ℓ ), I q (y; z ℓ ) ≤ I q (x; z ℓ ) ≡ E pdata(x) E q ϕ (z ℓ |x) log q ϕ (z ℓ |x) q ϕ (z ℓ ) (10) = E pdata(x) E q ϕ (z ℓ |x) log q ϕ (z ℓ |x) p θ (z ℓ ) -D KL q ϕ (z ℓ ) p θ (z ℓ ) ≤ E pdata(x) E q ϕ (z ≥ℓ |x) log q ϕ (z ≥ℓ |x) p θ (z ≥ℓ ) -E q ϕ (z ℓ |x) D KL q ϕ (z ≥ℓ+1 | x, z ℓ ) p θ (z ≥ℓ+1 |z ℓ ) ≤ E pdata(x) R(z L ) + R(z L-1 |z L ) + . . . + R(z ℓ | z ≥ℓ+1 ) =:R(z ≥ℓ ) (≤R) . Here, q ϕ (z ℓ ) := E pdata(x) [q ϕ (z ℓ |x)] and we identify R(z ≥ℓ ) as the rate accumulated in all layers from z L to z ℓ . The first inequality in Eq. 10 comes from the data processing inequality (MacKay, 2003) , and the other two inequalities result from discarding the (nonnegative) KL-terms. The classification accuracy is thus bounded by (Meyen, 2016 ) (see also proof in Appendix B) class. accuracy ≤ f -1 I q (y; z ℓ ) ≤ f -1 E pdata(x) [R(z ≥ℓ )] ≤ f -1 E pdata(x) [R] where f -1 is the inverse of the monotonic function f (α) = H[p data (y)]+α log α+(1-α) log 1-α M -1 with M being the number of classes and H[p data (y)] ≤ log M the marginal label entropy. Eq. 11 suggests that the accuracy of an optimal classifier on z ℓ would increase as the rate R(z ≥ℓ ) accumulated from z L to z ℓ grows (i.e., as β ≥ℓ → 0), and that the rate added in downstream layers z <ℓ would be irrelevant. Practical classifiers, however, have a limited expressiveness, which a very high rate R(z ≥ℓ ) might exceed by encoding too many details into z ℓ that are not necessary for classification. We observe in Section 5.6 that, in such cases, increasing the rates of downstream layers z <ℓ improves classification accuracy as it allows keeping z ℓ simpler by deferring details to z <ℓ . Data Generation. The original proposal of VAEs (Kingma & Welling, 2014) motivated them from a generative modeling perspective using that, for β = 1, the negative of the loss function in Eq. 2 is a lower bound on the log marginal data likelihood. This suggests setting all β-hyperparameters in Eq. 8 to values close to 1 if a HVAE is used primarily for its generative model p θ . In summary, our theoretical analysis suggests that optimally tuned layer-wise rates depend on whether a HVAE is used for data reconstruction, representation learning, or data generation. The next section tests our theoretical predictions empirically for the same three application domains.

5. EXPERIMENTS

To demonstrate the features of our hierarchical information trading framework, we run large-scale grid searches over a two-dimensional rate space using two different implementations of HVAEs and three different data sets. Although the proposed framework is applicable for HVAEs with L ≥ 2, we only use HVAEs with L = 2 in our experiments for simplicity and visualization purpose. Model Architectures. For the generative model (Eq. 1), we assume a (fixed) standard Gaussian prior p(z 2 ) = N (0, I), and we use diagonal Gaussian models for p θ (z 1 |z 2 ) = N (g µ (z 2 ), g σ (z 2 ) 2 ) and (for SVHN and CIFAR-10) p θ (x|z 1 ) = N (g µ ′ (z 1 ), σ 2 x I) (this is similar to, e.g., (Minnen et al., 2018) ). Here, g µ , g σ , and g µ ′ , denote neural networks (see details below). Since MNIST has binary pixel values, we model it with a Bernoulli distribution for p θ (x|z 1 ) = Bern(g µ ′ (z 1 )). For the inference model, we also use diagonal Gaussian models for q ϕ (z 2 |x) = N (f µ (x), f σ (x) 2 ) and for q ϕ (z 1 |x, z 2 ) = N (f µ ′ (x, z 2 ), f σ ′ (x, z 2 ) 2 ) , where f µ , f σ , f µ ′ , and f σ ′ are again neural networks. We examine both LVAE (Figure 2(b) ) and our generalized top-down HVAEs (GHVAEs; see Figure 2(c )), using simple network architectures with only 2 to 3 convolutional and 1 fully connected layers (see Appendix A.1 for details) so that we can scan a large rate-space efficiently. Note that we are not trying to find the new state-of-the-art HVAEs. Results for LVAE are in Appendix A.2.2. We trained 441 different HVAEs for each data set/model combination, scanning the ratehyperparameters (β 2 , β 1 ) over a 21 × 21 grid ranging from 0.1 to 10 on a log scale in both directions (see Figure 1 on page 2, right panels). Each model took about 2 hours to train on an RTX-2080Ti GPU (∼ 27 hours in total for each data set/model combination using 32 GPUs in parallel). Baselines. Our proposed framework (Eq. 8) generalizes over both VAEs and β-VAEs (Eq. 2), which we obtain in the cases β 2 = β 1 = 1 and β 2 = β 1 , respectively. These baselines are indicated as black " " and red " " circles, respectively, in Figures 3, 5 Metrics. Performance metrics for the three application domains of VAEs mentioned in the introduction are introduced at the beginnings of the corresponding Sections 5.4-5.6. In addition, we evaluate the individual rates R(z 2 ) and R(z 1 |z 2 ) (Eq. 7), which we report in nats (i.e., to base e).

5.2. THERE IS NO "ONE HVAE FITS ALL"

Figure 1 on page 2 summarizes our results. The 21×21 GHVAEs trained with the grid of hyperparameters β 2 and β 1 map out a surface in a 3d-space spanned by suitable metrics for the three application domains (metrics defined in Sections 5.4-5.6 below). The two upper right panels map colors on this surface to βs used for training and to the resulting layer-wise rates, respectively. The lower right panels show performance landscapes and identify the optimal models for the three application domains of data reconstruction (△), representation learning (⋄), and generative modeling ( ). The figure shows that moving away from a conventional β-VAE (β 2 = β 1 ; dashed red lines in Figure 1 ) allows us to find better models for a given application domain as the three application domains favor vastly different regions in β-space. Thus, there is no single HVAE that is optimal for all tasks, and a HVAE that has been optimized for one task can perform poorly on a different task.

5.3. DEFINITION OF THE OPTIMAL MODEL FOR A GIVEN TOTAL RATE

One of the questions we study in Sections 5.4-5.6 below is: "Which allocation of rates across layers results in best model performance if we keep the total rate R fixed". Unfortunately, it is difficult to keep R fixed at training time since we control rates only indirectly via their Lagrange multipliers β 2 and β 1 . We instead use the following definition, illustrated in Figure 6 for a performance metric introduced in Section 5.6 below. The figure plots the performance metric over R for all 21 × 21 βsettings and highlights with purple circles " " all points on the upper convex hull. These highlighted models are optimal for a small interval of total rates in the following sense: if we use the total rates R of all " " to partition the horizontal axis into intervals then, by definition of the convex hull, each " " represents the model with highest performance in either the interval to its left or the one to its right.

5.4. PERFORMANCE ON DATA RECONSTRUCTION

Reconstruction is a popular task for VAEs, e.g., in the area of lossy compression (Ballé et al., 2017) . We measure reconstruction quality using the common peak signal-to-noise ratio (PSNR), which is equal to E x∼Xtest [-log D] up to rescaling and shifting. Higher PSNR means better reconstruction. Figure 5 : Sample generation performance, measured in Inception Score (IS, see Eq. 12) and its factorization into diversity and sharpness as a function of layer-wise rates for GHVAEs trained using SVHN data. Crosses in left panel correspond to samples shown in Figure 4 . Markers " ", " ", and " " same as in Figure 3 . Dashed line shows theoretical bound (Eq. 11). Other markers as in Figure 3 . Unsurprisingly and consistent with Eq. 9, reconstruction performance improves as total rate grows. However, minimizing distortion without any constraints is not useful in practice as we can simply use the original data, which has no distortion. To simulate a practical constraint in, e.g., a datacompression application, we consider models with optimal PSNR for a given total rate R (as defined in Section 5.3) which are marked as purple circles " " in Figure 3(b) . We see for both SVHN and CIFAR-10 that conventional β-VAEs (β 2 = β 1 ; red circles) perform somewhat suboptimal for a given total rate and can be improved by trading some rate in z 2 for some rate in z 1 . Reconstruction examples for the three models marked with crosses in Figure 3 (b) are shown in Figure 4 (bottom). Visual reconstruction quality improves from "3" to "2" to "1", consistent with reported PSNRs.

5.5. PERFORMANCE ON SAMPLE GENERATION

We next evaluate how tuning layer-wise rates affects the quality of samples from the generative model. We measure sample quality by the widely used Inception Score (IS) (Salimans et al., 2016) , IS = exp E p θ (x) D KL [p cls. (y|x) || p cls. (y)] = e H[pcls.(y)] × e -E p θ (x) [H[pcls.(y|x)]] Here, p θ is the trained generative model (Eq. 1), p cls. shows IS for GHVAEs trained on SVHN. Unlike the results for PSNR, here, higher rate does not always lead to better sample quality: for very high R(z 2 ) and low R(z 1 |z 2 ), IS eventually drops. The region of high IS is in the area where β 2 < β 1 , i.e., where R(z 2 ) is higher than in a comparable conventional β-VAE. The center and right panels of Figure 5 show diversity and sharpness, indicating that IS is mainly driven here by sharpness, which depends mostly on R(z 2 ), possibly because z 2 captures higher-level concepts than z 1 that may be more important to the classifier in Eq. 12. Samples from the the three models marked with crosses in Figure 5 are shown in Figure 4 (top). Visual sample quality improves from "1" to "3" to "2", consistent with reported IS. Figure 7 : Mutual information (MI) I q (y; z 2 ) and classification accuracies of four classifiers (see column labels) as a function of layer-wise rates R(z 2 ) and R(z 1 |z 2 ). Classifiers are conditioned on µ 2 := arg max z2 q(z 2 |x) learned from GHVAEs trained with SVHN (top) and CIFAR-10 (bottom). Markers " ", " ", and " " same as in Figure 3 .

5.6. PERFORMANCE ON REPRESENTATION LEARNING FOR DOWNSTREAM CLASSIFICATION

VAEs are very popular for representation learning as they map complicated high dimensional data x to typically lower dimensional representations {z ℓ }. To measure the quality of learned representations, we train two sets of classifiers on a labeled test set for each trained HVAE, each consisting of: logistic regression, a Support Vector Machine (SVM) (Boser et al., 1992) with linear kernel, an SVM with RBF kernel, and k-nearest neighbors (kNN) with k = 5. One set of classifiers is conditioned on the mode µ 2 of q ϕ (z 2 |x) and the other one on the mode µ 1 of q ϕ (z 1 |z 2 , x), where z 2 ∼ q ϕ (z 2 |x). We use the implementations from scikit-learn (Pedregosa et al., 2011) for all classifiers. Figure 7 shows the classification accuracies (columns 2-5) for all classifiers trained on µ 2 . The first column shows the mutual information I q (y; z 2 ), which depends mainly on R(z 2 ) as expected from Eq. 10. As long as the classifier is expressive enough (e.g., RBF-SVM or kNN) and the data set is simple (SVHN; top row), higher mutual information (≈ higher R(z 2 )) corresponds to higher classification accuracies, consistent with Eq. 11. But for less expressive (e.g., linear) classifiers or more complex data (CIFAR-10; bottom row), increasing R(z 1 |z 2 ) improves classification accuracy (see purple circles " " in corresponding panels), consistent with the discussion below Eq. 11. We see a similar effect (Table 1 ) for most classifier/data set combinations when replacing µ 2 by µ 1 , which has more information about x but is also higher dimensional.

6. CONCLUSIONS

We classified the various tasks that can be performed with Variational Autoencoders (VAEs) into three application domains and argued that each domain has different trade-offs, such that a good VAE for one domain is not necessarily good for another. This observation motivated us to propose a refinement of the rate/distortion theory of VAEs that allows trading off rates across individual layers of latents in hierarchical VAEs. We showed both theoretically and empirically that the proposal indeed provides practitioners better control for tuning VAEs for the three application domains. In the future, it would be interesting to explore adaptive schedules for the Lagrange parameters β that would make it possible to target a specific given rate for each layer in a single training run, for example by using the method proposed by Rezende & Viola (2018). Figure 9 : Trade-offs between rates and all metrics we used in Section 5 from LVAE trained with SVHN. The results from the standard VAE (i.e. β 2 = β 1 = 1) and the β-VAE (i.e. β 2 = β 1 ) are marked with " " and " ". The markers " " highlight the optimal models selected using convex hull (see Figure 6 for details). The diagonal grid lines are references for equivalent total rates, i.e. points on the same line have the same total rates. ) and the β-VAE (i.e. β 2 = β 1 ) are marked with " " and " ". The markers " " highlight the optimal models selected using convex hull (see Figure 6 for details). The diagonal grid lines are references for equivalent total rates, i.e. points on the same line have the same total rates.

B PROOF OF THE BOUND ON CLASSIFICATION ACCURACY

This section provides a proof of Eq. 11 by reformulating the proof of Proposition 5 in the thesis by Meyen (2016) into the notation used in the present paper. We stress that this section contains no original contribution and is provided only as a convenience to the reader, motivated by reviewer feedback. All credits for this section belong to Meyen (2016) . We consider an (unknown) true data generative distribution p data (y, x) for data x with (unobserved) true labels y, and a hierarchical VAE with an inference model q ϕ ({z ℓ } | x) of the form of Eq. 4. Focusing on a single layer ℓ of latents, we denote the joint probability over y, x, and z ℓ as q(y, x, z ℓ ) := p data (y, x) q ϕ (z ℓ |x) (13) where the marginal q ϕ (z ℓ |x) of q ϕ ({z ℓ } | x) is defined as usual. We further consider a classifier p cls. (y|z ℓ ) that operates on z ℓ . Denoting its top prediction as ŷ := arg max y p cls. (y|z ℓ ), the classification accuracy is α := E q [δ y,ŷ ], where δ is the Kronecker delta. Theorem 1. The mutual information I q (y; z ℓ ) between the latent representation z ℓ and the true label y under the distribution q defined in Eq. 13 is lower bounded as follows, I q (y; Before we prove Theorem 1, we note that the function f is strictly monotonically increasing on the relevant interval [max y p data (y), 1]. Thus, f is invertible and we obtain the following corollary: Corollary 1. The classification accuracy α is upper bounded as in Eq. 11 of the main text, i.e., α ≤ f -1 (I q (y; z ℓ )) ≤ f -1 E p data (x) [R(z ≥ℓ )] . The second inequality in Eq. 15 results from the bound I q (y; z ℓ ) ≤ E pdata(x) [R(z ≥ℓ )] derived in Eq. 10, using the fact that f -1 is monotonically increasing (since f is). Proof of Theorem 1. We split the mutual information into two contributions, I q (y; z ℓ ) = H pdata [y] -H q [y|z ℓ ] = H pdata [y] -E z ℓ ∼q(z ℓ ) E y∼q(y|z ℓ ) [-log q(y|z ℓ )] (16) where, as clarified in the second equality, H q [y|z ℓ ] is the expectation over z ℓ of the conditional entropy of y given z ℓ , and q(z ℓ ) and q(y|z ℓ ) are marginals and conditionals of q (Eq. 13) as usual. Since H pdata [y] is fixed by the problem at hand, finding a lower bound on I q (y; z ℓ ) for a given classification accuracy α is equivalent to finding an upper bound on the second term on the right-hand side of Eq. 16, H q [y|z ℓ ] = E z ℓ ∼q(z ℓ ) [E y∼q(y|z ℓ ) [-log q(y|z ℓ )]], with the constraint E q [δ y,ŷ ] = α. We do this by upper bounding the conditional entropy E y∼q(y|z ℓ ) [-log q(y|z ℓ )] of y given z ℓ for all z ℓ independently, and then taking the expectation over z ℓ ∼ q(z ℓ ). For a fixed latent representation z ℓ , we first split off the contribution to E y∼q(y|z ℓ ) [-log q(y|z ℓ )] from y = ŷ, where ŷ = arg max y p cls. (y|z ℓ ) is the label that our classifier would predict for z ℓ , E y∼q(y|z ℓ ) [-log q(y|z ℓ )] = -q(y = ŷ|z ℓ ) log q(y = ŷ|z ℓ ) -y̸ =ŷ q(y|z ℓ ) log q(y|z ℓ ). (17) Here, the second term on the right-hand side resembles the entropy of a distribution over the remaining (M -1) labels (y ̸ = ŷ), except that the probabilities sum to (1q(y = ŷ|z ℓ )) rather than one. Thus, regardless of the value of q(y = ŷ|z ℓ ), this term is maximized if q(y|z ℓ ) distributes the remaining probability mass (1q(y = ŷ|z ℓ )) uniformly over the remaining (M -1) labels, i.e., E y∼q(y|z ℓ ) [-log q(y|z ℓ )] ≤ -q(y = ŷ|z ℓ ) log q(y = ŷ|z ℓ ) -(1q(y = ŷ|z ℓ )) log 1q(y = ŷ|z ℓ ) M -1 = H 2 (q(y = ŷ|z ℓ )) + (1q(y = ŷ|z ℓ )) log(M -1). (18) Plugging Eq. 18 back into Eq. 16, we obtain the bound I q (y; z ℓ ) ≥ H pdata [y] -E z ℓ ∼q(z ℓ ) H 2 (q(y = ŷ|z ℓ )) -E z ℓ ∼q(z ℓ ) 1q(y = ŷ|z ℓ ) log(M -1). (19) We arrive at the proposition (Eq. 14) by pulling the concave function H 2 out of the expectation using Jensen's inequality, and by then identifying E z ℓ ∼q(z ℓ ) [q(y = ŷ|z ℓ )] = q(y = ŷ) = α.



Figure 2: Inference (dashed arrows) and generative (solid arrows) models for hierarchical VAEs (HVAEs) with two layers of latent variables. White/gray circles denote latent/observed random variables, respectively; the diamond d 1 in (b) is the result of a deterministic transformation of x.

TRADING INFORMATION BETWEEN LATENTS Many HVAEs used in the literature allow us to resolve the limitations identified in Section 3.1. For example, the popular LVAE architecture (Sønderby et al., 2016), (Figure 2(b))

EXPERIMENTAL SETUP Data sets. We used the SVHN(Netzer et al., 2011) and CIFAR-10 (Krizhevsky, 2009) data sets (both 32 × 32 pixel color images), and MNIST (LeCun et al., 1998) (28 × 28 binary pixel images). SVHN consists of photographed house numbers from 0 to 9, which are geometrically simpler than the 10 classes of objects from CIFAR-10 but more complex than MNIST digits. Most results shown in the main paper use SVHN; comprehensive results for CIFAR-10 and MNIST are shown in Appendix A.2 and tell a similar story except where explicitly discussed.

, 6, and 7, discussed below.

Rate/rate/distortion surface for SVHN.

PSNR-rates comparison in 2d.

Figure 3: PSNR-rate trade-off for GHVAEs trained on SVHN and CIFAR-10. Figure (a) visualizes the same data as the left panel of (b) in 3d. Black circles " " mark standard VAEs (β 2 = β 1 = 1), red circles " " mark β-VAEs (β 2 = β 1 ), and purple circles " " mark optimal models along constant total rate (dashed diagonal lines) as defined in Section 5.3. Crosses point to columns in Figure 4.

Figure 3(a) shows a 3d-plot of PSNR as a function of both R(z 1 |z 2 ) and R(z 2 ) for SVHN, thus generalizing the rate/distortion curve of a conventional β-VAE to a rate/rate/distortion surface. Figure 3(b) introduces a more compact 2d-representation of the same data that we use for all remaining metrics in the rest of this section and in Appendix A.2, and it also shows results for CIFAR-10.

Figure4: Samples (top) and reconstructions (bottom) from 3 different models (blue column labels "1", "2", and "3" from left to right correspond to crosses "1", "2", and "3" in Figures3(b) & 5). Consistent with PSNR and IS metrics, model "1" produces poorest samples but best reconstructions.

Figure 6: RBF-SVM classification on µ 2 .Dashed line shows theoretical bound (Eq. 11). Other markers as in Figure3.

(y|x) is the predictive distribution of a classifier trained on the same training set, and p cls. (y) := E p θ (x) [p cls. (y|x)]. The second equality in Eq. 12 follows Barratt & Sharma (2018) to split IS into a product of a diversity score and a sharpness score. Higher is better for all scores. The classifier is a ResNet-18 (He et al., 2016) for SVHN (test accuracy 95.02 %) and a DenseNet-121 (Huang et al., 2017) for CIFAR-10 (test accuracy 94.34 %).

Figure 5 (left)  shows IS for GHVAEs trained on SVHN. Unlike the results for PSNR, here, higher rate does not always lead to better sample quality: for very high R(z 2 ) and low R(z 1 |z 2 ), IS eventually drops. The region of high IS is in the area where β 2 < β 1 , i.e., where R(z 2 ) is higher than in a comparable conventional β-VAE. The center and right panels of Figure5show diversity and sharpness, indicating that IS is mainly driven here by sharpness, which depends mostly on R(z 2 ), possibly because z 2 captures higher-level concepts than z 1 that may be more important to the classifier in Eq. 12. Samples from the the three models marked with crosses in Figure5are shown in Figure4(top). Visual sample quality improves from "1" to "3" to "2", consistent with reported IS.



Figure10: Trade-offs between rates and all metrics we used in Section 5 from the generalized topdown HVAEs trained with CIFAR-10. The results from the standard VAE (i.e. β 2 = β 1 = 1) and the β-VAE (i.e. β 2 = β 1 ) are marked with " " and " ". The markers " " highlight the optimal models selected using convex hull (see Figure6for details). The diagonal grid lines are references for equivalent total rates, i.e. points on the same line have the same total rates.

z ℓ ) ≥ f (α) with f (α) = H p data [y] -H 2 (α) -(1α) log(M -1) (14) where H 2 (α) = -α log α -(1α) log(1α) is the entropy of a Bernoulli distribution, H p data [y] ≤ log Mis the marginal entropy of the true labels, and M denotes the number of classes.

Optimal classification accuracies (across all (β 2 , β 1 )-settings) using either µ 2 or µ 1 .

ACKNOWLEDGMENTS

The authors would like to thank Johannes Zenn, Zicong Fan, Zhen Liu for their helpful discussion. Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -EXC number 2064/1 -Project number 390727645. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Tim Z. Xiao.Reproducibility Statement. All code necessary to reproduce the results in this paper is available at https://github.com/timxzz/HIT/.

A EXPERIMENT SUPPLEMENTARIES

A.1 IMPLEMENTATION DETAILS Table 2 : Model architecture details for generalized top-down HVAEs (GHVAEs) used in Section 5. Conv and ConvTransp denote the convolutional and transposed convolutional layer, which has the corresponding input: input channel, output channel, kernel size, stride, padding. FC represents fully connected layer. For mean: FC(In=256, Out=20) For variance: FC(In=256, Out=20)For 

A.2 ADDITIONAL RESULTS

Here we attached the results for MNIST, as well as the full results for LVAE on SVHN and generalized top-down HVAEs on CIFAR-10.

A.2.1 RESULTS FOR GENERALIZED TOP-DOWN HVAES ON MNIST

We also evaluate our proposed framework using generalized top-down HVAEs trained on binary MNIST data (i.e., black and white images rather than grayscale).We note that the inception score (IS) behaves different in our MNIST models compared to SVHN (see Figure 5 ) in that optimal IS in MNIST occurs for high R(z 1 |z 2 ) rather than high R(z 2 ). This indicates that semantically low-level properties (hand-writing style) of MNIST might have more variation than high level properties (the digit), whereas SVHN images show variation in additional high-level properties such as the background color. Trade-offs between rates and all metrics we used in Section 5 from the generalized topdown HVAEs trained with MNIST. The results from the standard VAE (i.e. β 2 = β 1 = 1) and the β-VAE (i.e. β 2 = β 1 ) are marked with " " and " ". The markers " " highlight the optimal models selected using convex hull (see Figure 6 for details). The diagonal grid lines are references for equivalent total rates, i.e. points on the same line have the same total rates.

