TRADING INFORMATION BETWEEN LATENTS IN HIERARCHICAL VARIATIONAL AUTOENCODERS

Abstract

Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of β-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content ("bit rate") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.

1. INTRODUCTION

Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) are a class of deep generative models that are used, e.g., for density modeling (Takahashi et al., 2018) , clustering (Jiang et al., 2017) , nonlinear dimensionality reduction of scientific measurements (Laloy et al., 2017 ), data compression (Ballé et al., 2017) , anomaly detection (Xu et al., 2018) , and image generation (Razavi et al., 2019) . VAEs (more precisely, β-VAEs (Higgins et al., 2017) ) span such a diverse set of application domains in part because they can be tuned to a specific task without changing the network architecture, in a way that is well understood from information theory (Alemi et al., 2018) . The original proposal of VAEs (Kingma & Welling, 2014) motivates them from the perspective of generative probabilistic modeling and approximate Bayesian inference. However, the generalization to β-VAEs breaks this interpretation as they are no longer trained by maximizing a lower bound on the marginal data likelihood. These models are better described as neural networks that are trained to learn the identity function, i.e., to make their output resemble the input as closely as possible. This task is made nontrivial by introducing a so-called (variational) information bottleneck (Alemi et al., 2017; Tishby & Zaslavsky, 2015) at one or more layers, which restricts the information content that passes through these layers. The network activations at the information bottleneck are called latent representations (or simply "latents"), and they split the network into an encoder part (from input to latents) and a decoder part (from latents to output). This separation of the model into an encoder and a decoder allows us to categorize the wide variety of applications of VAEs into three domains: 1. data reconstruction tasks, i.e., applications that involve both the encoder and the decoder; these include various nonlinear inter-and extrapolations (e.g., image upscaling, denoising, or inpainting), and VAE-based methods for lossy data compression; 2. representation learning tasks, i.e., applications that involve only the encoder; they serve a downstream task that operates on the (typically lower dimensional) latent representation, e.g., classification, regression, visualization, clustering, or anomaly detection; and 3. generative modeling tasks, i.e., applications that involve only the decoder are less common but include generating new samples that resemble training data. Figure 1 : Left: trade-off between performance in the three applications domains of VAEs, using GHVAE trained on the SVHN data set (details: Section 5); higher is better for all three metrics; gray dots on walls show 2d-projections. Right: color code, corresponding layer-wise rates (Eq. 7), and individual performance landscapes (size of dots ∝ performance). The hyperparameters β 2 and β 1 allow us to tune the HVAE for best data reconstruction (△), best representation learning (⋄), or best generative modeling ( ). Note that performance landscapes differ strongly across the three applications, and neither a standard VAE (β 2 = β 1 = 1; marked "•" in right panels) nor a conventional β-VAE (β 2 = β 1 ; dashed red lines) result in optimal models for any of the three applications. The information bottleneck incentivizes the VAE to encode information into the latents efficiently by removing any redundancies from the input. How agressively this is done can be controlled by tuning the strength β of the information bottleneck (Alemi et al., 2018) . Unfortunately, information theory distinguishes relevant from redundant information only in a quantitative way that is agnostic to the qualitative features that each piece of information represents about some data point. In practice, many VAE-architectures (Deng et al., 2017; Yingzhen & Mandt, 2018; Ballé et al., 2018) try to separate qualitatively different features into different parts of the latent representation by making the model architecture reflect some prior assumptions about the semantic structure of the data. This allows downstream applications from the three domains discussed above to more precisely target specific qualitative aspects of the data by using or manipulating only the corresponding part of the latent representation. However, in this approach, the degree of detail to which each qualitative aspect is encoded in the latents can be controlled at most indirectly by tuning network layer sizes. In this paper, we argue both theoretically and empirically that the three different application domains of VAEs identified above require different trade-offs in the amount of information that is encoded in each part of the latent representation. We propose a method to independently control the information content (or "rate") of each layer of latent representations, generalizing the rate/distortion theory of β-VAEs (Alemi et al., 2018) for VAEs with more than one layer of latents ("hierarchical VAEs" or HVAEs for short). We identify the most general model architecture that is compatible with our proposal and analyze how both theoretical performance bounds and empirically measured performances in each of the above three application domains depend on how rate is distributed across layers. Our approach is summarized in Figure 1 . The 3d-plot shows empirically measured performance metrics (discussed in detail in Section 5.2) for the three application domains identified above. Each point on the colored surface corresponds to different layer-wise rates in an HVAE with two layers of latents. Crucially, the rates that lead to optimal performance are different for each of the three application domains (see markers △, , and ⋄ in Figure 1 ), and none of these three optimal models coincide with a conventional β-VAE (dashed red lines in right panels). Thus, being able to control each layer's individual rate allows practitioners to train VAEs that target a specific application. The paper is structured as follows. Section 2 summarizes related work. Section 3 introduces the proposed information-trading method. We then analyze how controlling individual layers' rates can be used to tune HVAEs for specific tasks, i.e., how performance in each of the three application domains identified above depends on the allocation of rates across layers. This analysis is done theoretically in Section 4 and empirically in Section 5. Section 6 provides concluding remarks.

