INFORMATION THEORETIC REGULARIZATION FOR LEARNING GLOBAL FEATURES BY SEQUENTIAL VAE Anonymous

Abstract

Sequential variational autoencoders (VAEs) with global latent variable z have been studied for the purpose of disentangling the global features of data, which is useful in many downstream tasks. To assist the sequential VAEs further in obtaining meaningful z, an auxiliary loss that maximizes the mutual information (MI) between the observation and z is often employed. However, by analyzing the sequential VAEs from the information theoretic perspective, we can claim that simply maximizing the MI encourages the latent variables to have redundant information and prevents the disentanglement of global and local features. Based on this analysis, we derive a novel regularization method that makes z informative while encouraging the disentanglement. Specifically, the proposed method removes redundant information by minimizing the MI between z and the local features by using adversarial training. In the experiments, we trained state-space and autoregressive model variants using speech and image datasets. The results indicate that the proposed method improves the performance of the downstream classification and data generation tasks, thereby supporting our information theoretic perspective in the learning of global representations.

1. INTRODUCTION

Uncovering the global factors of variation from high-dimensional data is a significant and relevant problem in representation learning (Bengio et al., 2013) . For example, a global representation of images that presents only the identity of the objects and is invariant to the detailed texture would assist in downstream semi-supervised classification (Ma et al., 2019) . In addition, the representation is known to be useful in the controlled generation of data. Obtaining the representation allows us to manipulate the voice of the speaker in speeches (Yingzhen & Mandt, 2018) , or generate images that share similar global structures (e.g. the structure of objects) but varying details (Razavi et al., 2019) . Sequential variational autoencoders (VAEs) with a global latent variable z have played an important role in the unsupervised learning of the global features. Specifically, we consider the sequential VAEs with a structured data generating process in which an observation x at time t (denoted as x t ) is generated from a global feature z and local feature s t . Then, the z of such sequential VAEs can acquire only global information invariant to t. For example, Yingzhen & Mandt (2018) demonstrated that a disentangled sequential autoencoder (DSAE), which combines state-space models (SSMs) with a global latent variable z, can uncover the speaker information from speeches. Furthermore, Chen et al. (2017); Gulrajani et al. (2017) proposed a VAE with a PixelCNN decoder (denoted as PixelCNN-VAE), which combines autoregressive models (ARMs) and z. In both methods, the hidden state of the sequential model (either SSMs or ARMs) is designed to capture local information, while an additional latent variable z captures global information. Unfortunately, the design of the aforementioned structured data generating process alone is insufficient to uncover the global features in practice. A typical issue is that latent variable z is ignored by a decoder (SSMs or ARMs) and becomes uninformative. This phenomenon occurs as follows: with expressive decoders, such as SSMs or ARMs, the additional latent variable z cannot assist in improving the evidence lower bound (ELBO), which is the objective function of VAEs; therefore, the decoders will not use z (Chen et al., 2017; Alemi et al., 2018) . The phenomenon in which the latent variables are ignored is referred to as posterior collapse (PC). To alleviate this issue, several studies have proposed regularizing the mutual information (MI) between x and z to be large, e.g., In this paper, we further analyze the MI-maximization and claim that merely maximizing I(x; z) is insufficient to uncover the global factors of variation. Figure 1-(a) summarizes the issue of the MI-maximization. As illustrated in the Venn diagram, the MI can be decomposed into I(x; z) = I(x; z|s) + I(x; z; s). Maximizing the first term I(x; z|s) is beneficial, as it measures the informativeness of z about x given a local feature s. However, maximizing the second term I(x; z; s) might cause a negative effect, because it would also increase I(z; s). In other words, maximizing I(x; z) would encourage latent variables to have redundant information. For example, when I(x; z) becomes so large that z retains all (local and global) information of x, the downstream classification performance would be degrated. Also, when local variables still contain global information due to large I(z; s), it becomes difficult to control speaker information in speeches using a DSAE. See Appendix A for empirical evidence that the MI-maximization increases I(z; s), as discussed above. Based on the analysis, we propose a new information theoretic regularization method for disentangling the global factors. Specifically, our method minimizes I(z; s), in addition to maximizing I(x; z) similar to prior work (Figure 1-(b) ). As I(z; s) measures the dependence between z and s, our method encourages z and s to have different information, i.e., the disentanglement of global and local factors. We call our method CMI-maximizing regularization, as it is the lower bound of the conditional mutual information (CMI) I(x; z|s). Furthermore, we introduce an adversarial training technique for estimating the CMI. A simple way to estimate it would be considering I(x; z) and I(z; s) independently, but it might result in compounding approximation errors. Instead, we use the formularization of β-VAE and adversarial training (Ganin et al., 2016) , which reduce the number of terms to be approximated. Specifically, we approximate the upper bound of I(z; s) using a density ratio trick (DRT) (Nguyen et al., 2008) , where an adversarial classifier models the density ratio. Once we estimate the bound, I(z; s) can be minimized via backpropagation through the classifier. In our experiments, we used DSAE and PixelCNN-VAE as illustrative examples of the SSM and ARM variants. In addition to evaluate the quality of global latent variable as in previous studies, we also evaluated the ability of controlled generation using a novel evaluation method inspired by Ravuri & Vinyals (2019). In the experiments, the CMI-maximizing regularization consistently outperformed the MI-maximizing one on image and speech datasets. These results support (i) our information theoretic view of learning global features: the sequential VAEs can suffer from obtaining redundant features when merely maximizing the MI. Also, the results support that (ii) regularizing I(x; z) and I(z; s) is complementary: learning global features can be facilitated by not only making z informative, but also the control for which aspect of x information (global or local) goes into z. Our contribution can be summarized as follows: (i) through our analysis, we reveal the potential negative side-effect of MI-maximizing regularization, which has been standard in learning global representation with sequential VAEs. Then, the analysis encourages the sequential VAE community to seek for new regularization approach. (ii) In order to learn good global representation, we proposed regularizing I(x; z) and I(z; s) at the same time. I(x; z) and I(x; z) are robustly shown to work complementary by our experiments using two models and two domains (speech and image datasets). This finding would help improve various sequential VAEs proposed before.



Figure 1: Comparison of (a) MI-maximizing regularization and (b) the proposed method, using a Venn diagram of information theoretic measures of x, z, and s.

