INFORMATION THEORETIC REGULARIZATION FOR LEARNING GLOBAL FEATURES BY SEQUENTIAL VAE Anonymous

Abstract

Sequential variational autoencoders (VAEs) with global latent variable z have been studied for the purpose of disentangling the global features of data, which is useful in many downstream tasks. To assist the sequential VAEs further in obtaining meaningful z, an auxiliary loss that maximizes the mutual information (MI) between the observation and z is often employed. However, by analyzing the sequential VAEs from the information theoretic perspective, we can claim that simply maximizing the MI encourages the latent variables to have redundant information and prevents the disentanglement of global and local features. Based on this analysis, we derive a novel regularization method that makes z informative while encouraging the disentanglement. Specifically, the proposed method removes redundant information by minimizing the MI between z and the local features by using adversarial training. In the experiments, we trained state-space and autoregressive model variants using speech and image datasets. The results indicate that the proposed method improves the performance of the downstream classification and data generation tasks, thereby supporting our information theoretic perspective in the learning of global representations.

1. INTRODUCTION

Uncovering the global factors of variation from high-dimensional data is a significant and relevant problem in representation learning (Bengio et al., 2013) . For example, a global representation of images that presents only the identity of the objects and is invariant to the detailed texture would assist in downstream semi-supervised classification (Ma et al., 2019) . In addition, the representation is known to be useful in the controlled generation of data. Obtaining the representation allows us to manipulate the voice of the speaker in speeches (Yingzhen & Mandt, 2018) , or generate images that share similar global structures (e.g. the structure of objects) but varying details (Razavi et al., 2019) . Sequential variational autoencoders (VAEs) with a global latent variable z have played an important role in the unsupervised learning of the global features. Specifically, we consider the sequential VAEs with a structured data generating process in which an observation x at time t (denoted as x t ) is generated from a global feature z and local feature s t . Then, the z of such sequential VAEs can acquire only global information invariant to t. For example, Yingzhen & Mandt (2018) demonstrated that a disentangled sequential autoencoder (DSAE), which combines state-space models (SSMs) with a global latent variable z, can uncover the speaker information from speeches. Furthermore, Chen et al. (2017); Gulrajani et al. ( 2017) proposed a VAE with a PixelCNN decoder (denoted as PixelCNN-VAE), which combines autoregressive models (ARMs) and z. In both methods, the hidden state of the sequential model (either SSMs or ARMs) is designed to capture local information, while an additional latent variable z captures global information. Unfortunately, the design of the aforementioned structured data generating process alone is insufficient to uncover the global features in practice. A typical issue is that latent variable z is ignored by a decoder (SSMs or ARMs) and becomes uninformative. This phenomenon occurs as follows: with expressive decoders, such as SSMs or ARMs, the additional latent variable z cannot assist in improving the evidence lower bound (ELBO), which is the objective function of VAEs; therefore, the decoders will not use z (Chen et al., 2017; Alemi et al., 2018) . The phenomenon in which the latent variables are ignored is referred to as posterior collapse (PC). To alleviate this issue, several studies have proposed regularizing the mutual information (MI) between x and z to be large, e.g.,

