DISENTANGLED RECURRENT WASSERSTEIN AUTOEN-CODER

Abstract

Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively.

1. INTRODUCTION

Unsupervised representation learning is an important research topic in machine learning. It embeds high-dimensional sensory data such as images and videos into a low-dimensional latent space in an unsupervised learning framework, aiming at extracting essential data variation factors to help downstream tasks such as classification and prediction (Bengio et al., 2013) . In the last several years, disentangled representation learning, which further separates the latent embedding space into exclusive explainable factors such that each factor only interprets one of semantic attributes of sensory data, has received a lot of interest and achieved many empirical successes on static data such as images (Chen et al., 2016; Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Rubenstein et al., 2018b; a; Kim & Mnih, 2018) . For example, the latent representation of handwritten digits can be disentangled into a content factor encoding digit identity and a style factor encoding handwriting style. In spite of successes on static data, only a few works have explored unsupervised representation disentanglement of sequential data due to the challenges of developing generative models of sequential data. Learning disentangled representations of sequential data is important and has many applications. For example, the latent representation of a smiling-face video can be disentangled into a static part encoding the identity of the person (content factor) and a dynamic part encoding the smiling motion of the face (motion factor). The disentangled representation of the video can be potentially used for many downstream tasks such as classification, retrieval, and synthetic video generation with style transfer. Most of previous unsupervised representation disentanglement models for static data heavily rely on the KL-divergence regularization in a VAE framework (Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Kim & Mnih, 2018) , which has been shown to be problematic due to matching individual instead of aggregated posterior distribution of the latent code to the same prior (Tolstikhin et al., 2018; Rubenstein et al., 2018b; a) . Therefore, extending VAE or recurrent VAE (Chung et al., 2015) to disentangle sequential data in a generative model framework (Hsu et al., 2017; Yingzhen & Mandt, 2018) is not ideal. In addition, recent research (Locatello et al., 2019) has theoretically shown that it is impossible to perform unsupervised disentangled representation learning without inductive biases on both models and data, especially on static data. Fortunately, sequential data such as videos often have clear inductive biases for the disentanglement of content factor and motion factor as mentioned in (Locatello et al., 2019) . Unlike static data, the learned static and dynamic factors of sequential data are not exchangeable. In this paper, we propose a recurrent Wasserstein Autoencoder (R-WAE) to learn disentangled representations of sequential data. We employ a Wasserstein metric (Arjovsky et al., 2018; Gulrajani et al., 2017; Bellemare et al., 2017) induced from the optimal transport between model distribution and the underlying data distribution, which has some nicer properties (for e.g., sum invariance, scale sensitivity, applicable to distributions with non-overlapping supports, and better out-of-sample performance in the worst-case expectation (Esfahani & Kuhn, 2018)) than the KL divergence in VAE (Kingma & Welling, 2014) and β-VAE (Higgins et al., 2017) . Leveraging explicit inductive biases in both sequential data and model, we encode an input sequence into two parts: a shared static latent code and a dynamic latent code, and sequentially decode each element of the sequence by combining both codes. We enforce a fixed prior distribution for the static code and learn a prior for the dynamic code to ensure the consistency of the sequence. The disentangled representations are learned by separately regularizing the posteriors of the latent codes with their corresponding priors. Our main contributions are summarized as follows: (1) We draw the first connection between minimizing a Wasserstein distance and maximizing mutual information for unsupervised representation disentanglement of sequential data from an information theory perspective; (2) We propose two sets of effective regularizers to learn the disentangled representation in a completely unsupervised manner with explicit inductive biases in both sequential data and models. (3) We incorporate a relaxed discrete latent variable to improve the disentangled learning of actions on real data. Experiments show that our models achieve state-of-the-art performance in both disentanglement of static and dynamic latent representations and unconditional video generation under the same settings as baselines (Yingzhen & Mandt, 2018; Tulyakov et al., 2018) .

2. BACKGROUND AND RELATED WORK

Notation Let calligraphic letters (i.e. X ) be sets, capital letters (i.e. X) be random variables and lowercase letters be their values. Let D(P X , P G ) be the divergence between the true (but unknown) data distribution P X (density p(x)) and the latent-variable generative model distribution P G specified by a prior distribution P Z (density p(z)) of latent variable Z. Let D KL be KL divergence, D JS be Jensen-Shannon divergence and MMD be Maximum Mean Discrepancy (MMD) (Gretton et al., 2007) . Optimal Transport Between Distributions The optimal transport cost inducing a rich class of divergence between the distribution P X and the distribution P G is defined as follows, W (P X , P G ):= inf Γ∼P(X∼P X ,Y ∼P G ) E (X,Y )∼Γ [c(X, Y )], where c(X, Y ) is any measurable cost function and P(X ∼ P X , Y ∼ P G ) is the set of joint distributions of (X, Y) with respective marginals P X and P G . Comparison between WAE (Tolstikhin et al., 2018) and VAE (Kingma & Welling, 2014) Instead of optimizing over all couplings Γ between two random variables in X , Bousquet et al.

