DISENTANGLED RECURRENT WASSERSTEIN AUTOEN-CODER

Abstract

Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively.

1. INTRODUCTION

Unsupervised representation learning is an important research topic in machine learning. It embeds high-dimensional sensory data such as images and videos into a low-dimensional latent space in an unsupervised learning framework, aiming at extracting essential data variation factors to help downstream tasks such as classification and prediction (Bengio et al., 2013) . In the last several years, disentangled representation learning, which further separates the latent embedding space into exclusive explainable factors such that each factor only interprets one of semantic attributes of sensory data, has received a lot of interest and achieved many empirical successes on static data such as images (Chen et al., 2016; Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Rubenstein et al., 2018b; a; Kim & Mnih, 2018) . For example, the latent representation of handwritten digits can be disentangled into a content factor encoding digit identity and a style factor encoding handwriting style. In spite of successes on static data, only a few works have explored unsupervised representation disentanglement of sequential data due to the challenges of developing generative models of sequential data. Learning disentangled representations of sequential data is important and has many applications. For example, the latent representation of a smiling-face video can be disentangled into a static part encoding the identity of the person (content factor) and a dynamic part encoding the smiling motion of the face (motion factor). The disentangled representation of the video can be potentially used for many downstream tasks such as classification, retrieval, and synthetic video generation with style transfer. Most of previous unsupervised representation disentanglement models for static data heavily rely on the KL-divergence regularization in a VAE framework (Higgins et al., 2017; Dupont, 2018; Chen et al., 2018; Kim & Mnih, 2018) , which has been shown to be problematic due to matching individual instead of aggregated posterior distribution of the latent code to the same prior (Tolstikhin et al., 2018; Rubenstein et al., 2018b; a) . Therefore, extending VAE or recurrent VAE (Chung et al., 2015) to disentangle sequential data in a generative model framework (Hsu et al., 2017; Yingzhen & Mandt, 2018) is not ideal. In addition, recent research (Locatello et al., 2019) has theoretically shown that it is impossible to perform unsupervised disentangled representation learning without inductive biases on both models and data, especially on static data. Fortunately, sequential data such as videos often have clear inductive biases for the disentanglement of content factor and motion factor as mentioned in (Locatello et al., 2019) . Unlike static data, the learned static and dynamic factors of sequential data are not exchangeable. In this paper, we propose a recurrent Wasserstein Autoencoder (R-WAE) to learn disentangled representations of sequential data. We employ a Wasserstein metric (Arjovsky et al., 2018; Gulrajani et al., 2017; Bellemare et al., 2017) induced from the optimal transport between model distribution and the underlying data distribution, which has some nicer properties (for e.g., sum invariance, scale sensitivity, applicable to distributions with non-overlapping supports, and better out-of-sample performance in the worst-case expectation (Esfahani & Kuhn, 2018) ) than the KL divergence in VAE (Kingma & Welling, 2014) and β-VAE (Higgins et al., 2017) . Leveraging explicit inductive biases in both sequential data and model, we encode an input sequence into two parts: a shared static latent code and a dynamic latent code, and sequentially decode each element of the sequence by combining both codes. We enforce a fixed prior distribution for the static code and learn a prior for the dynamic code to ensure the consistency of the sequence. The disentangled representations are learned by separately regularizing the posteriors of the latent codes with their corresponding priors. Our main contributions are summarized as follows: (1) We draw the first connection between minimizing a Wasserstein distance and maximizing mutual information for unsupervised representation disentanglement of sequential data from an information theory perspective; (2) We propose two sets of effective regularizers to learn the disentangled representation in a completely unsupervised manner with explicit inductive biases in both sequential data and models. (3) We incorporate a relaxed discrete latent variable to improve the disentangled learning of actions on real data. Experiments show that our models achieve state-of-the-art performance in both disentanglement of static and dynamic latent representations and unconditional video generation under the same settings as baselines (Yingzhen & Mandt, 2018; Tulyakov et al., 2018) .

2. BACKGROUND AND RELATED WORK

Notation Let calligraphic letters (i.e. X ) be sets, capital letters (i.e. X) be random variables and lowercase letters be their values. Let D(P X , P G ) be the divergence between the true (but unknown) data distribution P X (density p(x)) and the latent-variable generative model distribution P G specified by a prior distribution P Z (density p(z)) of latent variable Z. Let D KL be KL divergence, D JS be Jensen-Shannon divergence and MMD be Maximum Mean Discrepancy (MMD) (Gretton et al., 2007) .

Optimal Transport Between Distributions

The optimal transport cost inducing a rich class of divergence between the distribution P X and the distribution P G is defined as follows, W (P X , P G ):= inf Γ∼P(X∼P X ,Y ∼P G ) E (X,Y )∼Γ [c(X, Y )], where c(X, Y ) is any measurable cost function and P(X ∼ P X , Y ∼ P G ) is the set of joint distributions of (X, Y) with respective marginals P X and P G . Comparison between WAE (Tolstikhin et al., 2018) and VAE (Kingma & Welling, 2014) Instead of optimizing over all couplings Γ between two random variables in X , Bousquet et al. (2017) ; Tolstikhin et al. (2018) show that it is sufficient to find Q(Z|X) such that the marginal ) ] is identical to the prior P (Z), as given in the following definition, Definition 1. For any deterministic P G (X|Z) and any function G : Z → X , Q(Z) := E X∼P X [Q(Z|X W (P X , P G ) = inf Q:Q Z =P Z E P X E Q(Z|X) [c(X, G(Z))]. Definition 1 leads to the following loss D WAE of WAE based on a Wasserstein distance, inf Q(Z|X) E P X E Q(Z|X) [c(X, G(Z))] + β D(Q Z , P Z ), where the first term is data reconstruction loss, and the second one is a regularizer that forces the posterior Q Z = Q(Z|X)dP X to match the prior P Z (Adversarial autoencoder (AAE) (Makhzani et al., 2015) shares a similar idea to WAE). In contrast, VAE has a different regularizer E X [D KL (Q(Z|X), P Z )) ] enforcing the latent posterior distribution of each input to match P Z . In (Rubenstein et al., 2018a; b) , it is shown that WAE has better disentanglement than β-VAE (Higgins et al., 2017) on images, which inspires us to design a new representation disentanglement framework for sequential data with several innovations. Unsupervised disentangled representation learning Several generative models have been proposed to learn disentangled representations of sequential data (Denton et al., 2017; Hsu et al., 2017; Yingzhen & Mandt, 2018; Hsieh et al., 2018; Sun et al., 2018; Tulyakov et al., 2018) . FHVAE in (Hsu et al., 2017 ) is a VAE-based hierarchical graphical model with factorized Gaussian priors and only focuses on speech or audio data. Our R-WAE employing a more powerful recurrent prior can be applied to both speech and video data. The models in (Sun et al., 2018; Denton et al., 2017; Hsieh et al., 2018) are based on the first several elements of a sequence to design disentanglement architectures for future sequence predictions. In terms of representation learning by mutual information maximization, our work empirically demonstrates that explicit inductive biases in data and model architecture are necessary to the success of learning meaningful disentangled representations of sequential data, while the works in (Locatello et al., 2019; Poole et al., 2019; Tschannen et al., 2020; Ozair et al., 2019) are about general representation learning, especially on static data. The most related works to ours are MoCoGAN (Tulyakov et al., 2018) and DS-VAE (Yingzhen & Mandt, 2018) , which have the ability to disentangle variant and invariant parts of sequential data and perform unconditional sequence generation. 

3. PROPOSED APPROACH: DISENTANGLED RECURRENT WASSERSTEIN AUTOENCODER (R-WAE)

Given a high-dimensional sequence x 1:T , our goal is to learn a disentangled representation of timeinvariant latent code z c and time-variant latent code z m t , along the sequence. Let z t = (z c , z m t ) be the latent code of x t . Let X t , Z t , Z c and Z m t be random variables with realizations x t , z t , z c and z m t respectively, and denote D = X 1:T . To achieve this goal, we define the following probabilistic generative model by assuming Z m t and Z c are independent, P (X 1:T , Z 1:T ) = P (Z c ) T t=1 P ψ (Z m t |Z m <t )P θ (X t |Z t ), where P (Z 1:T ) = P (Z c ) T t=1 P ψ (Z m t |Z m <t ) is the prior in which Z t = (Z c , Z m t ) , and the decoder model P θ (X t | Z t ) is a Dirac delta distribution. In practice, P (Z c ) is chosen as N (0, I) and P ψ (Z m t |Z m <t ) = N (µ ψ (Z m <t ), σ 2 ψ (Z m <t ) ), µ ψ and σ ψ are parameterized by Recurrent Neural Networks (RNNs). The inference model Q is defined as Q φ (Z c , Z m 1:T |X 1:T ) = Q φ (Z c |X 1:T ) T t=1 Q φ (Z m t | Z m <t , X t ), where Q φ (Z c |X 1:T ) and Q φ (Z m t | Z m <t , X t ) are also Gaussian distributions parameterized by RNNs. The structures of the generative model ( 4) and the inference model ( 5) are provided in Fig. 1 .

3.1. R-WAE MINIMIZES A PENALIZED FORM OF A WASSERSTEIN DISTANCE

The optimal transport cost between two distributions P D and P G with respective sequential variables X 1:T (X 1:T ∼P D ) and Y 1:T (Y 1:T ∼P G ) is given as follows, W (P D , P G ) := inf Γ∼P(X 1:T ∼P D ,Y 1:T ∼P G ) E (X 1:T ,Y 1:T )∼Γ [c(X 1:T , Y 1:T )], P(X 1:T ∼P D ,Y 1:T ∼P G ) is a set of all joint distributions with marginals P D and P G respectively. When we choose c(x, y) = xy 2 (2-Wasserstein distance) and c(X 1:T , Y 1:T ) = t X t -Y t 2 by linearity, it is easy to derive the optimal transport cost for disentangled sequential variables. Theorem 1. With deterministic P (X t |Z t ) and any function Y t = G(Z t ), we derive W (P D , P G ) = inf Q:Q Z c =P Z c ,Q Z m 1:T =P Z m 1:T t E P D E Q(Zt|Z<t,Xt) [c(X t , G(Z t ))], where Q Z 1:T = Q Z c Q Z m 1: T is the marginal distribution of Z 1:T when X 1:T ∼ P D and Z 1:T ∼ Q(Z 1:T |X 1:T ) and P Z 1:T is the prior. Based on the assumptions, we have an upper bound, W (P D , P G ) ≤ inf Q∈S t E P D E Q(Zt|Z<t,Xt) [c(X t , G(Z t ))], where the subset S is S = {Q : Q Z c = P Z c , Q Z m 1 = P Z m 1 , Q Z m t |Z m <t = P Z m t |Z m <t } . In practice, we have the following objective function of our proposed R-WAE based on Theorem 1, T t=1 E Q(Zt|Z<t,Xt) [c(X t , G(Z t ))] + β 1 D(Q Z c , P Z c ) + β 2 T t=1 D(Q Z m t |Z m <t , P Z m t |Z m <t ), (9) where D is an divergence between two distributions, and the second and third terms are, respectively, regularization terms for Z c and Z m t . In the following, we will present two different approaches to calculating the regularization terms in section 3.2 and 3.3. Because we cannot straightforwardly estimate the marginals Q φ (Z c ) and Q φ (Z m t |Z m <t ), we cannot directly use KL divergence in the two regularization terms, but we can optimize the RHS of ( 9) by likelihood-free optimizations (Gretton et al., 2007; Goodfellow et al., 2014; Nowozin et al., 2016; Arjovsky et al., 2018) when samples from all distributions are available.

3.2. D

JS PENALTY FOR Z c AND MMD PENALTY FOR Z m The prior distribution of Z c is chosen as a multivariate unit-variance Gaussian, N (0, I). We can choose penalty D JS (Q Z c ,P Z c ) for Z c and apply min-max optimization by introducing a discriminator D γ (Goodfellow et al., 2014) . Instead of performing optimizations in high-dimensional input data space, we move the adversarial optimization to the latent representation space of the content with a much lower dimension. Because the prior distribution of {Z m t } is dynamically learned during training, it is challenging to use D JS to regularize {Z m t } with a discriminator, which will induce a third minimization within a min-max optimization. Therefore, we use MMD to regularize {Z m t } as samples from both distributions are easy to obtain (dimension of z m t is less than 20 in our experiments on videos). With a kernel k, MMD k (Q, P ) is approximated by samples from Q and P (Gretton et al., 2007) . The regularization terms can be summarized as follows and we call the resulting model R-WAE(GAN) (see Algorithm 1 in Appendix for details): D(Q Z c ,P Z c ) = D JS (Q Z c ,P Z c ); D(Q Z m t |Z m <t ,P Z m t |Z m <t ) = MMD k (Q Z m t |Z m <t ,P Z m t |Z m <t ). (10) 3.3 SCALED MMD PENALTY FOR Z c AND MMD PENALTY FOR Z m MMD with neural kernels for generative modeling of real-world data (Li et al., 2017; Bińkowski et al., 2018; Arbel et al., 2018) motivates us to use only MMD as regularization in Eq. ( 9), D(Q Z c ,P Z c ) = MMD kγ (Q Z c ,P Z c ); D(Q Z m t |Z m <t ,P Z m t |Z m <t ) = MMD k (Q Z m t |Z m <t ,P Z m t |Z m <t ), (11) where k γ is a parametrized family of kernels (Li et al., 2017; Bińkowski et al., 2018; Arbel et al., 2018) defined as k γ (x, y) = k(f γ (x), f γ (y)) and f γ (x) is a feature map, which is more expressive and used for Z c with equal or higher dimension than Z m t . The details of optimizing the first term 11) is provided in Appendix D based on scaled MMD (Arbel et al., 2018) , a principled and stable technique for training MMD-based critic. We call the resulting model R-WAE(MMD) (see Algorithm 2 in Appendix for details). MMD kγ (Q Z c ,P Z c ) in Eq. (

3.4. WEAKLY SUPERVISED DISENTANGLEMENT WITH A KNOWN NUMBER OF ACTIONS

When the number of actions (motions) in sequential data, denoted by A, is available, we incorporate a categorical latent variable a (a one-hot vector whose dimension is A) to enhance the disentanglement of the dynamic latent codes of the motions. The inference model for a is designed as q φ (a|x 1:T , z m 1:T ). Intuitively, the action is inferred from the motion sequence to recognize its label. Learning such a categorical distribution requires a continuous relaxation of the discrete random variable in order to backpropagate its gradient. Let α 1 , • • • , α A be the class probabilities, we can obtain a sample a = (y 1 , • • • , y A ) from its continuous relaxation by first sampling g = (g 1 , • • • , g A ) with g j ∼ Gumbel(0, 1) and then applying transformation a j = exp((log α j + g j )/τ ) i exp((log α i + g i )/τ ), where τ is a temperature parameter controlling the approximation. To learn the categorical distribution using the reparameterization trick, we use a regularizer D KL (q φ ( a|x 1:T , z m 1:T ), p( a)), where p( a) is a uniform Gumbel-Softmax prior distribution (Jang et al., 2016; Maddison et al., 2016) . The motion variable is augmented as z R t = (z m t , a), and learning joint continuous and discrete latent representation of image data has been extensively discussed in (Dupont, 2018 ) (see Fig. 1(c, d ) for illustrations).

4. ANALYZING R-WAE FROM AN INFORMATION THEORY PERSPECTIVE

Theorem 2. If the mutual information (MI) between Z 1:T and X 1:T is defined in terms of the inference model Q, I(Z 1:T ; X 1:T ) = E Q(X 1:T ,Z 1:T ) [log Q(Z 1:T |X 1:T ) -log Q(Z 1:T )], where Q(X 1:T , Z 1:T ) = Q(Z 1:T |X 1:T )P (X 1:T ) and Q(Z 1:T ) = X 1:T Q(X 1:T , Z 1:T ), we have a lower bound: Theorem 2 shows that R-WAE maximizes a lower bound of the mutual information between X 1:T and Z 1:T , which theoretically guarantees that R-WAE learns semantically meaningful latent representations of input sequences. With constant removed, the RHS of ( 9) and ( 12) are the same if D is KL divergence. In spite of being theoretically important, Theorem 2 with KL divergence cannot be directly used for the regularization terms of R-WAE in practice, because we cannot straightforwardly estimate the marginals Q φ (Z c ) and Q φ (Z m t |Z m <t ) as discussed previously. From Eq. ( 9) and ( 12), we can obtain the following theorem. I(Z 1:T ; X 1:T ) ≥ T t=1 E P D E Q φ [log P θ (X t |Z t )-log P (D)]-E Q φ (Z c |X 1:T ) [log Q φ (Z c )-log P (Z c )] - T t=1 E P D E Q φ (Z m t |Z m <t ,Xt) [log Q φ (Z m t |Z m <t )-log P (Z m t |Z m <t ) . Theorem 3. When its distribution divergence is chosen as KL divergence, the regularization terms in Eq. (9) jointly minimize the KL divergence between the inference model Q(Z 1:T |X 1:T ) and the prior model P (Z 1:T ) and maximize the mutual information between X 1:T and Z 1:T , KL(Q(Z c )||P (Z c )) = E p D [KL(Q(Z c |X 1:T )||P (Z c ))] -I(X 1:T ; Z c ), (13) KL(Q(Z m t |Z m <t )||P (Z m t |Z m <t )) = E p D [KL(Q(Z m t |Z m <t , X 1:T )||P (Z m t |Z m <t )] -I(X 1:T ; Z m t |Z m <t ) , where the mutual information is defined in terms of the inference model as in Theorem 2. Theorem 3 shows that, even if adopting KL divergence, the regularization in the loss of R-WAE still improves over the one in vanilla VAE, which only has the first term in the RHS of Eq. ( 13). The two mutual information terms explicitly enforce mutual information maximization between input data and unexchangeable disentangled latent representations, Z c and Z m t . Therefore, R-WAE is superior to recurrent VAE (DS-VAE).

5. EXPERIMENTS

We conduct extensive experiments on four datasets to quantitatively and qualitatively validate our methods. The baseline methods for comparisons are DS-VAE (Yingzhen & Mandt, 2018) and MoCoGAN (Tulyakov et al., 2018) . We train our models on Stochastic moving MNIST (SM-MNIST), Sprites, and TIMIT datasets under a completely unsupervised setting. The number of actions (motions) is used as prior information for all methods on MUG facial dataset. The detailed descriptions of datasets, architectures, and hyperparameters are provided in Appendix C, D, and G, respectively.

5.1. QUALITATIVE RESULTS ON DISENTANGLEMENT

We encode two original videos on the first and fourth row in Fig. 2 and generate videos on the second and third row by swapping corresponding {z c } and {z m 1:T } between videos for style transfer. Fig. 2 (left) shows that even testing on the long sequence (trained with T = 100), our R-WAE can disentangle content and motions exactly. In Fig. 2 (right), we do the same swapping on Sprites. We can see that the generated swapped videos have exact same appearances and actions as the corresponding original ones. On the MUG dataset, it is interesting to see that we can swap different motions between different persons. ). We cannot quantitatively verify the motion disentanglement on SM-MNIST.

SM-MNIST and Sprites Datasets

We quantitatively evaluate the disentanglement of our R-WAE(MMD). In Table 2 , "S" denotes a simple encoder/decoder architecture, where the encoders in both our model and DS-VAE (Yingzhen & Mandt, 2018) only use 5 layers of convolutional and deconvolutional networks adopted from DS-VAE (Yingzhen & Mandt, 2018) . "C" denotes a complex encoder/decoder architecture where we use Ladder network (Sønderby et al., 2016; Zhao et al., 2017) and ResBlock (He et al., 2016) , provided in Appendix E. On SM-MNIST, we get the labeled latent codes {z c } of test videos {x 1:T } with T = 10 and randomly sample motion variables {z m 1:T } to get labeled new samples. We pretrain a classifier and predict the accuracy on these labeled new samples. The accuracy on SM-MNIST dataset is evaluated on 10000 test samples. On Sprites, the labels of each attribute(skin colors, pants, hair styles, tops and pants) are available. We get the latent codes by fixing one attribute and randomly sample other attributes. We train a classifier for each attribute and evaluate the disentanglement of each attribute. The accuracy is based on 296 × 9 test data. Both DS-VAE and R-WAE(MMD) have extremely high accuracy (99.94%) when fixing hair style attribute, which is not provided in Table 2 due to space limit. As R-WAE(GAN) and R-WAE(MMD) have similar performance on these datasets, we only provide the results and parameters of R-WAE(MMD) to save space. There are two interesting observations in Table 2 . First, the simple architecture has better disentanglement than the complex architecture overall. The reason is that the simple architecture has sufficient ability to extract features and generate clear samples to be recognized by the pretrained classifiers. But the simple architecture cannot generate high-quality samples when applied to real data. Second, our proposed R-WAE(MMD) achieve better disentanglement than DS-VAE (Yingzhen & Mandt, 2018) on both corresponding architectures. The attributes within content latent variables are independent, our model can further disentangle the factors. Compared to DS-VAE, these results demonstrate the advantages of R-WAE with implicit mutual information maximization terms. Due to space limit, we also include similar comparisons on a new Moving-Shape dataset in Appendix I. As the number of possible motions in SM-MNIST is infinite and random, we cannot evaluate the disentanglement of motions by training a classifier. We fix the encoded motions {z m 1:T } and randomly sample content variables {z c }. We also randomly sample a motion sequence {z m 1:T } and randomly sample contents {z c }. We manually check the motions of these samples and almost all have the same corresponding motion even though the sequence is long (T = 100). TIMIT Speech Dataset We quantitatively compare our R-WAE with FHVAE and DS-VAE on the speaker verification task under the same setting as (Hsu et al., 2017; Yingzhen & Mandt, 2018) . The evaluation metric is based on equal error rate (EER), which is explained in detail in Appendix C. The lower EER on z c encoding the timbre of speakers is better and the higher EER on z m encoding linguistic content is better. From Table 1 , our model can disentangle z c and z m well. We can see that our R-WAE(MMD) has the best EER performance on both content attribute and motion attribute. In Appendix H we show by style transfer experiments that the learned dynamic factor encodes semantic content which is comparable to DS-VAE. MUG Facial Dataset We quantitatively evaluate the disentanglement and quality of generated samples. We train a 3D classifier on MUG facial dataset with accuracy 95.11% and Inception Score 5.20 on test data (Salimans et al., 2016) . We calculate Inception score, intra-entropy H(y|v), where y is the predicted label and v is the generated video, and inter-entropy H(y) (He et al., 2018) . For a comprehensive quantitative evaluation, Frame-level FID score, introduced by (Heusel et al., 2017) , is also provided. From Table 2 , our R-WAE(MMD) and R-WAE(GAN) have higher accuracy, which (Yingzhen & Mandt, 2018) without leveraging the number of actions performs worst, which shows that incorporating the number of actions as prior information does enhance the disentanglement of actions.

5.3. UNCONDITIONAL VIDEO GENERATION

SM-MNIST dataset Fig. 4 in Appendix E provides generated samples on the SM-MNIST dataset by randomly sampling content {z c } from the prior p(z c ) and motions {z m 1:T } from the learned prior p ψ (z m t |z m <t ). The length of our generated videos is T = 100 and we only show randomly chosen videos of T = 20 to save file size. Our R-WAE(MMD) achieves the most consistent and visually best sequence even when T = 100. Samples from MoCoGAN (Tulyakov et al., 2018) usually change digit identity along the sequence. The reason is that MoCoGAN (Tulyakov et al., 2018) requires the number of actions be finite. Our generated Sprites videos also have the best results but are not provided due to page limit. MUG Facial Dataset Fig. 3 and Fig. 5 in Appendix E show generated samples on MUG dataset by randomly sampling content {z c } from the prior p(z c ) and motions z R t = (a, z m t ) from the categorical prior p(a) (latent action variable a is a one-hot vector with dimension 6) and the learned prior p ψ (z m t |z m <t ). We show generated videos of length T = 10. DS-VAE (Yingzhen & Mandt, 2018) used the same structure as ours. Fig. 5 shows that DS-VAE (Yingzhen & Mandt, 2018) and MoCoGAN (Tulyakov et al., 2018) have blurry beginning frames {x t } and even more blurry frames as time t evolves. While our R-WAE(GAN) has much better frame quality and more consistent video sequences. To have a clear comparison among all three methods, we show the samples at time step T = 10 in Fig. 3 , and we can see that DS-VAE has very blurry samples with large time steps.

6. CONCLUSION

In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE) to learn disentangled representations of sequential data based on the optimal transport between distributions with sequential variables. Our theoretical analysis shows that R-WAE simultaneously maximizes the mutual information between input sequential data and different disentangled latent factors. Experiments on a variety of datasets demonstrate that our models achieve state-of-the-art results on the disentanglement of static and dynamic latent representations and unconditional video generation. Future research includes exploring our framework in self-supervised learning and conditional settings for text-to-video and video-to-video synthesis.

APPENDIX FOR RECURRENT WASSERSTEIN AUTOENCODER APPENDIX A: PROOF OF THEOREM 1

In the following, we provide the proof of Theorem 1. Theorem 1 For P G defined with deterministic P G (X|Z) and any function Y = G(Z), W (P D , P G ) = inf Q:Q Z c =P Z c ,Q Z m 1:T =P Z m 1:T T t=1 E P D E Q(Zt|Xt) [c(X t , G(Z t ))], where Q Z 1:T is the marginal distribution of Z 1:T when X 1:T ∼ P D and Z 1:T ∼ Q(Z 1:T |X 1:T ) and P Z 1:T is the prior. Based on the assumptions, the constraint set is equivalent to W (P D , P G ) ≤ inf Q∈S t E P D E Q(Zt|Xt) [c(X t , G(Z t ))], where the set S = {Q : Q Z c = P Z c , Q Z m 1 = P Z m 1 , Q Z m t |Z m <t = P Z m t |Z m <t }. Proof: Consider the sequential random variables D = X 1:T and Y 1:T , the optimal transport between the distribution for D = X 1:T and the distribution for Y 1:T induces a rich class of divergence, W (P D , P G ) := inf Γ∼P(X 1:T ∼P D ,Y 1:T ∼P G ) E (X 1:T ,Y 1:T )∼Γ [c(X 1:T , Y 1:T )] ( ) where P(X 1:T ∼ P D , Y 1:T ∼ P G ) is a set of all joint distributions of (X 1:T , Y 1:T ) with marginals P D and P G , respectively. When we choose c(x, y) = xy 2 , c(X 1:T , Y 1:T ) = t X t -Y t 2 by linearity. It is easy to derive the optimal transport for distributions with sequential random variables, W (P D , P G ) = inf Q:Q Z 1:T =P Z 1:T t E P D E Q(Zt|Xt) [c(X t , G(Z t ))]. Based on our assumption, the marginal distribution of the inference model satisfies Q(Z 1 , • • • , Z T ) = Q(Z c )Q(Z m 1 , • • • , Z m T ) = Q(Z c ) t Q(Z m t |Z m <t ). The prior distribution P (Z 1 , • • • , Z T ) satisfies P (Z 1 , • • • , Z T ) = P (Z c )P (Z m 1 , • • • , Z m T ) = P (Z c ) t P (Z m t |Z m <t ). Since the set S is a subset of {Q : Q Z 1:T = P Z 1:T }, we derive the inequality (15).

APPENDIX B: PROOF OF THEOREM 2

In the following, we provide the proof of Theorem 2. To make the notations easy to read, we use the density functions of corresponding distributions. The joint generative distribution is p(x 1:T , z 1:T ) = p ψ (z 1:T )p θ (x 1:T |z 1:T ), where p ψ (z 1:T ) is the prior distribution and p θ (x 1:T |z 1:T ) is the decoder model. And the corresponding joint inference distribution is q φ (x 1:T , z 1:T ) = p D (x 1:T )q φ (z 1:T | x 1:T ). If the MI between z 1:T and x 1:T is defined in terms of the inference model q, we have the following lower bound with step-by-step derivations: I(z 1:T ; x 1:T )= E q(x 1:T ,z 1:T ) [log q φ (z 1:T |x 1:T ) q φ (z 1:T ) ] (20) = E q(x 1:T ,z 1:T ) [D KL (q φ (z 1:T |x 1:T ), p(z 1:T |x 1:T ))+log p(z 1:T |x 1:T )-log q φ (z 1:T )] ≥ E p D [E q(z 1:T |x 1:T ) [log p(x 1:T |z 1:T ) + log p(z 1:T ) -log q φ (z 1:T )-log p(D)]] ≥ T t=1 E p(D) E q φ (zt|xt) [log p θ (x t |z t )] -E p(D) [E q φ (zt|xt) [log q φ (z c )-log p(z c )] - T t=1 E p(D) E q φ (z m t |xt) [log q φ (z m t |z m <t ) -log p(z m t |z m <t ) + log p(D) , where we use Bayesian's rule p(z 1:T |x 1:T ) = p θ (x 1:T |z 1:T )p(z 1:T )

p(D)

. Maximizing the MI between z 1:T and x 1:T achieves state-of-the-art results in disentangled latent representation by using different regularizers for the static and dynamical latent variable with different priors (Hjelm et al., 2018) . In practice, incorporating the mutual information I(z m t , x t ) between element x t and motion z m t might facilitate the disentanglement of the dynamical latent variable z m t . Theorem 3 When its distribution divergence is chosen as KL divergence, the regularization terms in Eq. ( 9) jointly minimize the KL divergence between the inference model Q(Z 1:T |X 1:T ) and the prior model P (Z 1:T ) and maximize the mutual information between X 1:T and Z 1:T , KL(Q(Z c )||P (Z c )) = E p D [KL(Q(Z c |X 1:T )||P (Z c ))] -I(X 1:T ; Z c ). KL(Q(Z m t |Z m <t )||P (Z m t |Z m <t )) = E p D [KL(Q(Z m t |Z m <t , X 1:T )||P (Z m t |Z m <t )] -I(X 1:T ; Z m t |Z m <t ). Proof: Denote X D = X 1:T . As in the proof of Theorem 2, the mutual information between Z 1:T and X 1:T is defined in terms of the inference model Q, and we use the density functions of corresponding distributions to make the notations easy to read. Thus, Q(Z 1:T ) = E p D q(z 1:T |x 1:T ). According to the definition of mutual information, we have I(X 1:T ; Z c ) = E p D z c p D (x 1:T )q(z c |x 1:T ) log p D (x 1:T )q(z c |x 1:T ) p D (x 1:T )q(z c ) = E p D z c q(z c |x 1:T ) log q(z c |x 1:T ) q(z c ) = E p D z c q(z c |x 1:T ) log q(z c |x 1:T ) p(z c ) -E p D z c q(z c |x 1:T ) log q(z c ) p(z c ) = E p D z c q(z c |x 1:T ) log q(z c |x 1:T ) p(z c ) - z c q(z c ) log q(z c ) p(z c ) = E p D [KL(Q(Z c |X 1:T )||P (Z c ))] -KL(Q(Z c )||P (Z c )) Therefore, KL(Q(Z c )||P (Z c )) = E p D [KL(Q(Z c |X 1:T )||P (Z c ))] -I(X 1:T ; Z c ). Similarly, we can prove the second equality in the theorem.

APPENDIX C: DATASETS

Stochastic Moving MNIST(SM-MNIST) Dataset Stochastic moving MNIST (SM-MNIST) consists of sequences of frames of size 64 × 64 × 1, containing one MNIST digit moving and bouncing off edges of the frame (walls). We use one digit instead of two digits because two moving digits may collide, which changes the content of the dynamics and is inconsistent with our assumption. The digits in SM-MNIST move with a constant velocity along a trajectory until they hit at wall at which point they bounce off with a random speed and direction.

Sprites Dataset

We follow the same steps as in Yingzhen & Mandt (2018) to process Sprites dataset, which consists of animated cartoon characters whose clothing, hairstyle, skin color and action can be fully controlled. We use 6 variants in each of 4 attribute categories (skin colors, tops, pants and hair style) and there are 6 4 = 1296 unique characters in total, where 1000 of them are used for training and the rest of them are used for testing. We use 9 action categories (walking, casting spells and slashing, each with three different viewing angles.) The resulting dataset consists of video sequences with T = 8 frames of size 64 × 64 × 3.

MUG Facial Dataset

We use the MUG Facial Expression Database (Aifanti et al., 2010) for this experiment. The dataset consists of 86 subjects. Each video consists of 50 to 160 frames. To use the same network architecture for the whole video datasets in this paper, we cropped the face regions and scaled to the same size 64 × 64 × 3. We use six facial expressions (anger, fear, disgust, happiness, where the scaled MMD MMD kγ (Q(Z c ),P (Z c )) is chosen as MMD kγ (Q Z c ,P Z c ) = MMD kγ (Q Z c , P Z c ) 1 + 10E P [ ∇f γ (z c ) 2 F ] , where the function f γ (z c ) is the kernel feature map and MMD kγ (Q Z c , P Z c ) is defined in the following. When we have samples { z c i } n i=1 from Q(Z c ) and samples {z c i } n i=1 from P (Z c ), MMD kγ (Q(Z c ), P (Z c )) = 1 n(n -1) i =j k(f γ (z c i ), f γ (z c j )) + 1 n(n -1) i =j k(f γ ( z c i ), f γ ( z c j )) (22) - 1 n 2 i,j k(f γ ( z c i ), f γ (z c j )), where the RBF kernel k is defined on scalar variables, k(x, y) = exp(-x-y 2 2 ). To avoid the situation where the generator gets stuck on a local optimum, we apply spectral parametrization for the weight matrix (Miyato et al., 2018) . The feature map f γ is updated L steps at each iteration. To overcome posterior collapse and inference lagging, we will update the inference model per iteration of updating the decoder model for L steps during training (He et al., 2019) . See Algorithm 1 for details.

R-WAE(GAN)

For the regularizer D JS (Q Z c ,P Z c ), we introduce a discriminator D γ . The loss is as follows, L = E z c ∼p(z c ) [log D γ (z c )] + E z c ∼q( z c ) [log(1 -D γ ( z c )))], where p(z c ) is the prior distribution and q( z c ) is the posterior distribution of the inference model. To stabilize the training of the min-max problem in GAN-based optimization (23), a lot of stabilization techniques have been proposed (Thanh-Tung et al., 2019; Mescheder et al., 2018; Gulrajani et al., 2017; Petzka et al., 2017; Roth et al., 2017; Qi, 2017) . Let samples {z c } are from the prior p(z c ) and { z c } are from the inference posterior q( z c ). In our R-WAE(GAN), we will adopt the regularization from Mescheder et al. (2018) and Thanh-Tung et al. (2019) , L -λE[ (∇D γ ) ẑc 2 ], where ẑc = αz c + (1 -α) z c , α ∈ U(0, 1) and (∇D γ ) ẑc is evaluated its gradient at the point ẑc .

Algorithm 1 R-WAE(GAN)

Input: regularization coefficient β and content prior p(z c ) Goal: learn encoders q φ (z c |x 1:T ) and q φ (z m t |x t , z m <t ), prior p ψ (z m t |z m <t ), discriminator D γ , and decoder p θ (x t |z t ), where z t = (z c , z m t ) while not converged do for step 1 to L do Sample batch X = {x t } Sample {z c } from prior p(z c ) and {z m t } from prior p ψ Sample {z c , zm t } from encoders q φ Update discriminator D γ and encoders q φ with loss given by ( 9), (10) end for Update p θ and prior p ψ with loss given by ( 9) and (10). end while Algorithm 2 R-WAE(MMD) Input: regularization coefficient β and content prior p(z c ) Goal: learn encoders q φ (z c |x 1:T ) and q φ (z m t |x t , z m <t ), prior p ψ (z m t |z m <t ), feature map f γ and decoder p θ (x t |z t ), where z t = (z c , z m t ) while not converged do for step 1 to L do Sample batch X = {x t } Sample {z c } from prior p(z c ) and {z m t } from prior p ψ Sample {z c , zm t } from encoders q φ Update feature map f γ and encoders q φ with loss given by ( 9), (11) end for Update p θ and prior p ψ with loss given by ( 9) and ( 11). end while 6.1 APPENDIX E: UNCONDITIONAL VIDEO GENERATION The reason is that MoCoGAN (Tulyakov et al., 2018) requires the number of actions be finite. [h 5 , h 4 , h 3 , h 2 , h 1 , h 0 ] are concatenated into latent feature h t , where h t is defined in Fig. 1 . We use deconvolutional network adopted from Brock et al. (2019) , named "ResBlock up". In (b), the hidden state h t of an LSTM, defined in Fig. 1 , is evenly split into [h 5 , h 4 , h 3 , h 2 , h 1 , h 0 ]. And the ResBlock in decoder network consists of deconvolutional network adopted from Brock et al. (2019) . We use leaky relu activation for all ResBlocks. In the inference model, we use an encoder network, defined in Fig. 7 (a) to extract latent feature h t defined in Fig. 1 . We use a decoder network to reconstruct xt from the hidden state h t , defined in Fig. 1 . For the discriminator D γ in R-WAE(GAN), we use a 4-layer fully-connected neural network (FC NN) with respective dimension (256, 256, 128, 1) . For the feature map f γ with a scalar output for the RBF kernel of R-WAE(MMD), we use a 4-layer fully-connected neural network with respective dimension (256, 256, 128, 1) . After encoding x t , we get extracted latent feature h t . We use Fig. 8 (a) and Fig. 8 (b) to infer the content variable z c and motion variables z m t . When the Gumbel latent variable is incorporated into our weakly-supervised inference model, we use Fig. 8(c ) to infer the Gumbel latent variable a. The latent content variable z c and latent motion variable z m t are concatenated as input to an LSTM after an FC NN to output hidden state h t for reconstructing xt using the decoder. For our weakly-supervised model, the latent content variable z c , latent motion variable z m t and latent action variable a are concatenated as input to an LSTM after an FC NN to output hidden state h t for reconstructing xt using the decoder. We use Adam optimizer (Kingma & Ba, 2015) with β 1 = 0.5 and β 2 = 0.9.

Methods

actions content R-WAE(GAN) (S) 3.73% 2.00% R-WAE(MMD) (S) 5.83% 2.45% R-WAE(GAN) (C) 3.13% 3.31% R-WAE(MMD) (C) 7.72% 3.31% 4 ), we only provide the results and parameters of R-WAE(MMD) to save space. At each iteration of training the decoder p θ (x t |z t ) and the prior p ψ (z m t |z m <t ), we train the encoder parameters q φ and the feature map f γ for R-WAE(MMD) with L steps. The results on SM-MNIST and Sprites datasets are evaluated after 500 epochs. On SM-MNIST dataset, we use a Bernoulli cross-entropy loss and choose L = 5. The penalty coefficients β 1 and β 2 , are, respectively, 5 and 20. The learning rate for the decoder model is 5 × 10 -4 and the learning rate for the encoder is 1 × 10 -4 . The learning rate for f γ is 1 × 10 -4 . On Sprites dataset, we use an L2 reconstruction loss and choose L = 5 steps. The penalty coefficients β 1 and β 2 are, respectively, 10 and 60. The learning rate for the decoder model is 3 × 10 -4 and the learning rate for the encoder is 1 × 10 -4 . The learning rate for D γ in R-WAE(GAN) or f γ in R-WAE(MMD) is 1 × 10 -4 . We use a decayed learning rate schedule on both datasets. After 50 epochs, we decrease all learning rates by a factor of 2 and after 80 epochs decrease further by a factor of 5. On TIMIT speech dataset, we use the same encoder and decoder architecture as that of DS-VAE. The dimension of hidden states is 256 and the dimensions of z c and z m t are both 16. 



Figure 1: Structures of our proposed sequential probabilistic models. Sequence x 1:T is disentangled into static part z c and dynamic parts {z m t }. (a) Sequence is generated by randomly sampling {z c , z m t } from priors and concatenating them as input into an LSTM to get hidden state h t for the decoder; (b) z c is inferred from x 1:T with an LSTM, and z m t is inferred from h t and z m t-1 with another LSTM; (c) is the same as (a) except concatenating additional categorical a; (d) A categorical latent variable a is inferred from the dynamic latent codes. The detailed structures of the encoder and decoder are in the supplementary material.

Figure 2: Illustration of disentangling the motions and contents of two videos on the test data of SM-MNIST (T = 100), Sprites (T = 8) and MUG dataset (T = 8). The first row and fourth row are original videos. The second row and third row are generated sequences by swapping the respective motion variables while keeping content variable the same (sampled at 4 time steps for illustrations).

Figure 3: Unconditional video generation on MUG dataset, where the sample at time step T = 10 is chosen for clear comparison. DS-VAE in (b) is improved by incorporating categorical latent variables. Samples of the video sequence are given in Appendix E.

Fig.4provides generated samples on the SM-MNIST dataset by randomly sampling content {z c } from the prior p(z c ) and motions {z m 1:T } from the learned prior p ψ (z m t |z m <t ). The length of our

Fig. 5 shows unconditional video generation with T = 10 on MUG facial dataset. DS-VAE in (b) is improved by incorporating categorical latent variables. The figures should be viewed with Adobe Reader to see video.

Figure 5: Unconditional video generation with T = 10 on MUG facial dataset. DS-VAE in (b) is improved by incorporating categorical latent variables. The figures should be viewed with Adobe Reader to see video.

Figure 6: Visualizing 2D manifold of content code {z c } encoded from R-WAE(MMD) on SM-MNIST by t-SNE (Maaten & Hinton, 2008).

Architecture on SM-MNIST, Sprites and TIMIT Datasets We use the same architecture on SM-MNIST and Sprites dataset, as shown in Fig. 9. The details of the parameters of the networks are provided in Fig. 9. As R-WAE(GAN) and R-WAE(MMD) have similar performance on SM-MNIST and Sprites (see Sprites results in Table

(a) infer z c (b) infer z m t (c) infer a (d) output ht for decoder (e) output ht for weakly-supervised decoder

Figure 8: Network architectures in addition to encoder/decoder network with h t defined in Fig. 7. (a) Network structure to infer the content variable z c from sequence x 1:T ; (b) Network structure to infer content variable z m t ; (c) In inference model, we introduce an additional Gumbel random variable a inferred by motion sequences {z m t }; (d) Content variable z c and motion variable z m t are concatenated into an LSTM for the decoder model; (e) In weakly-supervised inference model, content variable z c , motion variable z m t and Gumbel random variable a are concatenated into an LSTM for the decoder model.

Figure 11: Cross generation of 16 audio clips forms a 17 × 17 matrix. The first column and the first row are spectrum visualization of the original sequences. Subplot at the (i + 1)-th row and (j + 1)-th column represents the reconstruction of i-th static factor and j-th dynamic factor.

Comparison of averaged classification errors. On Sprites dataset, fix one encoded attribute and randomly sample others. On SM-MNIST dataset, we fix the encoded z c and randomly sample the motion sequence from the learned prior p ψ (z m t |z m <t

the categorical variable best captures the actions, which indicates our models achieve the best disentanglement. In table 2, the Inception score of R-WAE(GAN) is very close to Inception Score of the exact test data, which means our models have the best sample quality. Our proposed

Results of R-WAE(GAN) and R-WAE(MMD) on Sprites dataset.

acknowledgement

Acknowledgement Jun Han thanks Dr. Chen Fang at Tencent for insightful discussions and Prof. Qiang Liu at UT Austin for invaluable support.

annex

sadness, and surprise). To ensure there is sufficient change in the facial expression along a video sequence, we choose every other frame in the original video sequences to form training and test video sequences of length T = 10. 80% of the videos are used for training and 20% of the videos are used for testing.TIMIT Speech Dataset The TIMIT dataset (Garofolo, 1993) contains broadband 16k Hz of phonetically-balanced read speech. A total of 6300 utterances (5.4 hours) are presented with 10 sentences from each of 630 speakers. The data is preprocessed in the same way as in (Yingzhen & Mandt, 2018) and (Hsu et al., 2017) . The raw speech waveforms are first split into sub-sequences of 200ms, and then preprocessed with sparse fast Fourier transform to obtain a 200 dimensional logmagnitude spectrum, computed every 10ms, i.e., we use T = 20 for sequence x 1:T . The dimension of x t is 200. Now we explain the detail of the evaluation metric, equal error rate (EER), used on TIMIT dataset. Let w test be the feature of test utterance x test 1:T and w target be the feature of test utterance x target 1:T . The predicted identity is confirmed if the cosine similarity between w test and w target , cos(w test , w target ) is greater than some threshold used in Dehak et al. (2010) . The equal error rate (EER) means the false rejection rate equals the false acceptance rate (Dehak et al., 2010) . In the following, we will discuss the two choices of feature w test for evaluations of all methods,which is used to evaluate the disentanglement of z c ;which is used to evaluate the disentanglement of z m . For more details, please refer to (Dehak et al., 2010; Yingzhen & Mandt, 2018; Hsu et al., 2017) . We use the same network architecture as in Yingzhen & Mandt (2018) for a fair comparison on speech dataset. As the input dimension of speech is low, the encoder/decoder network is a 2-hidden-layer MLP with the hidden dimension 256.

APPENDIX D: CHOICES OF REGULARIZERS

In the following, we will discuss the choice of regularizers in R-WAE. To make notations easy to read, we use density functions for corresponding distributions. In both R-WAE(GAN) and R-WAE(MMD), we use the same regularizer for D(q(z m t |z m <t ), p(z m t |z m <t )). We also add a KLdivergence regularization term on z m to stabilize training. In the experiments, we assume inference model q(z c |x 1:T ) is a Gaussian distribution with parameters mean µ c and diagonal variance matrix σ c . Inference model q(z m t |x t , z m <t ) is a Gaussian distribution with parameters mean µ m and diagonal variance matrix σ m . For the prior distribution, we assume p(z m t |z m <t ) is a Gaussian distribution with parameters mean µ ψ m and diagonal covariance matrix σ ψ m . For regularizing the motion variables, we just use MMD without introducing any additional parameter, MMD k (q(z m t |z m <t ), p(z m t |z m <t )), and we choose mixture of RBF kernel (Li et al., 2017) , where RBF kernel is defined as k(x, y) = exp(-x-y 2 2σ 2 ). With samples { z i } n i=1 from the posterior q( z c ) and samples {z i } n i=1 from the prior p(z c ), MMD k (q( z c ), p(z c )) is defined asThe difference between R-WAE(MMD) and R-WAE(GAN) is how to choose metrics for the regularizer D(Q Z c ,P Z c ), where P Z c is the prior distribution and Q Z c is the posterior distribution of the inference model.

R-WAE(MMD)

The regularizer Architecture on MUG Facial Dataset The details of the architecture parameters of the networks for MUG facial dataset are provided in Fig. 9 . The results on MUG facial dataset are evaluated after 800 epochs. For the regularizer D KL (q φ (a|x 1:T , z m 1:T ), p(a)), we choose the coefficient of this categorical regularizer to be 50. We use an L2 reconstruction loss and choose L = 5 steps. For R-WAE(MMD), the penalty coefficients β 1 and β 2 are, respectively, 10 and 50. For R-WAE(GAN), the coefficients β 1 and β 2 of the penalties are, respectively, 5 and 60. The learning rate for the decoder model is 5 × 10 -4 and the learning rate for the encoder is 2 × 10 -4 . The learning rate for D γ in R-WAE(GAN) or f γ in R-WAE(MMD) is 2 × 10 -4 . We use the same decayed learning rate schedule as described on SM-MNIST and Sprites datasets. This architecture can be applied to improve the compression rate (?).

APPENDIX H: ADDITIONAL RESULTS ON AUDIO DATA

Swapping Static and Dynamic Factors on Audio Data Here we present results of swapping static and dynamic factors of given audio sequences. Results are given in Figure 11 . Each heatmap subplot is of dimension 80 × 20 and visualizes the spectrum of 200ms of an audio clip, in which the mel-scale filter bank features are plotted in the frequency domain (x-axis represents temporal domain with 20 timesteps and y-axis is the value of frequencies). We collect these heatmaps in a matrix where the static factors in a row are kept the same and each column shares the same dynamic factor. It can be observed that in each column, the linguistic phonetic contents as reflected by the formants along x-axis are kept almost the same after swapping. Likewise, the timbres are reflected as the harmonics in the spectrum plot. This can be concluded by observing that the horizontal light stripes which represents the harmonics are kept consistent in a row. Moreover, we perform identity verification experiment as conducted in DS-VAE (Yingzhen & Mandt, 2018) . Similar to cross reconstruction, z c female and z c male (or f female and f male in DS-VAE) are swapped for two sequences {x female } and {x male }. By an informal listening test of the original-swapped speech sequence pairs, we confirm that the speech content is preserved and identity is transferred (i.e. female voice usually has higher frequency). Table 9 : Prediction accuracy on generated video data, the experiment setting here is similar to Table 2 in the main text. For predicting the static factor, we fix the static latent representation z c and randomly sample z m , and examine whether the static information is preserved in the generated video (if so, the static attributes should be correctly predicted by a pretrained video classifier). For predicting the dynamic factor, we perform corresponding experiments analogously. 

Generation Results on Moving Shapes

We report results on a Moving-Shape dataset in Table 9 and Fig. 12 . The Moving-Shape synthetic dataset was introduced in Balaji et al. ( 2018) which has 5 control parameters: shape type (e.g. triangle and square), size (small and large), color (e.g. white and red), motion type (e.g. zig-zag, straight line and diagonal) and motion direction. In Table 9 , TFGAN (Balaji et al., 2018) encoder and decoder architectures are considered less expressive compared with BigGAN (Brock et al., 2019) architectures. Similar to results in Table 2 , with more complex and expressive architecture, learning disentangled representation is harder. The results in Table 9 and Fig. 12 demonstrate that R-WAE produces better disentanglement and generation performance than DS-VAE both quantitatively and qualitatively. Qualitative difference of fixing z m and sampling z c for DS-VAE and R-WAE is not that obvious and thus not shown.

