MULTIFACTOR SEQUENTIAL DISENTANGLEMENT VIA STRUCTURED KOOPMAN AUTOENCODERS

Abstract

Disentangling complex data to its latent factors of variation is a fundamental task in representation learning. Existing work on sequential disentanglement mostly provides two factor representations, i.e., it separates the data to time-varying and time-invariant factors. In contrast, we consider multifactor disentanglement in which multiple (more than two) semantic disentangled components are generated. Key to our approach is a strong inductive bias where we assume that the underlying dynamics can be represented linearly in the latent space. Under this assumption, it becomes natural to exploit the recently introduced Koopman autoencoder models. However, disentangled representations are not guaranteed in Koopman approaches, and thus we propose a novel spectral loss term which leads to structured Koopman matrices and disentanglement. Overall, we propose a simple and easy to code new deep model that is fully unsupervised and it supports multifactor disentanglement. We showcase new disentangling abilities such as swapping of individual static factors between characters, and an incremental swap of disentangled factors from the source to the target. Moreover, we evaluate our method extensively on two factor standard benchmark tasks where we significantly improve over competing unsupervised approaches, and we perform competitively in comparison to weaklyand self-supervised state-of-the-art approaches. The code is available at GitHub.

1. INTRODUCTION

Representation learning deals with the study of encoding complex and typically high-dimensional data in a meaningful way for various downstream tasks (Goodfellow et al., 2016) . Deciding whether a certain representation is better than others is often task-and domain-dependent. However, disentangling data to its underlying explanatory factors is viewed by many as a fundamental challenge in representation learning that may lead to preferred encodings (Bengio et al., 2013) . Recently, several works considered two factor disentanglement of sequential data in which time-varying features and time-invariant features are encoded in two separate sub-spaces. In this work, we contribute to the latter line of work by proposing a simple and efficient unsupervised deep learning model that performs multifactor disentanglement of sequential data. Namely, our method disentangles sequential data to more than two semantic components. One of the main challenges in disentanglement learning is the limited access to labeled samples, particularly in real-world scenarios. Thus, prior work on sequential disentanglement focused on unsupervised models which uncover the time-varying and time-invariant features with no available labels (Hsu et al., 2017; Li & Mandt, 2018) . Specifically, two feature vectors are produced, representing the dynamic and static components in the data, e.g., the motion of a character and its identity, respectively. Subsequent works introduce two factor self-supervised models which incorporate supervisory signals and a mutual information loss (Zhu et al., 2020) or data augmentation and a contrastive penalty (Bai et al., 2021) , and thus improve the disentanglement abilities of prior baseline models. Yamada et al. ( 2020) proposed a probabilistic model with a ladder module, allowing certain multifactor disentanglement capabilities. Still, to the best of our knowledge, the majority of existing work do not explore the problem of unsupervised multifactor sequential disentanglement. In the case of static images, multiple disentanglement approaches have been proposed (Kulkarni et al., 2015; Higgins et al., 2017; Kim & Mnih, 2018; Chen et al., 2018; 2016; Burgess et al., 2018; Kumar et al., 2017; Bouchacourt et al., 2018) . In addition, there are several approaches that support disentanglement of the image to multiple distinct factors. For instance, Li et al. ( 2020) design an architecture which learns the shape, pose, texture and background of natural images, allowing to generate new images based on combinations of disentangled factors. In (Xiang et al., 2021) , the authors introduce a weakly-supervised framework where N factors can be disentangled, given N -1 labels. In comparison, our approach is fully unsupervised, deals with sequential data and the number of distinct components is determined by a hyperparameter. Recently, Locatello et al. (2019) showed that unsupervised disentanglement is impossible without inductive biases on models and datasets. While exploiting the underlying temporal structure had been shown as a strong inductive bias in existing disentanglement approaches, we argue in this work that a stronger assumption should be considered. Specifically, based on Koopman theory (Koopman, 1931) and practice (Budišić et al., 2012; Brunton et al., 2021) , we assume that there exists a learnable representation where the dynamics of input sequences becomes linear. Namely, the temporal change between subsequent latent feature vectors can be encoded with a matrix that approximates the Koopman operator. Indeed, the same assumption was shown to be effective in challenging scenarios such as fluid flows (Rowley et al., 2009) as well as other application domains (Rustamov et al., 2013; Kutz et al., 2016) . However, it has been barely explored in the context of disentangled representations. In this paper, we design an autoencoder network (Hinton & Zemel, 1993) that is similar to previous Koopman methods (Takeishi et al., 2017; Morton et al., 2018) , and which facilitates the learning of linear temporal representations. However, while the dynamics is encoded in a Koopman operator, disentanglement is not guaranteed. To promote disentanglement, we make the following key observation: eigenvectors of the approximate Koopman operator represent time-invariant and time-variant factors. Motivated by this understanding, we propose a novel spectral penalty term which splits the operator's spectrum to separate and clearly-defined sets of static and dynamic eigenvectors. Importantly, our framework naturally supports multifactor disentanglement: every eigenvector represents a unique disentangled factor, and it is considered static or dynamic based on its eigenvalue. Contributions. Our main contributions can be summarized as follows. 1. We introduce a strong inductive bias for disentanglement tasks, namely, the dynamics of input sequences can be encapsulated in a matrix. This assumption is backed by the rich Koopman theory and practice. 2. We propose a new unsupervised Koopman autoencoder learning model with a novel spectral penalty on the eigenvalues of the Koopman operator. Our approach allows straightforward multifactor disentanglement via the eigendecomposition of the Koopman operator. 3. We extensively evaluate our method on new multifactor disentanglement tasks, and on several two factor benchmark tasks, and we compare our work to state-of-the-art unsupervised and weakly-supervised techniques. The results show that our approach outperforms baseline methods in various quantitative metrics and computational resources aspects. 



Sequential Disentanglement. Most existing work on sequential disentanglement is based on the dynamical variational autoencoder (VAE) architecture(Girin et al., 2020). Initial attempts focused on probabilistic models that separate between static and dynamic factors, where in(Hsu et al., 2017)  the joint distribution is conditioned on the mean, and in(Li & Mandt, 2018)  conditioning is defined on past features. Subsequent works proposed self-supervised approaches that depend on auxiliary tasks and supervisory signals(Zhu et al., 2020), or on additional data and contrastive penalty terms(Bai  et al., 2021). InHan et al. (2021a), the authors replace the common Kullback-Leibler divergence with the Wasserstein distance between distributions. Some approaches tailored to video disentanglement use generative adversarial network (GAN) architectures(Villegas et al., 2017; Tulyakov et al., 2018)   and a recurrent model with adversarial loss(Denton & Birodkar, 2017).Finally, Yamada et al. (2020)   proposed a variational autoencoder model including a ladder module(Zhao et al., 2017), which allows to disentangle multiple factors. The authors demonstrated qualitative results of multifactor latent traversal between various two static features and three dynamic features on the Sprites dataset.

