MULTIFACTOR SEQUENTIAL DISENTANGLEMENT VIA STRUCTURED KOOPMAN AUTOENCODERS

Abstract

Disentangling complex data to its latent factors of variation is a fundamental task in representation learning. Existing work on sequential disentanglement mostly provides two factor representations, i.e., it separates the data to time-varying and time-invariant factors. In contrast, we consider multifactor disentanglement in which multiple (more than two) semantic disentangled components are generated. Key to our approach is a strong inductive bias where we assume that the underlying dynamics can be represented linearly in the latent space. Under this assumption, it becomes natural to exploit the recently introduced Koopman autoencoder models. However, disentangled representations are not guaranteed in Koopman approaches, and thus we propose a novel spectral loss term which leads to structured Koopman matrices and disentanglement. Overall, we propose a simple and easy to code new deep model that is fully unsupervised and it supports multifactor disentanglement. We showcase new disentangling abilities such as swapping of individual static factors between characters, and an incremental swap of disentangled factors from the source to the target. Moreover, we evaluate our method extensively on two factor standard benchmark tasks where we significantly improve over competing unsupervised approaches, and we perform competitively in comparison to weaklyand self-supervised state-of-the-art approaches. The code is available at GitHub.

1. INTRODUCTION

Representation learning deals with the study of encoding complex and typically high-dimensional data in a meaningful way for various downstream tasks (Goodfellow et al., 2016) . Deciding whether a certain representation is better than others is often task-and domain-dependent. However, disentangling data to its underlying explanatory factors is viewed by many as a fundamental challenge in representation learning that may lead to preferred encodings (Bengio et al., 2013) . Recently, several works considered two factor disentanglement of sequential data in which time-varying features and time-invariant features are encoded in two separate sub-spaces. In this work, we contribute to the latter line of work by proposing a simple and efficient unsupervised deep learning model that performs multifactor disentanglement of sequential data. Namely, our method disentangles sequential data to more than two semantic components. One of the main challenges in disentanglement learning is the limited access to labeled samples, particularly in real-world scenarios. Thus, prior work on sequential disentanglement focused on unsupervised models which uncover the time-varying and time-invariant features with no available labels (Hsu et al., 2017; Li & Mandt, 2018) . Specifically, two feature vectors are produced, representing the dynamic and static components in the data, e.g., the motion of a character and its identity, respectively. Subsequent works introduce two factor self-supervised models which incorporate supervisory signals and a mutual information loss (Zhu et al., 2020) or data augmentation and a contrastive penalty (Bai et al., 2021) , and thus improve the disentanglement abilities of prior baseline models. Yamada et al. (2020) proposed a probabilistic model with a ladder module, allowing certain multifactor disentanglement capabilities. Still, to the best of our knowledge, the majority of existing work do not explore the problem of unsupervised multifactor sequential disentanglement. χ enc χ dec X Z Z p Z f Z, Z X rec Xf C L eig L rec L pred

Koopman Module

Figure 1 : Our architecture is based on a Koopman autoencoder network which includes encoder χ enc , decoder χ dec , and a Koopman module that computes the Koopman operator C via least squares solves. We augment this model with a novel spectral penalty term L eig which facilitates the learning of spectrally structured C matrices, and thus supporting multifactor disentanglement by construction. Dynamics Learning. Over the past few years, an increasing interest was geared towards learning and representing dynamical systems using deep learning techniques. Two factor disentanglement methods based on Kalman filter (Fraccaro et al., 2017) , and state-space models (Miladinović et al., 2019) focus on ordinary differential equation systems. Other methods utilize the mutual information between past and future to estimate predictive information Clark et al. (2019) ; Bai et al. (2020) . Mostly related to our approach are Koopman autoencoders (Lusch et al., 2018; Yeung et al., 2019; Otto & Rowley, 2019; Li et al., 2019; Azencot et al., 2020; Han et al., 2021b) , related to classical learning methods, e.g., Azencot et al. (2019) ; Cohen et al. (2021) . Specifically, in (Takeishi et al., 2017; Morton et al., 2018; Iwata & Kawahara, 2020) the Koopman operator is learned via a least squares solve per batch, allowing to train a single neural model on multiple initial conditions. We base our architecture on the latter works, and we augment it with a novel spectral loss term which promotes disentanglement. Recently, an intricate model for video disentanglement was proposed in (Comas et al., 2021) . While the authors employ Koopman techniques in that work, it is only partially related to our work since they explicitly model pose and appearance components, whereas our approach can model an arbitrary number of disentangled factors. In addition, their architecture is based on the attention network (Bahdanau et al., 2014) , where the Koopman module is mostly related to prediction. In comparison, in our work the Koopman module is directly responsible for unsupervised disentanglement of sequential data. Koopman Spectral Analysis. Our method is based on learning Koopman operators with structured spectra. Spectral analysis of Koopman operators is an active research topic (Mezić, 2013; Arbabi & Mezic, 2017; Mezic, 2017; Das & Giannakis, 2019; Naiman & Azencot, 2023) . We explore Koopman eigenfunctions associated with the eigenvalue 1. These eigenfunctions are related to global stability (Mauroy & Mezić, 2016) , and to orbits of the system (Mauroy & Mezić, 2013; Azencot et al., 2013; 2014) . Other attempts focused on computing eigenfunctions for a known spectrum (Mohr & Mezić, 2014) . Recently, pruning weights of neural networks using eigenfunctions with eigenvalue 1 was introduced in (Redman et al., 2021) . However, to the best of our knowledge, our work is among a few to propose a deep learning model for generating spectrally-structured Koopman operators.

3. KOOPMAN AUTOENCODER MODELS

We recall the Koopman autoencoder (KAE) architecture introduced in (Takeishi et al., 2017) as it is the basis of our model. The KAE model consists of an encoder and decoder modules, similarly to standard autoencoders, and in between, there is a Koopman module. The general idea behind this architecture is that the encoder and decoder are responsible to generate effective representations and their reconstructions, driven by the Koopman layer which penalizes for nonlinear encodings. We denote by X ∈ R b×(t+1)×m a batch of sequence data {x ij } ⊂ R m where i ∈ {1, . . . , b} and j ∈ {1, . . . , t + 1} represent the batch sample and time indices, respectively. The tensor X is encoded to its latent representation Z ∈ R b×(t+1)×k via Z = χ enc (X). The Koopman layer splits the latent variables to past Z p and future Z f observations, and then, it finds the best linear map C such that Z p • C ≈ Z f . Formally, Z p = (z ij ) ∈ R b• t×k for j ∈ {1, . . . , t} and any i, and Z f = (z ij ) ∈ R b•t×k for j ∈ {2, . . . , t + 1} and any i, i.e., Z p holds the first t latent variables per sample, and Z f holds the last t variables. Then, C = arg min C |Z p • C -Z f | 2 F = Z + p Z f , where A + denotes the pseudo-inverse of the matrix A. Importantly, the matrix C is computed per Z during both training and inference, and in particular, C is not parameterized by network weights. Additionally, the pseudo-inverse computation supports backpropagation, and thus it can be used during training (Ionescu et al., 2015) . Lastly, the latent samples are reconstructed with the decoder X rec = χ dec (Z). The above architecture employs reconstruction and prediction loss terms: the reconstruction loss promotes an autoencoder learning, and the prediction loss aims to capture the dynamics in C. We use the notation L MSE (X, Y ) = 1 b•t i,j |Y (i, j) -X(i, j)| 2 2 for the average distance between tensors X, Y ∈ R b×t×k for i ∈ {1, . . . , b} and j ∈ {1, . . . , t}. Then, the losses are given by L rec (X rec , X) = L MSE (X rec , X) , L pred ( Zf , Z f , Xf , X f ) = L MSE ( Zf , Z f ) + L MSE ( Xf , X f ) , where Zf := Z p • C, Xf := χ dec ( Zf ), and X f are the inputs corresponding to Z f latent variables. The network loss is taken to be L = λ rec L rec +λ pred L pred , where λ rec , λ pred ∈ R + balance between the reconstruction and prediction contributions. We show in Fig. 1 an illustration of the Koopman autoencoder architecture using the notations above.

4. MULTIFACTOR DISENTANGLING KOOPMAN AUTOENCODERS

How disentanglement can be achieved given the Koopman autoencoder architecture? For comparison, other disentanglement approaches typically represent the disentangled factors explicitly. In contrast the batch dynamics in KAE models is encoded in the approximate Koopman operator matrix C, where C propagates latent variables through time while carrying the static as well as dynamic information. Thus, the time-varying and time-invariant factors are still entangled in the Koopman matrix. We now show that KAE theoretically enables disentanglement under the following analysis. Koopman disentanglement. In general, one of the key advantages of Koopman theory and practice is the linearity of the Koopman operator, allowing to exploit tools from linear analysis. Specifically, our approach depends heavily on the spectral analysis of the Koopman operator (Mezić, 2005) . In what follows, we perform our analysis directly on C, and we refer the reader to App. A and the references therein for a detailed treatment of the full Koopman operator. The eigendecomposition of C consists of a set of left eigenvectors {ϕ i ∈ C k } and a set of eigenvalues {λ i ∈ C} such that ϕ T i C = λ i ϕ T i , i = 1, . . . , k . The eigenvectors can be viewed as approximate Koopman eigenfunctions, and thus the eigenvectors hold fundamental information related to the underlying dynamics. For instance, the eigenvectors describe the temporal change in latent variables. Formally, z T j C = k i=1 ⟨z T j , ϕ T i ⟩ϕ T i C = i zi j λ i ϕ T i ≈ z T j+1 , j = 1, . . . , t , where zi j := ⟨z T j , ϕ T i ⟩ is the projection of z T j on the eigenvector ϕ T i . The approximation follows from C being the best (and not necessarily exact) linear fit between past and future features. Moreover, it follows that predicting step j + r from j is achieved simply by applying powers of the Koopman matrix on z T j , i.e., z T j C r = i zi j λ r i ϕ T i ≈ z T j+r . Our approach to multifactor disentanglement is based on the following key observation: eigenvectors of the matrix C whose eigenvalue is 1 represent time-invariant factors. For instance, assume C has a single eigenvector ϕ 1 with λ 1 = 1 and λ i ̸ = 1 for i ̸ = 1, then it follows from Eq. 4 that z T j C r = z1 j ϕ T 1 + k i=2 zi j λ r i ϕ T i . Essentially, the contribution of ϕ 1 is not affected by the dynamics and it remains constant, and thus the first addend remains constant throughout time, and it is related to static features of the dynamics. In contrast, every element in the sum in Eq. 5 is scaled by its respective λ r i , and thus the sum changes throughout time, and these eigenvectors are related to dynamic features. We conclude that the KAE architecture virtually allows disentanglement via eigendecomposition of the Koopman matrix where the static factors are eigenvectors with eigenvalue 1, and the rest are dynamic factors. Multifactor Koopman Disentanglement. Unfortunately, the vanilla KAE model is not suitable for disentanglement as the learned Koopman matrices can generally have arbitrary spectra, with multiple static factors or no static components at all. Moreover, KAE does not allow to explicitly balance the number of static vs. dynamic factors. To alleviate the shortcomings of KAE, we propose to augment the Koopman autoencoder with a spectral loss term L eig which explicitly manipulates the structure of the Koopman spectrum, and its separation to static and dynamic factors. Formally, real imag L stat = 1 k s ks i |λ i -(1 + ı0)| 2 , L dyn = 1 k d k d i ξ(|λ i |, ϵ) , L eig = L stat + L dyn , where k s and k d represent the number of static and dynamic components, respectively, and thus k = k s + k d . The term L stat measures the average distance of every static eigenvalue from the complex value 1. The role of L dyn is to encourage separation between the static and dynamic factors. In practice, this is achieved with a threshold function ξ which takes the modulus of λ i and a user parameter ϵ ∈ (0, 1), and it returns |λ i | if |λ i | > ϵ, and zero otherwise. Thus, L dyn penalizes dynamic factors whose modulus is outside an ϵ-ball. The inset figure shows an example spectrum we obtain using our loss penalties, where blue and red denote static and dynamic factors, respectively. Method Summary. Given a batch X ∈ R b×t×m , we feed it to the encoder. Our encoder is similar to the one used in C-DSVAE (Bai et al., 2021) having five convolutional layers, followed by a unidirectional LSTM module. The output of the encoder is denoted by Z ∈ R b×t×k , and it is passed to the Koopman module. Then, Z is split to past Z p and future Z f observations, allowing to compute the approximate Koopman operator via C = Z + p Z f . In addition, we compute Zf := Z p • C which will be used to compute L pred . After the Koopman module, we apply the decoder whose structure mimics the encoder but in reverse having an LSTM component and de-convolutional layers. Additional details on the encoder and decoder are detailed in Tab. 5. We decode Z to obtain the reconstructed signal X rec , and we decode Zf to approximate the future recovered signals Xf . The total loss is given by L = λ rec L rec + λ pred L pred + λ eig L eig , where the balance weights λ rec , λ pred and λ eig scale the loss penalty terms and the exact values are given in Tab. 6. To compute L eig , we identify the static and dynamic subspaces. This is done by simply sorting the eigenvalues based on their modulus, and taking the last k s eigenvectors, whereas the rest k d are dynamic factors. Identifying multiple factors is more involved and can be obtained by manual inspection or via an automatic procedure using a pre-trained classifier, see App. B.5. Multifactor Static and Dynamic Swap. Similar to previous methods our approach allows to swap between e.g., the static factors of two different input samples. In addition, our framework naturally supports multifactor swap as we describe next. For simplicity, we first consider the swap of a single factor (e.g., hair color in Sprites (Reed et al., 2015) ) for the given latent codes of two samples, z j (u) and z j (v), j = 1, . . . , t + 1. Denote by ϕ 1 the eigenvector of the factor we wish to swap, then a single swap is obtained by switching the Koopman projection coefficients of ϕ 1 , i.e., ẑj (u) = z1 j (v)ϕ 1 + k i=2 zi j (u)ϕ i , ẑj (v) = z1 j (u)ϕ 1 + k i=2 zi j (v)ϕ i , where ẑj (u) denotes the new code of z j (u) using the swapped factor from the v sample, and similarly for ẑj (v). If several factors are to be swapped, then  ẑj (u) = i∈I zi j (v)ϕ i + i∈I c zi j (u)ϕ i ,

5. RESULTS

We evaluate our model on several two-and multi-factor disentanglement tasks. For every dataset, we train our model, and for evaluation, we additionally train a vanilla classifier on the label sequences. In all experiments, we apply our model on mini-batches, extracting the latent codes Z and the Koopman matrix C. Disentanglement tests use the eigendecomposition of C, where we identify the subspaces corresponding to the dynamic and static factors, denoted by I dyn and I stat , respectively. We may label other subspaces such as I h to note they correspond to e.g., hair color change in Sprites. To identify the subspace corresponding to a specific factor we perform manual or automatic approaches (App. B). Importantly, subspace's dimension of a single factor may be larger than one. We provide further details regarding the network architectures, hyperparameters, datasets, data pre-processing, and a comparison of computational resources (App. B). Additional results are provided in App. C.

5.1. MULTIFACTOR DISENTANGLEMENT

We will demonstrate that our method disentangles sequential data to multiple distinct factors, and thus it extends the toolbox introduced in competitive sequential disentanglement approaches which only supports two factor disentanglement. Specifically, while prior techniques separate to static and dynamic factors, we show that our model identifies several semantic static factors, allowing a finer control over the factored items for downstream tasks. We perform qualitative and quantitative tasks on the Sprites (Reed et al., 2015) and MUG (Aifanti et al., 2010) datasets to show those advantages. Factorial swap. This experiment demonstrates that our method is capable of swapping individual content components between sprite characters. We extract a batch with 32 samples, and we identify by manual inspection the subspaces responsible for hair color, skin color, and top color, labeled by I h , I s , I t . We select two samples from the test batch, shown as the source and target in Fig. 2 . To swap individual static factors between the source and target, we follow Eq. 9. Specifically, we gradually change the static features of the source to be those of the target. For example, the top row in Fig. 2 shows the source being modified to have the hair color, followed by skin color, and then top color of the target, from left to right. In practice, this is achieved via setting Zh = Zhs = Zhst = Zsrc and assigning Zh [:, To quantitatively assess the performance of our approach in the factorial swap task, we consider the following experiment. We iterate over test batches of size 256, and for every batch we automatically identify its hair color and skin color subspaces, I h , I s . Then, we compute a random sampling of Z denoted by J, and separately swap the hair color and the skin color. In practice, this boils down to Zh = Zs = Z and setting Zh [:, :, I h ] = Z[J, :, I h ] and similarly, Zs [:, :, I h ] = Ztgt [:, I h ], I s ] = Z[J, :, I s ]. The new latent codes are reconstructed and fed to the pre-trained classifier, and we compare the predicted labels to the true labels of Z[J]. The results are reported in Tab. 1 where we list the accuracy measures for every factor. For most non-swapped factors, we obtain an accuracy score close to random guess, e.g., the skin accuracy in the hair swap is 16.25% which is very close to 1/6. Moreover, the swapped factors yield high accuracy scores marked in bold, validating the successful swap of individual factors. Latent Embedding. We now explore the effect of our model on the latent representation of samples. To this end, we consider a batch X of sprites where the motion, skin and hair colors are arbitrary, and the top and pants colors are fixed, for a total of 324 examples. Following the above experiment, we automatically identify the subspaces responsible for changing the hair and skin color, I h , I s . To explore the distribution of the latent code, we visualize the Koopman projection coefficients of the 4-dimensional subspace I hs = I h ∪ I s given by Z[:, :, I hs ] ∈ C 324×8×4 . We plot in Fig. 3 the 2D embedding obtained using t-SNE (Van der Maaten & Hinton, 2008) . To distinguish between skin and hair labels, we paint the face of every 2D point based on its true hair label, and we paint the point's edge with the true skin color. The plot resembles a grid-like pattern, showing a perfect separation to all 36 unique combinations of (skin, hair) colors. We conclude that the Koopman subspace I hs indeed disentangles the samples based on either their skin or hair. Incremental Swap. In this test we explore multifactor features of time-varying Koopman subspaces on the MUG dataset. Given a source image u, we gradually modify its dynamic factors to be those of the target v. In practice, we compute Z[u, :, I q ] = Z[v, :, I q ], where I q ⊂ I dyn is an index set from I dyn such that q ∈ {1, 2, 3} and I 1 ⊂ I 2 ⊂ I 3 ⊂ I dyn . Specifically, |I 1 | = 4, |I 2 | = 6, |I 3 | = 32. Fig. 4 shows the incremental swap results of two examples changing from disgust to happiness (left), and happiness to anger (right). The three rows below the source row are the reconstructions of the gradual swap denoted by X(I q ) := χ dec ( Z[u, :, I q ] • Φ). Our results demonstrate in both cases a non-trivial gradual change from the source expression to the target, as more dynamic features are swapped. For instance, the left source is mapped to a smiling character over all time samples in X(I 2 ), and then it is fixed to better match the happiness trajectory source in X(I 3 ). 

5.2. TWO FACTOR DISENTANGLEMENT OF IMAGE DATA

We perform two factor disentanglement on Sprites and MUG datasets, and we compare with state-ofthe-art methods. Evaluation is performed by fixing the time-varying features of a test batch while randomly sampling its time-invariant features. Then, a pre-trained classifier generates predicted labels for the new samples while comparing them to the true labels. We use metrics such as accuracy (Acc), inception score (IS), intra-entropy H(y|x) and inter-entropy H(y) (Bai et al., 2021) . We extract batches of size 256, and we identify their static and dynamic subspaces automatically. In contrast to most existing work, our approach is not based on a variational autoencoder model, and thus the sampling process in our approach is performed differently. Specifically, for every test sequence, we randomly sample static features by generating a new latent code based on a random sampling in the convex hull of the batch. That is, we generate random coefficients {α i } for every sample in the batch such that they form a partition of unity and α i ∈ [0, 1]. Then, we swap the static features of the batch with those of the new samples, Z[:, :, I stat ] = i α i Z[i, :, I stat ]. We perform 300 epochs of random sampling, and we report the average results in Tab. 2, 3. Notably, our method outperforms previous SOTA methods on the Sprites dataset across all metrics. On the MUG dataset, we achieve competitive accuracy results and better results on IS and H(y|x) metrics. In comparison to unsupervised methods MoCoGAN, DSVAE and R-WAE, our results are the best on all metrics.

5.3. TWO FACTOR DISENTANGLEMENT OF AUDIO DATA

We additionally evaluate our model on a different data modality, utilizing a benchmark downstream speaker verification task (Hsu et al., 2017) on the TIMIT dataset (Garofolo et al., 1992) . In this task, we aim to distinguish between speakers, independently of the text they read. We compute for each test sample its latent representation Z, and its dynamic and static sub-representations Z dyn , Z stat , respectively. In an ideal two factor disentanglement, we expect Z stat to encode the speaker identity, whereas Z dyn should be agnostic to this data. To quantify the disentanglement we employ the Equal Error Rate (EER) test. Namely, we compute the cosine similarity between all pairs of latent subrepresentations in Z stat . The pair is assumed to encode the same speaker if their cosine similarity is higher than a threshold ϵ ∈ [0, 1], and the pair has different speakers otherwise. The threshold ϵ needs to be calibrated to receive the EER (Chenafa et al., 2008) . If Z stat indeed holds the speaker identity, 

KAE

Figure 5 : Our ablation study shows that the full model L eig disentangles data well, whereas models using only L stat loss or only L dyn loss or no L eig loss at all struggle with swapping static features. then its EER score should be low. The same test is also repeated on Z dyn for which we expect high EER scores as it should not contain speaker information.We report the results in Tab. 4. Our method achieves the third best overall EER on the static and dynamic tests. However, S3VAE and C-DSVAE either use significantly more data or self-supervision signals. We label by C-DSVAE * and C-DSVAE † the approach C-DSVAE without content and dynamic augmentation, respectively. When comparing to unsupervised approaches that do not use additional data (FHVAE, DSVAE, and R-WAE), we achieve the best results with a margin of 0.27% and 3.37% static and dynamic, respectively. 

5.4. ABLATION STUDY

We train different models to evaluate the effect of our loss term on the KAE architecture: full model with L eig , KAE + L stat , KAE + L dyn , and baseline KAE without L eig . All other parameters are left fixed. In Fig. 5 , we show a qualitative example of static and dynamic swaps between the source and the target. Each of the bottom four rows in the plot is associated with a different model. The full model (L eig ) yields clean disentanglement results on both swaps. In contrast, the static features are not perfectly swapped when removing the dynamic penalty (L stat ). Moreover, the model without static loss (L dyn ) does not swap the static features at all. Finally, the baseline KAE model generates somewhat random samples. We note that in all cases (even for the KAE model), the motion is swapped relatively well which can be attributed to the good encoding of the dynamics via the Koopman matrix.

6. DISCUSSION

We have proposed a novel approach for multifactor disentanglement of sequential data, extending existing two factor methods. Our model is based on a strong inductive bias where we assumed that the underlying dynamics can be encoded linearly. The latter assumption calls for exploiting recent Koopman autoencoders which we further enhance with a novel spectral loss term, leading to an effective disentangling model. Throughout an extensive evaluation, we have shown new disentanglement sequential tasks such as factorial swap and incremental swap. In addition, our approach achieves state-of-the-art results on two factor tasks in comparison to baseline unsupervised approaches, and it performs similarly to self-supervised and weakly-supervised techniques. There are multiple directions for future research. First, our approach is complementary to most existing VAE approaches, and thus merging features of our method with variational sampling, mutual information and contrastive losses could be fruitful. Second, theoretical aspects such as disentanglement guarantees could be potentially shown in our framework using Koopman theory.

A KOOPMAN THEORY

We briefly introduce the key ingredients of Koopman theory (Koopman, 1931) which are related to our work. Consider a dynamical system φ : M → M over the domain M given via the update rule x t+1 = φ(x t ) , where x t ∈ M ⊂ R m , and t ∈ N is the time index. Koopman theory proposes an alternative representation of the dynamical system φ by a linear yet infinite-dimensional Koopman operator K φ . Formally, K φ f (x t ) = f • φ(x t ) , where f : M → C is an observable complex-valued function, and f • φ denotes composition of transformations. Due to the linearity of K φ , we can discuss its eigendecomposition, when it exists. Specifically, let λ j ∈ C, ϕ j : M → C be a pair of eigenvalue and eigenfunction respectively of K φ , i.e., it holds that K φ ϕ j = λ j ϕ j for any j . From a theoretical viewpoint, there is no loss of information to represent the dynamics with φ or with K φ (Eisner et al., 2015) . Namely, one can recover the dynamics φ from a given K φ operator. Moreover, the Hartman-Grobman Theorem states that the linearization around hyperbolic fixed points is conjugate to the full, nonlinear system (Wiggins et al., 2003) . The latter result was further extended to the entirety of the basin (Lan & Mezić, 2013) . In practice, various tools were recently developed to approximate the infinite-dimensional Koopman operator using a finite-dimensional Koopman matrix. In particular, the Dynamic Mode Decomposition (DMD) (Schmid, 2010) is a popular technique for approximating dynamical systems and their modes. DMD was shown to be intimately related to Koopman mode decomposition in (Rowley et al., 2009) , which deals with the extraction of Koopman eigenvalues and eigenfunctions in a data-driven setting. Thus, the above discussion establishes the link between our work and Koopman theory since in practice, our Koopman module is similar in spirit to DMD. Moreover, it justifies our use of the Koopman matrix to encode the dynamics as well as disentangle it. 2010) share a facial expression dataset which contains image sequences of 52 subjects. Each subject performs six facial expressions: anger, fear, disgust, happiness, sadness and surprise. Each video in the dataset consists of 50 to 160 frames. To create sequences of length 15 as described in previous work (Bai et al., 2021) , we randomly sample 15 frames from the original sequence. Then, we crop the faces using Haar Cascades face detection, and we resize to 64 × 64 resulting in sequences x ∈ R 15×3×64×64 for a total of 3429 samples. Finally, we split the dataset such that 75% of it is used for the train set, and 25% for the test set. TIMIT. Garofolo et al. (1992) made TIMIT available which contains 16kHz audio recordings of American English speakers reading short texts. In total, the dataset has 6300 utterances (5.4 hours) aggregated from 630 speakers reading 10 phonetically rich sentences each. For each batch of samples the data pre-processing procedure goes as follows: First, we take the maximum raw audio length in the batch, and we zero pad all samples to match that length. Second, we calculate for each sample its log spectrogram with 201 frequency features calculated by a window of 10ms, using Short Time Fourier Transform (STFT). Thus, each batch has its own t (time steps) length, with an average length after padding of t = 450. The resulting sequences are of dimension x ∈ R t×201 .

B.2 DISENTANGLEMENT METRICS

Accuracy (Acc) measures how well a model preserves the fixed features while sampling the others. We compute it using a pre-trained classifier C (also called judge) which is trained on the same train set and tested on the same test set as our model. The classifier outputs the probability measures per feature of the dataset. For instance, C outputs one label for the pose and additional labels for each of the static factors (hair, skin, top and pants) for the Sprites dataset. Inception Score (IS) measures the performance of a generator. The score is calculated by first applying the judge C on every generated sequence x 1:t , yielding the conditional predicted label distribution p(y|x 1:t ). Then, given the marginal predicted label distribution p(y) we compute the Kullback-Leibler (KL) divergence KL (p(y|x 1:t ) || p(y)). The inception score is given by: IS = exp(E x [KL(p(y|x 1:T )) || p(y)]) . Intra-Entropy H(y|x) measures the conditional predicted label entropy of all generated sequences. To obtain the predicted labels we use the judge C, and we compute 1 b b i=1 H(p(y|x i 1:t )) where b is the number of generated sequences. Lower intra-entropy score reflects higher confidence of the classifier C. Inter-Entropy H(y) measures the marginal predicted label entropy of all generated sequences. We can compute H(p(y)) using the judge's output on the predicted labels {y}. Higher inter-entropy score reflects higher diversity among the generated sequences. Equal Error Rate (EER) is used in the speaker verification task on the TIMIT dataset. It is the value of false positive rate or false negative rate of a model over the speaker verification task, when the rates are equal.

B.3 ARCHITECTURE AND HYPERPARAMETERS

Our models are implemented in the PyTorch (Paszke et al., 2019) framework. We used Adam optimizer (Kingma & Ba, 2014) and a learning rate of 0.001 for all models, with no weight decay. Regarding hyper-parameters, in our experiments, k is tuned between 40 and 200 and λ rec , λ pred and λ eig are tuned over {1, 3, 5, 10, 15, 20}. k s is tuned between 4 and 20, and the ε threshold for the dynamic loss is tuned over {0.4, 0.5, 0.55, 0.6, 0.65}. The hyper-parameters are chosen through standard grid search.

B.3.1 ENCODER AND DECODER

Sprites and MUG. Our encoder and decoder follow the same general structure as in Bai et al. (2021) . First we have the same convolutional encoder as in C-DSVAE. Then we have a uni-directional LSTM. The architecture is described in detail in Tab. 5, where Conv2D and Conv2DT denote a 2D convolution layer and its transpose, and BN2D is a 2D batch normalization layer. Additionally, the hyperparameters are listed in Tab. 6, where b is the batch size, k is the size of Koopman matrix, h is the dimension of the LSTM hidden state, and #epochs is the number of epochs we used for training. The balance weights λ rec , λ pred and λ eig scale the loss penalty terms of the Koopman layer, L rec , L pred and L eig , respectively. Finally, k s is the amount of static factors, and ϵ is the dynamic threshold, see Eqs. 6 and 7 in the main text. TIMIT. We design a neural network related to DSVAE architecture, but we use a uni-directional LSTM module instead of a bi-directional layer. The encoder LSTM input dimension is 201 which is the spectrogram features dimension and its output dimension is k. The decoder LSTM input dimension is k and its output dimension is 201. The hyperparameter values are detailed in Tab. 6. 64 × 64 × 3 image Z Conv2D(3, 32, 4, 2, 1) → BN2D(32) → LeakyReLU LSTM(k, h) Conv2D(32, 64, 4, 2, 1) → BN2D(64) → LeakyReLU Conv2DT(h, 256, 4, 1, 0) → BN2D(256) → LeakyReLU Conv2D(64, 128, 4, 2, 1) → BN2D(128) → LeakyReLU Conv2DT(256, 128, 4, 1, 0) → BN2D(128) → LeakyReLU Conv2D(128, 256, 4, 2, 1) → BN2D(256) → LeakyReLU Conv2DT(128, 64, 4, 1, 0) → BN2D(64) → LeakyReLU Conv2D(256, k, 4, 2, 1) → BN2D(k) → LeakyReLU Conv2DT(64, 32, 4, 1, 0) → BN2D(32) → LeakyReLU LSTM(k, k) Conv2DT(32, 3, 4, 1, 0) → Sigmoid

B.3.2 KOOPMAN LAYER

The Koopman layer in our architecture is responsible for calculating the Koopman matrix C, and it is associated with the accompanying losses L rec , L pred , L eig . It may happen that the latent codes provided to the Koopman module are very similar, leading to numerically unstable computations. To alleviate this issue, we consider two possibilities. One, use blur filter on the image before inserting it to the encoder (used for the Sprites datasets). Two, add small random uniform noise sampled from [0, 1] to the latent code Z, i.e., Z + 0.005N , where N denotes the noise (used on TIMIT). Both options yield more diverse latent encodings, which in turn, stabilize the computation of C and the training procedure. Finally, we note that our spectral penalty terms L stat and L dyn which compose L eig are stable for a large regime of hyperparameter ranges.

B.3.3 ADDITIONAL DYNAMIC LOSS OPTIONS

The proposed form of L dyn in Eq. 7 constrains the dynamic factor modulus to an ϵ-ball to promote separation between the static factors located on the point 1 + ı0 and the dynamic factors. However, there are settings for which L dyn may be not optimal. For instance, a dataset may contain measurepreserving dynamic factors, e.g., as in the motion of a pendulum. Another example includes growing dynamic factors, e.g., as in a ball moving from the center of the frame towards the boundaries of the frame. If one has additional knowledge regarding the underlying dynamics, one may adapt L dyn accordingly. We consider the following options: 1. Set ϵ = 1 while adding the dynamic loss term to L eig . In this case, L dyn penalizes dynamic factors that are inside a δ-ball around the point 1 + 0ı. This option addresses measurepreserving dynamic oscillations in the data. 2. Set ϵ = 1 + η, η > 0 while adding the dynamic loss term to L eig . In this case, L dyn penalizes dynamic factors that are inside a δ-ball around the point 1 + 0ı. This option addresses growing dynamic factors.

B.4 DISENTANGLEMENT PROCESS USING MULTIFACTOR DISENTANGLING KOOPMAN AUTOENCODERS

In what follows, we detail the process of extracting the multifactor latent representation of a sample, and in addition, we will demonstrate a general swap of a factor between two arbitrary samples. We let X ∈ R b×t×m be our input batch and x ∈ R m be a single sample that we want to calculate its multifactor disentangled latent representation. The disentanglement process of x into its multiple latent factors representations using our model contains the following steps: 1. We insert X into the model encoder and get the encoder output Z ∈ R b×t×k . 2. We compute the Koopman matrix C for the batch X using the Koopman layer as described in the main text. 3. We compute the eigendecomposition of C to get the eigenvectors matrix V ∈ C k×k . In addition, we calculate U = V -1 . Now we calculate zT := z T V for every z ∈ R k . z stores the coefficients in the Koopman space and they are the disentangled latent representation in our method. Notice that z T = z T V U = zT U 4. We identify the indices that correspond to each latent factor. It may be that several indices represent one factor. We use the identification method of subspaces described in B. To conclude, these four steps describe the process of disentangling arbitrary factors in our setup. To demonstrate a swap, let us assume we use the Sprites dataset. Let x 1 , x 2 be two samples in X and let us assume we want to swap their hair and skin attributes. We will use steps 1, 2 and 3 to extract x 1 , x 2 multifactor latent representation z1 , z2 . Then, using Step 4, we will identify and extract I K = I s ∪ I h , were I h = hair indices and I s = skin indices. Now, we want to swap the latent representations of the hair and skin factors between the sample. To do so, we simply preform z1 [I K ] = z2 [I K ] and vice versa z2 [I K ] = z1 [I K ] in parallel. To get back to the pixel space, we need to repeat our steps backward. First we need to compute the new z i after the swap. We will do it using the V matrix we calculated in step 3. We compute zT 1 = zT 1 V, zT 2 = zT 2 V . Finally, we can insert Re(z 1 ), Re(z 2 ) as inputs to the model decoder and get the desired swapped new samples x1 , x2 . Last note, if z is some input for the model decoder then z must be real-valued, however, z is typically complex-valued since V, U ∈ C k×k . Thus, we keep the real part of z, and we eliminate its imaginary component. In what follows, we show that Re(z T V U ) = z T , and thus feeding the real part to the decoder as mentioned above is well justified. Moreover, a similar proof holds for swapped latent vectors, i.e., Im(z) = 0. Finally, we validated that standard numerical packages such as Numpy and pyTorch satisfy this property up to machine precision. Theorem 1. If C ∈ R k×k is full rank, then Re(z T V U ) = z T for any z ∈ R k , where V is the matrix of eigenvectors of C, and U = V -1 . Proof. It follows that z T V U = k j=1 ⟨z, v j ⟩u j , where v j is the j-th column of V , and u j is the j-row of U . To prove that Im(z T V U ) = 0, it is sufficient to show that if v 1 and v 2 are complex conjugate pair of vectors from V , i.e., v i1 = v i2 , then ⟨z, v 1 ⟩u 1 is the complex conjugate of ⟨z, v 2 ⟩u 2 . First, we have that a 1 = ⟨z, v 1 ⟩ = k i z[i]v 1 [i] = k i z[i]v 2 [i] = ⟨z, v 2 ⟩ = a 2 , where the third equality holds since v 1 = v 2 , and the last equality holds since z is real-valued. The proof is complete if we show that u 1 = u 2 , since then we have ⟨z, v 1 ⟩u 1 = ⟨z, v 2 ⟩u 2 . To verify that complex conjugate column pairs transform to complex conjugate row pairs, we assume w.l.o.g that the matrix V can be organized such that nearby columns are complex conjugates, i.e., v 1 = v 2 , v 3 = v 4 , and so on. Let P be the permutation matrix that exchanges the columns of V to their complex conjugates, i.e., it switches the i-th column with the (i + 1)-th column, where i is odd. Then V P = V . It follows that (V P ) -1 = P T V -1 = P T U = U , namely, the i-th row is the complex conjugate of the (i + 1)-th row, where i is an odd number.

B.5 IDENTIFICATION OF SUBSPACES

There are two scenarios in which we need to identify semantic Koopman subspaces in the eigenvectors of the Koopman matrix C: 1. separate between static and dynamic information (two factor separation). 2. identify individual factors, e.g., hair color in sprites (multifactor separation). Two factor separation. To distinguish between time-invariant and time-varying factors, we sort the eigenvalues based on their distance from the complex value 1 + ı0. Then, the subspace of static features I stat is defined as the eigenvalues' indices of the first k s elements in the sorted array. Then, the dynamic features subspace I dyn holds the remaining indices, i.e., I dyn = I \ I stat , where I is the set of all indices, and S 1 \ S 2 generates the set difference of the sets S 1 and S 2 . Multifactor separation. The identification of individual features such as the hair color or skin color in Sprites is less straightforward, unfortunately. Essentially, the key difficulty lies in that the Koopman matrix may encode an individual factor using a subspace whose dimension is unknown a priori. addition, the subspace related to e.g., hair color may depend on the particular batch sample. For instance, we observed cases where the hair color subspace was of dimension 1, 2 and 3 for three different batches. Nevertheless, manual inspection of I stat typically reveals the role of the eigenfunctions, and it can be achieved efficiently as k s ≤ 15 in our experiments. Still, we opt for an automatic approach, and thus we propose the following simple procedure. We consider the power set of I stat , denoted by I stat . Let J be an element of I stat , then we swap the content of the batch with respect to J, and check the accuracy of the factor in question (e.g., hair color) using the pre-trained classifier. The subspace J which corresponds to a single factor change is the one for which the accuracy of the factor decreases the most with respect to the original samples. In practice, we noticed that often the subspace of a factor is composed of subsequent eigenvectors in the sorting described for the two factor separation. Thus, many subsets J of the power set I stat can be ignored. We leave further exploration of this aspect for future work.

B.6 SPEAKER VERIFICATION EXPERIMENT DETAILS

The speaker verification task in Sec. 5.3 is performed as follows. We use the test set of TIMIT which contains 24 unique speakers, with eight different sentences per speaker. In total there are 192 audio samples. We compute the latent representation Z for this data, and its Koopman matrix C. Using the eigendecomposition of C, we identify the static and dynamic subspaces I stat and I dyn . We denote by Z stat , Z dyn the latent codes obtained when projecting Z to I stat , I dyn , respectively. Formally, this is computed via Z stat = Z • Φ -1 [:, I stat ] • Φ[I stat ], and similarly for the dynamic features. To perform the speaker verification task we calculate the identity representation code for the batch given by Ẑstat = 1 t t j=1 Z stat [:, j, :] , Ẑdyn = 1 t t j=1 Z stat [:, j, :] , Ẑdyn , Ẑdyn ∈ R 192×165 . The EER calculations are performed separately for Ẑstat and Ẑdyn for all of their 192 2 = 18336 pair combinations.

C.1 MEAN AND STANDARD DEVIATION MEASURES

We report the mean and standard deviation measures computed over 300 runs for the results reported in Tab. 1, 2, 3 in the main text. The results are detailed in Tab. 7, 8. The low standard deviation highlights the robustness of our method to various seed numbers, and the overall stability of our trained models.

C.2 DATA GENERATION

We present qualitative results of our model's unconditional generation capabilities. To this end, we randomly sample static and dynamic features by producing a new latent code based on a random Additionally, we extend the result in Fig. 2 to show an example in which we swap all multifactor combinations. Specifically, we show in Fig. 12 several multifactor swap from the source sequence (top row) to the target sequence (bottom row). The text to the left of every sequence in between denotes the swapped factor(s). For instance, the second row with the text p shows how the pants color of the target is swapped to the source character. Similarly, the row with the text s+t+h is related to the swap of the skin, top, and hair colors. In each incremental step, more static features are changing towards the target samples. Specifically, the skin color, hair color and density, ears structure, nose structure, chicks structure, chicks texture, lips and more other physical characteristics change gradually to better match the physical appearance of the target. Additionally, we observe that the source expression of the source is not altered during the transformation, highlighting the disentanglement capabilities of our approach. typically similar to prior work (Hsu et al., 2017; Li & Mandt, 2018; Zhu et al., 2020; Bai et al., 2021) , and thus we focus our analysis on the Koopman layer and the loss function. The dominant operation in the Koopman layer in terms of complexity is the computation of the pseudo-inverse of Z p (please see Section 3). Computing the pseudo-inverse of a matrix is implemented in high-level deep learning frameworks such as pyTorch via SVD. The textbook complexity of SVD is O(min(mn 2 , m 2 n)) for an m × n matrix. In addition, computing the loss function involves eigendecomposition. The theoretic complexity of eigendecomposition is equivalent to that of matrix multiplication, which in our case is O(k 2.376 ), where the Koopman operator is of size k × k. In comparison, the matrices Z p for which we compute pseudo-inverse are of size b • t × k, and typically k < b • t. Thus, the pseudo-inverse operation governs the complexity of the algorithm. The development of efficient SVD algorithms for the GPU is an ongoing research topic in itself. As far as we know, there is some parallelization in torch SVD computation, mainly affecting the decomposition of large matrices. The Koopman matrices we use are typically small (e.g., 100 × 100), and thus the effective computation time is short. 



Figure 2: In the factorial swap experiment we modify individual static factors of the source character to match those of the target. The top row shows the gradual change of the hair, skin, and top colors.

Zhs [:, I hs ] = Ztgt [:, I hs ], and Zhst [:, I hst ] = Ztgt [:, I hst ],where Zsrc , Ztgt ∈ C 8×40 are the Koopman projection values of the source and target, respectively. The set I hs := I h ∪ I s , and similarly for I hst . The tensor Zh represents the new character obtained by borrowing the hair color of the target, and similarly for Zhs and Zhst . In total, we demonstrate in Fig.2the changes: h→s→t (top), h→t→s (middle), and s→h→t (bottom). We additionally show in Fig.12an example of individual swaps including all possible combinations. Our results display good multifactor separation and transfer of individual static factors between different characters.

Figure 3: We show the t-SNE plot of the 4D Koopman static subspace which encodes the skin and hair colors. The embedding perfectly clusters all (skin, hair) color combinations.

Figure 4: Our method allows to swap the dynamic features incrementally, and thus it achieves a relatively smooth transition between the source and target expressions.

Reed et al. (2015) introduced a dataset of animated cartoon characters. Each character is composed of static and dynamic attributes. The static attributes include the color of skin, tops, pants and hair; each contains six possible variants. The dynamic attributes include three different motions: walking, casting spells and slashing, where each motion admits three different orientations: left, right, and forward. In total there are nine motions a character can perform and 1296 unique characters. A sequence is composed of eight RGB image frames of size of 64 × 64. We use 9000 samples for training and 2664 samples for testing. MUG.Aifanti et al. (

5 to extract the indices set. Let I k be some latent factor index set. Then, the latent representation of factor I k for the input x is z[I k ]. For instance, I k can be the hair color factor. If we want to take a group of factors, we can aggregate a few factors together I = I s ∪ I t ∪ I h ∪ I p , where I s = skin indices, I t = top indices, I h = hair indices, I p = pants indices. In practice I encodes a character identity on the Sprites dataset.

j∈J α j = 1, where J denotes the sample indices, and |J| = b = 2 is the number of samples in the combination. Then, we the static or dynamic features of the source (src) sample using the convex combination, Z[src, :,I stat ] = j∈J α j Z[j, :, I stat ], Z[src, :, I dyn ] = j∈J α j Z[j, :, I dyn ],respectively. The reconstruction of the latent codes for which static or dynamic factors are swapped are shown on the right panels in Figs. 6, 7, 13, 14 respectively. Our results on both Sprites and MUG datasets demonstrate a non-trivial generation of factors while preserving the dynamic/static factors shown on the left panels.C.3 TWO FACTORS AND MULTIFACTOR SWAPSWe present several qualitative results of two factor swapping between static and dynamic factors of two given samples. In Figs.8 and 9, each odd indexed row i ∈ {1, 3, 5, 7} shows the source sequence on the left and the target sequence to the right. Even indexed rows j ∈ {2, 4, 6, 8} represent the reconstructed samples after the swap where on the left we show the static swap, and on the right the dynamic swap. Notably, all examples show clean swaps while preserving non-swapped features.

Figure 6: Unconditional generation of Sprite characters. The left panel shows the source sequences, and the right panel demonstrates the sampled characters where time-varying features are preserved.

Figure 7: Unconditional generation for the MUG dataset. The left panel shows the source sequences, and the right panel demonstrates the sampled identities where time-varying features are preserved.

Figure 8: Several static and dynamic swap results on the Sprites dataset.

Figure 11: The Koopman matrix spectrum of different models.

Figure 12: Multifactor swap of individual static factors and their combinations on the Sprites dataset.

Figure 13: Unconditional generation of Sprite characters where the static factors are kept fixed, and the dynamic features are randomly sampled.

Figure 14: Unconditional generation of MUG images where the static factors are kept fixed, and the dynamic features are randomly sampled.

Accuracy measures of factorial swap experiments.

Disentanglement metrics on Sprites.

Disentanglement metrics on TIMIT.

Architecture details.

Hyperparameter details. Dataset b k h #epochs λ rec λ pred λ eig k s ϵ

Accuracy measures of factorial swap experiments, see Tab. 1. 35% ± 0.65% 17.40% ± 0.79% 17.07% ± 0.77% 36.29% ± 0.88% 90.20% ± 0.52% skin swap 11.35% ± 0.65% 72.72% ± 0.68% 17.23% ± 0.89% 31.22% ± 0.84% 16.92% ± 0.77%

Disentanglement metrics on Sprites and MUG, see Tabs. 2 3. sampling in the convex hull of two randomly chosen samples from the batch. That is, for every sample in the batch we generate random coefficients {α j ∈ [0, 1]} which form a partition of unity

Computational resources comparison.

ACKNOWLEDGEMENTS

This research was partially supported by the Lynn and William Frankel Center of the Computer Science Department, Ben-Gurion University of the Negev, an ISF grant 668/21, an ISF equipment grant, and by the Israeli Council for Higher Education (CHE) via the Data Science Research Center, Ben-Gurion University of the Negev, Israel.

C.5 KOOPMAN MATRIX SPECTRUM ABLATION STUDY

We would like to explore the impact of our spectral loss on the spectrum and the eigenvalues scattering of the Koopman matrix C. To this end, we train four different models: full model with L eig , KAE + L stat , KAE + L dyn , and baseline KAE without L eig . We show in Fig. 11 the obtained spectra for the various models, where eigenvalues associated with static factors are marked in blue, and the dynamic components are highlighted in red. Our model shows a clear separation between the static and dynamic factors, allowing to easily disentangle the data in practice. In contrast, the models KAE and L stat yield spectra in which the static and dynamic components are very close to each other, leading to challenging disentanglement. Finally, the model L dyn shows separation in its spectrum, however, some of the static factors drift away from the eigenvalue 1.

C.6 COMPUTATIONAL RESOURCES COMPARISON

We compare our method in terms of network memory footprint and the amount of data used for the Sprites dataset. We show in Tab. 9 the comparison of our method with respect to the other methods. All other approaches use significantly more parameters than our method, which uses 2 million weights. In addition, S3VAE and C-DSVAE utilize additional information during training. S3VAE exploits supervisory signals to an unknown extent as the details do not appear in the paper, and the code is proprietary. C-DSVAE uses data augmentation of size sixteen times the train set, that is, for content augmentation they generated eight times more train data, and the same amount for the motion augmentation. In comparison, our method and DSVAE do not use any additional data on top of the train set.The time complexity analysis of our method is governed by the complexities of the encoder, decoder, the Koopman layer and the loss function. The encoder and decoder can be chosen freely and are 

