LEARNING DISENTANGLEMENT IN AUTOENCODERS THROUGH EULER ENCODING

Abstract

Noting the importance of factorizing (or disentangling) the latent space, we propose a novel, non-probabilistic disentangling framework for autoencoders, based on the principles of symmetry transformations that are independent of one another. To the best of our knowledge, this is the first deterministic model that is aiming to achieve disentanglement based on autoencoders without pairs of images or labels, by explicitly introducing inductive biases into a model architecture through Euler encoding. The proposed model is then compared with a number of state-of-the-art models, relevant to disentanglement, including symmetry-based and generative models based on autoencoders. Our evaluation using six different disentanglement metrics, including the unsupervised disentanglement metric we propose here in this paper, shows that the proposed model can offer better disentanglement, especially when variances of the features are different, where other methods may struggle. We believe that this model opens several opportunities for linear disentangled representation learning based on deterministic autoencoders.

1. INTRODUCTION

Learning generalizable representations of data is one of the fundamental aspects of modern machine learning (Rudin et al., 2022) . In fact, better representations are more than a luxury now, and is a key to achieving generalization, interpretability, and robustness of machine learning models (Bengio et al., 2013; Brakel & Bengio, 2017; Spurek et al., 2020) . One of the primary and desired characteristics of the learned representation is factorizability or disentanglement, so that latent representation can be composed of multiple, independent generative factors of variations. The disentanglement process renders the latent space features to become independent of one another, providing a basis for a set of novel applications, including scene rendering, interpretability, and unsupervised deep learning (Eslami et al., 2018; Iten et al., 2020; Higgins et al., 2021) . Deep generative models, particularly that build on variational autoencoders (VAEs) (Kingma & Welling, 2013; Kumar et al., 2017; Higgins et al., 2017; Tolstikhin et al., 2018; Burgess et al., 2018; Chen et al., 2018; Burgess et al., 2018; Kim & Mnih, 2018; Zhao et al., 2019) , have shown to be effective in learning factored representations. Although these approaches have advanced the disentangled representation learning by regularizing the latent spaces, there are a number of issues that limit their full potential: (a) VAE-based models consist of two loss components, and balancing these loss components is a well known issue (Asperti & Trentin, 2020 ) (b) it is almost impossible to honor the idealized notion of having a known prior distribution for VAEs in practical settings (Takahashi et al., 2019; Asperti & Trentin, 2020; Zhang et al., 2020; Aneja et al., 2021) and, (c) factorizing the aggregated posterior in the latent space does not guarantee corresponding uncorrelated representations (Locatello et al., 2019) . An alternative approach for achieving disentangled representations is through seeking irreducible representations of the symmetry groups (Cohen & Welling, 2014; Higgins et al., 2018; Painter et al., 2020; Tonnaer et al., 2022) , where the aim is to find latent space transformations that are independent of one another, underpinned by well-defined mathematical framework(s) based on group theory. As this group of methods exploits the notion of transitions between samples, they require pairs of images representing the transitions (Cohen & Welling, 2014; Painter et al., 2020) or equivalent labels (Tonnaer et al., 2022) . Regardless of the approach, as shown in Locatello et al. (2019) , it is fundamentally impossible to learn disentangled representations without having inductive biases on either the model or the dataset, and both VAE-and symmetry-based approaches exemplify implicitly embedding inductive bias. (Higgins et al., 2018) . The first, second and third rows show the latent spaces learned from three datasets, namely, XY , 2D Arrow, and 3D Airplane datasets, respectively. Columns correspond to different models stated at the bottom of every column. It can be seen that the proposed model, DAE, achieves the best disentanglement. Despite the advances, we note a number of issues in the existing approaches about how they address disentanglement. Firstly, the majority of the VAE-based approaches are probabilistic, and as such, the quality of the disentanglement depends on ideal or near-ideal priors, and on the process of learning the correct posteriors for a given data. Secondly, the majority of symmetry-based disentangling approaches need pairs of images or labels, even in the unsupervised setting, owing to the requirements around inductive bias. Thirdly, none of these models conform to the formal definition of a linear disentangled representation proposed by Higgins et al. (2018) . Finally, and most importantly, none of the existing approaches have the unsupervised approach for introducing inductive biases (required for disentanglement) both on the models and on the datasets, essentially demanding labels or image pairs. Motivated by these shortcomings, in this paper, we propose a novel approach for deriving disentangled representation learning, with the following key contributions: We, • propose a totally unsupervised approach for introducing inductive bias into the model and data, without requiring pairs of images or labels, • propose a non-probabilistic approach that does not involve any priors or learning posteriors, • ensure that our approach conforms to the formal definition of a linearly disentangled representation as defined in Higgins et al. (2018) , • propose a new unsupervised metric, namely, Grid Fitting Score (GF-Score), to quantify the disentanglement, echoing the aspiration of an ideal disentanglement measure outlined in Higgins et al. (2018) , and • demonstrate the implementation of a formally defined disentanglement using autoencoders. As such, the proposed approach, which we name Disentangling Auto-Encoder (DAE), offers a theoretically sound framework for learning independent multi-dimensional vector subspaces, and hence towards learning disentangled representations. To the best of our knowledge, this is the first attempt to actually implement a disentanglement approach using deterministic autoencoders, especially without pairs of images or labels, and hence in a truly unsupervised manner. We provide a glimpse into the capability of the proposed model for disentanglement using three datasets compared against ten other models, which are either autoencoder-based probabilistic models or symmetrybased disentangled models, that do not require any labels or pairs of inputs in Figure 1 . The rest of this paper is organized as follows. In Section 2 we review the related work, focusing on VAE-based and symmetry-based approaches. This is then followed by a derivation of AE-based non-probabilistic approach for deriving disentangled representations in Section 3. In Section 4, we perform a detailed evaluation to decide the overall performance of the proposed model, using ten baseline models, six datasets, and six disentanglement metrics, and discuss our findings. We then conclude the paper in Section 5 with directions for further research. Given the space constraints, we highlight the prominent results in the main part of the paper, while providing the remaining set of results and relevant material as part of the Appendix.

2. RELATED WORK

2.1 DISENTANGLEMENT Disentangled representation learning focuses on learning a set of independent factors containing useful but minimal information for a given task, such that their variations are orthogonal to each other while accounting for the entire dataset (Bengio et al., 2013; Higgins et al., 2018) . This essentially entails a method or a set of methods for decoupling correlations between latent variables. A large body of work around disentanglement, and the ideal properties of a disentangled representation can be found in Ridgeway (2016) ; Eastwood & Williams (2018) ; Ridgeway & Mozer (2018) ; Zaidi et al. (2020) . Among a number of desirable properties of a disentangled representation, modularity, compactness and explicitness are three critically important properties. A number of metrics have been proposed in the literature to quantify these properties (Higgins et al., 2017; Kim & Mnih, 2018; Eastwood & Williams, 2018; Chen et al., 2018; Do & Tran, 2019; Sepliarskaia et al., 2019) . In our work, we use the notions outlined in Zaidi et al. (2020) , where the metrics are divided into three classes, namely, Intervention-based, Predictor-based, and Information-based metrics. These metrics are all used in a supervised manner and can be of indicators for the robustness of the representation to noise, and for the non-linearity of the relationships between learnt representations and ground truth factors, in addition to the three properties outlined above.

2.2. AUTOENCODER-BASED PROBABILISTIC MODELS

AE-based probabilistic generative models are realized by replacing the conventional encoder E ϕ and decoder D θ with probabilistic counterparts. The probabilistic encoder, denoted by q ϕ (z|x), is used to approximate the intractable true posterior, and the probabilistic decoder, denoted by p θ (x|z), is used to reconstruct the x from z (Kingma & Welling, 2013). The majority of the previous work on disentangled representation learning are based on probabilistic models, particularly building on VAE. They enforce regularization in the latent space that either regularizes the approximate posterior q ϕ (z|x) or the aggregate posterior q(z) = 1 N N i=1 q ϕ (z|x (i) ), as summarized in Tschannen et al. (2018) . The overall objective of the majority of the VAE-based methods can be expressed as: L recon (ϕ, θ) + L reg (ϕ) where L reg (ϕ) is a regularizer of the concerned generative model. A carefully designed regularizer should enable the model to achieve better disentanglement, either by controlling the capacity of the latent space, or by measuring the total correlation between latent variables. However, it is worth noting that factorizing aggregated posterior using regularizers does not guarantee linear disentangled representations (Locatello et al., 2019) . We summarize the regularization terms of seven state-ofthe-art generative models in Appendix A (See Columns 2 and 3 of Table 3 ).

2.3. SYMMETRY-BASED DISENTANGLING MODEL

While Higgins et al. (2018) proposed a formal definition of linear disentangled representations, it was generic, so that no specific architecture, model or technique were defined. As such, it does not provide an actual mechanism for learning such disentangled representations, albeit providing a formal definition, which is essential for this purpose. From the definitions in Higgins et al. (2018) , a symmetry group can be decomposed as a product of multiple subgroups, if suitable subgroups can be identified. This can provide an intuitive method for disentangling the latent space, if subgroups that independently act on subspaces of a latent space can be found. If actions applied on each of the subgroups affect only the corresponding subspace, these actions are called disentangled group actions. In other words, disentangled group actions only change a specific property of the state of an object, and leaves the other properties invariant. If there is a transformation in a vector space of representations, corresponding to a disentangled group action, the representation is called a disentangled representation. The concept and implementation of symmetry-based disentangled representations were proposed using pairs of images in Cohen & Welling (2014) . However, owing to the limitation around commutative Lie groups, upon which this model is built upon, the real world applicability of the technique from Cohen & Welling (2014) , especially across a range of diverse datasets, are limited. Following a formal definition for linear disentangled representations in Higgins et al. (2018) , there has been a considerable amount of effort to learn the transitions between images (Caselles-Dupré et al., 2019; Quessard et al., 2020; Painter et al., 2020) . The transitions between images are learned by treating each transition as a sequence of transitions until the base transition relies on pairs of images and by using additional networks. An alternative approach is to rely on labels, for example, as in Tonnaer et al. (2022) , where they propose two Diffusion VAE-based methods (Rey et al., 2019) , namely semi-supervised and unsupervised, along with a new metric called LSBD (Linear Symmetry-Based Disentanglement metric). The former model relies on labels, while the latter does not. As such, the latter model is directly relevant to our work, and, we use this as one of the baselines for our evaluation (See Section 4).

3. FRAMEWORK FOR DAE

The deterministic, and hence, non-probabilistic, approach we propose here, builds on the autoencoder architecture (rather than variational autoencoders). We provide the relevant background on the disentangled representations from Higgins et al. (2018) in the Appendix A.2. In this section, we define necessary mathematical framework and a corresponding neural network architecture implementing the proposed disentangling autoencoder.

3.1. ASSOCIATION BETWEEN THE DISENTANGLED REPRESENTATION AND AUTOENCODER

The definition of disentangled group actions from Higgins et al. (2018) assumes that a group G can be decomposes a direct product G = G 1 × • • • × G n . To relax the condition, we consider a group G, which is generated by S = {s 1 , s 2 , ..., s n } subject to a set of R of relations among elements in S. Let W be a set of world-states and suppose we have a group action • : G × W → W . Then, we say that the action is disentangled by the relation R if there is a decomposition With the definition of an equivariant map in place ( A.2), disentangling a latent space relies on finding a corresponding group action • : G×Z → Z so that the symmetry structure of W is reflected in an agent's representations, Z. This can be achieved if the following condition is satisfied: W = W 1 × • • • × W n and actions • i :< s i > ×W i → W i , i ∈ {1, ..., g • f (w) = f (g • w) ∀g ∈ G, w ∈ W. (2) where f : W → Z is a mapping from world-states to an agent's representations. However, in general, one cannot control the nature of the generative process b : W → O leading from worldstates to observations, O. In addition, without loss of generality, we can easily assume that the generative process b is an equivariant map. Theorem 3.1. Suppose a generative process b is an equivariant map satisfying , g • b(w) = b(g • w) ∀g ∈ G, w ∈ W . Then, there exists a function f that satisfies (2) if an inference process h : O → Z is an equivariant map satisfying, g • h(o) = h(g • o) ∀g ∈ G, o ∈ O. Proof. Proof in the Appendix A.3. Following the Theorem 3.1, this assumption leads to the fact that the goal of disentangling is the same as finding an inference process h : O → Z satisfying, g • h(o) = h(g • o) ∀g ∈ G, o ∈ O. ( ) Although there is no guarantee that one can find a compatible action • : G × Z → Z satisfying (4), if h is bijective then (4) can be expressed as follows, g • z = h(g • h -1 (z)) However, if h is a bijective function, simple neural network-based models cannot learn the overall equivariant map. Yet, the equivariant map, such as one outlined in equation 5 can be learned by autoencoders with inductive biases both on the model and the datasets, which is the central contribution of this paper. To show this mapping, let h and h -1 be an encoder, E ϕ , and a decoder, D θ , of an autoencoder. Then, the group action • : G × Z → Z can be defined as follows: G × Z G × O O Z idG × D θ •o E ϕ This shows that the equivariant map can, indeed, be learned by an autoencoder. However, this is not without a number of challenges, which we discuss in Section 3.2 below.

3.2. INTRODUCING DISENTANGLED REPRESENTATIONS INTO AUTOENCODERS

In deriving a disentangled representation, it is worth noting that, a vector addition, a basic and natural transformation, in the latent space enables natural transition between latent variables and nth root of unity is a cyclic group. We achieve a disentangled group action by the relations R in the latent space by a map (s ϵ1 1 , ..., s ϵn n ) to (e iα1ϵ1 , ..., e iαnϵn ) and W i to Z i for some α i ∈ (0, 1). However complex numbers are undesirable in machine learning and so this is achieved by introducing an Euler encoding, E on R n , into the AE architecture. We defined E as follows: E(z) = (cos(2πz 1 ), sin(2πz 1 ), cos(2πz 2 ), sin(2πz 2 ), ..., cos(2πz n ), sin(2πz n )) (6) where n is the the number of dimensions of the latent space. Theorem 3.2. Let E be a Euler encoding and A : R 2n → R m be an injective linear transformation where m > 2n. For α ∈ (0, 1) and i ∈ {1, ..., n}, let T α i : R n → R n by T α i (x) = (x 1 , ..., x i + α, ..., x n ). Then A • E(T α i (z)) = A • E(T β j (z)) if and only if i = j and α = β. Proof. For z ∈ R n , let A • E(T α i (z)) -A • E(T β j (z)) = 0 and define S α i =    I 2(i-1) cos(2πα) -sin(2πα) sin(2πα) cos(2πα) I 2(n-i)    Since E(T α i (z)) = S α i • E(z), A • (S α i -S β j ) • E(z) = 0. (a) If i ̸ = j, then A is a zero transformation, which is a contradiction. (b) If i = j, then α = β ± k, where k ∈ Z, hence, α = β. Since E(T α i (z)) = S α i • E(z) and S α i is an orthogonal transformation, the Euler encoding after translation on Z can be considered as an orthogonal transformation of E(z), which enables the changes of output from changes of different latent dimensions to be orthogonal. Nevertheless, there are still a number of practical challenges to overcome. These are: (a) Number of Elements in a Subgroup: The number of possible elements in the subgroups N j (j = 1, . . . , n), or at least the relative ratio of the number of elements between the subgroups are not known a priori. This has a crucial role in introducing inductive biases on datasets, (b) Robustness to Small Perturbations: Because the proposed approach for disentanglement is deterministic, the model is not resilient to small perturbations (e.g., noise) (Camuto et al., 2021) , which is essential for the model to behave in a robust manner when presented with unseen examples, and (c) Spatial Distribution of Features: An ideal factorized latent space must have the features spatially distributed in an equally likely manner. However, the equivariant map we discussed above alone may not be sufficient to address this issue. Although it is possible to address some of these concerns from a theoretical stand point, nearly all of these are addressable by carefully designing an architecture that exploits both the AE and the equivariant map principle discussed above, achieving the best possible disentanglement. We discuss these details in Section 3.3 below. 

3.3. ARCHITECTURE OF THE DAE

In mapping our theory to an architecture, we build on the AE model, which constitutes an encoder, that maps the observation space O to a factorized latent space Z ′ , followed by the disentangling process that factorizes/disentangles the latent space Z to Z ′ , and finally the decoding layer, that maps the factorized latent Z ′ to the regenerated observation space O. Each of the concerns that were discussed in Section 3.2 (a) through (c) are handled by a network of layers in our architecture. We show this model in Figure 2 and describe how this model addresses the relevant concerns next. (a) Number of Elements in a Subgroup: Although the number of elements in a subgroup is not known a priori, the number or the relative ratio of the possible number of elements across subgroups can be estimated using techniques that can extract the variance information from compressed information, such as principal component analysis (PCA) (Jolliffe, 2002) , independent component analysis (ICA) (Hyvärinen & Oja, 2000) , or even through neural networks (Kingma & Welling, 2013; Burgess et al., 2018; Mondal et al., 2021) . In this paper, for the reasons of simplification, we will be using the PCA technique. Since the singular values from PCA are proportional to the variances of the principal components of compressed data, these values are used to obtain a relative ratio of the number of possible element in the subgroups (Wall et al., 2003) . Then, all singular values are divided by the maximum values, and rounded to one decimal places, and then values smaller than 1 are replaced with hyperparameter α. The relevant algorithm is shown in Algorithm 1 in the Appendix. Let Λ be the relative ratio of the possible number of elements across subgroups obtained from Algorithm 1. (b) Uniform Spatial Distribution of Features: To ensure that each feature is equally/likely distributed across the latent space and falls within the (0, 1) (which lets A • E(T α i (z)) = A • E(T β j (z)) if and only if i = j and α = β in Theorem 3.2), we introduce a normalization layer, where we apply batch min-max normalization to the outputs of the encoder. As minimum and maximum values vary from batch (mini-batch) to batch (mini-batch), we update the moving minimum and maximum values during the training process, and use them during the test phase, akin to a batch normalization layer (Ioffe & Szegedy, 2015) . The minimum and maximum values are also initialized close to the middle point of [0, 1) to facilitate learning. This is then followed by scaling by Λ to account for different number of possible elements for different features. (c) Robustness to Small Perturbations: Robustness is achieved by introducing an Interpolation layer that performs Gaussian interpolation on the output of the normalized latent space, following Vincent et al. (2010) ; Berthelot et al. (2018) . Gaussian interpolation is used to map unseen examples to known examples, and to make the latent space locally smooth. However, since the proposed model is deterministic, it is important to map a number of unseen examples to the learned representations. This is achieved by adding a weight-sensitive Gaussian noise to the outputs of the previous layer during training, which is obtained based on the closest proximal distance of each dimension of the representations. The relevant algorithm is shown in Algorithm 2 in the Appendix. It is worth noting that this layer will not be used during the inference phase.

3.4. A NOVEL METRIC FOR QUANTIFYING DISENTANGLEMENT: GF-SCORE

Nearly all of the existing set of metrics outlined in the literature for quantifying linear symmetrybased disentanglement require ground truth labels. Here, we propose a new metric, namely, Grid Fitting Score (GF-Score), to achieve the same purpose without the need for labels. Our hypothesis is that performing independent disentangled actions on a symmetry group causes the corresponding subspace to form a grid-shape latent space. This can be exploited by generating a square grid to include latent variables, and by measuring the mean of the minimum distances from the square grid to latent variables to signify the quality of disentanglement. If the latent variables fit perfectly into the square grid, it would imply that the model achieves perfect linear disentanglement, and we can mark this as zero score. Therefore, the lower the GF-Score is, the better the disentanglement is. The relevant algorithm is shown in Appendix A 3.

4.1. EVALUATION METHOD

Our evaluation involves comparing the performance of the proposed approach against ten baseline models across six datasets using six disentanglement metrics. The code will be publicly available when the paper is published. We outline these details below. Datasets: One of the critical challenges around evaluating disentanglement is identifying suitable datasets. It is difficult to identify common datasets to study this problem. In the literature, different datasets have been used for different purposes. For example, dSprite (Matthey et al., 2017) , 3D Chair (Burgess & Kim, 2018) and CelebA (Liu et al., 2015) datasets have been used in β-VAE, β-TCVAE, and FVAE. Although these datasets are useful to understand the traversal order of the latent space, they lack a clear underlying group structure. As such, here, we utilize the datasets that have been first utilized in Higgins et al. (2018) , with relevant enhancements, which we describe in the Appendix (See A.10). In addition to this dataset, we use five more datasets containing clear underlying group structure, namely, 2D Arrow, 3D Airplane (Tonnaer et al., 2022) , 3D Teapots (Eastwood & Williams, 2018), 3D Shape (Burgess & Kim, 2018) and 3D Face Model (Paysan et al., 2009) datasets. Finally, to demonstrate the performance on complex datasets, we use the Blood Cell Acevedo et al. (2020) and the Sprites Reed et al. ( 2015) datasets (see results in the Appendix). Baseline Models: We considered ten different baselines models for our evaluation, namely, plain AE, vanilla VAE, β-VAE, β-TCVAE, CCI-VAE, FVAE, InfoVAE, DIPVAE, WAE and LSBDVAE. For DIPVAE, we only test DIPVAE-I, owing to the reasons of that DIPVAE-II model works better only for cases where the dimension of the latent space is larger than the ground truth factors, which is not the case for us. Furthermore, as the proposed technique is purely an AE-based method, we have not included any GAN-specific baselines. To render a fair evaluation mechanism, we used the same encoder and decoder architectures, and same latent space dimensions (for each baseline model), which are used in Higgins et al. (2017) ; Kim & Mnih (2018) ; Quessard et al. (2020) ; Tonnaer et al. (2022) throughout the evaluation. Please see the Appendix (See A.8) for additional details. Performance Metrics: As discussed in Section 2.1, a large number of metrics can be used to study the performance of disentanglement, depending on the nature of the dataset, access to ground truth, availability of latent factors, and the number of dimensions in the latent space. We use two types of metrics: (a) Visualization of the latent space, and (b) Numerical disentanglement score. The former permits one to visualize the orthogonality between features, and can be used to demonstrate how the latent traversal is achieved by the model and grid structures in the latent space. The second approach provides a quantifiable method for assessing the disentanglement. Collectively, we have used six metrics, including five supervised metrics accounting for each of the disentanglement metric classes (see Section 2.1), namely, z-diff and z-min from the intervention-based, dci-rf from the predictor-based, and jemmig and dcimig from the information-based metric classes in order to measure disentanglement, completeness and informativeness, along with GF-Score (See 3.4). In Locatello et al. (2019) , it was shown that variances of all metrics are large with random seeds and it disturbs the comparison between different models. Hence, we run all the models on each data set for 20 different random seeds and select the random seed with the highest total score over these metrics.

4.2. RESULTS AND DISCUSSIONS

Our evaluation has produced a considerable volume of results, and for the reasons of brevity, we present two sets of results here, namely, (i) we show the reconstructions of latent traversals for the 2D Arrow, 3D Airplane, XY CS, 3D shape, 3D Teapots and 3D Face Model datasets in Figure 3 ,  z-diff ↑ z-var ↑ dci-rf ↑ jemmig ↑ dcimig ↑ DAE 1 .00 /1.00 (0.0%) 1.00 /1.00 (0.0%) 0.99 /0.94 (-5.0%) 0.81 /0.67 (-17.2%) 0.80 /0.75 (-6.2%) β-VAE 1.00 /0.79 (-21.0%) 1.00 /0.46 (-54.0%) 0.91 /0.08 (-91.2%) 0.65 /0.18 (-72.3%) 0.63 /0.13 (-79.3%) β-TCVAE 1.00 /0.72 (-28.0%) 1.00 /0.41 (-59.0%) 0.93 /0.15 (-83.8%) 0.70 /0.24 (-65.7%) 0.69 /0.14 (-79.7%) CCI-VAE 1.00 /1.00 (0.0%) 1.00 /1.00 (0.0%) 0.98 /0.63 (-35.0%) 0.78 /0.47 (-39.7%) 0.76 /0.46 (-39.4%) FVAE 1.00 /1.00 (0.0%) 1.00 /0.91 (-9.0%) 0.96 /0.18 (-80.8%) 0.73 /0.21 (-69.5%) 0.70 /0.13 (-80.5%) InfoVAE 1.00 /0.92 (-8.0%) 1.00 /0.54 (-46.0%) 0.90 /0.21 (-76.6%) 0.64 /0.27 (-57.8%) 0.67 /0.13 (-76.6%) DIPVAE 1.00 /1.00 (0.0%) 1.00 /0.44 (-56.0%) 0.98 /0.32 (-67.3%) 0.78 /0.28 (-64.1%) 0.78 /0.11 (-85.8%) LSBDVAE 1.00 /1.00 (0.0%) 1.00 /0.76 (-24.0%) 0.96 /0.38 (-60.4%) 0.72 /0.30 (-58.3%) 0.70 /0.28 (-60.0%) along with the percentage of changes when color and shape features are added to XY dataset in Table 1 , and, (ii) we present the disentanglement scores of top two performing models across all datasets for all metrics in Tables 1 and 2 . We provide the remaining set of results (reconstructions of latent traversals and disentanglement scores of all models), and other relevant details (such as hyper-parameters, and network architectures A.7 and A.8) as part of the Appendix.

4.2.1. LATENT SPACE VISUALIZATION

We show the disentangled (2D) latent space for the XY , 2D Arrow and 3D Airplane datasets, which have two underlying factors, in Figure 1 (also see Table 14 in Appendix A for details of relevant hyperparameters). As can be seen in the Figure 1 , the proposed model, in general, provides the ideal grid-shape outlined in Higgins et al. (2018) . The plain AE, vanilla VAE, Info-VAE and WAE models offer the worst performance. Other models, such as β-VAE, β-TCVAE, CCI-VAE, and DIPVAE models also come closer to the ideal pattern in the three datasets, and thus most models are able to disentangle x and y positions in XY datasets, and rotation and color factors in 2D Arrow and 3D Airplane datasets. However, when color or shape feature is added to the XY dataset (i.e., for XY C, XY S and XY CS datasets), the disentanglement can become a significant challenge, other than for the proposed model (See Figure 11 in Appendix). In addition to these, pairs of latent spaces, and reconstructions of latent traversals (across each latent dimension) of six datasets for the DAE are shown in Figure 3 ). (Also the Appendix A.11 for more results.)

4.2.2. DISENTANGLEMENT SCORES

We present the supervised disentanglement scores and their percentage changes when color and shape features are added to the XY dataset in Table 1 , with the changes measured relative to the The disentanglement scores for the top two performing models for all datasets (except the XY S dataset) are shown in Tables 1 and 2 , with the best performing model highlighted in boldface. From these results (including those in the appendix), the following observations can be drawn: Firstly, the proposed model outperforms all models across all metrics for the 2D Toy (covering XY , XY C, XY S, XY CS datasets), 3D Airplane and 3D Teapots datasets (See Table 8 and Table 7 ). Secondly, for the remaining five datasets, DAE offered the best score across four of those datasets (2D Arrow, 3D Airplane, 3D Teapots and 3D Face Model) while offering the second best performance for the 3D Shape dataset. Where the DAE offered the second best performance, DAE still achieve higher scores jemmig and dcimig and the same scores in z-diff and z-min. Thirdly, for the 2D Toy dataset, the proposed model maintains the reconstruction loss as small as possible whilst offering improved disentanglement scores (See Figures 25 26 27 28 ). On the other hand, the reconstruction losses for the β-VAE, β-TCVAE and CCI-VAE models increase along with their disentanglement scores. Finally, GF-Score shows that the proposed model perfectly fits the latent space into a grid structure across all datasets and baseline models. Based on the GF-Score from Table 4 to 10, a model without regularizer, such as AE, fails to form a grid structure in the latent space.

5. CONCLUSIONS

In the context of representation learning, being able to factorize or disentangle the latent space dimensions is crucial for obtaining latent representations that are composed of multiple, independent factors of variations. Literature around disentanglement methods are either predominantly supervised or semi-supervised, and as such, either labels or pairs of images are required or achieved via factorizing the aggregated posterior in the latent space. In this paper, we presented a non-probabilistic, deterministic model, namely, disentangling autoencoder or DAE, addressing a number of issues found in the literature. We also demonstrated how to realize the disentanglement conceptualized in Higgins et al. (2018) for the first time, especially without requiring labels or pairs of images. Our approach exploits the Euler encoding that makes the subspaces of the latent space independent of one another. Along with the architectural details, we also presented a novel metric for quantifying disentanglement, namely, GF-Score. Our detailed evaluations, performed against a large number of AE-based models, using considerably a large number of metrics show that our model can easily offer superior disentanglement performance when compared against a number of existing methods across a number of datasets. Although the results are encouraging, a number of aspects remain to investigated, including, evaluation of the proposed model on datasets that lack underlying group structure, understanding the effect of the choice of the latent dimension on the outcomes, and to evaluate different latent space smoothing algorithms, to mention a few.

A APPENDIX

A.1 A REVIEW OF GROUP THEORY More details of definitions and theorems can be found in Dummit & Foote (2004) . Definition A.1. A group is an ordered pair (G, ⋆) where G is a set and ⋆ is a binary operation on G satisfying the following axioms: 1. Associativity: (a ⋆ b) ⋆ c = a ⋆ (b ⋆ c), for all a, b, c ∈ G, 2. Identity: there exists an element e in G, called an identity of G, such that for all a ∈ G, a ⋆ e = e ⋆ a, 3. Inverses: for each a ∈ G there is an element a -1 of G, called an inverse of a, such that a ⋆ a -1 = a -1 ⋆ a = e. We shall write the operation a ⋆ b as ab. Definition A.2. A group action of a group G on a set A is a map • : G × A → A by •(g, a) = g • a satisfying the following properties:  1. g 1 • (g 2 • a) = (g 1 g 2 ) • a, for all g 1 , g 2 ∈ G, a ∈ A,

A.2 DISENTANGLED REPRESENTATION

The notion of a disentangled representation is mathematically defined using the concept of symmetry in Higgins et al. (2018) . For example, horizontal and vertical translations are symmetry transformations in two-dimensional grid, and, hence, such transformations change the location of an object in this two-dimensional grid. From the definitions of a symmetry group in Higgins et al. (2018) , a symmetry group can be decomposed as a product of multiple subgroups, if suitable subgroups can be identified. This can render an intuitive method to disentangle the latent space, if subgroups that independently act on subspaces of a latent space, can be found. If actions by transformations of each subgroup only affect the corresponding subspace, the actions are called disentangled group actions. In other words, disentangled group actions only change a specific property of the state of an object, and leaves the other properties invariant. If there is a transformation in a vector space of representations, corresponding to a disentangled group action, the representation is called a disentangled representation. We reproduce the formal definitions of disentangled group action and disentangled representation from Higgins et al. (2018) , as Definitions A.5 and A.6, respectively. Definition A.5. Suppose that we have a group action • : G × X → X, and the group G decomposes as a direct product G = G 1 × • • • × G n . Let the action of the full group, and the actions of each subgroups be referred to as • and • i , respectively. Then, the action is disentangled if there is a decomposition X = X 1 × • • • × X n , and actions • i : G i × X i → X i , i ∈ {1, • • • , n} such that: (g 1 , • • • , g n ) • (x 1 , • • • , x n ) = (g 1 • x 1 , • • • , g n • x n ) (7) for all g i ∈ G i and x i ∈ X i . The overarching goal of disentangling the latent space now relies on finding a corresponding action • : G × Z → Z so that the symmetry structure of W is reflected in Z. In other words, an action on Z corresponding to the action on W is desirable. This can be achieved if the following condition is satisfied:

Now

g • f (w) = f (g • w) ∀g ∈ G, w ∈ W. ( ) In other words, the action, •, should commute with f , which adheres to the definition of the equivariant map, and thus, f is an equivariant map, as shown below. G × W W G × Z Z • W idG × f f • Z From Higgins et al. ( 2018), a disentangled representation can be defined as follows: Definition A.6. The representation Z is disentangled with respect to G = G 1 × • • • × G n if 1. There is an action • : G × Z → Z, 2. The map f : W → Z is equivariant between the actions on W and Z, and 3. There is a decomposition Z = Z 1 × • • • × Z n or Z = Z 1 ⊕ • • • ⊕ Z n such that each Z i is fixed by the action of all G j , j ̸ = i and affected only by G i .

A.3 PROOF

Proof of Theorem 3.1 Proof. Suppose that there b is an equivariant map defined in Theorem 3.1 and h is an equivariant map. Then g • f (w) = g • h(b(w)) (9) = h(g • b(w)) (10) = h(b(g • w)) (11) = f (g • w)) (12) ( ) ∀g ∈ G, w ∈ W .

A.4 COMPARISON OF DIFFERENT VAE-BASED MODELS

In our evaluation, we compare the proposed model against seven other VAE-based derivatives, namely, vanilla VAE, β-VAE, β-TCVAE, CCI-VAE, FVAE, InfoVAE and WAE. All these models vary based on the underlying regularizer L reg (ϕ). For example, the β-VAE model constraints on the latent space using β to limit the capacity of the latent space, which encourages the model to learn the most efficient representation of the data. The regularization term of these different models (Column 2) are summarized in Table 3 along with relevant notes (Column 3).  (z|x)∥p(z)) - β-VAE βDKL(q ϕ (z|x)∥p(z)) Usually, β is greater than 1 β-TCVAE I(z, x) + βDKL(q(z)∥ j q(zj )) + j DKL(q(zj )∥p(zj )) I(•, •) is a mutual information CCI-VAE β|DKL(q ϕ (z|x)∥p(z)) -C| C is a capacity FVAE DKL(q ϕ (z|x)∥p(z)) + γDKL(q(z)∥ j q(zj ))) - InfoVAE DKL(q ϕ (z|x)∥p(z)) + λM M D(q ϕ (z|x), p(z)) M M D(•, •) is Maximum Mean Discrepancy DIPVAE DKL(q ϕ (z|x)∥p(z)) + λ od i̸ =j [Cov p(x) [µ ϕ (x)]] 2 i,j Cov is a covariance and µ ϕ (x) is the output of an encoder. +λ d i ([Cov p(x) [µ ϕ (x)]]i,i -1) 2 We set λ d = 10λ od as suggested in Kumar et al. (2017  ) WAE λM M D(q ϕ (z|x), p(z)) λ is a regularization coefficient LSBDVAE ∆DKL(q ϕ (z|x)∥p(z)) + λD LSBD ∆DKL is a KL divergence used in Diffusion VAE A.5 ALGORITHMS Algorithm 1: Obtaining Λ using PCA Input: X: the entire dataset and α: hyperparameter less than 1 Output: Λ = [w 1 , w 2 , • • • , w n ] If n > 2: S = [s 1 , s 2 , • • • , s n ]: singular values from PCA(X) S = [ s1 , s2 , • • • , sn ] = S/max(S) Λ = [w 1 , w 2 , • • • , w n ]: round to one decimal place of S ( S of each dataset is shown in Table 12.) If there exists i such that w i < 1, then w i = α Otherwise: Λ = [1, α] Algorithm 2: Interpolation layer Input: A mini-batch: B = {x 1 , ..., x m }. Output: {y 1 , ..., y m } Let x i = (x k i ) k=1,••• ,n . for k in {1, 2, ..., n} for i in {1, 2, ..., n} -{k} Denote weight w k i = min j∈{1,••• ,m} d(x k i , x k j ) y k i = x k i + w k i * ε where ε ∼ N (0, 1) Algorithm 3: Grid fitting score method Input: Z = [Z :,1 , ..., Z :,n ]: a matrix consisting of all latent variables obtained by a model. Each row corresponds to one latent variable. Output: S for i in {1, 2, ..., n -1} for j in {i, i + 1, ..., n} Denote Z i,j = [Z :,i , Z :,j ]: a two dimensional subspace of Z Create G i,j : a set of variables from a square grid that fits Z i,j Let d i,j = 0 for k in range(len(G i,j )): A.9 SYSTEM AND MODEL CONFIGURATIONS All of our experiments were run on a single hardware consisting two DGX2 nodes, collectively consisting of 32-V100 GPUs, 1.5GB GPU RAM, and 3TB System RAM. Encoder and decoder architecture are the same in all experiments. Encoder has two convolutional layers followed by Batch Normalization layer and LeakyReLU activation. After convolutional layers, there is one fullyconnected layer with 64 nodes and another layer which maps to the latent space. The decode part is symmetric to the encoder part. C for CCI-VAE is set as 25 for 2D Toy dataset and as 50 for all other datasets. d i,j + = distance(G i,j k,: , Z i,j k,: ) S i,j = d i,j /k S: the average of S i,j A.6 DISENTANGLEMENT SCORES A.10 DATASET Figure 24 : Reconstructions of latent traversals across each latent dimension in the Blood Cell dataset. Some of the models independently learn size and shape of blood cells. For example, for DAE, the first dimension in the latent space corresponds to the size of cells and the second and the third dimensions correspond to changes in shape in different directions.



Figure1: Latent spaces learned by different models. Ideal learned latent space should cover a two-dimensional grid(Higgins et al., 2018). The first, second and third rows show the latent spaces learned from three datasets, namely, XY , 2D Arrow, and 3D Airplane datasets, respectively. Columns correspond to different models stated at the bottom of every column. It can be seen that the proposed model, DAE, achieves the best disentanglement.

Figure 2: Illustration of the DAE architecture. The model includes the Euler encoding, and the outputs from the interpolation layer are mapped to cosine and sine values.

Figure 3: Reconstructions of latent traversals across each latent dimension obtained by the DAE for the (a) 2D Arrow (color and shape), (b) 3D Airplane (color and shape), (c) XY CS (x position, y position, color and shape), (d) 3D Shape (floor hue, wall hue, object hue, scale, shape and orientation), (e) 3D Teapots (azimuth, elevation, red, green, blue and extra) and (f) 3D Face Model datasets (face id, azimuth, elevation and lighting).

Differences in scores between XY /XY CS datasets. Absolute and percentage of change from XY to XY CS are shown. Percentage changes closer to the zero are desirable. Disentanglement scores Models

, to derive the definition of a disentangled representation from the definition of disentangled group action, consider a set of world-states, denoted by W . Furthermore, assume that: (a) there is a generative process b : W → O leading from world-states to observations, O, (b) and an inference process h : O → Z leading from observations to an agent's representations, Z. With these, consider the composition f : W → Z, f = h • b. In terms of transformation, assume that these transformations are represented by a group G of symmetries acting on W via an action • : G × W → W .

Figure 4: Four factors in datasets. x and y positions have 53 elements, color has 5 elements and shape has 3 elements.

Figure 6: Relationships between X-Y, X-C, X-S and C-S features in DAE.

Figure 7: Relationships between X-Y, X-C, X-S and C-S features in AE.

Figure 8: Relationships between X-Y, X-C, X-S and C-S features in VAE.

Figure 9: Relationships between X-Y, X-C, X-S and C-S features in β-VAE.

Figure 10: Relationships between X-Y, X-C, X-S and C-S features in β-TCVAE.

Figure 11: Relationships between X-Y, X-C, X-S and C-S features in CCI-VAE.

Figure 12: Relationships between X-Y, X-C, X-S and C-S features in FVAE.

Figure 13: Relationships between X-Y, X-C, X-S and C-S features in InfoVAE.

Figure 14: Relationships between X-Y, X-C, X-S and C-S features in DIPVAE.

Figure 15: Relationships between X-Y, X-C, X-S and C-S features in WAE.

Figure 16: Relationships between X-Y, X-C, X-S and C-S features in LSBDVAE.

Figure 18: Reconstructions of latent traversals across each latent dimension in the 2D Arrow dataset.

Figure 21: Reconstructions of latent traversals across each latent dimension in the 3D Shape dataset.Due to the space limit, we omit the result from AE, InfoVAE and WAE, which have low scores in 3D Shape dataset.

Figure 22: Reconstructions of latent traversals across each latent dimension in the 3D Face Model dataset.

Disentanglement scores for the 2D Arrow, 3D Airplane, 3D Teapots, 3D Shape and 3D Face Model datasets XY dataset, and it is worth noting that the added features have smaller variances than x and y positions. As can be seen, in general, nearly, all models suffer from the performance drop except a few. The CCI-VAE is the only model that performs as good as the proposed model for the z-diff and z-min metrics. The proposed model shows the smallest drop across three remaining metrics, namely, dci-rf, jemmig and dcimig. While the largest drop in the proposed model is 17.2%, the scores fall by between 35.0% and 91.2% for the other models.

Comparison of different VAE-based models w.r.t the regularizers they employ.

Disentanglement scores for the XY dataset Models/Metrics z-diff ↑ z-var ↑ dci-rf ↑ jemmig ↑ dcimig ↑ GF (e -100 ) ↓ M SE ↓

Disentanglement scores for the XY CS dataset

Disentanglement scores for the 2D Arrow dataset

Disentanglement scores for the 3D Airplane dataset Models/Metrics z-diff ↑ z-var ↑ dci-rf ↑ jemmig ↑ dcimig ↑ GF (e -100 ) ↓ M SE ↓

Disentanglement scores for the 3D Teapots dataset

Disentanglement scores for the 3D Shape dataset

Disentanglement scores for the 3D Face Model dataset Models/Metrics z-diff ↑ z-var ↑ dci-rf ↑ jemmig ↑ dcimig ↑ GF (e -100 ) ↓ M SE ↓

Performance effects when removing the Euler layer (E) or the normalization layer (N)

Best hyperparameters for models for the 2D datasets.

Best hyperparameters for models for the 3D datasets.

A.7 HYPERPARAMETERS

[1.0, 1.0, 0.8] XY S[1.0, 1.0, 0.8] XY CS[1.0, 1.0, 0.8, 0.8] 2D Arrow [1.0, 1.0] 3D Airplane [1.0, 1.0] 3D Teapots [1.0, 0.8, 0.8, 0.4, 0.3, 0.3] 3D Shape [1.0, 1.0, 1.0, 1.0, 0.5, 0.5] 3D Face Model [1.0, 0.4, 0.4, 0.3] Table 13 : All hyperparameters for models.

Model

Values for XY CS dataset Extra values for the other datasets DAE (α) [1.0, 0.5, 0.1, 0.05, 0.01] β-VAE (β)[2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0]β-TCVAE (β) [2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128. Hence, the desirable value for both w 1 and w 2 is 1. The result shows that when α, lower than 0.05, is assigned to the second dimension, the disentanglement scores also become lower. 

