LEARNING DISENTANGLEMENT IN AUTOENCODERS THROUGH EULER ENCODING

Abstract

Noting the importance of factorizing (or disentangling) the latent space, we propose a novel, non-probabilistic disentangling framework for autoencoders, based on the principles of symmetry transformations that are independent of one another. To the best of our knowledge, this is the first deterministic model that is aiming to achieve disentanglement based on autoencoders without pairs of images or labels, by explicitly introducing inductive biases into a model architecture through Euler encoding. The proposed model is then compared with a number of state-of-the-art models, relevant to disentanglement, including symmetry-based and generative models based on autoencoders. Our evaluation using six different disentanglement metrics, including the unsupervised disentanglement metric we propose here in this paper, shows that the proposed model can offer better disentanglement, especially when variances of the features are different, where other methods may struggle. We believe that this model opens several opportunities for linear disentangled representation learning based on deterministic autoencoders.

1. INTRODUCTION

Learning generalizable representations of data is one of the fundamental aspects of modern machine learning (Rudin et al., 2022) . In fact, better representations are more than a luxury now, and is a key to achieving generalization, interpretability, and robustness of machine learning models (Bengio et al., 2013; Brakel & Bengio, 2017; Spurek et al., 2020) . One of the primary and desired characteristics of the learned representation is factorizability or disentanglement, so that latent representation can be composed of multiple, independent generative factors of variations. The disentanglement process renders the latent space features to become independent of one another, providing a basis for a set of novel applications, including scene rendering, interpretability, and unsupervised deep learning (Eslami et al., 2018; Iten et al., 2020; Higgins et al., 2021) . Deep generative models, particularly that build on variational autoencoders (VAEs) (Kingma & Welling, 2013; Kumar et al., 2017; Higgins et al., 2017; Tolstikhin et al., 2018; Burgess et al., 2018; Chen et al., 2018; Burgess et al., 2018; Kim & Mnih, 2018; Zhao et al., 2019) , have shown to be effective in learning factored representations. Although these approaches have advanced the disentangled representation learning by regularizing the latent spaces, there are a number of issues that limit their full potential: (a) VAE-based models consist of two loss components, and balancing these loss components is a well known issue (Asperti & Trentin, 2020) (b) it is almost impossible to honor the idealized notion of having a known prior distribution for VAEs in practical settings (Takahashi et al., 2019; Asperti & Trentin, 2020; Zhang et al., 2020; Aneja et al., 2021) and, (c) factorizing the aggregated posterior in the latent space does not guarantee corresponding uncorrelated representations (Locatello et al., 2019) . An alternative approach for achieving disentangled representations is through seeking irreducible representations of the symmetry groups (Cohen & Welling, 2014; Higgins et al., 2018; Painter et al., 2020; Tonnaer et al., 2022) , where the aim is to find latent space transformations that are independent of one another, underpinned by well-defined mathematical framework(s) based on group theory. As this group of methods exploits the notion of transitions between samples, they require pairs of images representing the transitions (Cohen & Welling, 2014; Painter et al., 2020) or equivalent labels (Tonnaer et al., 2022) . Regardless of the approach, as shown in Locatello et al. (2019) , it is fundamentally impossible to learn disentangled representations without having inductive biases on either the model or the dataset, and both VAE-and symmetry-based approaches exemplify implicitly embedding inductive bias. Despite the advances, we note a number of issues in the existing approaches about how they address disentanglement. Firstly, the majority of the VAE-based approaches are probabilistic, and as such, the quality of the disentanglement depends on ideal or near-ideal priors, and on the process of learning the correct posteriors for a given data. Secondly, the majority of symmetry-based disentangling approaches need pairs of images or labels, even in the unsupervised setting, owing to the requirements around inductive bias. Thirdly, none of these models conform to the formal definition of a linear disentangled representation proposed by Higgins et al. (2018) . Finally, and most importantly, none of the existing approaches have the unsupervised approach for introducing inductive biases (required for disentanglement) both on the models and on the datasets, essentially demanding labels or image pairs. Motivated by these shortcomings, in this paper, we propose a novel approach for deriving disentangled representation learning, with the following key contributions: We, • propose a totally unsupervised approach for introducing inductive bias into the model and data, without requiring pairs of images or labels, • propose a non-probabilistic approach that does not involve any priors or learning posteriors, • ensure that our approach conforms to the formal definition of a linearly disentangled representation as defined in Higgins et al. ( 2018), • propose a new unsupervised metric, namely, Grid Fitting Score (GF-Score), to quantify the disentanglement, echoing the aspiration of an ideal disentanglement measure outlined in Higgins et al. (2018) , and • demonstrate the implementation of a formally defined disentanglement using autoencoders. As such, the proposed approach, which we name Disentangling Auto-Encoder (DAE), offers a theoretically sound framework for learning independent multi-dimensional vector subspaces, and hence towards learning disentangled representations. To the best of our knowledge, this is the first attempt to actually implement a disentanglement approach using deterministic autoencoders, especially without pairs of images or labels, and hence in a truly unsupervised manner. We provide a glimpse into the capability of the proposed model for disentanglement using three datasets compared against ten other models, which are either autoencoder-based probabilistic models or symmetrybased disentangled models, that do not require any labels or pairs of inputs in Figure 1 . The rest of this paper is organized as follows. In Section 2 we review the related work, focusing on VAE-based and symmetry-based approaches. This is then followed by a derivation of AE-based non-probabilistic approach for deriving disentangled representations in Section 3. In Section 4, we perform a detailed evaluation to decide the overall performance of the proposed model, using ten baseline models, six datasets, and six disentanglement metrics, and discuss our findings. We then conclude the paper in Section 5 with directions for further research. Given the space constraints, we highlight the prominent results in the main part of the paper, while providing the remaining set of results and relevant material as part of the Appendix.



Figure 1: Latent spaces learned by different models. Ideal learned latent space should cover a two-dimensional grid (Higgins et al., 2018). The first, second and third rows show the latent spaces learned from three datasets, namely, XY , 2D Arrow, and 3D Airplane datasets, respectively. Columns correspond to different models stated at the bottom of every column. It can be seen that the proposed model, DAE, achieves the best disentanglement.

