QUANTIFYING AND LEARNING DISENTANGLED REPRESENTATIONS WITH LIMITED SUPERVISION

Abstract

Learning low-dimensional representations that disentangle the underlying factors of variation in data has been posited as an important step towards interpretable machine learning with good generalization. To address the fact that there is no consensus on what disentanglement entails, Higgins et al. ( 2018) propose a formal definition for Linear Symmetry-Based Disentanglement, or LSBD, arguing that underlying real-world transformations give exploitable structure to data. Although several works focus on learning LSBD representations, such methods require supervision on the underlying transformations for the entire dataset, and cannot deal with unlabeled data. Moreover, none of these works provide a metric to quantify LSBD. We propose a metric to quantify LSBD representations that is easy to compute under certain well-defined assumptions. Furthermore, we present a method that can leverage unlabeled data, such that LSBD representations can be learned with limited supervision on transformations. Using our LSBD metric, our results show that limited supervision is indeed sufficient to learn LSBD representations.

1. INTRODUCTION

Disentangled representation learning aims to create low-dimensional representations of data that separate the underlying explanatory factors of variation in data. These representations provide an interpretable (Sarhan et al., 2019) and useful tool for various purposes, such as noise removal (Lopez et al., 2018) , continuous learning (Achille et al., 2018) , and visual reasoning (van Steenkiste et al., 2019) . However, there is no consensus about the exact properties that characterize a disentangled representation. Higgins et al. (2018) provide a formal definition for Symmetry-Based Disentangled (SBD) and Linearly SBD (LSBD) data representations, building upon the idea that representations should reflect the underlying structure of the data. In particular, they argue that variability in the data comes from transformations in the real world from which the data is observed. Having a formal definition of disentanglement can serve as a paradigm for the evaluation of disentangled representations. Although several methods have been proposed to learn SBD or LSBD representations, none of them provide a clear metric for quantifying the level of disentanglement in these representations. Quessard et al. (2020) introduce a loss term that measures the complexity of the transformations acting on their learned representations based on the number of parameters needed, but this term does not directly characterize disentanglement. Caselles-Dupré et al. (2019) only evaluate the performance of their learned representations when used in a particular downstream task. Moreover, existing methods require information about the transformation relationships among data points for the entire training dataset. This information is used to produce models that enforce the properties of SBD or LSBD representations and can be considered as a form of supervision. For example, this supervision can consist of the parameters of the transformation that connects a pair of data points, such as a rotation angle. Obtaining this supervision on the transformations for a dataset can be an expensive task that requires expert knowledge. In this work we focus on characterizing and quantifying LSBD and developing a method capable of obtaining LSBD representations by using a limited amount of supervision on the transformation properties of a dataset. The main contributions of this paper are: 1. An easy-to-compute metric to quantify LSBD given certain assumptions (see Section 4), which acts as an upper bound to a more general metric (derived in Appendix D). 2. A partially supervised method to obtain LSBD representations that, during training, can also use data without supervision on the transformation relationships.

2. RELATED WORK

The concept of disentanglement comes from the intuition that data can be described in terms of a set of independent explanatory factors that constitute the variability of the data. In probabilistic modeling, these factors are interpreted as random independent unobserved latent variables that condition the data generation process (Kulkarni et al., 2015; Higgins et al., 2016; Chen et al., 2016; 2018) . Nevertheless, there is no consensus about the exact properties that disentangled representations should have. Evidence of this is the wide range of metrics used to characterize such representations (Locatello et al., 2018) . However, a reasonable expectation is that such representations should capture and separate the variations and properties of the data. Recent work has turned the attention to capturing not only the independent explanatory factors of a dataset, but also the transformations that those factors undergo in data generation. Examples of methods that attempt to produce representations with similar transformation properties to those of the explanatory factors of data are (Cohen & Welling, 2015; Worrall et al., 2017; Sosnovik et al., 2019) . Transformations of the real world determine the variability of data and its structure. These socalled symmetry transformations have long been studied in Physics (Gross, 1996) and have been formalized with group theory. Such transformations often affect only a subset of the properties that describe the real world and leave the rest invariant. 2018) provide a formal description of disentanglement that connects the symmetry transformations affecting the real world (from which data is generated) to the internal representations of a model. The definitions are grounded in concepts from group theory, for a more detailed description of these concepts please refer to Appendix A. These definitions assume the following setting. W is the set of possible world states, with underlying symmetry transformations that are described by a group G and its action • : G × W → W on W . In particular, G can be decomposed as the direct product of K groups G = G 1 × . . . × G K . Data is obtained via an observation function b : W → X that maps world states to observations in a data space X. A model's internal representation of data is modeled with the inference function h : X → Z that maps data to the representation space Z. Together, the observation and the inference constitute the model's internal representation of the real world f : W → Z with f (w) = h • b(w). The definitions for Symmetry-Based Disentangled (SBD) and Linearly SBD (LSBD) representations formalize the requirement that a model's internal representation f : W → Z should reflect and disentangle the transformation properties of the real world. The definition of SBD can be found in Appendix B. In particular, our work focuses on LSBD representations in which the transformation properties of the model's internal representations should be linear. The exact definition is as follows: Linearly Symmetry-Based Disentangled (LSBD) Representations A model's internal representation f : W → Z , where Z is a vector space, is LSBD with respect to the group decomposition G = G 1 × . . . × G K if



The connection between disentanglement and symmetries has recently been formalized byHiggins  et al. (2018)  into the definitions of Symmetry-Based Disentangled (SBD) and Linearly SBD (LSBD) representations. Quessard et al. (2020); Caselles-Dupré et al. (2019) propose methods to obtain such SBD and/or LSBD representations, but their methods require supervision on the transformation relationships among datapoints for the entire training dataset. Moreover, there is no clear metric for the level of disentanglement in their methods. 3 SYMMETRY-BASED DISENTANGLEMENT Higgins et al. (

