DCI-ES: AN EXTENDED DISENTANGLEMENT FRAME-WORK WITH CONNECTIONS TO IDENTIFIABILITY

Abstract

In representation learning, a common approach is to seek representations which disentangle the underlying factors of variation. Eastwood & Williams (2018) proposed three metrics for quantifying the quality of such disentangled representations: disentanglement (D), completeness (C) and informativeness (I). In this work, we first connect this DCI framework to two common notions of linear and nonlinear identifiability, thereby establishing a formal link between disentanglement and the closely-related field of independent component analysis. We then propose an extended DCI-ES framework with two new measures of representation quality-explicitness (E) and size (S)-and point out how D and C can be computed for black-box predictors. Our main idea is that the functional capacity required to use a representation is an important but thus-far neglected aspect of representation quality, which we quantify using explicitness or ease-of-use (E). We illustrate the relevance of our extensions on the MPI3D and Cars3D datasets.

1. INTRODUCTION

A primary goal of representation learning is to learn representations r(x) of complex data x that "make it easier to extract useful information when building classifiers or other predictors" (Bengio et al., 2013) . Disentangled representations, which aim to recover and separate (or, more formally, identify) the underlying factors of variation z that generate the data as x = g(z), are a promising step in this direction. In particular, it has been argued that such representations are not only interpretable (Kulkarni et al., 2015; Chen et al., 2016) but also make it easier to extract useful information for downstream tasks by recombining previously-learnt factors in novel ways (Lake et al., 2017) . While there is no single, widely-accepted definition, many evaluation protocols have been proposed to capture different notions of disentanglement based on the relationship between the learnt representation or code c = r(x) and the ground-truth data-generative factors z (Higgins et al., 2017; Eastwood & Williams, 2018; Ridgeway & Mozer, 2018; Kim & Mnih, 2018; Chen et al., 2018; Suter et al., 2019; Shu et al., 2020) . In particular, the metrics of Eastwood & Williams (2018)-disentanglement (D), completeness (C) and informativeness (I)-estimate this relationship by learning a probe f to predict z from c and can be used to relate many other notions of disentanglement (see Locatello et al. 2020, § 6) . In this work, we extend this DCI framework in several ways. Our main idea is that the functional capacity required to recover z from c is an important but thus-far neglected aspect of representation quality. For example, consider the case of recovering z from: (i) a noisy version thereof; (ii) raw, highdimensional data (e.g. images); and (iii) a linearly-mixed version thereof, with each c i containing the same amount of information about each z j (precise definition in § 6.1). The noisy version (i) will do quite well with just linear capacity, but is fundamentally limited by the noise corruption; the raw data (ii) will likely do quite poorly with linear capacity, but eventually outperform (i) given sufficient capacity; and the linearly-mixed version (iii) will perfectly recover z with just linear capacity, yet achieve the worst-possible disentanglement score of D = 0. Motivated by this observation, we introduce a measure of explicitness or ease-of-use based a representation's loss-capacity curve (see Fig. 1 ).

