DCI-ES: AN EXTENDED DISENTANGLEMENT FRAME-WORK WITH CONNECTIONS TO IDENTIFIABILITY

Abstract

In representation learning, a common approach is to seek representations which disentangle the underlying factors of variation. Eastwood & Williams (2018) proposed three metrics for quantifying the quality of such disentangled representations: disentanglement (D), completeness (C) and informativeness (I). In this work, we first connect this DCI framework to two common notions of linear and nonlinear identifiability, thereby establishing a formal link between disentanglement and the closely-related field of independent component analysis. We then propose an extended DCI-ES framework with two new measures of representation quality-explicitness (E) and size (S)-and point out how D and C can be computed for black-box predictors. Our main idea is that the functional capacity required to use a representation is an important but thus-far neglected aspect of representation quality, which we quantify using explicitness or ease-of-use (E). We illustrate the relevance of our extensions on the MPI3D and Cars3D datasets.

1. INTRODUCTION

A primary goal of representation learning is to learn representations r(x) of complex data x that "make it easier to extract useful information when building classifiers or other predictors" (Bengio et al., 2013) . Disentangled representations, which aim to recover and separate (or, more formally, identify) the underlying factors of variation z that generate the data as x = g(z), are a promising step in this direction. In particular, it has been argued that such representations are not only interpretable (Kulkarni et al., 2015; Chen et al., 2016) but also make it easier to extract useful information for downstream tasks by recombining previously-learnt factors in novel ways (Lake et al., 2017) . While there is no single, widely-accepted definition, many evaluation protocols have been proposed to capture different notions of disentanglement based on the relationship between the learnt representation or code c = r(x) and the ground-truth data-generative factors z (Higgins et al., 2017; Eastwood & Williams, 2018; Ridgeway & Mozer, 2018; Kim & Mnih, 2018; Chen et al., 2018; Suter et al., 2019; Shu et al., 2020) . In particular, the metrics of Eastwood & Williams (2018) -disentanglement (D), completeness (C) and informativeness (I)-estimate this relationship by learning a probe f to predict z from c and can be used to relate many other notions of disentanglement (see Locatello et al. 2020, § 6) . In this work, we extend this DCI framework in several ways. Our main idea is that the functional capacity required to recover z from c is an important but thus-far neglected aspect of representation quality. For example, consider the case of recovering z from: (i) a noisy version thereof; (ii) raw, highdimensional data (e.g. images); and (iii) a linearly-mixed version thereof, with each c i containing the same amount of information about each z j (precise definition in § 6.1). The noisy version (i) will do quite well with just linear capacity, but is fundamentally limited by the noise corruption; the raw data (ii) will likely do quite poorly with linear capacity, but eventually outperform (i) given sufficient capacity; and the linearly-mixed version (iii) will perfectly recover z with just linear capacity, yet achieve the worst-possible disentanglement score of D = 0. Motivated by this observation, we introduce a measure of explicitness or ease-of-use based a representation's loss-capacity curve (see Fig. 1 ). , datasets (top: MPI3D-Real, bottom: Cars3D), and probe types (left: multi-layer perceptrons / MLPs, middle: Random Fourier Features / RFFs, right: Random Forests / RFs). The loss was first averaged over factors z j , and then means and 95% confidence intervals were computed over 3 random seeds. Details in § 6. Structure and contributions. First, we connect the DCI metrics to two common notions of linear and nonlinear identifiability ( § 3). Next, we propose an extended DCI-ES framework ( § 4) in which we: (i) introduce two new complementary measures of representation quality-explicitness (E), derived from a representation's loss-capacity curve, and size (S); and then (ii) elucidate a means to compute the D and C scores for arbitrary black-box probes (e.g., MLPs). Finally, in our experiments ( § 6), we use our extended framework to compare different representations on the MPI3D-Real (Gondal et al., 2019) and Cars3D (Reed et al., 2015) datasets, illustrating the practical usefulness of our E score through its strong correlation with downstream performance.

2. BACKGROUND

Given a synthetic dataset of observations x = g(z) along with the corresponding K-dimensional data-generating factors z ∈ R K , the DCI framework quantitatively evaluates an L-dimensional data representation or code c = r(x) ∈ R L using two steps: (i) train a probe f to predict z from c, i.e., ẑ = f (c) = f (r(x)) = f (r(g(z))); and then (ii) quantify f 's prediction error and its deviation from the ideal one-to-one mapping, namely a permutation matrix (with extra "dead" units in c whenever L > K).foot_0 For step (i), Eastwood & Williams (2018) use Lasso (Tibshirani, 1996) or Random Forests (RFs, Breiman 2001) as linear or nonlinear predictors, respectively, for which it is straightforward to read-off suitable "relative feature importances". Definition 2.1. R ∈ R L×K is a matrix of relative importances for predicting z from c via ẑ = f (c) if R ij captures some notion of the contribution of c i to predicting z j s.t. ∀i, j: R ij ≥ 0 and ∑ L i=1 R ij = 1. For step (ii), Eastwood & Williams use R and the prediction error to define and quantify three desiderata of disentangled representations: disentanglement (D), completeness (C), and informativeness (I). Disentanglement. Disentanglement (D) measures the average number of data-generating factors z j that are captured by any single code c i . The score D i is given by D i = 1 -H K (P i. ), where H K (P i. ) = -∑ K k=1 P ik log K P ik denotes the entropy of the distribution P i. over row i of R, with P ij = R ij / ∑ K k=1 R ik . If c i is only important for predicting a single z j , we get a perfect score of D i = 1. If c i is equally important for predicting all z j (for j = 1, . . . , K), we get the worst score of D i = 0. The overall score D is then given by the weighted average D = ∑ L i=1 ρ i D i , with ρ i = 1 K ∑ K k=1 R ik . Completeness. Completeness (C) measures the average number of code variables c i required to capture any single z j ; it has also been called compactness (Ridgeway & Mozer, 2018) . The score C j in capturing z j is given by C j = (1 -H L ( P.j )), where H L ( P.j ) = -∑ L ℓ=1 Pℓj log L Pℓj denotes the entropy of the distribution P.j over column j of R, with Pij = R ij . If a single c i contributes to z j 's pre- diction, we get a perfect score of C j = 1. If all c i equally contribute to z j 's prediction (for i = 1, . . . , L), we get the worst score of C j = 0. The overall completeness score is given by C = 1 K ∑ K j=1 C j . Remark 2.2. Together, D and C quantify the degree of "mixing" between c and z, i.e., the deviation from a one-to-one mapping. They are reported separately as they capture distinct criteria. Informativeness. The informativeness (I) of representation c about data-generative factor z j is quantified by the prediction error, i.e., I j = 1 -E[ℓ(z j , f j (c))], where ℓ is an appropriate loss function.foot_1 Note that I j depends on the capacity of f j , as depicted in Fig. 1 . Thus, for I j to accurately capture the informativeness of c about z j , f j must have sufficient capacity to extract all of the information in c about z j . This capacity-informativeness dependency motivates a separate measure of representation explictness in § 4.1. The overall informativeness score is given by I = 1 K ∑ K j=1 I j .

3. CONNECTION TO IDENTIFIABILITY

The goal of learning a data representation which recovers the underlying data-generating factors is closely related to blind source separation and independent component analysis (ICA, Comon 1994; Hyvärinen & Pajunen 1999; Hyvarinen et al. 2019) . Whether a given learning algorithm provably achieves this goal up to acceptable ambiguities, subject to certain assumptions on the data-generating process, is typically formalised using the notion of identifiability. Two common types of identifiability for linear and nonlinear settings, respectively, are the following. Definition 3.1. We say that c = r(x) = r(g(z)) identifies z up to sign and permutation if c = Pz for some signed permutation matrix P (i.e., |P| is a permutation). Definition 3.2. We say c identifies z up to permutation and element-wise reparametrisation if there exists a permutation π of {1, ..., K} and invertible scalar-functions {h k } K k=1 s.t. ∀j: c j = h j (z π(j) ). We now establish theoretical connections between the DCI framework and these identifiability types. Proposition 3.3. If D = C = 1 and K = L (i.e., dim(c) = dim(z)), then R is a permutation matrix. All proofs are provided in Appendix A. Using Prop. 3.3, we can establish links to identifiability, provided the inferred representation c perfectly predicts the true data-generating factors z, i.e., I = 1. Corollary 3.4. Under the same conditions as Prop. 3.3, if z = W ⊤ c (so that I = 1) for some W with R ij = |w ij | ∑ L i=1 |w ij | , then c identifies z up to permutation and sign (Defn. 3.1). For nonlinear f , we give a more general statement for suitably-chosen feature-importance matrices R. Corollary 3.5. Under the same conditions as Prop. 3.3, let z = f (c) (so that I = 1) with f an invertible and differentiable nonlinear function, and let R be a matrix of relative feature importances for f (Defn. 2.1) with the property that R ij = 0 if and only if f j does not depend on c i , i.e., ∂ i f j 2 = 0. Then c identifies z up to permutation and element-wise reparametrisation (Defn. 3.2). Remark 3.6. While the if part of Corollary 3.5 holds for most feature importance measures, the only if part, in general, does not: not using a feature c i is typically a sufficient condition for R ij = 0, but it need not be a necessary condition (as required for Corollary 3.5). E.g., measures based on average performance may not satisfy this since a feature may not contribute on average, but still be used-sometimes helping and sometimes hurting performance (see § 7 for further discussion). In contrast, Gini importances, as used in random forests, do satisfy the necessary condition. While the non-invertibility of random forests prevents an explicit link to identifiability (typically studied for continuous features), they can still be a principled choice in practice (where features are often categorical). Summary. We have established that the learnt representation c identifies the ground-truth z up to: • sign and permutation if D = C = I = 1 and f is linear; • permutation and element-wise reparametrisation if D = C = I = 1 and R ij = 0 ⇔ ∂ i f j 2 = 0.

4. EXTENDED DCI-ES FRAMEWORK

Motivated by our theoretical insights from § 3-considering different probe function classes provides links to different types of identifiability-and the empirically-observed performance differences between representations trained with different-capacity probes shown in Fig. 1 , we now propose several extensions of the DCI framework.

4.1. EXPLICITNESS (E)

We first introduce a new complementary notion of disentanglement based on the functional capacity required to recover or predict z from c. The key idea is to measure the explicitness or ease-of-use (E) of a representation using its loss-capacity curve. Notation. Let F be a probe function class (e.g., MLPs or RFs), let f * j ∈ arg min f ∈F E[ℓ(z j , f (c))] be a minimum-loss probe for factor z j on a held-out data splitfoot_2 , and let Cap(•) be a suitable capacity measure on F -e.g., for RFs, Cap( f ) could correspond to the maximum tree-depth of f . Loss-capacity curves. A loss-capacity curve for representation c, factor z j , and probe class F displays test-set loss against probe capacity for increasing-capacity probes f ∈ F (see Fig. 1 ). To plot such a curve, we must train T predictors with capacities κ 1 , . . . , κ T to predict z j , with f t j ∈ arg min f ∈F E ℓ(z j , f (c)) s.t. Cap( f ) = κ t . (4.1) Here κ 1 , . . . , κ T is a list of T increasing probe capacities, ideallyfoot_3 shared by all representations, with suitable choices for κ 1 and κ T depending on both F and the dataset. For example, we may choose κ T to be large enough for all representations to achieve their lowest loss and, for random forest f s, we may choose an initial tree depth of κ 1 = 1 and then T -2 tree depths between 1 and κ T . AULCC. We next define the Area Under the Loss-Capacity Curve (AULCC) for representation c, factor z j , and probe class F as the (approximate) area between the corresponding loss-capacity curve and the loss-line of our best predictor ℓ * ,c κ 1 κ 2 κ 3 κ * ,c κ 5 Capacity b 1,c 2,c * ,c * Loss AULCC Normalizer j = E[ℓ(z j , f * j (c))]. To compute this area, depicted in Fig. 2 , we use the trapezoidal rule AULCC(z j , c; F ) = t * ,c ∑ t=2 1 2 ℓ t-1,c j +ℓ t,c j -ℓ * ,c j •∆κ t , where t * ,c denotes the index of c's lowest-loss capacity κ * ,c ; ℓ t,c j = E[ℓ(z j , f t j (c))] the test-set loss with predictor f t j , see Eq. (4.1); and ∆κ t = κ t -κ t-1 the size of the capacity interval at step t. If the lowest loss is achieved at the lowest capacity, i.e. t * ,c = 1, we set AULCC = 0. Explicitness. We define the explicitness (E) of representation c for predicting factor z j with predictor class F as E(z j , c; F ) = 1 - AULCC(z j , c; F ) 1 2 (κ T -κ 1 )(ℓ b j -ℓ * j ) , where ℓ b j is a suitable baseline loss (e.g., that of E[z j ]) and ℓ * j a suitable lowest loss (e.g., 0) for F . Here, the denominator represents the area of the light-blue triangle in Fig. 2 , normalizing the AULCC such that E j ∈ [-1, 1] so long as ℓ * j < ℓ b j . The best score E j = 1 means that the best loss was achieved with the lowest-capacity probe f 1 j , i.e., ℓ * ,c j = ℓ 1,c j and κ * ,c = κ 1 , and thus our representation c was explicit or easy-to-use for predicting z j with f ∈ F since there was no surplus capacity required (beyond κ 1 ) to achieve our lowest loss. In contrast, E j = 0 means that the loss reduced linearly from ℓ b j to ℓ * j with increased probe capacity, i.e., AULCC = Normalizer in Fig. 2 . More generally, if ℓ * ,c = ℓ * , i.e. the lowest loss for F can be reached with representation c, then E j < 0 implies that the loss decreased sub-linearly with increased capacity while E j > 0 implies it decreased super-linearly. The overall explicitness score is given by E = 1 K ∑ K j=1 E j . E vs. I. While the informativeness score I j captures the (total) amount of information in c about z j , the explicitness score E j captures the ease-of-use of this information. In particular, while I j is quantified by the lowest prediction error with any capacity ℓ * ,c , corresponding to a single point on c's loss-capacity curve, E j is quantified by the area under this curve. A fine-grained picture of identifiability. Compared to the commonly-used mean correlation coefficient (MCC) or Amari distance (Amari et al., 1996; Yang & Amari, 1997) , the D, C, I, E scores represent empirical measures which: (i) easily extend to mismatches in dimensionalities, i.e., L > K; and (ii) provide a more fine-grained picture of identifiability (violations), for if the initial probe capacity κ 1 is linear and R satisfies Corollary 3.5, we have that: • D = C = I = E = 1 =⇒ identified up to sign and permutation (Defn. 3.1); • D = C = I = 1 =⇒ identified up to permutation and element-wise reparametrisation (Defn. 3.2); • I = E = 1 =⇒ identified up to invertible linear transformation (cf. Khemakhem et al., 2020) . Thus, if D = C = I = E = 1 does not hold exactly, which score deviates the most from 1 may provide valuable insight into the type of identifiability violation. Probe classes. As emphasized above, whether or not a representation c is explicit or easy-to-use for predicting factor z j depends on the class of probe F used, e.g., MLPs or RFs. More generally, the explicitness of a representation depends on the way in which it is used in downstream applications, with different downstream uses or probe classes resulting in different definitions of explicit or easy-to-use information. We thus conduct experiments with different probe classes in § 6.

4.2. SIZE (S)

We next introduce a measure of representation size (S), motivated by the observation that larger representations tend to be both more informative and more explicit (see Tab. 1, more details below). Reporting S thus allows size-informativeness and size-explicitness trade-offs to be analysed. A measure of size. We measure representation size (S) relative to the ground-truth as: S = K L = dim(z) dim(c) . When L ≥ K, as often the case, we have S ∈ (0, 1] with the perfect score being S = 1. However, if we also consider the L < K case, which would likely sacrifice some informativeness, we have S ∈ (1, K]. Larger representations are often more informative. When L < K, it is intuitive that larger representations are more informative-they can simply preserve more information about z. When L > K, however, it is also common for larger representations to be more informative, perhaps due to an easier optimization landscape (Frankle & Carbin, 2019; Golubeva et al., 2021) . Tab. 1 illustrates this point, where AE-5 denotes an autoencoder with L = 5. Note that K = 7 for MPI3D-Real (see § 6). Larger representations are often more explicit. The explicitness of a representation also depends on its size: larger representations tend to be more explicit, as is apparent from the second column of Tab. 1. To explain this, we plot the corresponding loss-capacity curves in Fig. 3 . Here we see that the increased explicitness (i.e., smaller AULLC) of larger representations stems from a substantially lower initial loss when using a linear-capacity MLP probe. The fact that larger representations perform better with linear-capacity MLPs is unsurprising since they have more parameters.

4.3. PROBE-AGNOSTIC FEATURE IMPORTANCES

Finally, to meaningfully discuss more flexible probe-function choices within the DCI-ES framework, we point out that the D and C scores can be computed for arbitrary black-box probes f by using probe-agnostic feature-importance measures. In particular, in our experiments ( § 6), we use SAGE (Covert et al., 2020) 

5. RELATED WORK

Explicit representations. Eastwood & Williams (2018, § 2) noted that the informativeness score with a linear probe quantifies the amount of information in c about z that is "explicitly represented", while Ridgeway & Mozer (2018, § 3) proposed a measure of "explicitness" which simply reports the informativeness score with a linear probe. In contrast, our DCI-ES framework differentiates between the amount of information in c about z (informativeness) and the ease-of-use of this information (explicitness). This allows a more fine-grained analysis of the relationship between c and z, both theoretically (distinguishing between more identifiability equivalence classes; § 3) and empirically ( § 6). Loss-capacity curves. Plotting loss against model complexity or capacity has long been used in statistical learning theory, e.g., for studying the bias-variance trade-off (Hastie et al., 2009, Fig. 7 .1). More recently, such loss-capacity curves have been used to study the double-descent phenomenon of neural networks (Belkin et al., 2019; Nakkiran et al., 2021) as well as the scaling laws of large language models (Kaplan et al., 2020) . However, they have yet to be used for assessing the quality or explicitness of representations. Loss-data curves. Whitney et al. ( 2020) use loss-data curves, which plot loss against dataset size, to assess representations. They measure the quality of a representation by the sample complexity of learning probes that achieve low loss on a task of interest. Loss-data curves are also studied under the term learning curves in standard/purely supervised-learning settings (see, e.g., Viering & Loog, 2021, for a recent review). In contrast, we focus on functional complexity and the task of predicting the data-generative factors z, and then discuss the functional complexity for other tasks y in § 7. Representations. We use the following synthetic baselines and standard models as representations:

6. EXPERIMENTS

• Noisy labels: c = z + ϵ, with ϵ ∼ N (0, 0.01 • I K ). • Linearly-mixed labels: c = Wz, with W ij = 1 LK + ϵ ij and ϵ ij ∼N (0, 0.001) to achieve "uniform mixing" (each z j evenly-distributed across the c i s) while also ensuring the invertibility of W a.s. • Raw data (pixels): c = x = g(z). • Others: We also use VAEs (Kingma & Welling, 2014) with 10 latents (L=10), β-VAEs (Higgins et al. 2017, L=10) ; and an ImageNet-pretrained ResNet18 (He et al. 2016, L=512) . Probes. We use MLPs, RFs and Random Fourier Features (RFFs, Rahimi & Recht 2007) to predict z from c, with RFFs having a linear classifier on top. For MLPs, we start with linear probes (no hidden layers) then increase capacity by adding two hidden layers and varying their widths from 2 × K to 512 × K. We then measure capacity based on the number of "extra" parameters beyond that of the linear probe, and compute feature importances using SAGE with permutation-sampling estimators and marginal sampling of masked values (see https://github.com/iancovert/sage). For RFs, we use ensembles of 100 trees, control capacity by varying the maximum depth between 1 and 32, and compute feature importances using Gini importance. For RFFs, we control capacity by exponentially increasing the number of random features from 2 4 to 2 17 , and compute feature importances using SAGE. Implementation details. We split the data into training, validation and test sets of size 295k, 16k, and 726k respectively for MPI3D-Real and 12.6k, 1.4k, 3.4k for Cars3d. We use the validation split for hyperparameter selection and report results on the test split. We train MLP probes using the Adam (Kingma & Ba, 2015) optimizer for 100 epochs. We use mean-square error and cross-entropy losses for continuous and discrete factors z j , respectively. To compute E j , we use the baseline losses of E[z j ] and a random classifier for continuous and discrete z j , respectively. Further details can be found in our open-source code: https://github.com/andreinicolicioiu/DCI-ES.

6.2. EVALUATION RESULTS: CURVES AND SCORES

Loss-capacity curves. Fig. 1 depicts loss-capacity curves for the three probes and two datasets, averaged over factors z j . In all six plots, the noisy-labels baseline performs well with low-capacity and then is surpassed by other representations given sufficient capacity, as expected. Note that the linearlymixed-labels baseline immediately achieves ≈ 0 loss with MLP probes but not with RFF or RF probes, supporting the idea that the explicitness or ease-of-use of a representation depends on the way in which it is used. Also note that, with MLP probes and log(excess #params) as the capacity measure, larger input representations are afforded more parameters with a linear probe and thus are more expressive. This further explains why larger representations are often more explicit, and highlights the difficulty of measuring the capacity of MLPs-an active area of research in its own right, which we discuss in § 7. Finally, in Appendix B.2, we investigate the effect of dataset size by plotting loss-capacity curves for different dataset sizes, observing that larger datasets have smaller performance gaps between: (i) synthetic and learned representations; and (ii) small and large representations (see Fig. 10 ). DCI-ES scores. Tab. 2 reports the corresponding DCI-ES scores, along with some oracle scores for MLPs. Note that: (i) the GT labels z get perfect scores of 1 for all metrics; (ii) by attaining very low D and C scores but near-perfect E scores, the linearly-mixed labels expose the key difference between mixing-based (D,C) and functional-capacity-based (E) measures of the simplicity of the c-z relationship; (iii) larger representations (ImgNet-pretr, raw data) tend to be more explicit than smaller ones (VAE, β-VAE), with S and E together capturing this size-explicitness trade-off; and (iv) β-VAE achieves better mixing-based scores (D,C) but similar E scores compared to the VAE, illustrating that these two "disentanglement" notions are indeed orthogonal and complementary.

6.3. DOWNSTREAM RESULTS: SCORE CORRELATIONS

Setup. To illustrate the practical usefulness of our explictness score, we calculate its correlation with downstream performance when using low-capacity probes. Using MPI-3D, we create 14 synthetic downstream tasks: 7 regression tasks with y i = M i z and M i jk ∼ U(0, 1), and 7 classification tasks with y i = 1 {z i >m i } and m i the median value of factor z i . For representations, we use AEs, VAEs and β-VAEs, 2 latent dimensionalities (i.e. dim(c)) of 10 and 50, and 5 random seeds-resulting in a total of 30 different representations c. To compute the correlations, we first compute the DCIE scores as before, training MLP and RF probes f to predict z from c, i.e. ẑj = f j (c), and then compute the down-Table 2 : DCI-ES scores for different probes, datasets and representations. Empirical scores using MLP, RFF and RF probes trained on the MPI3D-Real and Cars3D datasets, as well as theoretical/oracle scores for some simple representations with MLPs (MLP*). We show averages over 3 random seeds; standard deviations were all < 0.05. Note that which representation is deemed "best" depends on the application of interest-some are more disentangled, some more informative, some more explicit, etc.

Representation

Probe stream performance by training new low-capacity MLP and RF probes f to predict y from c, i.e. ŷi = fi (c) (see Fig. 4c ). For MLP probes, low capacity means linear. For RF probes, low capacity means the maximum tree depth is 10. Next, we average the downstream performances across all 14 tasks before computing the correlation coefficient between this average and each of the D, C, I, and E scores. MPI3D CARS3D D C I E S D C I E S GT Labels z MLP* 1 1 1 1 1 1 1 1 1 1 Noisy labels MLP* 1 1 0.9 1 1.0 1 1 0. Analysis. Figs. 4a and 4b show that E is strongly correlated with downstream performance when using both MLP (ρ = 0.96, p = 8e-18) and RF probes (ρ = 0.88, p = 2e-10). In contrast, mixing-based disentanglement scores (D, C) exhibit much weaker correlations with MLP probes, corroborating the results of Träuble et al. (2022, Fig. 8 ) who also found a weak correlation between D and downstream performance on reinforcement learning tasks with MLPs. See App. B.1 for further details and results. 

7. DISCUSSION

Why connect disentanglement and identifiability? Connecting prediction-based evaluation in the disentanglement literature to the more theoretical notion of identifiability has several benefits. Firstly, it provides a concrete link between two often-separate communities. Secondly, it endows the often empirically-driven or practice-focused disentanglement metrics with a solid and well-studied theoretical foundation. Thirdly, compared to the commonly-used MCC or Amari distance, it provides the ICA or identifiability community with more fine-grained empirical measures, as discussed in § 4.1. Measuring probe capacity. Our measure of explicitness E depends strongly on the choice of capacity measure for a probe or function class. For some probes like RFs or RFFs, there exist natural measures of capacity. However, for other probes like MLPs, coming up with a good capacity measure is itself an important and active area of research (Jiang et al., 2020; Dziugaite et al., 2020) . Another difficulty arises from choosing a capacity scale, with different scales (e.g., log, linear, etc.) leading to loss-capacity curves with different shapes, areas and thus explicitness scores. To investigate the extent of this issue, i.e., the sensitivity of our explicitness measure to the choice of capacity scale, Fig. 5 compares the explicitness scores when using logarithmic and linear scaling. Here we see that the ranking essentially remains the same except for the raw-data representation with MLP probes. Measuring feature importance. Similarly, the choice of feature-importance measure has a strong influence on the D and C scores, with some probes having natural or in-built measures (e.g., random forests) and others not (e.g., MLPs). For the latter, we proposed the use of probe-agnostic featureimportance measures like SAGE, and specified the conditions (Corollary 3.5) that importance measures must satisfy if the resulting D and C scores are to be connected to identifiability. As with probe capacity, coming up with good measures of feature importance is its own orthogonal field of study (e.g., model explainability), with future advances likely to improve the DCI-ES framework. What about explicitness for other tasks y? While we focused on the explicitness or ease-of-use of a representation for predicting the data-generative factors z, one may also be interested in its ease-of-use for other tasks/labels y. While it is often implicitly assumed that the ease-of-use for predicting z correlates with the ease-of-use for common tasks of interest (e.g., object classification, segmentation, etc.), future work could directly evaluate the explicitness of a representation for particular tasks y. For example, one could consider the entire loss-capacity curve when benchmarking self-supervised representations on ImageNet, rather than just linear-probe performance (a single slice). Future work could also explore the trade-off between explicit but task-specific and implicit but task-agnostic representations.

8. CONCLUSION

We have presented DCI-ES-an extended disentanglement framework with two new complementary measures of representation quality-and proven its connections to identifiability. In particular, we have advocated for additionally measuring the explicitness (E) of a representation by the functional capacity required to use it, and proposed to quantify this explicitness using a representation's losscapacity curve. Together with the size (S) of a representation, we believe that our extended DCI-ES framework allows for a more fine-grained and nuanced benchmarking of representation quality. for f (Defn. 2.1) with the property that R ij = 0 if and only if f j does not depend on c i , i.e., ∂ i f j 2 = 0. Then c identifies z up to permutation and element-wise reparametrisation (Defn. 3.2). Proof. For any j consider z j = f j (c). By Prop. 3.3, R is a permutation matrix, so column j of R contains exactly one non-zero entry in row π(j) for some permutation π of {1, ..., K}. Hence, by the assumed property of R, f j (c) does not depend on c i for all i ̸ = π(j), and thus z j = f j (c π(j) ) ∀j. By invertibility of f , we obtain c j = h j (z j ′ ) with h j = f -1 j ′ and j ′ = π -1 (j).

B.1 DOWNSTREAM CORRELATIONS

Here we present the full results of the correlations between the DCIE scores and downstream performance, the latter with low-capacity probes (as discussed in § 6.3). In Tab. 3 and Tab. 4 we show the values of the Pearson and Spearman correlations alongside the corresponding p-valuesfoot_4 . Note that some of the assumptions behind these p-values, e.g. that the DCIE scores and downstream performances are normally distributed, likely do not hold. Thus, these p-values should not be interpreted as precise probabilities but rather as rough indications of statistical significance. In Tab. 5 we show the correlations for regression and classification tasks separately, with both task types exhibiting similar correlations. We note that E has the strongest correlation with the downstream performance (when using low-capacity probes for the downstream task). Probe f D C I E MLP 0.12 (5 × 10 -1 ) -0.07 (7 × 10 -1 ) 0.55 (2 × 10 -3 ) 0.94 (1 × 10 -14 ) RF 0.81 (6 × 10 -8 ) 0.75 (2 × 10 -6 ) 0.28 (1 × 10 -1 ) 0.78 (3 × 10 -7 ) Score-by-score analysis. To get a deeper insight into the correlations reported in Tabs. 3 and 4, we plot each of the D, C, I and E scores against downstream performance for each of the 30 models considered in § 6.3. As shown in Figs. 6 to 9, only E correlates strongly with downstream performance for both probe types, again highlighting: (i) the value that E adds to the existing DCI framework; and (ii) the practical usefulness of reporting E when comparing/evaluating learned representations. In Fig. 10 we present loss-capacity curves obtained when using different amounts of data to train the MLP probes. As shown, larger datasets have smaller performance gaps between (i) synthetic and learned representations; and (ii) small and large representations.



W.l.o.g., it can be assumed that z i and c j are normalised to have mean zero and variance one for all i, j, for otherwise such normalisation can be "absorbed" into g(•) and r(•). Here we deviate from Eastwood & Williams (who had I j = E[ℓ(z j , f j (c))]) such that 1 is now the best score. In practice, all expectations are taken w.r.t. the corresponding empirical (train/validation/test) distributions. True for RFs but not input-size dependent MLPs (see § 6). Computed using https://docs.scipy.org/doc/scipy/reference/generated/scipy. stats.pearsonr.html



Figure1: Loss-capacity curves. Empirical loss-capacity curves (see § 4.1) for various representations (see legend), datasets (top: MPI3D-Real, bottom: Cars3D), and probe types (left: multi-layer perceptrons / MLPs, middle: Random Fourier Features / RFFs, right: Random Forests / RFs). The loss was first averaged over factors z j , and then means and 95% confidence intervals were computed over 3 random seeds. Details in § 6.

Figure2: Explicitness via the area under the loss-capacity curve (AULCC). Here, κ 1 , ..., κ T (x-axis) are a sequence of increasing function-capacities and ℓ 1,c , ..., ℓ T,c (y-axis) are the losses achieved by the corresponding optimal predictors for c. The lowest loss ℓ * ,c is achieved at capacity κ * ,c , while ℓ b and ℓ * are suitable baseline and best-possible losses for the probe class.

Figure 3: Loss-capacity curves for auto-encoders of various sizes on MPI3D-Real with MLP probes.

We perform our analysis of loss-capacity curves on the MPI3D-Real(Gondal et al., 2019) and Cars3D (Reed et al., 2015)  datasets. MPI3D-Real contains ≈ 1M real-world images of a robotic arm holding different objects with seven annotated ground-truth factors: object colour (6), object shape (6), object size (2), camera height (3), background colour (3) and two degrees of rotations of the arm (40 × 40); numbers in brackets indicate the number of possible values for each factor. Cars3D contains ≈ 17.5k rendered images of cars with three annotated ground-truth factors: camera elevation (4), azimuth (24) and car type (183).

Figure 4: (a) Correlation coefficients ρ between DCIE scores and downstream performance with low-capacity probes. (b) E vs. downstream performance with linear MLPs. (c) DCIE scores are computed by predicting z from c with probes f , then downstream tasks y = h(z) are solved by predicting y from c with low-capacity probes f .

Figure 5: Explicitness scores on MPI3D-Real (top) and Cars3D (bottom) for different representations (see legend). Each pair of points represents the score with logarithmic (left) and linear (right) capacity-scaling.

Pearson correlation coefficient ρ between the D, C, I, and E scores and downstream performance, along with the corresponding p-values (in parentheses). See § 6.3 for experimental details. 4 × 10 -1 ) 0.28 (9 × 10 -1 ) 0.47 (9 × 10 -3 ) 0.96 (8 × 10 -18 ) RF 0.8 (1 × 10 -7 ) 0.70 (2 × 10 -5 ) 0.4 (3 × 10 -2 ) 0.88 (2 × 10 -10 ) Table4: Spearman rank correlation between the D, C, I, and E scores and downstream performance, along with the corresponding p-values (in parentheses).

Figure 6: Explicitness (E) vs. downstream performance. Scatter plots show 30 data points: 3 models (AEs, VAEs, β-VAEs) × 2 latent dimensionalities (L = 10 and L = 50) × 5 random seeds.

Figure 7: Disentanglement (D) vs. downstream performance. Scatter plots show 30 data points: 3 models (AEs, VAEs, β-VAEs) × 2 latent dimensionalities (L = 10 and L = 50) × 5 random seeds.

which summarises each feature's importance based on its contribution to I, E and S scores for auto-encoders of various sizes on MPI3D-Real with MLP probes.

predictive performance, making use of Shapley values(Shapley, 1953) to account for complex feature interactions. Such probe-agnostic measures allow the D and C scores to be computed for probes with no inherent or built-in notion of feature importance (e.g., MLPs), thereby generalising the Lasso and RF examples ofEastwood & Williams (2018,  § 4.3). While SAGE has several practical advantages over other probe-agnostic methods (see, e.g.,Covert et al., 2020, Table1), it may not satisfy the conditions required to link the D and C scores to different identifiability equivalence classes (see Remark 3.6). Future work may explore alternative methods which do, e.g., by looking at a feature's mean absolute attribution value (Lundberg & Lee, 2017) since, intuitively, absolute contributions do not allow for a cancellation of positive and negative attribution on average (cf. Remark 3.6).

Pearson correlation coefficient ρ between the D,C,I, E scores and downstream performance for each task type (regression and classification). Correlations are similar across both task types.

ACKNOWLEDGMENTS

The authors would like to thank Chris Williams, Francesco Locatello, Nasim Rahaman, Sidak Singh and Yash Sharma for helpful discussions and comments. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A, 01IS18039B; and by the Machine Learning Cluster of Excellence, EXC number 2064/1 -Project number 390727645.

A PROOFS

A.1 PROOF OF PROPOSITION 3.3 Proposition 3.3. If D = C = 1 and K = L (i.e., dim(c) = dim(z)), then R is a permutation matrix.Proof. First, by Defn. 2.1, we have 0 ≤ R ij and ∑ L i=1 R ij = 1 ∀i, j, so 0 ≤ R ij ≤ 1. It follows that ∀i, j : P i• , P•j ∈ ∆ K-1 , where ∆ K-1 denotes the K-dim. probability simplex, i.e., P i• and P•j are valid probability vectors. Hence, the Shannon entropies H K (P i• ), H K ( P•j ) are well-defined ∀i, j, and, due to using log K in the definition of H K (see § 2), are bounded in [0, 1] . It follows that ∀i, j : 0 ≤ D i ≤ 1 and 0 ≤ C j ≤ 1. Since D and C are convex combinations of the D i and C j , we havewhere p k log p k := 0 for p k = 0, consistent with lim x→0 + x log x = 0. Together with the simplex constraint, this implies that p must be a standard basis vector p = e l for some l, i.e., p l = 1 and p k = 0 for k ̸ = l. Hence, P i• , P•j must be standard basis vectors for all i, j, and so each row and column of R contains exactly one non-zero element. Since columns of R sum to one, these non-zero elements must all be one.

A.2 PROOF OF COROLLARY 3.4

Corollary 3.4. Under the same conditions as Prop., then c identifies z up to permutation and sign (Defn. 3.1).Proof. First, we show thatis a well-defined feature importance matrix. Suppose for a contradiction, that ∑ L i=1 |w il | = 0 for some l. Since |w il | ≥ 0, this implies w il = 0 for all i. Consider z l = ∑ L i=1 w il c i . Taking the covariance, we obtain Var[z l ] = ∑ L i,j=1 w il w jl Cov(c i , c j ) = 0, which is a contradiction since z l has positive (unit) variance by the normalisation assumption (see footnote 1). Hence, ∑ L i=1 |w il | > 0 for all l. Thus R is well-defined, with its elements being non-negative and its columns summing to one by construction, so it is a valid feature importance matrix.Next, note that we can write R = |W |D where D is the invertible diagonal matrix with positive diagonal entriesBy Prop. 3.3, R is a permutation matrix, so R = P = |W |D for some permutation matrix P. Right multiplication by D -1 yields PD -1 = |W |, that is |W | has exactly one non-zero, positive element in each row and each column (and zeros elsewhere). Thus W and therefore also W ⊤ are generalised permutation matrices. Hence (W ⊤ ) -1 exists and is also a generalised permutation matrix.Finally, consider c = (W ⊤ ) -1 z. Since all but ony element in each row of (W ⊤ ) -1 are zero, we have for any i: c i = wij z j for some j, where wij denotes the (i, j) element of (W ⊤ ) -1 . By considering the variances of both sides and recalling that all c i 's and z j 's are normalised to unit variance, it follows that 1 = Var(c i ) = w2 ij Var(z j ) = w2 ij . Hence, w2 ij = ±1 and so (W ⊤ ) -1 is, in fact, a signed permutation matrix, which concludes the proof.A.3 PROOF OF COROLLARY 3.5 Corollary 3.5. Under the same conditions as Prop. 3.3, let z = f (c) (so that I = 1) with f an invertible and differentiable nonlinear function, and let R be a matrix of relative feature importances 

