DISENTANGLEMENT OF CORRELATED FACTORS VIA HAUSDORFF FACTORIZED SUPPORT

Abstract

A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a model's representation with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically independent. In reality, factors (like object color and shape) are correlated. To address this limitation, we consider the use of a relaxed disentanglement criterion -the Hausdorff Factorized Support (HFS) criterion -that encourages only pairwise factorized support, rather than a factorial distribution, by minimizing a Hausdorff distance. This allows for arbitrary distributions of the factors over their support, including correlations between them. We show that the use of HFS consistently facilitates disentanglement and recovery of ground-truth factors across a variety of correlation settings and benchmarks, even under severe training correlations and correlation shifts, with in parts over +60% in relative improvement over existing disentanglement methods. In addition, we find that leveraging HFS for representation learning can even facilitate transfer to downstream tasks such as classification under distribution shifts. We hope our original approach and positive empirical results inspire further progress on the open problem of robust generalization. Code available at https://github.com/facebookresearch/disentangling-correlated-factors.

1. INTRODUCTION

Figure 1 : Real data exhibits correlations between generative factors: cows are likely on grass, camels on sand. This contradicts disentanglement methods assuming statistically independent factors. Instead, we show that merely assuming and aiming for a factorized support can yield robust disentanglement even under correlated factors. Disentangled representation learning (Bengio et al., 2013; Higgins et al., 2018) is a promising path to facilitate reliable generalization to in-and out-ofdistribution downstream tasks (Bengio et al., 2013; Higgins et al., 2018; Milbich et al., 2020; Dittadi et al., 2021; Horan et al., 2021) , on top of being more interpretable and fair (Locatello et al., 2019a; Träuble et al., 2021) . While Higgins et al. (2018) propose a formal definition based on group equivariance, and various metrics have been proposed to measure disentanglement (Higgins et al., 2017; Chen et al., 2018; Eastwood & Williams, 2018) the most commonly understood definition is as follows: Definition 1.1 (Disentanglement) Assuming data generated by a set of unknown ground-truth latent factors, a representation is said to be disentangled if there exists a one-to-one correspondence between each factor and dimension of the representation. The method by which to achieve this goal however, remains an open research question. Weak and semi-supervised settings, e.g. using data pairs or auxiliary variables, can provably offer disentanglement (Bouchacourt et al., 2018; Locatello et al., 2020b; Khemakhem et al., 2020; Klindt et al., 2021) . But fully unsupervised disentanglement -our focus in this study -is in theory impossible to achieve in the general unconstrained nonlinear case (Hyvärinen & Pajunen, 1999; Locatello et al., 2019b) . In practice however the inductive biases embodied in common autoencoder architectures allow for effective practical disentanglement (Rolinek et al., 2019) . Perhaps more problematic, standard unsupervised disentanglement methods (s.a. Higgins et al. (2017) ; Kim & Mnih (2018) ; Chen et al. (2018) ) rely on an unrealistic assumption of statistical independence of ground truth factors. Real data however contains correlations (Träuble et al., 2021) . Even with well defined factors (s.a. shape, color or background), correlations are pervasive-yellow bananas are more frequent than red ones; cows more often on grass than sand. In more realistic settings with correlations, prior work (e.g. Träuble et al. (2021) ; Dittadi et al. (2021) ) has shown existing disentanglement methods to fail. To address this limitation, we propose to relax the unrealistic assumption of statistical independence of factors (i.e. that they have a factorial distribution), and only assume the (bounded) support of the factors' distribution factorizes -a much weaker but more realistic constraint. For example, in a dataset of animal images (Fig. 1 ), background and animal are heavily correlated (camels most likely on sand, cows on grass), resulting in most datapoints being distributed along the diagonal as opposed to uniformly. Under the original assumption of factor independence, a model likely learns a shortcut solution where animal and landscape share the same latent correspondence (Beery et al., 2018) . On the other hand with a factorized support, learned factors should be such that any combination of their values has some grounding in reality: a cow on sand is an unlikely, yet not impossible combination. We still rely, just as standard unsupervised disentanglement methods, on the inductive bias of encoder-decoder architectures to recover factors (Rolinek et al., 2019) . However, we expect our method to facilitate robustness to any distribution shifts within the support (Träuble et al., 2021; Dittadi et al., 2021) , as it makes no assumptions on the distribution beyond its factorized support. We arrived at this factorized support principle from the perspective of relaxing the independence assumption to be robust to factor correlations, while remaining agnostic to how they may arise. Remarkably, the same principle was derived independently in Wang & Jordan (2021) foot_0 from a causal perspective and formal definition of causal disentanglement (Suter et al., 2019) , that explicits how factor correlations can arise. To ensure a computationally tractable and efficient criterion even with many factors, we further relax the full factorized support assumption to that of only a pairwise factorized support, i.e. factorized support for all pairs of factors. On this basis, we propose a concrete pairwise Hausdorff Factorized Support (HFS) training criterion to disentangle correlated factors, by aiming for all pairs of latents to have a factorized support. Specifically we encourage a factorized support by minimizing a Hausdorff set-distance between the finite sample approximation of the actual support and its factorization (Huttenlocher et al., 1993; Rockafellar & Wets, 1998) . Across large-scale experiments on standard disentanglement benchmarks and novel extensions with correlated factors, HFS consistently facilitates disentanglement. We also show that HFS can be implemented as regularizer for other methods to reliably improve disentanglement, up to +61% in disentanglement performance over baselines as measured by DCI-D (Eastwood & Williams, 2018 ) ( §4.1, Tab. 1). On downstream classification tasks, we improve generalization to more severe distribution shifts and sample efficiency ( §4.2, Fig. 2 ). To summarize our contributions: [1] We motivate and investigate a principle for learning disentangled representations under correlated factors: we relax the assumption of statistically independent factors into that of a factorized support only (independently also derived in Wang & Jordan (2021) from a causal perspective), and further relax it to a more practical pairwise factorized support. [2] We develop a concrete training criterion through a pairwise Hausdorff distance term, which can also be combined with existing disentanglement methods ( §2.3). [3] Extensive experiments on three main benchmarks and up to 14 increasingly difficult correlations settings over more than 20k models, show HFS systematically improving disentanglement (as measured by DCI-D) by up to +61% over standard methods (β/TC/Factor/Annealed-VAE, c.f. §4.1). [4] We show that HFS improves robustness to factor distribution shifts between train and test over disentanglement baselines on classification tasks by up to +28%, as well as sample efficiency.

2.1. DISENTANGLEMENT VERSUS INDEPENDENCE

We are given a dataset D = {x i } N i=1 (e.g. images), where each x i is a realization of a random variable, e.g., an image. We consider that each x i is generated by an unknown generative process, involving a ground truth latent random vector z whose components correspond to the dataset's underlying factors of variations (s.a. object shape, color, background, . . . ). This process generates an observation x, by first drawing a realization z = (z 1 , . . . , z k ) from a distribution p(z), i.e. z ∼ p(z). Observation x is then obtained by drawing x ∼ p(x|z). Given D, the goal of disentangled representation learning can be stated as learning a mapping f φ that for any x recovers as best as possible the associated z i.e. f φ (x) ≈ E[z|x] up to a permutation of elements and elementwise bijective transformation. In unsupervised disentanglement, the z are unobserved, and both p(z) and p(x|z) are a priori unknown to us, though we might assume specific properties and functional forms. Most unsupervised disentanglement methods follow the formalization of VAEs and employ parameterized probabilistic generative models of the form p θ (x, z) = p θ (z)p θ (x|z) to estimate the ground truth generative model over z, x. As in VAEs, these methods make the strong assumption that ground truth factors are statistically independent: p(z) = p(z 1 )p(z 2 ) . . . p(z k ). (1) and conflate the goal of learning a disentangled representation with that of learning a representation with statically independent components. This assumption naturally translates to a factorial model prior p θ (z). Successful variants of VAE for disentanglement (Higgins et al., 2017; Kim & Mnih, 2018; Chen et al., 2018) further modify the original VAE objective to even more strongly enforce elementwise independence of the aggregate posterior (i.e. the encoder outputs) than afforded by the VAE's optimized evidence lower bound. However, as explained in the introduction, the assumption of factor independence clearly doesn't hold for realistic data distributions. Consequently, methods that enforce this unrealistic assumption suffer from that discrepancy, as shown in Träuble et al. (2021) ; Dittadi et al. (2021) and confirmed in our own experiments. To address this shortcoming, we develop a novel method to relax the unrealistic assumption of factor independence.

2.2. RELAXING THE INDEPENDENCE ASSUMPTION INTO THAT OF FACTORIZED SUPPORT

Instead of assuming independent factors (i.e. a factorial distribution on z as in Eq. 1) we will only assume that the support of the distribution factorizes. Let us denote by S(p(z)) the support of p(z), i.e. the set {z ∈ Z | p(z) > 0}. We say that S(p(z)) is factorized if it equals to the Cartesian product of supports over individual dimensions' marginals, i.e. if: S(p(z)) = S(p(z 1 )) × S(p(z 2 )) × ... × S(p(z k )) def = S × (p(z)) where × denotes the Cartesian product. Of course, 1 (independence) ⇒ 2 (factorized-support) but 2 (factorized-support) 1 (independence). Assuming a factorized support is thus a relaxation of the (unrealistic) assumption of factorial distribution, (i.e. of statistical independence) of disentangled factors. Refer to the cartoon example in Fig. 1 , where the distribution of the two disentangled factors would not satisfy an independence assumption, but does have a factorized support. Informally the factorized support assumption is merely stating that whatever values z 1 and z 2 , etc... may take individually, any combination of these is possible (even when not very likely). In the next section we will develop a concrete training criterion that encourages the obtained latent representation to have a factorized support rather than a factorial distribution.

2.3. A PRACTICAL CRITERION FOR FACTORIZED SUPPORT

Based on our relaxed hypothesis, we now define a concrete training criterion that encourages a factorized support. Let us consider deterministic representations obtained by the encoder z = f φ (x). We enforce the factorial support criterion on the aggregate distribution qφ (z) = E x [f φ (x)], where qφ (z) is conceptually similar to the aggregate posterior q φ (z) in e.g. TCVAE, though we consider points produced by a deterministic mapping f φ rather than a stochastic one. To match our factorized support assumption on the ground truth we want to encourage the support of qφ (z) to factorize, i.e. that S(q φ (z)) and the Cartesian product of each dimension support, S × (q φ (z)), are equal. For clarity we use shorthand notations S and S × to denote S × (q φ (z)) and S(q φ (z)) respectively when it is clear from context. To guide the learning, we thus need a divergence or metric to tell us how far S is from S × . Supports are sets, so it is natural to use a set distance such as the Hausdorff distance. Hausdorff distance between sets Given a base distance metric d(z, z ) between any two points in Z (e.g. the Euclidean metric in Z = R k ), the Hausdorff Distance between sets (here, S × and S), is then defined as d H (S, S × ) = max sup z∈S × inf z ∈S d(z, z ) , sup z∈S inf z ∈S × d(z, z ) = sup z∈S × inf z ∈S d(z, z ) with the second part of the Hausdorff distance equating to zero since S ⊂ S × .

Monte-Carlo Hausdorff Distance Estimation

In practice we only have a finite sample of observations {x} N i , and can only estimate the support and Hausdorff distances from the finite number of representations {f φ (x)} N i . We thus introduce a practical Monte-Carlo batch-approximation : with access to a batch of b inputs X yielding b k-dimensional latent representations Z = f φ (X) ∈ R b×k , we estimate Hausdorff distances using sample-based approximations to the support: S ≈ Z and S × ≈ Z :,1 × Z :,2 × ... × Z :,k = {(z 1 , . . . , z k ), z 1 ∈ Z :,1 , . . . , z k ∈ Z :,k }. Here Z :,j must be understood as the set (not vector) of all elements in the j th column of Z. Plugging into Eq. 3 yields: dH (Z) = max z∈Z:,1×Z:,2×...×Z :,k [min z ∈Z d(z, z )] where by noting z ∈ Z we consider the matrix Z as a set of rows, over which we find the minfoot_1 . Further relaxing the assumption to pairwise factorization In high dimension, with many factors, the assumption that every combination of all latent values is possible might still be too strong an assumption. And even if we assumed all to be in principle possible, we can never hope to observe all in a finite dataset of realistic size due to the combinatorial explosion of conceivable combinations. However, it is statistically reasonable to expect evidence of a factorized support for all pairs of elementsfoot_2 . To encourage such a pairwise factorized support, we can minimize a sliced/pairwise Hausdorff estimate with the additional benefit of keeping computation tractable when k is large d(2) H (Z) = k-1 i=1 k j=i+1 max z∈Z:,i×Z:,j min z ∈Z :,(i,j) d(z, z ) where Z :,(i,j) denotes the concatenation of column i and column j, yielding a set of rows. Avoiding collapse and retaining input information We will be learning representations z = f φ (x) by learning parameters φ that optimize a training objective. Because the Hausdorff distance builds on a base distance d(z, z ), if we were to minimize only this, it could be trivially minimized to 0 by collapsing all representations to a single point. Avoiding this can be achieved in several ways, s.a. by including a term that encourages the variance of z :,i to be above 1 (a technique used e.g. in self-supervised learning method VICReg (Bardes et al., 2022) ) or -more in line with traditional VAE variants for disentanglement -by using a stochastic autoencoder (SAE) reconstruction error: SAE (x; φ, θ) = -E q φ (z|x) [log p θ (x|z)] where typically q φ (z|x) = N (f φ (x), Σ φ (x)) with mean given by our deterministic mapping f φ , Σ φ (x) producing a diagonal covariance parameter, and e.g. log p θ (x|z) = r θ (z) -x 2 with r θ a parameterized decoder. The autoencoder term ensures representations f φ (x) retaining as much information as possible about x for reconstruction, preventing collapse of representations to a single point. A minimum scale can also be ensured by imposing Σ φ (x) to be above a minimal threshold.

2.4. PUTTING IT ALL TOGETHER

Our basic training objective for Hausdorff-based Factored Support (HFS) can thus be formed by simply combining the stochastic auto-encoder loss of Eq. 6 and our Hausdorff estimate of Eq. 5: L HFS (D; φ, θ) = E X b ∼D γ d(2) H (f φ (X)) + 1 b x∈X SAE (x; φ, θ) where X b ∼ D denotes a batch of b inputs, f φ (X) the batch representations Z, and γ the tradeoff between the Hausdorff and SAE terms. To compare with existing VAE-based disentanglement methods s.a. β-VAE (Higgins et al., 2017) , we can also use Eq. 5 as regularizer on top: L βVAE HFS (D; φ, θ) = E X b ∼D γ d(2) H (Z) + 1 b x∈X SAE (x; φ, θ) + βD KL (q φ (z|x)||p θ (z)) (8) where D KL is the Kullback-Leibler divergence, and p θ (z) the usual VAE factorial unit Gaussian prior. This hybrid objective recovers the original β-VAE with γ = 0, and L HFS (Eq. 7) with β = 0, showing that the plain HFS objective replaces the β-VAE KL term by our factorized-supportencouraging Hausdorff term and removes the factorial prior p(z). We can similarly extended other VAE-based variants (Chen et al., 2018; Kim & Mnih, 2018; Burgess et al., 2018) by adding our Hausdorff term as regularizer to focus more on its support than a precise factorial distribution.

3. RELATED WORK

Disentangled Representation Learning aims to recover representation spaces where each groundtruth generative factor is encoded in a unique entry or subspace (Bengio et al., 2013; Higgins et al., 2018) to benefit subsequent downstream transfer (Bengio et al., 2013; Peters et al., 2017; Tschannen et al., 2018; Locatello et al., 2019b; Montero et al., 2021; Mancini et al., 2021; Roth et al., 2020; Funke et al., 2022) , interpretability (Chen et al., 2016; Esser et al., 2018; Niemeyer & Geiger, 2021) and fairness (Locatello et al., 2019a; Träuble et al., 2021; Dullerud et al., 2022) via compositionality of representations. Methods often rely on Variational AutoEncoders (VAEs) variants (Kingma & Welling, 2014; Rezende et al., 2014) to constrain the (aggregate) posterior of the encoder, e.g. via penalties on the bottleneck capacity (β-VAE (Higgins et al., 2017) ) with progressive constraints or network growing (AnnealedVAE (Burgess et al., 2018) , ProVAE (Li et al., 2020) ), the total correlation (β-TCVAE (Chen et al., 2018) , FactorVAE (Kim & Mnih, 2018) ) or the mismatch to some factorized prior (DIP-VAE (Kumar et al., 2018) , DoubleVAE (Mita et al., 2020) ). These approaches assume statistically independent factors, which is invalid for realistic data as motivated in §1. Disentanglement under correlated factors. Consequently, while most methods have been shown to perform well on toy datasets and ones with known independent factors such as Shapes3D (Kim & Mnih, 2018 ), MPI3D (Gondal et al., 2019) , DSprites (Higgins et al., 2017 ), SmallNorb (LeCun et al., 2004) or Cars3D (Reed et al., 2015) , recent research (Montero et al., 2021; Träuble et al., 2021; Montero et al., 2022; Funke et al., 2022; Dittadi et al., 2021) show that unsupervised disentanglement methods that assume independent factors fail to disentangle, with potentially negative impact to OOD generalization. Suter et al. (2019) propose a causal metric to evaluate disentanglement when assuming confounders between the ground-truth factors. Choi et al. (2020) introduce a Gaussian mixture model for dependencies between continuous and discrete variables in a structured setup with number of mixtures known. By contrast, we investigate a generic remedy without explicit auxiliary variables or prior models by relaxing the independence assumption to only a pairwise factorized support. Pfau et al. (2020) propose geometrically motivated non-parametric unsupervised disentanglement following the symmetry-based definition in Higgins et al. (2018) by leveraging holonomy of manifold geometries as learning signal to find disentangled subspaces. This does not assume statistical independence, but requires non-trivial holonomy for each factor manifold, and struggles in high-dimensional spaces and generalization to new data. For domain adaptation shifts, Tong et al. (2022) propose adversarial support matching, highlighting that operating on the support can be beneficial for related settings as well. To evaluate disentanglement, we utilize DCI-D (part of DCI -Disentanglement, Completeness, Informativeness, see Eastwood & Williams (2018) ) as leading metric. As opposed to other metrics s.a. Beta-/FactorVAE scores (Higgins et al., 2017; Kim & Mnih, 2018) , MI Gap (Chen et al., 2018) , Modularity (Ridgeway & Mozer, 2018) or SAP (Kumar et al., 2018) 2021)), with generally strong correlation between metrics (Locatello et al., 2019b) . Finally, Wang & Jordan (2021) independently also arrived at the idea of support factorization for disentanglement from a causal perspective, for which they propose a similar Hausdorff distance objective, providing orthogonal validation to support factorization for disentanglement. On the contrary, we derive it from relaxing the assumption of factor independence, and propose further pairwise relaxation, which performs and scales much better (see Supp. A), alongside a much more expansive experimental study on the impact on downstream disentanglement, adaptation and generalization under various correlation shifts.

4. EXPERIMENTS

We start with experimental details listed below, before studying HFS on benchmarks with and without training correlations ( §4.1). These results are extended in §4.2 to evaluate the transfer and downstream adaptability ( §4.3) of learned representations across different correlation shifts and link HFS during training to various downstream metrics ( §4.4). We include variant, qualitative and hyperparameter robustness studies in appendix §B, §E and §D -all favouring our HFS objective. Across experiments, we re-implemented baselines (β-VAE (Higgins et al., 2017) , FactorVAE (Kim & Mnih, 2018) , AnnealedVAE (Burgess et al., 2018)  ) ∝ exp -(z 1 -f (z 2 )) 2 /(2σ 2 ) , where higher σ notes weaker correlation between normalized factors z 1 and z 2 , and f (z) = z or f (z) = 1 -z for inverted correlations when necessary . We extend this framework to include correlations between multiple factor pairs (either 1, 2 or 3 pairs) and shared confounders (one factor correlated to all others). All reported numbers are computed on at least 6 seeds (with ≥ 10 seeds used for key experiments s.a. Tab. 1 or Fig. 2 ). Similar to existing literature (Locatello et al., 2019b; 2020b; Träuble et al., 2021; Dittadi et al., 2021) we cover at least 7 hyperparameter settings for each baseline. Further experimental details are provided in §H.

4.1. FACTORIZATION OF SUPPORTS FOR DISENTANGLEMENT ON STANDARD BENCHMARK

We study the behaviour of HFS and baselines on standard disentanglement learning benchmarks and correlated variants thereof (see §4) -Shapes3D (Kim & Mnih, 2018 ), MPI3D Gondal et al. (2019) and DSprites (Higgins et al., 2017) . For each setting, we report results averaged over ≥ 10 seeds in Tab. 1. Each column denotes the test performance on uncorrelated data for all methods trained on a particular correlation setting. As DSprites only has five effective factors of variation, no threepair setting is possible. Values reported denote median DCI-D with 25th and 75th percentiles in grey. Our results indicate that a factorization of the support via HFS encourages disentanglement (as measured via DCI-D) without relying on a factorial distribution objective (s.a. β-VAE and its variants), consistently matching or outperforming the comparable β-VAE setting -both when no correlation is encountered during training ("No Corr.") as well as for much more severe correlations ("Conf."). Even more, we find that extending existing disentanglement objectives s.a. β-VAE (or stronger extensions like β-TCVAE) with explicit factorization of the support (+HFS) can provide even further, significant improvements. For example, without correlations we find relative improvements of nearly +30%, while for some correlated settings, e.g. with a shared confounder, these go over +60%! In addition, relative increases of up to +140% on β-TCVAE further highlight both the general importance of an explicit factorization of the support for disentanglement even under training correlations, as well as it being a property generally neglected until now.

4.2. OUT-OF-DISTRIBUTION GENERALIZATION UNDER CORRELATION SHIFTS

As β-VAE and β-VAE + HFS models in §4.1 were trained on correlated and evaluated on uncorrelated data, the performance differences provide a first indication that encouraging a factorized support can benefit transfer under correlation shifts. Such changes in correlation from train to test data is commonly referred to as "distributional shift" (Quinonero-Candela et al., 2009) as the test data becomes out-of-distribution for the model, and mark a key issue interfering with generalization in realistic settings (Arjovsky et al., 2019; Koh et al., 2021; Milbich et al., 2021; Roth et al., 2022; Funke et al., 2022) . While some works point to initial benefits of disentangled representations for out-of-distribution (OOD) generalization (e.g. of all factors, disentangled, enables effective subsequent feature selection (e.g. L1-regularized logistic regression or shallow decision trees). A downstream predictor can thus be far more sample efficient (Ng, 2004) in learning to ignore irrelevant factors, that may be spuriously correlated with the target, than if they were entangled in the representation. As we showed that explicit support factorization provides stronger relative disentanglement, we leverage this for further insights into its benefits on OOD tasks. Given the strong performance of HFS on Shapes3D, we extend our experiments from §4.1 on this dataset with more training and now also test data correlations. This gives transfer grids (Fig. 2 ) across diverse, increasingly severe correlation shifts. For each grid, the y-and x-axis indicate training and test correlations increasing from top to bottom and left to right, respectively. Darker colors refer to a score increase. We use these grids to see (1) how different correlation shifts impact disentanglement, and (2) if improvements in disentanglement via explicitly aiming for a factorized support impact downstream transferability of the learned representations. (1) We first evaluate disentanglement of a standard β-VAE (leftmost grid; each square uses optimal parameters for a given correlation and seed), and find an expected drop with increased correlation on the training data. The subsequent grid shows changes when adding HFS to β-VAE, with consistent improvements in disentanglement of test data over β-VAE across all correlation shifts (only positive changes), and extends our insights from Tab. 1. (2) To understand the usefulness for practical transfer tasks under correlation shifts, we train a Gradient Boosted Tree (GBT, sklearn (Pedregosa et al., 2011) , c.f. Locatello et al. (2019b; 2020b) ) to 2) receiving either the entire latent vector (black) or only the most expressive entry (blue). The increased disentanglement through HFS gives consistent improvements in all cases, and gets more pronounced in the low data regime for full latents, indicating higher sample efficiency, as expected from better disentanglement. Relative improvements up to +80% in the single entry case across correlation shifts highlight the better reflection of ground truth factors across correlations. take representations of the test data and predict the exact ground truth factor values; reflected in the third grid for β-VAE baseline and measured by the DCI-I metric (Eastwood & Williams, 2018) . We see that the downstream classification performance, while saturated for small shifts, drops notably both with increased training correlation or more variation in the test data (drop towards bottom-left corner). If we now measure the change in downstream classification performance with HFS (last grid), we see that while for small shifts the benefits are small, they increase for larger ones (increase towards same bottom left corner). This indicates that changes in disentanglement through our support factorization become increasingly important as distributional shifts increase, and highlight that benefits for OOD generalization drawn from improvements in disentanglement may be particularly evident for harder shifts.

4.3. BENEFITS UNDER VARYING DOWNSTREAM ADAPTATION METHODS

This section investigates how generalization improvements hold when the amount of downstream test data changes, and revisits the recovery of ground truth factors under correlation shifts by looking at the performance with only the single most important latent entry. We train GBTs (as we care about relative changes, we use xgboost (Chen & Guestrin, 2016) for faster training) on embeddings ex- tracted either from an optimal β-VAE or β-VAE + HFS. For the single-latent-entry training, we first train a GBT to select the most important entry to predict each respective ground truth factor, and then use said entry to train a second ground truth factor predictor. See Supp. §F.2 for experiments with linear probes. In all cases (Fig. 3 ), explicit support factorization via HFS facilitates downstream adaptation with particular benefits when only little data is provided at test time. For example in the standard uncorrelated setting, relative improvements increase from 4% to 45%, with similar trends across correlation shifts. Finally, our experiments reveal that increased disentanglement expectedly results in a better reflection of ground truth factors in single latent entries, shown in nearly +80% relative improvement when training and predicting on the most expressive entry. These insights reinforce that explicit support factorization via HFS encourages disentanglement also under correlation shifts, and show potential benefits in downstream generalization especially in the low data regime.

4.4. FACTORIZATION AS A PERFORMANCE METRIC

We now explore the relationship of HFS to existing metrics across correlations. Utilizing Eq. 5 as separate evaluation metric for the factorization of the support across the whole training data facilitated through increased HFS weighting γ, we find that when the factorization of the support across the training data goes down (Fig. 4 (left, orange)), the disentanglement on the test data consistently goes up, verifying again the connection between support factorization and disentanglement. Fig. 4 (center, black) shows β-VAE implicitly encouraging a factorized support by pushing towards independence, but which hurts disentanglement and generalization, see experiments above. Finally, Fig. 4 (right) shows that support factorization on the training data exhibits correlation with disentanglement metrics, consistent across also stronger training correlations, albeit lower. This is useful, as HFS neither requires access to ground truth factors nor a specific prior distribution over the support, and can thus serve as a proxy for development and training evaluation of future works.

5. CONCLUSION

To avoid the unrealistic assumption of factors independence (i.e. factorial distribution) as in traditional disentanglement, which stands in contrast to realistic data being correlated, we thoroughly investigate an approach that only aims at recovering a factorized support. Doing so achieves disentanglement by ensuring the model can encode many possible combinations of generative factors in the learned latent space, while allowing for arbitrary distributions over the support -in particular those with correlations. Indeed, through a practical criterion using pairwise Hausdorff set-distances -HFS -we show that encouraging a pairwise factorized support is sufficient to match traditional disentanglement methods. Furthermore we show that HFS can steer existing disentanglement methods towards a more factorized support, giving large relative improvements of over +60% on common benchmarks across a large variety of correlation shifts. We find this improvement in disentanglement across correlation shifts to be also reflected in improved out-of-distribution generalization especially as these shifts become more severe; tackling a key promise for disentangled representation learning.

REPRODUCIBILITY STATEMENT

To reproduce the results from this paper and avoid implementational and library-related differences, we have released our codebase here: https://github.com/facebookresearch/disentangling-correlatedfactors. To reproduce Tab. 1, we first refer to Tab. 5, which contains all Tab. 1 results with additional details on the exact correlations used (as well as other correlation settings). For each of the correlation settings, the associated factor correlation pairs are provided in §H.1, with the training, model as well as grid-search details all noted in §H. The correlation formula to introduce artificial correlations between respective factors follows the setup described in the experimental details noted at the beginning of §4. For the correlation shift transfer experiments used in §4.2, the same training and correlation settings are used. For our downstream adaptability results, we provide all relevant details in §4.3 and §H.

A ALTERNATE HAUSDORFF VARIANTS

In this section, we introduce various variants to our Hausdorff distance approximation introduced in §2.3 and particularly Eq. 5, which we then experimentally evaluate in §B. A.1 AVERAGED HAUSDORFF First, due to the sensitivity to outliers, one can also utilize the average Hausdorff distance, which simply gives: d(2) H,avg (Z) = k i=1 k j=i+1 1 |Z:,i×Z:,j | z∈Z:,i×Z:,j min z ∈Z :,(i,j) d(z, z ) (9) using the same pair-based approximation introduced in Eq. 5. A.2 SUBSAMPLING One can also operate on the full approximated Ŝ× with  S × ≈ Ŝ× = Z :,1 × Z :,2 × ... × Z :,k instead of a collection of Ŝ× i,j = Z :,i × Z :,j , However, in practice (as shown in §B), we found d(2) H to work better, as the max-operation over a collection of 2D subspaces provides a less sparse training signal than a single backpropagated distance pair in dH,sub .

A.3 SAMPLING-BASED SOFTMIN

In addition, as the latent representations and the corresponding support change during training, one can also encourage some degree of exploration during training instead of relying on the use of hard max and min operations, for example through a probabilistic selection of the final distance to minimize for, allowing for a controllable degree of exploration during training: d(2) H,prob = k i=1 k j=i+1 max z∈ Ŝ× (Z) E z ∼psoftmin(•|z,Z :,(i,j) ,τ ) [d(z, z )] with the SoftMin-distribution p softmin (z |z, Z, τ ) = exp(-d(z, z )/τ ) z * ∈Z exp(-d(z, z * )/τ ) though as shown in the following experimental section, we found minimal benefits in doing so.

A.4 SOFTENED HAUSDORFF DISTANCE

To potentially better align the Hausdorff distance objective with the differentiable optimization process, it may make be beneficial to look into soft variants to relax the hard minimization and maximization, respectively (note the d instead of d): σ min (z, z , Z) = exp(-d(z, z )/τ 1 ) z * ∈Z exp(-d(z, z * )/τ 1 ) d soft min (z, Z) = z ∈Z σ min (z, z , Z) d(z, z ) d(2) H (Z) = k i=1 k j=i+1 z∈Z :,(i,j) exp(d soft min (z, Z :,(i,j) )/τ 2 ) z * ∈Z :,(i,j) exp(d soft min (z * , Z :,(i,j) )/τ 2 ) H,prob ) as well as subsampling of the full-dimensional factorized support without any pairwise approximations. In all cases, entries are selected with at least 5 seeds and optimal values chosen from a gridsearch over γ ∈ {1, 3, 10, 30, 100, 300, 1000, 3000, 10000}. 9 [74.3, 81.4] 55.9 [47.6, 59.9] 48.0 [39.1, 48.9] d(2) H,avg (Eq. 9) d soft min (z, Z :,(i,j) ) 61.1 [55.6, 65.5] 1 [48.8, 54.7] 32.1 [27.0, 33.1] dsub H (Eq. 10), Subs. 8 • 10 4 66.9 [61.3, 71.2] 54.5 [49.3, 59.7] 39.2 [31.9, 43.8] dsub H (Eq. 10), Subs. 8 • 10 5 66.6 [62.2, 72.4] 56.8 [50.1, 59.4] 40.9 [37.3, 45.2] Figure 5 : Results for our soft approximation to Eq. 5. Blue horizontal line denotes the default Eq. 5 objective, while orange denotes a replaced of the max-operation with a mean. We find that generally, a convergence of the soft approach to our default hard variant performs best, with large choices in the outer temperature converging towards our mean approximation. Here, the temperature τ 1 controls the translation between putting more weight on the minimal distance (smaller τ 1 ) versus a more uniform distribution (larger τ 1 ), moving further away from the corresponding min operation. A secondary τ 2 then controls the transition between the max-operation over our soft distances and a more uniform weighting over all (non-zero) soft distances, with the limit case τ 2 → ∞ approximating the simple mean over soft distances.

B EVALUATION OF HAUSDORFF DISTANCE APPROXIMATION VARIANTS

As our utilized distance function d(2) H only approximates the Hausdorff distance to the factorized support, we now move to a variant study of other alternative distance measures as described above. In particular, we investigate (1) a replacement of the max-operation with a corresponding mean over support samples to address potential outliers better, (2) a probabilistic approximation to our minoperation over d(z, z ) (see Eq. 11), (3) and a fully soft approximation to both max and min using a respective Softmax and Softmin formulation (A.4). (4) Finally, we also revisit the impact of exlicit scale regularization as introduced in §2.3.

Method ablation.

Ablation studies across the default as well as two different training correlation settings can be found in Tab. 2, with each entry computed over at least 6 seeds, and a gridsearch over γ ∈ {1, 3, 10, 30, 100, 300, 1000, 3000, 10000}. Our results show that for optimization purposes, approximating the Hausdorff distances in a "sliced", pairwise fashion as suggested in Eq. 5 is noticeably better than subsampling from the incredibly high-dimensional factorized support, as instead of a single distance entry that is optimized for (after the max-min selection), we have vari-Table 3 : Impact of the number of 2D approximations in d(2) H . Our experiments reveal that the use of multiple 2D approximations to the full Hausdorff distances has notable merits up to a certain degree (change in disentanglement performance from e.g. 64% to 75% in the uncorrelated transfer setting). Each entry was chosen as the highest value in a gridsearch over γ ∈ {0.01, 0.1, 1, 10, 100}. 25.0 [23.9, 35.4] 28.7 [24.5, 45.6] 54.2 [44.2, 67.0] 51.4 [42.5, 58.3] 53.6 [48.3, 57.9] 53.5 [51.0, 59.1] Correlated Pairs: 3 40.9 [38.5, 44.3] 44.3 [41.3, 48.0] 46.8 [44.9, 48.9] 46.5 [45.7, 47.3] 48.7 [47.8, 50.2] 48.8 [46.8, 52.0] 49.2 [45.6, 51.8] Table 4 : Impact of scale regularization (as detailed in §C) using VAE + d H with γ = 100. For each setting, we perform a gridsearch over either the weight scale δ ∈ {0, 1, 3, 10, 30, 100, 300, 1000, 3000, 10000} or the L2 Regularization weight ∈ {10 -6 , 10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 }. Results show that scale regularization, while not necessarily detrimental, does not provide any consistent benefits. ous latent subsets that incur a training gradient, and in two dimensions can cheaply compute the full factorized support. Similarly, we also find that replacing the min-selection over latent entries with a probabilistic variant, as well as the max-selection over factorized support elements, offer no notable benefits. In particular the replacement of the outer max-operation can severely impact the disentanglement performance. These insights are additionally supported when utilizing a soft variant (see Eq. A.4 in the appendix), which replaces both max and min operations with a respective Softmax and Softmin operation, each with respective temperatures τ 1 and τ 2 . When utilizing this objective, we see that small temperature choices on both soft approximations are beneficial, and approximate the hard variant. Similarly, we find a consistent drop in performance when either one of these temperatures is reduced, with the soft performance converging towards the Mean variant when increasing the outer temperature τ 2 . Overall, we don't see any major benefits in a soft approximation, while also introducing two additional hyperparameters that would need to be optimized. Finally, we ablate the key parameter for our Hausdorff distance approximation of choice, d(2) H (Eq. 5 -the number of pairs over which we compute a sliced 2D variant. Given a total latent dimensionality k, we are given k 2 usable combinations, which we can choose to subsample all the way down to a single pair of latent entries, which is what we do in Tab. 3. The results showcase that while a minimal number of latent pairs is crucial, not all combinations are needed, with diminishing returns for more pairs included. For practical purposes, we therefore choose 25 pairs as our default setting to strike a balance between performance and compute cost, which however can be easily increased if needed. On the latter note, we also highlight that while the addition of L HFS does incur a higher epoch training time (60s for a standard VAE as used in Locatello et al. (2020b) on a NVIDIA Quadro GV100) than β-VAE (52s), it still compares favourably when compared to e.g. β-TCVAE (70s) or FactorVAE (96s). In addition, the impact on the training time diminishes when larger backbone networks are utilized.

C REGULARIZING SCALE TO AVOID COLLAPSE

In the limit case, the standard Hausdorff matching problem is solved by collapsing all representations into a singular point. In addition to that, the actual scale of the latent entries directly impacts the distance scale. One can therefore provide additional regularization on top to ensure both a scaleinvariant measure as well as work against a potential collapse, for example by enforcing a minimal In all cases, we perform a grid-search over an additional scale regularization weight parameter δ (with δ ∈ {0, 1, 3, 10, 30, 100, 300, 1000, 3000, 10000}) or the L2 Regularization weight (∈ {10 -6 , 10 -5 , 10 -foot_3 , 10 -3 , 10 -2 , 10 -1 }). Our results show no improvements that are both significant and consistent across correlation settings. And while these regularization may become relevant for future variants and extensions, in this work we choose to forego a scale regularizer with the benefits of having one less hyperparameter to tune for.

D HYPERPARAMETER EVALUATION

To understand to what extent the factorization of support parameter γ impacts the learning and performance of the model, we also compare grid searches over γ and the standard β-VAE prior matching weight β. The results in Fig. 6 indicate that a factorization of support is much less dependent on the exact choice of weighting γ as opposed to the standard KL-Divergence to the normal prior used in β-VAE frameworks (notice the logarithmic value grid). This stands to reason, as a factorization of the support instead of distributions is both a more realistic property as well as a much weaker constraint on the overall training dynamics.

E SAMPLE RECONSTRUCTIONS AND QUALITATIVE EVALUATION

We also provide some qualitative impression of the impact an explicit factorization has on the overall disentanglement across different correlations. In particular, Figure 7 visuals latent traversals both for the β-VAE baseline (top)as well as the HFS-augmented variant (bottom) for the latent entry most expressive for the first mentioned latent entry in ("Correlations addressed"). To generate these figures, we select the best performing seed for each setup, and report the respective DCI-D score within each subplot. As can be seen, beyond the increase in maximally achievable DCI-D, an explicit factorization of the support helps the disentangling method separate factors it initially struggled with -both when correlations exists in the training data as well as for generally failure modes when the β-VAE fails to fully disentangle in the uncorrelated setting. 

F DETAILED FIGURES AND TABLES

In this section, we provide detailed variants of figures and tables utilised in the main paper.

F.1 ADDITIONAL DISENTANGLEMENT RESULTS

For Tab. 1 studying the impact of HFS both as a standalone objective and as a regularizer on disentanglement of test data across varying degrees of training correlations, we include a more detailed variant highlighting the exact splits utilised in Tab. 5, as well as additional correlation settings.

F.2 FURTHER ADAPTATIONS

Extending our adaptation experiments done in §4.3, we also investigate the average classification performance of ground truth factors of a weaker, L1-regularized linear probe in Fig. 8 . Similar to 

G DISENTANGLEMENT METRICS

In this section, we will provide a brief introduction into various disentanglement metrics, with particular emphasis on the DCI-D metric (Eastwood & Williams, 2018 ) used as our leading measure of disentanglement. G.1 DCI-D AND DCI-I DCI-Disentanglement was introduced in Eastwood & Williams (2018) as part of a three-property description of learned representation spaces, alongside Completeness and Informativeness. In this work, we primarily utilize DCI-D as a measure of disentanglement, and DCI-I as a measure of generalization performance. In particular, each submetric utilizes multiple classification models (e.g. logistic regressor (Eastwood & Williams, 2018) or a boosted decision tree (Locatello et al., 2019b) ), which are trained to predict each underlying ground-truth factors from representations extracted from the dataset of interest, respectively. DCI-I is then simply computed as the average prediction error (on a test-split). To compute DCI-D, for each ground-truth factor and consequently each prediction model, predictive importance scores for each dimension of the representation space are extracted from the classification model, given as R ∈ R d×k with representation dimensionality d and number of factors k. For each row, the entropy value is then computed and subtracted from 1 -being high if a dimension is predictive for only one factor, and low if it is used to predict multiple factors. Finally, each entropy score is weighted with the relative overall importance of the respective dimension to predict any of the ground-truth factors, giving DCI-D = d i (1 -H(Norm(R i,: )) k j R i,j i * j * R i * ,j * G.2 MUTUAL INFORMATION GAP (MIG) The Mutual Information Gap (MIG) was introduced in Kim & Mnih (2018) to measure the mutual information difference of the two representation entries that have the highest mutual information with a respective ground-truth factor normalized by the respective entropy, which is then averaged for all ground-truth factors. For our work, we follow the particular formulation and implementation introduced in Locatello et al. (2019b) , by taking the mean representations produced by the encoder network, and estimating a discrete mutual information score, such that the overall MIG can be computed as MIG = 1 k k i=1 I(z m(k,1) , z k ) -I(z m(k,2) , z k ) H(z k ) where k denotes the number of ground-truth factors of variation z k , H(z k ) the respective entropy of z k , zi the i-th entry of the generated latent space, and m(k, n) a function that returns the representation index with the n-th highest mutual information to ground-truth factor k.  = 1 d d i j (m i,j • I j=argmax g mi,g ) 2 (max g m i,g ) 2 (k -1) which, per latent dimension i measures the average normalized squared mutual information scores between the factors that do not share the highest mutual information with the latent entry i. Here, m i,j denotes the discretized mutual information between latent entry i and factor j similar to our implementation of the Mutual Information Gap and Locatello et al. (2019b) , where we utilize a discretized approximation by binning each latent entry into 20 bins over 10000 samples to compute the discretized mutual information scores.

G.4 SAP SCORE

The Separated Attribute Predictability (SAP) score was introduced in Kumar et al. (2018) as another disentanglement measure, in which the authors suggest to train a linear regressor (in the case of Locatello et al. (2019b) a linear SVM with C = 0.01 and again 10000 training samples and 5000 test points) to predict each ground-truth factor from each dimension of the learned representation space, and then taking the average difference in prediction errors between the two most predictive latent entries for each respective ground-truth factor.

G.5 BETA-AND FACTORVAE SCORES

The FactorVAE Score (Kim & Mnih, 2018 ) is an extension of the BetaVAE Score introduced in Higgins et al. (2017) . In both cases, a ground-truth factor of variation is fixed, and two sets of observations are then sampled. The BetaVAE score then measures disentanglement as the classification accuracy of a linear classifier to predict the index of the fixed factor based on the average absolute differences between set pairs. In Locatello et al. (2019b) , this process is repeated 10000 times to train a logistic regressor, and evaluated on 5000 test pairs. The FactorVAE score improves on this metric through the use of a majority vote classifier that instead predicts based on the index of the representation entry with least variance. • No Corr.: No Correlation during training. This constitutes the default evaluation setting. • Pair: 1 [V1]: floorCol and wallCol. • Pair: 1 [V2]: objType and objSize. • Pair: 1 [V3]: objType and wallCol. • Pair: 1 [V4]: objType and objCol. • Pair: 1 [inv, V4]: objType and objCol, but inverse correlation. • Pairs: 2 [V1]: objSize and floorCol as well as objType and wallCol. • Pairs: 2 [V2]: objSize and objType as well as floorCol and wallCol. • Pairs: 2 [V3]: objType and objCol as well as objType and objSize. • Pairs: 2 [inv, V3]: objType and objCol as well as objType and objSize, but with inverse correlation. • Pairs: 3 [V1] : objSize and objAzimuth as well as objType and wallCol, and objCol and floorCol. • Pairs: 3 [V2]: objCol and objAzimuth as well as objType and objSize, and wallCol and floorCol. • Shared Conf. [V1] : We correlate (confound) objType against all other factors. • Shared Conf. [V2]: We correlate (confound) wallCol against all other factors. For MPI3D (Gondal et al., 2019) , we introduce the following correlations: • No Corr.: No Correlation during training. This constitutes the default evaluation setting. • Pair: 1 [V1]: cameraHeight and backgroundCol. • Pair: 1 [V2]: objCol and objSize. • Pair: 1 [V3]: posX and posY. • Pairs: 2 [V1]: objCol and objShape as well as posX and posY. • Pairs: 2 [V2]: objCol and posX as well as objShape and posY. 2) H (Eq. 5)



We were initially not aware of this work, whose preprint predates ours. We consider it a strong positive sign when the same principle (of support factorization) is arrived at independently from two quite different angles (causality versus relaxed factor independence assumption). One can alternatively use softened max and min operations, as defined in Appendix A.4. In practice, we saw no robustness benefit to this, likely because we compute dH over batches, not the entire dataset. Straightforward to generalize to larger tuples, but computational and statistical benefits shrink accordingly. Though this enforces an implicit assumption on the density within each latent factor.



has started to connect these setups to more realistic scenarios with factor correlations: Träuble et al. (2021) introduce artificial correlations between two factors, and Montero et al. (2021) exclude value combinations for recombination studies. In such settings,Montero et al. (2021; 2022);Träuble et al. (2021);Dittadi et al. (2021)

,Locatello et al. (2020a);Dittadi et al. (2021) have indicated DCI-D as the potentially most suitable disentanglement metric (and as also done e.g. inLocatello et al.  (2019b; 2020b);Träuble et al. (

Figure 2: Out-of-Distribution Disentanglement and Generalization across large ranges of correlation shifts between train and test data on Shapes3D. We evaluate the impact of encouraging factorized support on disentanglement (DCI-D) and classification performance of test ground truth factors (DCI-I) via HFS. Y-axis denotes source correlations increasing from top to bottom, x-axis target correlations (left to right). Darker blue and green mean higher scores and absolute improvements, respectively. [Leftmost]: DCI-D β-VAE for all shifts, dropping with higher training correlations. [Left]: Consistent and in parts high improvements in DCI-D when explicitly encouraging factorized support via β-VAE + HFS across shifts. [Right]: DCI-I using a GBT over embeddings generated by a β-VAE model trained on respective source correlations. Drop in performance with higher training correlation or test data variation (bottom left corner).[Rightmost]: Absolute changes in DCI-I with HFS reveal higher generalization particularly when shifts are large (c.f. bottom-left). This shows that explicitly encouraging factorized support benefits generalization as shifts become more severe.

Figure 3: Increased accuracy and sample efficiency on downstream classification. We plot relative improvement (%) in average ground truth factor classification accuracy by using HFS on top of a β-VAE, as a function of the amount of labeled training data. Classifier is a GBT (for linear probe see §F.2) receiving either the entire latent vector (black) or only the most expressive entry (blue). The increased disentanglement through HFS gives consistent improvements in all cases, and gets more pronounced in the low data regime for full latents, indicating higher sample efficiency, as expected from better disentanglement. Relative improvements up to +80% in the single entry case across correlation shifts highlight the better reflection of ground truth factors across correlations.

Figure 4: [Left, orange]: Increased support factorization on train data measured by lower Hausdorff estimate d(2) H by increasing HFS weight γ (Eq. 7)), improves disentanglement (DCI-D) across correlations on Shapes3D (more results in Supp. Fig. 9). [Center, black]: Minimizing β-VAE KL-Divergence by increasing β implicitly encourages a factorized support by pushing towards full independence, but hurts disentanglement because of the incorrect assumption. [Right]: We find strong correlation between HFS on train data and standard disentanglement metrics on test data ([%], darker → higher) even under training correlations (top to bottom). Detailed figure in §F.5.

1 -Var [z :,i ] (14) or simply enforcing a minimal range [a, b] of the support:L scale = d i=1 max [0, b -max(z :,i )] + max [0, min(z :,i ) -a](15)In real setups however, we have found regularization of scale to not be necessary in the majority of cases, as the use of the additional autoencoding term alongside the Hausdorff Support Factorization is sufficient to avoid collapse (see part in §2.3 on collapse), as shown in Tab. 4. For Tab. 4, we also investigate what happens if we apply L2 regularization on the decoder.

Figure 6: Robustness to factorized support weighting γ -much less detrimental to overall training dynamics.

Figure 7: Sample traversals for the Shapes3D benchmark Kim & Mnih (2018) in latent space for latent entry most closely associated with various ground truth factors of variations across different correlation shifts. In all cases, the best seed (out of 10) was selected to perform these qualitative studies. Each image also reports the associated overall DCI-D score of each respective best seed for β-VAE and β-VAE + HFS.

Figure 8: This figure shows adaptation behaviour across different amounts of test data for a L1optimal linear probe (i.e. for each seed and entry, we selected the optimal L1-regularization values). Reported values show relative improvement in average ground truth factor classification performance of β-VAE + HFS versus standard β-VAE. As can be seen, the increased disentanglement through an explicitly factorized support gives expected improvements increasing with the severity of training correlations encountered.

Figure 9: This is the full figure for Fig. 4, showcasing that a factorization of the support on the training data is consistently linked to improved downstream disentanglement (top), and that a minimization of the standard β-VAE KLD-objective for a factorial distribution implicitly minimizes for a factorized support across settings.

Figure 10: This is the detailed correlation shift transfer grid utilized in Fig. 2, indicating the exact correlation settings used for training and test data.

Figure 11: Correlation of Hausdorff distance to factorized support on the training data to various disentanglement metrics (in particular DCI and MIG) across correlation shifts. We find factorization of supports on the training data to strongly relate to downstream disentanglement even when experiencing strong correlation during training.

Pairs: 3 [V1]: objCol and backgroundCol as well as cameraHeight and posX, and objShape and posY. • Pairs: 3 [V2]: objCol and posX as well as objShape and posY, and backgroundCol and cameraHeight. • Shared Conf. [V1]: We correlate (confound) objShape against all other factors. • Shared Conf. [V2]: We correlate (confound) posX against all other factors.For DSprites(Higgins et al., 2017), we introduce the following correlations:• No Corr.: No Correlation during training. This constitutes the default evaluation setting.• Pair: 1 [V1]: shape and scale.• Pair: 1 [V2]: posX and posY.• Pair: 1 [V3]: shape and posY.• Pairs: 2 [V1]: shape and scale as well as posX and posY.• Pairs: 2 [V2]: shape and posX as well as scale and posY.• Shared Conf.[V1]: We correlate (confound) shape against all other factors.• Shared Conf. [V2]: We correlate (confound) posX against all other factors.I PSEUDOCODEFinally, we provide a PyTorch-style pseudocode to quickly re-implement and apply the factorization objective following Eq. 5.1 # Inputs: 2 # * Batch of latents <z> [bs x dim] 3 # * Number of latent pairs to use for approximation <n_pairs_to_use>.

Get available latent pairs. 10 pairs = np.array(list(it.combinations(range(dim), 2))) 11 n_pairs = len(pairs) 12 pairs = pairs[np.random.choice(n_pairs, n_pairs_to_use, replace=False)] 13 14 # Subsample batch <z> [bs x latent_dim] into <s_z> [bs x num_latent_pairs x 2] 15 s_z = z[..., pairs] 16 17 # ixs_a = [0, ..., bs-1, 0, 1, ...., bs-1] 18 ref_range = torch.arange(len(z), device=z.device) 19 ixs_a = torch.tile(ref_range, dims=(len(z),)) 20 # ixs_b = [0, 0, 0, ..., 1, 1, ..., bs-1] 21 ixs_b = torch.repeat_interleave(ref_range, len(z)) 22 23 # Aggregate factorized support: 24 # For every latent pair, we select all possible batch pairwise 25 # combinations, giving our factorized support <fact_z>: 26 # dim(fact_z) = bs ** 2 x num_latent_pairs x 2 27 fact_z = torch.cat([s_z[ixs_a, :, 0:1], s_z[ixs_b, :, 1:2]], dim=-1) 28 29 # Compute distance between factorized support and 2D batch embeddings: 30 # dim(dists) = bs ** 2 x bs x num_pairs 31 dists = ((fact_z.unsqueeze(1) -s_z.unsqueeze(0)) ** 2).sum(-1) 32 33 # Compute Hausdorff distance for each pair, then sum up each pair contribution. 34 hfs_distance = dists.min(1)[0].max(0)[0].sum() Listing 1: Sample PyTorch Implementation of d(

, β-TCVAE(Chen et al., 2018)) as done e.g., inLocatello et al. (2019b; 2020b);Träuble et al. (2021). To investigate methods under correlated ground truth factors, we use and extend the correlation framework introduced inTräuble et al. (2021) who introduce correlation between pairs of factors as p(z 1 , z 2

Disentanglement by explicitly factorizing the support using HFS on 3 benchmarks across various numbers of correlated factors (columns) and correlation increasing from left (no correlation) to right (every factor correlated to one confounder; for DSprites, three pairs are impossible, see text). Scores denote DCI-D metric computed on uncorrelated test data. (Bold) blue denotes (second) best performance per benchmark/correlation.[a, b]  indicate 25/75th percentiles. The results show that relaxing the goal of a factorial latent distribution to a factorized support with standalone HFS already offers competitive disentanglement. Adding HFS as regularizer over standard methods (β-VAE/TCVAE) to target a more factorized support yields even higher scores, beating other approaches with optimally tuned hyperparameters s.a. β. Remarkably on MPI3D, optimal tuning turned β and other TCVAE terms to 0, leaving only HFS which consistently worked best.

by simply utilising a randomly subsampled version of Ŝ× , denoted Ŝ×

Method ablations to d(2)H . We compare our default pairwise Hausdorff approximation against a variant using averaging instead of max (

(Eq. 10), Subs. 8 • 10 3 62.2 [59.5, 69.3] 52.

Full table for Tab. 1 with detailed and extended correlation settings. To understand the exact factors correlated, please check the associated pairings from §H.

Mozer (2018)  introduce the notions of Modularity and Expressiveness as key components of a disentangled representation -with the former evaluating whether each representation dimension depends on at most a single ground-truth factor of variation, and the latter the predictiveness of the overall representation to predict ground-truth factor values. Similarly toLocatello et al.  (2019b), we mainly focus on the property of Modularity, whichRidgeway & Mozer (2018)  define for a d-dimensional representation space with k ground-truth factors as Modularity

ACKNOWLEDGEMENTS

The authors would like to thank Léon Bottou for useful and encouraging early discussions, as well as David Lopez-Paz, Mike Rabbat, and Badr Youbi Idrissi for their careful reading and feedback that helped improve the paper. We also want to extend special thanks to Kartik Ahuja for later making us aware of the work of Wang & Jordan (2021) and its close connection with our approach. Karsten Roth thanks the International Max Planck Research School for Intelligent Systems (IMPRSIS) and the European Laboratory for Learning and Intelligent Systems (ELLIS) PhD program for support. Zeynep Akata acknowledges partial funding by the ERC (853489 -DEXIM) and DFG (2064/1 -Project number 390727645) under Germany's Excellence Strategy.

Method

Parameter Values , 2, 3, 4, 6, 8, 10, 12, 16 ] β-TCVAE β [1, 2, 3, 4, 6, 8, 10, 12, 16 ] AnnealedVAE c max [2, 5, 10, 25, 50, 75, 100, 150 ] FactorVAE β [2, 5, 10, 25, 50, 75, 100, 150 ] HFS γ [20, 40, 80, 100, 200, 400, 800, 1000, 2000, 4000] 30, 60, 100, 300, 600, 1000, 3000, 6000] The training details are as follows:• Optimization: Batchsize = 64, Optimizer = Adam (β 1 = 0.9, β 2 = 0.999, = 10 -8 ), Learning rate = 10 • Architecture: [MLP(1000), leakyReLU] x 6, MLP(2) • Optimization: Batchsize = 64, Optimizer = Adam (β 1 = 0.5, β 2 = 0.9, = 10 -8 ) Finally, we provide the hyperparameter gridsearches performed for every baseline method which mostly follow Locatello et al. (2020b) , as well as for HFS (though for some ablation studies more coarse-grained grids very utilised) and β-VAE + HFS: Note that for AnnealedVAE, we also leverage an iteration threshold of 10 5 and γ = 10 3 .

H.1 CORRELATION SETTINGS

We now provide more detailed information regarding the specific abbrevations used throughout the main text and for the following appendix to denote various correlation setups during training. We note that to introduce multiple correlated factors pairs, we simply multiply respective p(c i , c j ) entries.For Shapes3D (Kim & Mnih, 2018) , we introduce the following correlations:

