DISENTANGLEMENT OF CORRELATED FACTORS VIA HAUSDORFF FACTORIZED SUPPORT

Abstract

A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a model's representation with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically independent. In reality, factors (like object color and shape) are correlated. To address this limitation, we consider the use of a relaxed disentanglement criterion -the Hausdorff Factorized Support (HFS) criterion -that encourages only pairwise factorized support, rather than a factorial distribution, by minimizing a Hausdorff distance. This allows for arbitrary distributions of the factors over their support, including correlations between them. We show that the use of HFS consistently facilitates disentanglement and recovery of ground-truth factors across a variety of correlation settings and benchmarks, even under severe training correlations and correlation shifts, with in parts over +60% in relative improvement over existing disentanglement methods. In addition, we find that leveraging HFS for representation learning can even facilitate transfer to downstream tasks such as classification under distribution shifts. We hope our original approach and positive empirical results inspire further progress on the open problem of robust generalization. Code available at https://github.com/facebookresearch/disentangling-correlated-factors.

1. INTRODUCTION

Figure 1 : Real data exhibits correlations between generative factors: cows are likely on grass, camels on sand. This contradicts disentanglement methods assuming statistically independent factors. Instead, we show that merely assuming and aiming for a factorized support can yield robust disentanglement even under correlated factors. Disentangled representation learning (Bengio et al., 2013; Higgins et al., 2018) is a promising path to facilitate reliable generalization to in-and out-ofdistribution downstream tasks (Bengio et al., 2013; Higgins et al., 2018; Milbich et al., 2020; Dittadi et al., 2021; Horan et al., 2021) , on top of being more interpretable and fair (Locatello et al., 2019a; Träuble et al., 2021) . While Higgins et al. (2018) propose a formal definition based on group equivariance, and various metrics have been proposed to measure disentanglement (Higgins et al., 2017; Chen et al., 2018; Eastwood & Williams, 2018) the most commonly understood definition is as follows: Definition 1.1 (Disentanglement) Assuming data generated by a set of unknown ground-truth latent factors, a representation is said to be disentangled if there exists a one-to-one correspondence between each factor and dimension of the representation. The method by which to achieve this goal however, remains an open research question. Weak and semi-supervised settings, e.g. using data pairs or auxiliary variables, can provably offer disentanglement (Bouchacourt et al., 2018; Locatello et al., 2020b; Khemakhem et al., 2020; Klindt et al., 2021) . But fully unsupervised disentanglement -our focus in this study -is in theory impossible to achieve in the general unconstrained nonlinear case (Hyvärinen & Pajunen, 1999; Locatello et al., 2019b) . In practice however the inductive biases embodied in common autoencoder architectures allow for effective practical disentanglement (Rolinek et al., 2019) . Perhaps more problematic, standard unsupervised disentanglement methods (s.a. Higgins et al. ( 2017 To address this limitation, we propose to relax the unrealistic assumption of statistical independence of factors (i.e. that they have a factorial distribution), and only assume the (bounded) support of the factors' distribution factorizes -a much weaker but more realistic constraint. For example, in a dataset of animal images (Fig. 1 ), background and animal are heavily correlated (camels most likely on sand, cows on grass), resulting in most datapoints being distributed along the diagonal as opposed to uniformly. Under the original assumption of factor independence, a model likely learns a shortcut solution where animal and landscape share the same latent correspondence (Beery et al., 2018) . On the other hand with a factorized support, learned factors should be such that any combination of their values has some grounding in reality: a cow on sand is an unlikely, yet not impossible combination. We still rely, just as standard unsupervised disentanglement methods, on the inductive bias of encoder-decoder architectures to recover factors (Rolinek et al., 2019) . However, we expect our method to facilitate robustness to any distribution shifts within the support (Träuble et al., 2021; Dittadi et al., 2021) , as it makes no assumptions on the distribution beyond its factorized support. We arrived at this factorized support principle from the perspective of relaxing the independence assumption to be robust to factor correlations, while remaining agnostic to how they may arise. Remarkably, the same principle was derived independently in Wang & Jordan (2021)foot_0 from a causal perspective and formal definition of causal disentanglement (Suter et al., 2019) , that explicits how factor correlations can arise. To ensure a computationally tractable and efficient criterion even with many factors, we further relax the full factorized support assumption to that of only a pairwise factorized support, i.e. factorized support for all pairs of factors. On this basis, we propose a concrete pairwise Hausdorff Factorized Support (HFS) training criterion to disentangle correlated factors, by aiming for all pairs of latents to have a factorized support. Specifically we encourage a factorized support by minimizing a Hausdorff set-distance between the finite sample approximation of the actual support and its factorization (Huttenlocher et al., 1993; Rockafellar & Wets, 1998) . Across large-scale experiments on standard disentanglement benchmarks and novel extensions with correlated factors, HFS consistently facilitates disentanglement. We also show that HFS can be implemented as regularizer for other methods to reliably improve disentanglement, up to +61% in disentanglement performance over baselines as measured by DCI-D (Eastwood & Williams, 2018) ( §4.1, Tab. 1). On downstream classification tasks, we improve generalization to more severe distribution shifts and sample efficiency ( §4.2, Fig. 2 ). To summarize our contributions: [1] We motivate and investigate a principle for learning disentangled representations under correlated factors: we relax the assumption of statistically independent factors into that of a factorized support only (independently also derived in Wang & Jordan (2021) from a causal perspective), and further relax it to a more practical pairwise factorized support. [2] We develop a concrete training criterion through a pairwise Hausdorff distance term, which can also be combined with existing disentanglement methods ( §2.3). [3] Extensive experiments on three main benchmarks and up to 14 increasingly difficult correlations settings over more than 20k models, show HFS systematically improving disentanglement (as measured by DCI-D) by up to +61% over standard methods (β/TC/Factor/Annealed-VAE, c.f. §4.1). [4] We show that HFS improves robustness to factor distribution shifts between train and test over disentanglement baselines on classification tasks by up to +28%, as well as sample efficiency.



We were initially not aware of this work, whose preprint predates ours. We consider it a strong positive sign when the same principle (of support factorization) is arrived at independently from two quite different angles (causality versus relaxed factor independence assumption).



); Kim & Mnih (2018); Chen et al. (2018)) rely on an unrealistic assumption of statistical independence of ground truth factors. Real data however contains correlations (Träuble et al., 2021). Even with well defined factors (s.a. shape, color or background), correlations are pervasive-yellow bananas are more frequent than red ones; cows more often on grass than sand. In more realistic settings with correlations, prior work (e.g. Träuble et al. (2021); Dittadi et al. (2021)) has shown existing disentanglement methods to fail.

