ON THE TRANSFER OF DISENTANGLED REPRESENTA-TIONS IN REALISTIC SETTINGS

Abstract

Learning meaningful representations that disentangle the underlying structure of the data generating process is considered to be of key importance in machine learning. While disentangled representations were found to be useful for diverse tasks such as abstract reasoning and fair classification, their scalability and real-world impact remain questionable. We introduce a new high-resolution dataset with 1M simulated images and over 1,800 annotated real-world images of the same setup. In contrast to previous work, this new dataset exhibits correlations, a complex underlying structure, and allows to evaluate transfer to unseen simulated and realworld settings where the encoder i) remains in distribution or ii) is out of distribution. We propose new architectures in order to scale disentangled representation learning to realistic high-resolution settings and conduct a large-scale empirical study of disentangled representations on this dataset. We observe that disentanglement is a good predictor for out-of-distribution (OOD) task performance.

1. INTRODUCTION

Figure 1 : Images from the simulated dataset (left) and from the real-world setup (right). Disentangled representations hold the promise of generalization to unseen scenarios (Higgins et al., 2017b) , increased interpretability (Adel et al., 2018; Higgins et al., 2018) and faster learning on downstream tasks (van Steenkiste et al., 2019; Locatello et al., 2019a) . However, most of the focus in learning disentangled representations has been on small synthetic datasets whose ground truth factors exhibit perfect independence by design. More realistic settings remain largely unexplored. We hypothesize that this is because real-world scenarios present several challenges that have not been extensively studied to date. Important challenges are scaling (much higher resolution in observations and factors), occlusions, and correlation between factors. Consider, for instance, a robotic arm moving a cube: Here, the robot arm can occlude parts of the cube, and its end-effector position exhibits correlations with the cube's position and orientation, which might be problematic for common disentanglement learners (Träuble et al., 2020) . Another difficulty is that we typically have only limited access to ground truth labels in the real world, which requires robust frameworks for model selection when no or only weak labels are available. The goal of this work is to provide a path towards disentangled representation learning in realistic settings. First, we argue that this requires a new dataset that captures the challenges mentioned above. We propose a dataset consisting of simulated observations from a scene where a robotic arm interacts with a cube in a stage (see Fig. 1 ). This setting exhibits correlations and occlusions that are typical in real-world robotics. Second, we show how to scale the architecture of disentanglement methods to perform well on this dataset. Third, we extensively analyze the usefulness of disentangled representations in terms of out-of-distribution downstream generalization, both in terms of held-out factors of variation and sim2real transfer. In fact, our dataset is based on the TriFinger robot from Wüthrich et al. ( 2020), which can be built to test the deployment of models in the real world. While the analysis in this paper focuses on the transfer and generalization of predictive models, we hope that our dataset may serve as a benchmark to explore the usefulness of disentangled representations in real-world control tasks. The contributions of this paper can be summarized as follows: • We propose a new dataset for disentangled representation learning, containing 1M simulated high-resolution images from a robotic setup, with seven partly correlated factors of variation. Additionally, we provide a dataset of over 1,800 annotated images from the corresponding real-world setup that can be used for challenging sim2real transfer tasks. These datasets are made publicly available.foot_0 • We propose a new neural architecture to successfully scale VAE-based disentanglement learning approaches to complex datasets. • We conduct a large-scale empirical study on generalization to various transfer scenarios on this challenging dataset. We train 1,080 models using state-of-the-art disentanglement methods and discover that disentanglement is a good predictor for out-of-distribution (OOD) performance of downstream tasks.

2. RELATED WORK

Disentanglement methods. Most state-of-the-art disentangled representation learning approaches are based on the framework of variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) . A (high-dimensional) observation x is assumed to be generated according to the latent variable model p θ (x|z)p(z) where the latent variables z have a fixed prior p(z). The generative model p θ (x|z) and the approximate posterior distribution q φ (z|x) are typically parameterized by neural networks, which are optimized by maximizing the evidence lower bound (ELBO): L V AE = E q φ (z|x) [log p θ (x|z)] -D KL (q φ (z|x) p(z)) ≤ log p(x) As the above objective does not enforce any structure on the latent space except for some similarity to p(z), different regularization strategies have been proposed, along with evaluation metrics to gauge the disentanglement of the learned representations (Higgins et al., 2017a; Kim & Mnih, 2018; Burgess et al., 2018; Kumar et al., 2018; Chen et al., 2018; Eastwood & Williams, 2018) . Recently, Locatello et al. (2019b, Theorem 1) showed that the purely unsupervised learning of disentangled representations is impossible. This limitation can be overcome without the need for explicitly labeled data by introducing weak labels (Locatello et al., 2020; Shu et al., 2019) . Ideas related to disentangling the factors of variation date back to the non-linear ICA literature (Comon, 1994; Hyvärinen & Pajunen, 1999; Bach & Jordan, 2002; Jutten & Karhunen, 2003; Hyvarinen & Morioka, 2016; Hyvarinen et al., 2019; Gresele et al., 2019) . Recent work combines non-linear ICA with disentanglement (Khemakhem et al., 2020; Sorrenson et al., 2020; Klindt et al., 2020) . 



http://people.tuebingen.mpg.de/ei-datasets/iclr_transfer_paper/robot_ finger_datasets.tar (6.18 GB)



Evaluating disentangled representations. The BetaVAE (Higgins et al., 2017a) and Factor-VAE (Kim & Mnih, 2018) scores measure disentanglement by performing an intervention on the factors of variation and predicting which factor was intervened on. The Mutual Information Gap (MIG) (Chen et al., 2018), Modularity (Ridgeway & Mozer, 2018), DCI Disentanglement (Eastwood & Williams, 2018) and SAP scores (Kumar et al., 2018) are based on matrices relating factors of variation and codes (e.g. pairwise mutual information, feature importance and predictability).

