ON THE NECESSITY OF DISENTANGLED REPRESENTA-TIONS FOR DOWNSTREAM TASKS Anonymous

Abstract

A disentangled representation encodes generative factors of data in a separable and compact pattern. Thus it is widely believed that such a representation format benefits downstream tasks. In this paper, we challenge the necessity of disentangled representation in downstream applications. Specifically, we show that dimension-wise disentangled representations are not necessary for downstream tasks using neural networks that take learned representations as input. We provide extensive empirical evidence against the necessity of disentanglement, covering multiple datasets, representation learning methods, and downstream network architectures. Moreover, our study reveals that informativeness of representations best accounts for downstream performance. The positive correlation between the informativeness and disentanglement explains the claimed usefulness of disentangled representations in previous works.

1. INTRODUCTION

Disentanglement has been considered an essential property of representation learning (Bengio et al., 2013; Peters et al., 2017; Goodfellow et al., 2016; Bengio et al., 2007; Schmidhuber, 1992; Lake et al., 2017; Tschannen et al., 2018) . Though there is no widely accepted formal definition yet, the fundamental intuition is that a disentangled representation should separately and distinctly capture information from generative data factors (Bengio et al., 2013) . In practice, disentanglement is often implemented to emphasize a dimension-wise relationship, i.e., a representation dimension should capture information from exactly one factor and vice versa (Locatello et al., 2019b; Higgins et al., 2016; Kim & Mnih, 2018; Chen et al., 2018; Eastwood & Williams, 2018; Ridgeway & Mozer, 2018; Kumar et al., 2017; Do & Tran, 2019) . Disentangled representations offer human-interpretable factor dependencies. Therefore, in theory, they are robust to variations in the natural data and are expected to benefit downstream performances (Bengio et al., 2013) . Researchers are interested in empirically verifying these purported advantages. Especially, they focus on the following two-staged tasks: (1) extracting representations in an unsupervised manner from data, (2) then performing downstream neural networks training based on learned representations (van Steenkiste et al., 2019; Locatello et al., 2019a; Dittadi et al., 2020; Locatello et al., 2020) . Among various downstream tasks, except the ones that explicitly require disentanglement (Higgins et al., 2018b; Gabbay & Hoshen, 2021; Schölkopf et al., 2021) , abstract visual reasoning is widely recognized as a popular testbed (van Steenkiste et al., 2019; Locatello et al., 2020; Schölkopf et al., 2021) . The premise behind it aligns with the goals of machine intelligence (Snow et al., 1984; Carpenter et al., 1990) . Moreover, its mechanism ensures valid measurement of representations downstream performance (Fleuret et al., 2011; Barrett et al., 2018) . In the abstract visual reasoning task, intelligent agents are asked to take human IQ tests, i.e., predict the missing panel of Raven's Progressive Matrices (RPMs) (Raven, 1941) . Indeed it is a challenging task for representation learning (Barrett et al., 2018; van Steenkiste et al., 2019) . Disentanglement literature often takes this task as an encouraging example to show that disentanglement leads to quicker learning and better final performance (van Steenkiste et al., 2019; Locatello et al., 2020; Schölkopf et al., 2021) . However, on the abstract visual reasoning task, we find that rotating disentangled representations, i.e., multiplying the representations by an orthonormal matrix, has no impact on sample efficiency and final accuracy. We construct the most disentangled representations, i.e., normalized true factors. Then we solve the downstream tasks from them and their rotated variants. As shown in Figure 2a , there is little difference between the accuracy curves of original and rotated representations throughout the learning process. On one hand, this phenomenon is surprising since the rotation decreases dimension-wise disentanglement by destroying axis alignment (Locatello et al., 2019b) . Indeed, in Figure 2b we can observe notable drops in disentanglement metric scores (first 5 columns). Our finding demonstrates that disentanglement does not affect the downstream learning trajectory, which is against the commonly believed usefulness of disentanglement. On the other hand, it is not surprising since we apply an invertible linear transform. We can observe that Logistic Regression (LR) accuracy remains 100% before and after rotation, indicating that a simple linear layer could eliminate the effects of rotation. Per such facts, some questions arise: Are disentangled representations necessary for two-staged tasks? If not, which property matters? To address them, we conduct an extensive empirical study based on abstract reasoning tasks. Our contributions are as follows. • We challenge the necessity of disentanglement for abstract reasoning tasks. We find that (1) entangling representations by random rotation has little impact, and (2) general-purpose representation learning methods could reach better or competitive performance than disentanglement methods. • Following Eastwood & Williams ( 2018), we term what information the representation has learned as informativeness. We show that informativeness matters downstream performance most. (1) Logistic regression (LR) accuracy on factor classification correlates most with downstream performance, comparing with disentanglement metrics. (2) Conditioning on close LR accuracy, disentanglement still correlates mildly. (3) The informativeness is behind the previously argued usefulness of disentanglement since we observe a positive correlation between LR and disentanglement metrics. • We conduct a large-scale empirical study supporting our claim. We train 720 representation learning models covering two datasets, including both disentanglement and general-purpose methods. Then we train 5 WReNs (Barrett et al., 2018) and 5 Transformers (Vaswani et al., 2017; Hahne et al., 2019) using the outputs of each representation learning model to perform abstract reasoning, yielding a total of 7200 abstract reasoning models.

2. RELATED WORK

Disentangled representation learning. There is no agreed-upon formal definition of disentanglement. Therefore, in practice, disentanglement is often interpreted as a one-to-one mapping between representation dimensions and generative factors of data, which we term "dimension-wise disentanglement". It requires that the representation dimension encode only one factor and vice versa (Locatello et al., 2019b; Eastwood & Williams, 2018; Kumar et al., 2017; Do & Tran, 2019) . Besides dimension-wise disentanglement, Higgins et al. (2018a) propose a definition from the group theory perspective. However, its requirement in interaction with the environment prevents applicable learning methods for existing disentanglement benchmarks (Caselles-Dupré et al., 2019) . Adopting the dimension-wise definition, researchers develop methods and metrics. SOTA disentanglement methods are mainly variants of generative methods (Higgins et al., 2016; Kim & Mnih, 2018; Burgess et al., 2018; Kumar et al., 2017; Chen et al., 2018; 2016; Jeon et al., 2018; Lin et al., 2020) . Corresponding metrics are designed in the following ways (Zaidi et al., 2020): intervening factors (Higgins et al., 2016; Kim & Mnih, 2018) , estimating mutual information (Chen et al., 2018) , and developing classifiers (Eastwood & Williams, 2018; Kumar et al., 2017) . Another line of work related to disentangled representation learning is the Independent Component Analysis (ICA) (Comon, 1994) . ICA aims to recover independent components of the data, using the mean correlation coefficient (MCC) as the metric. However, ICA models require access to auxiliary variables (Hyvarinen et al., 2019) , leading to inevitable supervision for image datasets training (Hyvarinen & Morioka, 2016; Khemakhem et al., 2020a; b; Klindt et al., 2020) . In this paper, we focus on the downstream performance of unsupervised representation learning. Downstream tasks. It is widely believed that disentangled representations benefit downstream tasks. Intuitively, they offer a human-understandable structure with ready access to salient factors, hence should be enjoying robust generalization capacity (Bengio et al., 2013; Do & Tran, 2019) . Several works conduct empirical studies on downstream tasks to support the notions above, includ-

