ON THE NECESSITY OF DISENTANGLED REPRESENTA-TIONS FOR DOWNSTREAM TASKS Anonymous

Abstract

A disentangled representation encodes generative factors of data in a separable and compact pattern. Thus it is widely believed that such a representation format benefits downstream tasks. In this paper, we challenge the necessity of disentangled representation in downstream applications. Specifically, we show that dimension-wise disentangled representations are not necessary for downstream tasks using neural networks that take learned representations as input. We provide extensive empirical evidence against the necessity of disentanglement, covering multiple datasets, representation learning methods, and downstream network architectures. Moreover, our study reveals that informativeness of representations best accounts for downstream performance. The positive correlation between the informativeness and disentanglement explains the claimed usefulness of disentangled representations in previous works.

1. INTRODUCTION

Disentanglement has been considered an essential property of representation learning (Bengio et al., 2013; Peters et al., 2017; Goodfellow et al., 2016; Bengio et al., 2007; Schmidhuber, 1992; Lake et al., 2017; Tschannen et al., 2018) . Though there is no widely accepted formal definition yet, the fundamental intuition is that a disentangled representation should separately and distinctly capture information from generative data factors (Bengio et al., 2013) . In practice, disentanglement is often implemented to emphasize a dimension-wise relationship, i.e., a representation dimension should capture information from exactly one factor and vice versa (Locatello et al., 2019b; Higgins et al., 2016; Kim & Mnih, 2018; Chen et al., 2018; Eastwood & Williams, 2018; Ridgeway & Mozer, 2018; Kumar et al., 2017; Do & Tran, 2019) . Disentangled representations offer human-interpretable factor dependencies. Therefore, in theory, they are robust to variations in the natural data and are expected to benefit downstream performances (Bengio et al., 2013) . Researchers are interested in empirically verifying these purported advantages. Especially, they focus on the following two-staged tasks: (1) extracting representations in an unsupervised manner from data, (2) then performing downstream neural networks training based on learned representations (van Steenkiste et al., 2019; Locatello et al., 2019a; Dittadi et al., 2020; Locatello et al., 2020) . Among various downstream tasks, except the ones that explicitly require disentanglement (Higgins et al., 2018b; Gabbay & Hoshen, 2021; Schölkopf et al., 2021) , abstract visual reasoning is widely recognized as a popular testbed (van Steenkiste et al., 2019; Locatello et al., 2020; Schölkopf et al., 2021) . The premise behind it aligns with the goals of machine intelligence (Snow et al., 1984; Carpenter et al., 1990) . Moreover, its mechanism ensures valid measurement of representations downstream performance (Fleuret et al., 2011; Barrett et al., 2018) . In the abstract visual reasoning task, intelligent agents are asked to take human IQ tests, i.e., predict the missing panel of Raven's Progressive Matrices (RPMs) (Raven, 1941) . Indeed it is a challenging task for representation learning (Barrett et al., 2018; van Steenkiste et al., 2019) . Disentanglement literature often takes this task as an encouraging example to show that disentanglement leads to quicker learning and better final performance (van Steenkiste et al., 2019; Locatello et al., 2020; Schölkopf et al., 2021) . However, on the abstract visual reasoning task, we find that rotating disentangled representations, i.e., multiplying the representations by an orthonormal matrix, has no impact on sample efficiency and final accuracy. We construct the most disentangled representations, i.e., normalized true factors.

