CLEARING THE PATH FOR TRULY SEMANTIC REPRE-SENTATION LEARNING

Abstract

The performance of β Variational Autoencoders (β-VAEs) and their variants on learning semantically meaningful, disentangled representations is unparalleled. On the other hand, there are theoretical arguments suggesting impossibility of unsupervised disentanglement. In this work, we show that small perturbations of existing datasets hide the convenient correlation structure that is easily exploited by VAE-based architectures. To demonstrate this, we construct modified versions of the standard datasets on which (i) the generative factors are perfectly preserved; (ii) each image undergoes a transformation barely visible to the human eye; (iii) the leading disentanglement architectures fail to produce disentangled representations. We intend for these datasets to play a role in separating correlation-based models from those that discover the true causal structure. The construction of the modifications is non-trivial and relies on recent progress on mechanistic understanding of β-VAEs and their connection to PCA, while also providing additional insights that might be of stand-alone interest.

1. INTRODUCTION

The task of unsupervised learning of interpretable data representations has a long history. From classical approaches using linear algebra e.g. via Principle Component Analysis (PCA) (Pearson, 1901) or statistical methods such as Independent Component Analysis (ICA) (Comon, 1994) all the way to more recent approaches that rely on deep learning architectures. The cornerstone architecture is the Variational Autoencoder (Kingma & Welling, 2014) (VAE) which clearly demonstrated both high semantic quality as well as good performance in terms of disentanglement. Until today, derivates of VAEs (Higgins et al., 2017; Kim & Mnih, 2018a; Chen et al., 2018; Kumar et al., 2017) excel over other architectures in terms of disentanglement metrics. The extent of VAE's success even prompted recent deeper analyses of its inner workings (Rolinek et al., 2019; Burgess et al., 2018; Chen et al., 2018; Mathieu et al., 2018) . If we treat the overloaded term disentanglement to the highest of its aspirations, as the ability to recover the true generating factors of data, fundamental problems arise. As explained byLocatello et al. ( 2019), already the concept of generative factors is compromised from a statistical perspective: two (or in fact infinitely many) sets of generative factors can generate statistically indistinguishable datasets. Yet, the scores on the disentanglement benchmarks are high and continue to rise. This apparent contradiction needs a resolution. In this work, we claim that all leading disentanglement architectures can be fooled by the same trick. Namely, by introducing a small change of the correlation structure, which, however, perfectly preserves the set of generative factors. To that end, we provide an alternative version of the standard disentanglement datasets in which each image undergoes a modification barely visible to a human eye. We report drastic drops of disentanglement performance on the altered datasets. On a technical level, we build on the findings by Rolinek et al. (2019) who argued that VAEs recover the nonlinear principle components of the datasets; in other words nonlinearly computed scalars that are the sources of variance in the sense of classical PCA. The small modifications of the datasets we propose, aim to change the leading principle component by adding modest variance to an alternative Datasets will be released here: https://sites.google.com/view/sem-rep-learning candidate. The "to-be" leading principle component is specific to each dataset but it is determined in a consistent fashion. With the greedy algorithm for discovering the linear PCA components in mind, we can see that any change in the leading principle component is reflected also in the others. As a result, the overall alignment changes and the generating factors get entangled leading to low disentanglement scores. We demonstrate that, even though the viewpoint of (Rolinek et al., 2019) only has theoretical support for the (β)-VAE, it empirically transfers to other architectures. Overall, we want to encourage evaluating new disentanglement approaches on the proposed datasets, in which the generative factors are intact but the correlation structure is less favorable for the principle component discovery ingrained in VAE-style architectures. We hope that providing a more sound experimental setup will clear the path for a new set of disentanglement approaches.

2. RELATED WORK

The related work can be categorized in three research questions: i) defining disentanglement and metrics capturing the quality of latent representations; ii) architecture development for unsupervised learning of disentangled representations; and iii) understanding the inner workings of existing architectures, as for example of β-VAEs. This paper is building upon results from all three lines of work. Defining disentanglement. Defining the term disentangled representation is an open question (Higgins et al., 2018) . The presence of learned representations in machine learning downstream tasks, such as object recognition, natural language processing and others, created the need to "disentangle the factors of variation" (Bengio et al., 2013) very early on. This vague interpretation of disentanglement is inspired by the existence of a low dimensional manifold that captures the variance of higher dimensional data. As such, finding a factorized, statistically independent representation became a core ingredient of disentangled representation learning and dates back to classical ICA models (Comon, 1994; Bell & Sejnowski, 1995) . For some tasks, the desired feature of a disentangled representation is that it is semantically meaningful. Prominent examples can be found in computer vision (Shu et al., 2017; Liao et al., 2020) and in research addressing interpretability of machine learning models (Adel et al., 2018; Kim, 2019) . Based on group theory and symmetry transformations, (Higgins et al., 2018) provides the "first principled definition of a disentangled representation". Closely related to this concept is also the field of causality in machine learning (Schölkopf, 2019; Suter et al., 2019) , more specifically the search for causal generative models (Besserve et al., 2018; 2020) . Architecture development. The leading architectures for disentangled representation learning are based on VAEs (Kingma & Welling, 2014) . Despite originally developed as a generative modeling architecture, its variants have proven to excel at representation learning tasks. First of all the β-VAE (Higgins et al., 2017) , which exposes the trade-off between reconstruction and regularization via the additional hyperparameter, performs remarkably well. Other architectures have been proposed that additionally encourage statistical independence in the latent space, e.g. FactorVAE (Kim & Mnih, 2018b ) and β-TC-VAE (Chen et al., 2018) . The DIP-VAE (Kumar et al., 2017) suggests using moment-matching to close the distribution gap introduced in the original VAE paper. As the architectures developed, so did the metrics used for measuring the disentanglement quality of representations (Chen et al., 2018; Kim & Mnih, 2018b; Higgins et al., 2017; Kumar et al., 2017) . Understanding inner workings. With the rising success and development of VAE based architectures, the question of understanding their inner working principles became dominant in the community. One line of work searches for an answer to the question why these models disentangle (Burgess et al., 2018) . Another, closely related line of work shows up and tightens the connection between the vanilla (β-)VAE objective and (probabilistic) PCA (Tipping & Bishop, 1999 ) (Rolinek et al., 2019; Lucas et al., 2019) . Building on such findings, novel approaches for model selection were proposed (Duan et al., 2020) , emphasizing the value of thoroughly understanding the working principles of these methods. On a less technical side, (Locatello et al., 2019) conducted a broad set of experiments, questioning the relevance of models given the variance over restarts and the choice of hyperparameter. They also formalized the necessity of inductive bias as a strict requirement for unsupervised learning of disentangled representations. Our experiments are built on their code-base.

