CLEARING THE PATH FOR TRULY SEMANTIC REPRE-SENTATION LEARNING

Abstract

The performance of β Variational Autoencoders (β-VAEs) and their variants on learning semantically meaningful, disentangled representations is unparalleled. On the other hand, there are theoretical arguments suggesting impossibility of unsupervised disentanglement. In this work, we show that small perturbations of existing datasets hide the convenient correlation structure that is easily exploited by VAE-based architectures. To demonstrate this, we construct modified versions of the standard datasets on which (i) the generative factors are perfectly preserved; (ii) each image undergoes a transformation barely visible to the human eye; (iii) the leading disentanglement architectures fail to produce disentangled representations. We intend for these datasets to play a role in separating correlation-based models from those that discover the true causal structure. The construction of the modifications is non-trivial and relies on recent progress on mechanistic understanding of β-VAEs and their connection to PCA, while also providing additional insights that might be of stand-alone interest.

1. INTRODUCTION

The task of unsupervised learning of interpretable data representations has a long history. From classical approaches using linear algebra e.g. via Principle Component Analysis (PCA) (Pearson, 1901) or statistical methods such as Independent Component Analysis (ICA) (Comon, 1994) all the way to more recent approaches that rely on deep learning architectures. The cornerstone architecture is the Variational Autoencoder (Kingma & Welling, 2014) (VAE) which clearly demonstrated both high semantic quality as well as good performance in terms of disentanglement. Until today, derivates of VAEs (Higgins et al., 2017; Kim & Mnih, 2018a; Chen et al., 2018; Kumar et al., 2017) excel over other architectures in terms of disentanglement metrics. The extent of VAE's success even prompted recent deeper analyses of its inner workings (Rolinek et al., 2019; Burgess et al., 2018; Chen et al., 2018; Mathieu et al., 2018) . If we treat the overloaded term disentanglement to the highest of its aspirations, as the ability to recover the true generating factors of data, fundamental problems arise. As explained byLocatello et al. ( 2019), already the concept of generative factors is compromised from a statistical perspective: two (or in fact infinitely many) sets of generative factors can generate statistically indistinguishable datasets. Yet, the scores on the disentanglement benchmarks are high and continue to rise. This apparent contradiction needs a resolution. In this work, we claim that all leading disentanglement architectures can be fooled by the same trick. Namely, by introducing a small change of the correlation structure, which, however, perfectly preserves the set of generative factors. To that end, we provide an alternative version of the standard disentanglement datasets in which each image undergoes a modification barely visible to a human eye. We report drastic drops of disentanglement performance on the altered datasets. On a technical level, we build on the findings by Rolinek et al. (2019) who argued that VAEs recover the nonlinear principle components of the datasets; in other words nonlinearly computed scalars that are the sources of variance in the sense of classical PCA. The small modifications of the datasets we propose, aim to change the leading principle component by adding modest variance to an alternative Datasets will be released here: https://sites.google.com/view/sem-rep-learning 1

