CLEARING THE PATH FOR TRULY SEMANTIC REPRE-SENTATION LEARNING

Abstract

The performance of β Variational Autoencoders (β-VAEs) and their variants on learning semantically meaningful, disentangled representations is unparalleled. On the other hand, there are theoretical arguments suggesting impossibility of unsupervised disentanglement. In this work, we show that small perturbations of existing datasets hide the convenient correlation structure that is easily exploited by VAE-based architectures. To demonstrate this, we construct modified versions of the standard datasets on which (i) the generative factors are perfectly preserved; (ii) each image undergoes a transformation barely visible to the human eye; (iii) the leading disentanglement architectures fail to produce disentangled representations. We intend for these datasets to play a role in separating correlation-based models from those that discover the true causal structure. The construction of the modifications is non-trivial and relies on recent progress on mechanistic understanding of β-VAEs and their connection to PCA, while also providing additional insights that might be of stand-alone interest.

1. INTRODUCTION

The task of unsupervised learning of interpretable data representations has a long history. From classical approaches using linear algebra e.g. via Principle Component Analysis (PCA) (Pearson, 1901) or statistical methods such as Independent Component Analysis (ICA) (Comon, 1994) all the way to more recent approaches that rely on deep learning architectures. The cornerstone architecture is the Variational Autoencoder (Kingma & Welling, 2014) (VAE) which clearly demonstrated both high semantic quality as well as good performance in terms of disentanglement. Until today, derivates of VAEs (Higgins et al., 2017; Kim & Mnih, 2018a; Chen et al., 2018; Kumar et al., 2017) excel over other architectures in terms of disentanglement metrics. The extent of VAE's success even prompted recent deeper analyses of its inner workings (Rolinek et al., 2019; Burgess et al., 2018; Chen et al., 2018; Mathieu et al., 2018) . If we treat the overloaded term disentanglement to the highest of its aspirations, as the ability to recover the true generating factors of data, fundamental problems arise. As explained byLocatello et al. (2019) , already the concept of generative factors is compromised from a statistical perspective: two (or in fact infinitely many) sets of generative factors can generate statistically indistinguishable datasets. Yet, the scores on the disentanglement benchmarks are high and continue to rise. This apparent contradiction needs a resolution. In this work, we claim that all leading disentanglement architectures can be fooled by the same trick. Namely, by introducing a small change of the correlation structure, which, however, perfectly preserves the set of generative factors. To that end, we provide an alternative version of the standard disentanglement datasets in which each image undergoes a modification barely visible to a human eye. We report drastic drops of disentanglement performance on the altered datasets. On a technical level, we build on the findings by Rolinek et al. (2019) who argued that VAEs recover the nonlinear principle components of the datasets; in other words nonlinearly computed scalars that are the sources of variance in the sense of classical PCA. The small modifications of the datasets we propose, aim to change the leading principle component by adding modest variance to an alternative candidate. The "to-be" leading principle component is specific to each dataset but it is determined in a consistent fashion. With the greedy algorithm for discovering the linear PCA components in mind, we can see that any change in the leading principle component is reflected also in the others. As a result, the overall alignment changes and the generating factors get entangled leading to low disentanglement scores. We demonstrate that, even though the viewpoint of (Rolinek et al., 2019) only has theoretical support for the (β)-VAE, it empirically transfers to other architectures. Overall, we want to encourage evaluating new disentanglement approaches on the proposed datasets, in which the generative factors are intact but the correlation structure is less favorable for the principle component discovery ingrained in VAE-style architectures. We hope that providing a more sound experimental setup will clear the path for a new set of disentanglement approaches.

2. RELATED WORK

The related work can be categorized in three research questions: i) defining disentanglement and metrics capturing the quality of latent representations; ii) architecture development for unsupervised learning of disentangled representations; and iii) understanding the inner workings of existing architectures, as for example of β-VAEs. This paper is building upon results from all three lines of work. Defining disentanglement. Defining the term disentangled representation is an open question (Higgins et al., 2018) . The presence of learned representations in machine learning downstream tasks, such as object recognition, natural language processing and others, created the need to "disentangle the factors of variation" (Bengio et al., 2013) very early on. This vague interpretation of disentanglement is inspired by the existence of a low dimensional manifold that captures the variance of higher dimensional data. As such, finding a factorized, statistically independent representation became a core ingredient of disentangled representation learning and dates back to classical ICA models (Comon, 1994; Bell & Sejnowski, 1995) . For some tasks, the desired feature of a disentangled representation is that it is semantically meaningful. Prominent examples can be found in computer vision (Shu et al., 2017; Liao et al., 2020) and in research addressing interpretability of machine learning models (Adel et al., 2018; Kim, 2019) . Based on group theory and symmetry transformations, (Higgins et al., 2018) provides the "first principled definition of a disentangled representation". Closely related to this concept is also the field of causality in machine learning (Schölkopf, 2019; Suter et al., 2019) , more specifically the search for causal generative models (Besserve et al., 2018; 2020) . Architecture development. The leading architectures for disentangled representation learning are based on VAEs (Kingma & Welling, 2014) . Despite originally developed as a generative modeling architecture, its variants have proven to excel at representation learning tasks. First of all the β-VAE (Higgins et al., 2017) , which exposes the trade-off between reconstruction and regularization via the additional hyperparameter, performs remarkably well. Other architectures have been proposed that additionally encourage statistical independence in the latent space, e.g. FactorVAE (Kim & Mnih, 2018b ) and β-TC-VAE (Chen et al., 2018) . The DIP-VAE (Kumar et al., 2017) suggests using moment-matching to close the distribution gap introduced in the original VAE paper. As the architectures developed, so did the metrics used for measuring the disentanglement quality of representations (Chen et al., 2018; Kim & Mnih, 2018b; Higgins et al., 2017; Kumar et al., 2017) . Understanding inner workings. With the rising success and development of VAE based architectures, the question of understanding their inner working principles became dominant in the community. One line of work searches for an answer to the question why these models disentangle (Burgess et al., 2018) . Another, closely related line of work shows up and tightens the connection between the vanilla (β-)VAE objective and (probabilistic) PCA (Tipping & Bishop, 1999) (Rolinek et al., 2019; Lucas et al., 2019) . Building on such findings, novel approaches for model selection were proposed (Duan et al., 2020) , emphasizing the value of thoroughly understanding the working principles of these methods. On a less technical side, (Locatello et al., 2019) conducted a broad set of experiments, questioning the relevance of models given the variance over restarts and the choice of hyperparameter. They also formalized the necessity of inductive bias as a strict requirement for unsupervised learning of disentangled representations. Our experiments are built on their code-base.

3. BACKGROUND

3.1 QUANTIFYING DISENTANGLEMENT Among the different viewpoints on disentanglement, we follow recent literature and focus on the connection between the discovered data representation and a set of generative factors. Multiple metrics have been proposed to quantify this connection. Most of them are based on the understanding that, ideally, each generative factor is encoded in precisely one latent variable. This was captured concisely by Chen et al. ( 2018) who proposed the Mutual Information Gap (MIG), the mean (over the N w generative factors) normalized difference of the two highest mutual information values between a latent coordinate and the single generating factor: 1 N w Nw i=0 1 H(w i ) max k I (w i ; z k ) -max k =k I (w i ; z k ) , where k = arg max κ I (w i , z κ ). More details about MIG, its implementations, and an extension to discrete variables can be found in (Chen et al., 2018; Rolinek et al., 2019) . While multiple other metrics were proposed such as SAPScore (Kumar et al., 2017) , FactorVAEScore (Kim & Mnih, 2018a) and DCI Score (Eastwood & Williams, 2018) (see the supplementary material of Klindt et al. ( 2020)), in this work, we focus primarily on MIG.

3.2. VARIATIONAL AUTOENCODERS AND THE MYSTERY OF A SPECIFIC ALIGNMENT

Variational autoencoders hide many intricacies and attempting to compress their exposition would not do them justice. For this reason, we limit ourselves to what is crucial for understanding this work: the objective functions. For well-presented full descriptions of VAEs, we refer the readers to (Doersch, 2016) . As it is common in generative models, VAEs aim to maximize the log-likelihood objective N i=1 log p(x (i) ), in which {x (i) } N i=1 = X is a dataset consisting of N i.i.d. samples x (i) of a multivariate random variable X that follows the true data distribution. The quantity p(x (i) ) captures the probability density of generating the training data point x (i) under the current parameters of the model. This objective is, however, intractable in its general form. For this reason, Kingma & Welling (2014) follow the standard technique of variational inference and introduce a tractable Evidence Lower Bound (ELBO): E q(z|x (i) ) log p(x (i) | z) + D KL (q(z | x (i) ) p(z)). (3) Here, z are the latent variables used to generate samples from X via a parameterized stochastic decoder q(x (i) | z). The fundamental question of "How do these objectives promote disentanglement?" was first asked by Burgess et al. (2018) . This is indeed far from obvious; in disentanglement the aim to encode a fixed generative factor in precisely one latent variable. From a geometric viewpoint this requires the latent representation to be axis-aligned (one axis corresponding to one generative factor). This question becomes yet more intriguing after noticing (and formally proving) that both objective functions (2) and (3) are invariant under rotations (Burgess et al., 2018; Rolinek et al., 2019) . In other words, any rotation of a fixed latent representation results in the same value of the objective function and yet β-VAEs consistently produce representations that are axis-aligned and in effect are isolating the generative factor into individual latent variables.

3.3. RESOLUTION VIA NON-LINEAR CONNECTIONS TO PCA

A mechanistic answer to the question raised in the previous subsection was given by Rolinek et al. (2019) . The formal argument showed that under specific conditions which are typical for β-VAEs (called polarized regime), the model locally performs PCA in the sense of aligning the "sources This discrepancy is the driver of the proposed modifications. of variance" with the local axes. The resulting global alignment often coincides with finding nonlinear principal components of the dataset (see Fig. 1 ). This behavior stems from the convenient but uninformed choice of a diagonal posterior, which breaks the symmetry of ( 2) and ( 3). This connection with PCA was also reported by Stuehmer et al. (2020) , alternatively formalized by Lucas et al. (2019) and converted into performance improvements in an unsupervised setting by Duan et al. (2020) . Since our dataset perturbation method is based on this interpretation of the VAE, we offer further context. In more technical terms, Rolinek et al. (2019) simplify the closed-form KL-term of the objective (3) under the assumption of polarized regime into the form L KL (x (i) ) ≈ 1 2 j∈Va µ 2 j (x (i) ) -log(σ 2 j (x (i) )), where V a is the set of active latent variables, µ 2 j (x (i) ) is the mean embedding of input x (i) and σ 2 j (x (i) ) the corresponding term in the diagonal posterior which also plays the role of noise applied to the latent variable j. This form of the objective, when studied independently, lends itself to an alternative viewpoint. When the only "moving part" in the model is the alignment of the latent representation, i.e. application of a rotation matrix, the induced optimization problem has the following interpretationfoot_0 : Distribute a fixed amount of noise among the latent variables such that the L 2 reconstruction error increases the least. This extracted problem can be solved in closed form and the solution is based on isolating sources of variance. Intuitively, important scalar factors whose preservation is crucial for reconstruction, need to be captured with high precision (low noise). For that, it is economic to isolate them from other factors. This high-level connection with PCA is then further formalized by Rolinek et al. (2019) . One less obvious observation is that the "isolation" of different sources of variance relies on the accuracy of the linearization of the decoder at around a fixed µ(x (i) ). Since in many datasets the local and global correlation structure is nearly identical, β-VAE recover sound global principle components. If, however, the local structure obeys a different "natural" alignment, the Variational autoencoder prefers it to the global one as captured in a synthetic experiment displayed in Fig. 1 . The sensitivity of β-VAE's global representation to small local changes is precisely our point of attack.

4. METHODS

The standard datasets for evaluating disentanglement all have an explicit generation procedure. Each data point x (i) ∈ X is an outcome of a generative process g applied to input w (i) ∈ W. Imagine that g is a function rendering a simple scene from its specification w containing as its coordinates the background color, foreground color, object shape, object size etc. By design, the individual generative factors are statistically independent in W. All in all, the dataset X = x (1) , x (2) , . . . , x (n) is constructed with x (i) = g(w (i) ), where g is a mapping from the generative factors to the corresponding data points. In this section, we introduce a modification g of the generative procedure g that barely distorts the resulting data points. In particular, for each x (i) ∈ X , we have d x (i) , g(w (i) ) ≤ ε (5) under some distance d(•, •). How to design g such that despite an ε-small modification VAE-based architectures will create an entangled representation? Following the intuition from Sec. 3.3 and Fig. 1 , we "misalign" the local variance with repect to the global variance in order to promote an alternative (entangled) latent code. To avoid hand-crafting this process we can exploit the following observation. VAE based architectures suffer from large performance variance over e.g. random initializations. This hints to the existing ambiguity: two or more candidates for the latent coordinate system are competing minima of the optimization problem. Some of these solution are "bad" in terms of disentanglement. Below we elaborate on how to foster these bad solutions. It should be noted that our dataset modifications are not an implementation of (Locatello et al., 2019 , Theorem 1). We do not modify the set of generative factors. Rather, we slightly perturb the generation process in order to target a specific subtlety in the inner working of VAEs. Overall, our modification process has three steps: (i) Find the most entangled alignment that a β-VAE produces over multiple restarts and retrieve its first principle component, denoted by s. (ii) Fix a "noise pattern" that is highly tied with s. (iii) Add the noise pattern with suitable magnitude ε to each image. Over multiple restarts of β-VAE we pick one with the lowest MIG score. This gives us an entangled alignment that is expressible by the architecture. It's first principle component is captured by s : X → R computed as the value of the latent coordinate j that has the least noise σ 2 j injected (averaged over the dataset):

4.1. CHOICE OF FOSTERED LOCAL PRINCIPAL COMPONENT

s x (i) = enc x (i) j j = arg min k σ 2 k . (6) This procedure of retrieving the most "important" latent coordinate is consistent with (Higgins et al., 2017) and (Rolinek et al., 2019) . The analogy to PCA is that the mapping s(x (i) ) gives the first coordinate of x (i) in the new (non-linear) coordinate system.

4.2. MANIPULATION PATTERNS

We will now describe the modification procedure assuming the data points are r × r images. The manipulated data-point x (i) is of the form x (i) = x (i) + εf s(x (i) ) where the mapping f : R → R r × R r is constrained by f (s) ∞ ≤ 1 for every s. Then inequality ( 5) is naturally satisfied for the maximum norm. At the same time, we aim to produce large local changes. More technically, the local expansiveness factor α δ (s) = inf |δ |<δ f (s) -f (s + δ ) |δ | should be as high as possible everywhere. Additionally, the mapping should be injective so that the resulting modifications contains information about s. For the sake of simplicity, we accomplish the above requirements (with high probability) by vertically shifting a fixed random pattern. Specifically, for fixed randomly sampled p i,j ∈ {-1, 1} for i, j ∈ N and "column factors" c j ∈ {1, 2, 3, 4}, we set f as (f (s)) i,j = p i ,j , where i = round(s/c j ) + i. This random pattern is moving along the first coordinate with s while creating consecutive blocks of length c j in column j in order to incorporate information about s on multiple scales.

5. EXPERIMENTS

In order to experimentally validate the soundness of the manipulations, we need to demonstrate the following: 1. Effectiveness of manipulations. Disentanglement metrics should drop on the modified datasets across architectures and datasets. 2. Qualitative analysis. The changes in local and global statistics of the dataset should match the intuition fleshed out in Sec. 3.3. 3. Comparison to a trivial modification. Instead of the proposed method, we modify with uniform noise of the same magnitude. The disentanglement scores for the algorithms on the resulting datasets should not drop significantly. 4. Robustness. The new datasets should be hard to disentangle even after retuning hyperparameters of the original architectures.

5.1. EFFECTIVENESS OF MANIPULATIONS

We use the scalar s = s(x (i) ) as described in Sec. 4.1 and embed it in the data-space via the manipulation f (s) described in Sec. 4.2. We deploy this approach on two datasets: Shapes3D (Burgess & Kim, 2018) and dSprites (Matthey et al., 2017) , leading to manipulations as depicted in Fig. 3(b, c ). Four VAE based architectures and a regular autoencoder are evaluated on both the original and manipulated dataset using literature regularization strength (or better). Other hyperparameters are taken from the disentanglement library (Locatello et al., 2019) . For the sake of simplicity and clarity, we restricted the latent space dimension to be equal the number of ground truth generative factors. The resulting MIG scores are listed in Tab. 1. Over all models and datasets, the disentanglement quality is significantly reduced. The proposed manipulations are by definition of constraint ( 5) small in terms of overall variance, but chosen to have a large local expansiveness factor. We want quantify both and the order in which the β-VAE encodes the generating factors. In addition to the existing generating factors and their influence on the dataset, we also want to evaluate the manipulation applied in the previous experiment. By adding an additional generating factor s, a uniform random variable with zero mean and unit variance, we isolate the effect of the manipulation from the other factors. We approximate the explained global variance of a generating factor w (i) by extensively sampling var wj E x (i) (w (i) ), w (i) j =wj x (i) . Due to the discrete nature of the generating factors, the local expansiveness factor is calculated for δ equal the step size of the position generating factors. We train a β-VAE on the same modified dataset and report the "information level" (which determines the order of the principle components) of the latent coordinate that encodes each individual generating factor, averaged over 10 restarts. The reported quantity is 1 -E σ 2 i , i.e. the lack of noise in the latent coordinate. Figure 4 In conclusion, β-VAE chooses to encode s as the first principle component for ε ≈ 0.1 which is exactly when s starts to dominate local statistics but long before it dominates global statistics. This is consistent with our prediction in Sec. 3.3. 

5.3. NOISY DATASETS

We replace our modification by contaminating each image with pixel-wise uniform noise between [-ε, ε]. The value of ε is fixed to the level of the presented manipulations. Table 2 provides the results for the same five architectures as used before. The lack of structure in the contamination does not affect the performance in a guided way and leads to very little effect on 3DShapes. The impact on dSprites is, however, noticeable. Due to the comparatively small variance among dSprites images, the noise shifts the balance between the reconstruction loss and the regularizers. We performed a grid search over β and recovered the performance on the noisy dataset (β = 16), the same can be expected for the other architectures. 

5.4. GRIDSEARCH ON HYPERPARAMETERS

We run a grid searches over the hyperparameter for each architecture. The results are illustrated in Fig. 5 . Overall our modifications seem mostly robust for adjusted hyperparameters. On both datasets, significant increase in the regularization strength allows for some recovery. More thorough analysis reveals that this effect starts only once the models reach a level of over-pruning, which is a behavior well known to practitioners. The dashed curves mark the region in which the active latent space dimension (number of coordinates for which E σ 2 i < 0.8) shrunk significantly. This effect goes along with decreased reconstruction quality and also intrinsically prevents the models from recovering all true generative factors and as such renders this area uninteresting.

6. CONCLUSION

We have shown that the success of β-VAE based architectures is solely based on the structured nature of the datasets they are being evaluated on. The goal of truly semantic representation learning requires an alternative benchmark to excel at, free of similar correlation-based artifacts. We propose to the community a set of novel datasets and hope that future architectures become capable of learning truly semantic representation.



the precise formal statement can be found in(Rolinek et al., 2019, Equations (22) and (23))



Figure 1: (a) The axes of VAEs latent representation, when decoded back to data space, represent non-linear principal components, as formalized by Rolinek et al. (2019) (figure used with the permission of authors). (b) Distribution of latent encodings for an input distributed as depicted in the inset. The linear VAE's encoding matches the PCA encoding remarkably well (left); both focus on aligning with axes based on explained (global) variance. The non-linear VAE is, however, more sensitive to local variance. It picks up on the natural axis alignment of the microscopic structure. This discrepancy is the driver of the proposed modifications.

Figure 2: A schematic visualization of the image generation process. The set of ground truth generating factors stays untouched by the modification.

Figure 3: Dataset modification. (a) Construction of the manipulation pattern (transposed visualization). (b,c) Visual comparison between original and perturbed samples side-by-side.

(a) and (b) show the local expansiveness factor and the explained global variance for different values of ε. The local expansiveness of s exceeds all other generating factors at ε ≈ 0.1, long before it's explained global variance becomes dominant. At the same point s becomes the dominant coordinate in the latent encoding, as shown in Fig. 4 (c).

Figure 4: A comparison of the local expansiveness factor, the explained global variance and the β-VAE noise distribution. (a) Sampled local variance for different values of ε. Restricted to the position factors (curves are overlapping) since the sampling in the generating factor space is discrete and varying step sizes are inexpressive. (b) Variance for all generating factors and s. (c) The mean latent noise standard deviation is an indicator for the relevance for reconstruction. Increasing ε makes s the prime encoded coordinate. For ε ≤ 0.075, s was not encoded due to it's noisy nature.

Figure 5: MIG scores for scaled literature hyperparameters over 10 restarts. The dashed lines show the average number of active units. Less than 5 (left, dSprites) or 6 (right, shapes3D) active units mean the architectures overpruned and can not recover the true generative factors.

MIG Scores for the unmodified and the modified dataset. Each setting was run with 10 distinct random seeds.

MIG scores on original and a noisy version of the datasets with literature hyperparameters. β was tuned to the noisy dataset. Noise cannot explain the full loss in disentanglement.

availability

https://sites.google.com/view/

