DISENTANGLING LEARNING REPRESENTATIONS WITH DENSITY ESTIMATION

Abstract

Disentangled learning representations have promising utility in many applications, but they currently suffer from serious reliability issues. We present Gaussian Channel Autoencoder (GCAE), a method which achieves reliable disentanglement via flexible density estimation of the latent space. GCAE avoids the curse of dimensionality of density estimation by disentangling subsets of its latent space with the Dual Total Correlation (DTC) metric, thereby representing its high-dimensional latent joint distribution as a collection of many low-dimensional conditional distributions. In our experiments, GCAE achieves highly competitive and reliable disentanglement scores compared with state-of-the-art baselines.

1. INTRODUCTION

The notion of disentangled learning representations was introduced by Bengio et al. ( 2013) -it is meant to be a robust approach to feature learning when trying to learn more about a distribution of data X or when downstream tasks for learned features are unknown. Since then, disentangled learning representations have been proven to be extremely useful in the applications of natural language processing Jain et al. ( 2018 2019) point out serious reliability issues shared by all. In particular, increasing disentanglement pressure during training doesn't tend to lead to more independent representations, there currently aren't good unsupervised indicators of disentanglement, and no method consistently dominates the others across all datasets. Locatello et al. (2019) stress the need to find the right inductive biases in order for unsupervised disentanglement to truly deliver. We seek to make disentanglement more reliable and high-performing by incorporating new inductive biases into our proposed method, Gaussian Channel Autoencoder (GCAE). We shall explain them in more detail in the following sections, but to summarize: GCAE avoids the challenge of representing high-dimensional p Z (z) via disentanglement with Dual Total Correlation (rather than TC) and the DTC criterion is augmented with a scale-dependent latent variable arbitration mechanism. This work makes the following contributions: • Analysis of the TC and DTC metrics with regard to the curse of dimensionality which motivates use of DTC and a new feature-stabilizing arbitration mechanism • GCAE, a new form of noisy autoencoder (AE) inspired by the Gaussian Channel problem, which permits application of flexible density estimation methods in the latent space • Experimentsfoot_0 which demonstrate competitive performance of GCAE against leading disentanglement baselines on multiple datasets using existing metrics

2. BACKGROUND AND INITIAL FINDINGS

To estimate p Z (z), we introduce a discriminator-based method which applies the density-ratio trick and the Radon-Nikodym theorem to estimate density of samples from an unknown distribution. We shall demonstrate in this section the curse of dimensionality in density estimation and the necessity for representing p Z (z) as a collection of conditional distributions. The optimal discriminator neural network introduced by Goodfellow et al. (2014a) satisfies: arg max D(•) E xr∼X real [log D(x r )] + E x f ∼X f ake [log (1 -D(x f ))] ≜ D * (x) = p real (x) p real (x) + p f ake (x) where D(x) is a discriminator network trained to differentiate between "real" samples x r and "fake" samples x f . Given the optimal discriminator D * (x), the density-ratio trick can be applied to yield p real (x) p f ake (x) = D * (x) 1-D * (x) . Furthermore, the discriminator can be supplied conditioning variables to represent a ratio of conditional distributions Goodfellow et al. (2014b); Makhzani et al. (2015) . Consider the case where the "real" samples come from an unknown distribution z ∼ Z and the "fake" samples come from a known distribution u ∼ U . Permitted that both p Z (z) and p U (u) are finite and p U (u) is nonzero on the sample space of p Z (z), the optimal discriminator can be used to retrieve the unknown density p Z (z) = D * (z) 1-D * (z) p U (z). In our case where u is a uniformly distributed variable, this "transfer" of density through the optimal discriminator can be seen as an application of the Radon-Nikodym derivative of p Z (z) with reference to the Lebesgue measure. Throughout the rest of this work, we employ discriminators with uniform noise and the density-ratio trick in this way to recover unknown distributions. This technique can be employed to recover the probability density of an m-dimensional isotropic Gaussian distribution. While it works well in low dimensions (m ≤ 8), the method inevitably fails as m increases. Figure 1a depicts several experiments of increasing m in which the KL-divergence of the true and estimated distributions are plotted with training iteration. When number of data samples is finite and the dimension m exceeds a certain threshold, the probability of there being any uniform samples in the neighborhood of the Gaussian samples swiftly approaches zero, causing the density-ratio trick to fail. This is a well-known phenomenon called the curse of dimensionality of density estimation. In essence, as the dimensionality of a joint distribution increases, concentrated joint data quickly become isolated within an extremely large space. The limit m ≤ 8 is consistent with the limits of other methods such as kernel density estimation (Parzen-Rosenblatt window). Fortunately, the same limitation does not apply to conditional distributions of many jointly distributed variables. Figure 1b depicts a similar experiment of the first in which m -1 variables are independent Gaussian distributed, but the last variable z m follows the distribution z m ∼ N (µ = (m -1) -1 2 m-1 i=1 z i , σ 2 = 1 m ) (i.e., the last variable is Gaussian distributed with its mean as the sum of observations of the other variables). The marginal distribution of each component is



Code available at https://github.com/ericyeats/gcae-disentanglement



), content and style separation John et al. (2018), drug discovery Polykovskiy et al. (2018); Du et al. (2020), fairness Sarhan et al. (2020), and more. Density estimation of learned representations is an important ingredient to competitive disentanglement methods. Bengio et al. (2013) state that representations z ∼ Z which are disentangled should maintain as much information of the input as possible while having components which are mutually invariant to one another. Mutual invariance motivates seeking representations of Z which have independent components extracted from the data, necessitating some notion of p Z (z). Leading unsupervised disentanglement methods, namely β-VAE Higgins et al. (2016), FactorVAE Kim & Mnih (2018), and β-TCVAE Chen et al. (2018) all learn p Z (z) via the same variational Bayesian framework Kingma & Welling (2013), but they approach making p Z (z) independent with different angles. β-VAE indirectly promotes independence in p Z (z) via enforcing low D KL between the representation and a factorized Gaussian prior, β-TCVAE encourages representations to have low Total Correlation (TC) via an ELBO decomposition and importance weighted sampling technique, and FactorVAE reduces TC with help from a monolithic neural network estimate. Other well-known unsupervised methods are Annealed β-VAE Burgess et al. (2018), which imposes careful relaxation of the information bottleneck through the VAE D KL term during training, and DIP-VAE I & II Kumar et al. (2017), which directly regularize the covariance of the learned representation. For a more in-depth description of related work, please see Appendix D. While these VAE-based disentanglement methods have been the most successful in the field, Locatello et al. (

