IDENTIFYING INFORMATIVE LATENT VARIABLES LEARNED BY GIN VIA MUTUAL INFORMATION Anonymous authors Paper under double-blind review

Abstract

How to learn a good representation of data is one of the most important topics of machine learning. Disentanglement of representations, though believed to be the core feature of good representations, has caused a lot of debates and discussions in recent. Sorrenson et al. (2020), using the techniques developed in nonlinear independent component analysis theory, show that general incompressible-flow networks (GIN) can recover the underlying latent variables that generate the data, and thus can provide a compact and disentangled representation. However, in this paper, we point out that the method taken by GIN for informative latent variables selection is not theoretically supported and can be disproved by experiments. We propose to use the mutual information between each learned latent variables and the auxiliary variable to correctly identify informative latent variables. We directly verify the improvement brought by our method in experiments on synthetic data. We further show the advantage of our method on various downstream tasks including classification, outlier detection and adversarial attack defence on both synthetic and real data.

1. INTRODUCTION

Representation learning is arguably one of the most important area in machine learning. Many researchers believe that being able to extract useful and interpretable features is a crucial advantage of deep networks over other learning models. A data representation can be obtained either via a supervised learning task or an unsupervised learning task. The former one includes the popular ImageNet pretrained backbone in computer vision tasks, while the latter one mainly consists of generative models like variants of VAEs, GANs and flow-based models. Among all the generative models, VAEs and flow-based models can naturally output the representation of data or even their density, which is convenient for the representation learning purpose. Moreover, supervision information of labels can be integrated into generative models to further improve their performance. General Incompressible-flow Networks (GIN, Sorrenson et al. (2020) ), the model we considered in this paper, falls into this case. Disentanglement is a widely discussed concept by many representation learning works. However, to the best of our knowledge, it has not been given a widely accepted definition (Bengio et al., 2013; Higgins et al., 2018) . Many disentangled representation learning algorithms focus on recovering the independent latent variables that generate the data (Burgess et al., 2018; Chen et al., 2018b) . However, Locatello et al. (2018) show that without more assumptions than independence among latent variables, it is impossible to recover them from the observation of data. This result is equivalent to the non-identifiability of non-linear independent component analysis (ICA) (Comon, 1994) . Actually, any assumptions solely on the latent variables' distribution without referring to the observable data is not sufficient for the identifiability (Hyvarinen and Morioka, 2016; Khemakhem et al., 2020) . A set of sufficient conditions is proposed under the framework of nonlinear ICA by Khemakhem et al. (2020) . The core condition requires that the data, denoted by x, are generated by latent vectors z through a generative model p(x | z), and conditioned on an auxiliary variable u, the entries of z are independent and follow some exponential family distributions. This can be expressed by the following formulas. p T,λ (z|u) = n i=1 Q i (z i ) Z i (u) exp   k j=1 T i,j (z i )λ i,j (u)   , x = f (z) + . (2) In Eq. 1 and 2, z ∈ R n and x ∈ R d . Usually people believe that n d; that is, the data are distributed near a low dimensional manifold embedded in a very high dimensional space. With both u and x being observable, the identifiability of z can be established. Moreover, Khemakhem et al. also show that the latent variables and the generative function f can be learned by optimizing the ELBO using a special VAE, called iVAE (i for identifiable). However, iVAE assumes that all the latent variables are correlated with u, and it requires the correct number of latent variables for iVAE to work, which is usually not known. Sorrenson et al. ( 2020) consider the generative model with Eq. 2 replaced by x = f (z, ) , (3) where f is invertible, p T,λ (z|u) is a Gaussian density supported on R n with free expectations and diagonal covariance, and p( ), supported on R d-n , is free from u. Entries of z are refered to as informative latent variables, while entries of are treated as noise. They prove that z is identifiable up to a permutation and scaling. Based on this identifiability result, Sorrenson et al. propose to learn the informative latent variables z using a flow-based model, called the General Incompressible-flow Networks (GIN). The output of a GIN model, w = g(x), is a vector with its dimension equal to dim(z, ), since both f and g are invertible. We, hence, need a method to identify the entries of w that estimate z.

Sorrenson et al. propose a variance based criterion for the informative latent variables selection.

Specifically, they select w i 's with large variances as estimates to z and treat the remaining entries as solely related to . Sorrenson et al. do not clearly explain the logic behind this criterion. For the following reasons, we think this criterion is not guaranteed to work all the time: 1. Note that a volume-preserving transform cannot preserve the variances of the input data, because volume-preserving constrains the determinant of the Jacobian instead of the size of each singular value. 2. According to the identifiability theorem in Sorrenson et al. (2020) , each z i can only be estimated up to an affine transform, which means that the variances of informative w i 's can be smaller than the variances of non-informative ones. Nevertheless, they show that the variance based criterion can correctly identify the ground truth informative latent variables in experiments on synthetic data. However, in their experiments, the noise latent variables all have very small variances. When reproducing their experiments, we find that if we increase the variances of , but which are still less than the variances of z), this criterion will fail. On the other hand, the identifiability result in Sorrenson et al. (2020) implies that the entries of w can be categorized into two classes, entries in one class are linearly related to entries of z, while entries in the other class have distributions that are irrelevant to u. This observation inspires us to scrutinise the notion of informative latent variables under the generative framework described by Eq. 3. The key difference between informative and non-informative latent variables is not their variances, but whether they are correlated with the auxiliary variable u. Following this new perspective, it is natural to choose the mutual information between w i 's and u as the criterion for identifying informative latent variables, which we call it the mutual information (MI) criterion. In this paper, we compare the MI criterion with the previous variance based criterion (VAR) using abundant experiments on various tasks, including disentanglement quality on synthetic data, classification performance on real data, outlier detection capability, and robustness under adversarial attacks. In all these cases, the MI criterion shows superior performance over the VAR criterion. The main content of the paper is organised as follows. In Section 2, we briefly review the identifiability result of Sorrenson et al. (2020) , and show why the mutual information can work for

