UNSUPERVISED LEARNING OF GLOBAL FACTORS IN DEEP GENERATIVE MODELS

Abstract

We present a novel deep generative model based on non i.i.d. variational autoencoders that captures global dependencies among observations in a fully unsupervised fashion. In contrast to the recent semi-supervised alternatives for global modeling in deep generative models, our approach combines a mixture model in the local or data-dependent space and a global Gaussian latent variable, which lead us to obtain three particular insights. First, the induced latent global space captures interpretable disentangled representations with no user-defined regularization in the evidence lower bound (as in beta-VAE and its generalizations). Second, we show that the model performs domain alignment to find correlations and interpolate between different databases. Finally, we study the ability of the global space to discriminate between groups of observations with non-trivial underlying structures, such as face images with shared attributes or defined sequences of digits images.

1. INTRODUCTION

Since its first proposal by Kingma & Welling (2013) , Variational Autoencoders (VAEs) have evolved into a vast amount of variants. To name some representative examples, we can include VAEs with latent mixture models priors (Dilokthanakul et al. (2016) ), adapted to model time-series (Chung et al. (2015) ), trained via deep hierarchical variational families (Ranganath et al. (2016) , Tomczak & Welling (2018) ), or that naturally handle heterogeneous data types and missing data (Nazabal et al. (2020) ). The large majority of VAE-like models are designed over the assumption that data is i.i.d., which remains a valid strategy for simplifying the learning and inference processes in generative models with latent variables. A different modelling approach may drop the i.i.d. assumption with the goal of capturing a higher level of dependence between samples. Inferring such kind of higher level dependencies can directly improve current approaches to find interpretable disentangled generative models (Bouchacourt et al. (2018) ), to perform domain alignment (Heinze-Deml & Meinshausen (2017) ) or to ensure fairness and unbiased data (Barocas et al. (2017) ). The main contribution of this paper is to show that a deep probabilistic VAE non i.i.d. model with both local and global latent variable can capture meaningful and interpretable correlation among data points in a completely unsupervised fashion. Namely, weak supervision to group the data samples is not required. In the following we refer to our model as Unsupervised Global VAE (UG-VAE). We combine a clustering inducing mixture model prior in the local space, that helps to separate the fundamental data features that an i.i.d. VAE would separate, with a global latent variable that modulates the properties of such latent clusters depending on the observed samples, capturing fundamental and interpretable data features. We demonstrate such a result using both CelebA, MNIST and the 3D FACES dataset in Paysan et al. (2009) . Furthermore, we show that the global latent space can explain common features in samples coming from two different databases without requiring any domain label for each sample, establishing a probabilistic unsupervised framework for domain alignment. Up to our knowledge, UG-VAE is the first VAE model in the literature that performs unsupervised domain alignment using global latent variables. Finally, we demonstrate that, even when the model parameters have been trained using an unsupervised approach, the global latent space in UG-VAE can discriminate groups of samples with non-trivial structures, separating groups of people with black and blond hair in CelebA or series of numbers in MNIST. In other words, if weak supervision is applied at test time, the posterior distribution of the global latent variable provides with an informative representation of the user defined groups of correlated data.

2. RELATED WORK

Non i.i.d. deep generative models are getting recent attention but the literature is still scarse. First we find VAE models that implement non-parametric priors: in Gyawali et al. (2019) the authors make use of a global latent variable that induces a non-parametric Beta process prior, and more efficient variational mechanism for this kind of IBP prior are introduced in Xu et al. (2019) . Second, both Tang et al. (2019) and Korshunova et al. (2018) proposed non i.i.d. exchangable models by including correlation information between datapoints via an undirected graph. Finally, some other works rely on simpler generative models (compared to these previous approaches), including global variables with fixed-complexity priors, typically a multi-variate Gaussian distribution, that aim at modelling the correlation between user-specified groups of correlated samples (e.g. images of the same class in MNIST, or faces of the same person). In Bouchacourt et al. (2018) or Hosoya (2019) , authors apply weak supervision by grouping image samples by identity, and include in the probabilistic model a global latent variable for each of these groups, along with a local latent variable that models the distribution for each individual sample. Below we specify the two most relevant lines of research, in relation to our work. VAEs with mixture priors. Several previous works have demonstrated that incorporating a mixture in the latent space leads to learn significantly better models. In Johnson et al. (2016) authors introduce a latent GMM prior with nonlinear observations, where the means are learned and remain invariant with the data. The GMVAE proposal by Dilokthanakul et al. (2016) aims at incorporating unsupervised clustering in deep generative models for increasing interpretability. In the VAMP VAE model Tomczak & Welling (2018) , the authors define the prior as a mixture with components given by approximated variational posteriors, that are conditioned on learnable pseudo-inputs. This approach leads to an improved performance, avoiding typical local optima difficulties that might be related to irrelevant latent dimensions. Semi-supervised deep models for grouped data. In contrast to the i.i.d. vanilla VAE model in Figure 1 (a), and its augmented version for unsupervised clustering, GMVAE, in Figure 1 Conditioned to β, data samples are independent and distributed according to a Gaussian mixture local (one per data) latent variable Z = {z 1 , ..., z B } ⊆ R d , and d = {d 1 , ..., d B } ⊆ {1, ..., K} are independent discrete categorical variables with uniform prior distributions. This prior, along with the conditional distribution p(z i |d i , β), defines a Gaussian mixture latent space, which helps to infer similarities between samples from different batches (by assigning them to the same cluster), and thus, d i plays a similar role than the semi-supervision included in Bouchacourt et al. (2018) by grouping. Our experimental results demonstrate that this level of structure in the local space is crucial to acquire interpretable information at the global space, and specially, if we fix d i for all the samples within a batch, that the global variable β is able to tune different generative factors for each cluster. The joint distribution for a single group is therefore defined by: p θ (X, Z, d, β) = p(X|Z, β) p(Z|d, β) p(d) p(β) where the likelihood term of each sample is a Gaussian distribution, whose parameters are obtained from a concatenation of z i and β as input of a decoder network: p(X|Z, β) = B i=1 p(x i |z i , β) = B i=1 N (µ θx ([z i , β]), Σ θx ([z i , β])) In contrast with Johnson et al. (2016) , where the parameters of the clusters are learned but shared by all the observations, in UG-VAE, the parameters of each component are obtained with networks fed with β. Thus, the prior of each local latent continuous variable is defined by a mixture of Gaussians, where d i defines the component and β is the input of a NN that outputs its parameters: p(Z|d, β) = B i=1 p(z i |d i , β) = B i=1 N µ (di) θz (β), Σ (di) θz (β) , hence we trained as many NNs as discrete categories. This local space encodes samples in representative clusters to model local factors of variation. The prior of the discrete latent variable is defined as uniform: p(d) = B i=1 Cat(π) π k = 1/K (4) and the prior over the continuous latent variable β follows an isotropic Gaussian, p(β) = N (0, I).

3.2. INFERENCE MODEL

The graphical model of the proposed variational family is shown in Figure 2 (b): q φ (Z, d, β|X) = q(Z|X) q(d|Z)q(β|X, Z) where we employ an encoder network that maps the input data into the local latent posterior distribution, which is defined as a Gaussian: q(Z|X) = B i=1 q(z i |x i ) = B i=1 N (µ φz (x i ), Σ φz (x i )) Given the posterior distribution of z, the categorical posterior distribution of d i is parametrized by a NN that takes z i as input q(d|Z) = B i=1 q(d i |z i ) = B i=1 Cat(π φ d (z i )) The approximate posterior distribution of the global variable β is computed as a product of local contributions per datapoint. This strategy, as demonstrated by Bouchacourt et al. (2018) , outperforms other approaches like, for example, a mixture of local contributions, as it allows to accumulate group evidence. For each sample, a NN encodes x i and the Categorical parameters π φ d (z i ) in a local Gaussian: q(β|X, Z) = N (µ β , Σ β ) = B i=1 N µ φ β ([x i , π φ d (z i )]), Σ φ β ([x i , π φ d (z i )]) If we denote by µ i and Σ i the parameters obtained by networks µ φ β and Σ φ β , respectively, the parameters of the global Gaussian distribution are given, following Bromiley (2003) , by: Λ β = Σ -1 β = B i=1 Λ i µ β = (Λ β ) -1 B i=1 Λ i µ i where Λ β = Σ -1 β is defined as the precision matrix, which we model as a diagonal matrix.

3.3. EVIDENCE LOWER BOUND

Overall, the evidence lower bound reads as follows: L(θ, φ; X, Z, d, β) = E q(β) [L i (θ, φ; x i , z i , d, β)] -E q(d) [D KL (q(β|X, Z) p(β))] The resulting ELBO is an expansion of the ELBO for a standard GMVAE with a new regularizer for the global variable. As the reader may appreciate, the ELBO for UG-VAE does not include extra hyperparameters to enforce disentanglement, like other previous works as β-VAE, and thus, no extra validation is needed apart from the parameters of the networks architecture, the number of clusters and the latent dimensions. We denote by L i each local contribution to the ELBO: L i (θ, φ; x i , z i , d, β) = E q(d i ,zi) [log p(x i |z i , d i , β)] -E q(d i ) [D KL (q(z i |x i ) p(z i |d i , β))] -D KL (q(d i |z i ) p(d i ))) The first part of equation 10 is an expectation over the global approximate posterior of the so-called local ELBO. This local ELBO differs from the vanilla ELBO proposed by Kingma & Welling (2013) in the regularizer for the discrete variable d i , which is composed by the typical reconstruction term of each sample and two KL regularizers: one for z i , expected over d i , and the other over d i . The second part in equation 10 is a regularizer on the global posterior. The expectations over the discrete variable d i are tractable and thus, analytically marginalized. In contrast with GMVAE (Figure 1 (b)), in UG-VAE, β is shared by a group of observations, therefore the parameters of the mixture are the same for all the samples in a batch. In this manner, within each optimization step, the encoder q(β|X, Z) only learns from the global information obtained from the product of Gaussian contributions of every observation, with the aim at configuring the mixture to improve the representation of each datapoint in the batch, by means of p(Z|β, X) and p(X|Z, β). Hence, the control of the mixture is performed by using global information. In contrast with ML-VAE (whose encoder q(C G |X) is also global, but the model does not include a mixture), in UG-VAE, the β encoder incorporates information about which component each observation belongs to, as the weights of the mixture inferred by q(d|Z) are used to obtain q(β|X, Z). Thus, while each cluster will represent different local features, moving β will affect all the clusters. In other words, modifying β will have some effect in each local cluster. As the training progresses, the encoder q(β|X, Z) learns which information emerging from each batch of data allows to move the cluster in a way that the ELBO increases.

4. EXPERIMENTS

In this section we demonstrate the ability of the UG-VAE model to infer global factors of variation that are common among samples, even when coming from different datasets. In all cases, we have not validated in depth all the networks used, we have merely rely on encoder/decoder networks proposed in state-of-the-art VAE papers such as Kingma & Welling (2013), Bouchacourt et al. (2018) or Higgins et al. (2016) . Our results must be hence regarded as a proof of concept about the flexibility and representation power of UG-VAE, rather than fine-tuned results for each case. Hence there is room for improvement in all cases. Details about network architecture and training parameters are provided in the supplementary material.

4.1. UNSUPERVISED LEARNING OF GLOBAL FACTORS

In this section we first asses the interpretability of the global disentanglement features inferred by UG-VAE over both CelebA and MNIST. In Figure 3 we show samples of the generative model as we explore both the global and local latent spaces. We perform a linear interpolation with the aim at exploring the hypersphere centered at the mean of the distribution and with radius σ i for each dimension i. To maximize the variation range across every dimension, we move diagonally through the latent space. Rows correspond to an interpolation on the global β between [-1, 1] on every dimension (p(β) follows a standard Gaussian). As the local p(z|d, β) (equation 3) depends on d and β, if we denote µ z = µ The total number of clusters is set to K = 20 for CelebA and K = 10 for MNIST. Three of these components are presented in Figure 3 . We can observe that each row (each value of β) induces a shared generative factor, while z is in charge of variations inside this common feature. For instance, in CelebA (top), features like skin color, presence of beard or face contrast are encoded by the global variable, while local variations like hair style or light direction are controlled by the local variable. In a simple dataset like MNIST (bottom), results show that handwriting global features as cursive style, contrast or thickness are encoded by β, while the local z defines the shape of the digit. The characterization of whether these generative factors are local/global is based on an interpretation of the effect that varying z and β provokes in each image within a batch, and in the whole batch of images, respectively. In the supplementary material, we reproduce the same figures for the all the clusters, in which we can appreciate that there is a significant fraction of clusters with visually interpretable global/local features. We stress here again the fact that the UG-VAE training is fully unsupervised: data batches during training are completely randomly chosen from the training dataset, with no structured correlation whatsoever. Unlike other approaches for disentanglement, see Higgins et al. (2016) or Mathieu et al. (2019) , variational training in UG-VAE does not come with additional ELBO hyperparameters that need to be tuned to find a proper balance among terms in the ELBO. Figure 3 : Sampling from UG-VAE for CelebA (top) and MNIST (bottom). We include samples from 3 local clusters from a total of K = 20 for CelebA and K = 10 for MNIST. In CelebA (top), the global latent variable disentangles in skin color, beard and face contrast, while the local latent variable controls hair and light orientation. In MNIST (bottom), β controls cursive grade, contrast and thickness of handwriting, while z varies digit shape. One of the main contributions in the design of UG-VAE is the fact that, unless we include a clustering mixture prior in the local space controlled by the global variable β, unsupervised learning of global factors is non-informative. To illustrate such a result, in Figure 4 we reproduce the results in Figure 3 but for a probabilistic model in which the discrete local variable d is not included. Namely, we use the ML-VAE in Figure 2 (c) but we trained it with random data batches. In this case, the local space is uni-modal given β and we show interpolated values between -1 to 1. Note that the disentanglement effect of variations in both β and z is mild and hard to interpret. 

4.2. DOMAIN ALIGNMENT

In this section, we evaluate the UG-VAE performance in an unsupervised domain alignment setup. During training, the model is fed with data batches that include random samples coming from two different datasets. In particular, we train our model with a mixed dataset between CelebA and 3D FACES Paysan et al. (2009) , a dataset of 3D scanned faces, with a proportion of 50% samples from each dataset inside each batch. Upon training with random batches, in Figure 5 , we perform the following experiment using domain supervision to create test data batches. We create two batches containing only images from CelebA and 3D FACES. Let β 1 and β 2 be the mean global posterior computed using (8) associated for each batch. For two particular images in these two batches, let z 1 and z 2 be the mean local posterior of these two images, computed using (3). Figure 5 (a) shows samples of the UG-VAE model when we linearly interpolate between β 1 and β 2 (rows) and between z 1 and z 2 (columns)foot_0 . Certainly β is capturing the domain knowledge. For fixed z, e.g. z 1 in the first column, the interpolation between β 1 and β 2 is transferring the CelebA image into the 3D FACES domain (note that background is turning white, and the image is rotated to get a 3D effect). Alternatively, for fixed β, e.g. β 1 in the first row, interpolating between z 1 and z 2 modifies the first image into one that keeps the domain but resembles features of the image in the second domain, as face rotation. In Figure 5 (b) we show the 2D t-SNE plot of the posterior distribution of β for batches that are random mixtures between datasets (grey points), batches that contain only CelebA faces (blue squares), and batches that contain only 3D faces (green triangles). We also add the corresponding points of the β 1 and β 2 interpolation in Figure 5 (a). In Figure 5 (c), we reproduce the experiment in (a) but interpolating between two images and values of β that correspond to the same domain (brown interpolation line in Figure 5(b) ). As expected, the interpolation of β in this case does not change the domain, which suggests that the domain structure in the global space is smooth, and that the interpolation along the local space z modifies image features to translate one image into the other. In Figure 6 experiments with more datasets are included. When mixing the 3DCars dataset (Fidler et al. (2012) ) with the 3D Chairs dataset (Aubry et al. ( 2014)), we find that certain correlations between cars and chairs are captured. In Figure 6 (a), interpolating between a racing car and an office desk chair leads to a white car in the first domain (top right) and in a couch (bottom left). In Figure 6 (b), when using the 3D Cars along with the Cars Dataset (Krause et al. (2013) ), rotations in the cars are induced. Finally, in the supplementary material we show that, as expected, the rich structured captured by UG-VAE illustrated in Figure 5 is lost when we do not include the clustering effect in the local space, i.e. if we use ML-VAE with unsupervised random data batches, and all the transition between domains is performed within the local space. 

4.3. UG-VAE REPRESENTATION OF STRUCTURED NON-TRIVIAL DATA BATCHES

In the previous subsection, we showed that the UG-VAE global space is able to separate certain structure in the data batches (e.g. data domain) even though during training batches did not present such an explicit correlation. Using UG-VAE trained over CelebA with unsupervised random batches of 128 images as a running example, in this section we want to further demonstrate this result. In Figure 7 we show the t-SNE 2D projection of structured batches using the posterior β distribution in (8) over CelebA test images. In Figure 7 (a), we display the distribution of batches containing only men and women, while in Figure 7 (b) the distribution of batches containing people with black or blond hair. In both cases we show the distribution of randomly constructed batches as the ones in the training set. To some extend, in both cases we obtain separable distributions among the different kinds of batches. A quantitive evaluation can be found in Table 1 , in which we have used samples from the β distribution to train a supervised classifier to differentiate between different types of batches. When random batches are not taken as a class, the separability is evident. When random batches are included, it is expected that the classifier struggles to differentiate between a batch that contains 90% of male images and a batch that only contain male images, hence the drop in accuracy for the multi-case problem. An extension with similar results and figures for another interpretation of global information capturing are exposed in the supplementary material, using structured grouped batches in MNIST dataset. In this experiment, the groups are digits that belong to certain mathematical series, including even numbers, odd numbers, Fibonacci series and prime numbers, and we prove that the model is able to discriminate among their global posterior representations. 



Note that since both β and z are deterministically interpolated, the discrete variable d plays no role to sample from the model.



Figure 1: Comparison of four deep generative models. Dashed lines represent the graphical model of the associated variational family. The Vanilla VAE (a), the GMVAE (b), and semi-supervised variants for grouped data; ML-VAE (c) and NestedVAE (d).

Figure 2: Generative (left) and inference (right) of UG-VAE.

the local interpolation goes from [µ z0 -3, µ z1 -3, ...µ zd -3] to [µ z0 + 3, µ z1 + 3, ..., µ zd + 3]. The range of ±3 for the local interpolation is determined to cover the variances Σ (d) z (β) that we observe upon training the model for MNIST and CelebA. The every image in Figure 3 correspond to samples from a different cluster (fixed values of d), in order to facilitate the interpretability of the information captured at both local and global levels. By using this set up, we demonstrate that the global information tuned by β is different and clearly interpretable inside each cluster.

Figure 4: Sampling from ML-VAE, trained over unsupervised data.

Figure 5: UG-VAE interpolation in local (columns) and global (rows) posterior spaces, fusing celebA and FACES datasets. In (a) the interpolation goes between the posteriors of a sample from CelebA dataset and a sample from FACES dataset. In (c) the interpolation goes between the posteriors of a sample from FACES dataset and another sample from the same dataset.

Figure 7: 2D t-SNE projection of the UG-VAE β posterior distribution of structured batches of 128 CelebA images. UG-VAE is trained with completely random batches of 128 train images.

Batch classification accuracy using samples of the posterior β distribution. this paper we have presented UG-VAE, an unsupervised generative probabilistic model able to capture both local data features and global features among batches of data samples. Unlike similar approaches in the literature, by combining a structured clustering prior in the local latent space with a global latent space with Gaussian prior and a more structured variational family, we have demonstrated that interpretable group features can be inferred from the global latent space in a completely unsupervised fashion. Model training does not require artificial manipulation of the ELBO term to force latent interpretability, which makes UG-VAE stand out w.r.t. most of the current disentanglement approaches using VAEs. The ability of UG-VAE to infer diverse features from the training set is further demonstrated in a domain alignment setup, where we show that the global space allows interpolation between domains, and also by showing that images in correlated batches of data, related by non-trivial features such as hair color or gender in CelebA, define identifiable structures in the posterior global latent space distribution.

