IMPROVING THE UNSUPERVISED DISENTANGLED REPRESENTATION LEARNING WITH VAE ENSEMBLE

Abstract

Variational Autoencoder (VAE) based frameworks have achieved the state-of-the-1 art performance on the unsupervised disentangled representation learning. A re-2 cent theoretical analysis shows that such success is mainly due to the VAE im-3 plementation choices that encourage a PCA-like behavior locally on data sam-4 ples. Despite this implied model identifiability, the VAE based disentanglement 5 frameworks still face the trade-off between the local orthogonality and data re-6 construction. As a result, models with the same architecture and hyperparameter 7 setting can sometimes learn entangled representations. To address this challenge, 8 we propose a simple yet effective VAE ensemble framework consisting of multi-9 ple VAEs. It is based on the assumption that entangled representations are unique 10 in their own ways, and the disentangled representations are "alike" (similar up to a 11 signed permutation transformation). In the proposed VAE ensemble, each model 12 not only maintains its original objective, but also encodes to and decodes from 13 other models through pair-wise linear transformations between the latent repre-14 sentations. We show both theoretically and experimentally, the VAE ensemble 15 objective encourages the linear transformations connecting the VAEs to be triv-16 ial transformations, aligning the latent representations of different models to be 17 "alike". We compare our approach with the state-of-the-art unsupervised disen-18 tangled representation learning approaches and show the improved performance. 19

1. INTRODUCTION 20

Disentangled representation learning aims to capture the semantically meaningful compositional 21 representation of data (Higgins et al., 2018; Mathieu et al., 2018) , and is shown to improve the 22 efficiency and generalization of supervised learning (Locatello et al., 2019) , reinforcement learning 23 (Watters et al., 2019) , and reasoning tasks (van Steenkiste et al., 2019) . The current state-of-the-24 art unsupervised disentangled representation learning deploy the Variational Autoencoder (VAE) 25 (Kingma & Welling, 2013; Rezende et al., 2014) . The main challenge is to reduce the trade-off 26 between learning a disentangled representation and reconstructing input data. Most of the recent 27 works extend the original VAE objective with carefully designed augmented objective to address 28 this trade-off (Higgins et al., 2017; Burgess et al., 2017; Kim & Mnih, 2018; Chen et al., 2018; 29 Kumar et al., 2017) . A recent study in (Locatello et al., 2018 ) compared these methods and showed 30 that their performance is sensitive to initialization and hyperparameter setting of the augmented 31 objective function.

32

Recently, Duan et al. (Duan et al., 2019 ) developed an unsupervised model selection method named 33 Unsupervised Disentanglement Ranking (UDR) to address the challenge of hyperparameter search 34 and model selection. UDR leverages the finding in (Rolinek et al., 2019) that the implementation 35 choices of VAE encourage a local PCA-like behavior locally on data samples. As a result, disen-36 tangled representations by VAEs are "alike" as they are similar up to signed permutation transfor-37 mations. On the contrary, the entangled representations by VAEs are "unique" as they are similar 38 at least up to non-degenerate rotation matrices. UDR uses multiple models trained with different 39 initializations and hyperparameter settings, and builds a similarity matrix measuring the pair-wise 40 similarity between the latent variables from different models. A higher score is given to the model 41 that can match its representations to many others models. The results show close match between 42 UDR and commonly used supervised metrics, as well as the performance of downstream tasks using 43 the latent representations. 86 Wang et al., 2018; Morcos et al., 2018) . For the unsupervised disentanlged representation, as dis-87 cussed in (Locatello et al., 2018; Duan et al., 2019) , we often observe high variance in the perfor- 



a conference paper at ICLR 2021 Inspired by the findings from these studies, we propose a simple yet effective VAE ensemble frame-"active" mode is defined by σ 2 j (x) 1. The "passive" latent variables closely approximate the prior 75 and have little effect on the decoder. The "active" latent variables, on the other hand, are closely 76 related to both the per sample KL loss and the decoder output. The "polarized regime" enables a 77 reformulated VAE objective showing that VAEs optimize a trade-off between data reconstruction 78 and orthogonality of the linear approximation of decoder Jacobian locally around a data sample. 79 This PCA-like behavior near data points encourages an identifiable disentangled latent space by 80 VAE. Furthermore, it was suggested that finding an appropriate "polarized regime" is dependent 81 on the initialization and the hyperparameter tuning of the state-of-the-art approaches. In this study, 82 we show that the proposed VAE ensemble aligns the "polarized regime" of individual VAE models 83 towards the disentangled representation. 84 Model selection In practice, we often observe neural networks achieve similar performance with 85 different internal representations when trained with the same hyperparameters (Raghu et al., 2017;

mance from the model trained with the same architecture and hyperparameter setting. This poses 89 a challenge for choosing the model in practice. Duan et al. (2019) proposed Unsupervised Disen-90 tanglement Ranking (UDR) to address this challenge. The extensive empirical evaluations on UDR 91 using both the supervised metric measurement and the performance of downstream tasks validates 92 its effectiveness. They also confirm that disentangled representations are "alike" and entangled rep-93 resentations are unique in their own ways. The proposed VAE ensemble leverages this finding. 94 Identifiable VAE Built on the recent breakthroughs in nonlinear Independent Component Analy-95 sis (ICA) literature (Hyvarinen & Morioka, 2016; 2017; Hyvarinen et al., 2019), Khemakhem et 96 al. show that the identification of the true joint distribution over observed and latent variables is97

annex

work to improve the disentangled representation by VAE. The proposed VAE ensemble consists of 46 multiple VAEs. The latent variables in every pair of these models are connected through linear lay-47 ers to force the latent representations in the ensemble to be similar up to a linear transformation.

48

We show that the VAE ensemble objective encourages these pair-wise linear transformations to con-49 verge to trivial transformations, making latent representations of different VAEs in the ensemble to 50 be "alike", thus disentangled. In this paper, we make the following contributions: (1) We introduce (Kingma & Welling, 2013; Rezende et al., 2014) . The encoder q φ (z|x) maps the 60 input data x ∈ R n to a probabilistic distribution as the latent representation z ∈ R d , and the decoder 61 q θ (x|z) maps the latent representation to the data space noted as q θ (x|z), where φ and θ represent 62 model parameters. The VAE objective is to maximize the marginalized log-likelihood of data. Direct 63 optimization of this objective is not tractable and it is approximated by the evidence lower bound 64 (ELBO) as:In practice, the first term is estimated by reconstruction error. The second term is the Kullback-

66

Leibler divergence between the posterior q φ (z|x) and the prior p(z) commonly chosen as an work in "polarized" modes. The "passive" mode is defined by µ 2 j (x) 1 and σ 2 j (x) ≈ 1, while the

