VARIATIONAL AUTOENCODERS WITH DECREMENTAL INFORMATION BOTTLENECK FOR DISENTANGLEMENT

Abstract

One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous incremental methods with only on latent space cannot optimize these two targets simultaneously, so they expand the Information Bottleneck while training to optimize from disentanglement to reconstruction. However, a large bottleneck will lose the constraint of disentanglement, causing the information diffusion problem. To tackle this issue, we present a novel decremental variational autoencoder with disentanglement-invariant transformations to optimize multiple objectives in different layers , termed DeVAE, for balancing disentanglement and reconstruction fidelity by decreasing the information bottleneck of diverse latent spaces gradually. Benefiting from the multiple latent spaces, DeVAE allows simultaneous optimization of multiple objectives to optimize reconstruction while keeping the constraint of disentanglement, avoiding information diffusion. DeVAE is also compatible with large models with high-dimension latent space. Experimental results on dSprites and Shapes3D that DeVAE achieves a good balance between disentanglement and reconstruction. DeVAE shows high tolerant of hyperparameters and on high-dimensional latent spaces.

1. INTRODUCTION

Unsupervised learning for sensing the properties of objects is crucial to reduce the gap between humans and machines intelligence. Inline with human intelligence disentanglement learning (Bengio et al., 2013) is considered to be a promising direction to obtain explanatory representations from observations to understand and reason objects without any supervision. In the recent years, various approaches (Higgins et al., 2017; Chen et al., 2018; Kim & Mnih, 2018; Burgess et al., 2018; Chen et al., 2016) have been proposed to successfully extract basic properties of objects, such as position, color, orientation, and scale (Burgess & Kim, 2018; Matthey et al., 2017) . The commonly-used methods are based on variational autoencoder (VAE) (Kingma & Welling, 2014) . In particular, β-VAE (Higgins et al., 2017) introduced an extra parameter β on the Kullback-Leibler (KL) divergence to promote disentanglement. However, there is a trade-off between disentanglement and reconstruction fidelity on β-VAE, which is a problem to be solved in the following works. One common direction for dealing with the trade-off is to penalize the Total Correlation (TC) between latent variables, avoiding reducing the mutual information, such as FactorVAE (Kim & Mnih, 2018 ), β-TCVAE (Chen et al., 2018 ), and DIPVAE (Kumar et al., 2018) . As pointed out in (Träuble et al., 2020; Dittadi et al., 2020) , TC-based VAEs have a strong prior assumption that the factors are statistically independent. Beyond that, when it comes to high-dimension latent space, the estimation of TC becomes inaccurate due to the curse of dimensionality, as our experiments observed in Section 3.2. The realistic problems usually have numerous factors, therefore it would need a large model with high latent space to extract representations. For example, the popular deep model ResNet50 (He et al., 2016) has 2048 dimensional feature space. However, the current TC estimations are not scaled to high dimensional problems, causing the low performance of BC-based methods in practice. In this work, instead of calculating TC, we leverage the information bottleneck (IB) (Tishby et al., 1999; Burgess et al., 2018) to promotes disentanglement. In the meanwhile, previous information bottleneck (IB)-based methods (Burgess et al., 2018; Shao et al., 2022; Wu et al., 2022) have tried to solve the obstacle of trade-off between disentanglement and reconstruction fidelity. A narrow IB enforces the model to find efficient codes for representing the data, which encourages disentanglement. Therefore, they first set a high pressure with a narrow IB and then expand the IB gradually to promote disentanglement to reconstruction fidelity , termed incremental methods. For example, DynamicVAE (Shao et al., 2022) initiated β with a large value at the beginning of training for disentanglement and stably increase the KL divergence for reconstruction by a non-linear PI controller ( Åström & Hägglund, 2006) . However, they lost the constraint of disentanglement when expanding the IB, which causes the information diffusion problem (Wu et al., 2022) . In this work, to avoid information diffusion, we aim to optimize reconstruction while keeping the constraint of disentanglement. Different from IB-Incremental based approaches listed above, our key motivation is to optimize disentanglement and reconstruction simultaneously. revious methods only have one latent space and are unable to optimize disentanglement and reconstruction at the same time, which causes them to have to change the target from disentanglement to reconstruction during training. Instead, our work proposes a novel multi-layer framework with its own latent spaces and objectives in each layer, allowing optimizing multiple targets at a time. In this way, the first layer is a vanilla VAE to rebuild high-quality images, and the subsequent layers will distill some important variables by narrow IBs to promote disentanglement. To inherit disentanglement from the subsequent layers, we introduce disentanglement-invariant transformations to connect the layers one by one. These extra layers can be seen as regularizations for disentanglement to constrain the representation. To achieve this, we propose a simple yet effective VAE framework composed of multiple continuous latent sub-spaces with a novel IB-Decremental strategy and disentanglement-invariant transform operators, which we call DeVAE. Specifically, we decrease the information bottleneck of each latent space layer by layer, where we constrain the first space for informativeness to recover the input image, and other disentangled spaces for learning factors of the image by narrow IBs. Furthermore, we introduce the disentanglement-invariant transform operator to ensure simultaneous optimization of disentanglement across continuous latent sub-spaces, which avoids the information diffusion. Our decremental model avoids ID by keeping the constraints of disentanglement while optimizing reconstruction. We also conducted comprehensive comparisons with popular methods quantitatively and qualitatively. Experimental results demonstrate that DeVAE is robust in hyperparameters and the size of latent spaces. Our contributions can be summarized as follows: • We introduce several latent spaces sharing disentanglement by disentanglement-invariant transformations. • We propose a novel diagram for disentanglement learning by decreasing IB, termed decremental VAE (DeVAE). Our decremental model can handle large-scale problems and show robustness on several datasets.

2. METHODOLOGY 2.1 PRELIMINARIES

Problem Setup & Notations. Disentanglement learning aims to learn the factors of variation which raises the change of observations. Given a set of samples x ∈ X , they can be uniquely described by a set of ground-truth factors c ∈ C. Generally, the generation process g(•) is invisible x = g(c). We say a representation for factor c i is disentangled if it is invariant for the samples with c j . We use variational inference to learn the disentangled representation for a given problem. p(z|x) denotes the probability of z = f (x), p(x|z) denotes the probability of x = g(z). The representation function is a conditional Bayesian network of the form q ϕ (z|x) to estimate p(z|x). The generative model is another network of the form p θ (x|z)p(z). ϕ, θ are trainable parameters. Revisit VAE & β-VAE. The VAE framework (Kingma & Welling, 2014) computes the representation function by introducing q ϕ (z|x) and optimizing the variational lower bound (ELBO).

