VARIATIONAL AUTOENCODERS WITH DECREMENTAL INFORMATION BOTTLENECK FOR DISENTANGLEMENT

Abstract

One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous incremental methods with only on latent space cannot optimize these two targets simultaneously, so they expand the Information Bottleneck while training to optimize from disentanglement to reconstruction. However, a large bottleneck will lose the constraint of disentanglement, causing the information diffusion problem. To tackle this issue, we present a novel decremental variational autoencoder with disentanglement-invariant transformations to optimize multiple objectives in different layers , termed DeVAE, for balancing disentanglement and reconstruction fidelity by decreasing the information bottleneck of diverse latent spaces gradually. Benefiting from the multiple latent spaces, DeVAE allows simultaneous optimization of multiple objectives to optimize reconstruction while keeping the constraint of disentanglement, avoiding information diffusion. DeVAE is also compatible with large models with high-dimension latent space. Experimental results on dSprites and Shapes3D that DeVAE achieves a good balance between disentanglement and reconstruction. DeVAE shows high tolerant of hyperparameters and on high-dimensional latent spaces.

1. INTRODUCTION

Unsupervised learning for sensing the properties of objects is crucial to reduce the gap between humans and machines intelligence. Inline with human intelligence disentanglement learning (Bengio et al., 2013) is considered to be a promising direction to obtain explanatory representations from observations to understand and reason objects without any supervision. In the recent years, various approaches (Higgins et al., 2017; Chen et al., 2018; Kim & Mnih, 2018; Burgess et al., 2018; Chen et al., 2016) have been proposed to successfully extract basic properties of objects, such as position, color, orientation, and scale (Burgess & Kim, 2018; Matthey et al., 2017) . The commonly-used methods are based on variational autoencoder (VAE) (Kingma & Welling, 2014) . In particular, β-VAE (Higgins et al., 2017) introduced an extra parameter β on the Kullback-Leibler (KL) divergence to promote disentanglement. However, there is a trade-off between disentanglement and reconstruction fidelity on β-VAE, which is a problem to be solved in the following works. One common direction for dealing with the trade-off is to penalize the Total Correlation (TC) between latent variables, avoiding reducing the mutual information, such as FactorVAE (Kim & Mnih, 2018), β-TCVAE (Chen et al., 2018), and DIPVAE (Kumar et al., 2018) . As pointed out in (Träuble et al., 2020; Dittadi et al., 2020) , TC-based VAEs have a strong prior assumption that the factors are statistically independent. Beyond that, when it comes to high-dimension latent space, the estimation of TC becomes inaccurate due to the curse of dimensionality, as our experiments observed in Section 3.2. The realistic problems usually have numerous factors, therefore it would need a large model with high latent space to extract representations. For example, the popular deep model ResNet50 (He et al., 2016) has 2048 dimensional feature space. However, the current TC estimations are not scaled to high dimensional problems, causing the low performance of BC-based methods in practice. In this work, instead of calculating TC, we leverage the information bottleneck (IB) (Tishby et al., 1999; Burgess et al., 2018) to promotes disentanglement. 1

