DYNAMICVAE: DECOUPLING RECONSTRUCTION ER-ROR AND DISENTANGLED REPRESENTATION LEARN-ING

Abstract

This paper challenges the common assumption that the weight β, in β-VAE, should be larger than 1 in order to effectively disentangle latent factors. We demonstrate that β-VAE, with β < 1, can not only attain good disentanglement but also significantly improve reconstruction accuracy via dynamic control. The paper removes the inherent trade-off between reconstruction accuracy and disentanglement for β-VAE. Existing methods, such as β-VAE and FactorVAE, assign a large weight to the KL-divergence term in the objective function, leading to high reconstruction errors for the sake of better disentanglement. To mitigate this problem, a Con-trolVAE has recently been developed that dynamically tunes the KL-divergence weight in an attempt to control the trade-off to more a favorable point. However, ControlVAE fails to eliminate the conflict between the need for a large β (for disentanglement) and the need for a small β (for smaller reconstruction error). Instead, we propose DynamicVAE that maintains a different β at different stages of training, thereby decoupling disentanglement and reconstruction accuracy. In order to evolve the weight, β, along a trajectory that enables such decoupling, Dy-namicVAE leverages a modified incremental PI (proportional-integral) controller, a variant of proportional-integral-derivative controller (PID) algorithm, and employs a moving average as well as a hybrid annealing method to evolve the value of KL-divergence smoothly in a tightly controlled fashion. We theoretically prove the stability of the proposed approach. Evaluation results on three benchmark datasets demonstrate that DynamicVAE significantly improves the reconstruction accuracy while achieving disentanglement comparable to the best of existing methods. The results verify that our method can separate disentangled representation learning and reconstruction, removing the inherent tension between the two.

1. INTRODUCTION

The goal of disentangled representation learning is to encode input data into a low-dimensional space that preserves information about the salient factors of variation, so that each dimension of the representation corresponds to a distinct factor in the data (Bengio et al., 2013; Locatello et al., 2020; van Steenkiste et al., 2019) . Learning disentangled representations benefits a variety of downstream tasks (Higgins et al., 2018; Lake et al., 2017; Locatello et al., 2019c; a; Denton et al., 2017; Mathieu et al., 2019) , including abstract visual reasoning (van Steenkiste et al., 2019) , zero-shot transfer learning (Burgess et al., 2018; Lake et al., 2017; Higgins et al., 2017a) and image generation (Nie et al., 2020) , just to name a few. Due to its central importance in various downstream applications, there is abundant literature on learning disentangled representations. Roughly speaking, there are two lines of methods towards this goal. The first category includes supervised methods (Chen & Batmanghelich, 2019; Locatello et al., 2019c; Shu et al., 2019; Bouchacourt et al., 2018; Nie et al., 2020; Yang et al., 2015) , where external supervision (e.g., data generative factors) is available during training to guide the learning of disentangled representations. The second line of works focus on unsupervised methods (Chen et al., 2016; 2018; Burgess et al., 2018; Kim & Mnih, 2018; Denton et al., 2017; Kumar et al., 2018; Fraccaro et al., 2017) , which substantially relieve the needs to have external supervisions. For this reason, in this paper, we mainly focus on unsupervised disentangled representation learning.

