EVALUATING THE DISENTANGLEMENT OF DEEP GENERATIVE MODELS WITH MANIFOLD TOPOLOGY

Abstract

Learning disentangled representations is regarded as a fundamental task for improving the generalization, robustness, and interpretability of generative models. However, measuring disentanglement has been challenging and inconsistent, often dependent on an ad-hoc external model or specific to a certain dataset. To address this, we present a method for quantifying disentanglement that only uses the generative model, by measuring the topological similarity of conditional submanifolds in the learned representation. This method showcases both unsupervised and supervised variants. To illustrate the effectiveness and applicability of our method, we empirically evaluate several state-of-the-art models across multiple datasets. We find that our method ranks models similarly to existing methods. We make our code publicly available at https://github.com/stanfordmlgroup/disentanglement.

1. INTRODUCTION

Learning disentangled representations is important for a variety of tasks, including adversarial robustness, generalization to novel tasks, and interpretability (Stutz et al., 2019; Alemi et al., 2017; Ridgeway, 2016; Bengio et al., 2013) . Recently, deep generative models have shown marked improvement in disentanglement across an increasing number of datasets and a variety of training objectives (Chen et al., 2016; Lin et al., 2020; Higgins et al., 2017; Kim and Mnih, 2018; Chen et al., 2018b; Burgess et al., 2018; Karras et al., 2019) . Nevertheless, quantifying the extent of this disentanglement has remained challenging and inconsistent. As a result, evaluation has often resorted to qualitative inspection for comparisons between models. Existing evaluation metrics are rigid: while some rely on training additional ad-hoc models that depend on the generative model, such as a classifier, regressor, or an encoder (Eastwood and Williams, 2018; Kim and Mnih, 2018; Higgins et al., 2017; Chen et al., 2018b; Glorot et al., 2011; Grathwohl and Wilson, 2016; Karaletsos et al., 2015; Duan et al., 2020) , others are tuned for a particular dataset (Karras et al., 2019) . These both pose problems to the evaluation metric's reliability, its relevance to different models and tasks, and consequently, its applicable scope. Specifically, reliance



Figure 1: Factors in the dSprites dataset displaying topological similarity and semantic correspondence to respective latent dimensions in a disentangled generative model, as shown through Wasserstein RLT distributions-vectorizations of the persistent homology of submanifolds conditioned on a latent dimension-and latent interpolations along respective latent dimensions.

