FONDUE: AN ALGORITHM TO AUTOMATICALLY FIND THE DIMENSIONALITY OF THE LATENT REPRESENTA-TIONS OF VARIATIONAL AUTOENCODERS

Abstract

When training a variational autoencoder (VAE) on a given dataset, determining the number of latent variables is mostly done by grid search -a costly process in terms of computational time and carbon footprint. In this paper, we explore the intrinsic dimension estimates (IDEs) of the data and latent representations learned by VAEs. We show that the discrepancies between the IDE of the mean and sampled representations of a VAE after only a few steps of training reveal the presence of passive variables in the latent space, which, in well-behaved VAEs, indicates a superfluous number of dimensions. Using this property, we propose FONDUE: an algorithm which quickly finds the number of latent dimensions after which the mean and sampled representations start to diverge (i.e., when passive variables are introduced), providing a principled method for selecting the number of latent dimensions for VAEs and autoencoders.

1. INTRODUCTION

"How many latent variables should I use for this model?" is a question that many practitioners using variational autoencoders (VAEs) or autoencoders (AEs) have to deal with. When the task has been studied before, this information is available in the literature for the specific architecture and dataset used. However, when it has not, answering this question becomes more complicated. Indeed, the dimensionality of the latent representation is currently determined empirically by increasing the number of latent dimensions until the reconstruction loss or accuracy on a downstream task does not improve anymore (Doersch, 2016; Mai Ngoc & Hwang, 2020) . This is a costly process requiring to fully train multiple models, and increasing the carbon footprint and time needed for an experiment. One could wonder if it would be sufficient to use a very large number of latent dimensions in all cases. However, beside defeating the purpose of learning compressed representations, this may lead to a range of issues. For example, one would obtain lower accuracy on downstream tasks (Mai Ngoc & Hwang, 2020) and -if the number of dimensions is sufficiently large -very high reconstruction loss (Doersch, 2016) . This would also hinder the interpretability of downstream task models such as linear regression, prevent investigating the learned representation with latent traversal, and increase the correlation of the latent representations (Bonheme & Grzes, 2021). Intrinsic dimension (ID) estimation -the estimation of the minimum number of variables needed to describe the data -is an active area of research in topology, and various estimation methods have been proposed (Facco et al., 2017; Levina & Bickel, 2004) . In recent years, these techniques have successfully been applied to deep learning to empirically show that the intrinsic dimension of images was much lower than their extrinsic dimension (i.e., the number of pixels) (Gong et al., 2019; Ansuini et al., 2019; Pope et al., 2021) , and that the ID estimates (IDEs) of neural network classifiers with good generalisation tended to first increase, then decrease until reaching a very low IDE in the last layer (Ansuini et al., 2019) . However, to the best of our knowledge, ID estimation techniques have never been applied to VAEs. After exploring the IDEs of the representations learned by VAEs at different layers, we will show that by combining this technique with knowledge of the properties of VAEs, we can design a simple yet efficient algorithm which fulfills the criteria of the current methods (i.e., low reconstruction loss, high accuracy on downstream tasks), without requiring to fully train multiple models.

