FONDUE: AN ALGORITHM TO AUTOMATICALLY FIND THE DIMENSIONALITY OF THE LATENT REPRESENTA-TIONS OF VARIATIONAL AUTOENCODERS

Abstract

When training a variational autoencoder (VAE) on a given dataset, determining the number of latent variables is mostly done by grid search -a costly process in terms of computational time and carbon footprint. In this paper, we explore the intrinsic dimension estimates (IDEs) of the data and latent representations learned by VAEs. We show that the discrepancies between the IDE of the mean and sampled representations of a VAE after only a few steps of training reveal the presence of passive variables in the latent space, which, in well-behaved VAEs, indicates a superfluous number of dimensions. Using this property, we propose FONDUE: an algorithm which quickly finds the number of latent dimensions after which the mean and sampled representations start to diverge (i.e., when passive variables are introduced), providing a principled method for selecting the number of latent dimensions for VAEs and autoencoders.

1. INTRODUCTION

"How many latent variables should I use for this model?" is a question that many practitioners using variational autoencoders (VAEs) or autoencoders (AEs) have to deal with. When the task has been studied before, this information is available in the literature for the specific architecture and dataset used. However, when it has not, answering this question becomes more complicated. Indeed, the dimensionality of the latent representation is currently determined empirically by increasing the number of latent dimensions until the reconstruction loss or accuracy on a downstream task does not improve anymore (Doersch, 2016; Mai Ngoc & Hwang, 2020) . This is a costly process requiring to fully train multiple models, and increasing the carbon footprint and time needed for an experiment. One could wonder if it would be sufficient to use a very large number of latent dimensions in all cases. However, beside defeating the purpose of learning compressed representations, this may lead to a range of issues. For example, one would obtain lower accuracy on downstream tasks (Mai Ngoc & Hwang, 2020) and -if the number of dimensions is sufficiently large -very high reconstruction loss (Doersch, 2016) . This would also hinder the interpretability of downstream task models such as linear regression, prevent investigating the learned representation with latent traversal, and increase the correlation of the latent representations (Bonheme & Grzes, 2021). Intrinsic dimension (ID) estimation -the estimation of the minimum number of variables needed to describe the data -is an active area of research in topology, and various estimation methods have been proposed (Facco et al., 2017; Levina & Bickel, 2004) . In recent years, these techniques have successfully been applied to deep learning to empirically show that the intrinsic dimension of images was much lower than their extrinsic dimension (i.e., the number of pixels) (Gong et al., 2019; Ansuini et al., 2019; Pope et al., 2021) , and that the ID estimates (IDEs) of neural network classifiers with good generalisation tended to first increase, then decrease until reaching a very low IDE in the last layer (Ansuini et al., 2019) . However, to the best of our knowledge, ID estimation techniques have never been applied to VAEs. After exploring the IDEs of the representations learned by VAEs at different layers, we will show that by combining this technique with knowledge of the properties of VAEs, we can design a simple yet efficient algorithm which fulfills the criteria of the current methods (i.e., low reconstruction loss, high accuracy on downstream tasks), without requiring to fully train multiple models. (i) We provide an experimental study of the IDEs of VAEs, and found that ( 1  L(θ, φ; x) = E q φ (z|x) [log p θ (x|z)] reconstruction term -D KL (q φ (z|x)||p(z)) regularisation term , where p(z) is generally modelled as a standard multivariate Gaussian distribution N (0, I) to permit a closed form computation of the regularisation term (Doersch, 2016) . The regularisation term can be further penalised by a weight β (Higgins et al., 2017) such that L(θ, φ; x) = E q φ (z|x) [log p θ (x|z)] reconstruction term -βD KL (q φ (z|x)||p(z)) regularisation term , reducing to equation 1 when β = 1 and to a deterministic autoencoder (AE) when β = 0. Note that we mention AEs here as a way of explaining β, and refer the reader to Appendix J for an overview of AEs. Polarised regime When β 1, VAEs are encouraged to have a high precision (i.e., low variance) on the latent variables that they use (the active variables) while maintaining the remaining -passive -variables close to N (0, I) (Rolinek et al., 2019) . This behaviour typical for VAEs is known as polarised regime or posterior collapse and is necessary for VAEs to provide good reconstruction (Dai & Wipf, 2018; Dai et al., 2020) . Because they are close to a standard Gaussian distribution, the passive variables can be used by the model to lower the KL divergence and compensate the increased divergence generated by the active variables. Moreover, their mean representation will generally be close to zero regardless of the input while their sampled representation will have a variance close to 1 (Bonheme & Grzes, 2021) (see also Appendices E and F).

2.2. INTRINSIC DIMENSION ESTIMATION

It is generally assumed that a dataset X of m i.i. Campadelli et al., 2015; Chollet, 2021) . The goal of ID estimation is to recover d given X. In this section, we will detail two ID estimation techniques which use the statistical properties of the neighbourhood of each data point to estimate d, and provide good results for approximating the ID of deep neural network representations and deep learning datasets (Ansuini et al., 2019; Gong et al., 2019; Pope et al., 2021) . See Appendix H and Campadelli et al. (2015) for more details on ID estimation techniques.



Due to their size and to preserve anonymity, the 300 models trained for this experiment will be released after the review.



) the layers of VAEs reach stable IDEs very early in the training; and (2) the IDE of the mean and sampled representations is different when some latent variables collapse. (ii) Based on these observation we propose FONDUE: an algorithm which automatically finds the number of latent dimensions that leads to a low reconstruction loss and good accuracy. Rezende & Mohamed, 2015) are deep probabilistic generative models based on variational inference. The encoder maps an input x to a latent representation z, and the decoder attempts to reconstruct x using z. This can be optimised by maximising L, the evidence lower bound (ELBO)

d. data examples X i ∈ R n is a locally smooth non-linear transformation g of a lower-dimensional dataset Y of m i.i.d. samples Y i ∈ R d , where d n (

