VERIFYING THE UNION OF MANIFOLDS HYPOTHESIS FOR IMAGE DATA

Abstract

Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in image data. Assuming that data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we consider the union of manifolds hypothesis, which states that data lies on a disjoint union of manifolds of varying intrinsic dimensions. We empirically verify this hypothesis on commonlyused image datasets, finding that indeed, observed data lies on a disconnected set and that intrinsic dimension is not constant. We also provide insights into the implications of the union of manifolds hypothesis in deep learning, both supervised and unsupervised, showing that designing models with an inductive bias for this structure improves performance across classification and generative modelling tasks. Our code is available at https://github.com/layer6ai-labs/UoMH.

1. INTRODUCTION

The manifold hypothesis (Bengio et al., 2013) states that high-dimensional data of interest often lives in an unknown lower-dimensional manifold embedded in ambient space, and there is strong evidence supporting this hypothesis. From a theoretical perspective, it is known that both manifold learning and density estimation scale exponentially with the (low) intrinsic dimension when such structure exists (Ozakin & Gray, 2009; Narayanan & Mitter, 2010) , while scaling exponentially with the (high) ambient dimension otherwise (Cacoullos, 1966) . Thus, the most plausible explanation for the success of machine learning methods on high-dimensional data is the existence of far lower intrinsic dimension, which facilitates learning on datasets of fairly reasonable size. This is verified empirically by Pope et al. (2021) , in which a comprehensive study estimating the intrinsic dimension of commonly-used image datasets is performed, clearly finding low-dimensional structure. However, thinking of observed data as lying on a single unknown low-dimensional manifold is quite limiting, as this implies that the intrinsic dimension throughout the dataset is constant. If we consider the intrinsic dimensionality to be the number of factors of variation generating the data, we can see that this formulation prevents distinct regions of the data's support from having differing quantities of factors of variation. Yet this seems to be unrealistic: for example, we should not expect the number of factors needed to describe 8s and 1s in the MNIST dataset (LeCun et al., 1998) to be equal. To accommodate this intuition, in this paper we consider the union of manifolds hypothesis: that high-dimensional image data often lies not on a single manifold, but on a disjoint union of manifolds of different intrinsic dimensions.foot_0 While this hypothesis has motivated work in the clustering literature (Vidal, 2011; Elhamifar & Vidal, 2011; 2012; 2013; Zhang et al., 2019; Abdolali & Gillis, 2021; Cai et al., 2022) , to the best of our knowledge it has never been empirically explored analogously to the way that Pope et al. ( 2021) probe the manifold hypothesis. In this work we carry out this verification on commonly-used image datasets, first by confirming their supports are disconnected, and then by estimating the intrinsic dimension on each component, finding that there is indeed variation in these estimates. In order to verify that the support of the data is disconnected, we leverage pushforward deep generative models (Salmona et al., 2022; Ross et al., 2022) , which generate samples by transforming noise through a neural network G. We first prove that these deep generative models (DGMs) are incapable of modelling disconnected supports. We then argue that the class labels provided in our considered datasets approximately identify connected components (i.e. different classes are mostly disconnected from each other), and show that training a pushforward model on each class outperforms training a single such model on the entire dataset, even when using the same computational budget: this improvement is a firm indicator that the support is truly disconnected. After empirically verifying the union of manifolds hypothesis, we turn our attention to some of its implications in deep learning. We establish that classes with higher intrinsic dimension are harder to classify, and guided by this insight, we also show that classification accuracy can be improved by more heavily weighting the terms corresponding to classes of higher intrinsic dimension in the cross entropy loss. We also show that the same DGMs we used to confirm the disconnectedness of the data support, which we call disconnected DGMs, provide a performant class of models which are competitive with their non-disconnected baselines across a wide range of datasets and models, and thus provide a promising direction towards improving generative models.

2. BACKGROUND AND RELATED WORK

Throughout our paper, we consider the setup where we have access to a dataset D = {x i } n i=1 , generated i.i.d. from some distribution P * in a high dimensional ambient space X = R D . Pushforward DGMs As mentioned in the introduction, we leverage DGMs -in particular pushforward DGMs -in order to verify the disconnectedness of the support of P * . We call a pushforward model any DGM whose samples X are given by: Z ∼ P Z and X = G(Z), where P Z is a (potentially trainable) base distribution in some latent space Z, and G : Z → X is a neural network. We highlight many popular DGMs fall into this category, including (Gaussian) variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) , normalizing flows (NFs) (Dinh et al., 2017; Kingma & Dhariwal, 2018; Behrmann et al., 2019; Chen et al., 2019; Durkan et al., 2019) , generative adversarial networks (GANs) (Goodfellow et al., 2014) , and Wasserstein autoencoders (WAEs) (Tolstikhin et al., 2018) . 

Intrinsic dimension estimation

dk :=   1 n(k -1) n i=1 k-1 j=1 log T k (x i ) T j (x i )   -1 , where T j (x) is the Euclidean distance from x to its j th -nearest neighbour in D \ {x}, and k is a hyperparameter specifying the maximum number of nearest neighbours to consider. While other estimators have been recently proposed (Block et al., 2021; Lim et al., 2021; Tempczyk et al., 2022) , we stick with (2) throughout this work as it is well-established in the literature. As we will see, popular image datasets exhibit different intrinsic dimensions in separate regions of data space.



The disjoint union of d-dimensional manifolds is a d-dimensional manifold (Lee, 2013): the possibility of having different intrinsic dimensions separates the union of manifolds hypothesis from the manifold hypothesis.2 Requiring the support to be the closure of a manifold (or a union thereof) rather than the manifold itself is merely a technicality due to the formal definition of support, which is always a closed set. See Appendix A.



If we assume that P * is supported on the closure of a d-dimensional manifold embedded in X for some unknown d < D, a natural question is how to estimate d from D. 2 We follow Pope et al. (2021) and use the Levina & Bickel (2004) estimator with the MacKay & Ghahramani (2005) extension, given by:

