THE INTRINSIC DIMENSION OF IMAGES AND ITS IMPACT ON LEARNING

Abstract

It is widely believed that natural image data exhibits low-dimensional structure despite the high dimensionality of conventional pixel representations. This idea underlies a common intuition for the remarkable success of deep learning in computer vision. In this work, we apply dimension estimation tools to popular datasets and investigate the role of low-dimensional structure in deep learning. We find that common natural image datasets indeed have very low intrinsic dimension relative to the high number of pixels in the images. Additionally, we find that low dimensional datasets are easier for neural networks to learn, and models solving these tasks generalize better from training to test data. Along the way, we develop a technique for validating our dimension estimation tools on synthetic data generated by GANs allowing us to actively manipulate the intrinsic dimension by controlling the image generation process. Code for our experiments may be found here.

1. INTRODUCTION

The idea that real-world data distributions can be described by very few variables underpins machine learning research from manifold learning to dimension reduction (Besold & Spokoiny, 2019; Fodor, 2002) . The number of variables needed to describe a data distribution is known as its intrinsic dimension (ID). In applications, such as crystallography, computer graphics, and ecology, practitioners depend on data having low intrinsic dimension (Valle & Oganov, 2010; Desbrun et al., 2002; Laughlin, 2014) . The utility of representations which are low-dimensional has motivated a variety of deep learning techniques including autoencoders and regularization methods (Hinton & Salakhutdinov, 2006; Vincent et al., 2010; Gonzalez & Balajewicz, 2018; Zhu et al., 2018) . It is also known that dimensionality plays a strong role in learning function approximations and non-linear class boundaries. The exponential cost of learning in high dimensions is easily captured by the trivial case of sampling a function on a cube; in d dimensions, sampling only the cube vertices would require 2 d measurements. Similar behaviors emerge in learning theory. It is known that learning a manifold requires a number of samples that grows exponentially with the manifold's intrinsic dimension (Narayanan & Mitter, 2010) . Similarly, the number of samples needed to learn a well-conditioned decision boundary between two classes is an exponential function of the intrinsic dimension of the manifold on which the classes lie (Narayanan & Niyogi, 2009) . Furthermore, these learning bounds have no dependence on the ambient dimension in which manifold-structured datasets live. In light of the exponentially large sample complexity of learning high-dimensional functions, the ability of neural networks to learn from image data is remarkable. Networks learn complex decision boundaries from small amounts of image data (often just a few hundred or thousand samples per class). At the same time, generative adversarial networks (GANs) are able to learn image "manifolds" from merely a few thousand samples. The seemingly low number of samples needed to learn these manifolds strongly suggests that image datasets have extremely low-dimensional structure. Despite the established role of low dimensional data in deep learning, little is known about the intrinsic dimension of popular datasets and the impact of dimensionality on the performance of neural networks. Computational methods for estimating intrinsic dimension enable these measurements. We adopt tools from the dimension estimation literature to shed light on dimensionality in settings of interest to the deep learning community. Our contributions can be summarized as follows: • We verify the reliability of intrinsic dimension estimation on high-dimensional data using generative adversarial networks (GANs), a setting in which we can a priori upper-bound the intrinsic dimension of generated data by the dimension of the latent noise vector. • We measure the dimensionality of popular datasets such as MNIST, CIFAR-10, and Im-ageNet. In our experiments, we find that natural image datasets whose images contain thousands of pixels can, in fact, be described by orders of magnitude fewer variables. For example, we estimate that ImageNet, despite containing 224 × 224 × 3 = 150528 pixels per image, only has intrinsic dimension between 26 and 43; see Figure 1 . • We train classifiers on data, synthetic and real, of various intrinsic dimension and find that this variable correlates closely with the number of samples needed for learning. On the other hand, we find that extrinsic dimension, the dimension of the ambient space in which data is embedded, has little impact on generalization. Together, these results put experimental weight behind the hypothesis that the unintuitively low dimensionality of natural images is being exploited by deep networks, and suggest that a characterization of this structure is an essential building block for a successful theory of deep learning.

2. RELATED WORK

While the hypothesis that natural images lie on or near a low-dimensional manifold is controversial, Goodfellow et al. ( 2016) argue that the low-dimensional manifold assumption is at least approximately correct for images, supported by two observations. First, natural images are locally connected, with each image surrounded by other highly similar images reachable through image transformations (e.g., contrast, brightness). Second, natural images seem to lie on a low-dimensional structure, as the probability distribution of images is highly concentrated; uniformly sampled pixels can hardly assemble a meaningful image. It is widely believed that the combination of natural scenes and sensor properties yields very sparse and concentrated image distributions, as has been supported by several empirical studies on image patches (Lee et al., 2003; Donoho & Grimes, 2005; Carlsson et al., 2008) . This observation motivated work on efficient coding (Olshausen & Field, 1996) and served as a prior in computer vision (Peyré, 2009) . Further, rigorous experiments have been conducted clearly supporting the low-dimensional manifold hypothesis for many image datasets (Ruderman, 1994; Schölkopf et al., 1998; Roweis & Saul, 2000; Tenenbaum et al., 2000; Brand, 2003) ; see also (Fefferman et al., 2016) for principled algorithms on verifying the manifold hypothesis. The generalization literature seeks to understand why some models generalize better from training data to test data than others. One line of work suggests that the loss landscape geometry explains why neural networks generalize well (Huang et al., 2019) . Other generalization work predicts that data with low dimension, along with other properties which do not include extrinsic dimension,



Figure 1: Estimates of the intrinsic dimension of commonly used datasets obtained using the MLE method with k = 3, 5, 10, 20 nearest neighbors (left to right). The trends are consistent using different k's.

