MODEL-CENTRIC DATA MANIFOLD: THE DATA THROUGH THE EYES OF THE MODEL

Abstract

We discover that deep ReLU neural network classifiers can see a low-dimensional Riemannian manifold structure on data. Such structure comes via the local data matrix, a variation of the Fisher information matrix, where the role of the model parameters is taken by the data variables. We obtain a foliation of the data domain and we show that the dataset on which the model is trained lies on a leaf, the data leaf, whose dimension is bounded by the number of classification labels. We validate our results with some experiments with the MNIST dataset: paths on the data leaf connect valid images, while other leaves cover noisy images.

1. INTRODUCTION

In machine learning, models are categorized as discriminative models or generative models. From its inception, deep learning has focused on classification and discriminative models (Krizhevsky et al., 2012; Hinton et al., 2012; Collobert et al., 2011) . Another perspective came with the construction of generative models based on neural networks (Kingma & Welling, 2014; Goodfellow et al., 2014; Van den Oord et al., 2016; Kingma & Dhariwal, 2018) . Both kinds of models give us information about the data and the similarity between examples. In particular, generative models introduce a geometric structure on generated data. Such models transform a random low-dimensional vector to an example sampled from a probability distribution approximating the one of the training dataset. As proved by Arjovsky & Bottou (2017) , generated data lie on a countable union of manifolds. This fact supports the human intuition that data have a low-dimensional manifold structure, but in generative models the dimension of such a manifold is usually a hyper-parameter fixed by the experimenter. A recent algorithm by Peebles et al. (2020) provides a way to find an approximation of the number of dimensions of the data manifold, deactivating irrelevant dimensions in a GAN. Similarly, here we try to understand if a discriminative model can be used to detect a manifold structure on the space containing data and to provide tools to navigate this manifold. The implicit definition of such a manifold and the possibility to trace paths between points on the manifold can open many possible applications. In particular, we could use paths to define a system of coordinates on the manifold (more specifically on a chart of the manifold). Such coordinates would immediately give us a low-dimensional parametrization of our data, allowing us to do dimensionality reduction. In supervised learning, a model is trained on a labeled dataset to identify the correct label on unseen data. A trained neural network classifier builds a hierarchy of representations that encodes increasingly complex features of the input data (Olah et al., 2017) . Through the representation function, a distance (e.g. euclidean or cosine) on the representation space of a layer endows input data with a distance. This pyramid of distances on examples is increasingly class-aware: the deeper is the layer, the better the metric reflects the similarity of data according to the task at hand. This observation suggests that the model is implicitly organizing the data according to a suitable structure. Unfortunately, these intermediate representations and metrics are insufficient to understand the geometric structure of data. First of all, representation functions are not invertible, so we cannot recover the original example from its intermediate representation or interpolate between data points. Moreover, the domain of representation functions is the entire data domain R n . This domain is mostly composed of meaningless noise and data occupy only a thin region inside of it. So, even if representation functions provide us a distance, those metrics are incapable of distinguishing between meaningful data and noise. We find out that a ReLU neural network implicitly identifies a low-dimensional submanifold of the data domain that contains real data. We prove that if the activation function is piecewise-linear (e.g. ReLU), the neural network decomposes the data domain R n as the disjoint union of submanifolds (the leaves of a foliation, using the terminology of differential geometry). The dimension of every submanifold (every leaf of the foliation) is bounded by the number of classes of our classification model, so it is much smaller than n, the dimension of the data domain R n . Our main theoretical result, Theorem 3.1, stems from the study of the properties of a variant of the Fisher Information matrix, the local data matrix. However, Theorem 3.1 cannot tell us which leaves of this foliation are meaningful, i.e. what are the possible interesting practical applications of the submanifolds that compose the foliation. The interpretation of this geometric structure can only come from experiments. We report experiments performed on MNIST dataset. We choose to focus on MNIST because it is easily interpretable and because small networks are sufficient to reach a high accuracy. Our experiments suggest that all valid data points lie on only one leaf of the foliation, the data leaf. To observe this phenomenon we take an example from the dataset and we try to connect it with another random example following a path along the leaf containing the starting point. If such a path exists, it means that the destination example belongs to the same leaf of the foliation. Visualizing the intermediate points on these joining paths, we see that the low-dimensional data manifold defined by the model is not the anthropocentric data manifold composed of data meaningful for a human observer. The model-centric data manifold comprises images that do not belong to a precise class. The model needs those transition points to connect points with different labels. At the same time, it understands that such transition points represent an ambiguous digit: on such points, the model assigns a low probability to every class. The experiments also show that moving orthogonally to the data leaf we find noisy images. That means that the other leaves of the foliation contain images with a level of noise that increases with the distance from the data leaf. These noisy images become soon meaningless to the human eye, while the model still classifies them with high confidence. This fact is a consequence of the property of the local data matrix: equation ( 8) prescribes that the model output does not change if we move in a direction orthogonal to the tangent space of the leaf on which our data is located. This remark points us to other possible applications of the model-centric data manifold. We could project a noisy point on the data leaf to perform denoising, or we can use the distance from the data leaf to recognize out-of-distribution examples. The main contributions of the paper are: 1. the definition of the local data matrix G(x, w) at a point x of the data domain and for a given model w, and the study of its properties; 2. the proof that the subspace spanned by the eigenvectors with non-zero eigenvalue of the local data matrix G(x, w) can be interpreted as the tangent space of a Riemannian manifold, whose dimension is bounded by the number of classes on which our model is trained; 3. the identification and visualization of the model-centric data manifold through paths, obtained via experiments on MNIST. Organization of the paper. In Section 2, we review the fundamentals of information geometry using a novel perspective that aims at facilitating the comprehension of the key concepts of the paper. We introduce the local data matrix G(x, w) and we summarize its properties in Prop. 2.1. In Section 3, we show that, through the local data matrix, under some mild hypotheses, the data domain foliates as a disjoint union of leaves, which are all Riemannian submanifolds of R n , with metric given via G(x, w). In Section 4, we provide evidence that all our dataset lies on one leaf of the foliation and that moving along directions orthogonal to the data leaf amounts to adding noise to data.



Figure 1: Simplified summary of our experiments: images from MNIST are connected by paths on the data leaf, while images on other leaves are noisy.

