CURVED DATA REPRESENTATIONS IN DEEP LEARNING

Abstract

The phenomenal success of deep neural networks inspire many to understand the inner mechanisms of these models. To this end, several works have been studying geometric properties such as the intrinsic dimension of latent data representations produced by the layers of the network. In this paper, we investigate the curvature of data manifolds, i.e., the deviation of the manifold from being flat in its principal directions. We find that state-of-the-art trained convolutional neural networks have a characteristic curvature profile along layers: an initial increase, followed by a long phase of a plateau, and tailed by another increase. In contrast, untrained networks exhibit qualitatively and quantitatively different curvature profiles. We also show that the curvature gap between the last two layers is strongly correlated with the performance of the network. Further, we find that the intrinsic dimension of latent data along the network layers is not necessarily indicative of curvature. Finally, we evaluate the effect of common regularizers such as weight decay and mixup on curvature, and we find that mixup-based methods flatten intermediate layers, whereas the final layers still feature high curvatures. Our results indicate that relatively flat manifolds which transform to highly-curved manifolds toward the last layers generalize well to unseen data.

1. INTRODUCTION

Real-world data arising from scientific and engineering problems is often high-dimensional and complex. Using such data for downstream tasks may seem hopeless at first glance. Nevertheless, the widely accepted manifold hypothesis (Cayton, 2005) stating that complex high-dimensional data is intrinsically low-dimensional, suggests that not all hope is lost. Indeed, significant efforts in machine learning have been dedicated to developing tools for extracting meaningful low-dimensional features from real-world information (Khalid et al., 2014; Bengio et al., 2013) . Particularly successful in several challenging tasks such as classification (Krizhevsky et al., 2017) and recognition (Girshick et al., 2014) are deep learning approaches which manipulate data via nonlinear neural networks. Unfortunately, the inner mechanisms of deep models are not well understood at large. Motivated by the manifold hypothesis and more generally, manifold learning (Belkin & Niyogi, 2003) , several recent approaches proposed to analyze deep models by their latent representations. Essentially, a manifold is a topological space locally similar to an Euclidean domain at each of its points (Lee, 2013) . A key property of a manifold is its intrinsic dimension, defined as the dimension of the related Euclidean domain. Recent studies estimated the intrinsic dimension (ID) along layers of trained neural networks using neighborhood information (Ansuini et al., 2019) and topological data analysis (Birdal et al., 2021) . Remarkably, it has been shown that the ID admits a characteristic "hunchback" profile (Ansuini et al., 2019) , i.e., it increases in the first layers and then it decreases progressively. Moreover, the ID was found to be strongly correlated with the network performance. Still, the intrinsic dimension is only a single measure, providing limited knowledge of the manifold. To consider other properties, the manifold has to be equipped with an additional structure. In this work, we focus on Riemannian manifolds which are differentiable manifolds with an inner product (Lee, 2006) . Riemannian manifolds can be described using properties such as angles, distances, and curvatures. For instance, the curvature in two dimensions is the amount by which a surface deviates from being a plane, which is completely flat. Ansuini et al. ( 2019) conjectured that while the intrinsic dimension decreases with network depth, the underlying manifold is highly curved. Our study confirms the latter conjecture empirically by estimating the principal curvatures of latent representations of popular deep convolutional classification models trained on benchmark datasets. Previously, curvature estimates were used in the analysis of trained deep models to compare between two neural networks (Yu et al., 2018) , and to explore the decision boundary profile of classification models (Kaul & Lall, 2019) . However, there has not been an extensive and systematic investigation that characterizes the curvature profile of data representations along layers of deep neural networks, similarly to existing studies on the intrinsic dimension. In this paper, we take a step forward towards bridging this gap. To estimate principal curvatures per sample, we compute the eigenvalues of the manifold's Hessian, following the algorithm introduced in (Li, 2018). Our evaluation focuses on convolutional neural network (CNN) architectures such as VGG (Simonyan & Zisserman, 2015) and ResNet (He et al., 2016) and on image classification benchmark datasets such as CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . We address the following questions: • How does curvature vary along the layers of CNNs? Do CNNs learn flat manifolds, or, alternatively, highly-curved data representations? How do common regularizers such as weight decay and mixup affect the curvature profile? • Do curvature estimates of a trained network are indicative of its performance? Is there an indicator that generalize across different architectures and datasets? • Is there a correlation between curvature and other geometric properties of the manifold, such as the intrinsic dimension? Can we deduce the curvature behavior along layers using dimensionality estimation tools? Our results show that learned representations span manifolds whose curvature is mostly fixed with relatively small values (on the order of 1e-1), except for the output layer of the network where curvature increases significantly (on the order of 1). Moreover, this curvature profile was shared among several different convolutional architectures when considered as a function of the relative depth of the network. In particular, highly-curved data manifolds at the output layer have been observed in all cases, even in mixup-based models (Zhang et al., 2018) which flatten intermediate manifolds more strongly in comparison to non mixup-based networks. In contrast, untrained models whose weights are randomly initialized presented a different curvature profile, yielding completely flat (i.e., zero curvature) manifolds across the last half of the layers. Further, our analysis suggests that estimates of dimensionality based on principal component analysis or more advanced methods need not reveal the actual characteristics of the curvature profile. Finally and similarly to indicators based on the intrinsic dimension (Ansuini et al., 2019; Birdal et al., 2021) , we have found that the curvature gap in the last two layers of the network predict its accuracy in that smaller gaps are associated with inferior performance, and larger gaps are related to more accurate models.

2. RELATED WORK

Geometric approaches commonly appear in learning-related tasks. In what follows, we narrow our discussion to manifold-aware learning and manifold-aware analysis works, and we refer the reader to surveys on geometric learning (Shuman et al., 2013; Bronstein et al., 2017) . Manifold-aware learning. Exploiting the intrinsic structure of data dates back to at least (Belkin & Niyogi, 2004) , where the authors utilize the graph Laplacian to approximate the Laplace-Beltrami operator, which further allows to improve classification tools. More recently, several approaches that use geometric properties of the underlying manifold have been proposed. For instance, the intrinsic dimension (ID) was used to regularize the training of deep models, and it was proven to be effective in comparison to weight decay and dropout regularizers (Zhu et al., 2018) et al., 2021) . Additional approaches modify neural networks to account for metric information (Hoffer & Ailon, 2015; Karaletsos et al., 2016; Gruffaz et al., 2021) .A recent work (Chan et al., 2022) showed that mapping distributions of real data, on multiple nonlinear submanifolds can improve robustness against label noise and data corruptions.



, as well as in the context of noisy inputs(Ma et al., 2018b). Another work(Gong et al., 2019)  used the low dimension of image manifolds to construct a deep model. Focusing on symmetric manifolds, Jensen et al. (2020) propose a generative Gaussian process model which allows non-Euclidean inference. Similarly, Goldt et al. (2020) suggest a generative model that is amenable to analytic treatment if data is concentrated on a low-dimensional manifold. Other approaches aim for a flat latent manifold by penalizing the metric tensor(Chen et al., 2020), and incorporating neighborhood penalty terms (Lee

