DEEP NETWORKS AND THE MULTIPLE MANIFOLD PROBLEM

Abstract

We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth L is large relative to certain geometric and statistical properties of the data, the network width n grows as a sufficiently large polynomial in L, and the number of i.i.d. samples from the manifolds is polynomial in L, randomly-initialized gradient descent rapidly learns to classify the two manifolds perfectly with high probability. Our analysis demonstrates concrete benefits of depth and width in the context of a practically-motivated model problem: the depth acts as a fitting resource, with larger depths corresponding to smoother networks that can more readily separate the class manifolds, and the width acts as a statistical resource, enabling concentration of the randomlyinitialized network and its gradients. The argument centers around the "neural tangent kernel" of Jacot et al. and its role in the nonasymptotic analysis of training overparameterized neural networks; to this literature, we contribute essentially optimal rates of concentration for the neural tangent kernel of deep fullyconnected ReLU networks, requiring width n ≥ L poly(d 0 ) to achieve uniform concentration of the initial kernel over a d 0 -dimensional submanifold of the unit sphere S n0-1 , and a nonasymptotic framework for establishing generalization of networks trained in the "NTK regime" with structured data. The proof makes heavy use of martingale concentration to optimally treat statistical dependencies across layers of the initial random network. This approach should be of use in establishing similar results for other network architectures.

1. INTRODUCTION

Data in many applications in machine learning and computer vision exhibit low-dimensional structure (Fig. 1a ). Although deep neural networks achieve state-of-the-art performance on tasks in these areas, rigorous explanations for their performance remain elusive, in part due to the complex interaction between models, architectures, data, and algorithms in neural network training. There is a need for model problems that capture essential features of applications (such as low dimensionality), but are simple enough to admit rigorous end-to-end performance guarantees. In addition to helping to elucidate the mechanisms by which deep networks succeed, this approach has the potential to clarify the roles of various network properties and how these should reflect the properties of the data. These considerations lead us to formulate the multiple manifold problem (Fig. 1b ), a binary classification problem in which the classes are two disjoint submanifolds of the unit sphere S n0-1 , and the classifier is a deep fully-connected ReLU network of depth L and width n trained on N i.i.d. samples from a distribution supported on the manifolds. The goal is to articulate conditions on the network architecture and number of samples under which the learned classifier provably separates the two manifolds, guaranteeing perfect generalization to unseen data. The difficulty of an instance of the multiple manifold problem is controlled by the dimension of the manifolds d 0 , their separation ∆, and their curvature κ, allowing us to study the constraints imposed by these intrinsic properties of the data on the settings of the neural network's architectural hyperparameters such that the two manifolds can be separated by training with a gradient-based method. Our main result is an analysis of the one-dimensional case of the multiple manifold problem, which reduces the analysis of the gradient descent dynamics to the construction of a certificate-showing that a certain deterministic integral equation involving the network architecture and the structure of the data admits a solution of small norm. We construct such a certificate for the simple geometry in Fig. 3 , guaranteeing generalization in this setting. Theorem 1 (informal). Let d 0 = 1. Suppose a certificate for M exists. Then if the network depth satisfies L ≥ poly(κ, C ρ , log(n 0 )), the width satisfies n ≥ poly(L, log(Ln 0 )), and the number of training samples satisfies N ≥ poly(L), randomly-initialized gradient descent on N i.i.d. samples rapidly learns a network that separates the two manifolds with overwhelming probability. The constants C ρ , κ depend only on the data density and the regularity of the manifolds. In addition, if L ∆ -foot_0 , then a certificate exists for the configuration of M shown in Fig. 3 . Theorem 1 gives a provable generalization guarantee for a model classification problem with deep networks on structured data that depends only on the architectural hyperparameters and properties of the data. In addition, it provides an interpretable tradeoff between the architectural settings necessary to separate the two manifolds: the network depth needs to be set according to the intrinsic difficulty of the problem, and the network width needs to grow with the depth. Our analysis gives further insight into the independent roles played by each of these parameters in solving the problem, with the depth acting as a 'fitting resource', making the network's output more regular and easier to change, and the width acting as a 'statistical resource', granting concentration of the network over the random initialization around a well-behaved object that we can analyze. Moreover, the sample complexity of Theorem 1 is dictated by the intrinsic difficulty of the problem instance which is set by the geometry of the data. As a consequence, we avoid any dependence of the width of the network on the number of samples, which is common in deep network convergence results in the literature (e.g. (Allen-Zhu et al., 2019b; Du et al., 2019 ), (Chen et al., 2021, Theorem 3.4 )). As is the case in practice, given a fixed architecture, more data doesn't have a detrimental effect on fitting 1 . Theorem 1 is modular, in the sense that a generalization guarantee is ensured for any geometry for which one can construct a certificate. The key to our approach will be to approximate the gradient



When using data augmentation, for example, the number of samples is effectively infinite yet highly structured, enabling convergence and generalization.



Figure 1: (a) Data in image classification with standard augmentation techniques, as well as other domains in which neural networks are commonly used, lies on low dimensional class manifoldsin this case those generated by the action of continuous transformations on images in the training set. Tangent vectors at a point on the manifold corresponding to an application of a rotation or a translation are illustrated in green. The dimension of the manifold is determined by the dimension of the symmetry group, and is typically small. (b) The multiple manifold problem. Our model problem, capturing this low dimensional structure, is the classification of low-dimensional submanifolds of a sphere S n0-1 . The difficulty of the problem is set by the inter-manifold separation ∆ and the curvature κ. The depth and width of the network required to provably reduce the generalization error efficiently are set by these parameters.

