FEATURE SELECTION AND LOW TEST ERROR IN SHALLOW LOW-ROTATION RELU NETWORKS

Abstract

This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization scale, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analysis technique. The first regime is near initialization, specifically until the weights have moved by O( √ m), where m denotes the network width, which is in sharp contrast to the O(1) weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself and in particular escapes bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption.

1. INTRODUCTION

A key promise of deep learning is automatic feature learning: standard gradient methods are able to adjust network parameters so that lower layers become meaningful feature extractors, which in turn implies low sample complexity. As a running illustrative (albeit technical) example throughout this work, in the 2-sparse parity problem (cf. Figure 1 ), networks near initialization require d 2 /ϵ samples to achieve ϵ test error, whereas powerful optimization techniques are able to learn more compact networks which need only d/ϵ samples (Wei et al., 2018) . It is not clear how to establish this improved feature learning ability with a standard gradient-based optimization method; for example, despite the incredible success of the Neural Tangent Kernel (NTK) in proving various training and test error guarantees (Jacot et al., 2018; Du et al., 2018b; Allen-Zhu et al., 2018; Zou et al., 2018; Arora et al., 2019; Li & Liang, 2018; Ji & Telgarsky, 2020b; Oymak & Soltanolkotabi, 2019) , ultimately the NTK corresponds to learning with frozen initial random features. The goal of this work is to establish low test error from random initialization in an intermediate regime where parameters of individual nodes do not rotate much, however their change in norm leads to selection of certain pre-existing features. This perspective is sufficient to establish the best known sample complexities from random initialization in a variety of scenarios, for instance matching the d 2 /ϵ within-kernel sample complexity with a computationally-efficient stochastic gradient descent (SGD) method, and the beyond-kernel d/ϵ sample complexity with an inefficient gradient flow (GF) method. The different results are tied together through their analyses, which establish not merely low training error but large margins, a classical approach to low sample complexity within overparameterized models (Bartlett, 1996) . These results will use standard gradient methods from standard initialization, which is in contrast to existing works in feature learning, which adjusts the optimization method in some way (Shi et al., 2022; Wei et al., 2018) , most commonly by training the inner layer for only one iteration (Daniely & Malach, 2020; Abbe et al., 2022; Barak et al., 2022; Damian et al., 2022) , and typically not beating the within-kernel d 2 /ϵ sample complexity on the 2-sparse parity problem (cf. Table 1 ). Contributions. There are four high-level contributions of this work. The first two consider networks of reasonable width (e.g., O(d 2 ) for 2-sparse parity), and are the more tractable of the four. In these results, the network parameters can move up to O( √ m), where m is the width of the network; this is in sharp contrast to the NTK, where weights can only move by O(1). The performance of these first two results is measured in terms of the NTK margin γ ntk , a quantity formally defined in Assumption 1.2. These first two contributions are as follows. 1 The second two high-level contributions require intractable widths (e.g., 2 d ), but are able to achieve much better global margins γ gl , which as detailed in Sections 1.1 and 1.2, were previously only possible under strong assumptions or unrealistic algorithmic modifications. 3. Neural collapse. Theorem 3.2 establishes low sample complexity in the neural collapse (NC) regime (Papyan et al., 2020) , where data are organized in well-separated clusters of common label. By contrast, prior work did not analyze gradient methods from initialization, but instead the relationship between various optimality conditions (Papyan et al., 2020; Yaras et al., 2022; Thrampoulidis et al., 2022) . The method of proof is to establish global margin maximization of GF; by contrast, for any type of data, this was only proved in the literature with strong assumptions and modified algorithms (Wei et al., 2018; Chizat & Bach, 2020; Lyu et al., 2021) .

4.. Global margin maximization for rotation-free networks.

To investigate what could be possible, Theorem 3.3 establishes global margin maximization with GF under a restriction that the inner weights can only change in norm, and can not rotate; this analysis suffices to achieve d/ϵ sample complexity on 2-sparse parity, as in Table 1 , and the low-rotation assumption is backed by preliminary empirical evidence in Figure 2 . 



As purely technical contributions, this work provides new tools to analyze low-width networks near initialization (cf. Lemmas B.4 and C.4), a new versatile generalization bound technique (cf. Lemma C.5), and a new potential function technique for global margin maximization far from initialization (cf. Lemma B.7 and applications thereof). This introduction concludes with notation and related work, Section 2 collects the KKT point and low computation guarantees, Section 3 collects the global margin guarantees, Section 4 provides concluding remarks and open problems, and the appendices contain full proofs and additional technical discussion. 1.1 NOTATION Architecture and initialization. With the exception of Theorem 3.3, the architecture will be a 2-layer ReLU network of the form x → F (x; W ) = j a j σ(v T j x) = a T σ(V x), where σ(z) = max{0, z} is the ReLU, and where a ∈ R m and V ∈ R m×d have initialization scale roughly matching pytorch defaults: a ∼ N m /m 1/4 (m iid Gaussians with variance 1/ √ m) and V ∼ N m×d / d √ m (m × d iid Gaussians with variance 1/(d √ m)); in contrast with pytorch, the layers are approximately balanced. These parameters (a, V ) will be collected into a tuple W =

. Non-trivial margin KKT points. Prior work established that features converge in a strong sense: features and parameters converge to a KKT point of a natural margin objection (cf. Section 1.1, (Lyu & Li, 2019; Ji & Telgarsky, 2020a)). Those works, however, left open the possibility that the limiting KKT point is arbitrarily bad; instead, Theorem 2.1 guarantees that the limiting GF margin is at least γ ntk /4096, where γ ntk is a distribution-dependent constant. 2. Simultaneous low test error and low computational complexity. Replacing GF with SGD in the preceding approach leads to a computationally efficient method. Applying the resulting guarantees in Theorem 2.3 to the 2-sparse parity problem yields, as detailed in Table 1, a method which saves a factor d 8 against prior work with sample complexity d 2 /ϵ, and a factor 1/ϵ in computation against work with sample complexity d 4 /ϵ 2 . Moreover, Theorem 2.3 guarantees that the first gradient step moves parameters by √ m and formally exits the NTK.

