NEURAL NETWORKS EFFICIENTLY LEARN LOW-DIMENSIONAL REPRESENTATIONS WITH SGD

Abstract

We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input x ∈ R d is Gaussian and the target y ∈ R follows a multiple-index model, i.e., y = g(⟨u 1 , x⟩, . . . , ⟨u k , x⟩) with a noisy link function g. We prove that the first-layer weights of the NN converge to the k-dimensional principal subspace spanned by the vectors u 1 , . . . , u k of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when k ≪ d. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of O( kd/T ) after T iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form y = f (⟨u, x⟩) + ϵ by recovering the principal direction, with a sample complexity linear in d (up to log factors), where f is a monotonic function with at most polynomial growth, and ϵ is the noise. This is in contrast to the known d Ω(p) sample requirement to learn any degree p polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.

1. INTRODUCTION

The task of learning an unknown statistical (teacher) model using data is fundamental in many areas of learning theory. There has been a considerable amount of research dedicated to this task, especially when the trained (student) model is a neural network (NN), providing precise and non-asymptotic guarantees in various settings (Zhong et al., 2017; Goldt et al., 2019; Ba et al., 2019; Sarao Mannelli et al., 2020; Zhou et al., 2021; Akiyama & Suzuki, 2021; Abbe et al., 2022; Ba et al., 2022; Damian et al., 2022; Veiga et al., 2022) . As evident from these works, explaining the remarkable learning capabilities of NNs requires arguments beyond the classical learning theory (Zhang et al., 2021) . The connection between NNs and kernel methods has been particularly useful towards this expedition (Jacot et al., 2018; Chizat et al., 2019) . In particular, a two-layer NN with randomly initialized and untrained weights is an example of a random features model (Rahimi & Recht, 2007) , and regression on the second layer captures several interesting phenomena that NNs exhibit in practice (Louart et al., 2018; Mei & Montanari, 2022 ), e.g. cusp in the learning curve. However, NNs also inherit favorable characteristics from the optimization procedure (Ghorbani et al., 2019; Allen-Zhu & Li, 2019; Yehudai & Shamir, 2019; Li et al., 2020; Refinetti et al., 2021) , which cannot be captured by associating NNs with regression on random features. Indeed, recent works have established a separation between NNs and kernel methods, relying on the emergence of representation learning as a consequence of gradient-based training (Abbe et al., 2022; Ba et al., 2022; Barak et al., 2022; Damian et al., 2022) , which often exhibits a natural bias towards low-complexity models. A theme that has emerged repeatedly in modern learning theory is the implicit regularization effect provided by the training dynamics (Neyshabur et al., 2014) . The work by Soudry et al. (2018) has inspired an abundance of recent works focusing on the implicit bias of gradient descent favoring, in some sense, low-complexity models, e.g. by achieving min-norm and/or max-margin solutions

