HOW MUCH OVER-PARAMETERIZATION IS SUFFI-CIENT TO LEARN DEEP RELU NETWORKS?

Abstract

A recent line of research on deep learning focuses on the extremely overparameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size n and the inverse of the target error ´1, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2020). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in n and ´1. Our results push the study of over-parameterized deep neural networks towards more practical settings.

1. INTRODUCTION

Deep neural networks have become one of the most important and prevalent machine learning models due to their remarkable power in many real-world applications. However, the success of deep learning has not been well-explained in theory. It remains mysterious why standard optimization algorithms tend to find a globally optimal solution, despite the highly non-convex landscape of the training loss function. Moreover, despite the extremely large amount of parameters, deep neural networks rarely over-fit, and can often generalize well to unseen data and achieve good test accuracy. Understanding these mysterious phenomena on the optimization and generalization of deep neural networks is one of the most fundamental problems in deep learning theory. Recent breakthroughs have shed light on the optimization and generalization of deep neural networks (DNNs) under the over-parameterized setting, where the hidden layer width is extremely large (much larger than the number of training examples). It has been shown that with the standard random initialization, the training of over-parameterized deep neural networks can be characterized by a kernel function called neural tangent kernel (NTK) (Jacot et al., 2018; Arora et al., 2019b) . In the neural tangent kernel regime (or lazy training regime (Chizat et al., 2019) ), the neural network function behaves similarly as its first-order Taylor expansion at initialization (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019b; Cao and Gu, 2019) , which enables feasible optimization and generalization analysis. In terms of optimization, a line of work (Du et al., 2019b; Allen-Zhu et al., 2019b; Zou et al., 2019; Zou and Gu, 2019) Although existing results in the neural tangent kernel regime have provided important insights into the learning of deep neural networks, they require the neural network to be extremely wide. The typical requirement on the network width is a high degree polynomial of the training sample size n and the inverse of the target error ´1. As there still remains a huge gap between such network width requirement and the practice, many attempts have been made to improve the overparameterization condition under various conditions on the training data and model initialization (Oymak and Soltanolkotabi, 2019; Zou and Gu, 2019; Kawaguchi and Huang, 2019; Bai and Lee, 2019) . For two-layer ReLU networks, a recent work (Ji and Telgarsky, 2020) showed that when the training data are well separated, polylogarithmic width is sufficient to guarantee good optimization and generalization performances. However, their results cannot be extended to deep ReLU networks since their proof technique largely relies on the fact that the network model is 1-homogeneous, which cannot be satisfied by DNNs. Therefore, whether deep neural networks can be learned with such a mild over-parameterization is still an open problem. In this paper, we resolve this open problem by showing that polylogarithmic network width is sufficient to learn DNNs. In particular, unlike the existing works that require the DNNs to behave very close to a linear model (up to some small approximation error), we show that a constant linear approximation error is sufficient to establish nice optimization and generalization guarantees for DNNs. Thanks to the relaxed requirement on the linear approximation error, a milder condition on the network width and tighter bounds on the convergence rate and generalization error can be proved. We summarize our contributions as follows: • We establish the global convergence guarantee of GD for training deep ReLU networks based on the so-called NTRF function class (Cao and Gu, 2019), a set of linear functions over random features. Specifically, we prove that GD can learn deep ReLU networks with width m " polypRq to compete with the best function in NTRF function class, where R is the radius of the NTRF function class. • We also establish the generalization guarantees for both GD and SGD in the same setting. Specifically, we prove a diminishing statistical error for a wide range of network width m P p r Ωp1q, 8q, while most of the previous generalization bounds in the NTK regime only works in the setting where the network width m is much greater than the sample size n. Moreover, we establish r Op ´2q r Op ´1q sample complexities for GD and SGD respectively, which are tighter than existing bounds for learning deep ReLU networks (Cao and Gu, 2019), and match the best results when reduced to the two-layer cases (Arora et al., 2019b; Ji and Telgarsky, 2020) . • We further generalize our theoretical analysis to the scenarios with different data separability assumptions in the literature. We show if a large fraction of the training data are well separated, the best function in the NTRF function class with radius R " r Op1q can learn the training data with error up to . This together with our optimization and generalization guarantees immediately suggests that deep ReLU networks can be learned with network width m " r Ωp1q, which has a logarithmic dependence on the target error and sample size n. Compared with existing results (Cao and Gu, 2020; Ji and Telgarsky, 2020) which require all training data points to be separated in the NTK regime, our result is stronger since it allows the NTRF function class to misclassify a small proportion of the training data. For the ease of comparison, we summarize our results along with the most related previous results in Table 1 , in terms of data assumption, the over-parameterization condition and sample complexity. It can be seen that under data separation assumption (See Sections 4.1, 4.2), our result improves existing results for learning deep neural networks by only requiring a polylogpn, ´1q network width. Notation. For two scalars a and b, we denote a ^b " minta, bu. For a vector x P R d we use }x} 2 to denote its Euclidean norm. For a matrix X, we use }X} 2 and }X} F to denote its spectral norm and Frobenius norm respectively, and denote by X ij the entry of X at the i-th row and j-th column. Given two matrices X and Y with the same dimension, we denote xX, Yy " ř i,j X ij Y ij . Given a collection of matrices W " tW 1 , ¨¨¨, W L u P b L l"1 R m l ˆm1 l and a function f pWq over b L l"1 R m l ˆm1 l , we define by ∇ W l f pWq the partial gradient of f pWq with respect to W l and denote ∇ W f pWq " t∇ W l f pWqu L l"1 . We also denote BpW, τ q " W 1 : max lPrLs }W 1 l ´Wl } F ď τ ( for τ ě 0. For two collection of matrices A " tA 1 , ¨¨¨, A n u, B " tB 1 , ¨¨¨, B n u, we denote xA, By " ř n i"1 xA i , B i y and }A} 2 F " ř n i"1 }A i } 2 F .



proved that for sufficiently wide neural networks, (stochastic) gradient descent (GD/SGD) can successfully find a global optimum of the training loss function. For generalization, Allen-Zhu et al. (2019a); Arora et al. (2019a); Cao and Gu (2019) established generalization bounds of neural networks trained with (stochastic) gradient descent, and showed that the neural networks can learn target functions in certain reproducing kernel Hilbert space (RKHS) or the corresponding random feature function class.

