HOW MUCH OVER-PARAMETERIZATION IS SUFFI-CIENT TO LEARN DEEP RELU NETWORKS?

Abstract

A recent line of research on deep learning focuses on the extremely overparameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size n and the inverse of the target error ´1, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2020). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in n and ´1. Our results push the study of over-parameterized deep neural networks towards more practical settings.

1. INTRODUCTION

Deep neural networks have become one of the most important and prevalent machine learning models due to their remarkable power in many real-world applications. However, the success of deep learning has not been well-explained in theory. It remains mysterious why standard optimization algorithms tend to find a globally optimal solution, despite the highly non-convex landscape of the training loss function. Moreover, despite the extremely large amount of parameters, deep neural networks rarely over-fit, and can often generalize well to unseen data and achieve good test accuracy. Understanding these mysterious phenomena on the optimization and generalization of deep neural networks is one of the most fundamental problems in deep learning theory. Recent breakthroughs have shed light on the optimization and generalization of deep neural networks (DNNs) under the over-parameterized setting, where the hidden layer width is extremely large (much larger than the number of training examples). It has been shown that with the standard random initialization, the training of over-parameterized deep neural networks can be characterized by a kernel function called neural tangent kernel (NTK) (Jacot et al., 2018; Arora et al., 2019b) . In the neural tangent kernel regime (or lazy training regime (Chizat et al., 2019) ), the neural network function behaves similarly as its first-order Taylor expansion at initialization (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019b; Cao and Gu, 2019) , which enables feasible optimization and generalization analysis. In terms of optimization, a line of work (Du et al., 2019b; Allen-Zhu et al., 2019b; Zou et al., 2019; Zou and Gu, 2019) proved that for sufficiently wide neural networks, (stochastic) gradient descent (GD/SGD) can successfully find a global optimum of the training loss function. For generalization, Allen-Zhu et al. (2019a) ; Arora et al. (2019a) ; Cao and Gu (2019) established generalization bounds of neural networks trained with (stochastic) gradient descent, and showed that the neural networks can learn target functions in certain reproducing kernel Hilbert space (RKHS) or the corresponding random feature function class. Although existing results in the neural tangent kernel regime have provided important insights into the learning of deep neural networks, they require the neural network to be extremely wide. The typical requirement on the network width is a high degree polynomial of the training sample size n and the inverse of the target error ´1. As there still remains a huge gap between such network width requirement and the practice, many attempts have been made to improve the overparameterization condition under various conditions on the training data and model initialization (Oymak and Soltanolkotabi, 2019; Zou and Gu, 2019; Kawaguchi and Huang, 2019; Bai and Lee, 2019) . For two-layer ReLU networks, a recent work (Ji and Telgarsky, 2020) showed that when the training data are well separated, polylogarithmic width is sufficient to guarantee good optimization and generalization performances. However, their results cannot be extended to deep ReLU networks since their proof technique largely relies on the fact that the network model is 1-homogeneous, which cannot be satisfied by DNNs. Therefore, whether deep neural networks can be learned with such a mild over-parameterization is still an open problem. In this paper, we resolve this open problem by showing that polylogarithmic network width is sufficient to learn DNNs. In particular, unlike the existing works that require the DNNs to behave very close to a linear model (up to some small approximation error), we show that a constant linear approximation error is sufficient to establish nice optimization and generalization guarantees for DNNs. Thanks to the relaxed requirement on the linear approximation error, a milder condition on the network width and tighter bounds on the convergence rate and generalization error can be proved. We summarize our contributions as follows: • We establish the global convergence guarantee of GD for training deep ReLU networks based on the so-called NTRF function class (Cao and Gu, 2019) , a set of linear functions over random features. Specifically, we prove that GD can learn deep ReLU networks with width m " polypRq to compete with the best function in NTRF function class, where R is the radius of the NTRF function class. • We also establish the generalization guarantees for both GD and SGD in the same setting. Specifically, we prove a diminishing statistical error for a wide range of network width m P p r Ωp1q, 8q, while most of the previous generalization bounds in the NTK regime only works in the setting where the network width m is much greater than the sample size n. Moreover, we establish r Op ´2q r Op ´1q sample complexities for GD and SGD respectively, which are tighter than existing bounds for learning deep ReLU networks (Cao and Gu, 2019) , and match the best results when reduced to the two-layer cases (Arora et al., 2019b; Ji and Telgarsky, 2020) . • We further generalize our theoretical analysis to the scenarios with different data separability assumptions in the literature. We show if a large fraction of the training data are well separated, the best function in the NTRF function class with radius R " r Op1q can learn the training data with error up to . This together with our optimization and generalization guarantees immediately suggests that deep ReLU networks can be learned with network width m " r Ωp1q, which has a logarithmic dependence on the target error and sample size n. Compared with existing results (Cao and Gu, 2020; Ji and Telgarsky, 2020) which require all training data points to be separated in the NTK regime, our result is stronger since it allows the NTRF function class to misclassify a small proportion of the training data. For the ease of comparison, we summarize our results along with the most related previous results in Table 1 , in terms of data assumption, the over-parameterization condition and sample complexity. It can be seen that under data separation assumption (See Sections 4.1, 4.2), our result improves existing results for learning deep neural networks by only requiring a polylogpn, ´1q network width. Notation. For two scalars a and b, we denote a ^b " minta, bu. For a vector x P R d we use }x} 2 to denote its Euclidean norm. For a matrix X, we use }X} 2 and }X} F to denote its spectral norm and Frobenius norm respectively, and denote by X ij the entry of X at the i-th row and j-th column. Given two matrices X and Y with the same dimension, we denote xX, Yy " ř i,j X ij Y ij . Given a collection of matrices W " tW 1 , ¨¨¨, W L u P b L l"1 R m l ˆm1 l and a function f pWq over b L l"1 R m l ˆm1 l , we define by ∇ W l f pWq the partial gradient of f pWq with respect to W l and denote ∇ W f pWq " t∇ W l f pWqu L l"1 . We also denote BpW, τ q " W 1 : max lPrLs }W 1 l ´Wl } F ď τ ( for τ ě 0. For two collection of matrices A " tA 1 , ¨¨¨, A n u, B " tB 1 , ¨¨¨, B n u, we denote xA, By " Algorithm 1 Gradient descent with random initialization ř n i"1 xA i , B i y and }A} 2 F " ř n i"1 }A i } 2 F . Input: Number of iterations T , step size η, training set S " tpx i , y i q n i"1 u, initialization W p0q for t " 1, 2, . . . , T do Update W ptq " W pt´1q ´η ¨∇W L S pW pt´1q q. end for Output: W p0q , . . . , W pT q . Given two sequences tx n u and ty n u, we denote x n " Opy n q if |x n | ď C 1 |y n | for some absolute positive constant C 1 , x n " Ωpy n q if |x n | ě C 2 |y n | for some absolute positive constant C 2 , and x n " Θpy n q if C 3 |y n | ď |x n | ď C 4 |y n | for some absolute constants C 3 , C 4 ą 0. We also use r Op¨q, r Ωp¨q to hide logarithmic factors in Op¨q and Ωp¨q respectively. Additionally, we denote x n " polypy n q if x n " Opy D n q for some positive constant D, and x n " polylogpy n q if x n " polyplogpy n qq.

2. PRELIMINARIES ON LEARNING NEURAL NETWORKS

In this section, we introduce the problem setting in this paper, including definitions of the neural network and loss functions, and the training algorithms, i.e., GD and SGD with random initialization. Neural network function. Given an input x P R d , the output of deep fully-connected ReLU network is defined as follows, f W pxq " m 1{2 W L σpW L´1 ¨¨¨σpW 1 xq ¨¨¨q, where W 1 P R mˆd , W 2 , ¨¨¨, W L´1 P R mˆm , W L P R 1ˆm , and σpxq " maxt0, xu is the ReLU activation function. Here, without loss of generality, we assume the width of each layer is equal to m. Yet our theoretical results can be easily generalized to the setting with unequal width layers, as long as the smallest width satisfies our overparameterization condition. We denote the collection of all weight matrices as W " tW 1 , . . . , W L u.

Loss function.

Given training dataset tx i , y i u i"1,...,n with input x i P R d and output y i P t´1, `1u, we define the training loss function as L S pWq " 1 n n ÿ i"1 L i pWq, where L i pWq " `yi f W px i q ˘" log `1 `expp´y i f W px i qq ˘is defined as the cross-entropy loss. Algorithms. We consider both GD and SGD with Gaussian random initialization. These two algorithms are displayed in Algorithms 1 and 2 respectively. Specifically, the entries in W are generated independently from N p0, 1{mq. For GD, we consider using the full gradient to update the model parameters. For SGD, we use a new training data point in each iteration. Note that our initialization method in Algorithms 1, 2 is the same as the widely used He initialization (He et al., 2015) . Our neural network parameterization is also consistent with the parameterization used in prior work on NTK (Jacot et al., 2018; Allen-Zhu et al., 2019b; Du et al., 2019a; Arora et al., 2019b; Cao and Gu, 2019) .

Algorithm 2 Stochastic gradient desecent (SGD) with random initialization

Input: Number of iterations n, step size η, initialization W p0q for i " 1, 2, . . . , n do Draw px i , y i q from D and compute the corresponding gradient ∇ W L i pW pi´1q q. Update W piq " W pi´1q ´η ¨∇W L i pW pi´1q q. end for Output: Randomly choose x W uniformly from tW p0q , . . . , W pn´1q u.

3. MAIN THEORY

In this section, we present the optimization and generalization guarantees of GD and SGD for learning deep ReLU networks. We first make the following assumption on the training data points. Assumption 3.1. All training data points satisfy }x i } 2 " 1, i " 1, . . . , n. This assumption has been widely made in many previous works (Allen-Zhu et al., 2019b; c; Du et al., 2019b; a; Zou et al., 2019) in order to simplify the theoretical analysis. This assumption can be relaxed to be upper bounded and lower bounded by some constant. In the following, we give the definition of Neural Tangent Random Feature (NTRF) (Cao and Gu, 2019) , which characterizes the functions learnable by over-parameterized ReLU networks. Definition 3.2 (Neural Tangent Random Feature, (Cao and Gu, 2019) ). Let W p0q be the initialization weights, and F W p0q ,W pxq " f W p0q pxq `x∇f W p0q pxq, W ´Wp0q y be a function with respect to the input x. Then the NTRF function class is defined as follows FpW p0q , Rq " F W p0q ,W p¨q : W P BpW p0q , R ¨m´1{2 q ( . The function class F W p0q ,W pxq consists of linear models over random features defined based on the network gradients at the initialization. Therefore it captures the key "almost linear" property of wide neural networks in the NTK regime (Lee et al., 2019; Cao and Gu, 2019) . In this paper, we use the NTRF function class as a reference class to measure the difficulty of a learning problem. In what follows, we deliver our main theoretical results regarding the optimization and generalization guarantees of learning deep ReLU networks. We study both GD and SGD with random initialization (presented in Algorithms 1 and 2).

3.1. GRADIENT DESCENT

The following theorem establishes the optimization guarantee of GD for training deep ReLU networks for binary classification. Theorem 3.3. For δ, R ą 0, let NTRF " inf F PF pW p0q ,Rq n ´1 ř n i"1 ry i F px i qs be the minimum training loss achievable by functions in FpW p0q , Rq. Then there exists m ˚pδ, R, Lq " r O `polypR, Lq ¨log 4{3 pn{δq ˘, such that if m ě m ˚pδ, R, Lq, with probability at least 1 ´δ over the initialization, GD with step size η " ΘpL ´1m ´1q can train a neural network to achieve at most 3 NTRF training loss within T " O `L2 R 2 ´1 NTRF ˘iterations. Theorem 3.3 shows that the deep ReLU network trained by GD can compete with the best function in the NTRF function class FpW p0q , Rq if the network width has a polynomial dependency in R and L and a logarithmic dependency in n and 1{δ. Moreover, if the NTRF function class with R " r Op1q can learn the training data well (i.e., NTRF is less than a small target error ), a polylogarithmic (in terms of n and ´1) network width suffices to guarantee the global convergence of GD, which directly improves over-paramterization condition in the most related work (Cao and Gu, 2019) . Besides, we remark here that this assumption on the NTRF function class can be easily satisfied when the training data admits certain separability conditions, which we discuss in detail in Section 4. Compared with the results in (Ji and Telgarsky, 2020) which give similar network width requirements for two-layer networks, our result works for deep networks. Moreover, while Ji and Telgarsky (2020) essentially required all training data to be separable by a function in the NTRF function class with a constant margin, our result does not require such data separation assumptions, and allows the NTRF function class to misclassify a small proportion of the training data points * . We now characterize the generalization performance of neural networks trained by GD. We denote L 0´1 D pWq " E px,yq"D r1tf W pxq ¨y ă 0us as the expected 0-1 loss (i.e., expected error) of f W pxq. Theorem 3.4. Under the same assumptions as Theorem 3.3, with probability at least 1 ´δ, the iterate W ptq of Algorithm 1 satisfies that L 0´1 D pW ptq q ď 2L S pW ptq q `r O ˜4L L 2 R c m n ^˜L 3{2 R ? n `L11{3 R 4{3 m 1{6 ¸¸`O ˜c logp1{δq n for all t " 0, . . . , T . Theorem 3.4 shows that the test error of the trained neural network can be bounded by its training error plus statistical error terms. Note that the statistical error terms is in the form of a minimum between two terms 4 L L 2 R a m{n and L 3{2 R{ ? n `L11{3 R 4{3 {m 1{6 . Depending on the network width m, one of these two terms will be the dominating term and diminishes for large n: (1) if m " opnq, the statistical error will be 4 L L 2 R a m{n, and diminishes as n increases; and (2) if m " Ωpnq, the statistical error is L 3{2 R{ ? n `L11{3 R 4{3 {m 1{6 , and again goes to zero as n increases. Moreover, in this paper we have a specific focus on the setting m " r Op1q, under which Theorem 3.4 gives a statistical error of order r Opn ´1{2 q. This distinguishes our result from previous generalization bounds for deep networks (Cao and Gu, 2020; 2019) , which cannot be applied to the setting m " r Op1q. We note that for two-layer ReLU networks (i.e., L " 2) Ji and Telgarsky (2020) proves a tighter r Op1{n 1{2 q generalization error bound regardless of the neural networks width m, while our result (Theorem 3.4), in the two-layer case, can only give r Op1{n 1{2 q generalization error bound when m " r Op1q or m " r Ωpn 3 q. However, different from our proof technique that basically uses the (approximated) linearity of the neural network function, their proof technique largely relies on the 1-homogeneous property of the neural network, which restricted their theory in two-layer cases. An interesting research direction is to explore whether a r Op1{n 1{2 q generalization error bound can be also established for deep networks (regardless of the network width), which we will leave it as a future work.

3.2. STOCHASTIC GRADIENT DESCENT

Here we study the performance of SGD for training deep ReLU networks. The following theorem establishes a generalization error bound for the output of SGD. Theorem 3.5. For δ, R ą 0, let NTRF " inf F PF pW p0q ,Rq n ´1 ř n i"1 ry i F px i qs be the minimum training loss achievable by functions in FpW p0q , Rq. Then there exists m ˚pδ, R, Lq " r O `polypR, Lq ¨log 4{3 pn{δq ˘, such that if m ě m ˚pδ, R, Lq, with probability at least 1 ´δ, SGD with step size η " Θ `m´1 pLR 2 n ´1 ´1 NTRF ^L´1 q ˘achieves ErL 0´1 D p x Wqs ď 8L 2 R 2 n `8 logp2{δq n `24 NTRF , where the expectation is taken over the uniform draw of x W from tW p0q , . . . , W pn´1q u. For any ą 0, Theorem 3.5 gives a r Op ´1q sample complexity for deep ReLU networks trained with SGD to achieve Op NTRF ` q test error. Our result extends the result for two-layer networks proved in (Ji and Telgarsky, 2020) to multi-layer networks. Theorem 3.5 also provides sharper results compared with Allen-Zhu et al. (2019a); Cao and Gu (2019) in two aspects: (1) the sample complexity is improved from n " r Op ´2q to n " r Op ´1q; and (2) the overparamterization condition is improved from m ě polyp ´1q to m " r Ωp1q.

4. DISCUSSION ON THE NTRF CLASS

Our theoretical results in Section 3 rely on the radius (i.e., R) of the NTRF function class FpW p0q , Rq and the minimum training loss achievable by functions in FpW p0q , Rq, i.e., NTRF . Note that a larger R naturally implies a smaller NTRF , but also leads to worse conditions on m. In this section, for any (arbitrarily small) target error rate ą 0, we discuss various data assumptions studied in the literature under which our results can lead to Op q training/test errors, and specify the network width requirement.

4.1. DATA SEPARABILITY BY NEURAL TANGENT RANDOM FEATURE

In this subsection, we consider the setting where a large fraction of the training data can be linearly separated by the neural tangent random features. The assumption is stated as follows. Assumption 4.1. There exists a collection of matrices U ˚" tU 1 , ¨¨¨, U Lu satisfying ř L l"1 }U l } 2 F " 1, such that for at least p1 ´ρq fraction of training data we have y i x∇f W p0q px i q, U ˚y ě m 1{2 γ, where γ is an absolute positive constant † and ρ P r0, 1q. The following corollary provides an upper bound of NTRF under Assumption 4.1 for some R. Proposition 4.2. Under Assumption 4.1, for any , δ ą 0, if R ě C " log 1{2 pn{δq `logp1{ q ‰ {γ for some absolute constant C, then with probability at least 1 ´δ, NTRF :" inf F PF pW p0q ,Rq n ´1 n ÿ i"1 `yi F px i q ˘ď `ρ ¨OpRq. Proposition 4.2 covers the setting where the NTRF function class is allowed to misclassify training data, while most of existing work typically assumes that all training data can be perfectly separated with constant margin (i.e., ρ " 0) (Ji and Telgarsky, 2020; Shamir, 2020). Our results show that for sufficiently small misclassification ratio ρ " Op q, we have NTRF " r Op q by choosing the radius parameter R logarithimic in n, δ ´1, and ´1. Substituting this result into Theorems 3.3, 3.4 and 3.5, it can be shown that a neural network with width m " polypL, logpn{δq, logp1{ qq suffices to guarantee good optimization and generalization performances for both GD and SGD. Consequently, we can obtain that the bounds on the test error for GD and SGD are r Opn ´1{2 q and r Opn ´1q respectively.

4.2. DATA SEPARABILITY BY SHALLOW NEURAL TANGENT MODEL

In this subsection, we study the data separation assumption made in Ji and Telgarsky (2020) and show that our results cover this particular setting. We first restate the assumption as follows. Assumption 4.3. There exists up¨q : R d Ñ R d and γ ě 0 such that }upzq} 2 ď 1 for all z P R d , and y i ż R d σ 1 pxz, x i yq ¨xupzq, x i ydµ N pzq ě γ for all i P rns, where µ N p¨q denotes the standard normal distribution. Assumption 4.3 is related to the linear separability of the gradients of the first layer parameters at random initialization, where the randomness is replaced with an integral by taking the infinite width limit. Note that similar assumptions have also been studied in (Cao and Gu, 2020; Nitanda and Suzuki, 2019; Frei et al., 2019) . The assumption made in (Cao and Gu, 2020; Frei et al., 2019) uses gradients with respect to the second layer weights instead of the first layer ones. In the following, we mainly focus on Assumption 4.3, while our result can also be generalized to cover the setting in (Cao and Gu, 2020; Frei et al., 2019) . In order to make a fair comparison, we reduce our results for multilayer networks to the two-layer setting. In this case, the neural network function takes form f W pxq " m 1{2 W 2 σpW 1 xq. Then we provide the following proposition, which states that Assumption 4.3 implies a certain choice of R " r Op1q such the the minimum training loss achieved by the function in the NTRF function class FpW p0q , Rq satisfies NTRF " Op q, where is the target error. Proposition 4.4. Suppose the training data satisfies Assumption 4.3. For any , δ ą 0, let R " C " logpn{δq `logp1{ q ‰ {γ for some large enough absolute constant C. If the neural network width satisfies m " Ω `logpn{δq{γ 2 ˘, then with probability at least 1 ´δ, there exist F W p0q ,W px i q P FpW p0q , Rq such that `yi ¨FW p0q ,W px i q ˘ď , @i P rns. Proposition 4.4 shows that under Assumption 4.3, there exists F W p0q ,W p¨q P FpW p0q , Rq with R " r Op1{γq such that the cross-entropy loss of F W p0q ,W p¨q at each training data point is bounded by . This implies that NTRF ď . Moreover, by applying Theorem 3.3 with L " 2, the condition on the neural network width becomes m " r Ωp1{γ 8 q ‡ , which matches the results proved in Ji and Telgarsky (2020). Moreover, plugging these results on m and NTRF into Theorems 3.4 and 3.5, we can conclude that the bounds on the test error for GD and SGD are r Opn ´1{2 q and r Opn ´1q respectively.

4.3. CLASS-DEPENDENT DATA NONDEGENERATION

In previous subsections, we have shown that under certain data separation conditions NTRF can be sufficiently small while the corresponding NTRF function class has R of order r Op1q. Thus neural networks with polylogarithmic width enjoy nice optimization and generalization guarantees. In this part, we consider the following much milder data separability assumption made in Zou et al. (2019) . Assumption 4.5. For all i ‰ i 1 if y i ‰ y i 1 , then }x i ´xj } 2 ě φ for some absolute constant φ. In contrast to the conventional data nondegeneration assumption (i.e., no duplicate data points) made in Allen-Zhu et al. We have the following proposition which shows that Assumption 4.5 also implies the existence of a good function that achieves training error, in the NTRF function class with a certain choice of R. Proposition 4.6. Under Assumption 4.5, if R " Ω `n3{2 φ ´1{2 logpnδ ´1 ´1q ˘, m " r Ω `L22 n 12 φ ´4˘, we have NTRF ď with probability at least 1 ´δ. Proposition 4.6 suggests that under Assumption 4.5, in order to guarantee NTRF ď , the size of NTRF function class needs to be Ωpn 3{2 q. Plugging this into Theorems 3.4 and 3.5 leads to vacuous bounds on the test error. This makes sense since Assumption 4.5 basically covers the "random label" setting, which is impossible to be learned with small generalization error. Moreover, we would like to point out our theoretical analysis leads to a sharper over-parameterization condition than that proved in Zou et al. (2019) , i.e., m " r Ω `n14 L 16 φ ´4 `n12 L 16 φ ´4 ´1˘, if the network depth satisfies L ď r Opn 1{3 _ ´1{6 q.

5. PROOF SKETCH OF THE MAIN THEORY

In this section, we introduce a key technical lemma in Section 5.1, based on which we provide a proof sketch of Theorems 3.3. The full proof of all our results can be found in the appendix. ‡ We have shown in the proof of Theorem 3.3 that m " r ΩpR 8 q (see (A.1) for more detail). § Specifically, Allen-Zhu et al. (2019b) ; Zou and Gu (2019) require that any two data points (rather than data points from different classes) are separated by a positive distance. Zou and Gu (2019) shows that this assumption is equivalent to those made in Du et al. (2019b; a) , which require that the composite kernel matrix is strictly positive definite. Published as a conference paper at ICLR 2021

5.1. A KEY TECHNICAL LEMMA

Here we introduce a key technical lemma used in the proof of Theorem 3.3. Our proof is based on the key observation that near initialization, the neural network function can be approximated by its first-order Taylor expansion. In the following, we first give the definition of the linear approximation error in a τ -neighborhood around initialization. app pτ q :" sup i"1,...,n sup W 1 ,WPBpW p0q ,τ q ˇˇf W 1 px i q ´fW px i q ´x∇f W px i q, W 1 ´Wy ˇˇ. If all the iterates of GD stay inside a neighborhood around initialization with small linear approximation error, then we may expect that the training of neural networks should be similar to the training of the corresponding linear model, where standard optimization techniques can be applied. Motivated by this, we also give the following definition on the gradient upper bound of neural networks around initialization, which is related to the Lipschitz constant of the optimization objective function. M pτ q :" sup i"1,...,n sup l"1,...,L sup WPBpW p0q ,τ q }∇ W l f W px i q} F . By definition, we can choose W ˚P BpW p0q , Rm ´1{2 q such that n ´1 ř n i"1 `yi F W p0q ,W ˚px i q ˘" NTRF . Then we have the following lemma. Lemma 5.1. Set η " OpL ´1M pτ q ´2q. Suppose that W ˚P BpW p0q , τ q and W ptq P BpW p0q , τ q for all 0 ď t ď t 1 ´1. Then it holds that 1 t 1 t 1 ´1 ÿ t"0 L S pW ptq q ď }W p0q ´W˚}2 F ´}W pt 1 q ´W˚}2 F `2t 1 η NTRF t 1 η `3 2 ´4 app pτ q ˘. Lemma 5.1 plays a central role in our proof. In specific, if W ptq P BpW p0q , τ q for all t ď t 1 , then Lemma 5.1 implies that the average training loss is in the same order of NTRF as long as the linear approximation error app pτ q is bounded by a positive constant. This is in contrast to the proof in Cao and Gu (2019) , where app pτ q appears as an additive term in the upper bound of the training loss, thus requiring app pτ q " Op NTRF q to achieve the same error bound as in Lemma 5.1. Since we can show that app " r Opm ´1{6 q (See Section A.1), this suggests that m " r Ωp1q is sufficient to make the average training loss in the same order of NTRF . Compared with the recent results for two-layer networks by Ji and Telgarsky (2020), Lemma 5.1 is proved with different techniques. In specific, the proof by Ji and Telgarsky (2020) relies on the 1-homogeneous property of the ReLU activation function, which limits their analysis to two-layer networks with fixed second layer weights. In comparison, our proof does not rely on homogeneity, and is purely based on the linear approximation property of neural networks and some specific properties of the loss function. Therefore, our proof technique can handle deep networks, and is potentially applicable to non-ReLU activation functions and other network architectures (e.g, Convolutional neural networks and Residual networks).

5.2. PROOF SKETCH OF THEOREM 3.3

Here we provide a proof sketch of Theorem 3.3. The proof consists of two steps: (i) showing that all T iterates stay close to initialization, and (ii) bounding the empirical loss achieved by gradient descent. Both of these steps are proved based on Lemma 5.1. Proof sketch of Theorem 3.3. Recall that we choose W ˚P BpW p0q , Rm ´1{2 q such that n ´1 ř n i"1 `yi F W p0q ,W ˚px i q ˘" NTRF . We set τ " r OpL 1{2 m ´1{2 Rq, which is chosen slightly larger than m ´1{2 R since Lemma 5.1 requires the region BpW p0q , τ q to include both W ˚and tW ptq u t"0,...,t 1 . Then by Lemmas 4.1 and B.3 in Cao and Gu (2019) we know that app pτ q " r Opτ 4{3 m 1{2 L 3 q " r OpR 4{3 L 11{3 m ´1{6 q. Therefore, we can set m " r ΩpR 8 L 22 q to ensure that app pτ q ď 1{8. Then we proceed to show that all iterates stay inside the region BpW p0q , τ q. Since the L.H.S. of Lemma 5.1 is strictly positive and app pτ q ď 1{8, we have for all t ď T , }W p0q ´W˚}2 F ´}W ptq ´W˚}2 F ě ´2tη NTRF , which gives an upper bound of }W ptq ´W˚} F . Then by the choice of η, T , triangle inequality, and a simple induction argument, we see that }W ptq ´Wp0q } F ď m ´1{2 R `?2T η NTRF " r OpL 1{2 m ´1{2 Rq, which verifies that W ptq P BpW p0q , τ q for t " 0, . . . , T ´1. The second step is to show that GD can find a neural network with at most 3 NTRF training loss within T iterations. To show this, by the bound given in Lemma 5.1 with app ď 1{8, we drop the terms }W ptq ´W˚}2 F and rearrange the inequality to obtain 1 T T ´1 ÿ t"0 L S pW ptq q ď 1 ηT }W p0q ´W˚}2 F `2 NTRF . We see that T is large enough to ensure that the first term in the bound above is smaller than NTRF . This implies that the best iterate among W p0q , . . . , W pT ´1q achieves an empirical loss at most 3 NTRF .

6. CONCLUSION

In this paper, we established the global convergence and generalization error bounds of GD and SGD for training deep ReLU networks for the binary classification problem. We show that a network width condition that is polylogarithmic in the sample size n and the inverse of target error ´1 is sufficient to guarantee the learning of deep ReLU networks. Our results resolve an open question raised in Ji and Telgarsky (2020). Convergence of gradient descent. (A.2) implies }W p0q ´W˚}2 F ´}W pT q ´W˚}2 F ě η ˆT ´1 ÿ t"0 L S pW ptq q ´2T NTRF ˙. Dividing by ηT on the both sides, we get 1 T T ´1 ÿ t"0 L S pW ptq q ď }W p0q ´W˚}2 F ηT `2 NTRF ď LR 2 m ´1 ηT `2 NTRF ď 3 NTRF , where the second inequality is by the fact that W ˚P BpW p0q , R ¨m´1{2 q and the last inequality is by our choices of T and η which ensure that T η ě LR 2 m ´1 ´1 NTRF . Notice that T " rLR 2 m ´1η ´1 ´1 NTRF s " OpL 2 R 2 ´1 NTRF q. This completes the proof of the second part, and we are able to complete the proof. A.2 PROOF OF THEOREM 3.4 Following Cao and Gu (2020) , we first introduce the definition of surrogate loss of the network, which is defined by the derivative of the loss function. Definition A.2. We define the empirical surrogate error E S pWq and population surrogate error E D pWq as follows: E S pWq :" ´1 n n ÿ i"1 1 " y i ¨fW px i q ‰ , E D pWq :" E px,yq"D ´ 1 " y ¨fW pxq ‰( . The following lemma gives uniform-convergence type of results for E S pWq utilizing the fact that ´ 1 p¨q is bounded and Lipschitz continuous. Lemma A.3. For any r R, δ ą 0, suppose that m " r ΩpL 12 r R 2 q ¨rlogp1{δqs 3{2 . Then with probability at least 1 ´δ, it holds that |E D pWq ´ES pWq| ď r O ˜min # 4 L L 3{2 r R c m n , L r R ? n `L3 r R 4{3 m 1{6 +¸`O˜c logp1{δq n for all W P BpW p0q , r R ¨m´1{2 q We are now ready to prove Theorem 3.4, which combines the trajectory distance analysis in the proof of Theorem 3.3 with Lemma A.3. Proof of Theorem 3.4. With exactly the same proof as Theorem 3.3, by (A.3) and induction we have W p0q , W p1q , . . . , W pT q P BpW p0q , r Rm ´1{2 q with r R " Op ? LRq. Therefore by Lemma A.3, we have |E D pW ptq q ´ES pW ptq q| ď r O ˜min # 4 L L 2 R c m n , L 3{2 R ? n `L11{3 R 4{3 m 1{6 +¸`O˜c logp1{δq n for all t " 0, 1, . . . , T . Note that we have 1tz ă 0u ď ´2 1 pzq. Therefore, EL 0´1 D pW ptq q ď 2E D pW ptq q ď 2L S pW ptq q `r O ˜min # 4 L L 2 R c m n , L 3{2 R ? n `L11{3 R 4{3 m 1{6 +¸`O˜c logp1{δq n for t " 0, 1, . . . , T , where the last inequality is by E S pWq ď L S pWq because ´ 1 pzq ď pzq for all z P R. This finishes the proof. Convergence of online SGD. By (A.4), we have }W p0q ´W˚}2 F ´}W pnq ´W˚}2 F ě η ˆn ÿ i"1 L i pW pi´1q q ´2n NTRF ˙. Dividing by ηn on the both sides and rearranging terms, we get 1 n n ÿ i"1 L i pW pi´1q q ď }W p0q ´W˚}2 F ´}W pnq ´W˚}2 F ηn `2 NTRF ď L 2 R 2 n `3 NTRF , where the second inequality follows from facts that W ˚P BpW p0q , R ¨m´1{2 q and η " Θ `m´1 pLR 2 n ´1 ´1 NTRF ^L´1 q ˘. By Lemma 4.3 in (Ji and Telgarsky, 2020) and the fact that E i pW pi´1q q ď L i pW pi´1q q, we have 1 n n ÿ i"1 L 0´1 D pW pi´1q q ď 2 n n ÿ i"1 E D pW pi´1q q ď 8 n n ÿ i"1 E i pW pi´1q q `8 logp1{δq n ď 8L 2 R 2 n `8 logp1{δq n `24 NTRF . This completes the proof of the second part. tU 1 , ¨¨¨, U Lu with ř L l"1 }U l } 2 F " 1 such that y i x∇f W p0q px i q, U ˚y ě m 1{2 γ for at least 1 ´ρ fraction of the training data. By Lemma B.1, for all i P rns we have |f W p0q px i q| ď C a logpn{δq for some absolute constant C. Then for any positive constant λ, we have for at least 1 ´ρ portion of the data, y i `fW p0q px i q `x∇f W p0q , λU ˚y˘ě m 1{2 λγ ´Ca logpn{δq. For this fraction of data, we can set λ " C 1 " log 1{2 pn{δq `logp1{ q ‰ m 1{2 γ , where C 1 is an absolute constant, and get m 1{2 λγ ´Ca logpn{δq ě logp1{ q. Now we let W ˚" W p0q `λU ˚. By the choice of R in Proposition 4.2, we have W ˚P BpW p0q , R ¨m´1{2 q. The above inequality implies that for at least 1 ´ρ fraction of data, we have `yi F W p0q ,W ˚px i q ˘ď . For the rest data, we have y i `fW p0q px i q `x∇f W p0q , λU ˚y˘ě ´Ca logpn{δq ´λ}∇f W p0q } 2 2 ě ´C1 R for some absolute positive constant C 1 , where the last inequality follows from fact that }∇f W p0q } 2 " r Opm 1{2 q (see Lemma A.1 for detail). Then note that we use cross-entropy loss, it follows that for this fraction of training data, we have `yi F W p0q ,W ˚px i q ˘ď C 2 R for some constant C 2 . Combining the results of these two fractions of training data, we can conclude NTRF ď n ´1 n ÿ i"1 `yi F W p0q ,W ˚px i q ˘ď p1 ´ρq `ρ ¨OpRq This completes the proof. B.2 PROOF OF PROPOSITION 4.4 Proof of Proposition 4.4. We are going to prove that Assumption 4.3 implies the existence of a good function in the NTRF function class. By Definition 3.2 and the definition of cross-entropy loss, our goal is to prove that there exists a collection of matrices W " tW 1 , W 2 u satisfying maxt}W 1 ´Wp0q 1 } F , }W 2 ´Wp0q 2 } 2 u ď R ¨m´1{2 such that y i ¨"f W p0q px i q `x∇ W1 f W p0q , W 1 ´Wp0q 1 y `x∇ W2 f W p0q , W 2 ´Wp0q 2 y ‰ ě logp2{ q. We first consider ∇ W1 f W p0q px i q, which has the form p∇ W1 f W p0q px i q ˘j " m 1{2 ¨wp0q 2,j ¨σ1 pxw p0q 1,j , x i yq ¨xi . Note that w p0q 2,j and w p0q 1,j are independently generated from N p0, 1{mq and N p0, 2I{mq respectively, thus we have Pp|w p0q 2,j | ě 0.47m ´1{2 q ě 1{2. By Hoeffeding's inequality, we know that with probability at least 1 ´expp´m{8q, there are at least m{4 nodes, whose union is denoted by S, satisfying |w Define v j " upw p0q 1,j q{w 2,j if |w 2,j | ě 0.47m ´1{2 and v j " 0 otherwise. Then we have m ÿ j"1 y i ¨wp0q 2,j ¨xv j , x i y ¨σ1 pxw p0q 1,j , x i yq " ÿ jPS y i ¨xupw p0q 1,j q, x i y ¨σ1 pxw p0q 1,j , x i yq ě |S|γ ´a2|S| logp1{δ 1 q. Set δ " 2nδ 1 and apply union bound, we have with probability at least 1 ´δ{2, m ÿ j"1 y i ¨wp0q 2,j ¨xv j , x i y ¨σ1 pxw p0q 1,j , x i yq ě |S|γ ´a2|S| logp2n{δq. Therefore, note that with probability at least 1 ´expp´m{8q, we have |S| ě m{4. Moreover, in Assumption 4.3, by y i P t˘1u and |σ 1 p¨q|, }up¨q} 2 , }x i } 2 ď 1 for i " 1, . . . , n, we see that γ ď 1. Then if m ě 32 logpn{δq{γ 2 , with probability at least 1 ´δ{2 ´exp `´4 logpn{δq{γ 2 ˘ě 1 ´δ, m ÿ j"1 y i ¨wp0q 2,j ¨xv j , x i y ¨σ1 pxw p0q 1,j , x i yq ě |S|γ{2. Let U " pv 1 , v 2 , ¨¨¨, v m q J { a m|S|, we have y i x∇ W1 f W p0q px i q, Uy " 1 a |S| m ÿ j"1 y i ¨wp0q 2,j ¨xv j , x i y ¨σ1 pxw p0q 1,j , x i yq ě a |S|γ 2 ě m 1{2 γ 4 , Proof of Proposition 4.6. Recall that we only consider training the last hidden weights, i.e., W L´1 , via gradient flow with squared hinge loss, and our goal is to prove that gradient flow is able to find a NTRF model within the function class FpW p0q , Rq around the initialization, i.e., achieving n ´1 ř n i"1 `yi F W p0q ,W ˚px i q ˘ď . Let Wptq be the weights at time t, gradient flow implies that d r L S pWptqq dt " ´}∇ W L´1 r L S pWptqq} 2 F ď ´Cmφ n 5 ˆn ÿ i"1 r 1 `yi F W p0q ,Wptq px i q ˘˙2 " 4Cmφ r L S pWptqq n 3 , where the first equality is due to the fact that we only train the last hidden layer, the first inequality is by Lemma B.2 and the second equality follows from the fact that r 1 p¨q " ´2b r p¨q. Solving the above inequality gives r L S pWptqq ď r L S pWp0qq ¨exp ˆ´4Cmφt n 3 ˙. (B.3) Then, set T " O `n3 m ´1φ ´1 ¨logp r L S pWp0qq{ 1 q ˘and 1 " 1{n, we have r L S pWptqq ď 1 . Then it follows that r `yi F W p0q ,Wptq px i q ˘ď 1, which implies that y i F W p0q ,Wptq px i q ě logp q and thus n ´1 ř n i"1 `yi F W p0q ,W ˚px i q ˘ď . Therefore, WpT q is exactly the NTRF model we are looking for. The next step is to characterize the distance between WpT q and Wp0q in order to characterize the quantity of R. Note that }∇ W L´1 r L S pWptqq} 2 F ě 4Cmφ r L S pWptqq{n 3 , we have d b r L S pWptqq dt " ´}∇ W L´1 r L S pWptqq} 2 F 2 b r L S pWptqq ď ´}∇ W L´1 r L S pWptqq} F ¨C1{2 m 1{2 φ 1{2 n 3{2 . Taking integral on both sides and rearranging terms, we have ż T t"0 }∇ W L´1 r L S pWptqq} F dt ď n 3{2 C 1{2 m 1{2 φ 1{2 ¨ˆb r L S pWp0qq ´br L S pWptqq ˙. Note that the L.H.S. of the above inequality is an upper bound of }Wptq ´Wp0q} F , we have for any t ě 0, }Wptq ´Wp0q} F ď n 3{2 C 1{2 m 1{2 φ 1{2 ¨br L S pWp0qq " O ˆn3{2 log `n{pδ q m1{2 φ 1{2 ˙, where the second inequality is by Lemma B.1 and our choice of λ " logp1{ q `1. This implies that there exists a point W ˚within the class FpW p0q , Rq with R " O ˆn3{2 log `n{pδ q φ1{2 ṡuch that NTRF :" n ´1 n ÿ i"1 `yi F W p0q ,W ˚px i q ˘ď . Then by Theorem 3.3, and, more specifically, (A.1), we can compute the minimal required neural network width as follows, m " r ΩpR 8 L 22 q " r Ω ˜L22 n 12 φ 4 ¸. This completes the proof.

C PROOF OF TECHNICAL LEMMAS

Here we provide the proof of Lemmas 5.1, A.3 and A.4. ď O `LM pτ q 2 ˘¨L i`1 pW piq q. (C.11) Then plugging (C.10) and (C.11) into (C.9) gives }W piq ´W˚}2 F ´}W pi`1q ´W˚}2 F ě p2 ´4 app pτ qqηL i`1 pW piq q ´2η `yi F W p0q ,W ˚px i q ˘´O `η2 LM pτ q 2 ˘Li`1 pW piq q ě p 3 2 ´4 app pτ qqηL i`1 pW piq q ´2η `yi F W p0q ,W ˚px i q ˘, where the last inequality is by η " OpL ´1M pτ q ´2q and merging the third term on the second line into the first term. Taking telescope sum over i " 0, . . . , n 1 ´1, we obtain }W p0q ´W˚}2 F ´}W pn 1 q ´W˚}2 F ě ´3 2 ´4 app pτ q ¯η n 1 ÿ i"1 L i pW pi´1q q ´2η n 1 ÿ i"1 `yi F W p0q ,W ˚px i q ˘. ě ´3 2 ´4 app pτ q ¯η n 1 ÿ i"1 L i pW pi´1q q ´2η n ÿ i"1 `yi F W p0q ,W ˚px i q ˘. ě ´3 ´4 app pτ q ¯η n 1 ÿ i"1 L i pW pi´1q q ´2nη NTRF . This finishes the proof.

D EXPERIMENTS

In this section, we conduct some simple experiments to validate our theory. Since our paper mainly focuses on binary classification, we use a subset of the original CIFAR10 dataset (Krizhevsky et al., 2009) , which only has two classes of images. We train a 5-layer fullyconnected ReLU network on this binary classification dataset with different sample sizes (n P t100, 200, 500, 1000, 2000, 5000, 10000u) , and plot the minimal neural network width that is required to achieve zero training error in Figure 1 (solid line). We also plot Opnq, Oplog 3 pnqq, Oplog 2 pnqq and Oplogpnqq in dashed line for reference. It is evident that the required network width to achieve zero training error is polylogarithmic on the sample size n, which is consistent with our theory. 



* A detailed discussion is given in Section 4.2. † The factor m 1{2 is introduced here since }∇ W p0q f pxiq}F is typically of order Opm 1{2 q.



¨¨¨, W p0q L´1 are generated independently from univariate Gaussian distribution N p0, 2{mq and the entries in W p0q L

(2019b);Du et al. (2019b;a); Zou and Gu (2019) § , Assumption 4.5 only requires that the data points from different classes are nondegenerate, thus we call it class-dependent data nondegeneration.

| ě 0.47m ´1{2 . Then we only focus on the nodes in the set S. Note that W p0q 1 and W p0q 2 are independently generated. Then by Assumption 4.3 and Hoeffeding's inequality, there exists a function up¨q : R d Ñ R d such that with probability at least 1 ´δ1 , q, x i y ¨σ1 pxw p0q 1,j , x i yq ě γ ´d 2 logp1{δ 1 q |S| .

Figure 1: Minimum network width that is required to achieve zero training error with respect to training sample size (blue solid line). The hidden constants in all Op¨q notations are adjusted to ensure their plots (dashed lines) start from the same point.

Comparison of neural network learning results in terms of over-parameterization condition and sample complexity. Here is the target error rate, n is the sample size, L is the network depth.

ACKNOWLEDGEMENT

We would like to thank the anonymous reviewers for their helpful comments. ZC, YC and QG are partially supported by the National Science Foundation CAREER Award 1906169, IIS-2008981 and Salesforce Deep Learning Research Award. DZ is supported by the Bloomberg Data Science Ph.D. Fellowship. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

A PROOF OF MAIN THEOREMS

In this section we provide the full proof of Theorems 3.3, 3.4 and 3.5.A.1 PROOF OF THEOREM 3.3 We first provide the following lemma which is useful in the subsequent proof. Lemma A.1 (Lemmas 4.1 and B.3 in Cao and Gu (2019) ). There exists an absolute constant κ such that, with probability at least 1 ´OpnL 2 q expr´Ωpmτ 2{3 Lqs, for any τ ď κL ´6rlogpmqs ´3{2 , it holds that app pτ q ď r O `τ 4{3 L 3 m 1{2 ˘, M pτ q ď r Op ? mq.Proof of Theorem 3.3. Recall that W ˚is chosen such that`yi F W p0q ,W ˚px i q ˘" NTRF and W ˚P BpW p0q , Rm ´1{2 q. Note that to apply Lemma 5.1, we need the region BpW p0q , τ q to include both W ˚and tW ptq u t"0,...,t 1 . This motivates us to set τ " r OpL 1{2 m ´1{2 Rq, which is slightly larger than m ´1{2 R. With this choice of τ , by Lemma A.1 we have app pτ q " r Opτ 4{3 m 1{2 L 3 q " r OpR 4{3 L 11{3 m ´1{6 q. Therefore, we can set m " r ΩpR 8 L 22 q (A.1) to ensure that app pτ q ď 1{8, where r Ωp¨q hides polylogarithmic dependencies on network depth L, NTRF function class size R, and failure probability parameter δ. Then by Lemma 5.1, we have with probability at least 1 ´δ, we haveL S pW ptq q ´2t 1 η NTRF (A.2) as long as W p0q , . . . , W pt 1 ´1q P BpW p0q , τ q. In the following proof we choose η " ΘpL ´1m ´1q and T " rLR 2 m ´1η ´1 ´1 NTRF s.We prove the theorem by two steps: 1) we show that all iterates tW p0q , ¨¨¨, W pT q u will stay inside the region BpW p0q , τ q; and 2) we show that GD can find a neural network with at most 3 NTRF training loss within T iterations.All iterates stay inside BpW p0q , τ q. We prove this part by induction. Specifically, given t 1 ď T , we assume the hypothesis W ptq P BpW p0q , τ q holds for all t ă t 1 and prove that W pt 1 q P BpW p0q , τ q. First, it is clear that W p0q P BpW p0q , τ q. Then by (A.2) and the fact that L S pWq ě 0, we havewhere C ě 4 is an absolute constant. Therefore, by triangle inequality, we further have the following for all l P rLs, }WCLRm ´1{2 ď τ based on our choice of τ previously. This completes the proof of the first part.A.3 PROOF OF THEOREM 3.5In this section we provide the full proof of Theorem 3.5. We first give the following result, which is the counterpart of Lemma 5.1 for SGD. Again we pick W ˚P BpW p0q , Rm ´1{2 q such that the loss of the corresponding NTRF model F W p0q ,W ˚pxq achieves NTRF .Lemma A.4. Set η " OpL ´1M pτ q ´2q. Suppose that W ˚P BpW p0q , τ q and W pn 1 q P BpW p0q , τ q for all 0 ď n 1 ď n ´1. Then it holds thatWe introduce a surrogate loss E i pWq " ´ 1 ry i ¨fW px i qs and its population version E D pWq " E px,yq"D r´ 1 ry ¨fW pxqss, which have been used in (Ji and Telgarsky, 2018; Cao and Gu, 2019; Ji and Telgarsky, 2020) . Our proof is based on the application of Lemma A.4 and an online-tobatch conversion argument (Cesa-Bianchi et al., 2004; Cao and Gu, 2019; Ji and Telgarsky, 2020) .We introduce a surrogate loss E i pWq " ´ 1 ry i ¨fW px i qs and its population version E D pWq " E px,yq"D r´ 1 py ¨fW pxqqs, which have been used in (Ji and Telgarsky, 2018; Cao and Gu, 2019; Nitanda and Suzuki, 2019; Ji and Telgarsky, 2020) .Proof of Theorem 3.5. Recall that W ˚is chosen such thatNTRF and W ˚P BpW p0q , Rm ´1{2 q. To apply Lemma A.4, we need the region BpW p0q , τ q to include both W ˚and tW ptq u t"0,...,t 1 . This motivates us to set τ " r OpL 1{2 m ´1{2 Rq, which is slightly larger than m ´1{2 R. With this choice of τ , by Lemma A.1 we have app pτ q " r Opτ 4{3 m 1{2 L 3 q " r OpR 4{3 L 11{3 m ´1{6 q. Therefore, we can set m " r ΩpR 8 L 22 q to ensure that app pτ q ď 1{8, where r Ωp¨q hides polylogarithmic dependencies on network depth L, NTRF function class size R, and failure probability parameter δ.Then by Lemma A.4, we have with probability at least 1 ´δ,as long as W p0q , . . . , W pn 1 ´1q P BpW p0q , τ q.We then prove Theorem 3.5 in two steps: 1) all iterates stay inside BpW p0q , τ q; and 2) convergence of online SGD.All iterates stay inside BpW p0q , τ q. Similar to the proof of Theorem 3.3, we prove this part by induction. Assuming W piq satisfies W piq P BpW p0q , τ q for all i ď n 1 ´1, by (A.4), we havewhere the last inequality is by W ˚P BpW p0q , Rm ´1{2 q. Then by triangle inequality, we further getThen by our choices of η " Θ `m´1 ¨pLR 2 n ´1 ´1 NTRF ^L´1 q ˘, we have }W pn 1 q ´Wp0q } F ď 2 ?LRm ´1{2 ď τ . This completes the proof of the first part.where the last inequality is by the fact that |S| ě m{4. Besides, note that by concentration and Gaussian tail bound, we have |f W p0q px i q| ď C logpn{δq for some absolute constant C. Therefore, let W 1 " W p0q 1`4`l ogp2{ q `C logpn{δq ˘m´1{2 U{γ and W 2 " W p0q 2 , we haveNote that }up¨q} 2 ď 1, we have }U} F ď 1{0.47 ď 2.2. Therefore, we further have }W 1 Ẃp0q1 } F ď 8.8γ´1`l ogp2{ q `C logpn{δq ˘¨m ´1{2 . This implies that W P BpW p0q , Rq with R " O `log `n{pδ q ˘{γ ˘. Applying the inequality plogp2{ qq ď on (B.1) gives py i ¨FW p0q ,W px i qq ď for all i " 1, . . . , n. This completes the proof.

B.3 PROOF OF PROPOSITION 4.6

Based on our theoretical analysis, the major goal is to show that there exist certain choices of R and m such that the best NTRF model in the function class FpW p0q , Rq can achieve training error. In this proof, we will prove a stronger results by showing that given the quantities of R and m specificed in Proposition 4.6, there exists a NTRF model with parameterIn order to do so, we consider training the NTRF model via a different surrogate loss function. Specifically, we consider squared hinge loss r pxq " `maxtλ ´x, 0u ˘2, where λ denotes the target margin. In the later proof, we choose λ " logp1{ q `1 such that the condition r pxq ď 1 can guarantee that x ě logp q. Moreover, we consider using gradient flow, i.e., gradient descent with infinitesimal step size, to train the NTRF model. Therefore, in the remaining part of the proof, we consider optimizing the NTRF parameter W with the loss functionMoreover, for simplicity, we only consider optimizing parameter in the last hidden layer (i.e., W L´1 ).Then the gradient flow can be formulated asNote that the NTRF model is a linear model, thus by Definition 3.2, we haveThen it is clear that ∇ W L´1 r L S pWptqq has fixed direction throughout the optimization.In order to prove the convergence of gradient flow and characterize the quantity of R, We first provide the following lemma which gives an upper bound of the NTRF model output at the initialization.Then we provide the following lemma which characterizes a lower bound of the Frobenius norm of the partial gradient ∇ W L´1 r L S pWq.Lemma B.2 (Lemma B.5 in Zou et al. (2019) ). Under Assumptions 3.1 and 4.5, if m " r Ωpn 2 φ ´1q, then for all t ě 0, with probability at least 1 ´exp `´Opmφ{nq ˘, there exist a positive constant C such thatWe slightly modified the original version of this lemma since we use different models (we consider NTRF model while Zou et al. (2019) considers neural network model). However, by (B.2), it is clear that the gradient ∇ r L S pWq can be regarded as a type of the gradient for neural network model at the initialization (i.e., ∇ W L´1 L S pW p0q q) is valid. Now we are ready to present the proof.C.1 PROOF OF LEMMA 5.1The detailed proof of Lemma 5.1 is given as follows.Proof of Lemma 5.1. Based on the update rule of gradient descent, i.e., W pt`1q " W ptq ή∇ W L S pW ptq q, we have the following calculation.where the equation follows from the fact that L S pW ptq q " n ´1 ř n i"1 L i pW ptq q. In what follows, we first bound the term I 1 on the R.H.S. of (C.1) by approximating the neural network functions with linear models. By assumption, for t " 0, . . . , t 1 ´1, W ptq , W ˚P BpW p0q , τ q. Therefore by the definition of app pτ q, y i ¨x∇f W ptq px i q, W ptq ´W˚y ď y i ¨`f W ptq px i q ´fW ˚px i q ˘` app pτ q (C.2)Moreover, we also havewhere the equation follows by the definition of F W p0q ,W ˚pxq. Adding (C.3) to (C.2) and canceling the terms y i ¨fW ˚px i q, we obtain thatWe can now give a lower bound on first term on the R.H.S. of (C.1). For i " 1, . . . , n, applying the chain rule on the loss function gradients and utilizing (C.4), we havewhere the first inequality is by the fact that 1 `yi f W ptq px i q ˘ă 0, the second inequality is by convexity of p¨q and the fact that ´ 1 `yi f W ptq px i q ˘ď `yi f W ptq px i q ˘.We now proceed to bound the term I 2 on the R.H.S. of (C.1). Note that we have 1 p¨q ă 0, and therefore the Frobenius norm of the gradient ∇ W l L S pW ptq q can be upper bounded as follows,where the inequality follows by triangle inequality. We now utilize the fact that cross-entropy loss satisfies the inequalities ´ 1 p¨q ď p¨q and ´ 1 p¨q ď 1. Therefore by definition of M pτ q, we have´ 1 `yi f W ptq px i q ˘˙2ď O `LM pτ q 2 ˘¨L S pW ptq q.(C.6)Then we can plug (C.5) and (C.6) into (C.1) and obtainwhere the last inequality is by η " OpL ´1M pτ q ´2q and merging the third term on the second line into the first term. Taking telescope sum from t " 0 to t " t 1 ´1 and plugging in the definitionProof of Lemma A.3. We first denote W " BpW p0q , r R ¨m´1{2 q, and define the corresponding neural network function class and surrogate loss function class as F " tf W pxq : W P Wu and G " t´ ry ¨fW pxqs : W P Wu respectively.By standard uniform convergence results in terms of empirical Rademacher complexity (Bartlett and Mendelson, 2002; Mohri et al., 2018; Shalev-Shwartz and Ben-David, 2014) , with probability at least 1 ´δ we havewhere C 1 is an absolute constant, andis the empirical Rademacher complexity of the function class G. We now provide two bounds on p R n pGq, whose combination gives the final result of Lemma A.3. First, by Corollary 5.35 in (Vershynin, 2010) , with probability at least 1 ´L ¨expp´Ωpmqq, }W p0q l } 2 ď 3 for all l P rLs. Therefore for all W P W, we have }W l } 2 ď 4. Moreover, standard concentration inequalities on the norm of the first row of W p0q l also implies that }W l } 2 ě 0.5 for all W P W and l P rLs. Therefore, an adaptation of the bound in (Bartlett et al., 2017) We now derive the second bound on p R n pGq, which is inspired by the proof provided in (Cao and Gu, 2020) . Since y P t`1, 1u, | 1 pzq| ď 1 and 1 pzq is 1-Lipschitz continuous, by standard empirical Rademacher complexity bounds (Bartlett and Mendelson, 2002; Mohri et al., 2018; Shalev-Shwartz and Ben-David, 2014) Bartlett et al. (2017) only proved the Rademacher complexity bound for the composition of the ramp loss and the neural network function. In our setting essentially the ramp loss is replaced with the ´ 1 p¨q function, which is bounded and 1-Lipschitz continuous. The proof in our setting is therefore exactly the same as the proof given in (Bartlett et al., 2017) , and we can apply Theorem 3.3 and Lemma A.5 in (Bartlett et al., 2017) to obtain the desired bound we present here.where p R n pFq is the empirical Rademacher complexity of the function class F. We havewhere F W p0q ,W pxq " f W p0q pxq `@∇ W f W p0q pxq, W ´Wp0q D . For I 1 , by Lemma 4.1 in (Cao and Gu, 2019) , with probability at least 1 ´δ{2 we haveBy Cauchy-Schwarz inequality we haveThereforewhere we apply Jensen's inequality to obtain the first inequality, and the last inequality follows by Lemma B.3 in (Cao and Gu, 2019) . Combining the bounds of I 1 and I 2 gives˙.Further combining this bound with (C.7) and recaling δ completes the proof.

C.3 PROOF OF LEMMA A.4

Proof of Lemma A.4. Different from the proof of Lemma 5.1, online SGD only queries one data to update the model parameters in each iteration, i.e., W i`1 " W i ´η∇L i`1 pW piq q. By this update rule, we haveWith exactly the same proof as (C.5) in the proof of Lemma 5.1, we have xW ptq ´W˚, ∇ W L i pW ptq qy ě p1 ´2 app pτ qq `yi f W ptq px i q ˘´ `yi F W p0q ,W ˚px i q ˘, (C.10) for all i " 0, . . . , n 1 ´1. By the fact that ´ 1 p¨q ď p¨q and ´ 1 p¨q ď 1, we have`yi`1 f Wt px i`1 q ˘¨}∇ W l f W piq px i`1 q} 2 F

