SHARPER ANALYSIS OF SPARSELY ACTIVATED WIDE NEURAL NETWORKS WITH TRAINABLE BIASES

Abstract

This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The tantalizing benefit of such initialization is that the neural network will provably have sparse activation pattern before, during and after training, which can enable fast training procedures and, therefore, reduce the training cost. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. The required width is provided to ensure gradient descent can drive the training error towards zero at a linear rate. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). To our knowledge, this is the first sparsity-dependent generalization result via localized Rademacher complexity. As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice datadependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound. Experimental evaluation is provided to evaluate our results.

1. INTRODUCTION

The literature of sparse neural networks can be dated back to the early work of LeCun et al. (1989) where they showed that a fully-trained neural network can be pruned to preserve generalization. Recently, training sparse neural networks has been receiving increasing attention since the discovery of the lottery ticket hypothesis (Frankle & Carbin, 2018) . In their work, they showed that if we repeatedly train and prune a neural network and then rewind the weights to the initialization, we are able to find a sparse neural network that can be trained to match the performance of its dense counterpart. However, this method is more of a proof of concept and is computationally expensive for any practical purposes. Nonetheless, this inspires further interest in the machine learning community to develop efficient methods to find the sparse pattern at the initialization such that the performance of the sparse network matches the dense network after training (Lee et al., 2018; Wang et al., 2019; Tanaka et al., 2020; Liu & Zenke, 2020; Chen et al., 2021; He et al., 2017; Liu et al., 2021b) . On the other hand, instead of trying to find some desired sparsity patterns at the initialization, another line of research has been focusing on inducing the sparsity pattern naturally and then creatively utilizing such sparse structure via high-dimensional geometric data structures as well as sketching or even quantum algorithms to speedup per-step gradient descent training (Song et al., 2021a; b; Hu et al., 2022; Gao et al., 2022) . In this line of theoretical studies, the sparsity is induced by shifted ReLU which is the same as initializing the bias of the network's linear layer to some large constant instead of zero and holding the bias fixed throughout the entire training. By the concentration of Gaussian, at the initialization, the total number of activated neurons (i.e., ReLU will output some non-zero value) will be sublinear in the total number m of neurons, as long as the bias is initialized to be C √ log m for some appropriate constant C. We call this sparsity-inducing initialization. If the network is in the NTK regime, each neuron weight will exhibit microscopic change after training, and thus the sparsity can be preserved throughout the entire training process. Therefore, during the entire training process, only a sublinear number of the neuron weights need to be updated, which can significantly speedup the training process. The focus of this work is along the above line of theoretical studies of sparsely trained overparameterized neural networks and address the two main research limitations in the aforementioned studies. (1) The bias parameters used in the previous works are not trainable, contrary to what people are doing in practice. (2) The previous works only provided the convergence guarantee, while lacking the generalization performance which is of the central interest in deep learning theory. Thus, our study will fill the above important gaps, by providing a comprehensive study of training one-hidden-layer sparsely activated neural networks in the NTK regime with (a) trainable biases incorporated in the analysis; (b) finer analysis of the convergence; and (c) first generalization bound for such sparsely activated neural networks after training with sharp bound on the restricted smallest eigenvalue of the limiting NTK. We further elaborate our technical contributions are follows: 1. Convergence. Theorem 3.1 provides the required width to ensure that gradient descent can drive the training error towards zero at a linear rate. Our convergence result contains two novel ingredients compared to the existing study. (1) Our analysis handles trainable bias, and shows that even though the biases are allowed to be updated from its initialization, the network's activation remains sparse during the entire training. This relies on our development of a new result showing that the change of bias is also diminishing with a O(1/ √ m) dependence on the network width m. (2) A finer analysis is provided such that the required network width to ensure the convergence can be much smaller, with an improvement upon the previous result by a factor of Θ(n 8/3 ) under appropriate bias initialization, where n is the sample size. This relies on our novel development of (1) a better characterization of the activation flipping probability via an analysis of the Gaussian anti-concentration based on the location of the strip and (2) a finer analysis of the initial training error. 2. Generalization. Theorem 3.8 studies the generalization of the network after gradient descent training where we characterize how the network width should depend on activation sparsity, which lead to a sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). To our knowledge, this is the first sparsity-dependent generalization result via localized Rademacher complexity. In addition, compared with previous works, our result yields a better width's dependence by a factor of n 10 . This relies on (1) the usage of symmetric initialization and (2) a finer analysis of the weight matrix change in Frobenius norm in Lemma 3.13. 3. Restricted Smallest Eigenvalue. Theorem 3.8 shows that the generalization bound heavily depends on the smallest eigenvalue λ min of the limiting NTK. However, the previously known worst-case lower bounds on λ min under data separation have a 1/n 2 explicit dependence in (Oymak & Soltanolkotabi, 2020; Song et al., 2021a) , making the generalization bound vacuous. Instead, our Theorem 3.11 establishes a much sharper lower bound restricted to a data-dependent region, which is sample-size-independent. This hence yields a desirable generalization bound that vanishes as fast as O(1/ √ n), given that the label vector is in this region, which can be done with simple label-shifting.

1.1. FURTHER RELATED WORKS

Besides the works mentioned in the introduction, another work related to ours is (Liao & Kyrillidis, 2022) where they also considered training a one-hidden-layer neural network with sparse activation and studied its convergence. However, different from our work, their sparsity is induced by sampling a random mask at each step of gradient descent whereas our sparsity is induced by non-zero initialization of the bias terms. Also, their network has no bias term, and they only focus on studying the training convergence but not generalization. We discuss additional related works here. Training Overparameterized Neural Networks. Over the past few years, a tremendous amount of efforts have been made to study training overparameterized neural networks. A series of works have shown that if the neural network is wide enough (polynomial in depth, number of samples, etc), gradient descent can drive the training error towards zero in a fast rate either explicitly (Du et al., 2018; 2019; Ji & Telgarsky, 2019) or implicitly (Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020) using the neural tangent kernel (NTK) (Jacot et al., 2018) . Further, under some conditions, the networks can generalize (Cao & Gu, 2019) . Under the NTK regime, the trained neural network can be well-approximated by its first order Taylor approximation from the initialization and Liu et al. (2020) showed that this transition to linearity phenomenon is a result from a diminishing Hessian 2norm with respect to width. Later on, Frei & Gu (2021) and Liu et al. (2022) showed that closeness to initialization is sufficient but not necessary for gradient descent to achieve fast convergence as long as the non-linear system satisfies some variants of the Polyak-Łojasiewicz condition. On the other hand, although NTK offers good convergence explanation, it contradicts the practice since (1) the neural networks need to be unrealistically wide and (2) the neuron weights merely change from the initialization. As Chizat et al. (2019) pointed out, this "lazy training" regime can be explained by a mere effect of scaling. Other works have considered the mean-field limit (Chizat & Bach, 2018; Mei et al., 2019; Chen et al., 2020) , feature learning (Allen-Zhu & Li, 2020; 2022; Shi et al., 2021; Telgarsky, 2022) which allow the weights to travel far away from the initialization. Sparse Neural Networks in Practice. Besides finding a fixed sparse mask at the initialization as we mentioned in introduction, on the other hand, dynamic sparse training allows the sparse mask to be updated during training, e.g., (Mocanu et al., 2018; Mostafa & Wang, 2019; Evci et al., 2020; Jayakumar et al., 2020; Liu et al., 2021a; c; d) .

2. PRELIMINARIES

Notations. We use ∥•∥ 2 to denote vector or matrix 2-norm and ∥•∥ F to denote the Frobenius norm of a matrix. When the subscript of ∥•∥ is unspecified, it is default to be the 2-norm. For matrices A ∈ R m×n1 and B ∈ R m×n2 , we use [A, B] to denote the row concatenation of A, B and thus [A, B] is a m × (n 1 + n 2 ) matrix. For matrix X ∈ R m×n , the row-wise vectorization of X is denoted by vec(X) = [x 1 , x 2 , . . . , x m ] ⊤ where x i is the i-th row of X. For a given integer n ∈ N, we use [n] to denote the set {0, . . . , n}, i.e., the set of integers from 0 to n. For a set S, we use S to denote the complement of S. We use N (µ, σ 2 ) to denote the Gaussian distribution with mean µ and standard deviation σ. In addition, we use O, Θ, Ω to suppress (poly-)logarithmic factors in O, Θ, Ω.

2.1. PROBLEM FORMULATION

Let the training set to be (X, y) where X = (x 1 , x 2 , . . . , x n ) ∈ R d×n denotes the feature matrix consisting of n d-dimensional vectors, and y = (y 1 , y 2 , . . . , y n ) ∈ R n consists of the corresponding n response variables. We assume ∥x i ∥ 2 ≤ 1 and y i = O(1) for all i ∈ [n]. We use one-hidden-layer neural network and consider the regression problem with the square loss function: f (x; W, b) := 1 √ m m r=1 a r σ(⟨w r , x⟩ -b r ), L(W, b) := 1 2 n i=1 (f (x i ; W, b) -y i ) 2 , where W ∈ R m×d with its r-th row being w r , b ∈ R m is a vector with b r being the bias of r-th neuron, a r is the second layer weight, and σ(•) denotes the ReLU activation function. We initialize the neural network by W r,i ∼ N (0, 1) and a r ∼ Uniform({±1}) and b r = B for some value B ≥ 0 of choice, for all r ∈ [m], i ∈ [d]. We train only the parameters W and b (i.e., the linear layer a r for r ∈ [m] is not trained) via gradient descent, the update of which are given by w r (t + 1) = w r (t) -η ∂L(W (t), b(t)) ∂w r , b r (t + 1) = b r (t) -η ∂L(W (t), b(t)) ∂b r . By the chain rule, we have ∂L ∂wr = ∂L ∂f ∂f ∂wr . The gradient of the loss with respect to the network is ∂L ∂f = n i=1 (f (x i ; W, b) -y i ) and the network gradients with respect to weights and bias are ∂f (x; W, b) ∂w r = 1 √ m a r xI(w ⊤ r x ≥ b r ), ∂f (x; W, b) ∂b r = - 1 √ m a r I(w ⊤ r x ≥ b r ), where I(•) is the indicator function. We further define H as the NTK matrix of this network with H i,j (W, b) := ∂f (x i ; W, b) ∂W , ∂f (x j ; W, b) ∂W + ∂f (x i ; W, b) ∂b , ∂f (x j ; W, b) ∂b = 1 m m r=1 (⟨x i , x j ⟩ + 1)I(w ⊤ r x i ≥ b r , w ⊤ r x j ≥ b r ) and the infinite-width version H ∞ (B) of the NTK matrix H is given by H ∞ ij (B) := E w∼N (0,I) (⟨x i , x j ⟩ + 1)I(w ⊤ x i ≥ B, w ⊤ x j ≥ B) . Let λ(B) := λ min (H ∞ (B)). We define I r,i (W, b) := I(w ⊤ r x i ≥ b r ) and the matrix Z(W, b) as Z(W, b) := 1 √ m    I 1,1 (W, b)a 1 [x ⊤ 1 , -1] ⊤ . . . I 1,n (W, b)a 1 [x ⊤ n , -1] ⊤ . . . . . . . . . I m,1 (W, b)a m [x ⊤ 1 , -1] ⊤ . . . I m,n (W, b)a m [x ⊤ n , -1] ⊤    ∈ R m(d+1)×n . Note that H(W, b) = Z(W, b) ⊤ Z(W, b). Hence, the gradient descent step can be written as vec d+1) denotes the row-wise concatenation of W (t) and b(t) at the t-th step of gradient descent, and Z(t) := Z(W (t), b(t)). ([W, b](t + 1)) = vec([W, b](t)) -ηZ(t)(f (t) -y) where [W, b](t) ∈ R m×(

3.1. CONVERGENCE AND SPARSITY

We present the convergence of gradient descent for the sparsely activated neural networks. Compared to the existing convergence result in (Song et al., 2021a) , our study handles the trainable bias with constant initialization in the convergence analysis (which is the first of such a type). Also, our bound is sharper and yields a much smaller bound on the width of neural networks to guarantee the convergence. Theorem 3.1 (Convergence). Let the learning rate η ≤ O( λ exp(B 2 ) n 2 ), and the bias initialization B ∈ [0, √ 0.5 log m]. Assume λ(B) = λ 0 exp(-B 2 /2) for some λ 0 > 0 independent of B. Then, if the network width satisfies m ≥ Ω λ -4 0 n 4 exp(B 2 ) , over the randomness in the initialization, P ∀t : L(W (t), b(t)) ≤ (1 -ηλ(B)/4) t L(W (0), b(0)) ≥ 1 -δ -e -Ω(n) . This theorem show that the training loss decreases linearly, and its rate depends on the smallest eigenvalue of the NTK. The assumption on λ(B) in Theorem 3.1 can be justified by (Song et al., 2021a, Theorem F.1) which shows that under some mild conditions, the NTK's least eigenvalue λ(B) is positive and has an exp(-B 2 /2) dependence. Remark 3.2. Theorem 3.1 establishes a much sharper bound on the width of the neural network than previous work to guarantee the linear convergence. To elaborate, our bound only requires m ≥ Ω λ -4 0 n 4 exp(B 2 ) , as opposed to the bound m ≥ Ω(λ -4 0 n 4 B 2 exp(2B 2 )) in (Song et al., 2021a, Lemma D.9 ). If we take B = √ 0.25 log m (as allowed by the theorem), then our lower bound yields a polynomial improvement by a factor of Θ(n/λ 0 ) 8/3 , which implies that the neural network width can be much smaller to achieve the same linear convergence. Key ideas in the proof of Theorem 3.1. The proof mainly consists of developing a novel bound on activation flipping probability and a novel upper bound on initial error, as we elaborate below. Like previous works, in order to prove convergence, we need to show that the NTK during training is close to its initialization. Inspecting the expression of NTK in Equation (1), observe that the training will affect the NTK by changing the output of each indicator function. We say that the r-th neuron flips its activation with respect to input x i at the k-th step of gradient descent if I(w r (k) ⊤ x i - b r (k) > 0) ̸ = I(w r (k -1) ⊤ x i -b r (k -1) > 0) for all r ∈ [m]. The central idea is that for each neuron, as long as the weight and bias movement R w , R b from its initialization is small, then the probability of activation flipping (with respect to random initialization) should not be large. We first present the bound on the probability that a given neuron flips its activation.  A i,r = {∃w r , b r : ∥ wr -w r ∥ 2 ≤ R w , |b r -br | ≤ R b , I(x ⊤ i wr ≥ br ) ̸ = I(x ⊤ i w r ≥ b r )}. Then, for some constant c, P [A i,r ] ≤ c(R w + R b ) exp(-B 2 /2). ( Song et al., 2021a, Claim C.11 ) presents a O(min{R, exp(-B 2 /2)}) bound on P[A i,r ]. The reason that their bound involving the min operation is because P[A i,r ] can be bounded by the standard Gaussian tail bound and Gaussian anti-concentration bound separately and then, take the one that is smaller. On the other hand, our bound replaces the min operation by the product which creates a more convenient (and tighter) interpolation between the two bounds. Later, we will show that the maximum movement of neuron weights and biases, R w and R b , both have a O(1/ √ m) dependence on the network width, and thus our bound offers a exp(-B 2 /2) improvement where exp(-B 2 /2) can be as small as 1/m 1/4 when we take B = √ 0.5 log m. Proof idea of Lemma 3.3. First notice that P[A i,r ] = P x∼N (0,1) [|x -B| ≤ R w + R b ]. Thus, here we are trying to solve a fine-grained Gaussian anti-concentration problem with the strip centered at B. The problem with the standard Gaussian anti-concentration bound is that it only provides a worst case bound and, thus, is location-oblivious. Centered in our proof is a nice Gaussian anticoncentration bound based on the location of the strip, which we describe as follows: Let's first assume B > R w +R b . A simple probability argument yields a bound of 2(R w +R b ) 1 √ 2π exp(-(B- R w -R b ) 2 ). Since later in the Appendix we can show that R w and R b have a O(1/ √ m) dependence (Lemma A.9 bounds the movement for gradient descent and Lemma A.10 for gradient flow) and we only take B = O( √ log m), by making m sufficiently large, we can safely assume that R w and R b is sufficiently small. Thus, the probability can be bounded by O((R w + R b ) exp(-B 2 /2)). However, when B < R w + R b the above bound no longer holds. But a closer look tells us that in this case B is close to zero, and thus (R w + R b ) 1 √ 2π exp(-B 2 /2) ≈ Rw+R b √ 2π which yields roughly the same bound as the standard Gaussian anti-concentration. Next, our proof of Theorem 3.1 develops the following initial error bound. Lemma 3.4 (Initial error upper bound). Let B > 0 be the initialization value of the biases and all the weights be initialized from standard Gaussian. Let δ ∈ (0, 1) be the failure probability. Then, with probability at least 1 -δ over the randomness in the initialization, we have L(W (0), b(0)) = O n + n exp(-B 2 /2) + 1/m log 3 (2mn/δ) . ( Song et al., 2021a, Claim D.1) gives a rough estimate of the initial error with O(n(1 + B 2 ) log 2 (n/δ) log(m/δ)) bound. When we set B = C √ log m for some constant C, our bound improves the previous result by a polylogarithmic factor. The previous bound is not tight in the following two senses: (1) the bias will only decrease the magnitude of the neuron activation instead of increasing and (2) when the bias is initialized as B, only roughly O(exp(-B 2 /2)) • m neurons will activate. Thus, we can improve the B 2 dependence to exp(-B 2 /2). By combining the above two improved results, we can prove our convergence result with improved lower bound of m as in Remark 3.2. We provide the complete proof in Appendix A. Lastly, since the total movement of each neuron's bias has a O(1/ √ m) dependence (shown in Lemma A.9), combining with the number of activated neurons at the initialization, we can show that during the entire training, the number of activated neurons is small. Lemma 3.5 (Number of Activated Neurons per Iteration). Assume the parameter settings in Theorem 3.1. With probability at least 1 -e -Ω(n) over the random initialization, we have |S on (i, t)| = O(m • exp(-B 2 /2)) for all 0 ≤ t ≤ T and i ∈ [n], where S on (i, t) = {r ∈ [m] : w r (t) ⊤ x i ≥ b r (t)}.

3.2. GENERALIZATION AND RESTRICTED LEAST EIGENVALUE

In this section, we present the sparsity-dependent generalization of our neural networks after gradient descent training. However, for technical reasons stated in Section 3.3, we use symmetric initialization defined below. Further, we adopt the setting in (Arora et al., 2019) and use a non-degenerate data distribution to make sure the infinite-width NTK is positive definite. Definition 3.6 (Symmetric Initialization). For a one-hidden layer neural network with 2m neurons, the network is initialized as the following: 1. For r ∈ [m], independently initialize w r ∼ N (0, I) and a r ∼ Uniform({-1, 1}). 2. For r ∈ {m + 1, . . . , 2m}, let w r = w r-m and a r = -a r-m . Definition 3.7 ((λ 0 , δ, n)-non-degenerate distribution, (Arora et al., 2019)  ). A distribution D over R d × R is (λ 0 , δ, n)-non-degenerate, if for n i.i.d. samples {(x i , y i )} n i=1 from D, with probability 1 -δ we have λ min (H ∞ (B)) ≥ λ 0 > 0. Theorem 3.8. Fix a failure probability δ ∈ (0, 1) and an accuracy parameter ϵ ∈ (0, 1). Suppose the training data S = {(x i , y i )} n i=1 are i.i.d. samples from a (λ, δ, n)-non-degenerate distribution D defined in Definition 3.7. Assume the one-hidden layer neural network is initialized by symmetric initialization in Definition 3.6. Further, assume the parameter settings in Theorem 3.1 except we let m ≥ Ω λ(B) -6 n 6 exp(-B 2 ) . Consider any loss function ℓ : R × R → [0, 1] that is 1-Lipschitz in its first argument. Then with probability at least 1 -2δ -e -Ω(n) over the randomness in symmetric initialization of W (0) ∈ R m×d and a ∈ R m and the training samples, the two layer neural network f (W (t), b(t), a) trained by gradient descent for t ≥ Ω( 1 ηλ(B) log n log(1/δ) ϵ ) iterations has empirical Rademacher complexity (see its formal definition in Definition C.1 in Appendix) bounded as R S (F) ≤ y ⊤ (H ∞ (B)) -1 y • 8 exp(-B 2 /2) n + Õ exp(-B 2 /4) n 1/2 and the population loss L D (f ) = E (x,y)∼D [ℓ(f (x), y)] can be upper bounded as L D (f (W (t), b(t), a)) ≤ y ⊤ (H ∞ (B)) -1 y • 32 exp(-B 2 /2) n + Õ 1 n 1/2 . (2) To show good generalization, we need a larger width: the second term in the Rademacher complexity bound is diminishing with m and to make this term O(1/ √ n), the width needs to have (n/λ(B)) 6 dependence as opposed to (n/λ(B)) 4 for convergence. Now, at the first glance of our generalization result, it seems we can make the Rademacher complexity arbitrarily small by increasing B. Recall from the discussion of Theorem 3.1 that the smallest eigenvalue of H ∞ (B) also has an exp(-B 2 /2) dependence. Thus, in the worst case, the exp(-B 2 /2) factor gets canceled and sparsity will not hurt the network's generalization. Before we present the proof, we make a corollary of Theorem 3.8 for the zero-initialized bias case. Corollary 3.9. Take the same setting as in Theorem 3.8 except now the biases are initialized as zero, i.e., B = 0. Then, if we let m ≥ Ω(λ(0) -6 n 6 ), the empirical Rademacher complexity and population loss are both bounded by R S (F), L D (f (W (t), b(t), a)) ≤ y ⊤ (H ∞ (0)) -1 y • 32 n + Õ 1 n 1/2 . Corollary 3.9 requires the network width m ≥ Ω((n/λ(0)) 6 ) which significantly improves upon the previous result in (Song & Yang, 2019, Theorem G.7 ) m ≥ Ω(n 16 poly(1/λ(0))) (including the dependence on the rescaling factor κ) which is a much wider network. Generalization Bound via Least Eigenvalue. Note that in Theorem 3.8, the worst case of the first term in the generalization bound in Equation ( 2) is given by O( 1/(λ(B) • n)). Hence, the least eigenvalue λ(B) of the NTK matrix can significantly affect the generalization bound. Previous works (Oymak & Soltanolkotabi, 2020; Song et al., 2021a) established lower bounds on λ(B) with an explicit 1/n 2 dependence on n under the δ data separation assumption (see Theorem 3.11), which clearly makes a vacuous generalization bound of O( √ n). This thus motivates us to provide a tighter bound (desirably independent on n) on the least eigenvalue of the infinite-width NTK in order to make the generalization bound in Theorem 3.8 valid and useful. However, it turns out that there are major difficulties in proving a better lower bound in the general case and thus, we are only able to present a better lower bound when we restrict the domain to some (data-dependent) regions. Definition 3.10 (Data-dependent Region). Let p ij = P w∼N (0,I) [w ⊤ x i ≥ B, w ⊤ x j ≥ B] for i ̸ = j. Define the (data-dependent) region R = {a ∈ R n : i̸ =j a i a j p ij ≥ min i ′ ̸ =j ′ p i ′ j ′ i̸ =j a i a j }. Notice that R is non-empty for any input data-set since R n + ⊂ R where R n + denotes the set of vectors with non-negative entries, and R = R n if p ij = p i ′ j ′ for all i ̸ = i ′ , j ̸ = j ′ . Theorem 3.11 (Restricted Least Eigenvalue). Let X = (x 1 , . . . , x n ) be points in R d with ∥x i ∥ 2 = 1 for all i ∈ [n] and w ∼ N (0, I d ). Suppose that there exists δ ∈ [0, √ 2] such that min i̸ =j∈[n] (∥x i -x j ∥ 2 , ∥x i + x j ∥ 2 ) ≥ δ. Let B ≥ 0. Consider the minimal eigenvalue of H ∞ over the data-dependent region R defined above, i.e., let λ := min ∥a∥ 2 =1, a∈R a ⊤ H ∞ a. Then, λ ≥ max(0, λ ′ ) where λ ′ ≥ max 1 2 - B √ 2π , 1 B - 1 B 3 e -B 2 /2 √ 2π -e -B 2 /(2-δ 2 /2) π -arctan δ √ 1-δ 2 /4 1-δ 2 /2 2π . To demonstrate the usefulness of our result, if we take the bias initialization B = 0 in Equation ( 3), this bound yields 1/(2π) • arctan((δ 1 -δ 2 /4)/(1 -δ 2 /2)) ≈ δ/(2π) , when δ is close to 0 whereas (Song et al., 2021a ) yields a bound of δ/n 2 . On the other hand, if the data has maximal separation, i.e., δ = √ 2, we get a max 1 2 -B √ 2π , 1 B -1 B 3 e -B 2 /2 √ 2π lower bound, whereas (Song et al., 2021a ) yields a bound of exp(-B 2 /2) √ 2/n 2 . Connecting to our convergence result in Theorem 3.1, if f (t) -y ∈ R, then the error can be reduced at a much faster rate than the (pessimistic) rate with 1/n 2 dependence in the previous studies as long as the error vector lies in the region. Remark 3.12. The lower bound on the restricted smallest eigenvalue λ in Theorem 3.11 is independent on n, which makes that the generalization bound in Theorem 3.8 vanishes as fast as O(1/ √ n). Such a lower bound is much sharper than the previous results with a 1/n 2 explicit dependence which yields vacuous generalization. This improvement relies on a fact that the label vector should lie in the region R, which can be justified by a simple label-shifting strategy as follows. Since R n + ⊂ R, the condition can be easily achieved by training the neural network on the shifted labels y + C (with appropriate broadcast) where C is a constant such that min i y i + C ≥ 0. Careful readers may notice that in the proof of Theorem 3.11 in Appendix B, the restricted least eigenvalue on R n + is always positive even if the data separation is zero. However, we would like to point out that the generalization bound in Theorem 3.8 is meaningful only when the training is successful: when the data separation is zero, the limiting NTK is no longer positive definite and the training loss cannot be minimized toward zero.

3.3. KEY IDEAS IN THE PROOF OF THEOREM 3.8

Since each neuron weight and bias move little from their initialization, a natural approach is to bound the generalization via localized Rademacher complexity. After that, we can apply appropriate concentration bounds to derive generalization. The main effort of our proof is devoted to bounding the weight movement to bound the localized Rademacher complexity. If we directly take the setting in Theorem 3.1 and compute the network's localized Rademacher complexity, we will encounter a non-diminishing (with the number of samples n) term which can be as large as O( √ n) since the network outputs non-zero values at the initialization. Arora et al. (2019) and Song & Yang (2019) resolved this issue by initializing the neural network weights instead by N (0, κ 2 I) to force the neural network output something close to zero at the initialization. The magnitude of κ is chosen to balance different terms in the Rademacher complexity bound in the end. Similar approach can also be adapted to our case by initializing the weights by N (0, κ 2 I) and the biases by κB. However, the drawback of such an approach is that the effect of κ to all the previously established results for convergence need to be carefully tracked or derived. In particular, in order to guarantee convergence, the neural network's width needs to have a polynomial dependence on 1/κ where 1/κ has a polynomial dependence on n and 1/λ, which means their network width needs to be larger to compensate for the initialization scaling. We resolve this issue by symmetric initialization Definition 3.6 which yields no effect (up to constant factors) on previously established convergence results, see (Munteanu et al., 2022) . Symmetric initialization allows us to organically combine the results derived for convergence to be reused for generalization, which leads to a more succinct analysis. Further, we replace the ℓ 1 -ℓ 2 norm upper bound by finer inequalities in various places in the original analysis. All these improvements lead to the following upper bound of the weight matrix change in Frobenius norm. Further, combining our sparsity-inducing initialization, we present our sparsity-dependent Frobenius norm bound on the weight matrix change. Lemma 3.13. Assume the one-hidden layer neural network is initialized by symmetric initialization in Definition 3.6. Further, assume the parameter settings in Theorem 3.1. Then with probability at least 1 -δ -e -Ω(n) over the random initialization, we have for all t ≥ 0, ∥[W, b](t) -[W, b](0)∥ F ≤ y ⊤ (H ∞ ) -1 y + O n λ exp(-B 2 /2) log(n/δ) m 1/4 + O n R exp(-B 2 /2) λ + n λ 2 • O exp(-B 2 /4) log(n 2 /δ) m + R exp(-B 2 /2) where R = R w + R b denote the maximum magnitude of neuron weight and bias change. By Lemma A.9 and Lemma A.11 in the Appendix, we have R = O( n λ √ m ). Plugging in and setting B = 0, we get ∥[W, b](t) -[W, b](0)∥ F ≤ y ⊤ (H ∞ ) -1 y + O( n λm 1/4 + n 3/2 λ 3/2 m 1/4 + n λ 2 √ m + n 2 λ 3 √ m ). On the other hand, taking κ = 1, (Song & Yang, 2019, Lemma G.6) yields a bound of ∥W (t) -W (0)∥ F ≤ y ⊤ (H ∞ ) -1 y + O( n λ + n 7/2 poly(1/λ) m 1/4 ). Notice that the O( n λ ) term has no dependence on 1/m and is removed by symmetric initialization in our analysis and we improve the upper bound's dependence on n by a factor of n 2 . We defer the full proof of Theorem 3.8 and Lemma 3.13 to Appendix C.

3.4. KEY IDEAS IN THE PROOF OF THEOREM 3.11

In this section, we analyze the smallest eigenvalue λ := λ min (H ∞ ) of the limiting NTK H ∞ with δ data separation. We first note that H ∞ ⪰ E w∼N (0,I) I(Xw ≥ B)I(Xw ≥ B) ⊤ and for a fixed vector a, we are interested in the lower bound of E w∼N (0,I) [|a ⊤ I(Xw ≥ B)| 2 ]. In previous works, Oymak & Soltanolkotabi (2020) showed a lower bound Ω(δ/n 2 ) for zero-initialized bias, and later Song et al. (2021a) generalized this result to a lower bound Ω(e -B 2 /2 δ/n 2 ) for non-zero initialized bias. Both lower bounds have a dependence of 1/n 2 . Their approach is by using an intricate Markov's inequality argument and then proving an lower bound of P[|a ⊤ I(Xw ≥ B)| ≥ c ∥a∥ ∞ ]. The lower bound is proved by only considering the contribution from the largest coordinate of a and treating all other values as noise. It is non-surprising that the lower bound has a factor of 1/n since a can have identical entries. On the other hand, the diagonal entries can give a exp(-B 2 /2) upper bound and thus there is a 1/n 2 gap between the two. Now, we give some evidence suggesting the 1/n 2 dependence may not be tight in some cases. Consider the following scenario: Assume n ≪ d and the data set is orthonormal. For a fixed a, we have a ⊤ E w∼N (0,I) I(Xw ≥ B)I(Xw ≥ B) ⊤ a = i,j∈[n] a i a j P[w ⊤ x i ≥ B, w ⊤ x j ≥ B] = p 0 ∥a∥ 2 2 + p 1 i̸ =j a i a j = p 0 -p 1 + p 1 ( i a i ) 2 > p 0 -p 1 where p 0 , p 1 ∈ [0, 1] are defined such that due to the spherical symmetry of the standard Gaussian we are able to let p 0 = P[w ⊤ x i ≥ B], ∀i ∈ [n] and p 1 = P[w ⊤ x i ≥ B, w ⊤ x j ≥ B], ∀i, j ∈ [n], i ̸ = j. Notice that p 0 > p 1 . Since this is true for all a ∈ R n , we get a lower bound of p 0 -p 1 with no explicit dependence on n and this holds for all n ≤ d. When d is large and n = d/2, this bound is better than previous bound by a factor of Θ(1/d 2 ). However, it turns out that the product terms with i ̸ = j above creates major difficulties in analyzing the general case. Due to such technical difficulties, we are only able to prove a better lower bound by utilizing the extra constant factor in the NTK thanks to the trainable bias, when we restrict the domain to some data-dependent region. We defer the proof of Theorem 3.11 to Appendix B.

4. EXPERIMENTS

In this section, we study how the activation sparsity patterns of multi-layer neural networks change during training when the bias parameters are initialized as non-zero. Settings. We train a 6-layer multi-layer perceptron (MLP) of width 1024 with trainable bias terms on MNIST image classification (LeCun et al., 2010) . The biases of the fully-connected layers are initialized as 0, -0.5 and -1. For the weights in the linear layer, we use Kaiming Initialization (He et al., 2015) which is sampled from an appropriately scaled Gaussian distribution. The traditional MLP architecture only has linear layers with ReLU activation. However, we found out that using the sparsity-inducing initialization, the magnitude of the activation will decrease geometrically layer-bylayer, which leads to vanishing gradients and that the network cannot be trained. Thus, we made a slight modification to the MLP architecture to include an extra Batch Normalization after ReLU to normalize the activation. Our MLP implementation is based on (Zhu et al., 2021) . We train the neural network by stochastic gradient descent with a small learning rate 5e-3 to make sure the training is in the NTK regime. The sparsity is measured as the total number of activated neurons (i.e., ReLU outputs some positive values) divided by total number of neurons, averaged over every SGD batch. We plot how the sparsity patterns changes for different layers during training. Observation and Implication. As demonstrated at Figure 1 , when we initialize the bias with three different values, the sparsity patterns are stable across all layers during training: when the bias is initialized as 0 and -0.5, the sparsity change is within 2.5%; and when the bias is initialized as -1.0, the sparsity change is within 10%. Meanwhile, by increasing the initialization magnitude for bias, the sparsity level increases with only marginal accuracy dropping. This implies that our theory can be extended to the multi-layer setting (with some extra care for coping with vanishing gradient) and multi-layer neural networks can also benefit from the sparsity-inducing initialization and enjoy reduction of computational cost. Another interesting observation is that the input layer (layer 0) has a different sparsity pattern from other layers while all the rest layers behave similarly. A CONVERGENCE Notation simplification. Since the smallest eigenvalue of the limiting NTK appeared in this proof all has dependence on the bias initialization parameter B, for the ease of notation of our proof, we suppress its dependence on B and use λ to denote λ := λ(B) = λ min (H ∞ (B)). A.1 DIFFERENCE BETWEEN LIMIT NTK AND SAMPLED NTK Lemma A.1. For a given bias vector b ∈ R m with b r ≥ 0, ∀r ∈ [m], the limit NTK H ∞ and the sampled NTK H are given as H ∞ ij := E w∼N (0,I) (⟨x i , x j ⟩ + 1)I(w ⊤ r x i ≥ b r , w ⊤ r x j ≥ b r ) , H ij := 1 m m r=1 (⟨x i , x j ⟩ + 1)I(w ⊤ r x i ≥ b r , w ⊤ r x j ≥ b r ). Let's define λ := λ min (H ∞ ) and assume λ > 0. If the network width m = Ω(λ -1 n • log(n/δ)), then P λ min (H) ≥ 3 4 λ ≥ 1 -δ. Proof. Let H r := 1 m X(w r ) ⊤ X(w r ), where X(w r ) ∈ R (d+1)×n is defined as X(w r ) := [I(w ⊤ r x 1 ≥ b) • (x 1 , 1), . . . , I(w ⊤ r x n ≥ b) • (x n , 1)], where (x i , 1) denotes appending the vector x i by 1. Hence H r ⪰ 0. Since for each entry H ij we have (H r ) ij = 1 m (⟨x i , x j ⟩ + 1)I(w ⊤ r x i ≥ b r , w ⊤ r x j ≥ b r ) ≤ 1 m (⟨x i , x j ⟩ + 1) ≤ 2 m , and naively, we can upper bound ∥H r ∥ 2 by: ∥H r ∥ 2 ≤ ∥H r ∥ F ≤ n 2 4 m 2 = 2n m . Then H = m r=1 H r and E[H] = H ∞ . Hence, by the Matrix Chernoff Bound in Lemma D.2 and choosing m = Ω(λ -1 n • log(n/δ)), we can show that P λ min (H) ≤ 3 4 λ ≤ n • exp - 1 16 λ/(4n/m) = n • exp - λm 64n ≤ δ. Lemma A.2. Assume m = n O(1) and exp(B 2 /2) = O( √ m) where we recall that B is the initialization value of the biases. With probability at least 1 -δ, we have ∥H(0) -H ∞ ∥ F ≤ 4n exp(-B 2 /4) log(n 2 /δ) m . Proof. First, we have E[((⟨x i , x j ⟩ + 1)I r,i (0)I r,j (0)) 2 ] ≤ 4 exp(-B 2 /2). Then, by Bernstein's inequality in Lemma D.1, with probability at least 1 -δ/n 2 , |H ij (0) -H ∞ ij | ≤ 2 exp(-B 2 /4) 2 log(n 2 /δ) m + 2 2 m log(n 2 /δ) ≤ 4 exp(-B 2 /4) log(n 2 /δ) m . By a union bound, the above holds for all i, j ∈ [n] with probability at least 1 -δ, which implies  ∥H(0) -H ∞ ∥ F ≤ 4n exp(-B 2 /4) log(n 2 /δ) m .

A.2 BOUNDING THE NUMBER OF FLIPPED

A i,r = {∃w r , b r : ∥ wr -w r ∥ 2 ≤ R w , |b r -br | ≤ R b , I(x ⊤ i wr ≥ br ) ̸ = I(x ⊤ i w r ≥ b r )}. Then, P [A i,r ] ≤ c(R w + R b ) exp(-B 2 /2) for some constant c. Proof. Notice that the event A i,r happens if and only if | w⊤ r x i -br | < R w + R b . First, if B > 1, then by Lemma D.3, we have P [A i,r ] ≤ (R w + R b ) 1 √ 2π exp(-(B -R w -R b ) 2 /2) ≤ c 1 (R w + R b ) exp(-B 2 /2) for some constant c 1 . If 0 ≤ B < 1, then the above analysis doesn't hold since it is possible that B -R w -R b ≤ 0. In this case, the probability is at most P[A i,r ] ≤ 2(R w + R b ) 1 √ 2π exp(-0 2 /2) = 2(Rw+R b ) √ 2π . However, since 0 ≤ B < 1 in this case, we have exp(- 1 2 /2) ≤ exp(-B 2 /2) ≤ exp(-0 2 /2). Therefore, P[A i,r ] ≤ c 2 (R w + R b ) exp(-B 2 /2) for c 2 = 2 exp(1/2) √ 2π . Take c = max{c 1 , c 2 } finishes the proof.  P[r ∈ S i ] ≤ c(R w + R b ) exp(-B 2 /2) for some constant c, which implies P[∀i ∈ [n] : |S i | ≤ 2mc(R w + R b ) exp(-B 2 /2)] ≥ 1 -n • exp - 2 3 mc(R w + R b ) exp(-B 2 /2) . Proof. The proof is by observing that P[r ∈ S i ] ≤ P[A i,r ]. Then, by Bernstein's inequality, P[|S i | > t] ≤ exp - t 2 /2 mc(R w + R b ) exp(-B 2 /2) + t/3 . Take t = 2mc(R w + R b ) exp(-B 2 /2 ) and a union bound over [n], we have  P[∀i ∈ [n] : |S i | ≤ 2mc(R w + R b ) exp(-B 2 /2)] ≥ 1 -n • exp - 2 3 mc(R w + R b ) exp(-B 2 /2) .

A.3 BOUNDING NTK IF PERTURBING WEIGHTS AND BIASES

H ij (W, b) = 1 m m r=1 (⟨x i , x j ⟩ + 1)I(w ⊤ r x i ≥ b r , w ⊤ r x j ≥ b r ). It satisfies that for some small positive constant c, 1. With probability at least 1 -n 2 exp -2 3 cm(R w + R b ) exp(-B 2 /2) , we have H( W , b) -H(W, b) F ≤ n • 8c(R w + R b ) exp(-B 2 /2), Z( W , b) -Z(W, b) F ≤ n • 8c(R w + R b ) exp(-B 2 /2).

2.. With probability at least

1 -δ -n 2 exp -2 3 cm(R w + R b ) exp(-B 2 /2) , λ min (H(W, b)) > 0.75λ -n • 8c(R w + R b ) exp(-B 2 /2). Proof. We have Z(W, b) -Z( W , b) 2 F = i∈[n]   2 m r∈[m] I(w ⊤ r x i ≥ b r ) -I( w⊤ r x i ≥ br ) 2   = i∈[n]   2 m r∈[m] t r,i   and H(W, b) -H( W , b) 2 F = i∈[n], j∈[n] (H ij (W, b) -H ij ( W , b)) 2 ≤ 4 m 2 i∈[n], j∈[n]   r∈[m] |I(w ⊤ r x i ≥ b r , w ⊤ r x j ≥ b r ) -I( w⊤ r x i ≥ br , w⊤ r x j ≥ br )|   2 = 4 m 2 i,j∈[n]   r∈[m] s r,i,j   2 , where we define s r,i,j := |I(w ⊤ r x i ≥ b r , w ⊤ r x j ≥ b r ) -I( w⊤ r x i ≥ br , w⊤ r x j ≥ br )|, t r,i := (I(w ⊤ r x i ≥ b r ) -I( w⊤ r x i ≥ br )) 2 . Notice that t r,i = 1 only if the event A i,r happens (recall the definition of A i,r in Lemma A.4) and s r,i,j = 1 only if the event A i,r or A j,r happens. Thus, r∈[m] t r,i ≤ r∈[m] I(A i,r ), r∈[m] s r,i,j ≤ r∈[m] I(A i,r ) + I(A j,r ). By Lemma A.4, we have E wr [s r,i,j ] ≤ E wr [s 2 r,i,j ] ≤ P wr [A i,r ] + P wr [A j,r ] ≤ 2c(R w + R b ) exp(-B 2 /2). Define s i,j = m r=1 I(A i,r ) + I(A j,r ). By Bernstein's inequality in Lemma D.1, P s i,j ≥ m • 2c(R w + R b ) exp(-B 2 /2) + mt ≤ exp - m 2 t 2 /2 m • 2c(R w + R b ) exp(-B 2 /2) + mt/3 , ∀t ≥ 0. Let t = 2c(R w + R b ) exp(-B 2 /2). We get P[s i,j ≥ m • 4c(R w + R b ) exp(-B 2 /2)] ≤ exp - 2 3 cm(R w + R b ) exp(-B 2 /2) . Thus, we obtain with probability at least 1 - n 2 exp -2 3 cm(R w + R b ) exp(-B 2 /2) , H( W , b) -H(W, b) F ≤ n • 8c(R w + R b ) exp(-B 2 /2), Z( W , b) -Z(W, b) F ≤ n • 8c(R w + R b ) exp(-B 2 /2). For the second result, by Lemma A.1, P[λ min (H( W , b)) ≥ 0.75λ] ≥ 1 -δ. Hence, with probability at least 1 -δ -n 2 exp -2 3 cm(R w + R b ) exp(-B 2 /2) , λ min (H(W, b)) ≥ λ min (H( W , b)) -H(W, b) -H( W , b) ≥ λ min (H( W , b)) -H(W, b) -H( W , b) F ≥ 0.75λ -n • 8c(R w + R b ) exp(-B 2 /2).

A.4 TOTAL MOVEMENT OF WEIGHTS AND BIASES

Definition A.7 (NTK at time t). For t ≥ 0, let H(t) be an n × n matrix with (i, j)-th entry H ij (t) := ∂f (x i ; θ(t)) ∂θ(t) , ∂f (x j ; θ(t)) ∂θ(t) = 1 m m r=1 (⟨x i , x j ⟩ + 1)I(w r (t) ⊤ x i ≥ b r (t), w r (t) ⊤ x j ≥ b r (t)). We follow the proof strategy from (Du et al., 2018) . Now we derive the total movement of weights and biases. Let f (t) = f (X; θ(t)) where f i (t) = f (x i ; θ(t) ). The dynamics of each prediction is given by d dt f i (t) = ∂f (x i ; θ(t)) ∂θ(t) , dθ(t) dt = n j=1 (y j -f j (t)) ∂f (x i ; θ(t)) ∂θ(t) , ∂f (x j ; θ(t)) ∂θ(t) = n j=1 (y j -f j (t))H ij (t), which implies d dt f (t) = H(t)(y -f (t)). Lemma A.8 (Gradient Bounds). For any 0 ≤ s ≤ t, we have ∂L(W (s), b(s)) ∂w r (s) 2 ≤ n m ∥f (s) -y∥ 2 , ∂L(W (s), b(s)) ∂b r (s) 2 ≤ n m ∥f (s) -y∥ 2 . Proof. We have: ∂L(W (s), b(s)) ∂w r (s) 2 = 1 √ m n i=1 (f (x i ; W (s), b(s)) -y i )a r x i I(w r (s) ⊤ x i ≥ b r ) 2 ≤ 1 √ m n i=1 |f (x i ; W (s), b(s)) -y i | ≤ n m ∥f (s) -y∥ 2 , where the first inequality follows from triangle inequality, and the second inequality follows from Cauchy-Schwarz inequality. Similarly, we also have: ∂L(W (s), b(s)) ∂b r (s) 2 = 1 √ m n i=1 (f (x i ; W (s), b(s)) -y i )a r I(w r (s) ⊤ x i ≥ b r ) 2 ≤ 1 √ m n i=1 |f (x i ; W (s), b(s)) -y i | ≤ n m ∥f (s) -y∥ 2 . A.4.1 GRADIENT DESCENT Lemma A.9. Assume λ > 0. Assume ∥y -f (k)∥ 2 2 ≤ (1-ηλ/4) k ∥y -f (0)∥ 2 2 holds for all k ′ ≤ k. Then for every r ∈ [m], ∥w r (k + 1) -w r (0)∥ 2 ≤ 8 √ n ∥y -f (0)∥ 2 √ mλ := D w , |b r (k + 1) -b r (0)| ≤ 8 √ n ∥y -f (0)∥ 2 √ mλ := D b . Proof. ∥w r (k + 1) -w r (0)∥ 2 ≤ η k k ′ =0 ∂L(W (k ′ )) ∂w r (k ′ ) 2 ≤ η k k ′ =0 n m ∥y -f (k ′ )∥ 2 ≤ η k k ′ =0 n m (1 -ηλ/4) k ′ /2 ∥y -f (0)∥ 2 ≤ η k k ′ =0 n m (1 -ηλ/8) k ′ ∥y -f (0)∥ 2 ≤ η ∞ k ′ =0 n m (1 -ηλ/8) k ′ ∥y -f (0)∥ 2 ≤ 8 √ n √ mλ ∥y -f (0)∥ 2 , where the first inequality is by Triangle inequality, the second inequality is by Lemma A.8, the third inequality is by our assumption and the fourth inequality is by (1 -x) 1/2 ≤ 1 -x/2 for x ≥ 0. The proof for b is similar.

A.4.2 GRADIENT FLOW

Lemma A.10. Suppose for 0 ≤ s ≤ t, λ min (H(s)) ≥ λ0 2 > 0. Then we have ∥y - f (t)∥ 2 2 ≤ exp(-λ 0 t) ∥y -f (0)∥ 2 2 and for any r ∈ [m], ∥w r (t) -w r (0)∥ 2 ≤ √ n∥y-f (0)∥ 2 √ mλ0 and |b r (t) - b r (0)| ≤ √ n∥y-f (0)∥ 2 √ mλ0 . Proof. By the dynamics of prediction in Equation ( 4), we have d dt ∥y -f (t)∥ 2 2 = -2(y -f (t)) ⊤ H(t)(y -f (t)) ≤ -λ 0 ∥y -f (t)∥ 2 2 , which implies ∥y -f (t)∥ 2 2 ≤ exp(-λ 0 t) ∥y -f (t)∥ 2 2 .

Now we bound the gradient norm of the weights

d ds w r (s) 2 = n i=1 (y i -f i (s)) 1 √ m a r x i I(w r (s) ⊤ x i ≥ b(s)) 2 ≤ 1 √ m n i=1 |y i f i (s)| ≤ √ n √ m ∥y -f (s)∥ 2 ≤ √ n √ m exp(-λ 0 s) ∥y -f (0)∥ 2 . Integrating the gradient, the change of weight can be bounded as ∥w r (t) -w r (0)∥ 2 ≤ t 0 d ds w r (s) 2 ds ≤ √ n ∥y -f (0)∥ 2 √ mλ 0 . For bias, we have d ds b r (s) 2 = n i=1 (y i -f i (s)) 1 √ m a r I(w r (s) ⊤ x i ≥ b(s)) 2 ≤ 1 √ m n i=1 |y i -f i (s)| ≤ √ n √ m ∥y -f (s)∥ 2 ≤ √ n √ m exp(-λ 0 s) ∥y -f (0)∥ 2 . Now, the change of bias can be bounded as ∥b r (t) -b r (0)∥ 2 ≤ t 0 d ds w r (s) 2 ds ≤ √ n ∥y -f (0)∥ 2 √ mλ 0 . A.5 GRADIENT DESCENT CONVERGENCE ANALYSIS

A.5.1 UPPER BOUND OF THE INITIAL ERROR

Lemma A.11 (Initial error upper bound). Let B > 0 be the initialization value of the biases and all the weights be initialized from standard Gaussian. Let δ ∈ (0, 1) be the failure probability. Then, with probability at least 1 -δ, we have ∥f (0)∥ 2 2 = O(n(exp(-B 2 /2) + 1/m) log 3 (mn/δ)), ∥f (0) -y∥ 2 2 = O n + n exp(-B 2 /2) + 1/m log 3 (2mn/δ) . Proof. Since we are only analyzing the initialization stage, for notation ease, we omit the dependence on time without any confusion. We compute ∥y -f ∥ 2 2 = n i=1 (y i -f (x i )) 2 = n i=1 y i - 1 √ m m r=1 a r σ(w ⊤ r x i -B) 2 = n i=1   y 2 i -2 y i √ m m r=1 a r σ(w ⊤ r x i -B) + 1 m m r=1 a r σ(w ⊤ r x i -B) 2   . Since w ⊤ r x i ∼ N (0, 1) for all r ∈ [m] and i ∈ [n] , by Gaussian tail bound and a union bound over r, i, we have P[∀i ∈ [n], j ∈ [m] : w ⊤ r x i ≤ 2 log(2mn/δ)] ≥ 1 -δ/2. Let E 1 denote this event. Conditioning on the event E 1 , let z i,r := 1 √ m • a r • min σ(w ⊤ r x i -B), 2 log(2mn/δ) . Notice that z i,r ̸ = 0 with probability at most exp(-B 2 /2). Thus, E ar,wr [z 2 i,r ] ≤ exp(-B 2 /2) 1 m 2 log(2mn/δ). By randomness in a r , we know E[z i,r ] = 0. Now apply Bernstein's inequality in Lemma D.1, we have for all t > 0, P m r=1 z i,r > t ≤ exp -min t 2 /2 4 exp(-B 2 /2) log(2mn/δ) , √ mt/2 2 2 log(2mn/δ) . Thus, by a union bound, with probability at least 1 -δ/2, for all i ∈ [n], m r=1 z i,r ≤ 2 log(2mn/δ) exp(-B 2 /2)2 log(2n/δ) + 2 2 log(2mn/δ) m log(2n/δ) ≤ 2 exp(-B 2 /4) + 2 2/m log 3/2 (2mn/δ). Let E 2 denote this event. Thus, conditioning on the events E 1 , E 2 , with probability 1 -δ, ∥f (0)∥ 2 2 = n i=1 m r=1 z i,r 2 = O(n(exp(-B 2 /2) + 1/m) log 3 (mn/δ)) and ∥y -f (0)∥ 2 = n i=1 y 2 i -2 n i=1 y i m r=1 z i,r + n i=1 m r=1 z i,r 2 ≤ n i=1 y 2 i + 2 n i=1 |y i | 2 exp(-B 2 /4) + 2 2/m log 3/2 (2mn/δ) + n i=1 2 exp(-B 2 /4) + 2 2/m log 3/2 (2mn/δ) 2 = O n + n exp(-B 2 /2) + 1/m log 3 (2mn/δ) , where we assume y i = O(1) for all i ∈ [n].

A.5.2 ERROR DECOMPOSITION

We follow the proof outline in (Song & Yang, 2019; Song et al., 2021a) and we generalize it to networks with trainable b. Let us define matrix H ⊥ similar to H except only considering flipped neurons by H ⊥ ij (k) := 1 m r∈Si (⟨x i , x j ⟩ + 1)I(w r (k) ⊤ x i ≥ b r (k), w r (k) ⊤ x j ≥ b r (k)) and vector v 1 , v 2 by v 1,i := 1 √ m r∈Si a r (σ(⟨w r (k + 1), x i ⟩ -b r (k + 1)) -σ(⟨w r (k), x i ⟩ -b r (k))), v 2,i := 1 √ m r∈Si a r (σ(⟨w r (k + 1), x i ⟩ -b r (k + 1)) -σ(⟨w r (k), x i ⟩ -b r (k))). Now we give out our error update. Claim A.12. ∥y -f (k + 1)∥ 2 2 = ∥y -f (k)∥ 2 2 + B 1 + B 2 + B 3 + B 4 , where B 1 := -2η(y -f (k)) ⊤ H(k)(y -f (k)), B 2 := 2η(y -f (k)) ⊤ H ⊥ (k)(y -f (k)), B 3 := -2(y -f (k)) ⊤ v 2 , B 4 := ∥f (k + 1) -f (k)∥ 2 2 . Proof. First we can write v 1,i = 1 √ m r∈Si a r σ w r (k) -η ∂L ∂w r , x i -b r (k) -η ∂L ∂b r -σ(⟨w r (k), x i ⟩ -b r (k)) = 1 √ m r∈Si a r -η ∂L ∂w r , x i + η ∂L ∂b r I(⟨w r (k), x i ⟩ -b r (k) ≥ 0) = 1 √ m r∈Si a r   η 1 √ m n j=1 (y j -f j (k))a r (⟨x j , x i ⟩ + 1)I(w r (k) ⊤ x j ≥ b r (k))   I(⟨w r (k), x i ⟩ -b r (k) ≥ 0) = η n j=1 (y j -f j (k))(H ij (k) -H ⊥ ij (k)) which means v 1 = η(H(k) -H ⊥ (k))(y -f (k)).

Now we compute

∥y -f (k + 1)∥ 2 2 = ∥y -f (k) -(f (k + 1) -f (k))∥ 2 2 = ∥y -f (k)∥ 2 2 -2(y -f (k)) ⊤ (f (k + 1) -f (k)) + ∥f (k + 1) -f (k)∥ 2 2 . Since f (k + 1) -f (k) = v 1 + v 2 , we can write the cross product term as (y -f (k)) ⊤ (f (k + 1) -f (k)) = (y -f (k)) ⊤ (v 1 + v 2 ) = (y -f (k)) ⊤ v 1 + (y -f (k)) ⊤ v 2 = η(y -f (k)) ⊤ H(k)(y -f (k)) -η(y -f (k)) ⊤ H ⊥ (k)(y -f (k)) + (y -f (k)) ⊤ v 2 . A.5.3 BOUNDING THE DECREASE OF THE ERROR Lemma A.13. Assume λ > 0. Assume we choose R w , R b , B where R w , R b ≤ min{1/B, 1} such that 8cn(R w + R b ) exp(-B 2 /2) ≤ λ/8. Denote δ 0 = δ + n 2 exp(-2 3 cm(R w + R b ) exp(-B 2 /2)). Then, P[B 1 ≤ -η5λ ∥y -f (k)∥ 2 2 /8] ≥ 1 -δ 0 . Proof. By Lemma A.6 and our assumption, λ min (H(W )) > 0.75λ -n • 8c(R w + R b ) exp(-B 2 /2) ≥ 5λ/8 with probability at least 1 -δ 0 . Thus, (y -f (k)) ⊤ H(k)(y -f (k)) ≥ ∥y -f (k)∥

A.5.4 BOUNDING THE EFFECT OF FLIPPED NEURONS

Here we bound the term B 2 , B 3 . First, we introduce a fact. Fact A.14. H ⊥ (k) 2 F ≤ 4n m 2 n i=1 |S i | 2 . Proof. H ⊥ (k) 2 F = i,j∈[n]   1 m r∈Si (x ⊤ i x j + 1)I(w r (k) ⊤ x i ≥ b r (k), w r (k) ⊤ x j ≥ b r (k))   2 ≤ i,j∈[n] 1 m 2|S i | 2 ≤ 4n m 2 n i=1 |S i | 2 . Lemma A.15. Denote δ 0 = n exp(-2 3 cm(R w + R b ) exp(-B 2 /2)). Then, P[B 2 ≤ 8ηnc(R w + R b ) exp(-B 2 /2) • ∥y -f (k)∥ 2 2 ] ≥ 1 -δ 0 . Proof. First, we have B 2 ≤ 2η ∥y -f (k)∥ 2 2 H ⊥ (k) 2 . Then, by Fact A.14, H ⊥ (k) 2 2 ≤ H ⊥ (k) 2 F ≤ 4n m 2 n i=1 |S i | 2 . By Corollary A.5, we have P[∀i ∈ [n] : |S i | ≤ 2mc(R w + R b ) exp(-B 2 /2)] ≥ 1 -δ 0 . Thus, with probability at least 1 -δ 0 , H ⊥ (k) 2 ≤ 4nc(R w + R b ) exp(-B 2 /2). Lemma A.16. Denote δ 0 = n exp(-2 3 cm(R w + R b ) exp(-B 2 /2)). Then, P[B 3 ≤ 4cηn(R w + R b ) exp(-B 2 /2) ∥y -f (k)∥ 2 2 ] ≥ 1 -δ 0 . Proof. By Cauchy-Schwarz inequality, we have B 3 ≤ 2 ∥y -f (k)∥ 2 ∥v 2 ∥ 2 . We have ∥v 2 ∥ 2 2 ≤ n i=1   η √ m r∈Si ∂L ∂w r , x i + ∂L ∂b r   2 ≤ n i=1 η 2 m max i∈[n] ∂L ∂w r , x i + ∂L ∂b r 2 |S i | 2 ≤ n η 2 m 2 n m ∥f (k) -y∥ 2 2mc(R w + R b ) exp(-B 2 /2) 2 = 16c 2 η 2 n 2 ∥y -f (k)∥ 2 2 (R w + R b ) 2 exp(-B 2 ) , where the last inequality is by Lemma A.8 and Corollary A.5 which holds with probability at least 1 -δ 0 .

A.5.5 BOUNDING THE NETWORK UPDATE

Lemma A.17. B 4 ≤ 4η 2 n 2 ∥y -f (k)∥ 2 2 . new result: B 4 ≤ C 2 2 η 2 n 2 ∥y -f (k)∥ 2 2 exp(-B 2 ). for some constant C 2 . Proof. ∥f (k + 1) -f (k)∥ 2 2 ≤ n i=1 η √ m m r=1 ∂L ∂w r , x i + ∂L ∂b r 2 ≤ 4η 2 n 2 ∥y -f (k)∥ 2 2 . New Proof. Recall that the definition that S on (i, t) = {r ∈ [m] : w r (t) ⊤ x i ≥ b r (t)}, i.e., the set of neurons that activates for input x i at the t-th step of gradient descent. ∥f (k + 1) -f (k)∥ 2 2 ≤ n i=1   η √ m r:r∈Son(i,k+1)∪Son(i,k) ∂L ∂w r , x i + ∂L ∂b r   2 ≤ n η 2 m (|S on (i, k + 1)| + |S on (i, k)|) 2 max i∈[n] ∂L ∂w r , x i + ∂L ∂b r 2 ≤ n η 2 m C 2 m exp(-B 2 /2) • n m ∥y -f (k)∥ 2 2 ≤ C 2 2 η 2 n 2 ∥y -f (k)∥ 2 2 exp(-B 2 ). where the third inequality is by Lemma A.19 for some C 2 .

A.5.6 PUTTING IT ALL TOGETHER

Theorem A.18 (Convergence). Assume λ > 0. Let η ≤ λ/(64n 2 ) η ≤ λ exp(B 2 ) 5C 2 2 n 2 , B ∈ [0, √ 0.5 log m] and m ≥ Ω λ -4 n 4 1 + exp(-B 2 /2) + 1/m log 3 (2mn/δ) exp(-B 2 ) . Assume λ = λ 0 exp(-B 2 /2) for some constant λ 0 . Then, P ∀t : ∥y -f (t)∥ 2 2 ≤ (1 -ηλ/4) t ∥y -f (0)∥ 2 2 ≥ 1 -δ -e -Ω(n) . Proof. From Lemma A.13, Lemma A.15, Lemma A.16 and Lemma A.17, we know with probability at least 1 -2n 2 exp(- 2 3 cm(R w + R b ) exp(-B 2 /2)) -δ, we have ∥y -f (k + 1)∥ 2 2 ≤ ∥y -f (k)∥ 2 2 (1 -5ηλ/8 + 12ηnc(R w + R b ) exp(-B 2 /2) + 4η 2 n 2 ). ∥y -f (k + 1)∥ 2 2 ≤ ∥y -f (k)∥ 2 2 (1 -5ηλ/8 + 12ηnc(R w + R b ) exp(-B 2 /2) + C 2 2 η 2 n 2 ∥y -f (k)∥ 2 2 exp(-B 2 )) . By Lemma A.9, we need D w = 8 √ n ∥y -f (0)∥ 2 √ mλ ≤ R w , D b = 8 √ n ∥y -f (0)∥ 2 √ mλ ≤ R b . Take t = m exp(-B 2 /2) we have P[|S on (i, 0)| ≥ 2m exp(-B 2 /2)] ≤ exp -m exp(-B 2 /2)/4 . By a union bound over i ∈ [n], we have P[∀i ∈ [n] : |S on (i, 0)| ≤ 2m exp(-B 2 /2)] ≥ 1 -n exp -m exp(-B 2 /2)/4 . Notice that ∥Z(0)∥ 2 F ≤ 4 m m r=1 n i=1 I r,i (0) ≤ 8n exp(-B 2 /2). Lemma A.20 (Number of Activated Neurons per Iteration). Assume the parameter settings in Theorem A.18. With probability at least 1 -e -Ω(n) over the random initialization, we have |S on (i, t)| = O(m • exp(-B 2 /2)) for all 0 ≤ t ≤ T and i ∈ [n]. Proof. By Corollary A.5 and Theorem A.18, we have P[∀i ∈ [n] : |S i | ≤ 4mc exp(-B 2 /2)] ≥ 1 -e -Ω(n) . Recall S i is the set of flipped neurons during the entire training process. Notice that |S on (i, t)| ≤ |S on (i, 0)| + |S i |. Thus, by Lemma A.19 P[∀i ∈ [n] : |S on (i, t)| = O(m exp(-B 2 /2))] ≥ 1 -e -Ω(n) .

B BOUNDING THE RESTRICTED SMALLEST EIGENVALUE WITH DATA SEPARATION

Theorem B.1. Let X = (x 1 , . . . , x n ) be points in R d with ∥x i ∥ 2 = 1 for all i ∈ [n] and w ∼ N (0, I d ). Suppose that there exists δ ∈ [0, √ 2] such that min i̸ =j∈[n] (∥x i -x j ∥ 2 , ∥x i + x j ∥ 2 ) ≥ δ. Let B ≥ 0. Recall the limit NTK matrix H ∞ defined as H ∞ ij := E w∼N (0,I) (⟨x i , x j ⟩ + 1)I(w ⊤ x i ≥ B, w ⊤ x j ≥ B) . Define p 0 = P[w ⊤ x 1 ≥ B] and p ij = P[w ⊤ x i ≥ B, w ⊤ x j ≥ B] for i ̸ = j. Define the (data- dependent) region R = {a ∈ R n : i̸ =j a i a j p ij ≥ min i ′ ̸ =j ′ p i ′ j ′ i̸ =j a i a j } and let λ := min ∥a∥ 2 =1, a∈R a ⊤ H ∞ a. Then, λ ≥ max(0, λ ′ ) where λ ′ ≥ p 0 -min i̸ =j p ij ≥ max 1 2 - B √ 2π , 1 B - 1 B 3 e -B 2 /2 √ 2π -e -B 2 /(2-δ 2 /2) π -arctan δ √ 1-δ 2 /4 1-δ 2 /2 2π . Proof. Define ∆ := max i̸ =j | ⟨x i , x j ⟩ |. Then by our assumption, 1 -∆ = 1 -max i̸ =j | ⟨x i , x j ⟩ | = min i̸ =j (∥x i -x j ∥ 2 2 , ∥x i + x j ∥ 2 2 ) 2 ≥ δ 2 /2 ⇒ ∆ ≤ 1 -δ 2 /2. Further, we define Z(w) := [x 1 I(w ⊤ x 1 ≥ B), x 2 I(w ⊤ x 2 ≥ B), . . . , x n I(w ⊤ x n ≥ B)] ∈ R d×n . Notice that H ∞ = E w∼N (0,I) Z(w) ⊤ Z(w) + I(Xw ≥ B)I(Xw ≥ B) ⊤ . We need to lower bound min ∥a∥ 2 =1,a∈R a ⊤ H ∞ a = min ∥a∥ 2 =1,a∈R a ⊤ E w∼N (0,I) Z(w) ⊤ Z(w) a + a ⊤ E w∼N (0,I) I(Xw ≥ B)I(Xw ≥ B) ⊤ a ≥ min ∥a∥ 2 =1,a∈R a ⊤ E w∼N (0,I) I(Xw ≥ B)I(Xw ≥ B) ⊤ a. Now, for a fixed a, a ⊤ E w∼N (0,I) I(Xw ≥ B)I(Xw ≥ B) ⊤ a = n i=1 a 2 i P[w ⊤ x i ≥ B] + i̸ =j a i a j P[w ⊤ x i ≥ B, w ⊤ x j ≥ B] = p 0 ∥a∥ 2 2 + i̸ =j a i a j p ij , where the last equality is by P[w ⊤ x 1 ≥ B] = . . . = P[w ⊤ x n ≥ B] = p 0 which is due to spherical symmetry of standard Gaussian. Notice that max i̸ =j p ij ≤ p 0 . Since a ∈ R, E w∼N (0,I) (a ⊤ I(Xw ≥ B)) 2 ≥ (p 0 -min i̸ =j p ij ) ∥a∥ 2 2 + (min i̸ =j p ij ) ∥a∥ 2 2 + (min i̸ =j p ij ) i̸ =j a i a j = (p 0 -min i̸ =j p ij ) ∥a∥ 2 2 + (min i̸ =j p ij ) i a i 2 . Thus, λ ≥ min ∥a∥ 2 =1,a∈R E w∼N (0,I) (a ⊤ I(Xw ≥ B)) 2 ≥ min ∥a∥ 2 =1,a∈R (p 0 -min i̸ =j p ij ) ∥a∥ 2 2 + min ∥a∥ 2 =1,a∈R (min i̸ =j p ij ) i a i 2 ≥ p 0 -min i̸ =j p ij . Now we need to upper bound min i̸ =j p ij ≤ max i̸ =j p ij . We divide into two cases: B = 0 and B > 0. Consider two fixed examples x 1 , x 2 . Then, let v = (I -x 1 x ⊤ 1 )x 2 / (I -x 1 x ⊤ 1 )x 2 and c = | ⟨x 1 , x 2 ⟩ | 1 . Case 1: B = 0. First, let us define the region A 0 as A 0 = (g 1 , g 2 ) ∈ R 2 : g 1 ≥ 0, g 1 ≥ - √ 1 -c 2 c g 2 . Then, P[w ⊤ x 1 ≥ 0, w ⊤ x 2 ≥ 0] = P[w ⊤ x 1 ≥ 0, w ⊤ (cx 1 + 1 -c 2 v) ≥ 0] = P[g 1 ≥ 0, cg 1 + 1 -c 2 g 2 ≥ 0] = P[A 0 ] = π -arctan √ 1-c 2 |c| 2π ≤ π -arctan √ 1-∆ 2 |∆| 2π , where we define g 1 := w ⊤ x 1 and g 2 := w ⊤ v and the second equality is by the fact that since x 1 and v are orthonormal, g 1 and g 2 are two independent standard Gaussian random variables; the last inequality is by arctan is a monotonically increasing function and √ 1-c 2 |c| is a decreasing function in |c| and |c| ≤ ∆. Thus, min i̸ =j p ij ≤ max i̸ =j p ij ≤ π -arctan √ 1-∆ 2 |∆| 2π . Case 2: B > 0. First, let us define the region A = (g 1 , g 2 ) ∈ R 2 : g 1 ≥ B, g 1 ≥ B c - √ 1 -c 2 c g 2 . Then, following the same steps as in case 1, we have P[w ⊤ x 1 ≥ B, w ⊤ x 2 ≥ B] = P[g 1 ≥ B, cg 1 + 1 -c 2 g 2 ≥ B] = P[A]. Let B 1 = B and B 2 = B 1-c 1+c . Further, notice that A = A 0 + (B 1 , B 2 ). Then, P[A] = (g1,g2)∈A 1 2π exp - g 2 1 + g 2 2 2 dg 1 dg 2 = (g1,g2)∈A0 1 2π exp - (g 1 + B 1 ) 2 + (g 2 + B 2 ) 2 2 dg 1 dg 2 = e -(B 2 1 +B 2 2 )/2 (g1,g2)∈A0 1 2π exp {-B 1 g 1 -B 2 g 2 } exp - g 2 1 + g 2 2 2 dg 1 dg 2 . Now, B 1 g 1 + B 2 g 2 = Bg 1 + B 1-c 1+c g 2 ≥ 0 always holds if and only if g 1 ≥ -1-c 1+c g 2 . Define the region A + to be A + = (g 1 , g 2 ) ∈ R 2 : g 1 ≥ 0, g 1 ≥ - 1 -c 1 + c g 2 . Observe that 1 -c 1 + c ≤ √ 1 -c 2 c = (1 -c)(1 + c) c ⇔ c ≤ 1 + c. Thus, A 0 ⊂ A + . Therefore, P[A] ≤ e -(B 2 1 +B 2 2 )/2 (g1,g2)∈A0 1 2π exp - g 2 1 + g 2 2 2 dg 1 dg 2 = e -(B 2 1 +B 2 2 )/2 P[A 0 ] = e -(B 2 1 +B 2 2 )/2 π -arctan √ 1-c 2 |c| 2π ≤ e -B 2 /(1+∆) π -arctan √ 1-∆ 2 |∆| 2π . Finally, we need to lower bound p 0 . This can be done in two ways: when B is small, we apply Gaussian anti-concentration bound and when B is large, we apply Gaussian tail bounds. Thus, p 0 = P[w ⊤ x 1 ≥ B] ≥ max 1 2 - B √ 2π , 1 B - 1 B 3 e -B 2 /2 √ 2π . Combining the lower bound of p 0 and upper bound on max i̸ =j p ij we have λ ≥ p 0 -min i̸ =j p ij ≥ max 1 2 - B √ 2π , 1 B - 1 B 3 e -B 2 /2 √ 2π -e -B 2 /(1+∆) π -arctan √ 1-∆ 2 |∆| 2π . Applying ∆ ≤ 1 -δ 2 /2 and noticing that H ∞ is positive semi-definite gives our final result.

C GENERALIZATION C.1 RADEMACHER COMPLEXITY

In this section, we would like to compute the Rademacher Complexity of our network. Rademacher complexity is often used to bound the deviation from empirical risk and true risk (see, e.g. (Shalev-Shwartz & Ben-David, 2014) .) Definition C.1 (Empirical Rademacher Complexity). Given n samples S, the empirical Rademacher complexity of a function class F, where f : R d → R for f ∈ F, is defined as R S (F) = 1 n E ϵ sup f ∈F n i=1 ϵ i f (x i ) where ϵ = (ϵ 1 , .  sup f ∈F L D (f ) -L S (f ) ≤ 2ρR S (F) + 3c log(2/δ) 2n . In order to get meaningful generalization bound via Rademacher complexity, previous results, such as (Arora et al., 2019; Song & Yang, 2019) , multiply the neural network by a scaling factor κ to make sure the neural network output something small at the initialization, which requires at least modifying all the previous lemmas we already established. We avoid repeating our arguments by utilizing symmetric initialization to force the neural network to output exactly zero for any inputs at the initialization. 2 Definition C.3 (Symmetric Initialization). For a one-hidden layer neural network with 2m neurons, the network is initialized as the following 1. For r ∈ [m], initialize w r ∼ N (0, I) and a r ∼ Uniform({-1, 1}). 2. For r ∈ {m + 1, . . . , 2m}, let w r = w r-m and a r = -a r-m . It is not hard to see that all of our previously established lemmas hold including expectation and concentration. The only effect this symmetric initialization brings is to worse the concentration by a constant factor of 2 which can be easily addressed. For detailed analysis, see (Munteanu et al., 2022) . In order to state our final theorem, we need to use Definition 3.7. Now we can state our theorem for generalization. Theorem C.4. Fix a failure probability δ ∈ (0, 1) and an accuracy parameter ϵ ∈ (0, 1). Suppose the training data S = {(x i , y i )} n i=1 are i.i.d. samples from a (λ, δ, n)-non-degenerate distribution D. Assume the settings in Theorem A.18 except now we let m ≥ Ω λ -4 n 6 1 + exp(-B 2 /2) + 1/m log 3 (2mn/δ) exp(-B 2 ) . Consider any loss function ℓ : R × R → [0, 1] that is 1-Lipschitz in its first argument. Then with probability at least 1 -2δ -e -Ω(n) over the symmetric initialization of W (0) ∈ R m×d and a ∈ R m and the training samples, the two layer neural network f (W (k), b(k), a) trained by gradient descent for k ≥ Ω( 1 ηλ log n log(1/δ) ϵ ) iterations has population loss L D (f ) = E (x,y)∼D [ℓ(f (x), y)] upper bounded as L D (f (W (k), b(k), a)) ≤ y ⊤ (H ∞ ) -1 y • 32 exp(-B 2 /2) n + Õ 1 n 1/2 . 2 While preparing the manuscript, the authors notice that this can be alternatively solved by reparameterized the neural network by f (x; W ) -f (x; W0) and thus minimizing the following objective L = 1 2 n i=1 (f (xi; W ) -f (xi; W0) -yi) 2 . The corresponding generalization is the same since Rademacher complexity is invariant to translation. However, since the symmetric initialization is widely adopted in theory literature, we go with symmetric initialization here. The rest of the proof depends on the results from Lemma C.6 and Lemma C.8. Let R : = ∥[W, b](k) -[W, b](0)∥ F . By Lemma C.6 we have R S (F Rw,R b ,R ) ≤ R 8 exp(-B 2 /2) n + 4c(R w + R b ) 2 √ m exp(-B 2 /2) ≤ R 8 exp(-B 2 /2) n + O n 2 (1 + (exp(-B 2 /2) + 1/m) log 3 (2mn/δ)) exp(-B 2 /2) √ mλ 2 . Lemma C.8 gives that R ≤ y ⊤ (H ∞ ) -1 y + O n λ exp(-B 2 /2) log(n/δ) m 1/4 + O n (R w + R b ) exp(-B 2 /2) λ + n λ 2 • O exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) . Combining the above results and using the choice of m, R, B in Theorem A.18 gives us R(F) ≤ y ⊤ (H ∞ ) -1 y • 8 exp(-B 2 /2) n + O n exp(-B 2 /2) λ exp(-B 2 /2) log(n/δ) m 1/4 + O n(R w + R b ) λ exp(B 2 /2) + √ n λ 2 • O exp(-B 2 /2) log(n 2 /δ) m + (R w + R b ) exp(-3B 2 /4) + O n 2 (1 + (exp(-B 2 /2) + 1/m) log 3 (2mn/δ)) exp(-B 2 /2) √ mλ 2 . Now, we analyze the terms one by one by plugging in the bound of m and R w , R b and show that they can be bounded by Õ(exp(-B 2 /4)/n 1/2 ). For the second term, we have O n exp(-B 2 /2) λ exp(-B 2 /2) log(n/δ) m 1/4 = O √ λ exp(-B 2 /8) log 1/4 (n/δ) n . For the third term, we have O n(R w + R b ) λ exp(B 2 /2) = O √ n λ exp(B 2 /2) √ n(1 + (exp(-B 2 /2) + 1/m) log 3 (2mn/δ)) 1/4 m 1/4 λ 1/2 = O n exp(B 2 /2)n 6/4 exp(-B 2 /4) = O exp(-B 2 /4) n 1/2 . For the fourth term, we have √ n λ 2 • O exp(-B 2 /2) log(n 2 /δ) m + (R w + R b ) exp(-3B 2 /4) = O λ log(n/δ) n 2.5 + O exp(-B 2 /4) n 1.5 . For the last term, we have O n 2 (1 + (exp(-B 2 /2) + 1/m) log 3 (2mn/δ)) exp(-B 2 /2) √ mλ 2 = O   λ 1 + (exp(-B 2 /2) + 1/m) log 3 (2mn/δ) n   . Recall our discussion on λ in Section 3.4 that λ = λ 0 exp(-B 2 /2) ≤ 1 for some λ 0 independent of B. Putting them together, we get the desired upper bound for R(F), and the theorem is then proved. Lemma C.6. Assume the choice of R w , R b , m in Theorem A.18. Given R > 0, with probability at least 1 -e -Ω(n) over the random initialization of W (0), a, the following function class F Rw,R b ,R = {f (W, a, b) : ∥W -W (0)∥ 2,∞ ≤ R w , ∥b -b(0)∥ ∞ ≤ R b , ∥vec([W, b] -[W (0), b(0)])∥ ≤ R} has empirical Rademacher complexity bounded as R S (F Rw,R b ,R ) ≤ R 8 exp(-B 2 /2) n + 4c(R w + R b ) 2 √ m exp(-B 2 /2). Proof. We need to upper bound R S (F Rw,R b ,R ). Define the events A r,i = {|w r (0) ⊤ x i -b r (0)| ≤ R w + R b }, i ∈ [n], r ∈ [m] and a shorthand I(w r (0 ) ⊤ x i -B ≥ 0) = I r,i (0). Then, n i=1 ϵ i m r=1 a r σ(w ⊤ r x i -b r ) - n i=1 ϵ i m r=1 a r I r,i (0)(w ⊤ r x i -b r ) = n i=1 m r=1 ϵ i a r σ(w ⊤ r x i -b r ) -I r,i (0)(w ⊤ r x i -b r ) = n i=1 m r=1 I(A r,i )ϵ i a r σ(w ⊤ r x i -b r ) -I r,i (0)(w ⊤ r x i b r ) = n i=1 m r=1 I(A r,i )ϵ i a r σ(w ⊤ r x i -b r ) -I r,i (0)(w r (0) ⊤ x i -b r (0)) -I r,i (0)((w r -w r (0)) ⊤ x i -(b r -b r (0))) = n i=1 m r=1 I(A r,i )ϵ i a r σ(w ⊤ r x i -b r ) -σ(w r (0) ⊤ x i -b r (0)) -I r,i (0)((w r -w r (0)) ⊤ x i -(b r -b r (0))) ≤ n i=1 m r=1 I(A r,i )2(R w + R b ), where the second equality is due to the fact that σ(w ⊤ r x i -b r ) = I r,i (0)(w ⊤ r x i -b r ) if r / ∈ A r,i . Thus, the Rademacher complexity can be bounded as R S (F Rw,R b ,R ) = 1 n E ϵ     sup ∥W -W (0)∥ 2,∞ ≤Rw, ∥b-b(0)∥ ∞ ≤R b , ∥vec([W,b]-[W (0),b(0)])∥≤R n i=1 ϵ i m r=1 a r √ m σ(w ⊤ r x i -b r )     ≤ 1 n E ϵ     sup ∥W -W (0)∥ 2,∞ ≤Rw, ∥b-b(0)∥ ∞ ≤R b , ∥vec([W,b]-[W (0),b(0)])∥≤R n i=1 ϵ i m r=1 a r √ m I r,i (0)(w ⊤ r x i -b r )     + 2(R w + R b ) n √ m n i=1 m r=1 I(A r,i ) = 1 n E ϵ sup ∥vec([W,b]-[W (0),b(0)])∥≤R vec([W, b]) ⊤ Z(0)ϵ + 2(R w + R b ) n √ m n i=1 m r=1 I(A r,i ) = 1 n E ϵ sup ∥vec([W,b]-[W (0),b(0)])∥≤R vec([W, b] -[W (0), b(0)]) ⊤ Z(0)ϵ + 2(R w + R b ) n √ m n i=1 m r=1 I(A r,i ) ≤ 1 n E ϵ [R ∥Z(0)ϵ∥ 2 ] + 2(R w + R b ) n √ m n i=1 m r=1 I(A r,i ) ≤ R n E ϵ [∥Z(0)ϵ∥ 2 2 ] + 2(R w + R b ) n √ m n i=1 m r=1 I(A r,i ) = R n ∥Z(0)∥ F + 2(R w + R b ) n √ m n i=1 m r=1 I(A r,i ), where we recall the definition of the matrix Z(0) = 1 √ m    I 1,1 (0)a 1 [x ⊤ 1 , -1] ⊤ . . . I 1,n (0)a 1 [x ⊤ n , -1] ⊤ . . . . . . I m,1 (0)a m [x ⊤ 1 , -1] ⊤ . . . I m,n (0)a m [x ⊤ n , -1] ⊤    ∈ R m(d+1)×n . By Lemma A.19, we have ∥Z(0)∥ F ≤ 8n exp(-B 2 /2) and by Corollary A.5, we have P ∀i ∈ [n] : m r=1 I(A r,i ) ≤ 2mc(R w + R b ) exp(-B 2 /2) ≥ 1 -e -Ω(n) . Thus, with probability at least 1 -e -Ω(n) , we have R S (F Rw,R b ,R ) ≤ R 8 exp(-B 2 /2) n + 4c(R w + R b ) 2 √ m exp(-B 2 /2).

C.2 ANALYSIS OF RADIUS

Theorem C.7. Assume the parameter settings in Theorem A.18. With probability at least 1 -δe -Ω(n) over the initialization we have f (k) -y = -(I -ηH ∞ ) k y ± e(k), where ∥e(k)∥ 2 = k(1 -ηλ/4) (k-1)/2 ηn 3/2 • O exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) . Proof. Before we start, we assume all the events needed in Theorem A.18 succeed, which happens with probability at least 1 -δ -e -Ω(n) . Recall the no-flipping set S i in Definition A.3. We have f i (k + 1) -f i (k) = 1 √ m m r=1 a r [σ(w r (k + 1) ⊤ x i -b r (k + 1)) -σ(w r (k) ⊤ x i -b r (k))] = 1 √ m r∈Si a r [σ(w r (k + 1) ⊤ x i -b r (k + 1)) -σ(w r (k) ⊤ x i -b r (k))] + 1 √ m r∈Si a r [σ(w r (k + 1) ⊤ x i -b r (k + 1)) -σ(w r (k) ⊤ x i -b r (k))] ϵi(k) . (5) Now, to upper bound the second term ϵ i (k), |ϵ i (k)| = 1 √ m r∈Si a r [σ(w r (k + 1) ⊤ x i -b r (k + 1)) -σ(w r (k) ⊤ x i -b r (k))] ≤ 1 √ m r∈Si |w r (k + 1) ⊤ x i -b r (k + 1) -(w r (k) ⊤ x i -b r (k))| ≤ 1 √ m r∈Si ∥w r (k + 1) -w r (k)∥ 2 + |b r (k + 1) -b r (k)| = 1 √ m r∈Si η √ m a r n j=1 (f j (k) -y j )I r,j (k)x j 2 + η √ m a r n j=1 (f j (k) -y j )I r,j (k) ≤ 2η m r∈Si n j=1 |f j (k) -y j | ≤ 2η √ n|S i | m ∥f (k) -y∥ 2 ⇒ ∥ϵ∥ 2 = n i=1 4η 2 n|S i | 2 m 2 ∥f (k) -y∥ 2 2 ≤ ηnO((R w + R b ) exp(-B 2 /2)) ∥f (k) -y∥ 2 where we apply Corollary A.5 in the last inequality. To bound the first term, 1 √ m r∈Si a r [σ(w r (k + 1) ⊤ x i -b r (k + 1)) -σ(w r (k) ⊤ x i -b r (k))] = 1 √ m r∈Si a r I r,i (k) (w r (k + 1) -w r (k)) ⊤ x i -(b r (k + 1) -b r (k)) = 1 √ m r∈Si a r I r,i (k)      - η √ m a r n j=1 (f j (k) -y j )I r,j (k)x j   ⊤ x i - η √ m a r n j=1 (f j (k) -y j )I r,j (k)    = 1 √ m r∈Si a r I r,i (k)   - η √ m a r n j=1 (f j (k) -y j )I r,j (k)(x ⊤ j x i + 1)   = -η n j=1 (f j (k) -y j ) 1 m r∈Si I r,i (k)I r,j (k)(x ⊤ j x i + 1) = -η n j=1 (f j (k) -y j )H ij (k) + η n j=1 (f j (k) -y j ) 1 m r∈Si I r,i (k)I r,j (k)(x ⊤ j x i + 1) ϵ ′ i (k) where we can upper bound |ϵ ′ i (k)| as |ϵ ′ i (k)| ≤ 2η m |S i | n j=1 |f j (k) -y j | ≤ 2η √ n|S i | m ∥f (k) -y∥ 2 ⇒ ∥ϵ ′ ∥ 2 = n i=1 4η 2 n|S i | 2 m 2 ∥f (k) -y∥ 2 2 ≤ ηnO((R w + R b ) exp(-B 2 /2)) ∥f (k) -y∥ 2 . Combining Equation ( 5), Equation ( 6), Equation ( 7) and Equation (8), we have f i (k + 1) -f i (k) = -η n j=1 (f j (k) -y j )H ij (k) + ϵ i (k) + ϵ ′ i (k) ⇒ f (k + 1) -f (k) = -ηH(k)(f (k) -y) + ϵ(k) + ϵ ′ (k) = -ηH ∞ (f (k) -y) + η(H ∞ -H(k))(f (k) -y) + ϵ(k) + ϵ ′ (k) ζ(k) ⇒ f (k) -y = (I -ηH ∞ ) k (f (0) -y) + k-1 t=0 (I -ηH ∞ ) t ζ(k -1 -t) = -(I -ηH ∞ ) k y + (I -ηH ∞ ) k f (0) + k-1 t=0 (I -ηH ∞ ) t ζ(k -1 -t) e(k) . Now the rest of the proof bounds the magnitude of e(k). From Lemma A.2 and Lemma A.6, we have ∥H ∞ -H(k)∥ 2 ≤ ∥H(0) -H ∞ ∥ 2 + ∥H(0) -H(k)∥ 2 = O n exp(-B 2 /4) log(n 2 /δ) m + O(n(R w + R b ) exp(-B 2 /2)). Thus, we can bound ζ(k) as ∥ζ(k)∥ 2 ≤ η ∥H ∞ -H(k)∥ 2 ∥f (k) -y∥ 2 + ∥ϵ(k)∥ 2 + ∥ϵ ′ (k)∥ 2 = O ηn exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) ∥f (k) -y∥ 2 . Notice that ∥H ∞ ∥ 2 ≤ Tr(H ∞ ) ≤ n since H ∞ is symmetric. By Theorem A.18, we pick η = O(λ/n 2 ) ≪ 1/ ∥H ∞ ∥ 2 and , with probability at least 1 -δ -e -Ω(n) over the random initialization, we have ∥f (k) -y∥ 2 ≤ (1 -ηλ/4) k/2 ∥f (0) -y∥ 2 . Since we are using symmetric initialization, we have (I -ηH ∞ ) k f (0) = 0. Thus, ∥e(k)∥ 2 = k-1 t=0 (I -ηH ∞ ) t ζ(k -1 -t) 2 ≤ k-1 t=0 ∥I -ηH ∞ ∥ t 2 ∥ζ(k -1 -t)∥ 2 ≤ k-1 t=0 (1 -ηλ) t ηnO exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) ∥f (k -1 -t) -y∥ 2 ≤ k-1 t=0 (1 -ηλ) t ηnO exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) • (1 -ηλ/4) (k-1-t)/2 ∥f (0) -y∥ 2 ≤ k(1 -ηλ/4) (k-1)/2 ηnO exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) ∥f (0) -y∥ 2 ≤ k(1 -ηλ/4) (k-1)/2 ηn 3/2 O exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) • 1 + (exp(-B 2 /2) + 1/m) log 3 (2mn/δ) = k(1 -ηλ/8) k-1 ηn 3/2 O exp(-B 2 /4) log(n 2 /δ) m + (R w + R b ) exp(-B 2 /2) . Lemma C.8. Assume the parameter settings in Theorem A.18. Then with probability at least 1δ -e -Ω(n) over the random initialization, we have for all k ≥ 0, ∥[W, b](k) -[W, b](0)∥ F ≤ y ⊤ (H ∞ ) -1 y + O n λ exp(-B 2 /2) log(n/δ) m 1/4 + O n R exp(-B 2 /2) λ + n λ 2 • O exp(-B 2 /4) log(n 2 /δ) m + R exp(-B 2 /2) where R = R w + R b . Proof. Before we start, we assume all the events needed in Theorem A.18 succeed, which happens with probability at least 1 -δ -e -Ω(n) . Let H ∞ = U ΣU ⊤ be the eigendecomposition. Then T = U η K-1 k=0 (I -ηΣ) k U ⊤ = U ((I -(I -ηΣ) K )Σ -1 )U ⊤ ⇒ T H ∞ T = U ((I -(I -ηΣ) K )Σ -1 ) 2 ΣU ⊤ = U (I -(I -ηΣ) K ) 2 Σ -1 U ⊤ ⪯ U Σ -1 U ⊤ = (H ∞ ) -1 . Thus, ∥T 1 ∥ 2 2 = K-1 k=0 ηZ(0)(I -ηH ∞ ) k y 2 ≤ y ⊤ (H ∞ ) -1 y + O n 2 exp(-B 2 /4) λ 2 log(n/δ) m ≤ y ⊤ (H ∞ ) -1 y + O n λ exp(-B 2 /2) log(n/δ) m 1/4 . Finally, plugging in the bounds in Equation ( 9 Lemma D.2 (Matrix Chernoff Bound, (Tropp et al., 2015) ). Let X 1 , . . . , X m ∈ R n×n be m independent random Hermitian matrices. Assume that 0 ⪯ X i ⪯ L • I for some L > 0 and for all i ∈ [m]. Let X := m i=1 X i . Then, for ϵ ∈ (0, 1], we have Lemma D.4 (Anti-concentration of Gaussian). Let Z ∼ N (0, σ 2 ). Then for t > 0, P[|Z| ≤ t] ≤ 2t √ 2πσ .

E THE BENEFIT OF CONSTANT INITIALIZATION OF BIASES

In short, the benefit of constant initialization of biases lies in inducing sparsity in activation and thus reducing the per step training cost. This is the main motivation of our work on studying sparsity from a deep learning theory perspective. Since our convergence shows that sparsity doesn't change convergence rate, the total training cost is also reduced. To address the width's dependence on B, our argument goes like follows. In practice, people set up neural network models by first picking a neural network of some pre-chosen size and then choose other hyper-parameters such as learning rate, initialization scale, etc. In our case, the hyperparameter is the bias initialization. Thus, the network width is picked before B. Let's say we want to apply our theoretical result to guide our practice. Since we usually don't know the exact data separation and the minimum eigenvalue of the NTK, we don't have a good estimate on the exact width needed for the network to converge and generalize. We may pick a network with width that is much larger than needed (e.g. we pick a network of width Ω(n 12 ) whereas only Ω(n 4 ) is needed; this is possible because the smallest eigenvalue of NTK can range from [Ω(1/n 2 ), O(1)]). Also, it is an empirical observation that the neural networks used in practice are very overparameterized and there is always room for sparsification. If the network width is very large, then per step gradient descent is very costly since the cost scales linearly with width and can be improved to scale linearly with the number of active neurons if done smartly. If the bias is initialized to zero (as people usually do in practice), then the number of active neurons is O(m). However, since we can sparsify the neural network activation by non-zero bias initialization, the number of active neurons can scale sub-linearly in m. Thus, if the neural network width we choose at the beginning is much larger than needed, then we are indeed able to obtain total training cost reduction by this initialization. The above is an informal description of the result proven in (Song et al., 2021a) and the message is sparsity can help reduce the per step training cost. If the network width is pre-chosen, then the lower bound on network width m ≥ Ω(λ -4 0 n 4 exp(B 2 )) in Theorem 3.1 can be translated into an upper bound on bias initialization: B ≤ Õ( log λ 4 0 m n 4 ) if m ≥ Ω(λ -4 0 n 4 ). This would be a more appropriate interpretation of our result. Note that this is different from how Theorem 3.1 is presented: first pick B and then choose m; since m is picked later, m can always satisfy B ≤ √ 0.5 log m and m ≥ Ω(λ -4 0 n 4 exp(B 2 )). Of course, we don't know the best (largest) possible B that works but as long as we can get some B to work, we can get computational gain from sparsity. In summary, sparsity can reduce the per step training cost since we don't know the exact width needed for the network to converge and generalize. Our result should be interpreted as an upper bound on B since the width is always chosen before B in practice.



DISCUSSIONIn this work, we study training one-hidden-layer overparameterized ReLU networks in the NTK regime with its biases being trainable and initialized as some constants rather than zero. We showed sparsity-dependent results on convergence, restricted least eigenvalue and generalization. A future direction is to generalize our analysis to multi-layer neural networks. In practice, label shifting is unnecessary for achieving good generalization. An open problem is whether it is possible to improve the dependence on the sample size of the lower bound of the infinite-width NTK's least eigenvalue, or even whether a lower bound purely dependent on the data separation is possible so that the generalization bound is no longer vacuous for all labels. 5λ/8. Here we force c to be positive. Since we are dealing with standard Gaussian, the probability is exactly the same if c < 0 by symmetry and therefore, we force c > 0.



Lemma 3.3 (Bound on Activation flipping probability). Let B ≥ 0 and R w , R b ≤ min{1/B, 1}. Let W = ( w 1 , . . . , wm ) be vectors generated i.i.d. from N (0, I) and b = ( b1 , . . . , bm ) = (B, . . . , B), and weights W = (w 1 , . . . , w m ) and biases b = (b 1 , . . . , b m ) that satisfy for any r ∈ [m], ∥ wr -w r ∥ 2 ≤ R w and | br -b r | ≤ R b . Define the event

Figure 1: Sparsity pattern on different layers across different training iterations for three different bias initialization. The x and y axis denote the iteration number and sparsity level, respectively. The models can achieve 97.9%, 97.7% and 97.3% accuracy after training, respectively. Note that, in Figure (a), the lines of layers 1-5 overlap together except layer 0.

NEURONSDefinition A.3 (No-flipping set). For each i ∈ [n], let S i ⊂ [m] denote the set of neurons that are never flipped during the entire training process,S i := {r ∈ [m] : ∀t ∈ [T ] sign(⟨w r (t), x i ⟩ -b r (t)) = sign(⟨w r (0), x i ⟩ -b r (0))}.Thus, the flipping set is S i for i ∈ [n]. Lemma A.4 (Bound on flipping probability). Let B ≥ 0 and R w , R b ≤ min{1/B, 1}. Let W = ( w 1 , . . . , wm ) be vectors generated i.i.d. from N (0, I) and b = ( b1 , . . . , bm ) = (B, . . . , B), and weights W = (w 1 , . . . , w m ) and biases b = (b 1 , . . . , b m ) that satisfy for any r ∈ [m], ∥ wr -w r ∥ 2 ≤ R w and | br -b r | ≤ R b . Define the event

Corollary A.5. Let B > 0 and R w , R b ≤ min{1/B, 1}. Assume that ∥w r (t) -w r (0)∥ 2 ≤ R w and |b r (t) -b r (0)| ≤ R b for all t ∈ [T ]. For i ∈ [n], the flipping set S i satisfies that

Assume λ > 0. Let B > 0 and R b , R w ≤ min{1/B, 1}. Let W = ( w 1 , . . . , wm ) be vectors generated i.i.d. from N (0, I) and b = ( b1 , . . . , bm ) = (B, . . . , B). For any set of weights W = (w 1 , . . . , w m ) and biases b = (b 1 , . . . , b m ) that satisfy for any r ∈ [m], ∥ wr -w r ∥ 2 ≤ R w and | br -b r | ≤ R b , we define the matrix H(W, b) ∈ R n×n by

)(I -ηH ∞ ) k yk) -Z(0))(I -ηH ∞ ) k y T2 Lemma A.6, we have ∥Z(k) -Z(0)∥ F ≤ O( nR exp(-B 2 /2)) which implies ∥T 2 ∥ 2 = -ηH ∞ ) k . By Lemma A.2, we know ∥H(0) -H ∞ ∥ 2 ≤ O(n exp(-B 2 /4) log(n/δ) m (I -ηH ∞ ) k y 2 ∥Z(0)T y∥ 2 2 = y ⊤ T Z(0) ⊤ Z(0)T y = y ⊤ T H(0)T y ≤ y ⊤ T H ∞ T y + ∥H(0) -H ∞ ∥ 2 ∥T ∥

), Equation (12), Equation (10), and Equation (11), we have∥[W, b](K) -[W, b](0)∥ F = ∥vec([W, b](K))vec([W, b](0))∥ 2 ≤ y ⊤ (H ∞ ) -1 y + O n λ exp(-B 2 /2) log(n/δ) m 1/4

[λ min (X) ≤ ϵλ min (E[X])] ≤ n • exp(-(1 -ϵ) 2 λ min (E[X])/(2L)).Lemma D.3 ((Li & Shao, 2001, Theorem 3.1) with Improved Upper Bound for Gaussian)). Let b > 0 and r > 0. Then, max{b -r, 0}) 2 /2).

. . , ϵ n ) ⊤ and ϵ i is an i.i.d Rademacher random variable. Theorem C.2 ((Shalev-Shwartz & Ben-David, 2014)). Suppose the loss function ℓ(•, •) is bounded in [0, c] and is ρ-Lipschitz in the first argument. Then with probability at least 1 -δ over sample S of size n:

(Bernstein's Inequality). Assume Z 1 , . . . , Z n are n i.i.d. random variables with E[Z i ] = 0 and |Z i | ≤ M for all i ∈ [n] almost surely. Let Z =

annex

By Lemma A.11, we haveCombine the results we have R > Ω(λ -1 m -1/2 n 1 + (exp(-B 2 /2) + 1/m) log 3 (2mn/δ)).Lemma A.13 requireswhich implies a lower bound on mLemma A.1 further requires a lower bound of m = Ω(λ -1 n • log(n/δ)) which can be ignored.Lemma A.6 further requires R < min{1/B, 1} which impliesFrom Theorem F.1 in (Song et al., 2021a) we know that λ = λ 0 exp(-B 2 /2) for some λ 0 with no dependence on B and λ exp(B 2 /2) ≤ 1. Thus, by our constraint on m and B, this is always satisfied.Finally, to require2 n 2 . By our choice of m, B, we have n) .

A.6 BOUNDING THE NUMBER OF ACTIVATED NEURONS PER ITERATION

First we define the set of activated neurons at iteration t for training point x i to beLemma A.19 (Number of Activated Neurons at Initialization). Assume the choice of m in Theorem A.18. With probability at least 1 -e -Ω(n) over the random initialization, we have. And As a by-product,Proof. First we bound the number of activated neuron at the initialization. We have P[w ⊤ r x i ≥ B] ≤ exp(-B 2 /2). By Bernstein's inequality,Proof. First, we need to bound L S . After training, we have ∥f (k) -y∥ 2 ≤ ϵ < 1, and thusBy Theorem C.2, we know thatThen, by Theorem C.5, we get that for sufficiently large m,where the last step follows from B > 0.Therefore, we conclude that:Theorem C.5. Fix a failure probability δ ∈ (0, 1). Suppose the training data S = {(x i , y i )} n i=1 are i.i.d. samples from a (λ, δ, n)-non-degenerate distribution D. Assume the settings in Theorem A.18 except now we letDenote the set of one-hidden-layer neural networks trained by gradient descent as F. Then with probability at least 1 -2δ -e -Ω(n) over the randomness in the symmetric initialization and the training data, the set F has empirical Rademacher complexity bounded asNote that the only extra requirement we make on m is the (n/λ) 6 dependence instead of (n/λ) 4 which is needed for convergence. The dependence of m on n is significantly better than previous work (Song & Yang, 2019) where the dependence is n 14 . We take advantage of our initialization and new analysis to improve the dependence on n.Proof. Let R w (R b ) denotes the maximum distance moved any any neuron weight (bias), the same role as D w (D b ) in Lemma A.9. From Lemma A.9 and Lemma A. 

