IMPLICIT BIAS IN LEAKY RELU NETWORKS TRAINED ON HIGH-DIMENSIONAL DATA

Abstract

The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fullyconnected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an ℓ 2 -max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.

1. INTRODUCTION

Neural networks trained by gradient descent appear to generalize well in many settings, even when trained without explicit regularization. It is thus understood that the usage of gradient-based optimization imposes an implicit bias towards particular solutions which enjoy favorable properties. The nature of this implicit regularization effect-and its dependence on the structure of the training data, the architecture of the network, and the particular gradient-based optimization algorithm-is thus a central object of study in the theory of deep learning. In this work, we examine the implicit bias of gradient descent when the training data is such that the pairwise correlations |⟨x i , x j ⟩| between distinct samples x i , x j ∈ R d are much smaller than the squared Euclidean norms of each sample: that is, the samples are nearly-orthogonal. As we shall show, this property is often satisfied when the training data is sampled i.i.d. from a d-dimensional distribution and d is significantly larger than the number of samples n. We will thus refer to such training data with the descriptors 'high-dimensional' and 'nearly-orthogonal' interchangeably. We consider fully-connected two-layer networks with m neurons where the first layer weights are trained and the second layer weights are fixed at their random initialization. If we denote the firstlayer weights by W ∈ R m×d , with rows w ⊤ j ∈ R d , then the network output is given by, f (x; W ) := m j=1 a j ϕ(⟨w j , x⟩), where a j ∈ R, j = 1, . . . m are fixed. We consider the implicit bias in two different settings: gradient flow, which corresponds to gradient descent where the step-size tends to zero, and standard gradient descent. a certain function norm known as the variation norm. Phuong & Lampert (2020) studied the implicit bias in two-layer ReLU networks trained on orthogonally separable data (i.e., where for every pair of labeled examples (x i , y i ), (x j , y j ) we have x ⊤ i x j > 0 if y i = y j and x ⊤ i x j ≤ 0 otherwise). Safran et al. (2022) proved implicit bias towards minimizing the number of linear regions in univariate two-layer ReLU networks. Implicit bias in neural networks trained with nearly-orthogonal data was previously studied in Vardi et al. (2022) . Their assumptions on the training data are similar to ours, but they consider ReLU networks and prove bias towards non-robust networks. Their results do not have any clear implications for our setting. Implicit bias towards rank minimization was also studied in several other papers. Ji & Telgarsky (2018; 2020) showed that in linear networks of output dimension 1, gradient flow with exponentiallytailed losses converges to networks where the weight matrix of every layer is of rank 1. Timor et al. (2022) showed that the bias towards margin maximization in homogeneous ReLU networks may induce a certain bias towards rank minimization in the weight matrices of sufficiently deep ReLU networks. Finally, implicit bias towards rank minimization was also studied in regression settings. See, e.g., Arora et al. (2019) ; Razin & Cohen (2020) ; Li et al. (2020) ; Timor et al. (2022) ; Ergen & Pilanci (2021; 2020) . Training dynamics of neural networks for linearly separable training data. A series of works have explored the training dynamics of gradient descent when the data is linearly separable (such as is the case when the input dimension is larger than the number of samples, as we consider here). Brutzkus et al. (2017) showed that in two-layer leaky ReLU networks, SGD on the hinge loss for linearly separable data converges to zero loss. Frei et al. (2021) showed that even when a constant fraction of the training labels are corrupted by an adversary, in two-layer leaky ReLU networks, SGD on the logistic loss produces neural networks that have generalization error close to the label noise rate. As we mentioned above, both Lyu et al. (2021) and Sarussi et al. (2021) considered twolayer leaky ReLU networks trained by gradient-based methods on linearly separable datasets. Wang et al. (2019) and Yang et al. (2021) considered the dynamics of variants of GD/SGD algorithms on the hinge loss for ReLU networks for linearly separable distributions. Another line of work has explored the dynamics of neural network training when the data is sampled i.i.d. from a distribution which is not linearly separable but the training data is linearly separable due to the number of samples being smaller than the input dimension. Cao et al. (2022) studied two-layer convolutional networks trained on an image-patch data model and showed how a low signal-to-noise ratio can result in harmful overfitting, while a high signal-to-noise ratio allows for good generalization performance. Shen et al. (2022) considered a similar image-patch signal model and studied how data augmentation can improve generalization performance of two-layer convolutional networks. Frei et al. (2022a) showed that two-layer fully connected networks trained on highdimensional mixture model data can exhibit a 'benign overfitting' phenomenon. Frei et al. (2022b) studied the feature-learning process for two-layer ReLU networks trained on noisy 2-xor clustered data and showed that early-stopped networks can generalize well even in high-dimensional settings. Boursier et al. (2022) studied the dynamics of gradient flow on the squared loss for two-layer ReLU networks with orthogonal inputs.

2. PRELIMINARIES

Notation. For a vector x we denote by ∥x∥ the Euclidean norm. For a matrix W we denote by ∥W ∥ F the Frobenius norm, and by ∥W ∥ 2 the spectral norm. We denote by 1[•] the indicator function, for example 1[t ≥ 5] equals 1 if t ≥ 5 and 0 otherwise. We denote sign(z) = 1 for z > 0 and sign(z) = -1 otherwise. For an integer d ≥ 1 we denote [d] = {1, . . . , d}. We denote by N(µ, σ 2 ) the Gaussian distribution. We denote the maximum of two real numbers a, b as a ∨ b, and their minimum as a ∧ b. We denote by log the logarithm with base e. We use the standard O(•) and Ω(•) notation to only hide universal constant factors, and use Õ(•) and Ω(•) to hide poly-logarithmic factors in the argument. Neural networks. In this work we consider depth-2 neural networks, where the second layer is fixed and only the first layer is trained. Thus, a neural network with parameters W is defined as f (x; W ) = m j=1 a j ϕ(w ⊤ j x) , where x ∈ R d is an input, W ∈ R m×d is a weight matrix with rows w ⊤ 1 , . . . , w ⊤ m , the weights in the second layer are a j ∈ {±1/ √ m} for j ∈ [m], and ϕ : R → R is an activation function. We focus on the leaky ReLU activation function, defined by ϕ(z) = max{z, γz} for some constant γ ∈ (0, 1), and on a smooth approximation of leaky ReLU (defined later). Gradient descent and gradient flow. Let S = {(x i , y i )} n i=1 ⊆ R d × {±1} be a binaryclassification training dataset. Let f (•; W ) : R d → R be a neural network parameterized by W . For a loss function ℓ : R → R the empirical loss of f (•; W ) on the dataset S is L(W ) := 1 n n i=1 ℓ(y i f (x i ; W )) . We focus on the exponential loss ℓ(q) = e -q and the logistic loss ℓ(q) = log(1 + e -q ). In gradient descent, we initialize [W (0) ] i,j i.i.d. ∼ N(0, ω 2 init ) for some ω init ≥ 0, and in each iteration we update W (t+1) = W (t) -α∇ W L(W (t) ) , where α > 0 is a fixed step size. Gradient flow captures the behavior of gradient descent with an infinitesimally small step size. The trajectory W (t) of gradient flow is defined such that starting from an initial point W (0), the dynamics of W (t) obeys the differential equation dW (t) dt = -∇ W L(W (t)). When L(W ) is nondifferentiable, the dynamics of gradient flow obeys the differential equation dW (t) dt ∈ -∂ • L(W (t)), where ∂ • denotes the Clarke subdifferential, which is a generalization of the derivative for nondifferentiable functions (see Appendix A for a formal definition).

3. ASYMPTOTIC ANALYSIS OF THE IMPLICIT BIAS

In this section, we study the implicit bias of gradient flow in the limit t → ∞. Our results build on a theorem by Lyu & Li (2019) and Ji & Telgarsky (2020) , which considers the implicit bias in homogeneous neural networks. Let f (x; θ) be a neural network parameterized by θ, where we view θ as a vector. The network f is homogeneous if there exists L > 0 such that for every β > 0 and x, θ we have f (x; βθ) = β L f (x; θ). We say that a trajectory θ(t) of gradient flow converges in direction to θ * if lim t→∞ θ(t) ∥θ(t)∥ = θ * ∥θ * ∥ . Their theorem can be stated as follows. Theorem 3.1 (Paraphrased from Lyu & Li (2019) ; Ji & Telgarsky (2020) ). Let f be a homogeneous ReLU or leaky ReLU neural network parameterized by θ. Consider minimizing either the exponential or the logistic loss over a binary classification dataset {(x i , y i )} n i=1 using gradient flow. Assume that there exists time t 0 such that L(θ(t 0 )) < log(2) n . Then, gradient flow converges in direction to a first order stationary point (KKT point) of the following maximum-margin problem in parameter space: min θ 1 /2 ∥θ∥ 2 s.t. ∀i ∈ [n] y i f (x i ; θ) ≥ 1 . Moreover, L(θ(t)) → 0 and ∥θ(t)∥ → ∞ as t → ∞. We focus here on depth-2 leaky ReLU networks where the trained parameters is the weight matrix W ∈ R m×d of the first layer. Such networks are homogeneous (with L = 1), and hence the above theorem guarantees that if there exists time t 0 such that L(W (t 0 )) < log(2) n , then gradient flow converges in direction to a KKT point of the problem min W 1 /2 ∥W ∥ 2 F s.t. ∀i ∈ [n] y i f (x i ; W ) ≥ 1 . (1) Note that in leaky ReLU networks Problem (1) is non-smooth. Hence, the KKT conditions are defined using the Clarke subdifferential. See Appendix A for more details of the KKT conditions. The theorem implies that even though there might be many possible directions W ∥W ∥ F that classify the dataset correctly, gradient flow converges only to directions that are KKT points of Problem (1). We note that such a KKT point is not necessarily a global/local optimum (cf. Vardi et al. (2021) ; Lyu et al. (2021) ). Thus, under the theorem's assumptions, gradient flow may not converge to an optimum of Problem (1), but it is guaranteed to converge to a KKT point. We now state our main result for this section. For convenience, we will use different notations for positive neurons (i.e., where a j = 1/ √ m) and negative neurons (i.e., where a j = -1/ √ m). Namely, f (x; W ) = m j=1 a j ϕ(w ⊤ j x) = m1 j=1 1 √ m ϕ(v ⊤ j x) - m2 j=1 1 √ m ϕ(u ⊤ j x) . Note that m = m 1 + m 2 . We assume that m 1 , m 2 ≥ 1. Theorem 3.2. Let {(x i , y i )} n i=1 ⊆ R d × {±1} be a training dataset, and let R max := max i ∥x i ∥, R min := min i ∥x i ∥ and R = R max /R min . We denote I := [n], I + := {i ∈ I : y i = 1} and I -:= {i ∈ I : y i = -1}. Assume that R 2 min ≥ 3γ -3 R 2 n max i̸ =j |⟨x i , x j ⟩| . Let f be the leaky ReLU network from (2) and let W be a KKT point of Problem (1). Then, the following hold: 1. y i f (x i ; W ) = 1 for all i ∈ I. 2. v 1 = . . . = v m1 := v and u 1 = . . . = u m2 := u. Hence, rank(W ) ≤ 2. 3. v = 1 √ m i∈I+ λ i x i -γ √ m i∈I-λ i x i and u = 1 √ m i∈I-λ i x i -γ √ m i∈I+ λ i x i , where λ i ∈ 1 2R 2 max , 3 2γ 2 R 2 min for every i ∈ I. Furthermore, for all i ∈ I we have y i v ⊤ x i > 0 and y i u ⊤ x i < 0. 4. W is a global optimum of Problem (1). Moreover, this global optimum is unique. 5. v, u is the global optimum of the following convex problem: min v,u∈R d m 1 2 ∥v∥ 2 + m 2 2 ∥u∥ 2 (3) ∀i ∈I + : m 1 √ m v ⊤ x i -γ m 2 √ m u ⊤ x i ≥ 1 ∀i ∈I -: m 2 √ m u ⊤ x i -γ m 1 √ m v ⊤ x i ≥ 1 . 6. Let z = m1 √ m v -m2 √ m u. For every x ∈ R d we have sign (f (x; W )) = sign(z ⊤ x). Thus, the network f (•; W ) has a linear decision boundary. 7. The vector z may not be an ℓ 2 -max-margin linear predictor, but it maximizes the margin approximately in the following sense. For all i ∈ I we have y i z ⊤ x i ≥ 1, and ∥z∥ ≤ 2 κ+γ ∥z * ∥, where κ := min{m1,m2} max{m1,m2} , and z * := argmin z ∥z∥ s.t. y i z⊤ x i ≥ 1 for all i ∈ I. Note that by the above theorem, the KKT points possess very strong properties: the weight matrix is of rank at most 2, there is margin maximization in parameter space, in function space the predictor has a linear decision boundary, there may not be margin maximization in predictor space, but the predictor maximizes the margin approximately within a factor of 2 κ+γ . Note that if κ = 1 (i.e., m 1 = m 2 ) and γ is roughly 1, then we get margin maximization also in predictor space. We remark that variants of items 2, 5 and 6 were shown in Sarussi et al. (2021) under a different assumption called Neural Agreement Regime (as we discussed in the related work section). 1The proof of Theorem 3.2 is given in Appendix B. We now briefly discuss the proof idea. Since W satisfies the KKT conditions of Problem (1), then there are λ 1 , . . . , λ n such that for every j ∈ [m] we have w j = i∈I λ i ∇ wj (y i f (x i ; W )) = a j i∈I λ i y i ϕ ′ i,wj x i , where ϕ ′ i,wj is a subgradient of ϕ at w ⊤ j x i . Also we have λ i ≥ 0 for all i, and λ i = 0 if y i f (x i ; W ) ̸ = 1. We prove strictly positive upper and lower bounds for each of the λ i 's. Since the λ i 's are strictly positive, the KKT conditions show that the margin constraints are satisfied with equalities, i.e., part 1 of the theorem. By leveraging these bounds on the λ i 's we also derive the remaining parts of the theorem. The main assumption in Theorem 3.2 is that R 2 min ≥ 3γ -3 R 2 n max i̸ =j |⟨x i , x j ⟩|. In words, this means the squared norms of samples are much larger than the pairwise correlations between different samples, i.e. the training data are nearly orthogonal. Lemma 3.3 below implies that if the inputs x i are drawn from a well-conditioned Gaussian distribution (e.g., N(0, I d )), then it suffices to require 1). Lemma 3.3 holds more generally for a class of subgaussian distributions (see, e.g., Hu et al. (2020, Claim 3.1) ), and we state the result for Gaussians here for simplicity. Lemma 3.3. Suppose that x 1 , . . . , x n are drawn i.  n ≤ O γ 3 d log n , i.e., d ≥ Ω n 2 if γ = Ω( (W (t 0 )) < log(2) n . In the following theorem we show that such t 0 exists, regardless of the initialization of gradient flow (the theorem holds both for the logistic and the exponential losses). Theorem 3.4. Consider gradient flow on a the network from (2) w.r.t. a dataset that satisfies the assumption from Theorem 3.2. Then, there exists a finite time t 0 such that for all t ≥ t 0 we have L(W (t)) < log(2)/n. We prove the theorem in Appendix D. Combining Theorems 3.1, 3.2 and 3.4, we get the following: Corollary 3.5. Consider gradient flow on the network from (2) w.r.t. a dataset that satisfies the assumption from Theorem 3.2. Then, gradient flow converges to zero loss, and converges in direction to a weight matrix W that satisfies items 1-7 from Theorem 3.2.

4. NON-ASYMPTOTIC ANALYSIS OF THE IMPLICIT BIAS

In this section, we study the implicit bias of gradient descent with a fixed step size following random initialization (refer to Section 2 for the definition of gradient descent). Our results in this section are for the logistic loss ℓ(z) = log(1 + exp(-z)) but could be extended to the exponential loss as well. We shall assume the activation function ϕ satisfies ϕ(0) = 0 and is twice differentiable and there exist constants γ ∈ (0, 1], H > 0 such that 0 < γ ≤ ϕ ′ (z) ≤ 1, and |ϕ ′′ (z)| ≤ H. We shall refer to functions satisfying the above properties as γ-leaky, H-smooth. Note that such functions are not necessarily homogeneous. Examples of such functions are any smoothed approximation to the leaky ReLU that is zero at the origin. One such example is: ϕ(z) = γz + (1 -γ) log 1 2 (1 + exp(z)) , which is γ-leaky and 1 /4-smooth (see Figure 3 in the appendix for a side-by-side plot of this activation with the standard leaky ReLU). We next introduce the definition of stable rank (Rudelson & Vershynin, 2007) . Definition 4.1. The stable rank of a matrix W ∈ R m×d is StableRank(W ) = ∥W ∥ 2 F /∥W ∥ 2 2 . The stable rank is in many ways analogous to the classical rank of a matrix but is considerably more well-behaved. For instance, consider the diagonal matrix W ∈ R d×d with diagonal entries equal to 1 except for the first entry which is equal to ε ≥ 0. As ε → 0, the classical rank of the matrix is equal to d until ε exactly equals 0, while on the other hand the stable rank smoothly decreases from d to d -1. For another example, suppose again W ∈ R d×d is diagonal with W 1,1 = 1 and W i,i = exp(-d) for i ≥ 2. The classical rank of this matrix is exactly equal to d, while the stable rank of this matrix is 1 + o d (1). With the above conditions in hand, we can state our main theorem for this section. Theorem 4.2. Suppose that ϕ is a γ-leaky, H-smooth activation. For training data {(x i , y i )} n i=1 ⊂ R d × {±1}, let R max = max i ∥x i ∥ and R min = min i ∥x i ∥, and suppose R = R max /R min is at most an absolute constant. Denote by C R := 10R 2 /γ 2 + 10. Assume the training data satisfies, R 2 min ≥ 5γ -2 C R n max i̸ =j |⟨x i , x j ⟩|. There exist absolute constants C 1 , C 2 > 1 (independent of m, d, and n) such that the following holds. For any δ ∈ (0, 1), if the step-size satisfies α ≤ γ 2 (5nR 2 max R 2 C R max(1, H)) -1 , and ω init ≤ αγ 2 R min (72RC R n md log(4m/δ)) -1 , then with probability at least 1-δ over the random initialization of gradient descent, the trained network satisfies: 1. The empirical risk under the logistic loss satisfies L(W (t) ) ≤ C1n /R 2 min αt for t ≥ 1. 2. The ℓ 2 norm of each neuron grows to infinity: for all j, ∥w (t) j ∥ 2 → ∞. 3. The stable rank of the weights is bounded: sup t≥1 StableRank(W (t) ) ≤ C 2 . We now make a few remarks on the above theorem. We note that the assumption on the training data is the same as in Theorem 3.2 up to constants (treating γ as a constant), and is satisfied in many settings when d ≫ n 2 (see Lemma 3.3). For the first part of the theorem, we show that despite the non-convexity of the underlying optimization problem, gradient descent can efficiently minimize the training error, driving the empirical risk to zero. For the second part of the theorem, note that since the empirical risk under the logistic loss is driven to zero and the logistic loss is decreasing and satisfies ℓ(z) > 0 for all z, it is necessarily the case that the spectral norm of the first layer weights ∥W (t) ∥ 2 → ∞. (Otherwise, L(W (t) ) would be bounded from below by a constant.) This leaves open the question of whether only a few neurons in the network are responsible for the growth of the magnitude of the spectral norm, and part (2) of the theorem resolves this question. The third part of the theorem is perhaps the most interesting one. In Theorem 3.2, we showed that for the standard leaky ReLU activation trained on nearly-orthogonal data with gradient flow, the asymptotic true rank of the network is at most 2. By contrast, Theorem 4.2 shows that the stable rank of neural networks with γ-leaky, H-smooth activations trained by gradient descent have a constant stable rank after the first step of gradient descent and the rank remains bounded by a constant throughout the trajectory. Note that at initialization, by standard concentration bounds for random matrices (see, e.g., Vershynin (2010) ), the stable rank satisfies StableRank(W (0) ) ≈ Θ( md /( √ m+ √ d) 2 ) = Ω(m∧d), so that Theorem 4.2 implies that gradient descent drastically reduces the rank of the matrix after just one step. The details for the proof of Theorem 4.2 are provided in Appendix E, but we provide some of the main ideas for the proofs of part 1 and 3 of the theorem here. For the first part, note that training data satisfying the assumptions in the theorem are linearly separable with a large margin (take, for instance, the vector n i=1 y i x i ). We use this to establish a proxy Polyak-Lojasiewicz (PL) inequality (Frei & Gu, 2021 ) that takes the form ∥∇ L(W (t) )∥ F ≥ c G(W (t) ) for some c > 0, where G(W (t) ) is the empirical risk under the sigmoid loss -ℓ ′ (z) = 1/(1 + exp(z)). Because we consider smoothed leaky ReLU activations, we can use a smoothness-based analysis of gradient descent to show ∥∇ L(W (t) )∥ F → 0, which implies G(W (t) ) → 0 by the proxy PL inequality. We then translate guarantees for G(W (t) ) into guarantees for L(W (t) ) by comparing the sigmoid and logistic losses. For the third part of the theorem, we need to establish two things: (i) an upper bound for the Frobenius norm, and (ii) a lower bound for the spectral norm. A loose approach for bounding the Frobenius norm via an application of the triangle inequality (over time steps) results in a stable rank bound that grows with the number of samples. To develop a tighter upper bound, we first establish a structural condition we refer to as a loss ratio bound (see Lemma E.4). In the gradient descent updates, each sample is weighted by a quantity that scales with -ℓ ′ (y i f (x i ; W (t) )) ∈ (0, 1). We show that these -ℓ ′ losses grow at approximately the same rate for each sample throughout training, 4)). The rank reduction happens more quickly as the dimension grows (left; initialization scale 50× smaller than default TensorFlow, α = 0.01) and as the initialization scale decreases (right; d = 10 4 , α = 0.16). and that this allows for a tighter upper bound for the Frobenius norm. Loss ratio bounds were key to the generalization analysis of two previous works on benign overfitting (Chatterji & Long, 2021; Frei et al., 2022a) and may be of independent interest. In Proposition E.10 we provide a general approach for proving loss ratio bounds that can hold for more general settings than the ones we consider in this work (i.e., data which are not high-dimensional, and networks with non-leaky activations). The lower bound on the spectral norm follows by identifying a single direction µ := n i=1 y i x i that is strongly correlated with every neuron's weight w j , in the sense that |⟨ w (t) j /∥w (t) j ∥, µ⟩| is relatively large for each j ∈ [m]. Since every neuron is strongly correlated with this direction, this allows for a good lower bound on the spectral norm.

5. IMPLICATIONS OF THE IMPLICIT BIAS AND EMPIRICAL OBSERVATIONS

The results in the preceding sections show a remarkable simplicity bias of gradient-based optimization when training two-layer networks with leaky activations on sufficiently high-dimensional data. For gradient flow, regardless of the initialization, the learned network has a linear decision boundary, even when the labels y are some nonlinear function of the input features and when the network has the capacity to approximate any continuous function. With our analysis of gradient descent, we showed that the bias towards producing low-complexity networks (as measured by the stable rank of the network) is something that occurs quickly following random initialization, provided the initialization scale is small enough. In some distributional settings, this bias towards rather simple classifiers may be beneficial, while in others it may be harmful. To see where it may be beneficial, consider a Gaussian mixture model distribution P, parameterized by a mean vector µ ∈ R d , where samples (x, y) ∼ P have a distribution as follows: y ∼ Uniform({±1}), x|y ∼ yµ + z, z ∼ N(0, I d ). (4) The linear classifier x → sign(⟨µ, x⟩) performs optimally for this distribution, and so the implicit bias of gradient descent towards low-rank classifiers (and of gradient flow towards linear decision boundaries) for high-dimensional data could in principle be helpful for allowing neural networks trained on such data to generalize well for this distribution. Indeed, as shown by Chatterji & Long (2021)  , since ∥x i ∥ 2 ≈ d + ∥µ∥ 2 while |⟨x i , x j ⟩| ≈ ∥µ∥ 2 + √ d for i ̸ = j, provided ∥µ∥ = Θ(d β ) and d ≫ n 1 1-2β ∨ n 2 for β ∈ (0, 1/2) , the assumptions in Theorem 4.2 hold. Thus, gradient descent on two-layer networks with γ-leaky, H-smooth activations, the empirical risk is driven to zero and the stable rank of the network is constant after the first step of gradient descent. In this setting, Frei et al. (2022a) recently showed that such networks have small generalization error. This shows that the implicit bias towards classifiers with constant rank can be beneficial in distributional settings where linear classifiers can perform well. On the other hand, the same implicit bias can be harmful if the training data come from a distribution that does not align with this bias. preceding paragraph the assumptions needed for Theorem 3.2 are satisfied provided d ≫ n 1 1-2β ∨n 2 . In this setting, regardless of the initialization, by Theorem 3.2 the limit of gradient flow produces a neural network which has a linear decision boundary and thus achieves 50% test error. In the appendix (see Fig. 6 ) we verify this with experiments. Thus, the implicit bias can be beneficial in some settings and harmful in others. Theorem 4.2 and Lemma 3.3 suggest that the relationship between the input dimension and the number of samples, as well as the initialization variance, can influence how quickly gradient descent finds low-rank networks. In Figure 1 we examine these factors for two-layer nets trained on a Gaussian mixture model distribution (see Appendix F for experimental details). We see that the bias towards rank reduction increases as the dimension increases and the initialization scale decreases, as suggested by our theory. Moreover, it appears that the initialization scale is more influential for determining the rank reduction than training gradient descent for longer. In Appendix F we provide more detailed empirical investigations into this phenomenon. In Figure 2 , we investigate whether or not the initialization scale's effect on the rank reduction of gradient descent occurs in settings not covered by our theory, namely in two-layer ReLU networks with bias terms trained by SGD on CIFAR-10. We consider two different initialization schemes: (1) Glorot uniform, the default TensorFlow initialization scheme with standard deviation of order 1/ √ m + d, and (2) a uniform initialization scheme with 50× smaller standard deviation than that of the Glorot uniform initialization. In the default initialization scheme, it appears that a reduction in the rank of the network only comes in the late stages of training, and the smallest stable rank achieved by the network within 10 6 steps is 74.0. On the other hand, with the smaller initialization scheme, the rank reduction comes rapidly, and the smallest stable rank achieved by the network is 3.25. It is also interesting to note that in the small initialization setting, after gradient descent rapidly produces low-rank weights, the rank of the trained network begins to increase only when the gap between the train and test accuracy begin to diverge.

6. CONCLUSION

In this work, we characterized the implicit bias of common gradient-based optimization algorithms for two-layer leaky ReLU networks when trained on high-dimensional datasets. For both gradient flow and gradient descent, we proved convergence to near-zero training loss and that there is an implicit bias towards low-rank networks. For gradient flow, we showed a number of additional implicit biases: the weights are (unique) global maxima of the associated margin maximization problem, and the decision boundary of the learned network is linear. For gradient descent, we provided experimental evidence which suggests that small initialization variance is important for gradient descent's ability to quickly produce low-rank networks. There are many natural directions to pursue following this work. One question is whether or not a similar implicit bias towards low-rank weights in fully connected networks exists for networks with different activation functions or for data which is not nearly orthogonal. Our proofs relied heavily upon the near-orthogonality of the data, and the 'leaky' behavior of the leaky ReLU, namely that there is some γ > 0 such that ϕ ′ (z) ≥ γ for all z ∈ R. We conjecture that some of the properties we showed in Theorem 3.2 (e.g., a linear decision boundary) may not hold for non-leaky activations, like the ReLU, or without the near-orthogonality assumption.  ∂ • f (x) := conv lim i→∞ ∇f (x i ) lim i→∞ x i = x, f is differentiable at x i . If f is continuously differentiable at x then ∂ • f (x) = {∇f (x)}. For the Clarke subdifferential the chain rule holds as an inclusion rather than an equation. That is, for locally Lipschitz functions z 1 , . . . , z n : R d → R and f : R n → R, we have ∂ • (f • z)(x) ⊆ conv n i=1 α i h i : α ∈ ∂ • f (z 1 (x), . . . , z n (x)), h i ∈ ∂ • z i (x) . Consider the following optimization problem min f (x) s.t. ∀n ∈ [N ] g n (x) ≤ 0 , where f, g 1 , . . . , g n : R d → R are locally Lipschitz functions. We say that x ∈ R d is a feasible point of Problem (5) if x satisfies g n (x) ≤ 0 for all n ∈ [N ]. We say that a feasible point x is a KKT point if there exists λ 1 , . . . , λ N ≥ 0 such that 1. 0 ∈ ∂ • f (x) + n∈[N ] λ n ∂ • g n (x); 2. For all n ∈ [N ] we have λ n g n (x) = 0. B PROOF OF THEOREM 3.2 We start with some notations. We denote p = max i̸ =j |⟨x i , x j ⟩|. Thus, our assumption on n can be written as n ≤ γ 3 3 • R 2 min p • R 2 min R 2 max . Since W satisfies the KKT conditions of Problem (1), then there are λ 1 , . . . , λ n such that for every j ∈ [m 1 ] we have v j = i∈I λ i ∇ vj (y i f (x i ; W )) = 1 √ m i∈I λ i y i ϕ ′ i,vj x i , where ϕ ′ i,vj is a subgradient of ϕ at v ⊤ j x i , i.e., if v ⊤ j x i > 0 then ϕ ′ i,vj = 1, if v ⊤ j x i < 0 then ϕ ′ i,vj = γ and otherwise ϕ ′ i,vj is some value in [γ, 1 ]. Also we have λ i ≥ 0 for all i, and λ i = 0 if y i f (x i ; W ) ̸ = 1. Likewise, for all j ∈ [m 2 ] we have u j = i∈I λ i ∇ uj (y i f (x i ; W )) = 1 √ m i∈I λ i (-y i )ϕ ′ i,uj x i , where ϕ ′ i,uj is defined similarly to ϕ ′ i,vj . The proof of the theorem follows from the following lemmas. Lemma B.1. For all i ∈ I we have j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj < 3m 2γR 2 min . Furthermore, λ i < 3 2γ 2 R 2 min for all i ∈ I. Proof. Let ξ = max q∈I j∈[m1] λ q ϕ ′ q,vj + j∈[m2] λ q ϕ ′ q,uj and suppose that ξ ≥ 3m 2γR 2 min . Let r = argmax q∈I j∈[m1] λ q ϕ ′ q,vj + j∈[m2] λ q ϕ ′ q,uj . Since ξ ≥ 3m 2γR 2 min > 0 then λ r > 0, and hence by the KKT conditions we must have y r f (x r ; W ) = 1. We consider two cases: Published as a conference paper at ICLR 2023 Case 1: Assume that r ∈ I -. Using ( 6) and ( 7), we have √ mf (x r ; W ) = j∈[m1] ϕ(v ⊤ j x r ) - j∈[m2] ϕ(u ⊤ j x r ) = j∈[m1] ϕ   1 √ m q∈I λ q y q ϕ ′ q,vj x ⊤ q x r   - j∈[m2] ϕ   1 √ m q∈I λ q (-y q )ϕ ′ q,uj x ⊤ q x r   = j∈[m1] ϕ   1 √ m λ r y r ϕ ′ r,vj x ⊤ r x r + 1 √ m q∈I\{r} λ q y q ϕ ′ q,vj x ⊤ q x r   - j∈[m2] ϕ   1 √ m λ r (-y r )ϕ ′ r,uj x ⊤ r x r + 1 √ m q∈I\{r} λ q (-y q )ϕ ′ q,uj x ⊤ q x r   ≤ j∈[m1] ϕ   - 1 √ m λ r ϕ ′ r,vj R 2 min + 1 √ m q∈I\{r} λ q y q ϕ ′ q,vj x ⊤ q x r   - j∈[m2] ϕ   1 √ m λ r ϕ ′ r,uj R 2 min + 1 √ m q∈I\{r} λ q (-y q )ϕ ′ q,uj x ⊤ q x r   . Since the derivative of ϕ is lower bounded by γ, we know ϕ(z 1 ) -ϕ(z 2 ) ≥ γ(z 1 -z 2 ) for all z 1 , z 2 ∈ R. Using this and the definition of ξ, the above is at most j∈[m1]   ϕ   1 √ m q∈I\{r} λ q y q ϕ ′ q,vj x ⊤ q x r   - 1 √ m γ • λ r ϕ ′ r,vj R 2 min   - j∈[m2]   ϕ   1 √ m q∈I\{r} λ q (-y q )ϕ ′ q,uj x ⊤ q x r   + 1 √ m γ • λ r ϕ ′ r,uj R 2 min   ≤ - 1 √ m γξR 2 min + j∈[m1] 1 √ m q∈I\{r} λ q y q ϕ ′ q,vj x ⊤ q x r + j∈[m2] 1 √ m q∈I\{r} λ q (-y q )ϕ ′ q,uj x ⊤ q x r ≤ - 1 √ m γξR 2 min + 1 √ m j∈[m1] q∈I\{r} λ q y q ϕ ′ q,vj x ⊤ q x r + 1 √ m j∈[m2] q∈I\{r} λ q (-y q )ϕ ′ q,uj x ⊤ q x r . Using |x ⊤ q x r | ≤ p for q ̸ = r, the above is at most - 1 √ m γξR 2 min + 1 √ m j∈[m1] q∈I\{r} λ q ϕ ′ q,vj p + 1 √ m j∈[m2] q∈I\{r} λ q ϕ ′ q,uj p = - 1 √ m γξR 2 min + p √ m q∈I\{r}   j∈[m1] λ q ϕ ′ q,vj + j∈[m2] λ q ϕ ′ q,uj   ≤ - 1 √ m γξR 2 min + p √ m • |I| • max q∈I   j∈[m1] λ q ϕ ′ q,vj + j∈[m2] λ q ϕ ′ q,uj   = - 1 √ m γξR 2 min + p √ m nξ = - ξ √ m (γR 2 min -np) . By our assumption on n, we can bound the above expression by - ξ √ m γR 2 min -p • γ 3 3 • R 2 min p • R 2 min R 2 max = - ξR 2 min √ m γ - γ 3 3 • R 2 min R 2 max < - ξR 2 min √ m γ - γ 3 = - ξR 2 min √ m • 2γ 3 ≤ - 3m 2γR 2 min • R 2 min √ m • 2γ 3 = - √ m . Thus, we obtain f (x r ; W ) < -1 in contradiction to y r f (x r ; W ) = 1. Case 2: Assume that r ∈ I + . A similar calculation to the one given in case 1 (which we do not repeat for conciseness) implies that f (x r ; W ) > 1, in contradiction to y r f (x r ; W ) = 1. It concludes the proof of ξ < 3m 2γR 2 min . Finally, since ξ < 3m 2γR 2 min and the derivative of ϕ is lower bounded by γ, then for all i ∈ I we have 3m 2γR 2 min > j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj ≥ mλ i γ , and hence λ i < 3 2γ 2 R 2 min . Lemma B.2. For all i ∈ I we have j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj > m 2R 2 max . Furthermore, λ i > 1 2R 2 max for all i ∈ I. Proof. Suppose that there is i ∈ I such that j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj ≤ m 2R 2 max . Using ( 6) and ( 7), we have √ m ≤ √ mf (x i ; W ) = j∈[m1] ϕ(v ⊤ j x i ) - j∈[m2] ϕ(u ⊤ j x i ) ≤ j∈[m1] v ⊤ j x i + j∈[m2] u ⊤ j x i = j∈[m1] 1 √ m q∈I λ q y q ϕ ′ q,vj x ⊤ q x i + j∈[m2] 1 √ m q∈I λ q (-y q )ϕ ′ q,uj x ⊤ q x i ≤ 1 √ m j∈[m1]   λ i y i ϕ ′ i,vj x ⊤ i x i + q∈I\{i} λ q y q ϕ ′ q,vj x ⊤ q x i   + 1 √ m j∈[m2]   λ i (-y i )ϕ ′ i,uj x ⊤ i x i + q∈I\{i} λ q (-y q )ϕ ′ q,uj x ⊤ q x i   . Using |x ⊤ q x i | ≤ p for q ̸ = i and x ⊤ i x i ≤ R 2 max , the above is at most 1 √ m j∈[m1]   λ i ϕ ′ i,vj R 2 max + q∈I\{i} λ q ϕ ′ q,vj p   + 1 √ m j∈[m2]   λ i ϕ ′ i,uj R 2 max + q∈I\{i} λ q ϕ ′ q,uj p   = 1 √ m   j∈[m1] λ i ϕ ′ i,vj R 2 max + j∈[m2] λ i ϕ ′ i,uj R 2 max   + 1 √ m q∈I\{i}   j∈[m1] λ q ϕ ′ q,vj p + j∈[m2] λ q ϕ ′ q,uj p   = R 2 max √ m   j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj   + p √ m q∈I\{i}   j∈[m1] λ q ϕ ′ q,vj + j∈[m2] λ q ϕ ′ q,uj   ≤ R 2 max √ m • m 2R 2 max + p √ m • |I| • max q∈I   j∈[m1] λ q ϕ ′ q,vj + j∈[m2] λ q ϕ ′ q,uj   . Combining the above with our assumption on n, we get max q∈I   j∈[m1] λ q ϕ ′ q,vj + j∈[m2] λ q ϕ ′ q,uj   ≥ m 2np ≥ m 2p • 3p γ 3 R 2 min • R 2 max R 2 min > 3m 2γR 2 min , in contradiction to Lemma B.1. It concludes the proof of j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj > m 2R 2 max . Finally, since j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj > m 2R 2 max and the derivative of ϕ is upper bounded by 1, then for all i ∈ I we have m 2R 2 max < j∈[m1] λ i ϕ ′ i,vj + j∈[m2] λ i ϕ ′ i,uj ≤ mλ i , and hence λ i > 1 2R 2 max . Lemma B.3. For all i ∈ I we have y i f (x i ; W ) = 1. Proof. By Lemma B.2 we have λ i > 0 for all i ∈ I, and hence by the KKT conditions we must have y i f (x i ; W ) = 1. Lemma B.4. We have v 1 = . . . = v m1 = 1 √ m i∈I+ λ i x i - γ √ m i∈I- λ i x i , and u 1 = . . . = u m2 = 1 √ m i∈I- λ i x i - γ √ m i∈I+ λ i x i . Moreover, for all i ∈ I we have: y i v ⊤ j x i > 0 for every j ∈ [m 1 ], y i u ⊤ j x i < 0 for every j ∈ [m 2 ]. Proof. Fix j ∈ [m 1 ]. By (6) for all i ∈ I + we have v ⊤ j x i = 1 √ m q∈I λ q y q ϕ ′ q,vj x ⊤ q x i = 1 √ m λ i y i ϕ ′ i,vj x ⊤ i x i + 1 √ m q∈I\{i} λ q y q ϕ ′ q,vj x ⊤ q x i ≥ 1 √ m λ i ϕ ′ i,vj R 2 min - 1 √ m q∈I\{i} λ q ϕ ′ q,vj p . By Lemma B.1 and Lemma B.2, and using ϕ ′ q,vj ∈ [γ, 1] for all q ∈ I, the above is larger than 1 √ m • 1 2R 2 max • γR 2 min - 1 √ m • n • 3 2γ 2 R 2 min • p ≥ γR 2 min 2 √ mR 2 max - 1 √ m • γ 3 3 • R 2 min p • R 2 min R 2 max • 3 2γ 2 R 2 min • p = γR 2 min 2 √ mR 2 max - γR 2 min 2 √ mR 2 max = 0 . Thus, v ⊤ j x i > 0, which implies ϕ ′ i,vj = 1. Similarly, for all i ∈ I -we have v ⊤ j x i = 1 √ m q∈I λ q y q ϕ ′ q,vj x ⊤ q x i = 1 √ m λ i y i ϕ ′ i,vj x ⊤ i x i + 1 √ m q∈I\{i} λ q y q ϕ ′ q,vj x ⊤ q x i ≤ - 1 √ m λ i ϕ ′ i,vj R 2 min + 1 √ m q∈I\{i} λ q ϕ ′ q,vj p . By Lemma B.1 and Lemma B.2, and using ϕ ′ q,vj ∈ [γ, 1] for all q ∈ I, the above is smaller than - 1 √ m • 1 2R 2 max • γR 2 min + 1 √ m • n • 3 2γ 2 R 2 min • p ≤ - γR 2 min 2 √ mR 2 max + 1 √ m • γ 3 3 • R 2 min p • R 2 min R 2 max • 3 2γ 2 R 2 min • p = - γR 2 min 2 √ mR 2 max + γR 2 min 2 √ mR 2 max = 0 . Thus, v ⊤ j x i < 0, which implies ϕ ′ i,vj = γ. Using (6) again we conclude that v j = 1 √ m i∈I λ i y i ϕ ′ i,vj x i = 1 √ m i∈I+ λ i x i - γ √ m i∈I- λ i x i . Since the above expression holds for all j ∈ [m 1 ] then we have v 1 = . . . = v m1 . By similar arguments (which we do not repeat for conciseness) we also get u 1 = . . . = u m2 = 1 √ m i∈I- λ i x i - γ √ m i∈I+ λ i x i . and y i u ⊤ j x i < 0 for all i ∈ I and j ∈ [m 2 ]. By the above lemma, we may denote v := v 1 = . . . = v m1 and u := u 1 = . . . = u m2 , and denote z := m1 √ m v -m2 √ m u. Lemma B.5. The pair v, u is a unique global optimum of the Problem (3). Proof. First, we remark that a variant of the this lemma appears in Sarussi et al. (2021) . They proved the claim under an assumption called Neural Agreement Regime (NAR), and Lemma B.4 implies that this assumption holds in our setting. Note that the objective in Problem (3) is strictly convex and the constraints are affine. Hence, its KKT conditions are sufficient for global optimality, and the global optimum is unique. It remains to show that v, u satisfy the KKT conditions. Firstly, note that v, u satisfy the constraints. Indeed, by Lemma B.4, for every i ∈ I + we have v ⊤ x i > 0 and u ⊤ x i < 0. Combining it with Lemma B.3 we get 1 = f (x i ; W ) = m 1 √ m ϕ(v ⊤ x i ) - m 2 √ m ϕ(u ⊤ x i ) = m 1 √ m v ⊤ x i -γ m 2 √ m u ⊤ x i . Published as a conference paper at ICLR 2023 Similarly, for every i ∈ I -we have v ⊤ x i < 0 and u ⊤ x i > 0. Together with Lemma B.3 we get -1 = f (x i ; W ) = m 1 √ m ϕ(v ⊤ x i ) - m 2 √ m ϕ(u ⊤ x i ) = γ m 1 √ m v ⊤ x i - m 2 √ m u ⊤ x i . Next, we need to show that there are µ 1 , . . . , µ n ≥ 0 such that m 1 v = i∈I+ µ i m 1 √ m x i + i∈I- µ i (-γ m 1 √ m x i ) , m 2 u = i∈I+ µ i (-γ m 2 √ m x i ) + i∈I- µ i m 2 √ m x i . By setting µ i = λ i for all i ∈ I, Lemma B.4 implies that the above equations hold. Finally, we need to show that µ i = 0 for all i ∈ I where the corresponding constraint holds with a strict inequality. However, by ( 8) and ( 9) all constraints hold with an equality. Lemma B.6. The weight matrix W is a unique global optimum of Problem (1). Proof. Let W be a weight matrix that satisfies the KKT conditions of Problem (1), and let ṽ1 , . . . , ṽm1 , ũ1 , . . . , ũm2 be the corresponding positive and negative weight vectors. We first show that W = W , i.e., there is a unique KKT point for Problem (1). Indeed, by Lemma B.4, for every such W we have ṽ1 = . . . = ṽm1 := ṽ and ũ1 = . . . Proof. First, We remark that a variant of the this lemma appears in Sarussi et al. (2021) . They proved the claim under an assumption called Neural Agreement Regime (NAR), and Lemma B.4 implies that this assumption holds in our setting. Let x ∈ R d . Consider the following cases: Case 1: If v ⊤ x ≥ 0 and u ⊤ x ≥ 0 then f (x; W ) = m1 √ m v ⊤ x -m2 √ m u ⊤ x = z ⊤ x, and thus sign (f (x; W )) = sign(z ⊤ x). Case 2: If v ⊤ x ≥ 0 and u ⊤ x < 0 then f (x; W ) = m1 √ m v ⊤ x -m2 √ m γu ⊤ x > 0 and z ⊤ x = m1 √ m v ⊤ x -m2 √ m u ⊤ x > 0. Case 3: If v ⊤ x < 0 and u ⊤ x ≥ 0 then f (x; W ) = m1 √ m γv ⊤ x -m2 √ m u ⊤ x < 0 and z ⊤ x = m1 √ m v ⊤ x -m2 √ m u ⊤ x < 0. Case 4: If v ⊤ x < 0 and u ⊤ x < 0 then f (x; W ) = m1 √ m γv ⊤ x -m2 √ m γu ⊤ x = γz ⊤ x, and thus sign (f (x; W )) = sign(z ⊤ x). Lemma B.8. The vector z may not be an ℓ 2 -max-margin linear predictor. Proof. We give an example of a setting that satisfies the theorem's assumptions, but the corresponding vector z is not an ℓ 2 -max-margin linear predictor. Let γ = 1 2 and suppose that m 1 = m 2 := m ′ . Let x 1 = (-1, 0, 0) ⊤ , x 2 = (ϵ, √ 1 -ϵ 2 , 0) ⊤ , and x 3 = (0, 0, 1) ⊤ , where ϵ > 0 is sufficiently small such that the theorem's assumption holds. Namely, since we need n  ≤ γ 3 3 • R 2 min p • R 2 min R v = 1 √ 2m ′ (λ 2 x 2 + λ 3 x 3 -γλ 1 x 1 ) = 1 √ 2m ′ λ 2 x 2 + λ 3 x 3 - 1 2 • λ 1 x 1 , u = 1 √ 2m ′ (λ 1 x 1 -γλ 2 x 2 -γλ 3 x 3 ) = 1 √ 2m ′ λ 1 x 1 - 1 2 • λ 2 x 2 - 1 2 • λ 3 x 3 , where λ i > 0 for all i. Since x 1 , x 2 , x 3 are linearly independent, then given v, u there is a unique choice of λ 1 , λ 2 , λ 3 that satisfy the above equations. Since v, u satisfy the KKT conditions of Problem (3), we can find λ 1 , λ 2 , λ 3 as follows. Let µ 1 , µ 2 , µ 3 ≥ 0 be such that the KKT conditions of Problem (3) hold. From the stationarity condition we have m ′ v = µ 2 m ′ √ 2m ′ x 2 + µ 3 m ′ √ 2m ′ x 3 -γµ 1 m ′ √ 2m ′ x 1 , m ′ u = µ 1 m ′ √ 2m ′ x 1 -γµ 2 m ′ √ 2m ′ x 2 -γµ 3 m ′ √ 2m ′ x 3 . Since x 1 , x 2 , x 3 are linearly independent, combining the above with ( 10) and ( 11) implies µ i = λ i > 0 for all i. Therefore, all constraints in Problem (3) must hold with an equality. Namely, we have √ 2m ′ m ′ = u ⊤ - 1 2 v ⊤ x 1 = 1 √ 2m ′ λ 1 x 1 - 1 2 • λ 2 x 2 - 1 2 • λ 3 x 3 - 1 2 λ 2 x 2 + λ 3 x 3 - 1 2 • λ 1 x 1 ⊤ x 1 = 1 √ 2m ′ 5 4 • λ 1 x 1 -λ 2 x 2 -λ 3 x 3 ⊤ x 1 = 1 √ 2m ′ 5 4 • λ 1 • 1 -λ 2 (-ϵ) -λ 3 • 0 = 1 √ 2m ′ 5 4 • λ 1 + λ 2 ϵ , √ 2m ′ m ′ = v ⊤ - 1 2 u ⊤ x 2 = 1 √ 2m ′ λ 2 x 2 + λ 3 x 3 - 1 2 • λ 1 x 1 - 1 2 λ 1 x 1 - 1 2 • λ 2 x 2 - 1 2 • λ 3 x 3 ⊤ x 2 = 1 √ 2m ′ 5 4 • λ 2 x 2 + 5 4 • λ 3 x 3 -λ 1 x 1 ⊤ x 2 = 1 √ 2m ′ 5 4 • λ 2 + 0 -λ 1 (-ϵ) = 1 √ 2m ′ 5 4 • λ 2 + λ 1 ϵ , √ 2m ′ m ′ = v ⊤ - 1 2 u ⊤ x 3 = 1 √ 2m ′ 5 4 • λ 2 x 2 + 5 4 • λ 3 x 3 -λ 1 x 1 ⊤ x 3 = 1 √ 2m ′ • 5 4 • λ 3 . Solving the above equations, we get λ 1 = λ 2 = 8 4ϵ+5 , and λ 3 = 8 5 . Published as a conference paper at ICLR 2023 Thus, a KKT point of Problem (1) must satisfy ( 10) and (11) with the above λ i 's. Now, consider z = m ′ √ 2m ′ v - m ′ √ 2m ′ u = m ′ √ 2m ′ (v -u) = m ′ √ 2m ′ • 1 √ 2m ′   i∈I+ λ i x i -γ i∈I- λ i x i - i∈I- λ i x i + γ i∈I+ λ i x i   = 1 + γ 2   i∈I+ λ i x i - i∈I- λ i x i   = 3 4 8 4ϵ + 5 • x 2 + 8 5 • x 3 - 8 4ϵ + 5 • x 1 = 6 4ϵ + 5 • x 2 + 6 5 • x 3 - 6 4ϵ + 5 • x 1 . We need to show that z does not satisfy the KKT conditions of the problem min z 1 2 ∥z∥ 2 s.t. ∀i ∈ {1, 2, 3} y i z⊤ x i ≥ β , for any margin β > 0. A KKT point z of the above problem must satisfy z = -λ ′ 1 x 1 +λ ′ 2 x 2 +λ ′ 3 x 3 , where λ ′ i ≥ 0 for all i, and λ ′ i = 0 if y i z⊤ x i ̸ = β. Since z is a linear combination of the three independent vectors x 1 , x 2 , x 3 where the coefficients are non-zero, then if z is a KKT point of Problem ( 12) we must have λ ′ i ̸ = 0 for all i, which implies y i z ⊤ x i = β for all i. Therefore, in order to conclude that z is not a KKT point, it suffices to show that z ⊤ x 2 ̸ = z ⊤ x 3 . We have z ⊤ x 2 = 6 4ϵ + 5 • x 2 + 6 5 • x 3 - 6 4ϵ + 5 • x 1 ⊤ x 2 = 6 4ϵ + 5 + 0 + 6ϵ 4ϵ + 5 = 6(ϵ + 1) 4ϵ + 5 , z ⊤ x 3 = 6 4ϵ + 5 • x 2 + 6 5 • x 3 - 6 4ϵ + 5 • x 1 ⊤ x 3 = 6 5 . Using the above equations, it is easy to verify that z ⊤ x 2 ̸ = z ⊤ x 3 for all ϵ > 0. Lemma B.9. For all i ∈ I we have y i z ⊤ x i ≥ 1, and ∥z∥ ≤ 2 κ+γ ∥z * ∥, where z * = argmin z ∥z∥ s.t. y i z⊤ x i ≥ 1 for all i ∈ I. Proof. By Lemma B.4, for all i ∈ I + we have v ⊤ x i > 0 and u ⊤ x i < 0. Hence 1 ≤ f (x i ; W ) = m 1 √ m ϕ(v ⊤ x i ) - m 2 √ m ϕ(u ⊤ x i ) = m 1 √ m v ⊤ x i - m 2 √ m γu ⊤ x i ≤ m 1 √ m v ⊤ x i - m 2 √ m u ⊤ x i = z ⊤ x i . Likewise, by Lemma B.4, for all i ∈ I -we have v ⊤ x i < 0 and u ⊤ x i > 0. Hence -1 ≥ f (x i ; W ) = m 1 √ m ϕ(v ⊤ x i ) - m 2 √ m ϕ(u ⊤ x 2 ) = m 1 √ m γv ⊤ x i - m 2 √ m u ⊤ x i ≥ m 1 √ m v ⊤ x i - m 2 √ m u ⊤ x i = z ⊤ x i . Thus, it remains to obtain an upper bound for ∥z∥. Assume w.l.o.g. that m 1 ≥ m 2 (the proof for the case m 1 ≤ m 2 is similar). Thus, κ = m2 m1 . Let z * ∈ R d such that y i (z * ) ⊤ x i ≥ 1 for all i ∈ I. Let v * = z * • √ m m 1 • 1 κ + γ , u * = -z * • √ m m 2 • κ κ + γ . Note that v * , u * satisfy the constraints in Problem (3). Indeed, for i ∈ I -we have m 2 √ m (u * ) ⊤ x i -γ m 1 √ m (v * ) ⊤ x i = - κ(z * ) ⊤ x i κ + γ -γ • (z * ) ⊤ x i κ + γ ≥ κ κ + γ + γ • 1 κ + γ = 1 . For i ∈ I + we have m 1 √ m (v * ) ⊤ x i -γ m 2 √ m (u * ) ⊤ x i = (z * ) ⊤ x i κ + γ + γ • κ(z * ) ⊤ x i κ + γ ≥ 1 κ + γ + γ • κ κ + γ = 1 + γκ κ + γ ≥ 1 , where the last inequality is since 0 ≤ (1 -κ)(1 -γ) = 1 + κγ -κ -γ. By Lemma B.5 the pair v, u is a global optimum of Problem (3). Hence m 1 ∥v∥ 2 + m 2 ∥u∥ 2 ≤ m 1 ∥v * ∥ 2 + m 2 ∥u * ∥ 2 = m 1 • m m 2 1 • 1 (κ + γ) 2 ∥z * ∥ 2 + m 2 • m m 2 2 • κ 2 (κ + γ) 2 ∥z * ∥ 2 = m ∥z * ∥ 2 (κ + γ) 2 1 m 1 + κ 2 m 2 = m ∥z * ∥ 2 (κ + γ) 2 • 2 m 1 . Therefore, we have ∥m 1 v∥ 2 + ∥m 2 u∥ 2 ≤ m 2 1 ∥v∥ 2 + m 1 m 2 ∥u∥ 2 ≤ m ∥z * ∥ 2 (κ + γ) 2 • 2 . Hence, ∥z∥ 2 = m 1 √ m v - m 2 √ m u 2 ≤ 2 m 1 √ m v 2 + m 2 √ m u 2 = 2 m ∥m 1 v∥ 2 + ∥m 2 u∥ 2 ≤ 4 ∥z * ∥ 2 (κ + γ) 2 , which implies ∥z∥ ≤ 2∥z * ∥ κ+γ as required. C PROOF OF LEMMA 3.3 Proof of Lemma 3.3. According to the distribution assumption in the lemma, we can write x i = Σ 1/2 xi where xi ∼ N(0, I d ). foot_1 By Hanson-Wright inequality (Rudelson & Vershynin, 2013 , Theorem 2.1), we have for any t ≥ 0, Pr Σ 1/2 xi -∥Σ 1/2 ∥ F > t ≤ 2 exp -Ω t 2 Σ 1/2 2 2 , i.e., Pr ∥x i ∥ - √ d > t ≤ 2 exp -Ω t 2 . Let t = C √ log n for a sufficiently large constant C > 0. Taking a union bound over all i ∈ [n], we have that with probability at least 1 -n -20 , ∥x i ∥ = √ d ± O( √ log n) for all i ∈ [n] simultaneously. For i ̸ = j, we have ⟨x i , x j ⟩|x j ∼ N(0, x ⊤ j Σx j ). Hence we can apply a standard tail bound to obtain Pr [|⟨x i , x j ⟩| > t | x j ] ≤ 2 exp - t 2 2x ⊤ j Σx j . Because we have known that x ⊤ j Σx j = O(∥x j ∥ 2 ) = O(d + log n) = O(d) with probability at least 1 -n -20 , we have Pr [|⟨x i , x j ⟩| > t] ≤ n -20 + 2 exp -Ω t 2 d . Then we can take t = C √ d log n for a sufficiently large constant C and apply a union bound over all i, j, which gives |⟨x i , x j ⟩| = O( √ d log n) for all i ̸ = j with probability at least 1 -n 2 n -20 + 2 exp(-Ω(C 2 log n)) ≥ 1 -n -15 . This completes the proof.

D PROOF OF THEOREM 3.4

To prove Theorem 3.4, we need to show that for some t 0 > 0, L(W (t)) < log 2/n for all t ≥ t 0 . To do so, we will first show a proxy PL inequality (Frei & Gu, 2021) , and then use this to argue that the loss must eventually be smaller than log 2/n. We begin by showing that the vector µ := n i=1 y i x i correctly classifies the training data with a positive margin. To see this, note that for any k ∈ [n], n i=1 y i x i , y k x k = ∥x k ∥ 2 + i̸ =k ⟨y i x i , y k x k ⟩ ≥ min i ∥x i ∥ 2 -n max i̸ =j |⟨x i , x j ⟩| (i) ≥ 1 - γ 3 3 min i ∥x i ∥ 2 (ii) ≥ 2 3 min i ∥x i ∥ 2 . ( ) Inequality (i) uses the theorem's assumption that 3n max i̸ =j |⟨x i , x j ⟩| ≤ γ 3 . Inequality (ii) uses that γ ≤ 1. To show how large of a margin µ gets on the training data, we bound its norm. We have, n i=1 y i x i 2 ≤ n i=1 ∥x i ∥ 2 + i̸ =j |⟨x i , x j ⟩| = n i=1   ∥x i ∥ 2 + j̸ =i |⟨x i , x j ⟩|   ≤ n i=1 ∥x i ∥ 2 + n max i̸ =j |⟨x i , x j ⟩| ≤ n i=1 ∥x i ∥ 2 + γ 3 3 min j ∥x j ∥ 2 ≤ 2n max i ∥x i ∥ 2 . Denoting R min := min i ∥x i ∥, R max = max i ∥x i ∥, and R = R max /R min , substituting the above display into (13) we get for any k ∈ [n], µ ∥ µ∥ , y k x k ≥ 2 /3R 2 min 2nR 2 max = √ 2R min 3R √ n . Let us now define the matrix Z ∈ R m×d with rows, z j := µ ∥ µ∥ a j . Since a 2 j = 1/m for each j, we have ∥Z∥ 2 F = 1, and moreover we have for any k ∈ [n] and W ∈ R m×d , y k ⟨∇f (x k ; W ), Z⟩ = m j=1 a 2 j ϕ ′ (⟨w j , x k ⟩) µ ∥ µ∥ , y k x k ≥ √ 2R min 3R √ n 1 m m j=1 ϕ ′ (⟨w j , x k ⟩) ≥ √ 2R min γ 3R √ n , where the first inequality uses ( 14) and the last inequality uses that ϕ ′ (z) ≥ γ. If ℓ is the logistic or exponential loss and we define g(z) = -ℓ ′ (z), G(W (t)) := 1 n n k=1 g(y k f (x k ; W (t))), then since g(z) > 0 the above allows for the following proxy-PL inequality, ∥∇ L(W (t))∥ F ≥ ∇ L(W (t)), -Z = 1 n n k=1 -ℓ ′ (y k f (x k ; W (t)))y k ⟨∇f (x k ; W (t)), Z⟩ ≥ √ 2R min γ 3R √ n G(W (t)). By the chain rule, the above implies d dt L(W (t)) = -∥∇ L(W (t))∥ 2 F ≤ - √ 2R min γ 3R √ n G(W (t)) 2 . Let us now calculate how long until we reach the point where G(W (t)) < log 2/(3n). Define τ = inf{t : G(W (t)) < log 2/(3n)}. Then for any t < τ we have d dt L(W (t)) ≤ - √ 2R min γ 3R √ n • log 2 3n 2 . Integrating, we see that L(W (t)) ≤ L(W (0)) - 2R 2 min γ 2 log 2 (2)t 81R 2 n 3 . Since L(W (t)) ≥ 0, this means that τ ≤ 81 L(W (0))R 2 n 3 /(2γ 2 R 2 min log 2 (2)) ≤ 85 L(W (0))R 2 n 3 /(γ 2 R 2 min ). At time τ , we know that G(W (τ )) ≤ log 2/(3n) and thus y i f (x i ; W (τ )) > 0 for each i. For z > 0, both the logistic loss and the exponential loss satisfy ℓ(z) ≤ 2 • -ℓ ′ (z), and so for either loss, we have L(W (τ )) = 1 n n i=1 ℓ(y i f (x i ; W (τ ))) ≤ 2 n n i=1 -ℓ ′ (y i f (x i ; W (τ ))) = 2 G(W (τ )) ≤ 2 3 • log 2 n . Since L(W (t)) is decreasing, we thus have for all times t ≥ τ , we have L(W (t)) ≤ L(W (τ )) < log(2)/n.

E PROOF OF THEOREM 4.2

In this section, we provide a proof of Theorem 4.2. An overview of our proof is as follows. 1. In Section E.1 we provide basic concentration arguments about the random initialization. 2. In Section E.2 we show that the neural network output and the logistic loss objective function are smooth as a function of the parameters. 3. In Section E.3 we prove a structural result on how gradient descent weights the samples throughout the training trajectory. In particular, we show that throughout gradient descent, the sigmoid losses -ℓ ′ (y i f (x i ; W (t) )) grow at approximately the same rate for all samples. 4. In Section E.4 we leverage the above structural result to provide a tighter upper bound on ∥W (t) ∥ F than is possible with a naïve application of the triangle inequality. 5. In Section E.5 we provide a lower bound for ∥W (t) ∥ 2 . 6. In Section E.6 we show that a proxy-PL inequality is satisfied. 7. We conclude the proof of Theorem 4.2 in Section E.7 by putting together the preceding items to bound the stable rank StableRank(W (t) ) = ∥W (t) ∥ 2 F /∥W (t) ∥ 2 2 and to show that L(W (t) ) → 0. Let us denote by C R := 10R 2 /γ 2 + 10, where R = R max /R min and R max = max i ∥x i ∥, R min = min i ∥x i ∥. For a given probability threshold δ ∈ (0, 1), we make the following assumptions moving forward: (A1) Step-size α ≤ γ 2 5nR 2 max R 2 C R max(1, H) -1 , where ϕ is H-smooth and γ-leaky. (A2) Initialization variance satisfies ω init ≤ αγ 2 R min 72RC R n md log(4m/δ) -1 . We shall also use the following notation to refer to the sigmoid losses that appear throughout the analysis of gradient descent training for the logistic loss, g(z) = -ℓ ′ (z) = 1 1 + exp(z) , G(W ) = 1 n n i=1 g y i f (x i ; W ) , g i := g y i f (x i ; W (t) ) . (16)

E.1 CONCENTRATION FOR RANDOM INITIALIZATION

The following lemma characterizes the ℓ 2 -norm of each neuron at intialization. It also characterizes how large the projection of each neuron along the direction µ := n i=1 y i x i can be at initialization. We shall see in Lemma E.13 that gradient descent forces the weights to align with this direction. In the proof of Theorem 4.2, we will argue that by taking a single step of gradient descent with a sufficiently large step-size and small initialization variance, the gradient descent update dominates the behavior of each neuron at initialization, so that after one step the µ direction becomes dominant for each neuron. This will form the basis of showing that W (t) has small stable rank for t ≥ 1. Lemma E.1. With probability at least 1-δ over the random initialization, the following holds. First, we have the following upper bounds for the spectral norm and per-neuron norms at initialization, ∥W (0) ∥ 2 ≤ C 0 ω init ( √ m + √ d) , and for all j ∈ [m], ∥w (0) j ∥ 2 ≤ 5ω 2 init d log(4m/δ). Second, if we denote by μ ∈ R d be the vector n i=1 y i x i /∥ n i=1 y i x i ∥, then we have |⟨w (0) j , μ⟩| ≤ 2ω init log(4m/δ). Proof. For the first part of the lemma, note that for fixed j ∈ [m], there are i.i.d. z i ∼ N(0, 1) such that ∥w (0) j ∥ 2 = d i=1 (w (0) j ) 2 i = ω 2 init d i=1 z 2 i ∼ ω 2 init • χ 2 (d). By concentration of the χ 2 distribution (Laurent & Massart, 2000 , Lemma 1), for any t > 0, P 1 ω 2 init ∥w (0) j ∥ 2 -d ≥ 2 √ dt + 2t ≤ exp(-t). In particular, if we let t = log(4m/δ), we have that with probability at least 1 -δ/4, for all j ∈ [m], ∥w (0) j ∥ 2 ≤ ω 2 init d + 2 d log(4m/δ) + 2 log(4m/δ) ≤ 5ω 2 init d log(4m/δ). For the second part, note that ⟨w (0) j , μ⟩ ∼ N(0, ω 2 init ). We therefore have P(|⟨w (0) j , μ⟩| ≥ t) ≤ 2 exp(-t 2 /2ω 2 init ). Choosing t = ω init log(4m/δ) we see that with probability at least 1 -δ/2, for all j, |⟨w (0) j , μ⟩| ≤ 2ω init log(4m/δ). Taking a union bound over both events completes the proof.

E.2 SMOOTHNESS OF NETWORK OUTPUT AND LOSS

In this sub-section, we show that the network output and the logistic loss satisfy a number of smoothness properties, owing to the fact that ϕ is H-smooth (i.e., ϕ ′′ exists and |ϕ ′′ (z)| ≤ H). Lemma E.2. For an H-smooth activation ϕ and any W, V ∈ R m×d and x ∈ R d , |f (x; W ) -f (x; V ) -⟨∇f (x; V ), W -V ⟩| ≤ H∥x∥ 2 2 √ m ∥W -V ∥ 2 2 . Proof. This was shown in Frei et al. (2022a, Lemma 4.5) . We next show that the empirical risk is smooth, in the sense that the gradient norm is bounded by the loss itself and that the gradients are Lipschitz. Lemma E.3. For an H-smooth, 1-Lipschitz activation ϕ and any W, V ∈ R m×d , if ∥x i ∥ ≤ R max for all i, 1 R max ∥∇ L(W )∥ F ≤ G(W ) ≤ L(W ) ∧ 1, where G(W ) is defined in (16). Additionally, ∥∇ L(W ) -∇ L(V )∥ F ≤ R 2 max 1 + H √ m ∥W -V ∥ 2 . Proof. This follows by Frei et al. (2022a, Lemma 4.6) . The only difference is that in that paper, the authors use ∥x i ∥ 2 ≤ C 1 p (in their work, x i ∈ R p ) to go from equations ( 5) and ( 6) to equation ( 7), while we instead use that ∥x i ∥ 2 ≤ R 2 max .

E.3 LOSS RATIO BOUND

In this section, we prove a key structural result which we will refer to as a 'loss ratio bound'. Lemma E.4. Let ϕ be a γ-leaky, H-smooth activation. Define R = max i,j ∥xi∥ /∥xj∥, and let us denote C R = 10R 2 γ -2 + 10. Suppose that for all i ∈ [n], we have, ∥x i ∥ 2 ≥ 5γ -2 C R n max k̸ =i |⟨x i , x k ⟩|. Then under Assumptions (A1) and (A2), we have with probability at least 1 -δ, sup t≥0 max i,j∈[n] ℓ ′ y i f (x i ; W (t) ) ℓ ′ y j f (x j ; W (t) ) ≤ C R . This lemma shows that regardless of the relationship between x and y, the ratio of the sigmoid losses -ℓ ′ (y i f (x i ; W (t) )), where -ℓ ′ (z) = 1/(1 + exp(z)), grows at essentially the same rate for all examples. Our proof largely follows that used by Frei et al. (2022a) , who showed a loss ratio bound for gradient descent-trained two-layer networks with γ-leaky, H-smooth activations when the data comes from a mixture of isotropic log-concave distributions. We generalize their proof technique to accommodate general training data for which the samples are nearly orthogonal in the sense that ∥x i ∥ 2 ≫ n max k̸ =i |⟨x i , x k ⟩|. Additionally, we provide a more general proof technique that illustrates how a loss ratio bound could hold for activations ϕ for which ϕ ′ (z) is not bounded from below by an absolute constant (like the ReLU), as well as for training data which are not necessarily nearlyorthogonal. We begin by describing two conditions which form the basis of this more general proof technique. The first condition concerns near-orthogonality of the gradients of the network, rather than the samples as in the assumption for Theorem 4.2. Condition E.5 (Near-orthogonality of gradients). We say that near-orthogonality of gradients holds at time t if, for a some absolute constant C ′ > 1, for any i ∈ [n], ∥∇f (x i ; W (t) )∥ 2 ≥ C ′ n max k̸ =i |⟨∇f (x i ; W (t) ), ∇f (x k ; W (t) )⟩|. Note that for linear classifiers-i.e., m = 1 with ϕ(z) = z-near-orthogonality of gradients is equivalent to near-orthogonality of samples, since in this setting ∇f (x i ; W ) = x i . It is clear that this is a more general condition than near-orthogonality of samples. The next condition we call gradient persistence, which roughly states that the gradients of the network with respect to a sample has large norm whenever that sample has large norm. Condition E.6 (Gradient persistence). We say that gradient persistence holds at time t if there is a constant c > 0 such that for all i ∈ [n], ∥∇f (x i ; W (t) )∥ 2 F ≥ c∥x i ∥ 2 . Gradient persistence essentially states that there is no possibility of a 'vanishing gradient' problem. Next, we show that Lipschitz activation functions that are also 'leaky' in the sense that ϕ ′ (z) ≥ γ > 0 everywhere, allow for both gradient persistence and, when the samples are nearly-orthogonal, near-orthogonality of gradients. Fact E.7. Suppose ϕ is such that ϕ ′ (z) ∈ [γ, 1] for all z for some absolute constant γ > 0. Suppose that for some C > γ -2 , for all i ∈ [n] we have, ∥x i ∥ 2 ≥ Cn max k̸ =i |⟨x i , x k ⟩|. Then for all times t ≥ 0, the gradients are nearly-orthogonal (Condition E.5) with C ′ = Cγ 2 and gradient persistence (Condition E.6) holds for c = γ 2 . Proof. For any samples i, k ∈ [n] and any W ∈ R m×d , ⟨∇f (x i ; W ), ∇f (x k ; W )⟩ = ⟨x i , x k ⟩ • 1 m m j=1 ϕ ′ (⟨w j , x i ⟩)ϕ ′ (⟨w j , x k ⟩). Since ϕ ′ (z) ∈ [γ, 1] for all z, we therefore see that gradient persistence holds with c = γ 2 : ∥∇f (x k ; W )∥ 2 F = ∥x k ∥ 2 • 1 m m j=1 ϕ ′ (⟨w j , x k ⟩) 2 ≥ γ 2 ∥x k ∥ 2 . Similarly, we see that the gradients are nearly-orthogonal, since Cn max i̸ =k |⟨∇f (x i ; W ), ∇f (x k ; W )⟩| (i) ≤ Cn max i̸ =k |⟨x i , x k ⟩| (ii) ≤ ∥x k ∥ 2 ≤ γ -2 ∥∇f (x k ; W )∥ 2 F , where (i) uses that ϕ is 1-Lipschitz and (ii) uses the assumption on the near-orthogonality of the samples. We can now begin to prove Lemma E.4. We remind the reader of the notation for the sigmoid loss, g(z) := -ℓ ′ (z) = 1 1 + exp(z) , g i := g y i f (x i ; W (t) ) . We follow the same proof technique of Frei et al. (2022a) , whereby in order to control the ratio of the sigmoid losses we show instead that the ratio of the exponential losses is small and that this suffices for showing the sigmoid losses is small. As we mention above, we generalize their analysis to emphasize that near-orthogonality of gradients and gradient persistence suffice for showing the loss ratio does not grow significantly. Lemma E.8. Denote R := R max /R min where R max = max i ∥x i ∥ and R min = min i ∥x i ∥, and let ϕ be an arbitrary 1-Lipschitz and H-smooth activation. Suppose that near-orthogonality of gradients (Condition E.5) holds for some C ′ > 1 and gradient persistence (Condition E.6) hold at time t for some c > 0. Provided α ≤ [5HR 2 max n(10R 2 /c + 10)] -1 and C ′ ≥ 25R 2 /c + 25, then for any i, j ∈ [n] we have, exp -y i f (x i ; W (t+1) ) exp -y j f (x j ; W (t+1) ) ≤ exp -y i f (x i ; W (t) ) exp -y j f (x j ; W (t) ) × exp - g (t) j αcR 2 min n g (t) i g (t) j - R 2 c × exp αR 2 max (10R 2 /c + 10)n • G(W (t) ) Proof. It suffices to consider i = 1 and j = 2. For notational simplicity denote A t := exp(-y 1 f (x 1 ; W (t) )) exp(-y 2 f (x 2 ; W (t) )) . We now calculate the exponential loss ratio between two samples at time t + 1 in terms of the exponential loss ratio at time t.

Recall the notation g (t)

i := -ℓ ′ (y i f (x i ; W (t) )), and introduce the notation ∇f (t) i := ∇f (x i ; W (t) ). We can calculate, A t+1 = exp(-y 1 f (x 1 ; W (t+1) )) exp(-y 2 f (x 2 ; W (t+1) )) = exp -y 1 f 1 W (t) -α∇ L(W (t) ) exp -y 2 f 2 W (t) -α∇ L(W (t) ) (i) ≤ exp -y 1 f x 1 ; W (t) + y 1 α ∇f (t) 1 , ∇ L(W (t) ) exp -y 2 f x 2 ; W (t) + y 2 α ∇f (t) 2 , ∇ L(W (t) ) exp HR 2 max α 2 √ m ∥∇ L(W (t) )∥ 2 (ii) = A t • exp y 1 α ∇f (t) 1 , ∇ L(W (t) ) exp y 2 α ∇f (t) 2 , ∇ L(W (t) ) exp HR 2 max α 2 √ m ∥∇ L(W (t) )∥ 2 = A t • exp -α n n k=1 g (t) k ⟨y 1 ∇f (t) 1 , y k ∇f (t) k ⟩ exp -α n n k=1 g (t) k ⟨y 1 ∇f (t) 2 , y k ∇f (t) k ⟩ exp HR 2 max α 2 √ m ∥∇ L(W (t) )∥ 2 = A t • exp - α n g (t) 1 ∥∇f (t) 1 ∥ 2 F -g (t) 2 ∥∇f (t) 2 ∥ 2 F × exp   - α n   k̸ =2 g (t) k ⟨y 2 ∇f (t) 2 , y k ∇f (t) k ⟩ - k̸ =1 g (t) k ⟨y 1 ∇f (t) 1 , y k ∇f (t) k ⟩     × exp HR 2 max α 2 √ m ∥∇ L(W (t) )∥ 2 . Inequality (i) uses Lemma E.2 while (ii) uses the definition of A t . We now proceed in a manner similar to Frei et al. (2022a) to bound each of the three terms in the product separately. For the first term, since gradient persistence (Condition E.6) holds at time t, we have for any i ∈ [n], ∥∇f (t) i ∥ 2 F ≥ c∥x i ∥ 2 ≥ cR 2 min . On the other hand, since ϕ is 1-Lipschitz we also have ∥∇f (t) i ∥ 2 F = ∥x i ∥ 2 1 m m i=1 ϕ ′ (⟨w (t) j , x i ⟩) 2 ≤ ∥x i ∥ 2 ≤ R 2 max . Putting the preceding two displays together, we get cR 2 min ≤ ∥∇f (t) i ∥ 2 F ≤ R 2 max . Therefore, we have exp - α n g (t) 1 ∥∇f (t) 1 ∥ 2 F -g (t) 2 ∥∇f (t) 2 ∥ 2 F = exp - g (t) 2 α n g (t) 1 g (t) 2 ∥∇f (t) 1 ∥ 2 F -∥∇f (t) 2 ∥ 2 F (i) ≤ exp - g (t) 2 α n g (t) 1 g (t) 2 • cR 2 min -R 2 max = exp - g (t) 2 αcR 2 min n g (t) 1 g (t) 2 - R 2 c . Inequality (i) uses ( 18), and the equality uses the definition R = R max /R min . This bounds the first term in (17). For the second term, we use the fact that the gradients are nearly orthogonal at time t (Condition E.5) and the lemma's assumption on C ′ to get for any i ̸ = k, ∥∇f (t) i ∥ 2 F ≥ C ′ n max k̸ =i |⟨∇f (t) i , ∇f (t) k ⟩| ≥ (25R 2 /c + 25)n max k̸ =i |⟨∇f (t) i , ∇f (t) k ⟩|. This allows for us to bound, exp   - α n   k̸ =2 g (t) k ⟨y 2 ∇f (t) 2 , y k ∇f (t) k ⟩ - k̸ =1 g (t) k ⟨y 1 ∇f (t) 1 , y k ∇f (t) k ⟩     (i) ≤ exp   α n k̸ =1 g (t) k |⟨∇f (t) 1 , ∇f k ⟩| + α n k̸ =2 g (t) k |⟨∇f (t) 2 , ∇f (t) k ⟩|   (ii) ≤ exp   α n k̸ =1 g (t) k • 1 (25R 2 /c + 25)n • ∥∇f (t) 1 ∥ 2 F + α n k̸ =2 g (t) k • 1 (25R 2 /c + 25)n • ∥∇f (t) 2 ∥ 2   (iii) ≤ exp   α n k̸ =1 g (t) k • 1 (25R 2 /c + 25)n • R 2 max + α n k̸ =2 g (t) k • 1 (25R 2 /c + 25)n • R 2 max   ≤ exp 2αR 2 max (25R 2 /c + 25)n • G(W (t) ) . Inequality (i) uses the triangle inequality. Inequality (ii) uses ( 20). The inequality (iii) uses ( 18). Finally, for the third term of ( 17), we have exp HR 2 max α 2 √ m ∥∇ L(W (t) )∥ 2 (i) ≤ exp HR 4 max α 2 √ m G(W (t) ) (ii) ≤ exp αR 2 max 2(25R 2 /c + 25)n • G(W (t) ) . Inequality (i) uses Lemma E.3, while (ii) uses the lemma's assumption that α is smaller than [5HR 2 max n(10R 2 /c + 10)] -1 . Putting ( 19), ( 21) and ( 22) into (17), we get A t+1 ≤ A t • exp - g (t) 2 αcR 2 min n g (t) 1 g (t) 2 - R 2 c × exp 2αR 2 max (25R 2 /c + 25)n • G(W (t) ) × exp αR 2 max 2(25R 2 /c + 25)n • G(W (t) ) = A t • exp - g (t) 2 αcR 2 min n g (t) 1 g (t) 2 - R 2 c • exp 5αR 2 max 2(25R 2 /c + 25)n • G(W (t) ) This completes the proof. Lemma E.8 shows that if the sigmoid loss ratio g (t) i /g (t) j is large, then for a small-enough step-size, the exponential loss ratio will contract at the following interation. This motivates understanding how the exponential loss ratios relate to the sigmoid loss ratios. We recall the following fact, shown in Frei et al. (2022a, Fact A.2) . Fact E.9. For any z 1 , z 2 ∈ R, g(z 1 ) g(z 2 ) ≤ max 2, 2 exp(-z 1 ) exp(-z 2 ) , and if z 1 , z 2 > 0, then we also have exp(-z 1 ) exp(-z 2 ) ≤ 2 g(z 1 ) g(z 2 ) . This fact demonstrates that if we can ensure that the inputs to the losses is positive, then we can essentially treat the sigmoid and exponential losses interchangeably. Thus, if the network is able to interpolate the training data at a given time t, we can swap the sigmoid loss ratio appearing in Lemma E.8 with the exponential loss, and argue that if the exponential loss is too large at a given iteration, it will contract the following one. This allows for the exponential losses to be bounded throughout gradient descent. We formalize this in the following lemma. Proposition E.10. Denote R := R max /R min where R max = max i ∥x i ∥ and R min = min i ∥x i ∥. Let ϕ be an arbitrary 1-Lipschitz and H-smooth activation. Suppose that, • Gradient persistence (Condition E.6) holds at time t for some c > 0, and • Near-orthogonality of gradients (Condition E.5) holds at time t for some C ′ > 25R 2 /c + 25, • For some ρ ≥ 5R 2 /c + 5, an exponential loss ratio bound holds at time t with, max i,j exp -y i f (x i ; W (t) ) exp -y j f (x j ; W (t) ) ≤ ρ. • The network interpolates the training data at time t: y i f (x i ; W (t) ) > 0 for all i. Then, provided the learning rate satisfies α ≤ [5HR 2 max n(10R 2 /c + 10)] -1 , we have an exponential loss ratio bound at time t + 1 as well, max i,j exp -y i f (x i ; W (t+1) ) exp -y j f (x j ; W (t+1) ) ≤ ρ. Proof. As in the proof of Lemma E.8, it suffices to prove that the ratio of the exponential loss for the first sample to the exponential loss for the second sample is bounded by ρ. Let us again denote A t := exp(-y 1 f (x 1 ; W (t) )) exp(-y 2 f (x 2 ; W (t) )) , and recall the notation g (t) i := -ℓ ′ (y i f (x i ; W (t) )) and ∇f (t) i := ∇f (x i ; W (t) ). By Lemma E.8, we have, A t+1 ≤ A t • exp - g (t) 2 αcR 2 min n g (t) 1 g (t) 2 - R 2 c • exp αR 2 max (10R 2 /c + 10)n • G(W (t) ) We now consider two cases. Case 1: g (t) 1 /g (t) 2 ≤ 2 5 ρ. Continuing from (24), we have, A t+1 (i) ≤ A t • exp g (t) 2 αR 2 min R 2 n • exp αR 2 max (10R 2 /c + 10)n G(W (t) ) = A t • exp α • g (t) 2 R 2 max n + R 2 max G(W (t) ) (10R 2 /c + 10)n (ii) ≤ 1.2A t (iii) ≤ 2.4 g (t) 1 g (t) 2 (iv) ≤ 2.4 • 2 5 ρ ≤ ρ. Above, inequality (i) follows since g (t) 1 /g (t) 2 > 0. The equality uses that R = R max /R min . Inequality (ii) uses that g (t) i < 1, the lemma's assumption on the step-size, α ≤ [5HR 2 max n(10R 2 /c + 10)] -1 , and that exp(0.1) ≤ 1.2. The inequality (iii) uses the proposition's assumption that the network interpolates the training data at time t, so that the ratio of exponential losses is at most twice the ratio of the sigmoid losses by Fact E.9. The final inequality (iv) follows by the case assumption that g (t) 1 /g (t) 2 ≤ 2 5 ρ. Case 2: g (t) 1 /g (t) 2 > 2 5 ρ. Continuing from (24), we have, A t+1 ≤ A t • exp - g (t) 2 αcR 2 min n g (t) 1 g (t) 2 - R 2 c • exp αR 2 max (10R 2 /c + 10)n • G(W (t) ) (i) ≤ A t • exp - g (t) 2 αcR 2 min n • 2 5 ρ - R 2 c • exp αR 2 max (10R 2 /c + 10)n • G(W (t) ) = A t • exp - g (t) 2 αcR 2 min n • 2 5 ρ - R 2 c • exp αR 2 max (10R 2 /c + 10)n • g (t) 2 • 1 n n i=1 g (t) i g (t) 2 (ii) ≤ A t • exp - g (t) 2 αcR 2 min n • 2 5 ρ - R 2 c • exp αR 2 max (10R 2 /c + 10)n • g (t) 2 • 2ρ = A t exp - g (t) 2 αcR 2 min n • 2 5 ρ - R 2 c - R 2 c • 1 5R 2 /c + 5 • ρ (iii) ≤ A t ≤ ρ. Inequality (i) uses the Case 2 assumption that g (t) 1 /g (t) 2 > 2 5 ρ. Inequality (ii) uses the proposition's assumption that the exponential loss ratio at time t is at most ρ, so that the sigmoid loss ratio is at most 2ρ by Fact E.9 (note that the sigmoid loss ratio is at least 2ρ/5 > 2 by the case assumption and as ρ > 5). The equality uses that R = R max /R min . The final inequality (iii) follows as we can write 2 5 ρ - R 2 c - R 2 c • 1 5R 2 /c + 5 • ρ = 2 5 ρ 1 - 1 2 • 5R 2 /c 5(R 2 /c + 1) - R 2 c ≥ 2 5 ρ • 1 2 - R 2 c > 0. The first inequality above uses that |x/(1 + x)| ≤ 1 for x > 0, and the final inequality follows by the assumption that ρ ≥ 5R 2 /c + 5 > 5R 2 /c. This proves (iii) above, so that in Case 2, the exponential loss ratio decreases at the following iteration. In summary, the preceding proposition demonstrates that a loss ratio bound can hold for general Lipschitz and smooth activations provided the following four conditions hold for some time t 0 : (1) an exponential loss ratio bound holds at time t 0 ; (2) near-orthogonality of the gradients holds for all times t ≥ t 0 ; (3) gradient persistence holds at all times t ≥ t 0 ; and (4) the network interpolates the training data for all times t ≥ t 0 . This is because the proposition guarantees that once you interpolate the training data, if the gradients are nearly-orthogonal and gradient persistence holds, the maximum ratio of the exponential losses does not become any larger than the maximum ratio at time t 0 . Note that the above proof outline does not rely upon the training data being nearly orthogonal, nor that the activations are 'leaky', and thus may be applicable to more general settings than the ones we consider in this work. On the other hand, when the training data is nearly-orthogonal and the activations are γ-leaky and H-smooth activations, Fact E.7 shows that (2) and (3) above hold for all times t ≥ 0. Thus, to show a loss ratio bound in this setting, the main task is to show items (1) and ( 4) above. Towards this end, we present the final auxiliary lemma that will be used in the proof of Lemma E.4. A similar lemma appeared in Frei et al. (2022a, Lemma A.3) , and our proof is only a small modification of their proof. For completeness, we provide its proof in detail here. Lemma E.11. Let ϕ be a γ-leaky, H-smooth activation. Then the following hold with probability at least 1 -δ over the random initialization. (a) An exponential loss ratio bound holds at initialization: max i,j exp(-y i f (x i ; W (0) )) exp(-y j f (x j ; W (0) )) ≤ exp(2). (b) If there is an absolute constant C ′ R > 1 such that at time t we have max i,j {g t) ). (t) i /g (t) j } ≤ C ′ R , and if for all k ∈ [n] we have ∥x k ∥ 2 ≥ 2γ -2 C ′ R n max i̸ =k |⟨x i , x k ⟩|, then for α ≤ γ 2 /(2HC ′ R R 2 R 2 max n), we have for all k ∈ [n], y k [f (x k ; W (t+1) ) -f (x k ; W (t) )] ≥ αγ 2 R 2 min 4C ′ R n G(W ( (c) If for all k ∈ [n] we have ∥x k ∥ 2 ≥ 8γ -2 n max i̸ =k |⟨x i , x k ⟩|, then under Assumptions (A1) and (A2), at time t = 1 and for all samples k ∈ [n], we have y k f (x k ; W (t) ) > 0. Proof. We shall prove each part of the lemma in sequence. Applying this bound to the network output for each sample at initialization, we get |f (x i ; W (0) )| ≤ ∥W (0) ∥ F ∥x i ∥ (i) ≤ √ 5ω init md log(4m/δ)R max (ii) ≤ √ 5αR 2 max 72n (iii) ≤ 1 50 . Inequality (i) uses Lemma E.1, while inequality (ii) and (iii) follow by Assumptions (A2) and (A1), respectively. We therefore have, (t) ). By Lemma E.2, we know max i,j=1,...,n exp(-y i f (x i ; W (0) )) exp(-y j f (x j ; W (0) )) ≤ exp(2). y k [f (x k ; W (t+1) ) -f (x k ; W (t) )] ≥ α n n i=1 g (t) i ⟨y i ∇f (t) i , y k ∇f (t) k ⟩ - HR 2 max α 2 2 √ m ∥∇ L(W (t) )∥ 2 2 . By definition, ⟨∇f i , ∇f (t) k ⟩ = ⟨x i , x k ⟩ • 1 m m j=1 ϕ ′ (⟨w (t) j , x i ⟩)ϕ ′ (⟨w (t) j , x k ⟩) ∈[γ 2 ,1] . ( ) We can thus calculate, y k [f (x k ; W (t+1) ) -f (x k ; W (t) )] (i) ≥ α n n i=1 g (t) i ⟨y i ∇f (t) i , y k ∇f (t) k ⟩ - HR 4 max αn 2 √ m G(W (t) ) = α n   g (t) k ∥∇f (t) k ∥ 2 F + i̸ =k g (t) i ⟨y i ∇f (t) i , y k ∇f (t) k ⟩ - HR 4 max αn 2 √ m G(W (t) )   ≥ α n   g (t) k ∥∇f (t) k ∥ 2 -max j g (t) j i̸ =k |⟨∇f (t) i , ∇f k ⟩| - HR 4 max αn 2 √ m G(W (t) )   = α n   g (t) k   ∥∇f (t) k ∥ 2 - max j g (t) j g (t) k i̸ =k |⟨∇f (t) i , ∇f (t) k ⟩|   - HR 4 max αn 2 √ m G(W (t) )   . where Inequality (i) uses Lemma E.3. Continuing we get that y k [f (x k ; W (t+1) ) -f (x k ; W (t) )] (i) ≥ α n   g (t) k   ∥∇f (t) k ∥ 2 -C ′ R i̸ =k |⟨∇f (t) i , ∇f (t) k ⟩|   - HR 4 max αn 2 √ m G(W (t) )   (ii) ≥ α n   g (t) k •   γ 2 ∥x k ∥ 2 -C ′ R i̸ =k |⟨x i , x k ⟩|   - HR 4 max αn 2 √ m G(W (t) )   (iii) ≥ α n g (t) k • 1 2 γ 2 ∥x k ∥ 2 - HR 4 max αn 2 √ m G(W (t) ) (iv) ≥ α n g (t) k • 1 2 γ 2 R 2 min - HR 4 max αn 2 √ m G(W (t) ) (v) ≥ α n γ 2 R 2 min 2C ′ R G(W (t) ) - HR 4 max αn 2 √ m G(W (t) ) (vi) ≥ αγ 2 R 2 min 4C ′ R n G(W (t) ) Inequality (i) uses the lemma's assumption that max i,j {g (t) i /g (t) j } ≤ C ′ R . Inequality (ii) uses that ϕ is γ-leaky and 1-Lipschitz (see eq. ( 27)). Inequality (iii) uses that the assumption that the samples are nearly-orthogonal, ∥x k ∥ 2 ≥ 2γ -2 C ′ R n max i̸ =k |⟨x i , x k ⟩| ≥ 2γ -2 C ′ R i̸ =k |⟨x i , x k ⟩|. Inequality (iv) uses the definition R min = min i ∥x i ∥. Inequality (v) again uses the lemma's assumption of a sigmoid loss ratio bound, so that g (t) k = 1 n n i=1 g (t) i g (t) k g (t) k ≥ 1 C ′ R 1 n n i=1 g (t) i = 1 C ′ R G(W (t) ). The final inequality (vi) follows since the step-size α ≤ γ 2 /(2HC ′ R R 2 R 2 max n) is small enough. This completes part (b) of this lemma. Part (c). Note that by (25), |f (x k ; W (0) )| ≤ 1/50. Since g is monotone this implies the sigmoid losses at initialization satisfy g  Thus, the assumption that ∥x k ∥ 2 ≥ 8γ -2 n max i̸ =k |⟨x i , x k ⟩| and Assumption (A1) allow for us to apply part (b) of this lemma as follows, y k f (x k ; W (1) ) = y k f (x k ; W (1) ) -y k f (x k ; W (0) ) + f (x k ; W (0) ) ≥ y k f (x k ; W (1) ) -y k f (x k ; W (0) ) -|f (x k ; W (0) )| (i) ≥ αγ 2 R 2 min 4n • 51 /49 • G(W (0) ) - √ 5ω init md log(4m/δ)R max (ii) ≥ γ 2 αR 2 min 16n - √ 5ω init md log(4m/δ)R max = γ 2 αR 2 min 16n 1 - 16 √ 5ω init Rn md log(4m/δ) γ 2 αR min (iii) ≥ γ 2 αR 2 min 32n . The first term in inequality (i) uses the lower bound provided in part (b) of this lemma as well as (28), while the second term uses the upper bound on |f (x k ; W (0) )| in (25). Inequality (ii) uses (28). Inequality (iii) uses Assumption (A2) so that ω init ≤ αγ 2 R min • (72RC R n md log(4m/δ)) -1 and that 16 √ 5 < 36. We now have all of the pieces necessary to prove Lemma E.4. Proof of Lemma E.4. In order to show that the ratio of the g(•) losses is bounded, it suffices to show that the ratio of exponential losses exp(-(•)) is bounded, since by Fact E.9, max i,j=1,...,n g(y i f (x i ; W (t) )) g(y j f (x j ; W (t) )) ≤ max 2, 2 • max i,j=1,...,n exp(-y i f (x i ; W (t) )) exp(-y j f (x j ; W (t) )) . We will prove the lemma by first showing an exponential loss ratio holds at time t = 0 and t = 1, and then use an inductive argument based on Proposition E.10 with ρ = 5R 2 /γ 5 + 5 = 1 2 C R . By part (a) of Lemma E.11, the exponential loss ratio at time t = 0 is at most exp(2). To see the loss ratio holds at time t = 1, first note that by assumption, we have that the samples satisfy, ∥x i ∥ 2 ≥ 5γ -2 C R n max k̸ =i |⟨x i , x k ⟩| = 2γ -2 (25R 2 γ -2 + 25)n max k̸ =i |⟨x i , x k ⟩|. ( ) Because ϕ is a γ-leaky, H-smooth activation, by Fact E.7 this implies that gradient persistence (Condition E.6) holds with c = γ 2 and near-orthogonality of gradients (Condition E.5) holds for all times t ≥ 0 with C ′ > 2(25R 2 γ -2 + 25). By Assumption (A1), we can therefore apply Lemma E.8 at time t = 0, so that we have for any i, j, exp -y i f (x i ; W (1) ) exp -y j f (x j ; W (1) ) ≤ exp(2) • exp - g (0) j αcR 2 min n g (0) i g (0) j - R 2 γ 2 × exp αR 2 max (10R 2 /γ 2 + 10)n • G(W (0) ) (i) ≤ exp(2) • exp R 2 R 2 min α n • exp αR 2 max (10R 2 /γ 2 + 10)n = exp(2) • exp α R 2 max n + R 2 max (10R 2 /γ 2 + 10)n (ii) ≤ exp(2.1) ≤ 9. Inequality (i) uses that g (t) i < 1, while inequality (ii) uses that the step-size is sufficiently small α ≤ 1/20R 2 max by Assumption (A1). Therefore, the exponential loss ratio at times t = 0 and t = 1 is at most 9 ≤ 5R 2 /γ 2 + 5. Now suppose by induction that at times τ = 1, . . . , t, the exponential loss ratio is at most 5R 2 /γ 2 +5, and consider t + 1. (The cases t = 0 and t = 1 were just proved above.) By the induction hypothesis and (29), the sigmoid loss ratio from times 0, . . . , t is at most 10R 2 /γ 2 + 10. By Assumption (A1), the step-size satisfies α ≤ γ 2 [5nR 2 max R 2 (10R 2 γ -2 + 10) max(1, H)] -1 ≤ γ 2 [2HC R R 2 max R 2 n] -1 . Further, the samples satisfy (30), so that ∥x k ∥ 2 ≥ 2γ -2 (10R 2 γ -2 + 10)n max i̸ =k |⟨x i , x k ⟩| = 2γ -2 C R n max i̸ =k |⟨x i , x k ⟩|. Thus all parts of Lemma E.11 hold with C ′ R = C R = 10R 2 γ -2 + 10. By part (b) of that lemma, the unnormalized margin for each sample increased for every time τ = 0, . . . , t: for all τ = 1, . . . , t, y k [f (x k ; W (τ +1) ) -f (x k ; W (τ ) )] > 0. ( ) Since the network interpolates the training data at time t = 1 by part (c) of Lemma E.11, this implies for all τ = 1, . . . , t, y k f (x k ; W (τ ) ) > 0. Finally, since the learning rate satisfies α ≤ γ 2 [5nR 2 max R 2 C R max(1, H)] -1 , all of the conditions necessary to apply Proposition E.10 hold. This proposition shows that the exponential loss ratio at time t + 1 is at most 5R 2 /γ 2 + 5. This completes the induction so that the exponential loss ratio is at most 5R 2 /γ 2 + 5 throughout gradient descent, which by (29) implies that the sigmoid loss ratio is at most 10R 2 /γ 2 + 10.

E.4 UPPER BOUND FOR THE FROBENIUS NORM

In this section we prove an upper bound for the Frobenius norm of the first-layer weights (recall that StableRank(W ) = ∥W ∥ 2 F /∥W ∥ 2 2 ). The proof is a modification of Frei et al. (2022a, Lemma 4.10 ) to accommodate more general data. Note that the lemma is a strict improvement over the triangle inequality, as we are able to reduce the growth term by a factor of 1/ √ n. The proof crucially relies upon the loss ratio bound proved in Lemma E.4. Lemma E.12. Let R min = min i ∥x i ∥, R max := max i ∥x i ∥, R = R max /R min , and denote C R = 10R 2 /γ 2 + 10 as the upper bound on the sigmoid loss ratio from Lemma E.4. Suppose that for all i ∈ [n] the training data satisfy, ∥x i ∥ 2 ≥ 5γ -2 C R n max k̸ =i |⟨x i , x k ⟩|. Then under Assumptions (A1) and (A2), with probability at least 1 -δ, for any t ≥ 1, ∥W (t) ∥ F ≤ ∥W (0) ∥ F + √ 2C R R max α √ n t-1 s=0 G(W (s) ). Proof. We prove an upper bound on the ℓ 2 norm of each neuron and then use this to derive an upper bound on the Frobenius norm of the first layer weight matrix. First note that the lemma's assumptions guarantee that Lemma E.4 holds. Next, by the triangle inequality, we have ∥w (t) j ∥ = w (0) j + α t-1 s=0 ∇ j L(W (s) ) F ≤ ∥w (0) j ∥ + α t-1 s=0 ∥∇ j L(W (s) )∥ F . We now consider the squared gradient norm with respect to the j-th neuron: ∥∇ j L(W (s) )∥ 2 = 1 n 2 n i=1 g (s) i y i ∇ j f (x i ; W (s) ) 2 = 1 n 2   n i=1 g (s) i 2 ∇ j f (x i ; W (s) ) 2 + k̸ =i∈[n] g (s) i g (s) k y i y j ⟨∇ j f (x i ; W (s) ), ∇ j f (x k ; W (s) )⟩   ≤ 1 n 2   n i=1 g (s) i 2 ∇ j f (x i ; W (s) ) 2 + k̸ =i∈[n] g (s) i g (s) k ⟨∇ j f (x i ; W (s) ), ∇ j f (x k ; W (s) )⟩   (i) ≤ a 2 j n 2   n i=1 g (s) i 2 ϕ ′ (⟨w (t) j , x i ⟩) 2 ∥x i ∥ 2 + k̸ =i∈[n] g (s) i g (s) k ϕ ′ (⟨w (t) j , x i ⟩)ϕ ′ (⟨w (t) j , x k ⟩)|⟨x i , x k ⟩|   (ii) ≤ a 2 j n 2   n i=1 g (s) i 2 ∥x i ∥ 2 + k̸ =i∈[n] g (s) i g (s) k |⟨x i , x k ⟩|   = a 2 j n 2 n i=1   g (s) i 2   ∥x i ∥ 2 + k̸ =i g (s) k g (s) i |⟨x i , x k ⟩|     (iii) ≤ a 2 j n 2 n i=1   g (s) i 2   ∥x i ∥ 2 + C R k̸ =i |⟨x i , x k ⟩|     (iv) ≤ 2a 2 j n 2 n i=1 g (s) i 2 ∥x i ∥ 2 . Above, inequality (i) uses that ∇ j f (x i ; W ) = a j ϕ ′ (⟨w j , x i ⟩)x i . Inequality (ii) uses that ϕ is 1-Lipschitz. Inequality (iii) uses the loss ratio bound in Lemma E.4, and inequality (iv) uses the lemma's assumption about the near-orthogonality of the samples. We can thus continue, ∥∇ j L(W (s) )∥ 2 ≤ 2a 2 j n 2 n i=1 g (s) i 2 ∥x i ∥ 2 ≤ 2a 2 j R 2 max n 2 • max k g (s) k • n i=1 g (s) i = 2a 2 j R 2 max n • max k g (s) k G(W (s) ) (i) ≤ 2a 2 j R 2 max C R n G(W (s) ) 2 . ( ) The final inequality uses the loss ratio bound so that we have max k∈[n] g (s) k = 1 n n i=1 max k g (s) k g (s) i g (s) i ≤ C R n n i=1 g (s) i = C R G(W (s) ). Finally, taking square roots of (33) and applying this bound on the norm in Inequality (32) above we conclude that ∥w (t) j ∥ ≤ ∥w (0) j ∥ + √ 2C R |a j |R max α √ n t-1 s=0 G(W (s) ), establishing our claim for the upper bound on ∥w (t) j ∥. For the bound on the Frobenius norm, we have an analogue of (32), ∥W (t) ∥ F ≤ ∥W (0) ∥ F + α t-1 s=0 ∥∇ L(W (s) )∥ F , On the other hand, by Lemma E.1, we know that |⟨w (0) j , µ⟩| ≤ 2ω init ∥ µ∥ log(4m/δ). By the lemma's assumption that ∥x i ∥ 2 ≫ n max k̸ =i |⟨x i , x k ⟩|, we have ∥ µ∥ 2 = n i=1   ∥x i ∥ 2 + n k:k̸ =i ⟨y i x i , y k x k ⟩   ≤ n i=1 ∥x i ∥ 2 + n max k̸ =i |⟨x i , x k ⟩| ≤ 2nR 2 max . ( ) Substituting this inequality into the previous display, we get |⟨w (0) j , µ⟩| ≤ 4R max ω init n log(4m/δ). We thus have αγR 2 min 2 √ m G(W (0) ) (i) ≥ αγR 2 min 8 √ m (ii) ≥ 8R max ω init n log(4m/δ) (iii) ≥ 2|⟨w (0) j , µ⟩|. where (i) uses ( 35), (ii) uses Assumption (A2) and that C R > 1 so that, α ≥ 64ω init γ -1 C R (R max /R 2 min ) nm log(4m/δ) = 64ω init γ -1 C R (R/R min ) nm log(4m/δ), and (iii) uses (37). Continuing from (34) we get ⟨w (t) j , µ⟩ ≥ ⟨w (t) j -w (0) j , µ⟩ -|⟨w (0) j , µ⟩| ≥ αγ|a j |R 2 min 2 t-1 s=0 G(W (s) ) -|⟨w (0) j , µ⟩| ≥ αγ|a j |R 2 min 4 t-1 s=0 G(W (s) ), where the last inequality uses (38). Negative neurons. The argument in this case is essentially identical. If a j < 0, then ⟨w (t+1) j -w (t) j , µ⟩ ≤ - α|a j | n n i=1 g (t) i ϕ ′ (⟨w (t) j , x i ⟩)   ∥x i ∥ 2 - k̸ =i |⟨x i , x k ⟩|   (i) ≤ - α|a j |R 2 min 2n n i=1 g (t) i ϕ ′ (⟨w (t) j , x i ⟩) (iii) ≤ - αγ|a j |R 2 min 2 G(W (t) ), where the inequalities (i) and (ii) follow using an identical logic to the positive neuron case. We therefore have for negative neurons, ⟨w (t) j -w (0) j , µ⟩ ≤ - αγR 2 min 2 √ m t-1 s=0 G(W (s) ). An identical argument used for the positive neurons to derive (39) shows a similar bound for ⟨w (t) j , µ⟩. To see the claim about the spectral norm, first note that since R min > 0, |⟨w j , µ⟩| > 0 and hence µ ̸ = 0. We thus can calculate, ∥W (t) ∥ 2 2 (i) ≥ ∥W (t) µ/∥ µ∥∥ 2 2 = ∥ µ∥ -2 m j=1 ⟨w (t) j , µ⟩ 2 ≥ ∥ µ∥ -2 m j=1 αγ|a j |R 2 min 4 t-1 s=0 G(W (s) ) 2 = ∥ µ∥ -2 αγR 2 min 4 t-1 s=0 G(W (s) ) 2 (ii) ≥ αγR 2 min 4 √ 2R max √ n t-1 s=0 G(W (s) ) Inequality (i) uses ( 34) and ( 40), and inequality (ii) uses the upper bound for ∥ µ∥ given in (36). This completes the proof.

E.6 PROXY PL INEQUALITY

Our final task for the proof of Theorem 4.2 is to show that L(W (t) ) → 0. We do so by establishing a variant of the Polyak-Lojasiewicz (PL) inequality called a proxy PL inequality (Frei & Gu, 2021, Definition 1.2). Lemma E.14. Let R max = max i ∥x i ∥, R min = min i ∥x i ∥, and R := R max /R min . Let C R = 10R 2 γ -2 + 10. Suppose the training data satisfy, for all i ∈ [n], ∥x i ∥ 2 ≥ 5γ -2 C R n max k̸ =i |⟨x i , x k ⟩|. For a γ-leaky activation, the following proxy-PL inequality holds for any t ≥ 0: ∇ L(W (t) ) ≥ γR min 2 √ 2R √ n G(W (t) ). Proof. By definition, for any matrix V ∈ R m×d with ∥V ∥ F ≤ 1 we have ∥∇ L(W )∥ ≥ ⟨∇ L(W ), -V ⟩ = 1 n n i=1 g (t) i y i ⟨∇f (x i ; W ), V ⟩. Let µ := n i=1 y i x i and define the matrix V as having rows a j µ/∥ µ∥. Then, ∥V ∥ 2 F = m j=1 a 2 j = 1, and we have for each j, y i ⟨∇f (x i ; W ), V ⟩ = 1 m m j=1 ϕ ′ (⟨w j , x i ⟩)⟨y i x i , µ/∥ µ∥⟩ = 1 ∥ µ∥m m j=1 ϕ ′ (⟨w j , x i ⟩)   ∥x i ∥ 2 + k̸ =i ⟨y i x i , y k x k ⟩   (i) ≥ 1 ∥ µ∥m m j=1 ϕ ′ (⟨w j , x i ⟩) • 1 2 ∥x i ∥ 2 (ii) ≥ R 2 min 2∥ µ∥m m j=1 ϕ ′ (⟨w j , x i ⟩) (iii) ≥ γR 2 min 2∥ µ∥ . Inequality (i) uses the lemma's assumption that ∥x i ∥ 2 ≫ n max k̸ =i |⟨x i , x k ⟩|. Inequality (ii) uses that ∥x i ∥ 2 ≥ R 2 min , and inequality (iii) uses that ϕ ′ (z) ≥ γ. We therefore have, ∥∇ L(W (t) )∥ ≥ 1 n n i=1 g (t) i y i ⟨∇f (x i ; W (t) ), V ⟩ ≥ γR 2 min 2∥ µ∥ G(W (t) ) ≥ γR 2 min 2 √ 2R max √ n G(W (t) ) = γR min 2 √ 2R √ n G(W (t) ), where the final inequality uses the calculation (36). E.7 PROOF OF THEOREM 4.2 We are now in a position to provide the proof of Theorem 4.2. For the reader's convenience, we re-state the theorem below. Theorem 4.2. Suppose that ϕ is a γ-leaky, H-smooth activation. For training data {(x i , y i )} n i=1 ⊂ R d × {±1}, let R max = max i ∥x i ∥ and R min = min i ∥x i ∥, and suppose R = R max /R min is at most an absolute constant. Denote by C R := 10R 2 /γ 2 + 10. Assume the training data satisfies, R 2 min ≥ 5γ -2 C R n max i̸ =j |⟨x i , x j ⟩|. There exist absolute constants C 1 , C 2 > 1 (independent of m, d, and n) such that the following holds. For any δ ∈ (0, 1), if the step-size satisfies α ≤ γ 2 (5nR 2 max R 2 C R max(1, H)) -1 , and ω init ≤ αγ 2 R min (72RC R n md log(4m/δ)) -1 , then with probability at least 1-δ over the random initialization of gradient descent, the trained network satisfies: 1. The empirical risk under the logistic loss satisfies L(W (t) ) ≤ C1n /R 2 min αt for t ≥ 1. 2. The ℓ 2 norm of each neuron grows to infinity: for all j, ∥w (t) j ∥ 2 → ∞. 3. The stable rank of the weights is bounded: sup t≥1 StableRank(W (t) ) ≤ C 2 . Proof. We prove the theorem in parts. Empirical risk driven to zero. This is a simple consequence of the proxy-PL inequality given in Lemma E.14 since ϕ is smooth; a small modification of the proof of Frei et al. (2022a, Lemma 4 .12) suffices. In particular, since by Lemma E.3 the loss L(w) has R 2 max (1 + H/ √ m)-Lipschitz gradients, we have L(W (t+1) ) ≤ L(W (t) ) -α∥∇ L(W (t) )∥ 2 F + R 2 max max(1, H/ √ m)α 2 ∥∇ L(W (t) )∥ 2 F . Applying the proxy-PL inequality of Lemma E.14 and using that α ≤ [2 max(1, H/ √ m)R 2 max ] -1 we thus have t) ) . γ 2 R 2 min 8R 2 n G(W (t) ) 2 ≤ ∥∇ L(W (t) )∥ 2 F ≤ 2 α L(W (t+1) ) -L(W ( Telescoping the above, we get min t<T G(W (t) ) 2 ≤ 1 T T -1 t=0 G(W (t) ) 2 ≤ 2 L(W (0) ) αT • 8nR 2 γ 2 R 2 min . We know from the proof of Lemma E.4 (see ( 31)) that the unnormalized margin increases for each sample for all times. Since g is monotone, this implies G(W (t) ) is decreasing and hence so is G(W (t) ) 2 , which implies G(W (T -1) ) = min t<T G(W (t) ) ≤ 16 L(W (0) )nR 2 γ 2 R 2 min αT . Since ℓ(z) ≤ 2g(z) for z > 0 and we know that the network interpolates the training data for all times t ≥ 1, we know that L(W (t) ) ≤ 2G(W (t) ) for t ≥ 1, so that for T ≥ 2, L(W (T -1) ) ≤ 2G(W (T -1) ) ≤ 2 16 L(W (0) )nR 2 γ 2 R 2 min αT . Norms driven to infinity. We showed in Lemma E.13 (see ( 39) and ( 36)) that for each t ≥ 1 and for each j, ∥w G(W (s) ). It therefore suffices to show that t-1 s=0 G(W (s) ) → ∞. Suppose this is not the case, so that there exists some β > 0 such that t-1 s=0 G(W (s) ) ≤ β for all t. By Lemma E.12, this implies that for all t, ∥W (t) ∥ F ≤ ∥W (0) ∥ F + 2 C R R max α/nβ. In particular, ∥W (t) ∥ F is bounded independently of t. But this contradicts the fact that L(W (t) ) → 0 and ℓ > 0 everywhere, and thus ∥w (t) j ∥ ≳ t-1 s=0 G(W (s) ) → ∞. Stable rank is constant. By definition, StableRank(W (t) ) = ∥W (t) ∥ 2 F ∥W (t) ∥ 2 2 . We consider two cases. Case 1: ∥W (t) ∥ F > 2∥W (0) ∥ F . In this instance, by Lemma E.12, we have the chain of inequalities, 2∥W (0) ∥ F < ∥W (t) ∥ F ≤ ∥W (0) ∥ F + √ 2C R R max α √ n t-1 s=0 G(W (s) ). In particular, we have ∥W (0) ∥ F < √ 2C R R max α √ n t-1 s=0 G(W (s) ). We can thus use Lemma E.13 and Lemma E.12 to bound the ratio of the Frobenius norm to the spectral norm: ∥W (t) ∥ F ∥W (t) ∥ 2 ≤ ∥W (0) ∥ F + √ 2C R Rmaxα √ n t-1 s=0 G(W (t) ) αγRmin 4 √ 2R √ n t-1 s=0 G(W (t) ) ≤ 2 √ 2C R Rmaxα √ n t-1 s=0 G(W (t) ) αγRmin 4 √ 2R √ n t-1 s=0 G(W (t) ) = 16C 1/2 R R 2 γ -1 . ( ) Case 2: ∥W (t) ∥ F ≤ 2∥W (0) ∥ F . Again using Lemma E.13, we have 4)), we see that using a small initialization scale leads to a rapid decrease in the stable rank of the network. A similar phenomenon occurs with CIFAR-10 (see Figure 2 ). ∥W (t) ∥ F ∥W (t) ∥ 2 ≤ 2∥W (0) ∥ F αγRmin 4 √ 2R √ n t-1 s=0 G(W (t) ) (i) ≤ √ 5ω init md log(4m/δ) αγRmin 4 √ 2R √ n G(W (0) ) (ii) ≤ 4 √ 5ω init md log(4m/δ) αγRmin 4 √ 2R √ n = 16 √ 10C R γ -1 RR -1 min √ nα -1 ω init md log(4m/δ) (iii) ≤ γ/ √ n ≤ 16C 1/2 R R 2 γ -1 . overfitting phenomenon as the training accuracy is 100% and the test accuracy is eventually the (optimal) 85%. Finally, in Figure 6 , we examine the training and test error of two-layer leaky ReLU networks trained by gradient descent with learning rate α = 0.01 for the 2-XOR distribution described in Section 5 (with n = 80). We fix the number of neurons to m = 512. Theorem 3.2 suggests that if the leaky parameter γ and d are large enough relative to the number of samples, then the network will achieve a linear decision boundary. For the 2-XOR distribution, every classifier with a linear decision boundary achieves 50% test accuracy. We see that as γ and d increase, the test accuracy is indeed close to 50%, but for small γ and d the network achieves better performance and thus learns a nonlinear decision boundary.

F.2 CIFAR10

We use the standard 10-class CIFAR10 dataset with pixel values normalized to be between 0 and 1 (dividing each pixel value by 255). We consider a standard two-layer network with 512 neurons with ReLU activations with biases and with second-layer weights trained. We train for T = 10 6 steps with SGD with batch size 128 and a learning rate of α = 0.01. Figure 2 shows the average over 5 independent random initializations with shaded area corresponding to plus or minus one standard deviation. For the second-layer initialization we use the standard TensorFlow Dense layer initialization, which uses Glorot Uniform with standard deviation 2/(m + 10) (since the network has 10 outputs). For the first-layer initialization, we consider two different initialization schemes. Default initialization. We use the standard Dense layer initialization in TensorFlow Keras. In this case the 'Glorot Uniform' initialization has standard deviation ω TF init = 2/(m + d). Small initialization. We use ω init = ω TF init /50.



In fact, the main challenge in our proof is to show that a property similar to their assumption holds in every KKT point in our setting. The proof below holds more generally when xi has independent subgaussian entries.



i.d. from a d-dimensional Gaussian distribution N(0, Σ), where Tr[Σ] = d and ∥Σ∥ 2 = O(1). Suppose n ≤ d O(1) . Then, with probability at least 1 -n -10 we have ∥xi∥ 2 d = 1 ± O( log n d ) for all i, and |⟨xi,xj ⟩| d = O( log n d ) for all i ̸ = j.

Figure 1: Relative reduction in the stable rank of two-layer nets trained by gradient descent for Gaussian mixture model data (cf. (4)). The rank reduction happens more quickly as the dimension grows (left; initialization scale 50× smaller than default TensorFlow, α = 0.01) and as the initialization scale decreases (right; d = 10 4 , α = 0.16).

Figure 2: Stable rank of SGD-trained two-layer ReLU networks on CIFAR-10. Compared to the default TensorFlow initialization (left), a smaller initialization (right) results in a smaller stable rank, and this effect is especially pronounced before the very late stages of training. Remarkably, the train (blue) and test (black) accuracy behavior is essentially the same.

for random initialization . . . . . . . . . . . . . . . . . . . . . . . . E.2 Smoothness of network output and loss . . . . . . . . . . . . . . . . . . . . . . . E.3 Loss ratio bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4 Upper bound for the Frobenius norm . . . . . . . . . . . . . . . . . . . . . . . . . E.5 Lower bound for the spectral norm . . . . . . . . . . . . . . . . . . . . . . . . . . E.6 Proxy PL inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.7 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F Experiment Details F.1 Binary cluster data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2 CIFAR10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A PRELIMINARIES ON THE CLARKE SUBDIFFERENTIAL AND THE KKT CONDITIONS Below we review the definition of the KKT conditions for non-smooth optimization problems (cf. Lyu & Li (2019); Dutta et al. (2013)). Let f : R d → R be a locally Lipschitz function. The Clarke subdifferential (Clarke et al., 2008) at x ∈ R d is the convex set

= ũm2 := ũ, and by Lemma B.5 the vectors ṽ, ũ are a unique global optimum of Problem (3). Since by Lemma B.5 the vectors v, u are also a unique global optimum of Problem (3), then we must have v = ṽ and u = ũ. Now, let W * be a global optimum of Problem (1). ByLyu & Li (2019), the KKT conditions of this problem are necessary for optimality, and hence they are satisfied by W * . Therefore, we have W * = W . Thus, W is a unique global optimum.Lemma B.7. For every x ∈ R d we have sign (f (x; W )) = sign(z ⊤ x).

max and we have R min = R max = 1 and p = ϵ, then ϵ should satisfy 3 ≤ 1 8•3ϵ . We also let y 1 = -1, y 2 = y 3 = 1. Let W be a KKT point of Problem (1) w.r.t. the dataset {(x i , y i )} 3 i=1 , and let v 1 , . . . , v m ′ , u 1 , . . . , u m ′ be the corresponding weight vectors. By Lemma B.4 and Lemma B.5 we have v = v 1 = . . . = v m ′ and u = u 1 = . . . = u m ′ where v, u are a solution of Problem (3). Moreover, by Lemma B.4 and Lemma B.2 we have

Part (a). Since ϕ is 1-Lipschitz and ϕ(0) = 0, Cauchy-Schwarz implies|f (x; W )| = x⟩ 2 = ∥W x∥ 2 .

Part (b). Let k ∈ [n]. Let us re-introduce the notation ∇f (t) i := ∇f (x i ; W

[(1 + exp(0.02)) -1 , (1 + exp(-0.02)) -1 ] ⊂ [0.49, 0.51] and so G(W (0) )

Figure 4: With larger learning rates, most of the rank reduction occurs in the first step of gradient descent. With smaller learning rates, training for longer can reduce the rank at most initialization scales.

Figure 6: As γ and d increase, leaky ReLU networks trained by gradient descent fail to generalize well for the XOR distribution, as predicted by Theorem 3.2.

The proof of Lemma 3.3 is provided in Appendix C. We thus see that for data sampled i.i.d. from a well-conditioned Gaussian, near-orthogonality of training data holds when the training data is sufficiently high-dimensional, i.e. the dimension is much larger than the number of samples.

ACKNOWLEDGEMENTS

We thank Matus Telgarsky for helpful discussions. This work was done in part while the authors were visiting the Simons Institute for the Theory of Computing as a part of the Deep Learning Theory Summer Cluster. SF, GV, PB, and NS acknowledge the support of the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and #814639.

annex

and we can simply use that a 2 j = 1/m and ∥∇ L(W (s) )∥ 2 F = m j=1 ∥∇ j L(W (s) )∥ 2 F .

E.5 LOWER BOUND FOR THE SPECTRAL NORM

We next show that the spectral norm is large. The proof follows by showing that after the first step of gradient descent, every neuron is highly correlated with the vector µ := n i=1 y i x i . Lemma E.13. Let R max = max i ∥x i ∥, R min = min i ∥x i ∥ and R := R max /R min . Let C R = 10R 2 γ -2 + 10. Suppose that for all i ∈ [n] the training data satisfy,Then, under Assumptions (A1) and (A2), we have with probability at least 1 -δ, we have the following lower bound for the spectral norm of the weights for any t ≥ 1:Proof. We shall show that every neuron is highly correlated with the vector µ :=i ϕ ′ (⟨wPositive neurons. If a j > 0, then we have,.i ≥ 0. Telescoping, we getWe now show that we can ignore the ⟨w (0) j , µ⟩ term by taking α large relative to ω init . By the calculation in (25), we know that |f (x i ; W (0) )| ≤ 1 for each i and thusPublished as a conference paper at ICLR 2023 Inequality (ii) uses that G(W (0) ) ≥ 1/4 by the calculation (35).The final inequality (iii) uses Assumption (A2) so that ω init ≤ αγ 2 R min (72RC R n md log(4m/δ)) -1 . Thus, (42) yields the following upper bound for the stable rank,

F EXPERIMENT DETAILS

We describe below the two experimental settings we consider.

F.1 BINARY CLUSTER DATA

In Figure 1 , we consider the binary cluster distribution described in (4). We consider a neural network with m = 512 neurons with activation ϕ(z) = γz + (1 -γ) log 1 2 (1 + exp(z)) for γ = 0.1, which is a 0.1-leaky, 1 /4-smooth leaky ReLU activation (see Figure 3 ). We fix n = 100 samples with mean separation ∥µ∥ = d 0.26 with each entry of µ identical and positive. We introduce label noise by making 15% of the labels in each cluster share the opposing cluster label (i.e., samples from cluster mean +µ 1 have label +1 with probability 0.85 and -1 with probability 0.15). Concurrent with the set-up in Section 4, we do not use biases and we keep the second layer fixed at the values ±1/ √ m, with exactly half of the second-layer weights positive and the other half negative. For the figure on the left, the initialization is standard normal distribution with standard deviation that is 50× smaller than the TensorFlow default initialization, that is,For the figure on the right, we fix d = 10 4 and vary the initialization standard deviation for different multiples of ω TF init , so that the variance is between (10 -2 ω TF init ) 2 and (10 2 ω TF init ) 2 . For the experiment on the effect of dimension, we use a fixed learning rate of α = 0.01, while for the experiment on the effect of the initialization scale we use a learning rate of α = 0.16.In Figure 1 , we show the stable rank of the first-layer weights scaled by the initial stable rank of the network (i.e., we plot StableRank(W (t) )/StableRank(W (0) )). The line shows the average over 5 independent random initializations with error bars (barely visible) corresponding to plus or minus one standard deviation.In Figure 4 , we provide additional empirical observations on how the learning rate can affect the initialization scale's influence on the stable rank of the trained network as we showed in Figure 1 . We fix d = 10 4 and otherwise use the same setup for Figure 1 described in the previous paragraph. When the learning rate is the smaller value of α = 0.01, training for longer can reduce the (stable) rank of the network, while for the larger learning rate of α = 0.32 most of the rank reduction occurs in the first step of gradient descent.In Figure 5 , we examine the training accuracy, test accuracy, and stable rank of networks trained on the binary cluster distribution described above. Here we fix d = 10 4 and α = 0.01 and otherwise use the same setup described in the first paragraph. We again consider two settings of the initialization scale: either a standard deviation of ω TF init or 1 /50 × ω TF init . We again see that the stable rank decreases much more rapidly when using a small initialization. Note that in both settings we observe a benign

