SHARP CONVERGENCE ANALYSIS OF GRADIENT DESCENT FOR OVERPARAMETERIZED DEEP LINEAR NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

This paper presents sharp rates of convergence of the gradient descent (GD) method for overparameterized deep linear neural networks with different random initializations. This study touches upon one major open theoretical problem in machine learning-why deep neural networks trained with GD methods are efficient in many practical applications? While the solution of this problem is still beyond the reach of general nonlinear deep neural networks, extensive efforts have been invested in studying relevant questions for deep linear neural networks, and many interesting results have been reported to date. For example, recent results on loss landscape show that even though the loss function of deep linear neural networks is non-convex, every local minimizer is also a global minimizer. When the GD method is applied to train deep linear networks, it's convergence behavior depends on the initialization. In this study, we obtained sharp rate of convergence of GD for deep linear networks and demonstrated that this rate does not depend on the types of random initialization. Furthermore, here, we show that the depth of the network does not affect the optimal rate of convergence, if the width of each hidden layer is appropriately large. Finally, we explain why the GD for an overparameterized deep linear network automatically avoids bad saddles.

1. INTRODUCTION

Deep linear neural networks, as a class of toy models, are frequently used to understand loss surfaces and gradient-based optimization methods related to non-convex problems. Dauphin et al. (2014) and Choromanska et al. (2015a) explored the loss function of deep nonlinear networks based on random matrix theory (such as a spherical spin-glass model). This theory essentially converts the loss surface of deep nonlinear neural networks into that of deep linear neural networks under certain assumptions, some of which are unrealistic. Choromanska et al. (2015b) suggested an open problem to establish a connection between the loss function of neural networks and the Hamiltonian of spherical spin-glass models under milder assumptions. Later, Kawaguchi (2016) successfully discarded most of these assumptions by analyzing the loss surface of the deep linear neural networks. The landscape for deep linear neural network (Kawaguchi, 2016; Kawaguchi & Lu, 2017; Laurent & Brecht, 2018) focuses on several properties of the critical points: (i) every local minimum is a global minimum; (ii) every critical point that is not a local minimum is a saddle point; and (iii) there exists a saddle such that all eigenvalues of its Hessian are zeros if the network is deeper than three layers. Thus, for deep linear neural networks, convergence to a global minimum is impeded by the existence of poor saddles. Lee et al. (2016) showed that the gradient method almost surely never converges to a strict saddle point, although the time cost can depend exponentially on the dimension (Du et al., 2017) . Gradient descent (GD) with perturbations (Ge et al., 2015; Jin et al., 2017) can find a local minimizer in polynomial time. Thus, the trajectory approach combined with random initialization or random algorithm circumvents the obstacle of existence of poor saddles. According to studies on continuous time dynamics of a gradient flow (Du et al., 2018; Arora et al., 2018b) , the balance property of deep linear network is preserved if the initialization is balanced. Arora et al. (2018a; b) , Du & Hu (2019) , and Hu et al. (2020) successfully proved that GD with its corresponding initialization schemes con-verges to a global minimizer of deep linear neural networks with high probability. Furthermore, the rate of convergence is linear, and behaves like GD for a convex problem. Hu et al. (2020) established that the convergence for Gaussian initialization can be very slow for deep linear neural networks with large depths, unless the width is almost linear. They also showed that orthogonal initialization in deep linear neural networks accelerate the convergence. Thus, the convergence behavior of the GD method, for training deep linear neural networks, crucially networks depends on the initialization. Recent studies have demonstrated the connection between deep learning and kernel methods (Daniely, 2017; Arora et al., 2019a; b; Chizat et al., 2019; Lee et al., 2019; Du et al., 2019; Cao & Gu, 2019; Woodworth et al., 2020) , especially the neural tangent kernel (NTK), introduced by Jacot et al. (2018) . For most common neural networks, the NTK becomes constant (Jacot et al., 2018; Liu et al., 2020) and remains so throughout the training in the limit of a large layer width. Throughout the training, the neural networks are well described by their first-order Taylor expansion around their parameters at the initialization (Lee et al., 2019) . In this paper, we first evaluate the convergence region, i.e. the set of initialization parameters that lead to the linear convergence of GD for deep linear neural networks (see Lemma 4.1 or Lemma D.1). Next, we demonstrate that if the minimum width among all the hidden layers is sufficiently large, then the random initialization will fall into the convergence region with high probability (see Theorem 3.1, Theorem B.1, Theorem B.2 and Theorem B.3) . Furthermore, the worst-case convergence rate of GD for deep linear neural networks is almost the same as the original convex problem with a corresponding learning rate. We also demonstrate that the GD trajectories for deep linear neural networks are arbitrarily close to those for the convex problem. The precise statement is related to remark 3, Theorem 3.2, Corollary 1 and Lemma 4.4 (also see Lemma D.5). The present study was inspired by a recent reported work Du & Hu (2019) ; Hu et al. (2020) , in which the authors carefully constructed the upper and lower bounds of the eigenvalues of the Gram matrix along the GD and established a linear convergence. In this paper, we generalize their results to strongly convex loss functions with layer varying widths and obtain sharper results. We also show that our rate of convergence for GD in deep linear neural networks is sharp in the sense that it matches the worst-case convergence rate for the original convex problem. The trajectories between the GD for deep linear neural networks and the original convex problem (1) can be arbitrary close. Furthermore, we show that if the width of each hidden layer is appropriately large, then the optimal rate does not depend on the random initialization types and network depth. Lastly, we elucidate the mechanism underlying the observed automatic avoidance of bad saddles by the GD for overparameterized deep linear networks.

2. PRELIMINARIES

2.1 PROBLEM SETUP Let x ∈ R nx and y ∈ R ny be an input vector and a target vector, respectively. Define {(x i , y i )} m l(W x i , y i ). (1) The GD for convex problem (1) with a learning rate of η * is given by: W (t + 1) = W (t) -η * ∇L(W (t)), t = 0, 1, 2, • • • . For any matrix A, let σ max (A) and σ min (A) be the largest and smallest singular values of A respectively. Here, we consider two types of matrix norms and one type of semi-norm for A, ∥A∥ := σ max (A), ∥A∥ 2 F := tr(AA T ), and ∥A∥ X := ∥AP X ∥ F , where P X = X(X T X) † X T is the orthogonal projection matrix onto the column space of X, and (X T X) † is the Moore-Penrose inverse. For two real matrices A, B with the same sizes, we consider their Frobenius inner product as well as their semi-inner product, ⟨A, B⟩ = ⟨A, B⟩ F := tr(A T B), ⟨A, B⟩ X := ⟨AP X , BP X ⟩. Here, we list some basic properties for the semi-norm and semi-inner product. Lemma 2.1. The loss function L(W ) defined in (1) satisfies the following properties: for any W, V ∈ R ny×nx , 1.L(W ) = L(W P X ), 2.∇L(W ) = ∇L(W P X )P X , 3.⟨∇L(W ), V ⟩ F = ⟨∇L(W ), V ⟩ X , 4. ∥∇L(W )∥ F = ∥∇L(W )∥ X , 5. ∥W ∥ X ≡ ∥W ∥ F if and only if X is full row rank. The next lemma demonstrates the importance of the semi-norm ∥•∥ X in our analysis. Lemma 2.2. Assume that l(•, y) is α(l)-strongly convex. Then, the following statements hold. 1. If X is not a full row rank matrix, then L(W ) is neither strictly convex nor strongly convex with respect to ∥ •∥ F . 2. L(W ) is α(l)λmin(XX T ) m -strongly convex with respect to ∥•∥ X , where λ min (XX T ) is the smallest non-zero eigenvalue of XX T . The proofs of the two aforementioned lemmas are provided in appendix A. Hereafter, if inner product is not specified, then we will consider the semi-inner product. Assume that L is α-strongly convex (α > 0), and ∇L is β-Lipschitz (with respect to the seminorm ∥•∥ X ); that means, for any W, V ∈ R ny×nx , L(W ) ≥ L(V ) + ⟨∇L(V ), W -V ⟩ X + α 2 ∥W -V ∥ 2 X , ∥∇L(W ) -∇L(V )∥ X = ∥∇L(W ) -∇L(V )∥ F ≤ β ∥W -V ∥ X . Without loss of generality, we assume that α and β are the best constants. Then, Lemma 2.2 implies that α ≥ α(l)λmin(XX T ) m . Similarly, we can also show that β ≤ β(l)λmax(XX T )

m

, where ∇l(•, y) is β(l)-Lipschitz and λ max (XX T ) is the largest eigenvalue of XX T . Define the effective condition number of the convex function L by κ = κ(L) = β α < ∞. κ appears naturally in the rate of convergence of the GD. Let W * be a global minimizer of L(W ), that is L(W * ) = min W L(W ). Notice that W * might not be unique, but W * P X is unique. The well-known results for the rate of convergence of GD (2) state are: η * = 1 β =⇒ E(t) ≤ 1 - 1 κ t E(0), t = 1, 2, • • • , as well as, η * = 2 α + β =⇒ E(t) ≤ β 2 1 - 4κ (1 + κ) 2 t ∥W (0) -W * ∥ 2 X , t = 1, 2, • • • , where E(t) = L(W (t)) -L(W * ).

2.2. DEEP LINEAR NETWORK SETUP

Let N -1 be the number of hidden layers. Assume rank(X) = r. Denote the weight parameters by W k ∈ R n k ×n k-1 , k = 1, 2 • • • , N , with n N = n y , n 0 = n x , where the n k is the width of the k-th layer. Set n min = min{n 1 , n 2 , • • • , n N -1 }, and n max = max{n 1 , n 2 , • • • , n N -1 }. For notational convenience, we denote n j:i = i≤k≤j n k and denote W j:i = W j W j-1 • • • W i for each 1 ≤ i ≤ j ≤ N . Define n i-1:i = 1 and W i-1:i = I (of appropriate dimension) for completeness. Considering the implicit regularization W = W N :1 for the convex problem (1). We obtain the following non-convex optimization problem of deep linear neural networks: minimize W1,••• ,W N L(W N :1 ) = 1 m m i=1 l(W N :1 x i , y i ). Example 2.1. Specifically, if we set the loss to be l(W x i , y i ) = ∥W x i -y i ∥ 2 2 , then L(W ) = 1 m ∥W X -Y ∥ 2 F is 2λmin(XX T ) m -strongly convex, and ∇L is 2λmax(XX T ) m -Lipschitz. Example 2.2. Deep linear neural networks with regularization λ ∥W N • • • W 1 P X ∥ 2 F can be con- verted into a new optimization problem minimize W1,••• ,W N L(W N :1 ) + λ ∥W N :1 ∥ 2 X . Let L λ (W ) = L(W ) + λ ∥W ∥ 2 X . Then, L λ (•) is α + 2λ-strongly convex, and ∇L λ (•) is β + 2λ- Lipschitz. More generally, if we consider regularization with a form R(W ) = λ • g(W P X ), and g(•) is α ′strongly convex, and β ′ -Lipschitz, then for the optimization problem minimize W1,••• ,W N L(W N :1 ) + R(W N • • • W 1 ) =: L R (W N • • • W 1 ), we know that L R (•) is α + λα ′ -strongly convex, and ∇L R (•) is β + λβ ′ -Lipschitz.

2.3. INITIALIZATION SCHEMES

In previous studies, the following form of deep linear networks was considered, instead of (5): minimize W1,••• ,W N L(a N W N :1 ) = 1 m m i=1 l(a N W N :1 x i , y i ), where a N = 1/ √ n 1 n 2 • • • n N is a normalization constant. By applying GD on ( 6), where we update W j simultaneously for j, we obtain W j (t + 1) = W j (t) -η • a N (W N :j+1 (t)) T ∇L (a N W N :1 (t)) (W j-1:1 (t)) T , j = 1, • • • , N. In a recent study, the authors considered GD (7) and adopted a Gaussian initialization (Du & Hu, 2019) or scaled orthogonal initialization (Hu et al., 2020) for initializing W j (0). In this paper, we consider the following three kinds of random initializations, which generalize their idea. Gaussian initialization: Let W 1 (0), • • • , W N (0) be the weight matrices at initialization. We assume that all the entries of W j , 1 ≤ j ≤ N are independent Gaussian random variables with a zero mean and unit variance. Then, a N is a normalization constant in the sense that for any x ∈ R n0 , we have E ∥a N W N :1 (0)x∥ 2 2 = ∥x∥ 2 2 . In fact, all the initializations discussed in this paper satisfy (8). Remark 1. Let V i (t) = 1 √ ni W i (t), for 1 ≤ i ≤ N . Then, GD (7) with a unit variance Gaussian initialization is equivalent to V j (t + 1) = V j (t) - η n j (V N :j+1 (t)) T ∇L (V N :1 (t)) (V j-1:1 (t)) T , with a zero mean and variance 1 ni Gaussian initialization for V i , i = 1, • • • , N . GD (9) for loss (5) is equivalent to GD (7) for loss (6). Hereafter, we will only consider GD (7) for deep linear neural network (6).

Orthogonal initialization:

We consider the so-called one peak random orthogonal projection and embedding initialization, which generalize the idea of orthogonal initialization (Hu et al., 2020)  . Definition 2.1. An initialization W N :1 (0) = W N (0)W N -1 (0) • • • W 1 (0) is said to be a one peak random orthogonal projection and embedding initialization if there exists 1 ≤ p < N , such that n 0 ≤ n 1 ≤ n 2 ≤ • • • ≤ n p , n p ≥ n p+1 ≥ n p+2 ≥ • • • n N -1 ≥ n N , and W 1 (0), W 2 (0), • • • , W p (0), W p+1 (0), W p+2 (0), • • • , W n N (0) are independent and uniformly distributed over rectangular matrices, which satisfy W T i (0)W i (0) = n i I ni-1 , 1 ≤ i ≤ p, W j (0)W T j (0) = n j-1 I nj , p + 1 ≤ j ≤ N. Remark 2. In this definition, 1 ni W i (0), 1 ≤ i ≤ p are random embeddings and 1 nj-1 W j (0), p + 1 ≤ j ≤ N are random orthogonal projections. Notably, A is a random orthogonal projection if and only if A T is a random embedding. Arora et al. (2018a) studied the rate of convergence of GD to a global optimum for training a deep linear neural network for a balanced initialization. Here, we will consider a special case of balanced initialization, which is described as follows: Special balanced initialization: Assume n 1 = • • • = n N -1 = n. Consider the initialization W N (0) = √ nU N [I ny , 0 ny×(n-ny) ]V T N , W 1 (0) = √ nU 1 [I nx , 0 nx×(n-nx) ] T V T 1 and W i (0) = √ nU i I n V T i , 2 ≤ i ≤ N -1, where U N -1 , U N , V 1 , V i = U i-1 , 2 ≤ i ≤ N -1 are orthogonal matrices (random or deterministic), and V N has a uniform distribution over the orthogonal matrices. Notice that only V N is required to be random. A simple estimation of the loss at the initialization is given by the following lemma. Lemma 2.3. If the initialization satisfies (8) for all x, then with probability at least 1 -δ 2 , we have L(a N W N :1 (0)) -L(W * ) ≤ βB δ , where B δ = 2 • rank(X) δ + ∥W * ∥ 2 X . Note that the bound B δ can be improved by using a sharp concentration inequality.

3. MAIN THEOREMS

Assume the thinnest layer is either the input layer or the output layer; that is n min ≥ max{n 0 , n N }, and the ratio between the width of any hidden layer is bounded from above, precisely we have nmax nmin ≤ C 0 < ∞. The quantities C 2 , C 5 and C 6 are defined in appendix D and are dependent on hyperparameters n N , κ, δ, rank(X), C 0 , and N . For notational convenience, we denote E(t) = L(W (t)) -L(W * ) , and E DLN (t) = L(a N W N :1 (t)) -L(W * ). Our assumptions and notation are now in place. We next state our main theorems in this section.

3.1. LINEAR CONVERGENCE OF DEEP LINEAR NEURAL NETWORKS

In appendix B we present a sharp estimate of the linear convergence of GD for deep linear neural networks in Theorem B.1 for Gaussian initialization, Theorem B.2 for orthogonal initialization, and Theorem B.3 for a special balanced initialization. In particular, with a specific learning rate η = n N βN , Theorem B.1 and Theorem B.2 yield the following optimum rate of convergence: Theorem 3.1. Given any δ, ε ∈ (0, 1 2 ), there exists a constant C := C(ε), such that if one of the following two overparameterization condition holds: 1. n min ≥ C • C 2 • N with the Gaussian initialization, 2. n min ≥ C • C 5 with the one peak random orthogonal projection and embedding initialization and with probability at least 1 -δ, then we have E DLN (t) ≤ 1 - 1 -ε κ t E DLN (0), t = 1, 2, • • • . Remark 3. Consider GD (2) with a learning rate of η * = 1 β and initialization W (0) = a N W N :1 (0). The well-known result of rate of convergence (3) for GD (2) of convex problem (1) matches the rates obtained from Theorem B.1 and Theorem B.2. Remark 4. Du & Hu (2019), and Hu et al. (2020) showed that the number of iterations required to reach a precision ε is O κ log 1 ϵ for l 2 loss. We only improved the rate of convergence and generalized their results to any strongly convex loss.

3.2. RESULTS OF TRAJECTORIES

Theorem 3.1 and remark 3 establish that the rate of convergence to a global optimum for GD to train a deep linear neural network is almost the same as the trajectories for the GD to train the corresponding convex problem with high probability, if the width is sufficiently large. Moreover, the GD for the fully-connected deep linear neural network (7) and that for GD (2) have almost the same trajectories. Let η 1 = 2n N βN be an upper bound of the learning rate η. We can show that the trajectories of GD (7) for deep linear neural network (6) with a learning rate of η < η 1 are close to those of GD (2) with a learning rate of η * = N n N η for the corresponding convex problem (1) with high probability, if the width of each hidden layer is sufficiently large. The precise statement is as follows: Theorem 3.2. Consider the GD for deep linear neural network (7) with a learning rate of η < η 1 for a N W N :1 (t), t = 0, 1, • • • , and GD (2) with a learning rate of η * = N n N η for W (t), t = 0, 1, • • • . Given τ, δ ∈ (0, 1), there exists a constant C := C(τ, η/η 1 ) such that if one of the following three overparameterization conditions holds: 1. n min ≥ C • C 2 • N with the Gaussian initialization, 2. n min ≥ C • C 5 with the one peak random orthogonal projection and embedding initialization, 3. n min ≥ C • C 6 with the special balanced initialization, then with probability at least 1 -δ, we obtain ∥a N W N :1 (t) -W (t)∥ 2 X ≤ D(τ, q, t) ∥a N W N :1 (0) -W * ∥ 2 X , |E DLN (t) -E(t)| ≤ β q t/2 D(τ, q, t) + 1 2 D(τ, q, t) ∥a N W N :1 (0) -W * ∥ 2 X , E DLN (t) ≤ 3β(q + τ ) t ∥a N W N :1 (0) -W * ∥ 2 X , where D(τ, q, t) = min τ 1-q , 2(q + τ ) t , with 0 < q < 1 defined in (15). Remark 5. To the best of knowledge, this is the first paper that reveals that the trajectory of the overparameterized deep linear neural networks is close to the original convex problem with an appropriately rescaled learning rate. Corollary 1. According to Theorem 3.2, if we set η = 2n N (α+β)N , the following inequality holds with high probability, E DLN (t) ≤ 3β 1 - 4κ (1 + κ) 2 + τ t ∥a N W N :1 (0) -W * ∥ 2 X . Notably, the rate of convergence in ( 11) is better than that in Theorem 3.1, because if κ > 1, then we can choose a sufficiently small τ such that the following inequality holds: 1 - 4κ (1 + κ) 2 + τ < 1 - 1 κ . Theorem 3.1, Theorem 3.2, Theorem B.1, and Theorem B.2 indicate that the implicit regularization induced by the GD for a convex problem recovers the convex problem itself in terms of optimization, at the cost of linear convergence only with high probability for random initialization. Remark 6. Recall the constants C 2 , C 5 , and C 6 defined in appendix B. The term rank(X) δ is not optimal, since our concentration inequality depends only on the second moment. By using stronger concentration inequalities for our Lemma 2.3, similar to the proof of proposition 6.5 (Du & Hu, 2019) and Lemma 4.2 (Hu et al., 2020) , the rank(X) δ can be improved to 1 + log( rank(X) δ ). C 2 is proportional to κ 2 , which is slightly better than the constant in Du & Hu (2019) , which is proportional to κ 3 . C 5 is also slightly better than the constant reported by Hu et al. (2020) , since we do not have the extra term ∥X∥ 2 F ∥X∥ 2 . The improvement of the constant is mainly due to the introduction of the semi-norm ∥•∥ X .

4. INSIGHTS FOR THEOREM 3.2

Initialization and convergence region: Arora et al. (2018a) showed that if the initialization is approximately balanced, and the product matrix W N :1 (0) is very close to a global minimizer, then the GD linearly converges to the global minimum for the deep linear network without any width requirement. However, the convergence region in (Arora et al., 2018a) is very small, because W N :1 (0) needs to be very close to W * . Later, Du & Hu (2019), and Hu et al. (2020) successfully proved that the GD with a Gaussian, or orthogonal initialization linearly converges to a global minimizer of the overparameterized deep linear neural network with high probability. They introduced a technique to analyze the trajectories of GD with large widths for any deterministic initialization. We introduce the following lemma, which describes the linear convergence result for a deep linear network with a deterministic initialization. Lemma 4.1. Under the setting of Lemma D.1, the GD for a deep linear network satisfies E DLN (t) ≤ (1 -ηγ) t E DLN (0), t = 1, 2, • • • . Our convergence region (see ( 31) in Lemma D.1 and Definition D.1) originates from the analysis of Du & Hu (2019), and Hu et al. (2020) and can be view as a neighborhood of the special balanced initialization, if n 1 = n 2 = • • • = n N -1 . Both Gaussian and orthogonal initialization are approximately balanced. For l 2 loss, without loss of generality, we can assume X to be a full rank matrix and L(W * ) = 0 because of the decomposition method in claim B.1 from Du & Hu (2019) . However, when considering a general strongly convex loss, we have to confront the low rank X directly in our analysis. Thus, ∥•∥ X appears naturally and aids in achieving the sharp rate of convergence in our main theorems. In addition to the technique reported in Du & Hu (2019), and Hu et al. (2020) , we also used classical convex optimization techniques (such as inequalities in Lemma C.1, and Polyak-Łojasiewicz inequality in ( 26)) as well as the classical concentration inequalities for beta distribution (such as the Chernoff type bound in Lemma F.3). Why GD trajectories for overparameterized deep linear neural networks with approximate balanced initialization are close to those for convex problems? The underlying mechanism can be understood as follows: Even though recent results of (Ziyin et al., 2022) can describe the exact global minimizer for a deep linear network (with a regularization term such as l 2 ), the evolution of each W j is still difficult to track. Instead, we consider the discrete dynamics for product matrices (see ( 41) and ( 42)): a N W N :1 (t + 1) = a N W N :1 (t) -η • P (t)[∇L(a N W N :1 (t)P X )] + a N E(t). For their own linear operator P t , Du & Hu (2019) showed that λ max (P t ) ≤ O( N n N ) • λ max (X T X) and λ min (P t ) ≥ Ω( N n N ) • λ min (X T X). To the best of our knowledge, the present paper is the first to proved that for our operator P (t)[•] ≈ N n N I (also see (44)) , where I is the identity operator. E(t) is negligible, which leads to the following result on discrete dynamics (see Lemma D.3). Lemma 4.2. Under the setting of Lemma D.3, we have a N W N :1 (t + 1) = a N W N :1 (t) - N n N η∇L(a N W N :1 (t)) + R(t), with ∥R(t)∥ X ≤ τ ∥a N W N :1 (t) -W * ∥ X . Without the R(t) term, the discrete dynamics is exactly the GD for a convex function. To control the distance between the two trajectories, we introduce the following lemma (also see Lemma D.4). Lemma 4.3. Assume τ ∈ [0, 1), and consider a discrete dynamical system V (t) such that, V (t + 1) = V (t) -η * ∇L(V (t)) + R(t), where ∥R(t)∥ X ≤ τ ∥V (t) -W * ∥ X . If η * ≤ 2/β, then we have ∥V (t) -W * ∥ 2 X ≤ (q + 7τ ) t ∥V (0) -W * ∥ 2 X , where q is defined in (15). With the help of this lemma, we further obtain the following trajectories comparison lemma (also see Lemma D.5), which leads to the main conclusions in Theorem 3.2. Lemma 4.4. Under the setting of Lemma D.5, we have ∥a N W N :1 (t) -W (t)∥ 2 X ≤ D(τ, q, t) ∥a N W N :1 (0) -W * ∥ 2 X , |E DLN (t) -E(t)| ≤ β q t/2 D(τ, q, t) + 1 2 D(τ, q, t) ∥a N W N :1 (0) -W * ∥ 2 X , (12b) E DLN (t) ≤ 3β(q + τ ) t ∥a N W N :1 (0) -W * ∥ 2 X , where D(τ, q, t) = min τ 1-q , 2(q + τ ) t , with 0 < q < 1 defined in (15). Why do bad saddles not affect GD for overparameterized deep linear neural networks? Kawaguchi (2016) showed that deep linear networks have bad saddles, and thus, in general, a vanishing Hessian can hinder the optimization. Theorem 2.3 in Kawaguchi (2016) explains that for all bad saddles satisfy that W N -1:2 is a non-full rank matrix. Thus, to show that the trajectories of GD are away from bad saddle points, it is sufficient to demonstrate that inf t σ min (W N -1:2 (t)) > 0. According to previous studies, there are two main ways to avoid bad saddles for GD to train deep linear networks. A critical point x * of f is a bad saddle if λ min (∇ 2 f (x * )) = 0. On the one hand, following Arora et al. (2018b) , it can be showed that if the approximate balanced initialization satisfies ∥W N :1 (0) -W * ∥ F ≤ σ min (W * ) -c, for some 0 < c < σ min (W * ), then σ min (W N :1 (t)) ≥ c through the training as well as ∥W 1 (t)∥ ≤ (4 ∥W * ∥ F ) 1/N , and ∥W N (t)∥ ≤ (4 ∥W * ∥ F ) 1/N , then σ min (W N -1:2 (t)) ≥ σmin(W N :1 (t)) ∥W1(t)∥∥W N (t)∥ ≥ c (4∥W * ∥) 2/N . On the other hand, if we assume that our rescaled and overparameterized weight initialization falls into convergence region (31), then we can show that (see B(t) in the proof of Lemma D.1) σ min (W N -1:2 (t)) ≥ max σ min (W N :2 (t)) σ max (W N (t)) , σ min (W N -1:1 (t)) σ max (W 1 (t)) . Thus, σ min ( W N -1:2 (n N -1:2 ) 1/2 ) ≥ e -c1-c2 max{ n1 n N -1 , n N -1 n1 } ≥ e -c1-c2 > 0. In conclusion, we first made a conjecture that according to Arora et al. (2018b) , for a nonoverparameterized deep linear network, there are no bad saddles satisfying ∥W N :1 (0) -W * ∥ F < σ min (W * ). Thus, ∥W N :1 (0) -W * ∥ F < σ min (W * ) is indeed a convergence region. However, this region in general is very small, and can even be empty if σ min (W * ) = 0. For an overparameterized deep linear network, the GD initialized in the convergence region will force the trajectories away from all the bad saddles. Why does the width have to be large? We will discuss overparameterization phenomena in deep linear networks. For simplicity, we consider a special balanced initialization. First, we know that ∥W i (t) -W i (0)∥ F = O( 1 N ) (see C(t) through the proof of Lemma D.1), provided η = h n N βN and γ = O( αN n N ) , where h ∈ (0, 2). An overparameterized deep linear network around the special balanced initialization is full of global minimizers, i.e., the trajectory limit (W * 1 , W * 2 , • • • , W * N ) is in the O( 1 N ) neighborhood of the special balanced initialization (W 1 (0), W 2 (0), • • • , W N (0)). Notably, σ min (W N -1:2 (0)) = N -1 j=2 σ min (W j (0)) = σ max (W N -1:2 (0)) = N -1 j=2 σ max (W j (0)) = n (N -2)/2 , as well as for any (W 1 , W 2 , • • • , W N ) in the O( 1 N ) neighborhood of any given initialization (W 1 (0), W 2 (0), • • • , W N (0)), we have (detailed argument can be found in the proof of B(t) in Lemma D.1): ∥W N -1:2 -W N -1:2 (0)∥ F ≤ N -2 s=1 N -1 s O( 1 N ) s (n (N -2-s)/2 ) ≤ n (N -2)/2 O 1 √ n . Thus in terms of landscape, we have σ min (W N -1:2 ) σ min (W N -1:2 (0)) ≥ σ min (W N -1:2 (0)) -∥W N -1:2 -W N -1:2 (0)∥ F σ min (W N -1:2 (0)) ≥ 1 -O( 1 √ n ), which implies that no bad saddle is present in the O( 1 N ) neighborhood of the special balanced initialization, if the width n is sufficiently large. In terms of training, we have (see the proof of Lemma D.1), ∥W i (t) -W i (0)∥ F = O( 1 N ), ∥W i (0)∥ F = n, 2 ≤ i ≤ N -1, as well as ∥Wi(t)-Wi(0)∥ F ∥Wi(0)∥ F = O( 1 N n ). Thus for an overparameterized deep linear network, the GD with an approximate balanced initialization only trains W 1 and W N , and the other weight matrices remain almost constant. Here, we provide empirical evidence in appendix G to support the aforementioned argument. On the other hand, the sharp rate of convergence depends on the trajectory limit, and when the minimum width is sufficiently large, the trajectory limit and the initialization are not far away from each other. For deep linear network with small widths, the result (Ziyin et al., 2022) might shed light on convergence analysis, because the exact global minimizer can be described for a deep linear network with L 2 regularization.

Numerical Experiments:

In appendix H, we will discuss some empirical evidence to support the main results shown in Section 3. Further, Figure 1 and 2 in appendix H show plots of the logarithm of loss as a function of number of iterations. When n is small, the trajectories of loss for deep linear neural networks do not decrease in some iterations. However, when n is large, the loss trajectories are close to those for the corresponding convex problem.

5. OVERVIEW OF THE PROOFS OF MAIN THEOREMS AND LEMMAS

In this section, we provide an overview of the proofs for all the theorems obtained in the main results. Since Theorem 3.1 in the main results is a special cases of general theorems with nonoptimal learning rates (see Theorem B.1 and Theorem B.2), we only need to focus on the proofs of the general theorems (see Theorem B.1, Theorem B.2, Theorem B.3, and Theorem 3.2). We begin with the convergence region of deep linear neural networks, which is basically the set of initializations that lead to the convergence of the GD for deep linear neural networks. The precise definition can be found in appendix D. Lemma 4.1 and Lemma 4.4 (also see Lemma D.1 and Lemma D.5) prove that this convergence region satisfies the following properties: if the initialization falls into the convergence region, then (i) the GD is guaranteed to converge to a global minimizer of the deep linear neural networks, (ii) the worst-case GD rate of convergence for the deep linear neural networks, which is a non-convex problem, is almost the same as the corresponding convex problem with a corresponding learning rate, and, (iii) the trajectories of the GD for the deep linear neural networks are arbitrarily close to those for the corresponding convex problem. More precisely, Lemma 4.1 (also see Lemma D.1) establishes the convergence region for a deterministic initialization, and it demonstrates the first two properties, (i) and (ii). Additionally, in appendix E and appendix F we also prove that the spectral properties of the products of random matrices partially reveal that the overparameterization realized by adding the width of each hidden layer guarantees that the random initialization falls into the convergence region with high probability. These results provide a foundation to establish the main linear convergence theorem for random initialization (see Theorem B.1, B.2, and B.3). By contrast, Lemma 4.2 (also see Lemma D.3) shows that if the initialization falls into the convergence region, then the update rule for the product of weight matrices in the GD for deep linear neural networks is more or less given by ( 2). This result can be used to establish both Lemma 4.4 (also see Lemma D.5), and Theorem 3.2, which is precisely property (iii) of the convergence region for deterministic and non-deterministic initializations, respectively.

A PROOFS OF BASIC PROPERTIES OF THE SEMI-NORM

Proof of Lemma 2.1. The first property is a direct consequence of the definition of the projection matrix P X . Notice that 1 ε (L(W + ε∆W ) -L(W )) = 1 ε (L(W P X + ε∆W P X ) -L(W P X )). Considering ε → 0, the definition of the directional derivative implies ⟨∇L(W ), ∆W ⟩ F = ⟨∇L(W P X ), ∆W P X ⟩ F = ⟨∇L(W P X )P X , ∆W ⟩ F , ∀∆W ∈ R ny×nx , since P X = P T X . This completes the proof of the second property. The third property is derived from the condition: orthogonal projection matrix satisfies P X = P T X = P 2 X = P 3 X , since ⟨∇L(W ), V ⟩ F = ⟨∇L(W P X )P X , V ⟩ F =⟨∇L(W P X )P 2 X , V P X ⟩ F = ⟨∇L(W P X )P X , V ⟩ X = ⟨∇L(W ), V ⟩ X . If we set V = ∇L(W ), then the fourth property is implied by the third property. For the last property, first recall that ∥W ∥ X = ∥W P X ∥ F and P X = X(X T X) † X T . X is of a full row rank matrix if and only if P X is an identity matrix, which completes the proof. Proof of Lemma 2.2. Because X is not full row rank, we know that I -P X ̸ = 0. There exists W such that W (I -P X ) ̸ = 0. Applying the first property in Lemma 2.1, we obtain L( 1 2 W + 1 2 W P X ) = L(( 1 2 W + 1 2 W P X )P X ) = L(W P X ) = 1 2 L(W ) + 1 2 L(W P X ), provided W ̸ = W P X . Hence, L is not strictly convex, which implies that L is not strongly convex. To prove the second property, it is sufficient to show that g(W ) = L(W ) -α(l)λmin(XX T ) m ∥W ∥ 2 X is convex. Obviously, g(W ) = L(W ) - α(l) m m i=1 ∥W x i -y i ∥ 2 2 + α(l) m (∥W X -Y ∥ 2 F -λ min (X T X) ∥W ∥ 2 X ). (14) L(W ) -α(l) m m i=1 ∥W x i , y i ∥ 2 F is convex, because l(•, y i ) is strongly convex. The Hessian of ∥W X -Y ∥ 2 F -λ min (W T W ) ∥W P X ∥ 2 F has no negative eigenvalue; thus the second term in ( 14) is also convex. This completes the proof.

B EXACT STATEMENTS OF THE MAIN THEOREMS

Definitions of some quantities: q = 1 -αη * (2 -η * α), 0 < η * ≤ 2 α+β 1 -βη * (2 -η * β), 2 (α+β) < η * < 2 β , B δ = 2 • rank(X) δ + ∥W * ∥ 2 X , C 1 = n N κ 2 B δ C 0 (η 0 -η) 2 /η 2 0 + ln N, C 2 = n N κ 2 B δ C 0 + ln N, C 3 = n N κ 2 B δ C 0 (η 0 -η) 2 /η 2 0 + C 0 ln(N ), C 4 = n N κ 2 B δ 1 (η 0 -η) 2 /η 2 0 , C 5 = n N κ 2 B δ C 0 + C 0 ln(N ), C 6 = n N κ 2 B δ , where N denotes the number of distinct elements in the set {n 1 , • • • , n N -1 }, η 1 = 2n N N β , and η 0 = 2n N e 2c N β with c > 0. Theorem B.1. Given any c > 0, and 0 < δ < 1/2, define η 0 = 2n N e 2c N β , and consider the learning rate of η < η 0 . There exists a constant C := C(c), such that if n min ≥ C • C 1 • N, then with probability at least 1 -δ over the random Gaussian initialization, we have E DLN (t) ≤ 1 -4e -c η η0 (1 -η η0 ) κ t E DLN (0). Theorem B.2. Given any c > 0, and 0 < δ < 1/2, define η 0 = 2n N e 2c βN , and consider the learning rate to be η < η 0 . There exists a constant C := C(c), such that if n min ≥ C • C 3 , ( ) then with probability at least 1-δ over the random one peak projection and embedding initialization, we have E DLN (t) ≤ 1 -4e -c η η0 (1 -η η0 ) κ t E DLN (0). Specially, if n 1 = n 2 = • • • = n N -1 = n ≥ min{n N , n 0 }, then requirement (17) can be replaced by n ≥ C • C 4 . ( ) Remark 7. Assume L(a N W N • • • W 1 ) = 1 2 ∥a N W N • • • W 1 X -Y ∥ 2 F , and n 1 = • • • = n N -1 = n. Then, for Gaussian initialization, our Theorem B.1 leads to Theorem 4.1 in Du & Hu (2019) . Similarly, for orthogonal initialization, our Theorem B.2 leads to Theorem 4.1 of Hu et al. (2020) . Next, we present a version of the theorem related to balanced initialization. Theorem B.3. Assume n 1 = • • • = n N -1 = n. Given any c > 0, and 0 < δ < 1/2, define η 0 = 2n N e 2c βN , and consider the learning rate as η < η 0 . There exists a constant C := C(c), such that as long as n ≥ C • C 4 . ( ) then with probability at least 1 -δ over the special balanced initialization, we have E DLN (t) ≤ 1 -4e -c η η0 (1 -η η0 ) κ t E DLN (0).

C INEQUALITIES IN CONVEX OPTIMIZATION

Convex optimization has been studied for about a century. Recall the definitions and basic inequalities for α-strongly convex and β-Lipschitz functions. Definition C.1. A continues differentiable function f is said to be β-Lipschitz if the gradient ∇f is β-Lipschitz, that is if for all x, y, ∥∇f (y) -∇f (x)∥ ≤ β ∥y -x∥ , f is said to be α-strongly convex if for all x, y, we have f (y) ≥ f (x) + ⟨∇f (x), y -x⟩ + α 2 ∥y -x∥ 2 . ( ) Proposition C.1. If f is α-strongly convex and ∇f is β-Lipschitz with respect to a (semi-)norm, then α ≤ β and ⟨∇f (x), y -x⟩ + α 2 ∥y -x∥ 2 ≤ f (y) -f (x) ≤ ⟨∇f (x), y -x⟩ + β 2 ∥y -x∥ 2 , ( ) ⟨∇f (x) -∇f (y), x -y⟩ ≥ αβ α + β ∥x -y∥ 2 + 1 α + β ∥∇f (x) -∇f (y)∥ 2 , ( ) ∥∇f (x) -∇f (y)∥ ≥ α ∥x -y∥ , f (x) -f (y) ≤ ⟨∇f (x), x -y⟩ - 1 2β ∥∇f (x) -∇f (y)∥ 2 . ( ) Proof of Proposition C.1. We only prove the last inequality. Let z = y -1 β (∇f (y) -∇f (x)). Since f is convex β-Lipschitz, we have f (z) -f (x) ≥ ⟨∇f (x), z -x⟩ and f (z) -f (y) ≤ ⟨∇f (y), z -y⟩ + β 2 ∥z -y∥ 2 . Thus, f (x) -f (y) =f (x) -f (z) + f (z) -f (y) ≤⟨∇f (x), x -z⟩ + ⟨∇f (y), z -y⟩ + β 2 ∥z -y∥ 2 =⟨∇f (x), x -y⟩ - 1 2β ∥∇f (x) -∇f (y)∥ 2 . Before we prove Lemma D.1, let us first include and prove the following result. Lemma C.2. 1. Assume L is α-strongly convex, α > 0. Denote a global minimizer of L by W * . Then, for any W , L(W * ) -L(W ) ≥ - 1 2α ∥∇L(W )∥ 2 X . (26) 2. Assume ∇L is β-Lipschitz, then L(W * ) -L(W ) ≤ - 1 2β ∥∇L(W )∥ 2 X . ( ) Proof of Lemma C.2. 1. First, we know that ∇L(W * ) = 0. L is α-strongly convex, which implies the inequality ( 22) holds. Thus L(V ) -L(W ) ≥ ⟨∇L(W ), V -W ⟩ X + α 2 ∥V -W ∥ 2 X =: g(V ). Minimizing both sides in terms of V gives ( 26). Now we focus on minimizing g(V ). Since g(V ) ∈ C 1 and the global minimizer exits, we have ∇g(V * ) = ∇L(W )P X + α(V * -W )P X = 0, where V * is a global minimizer for g(V ). Thus, g(V * ) = - 1 2α ∥∇L(W )∥ 2 X . 2. Applying proposition C.1 to a β-Lipschitz function ∇L, we obtain L(W * ) -L(W ) ≤⟨∇L(W * ), W * -W ⟩ X - 1 2β ∥∇L(W ) -∇L(W * )∥ 2 X = - 1 2β ∥∇L(W )∥ 2 X .

D CONVERGENCE REGION

In this section, we evaluate a class of the convergence region for deep linear neural networks with a deterministic initialization. Define A| R(X) = AX T (XX T ) -X = AP X , and view A| R(X) as a linear operator on R(X). Recall the optimization problem minimize W1,••• ,W N L N (W 1 , • • • W N ) := 1 m m i=1 l(a N W N :1 x i , y i ) = L(a N W N :1 ), and GD W j (t + 1) = W j (t) -η ∂L N ∂Wj (W 1 (t), • • • , W N (t)), j = 1, • • • , N, where ∂L N ∂Wj (W 1 , • • • , W N ) = a N (W N :j+1 ) T ∇L(a N W N :1 )(W j-1:1 ) T , where the normalization factor a N = 1 √ n1n2•••n N -1 n N . The following theorem generalizes the idea from a recent work (Du & Hu, 2019; Hu et al., 2020) . For notational convenience, we denote W j:i (t) = W j (t) • • • W i (t), L t = L(a N W N :1 (t)), ∇L t = ∇L(a N W N :1 (t)) etc. Lemma D.1. Assume the initialization simultaneously satisfies the following conditions:                  σ max (W N :i+1 (0)) ≤ e c1/2 (n N -1:i ) 1/2 , 1 ≤ i ≤ N -1, σ min (W N :i+1 (0)) ≥ e -c2/2 (n N -1:i ) 1/2 , 1 ≤ i ≤ N -1, σ max (W i-1:1 (0)| R(X) ) ≤ e c1/2 (n i-1:1 ) 1/2 , 2 ≤ i ≤ N, σ min (W i-1:1 (0)| R(X) ) ≥ e -c2/2 (n i-1:1 ) 1/2 , 2 ≤ i ≤ N, ∥W j:i (0)∥ ≤ M/2 • N θ ( i≤k≤j-1 n k • max{n i-1 , n j }) 1/2 , 1 < i ≤ j < N, L 0 -L(W * ) ≤ βB 0 =: B, where c 1 , c 2 , M are positive constant and θ ≥ 0. Notice that B 0 is a proper upper bound for ∥a N W N :1 (0)∥ 2 X + ∥W * ∥ 2 X . Set the learning rate as η = (1-ε)2n N e 6c 1 +3c 2 βN , where 0 < ε < 1. Define γ = 2e 6c 1 εαN n N . Assume that n min ≥ C(c 1 , c 2 )M 2 κ 2 B 0 ε 2 N 2θ n N . Then, GD (30) satisfies L t -L(W * ) ≤ (1 -ηγ) t (L 0 -L(W * )), t = 1, 2, • • • . Definition D.1. For given c 1 , c 2 , M, B 0 > 0, and θ ≥ 0, we define the convergence region R(c 1 , c 2 , θ, M, B 0 ) by the set of initialization that satisfies the inequality system (31). Remark 8. The condition (31) describes the convergence region for initialization and the condition (32) describes the overparameterization for deep linear neural networks. At this time, it is not clear how large this convergence region is. Later, we will show that the properly scaled random initialization with some extra mild overparameterization conditions will fall into this convergence region with high probability. Proof of Lemme D.1. To prove Lemma D.1, it suffices to show that the following three properties hold A(t), B(t), and C(t) for all t = 0, 1, • • • . 1. A(t): L t -L(W * ) ≤ (1 -ηγ) t (L 0 -L(W * )).

2.. B(t):

             σ max (W N :i+1 (t)) ≤ e c1 (n N -1:i ) 1/2 , 1 ≤ i ≤ N -1, σ min (W N :i+1 (t)) ≥ e -c2 (n N -1:i ) 1/2 , 1 ≤ i ≤ N -1, σ max (W i-1:1 (t)| R(X) ) ≤ e c1 (n i-1:1 ) 1/2 , 2 ≤ i ≤ N, σ min (W i-1:1 (t)| R(X) ) ≥ e -c2 (n i-1:1 ) 1/2 , 2 ≤ i ≤ N, ∥W j:i (t)∥ ≤ M • N θ ( 1 nmin i-1≤k≤j n k ) 1/2 , 1 < i ≤ j < N. 3. C(t): ∥W i (t) -W i (0)∥ F ≤ 2e 2c1 √ 2βB √ n N γ =: R, 1 ≤ i ≤ N. Using simultaneous induction, the proof of Lemma D.1 is divided into the following three claims. Claim 1. A(0), • • • , A(t), B(0), • • • , B(t) =⇒ C(t + 1). Claim 2. C(t) =⇒ B(t), if n min ≥ C(c1,c2)M 2 κ 2 B0 ε 2 N 2θ n N , where C(c 1 , c 2 ) is a positive constant only depend on c 1 , c 2 . Claim 3. A(t), B(t) =⇒ A(t + 1), if n min ≥ C(c 1 , c 2 )M 2 B 0 N 2θ n N , where C(c 1 , c 2 ) is a positive constant only depend on c 1 , c 2 . Proof of Claim 1. As a consequence of Lemma C.2 and Lemma 2.1, and A(s), s ≤ t, we have ∥∇L(a N W N :1 (s))∥ 2 F = ∥∇L s -∇L(W * P X )∥ 2 X ≤2β[L s -L(W * )] ≤2β (1 -ηγ) s B. ( ) From A(0), • • • , A(t), B(0), • • • , B(t), we have for any 0 ≤ s ≤ t, ∂L ∂W i (s) F ≤ a N ∥W N :i+1 (s)∥ ∥∇L(a N W N :1 (s))∥ F W i-1:1 (s)| R(X) ≤ e 2c1 √ n N ∥∇L(a N W N :1 (s))∥ F ≤ e 2c1 √ n N 2β (1 -ηγ) s B. Under review as a conference paper at ICLR 2023 Then, ∥W i (t + 1) -W i (0)∥ F ≤ t s=0 ∥W i (s + 1) -W i (s)∥ F = t s=0 η ∂L ∂W i (s) F ≤ η e 2c1 √ n N 2βB t s=0 (1 -ηγ) s/2 ≤ η e 2c1 √ n N 2βB t s=0 (1 -ηγ/2) s ≤ 2e 2c1 √ 2βB √ n N γ = R. This proves C(t + 1). Proof of Claim 2. Case 1. We first prove (37). Let δ i = W i (t) -W i (0), 1 ≤ i ≤ N . Using C(t), we have ∥δ i ∥ F ≤ R, 1 ≤ i ≤ N . Set ε 1 = e -c1/2 min{e c1 -e c1/2 , e -c2/2 -e -c2 , 1/2}. It is suffices to show that ∥W N :i (t) -W N :i (0)∥ ≤ e c1/2 ε 1 (n N -1 n N -1 • • • n i-1 ) 1/2 , 1 < i ≤ N, (W i:1 (t) -W i:1 (0))| R(X) ≤ e c1/2 ε 1 (n 1 n 2 • • • n i-1 ) 1/2 , 1 ≤ i < N, j:i (t) -W j:i (0)∥ ≤ M/2 • N θ   1 n min i-1≤k≤j n k   1/2 , 1 < i ≤ j < N, For 1 ≤ i < j ≤ N , we can write W j:i (t) = (W j (0 ) + δ j ) • • • (W i (0) + δ i ). Expanding the above product, each term has the form: W j:(ks+1) (0) • δ ks • W (ks-1):(ks-1+1) (0) • δ ks-1 • • • δ k1 • W (k1-1):i (0), where i ≤ k 1 < • • • < k s ≤ j are positions at which perturbation terms δ k l are taken out. Notice that the convergence region assumption (31) implies that for any 1 < i ≤ j < N , ∥W j:i (0)∥ ≤ M/2 • N θ   i≤k≤j-1 n k • max{n i-1 , n j }   1/2 ≤ M • N θ i-1≤k≤j n k n min 1/2 . (39) WLOG, assume M ≥ 1. If i = j + 1, then ∥W j:i (0)∥ = ∥I∥ ≤ M • N θ (n j /n min ) 1/2 . Assuming i > 1, j < N , and applying inequality (39) as well as the following inequality j-i+1 s=1 j -i + 1 s x s = (1 + x) j-i+1 -1 ≤ (1 + x) N -1, ∀x ≥ 0, we obtain ∥W j:i (t) -W j:i (0)∥ ≤ j-i+1 s=1 j -i + 1 s R s (M • N θ ) s+1 n -s/2 min (n i-1 • • • n j /n min ) 1/2 ≤M • N θ (n i-1 • • • n j /n min ) 1/2 [(1 + R • M • N θ / √ n min ) N -1] ≤ε 1 M • N θ (n i-1 • • • n j /n min ) 1/2 . The last line holds due to the following reasons: there exists absolute constant A 1 , A 2 > 0 such that (1 + x) N -1 ≤ A 2 xN, if x ≥ 0, N ≥ 1, and xN ≤ A 1 . Since there exists positive constant C(c 1 , c 2 ), which only depends on c 1 , c 2 , such that when n min ≥ C(c 1 , c 2 )M 2 κ 2 B 0 ε 2 N 2θ n N (40) we can have R • M • N θ+1 / √ n min ≤ A 1 , as well as [(1 + R • M • N θ / √ n min ) N -1] ≤ A 2 • M • R • N θ+1 / √ n min ≤ ε 1 = ε 1 (c 1 , c 2 ). Case 2. The proof of (35) is similar. Set j = N , we can save the factor M • N θ from previous calculation, which means ∥W N :i (t) -W N :i (0)∥ ≤e c1/2 N -i+1 s=1 N -i + 1 s R s (M • N θ ) s n -s/2 min (n i-1 • • • n N -1 ) 1/2 ≤e c1/2 (n i-1 • • • n N -1 ) 1/2 [(1 + R • M • N θ / √ n min ) N -1] ≤e c1/2 ε 1 (n i-1 • • • n N -1 ) 1/2 , i ≥ 2, where the last line is implied by equation ( 40). Case 3. Similarly, we have W j:1 (t)| R(X) -W j:1 (0)| R(X) ≤e c1/2 j s=1 j s R s (M • N θ ) s n -s/2 min (n 1 • • • n j ) 1/2 ≤e c1/2 (n 1 • • • n j ) 1/2 [(1 + R • M • N θ / √ n min ) N -1] ≤e c1/2 ε 1 (n 1 • • • n j ) 1/2 , j ≤ N -1 This proves B(t). Proof of Claim 3. GD (7) implies W N :1 (t + 1) = W N (t) -η ∂L N ∂W N (t) W N -1 (t) -η ∂L N ∂W N -1 (t) • • • W 1 (t) -η ∂L N ∂W 1 (t) =W N :1 (t) -η • a N N i=1 W N :i+1 (t)W T N :i+1 (t)∇L(a N W N :1 (t))(W i-1:1 (t)) T (W i-1:1 (t)) + E(t), where E(t) contains all the high-order terms (those with η 2 or higher). We define a linear operator P (t)[A] = a 2 N N i=1 W N :i+1 (t)W T N :i+1 (t)(AP X )(W i-1:1 (t)| R(X) ) T W i-1:1 (t)| R(X) , for any A ∈ R n N ×n0 .

Now we have

a N W N :1 (t + 1) = a N W N :1 (t) -η • P (t)[∇L(a N W N :1 (t)P X )] + a N E(t). ( ) Easy to check that P (t)[•] is a sum of positive semidefinite linear operator. The following proposition describes the eigenvalues of the linear operator P (t)[•]. Proposition D.2. Let S 1 , S 2 be symmetric matrices. Suppose S 1 = U Λ 1 U T , S 2 = V Λ 2 V T , where U = [u 1 , u 2 , • • • , u m ], and V = [v 1 , v 2 , • • • , v n ] are orthogonal matrices, and Λ 1 = diag(λ 1 , λ 2 , • • • , λ m ) and Λ 2 = diag(µ 1 , µ 2 , • • • , µ n ) are diagonal matrices. Then, the linear operator L(A) := S 1 AS 2 is orthogonally diagonalizable, and L(A ij ) = λ i µ j A ij , where λ i µ j represent all the eigenvalues corresponding to their eigenvectors A ij = u i v T j . Applying this proposition and the assumption B(t), we obtain the upper bound and lower bound for the maximum and minimum eigenvalues of the positive definite operator P (t), respectively, λ max (P (t)) ≤ a 2 N N i=1 σ 2 max (W i-1:1 (t)| R(X) ) • σ 2 max (W N :i+1 (t)) ≤ N n N e 2c1 , and λ min (P (t)) ≥ a 2 N N i=1 σ 2 min (W i-1:1 (t)| R(X) ) • σ 2 min (W N :i+1 (t)) ≥ N n N e -2c2 . In conclusion, we have λ max (P (t)) ≤ N n N e 2c1 , and λ min (P (t)) ≥ N n N e -2c2 . ( ) With a learning rate of η = η ε = (1-ε)2n N e 6c 1 +3c 2 βN , 0 < ε < 1, we have L t+1 -L t ≤ ⟨∇L t , -ηP (t)[∇L t ]⟩ X + ⟨∇L t , a N E(t)⟩ X + β 2 ∥ηP (t)[∇L t ] -a N E(t)∥ 2 X = ⟨∇L t , -ηP (t)[∇L t ]⟩ + β 2 η 2 ∥P (t)[∇L t ]∥ 2 X + F (t) ≤ -ηλ min (P (t)) - β 2 η 2 λ 2 max (P (t)) ∥∇L t ∥ 2 X + F (t) ≤ -e -2c2 N n N η 1 -e 4c1+2c2 β 2 η N n N ∥∇L t ∥ 2 X + F (t), where F (t) = ⟨∇L t , a N E(t)⟩ X + β 2 ∥ηP (t)[∇L t ] -a N E(t)∥ 2 X - β 2 η 2 ∥P (t)[∇L t ]∥ 2 X . We claim that F (t) is sufficiently small, such that L t+1 -L t ≤ -e -2c2 N n N η 1 -e 4c1+2c2 β 2 η N n N ∥∇L t ∥ 2 X + F (t) ≤ -e -3c2 N n N η 1 -e 6c1+3c2 β 2 η N n N ∥∇L t ∥ 2 X = -e -6(c1+c2) 2ε(1 -ε) β ∥∇L t ∥ 2 X . ( ) Assuming this claim for the moment, we complete the proof. Combining ( 26) and ( 46), we have L t+1 -L t ≤ -e -6(c1+c2) 2ε(1-ε) β ∥∇L t ∥ 2 X , L(W * ) -L t ≥ -1 2α ∥∇L t ∥ 2 X , which implies L t+1 -L(W * ) ≤ 1 -e -6(c1+c2) 4ε(1 -ε) κ (L t -L(W * )), that is L t -L(W * ) ≤ 1 -e -6(c1+c2) 4ε(1 -ε) κ t (L 0 -L(W * )) = (1 -ηγ) t (L 0 -L(W * )). (48) While estimating F (t), we observe that |F (t)| ≤ ∥∇L t ∥ X ∥a N E(t)∥ X + β 2 (2ηλ max (P (t)) ∥∇L t ∥ X ∥a N E(t)∥ X + ∥a N E(t)∥ 2 X ) = : I 1 + I 2 . From (34), we have ∂L ∂W i (t) F ≤ e 2c1 √ n N ∥∇L(a N W N :1 (t))∥ F = e 2c1 √ n N ∥∇L(a N W N :1 (t))∥ X =: K. Expanding the product W N :1 (t + 1) = W N (t) -η ∂L N ∂W N (t) W N -1 (t) -η ∂L N ∂W N -1 (t) • • • W 1 (t) -η ∂L N ∂W 1 (t) , each term has the form: ∆ = W N :(ks+1) (t) • η ∂L ∂W ks (t) • W (ks-1):(ks-1+1) (t) • η ∂L ∂W ks-1 (t) • • • η ∂L W k1 (t) • W (k1-1):1 (t) , where 1 ≤ k 1 < k 2 < • • • < k s ≤ N . As a direct consequence of inequality B(t) and inequality (39), we obtain ∥∆∥ X = ∥∆P X ∥ F ≤ 1 a N √ n N e 2c1 (ηK) s M • N θ √ n min s-1 , Recall that E(t) contains all high-order terms (those with η 2 or higher) in the expansion of the product. Thus, E(t) can be expressed as follows: N s=2 1≤k1<k2<•••<ks≤N W N :(ks+1) (t)•η ∂L ∂W ks (t)•W (ks-1):(ks-1+1) (t)•η ∂L ∂W ks-1 (t) • • • η ∂L W k1 (t)•W (k1-1):1 (t). Set ξ = min{(e -2c2 -e -3c2 )/e 4c1+1 , 1 4 (e 6c1 -e 4c1 )/e 6c1+1 , 1 2 (e 6c1 -e 4c1 ) 1/2 /e 4c1+1 , 1}. Recall the inequality N s ≤ (eN ) s . Thus, we have a N ∥E(t)∥ X ≤ 1 √ n N e 2c1 N s=2 N s (ηK) s M • N θ √ n min s-1 ≤ 1 √ n N M • N θ √ n min -1 e 2c1 N s=2 (eN ) s (ηK) s M • N θ √ n min s ≤ 1 √ n N e 2c1 (ηeKN ) ηeKM • N θ+1 / √ n min 1 -ηeKM • N θ+1 / √ n min ≤ξ N n N η • e 4c1+1 ∥∇L(a N W N :1 (t))∥ X ( if ηeKM • N θ+1 / √ n min < ξ/(1 + ξ)) =ξ • e 4c1+1 η N n N ∥∇L(a N W N :1 (t))∥ X . Using ( 33) and the upper bound of η, we know that there exists constant C(c 1 , c 2 ), such that n min ≥ C(c 1 , c 2 )M 2 • B 0 N 2θ n N , ηeKM • N θ+1 / √ n min ≤ 2 √ 2M • e 1+2c1 √ B 0 N θ √ n N √ n min = 1 C ′ (c 1 , c 2 ) ≤ ξ 2 ≤ ξ 1 + ξ . Using (49), we have I 1 ≤ ξ • e 4c1+1 η N n N ∥∇L t ∥ 2 X ≤ (e -2c2 -e -3c2 ) η N n N ∥∇L t ∥ 2 X , and I 2 ≤ β 2 2ξ • e 6c1+1 η 2 N 2 n 2 N ∥∇L t ∥ 2 X + ξ 2 • e 8c1+2 η 2 N 2 n 2 N ∥∇L t ∥ 2 X ≤(e 6c1 -e 4c1 ) β 2 η 2 N 2 n 2 N ∥∇L t ∥ 2 X . Thus, ( 46) valid. This proves A(t). As a direct consequence of the proof of Lemma D.1, we can obtain the following lemma. Lemma D.3. Assume all assumptions in Lemma D.1 hold. For any τ > 0, we can choose new constants c 1 , c 2 as well as C := C(c 1 , c 2 ) such that the overparameterization assumption (32) in Lemma D.1 hold and ∥R(t)∥ X ≤ τ ∥a N W N :1 (t) -W * ∥ X , N W N :1 (t + 1) = a N W N :1 (t) - N n N η∇L(a N W N :1 (t)) + R(t). Proof of Lemma D.3. Due to (33), ( 42), ( 44), (49), and lemma C.2, we have ∥R(t)∥ X = a N E(t) + η N n N ∇L t -P (t)[∇L t ] X ≤ ∥a N E(t)∥ X + η max λ max (P (t)) - N n N , N n N -λ min (P (t)) ∥∇L t ∥ X ≤(C ′ • ξ + max{e 2c1 -1, 1 -e -2c2 }) • η N n N • ∥∇L t ∥ X ≤ 2 2β(L t -L(W * )) e 6c1+3c2 • β • (C ′ • ξ + max{e 2c1 -1, 1 -e -2c2 }). Because L t -L(W * ) is non-increasing in t, and C ′ is a constant that depends only on c 1 , c 2 , we can choose a sufficiently small positive c 1 , c 2 and ξ, which depends on τ , such that ∥R(t)∥ X ≤ τ 2β(L t -L(W * )) β ≤ τ ∥a N W N :1 (t) -W * ∥ X . Lemma D.4. Assume τ ∈ [0, 1). Consider a discrete dynamical system V (t) such that, V (t + 1) = V (t) -η * ∇L(V (t)) + R(t), where ∥R(t)∥ X ≤ τ ∥V (t) -W * ∥ X . If η * ≤ 2/β, we have ∥V (t) -W * ∥ 2 X ≤ (q + 7τ ) t ∥V (0) -W * ∥ 2 X , where q is defined in (15). Proof of Lemma D.4. Set ∆(t) = V (t) -W * and τ ′ = τ ∥∆(t)∥ X . Notice that ∆(t + 1) = ∆(t) -η * (∇L(V (t)) -∇L(W * )) + R(t), and ∥∆(t + 1)∥ 2 X ≤η 2 * ∥∇L(V (t)) -∇L(W * )∥ 2 X -2η * ⟨∆(t), ∇L(V (t)) -∇L(W * )⟩ X + ∥∆(t)∥ 2 X + (2 ∥∆(t)∥ X + 2η * ∥∇L(V (t)) -∇L(W * )∥ X + τ ′ )τ ′ . By inequality (23), ∥∆(t + 1)∥ 2 X ≤ ∥∆(t)∥ 2 X -2η * ⟨∆(t), ∇L(V (t)) -∇L(W * )⟩ X + η 2 * ∥∇L(V (t)) -∇L(W * )∥ 2 X + 7τ ∥∆(t)∥ 2 X =(1 + 7τ ) ∥∆(t)∥ 2 X -2η * ⟨∆(t), ∇L(V (t)) -∇L(W * )⟩ X + η 2 * ∥∇L(V (t)) -∇L(W * )∥ 2 X ≤(1 + 7τ ) ∥∆(t)∥ 2 X -2η * αβ α + β ∥∆(t)∥ 2 X + η 2 * - 2η * α + β ∥∇L(V (t)) -∇L(W * )∥ 2 X . Case 1: 2 α+β < η * < 2 β . In this case, we have ∥∆(t + 1)∥ 2 X ≤(1 + 7τ ) ∥∆(t)∥ 2 X -2η * αβ α + β ∥∆(t)∥ 2 X + η 2 * - 2η * α + β ∥∇L(V (t)) -∇L(W * )∥ 2 X ≤(1 + 7τ ) ∥∆(t)∥ 2 X -2η * αβ α + β ∥∆(t)∥ 2 X + η 2 * - 2η * α + β β 2 ∥∆(t)∥ 2 X ≤ (1 + 7τ -βη * (2 -η * β)) ∥∆(t)∥ 2 X =(q + 7τ ) ∥∆(t)∥ 2 X . Case 2: 0 < η * ≤ 2 α+β . Similarly, we have ∥∆(t + 1)∥ 2 X ≤ (1 + 7τ -αη * (2 -η * α)) ∥∆(t)∥ 2 X = (q + 7τ ) ∥∆(t)∥ 2 X . In both cases, we have ∥∆(t + 1)∥ 2 X ≤ (q + 7τ ) ∥∆(t)∥ 2 X . Thus, ∥∆(t)∥ 2 X ≤ (q + 7τ ) t ∥∆(0)∥ 2 X . Next, we will show that the trajectories of the GD (30) for deep linear neural networks (29) are close to those of GD (2) for the corresponding convex problem (1). Lemma D.5. Consider the GD for the deep linear neural networks (30) with learning rate η < η 1 for a N W N :1 (t), t = 0, 1, • • • , and the GD (2) with learning rate η * = N n N η for W (t), t = 0, 1, • • • . Assume C(c 1 , c 2 ) exists in Lemma D.1 for any c 1 , c 2 > 0. For any τ ∈ (0, 1), η < η 1 (η 1 defined in B), we can choose c 1 , c 2 > 0 and the constant C = C(c 1 , c 2 ) = C ′ (τ, η/η 1 ), such that inequality (51) holds, given initialization condition (31), and overparameterization condition n min ≥ CM 2 κ 2 B 0 N 2θ n N . Furthermore, we have ∥a N W N :1 (t) -W (t)∥ 2 X ≤ D(τ, q, t) ∥a N W N :1 (0) -W * ∥ 2 X , |E DLN (t) -E(t)| ≤ β q t/2 D(τ, q, t) + 1 2 D(τ, q, t) ∥a N W N :1 (0) -W * ∥ 2 X , E DLN (t) ≤ 3β(q + τ ) t ∥a N W N :1 (0) -W * ∥ 2 X , where D(τ, q, t) = min τ 1-q , 2(q + τ ) t , with q defined in (15). Proof of Lemma D.5. Using Lemma D.3, we obtain that for any τ ∈ (0, 1) and η < η 1 , we can find sufficiently small positive constants c 1 , c 2 , which only depend on τ, η/η 1 , and constant C = C(c 1 , c 2 ) = C ′′ (τ, η/η 1 ) mentioned in Lemma D.3, such that η = (1 -ε)2n N e 6c1+3c2 βN , where 0 < ε < 1, as well as V (t + 1) = V (t) -η * ∇L(V (t)) + R(t), where V (t) = a N W N :1 (t), η * = N n N η, and ∥R(t)∥ X ≤ τ ′ = τ ∥V (t) -W * ∥ X . Notice that θ 0 := η/η 1 = 1-ε e 6c 1 +3c 2 and η/η 0 = 1 -ε, where η 0 = 2n N e 6c 1 +3c 2 βN . For the right hand side of inequality (32), we have C(c 1 , c 2 )M 2 κ 2 B 0 ε 2 N 2θ n N = C ′′ (τ, η/η 1 )M 2 κ 2 B 0 ε 2 N 2θ n N . To show that inequality (32) is equivalent to inequality (52), it suffices to show that ε only depend on τ, η/η 1 . Notice that ε = 1 -η/η 0 = 1 -θ 0 e 6c1+3c2 , and c 1 , c 2 only depend on τ and η/η 1 , which implies ε only depend on τ, η/η 1 . Now, we will prove the three inequalities in (53). Recall GD (2) for W (t). Define ∆ (t) = V (t) -W (t) = a N W N :1 (t) -W (t). Notice that ∆(t + 1) = ∆(t) -η * (∇L(V (t)) -∇L(W (t))) + R(t), and ∥∆(t + 1)∥ 2 X ≤η 2 * ∥∇L(V (t)) -∇L(W (t))∥ 2 X -2η * ⟨∆(t), ∇L(V (t)) -∇L(W (t))⟩ X + ∥∆(t)∥ 2 X + (2 ∥∆(t)∥ X + 2η * ∥∇L(V (t)) -∇L(W (t))∥ X + τ ′ )τ ′ . Let l t = 2 ∥∆(t)∥ X + 2η * ∥∇L(V (t)) -∇L(W (t))∥ X + τ ′ . Now, we aim to find an upper bound for l t . Applying lemma C.2 with the assumption 0 < η * = N n N η < 2 β , we know that l t ≤ (6 ∥∆(t)∥ X + τ ′ ) ≤ 7(∥W (t) -W * ∥ X + ∥V (t) -W * ∥ X ). Thus l t τ ′ ≤ 7τ ∥V (t) -W * ∥ X (∥V (t) -W * ∥ X + ∥W (t) -W * ∥ X ) =: U t τ. By inequality (23), ∥∆(t + 1)∥ 2 X ≤ ∥∆(t)∥ 2 X -2η * ⟨∆(t), ∇L(V (t)) -∇L(W (t))⟩ X + η 2 * ∥∇L(V (t)) -∇L(W (t))∥ 2 X + U t τ = ∥∆(t)∥ 2 X -2η * ⟨V (t) -W (t), ∇L(V (t)) -∇L(W (t))⟩ X + η 2 * ∥∇L(V (t)) -∇L(W (t))∥ 2 X + U t τ ≤ ∥∆(t)∥ 2 X -2η * αβ α + β ∥∆(t)∥ 2 X + η 2 * - 2η * α + β ∥∇L(V (t)) -∇L(W (t))∥ 2 X + U t τ. Case 1:foot_0 α+β < η * < 2 β . In this case, we have ∥∆(t + 1)∥ 2 X ≤ ∥∆(t)∥ 2 X -2η * αβ α + β ∥∆(t)∥ 2 X + η 2 * - 2η * α + β ∥∇L(V (t)) -∇L(W (t))∥ 2 X + U t τ ≤ ∥∆(t)∥ 2 X -2η * αβ α + β ∥∆(t)∥ 2 X + η 2 * - 2η * α + β β 2 ∥∆(t)∥ 2 X + U t τ ≤ (1 -βη * (2 -η * β)) ∥∆(t)∥ 2 X + U t τ = : q ∥∆(t)∥ 2 X + U t τ. Case 2: 0 < η * ≤ 2 α+β . Similarly, we have ∥∆(t + 1)∥ 2 X ≤ (1 -αη * (2 -η * α)) ∥∆(t)∥ 2 X + U t τ =: q ∥∆(t)∥ 2 X + U t τ. (55) In both cases, we have 0 < q < 1. First of all, since U t ≤ U 0 and ∥∆(0)∥ X = 0, we obtain that ∥∆(t)∥ 2 X ≤ U 0 τ 1 -q + q t ∥∆(0)∥ 2 X - U 0 τ 1 -q ≤ U 0 τ 1 -q ≤ 14τ 1 -q ∥V (0) -W * ∥ 2 X . Applying Lemma D.4 for V (t) and W (t), we obtain ∥V (t) -W * ∥ 2 X ≤ (1 + ε) t q t ∥V (0) -W * ∥ 2 X and ∥W (t) -W * ∥ 2 X ≤ q t ∥W (0) -W * ∥ 2 X , respectively. Thus, |L(W (t)) -L(a N W N :1 (t))| ≤|⟨∇L(W (t)), ∆(t)⟩ X | + β 2 ∥∆(t)∥ 2 X ≤β ∥W (t) -W * ∥ X • ∥∆(t)∥ X + β 2 ∥∆(t)∥ 2 X ≤β q t/2 14τ 1 -q + 7τ 1 -q ∥V (0) -W * ∥ 2 X . Generally speaking, (55) implies ∥∆(t)∥ 2 X ≤ τ t-1 j=0 q t-1-j U j . We have ∥∆(t)∥ 2 X ≤ 14τ t-1 j=0 (q + 7τ ) j q t-1-j ∥V (0) -W * ∥ 2 X ≤ 2(q + 7τ ) t 1 -q q + 7τ t ∥V (0) -W * ∥ Thus, we have ∥a N W N :1 (t) -W (t)∥ 2 X ≤ min 14τ 1 -q , 2(q + 7τ ) t ∥V (0) -W * ∥ 2 X , as well as |L(W (t)) -L(a N W N :1 (t))| ≤β ∥W (t) -W * ∥ X • ∥∆(t)∥ X + β 2 ∥∆(t)∥ 2 X ≤β min 14τ 1 -q , 2(q + 7τ ) t • q t/2 + 1 2 min 14τ 1 -q , 2(q + 7τ ) t ∥V (0) -W * ∥ 2 X . By triangle inequality as well as L(W (t)) -L(W * ) ≤ β 2 q t ∥V (0) -W * ∥ 2 X , we have |L(a N W N :1 (t)) -L(W * )| ≤ 3β(q + 7τ ) t ∥V (0) -W * ∥ 2 X . Without loss of generality, we replace all 14τ and 7τ by τ , which completes the proof.

E GAUSSIAN INITIALIZATION FALL INTO THE CONVERGENCE REGION

In this section, we first establish some spectral properties of the products of random Gaussian matrices. The spectral properties lead to the conclusion: overparameterization guarantees that the random initialization will fall into the convergence region with high probability. Proof of Lemma E.1. For random matrix A ∈ R ni×ni-1 with i.i.d N (0, 1) entries and any vector 0 ̸ = v ∈ R ni-1 , the distribution of ∥Av∥ 2 2 ∥v∥ 2 2 is χ n 2 i . We rewrite ∥W N :1 (0)x∥ 2 2 / ∥x∥ 2 2 = Z N Z N -1 • • • Z 1 , where Z i = ∥W i:1 (0)x∥ 2 / ∥W i-1:1 (0)x∥ 2 . Then we know that the distribution of random variable Z 1 ∼ χ 2 n1 , and conditional distribution of random variables Z i |(Z 1 , • • • , Z i-1 ) ∼ χ 2 ni (1 < i ≤ N ). Thus, Z 1 , • • • , Z ni are independent. By the law of iterated expectations, we have E[∥W N :1 (0)x∥ 2 2 / ∥x∥ 2 2 ] = N j=1 n j . Define ∆ 1 = N -1 j=1 1/n j . Now, we introduce a new notation Ω 1 ∆1 , which means that there exists k > 0, such that Ω 1 ∆1 ≥ k ∆1 . Lemma E.2. Consider real random matrix A j ∈ R nj ×nj-1 , 1 ≤ j ≤ q with i.i.d N (0, 1) entries and any vector 0 ̸ = x ∈ R n1 . Define ∆ 1 (q) = q j=1 1 nj and n min = min 1≤j≤q n j . Then P(∥A q A q-1 • • • A 1 x∥ 2 2 / ∥x∥ 2 2 > e c n 1 • • • n q ) ≤ exp - c 2 8∆ 1 (q) =: f 1 (c), ∀c > 0. ( ) When 0 < c ≤ 3 ln 2, ∆ 1 (q) ≤ c/(12 ln 2), we have P(∥A q A q-1 • • • A 1 x∥ 2 2 / ∥x∥ 2 2 < e -c n 1 • • • n q ) ≤ exp - c 2 36 ln(2)∆ 1 (q) =: f 2 (c). Hence, for any x ∈ S n0-1 with probability at least 1 -e -Ω( 1∆ 1 (q) ) , we have e -c2/2 (n 1 • • • n q ) 1/2 ≤ ∥A q • • • A 1 x∥ 2 ≤ e c1/2 (n 1 • • • n q ) 1/2 , when 0 < c 2 ≤ 3 ln 2, ∆ 1 (q) ≤ c 2 /(12 ln 2). Proof of Lemma E.2. For random matrix A i ∈ R ni×ni-1 with i.i.d N (0, 1) entries and any vector 0 ̸ = v ∈ R ni-1 , the random variable ∥Aiv∥ 2 2 ∥v∥ 2 2 is distributed as χ 2 ni . We rewrite ∥A q • • • A 1 x∥ 2 2 / ∥x∥ 2 2 = Z q Z q-1 • • • Z 1 , where Z i = ∥A i:1 x∥ 2 / ∥A i-1:1 x∥ 2 . We have Z 1 ∼ χ 2 n1 , Z i |(Z 1 , • • • , Z i-1 ) ∼ χ 2 ni (1 < i ≤ q). Recall the moments of Z ∼ χ 2 m : E[Z λ ] = 2 λ Γ( m 2 + λ) Γ( m 2 ) , ∀λ > - m 2 . Now, we aim to find the Chernoff type bound. Case 1: We define ratio of Gamma function R(x, λ) = Γ(x + λ) Γ(x) , λ > 0, x > 0. In Jameson ( 2013), we have R(x, λ) ≤ x(x + λ) λ-1 ≤ (x + λ) λ , λ > 0, x > 0. ( ) Fixed c > 0, for any λ > 0 we have P(Z q • • • Z 1 > e c n 1 • • • n q ) ≤ P((Z q • • • Z 1 ) λ > e λc (n 1 • • • n q ) λ ) ≤ e -λc (n 1 • • • n q ) -λ E[(Z q • • • Z 1 ) λ ] (Markov inequality) = exp{-λ(c + ln(n 1 • • • n q ))} q j=1 2 λ R(n j /2, λ) (Law of total expectation) ≤ exp{-λ(c + ln(n 1 • • • n q )) + qλ ln 2 + q j=1 λ ln( n j 2 + λ)}(Inequality (58)) = exp{-λc + λ q j=1 ln(1 + 2λ n j )} ≤ exp{-λc + 2λ 2 q j=1 1 n j }. Define constant ∆ 1 (q) = q j=1 1 nj . Set λ = c 4∆1(q) , we obtain (56). Case 2: Let n min = min 1≤j≤q n j . P(Z q • • • Z 1 < e -c n 1 • • • n q ) ≤ P((Z q • • • Z 1 ) λ > e -λc (n 1 • • • n q ) λ ) ≤ exp{λ(c -ln(n 1 • • • n q )) + qλ ln 2 + q j=1 ln R( n j 2 , λ)}. Define f (λ) = λ(c -ln(n 1 • • • n q )) + qλ ln 2 + q j=1 ln R( n j 2 , λ), - n min 2 < λ ≤ 0. Notice that f (0) = 0. Define digamma function, Qi et al. (2006) proved the following sharp inequality of digamma function, ψ(x) = d dx ln(Γ(x)) = Γ ′ (x) Γ(x) . ln(x + 1 2 ) - 1 x < ψ(x) < ln(x + e -γ ) - 1 x , x > 0, where γ is the Euler-Mascheroni constant, and e -γ ≈ 0.561459. Thus, f ′ (λ) = c + q j=1 -ln( n j 2 ) + ψ( n j 2 + λ) ≥ c + q j=1 ln(1 + λ + 1/2 n j /2 ) - q j=1 1 n j /2 + λ . Since ln(1 + x) is concave, we have ln(1 + x) ≥ 2 ln(2)x, x ∈ [-1/2, 0]. If -nmin 4 ≤ λ ≤ 0, then f (λ) = f (0) - 0 λ f ′ (x)dx ≤ cλ + λ 0   q j=1 ln(1 + x + 1/2 n j /2 ) - q j=1 1 n j /2 + x   dx = cλ + q j=1 λ ln(1 + λ + 1/2 n j /2 ) + (n j /2 + 1/2) ln(1 + λ n j /2 + 1/2 ) -λ -ln(1 + λ n j /2 ) ≤ cλ + q j=1 (λ -1) ln(1 + λ n j /2 ) ≤ cλ + 4 ln(2)λ(λ -1)∆ 1 (q). Assume 0 < c ≤ 3 ln 2. Let A = 12 ln 2, and λ * = -c A∆1(q) . Since n min ∆ 1 (q) ≥ 1, we have λ * ≥ -n min /4. Assume ∆ 1 (q) ≤ c/(12 ln 2). Thus f (λ * ) ≤ - c 2 A∆ 1 (q) + 4 ln 2 c 2 A∆ 1 (q) ∆ 1 (q) c + 1 A ≤ - c 2 36 ln(2)∆ 1 (q) . Thus, we obtain (57). Lemma E.3. There exists a positive constant C(c 1 , c 2 ) which only depends on c 1 , c 2 , such that if n N ∆ 1 ≤ C(c 1 , c 2 ), then for any fixed 1 < i ≤ N , with probability at least 1 -exp -Ω 1 ∆1 we have σ max (W N :i (0)) ≤ e c1 (n i-1 n i • • • n N -1 ) 1/2 , ( ) and σ min (W N :i (0)) ≥ e -c2 (n i-1 n i • • • n N -1 ) 1/2 . ( ) Proof of Lemma E.3. Let A = W T N :i (0). We know that σ max (A) = ∥A∥ = sup v∈S n N -1 ∥Av∥ 2 Lemma E.5. Set C = n max /n min < ∞, θ = 1/2. Assume Ω(1/∆ 1 ) ≥ k ∆1 , where 0 < k < 1 is a constant and ∆ 1 satisfies        ∆ 1 ≤ min k 5 ln(6) , k 5 ln(5 ln(6)e/k) ∆ 1 ln(C) ≤ min k 5 ln(5 ln(6)e/k) , k 5 ∆ 1 ln(N 2θ ) ≤ k/5. Given 1 < i ≤ j < N , with probability at least 1 -2e -k/(5∆1) = 1 -e -Ω(1/∆1) we have ∥W j:i (0)∥ ≤ M k √ CN θ (n i • • • n j-1 • max{n i-1 , n j }) 1/2 , where M k is a positive constant that only depends on k. Proof of Lemma E.5. WLOG, assume n i-1 ≤ n j . Let A = W j:i (0). From lemma E.2, we know that fixed v ∈ S ni-1-1 , with probability at least 1-e -Ω(1/∆1) we have ∥Av∥ 2 ≤ 4/3(n i • • • n j ) 1/2 . . Take a small constant c = kN 2θ 5 ln(6)∆1ni-1 ≥ k 5 ln(6)C . Let v 1 , • • • , v ni-1 be an orthonormal basis for R ni-1 . Partition the index set {1, 2, • • • , v ni-1 } = S 1 ∪ S 2 ∪ • • • ∪ S ⌈N 2θ /c⌉ , where |S l | ≤ ⌈cn i-1 /N 2θ ⌉ for each 1 ≤ l ≤ ⌈N 2θ /c⌉. The following discussion is similar to the proof of lemma E.3, hence we omit some details. For each l, taking a 1/2net N l for the set V S l = {v ∈ S ni-1-1 ; v ∈ span{v i ; i ∈ S l }}, we can get ∥Au∥ 2 ≤ 4(n i • • • n j ) 1/2 , u ∈ V S l , with probability at least 1 -|N l |e -k/∆1 ≥ 1 -exp{-k/∆ 1 + (cn i-1 /N + 1) ln 6} ≥ 1 -e -3k/(5∆1) , since ∆ 1 ≤ k 5 ln(6) . Therefore, for any v ∈ R ni-1 , we can write it as the sum v = l a l v l , where α l ∈ R and v l ∈ V S l for each l. We also know that ∥v∥ 2 2 = l≥1 |α l | 2 . Then we have ∥Av∥ 2 ≤ l |α l | ∥Av l ∥ 2 ≤ 4(n i • • • n j ) 1/2 ⌈N 2θ /c⌉ l |a l | 2 ≤ M k √ CN θ (n i • • • n j ) 1/2 ∥v∥ 2 . Thus, ∥A∥ ≤ M k √ CN θ (n i • • • n j ) 1/2 . Notice that when C ≤ e, ∆ 1 ≤ The success probability is at least 1 -⌈N 2θ /c⌉ • e -3k/(5∆1) ≥1 -exp ln 5 ln(6) • C k + ln(N 2θ ) -3k/(5∆ 1 ) -e -3k/(5∆1) ≥1 -2e -k/(5∆1) , since ∆ 1 ≤ k 5 ln (5 ln(6) • C/k) and ∆ 1 ln(N 2θ ) ≤ k/5. The Hoeffding bound for random variable X with a mean µ and sub-Gaussian parameter σ is given by, P [|X -µ| ≥ t] ≤ 2 exp - t 2 2σ 2 , ∀t ≥ 0. Simply applying the Chernoff bound for B(a, b), we obtain the following lemma. Lemma F.3. Assume a random variable B distributed as a beta distribution B(a, b) with two positive shape parameters a and b. Then, P( B - a a + b ≥ y) ≤ 2 exp -2(a + b)y 2 , y ≥ 0. Hence, P B - a a + b ≤ ε a a + b ≥ 1 -exp{-Ω(a 2 /(a + b))}, where Ω(•) only depend on ε. For the upper tail, we can obtain a better bound, P B ≥ (1 + ε) a a + b ≤ exp {-(ε -ln(ε + 1))a} . ( ) Proof of Lemma F.3. We only need to prove the third inequality. Assume random variable B ∼ B(a, b). Set v = a + b, (1 + t) a v ≤ y < 1, t > 0, and r > 0. We are going to estimate the Chernoff bound for B, which is P(B ≥ y) ≤ e -(ry-ln Ee rB ) =: e -Ir(y) . The moment generating function of B is given by Ee rB = 1 + ∞ k=1 a(a + 1) • • • (a + k -1) v(v + 1) • • • (v + k -1) r k k! ≤ 1 + ∞ k=1 a(a + 1) • • • (a + k -1) v k r k k! , r > 0. Recall that the Maclaurin series of (1 -r/v) -a over (-v, v) , is given by equation (1 -r/v) -a = 1 + ∞ k=1 a(a + 1) • • • (a + k -1) v k r k k! . Thus, I r (y) = ry -ln Ee rB ≥ ry + a ln(1 -r/v). Set r = v -a/y ∈ (0, v). We obtain P(B ≥ y) ≤ exp{-(vy -a + a ln(a/(vy)))} =: exp{-vy • g(a/(vy))}, (1 + t) a v ≤ y < 1 where g(x) = 1 -x + x ln(x), x = a/(vy) ∈ (0, 1/(1 + t)]. Notice that g(1) = 0 and g ′ (x) = ln(x) < 0 over x ∈ (0, 1). We know that g(x) ≥ g(1/(1 + t)) = t -ln(1 + t) t + 1 , t > 0. Thus, P(B ≥ y) ≤ exp -vy • t -ln(1 + t) t + 1 = exp {-(t -ln(1 + t))a} , y = (1 + t) a v < 1. Set y = (1 + ε) a a+b . We obtain the inequality (66). Remark 10. It is trivial to check ∥W j:i (0)∥ = (n i n i+1 • • • n j ) 1/2 , 1 ≤ i ≤ j ≤ p, ∥W j:i (0)∥ = (n i-1 n i • • • n j-1 ) 1/2 , p + 1 ≤ i ≤ j ≤ N, ∥W j:i (0)∥ ≤ (n i n i+1 • • • n j-1 ) 1/2 (n p ) 1/2 ≤ n max n min 1/2 (n i n i+1 • • • n j-1 • max{n i-1 , n j }) 1/2 , 1 ≤ i < p < j ≤ N, (i, j) ̸ = (1, N ). Remark 11. As a special case, if n 1 = n 2 = • • • = n N -1 = n, we know that ∥W j:i (0)∥ = (n i-1 n i • • • n N -1 ) 1/2 = n (N -i+1)/2 . Lemma F.4. Assume n p / min{n 1 , n N -1 } ≤ C 0 < ∞. Set ε > 0. Let C(ε) represent the constant depend only on ε. If n 1 /C 0 ≥ C(ε)n N , then with probability at least 1 -e -Ω(n1/C0) σ max (W N :i (0)) ≤ (1 + ε)(n i-1 n i • • • n N -1 ) 1/2 , 2 ≤ i ≤ p σ min (W N :i (0)) ≥ (1 -ε)(n i-1 n i • • • n N -1 ) 1/2 , 2 ≤ i ≤ p. Similarly, if n N -1 /C 0 ≥ C(ε)rank(X), then with probability at least 1 -e -Ω(n N -1 /C0) σ max (W j: 1 (0)| R(X) ) ≤ (1 + ε)(n 1 n 2 • • • n j ) 1/2 , p + 1 ≤ j ≤ N σ min (W j:1 (0)| R(X) ) ≥ (1 -ε)(n 1 n 2 • • • n j ) 1/2 , p + 1 ≤ j ≤ N. Proof of Lemma F.4. Let D = (n N -1 n N -2 • • • n p ) -1/2 W T N :p+1 (0) and A i = (n p n p-1 • • • n i ) -1/2 W T p:i (0). Assume v ∈ S n N -1 . Easy to see that A i is a product of random orthogonal projection and D is a random embedding. Let e 1 = (1, 0, 0, • • • , 0) T ∈ R np . There exists orthogonal matrix T such that T Dv = e 1 , ∥e 1 ∥ 2 = ∥T Dv∥ 2 = ∥v∥ 2 = 1. Since random orthogonal projection are right invariant, we have P(∥A i Dv∥ 2 ≥ y) = E E I {∥AiT T e1∥ 2 ≥y} D = E E I {∥Aie1∥ 2 ≥y} D = P(∥A i e 1 ∥ 2 ≥ y). This proves that ∥A i Dv∥  + b + 1 • • • a + k -1 a + b + k -1 . ( ) By the law of total expectation, we have EB i B i+1 • • • B p = n i-1 n i n i n i+1 • • • n p-1 n p = n i-1 n p , as well as E(B i B i+1 • • • B p ) k = n i-1 /2 n p /2 n i-1 /2 + 1 n p /2 + 1 • • • n i-1 /2 + k -1 n p /2 + k -1 . Notice that all the integer moments of B i B i+1 • • • B p match those of B(n i-1 /2, (n p -n i-1 )/2). We can verify that beta distribution satisfies Carleman's condition, which implies that B i B i+1 • • • B p ∼ B(n i-1 /2, (n p -n i-1 )/2). Thus, ∥A i Dv∥ 2 2 / ∥v∥ 2 2 ∼ B(n i-1 /2, (n p -n i-1 )/2), which proves the claim. With probability at least 1 -exp{-Ω(n 1 /C 0 )}, we have (1 -ε) 2 n i-1 n p ≤ ∥ADv∥ 2 2 ≤ (1 + ε) 2 n i-1 n p , ∥v∥ 2 = 1. Using the ϕ-net technique, which has been already used to prove lemma E.3, we know that σ min (AD) ≥ (1 -ε) n i-1 n p 1/2 , and σ max (AD) ≤ (1 + ε) n i-1 n p 1/2 , with probability at least 1 -exp{n N ln(3/ϕ(ε))} exp{-Ω(n 1 /C 0 )} ≥ 1 -exp{-Ω(n 1 /C 0 ), since n 1 /C 0 ≥ C(ε)n N , for 2 ≤ i ≤ p. Hence, with probability at least 1 -e -Ω(n1/C0) , we have σ min (W N :i (0)) ≥ (1 -ε) (n i-1 • • • n N -1 ) 1/2 , and σ max (W N :i (0)) ≤ (1 + ε) (n i-1 • • • n N -1 ) 1/2 . The other part of the proof is similar to that of lemma E.4, so we omit it. Proof of Theorem B.2 . Set c > 0, c 1 = c/6, c 2 = c/3. In lemma F.4, we can pick a ε > 0, such that 1 + ε ≤ e c1/2 and 1 -ε ≥ e -c2/2 . Set M = 2 √ C 0 , θ = 0, B 0 = B δ , and η = (1-ε)2n N e 2c βN . The requirement on size {n 1 , n 2 , • • • , n N -1 , N } in (17) make sure that the remark 10, lemma F.4, lemma 2.3, and lemma D.1 all hold. Notice that even though we need the conclusions in lemma F.4 to hold simultaneously for 2 ≤ i ≤ p, p + 1 ≤ j ≤ N , it suffices to apply lemma F.4 over i ∈ I and j ∈ J, such that {n i ; i ∈ I} and {n j ; j ∈ I} both have distinct values. Since |I| ≤ N and |J| ≤ N , with probability at least 1 -2N e -Ω(nmin/C0) -δ/2 ≥ 1 -δ, the one peak random orthogonal projection and embedding initialization satisfies the initialization assumption (31) and the overparameterization assumption (32). Under the assumption n 1 = n 2 = • • • = n N -1 , we can use remark 11 to replace lemma F.4. Thus, with probability at least 1 -δ/2 ≥ 1 -δ, (31) holds. By applying lemma 2.3 and D.1, we complete the proof. . Now, we want to verify (31). By a simple calculation, we have    σ max (W N :i+1 (0)) = σ min (W N :i+1 (0)) = n (N -i)/2 , 1 ≤ i ≤ N -1, σ max (W i-1:1 (0)| R(X) ) = σ max (W i-1:1 (0)| R(X) ) = n (i-1)/2 , 2 ≤ i ≤ N, ∥W j:i (0)∥ = n (j-i+1)/2 , 1 < i ≤ j < N. 



X



i=1as a training dataset of size m, and letX = [x 1 , x 2 , • • • , x m ] ̸ = 0 and Y = [y 1 , y 2 , • • • , y m ]. Denote the weight parameters by W ∈ R ny×nx .Consider the well-studied convex optimization problem:



because σ min (A + B) ≥ σ min (A) -σ max (B) = σ min (A) -∥B∥ and σ max (A + B) ≤ σ max (A) + σ max (B) = ∥A∥ + ∥B∥ (e.g. see Theorem 1.3 in Chafaı et al. (2009)).

Gaussian initialization: Denote by N (0, 1) the standard Gaussian distribution, and χ 2 k the chi square distribution with k degrees of freedom. Let S d-1 = {x ∈ R d ; ∥x∥ 2 = 1} be the unit sphere in R d . The scaling factor a N = 1 √ n1n2•••n N ensures that the networks at initialization preserves the norm of every input in expectation. Lemma E.1. For any x ∈ R n0 , the Gaussian initialization satisfies E ∥a N W N :1 (0

(6)•C/k) , and when C > e, we have ∆ 1 ln(C) ≤ min k 5 ln(5 ln(6)e/k) , k/5 ≤ k ln(C) 5 ln(5 ln(6) • C/k) .

If v ̸ = 0, then ∥A i Dv∥ 2 2 / ∥v∥ 2 2 = (n i n i+1 • • • n 2 p • • • n N -1 ) distribution B(n i-1 /2, (n p -n i-1 )/2). Define B p = ∥A p e 1 ∥ 2 2 , B i = ∥A i e 1 ∥ 2 2 / ∥A i+1 e 1 ∥ 2 2 , i = p -1, p -2, • • • , 1. Then B p ∼ B(n p-1 /2, (n p -n p-1 )/2), B p-1 |B p ∼ B(n p-2 /2, (n p-1 -n p-2 )/2), • • • , B i |(B p , • • • , B i+1 ) ∼ B(n i-1 /2, (n i -n i-1 )/2). If n i+1 = n i , we know that B i |(B p , • • • , B i+1 ) = 1, a.s. If B ∼ B(a, b), then the moments are given by the following equations,

Proof of TheoremB.3. Let W N (0) = √ nU N [I ny , 0]V T N , • • • , W i (0) = √ nU i I n V T i , 2 ≤ i ≤ N -1, and W 1 (0) = √ nU 1 [I nx , 0] T V T 1

any 1 ≤ p ≤ m ∥a N W N :1 (0N [I ny , 0]V T N U N [I nx , 0] T V T N [I ny , 0]V T N x ′ 2 2 ,wherex ′ = U N [I nx , 0] T V T 1 x, ∥x∥ 2 = ∥x ′ ∥ 2 . Since the distribution of U N [I ny , 0]V T N isright invariant under multiplying orthogonal matrices, we have U N [I ny , 0probability at least 1 -δ/2. By applying Lemma D.1 with c > 0, c 1 = c/6, c 2 = c/3, θ = 0, we complete the proof. Proof of Theorem 3.1. Theorem 3.1 is a special case of Theorem B.1 and Theorem B.2. Hence, we omit the proof. related to a convex optimization problem for GD under different initialization schemes. Consider the following procedures for plots of the logarithm of loss as a function of number of iterations: a) We choose X ∈ R 128×1000 and W * ∈ R 10×128 and set Y = W * X + ε, where the entries in X, W * and ε are drawn i.i.d. from N (0, 1).b) We consider the loss function 1 2 ∥a N W N :1 X -Y ∥ 2 F . c) For the given linear networks, we apply the Gaussian initialization and one peak random orthogonal projection and embedding initialization, which are denoted as W j (0), 1 ≤ j ≤ N .d) For the convex optimization problem (1), we set the initialization to be W(0) = a N W N (0) • • • W 1 (0). e)We set the learning rates as η = n N N •∥X∥ 2 and η * = N n N η for the deep linear neural networks and convex problem, respectively. f) We draw the loss function through 25 iterations.

Figure 1: Plot of Loss as a function of number of iterations with n 1 = n 2 = n 3 = 128 (First), 200 (Second), 2000 (Third) for Gaussian initialization, respectively.

annex

and σ min (A) = infApplying lemma E.2, we know that with probability at least 1 -exp -Ω 1 ∆1 , ∥Av∥ 2 / ∥v∥ 2 ∈ [e -c2/2 P, e c1/2 P ],where P = (n i-1 • • • n N -1 ) 1/2 . Set ϕ = min{1 -e -c1/2 , (e -c2/2 -e -c2 )/(e -c2/2 + e c1 )}. Take a ϕ-net N ϕ for S n N -1 with size |N ϕ | ≤ (3/ϕ) n N . Notice that with this size we can actually cover the unit ball, not only the unit sphere.Thus, with probability at least 1, for all u ∈ N ϕ simultaneously we have ∥Au∥ 2 / ∥u∥ 2 ∈ [e -c2/2 P, e c1/2 P ].Fixed v ∈ S n N -1 , there exists u ∈ N ϕ such that ∥u -v∥ 2 ≤ ϕ. WLOG, we assume 1 -ϕ ≤ ∥u∥ 2 ≤ 1. We obtain ∥Av∥ 2 ≤ ∥Au∥ 2 + ∥A(u -v)∥ 2 ≤ e c1/2 P + ϕ ∥A∥ .Taking supereme over ∥v∥ 2 = 1, we obtain σ max (A) = ∥A∥ ≤ e c1/2 1 -ϕ P ≤ e c1 P.For the lower bound, we haveTaking the infimum over ∥v∥ 2 = 1, we get σ min (A) ≥ e -c2 P.The conclusions hold with probability at leastLemma E.4. There exists a positive constant C(c 1 , c 2 ) which only depends on c 1 , c 2 , such that if rank(X)∆ 1 ≤ C(c 1 , c 2 ), then for any fixed 1 ≤ j < N , with probability at least 1 -exp{-Ω 1 ∆1 } we haveProof of Lemma E.4. The proof is similar to that of previous lemma. The only difference is that now we consider the ϕ-net to cover the unit sphere in R(X) ∩ R n0 , with dim R(X) ∩ R n0 = rank(X), where R(X) represents the column space of X.Proof of Lemma 2.3. Set r = rank(X), and u 1 , • • • , u r be an orthonormal basis of the column space of X.By assumption, we haveThe Markov inequality impliesTherefore, we can bound the initial loss value aswith probability at least 1 -δ/2.  and η =: (1-ε)2n N e 2c βN , then with probability at leastthe random initialization satisfies the initialization assumption (31) and the overparameterization assumption (32). By applying Lemma D.1, we complete the proof.

F ORTHOGONAL INITIALIZATION FALL INTO THE CONVERGENCE REGION

The following are some basic facts for random projection and embedding. Most of the following properties can be found in Eaton (1989). Proposition F.1.

1.

A is a random embedding if and only if A T is a random projection.

2.. If

A is a square matrix, then random projection, random embedding and random orthogonal matrix are equivalent.3. The uniform distribution on the group is a left and right invariant probability measure: that is, if A is a random orthogonal matrix, then A, U A, AU are all random orthogonal matrices, where U is a non-random orthogonal matrix.4. Assume X is a n × q(q ≤ n) random matrix whose entries are i.i.d. N (0, 1) random variables. Then A := X(X T X) -1/2 is a random embedding, since A T A = I q and the distribution of A is left invariant, which means that A and U A have the same distribution, where U is a non-random orthogonal matrix.5. If A is a uniform distribution over an orthogonal group of order n and A is partitioned as, where A 1 is n × q and A 2 is n × (n -q), then A T 1 and A T 2 are both random orthogonal matrix.6. The columns of uniform distribution over the orthogonal group of order n, andandRemark 9. There are several ways to construct the random matrix A = (a ij ) q×n , q ≤ n, which is uniformly distributed over rectangular matrices with AA T = c 2 I q , c > 0. Let O n be uniformly distributed over a real orthogonal group of order n, and O n is partitioned as, where A 1 is q × n. Assume X = (x ij ) q×n , and x ij are independent standard normal random variables. Then, A, cA 1 , and c(XX T ) -1/2 X have the same distribution. Lemma F.2. For any x ∈ R n0 , the one peak random projection and embedding initiation satisfiesThen D is an embedding matrix. Thus, ∥Dx∥, where i ≥ p + 1, and A p = I. Thus, by the law of total expectation, we haveThis completes the proof.Next, we introduce sub-Gaussian random variables, associated with bounds on how a random variables deviate their expected value.Definition F.1. A random variable X with finite mean µ = EX is sub-Gaussian if there is a positive number σ such that:Such a constant σ 2 is called a proxy variance, and we say that X is σ 2 -sub-Gaussian, and we writeFor beta distribution, Elder (2016) showed that B(a, b) is Proof of Theorem 3.2. In Theorem B.1, B.2, and B.3, we proved that for given constant c 1 , c 2 > 0 and 0 < ε, δ/2 < 1/2 as well as a learning rate η, there exists a constant C = C(c 1 , c 2 ) such that all three kinds of random initializations will fall into the convergence region with probability at least 1 -δ. By applying Lemma 2.3, we complete the proof.

G TABLES

In this section, we provide some empirical evidences to support the argument expressed in Section 4:Why do bad saddles not affect GD for overparameterized deep linear neural networks? Consider the following procedures for tables ofWe consider X ∈ R 128×1000 , and W * ∈ R 10×128 and set Y = W * X + ε, where the entries in X and ε are drawn i.i.d. from N (0, 1).

b) We consider the loss function

c) For the given deep linear networks, we apply the orthogonal initializations, which are denoted asWe set the learning rate as η = n N N •∥X∥ 2 for the deep linear neural networks.e) We prepare the tables forLet n 1 = n 2 = n 3 = 2000, N = 4. Assume W * are drawn i.i.d. from N (0, 25). We obtain the following table : i = 1 i = 2 i = 3 i = 4 t = 1 0.05161 0.00826 0.00826 0.18464 t = 2 0.08779 0.01389 0.01389 0.31396 t = 3 0.11335 0.01781 0.01779 0.40435 t = 4 0.12109 0.01894 0.01889 0.42920 t = 5 0.12527 0.01956 0.01948 0.44282 t = 6 0.12611 0.01967 0.01958 0.44476 t = 7 0.12755 0.01988 0.01978 0.44955 t = 8 0.12745 0.01986 0.01975 0.44876 t = 9 0.12819 0.01997 0.01987 0.45136 t = 10 0.12793 0.01992 0.01982 0.45018Let n 1 = n 2 = 10000, N = 2. Assume W * are drawn i.i.d. from N (0, 4). We obtain the following table : i = 1 i = 2 i = 3 t = 1 0.02708 0.00153 0.04844 t = 2 0.04319 0.00244 0.07727 t = 3 0.05296 0.00299 0.09474 t = 4 0.05888 0.00333 0.10533 t = 5 0.06248 0.00353 0.11176 t = 6 0.06468 0.00365 0.11569 t = 7 0.06603 0.00373 0.11811 t = 8 0.06688 0.00377 0.11962 t = 9 0.06741 0.00380 0.12057 t = 10 0.06775 0.00382 0.12117Let n 1 = n 2 = 4000, N = 2. Assume W * are drawn i.i.d. from N (0, 1). We obtain the following table : i = 1 i = 2 i = 3 t = 1 0.01622 0.00290 0.05802 t = 2 0.02684 0.00480 0.09601 t = 3 0.03411 0.00609 0.12202 t = 4 0.03919 0.00700 0.14018 t = 5 0.04280 0.00764 0.15306 t = 6 0.04539 0.00810 0.16232 t = 7 0.04729 0.00844 0.16908 t = 8 0.04869 0.00868 0.17408 t = 9 0.04974 0.00887 0.17782 t = 10 0.05054 0.00901 0.18066Let n 1 = n 2 = 8000, N = 2. Assume W * are drawn i.i.d. from N (0, 1). We obtain the following table : i = 1 i = 2 i = 3 t = 1 0.01173 0.00148 0.04195 t = 2 0.01944 0.00246 0.06955 t = 3 0.02470 0.00312 0.08838 t = 4 0.02838 0.00358 0.10151 t = 5 0.03098 0.00391 0.11083 t = 6 0.03287 0.00415 0.11758 t = 7 0.03426 0.00432 0.12253 t = 8 0.03530 0.00445 0.12624 t = 9 0.03608 0.00455 0.12904 t = 10 0.03668 0.00463 0.13118Let n 1 = n 2 = 12000, N = 2. Assume W * are drawn i.i.d. from N (0, 1). We obtain the following table : i = 1 i = 2 i = 3 t = 1 0.00965 0.00099 0.03453 t = 2 0.01597 0.00164 0.05712 t = 3 0.02025 0.00208 0.07244 t = 4 0.02323 0.00239 0.08310 t = 5 0.02535 0.00261 0.09069 t = 6 0.02690 0.00277 0.09621 t = 7 0.02804 0.00289 0.10029 t = 8 0.02890 0.00297 0.10336 t = 9 0.02955 0.00304 0.10570 t = 10 0.03006 0.00309 0.10750Let n 1 = n 2 = 20000, N = 2. Assume W * are drawn i.i.d. from N (0, 1). We obtain the following table : i = 1 i = 2 i = 3 t = 1 0.00713 0.00057 0.02551 t = 2 0.01181 0.00095 0.04225 t = 3 0.01499 0.00121 0.05362 t = 4 0.01720 0.00138 0.06154 t = 5 0.01878 0.00151 0.06720 t = 6 0.01994 0.00161 0.07132 t = 7 0.02079 0.00168 0.07438 t = 8 0.02144 0.00173 0.07668 t = 9 0.02193 0.00177 0.07844 t = 10 0.02231 0.00179 0.07981

H FIGURES

In this section, we provide some empirical evidences to support the results in Section 4: Numerical Experiments. We will show how the trajectories of the non-convex deep linear neural networks are 

