ON THE EXPLICIT ROLE OF INITIALIZATION ON THE CONVERGENCE AND GENERALIZATION PROPERTIES OF OVERPARAMETRIZED LINEAR NETWORKS Anonymous

Abstract

Neural networks trained via gradient descent with random initialization and without any regularization enjoy good generalization performance in practice despite being highly overparametrized. A promising direction to explain this phenomenon is the Neural Tangent Kernel (NTK), which characterizes the implicit regularization effect of gradient flow/descent on infinitely wide neural networks with random initialization. However, a non-asymptotic analysis that connects generalization performance, initialization, and optimization for finite width networks remains elusive. In this paper, we present a novel analysis of overparametrized single-hidden layer linear networks, which formally connects initialization, optimization, and overparametrization with generalization performance. We exploit the fact that gradient flow preserves a certain matrix that characterizes the imbalance of the network weights, to show that the squared loss converges exponentially at a rate that depends on the level of imbalance of the initialization. Such guarantees on the convergence rate allow us to show that large hidden layer width, together with (properly scaled) random initialization, implicitly constrains the dynamics of the network parameters to be close to a low-dimensional manifold. In turn, minimizing the loss over this manifold leads to solutions with good generalization, which correspond to the min-norm solution in the linear case. Finally, we derive a novel O(h -1/2 ) upper-bound on the operator norm distance between the trained network and the min-norm solution, where h is the hidden layer width.

1. INTRODUCTION

Neural networks have shown excellent empirical performance in many application domains such as vision (Krizhevsky et al., 2012; Rawat & Wang, 2017) , speech (Hinton et al., 2012; Graves et al., 2013) and video games (Silver et al., 2016; Vinyals et al., 2017) . Among the many unexplained mysteries behind this success is the fact that gradient descent with random initialization and without explicit regularization enjoys good generalization performance despite being highly overparametrized. A promising attempt to explain this phenomena is the Neural Tangent Kernel (NTK) (Jacot et al., 2018) , which characterizes the implicit regularization effect of gradient flow/descent on infinitely wide neural networks with random initialization. Precisely, under this infinite width assumption, a proper initialization, together with gradient flow training, can be understood as a kernel gradient flow (NTK flow) of a functional that is constrained on a manifold that guarantees good generalization performance. The analysis further admits extensions to the finite width (Arora et al., 2019b; Buchanan et al., 2020) . The core of the argument, illustrated in Figure 1 , amounts to showing that (properly scaled) random initialization of networks with sufficiently large width, leads to trajectories that are, in some sense initialized close to the aforementioned manifold. Thus, approximately good initialization, together with acceleration, ensures that the training stays close to the NTK Flow. Such analysis, however, leads to bounds on the network width that significantly exceed practical values (Buchanan et al., 2020) , and seems to suggest the need of acceleration to achieve generalization. This motivates the following questions: • Is the kernel regime, which requires impractical bounds on the network width, necessary to achieve good generalization? Figure 1 : Comparing our analysis to asymptotic/non-asymptotic NTK analysis. • Does generalization depends explicitly on acceleration? Or is acceleration required only due to the choosing an initialization outside the good generalization manifold? For the simplified, yet certainly non-trivial, single-hidden layer linear network setting, this paper finds an answer to these questions. Contributions. We present a novel analysis of the gradient flow dynamics of overparametrized single-hidden layer linear networks, which provides disentangled conditions on initialization that lead to acceleration and generalization. Specifically, we show that imbalanced initialization ensures acceleration, while orthogonal initialization ensures that trajectories remain close to the generalization manifold. Interestingly, properly scaled random initialization of moderately wide networks is sufficient to ensure that initialization is approximately imbalanced and orthogonal, yet it is not necessary for either. More specifically, as illustrated in Figure 1 , this paper makes the following contributions: 1. We show first that gradient flow on the squared-l 2 loss preserves a certain matrix-valued quantity, akin to constants of motion in mechanics or conservation laws of physics, that measures the imbalance of the network weights. Notably, some level of imbalance, measured by certain singular value of the imbalance matrix and defined at initialization, is sufficient to guarantee the exponential rate of convergence of the loss. Our analysis is non-probabilistic and valid under very mild assumptions, satisfied by moderately wide single-hidden-layer linear networks. 2. We characterize the existence of a low-dimensional manifold defined by a specific orthogonality condition on the parameters, which is invariant under the gradient flow. All trajectories within this manifold lead to a unique (w.r.t the end-to-end function) minimizer with good generalization performance, which corresponds to the min-norm solution. As a result, initializing the network within this manifold guarantees good generalization. 3. We further show that by randomly initializing the network weights using N (0, 1/h) (where h is the hidden layer width), an initialization setting related to the kernel regime (see Appendix E), one can approximately satisfy both our sufficient imbalance and orthogonality conditions with high probability. Notably, the inaccurate initialization relative to the good generalization manifold requires acceleration to control the generalization error. In the context of NTK, our result further provide for linear networks a novel O h -1/2 upper-bound on the operator norm distance between the trained network and the min-norm solution. To the best of our knowledge, this is the first non-asymptotic bound regarding the generalization property of wide linear networks under random initialization in the global sense. Notation. For a matrix A, we let A T denote its transpose, tr(A) denote its trace, λ i (A) and σ i (A) denote its i-th eigenvalue and i-th singular value, respectively, in decreasing order (when adequate). We let [A] ij , [A] i,: , and [A] :,j denote the (i, j)-th element, the i-th row and the j-th column of A, respectively. We also let A 2 and A F denote the spectral norm and the Frobenius norm of A, respectively. For a scalar-valued or matrix-valued function of time, F (t), we let Ḟ = Ḟ (t) = d dt F (t) denote its time derivative. Additionally, we let I n denote the identity matrix of order n and N (µ, σ 2 ) denote the normal distribution with mean µ and variance σ 2 .

2. RELATED WORK

Wide neural networks. There is a rich line of research that studies the convergence (Du et al., 2019b; a; Du & Hu, 2019; Allen-Zhu et al., 2019b) and generalization (Allen-Zhu et al., 2019a; Arora et al., 2019a; b; Li & Liang, 2018; Cao & Gu, 2019; Buchanan et al., 2020) of wide neural networks with random initialization. The behavior of such networks in their infinite width limit can be characterized by the Neural Tangent Kernel (NTK) (Jacot et al., 2018) . With the concept of NTK, heuristically, training wide neural networks can be approximately viewed as kernel regression under gradient flow/descent (Arora et al., 2019b) , hence the convergence and generalization can be understood by studying the non-asymptotic results regarding the equivalence of finite width networks to their infinite limit (Du et al., 2019b; a; Du & Hu, 2019; Allen-Zhu et al., 2019b; Arora et al., 2019a; b; Buchanan et al., 2020) . More generally, such non-asymptotic results are related to the "lazy training" (Chizat et al., 2019; Du et al., 2019a; Allen-Zhu et al., 2019b) , where the network weights do not deviate too much from its initialization during training. Our results for wide linear networks presented in Section 4.2 do not follow the NTK analysis, but provide an alternative, presumably more general view on the effect of random initialization when the hidden layer is sufficiently wide. Convergence of linear networks. Convergence in overparametrized linear networks has been studied for both gradient flow (Saxe et al., 2014) and gradient descent (Bartlett et al., 2018; Arora et al., 2018a; b) . In the kernel regime, Du & Hu (2019) applied the analysis of convergence of wide neural networks (Du et al., 2019b) to deep linear networks. Aside from the large hidden layer width and random initialization assumptions, Saxe et al. (2014) analyzed the trajectory of network parameters under spectral initialization, while Bartlett et al. (2018) studied the case of identity initialization. Although the fact that the imbalance is conservative under gradient flow has been exploited in Arora et al. (2018b; a) , they consider the case of balanced initialization only to simplify the learning dynamics, and additional conditions are required for convergence. Most works mentioned above considered specific, often data-dependent, types of initialization that makes the learning dynamics tractable. Our result, based on an imbalance measure, is data-agnostic and satisfied under a wide range of random initialization schemes, see lemmas 1 and F.2. Min-norm solution in high-dimension linear regression. For high-dimensional under-determined linear regression, the asymptotic generalization error of min-norm solution has been studied in Hastie et al. (2019) . In Bartlett et al. (2020) ; Mei & Montanari (2019) , the min-norm solution was proved to have near-optimal generalization performance under mild assumptions on the data model. For our purpose, we study the generalization of a trained linear network as its distance to the min-norm solution. In this way, we refer to solutions with good generalization performance, as those with small distance to the min-norm solution.

3. CONVERGENCE RATE OF GRADIENT FLOW FOR SINGLE-HIDDEN-LAYER LINEAR NETWORKS

In this section we study the convergence of gradient flow on squared l 2 -loss for single-hidden-layer linear networks. Given training data of n samples {x (i) , y (i) } n i=1 with x (i) ∈ R D , y (i) ∈ R m , we aim to solve the linear regression problem min Θ∈R D×m L = 1 2 n i=1 (y (i) -Θ T x (i) ) 2 , by training a single-hidden-layer linear network y = f (x; V, U ) = V U T x, V = R m×h , U ∈ R D×h with gradient flow, or equivalently, gradient descent with "infinitesimal step size", here h is the hidden layer width. We consider an overparametrized model such that h ≥ min{m, D}. We rewrite the loss function with respect to our parameters V, U as L(V, U ) = 1 2 n i=1 (y (i) -V U T x (i) ) 2 = 1 2 Y -XU V T 2 F , where Y = [y (1) , • • • , y (n) ] T , X = [x (1) , • • • , x (n) ] T are concatenations of the training data in rows. Assuming that the input data X has full rank, we consider the under-determined case D > n for our regression problem, i.e., there are infinitely many solutions Θ * to achieve zero loss of (1). With minor a reformulation, our convergence result works for the case where the input data X is rank deficient. We refer the reader to Appendix B. We will show that under certain conditions, the trajectory of the loss function L(t) = L(V (t), U (t)) under gradient flow of (2), i.e., V (t) = - ∂ ∂V L(V (t), U (t)) , U (t) = - ∂ ∂U L(V (t), U (t)) , converges to 0 exponentially, and that proper initialization of U (0), V (0) controls the convergence rate via a time-invariant matrix-valued term, the imbalance of the network.

3.1. REPARAMETRIZATION OF GRADIENT FLOW

Given D > n = rank(X), the singular value decomposition of X is given by X = W Σ 1/2 x 0 Φ T 1 Φ T 2 , W ∈ R n×n , Φ 1 ∈ R D×n , Φ 2 ∈ R D×(D-n) , with W W T = W T W = Φ T 1 Φ 1 = I n , Φ T 2 Φ 2 = I D-n , Φ T 1 Φ 2 = 0, and Φ 1 Φ T 1 + Φ 2 Φ T 2 = I D . Notice that U = I D U = (Φ 1 Φ T 1 + Φ 2 Φ T 2 )U = Φ 1 Φ T 1 U + Φ 2 Φ T 2 U , hence we can reparametrize U as (U 1 , U 2 ) using the bijection g : R n×h ×R (D-n) ×h → R D×h , U = g(U 1 , U 2 ) = Φ 1 U 1 + Φ 2 U 2 , with inverse (U 1 , U 2 ) = g -1 (U ) = (Φ T 1 U, Φ T 2 U ). We write the gradient flow in (3) explicitly as V (t) = Y -XU (t)V T (t) T XU (t) = E T (t)Σ 1/2 x Φ T 1 U (t) , (5a) U (t) = X T Y -XU (t)V T (t) V (t) = Φ 1 Σ 1/2 x E(t)V (t) , based on the SVD of data X in (4), where E(t) = E(V (t), U 1 (t)) = W T Y -Σ 1/2 x U 1 (t)V T (t) , is defined to be the error. Then from (5a)(5b) we obtain the dynamics in parameter space (V, U 1 , U 2 ) as V (t) = E T (t)Σ 1/2 x U 1 (t) , U1 (t) = Σ 1/2 x E(t)V (t) , U2 (t) = 0 . (7) Notice that since W is orthogonal, we have L(t) = 1 2 Y -XU (t)V T (t) 2 F = 1 2 W E(t) 2 F = 1 2 E(t) 2 F . Therefore it suffices to analyze the convergence rate of the error E(t) under the dynamics of V (t), U 1 (t) in ( 7). As we mentioned in Section 1, the exponential convergence of E(t), or equivalently the loss function L(t) is crucial for our analysis for generalization, in the sense that exponential convergence ensures that the parameters do not deviate much away from the manifold of our interest, which we will discuss in Section 4, so that good properties from the initialization are approximately preserved during training.

3.2. IMBALANCE AND CONVERGENCE RATE OF THE ERROR

We define the imbalance of the single-hidden-layer linear network under input data X as Imbalance : U T 1 U 1 -V T V . (9) The imbalance term is time-invariant under gradient flow, as stated in the following claim Claim. Under continuous dynamics (7), we have d dt [U T 1 (t)U 1 (t) -V T (t)V (t)] ≡ 0. Proof. Under (7), we compute the time derivative of U T 1 (t)U 1 (t) and V T (t)V (t) as d dt U T 1 (t)U 1 (t) = U T 1 (t)U 1 (t) + U T 1 (t) U1 (t) = V T (t)E T (t)Σ 1/2 x U 1 (t) + U T 1 (t)Σ 1/2 x E(t)V (t), d dt V T (t)V (t) = V T (t) V (t) + V T (t)V (t) = V T (t)E T (t)Σ 1/2 x U 1 (t) + U T 1 (t)Σ 1/2 x E(t)V (t) . The right-hand side of two equations is identical, hence d dt [U T 1 (t)U 1 (t) -V T (t)V (t)] ≡ 0. The imbalance is a h × h matrix with rank at most m + n, the rank of imbalance characterizes how much the row spaces of U 1 and V are misaligned. We show that a rank-(m + n -1) imbalance is sufficient for exponential convergence of the error E(t), or equivalently, the loss function. Now we present our result regarding convergence of the error. (see Appendix D for the proof). Theorem 1 (Convergence of linear networks with sufficient rank of imbalance). Suppose h ≥ m + n -1. Let V (t), U 1 (t), t > 0 be the trajectory of continuous dynamics (7) starting from some V (0), U 1 (0). If σ n+m-1 U T 1 (0)U 1 (0) -V T (0)V (0) = c > 0 , ) then for E(t) defined in (6), we have E(t) 2 F ≤ exp (-2σ n (Σ x )ct) E(0) 2 F , ∀t > 0 . (11) Additionally, V (t), U 1 (t), t > 0 converges to some equilibrium point (V (∞), U 1 (∞)) such that E(V (∞), U 1 (∞)) = 0. The fact that the imbalance is preserved under gradient flow has been exploited in Arora et al. (2018a; b) , where imbalance is assumed to be zero (or small), such that the learning dynamics can be expressed in closed form with respect to the end-to-end matrix. This analysis, requires, however, additional assumptions on the initialization of the end-to-end matrix for acceleration. Similarly, though in a more general setting, Du et al. (2018) showed that the imbalance is preserved, and proves convergence under a small imbalance assumption. Acceleration (exponential rate), however, is not guaranteed. Exploiting imbalance for guaranteeing acceleration was first presented in Saxe et al. ( 2014), under a spectral initialization assumption. In contrast, Theorem 1 shows acceleration without the spectral initialization condition. Rather, we show that under very mild conditions on the alignment between the initialization and the data, acceleration is achieved. Such good choice of initialization, at first glance, seems to largely depend on the data given the definition of the imbalance. However, we show in the next section that for sufficiently wide networks with random initialization, the imbalance has rank at least n + m with high probability, for any data matrix X, hence exponential convergence is almost guaranteed when training such networks. Later we will illustrate how such convergence also affects the generalization of the trained network. The dependence on σ n (Σ x ) has been also appeared in Du & Hu (2019) , where a linear convergence rate of gradient descent was shown for a multi-layer linear networks. Their proof followed the same high-level procedure as in showing convergence for networks with nonlinear activation (Du et al., 2019b; a) , which relied on showing that the Gram matrix is close to its initialization during training. For our result, although it is provided for single-hidden-layer linear networks, we essentially lower bound the smallest eigenvalue of the Gram matrix at any time t by a fixed constant that only depends on the initialization. We end the section by noting that our result is not restricted to the case that X is full rank. In Appendix B, we show that similar result hold for the case that X is rank deficient, with minor reformulations. In that case, we only require h ≥ d + m -1, and the singular value σ d+m-1 (U T 1 (0)U 1 (0) -V T (0)V (0) ) replaces what in shown in ( 10), where d is the rank of X. In addition, we present and discuss the numerical simulation regarding Theorem 1 in Appendix A.

4. GENERALIZATION OF SINGLE-HIDDEN-LAYER LINEAR NETWORK

In this section, we study the generalization properties of trained single-hidden-layer linear networks under gradient flow. Assuming that D > n = rank(X), the regression problem (1) has infinitely many solutions Θ * that achieve zero loss. Among all these solutions, one that is of particular interest in high-dimensional linear regression is the minimum norm solution (min-norm solution) Θ = arg min Θ∈R D×m { Θ 2 : Y -XΘ = 0} = X T (XX T ) -1 Y, which has near-optimal generalization error for suitable data models, as shown in (Bartlett et al., 2020; Mei & Montanari, 2019) . Here, we study conditions under which our trained network is equal or close to the min-norm solution by showing how the initialization explicitly controls the trajectory of the training parameters to be exactly (or approximately) confined within some low-dimensional manifold. In turn, minimizing the loss over this manifold leads to the min-norm solution. Moreover, our analysis on constrained learning applies to wide single-hidden-layer linear networks with random initialization, whose infinite width limit is equivalent to the kernel predictor with linear kernel K(x, x ) = x T x , as suggested by Jacot et al. (2018) . One can easily check that such a kernel predictor is the min-norm solution Θ. In addition, we show that the operator norm distance between trained finite width single-hidden-layer linear network and the min-norm solution is upper bounded by a O(h -1/2 ) term with high probability over random initialization.

4.1. DECOMPOSITION OF TRAINED NETWORK

To begin with, notice that the linear operator U V T ∈ R D×m associated with the single-hidden-layer linear network can be decomposed according to the data matrix X as U V T = (Φ 1 Φ T 1 + Φ 2 Φ T 2 )U V T = Φ 1 U 1 V T + Φ 2 U 2 V T , where Φ 1 , Φ 2 , U 1 , U 2 are previously defined in Section 3. Here [U V T ] :,j , i.e., the j-th column of U V T , is the linear predictor for the j-th output y j , and is decomposed into two components within complementary subspaces span(Φ 1 ) and span(Φ 2 ). Moreover [U 1 V T ] :,j is the coordinate of [U V T ] :,j w.r.t. the orthonormal basis as the columns of Φ 1 , and similarly [U 2 V T ] :,j is the coordinate w.r.t. basis Φ 2 . Clearly, under gradient flow (3), the trajectory (U (t)V (t) T , t > 0) is fully determined by the trajectory (U 1 (t)V T (t), U 2 (t)V T (t), t > 0), which is governed by the dynamics (7). Convergence of Training Parameters. We have derived useful results regarding U 1 (t)V T (t) for t > 0 in Section 3. By Theorem 1, if the rank of the imbalance matrix is large enough, U 1 (t)V T (t) converges to some U 1 (∞)V T (∞) and the stationary point satisfies W T Y -Σ 1/2 x U 1 (∞)V T (∞) = 0, which implies U 1 (∞)V T (∞) = Σ -1/2 x W T Y . Then it is easy to check that Φ 1 U 1 (∞)V T (∞) = Φ 1 Σ -1/2 x W T Y = X T (XX T ) -1 Y = Θ . ( ) In other words, the projected trajectory (in columns) of U (t)V T (t) onto span(Φ 1 ) converges exactly to the min-norm solution. For U 2 (t)V T (t), notice that U2 (t) = 0 in dynamics (7), hence U 2 (t) = U 2 (0), ∀t > 0. Then under sufficient rank of imbalance, U (t)V T (t) converges to some U (∞)V T (∞) and U (∞)V T (∞) = Φ 1 U 1 (∞)V T (∞) + Φ 2 U 2 (0)V T (∞) = Θ + Φ 2 U 2 (0)V T (∞) . Therefore U 2 (0)V T (∞) quantifies how much the trained network U (∞)V T (∞) is deviated from the min-norm solution Θ, and since Φ T 2 Φ 2 = I D-n , we have U (∞)V T (∞) -Θ 2 = Φ 2 U 2 (0)V T (∞) 2 = U 2 (0)V T (∞) 2 . ( ) Constrained Training via Initialization. Based on our analysis above, initializing U 2 (0) such that U 2 (0)V T (∞) = 0 in the limit, guarantees convergence to the min-norm solution via (15). However, this is not easily achievable, as one needs to know a priori V (∞). Instead, we can show that by choosing a proper initialization, one can constrain the trajectory of the matrix U (t)V T (t) to lie identically in the set Φ T 2 U (t)V T (t) ≡ 0 for all t ≥ 0, which is equivalent to saying that the columns of U (t)V T (t) lie in span(Φ 1 ). Indeed, using the fact that for all t ≥ 0, U 2 (t) = U 2 (0) and Φ T 2 U (t)V T (t) ≡ 0 we obtain 0 = Φ T 2 U (t)V T (t) = Φ T 2 U (∞)V T (∞) = U 2 (∞)V T (∞) = U 2 (0)V T (∞) . Therefore, using (15), it follows that U (∞)V T (∞) = Θ, as desired. Here, the constraint that all columns of U (t)V T (t) lie in span(Φ 1 ) is equivalent to the constraint that (V, U ) is within some low-dimensional manifold in the parameter space. More importantly, such constraint on U V T is relevant to generalization: When a column of U V T is in span (Φ 2 ), predictions are made based on features in span(Φ 2 ). However, those features are not present in the data X that spans Φ 1 , thus, intuitively, hurting the generalization performance. To enforce the constraint Φ T 2 U (t)V T (t) ≡ 0, consider the dynamics of U 2 (0)V T (t), or equivalently V (t)U T 2 (0). From (7) we have d dt V (t)U T 2 (0) U 1 (t)U T 2 (0) = 0 E T (t)Σ 1/2 x Σ 1/2 x E(t) 0 V (t)U T 2 (0) U 1 (t)U T 2 (0) . ( ) The most straightforward way to enforce V (t)U T 2 (0) = 0, ∀t > 0 is to properly initialize V (0), U (0) such that V (0)U T 2 (0) = 0 and U 1 (0)U T 2 (0) = 0. To have such a proper initialization, one can • Initialize columns of U (0) in span(Φ 1 ), which leads to U 2 (0) = 0; • Initialize U (0) and V (0) to enforce the orthogonality condition on the rows, i.e. V (0)U T (0) = 0 and U (0)U T (0) = I D . Such initialization guarantees that gradient flow is constrained within some low-dimensional manifold in the parameter space, such that any global minimizer of the loss in that manifold corresponds to the min-norm solution. Therefore, whenever the network parameters in this manifold converge, and E(∞) = 0, then the solution must be the minimum-norm one. While in practice we can make the initialization exactly as above, such choice is data-dependent and requires the SVD of the data matrix X. Moreover, we note that while the zero initialization works for the standard linear regression case, such initialization V (0) = 0, U (0) = 0 is bad in the overparametrized case because it is a saddle point of the gradient flow, even though it satisfies the orthogonal condition V (0)U T 2 (0) = 0 and U 1 (0)U T 2 (0) = 0. In the next section, we show that under random initialization and sufficiently large hidden layer width h, these conditions on initialization are approximately satisfied, i.e., with high probability the rank of imbalance is m + n and V (0 )U T 2 (0) F , U 1 (0)U T 2 (0) F ∼ O(h -1/2 ). So that the trajectory U (t)V T (t), t > 0 will be approximately constrained in the subspace as we mentioned above.

4.2. WIDE SINGLE-HIDDEN-LAYER LINEAR NETWORK

We will now discuss the generalization of wide single-hidden-layer linear network with random initialization. In particular, we will show how the previously mentioned conditions for convergence and good generalization, i.e., high imbalance and orthogonality, are approximately satisfied with high probability under the following initialization [U (0)] ij ∼ N 0, 1 h , 1 ≤ i ≤ D, 1 ≤ j ≤ h , [V (0)] ij ∼ N 0, 1 h , 1 ≤ i ≤ m, 1 ≤ j ≤ h , where all the entries are independent. Our analysis indeed highlights the need, within this regime, of exponential convergence to ensure good generalization. Remark 1. Our analysis can be extended to the more general case where all entries of U (0), V (0) are sampled from N (0, h -2α ) with 1 4 < α ≤ 1 2 . For the simplicity of the presentation, we consider the particular case α = 1 2 in this section. Please see Appendix F for the more general result. Previous works Jacot et al. (2018) have suggested in the limit h → ∞, the trained network is equivalent to the kernel predictor with NTK. For linear networks, the NTK is the linear kernel K(x, x ) = x T x whose corresponding kernel predictor is the min-norm solution. Therefore, as h → ∞, we should expect the trained network to converge to the min-norm solution, given proper scaling of the network (Arora et al., 2019b) . Combining what we have discussed regarding the convergence and generalization of linear networks with basic random matrix theory, we are able to derive the non-asymptotic O(h -1/2 ) bound on the operator norm distance between a trained h-width network under random initialization and the min-norm solution. Remark 2. We note that, both our parametrization and initialization, are at first sight different that the one used in previous works (Jacot et al., 2018; Du & Hu, 2019; Arora et al., 2019b) on NTK analysis. However, one can relate our model assumptions to the NTK ones by a rescaling of the parameters and time. In the context of this comparison, we see that our setting achieves the same as it achieves the same limiting end-to-end function, but with a rate of convergence h times faster (due to the timescale rescaling). Further, our result does not rely on studying the tangent kernel of the network, hence there is significant difference between our approach to the NTK one. Recall in the last section, one can obtain exactly min-norm solution via proper initialization of the single-hidden-layer network. In particular, it requires 1) convergence of the error E(t) to zero; and 2) the orthogonality conditions V (0)U T 2 (0) = 0 and U 1 (0)U T 2 (0) = 0. Under random initialization and sufficiently large hidden layer width h, these two conditions are approximately satisfied. Using basic random matrix theory, one can show that with high probability, the rank of imbalance is m + n, which leads to (exponential) convergence of E(t), and V (0)U T 2 (0) F , U 1 (0)U T 2 (0) F ∼ O(h -1/2 ), as stated in the following lemma (See Appendix F for the proof) Lemma 1. Given data matrix X. ∀δ ∈ (0, 1), ∀h > h 0 = poly m, D, 1 δ , with probability at least 1 -δ over random initializations with [U (0)] ij , [V (0)] ij ∼ N (0, h -1 ), we have all the following hold. 1. (Sufficient rank of imbalance) σ n+m U T 1 (0)U 1 (0) -V T (0)V (0) > 1 -2 √ m + D + 1 2 log 2 δ √ h , 2. (Approximate orthogonality condition) V (0)U T 2 (0) U 1 (0)U T 2 (0) F ≤ 2 √ m + n √ m + D + 1 2 log 2 δ √ h , U 1 (0)V T (0) F ≤ 2 √ m √ m + D + 1 2 log 2 δ √ h . Clearly, V (0)U T 2 (0) ∼ O(h -1/2 ) alone is insufficient to have the final bound on the trained network. However, from (17) we show that as long as the error E(t) converges to 0 exponentially, which is guaranteed by sufficient rank of imbalance with high probability, the final deviation from min-norm solution V (∞)U T 2 (0) F can not exceed C( V (0)U T 2 (0) F + U 1 (0)U T 2 (0) F ) for some constant C that depends on the data and the convergence rate of E(t), leading to the desired bound. The formal statement is summarized in the following theorem. Theorem 2 (Generalization of wide single-hidden-layer linear network). Let (V (t), U (t), t > 0) be a trajectory of continuous dynamics (7). Then, ∃C > 0, such that ∀δ ∈ (0, 1), ∀h > h 0 = poly m, D, 1 δ , σ1(Σx) σ 3 n (Σx) , with probability 1-δ over random initializations with [U (0)] ij , [V (0)] ij ∼ N (0, h -1 ), we have U (∞)V T (∞) -Θ 2 ≤ 2C √ m + n √ m + D + 1 2 log 2 δ √ h , where C depends on the data X, Y . The proof is shown in Appendix F. This is, to our best knowledge, the first non-asymptotic bound in the global (operator) sense of gradient flow trained wide neural networks under random initialization. Although we understand that this can not be directly compared to previous works (Arora et al., 2019b; Buchanan et al., 2020) that show non-asymptotic results connecting a trained network to the kernel predictor by the NTK, using more general network structure and activation functions than that of the linear network, we believe this theorem is a clear illustration of how overparametrization, in particular the hidden layer width, together with random initialization affects the convergence and generalization, beyond the kernel regime. To be specific, regarding the non-asymptotic analysis for wide neural networks, the concept of constrained learning presented in this section and used to prove Theorem 2 is more general than previous works (Arora et al., 2019b) , where one requires sufficiently large hidden layer width such that the trajectory of the network in the function space approximately matches the trajectory in the infinite width limit. Loosely speaking, such a large width h enforces the trajectory to be approximately confined within a one-dimension manifold, parametrized only by t, and independent of the initialization. We have shown here, however, that even for relative smaller width h, there is a larger dimensional manifold that provides good generalization performance. This shows that, while the kernel regime maybe sufficient, it is certainly, at least for the linear single-hidden layer network, not necessary to guarantee good generalization. To verify Theorem 2, we present the numerical simulation regarding the implicit regularization of gradient descent on wide linear network in Appendix A. The simulation result shows that U (∞)V T (∞) -Θ 2 approximately has the order O(h -1 ) as h grows, suggesting that our nonasymptotic bound is tight in the order w.r.t. h. We refer interested readers to the appendix for the details of the simulation settings.

A NUMERICAL VERIFICATION

The scale of the linear regression we consider in the numerical section is D = 400, n = 100, and m = 1.

A.1 CONVERGENCE OF SINGLE-HIDDEN-LAYER LINEAR NETWORK

Generating training data The synthetic training data is generated as following: 1) For data matrix X, first we generate X 0 ∈ R n×D with all the entries sampled from N (0, 1), and take its SVD X 0 = W Σ 1/2 Φ 1 . Then we let X = W Φ 1 , hence we have all the singular values of X being 1. 2) For Y , we first sample Θ ∼ N (0, D -1 I D ), and ∼ N (0, 0.01 2 I n ), then we let Y = XΘ + .

Initialization and Training

We set the hidden layer width h = 500. We initialize U (0), V (0) with [U (0)] ij ∼ N (0, σ 2 U ), [V (0)] ij ∼ N (0, σ 2 V ) , and we consider two cases of such initialization: 1) σ U = 0.1, σ V = 0.1; 2) σ U = 0.5, σ V = 0.02. For these two cases, we run gradient descent on the averaged loss L = 1 n Y -XU V T 2 F with step size η = 5e -4. The case 2 has much greater c than the case 1, and as the consequence, the loss converges much faster for the case 2, as shown in Fig. 2 . We see from the right log plot that for the case 1, the bound in Theorem 1 is not a tight characterization of the asymptotic convergence rate, while for the case 2, when c is large, the bound in Theorem 1 is almost tight regarding the asymptotic rate. Clearly for case 1, there are additional factors that contribute to the linear convergence, which would be an interesting research topic in the future work.  = σ n+m-1 (U T 1 (0)U 1 (0) -V T (0)V (0)).

A.2 IMPLICIT REGULARIZATION ON WIDE SINGLE-HIDDEN-LAYER LINEAR NETWORK

Generating training data The synthetic training data is generated as following: 1) For data matrix X, first we generate X ∈ R n×D with all the entries sampled from N (0, D -1 ); 2) For Y , we first sample Θ ∼ N (0, D -1 I D ), and ∼ N (0, 0.01 2 I n ), then we let Y = XΘ + .

Initialization and Training

We initialize U (0), V (0) with [U (0)] ij ∼ N (0, h -1 ), [V (0)] ij ∼ N (0, h -1 ) and run gradient descent on the averaged loss L = 1 n Y -XU V T 2 F with step size η = 5e -4. The training stops when the loss is below 1e -7. We run the algorithm for various h from 500 to 10000, and we repeat 10 runs for each h. F is the initial distance between the end-to-end function to the desired manifold discussed in Section 4.1. U (t f )V T (t f ) -Θ is the distance between the end-to-end function and the min-norm solution when the algorithm stops. The line is plotting the average over 10 runs for each h, and the error bar shows the standard deviation. Clearly, Fig. 3 shows that as h increases, the distance between the trained network and the min-norm solution decreases. The middle plot verifies that the distance is indeed O(h -1/2 ). Lastly, we note that the right plot implies the convergence rate approaches a constant as h increases, which verifies the result in Lemma 1 regarding the imbalance singular value.

B CONVERGENCE RATE ANALYSIS FOR LINEAR REGRESSION: GENERAL CASE

Suppose the input data matrix X has rank d ≤ min{D, n}, we write the compact SVD of X as Given the compact SVD, we still define U 1 = Φ T 1 U , and write the loss function as X = W Σ 1/2 x Φ T 1 , W ∈ R n×d , Φ 1 ∈ R D×d L(V, U ) = 1 2 Y -XU V T 2 F = 1 2 Y -W Σ 1/2 x U 1 V T 2 F = 1 2 (I d -W W T + W W T )Y -W Σ 1/2 x U 1 V T 2 F = 1 2 (I d -W W T )Y + W (W T Y -Σ 1/2 x U 1 V T ) 2 F = 1 2 (I d -W W T )Y 2 F + 1 2 W (W T Y -Σ 1/2 x U 1 V T ) 2 F + (I d -W W T )Y, W (W T Y -Σ 1/2 x U 1 V T ) F = 1 2 (I d -W W T )Y 2 F + 1 2 W (W T Y -Σ 1/2 x U 1 V T ) 2 F = 1 2 (I d -W W T )Y 2 F + 1 2 W T Y -Σ 1/2 x U 1 V T 2 F , where the last equality is because that W T W = I, and the second last equality is because that the cross terms equal zero due to W T (I d -W W T ) = 0. It is easy to see that min V,U L(V, U ) = 1 2 (I d -W W T )Y 2 F := L * , which is usually referred as the residue. Similarly to Section.3, we still define the error as E = W T Y -Σ 1/2 x U 1 V T and one can check that the gradient flow on L(V, U ) yields V (t) = E T (t)Σ 1/2 x U 1 (t) , U1 (t) = Σ 1/2 x E(t)V (t) , in parameter space (V, U 1 ). Similar to Theorem 1, we have Theorem B.1. Suppose h ≥ d + m -1. Let V (t), U 1 (t), t > 0 be the trajectory of continuous dynamics (22) starting from some V (0), U 1 (0). If σ d+m-1 U T 1 (0)U 1 (0) -V T (0)V (0) = c > 0 , then for E(t) = W T Y -Σ 1/2 x U 1 (t)V T (t), we have E(t) 2 F ≤ exp (-2σ n (Σ x )ct) E(0) 2 F , ∀t > 0 . Additionally, V (t), U 1 (t), t > 0 converges to some equilibrium point (V (∞), U 1 (∞)) such that E(V (∞), U 1 (∞)) = 0. It follows exactly the same proof as for Theorem 1, which is shown in Appendix D, except that the size of U 1 and E is now d × h and d × m respectively. To Summarize, for any linear regression problem, Theorem 1 shows that sufficient rank of imbalance guarantees exponential convergence of L(t) -L * , where L * = 1 2 (I d -W W T )Y 2 F .

C USEFUL LEMMAS

Before proving Theorem 1 and 2, we state several Lemmas that will be used in the proof. The first Lemma is the Grönwall's inequality (Grönwall, 1919) in the differential form. Lemma C.1 (Grönwall's inequality). Let u(t), β(t) : [0, +∞) → R be continuous, and u(t) differentiable on (0, +∞). If d dt u(t) ≤ β(t)u(t), ∀t > 0 , Proof. For readability we simply write V (t), U 1 (t), E(t) as V, U 1 , E for most of the proof. Under ( 7), the time derivative of error is given by Ė = -Σ 1/2 x U 1 U T 1 Σ 1/2 x E -Σ x EV V T . Consider the time derivative of E 2 F , d dt E 2 F = d dt tr(E T E) = -2 tr E T Σ 1/2 x U 1 U T 1 Σ 1/2 x E + E T Σ x EV V T . ( ) Use the trace inequality in Lemma C.4 to get the bound the trace of two matrices respectively as tr E T Σ 1/2 x U 1 U T 1 Σ 1/2 x E = tr Σ 1/2 x EE T Σ 1/2 x U 1 U T 1 ≥ σ n (U 1 U T 1 ) tr Σ 1/2 x EE T Σ 1/2 x = σ n (U 1 U T 1 ) tr Σ x EE T ≥ σ n (U 1 U T 1 )σ n (Σ x ) tr(EE T ) = σ n (U 1 U T 1 )σ n (Σ x ) E 2 F , (24) and tr E T Σ x EV V T ≥ σ m (V V T ) tr E T Σ x E = σ m (V V T ) tr Σ x EE T ≥ σ m (V V T )σ n (Σ x ) tr(EE T ) = σ m (V V T )σ n (Σ x ) E 2 F . Combine ( 23) with ( 24)(25), we have d dt E 2 F ≤ -2σ n (Σ x ) σ n (U 1 U T 1 ) + σ m (V V T ) E 2 F (26) Moreover, we have σ n (U 1 U T 1 ) + σ m (V V T ) = σ n (U T 1 U 1 ) + σ m (V T V ) = σ n (U T 1 U 1 ) + σ m (-V T V ) (Lemma C.2) ≥ σ n+m-1 (U T 1 U 1 -V T V ) (Imbalance is time-invariant) = σ n+m-1 (U T 1 (0)U 1 (0) -V T (0)V (0)) = c , where the first equality uses the fact that U 1 U T 1 (V V T resp.) has the same non-zero singular values as U T 1 U 1 (V T V resp.). Finally we have d dt E 2 F ≤ -2σ n (Σ x )c E 2 F . The result follows by applying Grönwall's inequality, Lemma C.1, which leads to E(t) 2 F ≤ exp (-2σ n (Σ x )ct) E(0) 2 F , ∀t > 0 , then the exponential convergence of E(t) is proved. Regarding the second statement, for the gradient system (7), the parameters (U 1 (t), V (t)) converge either to an equilibrium point which minimizes the potential E(t) 2 F or to infinity (Hirsch et al., 1974) . Consider the following dynamics d dt V (t) U 1 (t) = 0 E T (t)Σ 1/2 x Σ 1/2 x E(t) 0 :=A Z (t) V (t) U 1 (t) :=Z(t) , which is a time-variant linear system. Notice that by Horn & Johnson (2012, Theorem 7.3 .3), we have A Z (t) 2 = Σ 1/2 x E(t) 2 . From (28), we have d dt Z(t) 2 F = 2 tr Z T (t)A Z (t)Z(t) = 2 tr Z(t)Z T (t)A Z (t) ≤ 2 A Z (t) 2 tr Z(t)Z T (t) = 2 Σ 1/2 x E(t) 2 Z(t) 2 F ≤ 2σ 1/2 1 (Σ x ) E(t) 2 Z(t) 2 F ≤ 2σ 1/2 1 (Σ x ) E(t) F Z(t) 2 F . By Grönwall's inequality, Lemma C.1, we have Z(t) 2 F ≤ exp t 0 2σ 1/2 1 (Σ x ) E(τ ) F dτ Z(0) 2 F . Finally, from ( 27), we have E(t) F ≤ exp (-σ n (Σ x )ct) E(0) F , ∀t > 0, which leads to Z(t) 2 F ≤ exp t 0 2σ 1/2 1 (Σ x ) E(τ ) F dτ Z(0) 2 F ≤ exp 2σ 1/2 1 (Σ x ) E(0) F t 0 exp (-σ n (Σ x )cτ ) dτ Z(0) 2 F ≤ exp 2σ 1/2 1 (Σ x ) E(0) F ∞ 0 exp (-σ n (Σ x )cτ ) dτ Z(0) 2 F = exp 2σ 1/2 1 (Σ x ) cσ n (Σ x ) E(0) F Z(0) 2 F . Therefore the trajectory V (t), U 1 (t), t > 0 is bounded, i.e. it can not converge to infinity, then it has to converges to some equilibrium point (V (∞), U 1 (∞)) such that E(V (∞), U 1 (∞)) = 0.

E COMPARISON WITH THE NTK INITIALIZATION FOR WIDE SINGLE-HIDDEN-LAYER LINEAR NETWORKS

In Section 4.2, we analyzed generalization property of wide single-hidden-layer linear networks under properly scaled random initialization. Our initialization for network weights U, V is different from the typical setting in previous works (Jacot et al., 2018; Du & Hu, 2019; Arora et al., 2019b) . In this section, we show that under our setting, the gradient flow is related to the NTK flow by 1) reparametrization and rescaling in time ; 2) proper scaling of the network output. The necessity of output scaling is also shown in Arora et al. (2019b) . In this paper we work with a single-hidden-layer linear network defined as f : R D → R m , f (x; V, U ) = V U T x, which is parametrized by U, V . Then we analyze the gradient flow on the loss function L(V, U ) = 1 2 Y -XU V T 2 F , given the data and output matrix X, Y . Lastly, in Section 4.2, we initialize U (0), V (0) such that all the entries are randomly drawn from N 0, h -1 , where h is the hidden layer width. Now we define Ũ := √ hU, Ṽ := √ hV , then the loss function can be written as L(V, U ) = L( Ṽ , Ũ ) = 1 2 Y - 1 h X Ũ Ṽ T 2 F = 1 2 Y - √ m √ h 1 √ mh X Ũ Ṽ T 2 F = 1 2 n i=1 y (i) - √ m √ h 1 √ mh Ṽ Ũ T x (i) 2 2 := n i=1 y (i) - √ m √ h f (x; Ṽ , Ũ ) Notice that f (x; Ṽ , Ũ ) = 1 √ mh Ṽ Ũ T x is the typical network discussed in previous works (Jacot et al., 2018; Du & Hu, 2019; Arora et al., 2019b) . When all the entries of U (0), V (0) are initialized randomly as N 0, h -foot_1 , the entries of Ũ (0), Ṽ (0) are random samples from N (0, 1), which is the typical choice of initialization for NTK analysis. However, the difference is that f (x; Ṽ , Ũ ) is scaled by √ m √ h . In previous work showing nonasymptotic bound between wide neural networks and its infinite width limit (Arora et al., 2019b, Theorem 3.2) , the wide neural network is scaled by a small constant κ such that the prediction by the trained network is within -distance to the one by the kernel predictor of its NTK. Moreover, Arora et al. (2019b) suggests 1 κ should scale as poly( 1 ), i.e., to make sure the trained network is arbitrarily close to the kernel predictor, κ should be vanishingly small. In our setting, the random initialization implicitly enforces such a vanishing scaling √ m √ h , as the width of network increases. Lastly, we show that the gradient flow on L(V, U ) only differs from the flow on L( Ṽ , Ũ ) by the time scale. Suppose U, V 1 follows the gradient flow on L(V, U ) w.r.t. time t. Define t := ht, we have d d t Ũ = √ h d d t U = √ h dt d t d dt U = 1 √ h d dt U = - 1 √ h ∂ ∂U L(V, U ) , and similarly we have d d t Ṽ = -1 √ h ∂ ∂V L(V, U ). Now notice that d d t Ũ = - 1 √ h ∂ ∂U L(V, U ) = - 1 √ h X T (Y -XU V T )V = - 1 h X T Y - 1 h X Ũ Ṽ T Ṽ = - ∂ ∂ Ũ L( Ṽ , Ũ ), d d t Ṽ = - 1 √ h ∂ ∂V L(V, U ) = - 1 √ h (Y -XU V T ) T XU = - 1 h Y - 1 h X Ũ Ṽ T T X Ũ = - ∂ ∂ Ṽ L( Ṽ , Ũ ) Therefore, the gradient flow of U, V on L(V, U ) w.r.t. time t is equivalent to the gradient flow of Ũ , Ṽ on L( Ṽ , Ũ ) w.r.t. a rescaled time t = ht. Another way to see the time scale difference is the following, consider the gradient flow on L(V, U ) w.r.t. time t, we have d dt U (t) = - ∂ ∂U L(V (t), U (t)) ⇔ 1 √ h d dt Ũ (t) = - ∂ ∂U L(V (t), U (t)) ⇔ 1 √ h d dt Ũ (t) = - √ h ∂ ∂ Ũ L( Ṽ (t), Ũ (t)) ⇔ d dt Ũ (t) = -h ∂ ∂ Ũ L( Ṽ (t), Ũ (t)) , where we use the fact that ∂ ∂U L(V (t), U (t)) = X T (Y -XU (t)V T (t))V (t) = 1 √ h X T Y - 1 h X Ũ (t) Ṽ T (t) Ṽ (t) = √ h ∂ ∂ Ũ L( Ṽ (t), Ũ (t)) . Similarly we have d dt V (t) = - ∂ ∂V L(V (t), U (t)) ⇔ d dt Ṽ (t) = -h ∂ ∂ Ṽ L( Ṽ (t), Ũ (t)) . From ( 30) and ( 31) we know that the gradient flow on L(V, U ) w.r.t. time t essentially runs the gradient flow on L( Ṽ , Ũ ) with an accelerated rate by h. Such equivalence through time rescaling suggests that running gradient flow on our setting is h times faster than the NTK one. In Arora et al. (2019b) , as we mentioned above, the network is scaled by a small constant κ such that the trained network is within -distance to the kernel predictor by its NTK in terms of the prediction. As a consequence, the convergence rate is scaled by κ 2 , which makes the convergence slower. Therefore, our initialization scheme yields similar result as in Arora et al. (2019b) but the gradient flow is faster. Also, we note that this gap in rate of convergence in not present in (Du & Hu, 2019) , which only focuses on providing convergence guarantees of the algorithm. In that case, the network is not scaled by a small κ, however, the properties of generalization is not studied there. F PROOF OF LEMMA 1 AND THEOREM 2 To prove Lemma 1 and Theorem 2, we use a basic result in random matrix theory Lemma F.1. Given m, n ∈ N with m ≤ n. Let A be an n × m random matrix with i.i.d. standard normal entries A ij ∼ N (0, 1). For δ > 0, with probability at least 1 -2 exp(-δ 2 ), we have √ n -( √ m + δ) ≤ σ m (A) ≤ σ 1 (A) ≤ √ n + ( √ m + δ) . The proof can be found in Davidson & Szarek (2001, Theorem 2.13) In this section, we show more general results under the following random initialization [U (0)] ij ∼ N 0, 1 h 2α , 1 ≤ i ≤ D, 1 ≤ j ≤ h , [V (0)] ij ∼ N 0, 1 h 2α , 1 ≤ i ≤ m, 1 ≤ j ≤ h , where 1 4 < α ≤ 1 2 . It is easy to see that α = 1 2 corresponds to the random initialization scheme shown in Section 4.2, i.e. all the entries of U (0), V (0) are random mean zero Gaussian with h -1 variance. Regarding the imbalance and orthogonality condition, we have the following Lemma F.2. Let 1 4 < α ≤ 1 2 . Given data matrix X. ∀δ ∈ (0, 1), ∀h > h 0 = poly m, D, 1 δ , with probability at least 1 -δ over random initializations with [U ( 0 )] ij , [V (0)] ij ∼ N (0, h -2α ), we have all the following hold. 1. (Sufficient rank of imbalance) σ n+m U T 1 (0)U 1 (0) -V T (0)V (0) > h 1-2α -2 √ m + D + 1 2 log 2 δ h 2α-1 2 , 2. (Approximate orthogonality condition) V (0)U T 2 (0) U 1 (0)U T 2 (0) F ≤ 2 √ m + n √ m + D + 1 2 log 2 δ h 2α-1 2 , U 1 (0)V T (0) F ≤ 2 √ m √ m + D + 1 2 log 2 δ h 2α-1 2 . From the Lemma, we can see why our analysis only applies to the case where 1 4 < α ≤ 1 2 : 1) If α > 1 2 , the lower bound we can obtain for σ n+m U T 1 (0)U 1 (0) -V T (0)V (0) will decreases to zero as h increases; 2) If α ≤ 1 4 , the orthogonality condition will not be asymptotically satisfied as h increases. From Lemma F.2, let α = 1 2 , we have Consider the matrix V T U T which is h × (m + D). Apply Lemma F.1 to matrix A = h α V T U T , with probability at least 1 -δ, we have √ h - √ m + d + δ ≤ σ m+d (h α V T U T ) ≤ σ 1 (h α V T U T ) ≤ √ h + √ m + d + δ , which leads to h 1 2 -α - √ m + D + 1 2 log 2 δ h α ≤ σ m+D ( V T U T ) ≤ σ 1 ( V T U T ) ≤ h 1 2 -α + √ m + D + 1 2 log 2 δ h α . (32) Regarding the first inequality, write the imbalance as U T 1 U 1 -V T V = V T U T 1 -V U 1 = V T U T -I m 0 0 Φ 1 Φ T 1 V U . For h > √ m + D + 1 2 log 2 δ 2 , assume event (32) happens, then σ m+D V T U T ≥ h 1 2 -α - √ m+D+ 1 2 log 2 δ h α > 0, then we have σ n+m (U T 1 U 1 -V T V ) = σ n+m V T U T -I m 0 0 Φ 1 Φ T 1 V U (Lemma C.3) ≥ σ n+m V T U T -I m 0 0 Φ 1 Φ T 1 σ m+D V U = σ n+m -I m 0 0 Φ 1 Φ T 1 V U σ m+D V U (Lemma C.3) ≥ σ n+m -I m 0 0 Φ 1 Φ T 1 σ 2 m+D V U = σ n+m -I m 0 0 Φ 1 Φ T 1 σ 2 m+D V T U T = σ 2 m+D V T U T , where the last equality is due to the fact that -I m 0 0 Φ 1 Φ T 1 has exactly n + m non-zero singular value and all of them are 1. Therefore when h > √ m + D + 1 2 log 2 δ 2 , conditioned on event (32), with probability 1 we have σ n+m (U T 1 U 1 -V T V ) ≥ σ 2 m+D V T U T ≥ h 1 2 -α - √ m + D + 1 2 log 2 δ h α 2 = h 1-2α -2 √ m + D + 1 2 log 2 δ h 2α-1 2 + √ m + D + 1 2 log 2 δ h α 2 > h 1-2α -2 √ m + D + 1 2 log 2 δ h 2α-1 2 . ( ) Regarding the second and third inequality, using the fact that A F ≤ min{n, m} A 2 , A ∈ R n×m , we have 1 √ m + n V U T 2 U 1 U T 2 F ≤ V U T 2 U 1 U T 2 2 = I m 0 0 Φ T 1 V U V T U T 0 Φ 2 2 = I m 0 0 Φ T 1 V U V T U T -ηI m+D 0 Φ 2 2 ≤ V U V T U T -ηI m+D 2 , for any η ∈ R , where the second equality is by the fact that I m 0 0 Φ T 1 0 Φ 2 = 0, 1 √ m U 1 V T F ≤ U 1 V T 2 = 0 Φ T 1 V U V T U T I m 0 2 = 0 Φ T 1 V U V T U T -ηI m+D I m 0 2 ≤ V U V T U T -ηI m+D 2 , for any η ∈ R , where the second equality is by the fact that 0 Φ T 1 I m 0 = 0. Notice that V U V T U T -ηI m+D 2 = max i σ 2 i ( V T U T ) -η . Again we let h > √ m + D + 1 2 log 2 δ 2 . When event (32) happens, all σ 2 i ( V T U T ) are within the interval h 1 2 -α - √ m+D+ 1 2 log 2 δ h α 2 , h 1 2 -α - √ m+D+ 1 2 log 2 δ h α 2 . Since the choice of η is arbitrary, we pick η = h 1-2α + √ m + D + 1 2 log 2 δ h α 2 , which is the mid-point of this interval, then we have max i σ 2 i ( V T U T ) -η ≤ max    h 1 2 -α - √ m + D + 1 2 log 2 δ h α 2 -η , h 1 2 -α + √ m + D + 1 2 log 2 δ h α 2 -η    (η is the mid-point) where we choose η as in (34). Conditioned on event (32), events (33) and ( 35) happen with probability 1, hence the probability that both (33) and ( 35) happen is at least the probability of event (32), which is at least 1 -δ. More generally, for readers' interest, we show that all the non-zero imbalance singular values concentrate to h 1-2α as h increases. For the case of α = 1 2 , the singular values concentrate to 1, as suggested by the following Claim F.1. Let 1 4 < α ≤ 1 2 . Given data matrix X. ∀δ ∈ (0, 1), ∀h > h 0 = poly m, D, 1 δ , with probability at least 1 -δ over random initializations with [U ( 0 )] ij , [V (0)] ij ∼ N (0, h -2α ), we have all the following hold. σ n+m U T 1 (0)U 1 (0) -V T (0)V (0) > h 1 2 -α - √ m + D + 1 2 log 2 δ h α 2 , σ 1 U T 1 (0)U 1 (0) -V T (0)V (0) ≤ h 1 2 -α + √ m + D + 1 2 log 2 δ h α 2 . ( ) Proof. For readability we simply write U (0), U 1 (0), V (0) as U, U 1 , V . When the width condition h > h 0 = poly m, D, 1 δ is satisfied. Condition on event (32). The lower bound (36) for the n + m-th singular value has been shown by (33) . For the upper bound (37), notice that σ 1 (U T 1 U 1 -V T V ) = σ 1 V T U T -I m 0 0 Φ 1 Φ T 1 V U ≤ σ 1 -I m 0 0 Φ 1 Φ T 1 σ 2 1 V T U T ≤ σ 2 1 V T U T , where again we use the the fact that -I m 0 0 Φ 1 Φ T 1 has exactly n + m non-zero singular value and all of them are 1. Condition on event (32), we have σ 1 U T 1 U 1 -V T V ≤ σ 2 1 V T U T ≤ h 1 2 -α + √ m + D + 1 2 log 2 δ h α 2 , which is (37). Therefore (36)(37) holds with at least 1 -δ probability. With Lemma F.2, we have the following result regarding the generalization property of wide singlehidden-layer linear networks. Notice that here the result is presented under random initialization such that all entries of U (0), V (0) are sample from N (0, h -2α ), 1 4 < α ≤ 1 2 . Theorem F.1. Let 1 4 < α ≤ 1 2 . Let (V (t), U (t), t > 0) be a trajectory of continuous dynamics (7). Then, ∃C > 0, such that ∀δ ∈ (0, 1), ∀h > h 1/(4α-1) 0 with h 0 = poly m, D, 1 δ , σ1(Σx) σ 3 n (Σx) , with probability 1 -δ over random initializations with [U (0)] ij , [V (0)] ij ∼ N (0, h -2α ), we have U (∞)V T (∞) -Θ 2 ≤ 2C 1/h 1-2α √ m + n √ m + D + 1 2 log 2 δ h 2α-1 2 , ( ) where C depends on the data X, Y . From Theorem F.1, let α = 1 2 , we have Theorem 2 (Generalization of wide single-hidden-layer linear network, restated). Let (V (t), U (t), t > 0) be a trajectory of continuous dynamics (7). Then, ∃C > 0, such that ∀δ ∈ (0, 1), ∀h > h 0 = poly m, D, 1 δ , σ1(Σx) σ 3 n (Σx) , with probability 1 -δ over random initializations with [U (0)] ij , [V (0)] ij ∼ N (0, h -1 ), we have U (∞)V T (∞) -Θ 2 ≤ 2C √ m + n √ m + D + 1 2 2 δ √ h , where C depends on the data X, Y . Now we only remain to prove Theorem F.1 Proof of Theorem F.1. From the continuous dynamics (7) and Theorem 1, the stationary point U (∞), V (∞) satisfy U 1 (∞)V T (∞) = Φ T 1 Θ, U 2 (∞) = U 2 (0) . Hence we have U (∞)V T (∞) -Θ 2 = Φ 1 U 1 (∞)V T (∞) + Φ 2 U 2 (∞)V T (∞) -Θ 2 = Φ 1 Φ T 1 Θ + Φ 2 U 2 (∞)V T (∞) -Θ 2 = Φ 2 U 2 (∞)V T (∞) F = Φ 2 U 2 (0)V T (∞) F = U 2 (0)V T (∞) 2 ≤ U 2 (0)V T (∞) F . Consider the following dynamics d dt V (t)U T 2 (0) U 1 (t)U T 2 (0) = 0 E T (t)Σ 1/2 x Σ 1/2 x E(t) 0 :=A Z (t) V (t)U T 2 (0) U 1 (t)U T 2 (0) :=Z(t) , which is a time-variant linear system, and in particular, by Horn & Johnson (2012, Theorem 7.3.3) Here h 0 is larger than one from Lemma 1 because in (41) we want the least non-zero singular value of the imbalance to be further bounded by 1 2 . From Theorem 1, we have E(t) 2 F ≤ exp (-2σ n (Σ x )ct) E(0) 2 F , where c = σ n+m-1 U T 1 (0)U 1 (0) -V T (0)V (0) , then by (41), we have E(t) 2 F ≤ exp -h 1-2α σ n (Σ x )t E(0) 2 F ⇒ E(t) F ≤ exp -h 1-2α σ n (Σ x )t/2 E(0) F . Finally, from (40), we have Z(t) F ≤ exp t 0 σ 1/2 1 (Σ x ) E(τ ) F dτ Z(0) F ≤ exp σ 1/2 1 (Σ x ) E(0) F t 0 exp -h 1-2α σ n (Σ x )τ /2 dτ Z(0) F ≤ exp σ 1/2 1 (Σ x ) E(0) F ∞ 0 exp -h 1-2α σ n (Σ x )τ /2 dτ Z(0) F = exp 2 σ 1/2 1 (Σ x ) h 1-2α σ n (Σ x ) E(0) F Z(0) F . The initial error depends on the initialization but can be upper bounded as E(0) F = W T Y -Σ -1/2 x U 1 (0)V T (0) F ≤ W T Y F + Σ -1/2 x U 1 (0)V T (0) F ≤ Y F + σ -1/2 n (Σ x ) U 1 (0)V T (0) F then we can write (44) as ≥ h 0 ≥ h 0 , hence the width condition for (41)(42)(43) to hold is satisfied. Z(t) F ≤ exp 2 σ 1/2 1 (Σ x ) h 1-2α σ n (Σ x ) Y F exp 2 σ 1/2 1 (Σ x ) h 1-2α σ 3/2 n (Σ x ) U 1 (0)V T (0) F Z(0) F = exp 2 σ 1/2 1 (Σ x ) σ n (Σ x ) Y F exp 2 σ 1/2 1 (Σ x ) σ 3/2 n (Σ x ) U 1 (0)V T (0) F 1/h 1-2α Z(0) F . Finally by (42)(46), we write (45) as Z(t) F ≤ exp 1 + 2 σ 1/2 1 (Σ x ) σ n (Σ x ) Y F 1/h 1-2α Z(0) F ≤ exp 1 + 2 σ 1/2 1 (Σ x ) σ n (Σ x ) Y F 1/h 1-2α :=C 1/h 1-2α 2 √ m + n √ m + D + 1 2 ln 2 δ h 2α-1 2 = 2C 1/h 1-2α √ m + n √ m + D + 1 2 ln 2 δ h 2α-1 2 . Therefore for some C > 0 that depends on the data (X, Y ), given any 0 < δ < 1, when h > h 1/(4α-1) 0 as defined above, with at least probability 1 -δ, we have U (∞)V T (∞) -Θ 2 ≤ U 2 (0)V T (∞) F ≤ sup t>0 U 2 (0)V T (t) F ≤ sup t>0 Z(t) F ≤ 2C 1/h 1-2α √ m + n √ m + D + 1 2 ln 2 δ h 2α-1 2 .



CONCLUSIONIn this paper, we study the explicit role of initialization on controlling the convergence and generalization of single-hidden-layer linear networks trained under gradient flow. First of all, initializing the imbalance to have sufficient rank leads to the exponential convergence of the loss. Then proper initialization enforces the trajectory of network parameters to be exactly (or approximately) constrained in a low-dimensional manifold, over which minimizing the loss yields the min-norm solution. Combining those results, we obtain O(h -1/2 ) non-asymptotic bound regarding the equivalence of trained wide linear networks under random initialization to the min-norm solution. Our analysis, although on a simpler overparametrized model, formally connects overparametrization, initialization, and optimization with generalization performance. We think it is promising to translate some of the concepts such as the imbalance, and the constrained learning concept to multi-layer linear networks, and eventually to neural networks with nonlinear activations. We write U (t), V (t) as U, V for simplicity. Same for Ũ (t), Ṽ (t).



Figure 2: Convergence of gradient descent with different initial level of imbalance, c := σ n+m-1 (U T 1 (0)U 1 (0) -V T (0)V (0)).

Figure 3: Implicit regularization of wide single-hidden-layer linear network. U 2 (0)V T (0) 2F is the initial distance between the end-to-end function to the desired manifold discussed in Section 4.1. U (t f )V T (t f ) -Θ is the distance between the end-to-end function and the min-norm solution when the algorithm stops. The line is plotting the average over 10 runs for each h, and the error bar shows the standard deviation.

, and in Section 3 we assume that d = n < D. Notice that we always have W T W = I.

the second exponential, we let h 0 := max h 0 , 16 σ1(Σx) x )U 1 (0)V T (0) F ≤ exp 4 σ

, we have A Z (t) 2 = Σ E(t) 2 . Notice that here the Z(t) is different from the one in the proof for Theorem 1. By Grönwall's inequality, Lemma C.1, we have

annex

then u(t) ≤ u(0) exp t 0 β(τ )dτ , ∀t > 0 .The next Lemma is known as Weyl's Inequality for singular values. Lemma C.2 (Weyl's inequality for singular values). Let A, B ∈ R n×m , let q = min{n, m}, thenfor any i, j satisfying 1 ≤ i, j ≤ q and i + j -1 ≤ q.The proof can be found in Horn & Johnson (1994, Theorem 3.3.16) . Using Weyl's inequality, we state and prove a lemma that is used for proving Theorem 2.Proof. We start with the case where k, we get the desired inequality. When k > n, we havewhich still leads to the desired result.When k < n, consider replacing A with A 0 (n-k)×n, we haveWe also state a trace inequality widely using for solving control problems Lemma C.4. Suppose for A, B ∈ R n×n , A is symmetric and B is positive semidefinite, thenIf both A, B are positive semidefinite, thenThe proof can be found in Sheng-De Wang et al. (1986, Lemma 1) .

D PROOF OF THEOREM 1

We begin with restating the Theorem.Theorem 1 (Convergence of linear networks with sufficient rank of imbalance,restated). Suppose h ≥ m + n -1. Let V (t), U 1 (t), t > 0 be the trajectory of continuous dynamics (7) starting from some V (0), U 1 (0). IfF ≤ exp (-2σ n (Σ x )ct) E(0) 2 F , ∀t > 0 . Additionally, V (t), U 1 (t), t > 0 converges to some equilibrium point (V (∞), U 1 (∞)) such that E(V (∞), U 1 (∞)) = 0.

