TOWARDS RESOLVING THE IMPLICIT BIAS OF GRADIENT DESCENT FOR MATRIX FACTORIZATION: GREEDY LOW-RANK LEARNING

Abstract

Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent. Gunasekar et al. (2017) conjectured that Gradient Flow with infinitesimal initialization converges to the solution that minimizes the nuclear norm, but a series of recent papers argued that the language of norm minimization is not sufficient to give a full characterization for the implicit regularization. In this work, we provide theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions. This generalizes the rank minimization view from previous works to a much broader setting and enables us to construct counter-examples to refute the conjecture from Gunasekar et al. (2017). We also extend the results to the case where depth ≥ 3, and we show that the benefit of being deeper is that the above convergence has a much weaker dependence over initialization magnitude so that this rank minimization is more likely to take effect for initialization with practical scale. * Alphabet ordering.

1. INTRODUCTION

There are usually far more learnable parameters in deep neural nets than the number of training data, but still deep learning works well on real-world tasks. Even with explicit regularization, the model complexity of state-of-the-art neural nets is so large that they can fit randomly labeled data easily (Zhang et al., 2017) . Towards explaining the mystery of generalization, we must understand what kind of implicit regularization does Gradient Descent (GD) impose during training. Ideally, we are hoping for a nice mathematical characterization of how GD constrains the set of functions that can be expressed by a trained neural net. As a direct analysis for deep neural nets could be quite hard, a line of works turned to study the implicit regularization on simpler problems to get inspirations, for example, low-rank matrix factorization, a fundamental problem in machine learning and information process. Given a set of observations about an unknown matrix W * ∈ R d×d of rank r * d, one needs to find a low-rank solution W that is compatible with the given observations. Examples include matrix sensing, matrix completion, phase retrieval, robust principal component analysis, just to name a few (see Chi et al. 2019 for a survey). When W * is symmetric and positive semidefinite, one way to solve all these problems is to parameterize W as W = U U for U ∈ R d×r and optimize L(U ) := 1 2 f (U U ), where f ( • ) is some empirical risk function depending on the observations, and r is the rank constraint. In theory, if the rank constraint is too loose, the solutions do not have to be low-rank and we may fail to recover W * . However, even in the case where the rank is unconstrained (i.e., r = d), GD with small initialization can still get good performance in practice. This empirical observation reveals that the implicit regularization of GD exists even in this simple matrix factorization problem, but its mechanism is still on debate. Gunasekar et al. (2017) proved that Gradient Flow (GD with infinitesimal step size, a.k.a., GF) with infinitesimal initialization finds the minimum nuclear norm solution in a special case of matrix sensing, and further conjectured this holds in general. Conjecture 1.1 (Gunasekar et al. 2017, informal) . With sufficiently small initialization, GF converges to the minimum nuclear norm solution of matrix sensing. Subsequently, Arora et al. (2019a) challenged this view by arguing that a simple mathematical norm may not be a sufficient language for characterizing implicit regularization. One example illustrated in Arora et al. (2019a) is regarding matrix sensing with a single observation. They showed that GD with small initialization enhances the growth of large singular values of the solution and attenuates that of smaller ones. This enhancement/attenuation effect encourages low-rank, and it is further intensified with depth in deep matrix factorization (i.e., GD optimizes f (U 1 • • • U L ) for L ≥ 2). However, these are not captured by the nuclear norm alone. Gidel et al. (2019) ; Gissin et al. (2020) further exploited this idea and showed in the special case of full-observation matrix sensing that GF learns solutions with gradually increasing rank. Razin and Cohen (2020) showed in a simple class of matrix completion problems that GF decreases the rank along the trajectory while any norm grows towards infinity. More aggressively, they conjectured that the implicit regularization can be explained by rank minimization rather than norm minimization.

Our Contributions.

In this paper, we move one further step towards resolving the implicit regularization in the matrix factorization problem. Our theoretical results show that GD performs rank minimization via a greedy process in a broader setting. Specifically, we provide theoretical evidence that GF with infinitesimal initialization is in general mathematically equivalent to another algorithm called Greedy Low-Rank Learning (GLRL). At a high level, GLRL is a greedy algorithm that performs rank-constrained optimization and relaxes the rank constraint by 1 whenever it fails to reach a global minimizer of f ( • ) with the current rank constraint. As a by-product, we refute Conjecture 1.1 by demonstrating an counterexample (Example 5.9). We also extend our results to deep matrix factorization Section 6, where we prove that the trajectory of GF with infinitesimal identity initialization converges to a deep version of GLRL, at least in the early stage of the optimization. We also use this result to confirm the intuition achieved on toy models (Gissin et al., 2020) , that benefits of depth in matrix factorization is to encourage rank minimization even for initialization with a relatively larger scale, and thus it is more likely to happen in practice. This shows that describing the implicit regularization using GLRL is more expressive than using the language of norm minimization. We validate all our results with experiments in Appendix E.

2. RELATED WORKS

Norm Minimization. The view of norm minimization, or the closely related view of margin maximization, has been explored in different settings. Besides the nuclear norm minimization for matrix factorization (Gunasekar et al., 2017) discussed in the introduction, previous works have also studied the norm minimization/margin maximization for linear regression (Wilson et al., 2017; Soudry et al., 2018a; b; Nacson et al., 2019b; c; Ji and Telgarsky, 2019b) , deep linear neural nets (Ji and Telgarsky, 2019a; Gunasekar et al., 2018) , homogeneous neural nets (Nacson et al., 2019a; Lyu and Li, 2020) , ultra-wide neural nets (Jacot et al., 2018; Arora et al., 2019b; Chizat and Bach, 2020) . Small Initialization and Rank Minimization. The initialization scale can greatly influence the implicit regularization. A sufficiently large initialization can make the training dynamics fall into the lazy training regime defined by Chizat et al. (2019) and diminish test accuracy. Using small initialization is particularly important to bias gradient descent to low-rank solutions for matrix factorization, as empirically observed by Gunasekar et al. (2017) . Arora et al. (2019a) ; Gidel et al. (2019) ; Gissin et al. (2020) ; Razin and Cohen (2020) studied how gradient flow with small initialization encourages low-rank in simple settings, as discussed in the introduction. Li et al. (2018) proved recovery guarantees for gradient flow solving matrix sensing under Restricted Isometry Property (RIP), but the proof cannot be generalized easily to the case without RIP. Belabbas (2020) made attempts to prove that gradient flow is approximately rank-1 in the very early phase of training, but it does not exclude the possibility that the approximation error explodes later and gradient flow is not converging to low-rank solutions. Compared to these works, the current paper studies how GF encourages low-rank in a much broader setting.

3. BACKGROUND

Notations. For two matrices A, B, we define A, B := Tr(AB ) as their inner product. We use A F , A * and A 2 to denote the Frobenius norm, nuclear norm and the largest singular value of A respectively. For a matrix A ∈ R d×d , we use λ 1 (A), . . . , λ d (A) to denote the eigenvalues of A in decreasing order (if they are all reals). We define S d as the set of symmetric d × d matrices and S + d ⊆ S d as the set of positive semidefinite (PSD) matrices. We write A B or B A iff A -B is PSD. We use S + d,r , S + d,≤r to denote the set of d × d PSD matrices with rank = r, ≤ r respectively. Matrix Factorization. Matrix factorization problem asks one to optimize L(U , V ) := 1 2 f (U V ) among U , V ∈ R d×r , where f : R d×d → R is a convex function and in this paper we assume f is C 3 -smooth. A notable example is matrix sensing. There is an unknown rank-r * matrix W * ∈ R d×d with r * d. Given m measurements X 1 , . . . , X m ∈ R d×d , one can observe y i := X i , W * through each measurement. The goal of matrix sensing is to reconstruct W * via minimizing f (W ) := 1 2 m i=1 ( W , X i -y i ) 2 . Matrix completion is a notable special case of matrix sensing in which every measurement has the form X i = e pi e qi , where {e 1 , • • • , e d } stands for the standard basis (i.e., exactly one entry is observed through each measurement). Note that matrix factorization in the general case can be reduced to this symmetric case: let U = [ U V ] ∈ R 2d×r , f ([ A B C D ]) = 1 2 f (B) + 1 2 f (C), then f (U V ) = f (U U ). So in this paper we focus on the symmetric case as in previous works (Gunasekar et al., 2017) , i.e., finding a low-rank solution for the convex optimization problem: min W 0 f (W ). For this, we parameterize W as W = U U for U ∈ R d×r and optimize L(U ) := 1 2 f (U U ). We assume WLOG throughout this paper that f (W ) = f (W ); otherwise, we can set f (W ) = 1 2 f (W ) + f (W ) so that f (W ) = f (W ) while L(U ) = 1 2 f (U U ) is unaffected. This assumption makes ∇f (W ) symmetric for every symmetric W .

Gradient Flow.

In this paper, we analyze Gradient Flow (GF) for symmetric matrix factorization, defined as the solution of the following ODE for U (t) ∈ R d×r : dU dt = -∇L(U ) = -∇f (U U )U . (1) Let W (t) = U (t)U (t) ∈ R d×d . Then the following end-to-end dynamics holds for W (t): dW dt = -W ∇f (W ) -∇f (W )W =: g(W ). We use φ(W 0 , t) to denote the matrix W (t) in (2) when W (0) = W 0 0. Throughout this paper, we assume φ(W 0 , t) exists for all t ∈ R, W 0 0. It is easy to prove that U is a stationary point of 2), but the reverse may not be true, e.g., g(0) = 0, but 0 is not necessarily a minimizer. L( • ) (i.e., ∇L(U ) = 0) iff W = U U is a critical point of (2) (i.e., g(W ) = 0); see Lemma C.1 for a proof. If W is a minimizer of f ( • ) in S + d , then W is a critical point of ( In this paper, we particularly focus on the overparameterized case, where r = d, to understand the implicit regularization of GF when there is no rank constraint for the matrix W .

4. WARMUP EXAMPLES

First, we illustrate how GD performs greedy learning using two warmup examples. Linearization Around the Origin. In general, for a loss function L(U ) = 1 2 f (U U ), we can always apply Taylor expansion f (W ) ≈ f (0) + W , ∇f (0) around the origin to approximate it with a linear function. This motivates us to study the linear case: f (W ) := f 0 -W , Q for some symmetric matrix Q. In this case, the matrix U follows the ODE, dU dt = QU , which can be understood as a continuous version of the classical power iteration method for solving the top eigenvector. Let Q := d i=1 µ i v i v i be the eigendecomposition of Q, where µ 1 ≥ µ 2 ≥ • • • ≥ µ d and v 1 , . . . , v d are orthogonal to each other. Then we can write the solution as: U (t) = e tQ U (0) = d i=1 e µit v i v i U (0). When µ 1 > µ 2 , the ratio between e µ1t and e µit for i = 1 increases exponentially fast. As t → +∞, U (t) and W (t) become approximately rank-1 as long as v i U (0) = 0, i.e., lim t→∞ e -µ1t U (t) = v 1 v 1 U (0), lim t→∞ e -2µ1t W (t) = (v 1 W (0)v 1 )v 1 v 1 . The analysis for the simple linear case reveals that GD encourages low-rank through a process similar to power iteration. However, f (W ) is non-linear in general, and the linear approximation is close to f (W ) only if W is very small. With sufficiently small initialization, we can imagine that GD still resembles the above power iteration in the early phase of the optimization. But what if W (t) grows to be so large that the linear approximation is far from the actual f (W )? Full-observation Matrix Sensing. To understand the dynamics of GD when the linearization fails, we now consider a well-studied special case (Gissin et al., 2020) : L(U ) = 1 2 f (U U ), f (W ) = 1 2 W -W * 2 F for some unknown PSD matrix W * . GF in this case can be written as: dU dt = (W * -U U )U , dW dt = (W * -W )W + W (W * -W ). Let W * := d i=1 µ i v i v i be the eigendecomposition of W * . Our previous analysis shows that the dynamics is approximately dU dt = W * U in the early phase and thus encourages low-rank. To get a sense for the later phases, we simplify the setting by specifying U (0) = √ αI for a small number α. We can write W (0) and W * as diagonal matrices W (0) = diag(α, α, • • • , α), W * = diag(µ 1 , µ 2 , • • • , µ d ) with respect to the basis v 1 , . . . , v d . It is easy to see that W (t) is always a diagonal matrix, since the time derivatives of non-diagonal coordinates stay 0 during training. Let W (t) = diag(σ 1 (t), σ 2 (t), • • • , σ d (t)), then σ i (t) satisfies the dynamical equation d dt σ i (t) = 2σ i (t)(µ i -σ i (t)) , and thus σ i (t) = αµi α+(µi-α)e -2µ i t . This shows that every σ i (t) increases from α to µ i over time. As α → 0, every σ i (t) has a sharp transition from near 0 to near µ i at time roughly ( 1 2µi + o(1)) log 1 α , which can be seen from the following limit: lim α→0 σ i ( 1 2µi + c) log(1/α) = lim α→0 αµ i α + (µ i -α)α 1+2cµi = 0 c ∈ (-1 2µi , 0), µ i c ∈ (0, +∞). This means for every q ∈ ( 1 2µi , 1 2µi+1 ) for i = 1, . . . , d -1 (or q ∈ ( 1 2µi , +∞) for i = d), lim α→0 W (q log(1/α)) = diag(µ 1 , µ 2 , . . . , µ i , 0, 0, • • • , 0) . Therefore, when the initialization is sufficiently small, GF learns each component of W * one by one, according to the relative order of eigenvalues. At a high level, this shows a greedy nature of GD: GD starts learning with simple models; whenever it underfits, it increases the model complexity (which is rank in our case). This is also called sequential learning or incremental learning (Gidel et al., 2019; Gissin et al., 2020) . However, it is unclear how and why this sequential learning/incremental learning can occur in general. Through the first warmup example, we may understand why GD learns a rank-1 matrix in the early phase, but does GD always learn solutions with rank 2, 3, 4, . . . sequentially? If true, what is the mechanism behind this? The current paper answers the questions by providing both theoretical and empirical evidence that the greedy learning behavior does occur in general with a similar reason as for the first warmup example.

5. GREEDY LOW-RANK LEARNING (GLRL)

In this section, we present a trajectory-based analysis for the implicit bias of GF on matrix factorization. Our main result is that GF with infinitesimal initialization is generically the same as that of a simple greedy algorithm, Greedy Low-Rank Learning (GLRL, Algorithm 1). See Appendix A for a comparison with existing greedy algorithms for rank-constrained optimization. Algorithm 1: Greedy Low-Rank Learning parameter :step size η > 0; small > 0 r ← 0, W 0 ← 0 ∈ R d×d , and U 0 (∞) ∈ R d×0 while λ 1 (-∇f (W r )) > 0 do r ← r + 1 u r ← unit top eigenvector of -∇f (W r-1 ) U r (0) ← [U r-1 (∞) √ u r ] ∈ R d×r for t = 0, 1, . . . do U r (t + 1) ← U r (t) -η∇L(U r (t)) W r ← U r (∞)U r (∞) a return W r a In practice, we approximate the infinite time limit by running sufficiently many steps. The GLRL algorithm consists of several phases, numbered from 1. In phase r, GLRL increases the rank constraint to r and optimizes L(U r ) := 1 2 f (U r U r ) among U r ∈ R d×r via GD un- til it reaches a stationary point U r (∞), i.e., ∇L(U r (∞)) = 0. At convergence, W r := U r (∞)U r (∞) is a critical point of (2), and we call it the r-th critical point of GLRL. If W r is further a minimizer of f ( • ) in S + d , or equivalently, λ 1 (-∇f (W r )) ≤ 0 (see Lemma C.2), GLRL returns W r ; otherwise GLRL enters phase r + 1. To set the initial point of GD in phase r, GLRL appends a small column vector δ r ∈ R d to the resulting stationary point U r-1 (∞) from the last phase, i.e., U r (0) ← [U r-1 (∞) δ r ] ∈ R d×r (in the case of r = 1, U 1 (0) ← [δ 1 ] ∈ R d×1 ). In this way, U r (0)U r (0) = W r-1 + δ r δ r is perturbed away from the (r -1)-th critical point. In GLRL, we set δ r = √ u r , where u r is the top eigenvector of -∇f (W r ) with unit norm u r 2 = 1, and > 0 is a parameter controlling the magnitude of perturbation (preferably very small). Note that it is guaranteed that λ 1 (-∇f (W r-1 )) > 0; otherwise W r-1 is a minimizer of the convex function f ( • ) in S + d and GLRL exits before phase r. Trajectory of GLRL. We define the (limiting) trajectory of GLRL by taking the learning rate η → 0. The goal is to show that the trajectory of GLRL is close to that of GF with infinitesimal initialization. Recall that φ(W 0 , t) stands for the solution W (t) in (2) when W (0) = W 0 . Definition 5.1 (Trajectory of GLRL). Let W 0, := 0 be the 0th critical point of GLRL. For every r ≥ 1, if the (r -1)-th critical point W r-1, exists and is not a minimizer of f ( • ) in S + d , we define W G r, (t) := φ(W r-1, + u r, u r, , t), where u r, is a top eigenvector of ∇f (W r-1, ) with unit norm, u r, 2 = 1. We define W r, := lim t→+∞ W G r, (t) to be the r-th critical point of GLRL if the limit exists. Throughout this paper, we always focus on the case where the top eigenvalue of every ∇f (W r-1, ) is unique. In this case, the trajectory of GLRL is unique for every > 0, since the normalized top eigenvectors can only be ±u r, , and both of them lead to the same W G r, (t).

5.1. THE LIMITING TRAJECTORY: A GENERAL THEOREM FOR DYNAMICAL SYSTEM

To prove the equivalence between GF and GLRL, we first introduce our high-level idea by analyzing the behavior of a more general dynamical system around its critical point, say 0. A specific example is (2) if we set θ to be the vectorization of W . dθ dt = g(θ), where g(0) = 0. We use φ(θ 0 , t) to denote the value of θ(t) in the case of θ(0) = θ 0 . We assume that g(θ) is C 2 -smooth with J (θ) being the Jacobian matrix and φ(θ 0 , t) exists for all θ 0 and t. For ease of presentation, in the main text we assume J (0) is diagonalizable over R and defer the same result for the general case into Appendix G.3. Let J (0) = Ṽ D Ṽ -1 be the eigendecomposition, where Ṽ is an invertible matrix and D = diag(μ 1 , . . . , μd ) is the diagonal matrix consisting of the eigenvalues μ1 ≥ μ2 ≥ • • • ≥ μd . Let Ṽ = (ṽ 1 , . . . , ṽd ) and Ṽ -1 = ( ũ1 , . . . , ũd ) , then ũi , ṽi are left and right eigenvectors associated with μi and ũ i ṽj = δ ij . We can rewrite the eigendecomposition as J (0) = d i=1 μi ṽi ũ i . We also assume the top eigenvalue μ1 is positive and unique. Note μ1 > 0 means the critical point θ = 0 is unstable, and in matrix factorization it means 0 is a strict saddle point of L( • ). The key observation is that if the initialization is infinitesimal, the trajectory is almost uniquely determined. To be more precise, we need the following definition: Definition 5.2. For any θ 0 ∈ R d and u ∈ R d , we say that {θ α } α∈(0,1) converges to θ 0 with positive alignment with u if lim α→0 θ α = θ 0 and lim inf α→0 θα-θ0 θα-θ0 2 , u > 0. A special case is that the direction of θ α -θ 0 converges, i.e., θ := lim α→0 θα-θ0 θα-θ0 2 exists. In this case, {θ α } has positive alignment with either u or -u except for a zero-measure subset of θ. This means any convergent sequence generically falls into either of these two categories. The following theorem shows that if the initial point θ α converges to 0 with positive alignment with ũ1 as α → 0, the trajectory starting with θ α converges to a unique trajectory z(t) := φ(αṽ 1 , t + 1 μ1 log 1 α ). By symmetry, there is another unique trajectory for sequences {θ α } with positive alignment to -ũ1 , which is z (t) := φ(-αṽ 1 , t + 1 μ1 log 1 α ). This is somewhat surprising: different initial points should lead to very different trajectories, but our analysis shows that generically there are only two limiting trajectories for infinitesimal initialization. We will soon see how this theorem helps in our analysis for matrix factorization in Sections 5.2 and 5.3. Theorem 5.3. Let z α (t) := φ(αṽ 1 , t + 1 μ1 log 1 α ) for every α > 0, then z(t) := lim α→0 z α (t) exists and is also a solution of (6), i.e., z(t) = φ(z(0), t). If δ α converges to 0 with positive alignment with ũ1 as α → 0, then ∀t ∈ R, there is a constant C > 0 such that φ δ α , t + 1 μ1 log 1 δα, ũ1 -z(t) 2 ≤ C • δ α γ μ1 +γ 2 , ( ) for every sufficiently small α, where γ := μ1 -μ2 > 0 is the eigenvalue gap. Proof sketch. The main idea is to linearize the dynamics near origin as we have done for the first warmup example. For sufficiently small θ, by Taylor expansion of g(θ), the dynamics is approximately dθ dt ≈ J (0)θ, which can be understood as a continuous version of power iteration. If the linear approximation is exact, then θ(t) = e tJ(0) θ(0). For large enough t 0 , e t0J(0) = d i=1 e μit0 ṽi ũ i = e μ1t0 ṽ1 ũ 1 + O(e μ2t0 ). Therefore, as long as the initial point θ(0) has a positive inner product with ũ1 , θ(t 0 ) should be very close to ṽ1 for some > 0, and the rest of the trajectory after t 0 should be close to the trajectory starting from ṽ1 . However, here is a tradeoff: we should choose t 0 to be large enough so that the power iteration takes effect; but if t 0 is so large that the norm of θ(t 0 ) reaches a constant scale, then the linearization fails unavoidably. Nevertheless, if the initialization scale is sufficiently small, we show via a careful error analysis that there is always a suitable choice of t 0 such that θ(t 0 ) is well approximated by ṽ1 and the difference between θ(t 0 + t) and φ( ṽ1 , t) is bounded as well. We defer the details to Appendix G.

5.2. EQUIVALENCE BETWEEN GD AND GLRL: RANK-ONE CASE

Now we establish the equivalence between GF and GLRL in the first phase. The main idea is to apply Theorem 5.3 on (2). For this, we need the following lemma on the eigenvalues and eigenvectors. Lemma 5.4. Let g(W ) := -W ∇f (W ) -∇f (W )W and J (W ) be its Jacobian. Then J (0) is symmetric and thus diagonalizable. Let -∇f (0) = d i=1 µ i u 1[i] u 1[i] be the eigendecomposition of the symmetric matrix -∇f (0), where µ 1 ≥ µ 2 ≥ • • • ≥ µ d . Then J (0) has the form: J (0)[∆] = d i=1 d j=1 (µ i + µ j ) ∆, u 1[i] u 1[j] u 1[i] u 1[j] , where J (0)[∆] stands for the resulting matrix produced by left-multiplying J (0) to the vectorization of ∆. For every pair of 1 ≤ i ≤ j ≤ d, µ i + µ j is an eigenvalue of J (0) and u 1[i] u 1[j] + u 1[j] u 1[i] is a corresponding eigenvector. All the other eigenvalues are 0. We simplify the notation by letting u 1 := u 1 [1] . A direct corollary of Lemma 5.4 is that u 1 u 1 is the top eigenvector of J (0). According to Theorem 5.3, now there are only two types of trajectories, which correspond to infinitesimal initialization W α → 0 with positive alignment with u 1 u 1 or -u 1 u 1 . As the initialization must be PSD, W α → 0 cannot have positive alignment with -u 1 u 1 . For the former case, Theorem 5.6 below states that, for every fixed time t, the GF solution φ(W α , T (W α ) + t) after shifting by a time offset T (W α ) := 1 2µ1 log( W α , u 1 u 1 -1 ) converges to the GLRL solution W G 1 (t) as W α → 0. The only assumption for this result is that 0 is not a minimizer of f ( • ) in S + d (which is equivalent to λ 1 (-∇f (0)) > 0) and -∇f (0) has an eigenvalue gap. In the full observation case, this assumption is satisfied easily if the ground-truth matrix has a unique top eigenvalue. The proof for Theorem 5.6 is deferred to Appendix I.1. Assumption 5.5. µ 1 > max{µ 2 , 0}, where µ 1 := λ 1 (-∇f (0)), µ 2 := λ 2 (-∇f (0)). Theorem 5.6. Under Assumption 5.5, the following limit W G 1 (t) exists and is a solution of (2). W G 1 (t) := lim →0 W G 1, 1 2µ1 log 1 + t = lim →0 φ u 1 u 1 , 1 2µ1 log 1 + t . Let {W α } ⊆ S + d be PSD matrices converging to 0 with positive alignment with u 1 u 1 as α → 0, that is, lim α→0 W α = 0 and ∃α 0 , q > 0 such that W α , u 1 u 1 ≥ q W α F for all α < α 0 . Then ∀t ∈ R, there is a constant C > 0 such that φ W α , 1 2µ1 log 1 Wα,u1u 1 + t -W G 1 (t) F ≤ C W α γ 2µ 1 +γ F (10) for every sufficiently small α, where γ := 2µ 1 -(µ 1 + µ 2 ) = µ 1 -µ 2 . It is worth to note that W G 1 (t) has rank ≤ 1 for any t ∈ R, since every W G 1, (t) has rank ≤ 1 and the set S + d,≤1 is closed. This matches with the first warmup example: GD does start learning with rank-1 solutions. Interestingly, in the case where the limit W 1 := lim t→+∞ W G 1 (t) happens to be a minimizer of f ( • ) in S + d , GLRL should exit with the rank-1 solution W 1 after the first phase, and the following theorem shows that this is also the solution found by GF. Assumption 5.7. f (W ) is locally analytic at each point. Theorem 5.8. Under Assumptions 5.5 and 5.7, if W G 1 (t) F is bounded for all t ≥ 0, then the limit W 1 := lim t→+∞ W G 1 (t) exists. Further, if W 1 is a minimizer of f ( • ) in S + d , then for PSD matrices {W α } ⊆ S + d converging to 0 with positive alignment with u 1 u 1 as α → 0, it holds that lim α→0 lim t→+∞ φ(W α , t) = W 1 . Assumption 5.7 is a natural assumption, since f ( • ) in most cases of matrix factorization is a quadratic or polynomial function (e.g., matrix sensing, matrix completion). In general, it is unlikely for a gradient-based optimization process to get stuck at saddle points (Lee et al., 2017; Panageas et al., 2019) . Thus, we should expect to see in general that GLRL finds the rank-1 solution if the problem is feasible with rank-1 matrices. This means at least for this subclass of problems, the implicit regularization of GD is rather unrelated to norm minimization. Below is a concrete example: Example 5.9 (Counter-example of Conjecture 1.1, Gunasekar et al. 2017 ). Theorem 5.8 enables us to construct counterexamples of the implicit nuclear norm regularization conjecture in (Gunasekar et al., 2017) . The idea is to construct a loss L : R d×d → R where every rank-1 stationary point of L(U ) attains the global minimum but none of them is minimizing the nuclear norm. Below we give a concrete matrix completion problem that meets the above requirement. Let M be a partially observed matrix to be recovered, where the entries in Ω = {(1, 3), (1, 4), (2, 3), (3, 1), (3, 2), (4, 1)} are observed and the others (marked with "?") are unobserved. The optimization problem is defined formally by L(U ) = 1 2 f (U U ), f (W ) = 1 2 (i,j)∈Ω (W ij -M ij ) 2 . M =    ? ? 1 R ? ? R ? 1 R ? ? R ? ? ?    , M norm =    R 1 1 R 1 R R 1 1 R R 1 R 1 1 R    , M rank =    1 R 1 R R R 2 R R 2 1 R 1 R R R 2 R R 2    . Here R > 1 is a large constant, e.g., R = 100. The minimum nuclear norm solution is the rank-2 matrix M norm , which has M norm * = 4R (which is 400 when R = 100). M rank is a rank-1 solution with much larger nuclear norm, M norm * = 2R 2 + 2 (which is 20002 when R = 100). We can verify that f ( • ) satisfies Assumptions 5.5 and 5.7 and W G 1 (t) converges to the rank-1 solution M rank . Therefore, GF with infinitesimal initialization converges to M rank rather than M norm , which refutes the conjecture in (Gunasekar et al., 2017) . See Appendix D for a formal statement.

5.3. EQUIVALENCE BETWEEN GD AND GLRL: GENERAL CASE

Theorem 5.6 shows that for any fixed time t, the trajectory of GLRL in the first phase approximates GF with infinitesimal initialization, i.e., W G 1 (t) = lim α→0 W α (t), where W α (t ) := φ(W α , 1 2µ1 log( W α , u 1 u 1 -1 ) + t). However, W G 1 (∞) = lim α→0 W α (∞) does not hold in general, unless the prerequisite in Theorem 5.8 is satisfied, i.e., unless W 1 = W G 1 (∞) is a minimizer of f ( • ) in S + d . This is because of the well-known result that GD converges to local minimizers (Lee et al., 2016; 2017) . We adapt Theorem 2 of Lee et al. (2017) to the setting of GF (Theorem I.5) and obtain the following result (Theorem 5.10); see Appendix I.4 for the proof. Theorem 5.10. Let f : R d×d → R be a convex C 2 -smooth function. (1). All stationary points of L : R d×d → R, L(U ) = 1 2 f (U U ) are either strict saddles or global minimizers; (2). For any random initialization, GF (1) converges to strict saddles of L(U ) with probability 0. Therefore, for convex f ( • ) such as matrix sensing and completion, suppose f ( • ) has no rank-1 PSD minimizer, then no matter how small α is, W α (∞) (if exists) is a minimizer of f ( • ) with a higher rank and thus away from the rank-1 matrix W 1 . In other words, W G 1 (t) only describes the limiting trajectory of GF in the first phase, i.e., when GF goes from near 0 to near W 1 . After a sufficiently long time (depending on α), GF escapes the critical point W 1 , but this is not described by W G 1 (t). To understand how GF escapes W 1 , a priori, we need to know how GF approaches W 1 . Using a similar argument for Theorem 5.3, Theorem 5.11 shows that generically GF only escapes in the direction of v 1 v 1 , where v 1 is the (unique) top eigenvector of -∇f (W 1 ), and thus the limiting trajectory exactly matches with that of GLRL in the second phase until GF gets close to another critical point W 2 ∈ S + d,≤2 . If W 2 is still not a minimizer of f ( • ) in S + d (but it is a local minimizer in S + d,≤2 generically), then GF escapes W 2 and the above process repeats until W K is a minimizer in S + d for some K. Here by "generically" we hide some technical assumptions and we elaborate on them in Appendix J. See Figure 1 and Figure 2 for experimental verification of the equivalence between GD and GLRL. We end this section with the following characterization of GF: We plot dist(t) = min t ∈T WGD(t) -WGLRL(t ) F for different initialization scale W (0) F, where T is a discrete subset of R that δ-covers the entire trajectory of GLRL: maxt min t ∈T WGLRL(t) -WGLRL(t ) F ≤ δ for δ ≈ 0.00042. For each W (0) F , we run 20 random seeds and plot them separately. The ground truth W * ∈ R 20×20 is a randomly generated rank-3 matrix with W * F = 20. 30% entries are observed. See more in Appendix E.1. d i=1 µ i v i v i be the eigendecomposition of -∇f (W ). If µ 1 > µ 2 and if there exists time T α ∈ R for every α so that φ(W α , T α ) converges to W with positive alignment with the top principal component v 1 v 1 as α → 0, then for every fixed t, lim α→0 φ(W α , T α +foot_0 2µ1 log Theorem 5.11 (Theorem I.2, informal). Let W be a critical point of (2) satisfying that W is a local minimizer of f ( • ) in S + d,≤r for some r ≥ 1 but not a minimizer in S + d . Let -∇f (W ) = 1 φ(Wα,Tα),v1v 1 +t) exists and is equal to W G (t) := lim →0 φ(W + v 1 v 1 , 1 2µ1 log 1 + t). Characterization of the trajectory of GF. Generically, the trajectory of GF with small initialization can be split into K phases by K + 1 critical points of (2), {W r } K r=0 (W 0 = 0), where in phase r GF escapes from W r-1 in the direction of the top principal component of -∇f (W r-1 ) and gets close to W r . Each W r is a local minimizer of f ( • ) in S + d,≤r , but none of them is a minimizer of f ( • ) in S + d except W K . The smaller the initialization is, the longer GF stays around each W r . Moreover, {W r } K r=0 corresponds to {W r, } K r=0 in Definition 5.1 with infinitesimal > 0.

6. BENEFITS OF DEPTH: A VIEW FROM GLRL

In this section, we consider matrix factorization problems with depth L ≥ 3. Our goal is to understand the effect of the depth-L parametrization W = U 1 U 2 • • • U L on the implicit bias -how does depth encourage GF to find low rank solutions? We take the standard assumption in existing analysis for the end-to-end dynamics that the weight matrices have a balanced initialization, i.e. U i (0)U i (0) = U i+1 (0)U i+1 (0), ∀1 ≤ i ≤ L -1. Arora et al. (2018) showed that if {U i } L i=1 is balanced at initialization, then we have the following end-to-end dynamics. Similar to the depth-2 case, we use φ(W (0), t) to denote W (t), where dW dt = - L-1 i=0 (W W ) i L ∇f (W )(W W ) 1-i+1 L . ( ) The lemma below is the foundation of our analysis for the deep case, which greatly simplifies (11). Due to the space limit, we defer its derivations and applications into Appendix K. Lemma 6.1. For M (t) := W (t)foot_1/L , we have dM dt = -∇f (M L/2 )M L/2 -M L/2 ∇f (M L/2 ). Our main result, Theorem 6.2, gives a characterization of the limiting trajectory for deep matrix factorization with infinitesimal identity initialization. Here W (t) := lim α→0 W G α (t) is the trajectory of deep GLRL, where W G α (t) := φ(αe 1 e 1 , α -(1-1/P ) 2µ1(P -1) + t) (see Algorithm 2). The dynamics for general initialization is more complicated. Please see discussions in Appendix L. Theorem 6.2. Let P = L 2 , L ≥ 3. Suppose ∇f (0) 2 = λ 1 (-∇f (0)) > max{λ 2 (-∇f (0)), 0}, 1 for every fixed t ∈ R, φ αI, α -(1-1/P ) 2µ1(P -1) + t -W (t) F = O(α 1 P (P +1) ), ( ) and for any 2 ≤ k ≤ d, for every fixed t ∈ R, λ k φ αI, α -(1-1/P ) 2µ1(P -1) + t = O(α). ( ) So how does depth encourage GF to find low-rank solutions? When the ground truth is low-rank, say rank-k, our experiments (Figure 2 ) suggest that GF with small initialization finds solutions with smaller k-low-rankness compared to the depth-2 case, thus achieving better generalization. At first glance, this is contradictory to what Theorem 6.2 suggests, i.e., the convergence rate of deep GLRL at a constant time gets slower as the depth increases. However, it turns out the uniform upper bound for the distance between GF and GLRL is not the ideal metric for the eventual k-low-rankness of learned solution. Below we will illustrate why the r-low-rankness of GF within each phase r is a better metric and how they are different. Definition 6.3 (r-low-rankness). For matrix M ∈ R d×d , we define the r-low-rankness of M as d i=r+1 σ 2 i (M ) , where σ i (M ) is the i-th largest singular value of M . Figure 2 : GD passes by the same set of critical points as GLRL when the initialization scale is small, and gets much closer to the critical points when L ≥ 3. Depth-2 GD requires a much smaller initialization scale to maintain small low-rankness. Here the ground truth matrix W * ∈ R 20×20 is of rank 3 as stated in Appendix E.1. In this case, GLRL has 3 phases and 4 critical points {W r } 3 r=0 , where W 0 = 0 and W 3 = W * . For each depth L and initialization scale W (0) F, we plot the distance between the current step of GD and the closest critical point of GLRL, WGD(t) -W r F , the norm of full gradient, ∇ U 1:L L(U 1:L ) F and the (r + 1)-low-rankness of WGD(t) with r := arg min 0≤i≤3 WGD(t) -W i F. Suppose f ( • ) admits a unique minimizer W 0 in S + d,1 , and we run GF from αI for both depth-2 and depth-L cases. Intuitively, the 1-low-rankness of the depth-2 solution is Ω(α 1-µ2/µ1 ), which can be seen from the second warmup example in Section 4. For the depth-L solution, though it may diverge from the trajectory of deep GLRL more than the depth-2 solution does, its 1-low-rankness is only O(α), as shown in Theorem 6.4. The key idea is to show that there is a basin in the manifold of rank-1 matrices around W 0 such that any GF starting within the basin converges to W 0 . Based on this, we can prove that starting from any matrix O(α)-close to the basin, GF converges to a solution O(α)-close to W 0 . See Appendix M for more details. Theorem 6.4. In the same settings as Theorem 6.2, if W (∞) exists and is a minimizer of f ( • ) in S + d,≤1 , under regularity assumption M.1, we have inf t∈R φ (αI, t) -W (∞) F = O(α). Interpretation for the advantage of depth with multiple phases. For depth-2 GLRL, the lowrankness is raised to some power less then 1 per phase (depending on the eigengap). For deep GLRL, we show the low-rankness is only multiplied by some constant for the first phase and speculate it to be true for later phases. This conjecture is supported by our experiments; see Figure 2 . Interestingly, our theory and experiments (Figure 5 ) suggest that while being deep is good for generalization, being much deeper may not be much better: once L ≥ 3, increasing the depth does not improve the order of low-rankness significantly. While this theoretical result is only for identity initialization, Theorem F.1 and Corollary F.2 further show that the dynamics of GF ( 11) with any initialization pointwise converges as L → ∞, under a suitable time rescaling. See Figure 6 for experimental verification.

7. CONCLUSION AND FUTURE DIRECTIONS

In this work, we connect gradient descent to Greedy Low-Rank Learning (GLRL) to explain the success of using gradient descent to find low-rank solutions in the matrix factorization problem. This enables us to construct counterexamples to the implicit nuclear norm conjecture in (Gunasekar et al., 2017) . Taking the view of GLRL can also help us understand the benefits of depth.

A COMPARISON TO EXISTING GREEDY ALGORITHMS FOR RANK-CONSTRAINED OPTIMIZATION

The most related one to GLRL (Algorithm 1) is probably Rank-1 Matrix Pursuit (R1MP) proposed by Wang et al. (2014) for matrix completion, which was later generalized to general convex loss in (Yao and Kwok, 2016) . R1MP maintains a set of rank-1 matrices as the basis, and in phase r, R1MP adds the same u r u r as defined in Algorithm 1 into its basis and solve min α f ( r i=1 α i u i u i ) for rank-r estimation. The main difference between R1MP and GLRL is that the optimization in each phase of R1MP is performed on the coefficients α, while the entire U r evolves with GD in each phase of GLRL. In Figure 3 , we provide empirical evidence that GLRL generalizes better than R1MP when ground truth is low-rank, although GLRL may have a higher computational cost depending on η, . Similar to R1MP, Greedy Efficient Component Optimization (GECO, Shalev-Shwartz and Singer 2010) also chooses the r-th component of its basis as the top eigenvector of -∇f (W r ), while it solves min β f ( 1≤i,j≤r β ij u i u j ) for the rank-r estimation. Khanna et al. (2017) provided convergence guarantee for GECO assuming strong convexity. Haeffele and Vidal (2019) proposed a local-descent meta algorithm, of which GLRL can be viewed as a specific realization.

B DEEP GLRL ALGORITHM

Algorithm 2: Deep Greedy Low-Rank Learning (Deep GLRL) parameter :step size η > 0; small > 0 ← 1/L , L (U 1 , • • • , U L ) := f (W 1 • • • W L ). W 0 ← 0 ∈ R d×d , and U 0,1 (∞), . . . , U 0,L (∞) ∈ R d×0 are empty matrices while λ 1 (-∇f (W r )) > 0 do r ← r + 1 let u r be a top (unit) eigenvector of -∇f (W r-1 ) U r,1 (0) ← [U r-1,1 (∞) u r ] ∈ R d×r U r,k (0) ← U r-1,k (∞) 0 0 ∈ R r×r for all 2 ≤ k ≤ L -1 U r,L (0) ← U r-1,L (∞) u r ∈ R r×d for t = 0, 1, . . . do U r,i (t + 1) ← U r,i (t) -η∇ Ui L (U r,1 (t), • • • , U r,L (t)), ∀1 ≤ i ≤ L. W r ← U r,1 (∞) • • • U r,L (∞) return W r C PRELIMINARY LEMMAS Lemma C.1. For U 0 ∈ R d×r and W 0 := U 0 U 0 , the following statements are equivalent: (1). U 0 is a stationary point of L(U ) = 1 2 f (U U ); (2). ∇f (W 0 )W 0 = 0; (3). W 0 := U 0 U 0 is a critical point of (2). Proof. (2) ⇒ (3) is trivial. We only prove (1) ⇒ (2), (3) ⇒ (1). Proof for (1) ⇒ (2). If U 0 is a stationary point, then 0 = ∇L(U 0 ) = ∇f (W 0 )U 0 . So ∇f (W 0 )W 0 = (∇f (W 0 )U 0 ) U 0 = 0. Proof for (3) ⇒ (1). If W 0 is a critical point, then 0 = g(W 0 ), ∇f (W 0 ) = -2 Tr(∇f (W 0 )W 0 ∇f (W 0 )) = -2 ∇f (W 0 )U 0 2 F , which implies ∇L(U 0 ) = 0. Lemma C.2. For a stationary point U 0 ∈ R d×r of L(U ) = 1 2 f (U U ) where f ( • ) is convex, W 0 := U 0 U 0 attains the global minimum of f ( • ) in S + d := {W : W 0} iff ∇f (W 0 ) 0. Proof. Since f (W ) is a convex function and S + d is convex, we know that W 0 is a global minimizer of f (W ) in S + d iff ∇f (W 0 ), W -W 0 ≥ 0, ∀W 0. ( ) Note that ∇f (W 0 ), W 0 = Tr(∇f (W 0 )W 0 ). By Lemma C.1, ∇f (W 0 ), W 0 = 0. Combining this with ( 14), we know that W 0 is a global minimizer iff ∇f (W 0 ), W ≥ 0, ∀W 0. ( ) It is easy to check that this condition is equivalent to ∇f (W 0 ) 0.

D PROOFS FOR COUNTER-EXAMPLE

Conjecture D.1 (Formal Statement, Gunasekar et al. 2017) . Suppose f : R d×d → R is a quadratic function and min W 0 f (W ) = 0. Then for any W init 0 if W 1 = lim α→0 lim t→+∞ φ(αW init , t) exists and f (W 1 ) = 0, then W 1 * = min W 0 W * s.t. f (W ) = 0. Propsition D.2 (Formal Statement for Example 5.9). For constant R > 1, let M =    ? ? 1 R ? ? R ? 1 R ? ? R ? ? ?    , M norm =    R 1 1 R 1 R R 1 1 R R 1 R 1 1 R    , and M rank =    1 R 1 R R R 2 R R 2 1 R 1 R R R 2 R R 2    . and L(U ) = 1 2 f (U U ), f (W ) = 1 2 (i,j)∈Ω (W ij -M ij ) 2 where Ω = {(1, 3), (1, 4), (2, 3), (3, 1), (3, 2), (4, 1)}. Then for any W init 0, s.t. u 1 W init u 1 > 0, lim α→0 lim t→+∞ φ(αW init , t) = M rank . Moreover, we have M rank * = 2R 2 + 2 > 4R = M norm * = min W 0,f (W )=0 W * . Proof. We define W G 1, (t), W G 1 (t) in the same way as in Definition 5.1, Theorem 5.6. W G 1, (t) := φ u 1 u 1 , t , W G 1 (t) := lim →0 W G 1, ( 1 2µ1 log 1 + t). Below we will show 1. Assumption 5.7 and Assumption 5.5 are satisfied. 2. W G 1 (t) F bounded for t ≥ 0; 3. lim t→+∞ W G 1 (t) = M rank ; 4. M norm = arg min W 0,f (W )=0 W * . Thus Since M rank is a global minimizer of f ( • ), applying Theorem 5.8 finishes the proof. Proof for Item 1. Let M 0 := ∇f (0), then M 0 =    0 0 1 R 0 0 R 0 1 R 0 0 R 0 0 0    . Let A := [ 1 R R 0 ], then we have λ 1 (A) = 1+ √ 1+R 2 2 , λ 2 (A) = 1- √ 1+R 2 2 , thus λ 1 (A) > |λ 2 (A)| > 0 > λ 2 (A). As a result, λ 1 (A) = A 2 . Let v 1 ∈ R 2 be the top eigenvector of A. We claim that u 1 = [ v1 v1 ] ∈ R 4 is the top eigenvector of ∇f (0). First by definition it is easy to check that M 0 u 1 = λ 1 (A)u 1 . Further noticing that M 2 0 = A 2 0 0 A 2 , we know λ 2 i (M 0 ) ∈ {λ 2 1 (A), λ 2 2 (A)} for all eigenvalues λ i (M 0 ). That is, λ 1 (M 0 ) = λ 1 (A), λ 2 (M 0 ) = -λ 2 (A), λ 3 (M 0 ) = λ 2 (A), and λ 4 (M 0 ) = -λ 1 (A). Thus Assumption 5.5 is satisfied. Also note that f is quadratic, thus analytic, i.e., Assumption 5.7 is also satisfied. Proof for Item 2. Let (x (t), y (t)) ∈ R 2 be the gradient flow of g(x, y) = 1 2 (x 2 -1) 2 + (xy -R) 2 starting from (x (0), y (0)) = √ v 1 . dx(t) dt = (1 -x(t) 2 )x(t) -2y(t)(x(t)y(t) -R) dy(t) dt = -2x(t)(x(t)y(t) -R) Let W (t) be the following matrix: W (t) :=    x (t) y (t) x (t) y (t)    [x (t) y (t) x (t) y (t)] . Then it is easy to verify that W (0) = W G 1, (0) and W (t) satisfies (2). Thus by the existence and uniqueness theorem, we have W (t) = W G 1, (t) for all t. Taking the limit → 0, we know that W G 1 (t) can also be written in the following form: W G 1 (t) =    x(t) y(t) x(t) y(t)    [x(t) y(t) x(t) y(t)] , and (x (t), y (t)) ∈ R 2 is a gradient flow of g(x, y) = 1 2 (x 2 -1) 2 + (xy -R) 2 . Since g(x(t), y(t)) is non-increasing overtime, and lim t→-∞ g(x(-t), y(-t)) = g(x(-∞), y(-∞)) = g(0, 0) = R 2 + 0.5, we know |x(t)y(t)| ≤ 3R for all t. So whenever y 2 (t) -x 2 (t) ≥ 9R 2 , we have x 2 (t) ≤ 9R 2 y 2 (t) ≤ 9R 2 y 2 (t)-x 2 (t) ≤ 1. In this case, d(y 2 (t)-x 2 (t)) dt = 2x 2 (t)(x 2 (t) -1) ≤ 0. Combining this with y(-∞) 2 -x(-∞) 2 = 0 ≤ 9R 2 , we have y 2 (t) -x 2 (t) ≤ 9R 2 for all t, which also implies that y(t) is bounded. Noticing that 9R 2 ≥ g(x(t), y(t)) ≥ (x 2 (t) -1) 2 , we know x 2 (t) is also bounded. Therefore, W G 1 (t) is bounded. Proof for Item 3. Note that (x(∞), y(∞)) is a stationary point of g(x, y). It is clear that g(x, y) only has 3 stationary points -(0, 0), (1, R) and (-1, -R). Thus W 1 can only be 0 or M rank . However, since for all t, f (W G 1 (t)) < f (0), W 1 = lim t→∞ W G 1 (t) cannot be 0. So W 1 must be M rank . Proof for Item 4. Let m ij be (i, j)th element of M . Suppose M 0, we have (e 1 -e 4 ) M (e 1 -e 4 ) ≥ 0 =⇒ m 11 + m 44 ≥ m 14 + m 41 = 2R (e 2 -e 3 ) M (e 2 -e 3 ) ≥ 0 =⇒ m 22 + m 33 ≥ m 23 + m 32 = 2R Depth (L) Simulation method 2 Constant LR, η = 10 -3 for 10 6 iterations 3 Adaptive LR, η = 2 × 10 -5 and ε = 10 -4 for 10 6 iterations 4 Adaptive LR, η = 3 × 10 -4 and ε = 10 -3 for 10 6 iterations Table 1 : Choice of hyperparameters for simulating gradient flow. For L = 2, gradient descent escapes saddles in O(log 1 ) time, where is the distance between the initialization and the saddle. Below we will show the rest unknown off-diagonal entries must be 1. Thus 4R = min W 0,f (W )=0 W * , Let V = 1 -1 0 0 0 0 1 0 0 0 0 1 , then M 0 =⇒ V M V 0 =⇒ 0 m 13 -m 23 m 14 -m 24 m 31 -m 32 R R m 41 -m 42 R R 0, which implies m 13 = m 23 , m 14 = m 24 . With the same argument for V = 1 0 0 0 0 1 0 0 0 0 1 -1 , we have m 13 = m 14 , m 23 = m 24 . Also note M is symmetric and m 13 = 1, thus m ij = m ji = 1, ∀i = 1, 2, j = 3, 4. Thus M norm = arg min W 0,f (W )=0 W * , which is unique.

E EXPERIMENTS E.1 GENERAL SETUP

The code is written in Julia (Bezanson et al., 2012) and PyTorch (Paszke et al., 2019) . The ground-truth matrix W * is low-rank by construction: we sample a random orthogonal matrix U , a diagonal matrix S with Frobenius norm S F = 1 and set W * = U SU . Each measurement X in X 1 , . . . , X m is generated by sampling two one-hot vectors u and v uniformly and setting X = 1 2 uv + 1 2 vu . In Figures 1, 2 , 3 to 5 and 7, the ground truth matrix W * has shape 20 × 20 and rank 3, where W * F = 20, λ 1 (W * ) = 17.41, λ 2 (W * ) = 8.85, λ 3 (W * ) = 4.31 and λ 1 (-∇f (0)) = 6.23, λ 2 (-∇f (0)) = 5.41. p = 0.3 is used for generating measurements, except p = 0.25 in Figure 3 , i.e., each pair of entries of W * ij and W * ji is observed with probability p. Gradient Descent. Let ˜ > 0 be the Frobenius norm of the target random initialization. For the depth-2 case, we sample 2 orthogonal matrices V 1 , V 2 and a diagonal matrix D with Frobenius norm ˜ , and we set U = V 1 D 1/2 V 2 ; for the depth-L case with L ≥ 3, we sample L orthogonal matrices V 1 , . . . , V L and a diagonal matrix D with Frobenius norm ˜ , and we set U i := V i D 1/L V i+1 (V L+1 = V 1 ). In this way, we can guarantee that the end-to-end matrix W = U 1 • • • U L is symmetric and the initialization is balanced for L ≥ 3. We discretize the time to simulate gradient flow. When L > 2, gradient flow stays around saddle points for most of the time, therefore we use full-batch GD with adaptive learning rate ηt , inspired by RMSprop (Tieleman and Hinton, 2012) , for faster convergence: v t+1 = αv t + (1 -α) ∇L(θ t ) 2 2 , ηt = η vt+1 1-α t+1 + ε , θ t+1 = θ t -ηt ∇L(θ t ), where α = 0.99, η is the (unadjusted) learning rate. The choices of hyperparameters are summarized in Table 1 . The continuous time for θ t is measured as t-1 i=0 ηi . GLRL. In Figures 1, 2 , 3 and 4, the GLRL's trajectory is obtained by running Algorithm 1 with = 10 -7 and η = 10 -3 . The stopping criterion is that if the loop has been iterated for 10 7 times.

E.2 EXPERIMENTAL EQUIVALENCE BETWEEN GLRL AND GRADIENT DESCENT

Here we provide experimental evidence supporting our theoretical claims about the equivalence between GLRL and GF for both cases, L = 2 and L ≥ 3. In Figure 1 , we show the distance from every point on GF (simulated by GD) from random initialization is close to the trajectory of GLRL. In Figure 2 , we first run GLRL and obtain the critical points {W r } 3 r=0 passed by GLRL. We also define the distance of a matrix W to the critical points to be min 0≤r≤3 W -W r F .

E.3 HOW WELL DOES GLRL WORK?

We compare GLRL with gradient descent (with not-so-small initialization), nuclear norm minimization and R1MP (Wang et al., 2014) . We use CVXPY (Diamond and Boyd, 2016; Agrawal et al., 2018) for finding the nuclear norm solution. The results are shown in Figure 3 . GLRL can fully recover the ground truth, while others have difficulty doing so. We use the general setting in Appendix E.1. In these experiments, we use the constant learning rate 10 -5 for 4 × 10 7 iterations. The reference matrix W ref is obtained by running the first stage of GLRL with W (0) F = 10 -48 and we pick one matrix in the trajectory with W ref F about 0.6. For every = 10 i , i ∈ {-1, -2, -3, -4, -5}, we run both gradient descent and the first phase of GLRL with W (0) F = . For gradient descent, we use random initialization so W (0) F is full rank w.p. 1. The distance of a trajectory to W ref is defined as min t≥0 W (t) -W ref F . In practice, as we discretized time to simulate gradient flow, we check every t during simulation to compute the distance. As a result, the estimation might be inaccurate when a trajectory is really close to W ref . The result is shown at Figure 4 . We observe that GLRL trajectories are closer to the reference matrix W ref by magnitudes. Thus the take home message here is that GLRL is in general a more computational efficient method to simulate the trajectory of GF (GD) with infinitesimal initialization, as one can start GLRL with a much larger initialization, while still maintaining high precision. Figure 4 : Using v 1 v 1 (denoted by "rank 1") as initialization makes GD much closer to GLRL compared to using random initialization (denoted by "rank d"), where v 1 is the top eigenvector of -∇f (0). We take a fixed reference matrix on the trajectory of GLRL with constant norm and plot the distance of GD with each initialization to it respectively.. We observe that performance of GD is quite robust to its initialization. Note that for L > 2, the shaded area with initialization scale 10 -7 is large, as the sudden decrement of loss occurs at quite different continuous times for different random seeds in this case.

E.5 BENEFIT OF DEPTH: POLYNOMIAL VS EXPONENTIAL DEPENDENCE ON INITIALIZATION

To verify the our theory in Section 6, we run gradient descent with different depth and initialization. The results are shown in Figure 5 . We can see that as the initialization becomes smaller, the final solution gets closer to the ground truth. However, a depth-2 model requires exponentially small initialization, while deeper models require polynomial small initialization, though it takes much longer to converge.

F THE MARGINAL VALUE OF BEING DEEPER

Theorem F.1 shows that the end-to-end dynamics (17) converges point-wise while L → ∞ if the product of learning rate and depth, ηL, is fixed as constant. Interestingly, (17) also allows us to simulate the dynamics of W (t) for all depths L while the computation time is independent of L. In Figure 6 , we compare the effect of depth while fixing the initialization and ηL. We can see that deeper models converge faster. The difference between L = 1, 2, and 4 is large, while difference among L ≥ 16 is marginal. Theorem F.1. Suppose W = Ũ Σ Ṽ is the SVD decomposition of W , where Σ = diag(σ 1 , . . . , σ d ). The dynamics of L-layer linear net is the following, • denotes the entry-wise multiplication: L = 1 L = 2 L = 4 L = 8 L = 16 L = 32 L = 64 L = 128 dW dt = -L Ũ Ũ ∇f (W ) Ṽ • K (L) Ṽ , where K (L) i,i = σ 2-2/L i , K (L) i,j = σ 2 i -σ 2 j Lσ 2/L i -Lσ 2/L j for i = j. Proof. We start from (11): dW dt = - L-1 l=0 (W W ) l L ∇f (W )(W W ) L-1-l L = - L-1 l=0 Ũ Σ 2l L Ũ ∇f (W ) Ṽ Σ 2(L-1-l) L Ṽ = -L Ũ L -1 L-1 l=0 Σ 2l L ( Ũ ∇f (W ) Ṽ ) Σ 2(L-1-l) L Ṽ . Note that Σ is diagonal, so Σ 2l L ( Ũ ∇f (W ) Ṽ ) Σ 2(L-1-l) L = ( Ũ ∇f (W ) Ṽ ) • H (l) , where H (l) i,j = σ 2l L i σ 2(L-1-l) L j . Therefore, L -1 L-1 l=0 Σ 2l L ( Ũ ∇f (W ) Ṽ ) Σ 2(L-1-l) L = L -1 L-1 l=0 ( Ũ ∇f (W ) Ṽ ) • H (l) = ( Ũ ∇f (W ) Ṽ ) • K (L) , where K (L) = L -1 L-1 l=0 H (l) . Hence, dW dt = -L Ũ ( Ũ ∇f (W ) Ṽ ) • K (L) Ṽ . The entries of K (L) can be directly calculated by K (L) i,j = L -1 L-1 l=0 σ 2l L i σ 2(L-1-l) L j =    σ 2-2/L i , i = j, σ 2 i -σ 2 j Lσ 2/L i -Lσ 2/L j , i = j. Corollary F.2. As L → ∞, K (L) converges to K * , where K * i,i = σ 2 i , K * i,j = σ 2 i -σ 2 j ln σ 2 i -ln σ 2 j for i = j. Experiment details. We follow the general setting in Appendix E.1. The ground truth W * is different but is generated in the same manner and has the same shape of 20 × 20 and p = 0.3 is used for observation generation. We directly apply ( 17), in which we compute Ṽ and Ũ through SVD, to simulate the trajectory together with a constant learning rate of 10 -3 L for depth L. W (0) is sampled from 10 -3 × N (0, I d ).

G PROOFS FOR DYNAMICAL SYSTEM

In this section, we prove Theorem 5.3 in Section 5.1. In Appendix G.1, we show how to reduce Theorem 5.3 to the case where J (0) is exactly a diagonal matrix, then we prove this diagonal case in Appendix G.2. Finally, in Appendix G.3, we discuss how to extend it to the case where J (0) is non-diagonalizable.

G.1 REDUCTION TO THE DIAGONAL CASE

Theorem G.1. If J (0) = diag(μ 1 , . . . , μd ) is diagonal, then the statement in Theorem 5.3 holds. Proof for Theorem 5.3. We show how to prove Theorem 5.3 based on Theorem G.1. Let dθ dt = g(θ) be the dynamical system in Theorem 5.3. Let J (0) = Ṽ D Ṽ -1 be the eigendecomposition, where Ṽ is an invertible matrix and D = diag(μ 1 , . . . , μd ). Now we define the following new dynamics by changing the basis: θ(t) = Ṽ -1 θ(t). Then d θ(t) dt = ĝ( θ) for ĝ( θ) := Ṽ -1 g( Ṽ θ), and the associated Jacobian matrix is Ĵ ( θ) := Ṽ -1 J ( Ṽ θ) Ṽ , and thus Ĵ (0) = diag(μ 1 , . . . , μd ). Now we apply Theorem G.1 to θ(t). Then ẑα (t) := Ṽ -1 z α (t) converges to the limit ẑ(t) := lim α→0 ẑα (t). This shows that the limit z(t) = Ṽ ẑ(t) exists in Theorem 5.3. We can also verify that z(t) is a solution of (6). Given δ α converging to 0 with positive alignment with ũ1 as α → 0, we can define δα := Ṽ -1 δ α , then δα converges to 0 with positive alignment with e 1 , where e 1 is the first vector in the standard basis and is also the top eigenvector of Ĵ (0). Therefore, for every t ∈ (-∞, +∞), there is a constant C > 0 such that Ṽ -1 φ δ α , t + 1 μ1 log 1 δ α , ũ1 -ẑ(t) 2 ≤ C • δα γ μ1 +γ 2 (18) for every sufficiently small α. As Ṽ are invertible, this directly implies (7).

G.2 PROOF FOR THE DIAGONAL CASE

Now we only need to prove Theorem G.1. Let e 1 , . . . , e d be the standard basis. Then ũ1 = ṽ1 = e 1 in this diagonal case. We only use e 1 to stand for ũ1 and ṽ1 in the rest of our analysis. Let R > 0. Since g(θ) is C 2 -smooth, there exists β > 0 such that J (θ) -J (θ + h) 2 ≤ β h 2 (19) for all θ 2 , θ + h 2 ≤ R. Then the following can be proved by integration: g(θ + h) -g(θ) = 1 0 J (θ + ξh)dξ h, g(θ + h) -g(θ) -J (θ)h 2 ≤ β h 2 2 . By ( 21), we also have g(θ) -J (0)θ 2 = g(θ) -g(0) -J (0)θ 2 ≤ β θ 2 2 . ( ) Let κ := β/μ 1 . We assume WLOG that R ≤ 1/κ. Let F (x) = log x -log(1 + κx). It is easy to see that F (x) = 1 x+κx 2 and F (x) is an increasing function with range (-∞, log(1/κ)). We use F -1 (y) to denote the inverse function of F (x). Define T α (r) := 1 μ1 (F (r) -F (α)) = 1 μ1 log r α -log 1+κr 1+κα . Our proof only relies on the following properties of J (0) (besides that μ1 , e 1 are the top eigenvalue and eigenvector of J (0)): Lemma G.2. For J (0) := diag(μ 1 , . . . , μd ), we have 1. For any h ∈ R d , h J (0)h ≤ μ1 h 2 2 ; 2. For any t ≥ 0, e tJ(0) -e μ1t e 1 e 1 2 = e μ2t . Proof. For Item 1, h J (0 )h = d i=1 μi h 2 i ≤ μ1 h 2 2 . For Item 2, e tJ(0) -e μ1t e 1 e 1 2 = diag(0, e μ2t , . . . , e μd t ) 2 = e μ2t . Lemma G.3. For θ(t) = φ(θ 0 , t) with θ 0 2 ≤ α and t ≤ T α (r), θ(t) 2 ≤ 1 + κr 1 + κα α • e μ1t ≤ r. Proof. By ( 22) and Lemma G.2, we have 1 2 d θ(t) 2 2 dt = θ(t), g(θ(t)) ≤ θ(t), J (0)θ(t) + β θ(t) 3 2 ≤ μ1 θ(t) 2 2 + β θ(t) 3 2 . This implies d θ(t) 2 dt ≤ μ1 ( θ(t) 2 + κ θ(t) 2 2 ). Since F (x) = 1 x+κx 2 , we further have d dt F ( θ(t) 2 ) ≤ μ1 . So F ( θ(t) 2 ) ≤ F (α) + μ1 t. By definition of T α (r), we then know that θ(t) 2 ≤ r for all t ≤ T α (r). So log θ(t) 2 ≤ F ( θ(t) 2 ) + log(1 + κr) ≤ F (α) + μ1 t + log(1 + κr). Expending F (α) proves the lemma. Lemma G.4. For θ(t) = φ(θ 0 , t) with θ 0 2 ≤ α and t ≤ T α (r), we have θ(t) = e tJ(0) θ 0 + O(r 2 ). Proof. Let θ(t) = e tJ(0) θ 0 . Then we have 1 2 d dt θ(t) -θ(t) 2 2 ≤ g(θ(t)) -J (0) θ(t), θ(t) -θ(t) = g(θ(t)) -J (0)θ(t), θ(t) -θ(t) + (θ(t) -θ(t)) J (0)(θ(t) -θ(t)) ≤ g(θ(t)) -J (0)θ(t) 2 • θ(t) -θ(t) 2 + μ1 θ(t) -θ(t) 2 2 , where the last inequality is due to Lemma G.2. By ( 22) and Lemma G.3, we have g(θ(t)) -J (0)θ(t) 2 ≤ β θ(t) 2 2 ≤ β 1 + κr 1 + κα α 2 • e 2μ1t . So we have d dt θ(t) -θ(t) 2 ≤ β 1+κr 1+κα α 2 • e 2μ1t + μ1 θ(t) -θ(t) 2 . By Grönwall's inequality, θ(t) -θ(t) 2 ≤ t 0 β 1 + κr 1 + κα α 2 • e 2μ1τ e μ1(t-τ ) dτ. Evaluating the integral gives θ(t) -θ(t) 2 ≤ β 1 + κr 1 + κα α 2 e μ1t • e μ1t -1 μ1 ≤ κ 1 + κr 1 + κα α • e μ1t 2 ≤ κr 2 , which proves the lemma. Lemma G.5. Let θ(t) = φ(θ 0 , t), θ(t) = φ( θ0 , t). If max{ θ 0 2 , θ0 2 } ≤ α, then for t ≤ T α (r), θ(t) -θ(t) 2 ≤ e μ1t+κr θ 0 -θ0 2 . Proof. For t ≤ T α (r), by ( 20), 1 2 d dt θ(t) -θ(t) 2 2 = g(θ(t)) -g( θ(t)), θ(t) -θ(t) = (θ(t) -θ(t)) 1 0 J (θ ξ (t))dξ (θ(t) -θ(t)), where θ ξ (t) := ξθ(t) + (1 -ξ) θ(t). By Lemma G.3, max{ θ(t) 2 , θ(t) 2 } ≤ 1+κr 1+κα α • e μ1t for all t ≤ T α (r). So θ ξ (t) 2 ≤ 1+κr 1+κα α • e μ1t . Combining these with ( 19) and Lemma G.2, we have h J (θ ξ (t))h = h J (0)h + h (J (θ ξ (t)) -J (0))h ≤ μ1 + β • 1 + κr 1 + κα α • e μ1t h 2 2 , for all h ∈ R d . Thus, d dt θ(t) -θ(t) 2 ≤ μ1 + β • 1+κr 1+κα α • e μ1t θ(t) -θ(t) 2 . This implies log θ(t) -θ(t) 2 θ(0) -θ(0) 2 ≤ t 0 μ1 + β • 1 + κr 1 + κα α • e μ1τ dτ ≤ μ1 t + κ • 1 + κr 1 + κα αe μ1t ≤ μ1 t + κr. Therefore, θ(t) -θ(t) 2 ≤ e μ1t+κr θ(0) -θ(0) 2 . Lemma G.6. For every t ∈ (-∞, +∞), z(t) exists and z α (t) converges to z(t) in the following rate: z α (t) -z(t) 2 = O(α), where O hides constants depending on g(θ) and t. Proof. We prove the lemma in the cases of t ∈ (-∞, F (R)/μ 1 ] and t > F (R)/μ 1 respectively. Case 1. Fix t ∈ (-∞, F (R)/μ 1 ]. Let α be the unique number such that α 1+κ α = α (i.e., F (α) = log α). Let α be an arbitrary number less than α. Let t 0 := 1 μ1 log α α . Then t 0 = 1 μ1 (F (α) -log α ) ≤ T α (α). By Lemma G.4, we have φ (α e 1 , t 0 ) -αe 1 2 = φ (α e 1 , t 0 ) -e t0J(0) α e 1 2 = O(α 2 ). Let r := F -1 (μ 1 t) ≤ R. Then t + 1 μ1 log 1 α = T α(r) if α < r. By Lemma G.3, φ (α e 1 , t 0 ) 2 ≤ α. Also, αe 1 2 = α 1+κ α ≤ α. By Lemma G.5, z α (t) -z α (t) 2 = φ α e 1 , t + 1 μ1 log 1 α -φ αe 1 , t + 1 μ1 log 1 α 2 = φ φ(α e 1 , t 0 ), t + 1 μ1 log 1 α -φ αe 1 , t + 1 μ1 log 1 α 2 ≤ O(α 2 • e μ1(t+ 1 μ1 log 1 α )+κr ) ≤ O α2 α . For α small enough, we have α = O(α), so for any α ∈ (0, α), z α (t) -z α (t) 2 = O(α). This implies that {z α (t)} satisfies Cauchy's criterion for every t, and thus the limit z(t) exists for t ≤ F (R)/μ 1 . The convergence rate can be deduced by taking limits for α → 0 on both sides. Case 2. For t = F (R)/μ 1 + τ with τ > 0, φ(θ, τ ) is locally Lipschitz with respect to θ. So z α (t) -z α (t) 2 = φ(z α (F (R)/μ 1 ), τ ) -φ(z α (F (R)/μ 1 ), τ ) 2 = O( z α (F (R)/μ 1 ) -z α (F (R)/μ 1 ) 2 ) = O(α), which proves the lemma for t > F (R)/μ 1 . Proof for Theorem G.1. The existence of z(t) := lim α→0 z α (t) = lim α→0 φ αe 1 , t + 1 μ1 log 1 α has already been proved in Lemma G.6, where we show z α (t) -z(t) 2 = O(α). By the continuity of φ( • , t) for every t ∈ R, we have z(t) = lim α→0 φ αṽ 1 , t + 1 μ1 log 1 α = φ lim α→0 φ αṽ 1 , 1 μ1 log 1 α , t = φ (z(0), t) . Now it is only left to prove ( 7). WLOG we can assume that δ α 2 is decreasing and α 2 ≤ δ α 2 ≤ α (otherwise we can do reparameterization). Then our goal becomes proving θ α (t) -z(t) 2 = O α γ μ1 +γ . ( ) where θ α (t) := φ δ α , t + 1 μ1 log 1 δα,e1 . We prove ( 23) in the cases of t ∈ (-∞, F (R)/μ 1 ] and t > F (R)/μ 1 respectively. Case 1. Fix t ∈ (-∞, (F (R) + log q)/μ 1 ]. Let α1 = α γ μ1 +γ . Let α 1 := e F ( α1) = α1 1+κ α1 . Let t 0 := 1 μ1 (F (α 1 ) -log α) ≤ T δα 2 (α 1 ). At time t 0 , by Lemma G.2 we have e t0J(0) -e μ1t0 e 1 e 1 2 = e μ2t0 = e μ2 μ1 (F ( α1)-log α) = α 1 α μ2 μ1 . ( ) Let q α := δα α , e 1 . By Definition 5.2, there exists q > 0 such that q α ≥ q for all sufficiently small α. Then we have φ (δ α , t 0 ) -α 1 q α e 1 2 = φ (δ α , t 0 ) -e t0J(0) δ α 2 + e t0J(0) -e μ1t0 e 1 e 1 δ α 2 = O(α 2 1 ) + α 1 α μ2 μ1 δ α 2 = O( α2 1 + α μ2/μ1 1 α 1-μ2/μ1 ) = O(α 2 1 ). Let r := F -1 (μ 1 t + log 1 qα ) ≤ R. Then t + 1 μ1 log 1 α1qα = T α(r) if α < r. By Lemma G.3, φ (δ α , t 0 ) 2 ≤ α1 . Also, α 1 q α e 1 2 ≤ α 1 = α1 1+κ α1 ≤ α1 . By Lemma G.5, θ α (t) -z α1 (t) 2 ≤ φ φ (δ α , t 0 ) , t + 1 μ1 log 1 α 1 q α -φ α 1 q α e 1 , t + 1 μ1 log 1 α 1 q α 2 = O α 2 1 • e μ1 t+ 1 μ1 log 1 α 1 qα +κr = O(α 1 ). Combining this with the convergence rate for z α1 (t), we have θ α (t) -z(t) 2 ≤ θ α (t) -z α1 (t) 2 + z α1 (t) -z(t) 2 = O(α 1 ). Case 2. For t = (F (R) + log q)/μ 1 + τ with τ > 0, φ(θ, τ ) is locally Lipschitz with respect to θ. So θ α (t) -z(t) 2 = φ(θ α ((F (R) + log q)/μ 1 ), τ ) -φ(z((F (R) + log q)/μ 1 ), τ ) 2 = O( θ α ((F (R) + log q)/μ 1 ) -z((F (R) + log q)/μ 1 ) 2 ) = O(α 1 ), which proves (23) for t > (F (R) + log q)/μ 1 .

G.3 EXTENSION TO NON-DIAGONALIZABLE CASE

The proof in Appendix G.2 can be generalized to the case where J (0). Now we state the theorem formally and sketch the proof idea. We use the notations g(θ), φ(θ 0 , t), J (θ) as in Section 5.1, but we do not assume that J (0) is diagonalizable. Instead, we use μ1 , μ2 , . . . , μd ∈ C to denote the eigenvalues of J (0), repeated according to algebraic multiplicity. We sort the eigenvalues in the descending order of the real part of each eigenvalue, i.e., (μ 1 ) ≥ (μ 2 ) ≥ • • • ≥ (μ d ) , where (z) stands for the real part of a complex number z ∈ C. We call the eigenvalue with the largest real part the top eigenvalue. Theorem G.7. Assume that θ = 0 is a critical point and the following regularity conditions hold: 1. g(θ) is C 2 -smooth; 2. φ(θ 0 , t) exists for all θ 0 and t; 3. The top eigenvalue of J (0) is unique and is a positive real number, i.e., μ1 > max{ (μ 2 ), 0}. Let ṽ1 , ũ1 be the left and right eigenvectors associated with μ1 , satisfying ũ 1 ṽ1 = 1. Let z α (t) := φ(αṽ 1 , t + 1 μ1 log 1 α ) for every α > 0, then ∀t ∈ R, z(t) := lim α→0 z α (t) exists and z(t) = φ(z(0), t). If δ α converges to 0 with positive alignment with ũ1 as α → 0, then for any t ∈ R and for any > 0, there is a constant C > 0 such that for every sufficiently small α, φ δ α , t + 1 μ1 log 1 δα, ũ1 -z(t) 2 ≤ C • δ α γ μ1 +γ - 2 , ( ) where γ := μ1 -(μ 2 ) is the eigenvalue gap. Proof Sketch. Define the following two types of matrices. For r ≥ 1, a, δ ∈ R, we define J (r) a,δ :=         a δ a δ a δ . . . . . . a δ a         ∈ R r×r . For r ≥ 1, a, b, δ ∈ R, we define J (r) a,b,δ :=         C δI C δI C δI . . . . . . C δI C         ∈ R 2r×2r , where C = a -b b a ∈ R 2×2 . By linear algebra, the real matrix J (0) can be written in the real Jordan normal form, i.e., J (0 ) = Ṽ diag(J [1] , . . . , J [m] ) Ṽ -1 , where Ṽ ∈ R d×d is an invertible matrix, and each J [j] is a real Jordan block. Recall that there are two types of real Jordan blocks, J a,1 or J (r) a,b,1 . The former one is associated with a real eigenvalue a, and the latter one is associated with a pair of complex eigenvalues a ± bi. The sum of sizes of all Jordan blocks corresponding to a real eigenvalue a is its algebraic multiplicity. The sum of sizes of all Jordan blocks corresponding to a pair of complex eigenvalues a ± bi is two times the algebraic multiplicity of a + bi or a -bi (note that a ± bi have the same multiplicity). It is easy to see that J a,b,1 D -1 for D = diag(δ r , δ r , δ r-1 , δ r-1 , . . . , δ, δ) ∈ R 2r×2r . This means for every δ > 0 there exists Ṽδ such that J (0) = Ṽδ J δ Ṽ -1 δ , where J δ := diag(J δ[1] , . . . , J δ[m] ), J δ[j] := J (r) a,δ if J [j] := J (r) a,1 , or J δ[j] := J (r) a,b,δ if J [j] := J (r) a,b,1 . Since the top eigenvalue of J (0) is positive and unique, μ1 corresponds to only one block [μ 1 ] ∈ R 1×1 . WLOG we let J 1 = [μ 1 ], and thus J δ[1] = [μ 1 ]. We only need to select a parameter δ > 0 and prove the theorem in the case of J (0) = J δ since we can change the basis in a similar way as we have done in Appendix G.1. By scrutinizing the proof for Theorem G.1, we can find that we only need to reprove Lemma G.2. However, Lemma G.2 may not be correct since J (0) is not diagonal anymore. Instead, we prove the following: 1. If δ ∈ (0, γ), then h J δ h ≤ μ1 h 2 2 for all h ∈ R d ; 2. For any μ 2 ∈ ( (μ 2 ), μ1 ), if δ ∈ (0, μ 2 -(μ 2 )) , then e tJ δ -e μ1t e 1 e 1 2 ≤ e μ 2 t for all t ≥ 0. Proof for Item 1. Let K be the set of pairs (k 1 , k 2 ) such that k 1 = k 2 and the entry of J δ at the k 1 -th row and the k 2 -th column is non-zero. Then we have h J δ h = h J δ + J δ 2 h = d k=1 (μ k )h 2 k + (k1,k2)∈K h k1 h k2 δ ≤ d k=1 (μ k )h 2 k + (k1,k2)∈K h 2 k1 + h 2 k2 2 δ. Note that (μ k ) ≤ (μ 2 ) for k ≥ 2. Also note that there is no pair in K has k 1 = 1 or k 2 = 1, and for every k ≥ 2 there are at most two pairs in K has k 1 = k or k 2 = k. Combining all these together gives h J δ h ≤ μ1 h 2 1 + ( (μ 2 ) + δ) d k=2 h 2 k ≤ μ1 h 2 2 , which proves Item 1. Proof for Item 2. Since J δ is block diagonal, we only need to prove that e tJ δ [j] 2 ≤ e μ 2 t for every j ≥ 2. If J δ[j] = J (r) a,δ = aI + δN , where N is the nilpotent matrix, then e tJ δ[j] = e atI+δtN = e atI e δtN = e at e δtN , where the second equality uses the fact that I and N are commutable. So we have j] 2 ≤ e at e δtN 2 = e at e δt N 2 ≤ e (a+δ)t . e tJ δ[ If J δ[j] = J (r) a,δ = D + δN 2 , where D = diag(C, C, . . . , C) and N is the nilpotent matrix, then e tJ δ[j] = e tD+δtN 2 = e tD e δtN 2 , where the second equality uses the fact that D and N 2 are commutable. Note that e tC = e at cos(bt) -sin(bt) sin(bt) cos(bt) , which implies e tD 2 = e tC 2 = e at . So we have j] 2 ≤ e tD 2 • e δtN 2 2 = e at e δt N 2 2 ≤ e (a+δ)t . e tJ δ[ Since δ ∈ (0, μ 2 -(μ 2 )), we know that a + δ < μ 2 , which completes the proof. Proof for a fixed δ. Since Item 1 continues to hold for δ ∈ (0, γ), Lemmas G.3 to G.6 also hold. This proves that z(t) exists and satisfies (6). It remains to prove (25) for any > 0. Let γ ∈ (0, γ) be a number such that γ μ1+γ ≥ γ μ1+γ -. Fix μ 2 = μ1 -γ , δ = μ 2 -(μ 2 ). By Item 2, we have e tJ δ -e μ1t e 1 e 1 2 ≤ e μ 2 t for all t ≥ 0. By scrutinizing the proof for Theorem G.1, we can find that the only place we use Item 2 in Lemma G.2 is in (24). For proving (25), we can repeat the proof while replacing all the occurrences of μ2 by μ 2 . Then we know that for every t ∈ R, there is a constant C > 0 such that Ṽ -1 δ φ δ α , t + 1 μ1 log 1 δα, ũ1 -Ṽ -1 δ z(t) 2 ≤ C • Ṽ -1 δ δ α γ μ1 +γ 2 , for every sufficiently small α. By definition of γ , γ μ1+γ ≥ γ μ1+γ -. Since δ α → 0 as α → 0, we have Ṽ -1 δ δ α 2 < 1 for sufficiently small α. Then we have φ δ α , t + 1 μ1 log 1 δα, ũ1 -z(t) 2 ≤ Ṽδ 2 • Ṽ -1 δ φ δ α , t + 1 μ1 log 1 δα, ũ1 -Ṽ -1 δ z(t) 2 ≤ Ṽδ 2 • C • Ṽ -1 δ δ α γ μ1 +γ 2 ≤ C • Ṽδ 2 • Ṽ -1 δ γ μ1 +γ 2 • δ α γ μ1 +γ - 2 . Absorbing Ṽδ 2 • Ṽ -1 δ γ μ1 +γ 2 into C proves (25).

H EIGENVALUES OF JACOBIANS AND HESSIANS

In this section we analyze the eigenvalues of the Jacobian J (W ) at critical points of (2). For notation simplicity, we write sz(A) := A + A to denote the symmetric matrix produced by adding up A and its transpose, and write ac{A, B} = AB + BA to denote the anticommutator of two matrices A, B. Then g(W ) can be written as g(W ) := -ac{∇f (W ), W }. Let U 0 ∈ R d×r be a stationary point of the function L : R d×r → R, L(U ) = 1 2 f (U U ), i.e., ∇L(U 0 ) = ∇f (U 0 U 0 )U 0 = 0. By Lemma C.1, this implies ∇f (W 0 )W 0 = 0 for W 0 := U 0 U 0 , and thus W 0 is a critical point of (2). For a real-valued or vector-valued function F (θ), we use DF (θ)[δ], D 2 F (θ)[δ 1 , δ 2 ] to denote the first-and second-order directional derivatives of F ( • ) at θ. Let X be a linear space, which can be R d×d or R d×r . For a function F : X → X , we use DF (θ) to denote the directional derivative of F at θ, represented by the linear operator DF (θ)[∆] : X → X , ∆ → DF (θ)[∆] = lim t→0 F (θ + t∆) -F (θ) t . We also write DF (θ)[∆ 1 , ∆ 2 ] := DF (θ)[∆ 1 ], ∆ 2 . For a function F : X → R, we use D 2 F (θ) = D(∇F (θ)) to denote the second directional derivative of F at θ, i.e., D 2 F (θ)[∆] = D(∇F (θ))[∆], D 2 F (θ)[∆ 1 , ∆ 2 ] = D(∇F (θ))[∆ 1 , ∆ 2 ]. Define J (W ) := Dg(W ). By simple calculus, we can compute the formula for J (W 0 ): J (W 0 )[∆] = -ac{∇f (W 0 ), ∆} -ac{D 2 f (W 0 )[∆], W 0 }, J (W 0 )[∆ 1 , ∆ 2 ] = -∇f (W 0 ), sz(∆ 1 ∆ 2 ) -D 2 f (W 0 )[∆ 1 , sz(W 0 ∆ 2 )], where ∆, ∆ 1 , ∆ 2 ∈ R d×d . We can also compute the formula for D 2 L(U 0 ): D 2 L(U 0 )[∆] = ∇f (W 0 )∆ + D 2 f (W 0 )[sz(∆U 0 )]U 0 , D 2 L(U 0 )[∆ 1 , ∆ 2 ] = 1 2 ∇f (W 0 ), sz(∆ 1 ∆ 2 ) + D 2 f (W 0 )[sz(∆ 1 U 0 ), sz(∆ 2 U 0 )] , where ∆, ∆ 1 , ∆ 2 ∈ R d×r .

H.1 EIGENVALUES AT THE ORIGIN

The eigenvalues of J (0) is given in Lemma 5.4. Now we provide the proof. Proof for Lemma 5.4. For W 0 = 0, we have J (0)[∆] = -∇f (0)∆ -∆∇f (0) J (0)[∆ 1 , ∆ 2 ] = -∇f (0), sz(∆ 1 ∆ 2 ) It is easy to see from the second equation that J (0) is symmetric. Let -∇f (0) = d i=1 µ i u 1[i] u 1[i] be the eigendecomposition of the symmetric matrix -∇f (0). Then we have J (0)[∆] = d i=1 µ i u 1[i] u 1[i] ∆ + ∆u 1[i] u 1[i] = d i=1 d j=1 µ i u 1[i] u 1[i] ∆u 1[j] u 1[j] + u 1[j] u 1[j] ∆u 1[i] u 1[i] = d i=1 d j=1 (µ i + µ j )u 1[i] u 1[i] ∆u 1[j] u 1[j] = d i=1 d j=1 (µ i + µ j ) ∆, u 1[i] u 1[j] u 1[i] u 1[j] , which proves (8). For ∆ = u 1[i] u 1[j] + u 1[j] u 1[i] , we have J (0)[∆] = (µ i + µ j )u 1[i] u 1[j] + (µ i + µ j )u 1[j] u 1[i] = (µ i + µ j )∆. So u 1[i] u 1[j] + u 1[j] u 1[i] is an eigenvector of J (0) associated with eigenvalue µ i + µ j . Note that {u 1[i] u 1[j] + u 1[j] u 1[i] : i, j ∈ [d] } spans all the symmetric matrices, so these are all the eigenvectors in the space of symmetric matrices. For every antisymmetric matrix ∆ (i.e., ∆ = -∆ ), we have J (0)[∆] = J (0)[∆ ] = J (0)[-∆]. So J (0)[∆] = 0 and every antisymmetric matrix is an eigenvector associated with eigenvalue 0. Since every matrix can be expressed as the sum of a symmetric matrix and an antisymmetric matrix, we have found all the eigenvalues.

H.2 EIGENVALUES AT SECOND-ORDER STATIONARY POINTS

Now we study the eigenvalues of J (W 0 ) when U 0 is a second-order stationary point of L( • ), i.e., ∇L(U 0 ) = 0, D 2 L(U 0 )[∆, ∆] ≥ 0 for all ∆ ∈ R d×r . We further assume that U 0 is full-rank, i.e., rank(U 0 ) = r. This condition is meet if W 0 := U 0 U T 0 is a local minimizer of f ( • ) in S + d but not a minimizer in S + d . Lemma H.1. For r ≤ d, if U 0 ∈ R d×r is a second-order stationary point of L( • ), then either rank(U 0 ) = rank(W 0 ) = r, or W 0 is a minimizer of f ( • ) in S + d , where W 0 = U 0 U 0 . Proof. Assume to the contrary that U 0 has rank < r and W 0 is a minimizer of f ( • ) in S + d . The former one implies that there exists a unit vector q ∈ R r such that U 0 q = 0, and the latter one implies that there exists v ∈ R d such that v ∇f (W 0 )v < 0 by Lemma C.2.

Now we define

P := Û Ṽ + , Q := Û + Ṽ . Then we have P Q = Û Û + Û Ṽ Ṽ + Û + Ṽ + Ṽ . Note that Û Û + = I K , Û Ṽ = 0, Ṽ + Û + = ( Ṽ Ṽ ) -1 ( Û Ṽ ) ( Û Û ) -1 = 0, Ṽ + Ṽ = I D-K . So P Q = I D , or equivalently Q = P -1 . Then we have P -1 AP = diag( λ1 , . . . , λK ) * 0 diag( λ1 , . . . , λD-K ) , where * can be any K × (D -K) matrix. Since P -1 AP is upper-triangular, we know that P -1 AP has eigenvalues λ1 , . . . , λK , λ1 , . . . , λD-K , and so does A. Theorem H.5. The eigenvalues of J (W 0 ) can be fully classified into the following 3 types: 1. µ i + µ j is an eigenvalue for every 1 ≤ i ≤ j ≤ d -r, and Ûij := v i v j + v j v i is an associated left eigenvector. 2. ξ p is an eigenvalue for every 1 ≤ p ≤ rd -r(r-1)

2

, and Ṽp := E p U 0 + U 0 E p is an associated right eigenvector. 3. 0 is an eigenvalue, and any antisymmetric matrix is an associated right eigenvector, which spans a linear space of dimension d(d-1) . Proof of Theorem H.5. We first prove each item respectively, and then prove that these are all the eigenvalues of J (W 0 ). Proof for Item 1. For Ûij = v i v j + v j v i , it is easy to check: ac{-∇f (W 0 ), Ûij } = (λ i + λ j ) Ûij Ûij W 0 = 0 W 0 Ûij = 0 So we have J (W 0 )[∆, Ûij ] = (λ i + λ j ) ∆, Ûij -D 2 f (W 0 )[∆, 0] = (λ i + λ j ) ∆, Ûij , which shows that Ûij is a left eigenvector associated with eigenvalue λ i + λ j . Proof for Item 2. By definition of eigenvector, we have -D 2 L(U 0 )[E p ] = ξ p E p , so ξ p E p = -∇f (W 0 )E p -D 2 f (W 0 )[U 0 E p + E p U 0 ]U 0 . Right-multiplying both sides by U 0 , we get ξ p E p U 0 = -∇f (W 0 )E p U 0 -D 2 f (W 0 )[ Ṽp ]W 0 = -∇f (W 0 )(E p U 0 + U 0 E p ) -D 2 f (W 0 )[ Ṽp ]W 0 = -∇f (W 0 ) Ṽp -D 2 f (W 0 )[ Ṽp ]W 0 , where the second equality uses the fact that ∇f (W 0 )U 0 = 0 since U 0 is a critical point. Taking both sides into sz(•) gives ξ p Ṽp = -sz(∇f (W 0 ) Ṽp ) -sz(D 2 f (W 0 )[ Ṽp ]W 0 ) = J (W 0 )[ Ṽp ], which proves that Ṽp is a right eigenvector associated with eigenvalue ξ p . Proof. For every t ≥ 0, if θ(t) -θ 2 ≤ r, d dt L(θ(t)) -L( θ) 1-µ = (1 -µ) L(θ(t)) -L( θ) -µ • ∇L, dθ dt = -(1 -µ) L(θ(t)) -L( θ) -µ • ∇L 2 • dθ dt 2 ≤ -(1 -µ)c dθ dt 2 . Therefore, θ(t) -θ 0 2 ≤ t 0 dθ dt 2 dt ≤ 1 (1-µ)c L(θ 0 ) 1-µ = O( θ 0 -θ 2(1-µ) 2 ). If we choose θ(t)-θ 2 small enough, then θ(t)-θ 2 ≤ θ(t)-θ 0 2 + θ 0 -θ 2 = O( θ 0 -θ 2(1-µ) 2 ) < r, and thus +∞ 0 dθ dt 2 dt is convergent and finite. This implies that θ ∞ := lim t→+∞ θ(t) exists and θ ∞ -θ 2 = O( θ 0 -θ 2(1-µ) 2 ). Proof for Theorem 5.8. Since W G 1 (t) ∈ S + d,≤1 satisfies (2), there exists u(t) ∈ R d such that u(t)u(t) = W G 1 (t) and u(t) satisfies (1), i.e., du dt = -∇L(u), where L : R d → R, u → 1 2 f (uu ). If W G 1 (t) does not diverge to infinity, then so does u(t). This implies that there is a limit point ū of the set {u(t) : t ≥ 0}. , 1965) . Applying Lemma I.1 for L restricted on U, we know that if u(t 0 ) is sufficiently close to ū, the remaining length of the trajectory of u(t) (t ≥ t 0 ) is finite and thus lim t→+∞ u(t) exists. As ū is a limit point, this limit can only be ū. Therefore, Let U := {u : L(u) ≥ L( ū)}. Since L(u(t)) is non-increasing, we have u(t) ∈ U for all t. Note that ū is a local minimizer of L( • ) in U. By analyticity of f ( • ), Łojasiewicz inequality holds for L( • ) around ū (Łojasiewicz W 1 := lim t→+∞ W G 1 (t) = ū ū exists. If W 1 is a minimizer of f ( • ), U = ( ū, 0, • • • , 0) ∈ R d×d is also a minimizer of L : R d×d → R, U → 1 2 f (U U ) . By analyticity of f ( • ), Łojasiewicz inequality holds for L( • ) around U . For every > 0, we can always find a time t such that u(t ) -ū 2 ≤ /2. On the other hand, by Theorem 5.6, there exists a number α such that for every α < α , φ(W α , T (W α ) + t ) -W G 1 (t ) 2 ≤ /2, where T (W ) := 1 2µ 1 log 1 W , u 1 u 1 . Combining these together we have φ(W α , T (W α ) + t ) -W 1 2 ≤ . It is easy to construct a factorization φ(W α , T (W α ) + t ) := U α, U α, such that U α, -U 2 = O( ), e.g., we can find an arbitrary factorization and then right-multiply an orthogonal matrix so that the row vector with the largest norm aligns with the direction of ū. Applying Lemma I.1, we know that gradient flow starting with U α, converges to a point that is only O( 2(1-µ) ) far from ū. So we have lim t→+∞ φ(W α , T (W α ) + t) -W 1 2 = O( 2(1-µ) ). Taking → 0 complete the proof. I.3 PROOF FOR THEOREM 5.11 Theorem I.2. Let W be a critical point of (2) satisfying that W is a local minimizer of f ( • ) in S + d,≤r for some r ≥ 1 but not a minimizer in S + d . Let -∇f (W ) = d i=1 µ i v i v i be the eigendecomposition of -∇f (W ). If µ 1 > µ 2 , the following limit exists and is a solution of (2). W G (t) := lim →0 φ W + v 1 v 1 , 1 2µ 1 log 1 + t . For {W α } ⊆ S + d , if there exists time T α ∈ R for every α so that φ(W α , T α ) converges to W with positive alignment with the top principal component v 1 v 1 as α → 0, then ∀t ∈ R, lim α→0 φ W α , T α + 1 2µ 1 log 1 φ(W α , T α ), v 1 v 1 + t = W G (t). Moreover, there exists a constant C > 0 such that φ W α , T α + 1 2µ 1 log 1 φ(W α , T α ), v 1 v 1 + t -W G (t) F ≤ C φ(W α , T α ) γ 2µ 1 +γ F for every sufficiently small α, where γ := 2µ 1 -max{µ 1 + µ 2 , 0}. Proof. Following Appendix I.1, we view vec LT (W (t)) as a dynamical system. d dt vec LT (W (t)) = g(vec LT (W (t))). Let W = U U be a factorization of W , where U ∈ R d×r . Since W is a local minimizer of f ( • ) in S + d,≤r , U is also a local minimizer of L : R d×r → R, U → 1 2 f (U U ). Since W is not a minimizer of f ( • ) in S + d , by Lemma H.1, U is full-rank. By Theorem H.5, J (W ) has eigenvalues µ i + µ j , ξ p , 0. By a similar argument as in Appendix I.1, the Jacobian of g at vec LT (W (t)) has eigenvalues µ i + µ j , ξ p . Since U is a local minimizer, ξ p ≤ 0 for all p. If µ 1 > µ 2 , then 2µ 1 is the unique largest eigenvalue, and Theorem H.5 shows that vec LT (v 1 v 1 ) is a left eigenvector associated with 2µ 1 . The eigenvalue gap γ := 2µ 1 -max{µ 1 + µ 2 , max{ξ p : 1 ≤ p ≤ rd -r(r-1) (2). All saddles are strict. Here saddles denote those stationary points whose hessian are not positive semi-definite (thus including local maximizers).foot_2 Theorem I.4 (Theorem 2 in Lee et al. 2017) . Let g be a C 1 mapping from X → X and det(Dg(x)) = 0 for all x ∈ X . Then the set of initial points that converge to an unstable fixed point has measure zero, µ {x 0 : lim k→∞ g k (x 0 ) ∈ A * g } = 0, where A * g = {x : g(x) = x, max i |λ i (Dg(x))| > 1}. Theorem I.5 (GF only finds minimizers, a continuous analog of Theorem I.4). Let f : R d → R d be a C 1 -smooth function, and φ : R d × R → R d be the solution of the following differential equation, 2 }} ≥ 2µ 1 -max{µ 1 + µ 2 , 0}. Also note that φ(W α , T α ) -W , v 1 v 1 = φ(W α , T α ), v 1 v 1 because W , v 1 v 1 = 0 by (27). If φ(W α , T α ) converges to W as α → 0, then it has positive alignment with v 1 v 1 iff lim inf α→0 φ(Wα,Tα),v1v 1 φ(Wα,Tα)-W F > 0. dφ(x, t) dt = f (φ(x, t)), φ(x, 0) = x, ∀x ∈ R d , t ∈ R. Then the set of initial points that converge to a unstable critical point has measure zero, µ x 0 : lim t→∞ φ(x 0 , t) ∈ U * f = 0, where U * f = {x : f (x) = 0, λ 1 (Df (x)) > 0} and Df is the Jacobian matrix of f . Proof of Theorem I.5. By Theorem 1 in Section 2.3, Perko (2013) , we know φ(•, •) is C 1 -smooth for both x, t. We let g(x) = φ(x, 1), then we know g -1 (x) = φ(x, -1) and both g, g -1 are C 1 -smooth. Note that Dg -1 (x) is the inverse matrix of Dg(x). So both of the two matrices are invertible. Thus we can apply Theorem I.4 and we know µ {x 0 : lim k→∞ g k (x 0 ) ∈ A * g } = 0. Note that if lim t→∞ φ(x, t) exists, then lim k→∞ g k (x) = lim t→∞ φ(x, t). It remains to show that U * f ⊆ A * g . For f (x 0 ) = 0, we have φ(x 0 , t) = x 0 and thus g(x 0 ) = x 0 . Now it suffices to prove that λ 1 (Dg(x 0 )) > 1. For every t ∈ [0, 1], by Corollary of Theorem 1 in Section 2.3, Perko (2013), we have ∂ ∂t Dφ(x, t) = Df (φ(x, t))Dφ(x, t), ∀x, t. Thus, ∂ ∂t Dφ(x 0 , t) = Df (φ(x 0 , t))Dφ(x 0 , t) = Df (x 0 )Dφ(x 0 , t). Solving this ODE gives Dg(x 0 ) = Dφ(x, 1) = e Df (x0) Dφ(x, 0) = e Df (x0) , where the last equality is due to Dφ(x, 0) ≡ I, ∀x. Combining this with λ 1 (Df (x 0 )) > 0, we have λ 1 (Dg(x 0 )) > 1. Thus we have U * f := {x 0 : f (x 0 ) = 0, λ 1 (Df (x 0 )) > 0} ⊆ A * g , which implies that {x 0 : lim t→∞ φ(x 0 , t) ∈ U * } ⊆ {x 0 : lim k→∞ g k (x 0 ) ∈ A * g } Theorem 5.10. Let f : R d×d → R be a convex C 2 -smooth function. (1). All stationary points of L : R d×d → R, L(U ) = 1 2 f (U U ) are either strict saddles or global minimizers; (2). For any random initialization, GF (1) converges to strict saddles of L(U ) with probability 0. Proof of Theorem 5.10. For (1), by Theorem I.3, we immediately know all the stationary points of L( • ) are either global minimizers or strict saddles. ( 2) is just a direct consequence of Theorem I.5 by setting f in the above proof to -∇L.

J EQUIVALENCE BETWEEN GF AND GLRL

In this section we elaborate on the theoretical evidence that GF and GLRL are equivalent generically, including the case where GLRL does not end in the first phase. The word "generically" used when we want to assume one of the following regularity conditions: 1. We want to assume that GF converges to a local minimizer (i.e., GF does not get stuck on saddle points); 2. We want to assume that the top eigenvalue λ 1 (-∇f (W )) is unique for a critical point W of (2) that is not a minimizer of f ( • ) in S + d ; 3. We want to assume that a convergent sequence of PSD matrices W α → W has positive alignment with vv for some fixed vector v with W , vv = 0, i.e., for a convergent sequence of PSD matrices W α → W , it holds for sure that lim inf ≥ 0, and we further assume that the inequality is strict generically. Theorem I.2 uncovers how GF with infinitesimal initialization generically behaves. Let W 0 := 0. For every r ≥ 1, if W r-1 is a local minimizer in S + d,≤r-1 but not a minimizer in S + d , then λ 1 (-∇f (W r-1 )) > 0 by Lemma C.2. Generically, the top eigenvalue λ 1 (-∇f (W r-1 )) should be unique, i.e., λ 1 (-∇f (W r-1 )) > λ 2 (-∇f (W r-1 )). This enables us to apply Theorem I.2 and deduce that the limiting trajectory W G r (t) := lim →0 φ W r-1 + u r u r , 1 2λ 1 (-∇f (W r-1 )) log 1 + t exists, where u r is the top eigenvector of -∇f (W r-1 ). This W G r ( • ) is exactly the trajectory of GLRL in phase r as → 0. Note that W G r ( • ) corresponds to a trajectory of GF minimizing L( • ) in R d×r , which should generically converge to a local minimizer of L( • ) in R d×r . This means the limit W r := lim t→+∞ W G r (t) should generically be a local minimizer of f ( • ) in S + d,≤r . If W r is further a minimizer in S + d , then λ 1 (-∇f (W r )) ≤ 0 and GLRL exits with W r ; otherwise GLRL enters phase r + 1. If GF aligns well with GLRL in the beginning of phase r (defined below), then by Theorem I.2, as α → 0, the minimum distance from GF to W G r (t) converges to 0 for every t ∈ R. Therefore, GF can get arbitrarily close to the r-th critical point W r of GLRL, i.e., there exists a suitable choice T (r) α so that lim α→0 φ(W α , T α so that φ(W α , T (r) α ) not only converges to W r but also has positive alignment with u r u r , that is, GF should generically align well with GLRL in the beginning of phase r + 1. Definition J.1. We say that GF aligns well with GLRL in the beginning of phase r if there exists T (r) α for every α > 0 such that φ(W α , T (r) α ) converges to W r-1 with positive alignment with u r u r as α → 0. If the initialization satisfies that W α converges to 0 with positive alignment with u 1 u 1 as α → 0, then GF aligns well with GLRL in the beginning of phase 1, which can be seen by taking T (1) α = 0. Now assume that GF aligns well with GLRL in the beginning of phase r -1, then the above argument shows that GF should generically align well with GLRL in the beginning of phase r, if GLRL does not exit in phase r -1. In the other case, we can use a similar argument as in Theorem 5.8 to show that GF converges to a solution near the minimizer W r of f ( • ) as t → ∞, and the distance between the solution and W r converges to 0 as α → 0. By this induction we prove that GF with infinitesimal initialization is equivalent to GLRL generically.

K PROOFS FOR DEEP MATRIX FACTORIZATION

K.1 PRELIMINARY LEMMAS Lemma K.1. If W (0) 0, then W (t) 0 and rank(W (t)) = rank(W (0)) for all t. Proof. Note that we can always find a set of balanced U i (t), such that U 1 (t) . . . U L (t) = W (t), d 2 = d 3 = • • • = d L = rank(W (t)) and write the dynamics of W (t) in the space of {U i } L i=1 . Thus it is clear that for all t , rank(W (t )) ≤ rank(W (t)). We can apply the same argument for t and we know rank(W (t)) ≤ rank(W (t )). Thus rank(W (t)) is constant over time, and we denote it by k. Since eigenvalues are continuous matrix functions, and ∀t, λ i (W (t)), i ∈ [k] = 0. Thus they cannot change their signs and it must hold that W (t) 0.  Lemma K.2. ∀a, b, P ∈ R, if a > b ≥ 0, P ≥ 1, then a P -b P a-b ≤ P a P -1 . Proof. Let f (x) = P (1 -x) -(1 -x P ). Since f (x) = -P + P x P -1 < 0 for all x ∈ [0, 1), f N )[M ] F ≤ P N P -1 2 M F , where DF (N )[M ] := lim t→0 F (N +tM )-F (N ) t is the directional derivative of F along M . Proof. Let N = U ΣU , where U U = I and Σ = diag(σ 1 , • • • , σ d ). Note that F (U M U ) = U F (M )U for any M ∈ S + d . Then we have DF (N )[M ] F = lim t→0 F (N + tM ) -F (N ) F t = lim t→0 F (Σ + tU M U ) -F (Σ) F t = DF (Σ)[U M U ] F . Therefore, it suffices to prove the lemma for the case where N is diagonal, i.e., N = Σ. Assume P = q p , where p, q ∈ N and q ≥ p > 0. Define G(N ) = N 1 p . Then G(Σ) p = Σ. Taking directional derivative on both sides along direction M , we have p i=1 G(Σ) i-1 DG(Σ)[M ]G(Σ) p-1 = M , So we have [DG(Σ)[M ]] ij = m ij p k=1 σ k-1 p i σ p-k p j . Let H(G) = G q . With the same argument, we know [DH(G(Σ))[M ]] ij = m ij q k=1 σ k-1 p i σ q-k p j . Note that H(G(Σ)) = F (Σ). By chain rule, we have DF (Σ)[M ] = DH(G(Σ))[DG(Σ)[M ]]. That is, [DF (Σ)[M ]] ij = m ij q k=1 σ k-1 p i σ q-k p j p k=1 σ k-1 p i σ p-k p j . When σ i = σ j , clearly [DF (Σ)[M ]] ij = m ij • q p • σ q-p p i = P m ij σ P -1 i . Otherwise, we assume WLOG that σ i > σ j , we multiply σ i -σ j to both numerator and denominator and we have |[DF (Σ)[M ]] ij | = |m ij | σ P i -σ P j σ i -σ j ≤ |m ij | P σ P -1 i ≤ |m ij | P Σ P -1 2 . where the first inequality is by Lemma K.2. Thus we conclude the proof. Lemma K.4. For any A, B 0 and P ∈ R, P ≥ 1, A P -B P F ≤ P A -B F max A P -1 2 , B P -1 2 . Proof. Since both sides are continuous in P and Q is dense in R, it suffices to prove the lemma for P ∈ Q. Let ρ := max { A 2 , B 2 } and F (M ) = M P . Define N : [0, 1] → S + d , N (t) = (1 -t)A + tB, we have 1. N (t) 2 ≤ ρ, since • 2 is convex. 2. DF (N (t))[B -A] F ≤ P N (t) P -1 2 B -A F by Lemma K.3. Therefore, F (N (1)) -F (N (0)) F ≤ 1 0 dF (N (t)) dt F dt = 1 t=0 DF (N (t))[B -A] F dt ≤ P A -B F ρ P -1 , which completes the proof. For a locally Lipschitz function f ( • ), the Clarke subdifferential (Clarke, 1975; 1990; Clarke et al., 2008) of f at any point x is the following convex set ∂ • f (x) ∂x := co lim k→∞ ∇f (x k ) : x k → x, f is differentiable at x k , where co denotes the convex hull. Clarke subdifferential generalize the standard notion of gradients in the sense that, when f is smooth, ∂ • f (x) ∂x = {∇f (x)}. Clarke subdifferential satisfies the chain rule: Lemma K.7. Suppose M (t) satisfies (32), we have for any T > T , and k ∈ [d], λ k (M (T )) -λ k (M (T )) ≤ T T 2λ k (M (t)) P ∇f (M P (t)) 2 dt. ( ) and 1 P -1 λ 1-P k (M (T )) -λ 1-P k (M (T )) ≤ T T 2 ∇f (M P (t)) 2 dt. ( ) Proof. Since λ k (M (t)) is locally Lipschitz in t, by Rademacher's theorem, we know λ k (M (t)) is differentiable almost everywhere, and the following holds λ k (M (T )) -λ k (M (T )) = T T dλ k (M (t)) dt dt. When dλ k (M (t)) dt exists, we have dλ k (M (t)) dt ∈ G, dM (t) dt : G ∈ ∂ • λ k (M ) ∂M = 2λ k (M P (t)) G, -∇f (M P (t)) : G ∈ ∂ • λ k (M ) ∂M Note that G F ≤ G * = 1. So G, -∇f (M P (t)) ≤ ∇f (M P (t)) 2 . We can prove (34) with a similar argument. To prove Theorem 6.2, it suffices to consider the case that M (0) = αI where α := α 1/P . WLOG we can assume -∇f (0) = diag(µ 1 , . . . , µ d ) by choosing a suitable standard basis. By assumption in Theorem 6.2, we have µ 1 > max{µ 2 , 0} and µ 1 = ∇f (0) 2 . We use φ m (M 0 , t) to denote the solution of M (t) when M (0) = M 0 . Let R > 0. Since f ( • ) is C 3 -smooth, there exists β > 0 such that ∇f (W 1 ) -∇f (W 2 ) F ≤ β W 1 -W 2 2 for all W 1 , W 2 with W 1 2 , W 2 2 ≤ R. Let κ = β/µ 1 . We assume WLOG that R ≤ 1 κ(P -1) . Let F α(x) := α-(P -1) x -(P -1) dz 1+κz -P /(P -1) . Then F α(x) = (P -1)x -P 1+κx P = P -1 (1+κx P )x P . We will use this function to bound norm growth. Let g α,c (t) = 1 α-(P -1) -κ(P -1)c-2µ1(P -1)t . Define T α(r) = α-(P -1) -κ(P -1)r-r -(P -1) 2µ1(P -1) . It is easy to verify that g α,r (T α(r)) = r P -1 . Lemma K.8. For any x ∈ [α, R] we have α-(P -1) -x -(P -1) -F α(x) ∈ [0, κ(P -1)x]. Proof. On the one hand, we have α-(P -1) -x -(P -1) -F α(x) = α-(P -1) x -(P -1) 1 -1 1 + κz -P/(P -1) dz ≥ 0. On the other hand, α-(P -1) -x -(P -1) -F α(x) = α-(P -1) x -(P -1) κ z P/(P -1) + κ dz ≤ κ α-(P -1) x -(P -1) 1 z P/(P -1) dz = κ(P -1) • -1 z 1/(P -1) α-(P -1) x -(P -1) ≤ κ(P -1)x, which completes the proof.

So

D(T α(r)) F ≤ T α (r) 0 2βg α,r (t) 2P P -1 exp 2µ 1 P T α(r) t g α,r (τ )dτ dt = T α(r) 0 2βg α,r (t) 2P P -1 exp P P -1 ln g α,r (T α(r)) g α,r (t) dt = T α(r) 0 2βg α,r P P -1 g α,r (T α(r)) P P -1 dt = 2β • 1 2µ 1 g α,r (T α(r)) 1 P -1 • g α,r (T α(r)) P P -1 = κg α,r (T α(r)) P +1 P -1 = κr P +1 . which proves the bound. Lemma K.12. Let M (t) = φ m (αM 0 , t), M (t) = φ m (α M 0 , t). If max{ M 0 2 , M 0 2 } ≤ 1. For t ≤ T α(r), we have M (t) -M (t) F ≤ r α P e 2κr P M (0) -M (0) F . Proof. Define D(t) = M (t) -M (t). Then we have dD dt F = 2 ∇f (M P ) M P -M P + ∇f (M P ) -∇f ( M P ) M P F ≤ 2 ∇f (M P ) 2 M P -M P F + β M P -M P F M P 2 ≤ 2 µ 1 + β M P 2 + β M P 2 P max{ M P -1 2 , M P -1 2 } D F , where the last step is by Lemma K.4. So D(T α(r)) F ≤ D(0) F • exp 2P µ 1 T α(r) 0 1 + 2κg α,r P P -1 g α,r (t)dt ≤ D(0) F • exp P P -1 ln g α,r (T α(r)) g α,r (0) + 2κg α,r (T α(r)) P P -1 ≤ D(0) F r α P e 2κr P , which proves the bound. Let M G α (t) := φ m αe 1 e 1 , α-(P -1) 2µ1(P -1) + t . Let M (t) := lim α→0 M G α (t). Lemma K.13. For every t ∈ (-∞, +∞), M (t) exists and M G α (t) converges to M (t) in the following rate: M G α (t) -M (t) F = O(α). Proof. Let c be a sufficiently small constant. Let T := -κ(P -1)c-c -(P -1)

2µ1(P -1)

. We prove this lemma in the cases of t ∈ (-∞, T ] and t > T respectively. Case 1. Fix t ∈ (-∞, T ]. Then α-(P -1) 2µ1(P -1) + t ≤ T α(c). Let α be the unique number such that κ(P -1)α + α-(P -1) = α-(P -1) . Let α < α be an arbitrarily small number. Let t 0 := T α (α) = ( α ) -(P -1) -α-(P -1) 2µ1(P -1) . By Lemma K.11 and (35), we have φ m (α e 1 e 1 , t 0 ) -αe 1 e 1 F ≤ φ m (α e 1 e 1 , t 0 ) -φm (α e 1 e 1 , t 0 ) F ≤ O(α P +1 ). By Lemma K.9, φ m (α e 1 e 1 , t 0 ) 2 ≤ α. Then by Lemma K.12, we have φ m (α e 1 e 1 , t 0 + t) -φ( αe 1 e 1 , t) F ≤ c α P e 2κc P • O(α P +1 ) = O(α) = O( α). This implies that {M G α (t)} satisfies Cauchy's criterion for every t, and thus the limit M (t) exists for t ≤ T . The convergence rate can be deduced by taking limits for α → 0 on both sides. Case 2. For t = T + τ with τ > 0, φ m (M , τ ) is locally Lipschitz with respect to M . So M G α (t) -M G α (t) F = φ m (M G α ( T ), τ ) -φ m (M G α ( T ), τ ) F = O( M G α ( T ) -M G α ( T ) F ) = O( α), which proves the lemma for t > T . Theorem K.14. For every t ∈ (-∞, +∞), as α → 0, we have: φ m αI, α-(P -1) 2µ 1 (P -1) + t -M (t) F = O( α 1 P +1 ), and for any 2 ≤ k ≤ d, λ k φ m αI, α-(P -1) 2µ 1 (P -1) + t = O(α). (37) Proof. Let M α(t) := φ m αI, α-(P -1) 2µ1(P -1) + t . Again we let c be a sufficiently small constant and T := -κ(P -1)c-c -(P -1) . We prove in the cases of t ∈ (-∞, T ] and t > T respectively. Combining this with the convergence rate for M G α1 (t) proves the bound (36). For (37), by Lemma K.7, we have ≥ Ω(α -(P -1) ) -α-P -1 P +1 -c -(P -1) 2µ 1 (P -1) -O(c) ≥ Ω(α -(P -1) ). Thus λ k (M α( T )) ≤ O(α). Case 2. For t = T + τ with τ > 0, φ m (M , τ ) is locally Lipschitz with respect to M . So 

L ESCAPING DIRECTION FOR DEEP MATRIX FACTORIZATION

For deep matrix factorization, recall that we only prove that GF with infinitesimal identity initialization escapes in the direction of the top eigenvector. The main burden for us to generalize this proof to general initialization is that we don't know how to analyze the early phase dynamics of (29), i.e., the analytical solution of ( 39) is difficult to compute, when L ≥ 3. Intuitively, the direction that the infinitesimal initialization escapes 0 is exactly M := lim t→∞ M (t) M (t) F , where M (t) is the solution of (39). Showing M = v 1 v 1 is a critical step in our analysis towards convergence to GLRL. dM dt = -∇f (0)M L/2 -M L/2 ∇f (0). the trajectory of ( 41)). If further Assumption M.1 holds, then we know J (W 0 )| T : T → T can be diagonalized as J (W 0 )| T [ • ] = V(ΣV -1 ( • )), where Σ i = diag(-µ 1 , . . . , -µ dim(T ) ), V : R dim(T ) → T , V(x) = dim(T ) i=1 x i V i , and V i is the eigenvector associated with eigenvalue -µ i . As shown in Theorem M.3 below, this assumption implies that if W (0) is rank-k and is sufficiently close to W 0 , then W (t) -W 0 F ≤ Ce -µ1t for some constant C. For depth-2 case, the above assumption is equivalent to that L(U 0 ) is "strongly convex" at U 0 , except those 0 eigenvalues due to symmetry, by property 2 of Theorem H.5). For the case where L ≥ 3, because this dynamics is not gradient flow, in general it does not correspond to a loss function and strongly convexity does not make any sense. Nevertheless, in experiments we do observe linear convergence to W 0 , so this assumption is reasonable.

M.1 RANK-k INITIALIZATION

For convenience, we define for all W ∈ S d , W V := V -1 Π d 2 1 (W ) F , W F,1 := Π d 2 1 (W ) F , W F,2 := Π d 2 2 (W ) F . The reason for such definition of norms, as we will see later, is that the norm (or the difference) in the tangent space of the manifold of symmetric rank-r matrices, W -W F,1 , dominates that in the orthogonal complement of the tangent space, W -W F,2 , when both W , W get very close to the W 0 (see a more rigorous statement in Lemma M.2). WLOG, we can assume • F,1 K ≤ • V ≤ • F,1 , for some constant K, which may depend on f and W 0 . This also implies that • V ≤ • F . Below we also assume for sufficiently small R, and any W such that W -W 0 F ≤ R, we have ∇f (W ) 2 ≤ ρ and J (W )[∆] F ≤ β ∆ F for any ∆. In the proof below, we assume such properties hold as long as we can show the boundedness of W (t) -W 0 . Lemma M.2. Let max{ W -W 0 F,1 , W -W 0 F,1 } = r, when r ≤ m 2 , we have W -W F,2 ≤ 5r m W -W F,1 . As a special case, we have W -W 0 F,2 ≤ 5 W -W 2 F,1 m . Proof. WLOG we can assume W 0 is only non-zero in the first k dimension, i.e., [W 0 ] ij = 0, for all i ≥ k + 1, j ≥ k + 1. We further denote W and W by d-k) . By definition, we have W = A B B C and W = A B B C , where A, A ∈ R k×k , B, B ∈ R (d-k)×k , C, C ∈ R (d-k)×( A -A F , B -B F ≤ W -W F,1 and W -W F,2 = C -C F . Moreover, we have λ min (A) ≥ m -A -W 0 F ≥ m -W -W 0 F,1 ≥ m 2 . Since W , W is rank-k, we have C = BA -1 B , C = B A -1 B . Thus W -W F,2 = C -C F = BA -1 B -B A -1 B F ≤ B -B F A -1 B F + BA -1 F A -A F A -1 B F + B A -1 F B -B F ≤ W -W F,1 2r m + W -W F,1 2r m 2 + W -W F,1 2r m ≤ W -W F,1 • The "small terms" in the RHS of (42) satisfies that 4βK 2 W 1 (t) V + 4β W 2 (t) 2 F + 2ρ W 2 (t) F W 1 (t) V ≤ C 1 W 1 (t) V + C 2 N (t) F ≤ µ 1 for some C 1 and C 2 independent of t. • The spectral norm 1 2 ∇f (W (t)) 2 ≤ ∇f (W 0 ) 2 =: ρ for all t ≥ 0. • ∀x < r, κ L x 2 L -1 (L-2)ρ > 2 µ1 ln 2r C0x , where κ L = 1 -0.5 L-2 L . Note these conditions can always be satisfied by some C 0 and r because we can first find 3 groups (C 0 , r) to satisfy each individual condition, and then take the maximal C 0 and minimal r, it's easy to check these conditions are still verified. And we let T C0,r be the earliest time that such condition, i.e., C 0 N (t) F ≤ W 1 (t) V ≤ r fails. Thus by ( 42), for t ∈ [0, T C0,r ), we have W (t) V = W 1 (t) V ≤ W 1 (0) V e -µ 1 t 2 = W (0) V e -µ 1 t 2 . Thus (1) holds for any T smaller than T C0,r . If T C0,r = ∞, then clearly we can pick a sufficiently large T , such that (2) holds. Therefore, below it suffices to consider the case where T C0,r is finite. And we know the condition that fails must be C 0 N (t) F ≤ W 1 (t) V , i.e. C 0 N (T C0,r ) F = W 1 (T C0,r ) V . By (34) in Lemma K.7, we have N (0) 2 L -1 2 -N (t) 2 L -1 2 ≤ (L -2)ρt. Define T := κ L N (0) 2 L -1 2 (L-2)ρ , we know for any t < T , we have N (0) 2 L -1 2 -N (t) 2 L -1 2 ≤ κ L N (t) 2 L -1 2 . That is, N (t) 2 L -1 2 N (0) 2 L -1 2 ∈ 1 -κ L , 1 1 -κ L = 0.5 L-2 L , 0.5 -L-2 L =⇒ N (t) 2 N (0) 2 ∈ [1/2, 2]. Now we claim it must hold that T ≥ T C0,r . Otherwise, we have C 0 2 N (0) 2 ≤ C 0 N (T ) F ≤ W 1 (T ) V ≤ e -µ1T /2 W 1 (0) V ≤ e -µ1T /2 r. Therefore, κ L N (0) 2 L -1 2 (L-2)ρ = T ≤ 2 µ1 ln 2r C0 N (0) 2 , which contradicts to the definition of C 0 and r. As a result, we have 2C 0 √ d N (0) 2 ≥ 2C 0 N (0) F ≥ C 0 N (T c0,r ) F = W 1 (T C0,r ) V ≥ W 1 (0) V e -µ1T C 0 ,r /2 , and therefore, T C0,r ≤ 2 µ 1 ln W 1 (0) V 2 √ dC 0 N (0) F . Thus by Lemma M.2, we know M.3 PROOF FOR THEOREM 6.4 Proof for Theorem 6.4. Let C 0 , r be the constants predicted by Theorem M.4 w.r.t. to W (∞). We claim that we can pick large enough constant T , and α 0 sufficiently small, such that for all α ≤ α 0 , the initial condition in Theorem M.4 holds, i.e. C 0 N (0) F ≤ W 1 (0) V ≤ r, where W (0) := φ αI, α -(P -1) 2µ -1 1 (P -1) + T . This is because we can first ensure W (T ) -W (∞) 2 is sufficiently small, i.e., smaller than r 2 . By Theorem 6.2, we know when α → 0, W (T ) -W (0) V ≤ K W (T ) -W (0) F = o(1) and N (0) F = O(α). By Theorem M.4, we know there is a time T (either T C0,r or some sufficiently large number when T C0,r = ∞), such that W (T ) -W 0 F = O( N (0) F ) = O(α).



We believe assumption ∇f (0) = λ1(-∇f (0)) could be removed with a more refined analysis. Though the original theorem is proven for convex functions of form n i=1 (xiU U x i , yi), where (•, •) is C 2 convex for its first variable. By scrutinizing their proof, we can see the assumption can be relaxed to f is C 2 convex.



Figure1: The trajectory of depth-2 GD, WGD(t), converges to the trajectory of GLRL, WGLRL(t), as the initialization scale goes to 0.

where the equality is only attained at m ii = R, i= 1, 2, 3, 4. Otherwise, either m 11 m 14 m 41 m 44 or m 22 m 23 m 32 m 33 will have negative eigenvalues. Contradiction to that M 0.

Figure 3: GD with small initialization outperforms R1MP and minimal nuclear norm solution on synthetic data with low-rank ground truth. Solid (dotted) curves correspond to test (training) loss.Here the loss f (W ) := 1 d 2 W -W * 2 F and f (0) = 1. We run 10 random seeds for GD and plot them separately (most of them overlap).

Figure5: Deep matrix factorization encourages GF to find low rank solutions at a much practical initialization scale, e.g. 10 -3 . Here the ground truth is rank-3. For each setting, we run 5 different random seeds. The solid curves are the mean and the shaded area indicates one standard deviation. We observe that performance of GD is quite robust to its initialization. Note that for L > 2, the shaded area with initialization scale 10 -7 is large, as the sudden decrement of loss occurs at quite different continuous times for different random seeds in this case.

Figure 6: The marginal value of being deeper. The trajectory of GD converges when depth goes to infinity. Solid (dotted) curves correspond to test (train) loss. The x-axis stands for the normalized continuous time t (multiplied by L).

(r) a,δ = DJ (r) a,1 D -1 for D = diag(δ r , δ r-1 , . . . , δ) ∈ R r×r and J (r) a,b,δ = DJ (r)

Then it is easy to translate Theorem 5.3 to Theorem I.2. I.4 GRADIENT FLOW ONLY FINDS MINIMIZERS (PROOF FOR THEOREM 5.10) The proof for Theorem 5.10 is based on the following two theorems from the literature. Theorem I.3 (Theorem 3.1 in Du and Lee 2018). Let f : R d×d → R be a C 2 convex function. Then L : R d×k → R, L(U ) = f (U U ), k ≥ d satisfies that (1). Every local minimizer of L is also a global minimizer;

(r) α ) = W r . Note that W r , u r u r = 0 by (27) and thus

x) ≥ f (0) = 0. Then substituting x by b a completes the proof. Recall we use DF (N )[M ] to denote the directional derivative along M of F at N . Lemma K.3. Let F : S + d → S + d , M → M P , where P ≥ 1 and P ∈ Q. Then ∀M , N 0, DF

Fix t ∈ (-∞, T ]. Let α1 := α 1 P +1 . Let α1 be the unique number such that κ(P -1)α 1 + ααI, t 0 ) -α1 e 1 e 1 F ≤ φ m ( αI, t 0 ) -φm (αI, t 0 )F + φm (αI, t 0 ) -α1 e 1 e 1 F = O(α P +1 1 + α) = O(α).By Lemma K.9, φ m (α I, t 0 ) 2 ≤ α1 . Then by Lemma K.12, we have M α(t) -M G α1 (t) F = φ m (αI, t 0 + t) -φ m (α 1 e 1 e 1 , t)

.11, λ 1 (M α( T )) = M α( T ) 2 = c + O(c P +1 ). For k ≥ 2, λ k (M α( T )) -(P -1) ≥ Ω(α -(P -1) ) -2(P -1) µ 1 ( T -T 1 ) + κ 2 • c

α(t) -M G α1 (t) F = φ m (M α( T ), τ ) -φ m (M G α1 ( T ), τ ) F = O M α( T ) -M G α1 ( T ) F = O-1) β M P α (t) -(M G ) P (t) 2 + ∇f (M G ) P (t) ( T + τ )) = Ω(α -(P -1) ), that is, λ k (M α( T + τ )) = O(α), ∀k ≥ 2.Proof of Theorem 6.2. Note that M (t) P = W (t) and φ

(T C0,r ) -W 0 F ≤ W (T C0,r ) -W 0 F,1 + W (T C0,r ) -W 0 F,2 ≤ K W 1 (T C0,r ) V + M (T C0,r ) -W 0 F,2 + N (T C0,r ) F,2 ≤ O( N (0) F ) + O( N (0) 2 ) + O( N (0) F ) = O( N (0) F ).

ACKNOWLEDGMENTS

The authors thank Sanjeev Arora and Jason D. Lee for helpful discussions. The authors also thank Runzhe Wang for useful suggestions on writing. ZL and YL acknowledge support from NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. ZL is also supported by Microsoft PhD Fellowship.

annex

Let ∆ = vq . Then we haveSo D 2 L(U 0 )[∆, ∆] < 0, which leads to a contradiction.By ( 26), the symmetric matrices -∇f (W 0 ) and W 0 commute, so they can be simultaneously diagonalizable. Since (26) also implies that they have different column spans, we can have the following diagonalization:-∇fFirst we prove the following lemma on the eigenvalues and eigenvectors of the linear operator -D 2 L(U 0 ):Lemma H.2. For every ∆ ∈ R d×d , ifthen ∆ is an eigenvector of the linear operator -D 2 L(U 0 )[ • ] : R d×r → R d×r associated with eigenvalue 0. Moreover, the solutions of (28) spans a linear space of dimension r(r-1)

2

.Proof. Suppose U 0 ∆ + ∆U 0 = 0. Then we have U 0 ∆ = -∆U 0 , and thus ∆ = -U + 0 ∆U 0 , where U + 0 is the pseudoinverse of the full-rank matrix U 0 . This implies that there is a matrix R ∈ R r×r , such that ∆ = U 0 R. Then we haveReplacing ∆ with U 0 R in (28) gives U 0 (R + R )U 0 = 0, which is equivalent to R = -R since U 0 is full-rank. Since the dimension of r × r antisymmetric matrices is r(r-1) 2 , the span spanned by the solutions of (28) also has dimension r(r-1) .ξ p E p , ∆ E p be the eigendecomposition of the symmetric linear operator -D 2 L(U 0 )[ • ] : R d×r → R d×r , where ξ 1 , . . . , ξ rd ∈ R are eigenvalues, E 1 , . . . , E rd ∈ R d×r are eigenvectors satisfying E p , E q = δ pq . We enforce ξ p to be 0 and E p to be a solution of (28) for every rd -r(r-1) 2 < p ≤ rd.Lemma H.4. Let A ∈ R D×D be a matrix. If { û1 , . . . , ûK } is a set of linearly independent left eigenvectors associated with eigenvalues λ1 , . . . , λK and {ṽ 1 , . . . , ṽD-K } is a set of linearly independent right eigenvectors associated with eigenvalues λ1 , . . . , λD-K , and ûi , ṽj = 0 for all 1 ≤ i ≤ K, 1 ≤ j ≤ D -K, then λ1 , . . . , λK , λ1 , . . . , λD-K are all the eigenvalues of A.Proof. Let Û := ( û1 , . . . , ûK ) ∈ R K×D and Ṽ := (ṽ 1 , . . . , ṽD-K ) ∈ R D× (D-K) . Then both Û and Ṽ are full-rank. Let Û + = Û ( Û Û ) -1 , Ṽ + = ( Ṽ Ṽ ) -1 Ṽ be the pseudoinverses of Û and Ṽ .Proof for Item 3. Since ∇f (W ) is symmetric, g(W ) is also symmetric. For any ∆ = -∆ ,So J (W 0 )[∆] = 0 and ∆ is an eigenvector associated with eigenvalue 0.No other eigenvalues. Let S d be the space of symmetric matrices and A d be the space of antisymmetric matrices. It is easy to see that S d and A d are orthogonal to each other, and S d andbe the linear operator J (W 0 )[∆] restricted on symmetric matrices. We only need to prove that h is diagonalizable.It is easy to see that { Ûij } are linearly independent to each other and thus spans a subspace of S d with dimension (d-r)(d-r+1)  . We can also prove that { Ṽp } spans a subspace of S d with dimension rd -r(r-1) 2 by contradiction. Assume to the contrary that there exists scalars α p for 1 ≤ p ≤ rdr(r -1)/2, not all zero, such that rd-r(r-1)/2 p=1 α p Ṽp = 0. Then rd-r(r-1)/2 p=1 α p E p is a solution of (28). However, this suggests that rd-r(r-1)/2 p=1 α p E p lies in the span of {E p } rd-r(r-1)/2<p≤rd , which contradicts to the linear independence of {E p } 1≤p≤rd . be the vector consisting of the d(d+1) 2 entries of W in the lower triangle, permuted according to some fixed order.

Note that

Let g(W ) be the function defined in (2), which always maps symmetric matrices to symmetric matrices. Let g : Rbe the function such that g(vec LT (W )) = vec LT (g(W )) for any W ∈ S d . For W (t) evolving with (2), we view vec LT (W (t)) as a dynamical system.By Lemma 5.4, the spaces of symmetric matrices S d and antisymmetric matrices A d are invariant subspaces of J (0), andis the set of all the eigenvalues and eigenvectors in the invariant subspace S d . Thus, μ1 := 2µ 1 and μ2 := µ 1 + µ 2 are the largest and second largest eigenvalues of the Jacobian of g(•) at vec LT (W ) = 0, and ũ1 = ṽ1 = u 1 u 1 are the corresponding left and right eigenvectors of the top eigenvalue. Then it is easy to translate Theorem 5.3 to Theorem 5.6.I.2 PROOF FOR THEOREM 5.8The proof for Theorem 5.8 relies on the following Lemma on the gradient flow around a local minimizer: Lemma I.1. If θ is a local minimizer of L(θ) and for all θθ 2 ≤ r, θ satisfies Łojasiewicz inequality:).Published as a conference paper at ICLR 2021 Theorem K.5 (Theorem 2.3.10, Clarke 1990 ). Let F : R k → R d be a differentiable function and g : R d → R Lipschitz around F (x). Then f = g • F is Lipschitz around x and one hasbe the m-th largest eigenvalue of a symmetric matrix M . The following theorem gives the Clarke's subdifferentials of the eigenvalue: Theorem K.6 (Theorem 5.3, Hiriart-Urruty and Lewis 1999). The Clarke subdifferential of the eigenvalue function λ m is given below, where co denotes the convex hull:K.2 PROOF OF LEMMA 6.1The equation to be proved is:Since W (t) 0 by Lemma K.1, ( 11) can be rewritten as the following:Proof for Lemma 6.1. Suppose W (t) is a symmetric solution of (11). By Lemma K.1, we know W (t) also satisfies (30). Now we let R(t) be the solution of the following ODE with R(0) := (W (0))The calculation below shows that R L (t) also satisfies (30).which completes the proof.K.3 PROOF FOR THEOREM 6.2Now we turn to prove Theorem 6.2. Let P = L/2. Then (29) can be rewritten asThe following lemma about the growth rate of λ k (M ) is used later in the proof.Lemma K.9. Let M 0 be a PSD matrix with M 0 2 ≤ 1. For M (t) := φ m (αM 0 , t) and t ≤ T α(c),so M (t) 2 ≤ c for all t ≤ T α(c). Applying Lemma K.8 again, we have α-(P -1) -M (t)Consider the following ODE:We use φm ( M 0 , t) to denote the solution of M (t) when M (0) = M 0 . For diagonal matrix M 0 , M (t) is also diagonal for any t, and it is easy to show that) -(P -1) -2µi(P -1)t 1 P -1 e i M 0 e i = 0, 0 e i M 0 e i = 0.(Remark K.10. Unlike depth-2 case, the closed form solution, M (t) is only tractable for diagonal initialization, i.e., (35) (note that the identity matrix is diagonal). And this is the main barrier for extending our two-phase analysis to the case of general initialization when L ≥ 3. In Appendix L, we give a more detailed discussion on this barrier.The following lemma shows that the trajectory of M (t) is close to M (t). Lemma K.11. Let M 0 be a diagonal PSD matrix with M 0 2 ≤ 1. For M (t) := φ m (αM 0 , t) and M (t) := φm (αM 0 , t), we haveProof. We bound the difference D := M -M between M and M ., where the last step is by Lemma K.4. This implies that2P P -1 dτ.However, unlike the depth-2 case, M can be different from v 1 v 1 even if v 1 M (0)v 1 > 0. We here give an example for diagonal M (0) and ∇f (0) at Appendix L.2. Nevertheless, we still conjecture that except for a zero measure set of M (0), M = v 1 v 1 , based on the following theoretical and experimental evidences:• For the counter-example, we show experimentally, even with perturbation of only magnitude 10 -5 , M = v 1 v 1 . The results are shown at Figure 7 . The y-axis indicates v 1 , u 1 (t) where u 1 (t) is the top eigenvector of M (t). As W (t) F becomes larger, u 1 (t) aligns better with v 1 , which means the noise helps M escaping from v 1 . The larger the noise is, the faster u 1 (t) converges to v 1 .

L.1 RANK-ONE CASE

Theorem L.1 (rank-1 initialization escapes along the top eigenvector). When rank(M (0)) = 1, lim t→∞Proof. Let u(0) be the vector such that M (0) = u(0)u(0) and u(t) ∈ R d be the solution ofIt is easy to check that M (t) = u(t)u(t) is the solution of (39), becauseThat is, under time rescaling t → τ (t), the trajectory of u(t) still follows the power iteration, regardless of the depth L.

L.2 COUNTER-EXAMPLE FOR ESCAPING DIRECTION

Let ∇f (0) = diag(2, 0.9, 0.8, . . . , 0.1) ∈ R 10×10 be diagonal. Let W (0) be also diagonal and, where α = 10 -16 is a small constant. Let the depth be 4.Lemma L.2. With ∇f (0) and W (0As both W (0) and ∇f (0) are diagonal, W (t) is always diagonal and has dynamicstherefore we have closed form of M (t):the time for M (t) i,i going to infinity is (2M (0) i,i ∇f (0) i,i ) -1 . By simple calculation, M (t) 2,2 goes to infinity the fastest, thus M = e 2 e 2 = v 1 v 1 .We remark that the scales of W (0) and ∇f (0) do not matter as in gradient flow, as scaling ∇f (0) is equivalent to scaling time (by Lemma L.3 below). And for this reason, the x-axis is the chosen as, the relative growth rate. 10 2 10 5 10 8 10 11 10 14 10 17is the top eigenvector of W (t) and is the relative magnitude of noise. The initialization we use in this experiment is W noise (0) = W (0)+ α 2 (Z +Z ), where W ( 0) is what we construct at Appendix L.2, and Z is a matrix where entries are i.i.d. samples from the standard Gaussian distribution N (0, 1). We run 5 fixed random seeds (the noise matrix) for each . The trajectory of W is calculated by simulating gradient flow on M with small timestep and RMSprop (Tieleman and Hinton, 2012) for faster convergence.Lemma L.3. Suppose g : R d → R d is a P -homogeneous function, that is, g(αθ) = λ P g(α) for any α > 0, and dθ (t)Proof. Simply plug in θ(t) = αθ (α P -1 t), then we have

M PROOF OF LINEAR CONVERGENCE TO MINIMIZER

In this section, we will present the theorems that guarantee the linear convergence to a minimizer W 0 of f ( • ) if the dynamics ( 41) is initialized sufficiently close to W 0 , i.e., W (0) -W 0 F is sufficiently small. In Appendix M.3, we will apply this result to prove Theorem 6.4.Throughout this section, we assume rank(W 0 ) = k and use m := λ k (W 0 ) to denote the k-th smallest non-zero eigenvalue of W 0 . The tangent space of manifold of rank-k symmetric matrices at.Let J (W ) be the Jacobian of g(W ) in (41). For depth-2 case, we have shown that T is an invariant subspace of J (W 0 ) in Theorem H.5, property 2. This can be generalize to the deep case where L ≥ 3. Therefore, we can use J (W 0 )| T : T → T to denote the linear operator J (W 0 ) restricted on T . We also define Π d 2 1 (W ) as the projection of W ∈ R d×d on T , and Π d 2 2 (W ) := W -Π d 2 1 (W ). Towards showing the main convergence result in the section, we make the following assumption. Assumption M.1. Suppose J (W 0 )| T diagonalizable and all eigenvalues are negative real numbers.W 0 is a minimizer, so it is clear that J (W 0 )| T has no eigenvalues with positive real parts (otherwise there is a descending direction of f ( • ) from W 0 , since the loss f ( • ) strictly decreases along Published as a conference paper at ICLR 2021 Theorem M.3 (Linear convergence of rank-k matrices). Suppose that rank(W (0)) = rank(W 0 ) = k andwe have W (t) -W 0 V ≤ Ce -µ1t W (0) -W 0 V for some constant C depending on W 0 , where W (t) satisfies (41).Proof. For convenience, we defineFor the first term Π d 2 1 (J (W 0 )[W 1 (t)]) , W 1 (t)V -1 , we know W 1 (t) ∈ T , and T is an invariant space ofThus we have shown the following. Note so far we have not used the assumption that W is rank-k.Thus, from (42) we can derive thatThus from (43), we havewhich completes the proof.

M.2 ALMOST RANK-k INITIALIZATION

We use M (t) to denote the top-k components of W (t) in SVD, and N (t) to denote the rest part, i.e., W (t) -M (t). One can think M (t) as the main part and N (t) as the negligible part.Below we show that for deep overparametrized matrix factorization, where W (t) satisfies (41), if the trajectory is initialized at some W (0) in a small neighborhood of the k-th critical point W 0 of deep GLRL, and W (0) is approximately rank-k, in the sense that N (0) is very small, then inf t≥0 W (t) -W 0 V is roughly at the same magnitude of N (0).Theorem M.4 (Linear convergence of almost rank-k matrices, deep case). Suppose W 0 is a critical point of rank k and W 0 satisfies Assumption M.1, there exists constants C 0 and r, such that if C 0 N (0) F ≤ W 1 (0) V ≤ r, then there exists a time T and constants C, C , such that(1). W (t) -W 0 V ≤ Ce -µ1t/2 W (0) -W 0 V , for t ≤ T .(2). W (T ) -W 0 F ≤ C N (0) F .Proof. When W (t) -W 0 F ≤ λmin(W0)

4

, N (t) F ≤ λmin(W0) , thus we have Thus we can pick constant C 0 large enough and r small enough, such that for any t ≥ 0, if C 0 N (t) F ≤ W 1 (t) V ≤ r, then it holds that:

