Implicit Acceleration of Gradient Flow in Overparameterized Linear Models

Abstract

We study the implicit acceleration of gradient flow in over-parameterized two-layer linear models. We show that implicit acceleration emerges from a conservation law that constrains the dynamics to follow certain trajectories. More precisely, gradient flow preserves the difference of the Gramian matrices of the input and output weights and we show that the amount of acceleration depends on both the magnitude of that difference (which is fixed at initialization) and the spectrum of the data. In addition, and generalizing prior work, we prove our results without assuming small, balanced or spectral initialization for the weights, and establish interesting connections between the matrix factorization problem and Riccati type differential equations.

1. Introduction

Understanding over-parameterization in deep learning is a puzzling question. Contrary to the common belief that over-parameterization may hurt generalization and optimization, recent work suggests that over-parameterization may actually bias the optimization algorithm towards solutions that generalize well, a phenomenon known as implicit regularization or implicit bias, and even accelerate convergence, a phenomenon known as implicit acceleration. Recent work on the implicit bias in the over-parameterized regime (e.g. Gunasekar et al. (2018a; b) ; Chizat & Bach (2020) ; Ji & Telgarsky (2019b) ) shows that gradient descent on unregularized problems finds minimum norm solutions. For instance, Soudry et al. (2018) ; Ji & Telgarsky (2019a) analyze linear networks trained for binary classification on linearly separable data and show that the predictor converges to a max-margin solution. Similar ideas have been developed for matrix factorization, yielding solutions with minimum nuclear norm (Gunasekar et al., 2017; Li et al., 2018) or low-rank (Arora et al., 2019a) . It has also been shown that optimization methods which introduce multiplicative stochastic noise, such as dropout and dropblock, induce nuclear norm regularization (Cavazza et al., 2018) and spectral k-support norm regularization (Pal et al., 2020) , respectively. Recent work on the implicit acceleration of gradient descent for matrix factorization and deep linear networks (Arora et al., 2018) shows that when the initialization is sufficiently small and balanced (see Definition 2), over-parameterization acts as a pre-conditioning of the gradient that can be interpreted as a combination of momentum and an adaptive learning rate. They claim that acceleration for p -regression is possible only if p > 2, though there is no theory supporting such claim. Saxe et al. (2014) focused on 2 -regression with balanced spectral initializations (see Definition 3) and similarly concluded that depth may actually slow down the convergence. For two-layer linear networks, Saxe et al. (2019) ; Gidel et al. (2019) analyzed the dynamics of gradient flow and obtained explicit solutions under the assumption of vanishing spectral initialization, highlighting the sequential learning of the hierarchical components as a phenomenon that could improve generalization. Several recent papers (e.g Arora et al. (2019b) ; Du & Hu (2019) ; Du et al. (2018b) ) have also analyzed the convergence behaviour of gradient descent in the over-parameterized setting, particularly for very wide networks and have concluded linear convergence when the initialization is gaussian or balanced. While a precise study of the connections between gradient descent and gradient flow dynamics for non-convex problems remains elusive, recent work (Franca et al., 2020) shows that discrete-time convergence rates can be derived from their continuous-time counterparts via symplectic integrators. Therefore, our work focuses on the analysis of gradient flow as a stepping stone for future analysis of gradient descent. In this paper, we present a new analysis of the implicit acceleration of gradient flow for overparametrized two-layer neural networks that applies not only in the case of small, balanced, or spectral initialization but also extends to imbalanced and non-spectral initializations. We show that the key reason for the implicit acceleration of gradient flow is the existence of a conservation law that constrains the dynamics to follow a particular path.foot_0 More precisely, the quantity that is preserved by gradient flow is the difference of the Gramians of the input and output weight matrices, which in turn implies that the difference of the square of the norm of the weight matrices is preserved. The particular case where this difference is zero corresponds to the case of balanced weights, but the more general case of imbalanced weights also emerges as a conserved quantity and plays an important role. In particular, we show that acceleration can occur even in the case of 2 -regression as a result of imbalanced initialization. The reason this phenomenon was not previously observed in (Saxe et al., 2014; 2019; Gidel et al., 2019) is precisely due to the assumption of balanced initialization, which follows as a particular case of our analysis. Our work also establishes interesting connections with Riccati type differential equations. Indeed, some of our results have a similar flavor to those in (Fukumizu, 1998) , while others are more general and provide an explicit characterization of the continuous-time convergence rate. In short, our work makes the following contributions. 1. In Section 2, we analyze the implicit acceleration properties of gradient flow for symmetric matrix factorization, providing a closed form solution and a convergence rate that depends on the eigenvalues of the data without the assumptions of spectral and small initialization. 2. In Section 3, we analyze the implicit acceleration properties of gradient flow for asymmetric matrix factorization with spectral initialization. We show that implicit acceleration emerges as a consequence of conservation laws that only appear in over-parameterized settings due to an underlying rotational symmetry. 3. In Section 4, we analyze the implicit acceleration properties of gradient flow for asymmetric matrix factorization with an arbitrary initialization. We make connections with Riccati differential equations, obtaining a more general characterization of the convergence rate and establish an interesting link with explicit regularization.

2. Gradient Flow Dynamics for Symmetric Matrix Factorization

In this section, we analyze and compare the dynamics of gradient flow,foot_1 Ẋ(t) = -∇ X (X(t)), (1) when applied to two problems. The first one is learning a symmetric one-layer linear model min X∈R m×m (X) ≡ 1 2 ||Y -X|| 2 F , where Y ∈ R m×m is a given data matrix that one wishes to approximate by X ∈ R m×m . The second one is learning its over-parameterized symmetric matrix factorization counterpart min U ∈R m×k (U ) ≡ 1 2 ||Y -U U T || 2 F . We show that the dynamics of the linear model converge at a rate O(e -t ), while the overparameterized model has a rate of O(e -4t|σi| ), where σ i is the ith eigenvalue of the data matrix Y . Therefore, different spectral components are learned at different rates, which can be faster or slower than the non-overparameterized model depending on the eigenvalues of Y . Linear model. Let us start with the trivial problem of learning a non-overparameterized linear model (2).foot_2 Applying the gradient flow (1) to problem (2) yields Ẋ(t) + X(t) = Y with X(0) = X 0 . This is a linear differential equation whose unique solution is given by X(t) = Y + (X 0 -Y )e -t . Thus, X(t) -Y F = e -t X 0 -Y F and lim t→∞ X(t) = Y at an exponential rate of O(e -t ). For completeness, it will be interesting to consider the particular case in which X is constrained to be positive semidefinite (PSD), i.e. X 0. In this case, notice that if X 0 0 and Y 0, then X(t) 0 for all t > 0, hence the same dynamics and convergence rate still apply without having to enforce the PSD constraint. Otherwise, if Y is not PSD, gradient flow is not directly applicable. Symmetric matrix factorization model. Consider the more interesting case of learning a two-layer linear model with tied weights U ∈ R m×k formulated as the symmetric matrix factorization problem in (3). In classical low-rank matrix factorization, one assumes k < m. Here, we consider an over-parameterized formulation where k > m plays the role of the number of hidden units (width). The gradient flow (1) on problem (3), for U , now yields U = 2(Y -U U T )U with U (0) ≡ U 0 . Letting X(t) ≡ U (t)U (t) T 0 and X(0) = U 0 U T 0 0 one can easily verify that Ẋ = U U T + U U T = 2(Y -X)X + 2X(Y -X) = 2Y X + 2XY -4X 2 . Equation ( 5) is known to be rank preserving, i.e. if rank(X 0 ) = r ≤ m and X * = lim t→∞ X(t) exists then rank(X * ) ≤ r. It is thus impossible to recover a solution of rank higher than the rank of the initialization. We also note that (5) is a (matrix) differential equation of the Riccati type. Such equations often characterize dynamical systems behind least squares problems and have been extensively studied in the context of optimal control. Using results from this literature, we obtain (see Appendix A for the proof): Proposition 1. For any X 0 ∈ R m×m , the solution to (5) exists and is given by X(t) = e 2tY X 0 I + Y -1 (e 4tY -I)X 0 -1 e 2tY , provided that Y and the matrix inside the parenthesis above are invertible. This solution is derived for any X 0 , while the over-parameterized model requires X 0 = U 0 U T 0 0. Thus in using ( 6) as an analysis tool, it is important to keep in mind the set of allowable initializations. In what follows, we consider the spectral initialization (Saxe et al., 2019; Gidel et al., 2019) , and show that the eigenspace of the data is preserved throughout the entire evolution of the learning dynamics. Definition 1 (Symmetric Spectral initialization). Let Y = ΦΣΦ T be the eigendecomposition of the data. A spectral initialization is defined as U 0 ≡ ΦΣ 1/2 0 and X 0 ≡ U 0 U T 0 = ΦΣ 0 Φ T where Σ 0 0 is a diagonal matrix. From the explicit solution (6), we can readily obtain a convergence rate for a spectral initialization (see Appendix B for a short proof). Corollary 1. If Y = ΦΣΦ T = m i=1 σ i φ i φ T i is invertible and X 0 = ΦΣ 0 Φ T = m i=1 σ 0,i φ i φ T i is a spectral initialization, the solution to (5) is X(t) = ΦΣ(t)Φ T = m i=1 σ i (t)φ i φ T i , where σ i (t) = σ 0,i σ i e 4tσi σ i + σ 0,i (e 4tσi -1) = σ i + σ i (σ 0,i -σ i ) σ i + σ 0,i (e 4tσi -1) , ( ) provided the denominator is nonzero. Moreover, if Ỹ = m i=1 max(σ i , 0)φ i φ T i = Φ ΣΦ T is the projection of Y onto the PSD cone, then for all initializations X 0 0 such that rank(Σ 0 Σ) = rank( Σ), X(t) converges to Ỹ , at a rate O(e -4tσmin(Y ) ), where σ min (Y ) = min i |σ i | is the smallest eigenvalue of Y in magnitude. Note from (7) that the ith eigencomponent of X converges at a rate of O(e -4t|σi| ), so that components with 4|σ i | > 1 are accelerated compared to the non-overparameterized case, components with 4|σ i | < 1 are slowed down, and if 4σ min (Y ) > 1 all components are accelerated. This result about different components of the network being learned at different rates by gradient flow is related in spirit to the result in (Saxe et al., 2019; Gidel et al., 2019) about sequential learning in the asymmetric case with spectral balanced initialization. Here the balancedness is enforced by construction. Next, we derive the same convergence rate with a more general (non-spectral) initialization. The proof is in Appendix C and makes use of several interesting relations for Riccati differential equations. Proposition 2 (Convergence rate). Consider the eigenvalue decomposition Y = m i=1 σ i φ i φ T i . Let Ỹ = m i=1 max(σ i , 0)φ i φ T i be the projection of Y onto the PSD cone and Ŷ = m i=1 |σ i |φ i φ T i . For any initialization X 0 0, assume that I + Ŷ -1 X 0 -Ỹ and Y are nonsingular. Then the solution X(t) of (5) converges exponentially to Ỹ at a rate X(t) -Ỹ F ≤ Ce -4tσmin(Y ) , where σ min (Y ) is the smallest eigenvalue of Y in absolute value and C > 0 is a constant. It follows from Proposition 2 that the implicit acceleration for symmetric matrix factorization with spectral initialization can be extended to any positive semidefinite initialization X 0 provided I + Ŷ -1 X 0 -Ỹ is invertible, which is an extension of the previous assumption on X 0 , namely that rank(Σ 0 Σ) = rank( Σ). The primary difference is that in the spectral initialization case we can derive the convergence rate for each eigenvalue of X(t), while in general we can only obtain a global convergence rate of the solution.

3. Gradient Flow Dynamics for Asymmetric Matrix Factorization with Spectral Initialization

In this section, we analyze the dynamics of gradient flow for the more general asymmetric matrix factorization problem. We show that the implicit acceleration phenomenon is still present and provide an explanation for it based on a conservation law for the difference of the Gramians of the factors. We transform the dynamics to a canonical form and show that the solutions under the spectral initialization are diagonal and can be computed in closed form. The closed form solution reveals a convergence rate of O(e -t √ 4σ 2 i +λ 2 0,i ), where σ i is the ith singular value of Y and λ 0,i defines the level of imbalance in the initialization for the ith component. As in the symmetric case, data matrices with large singular values induce implicit acceleration. However, in the asymmetric formulation, additional acceleration can be gained by choosing an imbalanced initialization. Asymmetric matrix factorization model. Consider the asymmetric factorization problem: min U,V (U, V ) ≡ 1 2 ||Y -U V T || 2 F (9) where U ∈ R m×k , V ∈ R n×k and k ≥ n ≥ m. The gradient flow for this problem takes the form U = -∇ U = (Y -U V T )V, V = -∇ V = (Y -U V T ) T U. In what follows, We will make use of a conservation law for the difference of the Gramian matrices U T U and V T V which has been previously identified in (Du et al., 2018a; Arora et al., 2018) . Previous works have used this conservation law to ensure balancedness under vanishingly small initialization. In contrast, our analysis is the first to highlight the role of imbalance and the resulting acceleration of gradient flow. Conservation law. A straightforward calculation shows that (10) admits an invariant: Q ≡ U T U -V T V, dQ dt = U T U + U T U -V T V -V T V = 0 =⇒ Q(t) = Q(0). (11) The origin behind this conserved quantity Q is a global rotational symmetry of the system in (10), i.e., the system is invariant under the orthogonal group O(k). To see this, consider the singular value decomposition Y = ΦΣΨ T and, following (Saxe et al., 2019) , define matrices Ū and V through U = Φ Ū G T , V = Ψ V G T , ( ) where G is an arbitrary element of O(k). These transformed variables obey U = (Σ -Ū V T ) V , V = (Σ -Ū V T ) T Ū . ( ) Note that these equations have exactly the same form as ( 10), up to a gauge freedom on the choice of G. Since Q is real and symmetric, it is diagonalizable by an orthogonal matrix. Therefore, we can choose G to be the matrix that diagonalizes Q-this is a gauge choice. Hence, from (11) we have Ū T Ū -V T V = G T Q(t)G = Λ Q = Λ Q0 = Ū T 0 Ū0 -V T 0 V0 , where Λ Q0 is the (constant) diagonal matrix containing the k eigenvalues of Q(0) ≡ Q 0 (or Q) which is completely specified by the initial conditions U 0 and V 0 alone. Note that the number of conserved quantities in Λ Q0 depends on k, which is equal to the degree of over-parameterization. Though we do not assume balanced initialization in this paper, for further reference and comparison with prior work (Arora et al., 2018; Saxe et al., 2014; 2019) let us state its precise meaning since it relates to the conservation law. Definition 2 (Balanced initialization). (U 0 , V 0 ) is said to be balanced if Q F = Q 0 F ≤ for sufficiently small > 0, i.e. the conserved quantity in (11), or equivalently (14), almost vanishes. Under the above transformation, the matrix problem with spectral initialization can be reduced to solving k one-dimensional systems (one for each component). Proposition 3 provides a closed form solution and explicitly characterizes the evolution of each component along with its implicit acceleration (See Appendix D for the proof). Definition 3 (Asymmetric Spectral initialization). Let Y = ΦΣΨ T be the SVD of the data. The spectral initialization is defined as U 0 = Φ Ū0 G, V 0 = Ψ V0 G, and X 0 = U 0 V T 0 , where Ū0 and V0 are rectangular diagonal matrices and G is any orthogonal matrix. Proposition 3 (Implicit acceleration of asymmetric factorization with spectral initialization). Let Y = ΦΣΨ T = m i=1 σ i φ i ψ T i be the SVD of the data. The solution to (10) with spectral initialization X 0 = ΦΣ 0 Ψ T = m i=1 σ 0,i φ i ψ T i yields X(t) = U (t)V (t) T = ΦΣ(t)Ψ T = m i=1 σ i (t)φ i ψ T i , where Balanced Imbalanced λ 0 ≈ 0 λ 0 0 small σ large σ Implicit acceleration. Recall from (4) that the convergence rate for the non-overparameterized problem in (2) is O(e -t ), which does not depend on the data or the initialization. It follows from (15) that the asymptotic behavior of the singular values of the over-parameterized solution is: σ i (t) = σ i e 2t √ 4σ 2 i +λ 2 0,i -2C i λ 2 0,i e t √ 4σ 2 i +λ 2 0,i -4σ i λ 2 0,i C 2 i e 2t √ 4σ 2 i +λ 2 0,i + 8σ i C i e t √ 4σ 2 i +λ 2 0,i -4λ 2 0,i C 2 i , λ 0 = diag( Ū T 0 Ū0 -V T 0 V0 ) and C i = C i (σ i , λ 0,i , σ 0,i ) is a constant. Moreover, the ith eigencomponent of X(t) converges to the ith eigencomponent of Y at a rate O e -t √ 4σ 2 i +λ 2 0,i . |σ i (t) -σ i | 2C i (4σ 2 i + λ 2 0,i )e -t √ 4σ 2 i +λ 2 0,i , which depends on both σ i (singular values of the data) and λ 0,i (level of initialization imbalance). It is immediate to see that overparameterization leads to implicit acceleration when 4σ 2 i + λ 2 0,i > 1. As illustrated in Table 2 and previously observed in the literature (Saxe et al., 2014) , acceleration is possible under a balanced initialization i.e., λ 0,i ≈ 0 when the singular values are sufficiently large. This is similar to the symmetric case, which is not surprising since a symmetric factorization is by construction balanced. A new phenomenon that emerges from our analysis is that acceleration can also be achieved by increasing the level of imbalance, and that acceleration can always be achieved regardless of the data when |λ 0,i | > 1. Moreover, a faster convergence can be achieved by using a more imbalanced initialization. 4 Dynamics of Gradient Flow for Asymmetric Matrix Factorization without Spectral Initialization We now relax the assumption of spectral initialization (Definition 3). Defining the quantities R(t) ≡ Ū (t) V (t) , S ≡ 0 Σ Σ T 0 , S ≡ 1 2 I m 0 0 -I n , one can immediately obtain from ( 13) and ( 14) the Riccati-like differential equation Ṙ = SR -1 2 RR T R + SRΛ Q0 , where from ( 14) we conclude that 2R T 0 SR 0 = Λ Q0 with R(0) ≡ R 0 . However, in general, one cannot go back from ( 18) to ( 13) unless the conservation law ( 14) is explicitly imposed for all times t. The natural question is then, when are they equivalent? Our next result provides the answer and additionally reveals an interesting relation between ( 18) and a matrix factorization problem with explicit regularization (proof in Appendix F). Proposition 4 (Explicit regularization). The differential equation ( 18) is equivalent to U = (Σ -Ū V T ) V -1 2 Ū ( Ū T Ū -V T V -Λ Q0 ), V = (Σ -Ū V T ) T Ū + 1 2 V ( Ū T Ū -V T V -Λ Q0 ). This system corresponds to the dynamics of gradient flow applied to the regularized problem min Ū , V 1 2 ||Σ -Ū V T || 2 F + 1 8 || Ū T Ū -V T V -Λ Q0 || 2 F . Moreover, if Q(t) ≡ Ū T (t) Ū (t) -V T (t) V (t) obeys Q(t 0 ) = Λ Q0 at some t = t 0 , then Q(t) = Λ Q0 for all t ≥ t 0 . In particular, if we initialize (19)-or equivalently (18)-such that Q(0) = 2R T 0 SR 0 = Λ Q0 , then the conservation law (14) holds true for all t and both the dynamics of (19) and (13) are the same.

Some remarks are appropriate:

• A solution to (13) implies a solution to (19) because when (11) holds the 2nd terms on the RHS of ( 19) vanish, while the 1st terms are exactly (13). However, the converse is not necessarily true, unless ( 19) is initialized in the same way as (13). The above result relates implicit acceleration (as will be shown in Proposition 5) to an explicit regularization; namely one can either select a particular initialization and solve an unregularized problem, or start at an arbirary initialization and explicitly regularize. • Note the specific weight of 1/8 in ( 20) is special: If one replaces 1/8 by some constant α > 0, the gradient flow dynamics, i.e. the analog of (19), will not be equivalent to (18). Note also problem (20) also appeared in (Du et al., 2018a) , but without such connections. Equation ( 18) is reminiscent of a Riccati differential equation due to the cubic term in R (similar to the gradient flow in the symmetric case) but we believe that, in general, it cannot be solved exactly due to the last term. However, it can be solved exactly in a particular case (proof in Appendix E). Proposition 5 (Exact solution and convergence rate). If Λ Q0 = λ 0 I k for some constant λ 0 , then the differential equation (18) reduces to the following equation with a closed form solution for RR T Ṙ = SR -1 2 RR T R, R(t)R T (t) = e t S R 0 R T 0 I + 1 2 S-1 (e 2t S -I)R 0 R T 0 -1 e t S , where S ≡ S + λ 0 S = Φ ΣΦ T , R 0 ≡ R(0). Moreover, if S and I + 1 2 Ŝ-1 (R 0 R T 0 -R R T ) are invertible then R(t)R T (t) converges exponentially to R R T , defined as the projection of the matrix 2 S on the PSD cone, Ŝ = Φ| Σ|Φ T . More precisely, if Y is a square matrix the convergence rate is O e -t √ 4σ 2 min +λ 2 0 , where σ min is the smallest eigenvalue of Y , and otherwise the rate is O(e -|λ0|t ). Note that ( 21) is the gradient flow for the symmetric factorization problem min R 1 8 ||2 S - RR T || 2 F }. The particular case Λ Q0 = λ 0 I k is mathematically interesting because it is amenable to an analytical treatment. However, it may not be realizable in practice because the conserved quantity Q 0 (or Λ Q0 ) must have low rank, i.e., Since rank(λ 0 I k ) = k, choosing Ū0 and V0 such that Ū T 0 Ū0 -V T 0 V0 = λ 0 I k is not generally possible in an over-parameterized setting with k > m+n. On the other hand, the choice Λ Q0 = λ 0 I k does not present a problem if we consider the system (19) where we have the freedom to choose any initialization. The experiments in Section 5 illustrate that (19), or equivalently ( 21), is actually enough to capture the general behaviour of (13). rank(Q 0 ) = rank(U T 0 U 0 -V T 0 V 0 ) ≤ rank(U T 0 U 0 ) + rank(V T 0 V 0 ) ≤ m + n. (

5. Numerical Experiments

Imbalanced initialization. We now provide numerical evidence supporting our theoretical results. First, we generate a random matrix Y with Y ij ∼ N (0, 1), and set m = 5, n = 10 and k = 50. We approximate the dynamics of gradient flow for one-layer and two-layer linear models by using gradient descent with a fixed small step size η = 10 -3 (smaller step sizes did not lead to a discernible change). We evaluate the reconstruction error Y -X(t) F / Y F , where X(t) = U (t)V T (t), and compare the evolution of the singular values of X(t). We consider Gaussian initializations, i.e., U 0 and V 0 have entries ∼ N (0, σ 2 ), where σ is varied to obtain different degrees of imbalance. In order for the two models to start from the same state, we choose X(0) = U 0 V T 0 for the onelayer case. The results are shown in Fig. 1 . From our theoretical analysis, we expect a different behaviour for the convergence rate depending on whether the initialization is balanced or imbalanced, i.e. whether Q F = Q 0 F ≡ U T 0 U 0 -V T 0 V 0 F is small or large, respectively. When it is very small (Fig. 1a ), the strength of the singular value dominates and we expect the components to be learned sequentially from the largest to the smallest, which agrees with the findings in (Saxe et al., 2019; Gidel et al., 2019) . As we make the weights more imbalanced (Fig. 1b ) the singular values are learned faster, even the smaller ones. Finally, as Q F becomes very large, the implicit acceleration becomes more prominent and the solution of the factorized problem converges significantly faster (Fig. 1c ). In other words, these numerical results are consistent with Proposition 2 and Proposition 5. Λ Q0 = λ 0 I is general enough. Since Proposition 5 contains the case where an exact solution is available, we want to investigate whether this is general enough to capture the qualitative behaviour of the general problem, i.e. the general dynamics of (13). To avoid confusion, let us refer to Ū I and V I as the variables of ( 19), as well as Λ I Q0 ≡ λ 0 I k ; here I stands for "identity." The variables Ū and V refer to the original dynamical system (13), with its conserved quantity Λ Q completely fixed by the initial conditions; see (14) . We want to show that it is possible to find an "optimal" λ 0 ∈ R such that both cases have very close dynamics. Hence, we initialize U 0 and V 0 (and thus equivalently Ū0 and V0 ) with entries sampled from N (0, σ 2 ), and we fix σ = 1. The same initial condition is also used for (19), i.e. Ū I 0 = Ū0 and V I 0 = V0 . We set η = 10 -5 , Y ∼ N (0, 1), m = 5, n = 10 and vary k. Thus, we look for a λ 0 that minimizes the error X I (t) -X(t) F . In Fig. 2 we illustrate that, indeed, this can be done. Note that in these three cases the evolution of both dynamical systems are nearly indistinguishable. Extension to nonlinear networks. Our analysis so far has shown that the acceleration phenomenon is a result of imbalance and is induced through a conservation law for which the definition is expected to change when introducing nonlinearities. In fact, both the architecture of the network and the objective function should affect the invariances. As such, conducting a full analysis of the symmetries and conservation laws for more complex nonlinear networks is necessary to characterize the implicit bias in such cases, which we leave for future work. Nonetheless, we provide numerical evidence in a more realistic setting than the one of linear networks, where we only add a nonlinearity (sigmoid) to the final layer in order to preserve some of the structure of the linear model. We train the two networks (one layer vs two layers) on synthetic data, i.e. we compare the dynamics of gradient descent for the objectives: 1 (W ) = 1 2 ||Y -φ(XW )|| 2 F and 2 (U, V ) = 1 2 ||Y -φ(XU V T )|| 2 F where φ is the sigmoid function, X ∈ R n×d and Y ∈ R n×d represent the training samples and labels respectively (n = 10 3 , d = 10), W ∈ R d×d , U, V ∈ R n×k are the weight matrices and the width is k = 100. We generated the matrices W * and X with entries drawn from N (0, 1) and Y = φ(XW * )+ where ∼ 10 -3 N (0, I). Our results shown in figure 3 interestingly suggest that our conclusions about the role of imbalance hold in this case as well.

6. Conclusion

We considered the gradient flow dynamics for two-layer linear neural networks, providing an analytical treatment to a great level of detail. Our results establish a detailed characterization of the "implicit acceleration" phenomenon in this case, without assuming balanced or vanishingly small initializations, which so far have been present in all prior work in this vein. More specifically, our analysis shows that the implicit acceleration of gradient flow is a consequence of an emerging rotational symmetry induced by over-parameterization and giving rise to several conservation laws that constrain the dynamics to follow specific trajectories. Moreover, conserved quantities are completely fixed by the initialization, which has profound effects on the convergence rate. Although our analysis focuses on the simple case of linear networks, it reveals a potential key to understanding implicit bias which lies in the conservation laws that arise from the symmetries of the problem. These symmetries depend on the network architecture, objective, optimization algorithm, and they constrain the dynamics to an invariant manifold that encapsulates the implicit regularization and acceleration effects. Understanding this in more complex models may thus be reduced to finding dynamical invariants, for which our results provide a foundational starting point.

A Solution to the Matrix Riccati Differential Equation

Here we prove Proposition 1. First, we note that a general matrix Riccati differential equation takes the form Ṗ (t) = AP (t) + P (t)A T -P (t)RP (t) + Q, P (0) = P 0 , where P (t) ∈ R n×n , and A, R, Q, P 0 ∈ R n×n are constant matrices. Associated to (23) one has the linear system Ẋ1 (t) Ẋ2 (t) = -A T R Q A X 1 (t) X 2 (t) , X 1 (0) X 2 (0) = I n P 0 . ( ) The closed form solution of ( 23) follows as a consequence of lemma 6 below. Lemma 6 ( (Sasagawa, 1982) ). Consider the two initial value problems (23) and (24). We have: • The initial value problem (23) has a solution in the interval [0, t 1 ] if and only if the matrix X 1 (t) in the solution of the linear differential equation ( 24) is invertible for all t ∈ [0, t 1 ). Moreover, the solution to (23) is unique and given by P (t) = X 2 (t)X 1 (t) -1 . ( ) • Let P be a solution to the algebraic Riccati equation (ARE) AP + P A T -P RP + Q = 0. Then the solution of ( 24) is given by (27) below, where Ã = A -P R and Â = A T -R P : X 1 (t) X 2 (t) =   e -t Â + t 0 ds e -(t-s) ÂR e s Ã (P 0 -P ) P e -t Â + P t 0 ds e -(t-s) ÂR e s Ã (P 0 -P ) + e t Ã(P 0 -P )   . (27) Now, we apply Lemma 6 to the Riccati equation induced by a symmetric factorization, namely, Ẋ(t) = 2X(t)Y + 2Y X(t) -4X(t) 2 . ( ) The associated linear system is Ẋ1 (t) Ẋ2 (t) = -2Y 4I 0 2Y X 1 (t) X 2 (t) , X 1 (0) X 2 (0) = I n X 0 . ( ) and the algebraic Riccati equation is given by XY + Y X -2X 2 = 0. ( ) This equation admits the trivial solution X = 0. Thus, one can verify that as long as the given matrix Y is invertible we have X 1 (t) X 2 (t) = e -2tY + e -2tY Y -1 (e 4Y t -I)X 0 e 2tY X 0 . Therefore, the unique solution to (5) is explicitly given by X(t) = X 2 (t)X 1 (t) -1 = e 2tY X 0 I + Y -1 (e 4tY -I)X 0 -1 e 2tY , as long as the matrix between parenthesis is invertible. Note that Y -1 (e 4tY -I) is always positive semidefinite for t ≥ 0. Thus, for all X 0 0, the matrix I + Y -1 (e 4tY -I)X 0 is invertible and the solution of the Riccati equation is well defined for all t ≥ 0.

B Proof of Corollary 1

The first part follows trivially by substituting in (6) and verifying the invertibility of the matrix in parenthesis. For the second part, note from (7) that if σ i > 0, then σ i (t) → σ i as t → ∞ at a rate O(e -4tσi ), and if σ i < 0, then σ i (t) → 0 at a rate O(e 4tσi ). Therefore, Σ(t) → max(Σ, 0) and X(t) → Φ max(Σ, 0)Φ T at a rate O(e -4tσmin(Y ) ), as claimed.

C Rate of Convergence in the Symmetric Case

In this section, we show that the solution of the Riccati equation Ẋ(t) = 2Y X(t) + 2X(t)Y -4X(t) 2 converges exponentially to Ỹ at a rate O(e -4tσmin(Y ) ), where σ min (Y ) = min i |σ i | is the smallest eigenvalue of Y in magnitude. We already know that if X(t) converges to X * , then • X * is positive semidefinite if X 0 0 and rank(X * ) ≤ rank(X 0 ), • X * is a solution to the algebraic Riccati equation 2Y X + 2XY -4X 2 = 0. Our proof of proposition 2 is inspired by the proofs in (Molinari, 1977; Callier et al., 1992; 1994) which studied the solution of the Riccati equation under a more general setting. The strategy will be to show that the algebraic Riccati equation has a unique PSD solution X + such that the eigenvalues of Ã = Â = 2Y -4X + have negative real parts. Such solution is usually referred to as the strong solution or stabilizing solution in optimal control literature because it is the only solution of the algebraic Riccati equation such that the matrix Ã is exponentially stable, i.e. exp(t Ã) converges to 0. The stability of the matrix Ã is important because it appears in the solution of the Riccati equation as can be observed in ( 27). We start by proving that Ỹ is a solution to the algebraic Riccati equation, i.e. it is a critical point of the problem. Due to the symmetric and positive semidefinite nature of the matrices X and Y, in our case the algebraic equation ( 30) can be reduced to X(X -Y ) = 0. For X = Ỹ , we thus have Ỹ ( Ỹ -Y ) = m i=1 max{σ i , 0}φ i φ T i m i=1 (max{σ i , 0} -σ i )φ i φ T i = m i max{σ i , 0}φ i φ T i m i=1 min{σ i , 0}φ i φ T i . The first sum contains only the positive eigenvalues while the second sum contains only the negative ones. Therefore, no eigenvalue will appear in both sums and using the orthogonality of the vectors φ i we conclude that Ỹ ( Ỹ -Y ) = 0. Next, we prove that P = Ỹ is the unique symmetric positive semidefinite solution of the algebraic Riccati equation such that the eigenvalues of Ã = A -R P have negative real parts. Note that Ã = 2Y -4 Ỹ = -2 Ŷ where Ŷ = m i=1 |σ i |φ i φ T i 0. We proceed by contradiction. Let X 2 be a PSD solution of the (ARE) such that the eigenvalues of Ã2 = 2Y -4X 2 have negative real parts and X 2 = Ỹ . Now consider ∆ = Ỹ -X 2 . By a straightforward calculation, we can show that ∆ is a solution of Ã∆ + ∆ Ã + 4∆ 2 = 0. ( ) Since ∆ is not necessarily invertible, we consider a basis such that ∆ = Z 0 0 0 D Z T = Z ∆Z T , ( ) where D is invertible. Note that our proof holds and is more trivial if ∆ is invertible. We write with W 1 ≺ 0 and -W 4 0. This contradicts with the initial assumption on the eigenvalues of Ã2 having negative real parts. Now we can use Lemma 6 with P = Ỹ to obtain a new expression of the solution. We have Ã = Â = -2 Ŷ and X 1 (t) X 2 (t) = e -2t Ŷ I + Ŷ -1 (Ie -4 Ŷ t )(X 0 -Ỹ ) Ỹ e -2t Ŷ I + Ŷ -1 (Ie -4 Ŷ t )(X 0 -Ỹ ) + e 2t Ŷ (X 0 -Ỹ ) . Ã = Z W 1 W 2 W 3 W 4 Z T = Z ĀZ T (40) Therefore, the solution of the Riccati equation is also given by X(t) = X 2 (t)X 1 (t) -1 = Ỹ + e -2t Ŷ (X 0 -Ỹ ) I + Ŷ -1 (Ie -4 Ŷ t )(X 0 -Ỹ ) -1 e -2t Ŷ (41) Note that the inverse exists for t ≥ 0 when X 0 0 because we have proven the existence of the solution using the previous expression (32) (see lemma 2 in (Sasagawa, 1982) for a detailed proof). We introduce the function H(t) = (X 0 -Ỹ ) I + Ŷ -1 (Ie -4 Ŷ t )(X 0 -Ỹ ) -1 . Thus, X(t) -Ỹ = e -2t Ŷ H(t)e -2t Ŷ (42) The function H has the following properties for t ≥ 0: • It is decreasing: dH(t) dt = -4H(t)e -4 Ŷ t H(t) ≤ 0 (43) • If I + Ŷ (X 0 -Ỹ ) is invertible then lim t→∞ H(t) = (X 0 -Ỹ )[I + Ŷ (X 0 -Ỹ )] -1 = H. ( ) • H is bounded on R + ; H ≤ H(t) ≤ H(0) = X 0 -Ỹ . Therefore, we can conclude that there exists a constant C > 0 such that X(t) -Ỹ F ≤ Ce -4σmint , (46) where σ min = min{|σ i |} is the smallest eigenvalue of Y in absolute value.



A quantity Q(x(t)) is said to be conserved by the flow ẋ(t) = f (x(t)) if it remains constant through dynamical evolution, i.e., d dt Q(x(t)) = 0. For example, in mechanics the sum of potential and kinetic energies remains constant for a conservative system. A conservation law is usually a consequence of an underlying symmetry (Noether's theorem). In optimization, this can be seen as a constraint Q(x) = Q0 that is automatically satisfied without having to explicitly enforce it. Gradient descent,, is simply an explicit Euler discretization of (1). In a linear neural network, Y plays the role of the input-output data correlation matrix and X plays the role of the model's input-output map. In this trivial model, the input correlation matrix is assumed to be the identity as is the case when the data is whitened.



Figure 1: Top row: reconstruction error for one-vs. two-layer linear models. Bottom row: evolution of singular values of the solution. From left to right we use σ = 10 -2 , σ = 10 -1 , and σ = 1 respectively.

Figure 3: Evolution of the training loss for nonlinear onelayer and two-layer models. Top row : ||Q0||2 = 0. Bottom row : ||Q0||2 = 4.6. Initial weights are drawn from a normal distribution N (0, 10 -1 ).

) , we can deduce the following for the block matrices:W 2 = 0, W 3 = 0, DW 4 +W 4 D + 4D 2 = 0.Note that the last equation is similar to equation (35) for the invertible block D. Moreover, in the new basis, Ã is a block diagonal matrix. Using the change of variable T = D -1 , we obtain the Lyapunov equationT W 4 + W 4 T + 4I = 0. (39)Since Ã = -2 Ŷ ≺ 0 is invertible, the block W 4 is also invertible and its eigenvalues are a subset of the eigenvalues of -2 Ŷ . As a result, the Lyapunov equation has the unique trivial solution T = -2W -1 4 0, therefore D = -1 2 W 4 0. Using the solution of (35), we can obtain a new derivation for Ã2 as follows Ã2 = 2Y -4X 2 = 2Y -4

Relationships between our work and the state of the art.



D Closed form solution for the asymmetric case under spectral initialization

Under the spectral initialization, Ū0 and V0 are diagonal, hence so are U (0) and V (0). As a result, U (t) and V (t) remain diagonal for all t ≥ 0, because the components of ( 13) can be decoupled and the evolution of (13) will induce no change in the off-diagonal elements. To see this, observe thatwhereas the off-diagonal terms obey Uij = 0 and Vij = 0 (i = j). Thus (47) describes the evolution of the singular values of the solution. This decouples the problem into a set of independent one-dimensional equations, therefore it suffices to consider the scalar systemwhere we drop the index i = 1, . . . , m for simplicity. In (Saxe et al., 2019) there is a strong assumption, i.e. ūii (0) = vii (0) for all i which is a balanced initialization (Definition 2). Here we solve (47) without such an assumption. From (48), it is immediate that the conservation law (11) becomes d dt ū2 -v2 = 0. Trajectories (ū(t), v(t)) are thus constrained to lie on hyperbolas defined bySince we are mostly interested in the behavior of the product x(t) = ū(t)v(t) by making explicit use of the conservation law (49), specifically λ 2 0 = ū4 -2ū 2 v2 + v4 = ū4 + v4 -2x 2 , and (ū 2 + v2 ) 2 = λ 2 0 + 4x 2 , we obtain the following first-order differential equation for x ≡ x(t):Even though this is a nonlinear differential equation, it is separable, thus integrating both sides yields precisely (15) (we restore i and x → σ i represents the corresponding component associated with singular value σ i and conserved quantity λ 0,i ), where C > 0 is a constant:Above, only m out of k ≥ m conserved quantities are used. Hence, there is degeneracy in the solution and only m effective degrees of freedom regardless how large k is. Note that if k < m (under-parameterized case) then (47) becomes under-determined.

E Rate of Convergence in the Asymmetric Case

Here we prove Proposition 5. First, note that equationwith initial conditions P (0) = R 0 R T 0 . By the arguments used in Appendix A, the exact solution of the above equation is given by;One can show that RR T converges exponentially to the matrix R R T ; defined as the projection of 2 S on the positive semidefinte cone; this is derived by the same arguments leading to Proposition 2 (see Appendix C). The convergence rate depends on the eigenvalues of 2 S which can be determined using Schur's complement formula. For a block matrix M one hasThus, using the above formula for S -λI =Therefore, the eigenvalues of S are:1. 2m eigenvalues ±s i where si = 1 2 4σ 2 i + λ 2 0 (i = 1, . . . , m). 2. The eigenvalue -λ 0 /2 of multiplicity at least nm.In general the smallest magnitude eigenvalue will be |λ 0 |/2. However, if Y is a square matrix then only the eigenvalues ±s i above will be present.

F From Implicit Acceleration to Explicit Regularization

We now prove Proposition 4. Replacing ( 17) into (18) immediately yields ( 19). It is also straightforward to verify that applying the gradient flow, d dt Ū = -∇ Ū and d dt V = -∇ V , with being the objective function in (20), yields (19). Next, consider the formal Taylor seriesSince this is "linear" in ( Q -Λ Q0 ), higher order derivatives take the formwhere the functions Z i 's and W i 's contain a sum of powers and time derivatives of P(t). For instance, the second order derivative yieldsTherefore, if Q(t 0 ) = Λ Q0 at t = t 0 then all derivatives (58) vanish identically. As a consequence, the expansion (56) implies Q(t) = Q(t 0 ) = Λ Q0 for any other t ≥ t 0 as well.

G On matching rates between Gradient Flow and Gradient Descent

We provide an explicit example to illustrate why studying the continuous-time dynamics of the gradient flow is expected to reproduce the behaviour of its discretization, i.e. gradient descent. For the sake of simplicity let us limit the discussion to the case considered in Section 2 and in the scalar case, Y = σ ∈ R. What we would really like to do is to compare two different algorithms, namely gradient descent applied to problem (2) versus GD applied to the factorized problem (3). We thus haveversusThe respective continuous-time limits of these algorithms arefor X = X(t) and U = U (t). The solution of ( 61) is given by X(t)σ = (X 0 -Y )e -t , yielding a rate O(e -t ). Let us introduce the perturbed variable Xk ≡ X kσ. Hence (59) readily gives Xk+1 = (1η) Xk , thus a matching rate of O(e -ηk ). (This example is trivial because both systems are linear.) Now let us move on to the more interesting nonlinear case. Consider (61) and let X(t) ≡ U 2 (t). Thus Ẋ = 4σX -4X 2 , whose solution is Now we do the following trick. Let Xk ≡ X kσ be a perturbation from equilibrium. For a small enough η we can neglect terms of O(η 2 ), hence Xk+1 ≈ Xk -4η Xk (σ + Xk ) ≤ (1 -4ησ) Xk . This implies that Xk → 0, or equivalently X k → σ, at a discrete-time rate of e -4σηk . This matches the continuous-time rate.It is not hard to see how such nonlinear recurrence relations quickly become intractable for more complicated problems. On the other hand, even though the continuous-time limit provided by the gradient flow consists of a nonlinear ODE, the analysis is much more feasible, besides introducing connections with interesting areas of mathematics such as Riccati differential equations.

