A UNIFYING VIEW ON IMPLICIT BIAS IN TRAINING LINEAR NEURAL NETWORKS

Abstract

We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For L-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the 2/L max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted 1 and 2 norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.

1. INTRODUCTION

Overparametrized neural networks have infinitely many solutions that achieve zero training error, and such global minima have different generalization performance. Moreover, training a neural network is a high-dimensional nonconvex problem, which is typically intractable to solve. However, the success of deep learning indicates that first-order methods such as gradient descent or stochastic gradient descent (GD/SGD) not only (a) succeed in finding global minima, but also (b) are biased towards solutions that generalize well, which largely has remained a mystery in the literature. To explain part (a) of the phenomenon, there is a growing literature studying the convergence of GD/SGD on overparametrized neural networks (e.g., Du et al. (2018a; b) ; Allen-Zhu et al. (2018) ; Zou et al. (2018) ; Jacot et al. (2018) ; Oymak & Soltanolkotabi (2020) , and many more). There are also convergence results that focus on linear networks, without nonlinear activations (Bartlett et al., 2018; Arora et al., 2019a; Wu et al., 2019; Du & Hu, 2019; Hu et al., 2020) . These results typically focus on the convergence of loss, hence do not address which of the many global minima is reached. Another line of results tackles part (b), by studying the implicit bias or regularization of gradientbased methods on neural networks or related problems (Gunasekar et al., 2017; 2018a; b; Arora et al., 2018; Soudry et al., 2018; Ji & Telgarsky, 2019a; Arora et al., 2019b; Woodworth et al., 2020; Chizat & Bach, 2020; Gissin et al., 2020) . These results have shown interesting progress that even without explicit regularization terms in the training objective, algorithms such as GD applied on neural networks have an implicit bias towards certain solutions among the many global minima. However, in proving such results, many results rely on convergence assumptions such as global convergence of loss to zero and/or directional convergence of parameters and gradients. Ideally, such convergence assumptions should be removed because they cannot be tested a priori and there are known examples where GD does not converge to global minima under certain initializations (Bartlett et al., 2018; Arora et al., 2019a) . Networks are initialized at the same coefficients (circles on purple lines), but follow different trajectories due to implicit biases of networks induced from their architecture. The figures show that our theoretical predictions on limit points (circles on yellow line, the set of global minima) agree with the solution found by GD. For details of the experimental setup, see Section 6.

1.1. SUMMARY OF OUR CONTRIBUTIONS

We study the implicit bias of gradient flow (GD with infinitesimal step size) on linear neural networks. Following recent progress on this topic, we consider classification and regression problems that have multiple solutions with zero training error. Our analyses apply to a general class of networks, and prove both convergence and implicit bias, providing a more complete characterization of the algorithm trajectory without relying on convergence assumptions. • We propose a general tensor formulation of nonlinear neural networks which includes many network architectures considered in the literature. In this paper, we focus on the linear version of this formulation (i.e., no nonlinear activations), called linear tensor networks. • For linearly separable classification, we prove that linear tensor network parameters converge in direction to singular vectors of a tensor defined by the network. As a corollary, we show that linear fully-connected networks converge to the 2 max-margin solution (Ji & Telgarsky, 2020) . • For separable classification, we further show that if the linear tensor network is orthogonally decomposable (Assumption 1), the gradient flow finds the 2/depth max-margin solution in the singular value space, leading the parameters to converge to the top singular vectors of the tensor when depth = 2. This theorem subsumes known results on linear convolutional networks and diagonal networks proved in Gunasekar et al. (2018b) , without using convergence assumptions. • For underdetermined linear regression, we study the limit points of gradient flow on orthogonally decomposable networks (Assumption 1), and provide a full characterization of the limit points. This theorem covers results on deep matrix sensing (Arora et al., 2019b) as a special case, and extends a similar recent result (Woodworth et al., 2020) to a broader class of networks. • For underdetermined linear regression with deep linear fully-connected networks, we prove that the network converges to the minimum 2 norm solutions as we scale the initialization to zero. • Lastly, we present simple experiments that corroborate our theoretical analysis. Figure 1 shows that our predictions of limit points match with solutions found by GD.

2. PROBLEM SETTINGS AND RELATED WORKS

We first define notation used in the paper. Given a positive integer a, let [a] := {1, . . . , a}. We use I d to denote the d × d identity matrix. Given a matrix A, we use vec(A) to denote its vectorization, i.e., the concatenation of all columns of A. For two vectors a and b, let a⊗b be their tensor product, a b be their element-wise product, and a k be the element-wise k-th power of a. Given an order-L tensor A ∈ R k1×•••×k L , we use [A] j1,...,j L to denote the (j 1 , j 2 , . . . , j L )-th element of A, where j l ∈ [k l ] for all l ∈ [L]. In element indexing, we use • to denote all indices in the corresponding dimension, and a : b to denote all indices from a to b. For example, for a matrix A, [A] •,4:6 denotes a submatrix that consists of 4th-6th columns of A. The square bracket notation for indexing overloads with [a] when a ∈ N, but they will be distinguishable from the context. Since element indices start from 1, we re-define the modulo operation a mod d := a -a-1 d d ∈ [d] for a > 0. We use e k j to denote the j-th stardard basis vector of the vector space R k . Lastly, we define the multilinear multiplication between a tensor and linear maps, which can be viewed as a generalization of leftand right-multiplication on a matrix. Given a tensor A ∈ R k1×•••×k L and linear maps B l ∈ R p l ×k l for l ∈ [L], we define the multilinear multiplication • between them as A • (B T 1 , B T 2 , . . . , B T L ) = j1,...,j L [A] j1,...,j L (e k1 j1 ⊗ • • • ⊗ e k L j L ) • (B T 1 , . . . , B T L ) := j1,...,j L [A] j1,...,j L (B 1 e k1 j1 ⊗ • • • ⊗ B L e k L j L ) ∈ R p1×•••×p L .

2.1. PROBLEM SETTINGS

We are given a dataset {(x i , y i )} n i=1 , where x i ∈ R d and y i ∈ R. We let X ∈ R n×d and y ∈ R n be the data matrix and the label vector, respectively. We study binary classification and linear regression in this paper, focusing on the settings where there exist many global solutions. For binary classification, we assume y i ∈ {±1} and that the data is separable: there exists a unit vector z and a constant γ > 0 such that y i x T i z ≥ γ for all i ∈ [n]. For regression, we consider the underdetermined case (n ≤ d) where there are many parameters z ∈ R d such that Xz = y. Throughout the paper, we assume that X has full row rank. We use f (•; Θ) : R d → R to denote a neural network parametrized by Θ. Given the network and the dataset, we consider minimizing the training loss L(Θ) := n i=1 (f (x i ; Θ), y i ) over Θ. Following previous results (e.g., Lyu & Li (2020) ; Ji & Telgarsky ( 2020)), we use the exponential loss (ŷ, y) = exp(-ŷy) for classification problems. For regression, we use the squared error loss (ŷ, y) = 1 2 (ŷ-y) 2 . On the algorithm side, we minimize L using gradient flow, which can be viewed as GD with infinitesimal step size. The gradient flow dynamics is defined as d dt Θ = -∇ Θ L(Θ).

2.2. RELATED WORKS

Gradient flow/descent in separable classification. For linear models f (x; z) = x T z with separable data, Soudry et al. (2018) show that the GD run on L drives z to ∞, but z converges in direction to the 2 max-margin classifier. The limit direction of z is aligned with the solution of minimize z∈R d z subject to y i x T i z ≥ 1 for i ∈ [n], where the norm in the cost is the 2 norm. Nacson et al. (2019b; c) ; Gunasekar et al. (2018a) ; Ji & Telgarsky (2019b; c) extend these results to other (stochastic) algorithms and non-separable settings. Gunasekar et al. (2018b) study the same problem on linear neural networks and show that GD exhibits different implicit bias depending on the architecture. The authors show that the linear coefficients of the network converges in direction to the solution of (1) with different norms: 2 norm for linear fully-connected networks, 2/L (quasi-)norm for diagonal networks, and DFT-domain 2/L (quasi-)norm for convolutional networks with full-length filters. Here, L denotes the depth. We note that Gunasekar et al. (2018b) assume that GD globally minimizes the loss, and the network parameters and the gradient with respect to the linear coefficients converge in direction. Subsequent results (Ji & Telgarsky, 2019a; 2020) remove such assumptions for linear fully-connected networks. A recent line of results (Nacson et al., 2019a; Lyu & Li, 2020; Ji & Telgarsky, 2020) studies general homogeneous models and show divergence of parameters to infinity, monotone increase of smoothed margin, directional convergence and alignment of parameters (see Section 4 for details). Lyu & Li (2020) also characterize the limit direction of parameters as the KKT point of a nonconvex maxmargin problem similar to (1), but this characterization does not provide useful insights for the functions f (•; Θ) represented by specific architectures, because the formulation is in the parameter space Θ. Also, these results require that gradient flow/descent has already reached 100% training accuracy. Although we study a more restrictive set of networks (i.e., deep linear), we provide a more complete characterization of the implicit bias for the functions f (•; Θ), without assuming 100% training accuracy. Gradient flow/descent in linear regression. It is known that for linear models f (x; z) = x T z, GD converges to the global minimum that is closest in 2 distance to the initialization (see e.g., Gunasekar et al. (2018a) ). However, relatively less is known for deep networks, even for linear networks. This is partly because the parameters do not diverge to infinity, hence making limit points highly dependent on the initialization; this dependency renders analysis difficult. A related problem of matrix sensing aims to minimize Gunasekar et al. (2017) ; Arora et al. (2019b) that if the sensor matrices A i commute and we initialize all W l 's to αI, GD finds the minimum nuclear norm solution as α → 0. et al. (2019) show that if a network is zero at initialization, and we scale the network output by a factor of α → ∞, then the GD dynamics enters a "lazy regime" where the network behaves like a first-order approximation at its initialization, as also seen in results studying kernel approximations of neural networks and convergence of GD in the corresponding RKHS (e.g., Jacot et al. (2018) ). Woodworth et al. (2020) study linear regression with a diagonal network of the form n i=1 (y i -A i , W 1 • • • W L ) 2 over W 1 , . . . , W L ∈ R d×d . It is shown in

Chizat

f (x; w + , w -) = x T (w L + -w L -) , where w + and w -are identically initialized w + (0) = w -(0) = α w. The authors show that the global minimum reached by GD minimizes a normlike function which interpolates between (weighted) 1 norm (α → 0) and 2 norm (α → ∞). In our paper, we consider a more general class of orthogonally decomposable networks, and obtain similar results interpolating between weighted 1 and 2 norms. We also remark that our results include the results in Arora et al. (2019b) as a special case, and we do not assume convergence to global minima, as done in Gunasekar et al. (2017) ; Arora et al. (2019b) ; Woodworth et al. (2020) .

3. TENSOR FORMULATION OF NEURAL NETWORKS

In this section, we present a general tensor formulation of neural networks. Given an input x ∈ R d , the network uses a linear map M that maps x to an order-L tensor M(x) ∈ R k1×•••×k L , where L ≥ 2. Using parameters v l ∈ R k l and activation φ, the network computes its layers as the following: H 1 (x) = φ (M(x) • (v 1 , I k2 , . . . , I k L )) ∈ R k2×•••×k L , H l (x) = φ H l-1 (x) • (v l , I k l+1 , . . . , I k L ) ∈ R k l+1 ×...,k L , for l = 2, . . . , L -1, f (x; Θ) = H L-1 (x) • v L ∈ R. (2) We use Θ to denote the collection of all parameters (v 1 , . . . , v L ). We call M(x) the data tensor. Although this new formulation may look a bit odd in the first glance, it is general enough to capture many network architectures considered in the literature, including fully-connected networks, diagonal networks, and circular convolutional networks. We formally define these architectures below. Diagonal networks. An L-layer diagonal network is written as f diag (x; Θ diag ) = φ(• • • φ(φ(x w 1 ) w 2 ) • • • w L-1 ) T w L , where w l ∈ R d for l ∈ [L]. The representation of f diag as the tensor form (2) is straightforward. Let M diag (x) ∈ R d×•••×d have [M diag (x) ] j,j,...,j = [x] j , while all the remaining entries of M diag (x) are set to zero. We can set v l = w l for all l, and M = M diag to verify that (2) and (3) are equivalent. Circular convolutional networks. The tensor formulation (2) includes convolutional networks f conv (x; Θ conv ) = φ(• • • φ(φ(x w 1 ) w 2 ) • • • w L-1 ) T w L , where  w l ∈ R k l with k l ≤ d and k L = d, ∈ R k (k ≤ d), we have a b ∈ R d defined as [a b] i = k j=1 [a] (i+j-1) mod d [b] j , for i ∈ [d]. Define M conv (x) ∈ R k1×•••×k L as [M conv (x)] j1,j2,...,j L = [x] ( L l=1 j l -L+1) mod d for j l ∈ [k l ], l ∈ [L]. Setting v l = w l and M = M conv , we can verify that (2) and (4) are identical. Fully-connected networks. An L-layer fully-connected network is defined as f fc (x; Θ fc ) = φ(• • • φ(φ(x T W 1 )W 2 ) • • • W L-1 )w L , where W l ∈ R d l ×d l+1 for l ∈ [L -1] (we use d 1 = d) and w L ∈ R d L . One can represent f fc as the tensor form (2) by defining parameters v l = vec(W l ) for l ∈ [L -1] and v L = w L , and constructing the tensor M fc (x) by a recursive "block diagonal" manner. For example, if L = 2, we can define M fc (x) ∈ R d1d2×d2 to be the Kronecker product of I d2 and x. For deeper networks, we defer the full description of M fc (x) to Appendix B. Our focus: linear tensor networks. Throughout this section, we have used the activation φ to motivate our tensor formulation (2) for neural networks with nonlinear activations. For the remaining of the paper, we study the case whose activation is linear, i.e., φ(t) = t. In this case, f (x; Θ) = M(x) • (v 1 , v 2 , . . . , v L ). We will refer to (6) as linear tensor networks, where "linear" is to indicate that the activation is linear. Note that as a function of parameters v 1 , . . . , v L , f (x; Θ) is in fact multilinear. We also remark that when depth L = 2, the data tensor M(x) is a k 1 × k 2 matrix and the network formulation boils down to f (x; Θ) = v T 1 M(x)v 2 . Since the data tensor M(x) is a linear function of x, the linear tensor network is also a linear function of x. Thus, the output of the network can also be written as f (x; Θ) = x T β(Θ), where β(Θ) ∈ R d denotes the linear coefficients computed as a function of the network parameters Θ. Since the linear tensor network f (x; Θ) is linear in x, the expressive power of f is at best a linear model x → x T z. However, even though the models have the same expressive power, their architectural differences lead to different implicit biases in training, which is the focus of our investigation in this paper. Studying separable classification and underdetermined regression is useful for highlighting such biases because there are infinitely many coefficients that perfectly classify or fit the dataset. For our linear tensor network, the evolution of the parameters v l via gradient flow reads vl = -∇ v l L(Θ) = - n i=1 (f (x i ; Θ), y i )M(x i ) • (v 1 , . . . , v l-1 , I k l , v l+1 , . . . , v L ) = M(-X T r) • (v 1 , . . . , v l-1 , I k l , v l+1 , . . . , v L ), ∀l ∈ [L], where we initialize v l (0) = αv l , for l ∈ [L]. We refer to α and vl as the initial scale and initial direction, respectively. We note that we do not restrict vl 's to be unit vectors, in order to allow different scaling (at initialization) over different layers. The vector r ∈ R n is the residual vector, and each component of r is defined as [r] i = (f (x i ; Θ), y i ) = -y i exp(-y i f (x i ; Θ)) for classification, f (x i ; Θ) -y i for regression. (7)

4. IMPLICIT BIAS OF GRADIENT FLOW IN SEPARABLE CLASSIFICATION

In this section, we present our results on the implicit bias of gradient flow in binary classification with linearly separable data. Recent papers (Lyu & Li, 2020; Ji & Telgarsky, 2020) on this separable classification setup prove that after 100% training accuracy has been achieved by gradient flow (along with other technical conditions), the parameters of L-homogeneous models diverge to infinity, while converging in direction that aligns with the direction of the negative gradient. Mathematically, lim t→∞ Θ(t) = ∞, lim t→∞ Θ(t) Θ(t) = Θ ∞ , lim t→∞ Θ(t) T ∇ Θ L(Θ(t)) Θ(t) ∇ Θ L(Θ(t)) = -1. Since the linear tensor network satisfies the technical assumptions in the prior works, we apply these results to our setting and develop a new characterization of the limit directions of the parameters. Here, we present theorems on separable classification with general linear tensor networks. Corollaries for specific networks are deferred to Appendix A.

4.1. LIMIT DIRECTIONS OF PARAMETERS ARE SINGULAR VECTORS

Consider the singular value decomposition (SVD) of a matrix A = m j=1 s j (u j ⊗ v j ), where m is the rank of A. Note that the tuples (u j , v j , s j ) are solutions to the system of equations su = Av and sv = A T u. Lim (2005) generalizes this definition of singular vectors and singular values to higher-order tensors: given an order-L tensor A ∈ R k1×•••×k L , we define the singular vectors u 1 , u 2 , . . . , u L and singular value s to be the solution of the following system of equations: su l = A • (u 1 , . . . , u l-1 , I k l , u l+1 , . . . , u L ), for l ∈ [L]. Using the definition of the singular vectors of tensors, we can characterize the limit direction of parameters after reaching 100% training accuracy. In Appendix C, we prove the following: Theorem 1. Assume that the gradient flow satisfies L(Θ(t 0 )) < 1 for some t 0 ≥ 0 and X T r(t) converges in direction, say u ∞ := lim t→∞ X T r(t) X T r(t) 2 . Then, v 1 , . . . , v L converge to the singular vectors of M(-u ∞ ). For this theorem, we make some convergence assumptions, because the network is fully general; this is the only result where we assume convergence. It fact, for the special case of linear fullyconnected networks, the directional convergence assumption is not required, and the linear coefficients β fc (Θ fc ) converge in direction to the 2 max-margin classifier. We state this corollary in Appendix A.1; this result also appears in Ji & Telgarsky (2020) , but we provide an alternative proof.

4.2. LIMIT DIRECTIONS FOR ORTHOGONALLY DECOMPOSABLE NETWORKS

Admittedly, Theorem 1 is not a full characterization of the limit directions, because there are usually multiple solutions that satisfy (8). For example, in case of L = 2, the data tensor M(-u ∞ ) is a matrix and the number of possible limit directions (up to scaling) of (v 1 , v 2 ) is at least the rank of M(-u ∞ ). Singular vectors of high order tensors are much less understood than the matrix counterparts, and are much harder to deal with. Although their existence is implied from the variational formulation (Lim, 2005) , they are intractable to compute. Testing if a given number is a singular value, approximating the corresponding singular vectors, and computing the best rank-1 approximation are all NP-hard (Hillar & Lim, 2013) ; let alone orthogonal decompositions. Given this intractability, it might be reasonable to make some assumptions on the "structure" of the data tensor M(x), so that they are easier to handle. The following assumption defines a class of orthogonally decomposable data tensors, which includes linear diagonal networks and linear full-length convolutional networks as special cases (for the proof, see Appendix D.2 and D.3). Assumption 1. For the data tensor M(x) ∈ R k1×•••×k L of a linear tensor network (6), there exist a full column rank matrix S ∈ C m×d (d ≤ m ≤ min l k l ) and matrices U 1 ∈ C k1×m , . . . , U L ∈ C k L ×m such that U H l U l = I m for all l ∈ [L], and the data tensor M(x) can be written as M(x) = m j=1 [Sx] j ([U 1 ] •,j ⊗ [U 2 ] •,j ⊗ • • • ⊗ [U L ] •,j ). In this assumption, we allow U 1 , . . . , U L and S to be complex matrices, although M(x) and parameters v l stay real, as defined earlier. For a complex matrix A, we use A * to denote its entry-wise complex conjugate, A T to denote its transpose (without conjugating), and A H to denote its conjugate transpose. In case of L = 2, Assumption 1 requires that the data tensor M(x) (now a matrix) has singular value decomposition M(x) = U 1 diag(Sx)U T 2 ; i.e., the left and right singular vectors are independent of x, and the singular values are linear in x. Using Assumption 1, the following theorem characterizes the limit directions. Theorem 2. Suppose a linear tensor network satisfies Assumption 1. If there exists λ > 0 such that the initial directions v1 , . . . , vL of the network parameters satisfy |[U T l vl ] j | 2 -|[U T L vL ] j | 2 ≥ λ for all l ∈ [L -1] and j ∈ [m], then β(Θ(t) ) converges in a direction that aligns with S T ρ ∞ , where ρ ∞ ∈ C m denotes a stationary point of the following optimization problem minimize ρ∈C m ρ 2/L subject to y i x T i S T ρ ≥ 1, ∀i ∈ [n]. If S is invertible, then β(Θ(t)) converges in a direction that aligns with a stationary point z ∞ of minimize z∈R d S -T z 2/L subject to y i x T i z ≥ 1, ∀i ∈ [n]. Theorem 2 shows that the gradient flow finds sparse ρ ∞ that minimizes the 2/L norm in the "singular value space," where the data points x i are transformed into vectors Sx i consisting of singular values of M(x i ). Also, the proof of Theorem 2 reveals that in case of L = 2, the parameters v l (t) in fact converge to the top singular vectors of the data tensor M(-X T r); thus, compared to Theorem 1, we have a more complete characterization of "which" singular vectors to converge to. The proof of Theorem 2 is in Appendix D. Since the orthogonal decomposition (Assumption 1) of M(x) tells us that the singular vectors M(x) in U 1 , . . . , U L are independent of x, we can transform the network parameters v l to U T l v l and show that the network behaves similar to a linear diagonal network. This observation comes in handy in the characterization of limit directions. Remark 1 (Necessity of initialization assumptions). In order to remove the assumption that the loss converges to zero, at least some condition on initialization is necessary, because there are examples showing non-convergence of gradient flow for certain initializations (Bartlett et al., 2018; Arora et al., 2019a) . In our theorems, we pose assumptions on initial directions vl that are sufficient conditions for the loss L(Θ(t)) to converge to zero. Although such sufficient conditions are "stronger" than assuming L(Θ(t)) → 0, they are useful because they can be easily checked a priori, i.e., before running gradient flow. We note an important fact that in Theorems 2 and onwards, the conditions on initialization are used solely to prove convergence of the loss to zero, and our statements on the implicit bias hold whenever the loss converges to zero, even for initializations that do not satisfy our conditions. In addition, we argue that our assumptions are not too restrictive; λ can be arbitrarily small, so the conditions are satisfied with probability 1 if we set vL = 0 and randomly sample other vl 's. Setting one layer to zero to prove convergence is also studied in Wu et al. (2019) . Lastly, the condition that vL is "small" can be replaced with any layer; e.g., convergence still holds if |[U T l vl ] j | 2 -|[U T 1 v1 ] j | 2 ≥ λ for all l = 2, . . . , L and j ∈ [m] . Remark 2 (Comparison to existing results). Theorem 2 leads to corollaries (stated in Appendix A.2) on linear diagonal and full-length convolutional networks, showing that diagonal (or convolutional) networks converge to the stationary point of the max-margin problem with respect to the 2/L norm (or DFT-domain 2/L norm). Theorem 2 recovers the results in Gunasekar et al. (2018b) without relying on assumptions such as directional convergence of parameters and gradients. Remark 3 (Implications to architecture design). Theorem 2 shows that the gradient flow finds a solution that is sparse in a "transformed" input space where all data points are transformed with S. This implies something interesting about architecture design: if the sparsity of the solution under a certain linear transformation T is needed, one can design a network using Assumption 1 by setting S = T . Training such a network will give us a solution that has the desired sparsity property. Other than Assumption 1, there is another setting where we can prove a full characterization of limit directions: when there is one data point (n = 1) and the network is 2-layer (L = 2). This "extremely overparametrized" case is motivated by an experimental paper (Zhang et al., 2019) which studies generalization performance of different architectures when there is only one training data point. Theorem 3. Suppose we have a 2-layer linear tensor network (6) and a single data point (x, y). Consider the compact SVD M(x) = U 1 diag(s)U T 2 , where U 1 ∈ R k1×m , U 2 ∈ R k2×m , and s ∈ R m for m ≤ min{k 1 , k 2 }. Let ρ ∞ ∈ R m be a solution of the following optimization problem minimize ρ∈R m ρ 1 subject to ys T ρ ≥ 1. Assume that there exists λ > 0 such that the initial directions v1 , v2 of the network parameters satisfy [U T 1 v1 ] 2 j -[U T 2 v2 ] 2 j ≥ λ for all j ∈ [m]. Then, v 1 and v 2 converge in direction to U 1 η ∞ 1 and U 2 η ∞ 2 , where |η ∞ 1 | = |η ∞ 2 | = |ρ ∞ | 1/2 , and sign(η ∞ 1 ) = sign(y) sign(η ∞ 2 ). The proof of Theorem 3 can be found in Appendix E. Since ρ ∞ is the minimum 1 norm solution in the singular value space, the parameters v 1 and v 2 converge in direction to the top singular vectors. We would like to emphasize that this theorem can be applied to any network architecture that can be represented as a linear tensor network. Recall that the previous result (Gunasekar et al., 2018b) only considers full-length filters (k 1 = d), hence providing limited insights on networks with small filters, e.g., k 1 = 2. In light of this, we present a corollary in Appendix A.3 showing that linear coefficients of convolutional networks converge in direction to a "filtered" version of x.

5. IMPLICIT BIAS OF GRADIENT FLOW IN UNDERDETERMINED REGRESSION

In Section 4, the limit directions of parameters we characterized do not depend on initialization. This is due to the fact that the parameters diverge to infinity in separable classification problems, so that the initialization becomes unimportant in the limit. This is not the case in regression setting, because parameters do not diverge to infinity. As we show in this section, the limit points are closely tied to initialization, and our analyses characterize the dependency between them.

5.1. LIMIT POINT CHARACTERIZATION FOR ORTHOGONALLY DECOMPOSABLE NETWORKS

For the orthogonally decomposable networks satisfying Assumption 1 with real S and U l 's, we consider how limit points of gradient flow change according to initialization. We consider a specific initialization scheme that, in the special case of diagonal networks, corresponds to setting w l (0) = α w for l ∈ [L -1] and w L (0) = 0. We use the following lemma on a relevant system of ODEs: Lemma 4. Consider the system of ODEs, where p, q : R → R: ṗ = p L-2 q, q = p L-1 , p(0) = 1, q(0) = 0. Then, the solutions p L (t) and q L (t) are continuous on their maximal interval of existence of the form (-c, c) ⊂ R for some c ∈ (0, ∞]. Define h L (t) = p L (t) L-1 q L (t); then, h L (t) is odd and strictly increasing, satisfying lim t↑c h L (t) = ∞ and lim t↓-c h L (t) = -∞. Using the function h L (t) from Lemma 4, we can obtain the following theorem that characterizes the limit points as the minimizer of a norm-like function Q L,α, η among the global minima. Theorem 5. Suppose a linear tensor network satisfies Assumption 1. Assume further that the matrices U 1 , . . . , U L and S from Assumption 1 are all real matrices. For some λ > 0, choose any vector η ∈ R m satisfying [ η] 2 j ≥ λ for all j ∈ [m], and choose initial directions vl = U l η for l ∈ [L -1] and vL = 0. Then, the linear coefficients β(Θ(t)) converge to S T ρ ∞ , where ρ ∞ is the solution of minimize ρ∈R m Q L,α, η (ρ) := α 2 m j=1 [ η] 2 j H L [ρ]j α L |[ η]j | L subject to XS T ρ = y, where Q L,α, η : R m → R is a norm-like function defined using H L (t) := t 0 h -1 L (τ )dτ . If S is invertible, then β(Θ(t)) converges to the solution z ∞ of minimize z∈R d Q L,α, η (S -T z) subject to Xz = y. The proofs of Lemma 4 and Theorem 5 are deferred to Appendix F. Remark 4 (Interpolation between 1 and 2 ). It can be checked that H L (t) grows like the absolute value function if t is large, and grows like a quadratic function if t is close to zero. This means that lim α→0 Q L,α, η (ρ) ∝ m j=1 |[ρ]j | |[ η]j | L-2 , lim α→∞ Q L,α, η (ρ) ∝ m j=1 [ρ] 2 j [ η] 2L-2 j , so Q L,α , η interpolates between the weighted 1 and weighted 2 norms of ρ. Also, the weights in the norm are dependent on the initialization direction η unless L = 2 and α → 0. In general, Q L,α, η interpolates the standard 1 and 2 norms only if |[ η] j | is the same for all j ∈ [m] . This result is similar to the observations made in Woodworth et al. (2020) which considers a diagonal network with a "differential" structure f (x; w + , w -) = x T (w L + -w L -). In contrast, our results apply to a more general class of networks, without the need to have the differential structure. In Appendix A.4, we state corollaries of Theorem 5 for linear diagonal networks and linear full-length convolutional networks with even data points. There, we also show that deep matrix sensing with commutative sensor matrices (Arora et al., 2019b ) is a special case of our setting. Next, we present the regression counterpart of Theorem 3, for 2-layer linear tensor networks with a single data point. For this extremely overparametrized setup, we can fully characterize the limit points as functions of initialization v 1 (0) = αv 1 and v 2 (0) = αv 2 , for any linear tensor networks including linear convolutional networks with filter size smaller than input dimension. Theorem 6. Suppose we have a 2-layer linear tensor network (6) and a single data point (x, y). Consider the compact SVD M(x) = U 1 diag(s)U T 2 , where U 1 ∈ R k1×m , U 2 ∈ R k2×m , and s ∈ R m for m ≤ min{k 1 , k 2 }. Assume that there exists λ > 0 such that the initial directions v1 , v2 of the network parameters satisfy [U T 1 v1 ] 2 j -[U T 2 v2 ] 2 j ≥ λ for all j ∈ [m]. Then, gradient flow converges to a global minimizer of the loss L, and v 1 (t) and v 2 (t) converge to the limit points: v ∞ 1 = αU 1 U T 1 v1 cosh g -1 y α 2 s + U T 2 v2 sinh g -1 y α 2 s +α(I k1 -U 1 U T 1 )v 1 , v ∞ 2 = αU 2 U T 1 v1 sinh g -1 y α 2 s + U T 2 v2 cosh g -1 y α 2 s +α(I k2 -U 2 U T 2 )v 2 , where g -1 is the inverse of the following strictly increasing function g(ν) = m j=1 [s] j [U T 1 v1] 2 j +[U T 2 v2] 2 j 2 sinh(2[s] j ν) + [U T 1 v1 ] j [U T 2 v2 ] j cosh(2[s] j ν) . The proof can be found in Appendix G. We can observe that as α → 0, we have g -1 y α 2 → ∞, which results in exponentially faster growth of the sinh(•) and cosh(•) for the top singular values. As a result, the top singular vectors dominate the limit points v ∞ 1 and v ∞ 2 as α → 0, and they do not depend on the initial directions v1 , v2 . Experiment results in Section 6 support this observation.

5.2. IMPLICIT BIAS IN FULLY-CONNECTED NETWORKS: THE α → 0 LIMIT

We state our last theoretical element of this paper, which proves that the linear coefficients β fc (Θ fc ) of deep linear fully-connected networks converge to the minimum 2 norm solution as α → 0. We assume for simplicity that d 1 = d 2 = • • • = d L = d in this section, but we can extend it for d l ≥ d without too much difficulty. Recall f fc (x; Θ fc ) = x T W 1 • • • W L-1 w L . We minimize the training loss L with initialization W l (0) = α Wl for l ∈ [L -1] and w L (0) = α wL . Theorem 7. Assume that initial directions W1 , . . . , WL-1 , wL satisfy (1) W T l Wl Wl+1 W T l+1 for l ∈ [L-2], and (2) there exists λ > 0 such that W T L-1 WL-1 -wL wT L λI d . Then, the gradient flow converges to a global minimum, and lim α→0 lim t→∞ β fc (Θ fc (t)) = X T (XX T ) -1 y. The proof is presented in Appendix H. Theorem 7 shows that in the limit α → 0, linear fullyconnected networks have bias towards the minimum 2 norm solution, regardless of the depth. This is consistent with the results shown for classification. We also note that the convergence to a global minimum holds for any α > 0, and our sufficient conditions ( W T l Wl Wl+1 W T l+1 and W T L-1 WL-1 -wL wT 

6. EXPERIMENTS

Regression. To fully visualize the trajectory of linear coefficients, we run simple experiments with 2-layer linear fully-connected/diagonal/convolutional networks with a single 2-dimensional data point (x, y) = ([1 2], 1). For this dataset, the minimum 2 norm solution (corresponding to fully-connected networks) of the regression problem is [0.2 0.4], whereas the minimum 1 norm solution (corresponding to diagonal) is [0 0.5] and the minimum DFT-domain 1 norm solution (corresponding to convolutional) is [0.33 0.33]. We randomly pick four directions z1 , . . . z4 ∈ R 2 , and choose initial directions of the network parameters in a way that their linear coefficients at initialization are exactly β(Θ(0)) = α 2 zj . With varying initial scales α ∈ {0.01, 0.5, 1}, we run GD with small step size η = 10 -3 for large enough number of iterations T = 5 × 10 3 . Figures 1 and 2 plot the trajectories of β(Θ) (appropriately clipped for visual clarity) as well as the predicted limit points (Theorem 6). We observe that even though the networks start at the same linear coefficients α 2 zj , they evolve differently due to different architectures. Note that the prediction of limit points is accurate, and the solution found by GD is less dependent on initial directions when α is small. Classification. It is shown in the existing works as well as in Section 4 that the limit directions of linear coefficients are independent of the initialization. Is this also true in practice? To see this, we run a set of toy experiments on classification with two data points (x 1 , y 1 ) = ([1 2], +1) and (x 2 , y 2 ) = ([0 -3], -1). One can check that the max-margin classifiers for this problem are in the same directions to the corresponding min-norm solutions in the regression problem above. We use the same networks as in regression, and the same set of initial directions satisfying β(Θ(0)) = α 2 zj . With initial scales α ∈ {0.01, 0.5, 1}, we run GD with step size η = 5 × 10 -4 for T = 2 × 10 6 iterations. All experiments reached L(Θ) 10 -5 at the end. The trajectories are plotted in Figure 2 in the Appendix. We find that, in contrast to our theoretical characterization, the actual coefficients are quite dependent on initialization, because we do not train the network all the way to zero loss. This observation is also consistent with a recent analysis (Moroshko et al., 2020) for diagonal networks, and suggests that understanding the behavior of iterates after a finite number of steps is an important future work.

7. CONCLUSION

This paper studies the implicit bias of gradient flow on training linear tensor networks. Under a general tensor formulation of linear networks, we provide theorems characterizing how the network architectures and initializations affect the limit directions/points of gradient flow. Our work provides a unified framework that connects multiple existing results on implicit bias of gradient flow as special cases. Figure 2 : Gradient descent trajectories of linear coefficients of linear fully-connected, diagonal, and convolutional networks on a regression task with initial scale α = 0.5 (top left), and networks on a classification task with initial scales α = 0.01, 0.5, 1 (rest). Networks are initialized at the same coefficients (circles on purple lines), but follow different trajectories due to different implicit biases of networks induced from their architecture. The top left figure shows that our theoretical predictions on limit points (circles on yellow line, the set of global minima) agree with the solution found by GD. For details of the experimental setup, please refer to Section 6.

A COROLLARIES ON SPECIFIC NETWORK ARCHITECTURES

We present corollaries obtained by specializing the theorems in the main text to specific network architectures. We briefly review the linear neural network architectures studied in this section. Linear fully-connected networks. An L-layer linear fully-connected network is defined as f fc (x; Θ fc ) = x T W 1 • • • W L-1 w L , ( ) where W l ∈ R d l ×d l+1 for l ∈ [L -1] (we use d 1 = d) and w L ∈ R d L . Linear diagonal networks. An L-layer linear diagonal network is written as f diag (x; Θ diag ) = (x w 1 • • • w L-1 ) T w L , where w l ∈ R d for l ∈ [L]. Linear (circular) convolutional networks. An L-layer linear convolutional network is written as  f conv (x; Θ conv ) = (• • • ((x w 1 ) w 2 ) • • • w L-1 ) T w L , L ms (W 1 • • • W L ) := n i=1 (y i -A i , W 1 • • • W L ) 2 , ( ) where the sensor matrices A 1 , . . . , A n ∈ R d×d are symmetric. Following Gunasekar et al. (2017) ; Arora et al. (2019b) , we consider sensor matrices A 1 , . . . , A n ∈ R d×d that commute. To make the problem underdetermined, we assume that n ≤ d, and A i 's are linearly independent. A.1 COROLLARY OF THEOREM 1 Corollary 1. Consider an L-layer linear fully-connected network (10). If the training loss satisfies L(Θ fc (t 0 )) < 1 for some t 0 ≥ 0, then β fc (Θ fc (t)) converges in a direction that aligns with the solution of the following optimization problem minimize z∈R d z 2 2 subject to y i x T i z ≥ 1, ∀i ∈ [n]. Corollary 1 shows that whenever the network separates the data correctly, the linear coefficients β fc (Θ fc ) convergence in direction to the 2 max-margin classifier. Note that this corollary does not require the directional convergence of X T r, which is different from Theorem 1. In fact, this corollary also appears in Ji & Telgarsky (2020) , but we provide an alternative proof based on our tensor formulation. The proof of Corollary 1 can be found in Appendix C.

A.2 COROLLARIES OF THEOREM 2

Corollary 2. Consider an L-layer linear diagonal network (11). If there exists λ > 0 such that the initial directions w1 , . . . , wL of the network parameters satisfy [ wl ] 2 j -[ wL ] 2 j ≥ λ for all l ∈ [L-1] and j ∈ [d], then β diag (Θ diag (t)) converges in a direction that aligns with a stationary point z ∞ of minimize z∈R d z 2/L subject to y i x T i z ≥ 1, ∀i ∈ [n]. For full-length convolutional networks, we define F ∈ C d×d to be the matrix of discrete Fourier transform basis [F ] j,k = 1 √ d exp(- √ -1•2π(j-1)(k-1) d ). Note that F * = F -1 , and both F and F * are symmetric, but not Hermitian. Corollary 3. Consider an L-layer linear full-length convolutional network (12). If there exists λ > 0 such that the initial directions w1 , . . . , wL of the network parameters satisfy |[F wl ] j | 2 - |[F wL ] j | 2 ≥ λ for all l ∈ [L -1] and j ∈ [d], then β conv (Θ conv (t)) converges in a direction that aligns with a stationary point z ∞ of minimize z∈R d F z 2/L subject to y i x T i z ≥ 1, ∀i ∈ [n]. Corollary 2 shows that in the limit, linear diagonal network finds a sparse solution z that is a stationary point of the 2/L max-margin classification problem. Corollary 3 has a similar conclusion except that the standard 2/L norm is replaced with DFT-domain 2/L norm. By specifying mild conditions on initialization (see Remark 1), these corollaries remove the convergence assumptions required in Gunasekar et al. (2018b) . The proofs of Corollaries 2 and 3 are in Appendix D.

A.3 COROLLARY OF THEOREM 3

Recall that Theorem 3 can be applied to any 2-layer networks that can be represented as linear tensor networks. Examples include the convolutional networks that are not full-length (i.e., filter size k 1 < d), which are not covered by the previous result (Gunasekar et al., 2018b) . Here, we present the characterization of convergence directions of β conv (Θ conv (t)) for small-filter cases: k 1 = 1 and k 1 = 2. Corollary 4. Consider a 2-layer linear convolutional network (12) with k 1 = 1 and a single data point (x, y). If there exists λ > 0 such that the initial directions w1 and w2 of the network parameters satisfy x 2 v2 1 -(x T v2 ) 2 ≥ x 2 λ, then β conv (Θ conv (t) ) converges in direction that aligns with yx. Consider a 2-layer linear convolutional network (12) with k 1 = 2 and a single data point (x, y). Let ← - x := [[x] 2 • • • [x] d [x] 1 ], and - → x := [[x] d [x] 1 • • • [x] d-1 ]. If there exists λ > 0 such that the initial directions w1 and w2 of the network parameters satisfy ([v 1 ] 1 + [v 1 ] 2 ) 2 - ((x + ← - x ) T v2 ) 2 x 2 2 + x T ← - x ≥ λ, and ([v 1 ] 1 -[v 1 ] 2 ) 2 - ((x -← - x ) T v2 ) 2 x 2 2 -x T ← - x ≥ λ, then β conv (Θ conv (t) ) converges in a direction that aligns with a filtered version of x: lim t→∞ β conv (Θ conv (t)) β conv (Θ conv (t)) 2 ∝ 2yx + y ← - x + y - → x if x T ← - x > 0, 2yx -y ← - x -y - → x if x T ← - x < 0. Corollary 4 shows that if the filter size is k 1 = 1, then the limit direction of β conv (Θ conv ) is always the 2 max-margin classifier. Note that this is quite different from the case k 1 = d which converges to the DFT-domain 1 max-margin classifier. However, for 1 < k 1 < d, it is difficult to characterize the limit direction as the max-margin classifier of some common norms. Rather, the limit directions of β conv (Θ conv ) correspond to a "filtered" version of the data point, and the weights of the filter depend on the data point x. For k 1 = 2, the filter is a low-pass filter if the autocorrelation x T ←x of x is positive, and high-pass if the autocorrelation is negative. For k 1 > 2, the filter weights are more complicated to characterize in terms of x, and the filter length increases as k 1 increases. We prove Corollary 4 in Appendix E.

A.4 COROLLARIES OF THEOREM 5

In this subsection, we apply Theorem 5 to linear diagonal networks, linear full-length convolutional networks with even data, and deep matrix sensing. The proofs of the corollaries can be found in Appendix F. Corollary 5. Consider an L-layer linear diagonal network (11). For some λ > 0, choose any vector w ∈ R d satisfying [ w] 2 j ≥ λ for all j ∈ [d], and choose initial directions wl = w for l ∈ [L -1] and wL = 0. Then, the linear coefficients β diag (Θ diag (t)) converge to the solution z ∞ of minimize z∈R d Q L,α, w(z) := α 2 d j=1 [ w] 2 j H L [z]j α L |[ w]j | L subject to Xz = y. Recall that the original statement of Assumption 1 allows the matrices S, U 1 , . . . , U L to be complex, but Theorem 5 poses another assumption that these matrices are real. In applying Theorem 2 to convolutional networks to get Corollary 3, we used the fact that the data tensor M conv (x) of a linear full-length convolutional network satisfies Assumption 1 with S = d L-1 2 F and U 1 = • • • = U L = F * , where F ∈ C d×d is the matrix of discrete Fourier transform basis [F ] j,k = 1 √ d exp(- √ -1•2π(j-1)(k-1) d ) and F * is the complex conjugate of F . Note that these are complex matrices, so one cannot directly apply Theorem 5 to convolutional networks. However, it turns out that if the data and initialization are even, we can derive a corollary for convolutional networks. We say that a vector is even when it satisfies the even symmetry, as in even functions. More concretely, a vector x ∈ R d is even if [x] j+2 = [x] d-j for j = 0, . . . , d-3 2 ; i.e., the vector has the even symmetry around its "origin" [x] 1 . From the definition of the matrix F ∈ C d×d , it is straightforward to check that if x is real and even, then its DFT F x is also real and even (see Appendix F.4 for details). Corollary 6. Consider an L-layer linear full-length convolutional network (12). Assume that the data points {x i } n i=1 are all even. For some λ > 0, choose any even vector w satisfying [F w] 2 j ≥ λ for all j ∈ [d], and choose initial directions wl = w for l ∈ [L -1] and wL = 0. Then, the linear coefficients β conv (Θ conv (t)) converge to the solution z ∞ of minimize z∈R d , even Q L,α,F w(F z) := α 2 d j=1 [F w] 2 j H L [F z]j α L |[F w]j | L subject to Xz = y. Corollaries 5 and 6 show that the interpolation between minimum weighted 1 and weighted 2 solutions occurs for diagonal networks, and also for convolutional networks (in DFT domain, with the restriction of even symmetry). The conclusion of Corollary 5 is similar to the results in Woodworth et al. (2020) , but the network architecture (11) considered in our corollary is a slightly different from the "differential" network f (x; w + , w -) = x T (w L + -w L -) in Woodworth et al. (2020) . As mentioned in the main text, we can actually show that the matrix sensing result in Arora et al. (2019b) is a special case of our Theorem 5. Given any symmetric matrix M ∈ R d×d , let eig(M ) ∈ R d be the d-dimensional vector of eigenvalues of M . Corollary 7. Consider the depth-L deep matrix sensing problem (13). Let A i 's be symmetric, and assume A 1 , . . . , A n commute. For α > 0, choose initialization W l (0) = αI d for l ∈ [L -1] and W L (0) = 0. Then, the product W L (t) • • • W 1 (t) converge to the solution M ∞ of minimize M ∈R d×d , symmetric Q L,α (eig(M )) := α 2 d j=1 H L [eig(M )]j α L subject to L ms (M ) = 0. Under an additional assumption that A i 's are positive semidefinite, Theorem 2 in Arora et al. (2019b) studies the initialization W l (0) = αI d for all l ∈ [L], and shows that the limit point of W L . . . W 1 converges to the minimum nuclear norm solution as α → 0. We remove the assumption of positive definiteness of A i 's and let W L (0) = 0, to show a complete characterization of the solution found by gradient flow, which interpolates between the minimum nuclear norm (i.e., Schatten 1-norm) solution (when α → 0) and the minimum Frobenius norm (i.e., Schatten 2-norm) solution (when α → ∞).

B TENSOR REPRESENTATION OF FULLY-CONNECTED NETWORKS

In Section 3, we only defined the data tensor M fc (x) of fully-connected networks for L = 2. Here, we describe an iterative procedure constructing the data tensor for deep fully-connected networks. We start with T 1 (x) := x ∈ R d1 . Next, define a block diagonal matrix T 2 (x) ∈ R d1d2×d2 where the "diagonals" [T 2 (x)] d1(j-1)+1:d1j,j = T 1 (x) for j ∈ [d 2 ], while all the other entries are filled with 0. We continue this "block diagonal" procedure, as the following. Having defined T l-1 (x) ∈ R d1d2×•••×d l-2 d l-1 ×d l-1 , 1. Define T l (x) ∈ R d1d2×•••×d l-1 d l ×d l . 2. Set [T l (x)] •,...,•,d l-1 (j-1)+1:d l-1 j,j = T l-1 (x), ∀j ∈ [d l ]. 3. Set all the remaining entries of T l (x) to zero. We repeat this process for l = 2, . . . , L, and set M fc (x) := T L (x). By defining the parameters of the tensor formulation v l = vec(W l ) for l ∈ [L -1] and v L = w L , and using the tensor M(x) = M fc (x), we can check the equivalence of ( 2) and ( 5).

C PROOFS OF THEOREM 1 AND COROLLARY 1 C.1 PROOF OF THEOREM 1

The proof of Theorem 1 is outlined as follows. First, using the directional convergence and alignment results in Ji & Telgarsky (2020), we prove that each of our network parameters v l converges in direction, and it aligns with its corresponding negative gradient -∇ v l L. Then, we prove that the directions of v l 's are actually singular vectors of M(-u ∞ ), where u ∞ := lim t→∞ X T r(t) X T r(t) 2 . Since a linear tensor network is an L-homogeneous polynomial of v 1 , . . . , v L , it satisfies the assumptions required for Theorems 3.1 and 4.1 in Ji & Telgarsky (2020) . These theorems imply that if the gradient flow satisfies L(Θ(t 0 )) < 1 for some t 0 ≥ 0, then Θ(t) converges in direction, and the direction aligns with -∇ Θ L(Θ(t)); that is, lim t→∞ Θ(t) 2 = ∞, lim t→∞ Θ(t) Θ(t) 2 = Θ ∞ , lim t→∞ Θ(t) T ∇ Θ L(Θ(t)) Θ(t) 2 ∇ Θ L(Θ(t)) 2 = -1. ( ) For linear tensor networks (6), the parameter Θ is the concatenation of all parameter vectors v 1 , . . . , v L , so ( 14) holds for Θ = v T 1 . . . v T L T . Now, recall that by the definition of the linear tensor network, we have the following gradient flow vl = M(-X T r) • (v 1 , . . . , v l-1 , I k l , v l+1 , . . . , v L ). Note that we can apply this to calculate the rate of growth of v l 2 2 : d dt v l 2 2 = 2v T l vl = 2v T l M(-X T r) • (v 1 , . . . , v l-1 , I k l , v l+1 , . . . , v L ) = 2M(-X T r) • (v 1 , . . . , v l-1 , v l , v l+1 , . . . , v L ) = d dt v l 2 2 for any l ∈ [L], so the rate at which v l 2 2 grows over time is the same for all layers l ∈ [L]. By the definition of Θ and ( 14), we have Θ 2 2 = L l=1 v l 2 2 → ∞, which then implies lim t→∞ v l (t) 2 → ∞, lim t→∞ Θ(t) 2 v l (t) 2 = Θ(t) 2 2 v l (t) 2 2 = √ L, for all l ∈ [L]. Now, let I l be the set of indices that correspond to the components of v l in Θ. It follows from ( 14) that lim t→∞ v l (t) v l (t) 2 = lim t→∞ v l (t) Θ(t) 2 Θ(t) 2 v l (t) 2 = lim t→∞ [Θ(t)] I l Θ(t) 2 Θ(t) 2 v l (t) 2 = √ L[Θ ∞ ] I l , thus showing the directional convergence of v l 's. Next, it follows from directional convergence of Θ and its alignment with -∇ Θ L(Θ) (14) that ∇ Θ L(Θ) also converges in direction, in the opposite direction of Θ. By comparing the components in I l 's, we get that ∇ v l L(Θ) converges in the opposite direction of v l . For any l ∈ [L], now let v ∞ l := lim t→∞ v l (t) v l (t) 2 . Also recall the assumption that X T r(t) converges in direction; let the unit vector u ∞ := lim t→∞ X T r(t) X T r(t) 2 be the limit direction. By the gradient flow dynamics of v l , we have v ∞ l ∝ -∇ v l L(Θ ∞ ) = M(-u ∞ ) • (v ∞ 1 , . . . , v ∞ l-1 , I k l , v ∞ l+1 , . . . , v ∞ L ), for all l ∈ [L]. Note that this equation has the same form as (8), the definition of singular vectors in tensors. So this proves that (v ∞ 1 , . . . , v ∞ L ) are singular vectors of M(-u ∞ ).

C.2 PROOF OF COROLLARY 1

The proof proceeds as follows. First, we will show using the structure of the data tensor M fc that the limit direction of linear coefficients β fc (Θ ∞ fc ) is proportional to cu ∞ , where c is a nonzero scalar and u ∞ is the limit direction of X T r. Then, through a closer look at u ∞ and c, we will prove that β fc (Θ ∞ fc ) is in fact a conic combination of the support vectors (i.e., the data points with the minimum margins). Finally, we will compare β fc (Θ ∞ fc ) with the KKT conditions of the 2 maxmargin classification problem and conclude that β fc (Θ ∞ fc ) must be in the same direction as the 2 max-margin classifier. Due to the way how the data tensor M fc is constructed for fully-connected networks (Appendix B), we always have -∇ v1 L(Θ fc ) = M fc (-X T r) • (I k1 , v 2 , . . . , v L ) ∈ span            X T r 0 . . . 0     ,     0 X T r . . . 0     , . . . ,     0 0 . . . X T r            . From Theorem 1, we have directional convergence of v 1 and its alignment with -∇ v1 L(Θ fc ). This means that the limit direction v ∞ 1 , which is a fixed vector, must be also in the span of vectors written above. This implies that X T r must also converge to some direction, say u ∞ := lim t→∞ X T r(t) X T r(t) 2 . Now recall the definition of v 1 in case of the fully-connected network: v 1 = vec(W 1 ). So, by reshaping v ∞ 1 into its original d 1 × d 2 matrix form W ∞ 1 , we have W ∞ 1 ∝ u ∞ q T , for some q ∈ R d2 . This implies that the linear coefficients β fc (Θ fc ) of the network converge in direction to β fc (Θ ∞ fc ) = W ∞ 1 W ∞ 2 . . . W ∞ L-1 w ∞ L ∝ u ∞ q T W ∞ 2 . . . W ∞ L-1 w ∞ L = cu ∞ , ( ) where c is some nonzero real number. Let us now take a closer look at the vector u ∞ , the limit direction of X T r. Recall from Section 2.1 that for any i ∈ [n], [r] i = -y i exp(-y i f fc (x i ; Θ fc )) = -y i exp(-y i x T i β fc (Θ fc )), in case of classification. Recall that β fc (Θ fc (t)) 2 → ∞ while converging to a certain direction β fc (Θ ∞ fc ). This means that if y j x T j β fc (Θ ∞ fc ) > y i x T i β fc (Θ ∞ fc ) for any i, j ∈ [n], then lim t→∞ exp(-y j x T j β fc (Θ fc (t))) exp(-y i x T i β fc (Θ fc (t))) = 0. ( ) Take i to be the index of any support vector, i.e., any i that attains the minimum y i x T i β fc (Θ ∞ fc ) among all data points. Using such an i, the observation ( 16) implies that lim t→∞ [r(t)] j = 0 for any x j that is not a support vector. Thus, by the argument above, u ∞ can in fact be written as u ∞ = lim t→∞ n i=1 x i [r(t)] i n i=1 x i [r(t)] i 2 = - n i=1 ν i y i x i , ( ) where ν i ≥ 0 for all i ∈ [n], and ν j = 0 for x j 's that are not support vectors. Combining ( 17) and ( 15), β fc (Θ ∞ fc ) ∝ -c n i=1 ν i y i x i . Recall that we do not yet know whether c, introduced in ( 15), is positive or negative; we will now show that c has to be negative. From Lyu & Li (2020) , we know that L(Θ fc (t)) → 0, which implies that y i x T i β fc (Θ ∞ fc ) > 0 for all i ∈ [n]. However, if c > 0, then (18) implies that β fc (Θ ∞ fc ) is inside a cone K defined as K := n i=1 γ i y i x i | γ i ≤ 0, ∀i ∈ [n] . Note that the polar cone of K, denoted as K • , is K • := z | β T z ≤ 0, ∀β ∈ K = {z | y i x T i z ≥ 0, ∀i ∈ [n]}. It is known that K ∩ K • = {0} for any convex cone K and its polar cone K • . Therefore, having c > 0 implies that β fc (Θ ∞ fc ) ∈ K \ K • , which means that there exists some i ∈ [n] such that y i x T i β fc (Θ ∞ fc ) < 0; this contradicts the fact that the loss goes to zero as t → ∞. Therefore, c in ( 15) and ( 18) must be negative: β fc (Θ ∞ fc ) ∝ n i=1 ν i y i x i , for ν i ≥ 0 for all i ∈ [n] and ν j = 0 for all x j 's that are not suport vectors. Finally, compare (19) with the KKT conditions of the following optimization problem: minimize z z 2 2 subject to y i x T i z ≥ 1, ∀i ∈ [n]. The KKT conditions of this problem are z = n i=1 µ i y i x i , µ i ≥ 0, µ i (1 -y i x T i z) = 0 for all i ∈ [n], where µ 1 , . . . , µ n are the dual variables. Note that this is (up to scaling) satisfied by β fc (Θ ∞ fc ) (19), if we replace µ i 's with ν i 's. This finishes the proof that β fc (Θ ∞ fc ) is aligned with the 2 max-margin classifier.

D PROOFS OF THEOREM 2 AND COROLLARIES 2 & 3 D.1 PROOF OF THEOREM 2 D.1.1 CONVERGENCE OF LOSS TO ZERO

Since Theorem 2 does not assume the existence of t 0 ≥ 0 satisfying L(Θ(t 0 )) < 1, we need to first show that given the conditions on initialization, the training loss L(Θ(t)) converges to zero. Recall from Section 2.1 that vl = -∇ v l L(Θ) = M(-X T r) • (v 1 , . . . , v l-1 , I k l , v l+1 , . . . , v L ). Applying the structure (9) in Assumption 1, we get vl = M(-X T r) • (v 1 , . . . , v l-1 , I k l , v l+1 , . . . , v L ) = - m j=1 [SX T r] j (v T 1 [U 1 ] •,j ⊗ • • • ⊗ v T l-1 [U l-1 ] •,j ⊗ [U l ] •,j ⊗ v T l+1 [U l+1 ] •,j ⊗ • • • ⊗ v T L [U L ] •,j ) = - m j=1 [SX T r] j k =l [U T k v k ] j [U l ] •,j . Left-multiplying U H l (the conjugate transpose of U l ) to both sides, we get U H l vl = -SX T r k =l U T k v k , where denotes the product using entry-wise multiplication . Now consider the rate of growth for the absolute value squared of the j-th component of U T l v l : d dt |[U T l v l ] j | 2 = d dt [U T l v l ] j [U T l v l ] * j = d dt [U T l v l ] j [U H l v l ] j = [U T l vl ] j [U H l v l ] j + [U H l vl ] j [U T l v l ] j = 2 Re [U H l vl ] j [U T l v l ] j = 2 Re -[SX T r] j L k=1 [U T k v k ] j = d dt |[U T l v l ] j | 2 for any l ∈ [L], so for any j ∈ [m], the squared absolute value of the j-th components in U T l v l grow at the same rate for each layer l ∈ [L]. This means that the gap between any two different layers stays constant for all t ≥ 0. Combining this with our conditions on initial directions, we have |[U T l v l (t)] j | 2 -|[U T L v L (t)] j | 2 = |[U T l v l (0)] j | 2 -|[U T L v L (0)] j | 2 = α 2 |[U T l vl ] j | 2 -α 2 |[U T L vL ] j | 2 ≥ α 2 λ, for any j ∈ [m], l ∈ [L -1], and t ≥ 0. This inequality also implies |[U T l v l (t)] j | 2 ≥ |[U T L v L (t)] j | 2 + α 2 λ ≥ α 2 λ. ( ) Let us now consider the time derivative of L(Θ(t)). We have the following chain of upper bounds on the time derivative: d dt L(Θ(t)) = ∇ Θ L(Θ(t)) T Θ(t) = -∇ Θ L(Θ(t)) 2 2 ≤ -∇ v L L(Θ(t)) 2 2 = -vL (t) 2 2 (a) ≤ -U H L vL (t) 2 2 (b) = -SX T r(t) k =L U T k v k (t) 2 2 = - m j=1 |[SX T r(t)] j | 2 k =L |[U T k v k (t)] j | 2 (c) ≤ -α 2L-2 λ L-1 m j=1 |[SX T r(t)] j | 2 = -α 2L-2 λ L-1 SX T r(t) 2 2 (d) ≤ -α 2L-2 λ L-1 s min (S) 2 X T r(t) 2 2 , ( ) where (a) used the fact that vL (t) 2 2 ≥ U L U H L vL (t) 2 2 because it is a projection onto a subspace, and c) is due to (22); and (d) used the fact that S ∈ C m×d is a matrix that has full column rank, so for any z ∈ C d , we can use Sz 2 ≥ s min (S) z 2 where s min (S) is the minimum singular value of S. U L U H L vL (t) 2 2 = U H L vL (t) 2 2 because U H L U L = I k L ; (b) is due to (20); ( We now prove a lower bound on the quantity X T r(t) 2 2 . Recall from Section 2.1 the definition of [r(t)] i = -y i exp(-y i f (x i ; Θ(t))) for classification problems. Also, recall the assumption that the dataset is linearly separable, which means that there exists a unit vector z ∈ R d such that y i x T i z ≥ γ > 0 holds for all i ∈ [n], for some γ > 0. Using these, X T r(t) 2 2 = n i=1 y i x i exp(-y i f (x i ; Θ(t))) 2 2 ≥ [z T n i=1 y i x i exp(-y i f (x i ; Θ(t)))] 2 ≥ γ 2 [ n i=1 exp(-y i f (x i ; Θ(t)))] 2 = γ 2 L(Θ(t)) 2 . Combining this with ( 23), we get d dt L(Θ(t)) ≤ -α 2L-2 λ L-1 s min (S) 2 γ 2 L(Θ(t)) 2 , which implies L(Θ(t)) ≤ L(Θ(0)) 1 + α 2L-2 λ L-1 s min (S) 2 γ 2 t . Therefore, L(Θ(t)) → 0 as t → ∞.

D.1.2 CHARACTERIZING THE LIMIT DIRECTION

Since we proved that L(Θ(t)) → 0, the argument in the proof of Theorem 1 applies to this case, and shows that the parameters v l converge in direction and align with vl = -∇ v l L(Θ). Let v ∞ l := lim t→∞ v l (t) v l (t) 2 be the limit direction of v l . The remaining steps of the proof are as follows. We first prove that SX T r(t) converges in direction u ∞ . Using this u ∞ , we derive a number of conditions that has to be satisfied by the limit directions of the parameters. Finally, we compare these conditions with the KKT conditions of the minimization problem, and finish the proof. By Assumption 1, we have f (x; Θ) = M(x) • (v 1 , . . . , v L ) = m j=1 [Sx] j L l=1 [U T l v l ] j = m j=1 L l=1 [U T l v l ] j [S] j,• x = x T S T l∈[L] U T l v l = x T S T ρ. Here, we defined ρ := l∈[L] U T l v l ∈ C m . Since the linear coefficients must be real, we have S T ρ ∈ R d for any real v l 's. Since v l 's converge in direction, ρ also converges in direction, to ρ ∞ := l∈[L] U T l v ∞ l . So we can express the limit direction of β(Θ) as β(Θ ∞ ) ∝ S T l∈[L] U T l v ∞ l = S T ρ ∞ . ( ) lim t→∞ |[U T 1 v 1 (t)] j | 2 + |[U T 2 v 2 (t)] j | 2 |[ρ(t)] j | = 2 if lim t→∞ |[U T 2 v 2 (t)] j | = ∞. Recall that we want to prove that (32) should necessarily hold. For the sake of contradiction, suppose that there exists j ∈ [m] that satisfies |[ρ ∞ ] j | = 0 but |[u ∞ ] j | > |[u ∞ ] j |, for some j ∈ [m] satisfying |[ρ ∞ ] j | = 0. Note that having |[ρ ∞ ] j | = 0 and |[ρ ∞ ] j | = 0 implies that |[ρ(t)] j | → ∞ and |[ρ(t)]j | |[ρ(t)] j | → 0. We now want to compare the ratio of (34) for j and j . First, note that lim t→∞ |[SX T r(t)] j |/ SX T r(t) 2 |[SX T r(t)] j |/ SX T r(t) 2 = |[u ∞ ] j | |[u ∞ ] j | > 1. Next, using |[ρ(t)]j | |[ρ(t)] j | → 0 and the fact that x → 2x 2 +δ x √ x 2 +δ is a decreasing function of x ≥ 0 for any δ > 0, we have (|[U T 1 v 1 (t)] j | 2 + |[U T 2 v 2 (t)] j | 2 )/|[ρ(t)] j | (|[U T 1 v 1 (t)] j | 2 + |[U T 2 v 2 (t)] j | 2 ))/|[ρ(t)] j | ≥ 1, for any t ≥ t 0 , when t 0 is large enough. Combining ( 36) and ( 37) to compare the ratio of (34) for j and j , we get that there exists some t 0 ≥ 0 such that for any t ≥ t 0 , we have d dt [ρ(t)] j /|[ρ(t)] j | d dt [ρ(t)] j /|[ρ(t)] j | > 1. This implies that the ratio of the absolute value of time derivative of [ρ(t)] j to the absolute value of current value of [ρ(t)] j is strictly bigger than that of [ρ(t)] j . Moreover, we saw in (33) that the phase of d dt [ρ(t)] j converges to that of -[u ∞ ] * j . Since this holds for all t ≥ t 0 , (38) results in a growth of |[ρ(t)] j | that is exponentially faster than that of |[ρ(t)] j |, so [ρ(t)] j becomes a dominant component in ρ(t) as t → ∞. This contradicts that [ρ ∞ ] j = 0, hence the condition (32) has to be satisfied. So far, we have characterized a number of conditions ( 26), ( 28), ( 31), (32) that have to be satisfied by the limit directions u ∞ and ρ ∞ of X T r and ρ. We now consider the following optimization problem and prove that these conditions are in fact the KKT conditions of the optimization problem. Consider minimize ρ∈C m ρ 2/L subject to y i x T i S T ρ ≥ 1, ∀i ∈ [n]. The KKT conditions of this problem are ∂ ρ 2/L S * n i=1 µ i y i x i , and µ i ≥ 0, µ i (1 -y i x T i S T ρ) = 0 for all i ∈ [n], where µ 1 , . . . , µ n are the dual variables. The symbol ∂ • 2/L denotes the (local) subdifferential of the 2/L normfoot_0 , which can be written as ∂ ρ 1 = {u ∈ C m | |[u] j | ≤ 1 for all j ∈ [m], and [ρ] j = 0 =⇒ [u] j = exp( √ -1 arg([ρ] j ))}, if L = 2 (in this case ∂ ρ 1 is the global subdifferential), ∂ ρ 2/L = u ∈ C m | [ρ] j = 0 =⇒ [u] j = 2 L |[ρ] j | 2 L -1 exp( √ -1 arg([ρ] j )) , if L > 2. By replacing µ i 's with ν i 's defined in (26), we can check from ( 26), ( 28), ( 31), (32) that the that ρ ∞ and u ∞ satisfy the KKT conditions up to scaling. Therefore, by (24), β(Θ(t)) converges in direction aligned with S T ρ ∞ , where ρ ∞ is again aligned with a stationary point (global minimum in case of L = 2) of the optimization problem (39). If S is invertible, we can get S -T β(Θ ∞ ) ∝ ρ ∞ . Plugging this into the optimization problem (39) gives the last statement of the theorem.

D.2 PROOF OF COROLLARY 2

It suffices to prove that linear diagonal networks satisfy Assumption 1, with S = I d . The proof is very straightforward, since M diag (x) ∈ R d×•••×d has [M diag (x) ] j,j,...,j = [x] j while all the remaining entries are zero. It is straightforward to verify that M diag (x) satisfies Assumption 1 with S = U 1 = • • • = U L = I d . A direct substitution into Theorem 2 gives the corollary.

D.3 PROOF OF COROLLARY 3

For full-length convolutional networks (k 1 = • • • = k L = d), we will prove that they satisfy Assumption 1 with S = d L-1 2 F and U 1 = • • • = U L = F * , where F ∈ C d×d is the matrix of discrete Fourier transform basis [F ] j,k = 1 √ d exp(- √ -1•2π(j-1)(k-1) d ) and F * is the complex conjugate of F . For simplicity of notation, define ψ = exp(- √ -1•2π d ). With such matrices S and U 1 , . . . , U L , we can write M (x) as M(x) = d j=1 [Sx] j ([U 1 ] •,j ⊗ [U 2 ] •,j ⊗ • • • ⊗ [U L ] •,j ) = d j=1 d L-2 2 d k=1 [x] k ψ (j-1)(k-1)        ψ 0 / √ d ψ -(j-1) / √ d ψ -2(j-1) / √ d . . . ψ -(d-1)(j-1) / √ d        ⊗L , where a ⊗L denotes the L-times tensor product of a. We will show that M(x) = M conv (x). For any j 1 , . . . , j L ∈ [d], [M(x)] j1,...,j L = 1 d d l=1 d k=1 [x] k ψ (l-1)(k-1) ψ -(l-1)( L q=1 jq-L) = 1 d d k=1 [x] k d l=1 ψ (l-1)(k-1-L q=1 jq+L) . Recall that d l=1 ψ (l-1)(k-1-L q=1 jq+L) = d if k -1 - L q=1 j q + L is a multiple of d, 0 otherwise. Using this, we have Since Theorem 3 does not assume the existence of t 0 ≥ 0 satisfying L(Θ(t 0 )) < 1, we need to first show that given the conditions on initialization, the training loss L(Θ(t)) converges to zero. Since  [M(x)] j1,...,j L = 1 d d k=1 [x] k d l=1 ψ (l-1)(k-1-L q=1 jq+L) = [x] L q=1 jq-L+1 mod d = [M conv (x)] |[ρ(t)] j | = 2[U T 2 v 2 (t)] 2 j + δ j |[U T 2 v 2 (t)] j | [U T 2 v 2 (t)] 2 j + δ j ≥ 2, lim t→∞ [U T 1 v 1 (t)] 2 j + [U T 2 v 2 (t)] 2 j |[ρ(t)] j | = 2 if lim t→∞ |[U T 2 v 2 (t)] j | = ∞. Recall that we want to prove that (50) should necessarily hold. For the sake of contradiction, suppose that there exists j ∈ [m] that satisfies [ρ ∞ ] j = 0 but |[s] j | > |[s] j |, for some j ∈ [m] satisfying [ρ ∞ ] j = 0. Note that having [ρ ∞ ] j = 0 and [ρ ∞ ] j = 0 implies that |[ρ(t)] j | → ∞ and |[ρ(t)]j | |[ρ(t)] j | → 0. We now want to compare the ratio of (51) for j and j . Using |[ρ(t)]j | |[ρ(t)] j | → 0 and the fact that x → 2x 2 +δ x √ x 2 +δ is a decreasing function of x ≥ 0 for any δ > 0, we have ([U T 1 v 1 (t)] 2 j + [U T 2 v 2 (t)] 2 j )/|[ρ(t)] j | ([U T 1 v 1 (t)] 2 j + [U T 2 v 2 (t)] 2 j ))/|[ρ(t)] j | ≥ 1, for any t ≥ t 0 , when t 0 is large enough. Combining 1 and (53) to compare the ratio of (51) for j and j , there exists some t 0 ≥ 0 such that for any t ≥ t 0 , we have |[s]j | |[s] j | > d dt [ρ(t)] j /|[ρ(t)] j | d dt [ρ(t)] j /|[ρ(t)] j | > 1. ( ) This implies that the ratio of the absolute value of time derivative of [ρ(t)] j to the absolute value of current value of [ρ(t)] j is strictly bigger than that of [ρ(t)] j . Moreover, by the definition of r(t), d dt [ρ(t)] j does not change sign over time. Since this holds for all t ≥ t 0 , (54) results in a growth of |[ρ(t)] j | that is exponentially faster than that of |[ρ(t)] j |, so [ρ(t)] j becomes a dominant component in ρ(t) as t → ∞. This contradicts that [ρ ∞ ] j = 0, hence the condition (50) has to be satisfied. So far, we have characterized some conditions (47), ( 49), ( 50) that have to be satisfied by the limit direction ρ ∞ of ρ. We now consider the following optimization problem and prove that these conditions are in fact the KKT conditions of the optimization problem. Consider minimize ρ∈R m ρ 1 subject to ys T ρ ≥ 1. The KKT condition of this problem is ∂ ρ 1 ys, where the global subdifferential ∂ • 1 is defined as ∂ ρ 1 = {u ∈ R m | |[u] j | ≤ 1 for all j ∈ [m], and [ρ] j = 0 =⇒ [u] j = sign([ρ] j )}. We can check from ( 47), ( 49), ( 50) that the that ρ ∞ satisfies the KKT condition up to scaling. Now, how do we characterize v ∞ 1 and v ∞ 2 in terms of ρ ∞ ? Let η ∞ 1 := U T 1 v ∞ 1 and η ∞ 2 := U T 2 v ∞ 2 . Then, v ∞ l = U l η ∞ l = U l U T l v ∞ l holds because any component orthogonal to the column space of U l stays unchanged while the component in the column space of U l diverges to infinity. By (42), |η ∞ 1 | = |η ∞ 2 | = |ρ ∞ | 1/2 . By (45), we have sign(η ∞ 1 ) = sign(y) sign(η ∞ 2 ).

E.2 PROOF OF COROLLARY 4

The proof of Corollary 4 boils down to characterizing the SVD of M conv (x). E.2.1 THE k 1 = 1 CASE First, it is straightforward to check that for L = 2 and k 1 = 1, we have β conv (Θ conv ) = v 1 v 2 . For k 1 = 1, the data tensor is simply M conv (x) = x T . Thus, we have U 1 = 1, U 2 = x x 2 , and s = x 2 . Substituting U 1 and U 2 to the theorem gives the condition on initial directions in Corollary 4. Also, the theorem implies us that the limit direction v ∞ 2 of v 2 satisfies v ∞ 2 ∝ yv ∞ 1 x. Using this, it is easy to check that β conv (Θ ∞ conv ) ∝ v ∞ 1 v ∞ 2 ∝ yx. E.2.2 THE k 1 = 2 CASE First, it is straightforward to check that for L = 2 and k 1 = 2, we have β conv (Θ conv ) =         [v 1 ] 1 0 0 • • • 0 [v 1 ] 2 [v 1 ] 2 [v 1 ] 1 0 • • • 0 0 0 [v 1 ] 2 [v 1 ] 1 • • • 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 • • • [v 1 ] 1 0 0 0 0 • • • [v 1 ] 2 [v 1 ] 1         v 2 . ( ) For k 1 = 2, by definition, the data tensor is M conv (x) = x T ← - x T , and it is straightforward to check that the SVD of this matrix is M conv (x) = x T ← - x T = 1 / √ 2 1 / √ 2 1 / √ 2 -1 / √ 2   x 2 2 + x T ← - x 0 0 x 2 2 -x T ← - x     x T + ← - x T √ 2 √ x 2 2 +x T ← - x x T -← - x T √ 2 √ x 2 2 -x T ← - x   , so U 1 = 1 / √ 2 1 / √ 2 1 / √ 2 -1 / √ 2 , U 2 = x+ ← - x √ 2 √ x 2 2 +x T ← - x x-← - x √ 2 √ x 2 2 -x T ← - x , s =   x 2 2 + x T ← - x x 2 2 -x T ← - x   . Substituting U 1 and U 2 to the theorem gives the conditions on initial directions. Also, note that the maximum singular value depends on the sign of x T ←x . Consider the optimization problem in the theorem statement: minimize ρ∈R m ρ 1 subject to ys T ρ ≥ 1. If x T ←x > 0, then the solution ρ ∞ to this problem is in the direction of [y 0]. Therefore, the limit directions v ∞ 1 and v ∞ 2 will be of the form v ∞ 1 ∝ c 1 1 1 , v ∞ 2 ∝ c 2 (x + ← - x ), where sign(c 1 ) sign(c 2 ) = sign(y). Using (56), it is straightforward to check that β conv (Θ ∞ conv ) ∝ y         1 0 0 • • • 0 1 1 1 0 • • • 0 0 0 1 1 • • • 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 • • • 1 0 0 0 0 • • • 1 1         (x + ← - x ) = y(2x + ← - x + - → x ). Similarly, if x T ←x < 0, then the solution ρ ∞ is in the direction of [0 y]. Using (56), we have β conv (Θ ∞ conv ) ∝ y         1 0 0 • • • 0 -1 -1 1 0 • • • 0 0 0 -1 1 • • • 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 • • • 1 0 0 0 0 • • • -1 1        (x -← - x ) = y(2x -← - x -- → x ). F PROOFS OF THEOREM 5, COROLLARIES 5, 6 & 7, AND LEMMA 4 F.1 PROOF OF LEMMA 4 In this subsection, we restate Lemma 4 and prove it. Lemma 4. Consider the system of ODEs, where p, q : R → R: ṗ = p L-2 q, q = p L-1 , p(0) = 1, q(0) = 0. Then, the solutions p L (t) and q L (t) are continuous on their maximal interval of existence of the form (-c, c) ⊂ R for some c ∈ (0, ∞]. Define h L (t) = p L (t) L-1 q L (t); then, h L (t) is odd and strictly increasing, satisfying lim t↑c h L (t) = ∞ and lim t↓-c h L (t) = -∞. Proof First, continuity (and also continuous differentiability) of p(t) and q(t) is straightforward because the RHSs of the ODEs are differentiable in p and q. Next, define p(t) = p(-t) and q(t) = -q(-t). Then, one can show that p and q are also the solution of the ODE because d dt p(t) = d dt p(-t) = -ṗ(-t) = -p(-t) L-2 q(-t) = p(t) L-2 q(t), d dt q(t) = - d dt q(-t) = q(-t) = p(-t) L-1 = p(t) L-1 . However, by the Picard-Lindelöf theorem, the solution has to be unique; this means that p(t) = p(t) = p(-t) and q(t) = q(t) = -q(-t), which proves that p is even and q is odd and also implies that the domain of p and q has to be of the form (-c, c) (i.e. symmetric around the origin) and h = p L-1 q is odd. To show that h is strictly increasing, it suffices to show that p and q are both strictly increasing on [0, c). To this end, we show that p(t) ≥ 1 for all t ∈ [0, c). First, due to the initial condition p(0) = 1 and continuity of p, there exists 1 > 0 such that p(t) > 0 for all t ∈ [0, 1 ) =: I 1 . This implies that q(t) = p(t) L-1 > 0 for t ∈ I 1 \ {0}, so q is strictly increasing on I 1 . Since q(0) = 0, we have q(t) > 0 for t ∈ I 1 \ {0}, which then implies that ṗ(t) = p(t) L-2 q(t) > 0. Therefore, p is also strictly increasing on I 1 ; this then means p(t) ≥ 1 for t ∈ [0, 1 ] because p(0) = 1. Now, due to p( 1 ) ≥ 1 and continuity of p, there exists 2 > 1 such that p(t) > 0 for all t ∈ [ 1 , 2 ) =: I 2 . Using the argument above for I 2 results in p(t) ≥ 1 for t ∈ [0, 2 ]. Repeating this until the end of the domain, we can show that p(t) ≥ 1 holds for all t ∈ [0, c). By p ≥ 1, we have q = p L-1 ≥ 1 on [0, c), so q is strictly increasing on [0, c). Also, q(t) > 0 on (0, c), so ṗ = p L-2 q > 0 on (0, c) and p is also strictly increasing on [0, c). This proves that h is strictly increasing on [0, c), and also on (-c, c) by oddity of h. Finally, it is left to show lim t↑c h(t) = ∞ and lim t↓-c h(t) = -∞. If c < ∞, then this together with monotonicity implies that the limits hold. To see why, suppose c < ∞ and lim t↑c h(t) < ∞. Then, p and q can be extended beyond t ≥ c, which contradicts the fact that (-c, c) is the maximal interval of existence of the solution. Next, consider the case c = ∞. From p(t) ≥ 1, we have q(t) ≥ 1 for t ≥ 0. This implies that q(t) ≥ t for t ≥ 0. Now, ṗ(t) ≥ p(t) L-2 q(t) ≥ t, which gives p(t) ≥ t 2 2 + 1 for t ≥ 0. Therefore, we have lim t→∞ h(t) = lim t→∞ p(t) L-1 q(t) ≥ lim t→∞ t 2 2 + 1 L-1 t = ∞, hence finishing the proof. F.2 PROOF OF THEOREM 5

F.2.1 CONVERGENCE OF LOSS TO ZERO

We first show that given the conditions on initialization, the training loss L(Θ(t)) converges to zero. Recall from Section 2.1 that Let us now consider the time derivative of L(Θ(t)). We have the following chain of upper bounds on the time derivative: d dt L(Θ(t)) = ∇ Θ L(Θ(t)) T Θ(t) = -∇ Θ L(Θ(t)) 2 2 ≤ -∇ v L L(Θ(t)) 2 2 = -vL (t) 2 2 (a) ≤ -U T L vL (t) 2 2 (b) = -SX T r(t) k =L U T k v k (t) 2 2 = - m j=1 [SX T r(t)] 2 j k =L [U T k v k (t)] 2 j (c) ≤ -α 2L-2 λ L-1 m j=1 [SX T r(t)] 2 j = -α 2L-2 λ L-1 SX T r(t) 2 2 (d) ≤ -α 2L-2 λ L-1 s min (S) 2 s min (X) 2 r(t) 2 2 , = -2α 2L-2 λ L-1 s min (S) 2 s min (X) 2 L(Θ(t)), where (a) used the fact that vL (t) 2 2 ≥ U L U T L vL (t) 2 2 because it is a projection onto a subspace, and U L U T L vL (t) 2 2 = U T L vL (t) 2 2 because U T L U L = I k L ; (b) is due to (57); (c ) is due to (58); and (d) used the fact that S ∈ R m×d and X T ∈ R d×n are matrices that have full column rank, so for any z ∈ C n , we can use SX T z 2 ≥ s min (S)s min (X) z 2 where s min (•) denotes the minimum singular value of a matrix. From (59), we get L(Θ(t)) ≤ L(Θ(0)) exp(-2α 2L-2 λ L-1 s min (S) 2 s min (X) 2 t), so that L(Θ(t)) → 0 as t → ∞.

F.2.2 CHARACTERIZING THE LIMIT POINT

Now, we move on to characterize the limit points of the gradient flow. First, by defining a "transformed" version of the parameters η l (t) := U T l v l (t) and using (57), one can define an equivalent system of ODEs: ηl = -SX T r k =l η k for l ∈ [L], η l (0) = α η for l ∈ [L -1], η L (0) = 0. Using Lemma 4, it is straightforward to verify that the solution to (61) has the following form. For odd L, we have η l (t) = α η p L -α L-2 | η| L-2 SX T t 0 r(τ )dτ for l ∈ [L -1], η L (t) = α| η| q L -α L-2 | η| L-2 SX T t 0 r(τ )dτ . (62) Similarly, for even L, the solution for (61) satisfies η l (t) = α η p L -α L-2 η L-2 SX T t 0 r(τ )dτ for l ∈ [L -1], η L (t) = α η q L -α L-2 η L-2 SX T t 0 r(τ )dτ . (63) Now that we know how the solutions η l look like, let us see how these relate to the linear coefficients of the network. By Assumption 1, we have f (x; Θ) = M(x) • (v 1 , . . . , v L ) = m j=1 [Sx] j L l=1 [U T l v l ] j = 1 √ d d k=1 [x] k cos - 2π(j -1)(k -1) d + √ -1 √ d d k=1 [x] k sin - 2π(j -1)(k -1) d = 1 √ d d k=1 [x] k cos - 2π(j -1)(k -1) d ∈ R, for all j ∈ [d]. To prove that F x is even, for j = 0, . . . , d-3

2

, we have [F x] j+2 = 1 √ d d k=1 [x] k cos - 2π(j + 1)(k -1) d = 1 √ d d k=1 [x] k cos 2π(k -1) - 2π(j + 1)(k -1) d = 1 √ d d k=1 [x] k cos 2π(d -j -1)(k -1) d = 1 √ d d k=1 [x] k cos - 2π(d -j -1)(k -1) d = [F x] d-j . It is proved in Appendix D.3 that linear full-length convolutional networks (k 1 = • • • = k L = d) satisfy Assumption 1 with S = d L-1 2 F and U 1 = • • • = U L = F * , where F ∈ C d×d is the matrix of discrete Fourier transform basis [F ] j,k = 1 √ d exp(- √ -1•2π(j-1)(k-1) d ) and F * is the complex conjugate of F . The proof of convergence of loss to zero in Appendix F.2.1 is written for real matrices S, U 1 , . . . , U L , but we can actually apply the same argument as in Appendix D.1.1 and prove that the loss converges to zero, even in the case where S, U 1 , . . . , U L are complex. Next, since U l 's are complex, we can write the system of ODE as (see (20) for its derivation) F ẇl = -d L-1 2 F X T r k =l F * w k , Since all data points x i and initialization w l (0) are real and even, we have that F X T r is real and even, and F * w l (0) = F w l (0)'s are real and even. By (66), we see that the time derivatives of F w l are also real and even. Thus, the parameters w l (t) are all real and even for all t ≥ 0. From this observation, we can define η l (t) := F w l (t), η := F w, and S := d L-1 2 Re(F ), which are all real by the even symmetry. Then, starting from (61), the proof goes through.

F.5 PROOF OF COROLLARY 7

Since the sensor matrices A 1 , . . . , A n commute, they are simultaneously diagonalizable with a real unitary matrix U ∈ R d×d , i.e., U T A i U 's are diagonal matrices. From the deep matrix sensing problem (13), we can compute ∇ W l L ms , which gives the gradient flow dynamics of W l . Ẇl = -∇ W l L ms = -W T l-1 • • • W T 1 ( n i=1 r i A i )W T L • • • W T l+1 , where r i = A i , W 1 • • • W L -y i is the residual for the i-th sensor matrix. If we left-multiply U T and right-multiply U to both sides, we get U T Ẇl U = -U T W T l-1 U • • • U T W T 1 U ( n i=1 r i U T A i U )U T W T L U • • • U T W T l+1 U . (67) If U T W T k U is a diagonal matrix for all k = l, then U T Ẇl U is also a diagonal matrix. Note also that, since W l (0) = αI d = αU U T for l ∈ [L -1], the product U T W l U is a diagonal matrix at initialization. These observations imply that W l (t)'s are all diagonalizable with U for all t ≥ 0.

G.2 LIMIT POINT

Now, we move on to characterize the limit points of the gradient flow. First, note that any changes made in v l over time are in the subspace spanned by the columns of U l . Therefore, any component in the initialization v l (0) = αv l that is orthogonal to the column space of U l stays constant. So, we can focus on the evolution of v l in the column space of U l ; this can be done by defining a "transformed" version of the parameters η l (t) := U T l v l (t) and using (69), one can define an equivalent system of ODEs: η1 = -rs η 2 , η2 = -rs η 1 , η 1 (0) = α η1 , η 2 (0) = α η2 , where η1 := U T 1 v1 , η2 := U T 2 v2 . It is straightforward to verify that the solution to (72) has the following form. η 1 (t) = α η1 cosh -s t 0 r(τ )dτ + α η2 sinh -s t 0 r(τ )dτ , η 2 (t) = α η1 sinh -s t 0 r(τ )dτ + α η2 cosh -s t 0 r(τ )dτ . (73) By the convergence of the loss to zero (71), we have lim t→∞ f (x; Θ(t)) = y. Note that f (x; Θ(t)) can be written as f (x; Θ(t)) = M(x) • (v 1 (t), v 2 (t)) = v 1 (t) T M(x)v 2 (t) = v 1 (t) T U 1 diag(s)U T 2 v 2 (t) = s T (η 1 (t) η 2 (t)). Therefore, lim t→∞ f (x; Θ(t)) = lim t→∞ s T (η 1 (t) η 2 (t)) = α 2 s T ( η 2 1 + η 2 2 ) cosh -s ∞ 0 r(τ )dτ sinh -s ∞ 0 r(τ )dτ + ( η1 η2 ) cosh 2 -s ∞ 0 r(τ )dτ + sinh 2 -s ∞ 0 r(τ )dτ = α 2 s T η 2 1 + η 2 2 2 sinh -2s ∞ 0 r(τ )dτ + ( η1 η2 ) cosh -2s ∞ 0 r(τ )dτ = α 2 m j=1 [s] j [ η1 ] 2 j + [ η2 ] 2 j 2 sinh (2[s] j ν) + [ η1 ] j [ η2 ] j cosh (2[s] j ν) = y, where we defined ν := -∞ 0 r(τ )dτ . Consider the function ν → a sinh(ν) + b cosh(ν). This is a strictly increasing function if a > |b|. Note also that [ η1 ] 2 j + [ η2 ] 2 j 2 ≥ |[ η1 ] j [ η2 ] j |, which holds with equality if and only if |[ η1 ] j | = |[ η2 ] j |. However, recall from our assumptions on initialization that [ η1 ] 2 j -[ η2 ] 2 j ≥ λ > 0, so (75) can only hold with strict inequality. Therefore, g(ν) := m j=1 [s] j [ η1 ] 2 j + [ η2 ] 2 j 2 sinh(2[s] j ν) + [ η1 ] j [ η2 ] j cosh(2[s] j ν) is a strictly increasing (hence invertible) function because it is a sum of m strictly increasing function. Using this g(ν), (74) can be written as α 2 g(ν) = y, and by using the inverse of g, we have ν = - ∞ 0 r(τ )dτ = g -1 y α 2 . ( ) Plugging ( 76) into (73), lim t→∞ v 1 (t) = U 1 lim t→∞ η 1 (t) + α(I k1 -U 1 U T 1 )v 1 = αU 1 η1 cosh g -1 y α 2 s + η2 sinh g -1 y α 2 s + α(I k1 - U 1 U T 1 )v 1 , lim t→∞ v 2 (t) = U 2 lim t→∞ η 2 (t) + α(I k2 -U 2 U T 2 )v 2 = αU 2 η1 sinh g -1 y α 2 s + η2 cosh g -1 y α 2 s + α(I k2 -U 2 U T 2 )v 2 . This finishes the proof. H PROOF OF THEOREM 7 H.1 CONVERGENCE OF LOSS TO ZERO We first show that given the conditions on initialization, the training loss L(Θ(t)) converges to zero. Recall from (10) that the linear fully-connected network can be written as f fc (x; Θ fc ) = x T W 1 W 2 • • • W L-1 w L . From the definition of the training loss L, it is straightforward to check that the gradient flow dynamics read Ẇl = -∇ W l L(Θ fc ) = -W T l-1 • • • W T 1 X T rw T L W T L-1 • • • W T l+1 for l ∈ [L -1], ẇL = -∇ w L L(Θ fc ) = -W T L-1 • • • W T 1 X T r, W l (0) = α Wl for l ∈ [L -1], w L (0) = α wL , where r ∈ R n is the residual vector satisfying [r] i = f fc (x i ; Θ fc ) -y i , as defined in Section 2.1. From (77), we have W T l Ẇl = Ẇl+1 W T l+1 = -W T l • • • W T 1 X T rw T L W T L-1 • • • W T l+1 , Ẇ T l W l = W l+1 Ẇ T l+1 = -W l+1 • • • W L-1 w L r T XW 1 • • • W l , for any l ∈ [L -2]. From this, we have d dt W T l W l = d dt W l+1 W T l+1 , and thus W l (t) T W l (t) -W l+1 (t)W l+1 (t) T = W l (0) T W l (0) -W l+1 (0)W l+1 (0) T = α 2 W T l Wl -α 2 Wl+1 W T l+1 , for any l ∈ [L -2]. Similarly, we have W L-1 (t) T W L-1 (t) -w L (t)w L (t) T = W L-1 (0) T W L-1 (0) -w L (0)w L (0) T = α 2 W T L-1 WL-1 -α 2 wL wT L . (79) Let us now consider the time derivative of L(Θ fc (t)). We have the following chain of upper bounds on the time derivative: d dt L(Θ fc (t)) = ∇ Θ fc L(Θ fc (t)) T Θfc (t) = -∇ Θ fc L(Θ fc (t)) 2 2 ≤ -∇ w L L(Θ fc (t)) 2 2 = -ẇL (t) 2 2 = -W T L-1 • • • W T 1 X T r 2 2 . ( ) W L-1 2 F L 2 2 = α 2 ( WL-1 2 F -wL 2 ). Summing the equations above for l, l + 1, . . . , L -1, we get W l 2 F -w L 2 2 = α 2 ( Wl 2 F -wL 2 2 ). (83) Next, consider the operator norms (i.e., the maximum singular values), denoted as • 2 , of the matrices. W l 2 2 ≥ u T l+1 W T l W l u l+1 (e) = u T l+1 W l+1 W T l+1 u l+1 + α 2 u T l+1 ( W T l Wl -Wl+1 W T l+1 )u l+1 = W l+1 2 2 + α 2 u T l+1 ( W T l Wl -Wl+1 W T l+1 )u l+1 ≥ W l+1 2 2 -α 2 W T l Wl -Wl+1 W T l+1 2 for l ∈ [L -2], W L-1 2 2 ≥ w L w L 2 W T L-1 W L-1 w L w L 2 (f ) = w L w L 2 w L w T L w L w L 2 + α 2 w L w L 2 ( W T L-1 WL-1 -wL wT L ) w L w L 2 ≥ w L 2 2 -α 2 W T L-1 WL-1 -wL wT L 2 . where (e) used ( 78) and (f) used (79). Summing the inequalities gives W l 2 2 ≥ w L 2 2 -α 2 L-1 k=1 W T k Wk -Wk+1 W T k+1 2 . From ( 83) and ( 84), we get a bound on the gap between the second powers of the Frobenius norm (or the 2 norm of singular values) and operator norm (or the maximum singular value s l ) of W l : W l (t) 2 F -W l (t) 2 2 ≤ α 2 ( Wl 2 F -wL 2 2 ) + α 2 L-1 k=l W T k Wk -Wk+1 W T k+1 2 , ( ) which holds for any t ≥ 0. The gap (85) implies that each W l , for l ∈ [L -1], can be written as W l (t) = s l (t)u l (t)v l (t) T + O(α 2 ). (86) Next, we show that the "adjacent" singular vectors v l and u l+1 align with each other as α → 0. To this end, we will get lower and upper bounds for a quantity v T l W l+1 W T l+1 v l . v T l W l+1 W T l+1 v l = v T l W T l W l v l -α 2 v T l W T l Wl v l + α 2 v T l Wl+1 W T l+1 v l ≥ W l 2 2 -α 2 W T l Wl -Wl+1 W T l+1 2 = s 2 l -α 2 W T l Wl -Wl+1 W T l+1 2 , ( ) v T l W l+1 W T l+1 v l = v T l (s 2 l+1 u l+1 u T l+1 + W l+1 W T l+1 -s 2 l+1 u l+1 u T l+1 )v l = s 2 l+1 (v T l u l+1 ) 2 + v T l (W l+1 W T l+1 -s 2 l+1 u l+1 u T l+1 )v l ≤ s 2 l+1 (v T l u l+1 ) 2 + W l+1 2 F -W l+1 2 2 . ( ) Combining ( 87), (88), and (85), we get s 2 l ≤ s 2 l+1 (v T l u l+1 ) 2 + α 2 W T l Wl -Wl+1 W T l+1 2 + W l+1 2 F -W l+1 2 2 ≤ s 2 l+1 (v T l u l+1 ) 2 + α 2 ( Wl+1 2 F -wL 2 2 ) + α 2 L-1 k=l W T k Wk -Wk+1 W T k+1 2 . (89) Next, by a similar reasoning as (87), we have s 2 l ≥ u T l+1 W T l W l u l+1 ≥ s 2 l+1 -α 2 W T l Wl -Wl+1 W T l+1 2 . ( ) Combining ( 89) and ( 90) and dividing both sides by s 2 l+1 , we get (v l (t) T u l+1 (t)) 2 ≥ 1 -α 2 G l s l+1 (t) 2 for t ≥ 0, where G l := T l Wl -Wl+1 W l+1 2 + ( Wl+1 2 F -wL 2 2 ) + L-1 k=l W T k Wk -Wk+1 W T k+1 2 . By a similar argument, we can also get (v L-1 (t) T w L (t)) 2 w L (t) 2 2 ≥ 1 -α 2 G L-1 w L (t) 2 2 , ( ) where G L-1 := 2 W T L-1 WL-1 -wL wT L 2 . From ( 91) and ( 92), we can note that as α → 0, the inner product between the adjacent singular vectors converges to ±1, unless s 2 , . . . , s L-1 , w L 2 also diminish to zero. So it is left to show that the singular values do not diminish to zero as α → 0. To this end, recall that we proved in the previous subsection that lim t→∞ XW 1 (t) • • • W L-1 (t)w L (t) = y. A necessary condition for this to hold is that y 2 X 2 ≤ lim t→∞ W 1 (t) • • • W L-1 (t)w L (t) 2 ≤ lim t→∞ L-1 l=1 s l (t) w L (t) 2 . This means that after converging to the global minimum solution of the problem (i.e., t → ∞), the product of the singular values must be at least greater than some constant independent of α. Moreover, we can see from ( 87) and ( 90) that the gap between singular values squared of adjacent layers is bounded by O(α 2 ), for all t ≥ 0; so the maximum singular values become closer and closer to each other as α diminishes. This implies that So far, we saw from (86) that W l (t)'s become rank-1 matrices as α → 0, and from (93) that the top singular vectors align with each other as t → ∞ and α → 0. These imply that, as t → ∞ and α → 0, β fc (Θ fc ) is a scalar multiple of the u 1 , the top left singular vector of W 1 : lim α→0 lim t→∞ β fc (Θ fc (t)) = c • lim α→0 lim t→∞ u 1 (t), for some c ∈ R. In light of (94), it remains to take a close look at u 1 (t). Note from the gradient flow dynamics of W 1 that Ẇ1 is always a rank-1 matrix whose columns are in the row space of X, since X T r ∈ row(X). This implies that, if we decompose W 1 into two orthogonal components W ⊥ 1 and W 1 so that the columns in W 1 are in row(X) and the columns in W ⊥ 1 are in the orthogonal subspace row(X) ⊥ , we have Ẇ ⊥ 1 = 0, Ẇ 1 = Ẇ1 . That is, any component W ⊥ 1 (0) orthogonal to row(X) remains unchanged for all t ≥ 0, while the component W 1 changes by the gradient flow. Since we have W ⊥ 1 (t) F = W ⊥ 1 (0) F ≤ α Wl F , the component in W 1 that is orthogonal to row(X) diminishes to zero as α → 0. This means that at the limit α → 0, the columns of W 1 are entirely from row(X), which also means that lim α→0 lim t→∞ β fc (Θ fc (t)) ∈ row(X). However, recall that there is only one unique global minimum of Xz = y in row(X): namely, z = X T (XX T ) -1 y, the minimum 2 norm solution. This finishes the proof.



the definition of subdifferentials used here is taken fromGunasekar et al. (2018b).



Figure 1: Gradient descent trajectories of linear coefficients of linear fully-connected, diagonal, and convolutional networks on a regression task, initialized with different initial scales α = 0.01, 1.Networks are initialized at the same coefficients (circles on purple lines), but follow different trajectories due to implicit biases of networks induced from their architecture. The figures show that our theoretical predictions on limit points (circles on yellow line, the set of global minima) agree with the solution found by GD. For details of the experimental setup, see Section 6.

and defines the circular convolution: for any a ∈ R d and b

λI d ) for global convergence is a generalization of the zero-asymmetric initialization scheme ( W1 = • • • = WL-1 = I d and wL = 0) proposed in Wu et al. (2019).

where w l ∈ R k l with k l ≤ d and k L = d, and defines the circular convolution: for any a∈ R d and b ∈ R k (k ≤ d), we have a b ∈ R d defined as [a b] i = k j=1 [a] (i+j-1) mod d [b] j , for i ∈ [d].In case of k l = d for all l ∈ [L], we refer to this network as full-length convolutional networks.Deep matrix sensing. The deep matrix sensing problem considered inGunasekar et al. (2017);Arora et al. (2019b)  aims to minimize the following problem minimize W1,...,W L ∈R d×d

j1,...,j L . Hence, linear full-length convolutional networks satisfy Assumption 1 with S = d L-1 2 F . A direct substitution into Theorem 2 and then using the fact that |[F z] j | = |[F * z] j | for any real vector z ∈ R d gives the corollary. E PROOFS OF THEOREM 3 AND COROLLARY 4 E.1 PROOF OF THEOREM 3 E.1.1 CONVERGENCE OF LOSS TO ZERO

j

have the alignment of singular vectors at convergence as α → 0:lim α→0 lim t→∞ (v l (t) T u l+1 (t)) 2 = 1, for l ∈ [L -2

annex

From (20) and alignment of v l and vl , we haveSince all vectors U T l v l (t) converge in direction, the term SX T r(t) should also converge in direction. Let u ∞ := lim t→∞ SX T r(t) SX T r(t) 2 . One can use the same argument as in Appendix C.2, more specifically ( 16) and ( 17), to show that u ∞ can be written aswhere ν i ≥ 0 for all i ∈ [n], and ν j = 0 for x j 's that are not support vectors, i.e., those satisfying y j x T j S T ρ ∞ > min i∈[n] y i x T i S T ρ ∞ . Using u ∞ , we can rewrite (25) asfor all l ∈ [L]. Element-wise multiplying U T l v ∞ l to both sides giveswhere a b denotes element-wise b-th power of the vector a. Since the LHS of ( 27) is a positive real number, we haveso using this, ( 27) becomes(29) Now element-wise multiply (29) for all l ∈ [L], then we getA close look at (30) reveals that if L ≥ 2, ρ ∞ and u ∞ must satisfy thatfor all j ∈ [m]. There is another condition that has to be satisfied when L = 2:for any j, j ∈ [m]; let us prove why. First, consider the time derivative ofwhere (a) used ( 20). Now considerWe want to compare this quantity for different j, j ∈ [m]. Before we do that, we take a look at the last term in the RHS of (34). Recall from (21) thatFor simplicity, let, which is a positive number due to our assumption on initialization. Then, we can use (35) and, we can write the gradient flow dynamics from Section 2.1 as v1 = -M(X T r)where r(t) = -y exp(-yf (x; Θ(t))) is the residual of the data point (x, y). From (40) we getNow consider the rate of growth for the j-th component of2 j grow at the same rate. This means that the gap between the two layers stays constant for all t ≥ 0. Combining this with our conditions on initial directions,for any j ∈ [m] and t ≥ 0. This inequality impliesLet us now consider the time derivative of L(Θ(t)). We have the following chain of upper bounds on the time derivative:

E.1.2 CHARACTERIZING THE LIMIT DIRECTION

Since we proved that L(Θ(t)) → 0, the argument in the proof of Theorem 1 applies to this case, and shows that the parameters v l converge in direction and align with vl)) that we have sign(r(t)) = -sign(y). Using this, (41), and alignment of v l and vl , we haveElement-wise multiplying U T l v ∞ l to both sides givesPublished as a conference paper at ICLR 2021 Since the LHSs are positive and s is positive, the following equations have to be satisfied for all j ∈ [m]: sign(y) = sign([ρ ∞ ] j ).(47) Now, multiplying both sides of the two equations ( 46), we getFrom ( 48), ρ ∞ must satisfy thatfor all j, j ∈ [m]. As in the proof of Theorem 2, there is another condition that has to be satisfied:for any j, j ∈ [m]; let us prove why. First, consider the time derivative of, where (a) used ( 41). Now considerWe want to compare this quantity for different j, j ∈ [m]. Before we do that, we take a look at the last term in the RHS of (51). Recall from ( 43) thatFor simplicity, let, which is a positive number due to our assumption on initialization. Then, we can use (52) andApplying the structure (9) in Assumption 1, we getLeft-multiplying U T l to both sides, we getwhere denotes the product using entry-wise multiplication .Now consider the rate of growth for the second power of the j-th component offor any l ∈ [L]. Thus, for any j ∈ [m], the second power of the j-th components in U T l v l grow at the same rate for each layer l ∈ [L]. This means that the gap between any two different layers stays constant for all t ≥ 0. Combining this with our conditions on initial directions, we haveand t ≥ 0. This inequality also impliesHere, we defined ρ := l∈[L] η l ∈ R m . Therefore, the linear coefficients of the network can be written as β(Θ(t)) = S T ρ(t). From the solutions ( 62) and ( 63), we can writewhere h L := p L-1 L q L , defined in Lemma 4. By the convergence of the loss to zero (60), we have lim t→∞ Xβ(Θ(t)) = y. Therefore,Next, we will show that ρ ∞ is in fact the solution of the following optimization problemwhereNote that the KKT conditions for (65) arefor some ν ∈ R n . It is clear from (64) that ρ ∞ satisfies the first condition (primal feasibility), so let us check the other one. Through a straightforward calculation, we getEquating this with SX T ν givesHence, by setting ν = -∞ 0 r(τ )dτ , ρ ∞ satisfies this condition as well. Also, if S is invertible, we can substitute ρ = S -T z to (65) to get the last statement of the theorem. This finishes the proof.

F.3 PROOF OF COROLLARY 5

The proof is a direct consequence of the fact that Assumption 1 holds with S = U 1 = • • • = U L = I d for linear diagonal networks. Hence, the proof is the same as Corollary 2, proved in Appendix D.2.

F.4 PROOF OF COROLLARY 6

We start by showing the DFT of a real and even vector is also real and even. Suppose that x ∈ R d is real and even. First,Published as a conference paper at ICLR 2021 Now, define v l (t) = eig(W l (t)), i.e., U T W l U = diag(v l ). Also, let x i = eig(A i ). Then, (67) can be written as vl = -(Therefore, this is equivalent to the regression problem with linear diagonal networks, initialized at v l (0) = α1 for l ∈ [L -1] and v L (0) = 0. Given this equivalence, Corollary 7 can be implied from Corollary 5.

G PROOF OF THEOREM 6 G.1 CONVERGENCE OF LOSS TO ZERO

We first show that given the conditions on initialization, the training loss L(Θ(t)) converges to zero. Since L = 2 and M(x) = U 1 diag(s)U T 2 , we can write the gradient flow dynamics from Section 2.1 as v1 = -M(X T r)where r(t) = f (x; Θ(t)) -y is the residual of the data point (x, y). From (68) we getNow consider the rate of growth for the j-th component of2 j grow at the same rate. This means that the gap between the two layers stays constant for all t ≥ 0. Combining this with our conditions on initial directions,for any j ∈ [m] and t ≥ 0. This inequality impliesLet us now consider the time derivative of L(Θ(t)). We have the following chain of upper bounds on the time derivative:where (a) used the fact that v2 (t) 2 2 ≥ U 2 U T 2 v2 (t) 2 2 because it is a projection onto a subspace, andTherefore, L(Θ(t)) → 0 as t → ∞.Note that if W T L-1 • • • W T 1 is full-rank, its minimum singular value is positive, and one can boundWe now prove that the matrix W T L-1 • • • W T 1 is full-rank, and its minimum singular value is bounded from below by α L-1 λ (L-1)/2 for any t ≥ 0. To show this, it suffices to show thatNow,, where equalities marked in (a) used ( 78), and inequalities marked in (b) used the initialization conditions W T l Wl Wl+1 W T l+1 . Next, it follows from (79) thatwhere (c) used the assumption that W T L-1 WL-1 -wL wT L λI d . This proves (82). Applying (82) to (80) then gives≤ -α 2L-2 λ L-1 σ min (X) 2 r 2 2 = -α 2L-2 λ L-1 σ min (X) 2 L(Θ fc (t)),where (d) used the fact that X T is a full column rank matrix to apply a bound similar to (81). From this, we get L(Θ fc (t)) ≤ L(Θ fc (0)) exp(-α 2L-2 λ L-1 σ min (X) 2 t), hence proving L(Θ fc (t)) → 0 as t → ∞.H.2 CHARACTERIZING THE LIMIT POINT: α → 0 CASE Now, we move on to characterize the limit points of the gradient flow, for the "active regime" case α → 0. This part of the proof is motivated from the analysis in Ji & Telgarsky (2019a) .Let u l and v l be the top left and right singular vectors of W l , for l ∈ [L -1]. Note that since W l varies over time, the singular vectors and singular value also vary over time. Similarly, let s l be the largest singular value of W l . We will show that the linear coefficients β fc (Θ fc ) = W 1 • • • W L-1 w L align with u 1 as α → 0, and u 1 is in the subspace of row(X) in the limit α → 0, hence proving that β fc (Θ fc ) is the minimum 2 norm solution in the limit α → 0.First, note from ( 78) and ( 79) that if we take trace of both sides, we get

