IMPLICIT REGULARIZATION VIA SPECTRAL NEURAL NETWORKS AND NON-LINEAR MATRIX SENSING

Abstract

The phenomenon of implicit regularization has attracted interest in recent years as a fundamental aspect of the remarkable generalizing ability of neural networks. In a nutshell, it entails that gradient flow dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. However, known results attempting to theoretically explain this phenomenon focus overwhelmingly on the setting of linear neural nets, and the simplicity of the linear structure is particularly crucial to existing arguments. In this paper, we explore this problem in the context of more realistic neural networks with a general class of non-linear activation functions, and rigorously demonstrate the implicit regularization phenomenon for such networks in the setting of matrix sensing problems. This is coupled with rigorous rate guarantees that ensure exponentially fast convergence of gradient descent, complemented by matching lower bounds which stipulate that the exponential rate is the best achievable. In this vein, we contribute a network architecture called Spectral Neural Networks (abbrv. SNN) that is particularly suitable for matrix learning problems. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries, a potentially fruitful perspective for matrix learning. We demonstrate that the SNN architecture is inherently much more amenable to theoretical analysis than vanilla neural nets and confirm its effectiveness in the context of matrix sensing, supported via both mathematical guarantees and empirical investigations. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning scenarios.

1. INTRODUCTION

A longstanding pursuit of deep learning theory is to explain the astonishing ability of neural networks to generalize despite having far more learnable parameters than training data, even in the absence of any explicit regularization. An established understanding of this phenomenon is that the gradient descent algorithm induces a so-called implicit regularization effect. In very general terms, implicit regularization entails that gradient flow dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. In a sense, this creates a learning paradigm that automatically favors models characterized by "low complexity". A standard test-bed for mathematical analysis in studying implicit regularization in deep learning is the matrix sensing problem. The goal is to approximate a matrix X ⋆ from a set of measurement matrices A 1 , . . . , A m and observations y 1 , . . . , y m where y i = ⟨A i , X ⋆ ⟩. A common approach, matrix factorization, parameterizes the solution as a product matrix, i.e., X = U V ⊤ , and optimizes the resulting non-convex objective to fit the data. This is equivalent to training a depth-2 neural network with a linear activation function. In an attempt to explain the generalizing ability of over-parameterized neural networks, Neyshabur et al. (2014) first suggested the idea of an implicit regularization effect of the optimizer, which entails a bias towards solutions that generalize well. Gunasekar et al. (2017) investigated the possibility of an implicit norm-regularization effect of gradient descent in the context of shallow matrix factorization. In particular, they studied the standard Burer-Monteiro approach Burer & Monteiro (2003) to matrix factorization, which may be viewed as a depth-2 linear neural network. They were able to theoretically demonstrate an implicit norm-regularization phenomenon, where a suitably initialized gradient flow dynamics approaches a solution to the nuclear-norm minimization approach to lowrank matrix recovery Recht et al. (2010) , in a setting where the involved measurement matrices commute with each other. They also conjectured that this latter restriction on the measurement matrices is unnecessary. This conjecture was later resolved by Li et al. (2018) in the setting where the measurement matrices satisfy a restricted isometry property. Other aspects of implicit regularization in matrix factorization problems were investigated in several follow-up papers (Neyshabur et al., 2017; Arora et al., 2019; Razin & Cohen, 2020; Tarmoun et al., 2021; Razin et al., 2021) . For instance, Arora et al. (2019) showed that the implicit norm-regularization property of gradient flow, as studied by Gunasekar et al. (2017) , does not hold in the context of deep matrix factorization. Razin & Cohen (2020) constructed a simple 2 × 2 example, where the gradient flow dynamics lead to an eventual blow-up of any matrix norm, while a certain relaxation of rank-the so-called e-rank-is minimized in the limit. These works suggest that implicit regularization in deep networks should be interpreted through the lens of rank minimization, not norm minimization. Incidentally, Razin et al. (2021) have recently demonstrated similar phenomena in the context of tensor factorization. Researchers have also studied implicit regularization in several other learning problems, including linear models (Soudry et al., 2018; Zhao et al., 2019; Du & Hu, 2019) , neural networks with one or two hidden layers (Li et al., 2018; Blanc et al., 2020; Gidel et al., 2019; Kubo et al., 2019; Saxe et al., 2019) . Besides norm-regularization, several of these works demonstrate the implicit regularization effect of gradient descent in terms of other relevant quantities such as margin (Soudry et al., 2018) , the number of times the model changes its convexity (Blanc et al., 2020) , linear interpolation (Kubo et al., 2019) , or structural bias (Gidel et al., 2019) . A natural use case scenario for investigating the implicit regularization phenomenon is the problem of matrix sensing. Classical works in matrix sensing and matrix factorization utilize convex relaxation approaches, i.e., minimizing the nuclear norm subject to agreement with the observations, and deriving tight sample complexity bound (Srebro & Shraibman, 2005; Candès & Recht, 2009; Recht et al., 2010; Candès & Tao, 2010; Keshavan et al., 2010; Recht, 2011) . Recently, many works analyze gradient descent on the matrix sensing problem. Ge et al. (2016) and Bhojanapalli et al. (2016) show that the non-convex objectives on matrix sensing and matrix completion with lowrank parameterization do not have any spurious local minima. Consequently, the gradient descent algorithm converges to the global minimum. Despite the large body of works studying implicit regularization, most of them consider the linear setting. It remains an open question to understand the behavior of gradient descent in the presence of non-linearities, which are more realistic representations of neural nets employed in practice. In this paper, we make an initial foray into this problem, and investigate the implicit regularization phenomenon in more realistic neural networks with a general class of non-linear activation functions. We rigorously demonstrate the occurrence of an implicit regularization phenomenon in this setting for matrix sensing problems, reinforced with quantitative rate guarantees ensuring exponentially fast convergence of gradient descent to the best approximation in a suitable class of matrices. Our convergence upper bounds are complemented by matching lower bounds which demonstrate the optimality of the exponential rate of convergence. In the bigger picture, we contribute a network architecture that we refer to as the Spectral Neural Network architecture (abbrv. SNN), which is particularly suitable for matrix learning scenarios. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries. We believe that this point of view can be beneficial for tackling matrix learning problems in a neural network setup. SNNs are particularly well-suited for theoretical analysis due to the spectral nature of their non-linearities, as opposed to vanilla neural nets, while at the same time provably guaranteeing effectiveness in matrix learning problems. We also introduce a much more general counterpart of the near-zero initialization that is popular in related literature, and our methods are able to handle a much more robust class of initializing setups that are constrained only via certain inequalities. Our theoretical contributions include a compact analytical representation of the gradient flow dynamics, accorded by the spectral nature of our network architecture. We demonstrate the efficacy of the SNN architecture through its application to the matrix sensing problem, bolstered via both theoretical guarantees and extensive empirical studies. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning problems. In particular, we believe that the SNN architecture would be natural for the study of rank (or e-rank) minimization effect of implicit regularization in deep matrix/tensor factorization problems, especially given the fact that e-rank is essentially a spectrally defined concept.

2. PROBLEM SETUP

Let X ⋆ ∈ R d1×d2 be an unknown rectangular matrix that we aim to recover. Let A 1 , . . . , A m ∈ R d1×d2 be m measurement matrices, and the label vector y ∈ R m is generated by y i = ⟨A i , X ⋆ ⟩, where ⟨A, B⟩ = tr(A ⊤ B) denotes the Frobenius inner product. We consider the following squared loss objective ℓ(X) := 1 2 m i=1 (y i -⟨A i , X⟩) 2 . ( ) This setting covers problems including matrix completion (where the A i 's are indicator matrices), matrix sensing from linear measurements, and multi-task learning (in which the columns of X are predictors for the tasks, and A i has only one non-zero column). We are interested in the regime where m ≪ d 1 × d 2 , i.e., the number of measurements is much less than the number of entries in X ⋆ , in which case 2 is under-determined with many global minima. Therefore, merely minimizing 2 does not guarantee correct recovery or good generalization. Following previous works, instead of working with X directly, we consider a non-linear factorization of X as follows X = K k=1 α k Γ(U k V ⊤ k ), where α ∈ R, U k ∈ R d1×d , V k ∈ R d2×d , and the matrix-valued function Γ : R d1×d2 → R d1×d2 transforms a matrix by applying a nonlinear real-valued function γ(•) on its singular values. We focus on the over-parameterized setting d ≥ d 2 ≥ d 1 , i.e., the factorization does not impose any rank constraints on X. Let α = {α 1 , . . . , α K } be the collection of the α k 's. Similarly, we define U and V to be the collections of U k 's and V k 's.

2.1. GRADIENT FLOW

For each k ∈ [K], let α k (t), U k (t), V k (t) denote the trajectories of gradient flow, where α k (0), U k (0), V k (0) are the initial conditions. Consequently, X(t) = K k=1 α k (t)Γ(U k (t)V k (t) ⊤ ). The dynamics of gradient flow is given by the following differential equations, for k ∈ [K] ∂ t α k = -∇ α k ℓ(X(t)), ∂ t U k = -∇ U k ℓ(X(t)), ∂ t V k = -∇ V k ℓ(X(t)). (4)

3. THE SNN ARCHITECTURE

In this work, we contribute a novel neural network architecture, called the Spectral Neural Network (abbrv. SNN), that is particularly suitable for matrix learning problems. At the fundamental level, the SNN architecture entails an application of a non-linear activation function on a matrix-valued input in the spectral domain. This may be followed by a linear combination of several such spectrally transformed matrix-structured data. To be precise, let us focus on an elemental neuron, which manipulates a single matrix-valued input X. If we have a singular value decomposition X = Φ XΨ ⊤ , where Φ, Ψ are orthogonal matrices and X is the diagonal matrix of singular values of X. Let γ be any activation function of choice. Then the elemental neuron acts on X as follows : X → Φ γ( X) Ψ ⊤ , where γ( X) is a diagonal matrix with the non-linearity γ applied entrywise to the diagonal of X. A block in the SNN architecture comprises of K ≥ 1 elemental neurons as above, taking in K matrixvalued inputs X 1 , . . . , X K . Each input matrix X i is then individually operated upon by an elemental Figure 1 : Visualization of the anatomy of an SNN block and a depth-D SNN architecture. Each SNN block takes as input K matrices and outputs one matrix, both the input and output matrices are of size R d1×d2 . In layer i of the SNN, there are L i blocks, which aggregate matrices from the previous layers to produce L i output matrices as inputs for the next layer. The number of input matrices to a block equals the number of neurons in the previous layer. For example, blocks in layer 1 have K = L 0 , blocks in layer 2 have K = L 1 , and blocks in layer i have K = L i-1 . neuron, and finally, the resulting matrices are aggregated linearly to produce a matrix-valued output for the block. The coefficients of this linear combination are also parameters in the SNN architecture, and are to be learned during the process of training the network. The comprehensive SNN architecture is finally obtained by combining such blocks into multiple layers of a deep network, as illustrated in 1.

4. MAIN RESULTS

For the purposes of theoretical analysis, in the present article, we specialize the SNN architecture to focus on the setting of (quasi-) commuting measurement matrices A i and spectral near zero initialization; c.f. Assumptions 1 and 2 below. Similar settings have attracted considerable attention in the literature, including the foundational works of Gunasekar et al. (2017) and Arora et al. (2019) . Furthermore, our analysis holds under very general analytical requirements on the activation function γ; see Assumption 3 in the following. Assumption 1. The measurement matrices A 1 , . . . , A m share the same left and right singular vectors. Specifically, there exists two orthogonal matrices Φ ∈ R d1×d1 and Ψ ∈ R d2×d2 , and a sequence of (rectangular) diagonal matricesfoot_0 Ā1 , . . . , Ām ∈ R d1×d2 such that for any i ∈ [m], we have A i = Φ Āi Ψ ⊤ . Let σ (i) be the vector containing the singular values of A i , i.e., σ (i) = diag( Āi ). Furthermore, we assume that there exist real coefficients a 1 , . . . , a m that a 1 σ (1) + • • • + a m σ (m) = 1. We let X ⋆ = Φ ⊤ X ⋆ Ψ and σ ⋆ be the vector containing the diagonal elements of X ⋆ , i.e., σ ⋆ = diag(X ⋆ ). Without loss of generality, we may also assume that σ ⋆ is coordinatewise non-zero. This can be easily ensured by adding the rectangular identity matrix (c.f. Appendix D) cI d1×d2 to X * for some large enough positive number c. Eq. 6 postulates that the measurement matrices share the same (left-and right-) singular vectors. This holds if and only if the measurement matrices pair-wise quasi-commute in the sense that for any i, j ∈ [m], we have A i A ⊤ j = A j A ⊤ i , A ⊤ i A j = A ⊤ j A i . A natural class of examples of such quasi-commuting measurement matrices comes from families of commuting projections. In such a scenario Eq. 7 stipulates that these projections cover all the coordinate directions, which may be related conceptually to a notion of the measurements being sufficiently informative. For example, in this setting, Eq. 7 would entail that the trace of X ⋆ can be directly computed on the basis of the measurements. Eq. 7 acts as a regularity condition on the singular values of the measurement matrices. For example, it prohibits peculiar scenarios where σ (i) 1 = 0 for all i, i.e., all measurement matrices have 0 as their smallest singular values, which makes it impossible to sense the smallest singular value of X ⋆ from linear measurements.

Note that

y i = ⟨A i , X ⋆ ⟩ = Tr(A ⊤ i X ⋆ ) = Tr(Ψ Āi Φ ⊤ X ⋆ ) = Tr( Āi Φ ⊤ X ⋆ Ψ) = ⟨ Āi , Φ ⊤ X ⋆ Ψ⟩, where in the above we use the fact that Āi = Ā⊤ i (since Āi is diagonal) and the cyclic property of trace. We have y i = ⟨ Āi , X ⋆ ⟩ = ⟨σ (i) , σ ⋆ ⟩ = σ (i)⊤ σ ⋆ , where the second equality is due to Āi being diagonal. We further define vectors z (k) and three matrices B, C, and H as follows z (k) i = [ Ūk ] ii [ Vk ] ii B = σ (1) | . . . | σ (m) ∈ R d1×m C = BB ⊤ ∈ R d1×d1 H = γ(z (1) ) | . . . | γ(z (K) ) ∈ R d1×K . Under these new notations, we can write the label vector y as y = B ⊤ σ ⋆ . Assumption 2. (Spectral Initialization) Let Φ and Ψ be the matrices containing the left and right singular vectors of the measurement matrices from Assumption 1. Let G ∈ R d×d is any arbitrary orthogonal matrix. We initialize X(0) such that the following condition holds: for any k = 1, . . . , K, we have  (a) U k (0) = Φ Ūk (0)G and V k = Ψ V (0)G, K k=1 α k γ( Ūk (0) ii Vk (0) ii ) ≤ σ ⋆ i for any i = 1, . . . , d 1 . Assumption 2, especially part (c) therein, may be thought of as a much more generalized counterpart of the "near-zero" initialization which is widely used in the related literature (Gunasekar et al. (2017) ; Li et al. (2018) ; Arora et al. (2019) ). A direct consequence of Assumption 2 is that at initialization, the matrix X(0) has Φ and Ψ as its left and right singular vectors. As we will see later, this initialization imposes a distinctive structure on the gradient flow dynamics, allowing for an explicit analytical expression for the flow of each component of X. Assumption 3. The function γ : R → R is bounded between [0, 1], and is differentiable and non-decreasing on R. Assumption 3 imposes regularity conditions on the non-linearity γ. Common non-linearities that are used in deep learning such as Sigmoid, ReLU or tanh satisfy the differentiability and non-decreasing conditions, while the boundedness can be achieved by truncating the outputs of these functions if necessary. Our first result provides a compact representation of the gradient flow dynamics in suitable coordinates. The derivation of this dynamics involves matrix differentials utilizing the Khatri-Rao product Ψ ⊠ Φ of the matrices Ψ and Φ (see Eq. 21 in Appendix A of the supplement). Theorem 1. Suppose Assumptions 1, 2, 3 hold. Then the gradient flow dynamics in 4 are ∂ t α = H ⊤ C(σ ⋆ -Hα), ∂ t U k = ΦL k Ψ ⊤ V k , ∂ t V k = (ΦL k Ψ ⊤ ) ⊤ U k , where L k ∈ R d1×d2 is a diagonal matrix whose diagonal is given by diag(L k ) = λ (k) = [λ (k) 1 , . . . , λ (k) d1 ] ⊤ = α k γ ′ (z (k) ) • C(σ ⋆ -Hα). Proof. (Main ideas -full details in Appendix A). We leverage the fact that the non-linearity γ(•) only changes the singular values of the product matrix U k V ⊤ k while keeping the singular vectors intact. Therefore, the gradient flow in 4 preserves the left and right singular vectors. Furthermore, by Assumption 2, U k V ⊤ k has Φ and Ψ as its left and right singular vectors at initialization, which remains the same throughout. This property also percolates to ∇ U k V ⊤ k ℓ. Mathematically speaking, ∇ U k V ⊤ k ℓ becomes diagonalizable by Φ and Ψ, i.e., ∇ U k V ⊤ k ℓ = ΦΛ k Ψ ⊤ for some diagonal matrix Λ k . It turns out that Λ k = L k as given in the statement of the theorem. In view of Eq. 4, this explains the expressions for ∂ t U k and ∂ t V k . Finally, since α k is a scalar, the partial derivative of ℓ with respect to α k is relatively straightforward to compute. Theorem 1 provides closed-form expressions for the dynamics of the individual components of X, namely α k , U k and V k . We want to highlight that the compact analytical expression and the simplicity of the gradient flow dynamics on the components are a direct result of the spectral non-linearity. In other words, if we use the conventional element-wise non-linearity commonly used in deep learning, the above dynamics will be substantially more complicated, containing several Hadamard products and becoming prohibitively harder for theoretical analysis. As a direct corollary of Theorem 1, the gradient flow dynamics on Ūk and Vk are ∂ t Ūk = L k Vk , ∂ t Vk = L ⊤ k Ūk . ( ) Under Assumption 2, Ūk (0) and Vk (0) are diagonal matrices. From the gradient flow dynamics in Eq. 12, and recalling that the L k 's are diagonal, we infer that ∂ t Ūk (0) and ∂ t Vk (0) are also diagonal. Consequently, Ūk (t) and Vk (t) remain diagonal for all t ≥ 0 since the gradient flow dynamics in Eq. 12 does not induce any change in the off-diagonal elements. Thus, Ūk (t) V (t) ⊤ k also remains diagonal throughout. A consequence of the spectral initialization is that the left and right singular vectors of X(t) stay constant at Φ and Ψ throughout the entire gradient flow procedure. To this end, the gradient flow dynamics is completely determined by the evolution of the singular values of X(t), i.e., Hα. The next result characterizes the convergence of the singular values of X(t). Theorem 2. Under Assumptions 1 and 2, for any i = 1, . . . , d 1 , there are constants η i , C i > 0 such that we have : 0 ≤ σ ⋆ i -(H(t)α(t)) i ≤ C i e -ηit . On the other hand, we have the lower bound ∥σ ⋆ -(H(t)α(t))∥ 2 ≥ Ce -ηt , for some constants η, C > 0. Proof. (Main ideas -full details in Appendix B). By part (c) of Assumption 2, at initialization, we have that H(0)α(0) ≤ e σ ⋆ , in which the symbol ≤ e denotes the element-wise less than or equal to relation. Therefore, to prove Theorem 2, it is sufficient to show that H(t)α(t) is increasing to σ ⋆ element-wise at an exponential rate. To achieve that, we show that the evolution of Hα over time can be expressed as ∂ t (Hα) = 4 K k=1 α 2 k • γ ′ (z (k) ) 2 • C(σ ⋆ -Hα) • λ (k) • z (k) dt + C (k) + HH ⊤ C(σ ⋆ -Hα). By definition, the matrix B contains the singular values of the A i 's, and therefore its entries are nonnegative. Consequently, since C = BB ⊤ , the entries of C are also non-negative. By Assumption 3, we have that γ(•) ∈ [0, 1], thus the entries of H are non-negative. Finally, by Assumption 2, we have H(0)α(0) < σ ⋆ entry-wise. For these reasons, each entry in ∂ t (Hα) is non-negative at initialization, and indeed, for each i, the quantity (Hα) i is increasing as long as (Hα) i < σ * i . As Hα approaches σ ⋆ , the gradient ∂ t (Hα) decreases. If it so happened that Hα = σ ⋆ at some finite time, then ∂ t (Hα) would exactly equal 0, which would then cause Hα to stay constant at σ * from then on. Thus, each (Hα) i is non-decreasing and bounded above by σ * i , and therefore must converge to a limit ≤ σ * i . If this limit was strictly smaller than σ * i , then by the above argument (Hα) i would be still increasing, indicating that this cannot be the case. Consequently, we may deduce that lim t→∞ H(t)α(t) = σ ⋆ . It remains to show that the convergence is exponentially fast. To achieve this, we show in the detailed proof that each entry of ∂ t Hα is not only non-negative but also bounded away from 0, i.e., ∂ t (Hα) i ≥ η i (σ ⋆ i -(Hα) i ), for some constant η i > 0. This would imply that Hα converges to σ ⋆ at an exponential rate. The limiting matrix output by the network is, therefore, ΦDiag(σ * )Ψ ⊤ , and given the fact that σ * = diag(Φ ⊤ X * Ψ), this would be the best approximation of X * among matrices with (the columns of) Φ and Ψ as their left and right singular vectors. This is perhaps reasonable, given the fact that the sensing matrices A i also share the same singular vectors, and it is natural to expect an approximation that is limited by their properties. In particular, when the A i are symmetric and hence commuting, under mild linear independence assumptions, ΦDiag(σ * )Ψ ⊤ would be the best approximation of X * in the algebra generated by the A i -s, which is again a natural class given the nature of the measurements. We are now ready to rigorously demonstrate the phenomenon of implicit regularization in our setting. To this end, following the gradient flow dynamics, we are interested in the behavior of X ∞ in the limit when time goes to infinity. Theorem 3. Let X ∞ = lim t→∞ X(t). Under Assumptions 1 and 2, the following hold: (a) ℓ(X ∞ ) = 0, and (b) X ∞ solves the optimization problem min X∈R d 1 ×d 2 ∥X∥ * subject to y i = ⟨A i , X⟩ ∀i ∈ [m]. Proof. (Main ideas -full details in Appendix C). A direct corollary of Theorem 2 is that H(∞)α(∞) = σ ⋆ . By some algebraic manipulations, we can show that the limit of X takes the form X ∞ = Φ Diag(σ ⋆ ) Ψ ⊤ . Now, let us look at the prediction given by X ∞ . For any i = 1, . . . , m, we have ⟨A i , X ∞ ⟩ = ⟨Φ Āi Ψ ⊤ , ΦDiag(σ ⋆ )Ψ ⊤ ⟩ = ⟨σ (i) , σ ⋆ ⟩ = y i , where the last equality holds due to 10. This implies that ℓ(X ∞ ) = 0, proving (a). To prove (b), we will show that X ∞ satisfies the Karush-Kuhn-Tucker (KKT) conditions of the optimization problem stated in Eq. 74. The conditions are ∃ν ∈ R m s.t. ∀i ∈ [m], ⟨A i , X⟩ = y i and ∇ X ∥X∥ * + m i=1 ν i A i = 0. The solution matrix X ∞ satisfies the first condition as proved in part (a). As for the second condition, note first that the gradient of the nuclear norm of X ∞ is given by ∇∥X ∞ ∥ = ΦΨ ⊤ . Therefore the second condition becomes ΦΨ ⊤ + m i=1 ν i A i = 0 ⇔ Φ I - m i=1 ν i Āi Ψ ⊤ = 0 ⇔ Bν = 1. However, by Assumption 1, the vector 1 lies in the column space of B, which implies the existence of such a vector ν. This concludes the proof of part (b).

5. NUMERICAL STUDIES

In this section, we present numerical studies to complement our theoretical analysis. Additional experiments on the multi-layer SNN architecture, as well as with relaxed assumptions, can be found in Appendix E. We highlight that gradient flow can be viewed as gradient descent with an infinitesimal learning rate. Therefore, the gradient flow model only acts as a good proxy to study gradient descent when the learning rate is sufficiently small. Throughout our experiments, we shall consider gradient descent with varying learning rates, and demonstrate that the behavior suggested by our theory is best achieved using small learning rates. We generate the true matrix by sampling each entry of X ⋆ independently from a standard Gaussian distribution, suitably normalized. For every measurement matrix A i , i = 1, . . . , m, we sample each entry of the diagonal matrix Āi from the uniform distribution on (0, 1), sort them in decreasing order, and set A i = Φ Āi Ψ ⊤ , where Φ and Ψ are randomly generated orthogonal matrices. We then record the measurements y i = ⟨A i , X ⋆ ⟩, i = 1, . . . , m. For every k = 1, . . . , K, we initialize Ū (0) and V (0) to be diagonal matrices, whose diagonal entries are sampled uniformly from (0, 10 -3 ), and sorted in descending order. Similarly, α k is also sampled uniformly from (0, 10 -3 ). We take d 1 = d 2 = 10 (thus X ⋆ has 100 entries), m = 60 measurement matrices, and K = 3. As for the non-linearity, we use the sigmoid γ(x) = 1/(1 + e -x ). In the first experiment (c.f. Fig. 2 ), we empirically demonstrate that the singular values of the solution matrices converge to σ ⋆ at an exponential rate as suggested by Theorem 2. From the leftmost plot of Fig. 2 , we observe that when running gradient descent with a small learning rate, i.e., 10 -4 , the singular values of X converges to σ ⋆ exponentially fast. By visual inspection, it takes only less than 4000 iterations of gradient descent for the singular values of X to converge. As we increase the learning rate, the convergence rate slows down significantly, as demonstrated by the middle and rightmost plots of Fig. 2 . For the learning rates of 10 -3 and 10 -2 , it takes approximately 6000 and more than 10000 iterations respectively to converge. We re-emphasize that our theoretical results are for gradient flow, which only acts as a good surrogate to study gradient descent when the learning rates are infinitesimally small. As a result, our theory cannot characterize the behavior of gradient descent algorithm with substantially large learning rates. In the second experiment (c.f. Fig. 3 ), we show the evolution of the nuclear norm over time. Interestingly, but perhaps not surprisingly, the choice of the learning rate dictates the speed of convergence. Moderate values of the learning rate seem to yield the quickest convergence to stationarity. 

6. SUMMARY OF CONTRIBUTIONS AND FUTURE DIRECTIONS

In this work, we investigate the phenomenon of implicit regularization via gradient flow in neural networks, using the problem of matrix sensing as a canonical test bed. We undertake our investigations in the more realistic scenario of non-linear activation functions, compared to the mostly linear structure that has been explored in the literature. In this endeavor, we contribute a novel neural network architecture called Spectral Neural Network (SNN) that is particularly well-suited for matrix learning problems. SNNs are characterized by a spectral application of a non-linear activation function to matrix-valued input, rather than an entrywise one. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries. We believe that this perspective has the potential to gain increasing salience in a wide array of matrix learning scenarios. SNNs are particularly well-suited for theoretical analysis due to their spectral nature of the non-linearities, as opposed to vanilla neural nets, while at the same time provably guaranteeing effectiveness in matrix learning problems. We also introduce a much more general counterpart of the near-zero initialization that is popular in related literature, and our methods are able to handle a much more robust class of initializing setups that are constrained only via certain inequalities. Our theoretical contributions include a compact analytical representation of the gradient flow dynamics, accorded by the spectral nature of our network architecture. We demonstrate a rigorous proof of exponentially fast convergence of gradient descent to an approximation to the original matrix that is best in a certain class, complemented by a matching lower bound, Finally, we demonstrate the matrix-valued limit of the gradient flow dynamics achieves zero training loss and is a minimizer of the matrix nuclear norm, thereby rigorously establishing the phenomenon of implicit regularization in this setting. Our work raises several exciting possibilities for follow-up and future research. A natural direction is to extend our analysis to extend our detailed analysis to the most general setting when the sensing matrices A i are non-commuting. An investigation of the dynamics in the setting of discrete time gradient descent (as opposed to continuous time gradient flow) is an important question, wherein the optimal choice of the learning rate appears to be an intriguing question, especially in the context of our numerical studies (c.f. Fig. 3 ). Finally, it would be of great interest to develop a general theory of SNNs for applications of neural network-based techniques to matrix learning problems.

7. REPRODUCIBILITY STATEMENT

In our paper, we dedicate substantial effort to improving the reproducibility and comprehensibility of both our theoretical results and numerical studies. We formally state and discuss the necessity and implications of our assumptions (please see the paragraphs following each assumption) before presenting our theoretical results. We also provide proof sketches of our main theoretical results. In these sketches, we present the key ideas and high-level directions and refer the reader to more detailed and complete proofs in the Appendices. For the numerical studies, we provide details of different settings in Section 5 and Appendix E. The python code used to conduct our experiments is included in the supplementary material as a zip file and is also publicly available at https: //github.com/porichoy-gupto/spectral-neural-nets.

Supplementary Material

Implicit regularization via Spectral Neural Networks and non-linear matrix sensing A PROOF OF THEOREM 1 Before presenting the Proof of Theorem 1, we define the few notations. Let A ∈ R d1×d2 , d 1 < d 2 be a rectangular matrix and a ∈ R d1 be a vector. We let A ij denote the (i, j)-entry of A, A i * denote the i-th row of A, and A * j denote the j-th column of A. We define the following functions on the matrix A. vec : R d1×d2 → R d1d2 vec(A) = A ⊤ * 1 . . . A ⊤ * d2 ⊤ (14) diag : R d1×d2 → R d1 diag(A) = [A 11 . . . A d1d1 ] ⊤ (15) Diag : R d1 → R d1×d2 2 Diag(a) =    a 1 • • • 0 0 • • • 0 . . . . . . . . . . . . . . . . . . 0 • • • a d1 0 • • • 0    . ( ) We are now ready to present the Proof of Theorem 1. We first recall the definition of X in Eq. 3 X = K k=1 α k Γ(U k V ⊤ k ), where Γ(•) is a matrix-valued function that applies a non-linear scalar-valued function γ(•) on the matrix's singular values. Under Assumption 1, we can write U k and V k as U k V ⊤ k = (Φ Ūk Ψ T G)(G ⊤ V ⊤ k Ψ ⊤ ) = Φ Ūk V ⊤ k Ψ ⊤ . Since both Ūk and Vk are diagonal matrices, their product Ūk V ⊤ k is also diagonal. Consequently, we can write Γ(U k V ⊤ k ) = Φγ( Ūk V ⊤ k )Ψ ⊤ , where γ(•) is applied entry-wise on the matrix Ūk V ⊤ k . We can now write the matrix X as X = K k=1 α k Φγ( Ūk V ⊤ k )Ψ ⊤ = Φ K k=1 α k γ( Ūk V ⊤ k ) Ψ ⊤ . For notational convenience, we define the following notations to be used throughout this section: X = K k=1 α k γ( Ūk V ⊤ k ) ∈ R d1×d2 : Diagonal matrix containing the singular values of X x = diag(X) ∈ R d1 : Vector containing the singular values of X (19) x = vec(X) ∈ R d1d2 : The matrix X expressed as a vector (20) Θ = Ψ ⊠ Φ ∈ R d1d2×d1 : Khatri-Rao product between Ψ and Φ (21) G = ∂ℓ(X) ∂X = - m j=1 (y j -⟨A j , X⟩)A j ∈ R d1×d2 : Gradient of ℓ(X) with respect to X (22) g = vec(G) ∈ R d1d2 : The gradient G expressed as a vector ( 23) Z k = U k V ⊤ k ∈ R d1×d2 : Product matrix between U k and V ⊤ k . ( ) The Khatri-Rao product in Eq. 21 is defined as follows: the columns of the matrix Θ are Kronecker products of the corresponding columns of Ψ and Φ. In other words, the i-th column of Θ can be expressed as the vectorization of the outer product between the i-th column of Φ and the i-th column of Ψ, i.e., Θ * i = vec(ϕ i ψ ⊤ i ). ( ) We shall see in the next paragraph that by leveraging the Khatri-Rao product, we can write the differentials of many quantities of interest compactly, facilitating the derivation of the gradient flow dynamics in Theorem 4. Therefore, we can use the Khatri-Rao product to expand vec(X) as follows: X = ΦXΨ ⊤ = d1 i=1 X ii ϕ i ψ ⊤ i (26) x = vec(X) = d1 i=1 X ii vec(ϕ i ψ ⊤ i ) = d1 i=1 X ii Θ * i = Θx. ( ) From here, we can write the differential of X as dx = Θdx. ( ) Since x i is the i-th singular value of the matrix X, we can express the differential of x i as follows: dx i = ⟨ϕ i ψ ⊤ i , dX⟩ = ⟨ϕ i ψ ⊤ i , α k γ ′ (z (k) i )dZ k ⟩ = ⟨α k γ ′ (z (k) i )ϕ i ψ ⊤ i , dZ k ⟩, ( ) where the first equality is due to Eq. 3, and the second equality is due to α k and γ ′ (z (k) i ) being scalars. Notice that we can write the vector x as a sum over its entries as follow x = d1 i=1 x i • e i , where e i denote the i-th canonical basis vectors of R d1 . We have dx = d1 i=1 dx i • e i = d1 i=1 ⟨α k γ ′ (z (k) i )ϕ i ψ ⊤ i , dZ k ⟩ • e i = α k γ ′ (z (k) i ) d1 i=1 e i ⋆ ϕ i ψ ⊤ i , dZ k , ( ) where ⋆ denotes the tensor product, i.e., e i ⋆ ϕ i ψ ⊤ i ∈ R d1×d1×d2 is a third-order tensor. Since dZ k has dimension d 1 × d 2 , the above Frobenius product returns a vector of dimension d 1 , which matches that of dx. Substituting Eq. 30 this into the differential of ℓ(X) gives dℓ(X) = ∂ℓ ∂X , dX = ⟨G, dX⟩ = g ⊤ dx = g ⊤ Θdx = α k γ ′ (z (k) i ) d1 i=1 (g ⊤ Θe i )(ϕ i ψ ⊤ i ), dZ k . ( ) Let us define the scalar λ (k) i as λ (k) i = -α k γ ′ (z (k) i ) g ⊤ Θe i (a) = -α k γ ′ (z (k) i ) g ⊤ Θ * i (b) = -α k γ ′ (z (k) i ) g ⊤ vec(ϕ i ψ ⊤ i ) = -α k γ ′ (z (k) i )⟨G, ϕ i ψ ⊤ i ⟩ (c) = α k γ ′ (z (k) i ) m j=1 (y j -⟨A j , X⟩)A j , ϕ i ψ ⊤ i = α k γ ′ (z (k) i ) m j=1 (y j -⟨A j , X⟩) A j , ϕ i ψ ⊤ i (d) = α k γ ′ (z (k) i ) m j=1 ⟨σ (j) , σ ⋆ ⟩ - K l=1 α l ⟨σ (j) , γ(z (l) )⟩ • σ (j) i (e) = α k γ ′ (z (k) i ) m j=1 B ⊤ σ ⋆ -B ⊤ Hα • σ (j) i = α k γ ′ (z (k) i ) • row i (B)B ⊤ (σ ⋆ -Hα) , where (a) is due to Θe i equals to the i-column of Θ, (b) is due to Eq. 25, (c) is due to the definition of the matrix G in Eq. 22, (d) is due to Eq. 10, and (e) is due to the definitions of B and H. Let λ (k) ∈ R d1 denote the vector containing the λ (k) i , we can write λ (k) = α k • γ ′ (z (k) ) • BB ⊤ (σ ⋆ -Hα) = α k • γ ′ (z (k) ) • C(σ ⋆ -Hα). ( ) The differential of dℓ(X) becomes dℓ(X) = - d1 i=1 λ (k) i ϕ i ψ ⊤ i , dZ k , ∂ℓ(X) ∂Z k = - d1 i=1 λ (k) i ϕ i ψ ⊤ i = -ΦL (k) Ψ ⊤ , ( ) where L (k) is a diagonal matrix whose diagonal entries are λ (k) i , i.e., L (k) = Diag(λ (k) ). Since Z k = U k V ⊤ k , we have ∂ℓ(X) ∂U k = - d1 i=1 λ (k) i ϕ i ψ ⊤ i = -ΦL (k) Ψ ⊤ V k , (34) ∂ℓ(X) ∂V k = - d1 i=1 λ (k) i ϕ i ψ ⊤ i = -(ΦL (k) Ψ ⊤ ) ⊤ U k . ( ) This concludes the proof for the gradient flow dynamics on U k and V k . In the remaining, we shall derive the gradient of ℓ(X) with respect to the scalar α k .

∂ℓ(X) ∂α

k = - m j=1 (y j -⟨A j , X⟩)⟨Γ(U k V ⊤ k ), A j ⟩ = - m j=1 σ (i) , σ ⋆ - K l=1 α l γ(z (l) ) • σ (i) , γ(z (k) ) = -γ(z (k) ) ⊤ C σ ⋆ - K l=1 α l γ(z (l) ) = -γ(z (k) ) ⊤ C(σ ⋆ -Hα), where the second equality is due to Eq. 10. Consequently, the gradient of ℓ(X) with respect to the vector α is ∂ℓ (X) ∂α = -H ⊤ C(σ ⋆ -Hα), which concludes the proof of Theorem 1.

B PROOF OF THEOREM 2

Let us direct our attention to the evolution of the diagonal elements. Restricting 12 to the diagonal elements gives us a system of differential equations for each i ∈ [m]: [∂ t Ūk ] ii = λ (k) i [ Vk ] ii , [∂ t Vk ] ii = λ (k) i [ Ūk ] ii . We can re-write the above into a single matrix differential equation as ∂ t [ Ūk ] ii [ Vk ] ii = λ i 0 1 1 0 [ Ūk ] ii [ Vk ] ii . For the remaining of this section, we define the following notations for ease of presentation: x (k) i = [ Ūk ] ii [ Vk ] ii A = 0 1 1 0 (40) w (k) i = 1 2 x (k)⊤ i x (k) i = 1 2 ( Ū 2 ii + V 2 ii ) ∂ t w (k) i = x (k)⊤ i ∂ t x (k) i (42) z (k) i = 1 2 x (k)⊤ i Ax (k) i = [ Ūk ] ii [ Vk ] ii ∂ t z (k) i = x (k)⊤ i A∂ t x (k) i . The above matrix differential equation becomes ∂ t x (k) i = λ (k) i Ax (k) i x (k)⊤ i ∂ t x (k) i = x (k)⊤ i λ (k) i Ax (k) i ∂ t w On another note, we also have ∂ t x (k) i = λ (k) i Ax (k) i x (k)⊤ i A∂ t x (k) i = λ (k) i x (k)⊤ i AAx (k) i ∂ t z (k) i = 2λ (k) i w (k) i . We are now ready to prove the main result. In the remaining proof, we will derive the differential equation for Hα. By the product rule of calculus, we have ∂ t (Hα) = (∂ t H)α + H(∂ t α). We shall derive ∂ t Hα and H∂ t α separately. First, let us consider the evolution of H over time. ∂ t H = . . . | γ ′ (z (k) )∂ t z (k) | . . . = . . . | γ ′ (z (k) ) • 2λ (k) • w (k) | . . . = . . . | γ ′ (z (k) ) 2 • 2C(σ ⋆ -Hα) • w (k) | . . . Diag(α), where the first equality follows from the definition of H and the chain rule of calculus, the second equality is due to Eq. 46, and the last equality follows from Theorem 1. Multiplying the vector α from the right on both sides gives: (∂ t H)α = 2 K k=1 α 2 k • γ ′ (z (k) ) 2 • C(σ ⋆ -Hα) • w (k) . Recall that from Theorem 1, we have ∂ t α = H ⊤ C(σ ⋆ -Hα). Multiplying the matrix H from the left on both sides gives H(∂ t α) = HH ⊤ C(σ ⋆ -Hα). Combining Eq. 51 and Eq. 53 gives ∂ t (Hα) = 4 K k=1 α 2 k • γ ′ (z (k) ) 2 • C(σ ⋆ -Hα) • w (k) + HH ⊤ C(σ ⋆ -Hα). Notice that by definition, the matrix B contains the singular values of A i 's, and therefore its entries are non-negative. Consequently, since C = BB ⊤ , C's entries are also non-negative. Finally, by definition in Eq. 41, w (k) has non-negative entries. Therefore, all quantities in Eq. 54 are non-negative entry-wise, except for the vectors (σ ⋆ -Hα). Consequently, both quantities (∂ t H)α and H(∂ t α) have the same sign as (σ ⋆ -Hα). By our initialization, this sign is non-negative. Furthermore, this non-negativity implies that ∂ t (Hα) ≥ HH ⊤ C(σ ⋆ -Hα). ( ) ∂ t (Hα) ≥ 4 K k=1 α 2 k • γ ′ (z (k) ) 2 • C(σ ⋆ -Hα) • w (k) . We will have the occasion to use both inequalities depending on the situation. Finally, we can also write down a similar differential equation for each (H ij from Eq. 50 as ∂ t H ij = 2 • α j γ ′ (z (j) i ) 2 [C(σ ⋆ -Hα)] ij w (k) i . By Assumption 2, part (c), at initialization, we have H(0)α(0) < σ ⋆ entry-wise. This implies that each entry in ∂ t (Hα) is positive at initialization, and therefore Hα is increasing in a neighborhood of 0. As Hα approaches σ ⋆ , the gradient ∂ t (Hα) decreases and reaches 0 exactly when Hα = σ ⋆ , which then causes Hα to stay constant from then on. Thus, we have shown that lim t→∞ H(t)α(t) = σ ⋆ and ∂ t (Hα) ≥ 0. Combining Eq. 46 and Eq. 45, we have ∂ t w (k) i • w (k) i -∂ t z (k) i • z (k) i = 0. Integrating both sides with respect to time, we have that for any t > 0 (w (k) i (t)) 2 -(z (k) i (t)) 2 = Q. ( ) for some constant Q which does not depend on time. Since w (k) i is non-negative by definition, the above implies that w (k) i (t) ≥ |Q| for all t > 0; note that Q > 0 can be ensured via initialization, as discussed below. To this end, notice that (w (k) i (0)) 2 -(z (k) i (0)) 2 = 1 4 ( Ūk,ii (0) 2 + Vk,ii (0) 2 ) 2 -Ūk,ii (0) 2 Vk,ii (0) 2 = 1 4 ( Ūk,ii (0) 2 -Vk,ii (0) 2 ) 2 . Thus this can always be arranged simply by initializing Ūk (0), Vk (0) suitably. This, in particular, implies that w (k) i is bounded away from 0 in time. In the remainder of this section, we will show that the convergence rate is exponential. In the below, we shall establish a lower bound on ∂ t (Hα). Case 1: The α k are upper bounded by a finite constant α max > 0 for all k ∈ [K] and t ∈ R + . Let h 1 , . . . , h d1 denote the rows of H, and b 1 , . . . , b d1 denote the rows of B. We have H∂ t α = HH ⊤ C(σ ⋆ -Hα) = (HH ⊤ )(BB ⊤ )(σ ⋆ -Hα) (61) =    h ⊤ 1 h 1 h ⊤ 1 h 2 . . . h ⊤ 1 h d1 . . . . . . . . . . . . h ⊤ 1 h d1 h ⊤ 2 h d1 . . . h ⊤ d1 h d1       b ⊤ 1 b 1 b ⊤ 1 b 2 . . . b ⊤ 1 b d1 . . . . . . . . . . . . b ⊤ 1 b d1 b ⊤ 2 b d1 . . . b ⊤ d1 b d1    (σ ⋆ -Hα) (62) ≥      h ⊤ 1 h 1 h ⊤ 2 h 2 . . . h ⊤ d1 h d1           b ⊤ 1 b 1 b ⊤ 2 b 2 . . . b ⊤ d1 b d1      (σ ⋆ -Hα) (63) =      h ⊤ 1 h 1 • b ⊤ 1 b 1 h ⊤ 2 h 2 • b ⊤ 2 b 2 . . . h ⊤ d1 h d1 • b ⊤ d1 b d1      (σ ⋆ -Hα) (64) = h ⊤ 1 h 1 • b ⊤ 1 b 1 . . . h ⊤ d1 h d1 • b ⊤ d1 b d1 ⊤ • (σ ⋆ -Hα), where the first inequality is due to the non-negativity of the entries of H and B. Let us focus our attention on the evolution of the i-th entry of Hα. ∂ t (Hα) i = [(∂ t H)α] i + [H(∂ t α)] i ≥ [H(∂ t α)] i ≥ (h ⊤ i h i • b ⊤ i b i )(σ ⋆ i -(Hα) i ), where the first inequality is due to [H(t)α(t)] i ≤ σ ⋆ i , which causes (∂ t H)α i to be non-negative. Notice that b i are constants with respect to time, and are non-zero because of the condition that the all-ones vector lies in the range of B. Therefore, to show a lower bound on ∂ t (Hα) i , it remains to show that h ⊤ i h i (or ∥h i ∥) is bounded away from 0. In the previous part, we have shown that H(t)α(t) approaches σ ⋆ as t → ∞. Therefore, for any ϵ > 0, there exists a time t 0 after which |σ ⋆ i -(Hα) i | ≤ ϵ. Notice that (Hα) i = h ⊤ i α. By Cauchy-Schwarz inequality, we have for t > t 0 : ∥h i (t)∥ • ∥α(t)∥ ≥ |h i (t) ⊤ α(t)| ≥ |σ ⋆ i | -|σ ⋆ i -h i (t) ⊤ α(t)| ≥ |σ ⋆ i | -ϵ, where the second inequality is due to the Triangle inequality. Choose ϵ = |σ ⋆ i |/2, we have ∥h i (t)∥ ≥ |σ ⋆ i | 2∥α(t)∥ ≥ |σ ⋆ i | 2 √ Kα max . Let us define the constant η i = |σ ⋆ i |/(2 √ Kα max ) • b ⊤ i b i , and β i = σ ⋆ i -(Hα) i . Notice that η i is a constant with respect to t. We have the following differential inequality: ∂ t β i = -∂ t (Hα) i ≤ -η i β i . ( ) Integrating the above differential inequality we get that β i (t) = σ ⋆ i -(H(t)α(t)) i ≤ Ce -ηit , for some constant C > 0 for all large enough t. This shows that Hα converges to σ ⋆ at an exponential rate. Definition. In the complement of Case 1, we define the subset S ⊆ [K] such that j ∈ S implies that lim t→∞ α j (t) = +∞. In order to deal with the complement of Case 1 above, we now proceed to handle the convergence of Hα to σ * coordinate-wise. To this end, we consider two types of coordinates i ∈ [d 1 ], depending on the limiting behavior of the H ij in tandem with that of α j (as j ranges over S for this i). Case 2 : The index i ∈ [d 1 ] is such that H ij α j → 0 as t → ∞ for all j ∈ S. In this case, we consider the left and right sides of Eq 54 for the i-th co-ordinate, and notice that in fact we have σ * i = lim t→∞ j / ∈S H ij α j . Since α j for each j / ∈ S converges to a finite real number, this implies that for this particular index i ∈ [d 1 ] we can employ the argument of Case 1, with a lower dimensional vector (α j ) j / ∈S instead of (α j ) j∈ [K] . Case 3 : The index i ∈ [d 1 ] is such that ∃j ∈ S with H ij α j does not converge to zero as t → ∞ and H ij remains bounded away from 0 for this j. In this case, we notice that the i-th row of H, denoted h i , satisfies the condition that ∥h i ∥ 2 is bounded away from 0 as t → ∞, thanks to the above co-ordinate j. As such, we are able to apply the exponential decay argument of Case 1 by combining Eq 66 and Eq 69. Case 4 : The index i ∈ [d 1 ] is such that ∃j ∈ S with H ij α j does not converge to zero as t → ∞ and H ij → 0 for this j. For the proof of Case 4 in Theorem 2, we further assume that if x 0 is a zero of the activation function γ, then γ ′ (x) ≥ cγ(x) for x sufficiently close to x 0 , and c > being a constant. This covers the case of x 0 = -∞, as in the case of sigmoid functions, where the condition would be taken to be satisfied for x sufficiently large and negative. This condition is satisfied by nearly all of the activation functions of our interest, including sigmoid functions, the tanh function, and truncated ReLu with quadratic smoothing, among others. We believe that this condition is a technical artifact of our proof, and endeavor to remove it in an upcoming version of the manuscript. In this setting, we invoke Eq 54 in its i-th component and lower bound its right-hand side by the k = j summand therein; in other words, we write ∂ t (Hα) i ≥ 4α 2 j γ ′ (z (j) i ) 2 [C(σ ⋆ -Hα)] i w (k) i . Since H ij = γ(z i ), therefore for large enough t, we have z (j) i is close to z 0 , the zero of γ (in the event z 0 = -∞, this means that z (j) i is large and negative). By the properties of the activation function γ, this implies that the inequality γ ′ (z (j) i ) ≥ cγ(z (j) i ) holds true with an appropriate constant c > 0 for large enough t. Notice that α j ↑ +∞ and from Eq (57) we have ∂ t H ij ≥ 0 for large enough time t, which implies that H ij is non-decreasing for large time. As a result, H ij α j is non-decreasing for large time. But recall that H ij α j does not converge to zero as t → ∞ by the defining condition of this case, which when combined with the non-decreasing property established above, implies that H ij α j is bounded away from 0 for large time. Finally, recall that w k i ≥ |Q|. Combining these observations, we may deduce from Eq 72 that, for an appropriate constants c 1 , c 2 > 0 we have ∂ t (Hα) i ≥ c 1 α 2 j γ(z (j) i ) 2 [C(σ ⋆ -Hα)] i w (k) i ≥ c 2 • [C(σ ⋆ -Hα)] i . We now proceed as in Eq 69 and obtain exponentially fast convergence, as desired. Establishing a matching exponential lower bound on β i (t) is not very important from a practical point of view. So, we only show such a bound under the assumption that the α k (t) remains bounded by some constant α max > 0, for all k. For simplicity, we also assume that the non-linearity γ is such that γ (see Eq. ( 60)), this assumption ensures that γ ′ (z ′ (x) = O 1 √ |x| as x → ±∞. ( (k) i ) 2 w (k) i remains bounded uniformly for all i and k. Further, the entries of H are uniformly bounded, and C is a fixed matrix. Therefore, for some constant ζ > 0, we have, for all i ∈ [d 1 ], that ∂ t (Hα) i = 4 K k=1 α 2 k (γ ′ (z (k) i )) 2 w (k) i (C(σ ⋆ -Hα) i + j (HH ⊤ ) ij (C(σ ⋆ -Hα)) j ≤ ζ ℓ (σ ⋆ -Hα) ℓ . A fortiori, ∂ t i β i = i ∂ t β i = - i ∂ t (Hα) i ≥ -ζ i ℓ β ℓ = -d 1 ζ i β i . Integrating this differential inequality, we conclude that for some constant C 1 > 0, i β i (t) ≥ C 1 e -d1ζt , for all t > 0. The Cauchy-Schwartz bound i β i (t) ≤ √ d 1 ∥β(t)∥ 2 then implies that for all t > 0,  ∥σ ⋆ -H(t)α(t)∥ 2 = ∥β(t)∥ 2 ≥ C 1 √ d 1 e -d1ζt =

C PROOF OF THEOREM 3

By our definition of X ∞ , we have X ∞ = lim t→∞ X(t) = lim t→∞ K k=1 α k (t)Γ U k (t)V k (t) ⊤ = lim t→∞ K k=1 α k (t)Γ Φ Ūk (t)G • G ⊤ Vk (t) ⊤ Ψ ⊤ = lim t→∞ K k=1 α k (t)Γ Φ Ūk (t) Vk (t) ⊤ Ψ ⊤ = lim t→∞ K k=1 α k (t)Φγ( Ūk (t) Vk (t) ⊤ Ψ ⊤ = Φ lim t→∞ K k=1 α k (t)γ Ūk (t) Vk (t) ⊤ Ψ ⊤ = Φ lim t→∞ Diag K k=1 α k (t)γ z (k) (t) Ψ ⊤ = Φ lim t→∞ Diag H(t)α(t) Ψ ⊤ = Φ Diag lim t→∞ H(t)α(t) Ψ ⊤ = Φ Diag(σ ⋆ ) Ψ ⊤ , where the fourth equality is due to the orthogonality of G, and the last equality is due to Theorem 1. Now, let us look at the prediction given by X ∞ , for any i = 1, . . . , m, we have ⟨A i , X ∞ ⟩ = ⟨Φ Āi Ψ ⊤ , ΦDiag(σ ⋆ )Ψ ⊤ ⟩ = ⟨σ (i) , σ ⋆ ⟩ = y i , where the last equality is due to Equation 10.This implies that ℓ(X ∞ ) = 0, proving (a). To prove (b), we will show that X ∞ satisfies the Karush-Kuhn-Tucker (KKT) conditions of the following optimization problem min X∈R d 1 ×d 2 ∥X∥ * subject to y i = ⟨A i , X⟩, ∀i ∈ [m]. The KKT optimality conditions for the optimization in 74 are ∃ν ∈ R m s.t. ∀i ∈ [m], ⟨A i , X⟩ = y i and ∇ X ∥X∥ * + m i=1 ν i A i = 0. The solution matrix X ∞ satisfies the first condition from the first claim of Theorem 3, it remains to prove that X ∞ also satisfies the second condition. The gradient of the nuclear norm of X ∞ is given by ∇∥X ∞ ∥ = ΦΨ ⊤ . (76) Therefore, the second condition becomes ΦΨ ⊤ + m i=1 ν i A i = 0 ⇔ ΦΨ ⊤ + m i=1 ν i Φ Āi Ψ ⊤ = 0 ⇔ Φ I - m i=1 ν i Āi Ψ ⊤ = 0 ⇔ m i=1 ν i σ (i) j = 1, ∀j ∈ [d 1 ] ⇔ Bν = 1. By Assumption 1, the vector 1 lies in the column space of B, which implies the existence of a vector ν that satisfies the condition above. This proves (b) and concludes the proof of Theorem 3.

D SINGULAR VALUE DECOMPOSITION

In this appendix, we explain in detail the singular value decomposition of a rectangular matrix. By doing so, we also explain some of the non-standard notations used in the paper (e.g., a rectangular diagonal matrix). In our paper, we consider a rectangular measurement matrix A of dimension d 1 × d 2 and rank r. Without loss of generality, we assume that r ≤ d 1 ≤ d 2 . Then the singular value decomposition of the matrix A is given by A = Φ ĀΨ ⊤ , where Φ ∈ R d1×d1 , Ψ ∈ R d2×d1 are two orthogonal matrices whose columns represent the left and right singular vectors of A, and Ā ∈ R d1×d1 is a diagonal matrix whose diagonal entries represent the singular values of A. This is similar to the notion of compact SVD commonly used in the literature, but we truncate the matrix Ψ to d 1 columns, instead of r columns like in compact SVD. This choice of compact SVD is not taken lightly, it is crucial for the proof of Theorem 1 via the use of the Khatri-Rao product. More specifically, the Khatri-Rao product in Eq. 21 requires that Φ and Ψ have the same number of columns. This is generally not true in the standard singular value decomposition. Next, we precisely define the notion of a rectangular diagonal matrix. Suppose A ∈ R d1×d2 with d 1 < d 2 is a diagonal matrix, then A takes the following form: A =    a 1 • • • 0 0 • • • 0 . . . . . . . . . . . . . . . . . . 0 • • • a d1 0 • • • 0    . Here, we start with a standard square diagonal matrix of size d 1 × d 1 , and add d 2 -d 1 columns of all 0's to the right of that matrix. Similarly, we define a rectangular identity matrix of size d 1 × d 2 by setting a 1 = • • • = a d1 = 1.

E ADDITIONAL NUMERICAL STUDIES

In this section, we complement the numerical studies in the main paper with some simulation results on the SNN architecture of Fig. 1 . We empirically demonstrate that gradient descent dynamics in SNN minimizes the nuclear norm of the solution matrix, and leads to favorable generalization performance. We also investigate some settings in which the assumptions made in our theoretical study are relaxed. Do the results in Theorem 2 and Theorem 3 extend to multi-layer SNN? We generate the true matrix by sampling each entry of X ⋆ independently from a standard Gaussian distribution, suitably normalized. For every measurement matrix A i , i = 1, . . . , m, we sample each entry of the diagonal matrix Āi from the uniform distribution on (0, 1), sort them in decreasing order, and set A i = Φ Āi Ψ ⊤ , where Φ and Ψ are randomly generated orthogonal matrices. We then record the measurements y i = ⟨A i , X ⋆ ⟩, i = 1, . . . , m. A total of 120 measurement matrices are generated, half of them are used for training and the other half are used for testing (m = 60 in this case). In this setting, we still observe that the nuclear norm generally decreases over time, although not in a strict sense like before. We again observe that there are advantages in going from a single block to a multi-layer SNN from the generalization perspective. We use a learning rate of 10 -5 . In the middle and right plots, it shows that the single-block 1-layer SNN is the easiest to optimize but does not result in the best generalization performance: the error decreases quickly at first, but then plateaus; on the other hand, for the more complex SNNs, the loss improves slowly but reaches better solutions.

F CODE AVAILABILITY

The python code used to conduct our experiments is publicly available at https://github.com/ porichoy-gupto/spectral-neural-nets.



Rectangular diagonal matrices arise in the singular value decomposition of rectangular matrices, see Appendix D.



and (b) Ūk (0) and Vk (0) are diagonal, and

Figure 2: The evolution of X largest 3 singular values when running gradient descent on the matrix sensing problem with learning rates of different magnitudes. The first 100 iterations have vastly different values and are omitted for clarity of presentation.

Figure 3: The evolution of X's training error (left panel) and nuclear norm (right panel) over time with learning rates (LR) of different magnitudes.

Ce -ηt , where C = C1 √ d1 and η = d 1 ζ. This completes the proof of the lower bound.

Figure6: The evolution of SNN's nuclear norm, train error, and test error over time with learning rates (LR) of different magnitudes. From the left plot, the nuclear norm generally decreases over time and converges to the same value albeit with some erratic movements in the early iterations.

annex

We use a two-layer SNN of size [5, 3, 1], i.e., the input layer has 5 blocks, the hidden layer has 3 blocks, and the output layer has 1 block. These blocks are initialized with spectral initialization (see Assumption 2).We take d 1 = d 2 = 10 (thus X ⋆ has 100 entries), m = 60 measurement matrices, and K = 3. As for the non-linearity, we use the shifted and scaled sigmoidIn the figures below, we demonstrate the evolution of the nuclear norm, the train error, and the test error of our SNN over time as we run gradient descent. Even though we only analyze gradient flow dynamics for a single block in our theoretical investigation, the empirical results suggest that the same phenomenon extends to more general multi-layer SNN. We use a learning rate of 10 -5 . The left plot shows that gradient descent minimizes the nuclear norm of the solution matrix for all architectures, although they converge to slightly different final solutions. In the right plot, we observe that there is a substantial difference in the test error between different architectures. In this particular simulation, a 2-layer SNN achieves the lowest test error. This hints at the advantage of going beyond a single block to multi-layer SNN architecture as we proposed in Section 3.Do the results in Theorem 2 and Theorem 3 still hold when we relax the assumptions?We investigate a scenario in which Assumption 1 (commuting measurement matrices) and Assumption 2 (spectral initialization) are relaxed.We generate the true matrix by sampling each entry of X ⋆ independently from a standard Gaussian distribution, suitably normalized. For every measurement matrix A i , i = 1, . . . , m, we sample each entry of A i from a standard Gaussian distribution. We then record the measurements y i = ⟨A i , X ⋆ ⟩, i = 1, . . . , m. A total of 120 measurement matrices are generated, half of them are used for training and the other half are used for testing (m = 60 in this case).

