UNDERSTANDING INCREMENTAL LEARNING OF GRA-DIENT DESCENT: A FINE-GRAINED ANALYSIS OF MA-TRIX SENSING

Abstract

The implicit bias of optimization algorithms such as gradient descent (GD) is believed to play an important role in generalization of modern machine learning methods such as deep learning. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. With small initialization, we that GD behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019) . That is, GD sequentially learns solutions with increasing ranks until it recovers the ground-truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result is stronger because it characterizes the whole learning process. Moreover, our analysis of the incremental learning procedure applies to the under-parameterized regime as well. As a key ingredient of our analysis, we observe that GD always follows an approximately low-rank trajectory and develops novel landscape properties for matrix sensing with low-rank parameterization. Finally, we conduct numerical experiments which confirm our theoretical findings.

1. INTRODUCTION

Understanding the optimization and generalization properties of optimization algorithms is one of the central topics in deep learning theory (Zhang et al., 2021; Sun, 2019) . It has long been a mystery why simple algorithms such as Gradient Descent (GD) or Stochastic Gradient Descent (SGD) can find global minima even for highly non-convex functions (Du et al., 2019) , and why the global minima being found can generalize well (Hardt et al., 2016) . One influential line of works provides theoretical analysis of the implicit bias of GD/SGD. These results typically exhibit theoretical settings where the low-loss solutions found by GD/SGD attain certain optimality conditions of a particular generalization metric, e.g., the parameter norm (or the classifier margin) (Soudry et al., 2018; Gunasekar et al., 2018; Nacson et al., 2019; Lyu & Li, 2020; Ji & Telgarsky, 2020) , the sharpness of local loss landscape (Blanc et al., 2020; Damian et al., 2021; Li et al., 2022; Lyu et al., 2022) . Among these works, a line of works seek to characterize the implicit bias even when the training is away from convergence. Kalimeris et al. (2019) empirically observed that SGD learns model from simple ones, such as linear classifiers, to more complex ones. This behavior, usually referred to as the simplicity bias/incremental learning of GD/SGD, can help prevent overfitting for highly over-parameterized models since it tries to fit the training data with minimal complexity. Hu et al. (2020) ; Lyu et al. (2021) ; Frei et al. (2021) theoretically establish that GD on two-layer nets learns linear classifiers first. The goal of this paper is to demonstrate this simplicity bias/incremental learning in the matrix sensing problem, a non-convex optimization problem that arises in a wide range of real-world applications, e.g., image reconstruction (Zhao et al., 2010; Peng et al., 2014) , object detection (Shen & Wu, 2012; Zou et al., 2013) and array processing systems (Kalogerias & Petropulu, 2013) . Moreover, this problem serves as a standard test-bed of the implicit bias of GD/SGD in deep learning theory, since it retains many of the key phenomena in deep learning while being simpler to analyze. Formally, the matrix sensing problem asks for recovering a ground-truth matrix Z * ∈ R d×d given m observations y 1 , . . . , y m . Each observation y i here is resulted from a linear measurement y i = ⟨A i , Z * ⟩, where {A i } 1≤i≤m is a collection of symmetric measurement matrices. In this paper, we focus on the case where Z * is positive semi-definite (PSD) and is of low-rank: Z * ⪰ 0 and rank (Z * ) = r * ≪ d. An intriguing approach to solve this matrix sensing problem is to use the Burer-Monteiro type decomposition Z * = U U ⊤ with U ∈ R d×r , and minimize the squared loss with GD: min U ∈R d×r f (U ) := 1 4m m i=1 y i -A i , U U ⊤ 2 . (1) In the ideal case, the number of columns of U , denoted as r above, should be set to r * , but r * may not be known in advance. This leads to two training regimes that are more likely to happen: the under-parameterized regime where r < r * , and the over-parameterized regime where r > r * . The over-parameterized regime may lead to overfitting at first glance, but surprisingly, with small initialization, GD induces a good implicit bias towards solutions with exact or approximate recovery of the ground truth. It was first conjectured in Gunasekar et al. (2017) that GD with small initialization finds the matrix with minimum nuclear norm. However, a series of works point out that this nuclear norm minimization view cannot capture the simplicity bias/incremental learning behavior of GD. In the matrix sensing setting, this term particularly refers to the phenomenon that GD tends learn solutions with rank gradually increasing with training steps. Arora et al. (2019) exhibits this phenomenon when there is only one observation (m = 1). Gissin et al. ( 2019); Jiang et al. (2022) study the full-observation case, where every entry of the ground truth is measured independently f (U ) = 1 4d 2 ∥Z * -U U ⊤ ∥ 2 F , and GD is shown to sequentially recover singular components of the ground truth from the largest singular value to the smallest one. Li et al. (2020) provide theoretical evidence that the incremental learning behavior generally occurs for matrix sensing. They also give a concrete counterexample for Gunasekar et al. (2017) 's conjecture, where the simplicity bias drives GD to a rank-1 solution that has a large nuclear norm. In spite of these progresses, theoretical understanding of the simplicity bias of GD remains limited. Indeed, a vast majority of existing analysis only shows that GD is initially biased towards learning a rank-1 solution, but their analysis cannot be generalized to higher ranks, unless additional assumptions on the GD dynamics are made (Li et al., 2020, Appendix H) , (Belabbas, 2020; Jacot et al., 2021; Razin et al., 2021; 2022) .

1.1. OUR CONTRIBUTIONS

In this paper, we take a step towards understanding the generalization of GD with small initialization by firmly demonstrating the simplicity bias/incremental learning behavior in the matrix sensing setting, assuming the Restricted Isometry Property (RIP). Our main result is informally stated below. See Theorem 4.1 for the formal version. Definition 1.1 (Best Rank-s Solution) We define the best rank-s solution as the unique global minimizer Z * s of the following constrained optimization problem: min Z∈R d×d 1 4m m i=1 (y i -⟨A i , Z⟩) 2 s.t. Z ⪰ 0, rank (Z) ≤ s. ( ) Theorem 1.1 (Informal version of Theorem 4.1) Consider the matrix sensing problem (1) with rank-r * ground-truth matrix Z * and measurements {A i } m i=1 . Assume that the measurements satisfy the RIP condition (Definition 3.2). With small learning rate µ > 0 and small initialization U α,0 = αU ∈ R d×r , the trajectory of U α,t U ⊤ α,t during GD training enters an o(1)-neighbourhood of each of the best rank-s solutions in the order of s = 1, 2, . . . , r ∧ r * when α → 0. Li et al. (2018) ; Stöger & Soltanolkotabi (2021) that GD exactly recovers the ground truth under the RIP condition, but our theorem goes beyond this result in a number of ways. First, in the over-parameterized regime (i.e., r ≥ r * ), it implies that the trajectory of GD exhibits an incremental learning phenomenon: learning solutions with increasing ranks until it finds the ground truth. Second, this result also shows that in the under-parameterized regime (i.e., r < r * ), GD exhibits the same implicit bias, but finally it converges to the best low-rank solution of the matrix sensing loss. By contrast, to the best of our knowledge, only the over-parameterized setting is analyzed in existing literature.

It is shown in

Theorem 1.1 can also be considered as a generalization of previous results in Gissin et al. (2019) ; Jiang et al. (2022) which show that U α,t U ⊤ α,t passes by the best low-rank solutions one by one in the full observation case of matrix sensing f (U ) = 1 4d 2 ∥Z * -U U ⊤ ∥ 2 F . However, our setting has two major challenges which significantly complicate our analysis. First, since our setting only gives partial measurements, the decomposition of signal and error terms in Gissin et al. (2019) ; Jiang et al. (2022) cannot be applied. Instead, we adopt a different approach which is motivated by Stöger & Soltanolkotabi (2021) ; intuitive explanations of our approach is discussed in Appendix B. Second, it is well-known that the optimal rank-s solution of matrix factorization is X s (defined in Section 3), but little is known for Z * s . In Section 5 we analyze the landscape of (2), establishing the uniqueness of Z * s and local landscape properties under the RIP condition. We find that when U α,t U ⊤ α,t ≈ Z * s , GD follows an approximate low-rank trajectory, so that it behaves similarly to GD in the under-parameterized regime. Using our landscape results, we can finally prove Theorem 1.1. Organization. We review additional related works in Section 2. In Section 3, we provide an overview of necessary background and notations. We then present our main results in Section 4 where we also give a proof sketch. In Section 5, we outline the key landscape results that we use to prove Theorem 4.1. Experimental results are presented in Section 6 which verify our theoretical findings. Finally, in Section 7, we summarize our main contributions and discuss some promising future directions. Complete proofs of all results in this paper are given in the Appendix.

2. RELATED WORK

Low-rank matrix recovery. The goal of low-rank matrix recovery is to recover an unknown lowrank matrix from a finite number of (possibly noisy) measurements. Examples include matrix sensing (Recht et al., 2010 ), matrix completion (Candès & Recht, 2009; Candes & Plan, 2010) and robust PCA (Xu et al., 2010; Candès et al., 2011) . Fornasier et al. (2011) ; Ngo & Saad (2012) ; Wei et al. (2016) ; Tong et al. (2021) study efficient optimization algorithms with convergence guarantees. Interested readers can refer to Davenport & Romberg (2016) for an overview of low rank matrix recovery. Simplicity bias/incremental learning of gradient descent. Besides the works mentioned in the introduction, other works study the simplicity bias/incremental learning of GD/SGD on tensor factorization (Razin et al., 2021; 2022 ), deep linear networks (Gidel et al., 2019)) , two-layer nets with orthogonal inputs (Boursier et al., 2022) . Landscape analysis of non-convex low-rank problems. The strict saddle property (Ge et al., 2016; 2015; Lee et al., 2016) was established for non-convex low-rank problems in a unified framework by Ge et al. (2017) . Tu et al. (2016) proved a local PL property for matrix sensing with exact parameterization (i.e. the rank of parameterization and ground-truth matrix are the same). The optimization geometry of general objective function with Burer-Monteiro type factorization is studied in Zhu et al. (2018) ; Li et al. (2019) ; Zhu et al. (2021) . We provide a comprehensive analysis in this regime for matrix factorization as well as matrix sensing that improves over their results.

3. PRELIMINARIES

In this section, we first list the notations used in this paper, and then provide details of our theoretical setup and necessary preliminary results.

3.1. NOTATIONS

We write min{a, b} as a ∧ b for short. For any matrix A, we use ∥A∥ F to denote the Frobenius norm of A, use ∥A∥ to denote the spectral norm ∥A∥ 2 , and use σ min (A) to denote the smallest singular value of A. We use the following notation for Singular Value Decomposition (SVD): Definition 3.1 (Singular Value Decomposition) For any matrix A ∈ R d1×d2 of rank r, we use A = V A Σ A W ⊤ A to denote a Singular Value Decomposition (SVD) of A, where V A ∈ R d1×r , W A ∈ R d2×r satisfy V ⊤ A V A = I, W A W ⊤ A = I, and Σ A ∈ R r×r is diagonal. For the matrix sensing problem (1), we write the ground-truth matrix as Z * = XX ⊤ for some X = [v 1 , v 2 , • • • , v r * ] ∈ R d×r * with orthogonal columns. We denote the singular values of X as σ 1 , σ 2 , . . . , σ r * , then the singular values of Z * are σ 2 1 , σ 2 2 , . . . , σ 2 r * . We set σ r * +1 := 0 for convenience. For simplicity, we only consider the case where Z * has distinct singular values, i.e., σ 2 1 > σ 2 2 > • • • > σ 2 r * > 0. We use κ := σ 2 1 min 1≤s≤r * {σ 2 s -σ 2 s+1 } to quantify the degeneracy of the singular values of Z * . We also use the notatiowrite X s = [v 1 , v 2 , • • • , v s ] for the matrix consisting of the first s columns of X. Note that Z * s (Definition 1.1) does not equal X s X ⊤ s in general. We write the results of the measurements {A i } m i=1 as a linear mapping A : R d×d → R m , where [A(Z)] i = 1 √ m ⟨A i , Z⟩ for all 1 ≤ i ≤ m. We use A * : R m → R d×d , A * (w) = 1 √ m m i=1 w i A i to denote the adjoint operator of A. Our loss function (1) can then be written as f (U ) = 1 4 A Z * -U U ⊤ 2 2 . The gradient is given by ∇f (U ) = A * y -A(U U ⊤ ) U = A * A XX ⊤ -U U ⊤ U t . In this paper, we consider GD with learning rate µ > 0 starting from U 0 . The update rule is U t+1 = U t -µ∇f (U t ) =: (I + µM t )U t , where M t = A * A XX T -U t U T t . We specifically focus on GD with small initialization: letting U 0 = α Ū for some matrix Ū ∈ R d×r with ∥ Ū ∥ = 1, we are interested in the trajectory of GD when α → 0. Sometimes we write U t as U α,t to highlight the dependence of the trajectory on α.

3.2. ASSUMPTIONS

For our theoretical analysis of the matrix sensing problem, we make the following standard assumption in the matrix sensing literature: Definition 3.2 (Restricted Isometry Property) We say that a measurement operator A satisfies the (δ, r)-RIP condition if (1 -δ)∥Z∥ 2 F ≤ ∥A(Z)∥ 2 2 ≤ (1 + δ)∥Z∥ 2 F for all matrices Z ∈ R d×d with rank (Z) ≤ r. Assumption 3.1 The measurement operator A satisfies the (2r * + 1, δ)-RIP property, where r * = rank (Z * ) and δ ≤ 10 -7 κ -4 r -1 * . The RIP condition is the key to ensure the ground truth to be recoverable with partial observations. An important consequence of RIP is that it guarantees A * A(Z) = 1 m m i=1 ⟨A i , Z⟩ A i ≈ Z when Z is low-rank. This is made rigorous in the following proposition. Proposition 3.1 (Stöger & Soltanolkotabi, 2021, Lemma 7 .3) Suppose that A satisfies (r, δ)-RIP with r ≥ 2, then for all symmetric Z, (1). if rank (Z) ≤ r -1, we have ∥(A * A -I)Z∥ 2 ≤ √ rδ∥Z∥. (2). ∥(A * A -I)Z∥ 2 ≤ δ∥Z∥ * , where ∥ • ∥ * is the nuclear norm. We also need the following regularity condition on the initialization. Assumption 3.2 For all 1 ≤ s ≤ r ∧ r * , σ min V ⊤ Xs Ū ≥ ρ for some positive constant ρ, where V Xs is defined as Definition 3.1. The following proposition implies that Assumption 3.2 is satisfied with high probability with a Gaussian initialization. Proposition 3.2 Suppose that all entries of U ∈ R d×r are independently drawn from N 0, 1 √ r and ρ = ϵ √ r- √ r∧r * -1 √ r ≥ ϵ 2r * , then σ min V ⊤ Xs U ≥ ρ holds for all 1 ≤ s ≤ r ∧r * with probability at least 1 -r Cϵ + e -cr , where c, C > 0 are universal constants.

4. MAIN RESULTS

In this section, we present our main theorem, following the theoretical setup in Section 3. Theorem 4.1 Under Assumptions 3.1 and 3.2, consider GD (3) with learning rate µ ≤ 1 10 3 ∥Z * ∥ and initialization U α,0 = α Ū for solving the matrix sensing problem (1). There exists universal constants c, M > 0, a constant C (depending on r and κ) and a sequence of time points T 1 α < T 2 α < • • • < T r∧r * α such that for all 1 ≤ s ≤ r ∧ r * , the following holds when α = O (ρr * ) -cκ : U α,T s α U ⊤ α,T s α -Z * s F ≤ Cα 1 M κ 2 , ( ) where we recall that Z * s is the best rank-s solution defined in Definition 1.1. Moreover, GD follows an incremental learning procedure: we have lim α→0 max 1≤t≤T s α σ s+1 (U α,t ) = 0 for all 1 ≤ s ≤ r ∧ r * , where σ i (A) denotes the i-th largest singular value of a matrix A. Compared with existing works (Li et al., 2018; Stöger & Soltanolkotabi, 2021) in the same setting, our result characterizes the complete learning dynamics of GD and reveals an incremental learning mechanism, i.e., GD starts from learning simple solutions and then gradually increasing the complexity of search space until it finds the ground truth. Now we outline the proof of our main theorem. In the following, we fix an integer 1 ≤ s ≤ r ∧ r * and show the existence of T s α > 0 satisfying that lim α→0 U α,T s α U ⊤ α,T s α -Z * s F = 0. We then show that T s α is monotone increasing in s for any fixed α. Our first result states that with small initialization, GD can get into a small neighbourhood of Z * s . Lemma 4.1 Under Assumptions 3.1 and 3.2, there exists T s α > 0 for all α > 0 and 1 ≤ s ≤ r ∧ r * such that lim α→0 max 1≤t≤ T s α σ s+1 (U α,t ) = 0. Furthermore, it holds that U T s α U ⊤ T s α -Z * s F = O κ 3 √ r * δ∥X∥ 2 . Proof sketch: The proof is motivated by the three-phase analysis in Stöger & Soltanolkotabi (2021) but has some technical modifications. Starting from a small initialization, GD initially behaves similarly to power iteration since U t+1 = (I + µM t )U t ≈ (I + µM )U t , where M := A * A(XX ⊤ ) is a symmetric matrix. Let M = d k=1 σ2 k vk v⊤ k be the eigendecomposition of M . Then we have U T ≈ (I + µM ) T U 0 = d i=1 (1 + µσ 2 i ) T vi v⊤ i U 0 ≈ s i=1 (1 + µσ 2 i ) T vi v⊤ i U 0 (5) where the last step holds because it can be shown that 1 + µσ s > 1 + µσ s+1 , causing an exponential separation between the magnitude of the top-s and the remaining components. When (5) no longer holds, we enter a new phase which we call the parallel improvement phase. We consider the decomposition U t = U t W t W ⊤ t + U t W t,⊥ W ⊤ t,⊥ , where W t := W V ⊤ Xs Ut ∈ R r×s is the matrix consisting of the right singular vectors of V ⊤ Xs U t (Definition 3.1) and W t,⊥ ∈ R r×(r-s) is an orthogonal complement of W t . Assume X s = [σ 1 e 1 , • • • , σ s e s ] without loss of generality. The columns of W t is an orthogonal basis of the subspace spanned by the first s rows of U t . Each vector in R r can be decomposed as a parallel component and an orthogonal component w.r.t. this subspace. Intuitively, the row vectors of U t W t are the parallel components of the row vectors in U t ; we call U t W t the parallel component. For similar reasons, U t W t,⊥ will be referred to as the orthogonal component. More discussions for this decomposition are given in Appendix B. By the end of the spectral phase we have σ min (U t W t ) ≫ ∥U t W t,⊥ ∥. We show in Appendix C.2 that afterwards, σmin(UtWt) ∥UtW t,⊥ ∥ grows exponentially in t, until the former reaches a constant scale, while the latter stays o(1) (α → 0). After σ min (U t W t ) = Θ(1), we enter the refinement phase where we show in Appendix C.4 that X s X ⊤ s -U t U ⊤ t F keeps decreasing until it is O δκ 2 √ r * ∥X∥ 2 (see Lemma 5.1). On the other hand, we can show that best rank-s solution is close to the matrix factorization minimizer i.e. Z * s -X s X ⊤ s F = O δκ √ r * ∥X∥ 2 . We thus obtain that Z * s -U t U ⊤ t F = O δκ 2 √ r * ∥X∥ 2 . Finally, since rank (U t W t ) ≤ s, we have σ s+1 (U t ) ≤ ∥U t W t,⊥ ∥ = o(1), as desired. □ Lemma 4 .1 shows that U t U ⊤ t would enter a neighbourhood of Z * s with constant radius. However, there is still a gap between Lemma 4.1 and Theorem 4.1, since the latter states that U t U ⊤ t would actually get o(1)-close to Z * s . To illustrate our proof idea of this result, we first consider the simpler setting where the model is under-parameterized, i.e., r ≤ r * and show that U t U ⊤ t would eventually converge to Z * r . Proposition 4.1 (Convergence in the under-parameterized regime) Suppose that r ≤ r * , then there exists a constant c = c(r, κ) > 0 such that when α < c, we have lim t→+∞ U α,t U ⊤ α,t = Z * r . Proof sketch: We can deduce from Lemma 4.1 by taking s = r that there exists a global minimizer U * r of (1) (equivalently, a matrix in R d×s satisfying U * r U * r ⊤ = Z * ), such that U α, T r α -U * r = O(κ 3 √ r * δ∥X∥) (cf. Corollary F.1 ). On the other hand, by taking s = r in Theorem 5.1, we can see that within a neighbourhood of U * r with constant radius, f satisfies a Polyak-Łojasiewicz (PL) type condition with respect to the procrutes distance defined in Definition 5.1, and the global minimizer of f is unique up to orthogonal transformation. When δ is sufficiently small, U α,T r α lies in this neighbourhood, and the PL condition implies that GD converges linearly to the set of global minimizers, which yields the desired conclusion. □ Now we turn to the over-parameterized regime, where f is not necessarily local PL, and thus we cannot directly derive convergence as in Proposition 4.1. To proceed, we use a low-rank approximation for U t and associate the dynamics in this neighborhood with the GD dynamics of the following under-parameterized matrix sensing loss: f s (U ) = 1 4 A(Z * -U U ⊤ ) 2 2 , U ∈ R d×s , It can be shown that when δ is sufficiently small, the global minimizer of f s is unique up to rotation, i.e., if Û * s is a global minimizer of f s , then any other global minimizer can be written as Û * s R for some orthogonal matrix R ∈ R s×s . The main observation is that GD follows an approximate lowrank trajectory until it gets into a small neighbourhood of Z * s , so that we can still use our landscape results in the low-rank regime. Proof sketch of Theorem 4.1: For all t ≥ -T s α , define Ûα,t = U α,t+ T s α W α,t+ T s α ∈ R d×s , where W α,t = W V ⊤ Xs Ut as defined in the proof sketch of Lemma 4.1. Note that Lemma 4.1 implies that U α,T s α is approximately rank-s, so within a time period after T s α , the GD trajectory remains close to another GD initialized at Ûα,0 for the rank-s matrix sensing loss (6 ) until t = Θ log 1 α , i.e. Ûα,t Û ⊤ α,t ≈ U α, T s α +t U ⊤ α, T s α +t . Again by Lemma 4.1, the initialization Ûα,0 is within a small neighbourhood of the global minima of f s (U ). Furthermore, Theorem 5.1 implies that f s (U ) satisfies a local PL-like condition, so that GD with good initialization would converge linearly to its global minima (Karimi et al., 2016) . We need to choose a time t such that (7) remains true while this linear convergence takes place for sufficiently many steps. We can show that there always exists some t = t s α such that both Ûα,t Û ⊤ α,t -U α, T s α +t U ⊤ α, T s α +t F and Ûα,t -U * s F are O α 1 M κ 2 . Hence U α,t U ⊤ α,t -Z * s F ≲ α 1 M κ 2 for t = T s α := T s α + t s α . For all 1 ≤ s < r ∧ r * , since (7) always holds for t ≤ t s α and rank Ûα,t ≤ s, we must have max 1≤t≤T s α σ s+1 (U α,t ) = o(1) as α → 0. Finally, Z * s+1 -X s+1 X ⊤ s+1 = O(δ) (cf. Lemma E.5), so σ s+1 Z * s+1 = Θ(1). Therefore, U α,t U ⊤ α,t cannot be close to Z * s+1 when t ≤ T s α , so we must have T s+1 α > T s α . This completes the proof of Theorem 4.1.

5. CONVERGENCE IN UNDER-PARAMETERIZED MATRIX SENSING

In this section, we analyze the properties of the landscape of the matrix sensing loss (1), which plays a crucial role in proving results in Section 4. While the landscape of over-parameterized matrix sensing (i.e. r ≥ r * ) is well-studied (Tu et al., 2016; Ge et al., 2017) , few results are known for the under-parameterized case. Our results provide useful tools for analyzing the convergence of gradient-based algorithms for solving problems like low-rank matrix approximation and might have independent interests. To prove convergence results like Proposition 4.1, we first study the landscape of f s defined in (6) near the set of global minimizers. One major difficulty here is that f s is not locally stronglyconvex, because if U ∈ R d×s is a global minimizer of f s , then U R is also a global minimizer, for any orthogonal matrix R ∈ R s×s . Nonetheless, we can establish a Polyak-Łojasiewicz (PL) like condition, Theorem 5.1, which is the main result of this section. Definition 5.1 (Procrutes distance) For any d, s ∈ N + and U 1 , U 2 ∈ R d×s , we define dist(U 1 , U 2 ) = min {∥U 1 -U 2 R∥ F : R ∈ R s×s is orthogonal}. We note that the procrutes distance is well-defined because the s × s orthogonal matrices are a compact set and thus ∥U 1 -U 2 R∥ F is continuous in R. It can be verified that procrutes distance is a pseudo metric, i.e., it is symmetric and satisfies the triangle inequality. Theorem 5.1 (Landscape of under-parameterized matrix sensing) The global minimizer of f s is unique up to an orthogonal transformation, i.e. the set of global minimizers of f s is {U * s R : R ∈ R s×s is orthogonal} where U * s is an arbitrary global minimizer. Moreover, let U ∈ R d×s and U * s be a global minimizer of f s such that ∥U -U * s ∥ F = dist(U , U * s ). Sup- pose that Assumption 3.1 holds and ∥U -U * s ∥ ≤ 10 -2 κ -1 ∥X∥, then ⟨∇f s (U ), U -U * s ⟩ ≥ 0.1τ dist 2 (U , U * s ). Remark 5.1 Recall PL condition means there exists some constant µ > 0, such that ∥∇g(x)∥ 2 ≥ 2µ (g(x) -min y∈R n g(y)) holds for all x. Since f s is locally smooth around U * s , there exists a constant c 1 > 0 such that f s (U ) -f s (U * s ) ≤ c 1 ∥U -U * s ∥ 2 F . Moreover, Theorem 5.1 implies that ∥U -U * s ∥ 2 F ≤ 100τ -2 ∥∇f s (U )∥ 2 F , so we have ∥∇f s (U )∥ 2 F ≥ 10 -2 τ 2 c -1 1 (f s (U ) -f s (U * s ) ). In other words, Theorem 5.1 implies that the matrix sensing loss (1) is locally PL. When δ = 0, f s reduces to the matrix factorization loss F s : R d×s → R, F s (U ) = 1 4d 2 U U ⊤ -Z * 2 F . The following corollary immediately follows from Theorem 5.1. Corollary 5.1 (Landscape of under-parameterized matrix factorization) The set of global minimizers of F s is {X s R : R ∈ R s×s is orthogonal}. Moreover, under Assumption 3.1, given U ∈ R d×s and let R be an orthogonal matrix such that ∥U - X s R∥ F = dist(U , X s ). If dist(U , X s ) ≤ 10 -2 κ -1 ∥X∥, then ⟨∇F (U ), U -X s R⟩ ≥ 0.1τ dist 2 (U , X s ). We end this section with the following lemma that formalizes the intuition that all global minimizers of the f s must be close to X s under the procrutes distance, which is used in the proof sketch of Theorem 4.1 in Section 4. Lemma 5.1 Under Assumption 3.1, we have dist(U * s , X s ) ≤ 40δκ∥X∥ F for any global minimizer U * s of f s . Moreover, Z * s -X s X ⊤ s F ≤ 80δκ √ r * ∥X∥ 2 .

6. EXPERIMENTS

In this section, we perform some numerical experiments to illustrate our theoretical findings. Experimental setup. Consistent with our theory, we consider the matrix sensing problem (1) with d = 50, r * = 5, α ∈ {1, 0.1, 0.01, 0.001}, m ∈ {1000, 2000, 5000}. We will consider different choices for r in the experiments. The ground-truth Z * = XX ⊤ is generated such that the entries We solve the problem (1) via running GD for T = 10 4 iterations starting with small initialization with scale α. Specifically, we choose U 0 = αU where the entries of U ∈ R d×r are drawn i.i.d. from standard Gaussian distribution. We consider both the over-parameterized and the exact/under-parameterized regime. The learning rate of GD is set to be µ = 0.005.

6.1. IMPLICIT LOW-RANK BIAS

In this subsection, we consider the over-parameterized setting with r = 50. For each iteration t ∈ [T ] and rank s ∈ [r * ], we define the relative error E s (t) = ∥UtU ⊤ t -XsX ⊤ s ∥ 2 F ∥XsX ⊤ s ∥ 2 F to measure the proximity of the GD iterates to X s . We plot the relative error in Figure 1 for different choices of α and m (which affects the measurement error δ). Small initialization. The implicit low-rank bias of GD is evident when the initialization scale α is small. Indeed, one can observe that GD first visits a small neighbourhood of X 1 , spends a long period of time near it, and then moves towards X 2 . It then proceeds to learn X 3 , X 4 , • • • in a similar way, until it finally fits the ground truth. This is in align with Theorem 4.1. By contrast, for large initialization we do not have this implicit bias. The effect of measurement error. For fixed α, one can observe the the relative error becomes smaller when the number of measurement increases. This is in align with Lemma 4.1 in which the bound depends on δ. In particular, for the case s = r * , although GD with fixed initialization does not converge to global minima, but the distance to the set of global minima scales as poly(α).

6.2. MATRIX SENSING WITH EXACT PARAMETERIZATION

Now we study the behavior of GD in the exact parameterization regime (r = r * ). We fix m = 1000 and r = r * = 5 and run GD for T = 500 iterations. We plot the relative error in Figure 2 . We can see that GD exhibits an implicit low-rank bias when α is small. However, choosing a very small α would slow down the convergence speed. This is because GD would get into a poly(α)neighbourhood of the saddle point Z s and spend a long time escaping the saddle. Also, convergence to global minimizers is guaranteed as long as α is below a certain threshold (see Proposition 4.1).

7. CONCLUSION

In this paper, we study the matrix sensing problem with RIP measurements and show that GD with small initialization follows an incremental learning procedure, where GD finds near-optimal solutions with increasing ranks until it finds the ground-truth. We take a step towards understanding the optimization and generalization aspects of simple optimization methods, thereby providing insights into their success in modern applications such as deep learning (Goodfellow et al., 2016) . Also, we provide a detailed landscape analysis in the under-parameterized regime, which to the best of our knowledge is the first analysis of this kind. Although we focus on matrix sensing in this paper, it has been revealed in a line of works that the implicit regularization effect may vary for different models, including deep matrix factorization (Arora et al., 2019) and nonlinear ReLU/LeakyReLU networks (Lyu et al., 2021; Timor et al., 2022) . Also, it is shown in Woodworth et al. (2020) that different initialization scales can lead to distinct inductive bias and affect the generalization and optimization behaviors. All these results indicate that we need further studies to comprehensively understand gradient-based optimization methods from the generalization aspect. The appendix is organized as follows: in Appendix A we present a number of results that will be used for later proof. Appendix B sketches the main idea for proving our main results. Appendix C is devoted to a rigorous proof of Lemma B.1 ,with some auxiliary lemmas proved in Appendix D. In Appendix E we analyze the landscape of low-rank matrix sensing and prove our results in Section 5. These results are then used in Appendix F to prove Theorem 4.1. Finally, Appendix G studies the landscape of rank-1 matrix sensing, which enjoys a strongly convex property, as we mentioned in Section 5 without proof.

A PRELIMINARIES

In this section, we present some useful results that is needed in subsequent analysis.

A.1 THE RIP CONDITION AND ITS PROPERTIES

In this subsection, we collect a few useful properties of the RIP condition, which we recall below: Definition A.1 We say that the measurement A satisfies the (δ, r)-RIP condition if for all matrices Z ∈ R d×d with rank (Z) ≤ r, we have (1 -δ)∥Z∥ 2 F ≤ ∥A(Z)∥ 2 2 ≤ (1 -δ)∥Z∥ 2 F . The key intuition behind RIP is that A * A ≈ I, where A * : v → 1 √ m m i=1 v i A i is the adjoint of A. This intuition is made rigorous by the following proposition: Proposition A.1 (Stöger & Soltanolkotabi, 2021, Lemma 7 .3) Suppose that A satisfies (r, δ)-RIP, then for all symmetric matrix Z of rank ≤ r -1, we have ∥(A * A -I)Z∥ ≤ √ rδ∥Z∥.

A.2 MATRIX ANALYSIS

The following lemma is a direct corollary of Proposition A.1 and will be frequently used in our proof. Lemma A.1 Suppose that the measurement A satisfies (δ, 2r * + 1)-RIP condition, then for all matrices U ∈ R d×r such that rank (U ) ≤ r * , we have (A * A -I) (XX ⊤ -U U ⊤ ) ≤ δ √ r * ∥X∥ 2 + ∥U ∥ 2 . In our proof we will frequently make use of the Weyl's inequality for singular values: Lemma A.2 (Weyl's inequality) Let A, ∆ ∈ R d×d be two matrices, then for all 1 ≤ k ≤ d, we have |σ k (A) -σ k (A + ∆)| ≤ ∥∆∥. We will also need the Wedin's sin theorem for singular value decomposition: Lemma A.3 (Wedin, 1972 , Section 3) Define R(•) to be the column space of a matrix. Suppose that matrices B = A + T , A 1 , B 1 are the top-s components in the SVD of A and B respectively, and A 0 = A -A 1 , B 0 = B -B 1 . If δ = σ min (B 1 ) -σ max (A 0 ) > 0, then we have ∥sin Θ (R(A 1 ), R(B 1 ))∥ ≤ ∥T ∥ δ where Θ(•, •) denotes the angle between two subspaces. Equipped with Lemma A.1, we can have the following characterization of the eigenvalues of M (recall that M = A * A(XX ⊤ )): Lemma A.4 Let M := A * A(XX ⊤ ) and M = d k=1 σ2 k vk v⊤ k be the eigen-decomposition of M . For 1 ≤ i ≤ d we have σ 2 i -σ2 i ≤ δ∥X∥ 2 . Proof :By Weyl's inequality we have σ 2 i -σ2 i ≤ M -XX ⊤ ≤ δ∥X∥ 2 as desired. □ A.3 OPTIMIZATION Lemma A.5 Suppose that a smooth function f ∈ R m → R with minimum value f * > -∞ satisfies the following conditions with some ϵ > 0: (1). lim ∥x∥→+∞ f (x) = +∞. (2). There exists an open subset S ⊂ R m such that the set S * of global minima of f is contained in S, and for all stationary points x of f in R m -S, we have f (x) -f * ≥ 2ϵ. Moreover, we also have f (x) -f * ≥ 2ϵ on ∂S.

Then we have

{x ∈ R m : f (x) -f * ≤ ϵ} ⊂ S. Proof : Let x * be the minimizer of f on R m -S. By condition (1) we can deduce that x * always exists. Moreover, since any local minimizer of a function defined on a compact set must either be a stationary point or lie on the boundary of its domain, we can see that either x * ∈ ∂S or ∇f (x * ) = 0 holds. By condition (2), either cases would imply that f (x * ) -f * ≥ 2ϵ, as desired. □ Lemma A.6 Let {x k }, {y k } ⊂ R n be two sequences generated by x k+1 = x k -µ∇f (x k ) and y k+1 = y k -µ∇f (y k ). Suppose that ∥x k ∥ ≤ B and ∥y k ∥ ≤ B for all k and f is L-smooth in {x ∈ R n : ∥x∥ ≤ B}, then we have ∥x k -y k ∥ ≤ (1 + µL) k ∥x 0 -y 0 ∥ . Proof : The update rule implies that ∥x k+1 - y k+1 ∥ = ∥x k -y k -µ∇f (x k ) + µ∇f (y k )∥ ≤ ∥x k -y k ∥ + µ ∥∇f (x k ) -f (y k )∥ ≤ (1 + µL)∥x k -y k ∥ which yields the desired inequality. □ A.4 PROOF OF PROPOSITION 3.2 Proposition 3.2 immediately follows from the following result: Proposition A.2 (Rudelson & Vershynin, 2009) Suppose that all entries of U ∈ R d×r are indepen- dently drawn from N 0, 1 √ r and ρ = ϵ √ r- √ s-1 √ r , then σ min V ⊤ Xs U ≥ ρ with probability at least 1 -e -cr -(Cϵ) r-s+1 . Here c, C > 0 are universal constants. By Proposition A.2, we have P ∃1 ≤ s ≤ r ∧ r * s.t. σ min V ⊤ Xs U < 2ϵ r ≤ r∧r * s=1 P σ min V ⊤ Xs U < ϵ √ r - √ s -1 √ r ≤ r∧r * s=1 e -cr + (Cϵ) r-s+1 ≤ r e -cr + Cϵ which concludes the proof of Proposition 3.2. B MAIN IDEA FOR THE PROOF OF THEOREM 4.1 In this section, we briefly introduce our main ideas for proving Theorem 4.1. Motivated by Stöger & Soltanolkotabi (2021) , we decompose the matrix U t into a parallel component and an orthogonal component. Specifically, we write U t = U t W t W ⊤ t parallel component + U t W t,⊥ W ⊤ t,⊥ orthogonal component , where W t := W V ⊤ Xs Ut ∈ R r×s is the matrix consisting of the right singular vectors of V ⊤ Xs U t (Definition 3.1) and W t,⊥ ∈ R r×(r-s) is an orthogonal complement of W t . Our goal is to prove that at some time t, we have V ⊤ Xs U t U ⊤ t -X s X ⊤ s ≈ 0 and ∥U t W t,⊥ ∥ ≈ 0. As we will see later, these imply that U t U ⊤ t -X s X ⊤ s ≈ 0. The remaining part of this section is organized as follows: in Appendix B.1 we give a heuristic explanation for considering (8), and in Appendix B.2, we present our proof outline. Additional Notations. Let V Xs,⊥ ∈ R d×(d-s) be an orthogonal complement of V Xs ∈ R d×s . Let Σ s = diag(σ 1 , . . . , σ s ) and Σ s,⊥ = diag(σ s+1 , . . . , σ d ). We use ∆ t := (A * A -I)(XX ⊤ - U t U ⊤ t ) to denote the vector consisting of measurement errors for XX ⊤ -U t U ⊤ t .

B.1 HEURISTIC EXPLANATIONS OF THE DECOMPOSITION

A simple and intuitive approach for showing the implicit low rank bias is to directly analyze the growth of V ⊤ Xs U t versus V ⊤ Xs,⊥ U t . Ideally, the former grows faster than the latter, so that GD only learns the components in X s . By the update rule of GD (3), V ⊤ Xs,⊥ U t+1 = V ⊤ Xs,⊥ I + µA * A(XX ⊤ -U t U ⊤ t ) U t = V ⊤ Xs,⊥ I + µXX ⊤ -µU t U ⊤ t U t =:Gt,1 +µ V ⊤ Xs,⊥ ∆ t U t =:Gt,2 = G t,1 + µG t,2 . For the first term G t,1 , we have G t,1 = (I + µΣ 2 s,⊥ )V ⊤ Xs,⊥ U t -µV ⊤ Xs,⊥ U t U ⊤ t U t = (I + µΣ 2 s,⊥ )V ⊤ Xs,⊥ U t (I -µU t U ⊤ t ) + O(µ 2 ) , where the last term O(µ 2 ) is negligible when µ is sufficiently small. Since ∥Σ s,⊥ ∥ = σ s+1 , the spectral norm of G t,1 can be bounded by ∥G t,1 ∥ ≤ ∥I + µΣ 2 s,⊥ ∥ • ∥V ⊤ Xs,⊥ U t ∥ • ∥I -µU t U ⊤ t ∥ + O(µ 2 ) ≤ (1 + µσ 2 s+1 )∥V ⊤ Xs,⊥ U t ∥ + O(µ 2 ). However, the main difference with the full-observation case (Jiang et al., 2022) is the second term G t,2 := V ⊤ Xs,⊥ ∆ t U t . Since the measurement errors ∆ t are small but arbitrary, it is hard to compare this term with V ⊤ Xs,⊥ U t+1 . As a result, we cannot directly bound the growth of ∥V ⊤ Xs,⊥ U t ∥. However, the aforementioned problem disappears if we turn to bound the growth of ∥V ⊤ Xs,⊥ U t+1 W t,⊥ ∥. To see this, first we deduce the following by repeatedly using V ⊤ Xs U t W t,⊥ = 0 due to the definition of W t,⊥ . G t,1 W t,⊥ = V ⊤ Xs,⊥ I + µXX ⊤ -µU t U ⊤ t U t W t,⊥ = V ⊤ Xs,⊥ (I + µXX ⊤ )U t W t,⊥ -µV ⊤ Xs,⊥ U t U ⊤ t U t W t,⊥ = (I + µΣ 2 s,⊥ )V ⊤ Xs,⊥ U t W t,⊥ -µV ⊤ Xs,⊥ U t (W t W ⊤ t + W t,⊥ W ⊤ t,⊥ )U ⊤ t U t W t,⊥ = (I + µΣ 2 s,⊥ )V ⊤ Xs,⊥ U t W t,⊥ (I -µW ⊤ t,⊥ U ⊤ t U t W t,⊥ ) -µV ⊤ Xs,⊥ U t W t W ⊤ t U ⊤ t U t W t,⊥ + O(µ 2 ), G t,2 W t,⊥ = V ⊤ Xs,⊥ ∆ t U t W t,⊥ = V ⊤ Xs,⊥ ∆ t V Xs,⊥ V ⊤ Xs,⊥ U t W t,⊥ , So we have the following recursion: V ⊤ Xs,⊥ U t+1 W t,⊥ = (I + µΣ 2 s,⊥ + µV ⊤ Xs,⊥ ∆ t V Xs,⊥ )V ⊤ Xs,⊥ U t W t,⊥ (I -µW ⊤ t,⊥ U ⊤ t U t W t,⊥ ) -µV ⊤ Xs,⊥ U t W t W ⊤ t U ⊤ t U t W t,⊥ + O(µ 2 ), We further note that V ⊤ Xs,⊥ U t+1 W t+1,⊥ = V ⊤ Xs,⊥ U t+1 W t W ⊤ t W t+1,⊥ + V ⊤ Xs,⊥ U t+1 W t,⊥ W ⊤ t,⊥ W t+1,⊥ , which establishes the relationship between V ⊤ Xs,⊥ U t+1 W t,⊥ and V ⊤ Xs,⊥ U t+1 W t+1,⊥ . To complete the proof we need to prove the following: • The minimal eigenvalue of the parallel component U t W t W ⊤ t grows at a linear rate with speed strictly faster than σ s+1 . • The term V ⊤ Xs,⊥ V UtWt ≪ 1, which implies that the first term in ( 9) is negligible.

B.2 PROOF OUTLINE

An important intermediate step of our proof is the following result: Lemma B.1 Under Assumptions 3.1 and 3.2, if the initialization scale α is sufficiently small, then for all 1 ≤ s ≤ r ∧ r * there exists a time T s α ∈ Z + such that X s X ⊤ s -U T s α U ⊤ T s α F ≤ κ 2 √ r * ∥X∥ 2 δ. We begin with the spectral alignment phase, where we can make the following approximation U t+1 ≈ (I + µM )U t (10) since U t is initially small. At some time t = T 0 , U t would become approximately aligned with the first s components, as long as there is a positive gap between the s-th and (s + 1)-th largest eigenvalues of M . The choice of T 0 is subject to a trade-off such that the alignment takes effect while (10) does not induce large error. We then enter the second phase which we call the parallel matching phase, in which the parallel components grow to a constant magnitude and is well-matched with the ground-truth (the orthogonal components remain small). Specifically, for small constants c i , 1 ≤ i ≤ 3, we show that the followings are true in this phase: (1). σ min (V ⊤ Xs U t ) grows exponentially fast until it reaches c • σ s for some constant c at some time T s . Specifically, we have σ min (V ⊤ Xs U t+1 W t ) ≥ σ min (V ⊤ Xs U t ) 1 + µσ 2 s -c 1 -µσ 2 min (V ⊤ Xs U t ) . (2). When t ≤ T s , the growth speed of ∥U t W t,⊥ ∥ is slower than σ 2 min (V ⊤ Xs U t ): ∥U t+1 W t+1,⊥ ∥ ≤ 1 + µσ 2 s+1 + c 2 ∥U t W t,⊥ ∥ . (3). V ⊤ Xs,⊥ V UtWt ≤ c 3 remains true until t = T s . These statements will be proven by induction. At t = T s , we have σ min (V ⊤ Xs U t ) = Θ(1), and we enter the refinement phase in which we show that the quantity V ⊤ Xs U t U ⊤ t -X s X ⊤ s F decrease exponentially until it reaches O(δ). Note that in general GD would not converge to an o(1)-neighbourhood of X s X ⊤ s as the initialization scale α → 0, because X s X ⊤ s is not the rank-s minima of the RIP loss. As a result, in the refinement phase we can only expect to obtain U t U ⊤ t -X s X ⊤ s F ≤ poly(r) • δ. To conclude the proof of Theorem 4.1, we prove in Section 5 that the landscape of the matrix sensing loss with rank-s parameterization, though non-convex, satisfies a local Polyak-Lojasiewicz (PL) condition within a neighborhood of constant radius. As a result, for sufficiently small δ, GD converges linearly to global minima with good initialization. We then show that within a time period after the refinement phase, the GD trajectory is close to the trajectory of well-initialized GD for ranks parameterized matrix sensing. The length of this period goes to infinity when α → 0, thereby implying that GD finds the rank-s minimizer with o(1) error. The details are given in Appendix F.

C PROOF OF LEMMA B.1

In this section, we give the full proof of Lemma B.1, with some additional technical lemmas left to Appendix D. Appendices C.1 and C.2 are devoted to analyzing the spectral phase and parallel improvement phase, respectively. Appendix C.3 uses induction to characterize the low-rank GD trajectory in the parallel improvement phase. In Appendix C.4 we study the refinement phase, which allows us to derive Lemma B.1.

C.1 THE SPECTRAL PHASE

Starting from a small U 0 = αU , α ≪ 1, we first enter the spectral phase where GD behaves similar to power iteration. As in Stöger & Soltanolkotabi (2021) , we refer to this phase as the spectral phase. Specifically, we have in the spectral phase that U t+1 = I + µ (A * A) (XX ⊤ -U t U ⊤ t ) U t ≈ I + µ (A * A) (XX ⊤ ) U t . The approximation holds with high accuracy as long as ∥U t ∥ ≪ 1. Moreover we have M := (A * A) (XX ⊤ ) ≈ XX ⊤ by the RIP condition; when δ is sufficiently small, we can still ensure a positive eigen-gap of M . As a result, with small initialization U t would become approximately aligned with the top eigenvector u 1 of M . Since ∥M - XX ⊤ ∥ = O(δ √ r * ) by Proposition A.1, we have ∥u 1 -v 1 ∥ = O(δ √ r * ) so that ∥V ⊤ Xs V UtWt ∥ = O(δ √ r * ) . This proves the base case for the induction. Formally, we define M = A * A(XX ⊤ ), Z t = (I + µM ) t and U t = Z t U 0 . Suppose that M = rank(M ) i=1 σ2 i vi v⊤ i is the spectral decomposition of M . We additionally define M s = min{s,rank(M )} i=1 σ2 i vi v⊤ i . By Lemma A.4 and δ √ r * ≤ 10 -3 κ as stated in Lemma B.1, we have σs ≥ σ s -0.01τ and σs+1 ≥ σ s+1 + 0.01τ , where τ = σ s -σ s+1 > 0. Additionally, let L t be the span of the top-s left singular vectors of U t . We make the following assumption on the initialization, which holds with high probability when it is i.i.d. Gaussian: Assumption C.1 The matrix V ⊤ Ms U ∈ R s×r has full row-rank i.e. ρ = σ min V ⊤ Ms U > 0. Let t ⋆ := min i ∈ N : U i-1 -U i-1 > U i-1 , the following lemma bounds the error of approximating U t via U t : Lemma C.1 (Stöger & Soltanolkotabi, 2021, Lemma 8 .1) Suppose that A satisfies the rank-1 RIP with constant δ 1 . For all integers t such that 1 ≤ t ≤ t ⋆ it holds that ∥E t ∥ = U t -U t ≤ 4σ -2 1 α 3 r * (1 + δ 1 ) 1 + µσ 2 1 3t ∥U ∥ 3 . ( ) Corollary C.1 We have t * ≥ log α -1 + 1 2 log ρσ 2 1 4(1+δ1)r * log (1 + µσ 2 1 ) . Proof : By Lemma C.1 we have ∥E t ∥ ≤ 4σ -2 1 α 3 r * (1 + δ 1 ) 1 + µσ 2 1 3t ∥U ∥ 3 . for all t ≤ t * . On the other hand, we have ∥ U t ∥ = α (I + µM ) t U ≥ α(1 + µσ 2 1 ) t v1 v⊤ 1 U ≥ 1 + µσ 2 1 t αρ. Thus, it follows from ∥E t * ∥ ≥ ∥ U t * ∥ that 1 + µσ 2 1 t * ≥ ρσ 2 1 4(1 + δ 1 )r * ∥U ∥ 3 • α -1 ⇒ t * ≥ log α -1 + 1 2 log ρσ 2 1 4(1+δ1)r * log (1 + µσ 2 1 ) as desired. □ Lemma C.2 There exists a time t = T 0 := 2 log α -1 + log ρσ 2 1 4r * (1+δ) 3 log(1 + µ σ2 1 ) -log(1 + µσ 2 s+1 ) ≤ t * such that U t - s i=1 α(1 + µσ 2 i ) t vi v⊤ i U ≲ α γ where γ = 1 - 2 log(1+µσ 2 1 ) 3 log(1+µσ 2 1 )-log(1+µσ 2 s+1 ) . Proof : It's easy to check that T 0 ≤ t * by applying Corollary C.1. We consider the following decomposition: U t - s i=1 α(1 + µσ 2 i ) t vi v⊤ i U ≤ U t -U t + U t - s i=1 α(1 + µσ 2 i ) t vi v⊤ i U . When t ≤ t * , the first term can be bounded as ∥E t ∥ ≤ 4σ -2 1 α 3 r * (1 + δ 1 ) 1 + µσ 2 1 3t . For the second term we have U t - s i=1 α(1 + µσ 2 i ) t vi v⊤ i U ≤ r * i=s+1 α(1 + µσ 2 i ) t vi v⊤ i U ≤ α 1 + µσ 2 s+1 t . In particular, the definition of T 0 implies that U t - s i=1 α(1 + µσ 2 i ) t vi v⊤ i U ≲ α γ as desired. □ We conclude this section with the following lemma, which states that initially the parallel component U t W t would grow much faster than the noise term, and would become well-aligned with X s .

Lemma C.3

The following inequalities hold for t = T 0 when α ≲ ρ -4κ is sufficiently small: ∥U t ∥ ≤ ∥X∥ (12a) σ min (U T0 W T0 ) ≥ ρ • poly(r * ) -1 • α 1- 2 log(1+µσ 2 s ) 3 log(1+µσ 2 1 )-log(1+µσ 2 s+1 ) (12b) ∥U T0 W T0,⊥ ∥ ≤ poly(r * ) • α 1- 2 log(1+µσ 2 s+1 ) 3 log(1+µσ 2 1 )-log(1+µσ 2 s+1 ) (12c) V ⊤ Xs,⊥ V U T 0 W T 0 ≤ 200δ (12d) Proof : We prove this lemma by applying Corollary D.1 to t = T 0 defined in the previous lemma. The inequality (12a) can be directly verified by using Lemma C.2: ∥U t ∥ ≤ α 1 + µσ 2 i T0 + α γ ≲ poly(r * ) • α γ/3 ≤ ∥X∥. For the remaining inequalities, we first verify that the assumption in Corollary D.1: ασ s (Z t ) > 10 (ασ s+1 (Z t ) + ∥E t ∥) . ( ) By definition of Z t , we can see that ασ s+1 (Z T0 ) + ∥E T0 ∥ ≤ α 1 + µσ 2 s+1 T0 + ∥E T0 ∥ ≲ α γ ≲ 0.1ρα 1- 2 log(1+µσ 2 s ) 3 log(1+µσ 2 1 )-log(1+µσ 2 s+1 ) ≤ 0.1ασ s (Z T0 ) Proof : Let M t = A * A(XX ⊤ -U t U ⊤ t ) , so the update rule of GD implies that U t+1 W t+1 = (I + µM t )U t W t+1 = (I + µM t ) U t W t W ⊤ t W t+1 + U t W t,⊥ W ⊤ t,⊥ W t+1 = (I + µM t ) V UtWt V ⊤ UtW U t W t W ⊤ t W t+1 + U t W t,⊥ W ⊤ t,⊥ W t+1 = (I + µM t )(I + P )V UtWt := Ẑ V ⊤ UtWt U t W t W ⊤ t W t+1 , where P = U t W t,⊥ W ⊤ t,⊥ W t+1 V ⊤ UtWt U t W t W ⊤ t W t+1 -1 V ⊤ UtWt and V ⊤ UtWt U t W t W ⊤ t W t+1 is invertible since V ⊤ UtWt U t W t is invertible by our as- sumption that V ⊤ Xs U t is of full rank and rank (U t W t ) ≥ rank V ⊤ Xs U t W t = rank V ⊤ Xs U t = s, and W ⊤ t W t+1 is invertible by Lemma D.6 and the assumptions on µ and (A * A -I)(XX ⊤ -U t U ⊤ t ) . The key observation here is that because the (square) matrix V ⊤ UtWt U t W t W ⊤ t W t+1 is invertible, so that the column space of U t+1 W t+1 is the same as that of Z. Following the line of proof of (Stöger & Soltanolkotabi, 2021, Lemma 9. 3) (for completeness, we provide details in Lemma D.7), we deduce that  V ⊤ Xs,⊥ V Ut+1Wt+1 = V ⊤ Xs,⊥ V Ẑ W ⊤ Ẑ ≤ V ⊤ Xs,⊥ I + B - 1 2 V UtWt V ⊤ UtWt B + B ⊤ V UtWt -BV UtWt V ⊤ UtWt B + B ⊤ V UtWt + D ≤ V ⊤ Xs,⊥ I + B - 1 2 V UtWt V ⊤ UtWt B + B ⊤ V UtWt + 2∥B∥ 2 + ∥D∥ ∥P ∥ ≤ ∥U t W t,⊥ ∥ ∥W t,⊥ W t+1 ∥ σ min (U t W t )σ min (W ⊤ t W t+1 ) ≤ 2 ∥W t,⊥ W t+1 ∥ , so that B -µ(XX ⊤ -U t U ⊤ t ) ≤ µ∥M t -(XX ⊤ -U t U ⊤ t )∥ + ∥P ∥ + µ∥M t ∥∥P ∥ ≤ µ (A * A -I)(XX ⊤ -U t U ⊤ t ) + 2 ∥W t,⊥ W t+1 ∥ + 4µ∥X∥ 2 ∥W t,⊥ W t+1 ∥ ≤ µ (A * A -I)(XX ⊤ -U t U ⊤ t ) + 6 ∥W t,⊥ W t+1 ∥ ≤ 18µ 10µ∥X∥ 3 + c 4 c 3 ∥X∥ + 7µ (A * A -I)(XX ⊤ -U t U ⊤ t ) ≤ 18µ 10µ∥X∥ 3 + c 4 c 3 ∥X∥ + 0.01µκ -1 c 3 ∥X∥ 2 (28) where we use Lemma D.6 to bound W ⊤ t,⊥ W t+1 . Let B 1 = µ(XX ⊤ -U t U ⊤ t ) and R 1 = V ⊤ Xs,⊥ I + B 1 -V UtWt V ⊤ UtWt B 1 V UtWt , then we have R 1 = V ⊤ Xs,⊥ I + µ I -V UtWt V ⊤ UtWt XX ⊤ -U t U ⊤ t V UtWt = I + µΣ 2 s,⊥ V ⊤ Xs,⊥ V UtWt I -µV ⊤ UtWt XX ⊤ V UtWt -µV ⊤ Xs,⊥ I -V UtWt V ⊤ UtWt U t W t,⊥ W ⊤ t,⊥ U ⊤ t V Xs,⊥ V ⊤ Xs,⊥ V UtWt + µ 2 Σ 2 s,⊥ V ⊤ Xs,⊥ V UtWt V ⊤ UtWt XX ⊤ V UtWt . By Weyl's inequality (cf. Lemma A.2) and our assumption on c 3 , σ min V ⊤ UtWt XX ⊤ V UtWt ≥ σ min V ⊤ UtWt X s X ⊤ s V UtWt -V ⊤ UtWt X s,⊥ X ⊤ s,⊥ V UtWt 2 ≥ σ min V ⊤ UtWt X s X ⊤ s V UtWt -σ 2 s+1 V ⊤ Xs,⊥ V UtWt 2 ≥ σ 2 s V ⊤ UtWt V Xs 2 -σ 2 s+1 c 2 3 = σ 2 s -(σ 2 s + σ 2 s+1 )c 2 3 > 1 2 σ 2 s + σ 2 s+1 . So we have ∥R 1 ∥ ≤ 1 - µ 2 (σ 2 s -σ 2 s+1 ) V ⊤ Xs,⊥ V UtWt + 3µ∥X∥c 3 c 4 + µ 2 ∥X∥ 4 . It thus follows from ( 27) that V ⊤ X ⊥ s V Ut+1Wt+1 ≤ ∥R 1 ∥ + 2∥B -B 1 ∥ + 102∥B∥ 2 ≤ 1 - µ 2 (σ 2 s -σ 2 s+1 ) V ⊤ Xs,⊥ V UtWt + 40µc 3 c 4 ∥X∥ + 0.02µκ -1 c 3 ∥X∥ 2 + 10 3 µ 2 ∥X∥ 4 . Since V ⊤ Xs,⊥ V UtWt ≤ c 3 , it follows from our assumption on c 3 , c 4 and µ that V ⊤ X ⊥ s V Ut+1Wt+1 ≤ c 3 as well, which concludes the proof. □ C.3 INDUCTION Let T s α = min t ⩾ 0 : σ 2 min V ⊤ Xs U α,t+1 > 0.3 σ 2 s -σ 2 s+1 =: τ s . In this section, we show that when T 0 ≤ t < T s α , the parallel component grows exponentially faster than the orthogonal component. We prove this via induction and the base case is already shown in Lemma C.3. Lemma C.7 Let max{c 3 , c 4 ∥X∥ -1 } ≤ 10 -3 κ -1 , δ ≤ 10 -4 κ -1 r -1 2 * c 3 and µ ≤ 10 -4 κ -1 ∥X∥ -2 . Then the following holds for all T 0 ≤ t < T α,s as long as α ≤ poly(r) -1 : σ min V ⊤ Xs U t+1 ≥ σ min V ⊤ Xs U t+1 W t ≥ 1 + 0.5µ σ 2 s + σ 2 s+1 σ min V ⊤ Xs U α,t (30a) ∥U t+1 W t+1,⊥ ∥ ≤ min 1 + µ 0.4σ 2 s + 0.6σ 2 s+1 ∥U t W t,⊥ ∥ , c 4 (30b) V ⊤ Xs,⊥ V Ut+1Wt+1 ≤ c 3 . (30c) rank V ⊤ Xs U t+1 = rank V ⊤ Xs U t+1 W t = s. Proof : The base case t = T 0 is already proved in (12). Now suppose that the lemma holds for t, we now show that it holds for t + 1 as well. To begin with, we bound the term ∥∆ t ∥ as follows: ∥∆ t ∥ = (A * A -I)(XX ⊤ -U t U ⊤ t ) ≤ (A * A -I)(XX ⊤ -U t W t W ⊤ t U ⊤ t ) + (A * A -I)U t W t,⊥ W ⊤ t,⊥ U ⊤ t ≤ 10δ √ r * ∥X∥ 2 + δ U t W t,⊥ W ⊤ t,⊥ U ⊤ t * ≤ 10δ √ r * ∥X∥ 2 + δ √ d 1 + µ(0.4σ 2 s + 0.6σ 2 s+1 ) t-T0 ∥U T0 W T0,⊥ ∥ By induction hypothesis, it's easy to see that σ min V ⊤ Xs U t ∥U t W t,⊥ ∥ ≥ σ min V ⊤ Xs U T0 ∥U T0 W T0,⊥ ∥ ≥ poly(r) • α -γs where γ s = 2 log 1 + µσ 2 s -log 1 + µσ 2 s+1 3 log (1 + µσ 2 1 ) -log 1 + µσ 2 s+1 ≥ 1 4κ . Since we must have σ min V ⊤ Xs U t ≤ 0.3τ = O(1) by definition of T α,s , it follows that ∥U t W t,⊥ ∥ ≤ poly(r)α 1 4κ , so for sufficiently small α ≤ poly(r) -1 , ∥∆ t ∥ ≤ 11δ √ r∥X∥ 2 holds. The above inequality combined with our assumption on δ implies that the conditions on ∥∆ t ∥ in Lemmas C.4 to C.6 hold. We now show that (30a) to (30d) hold for t + 1, which completes the induction step. First, since t < T s α , we have σ min V ⊤ Xs U t+1 ≤ τ . Moreover, the induction hypothesis implies that V ⊤ Xs,⊥ V Ut-1Wt-1 ≤ c 3 and that V ⊤ Xs U α,t is of full rank. Thus the conditions of Corollary C.2 are all satisfied, and we deduce that (30a) holds. Second, the assumptions on c 3 , c 4 and δ, combined with Lemma C.5, immediately implies ∥U t+1 W t+1,⊥ ∥ ≤ 1 + µ 0.4σ 2 s + 0.6σ 2 s+1 ∥U t W t,⊥ ∥ . As a result, similar to (31) we observe that σ min V ⊤ Xs U t+1 ∥U t+1 W t+1,⊥ ∥ ≥ σ min V ⊤ Xs U T0 ∥U T0 W T0,⊥ ∥ ≥ poly(r) • α -1 4κ . Since σ min V ⊤ Xs U t+1 ≤ ∥X∥, when α < poly(r) -1 we must have that ∥U t+1 W t+1,⊥ ∥ ≤ c 4 . Finally, Lemma C.6 implies that (30c) is true, and (30d) follows from our application of Lemma C.4. This concludes the proof. □

C.4 THE REFINEMENT PHASE AND CONCLUDING THE PROOF OF LEMMA B.1

We have shown that the parallel component σ min V ⊤ Xs U t+1 grows exponentially faster than the orthogonal component ∥U t W t,⊥ ∥. In this section, we characterize the GD dynamics after T s α . We begin with the following lemma, which is straightforward from the proof of Lemma C.7.

Lemma C.8

The following inequality holds when α ≤ poly(r) -1 is sufficiently small: U Ts W Ts,⊥ ≤ poly(r) • α 1 4κ . The following lemma states that in a certain time period after T s α , the parallel and orthogonal components still behave similarly to the second (parallel improvement) phase. Lemma C.9 There exists t s α = Θ log 1 α when α → 0 (here we omit the dependence of T s α on α for simplicity) such that when 0 ≤ t -T s α ≤ ts α , we have σ min V ⊤ Xs U t ≥ σ min (U t W t ) ≥ 0.3τ (32a) ∥U t W t ∥ ≤ 1 + µ(0.4σ 2 s + 0.6σ 2 s+1 ) t-T s α U T s α W T s α (32b) ∥V Xs,⊥ V UtWt ∥ ≤ c 3 . Proof : We choose t s α = min t ≥ 0 : ∥U t+1 W t+1,⊥ ∥ 2 ≤ c 5 where c 5 = 10 -4 d -1 2 κ -1 c 3 ∥X∥ 2 (33) We prove (32) by induction. The proof follows the idea of Lemma C.7, except that we need to bound ∥∆ t ∥ in each induction step. Concretely, suppose that (32) holds at time t, then ∥∆ t ∥ = (A * A -I)(XX ⊤ -U t U ⊤ t ) ≤ (A * A -I)(XX ⊤ -U t W t W ⊤ t U ⊤ t ) + (A * A -I)U t W t,⊥ W ⊤ t,⊥ U ⊤ t ≤ 10δ √ r * ∥X∥ 2 + δ U t W t,⊥ W ⊤ t,⊥ U ⊤ t * ≤ 10δ √ r * ∥X∥ 2 + δc 5 √ d ≤ 10 -3 κ -1 c 3 ∥X∥ 2 and lastly (35c) is obtained from σ 2 min V ⊤ Xs V UtWt ≥ 1 -V ⊤ Xs,⊥ V UtWt 2 ≥ 1 -c 2 3 . For the second and the third terms, we have (I -A * A) XX ⊤ -U t U t U t U ⊤ t + U t U ⊤ t (I -A * A) XX ⊤ -U t U ⊤ t ≤ 0.1κ -1 c 3 ∥X∥ 4 (36) where we use the estimate in (34). Combining ( 35) and (36) yields V ⊤ Xs (XX ⊤ -U t+1 U ⊤ t+1 ) ≤ 1 - 1 2 µτ V ⊤ Xs XX ⊤ -U t U ⊤ t + 200µ∥X∥ 4 c 3 + 110µ 2 √ r * ∥X∥ 6 .

□

To apply the result of Lemma C.10, we need to verify that ∥V Xs,⊥ V UtWt ∥ ≤ c 3 still holds when t ≥ T s α . In fact, this is true as long as t -T s α ≤ O log 1 α . We are now ready to present our first main result, which states that with small initialization, GD would visit the O(δ)-neighbourhood of the rank-s minimizer of the full observation loss i.e. X s X ⊤ s . Theorem C.1 When α < poly(r * ) -1 and δ = 10 -4 κ -1 c 3 with c 3 < 10 -3 κ -1 r -1 2 * , there exists a time t = T s α ∈ Z + such that ∥X s X ⊤ s -U t U ⊤ t ∥ ≤ 10 3 τ -2 ∥X∥ 6 √ r * c 3 . Proof : First, observe that for all t ≥ 0, X s X ⊤ s -U t U ⊤ t F ≤ X s X ⊤ s -U t U ⊤ t V Xs V ⊤ Xs F + U t U ⊤ t V X ⊥ s V ⊤ X ⊥ s F ≤ X s X ⊤ s -U t U ⊤ t V Xs V ⊤ Xs F + V ⊤ X ⊥ s U t U ⊤ t V X ⊥ s F ≤ V ⊤ Xs X s X ⊤ s -U t U ⊤ t F + √ r * V ⊤ X ⊥ s U t W t 2 + √ d V ⊤ X ⊥ s U t W t,⊥ 2 ≤ V ⊤ Xs X s X ⊤ s -U t U ⊤ t + 9 √ r * ∥X∥ 2 ∥V Xs,⊥ V UtWt ∥ 2 + √ d∥U t W t,⊥ ∥ 2 . (37) We set δ = 10 -3 ∥X∥ -2 τ c 3 and T s α = T s α - log 10 -2 ∥X∥ -2 τ c -1 3 log 1 -1 2 µτ , then for small α we have T s α ≤ T s α + T s α (defined in Lemma C.9). Hence for T s α ≤ t < T s α we always have ∥V Xs,⊥ V UtWt ∥ ≤ c 3 . By Lemma C.10 and the choice of c 3 and δ, we have for T s α ≤ t < T s that V ⊤ Xs XX ⊤ -U t+1 U ⊤ t+1 F ≤ 1 - 1 2 µτ s V ⊤ Xs XX ⊤ -U t U ⊤ t F + 150µ∥X∥ 4 √ r * c 3 which implies that V ⊤ Xs XX ⊤ -U Ts U ⊤ Ts F ≤ 400τ -1 ∥X∥ 4 √ r * c 3 . Meanwhile, by Lemma C.9 we have ∥U t W t,⊥ ∥ ≤ c 5 and V ⊤ Xs,⊥ V UtWt ≤ c 3 at t = T s α . Plugging into (37) yields X s X ⊤ s -U T s α U ⊤ T s α F ≤ 400τ -1 ∥X∥ 4 √ r * c 3 + 9∥X∥ 2 c 2 3 √ r * + c 2 5 √ d. By definition of c 3 and c 5 we deduce that X s X ⊤ s -U T s α U ⊤ T s α F ≤ 10 3 τ -2 ∥X∥ 6 √ r * c 3 , as de- sired. □ Corollary C.4 There exists a constant C 1 = C 1 (κ, r * ) such that max 0≤t≤ T s α ∥U t W t,⊥ ∥ ≤ C 1 α 1 4κ . Proof : The case of T s α directly follows from Lemma C.8. For t > T s α , we know from Lemma C.9 that ∥U t W t,⊥ ∥ ≤ U T s α W T s α ,⊥ • 1 + µσ 2 s T s α -T s α . By (38), the second term is a constant independent of α, so the conclusion follows. □

D AUXILIARY RESULTS FOR PROVING LEMMA B.1

This section contains a collection of auxiliary results that are used in the previous section.

D.1 THE SPECTRAL PHASE

In the section, we provide auxiliary results for the analysis in the spectral phase. Recall that N t = (I + µM ) t and U t = U t + E t = N t U 0 + E t and U 0 = αU . Also recall that M = rank(M ) i=1 λi vi v⊤ i ; we additionally define M s = min{s,rank(M )} i=1 λi vi v⊤ i . Similarly, let L t be the span of the top-s left singular vectors of U t . The following lemma shows that power iteration would result in large eigengap of U t . Lemma D.1 Let ρ = σ min V ⊤ Ms U > 0, then the following three inequalities hold, given that the denominator of the third is positive. σ s (U t ) ≥ α ρσ s Ẑt -σ s+1 Ẑt ∥U ∥ -∥E t ∥ , σ s+1 (U t ) ≤ ασ s+1 Ẑt ∥U ∥ + ∥E t ∥ , V ⊤ M ⊥ s V Lt ≤ ασ s+1 Ẑt ∥U ∥ + ∥E t ∥ αρσ s Ẑt -2 ασ s+1 Ẑt ∥U ∥ + ∥E t ∥ . ( ) Proof : By Weyl's inequality we have σ s+1 (U t ) = σ s+1 (1 + µM ) t U 0 + ∥E t ∥ = ασ s+1 (1 + µM ) t U + ∥E t ∥ ≤ ασ s+1 (1 + µM s ) t U + α (1 + µM ) t -(1 + µM s ) t U + ∥E t ∥ ≤ α(1 + µ λs+1 ) t ∥U ∥ + ∥E t ∥. Thus (39b) holds. Similarly, σ s (U t ) ≥ ασ s N t V Ms V ⊤ Ms U -α(1 + µ λs+1 ) t ∥U ∥ -∥E t ∥ ≥ ασ s (N t V Ms ) σ min V ⊤ Ms U -α(1 + µ λs+1 ) t ∥U ∥ -∥E t ∥ ≥ αρ(1 + µ λs ) t -α(1 + µ λs+1 ) t ∥U ∥ -∥E t ∥. Finally, note that we can write α(1 + µM s ) t U = V Ms (1 + µΣ Ms ) t V ⊤ Ms U invertible , so that the subspace spanned by the left singular vectors of α(1 + µM s ) t U coincides with the column span of V Ms . Since L t is the span of top-s left singular vectors of U t , we apply Wedin's sin theorem (Wedin, 1972) and deduce (39c). □ The next lemma relates the quantities studied in Lemma D.1 with those that are needed in the induction. The proof is the same as (Stöger & Soltanolkotabi, 2021, Lemma 8.4 ), so we omit it here. where ∥D∥ ≤ 30∥B∥ 2 . Proof : By definition of Ẑ we have Ẑ( Ẑ⊤ Ẑ) -1 2 = (I + µM )(I + P )V UtWt V ⊤ UtWt I + P ⊤ (I + µM ) 2 (I + P )V UtWt -1 2 = (I + B)V UtWt V ⊤ UtWt I + B ⊤ + B + B ⊤ B V UtWt -1 2 = (I + B)V UtWt   I + V ⊤ UtWt B ⊤ + B + B ⊤ B V UtWt =:∆    -1 2 . It follows from ( 28) and our assumptions on c 3 and c 4 that ∥B∥ ≤ µ XX ⊤ -U t U ⊤ t + 6µ c 3 c 4 ∥X∥ + 50∥X∥ 2 δ ≤ 10µ∥X∥ 2 + 6µc 3 (c 4 + 1) ∥X∥ < 0.1 (note that this step is independent and does not rely on earlier derivations in the proof of Lemma C.6), so by Taylor's formula, we have (I + ∆) -1 2 -I + 1 2 ∆ ≤ 3∥∆∥ 2 . Hence, Ẑ( Ẑ⊤ Ẑ) -1 2 -V UtWt + BV UtWt - 1 2 (I + B)V UtWt V ⊤ UtWt B + B ⊤ V UtWt = (I + B)V UtWt (I + ∆) -1 2 -I + 1 2 ∆ - 1 2 V ⊤ UtWt B ⊤ BV UtWt ≤ (1 + ∥B∥) 3∥∆∥ 2 + 1 2 ∥B∥ 2 < 30∥B∥ 2 as desired. □

E PROOF OF RESULTS IN SECTION 5

Theorem 1.1 states that GD approximately learns the rank-s constrained minimizer of the matrix sensing loss. However, Theorem C.1 only implies that GD would get into an O(δ) neighborhood of X s X ⊤ s . As a result, a more fine-grained analysis is needed. Note that it is not even clear whether the rank-s minimizer of is unique. If it is not, then we may naturally ask which minimizer it converges to. In this section, we study the landscape of under-parameterized matrix sensing problem f s (U ) = 1 2 A(U U ⊤ -XX ⊤ ) 2 2 , U ∈ R d×s and establish local convergence of gradient descent. Our key result in this section is Lemma E.6, which states a local PL condition for the matrix sensing loss. Most existing results only study the landscape of (1) in the exact-and over-parameterized case. Zhu et al. (2021) studies the landscape of under-parameterized matrix factorization problem, but they only prove a strict saddle property without asymptotic convergence rate of GD. When the measurement satisfies the RIP condition, we can expect that the landscape of f looks similar to that of the (under-parameterized) matrix factorization loss: F s (U ) = 1 2 U U ⊤ -XX ⊤ 2 F , U ∈ R d×s for some s < r. Recall that XX ⊤ = r * i=1 σ 2 i v i v ⊤ i . The critical points of F s (U ) is characterized by the following lemma: Lemma E.1 U ∈ R d×s is a critical point of F s (U ) if and only if there exists an orthogonal matrix R ∈ R s×s , such that all columns of U R are in {σ i v i : 1 ≤ i ≤ r * }. Proof : Assume WLOG that XX ⊤ = diag(σ 2 1 , σ 2 2 , • • • , σ 2 r , 0, • • • , 0) =: Σ. Let U be a critical point of F s , then we have that U U ⊤ -XX ⊤ U = 0. Let W = U U ⊤ , then (Σ -W )W = 0. Since W is symmetric, so is W 2 , and we obtain that ΣW is also symmetric. It's then easy to see that that if Σ = diag (λ 1 I m1 , • • • , λ t I mt ) with λ 1 > λ 2 > • • • > λ t ≥ 0, then W is also in block-diagonal form: W = diag (W 1 , W 2 , • • • , W t ) where W i ∈ R mi×mi . For each 1 ≤ i ≤ t, we then have the equation (λ i I mi -W i ) W i = 0. Hence, there exists an orthogonal matrix R i such that R ⊤ i W i R i is a diagonal matrix where the diagonal entries are either 0 or √ λ i = σ i . Let so that ⟨∇F s (U ), U -X s ⟩ = (H + X s )(H + X s ) ⊤ -XX ⊤ (H + X s ), H = tr H ⊤ (H + X s )(H + X s ) ⊤ -XX ⊤ H + H ⊤ HH ⊤ + HX ⊤ s + X s H ⊤ X s ≥ -tr H ⊤ X s,⊥ X ⊤ s,⊥ H -3∥X∥∥H∥ 3 F + tr H ⊤ HX ⊤ s X s (43a) ≥ σ 2 s -σ 2 s+1 ∥H∥ 2 F -3∥X∥∥H∥ 3 F ≥ 0.1τ ∥H∥ 2 F (43b) where in (43a) we use tr (H ⊤ X s ) 2 ≥ 0 (since H ⊤ X s is symmetric as noticed in the beginning of the proof), and (43b) is because of tr H ⊤ HX ⊤ s X s = tr H ⊤ HW Xs Σ 2 Xs W ⊤ Xs = tr W ⊤ Xs H ⊤ HW Xs Σ 2 Xs ≥ σ 2 s tr W ⊤ Xs H ⊤ HW Xs = σ 2 s ∥H∥ 2 F and tr H ⊤ X s,⊥ X ⊤ s,⊥ H = tr H ⊤ V Xs,⊥ Σ s,⊥ V ⊤ Xs,⊥ H ≤ ∥Σ s,⊥ ∥ • H ⊤ V Xs,⊥ 2 F ≤ σ 2 s+1 ∥H∥ 2 F . □ Corollary E.1 Under the conditions of Lemma E.4, we have ∥∇F s (U )∥ F ≥ 0.1τ dist(U , X s ). The following lemma shows that the rank-s global minima of matrix sensing must lie in an O(δ)neighbourhood of the minima of F s . Lemma E.5 Suppose that Assumption 3.1 holds. Let U * s be a global minimizer of f s , then we have dist(U * s , X s ) ≤ 40δκ∥X∥ F . where we recall that κ = τ -1 ∥X∥ is the condition number of XX ⊤ . Proof : Define S = U ∈ R d×s : dist(U , X s ) ≤ 0.1κ -1 ∥X∥ . First we can show that U * s ∈ S. The main idea is to apply Lemma A.5. Indeed, it's easy to see that lim ∥U ∥ F →+∞ F s (U ) = +∞. Let U ∈ ∂S i.e. dist 2 (U , X s ) = 0.1∥X∥ -1 τ . Assume WLOG that dist(U , X s ) = ∥U -X s ∥ F , then by Lemma E.4 we have F s (U ) -F s (X s ) = 1 0 t ⟨∇F s (tU + (1 -t)X s ), U -X s ⟩ dt ≥ 1 0 0.1τ t 2 ∥U -X s ∥ 2 F dt ≥ 10 -3 ∥X∥ -2 τ 3 . Recall that all the stationary points of F s are characterized in Lemma E.1, so that for all U / ∈ S with ∇F s (U ) = 0, we have F s (U ) -F * s ≥ 0.5 σ 4 s -σ 4 s+1 ≥ 0.5τ 2 . On the other hand, we know from Lemma E.3 that F s (U * s ) -F * s ≤ 5δr∥X∥ 4 < 0.5τ 2 , so Lemma A.5 implies that U * s ∈ S. Inside S, we can apply the local PL property that we previously derived. Indeed, note that ∥∇F s (U * s )∥ F = XX ⊤ -U * s (U * s ) ⊤ U * s F = (A * A -I) XX ⊤ -U * s (U * s ) ⊤ U * s F ≤ δ XX ⊤ -U * s (U * s ) ⊤ F ∥U * s ∥ ≤ 4δ∥X∥ • XX ⊤ F . Hence we have that dist(U , X s ) ≤ 40δτ -1 ∥X∥ 2 ∥X∥ F = 40δκ∥X∥ F . □ Corollary E.2 Suppose that Assumption 3.1 holds, then we have σ min (U * s ) ⊤ U * s ≥ σ 2 s - 80δκ∥X∥∥X∥ F . Proof : We assume WLOG that ∥U * s -X s ∥ F = dist(U * s , X s ) i.e. R = I in Definition 5.1. By Lemma E.5, we have that (U * s ) ⊤ U * s -X ⊤ s X s ≤ (U * s ) ⊤ U * s -X ⊤ s X s ≤ max {∥U * s ∥ , ∥X s ∥} • ∥U * s -X s ∥ ≤ 80δκ∥X∥∥X∥ F . □ Lemma E.6 Suppose that Assumption 3.1 holds. Given U ∈ R d×s , let U * s ∈ R d×s be a minimizer of f s , and U * s R be the rank-s minimizer which is closest to U (R ∈ R s×s is orthogonal). When ∥U -U * s R∥ ≤ 10 -2 κ -1 ∥X∥, we have ⟨∇f s (U ), U -U * s R⟩ ≥ 0.1τ dist(U , U * s ) 2 . Proof : We assume WLOG that R = I, then U ⊤ U * s is positive semi-definite. Let H = U -U * s , then ∇f s (U ) = (A * A) (U U ⊤ -XX ⊤ )U = (A * A) (H + U * s )(H + U * s ) ⊤ -XX ⊤ (H + U * s ) = (A * A) HH ⊤ + U * s H ⊤ + H (U * s ) ⊤ (H + U * s ) -A * A XX ⊤ -U * s (U * s ) ⊤ H where we use the first-order optimality condition A * A XX ⊤ -U * s (U * s ) ⊤ U * s = 0. Since ∥U * s ∥ ≤ 2∥X∥ by Lemma E.5, we may thus deduce that ∇f s (U ) -HH ⊤ + U * s H ⊤ + H (U * s ) ⊤ (H + U * s ) -XX ⊤ -U * s (U * s ) ⊤ H F ≤ (A * A -I) HH ⊤ + U * s H ⊤ + H (U * s ) ⊤ (H + U * s ) F + (A * A -I) XX ⊤ -U * s (U * s ) ⊤ H ≤ 50δ∥X∥ 2 ∥H∥ F Hence ⟨∇f s (U ), U -U * s ⟩ ≥ HH ⊤ + U * s H ⊤ + H (U * s ) ⊤ (H + U * s ) -XX ⊤ -U * s (U * s ) ⊤ H, H -50δ∥X∥ 2 ∥H∥ 2 F ≥ tr H(H + U * s ) ⊤ (H + U * s )H ⊤ + H ⊤ U * s H ⊤ H + (U * s ) ⊤ H 2 -H ⊤ XX ⊤ -U * s (U * s ) ⊤ H -50δ∥X∥ 2 ∥H∥ 2 F ≥ σ min (U * s ) ⊤ U * s -XX ⊤ -U * s (U * s ) ⊤ -50δ∥X∥ 2 -3∥U * s ∥∥H∥ -∥H∥ 2 ∥H∥ 2 F . By Corollary E.2 we have σ min (U * s ) ⊤ U * s ≥ σ 2 s -80δκ∥X∥∥X∥ F and XX ⊤ -U * s (U * s ) ⊤ ≤ σ 2 s+1 + 80δκ∥X∥ 2 F , so that ⟨∇f s (U ), U -U * s ⟩ ≥ σ 2 s -σ 2 s+1 -160δκ∥X∥∥X∥ F -50δ∥X∥ 2 -3∥U * s ∥∥H∥ -∥H∥ 2 ∥H∥ 2 F . When δ ≤ 10 -3 r -1 2 * κ -2 and ∥H∥ ≤ 10 -2 τ ∥X∥ -1 , the above implies that ⟨∇f s (U ), U -U * s ⟩ ≥ 0.5τ ∥H∥ 2 F , as desired. □ F PROOF OF THEOREM 4.1 With the landscape results in hand, we are now ready to characterize the saddle-to-saddle dynamics of GD. We first note the following proposition, with is straightforward from Lemma C.9 and Theorem C.1. In the following we use U α,t to denote the t-th iteration of GD when initialized at U 0 = αU . Proposition F.1 There exists matrices Ūα,t , -T s α ≤ t ≤ 0 with rank ≤ s such that max . We choose Ūt = U T s α +t W T s α +t W ⊤ T s α +t , then rank Ūt ≤ s and moreover by Theorem C.1 we have X s X ⊤ s -Ū0 Ū ⊤ 0 F ≤ 2κ 2 ∥X∥ 2 √ r * δ. On the other hand, similar to Corollary E.2 we have that Z s -X s X ⊤ s F ≤ 80δκ∥X∥ 2 F . Thus Ū0 Ū ⊤ 0 -Z s ≤ 100κ 2 ∥X∥ 2 r * δ as desired. □ Let Ûα,t = U α,t W α,t ∈ R d×s , then it satisfies Û0 Û ⊤ 0 = Ū0 Ū ⊤ 0 . The following corollary shows that Ûα,0 is close to U * s in terms of the procrutes distance. Corollary F.1 We have dist( Û0 , U * s ) ≤ 80κ 3 r 1 2 * ∥X∥δ. Proof : We know from Lemma E.5 that dist(U * s , X s ) ≤ 40δκ∥X∥ F , so it remains to bound dist( Û0 , X s ). The proof idea is the same as that of Lemma E.5, so we only provide a proof sketch here. It has been shown in the proof of Proposition F.1 that F s ( Û0 ) := 1 2 X s X ⊤ s -Û0 Û ⊤ 0 2 F ≤ r * X s X ⊤ s -Û0 Û ⊤ 0 2 ≤ 4κ 4 r * ∥X∥ 4 δ 2 ≤ 0.5τ 2 . Note that F s is the matrix factorization loss with X s X ⊤ s as the ground-truth, so the local PL property (cf. Lemma E.4) still holds here, and by the same reason as (44), we deduce that dist( Û0 , X s ) ≤ 0.1∥X∥ -1 τ i.e. Û0 is in the local PL region around X s . Finally, it follows from the PL property that dist( Û0 , X s ) ≤ 10τ -1 ∇F s ( Û0 ) F ≤ 10τ -1 ∥ Û0 ∥ X s X ⊤ s -Û0 Û ⊤ 0 F ≤ 40κ 3 r 1 2 * ∥X∥δ.

The conclusion follows. □

Recall that matrix sensing loss satisfies a local PL property (cf. Lemma E.6). As a result, when δ is small, we can show that GD initialized at Û0 converges linearly to the ground-truth. Lemma F.1 Let Ût be the t-th iteration of GD initialized at Û0 . Suppose that δ ≤ 10 -2 r -1 2 * κ -3 and µ ≤ 10 -3 ∥X∥ -2 , then we have that dist 2 ( Ût , U * s ) ≤ (1 -0.05τ µ) t dist 2 ( Û0 , U * s ). Proof : We know from Corollary F.1 that Û0 F ≤ ∥X∥ F + 40κ 3 r 1 2 * ∥X∥δ. We will prove the following result, which immediately implies Lemma F.1: suppose that dist( Ût , U * s ) ≤ dist( Û0 , U * s ), then dist 2 ( Ût+1 , U * s ) ≤ (1 -0.05τ µ)dist 2 ( Ût , U * s ). (46) Let R be the orthogonal matrix satisfying ∥U t -U * s R∥ F = dist(U t , U * s ). Assume WLOG that R = I. We first bound the gradient ∇f ( Ût ) as follows: ∇f ( Ût ) F = A * A XX ⊤ -Ût Û ⊤ t Ût F ≤ A * A XX ⊤ -Ût Û ⊤ t Ût -U * s F + Ût Û ⊤ t -U * s (U * s ) ⊤ U * s F ≤ 20∥X∥ 2 Ût -U * s F (47) where we use Ût ≤ ∥X∥ + 40κ 3 r 1 2 * ∥X∥δ ≤ 2∥X∥ and the RIP property. It follows that dist 2 ( Ût+1 , U * s ) ≤ Ût+1 -U * s R 2 F (48a) = Ût -µ∇f ( Ût ) -U * s R 2 F = Ût -U * s R 2 F -µ ∇f ( Ût ), Ût -U * s R + µ 2 ∇f ( Ût ) 2 F ≤ 1 -0.1τ µ + 400∥X∥ 4 µ 2 Ût -U * s R 2 F (48b) where (48a) follows from the definition of dist, and (48b) is due to Lemma E.6 and (47). Finally, (46) follows from the condition µ ≤ 10 -3 κ -2 . □ We are now ready to complete the proof of Theorem 4.1. Theorem F.1 (Restatement of Theorem 4.1) Under Assumptions 3.1 and 3.2, consider GD (3) with learning rate µ ≤ 1 10 3 ∥Z * ∥ and initialization U α,0 = α Ū for solving the matrix sensing problem (1). There exists a universal constant c > 0, a constant C (depending on r and κ) and a sequence of time points T 1 α < T 2 α < • • • < T r∧r * α such that for all 1 ≤ s ≤ r ∧ r * , the following holds when α = O (ρr * ) -cκ : U α,T s α U ⊤ α,T s α -Z * s F ≤ Cα 1 10κ . ( ) where we recall that Z * s is the best rank-s solution defined in Definition 1.1. Moreover, GD follows an incremental learning procedure: for all 1 ≤ s ≤ r ∧ r * , we have lim α→0 max 1≤t≤T s α σ s+1 (U α,t ) = 0, where σ i (A) denotes the (s + 1)-th largest singular value of a matrix A. Proof : Recall that U T s α -Ū0 F = o(1) (α → 0) where T s α is defined in Proposition F.1; we omit the dependence on α to simplify notations. We also note that by the update of GD, we have Ūt Ū ⊤ t = Ût Û ⊤ t for all t ≥ 0. By Lemma F.1, we have that dist 2 ( Ût , U * s ) ≤ (1 -0.05τ µ) t dist 2 ( Û0 , U * s ) and, in particular, Ût ≤ 2∥X∥ for all t. Thus Ūt ≤ 2∥X∥ as well. Moreover, recall that ∥U t ∥ ≤ 3∥X∥ for all t. It's easy to see that that the matrix sensing loss f is L-smooth in U ∈ R d×r : ∥U ∥ ≤ 3∥X∥ for some constant L = O(∥X∥ 2 ), so it follows from Lemma A.6 that U T s α +t -Ūt F ≤ (1 + µL) t U T s α -Ū0 F . On the other hand, since dist 2 ( Ût , U * s ) ≤ (1 -0.05τ µ) Recall that rank (U t ) ≤ s, so that max 0≤t≤T s α σ s+1 (U t ) = o(1). Finally, for all 0 ≤ s < r ∧ r * , we need to show that T s α < T s+1



(a) α = 1, 1000 measurements. (b) α = 0.1, 1000 measurements. (c) α = 0.01, 1000 measurements. (d) α = 0.001, 1000 measurements. (e) α = 0.001, 2000 measurements. (f) α = 0.001, 5000 measurements.

Figure 1: The evolution of relative error against the best solution of different ranks over time.

(a) α = 1, 1000 measurements. (b) α = 0.1, 1000 measurements. (c) α = 0.01, 1000 measurements. (d) α = 0.001, 1000 measurements.

Figure 2: The evolution of the loss and relative error against best solution of different ranks in the exact-parameterized case r = 5.

27)where B = (I + µM t )(I + P ) -I and ∥D∥ ≤ 100∥B∥ 2 . By assumption we have

where T s α is defined in Theorem C.1 (we omit the dependence on α there) and moreoverŪ0 Ū ⊤ 0 -Z s ≤ 100κ 2 ∥X∥ 2 r * δ. where Z s = U * s (U * s )⊤ is the rank-s minimizer of the matrix sensing loss i.e. ∈ R d×d is positive semi-definite and rank (Z) ≤ s .(45)Remark F.1 We omit the dependence on α for simplicity of notations, when it is clear from context.Proof : It follows from Lemma C.9 that max 1≤t≤ T s α ∥U t W t,⊥ ∥ = O α 1 4κ

c 1 is a universal constant. Let T s α = T s α + t s α , then U T s α U ⊤ T s α -Z s F = o(1) holds.

t dist 2 ( Û0 , U * s ), we can deduce that U T s α +t U ⊤

Appendix Table of Contents

when α ≤ poly(r * ) -1 , so that (13) holds. As a result, we have In the following we estimate σ min V ⊤ Xs U t+1 W t . We state our main result of this section in the lemma below.Proof : The update rule of GD implies thatwhere (15a) follows fromWe now relate the last three terms in (15c) to V ⊤ Xs U t W t . Since V ⊤ Xs U t W t is assumed to be invertible, so is V ⊤ Xs V UtWt , Σ UtWt and W UtWt , thus we havePlugging ( 16) into the second and third terms of (15) and re-arranging, we deduce thatBy assumption we haveand by our assumption we haveMoreover, note that ∥Σ s ∥ 2 = ∥X∥ 2 , and when µ <The equation (18) implies thatRecall that P 1 and P 2 are bounded in ( 19) and ( 20) respectively, so we have that

The conclusion follows. □

The corollaries below immediately follow from Lemma C.4.

C.2.2 THE NOISE TERM

In this section we turn to analyze the noise term.The main result of this section is presented in the following:. The latter can be decomposed as follows:.In the following, we are going to show that the term (a) is bounded by c • µ where c is a small constant, while (b) grows linearly with a slow speed.Bounding summand (a). SinceThus the summand (a) can be rewritten as follows:whereBy Lemma D.4 we have V ⊤ Xs,⊥ V Ut+1Wt ≤ 0.01, which implies thatLastly, we bound M 1 as follows:where the second inequality follows from our assumption onCombining ( 23), ( 24) and ( 25) yieldsBounding summand (b). This is the main component in the error term. We'll see that although this term can grow exponentially fast, the growth speed is slower than the minimal eigenvalue of the parallel component.We have d-s) . In ( 26), (26a) follows from the update rule of GD, (26b) is obtained from, and lastly in (26d) we useTo summarize, we haveTo bound the growth speed of the orthogonal component, we need to show that the quantity V ⊤ Xs,⊥ V UtWt remains small. The following lemma serves to complete an induction step from t to t + 1:where κ is the condition number defined in Section 3.1) and µ ≤ 10 -4 κ -1 ∥X∥ -2 c 3 , then we havewhere we used the definition of c 5 in the last step. As a result, we can apply the conclusion of Lemmas C.4 to C.6 which implies that (32) holds for t + 1. Finally, combining Lemma C.8 and (32b) yields T s α = ω(1). □We now present the main result of this section:Lemma C.10 Suppose that 0 ≤ t -T s α ≤ t s α , V ⊤ Xs,⊥ V UtWt ≤ c 3 and the conditions on c 3 , c 4 , δ and µ in Lemma C.7 hold, then we havewhere we recall that τ = min 1≤s≤r∧r * σ 2 s -σ 2 s+1 > 0.Proof : The update of GD implies thatNote that we would like to bound. We deal with the above three terms separately. For the first term, we havewhere in (35a) we useCombining the above two lemmas, we directly obtain the following corollary:Corollary D.1 Suppose that ασ s (N t ) > 10 (ασ s+1 (N t ) + ∥E t ∥), then we have that

D.2 THE PARALLEL IMPROVEMENT PHASE

In the section, we provide auxiliary results for the analysis in the parallel improvement phase.Lemma D.3 (Stöger & Soltanolkotabi, 2021, Lemma 9.4 ) For sufficiently small µ and δ, suppose that ∥U t ∥ ≤ 3∥X∥, then we also have ∥U t+1 ∥ ≤ 3∥X∥.Lemma D.4 Under the assumptions in Lemma C.5, we haveProof : The proof of this lemma is essentially the same as (Stöger & Soltanolkotabi, 2021, Lemma B.1) , and we omit it here. □ Lemma D.5 Under the assumptions in Lemma C.6, we haveProof : We havewhere the last step follows from

The conclusion follows. □

Lemma D.6 Under the assumptions in Lemma C.6, we haveProof : The proof roughly follows (Stöger & Soltanolkotabi, 2021, Lemma B.3 ), but we include it here for completeness.where (42a) follows fromPlugging into (42), we deduce that. where in the last step we use Lemma D.5 and the induction hypothesis which implies thatThe matrix Ẑ defined in the proof of Lemma C.6 satisfies the following:R is diagonal and its nonzero diagonal entries form an s-subset of the multi-set {σ i : 1 ≤ i ≤ r * }. The conclusion follows. □In the case of s = 1, the global minimizers of F s are ±σ 1 v 1 , and we can show that F s is locally strongly convex around these minimizers. Therefore, we can deduce that f is locally stronglyconvex as well. Since our main focus is on s > 1, we put these details in Appendix G. When s > 1, F s (U ) is not locally strongly-convex due to rotational invariance: if U is a global minima, then so is U R for any orthogonal matrix R ∈ R s×s . Instead, we will establish a local PL property w.r.t the procrustes distance:The following characterization of the optimal R in Definition 5.1 is known in the literature (see e.g. (Tu et al., 2016, Section 5.2 .1)) but we provide a proof of it for completeness.Lemma E.2 Let R be the orthogonal matrix which minimizes, where the final step is due to orthogonality of B ⊤ R ⊤ A ∈ R s×s , and equality holds if and only ifThe following lemma states that the minimizer of matrix sensing loss is also near-optimal for the matrix factorization loss. The main idea for proving this result is to utilize Lemma E.3 Let Z * s be a best rank-s solution as defined in (1), then we haveProof : Since U * s is the global minimizer of f , by the RIP property we haveThe next lemma show that F s satisfies a local PL property:Lemma E.4 Given U ∈ R d×s and let R be an orthogonal matrix such that ∥U -. Indeed, by Corollary E.2 and the Assumption 3.1 we haveIn this section, we establish a local strong-convexity result Lemma G.2 for rank-1 parameterized matrix sensing. This result is stronger than the PL-condition we established for general ranks, though the latter is sufficient for our analysis.Lemma G.1 Define the full-observation loss with rank-1 parameterizationThen the global minima of g 1 are u * = σ 1 v 1 and -u * . Moreover, suppose that g(u) -g(u * ) ≤ 0.5τ 1 where τwhere equality holds if and only if u 2 = • • • = u d = 0 and ∥u∥ 2 = σ 2 1 i.e. u = ±σ 1 e 1 . Moreover, suppose that g 1 (u)-g 1 (u * ) ≤ 0.5τ 1 , it follows from (50b) that τ 1 d i=2 u 2 i ≤ 2(g 1 (u)g 1 (u * )) which implies that d i=2 u 2 i ≤ 2τ -1 1 (g 1 (u) -g 1 (u * )). Also (50c) yields ∥u∥ 2 -σ 2 1 ≤ 4 g 1 (u) -g 1 (u * ). Assume WLOG that u 1 > 0, then we haveSuppose that δ ≤ 10 -3 ∥X∥ -2 τ 1 , then there exists constants a 1 and ι, such that f 1 is locally ιstrongly convex in B 1 = B(σ 1 v 1 , a 1 ) ⊂ R d . Furthermore, there is a unique global minima of f 1 inside B 1 .Proof : Recall that we defined the full observation loss g 1 (u) = 1 4 uu T -XX T 2 F . Let h 1 = f 1 -g 1 , thenWhen ∥u -σ 1 v 1 ∥ 2 ≤ 0.1 min σ 2 1 , τ 1 (recall τ 1 = σ 2 1 -σ 2 2 ), σ min ∇ 2 g 1 (u) = 1 2 σ min ∥u∥ 2 I + 2uu T -XX T ≥ 0.4τ 1 .Hence we havei.e. strong-convexity holds for a 2 1 = 0.1 min σ 2 1 , τ 1 and ι = 0.2τ 1 . Let u * be a global minima of f 1 , then we must have ∥u * ∥ ≤ 2∥X∥ (otherwise f 1 (u) > f 1 (0)). We can thus deduce thatIt follows from Lemma G.1 and our assumption on δ that min ∥u * -σ 1 v 1 ∥ 2 , ∥u * + σ 1 v 1 ∥ 2 ≤ 1 2 a 2 1 . Moreover, by strong convexity, there exists only one global minima in B 1 , which concludes the proof. □

