UNDERSTANDING INCREMENTAL LEARNING OF GRA-DIENT DESCENT: A FINE-GRAINED ANALYSIS OF MA-TRIX SENSING

Abstract

The implicit bias of optimization algorithms such as gradient descent (GD) is believed to play an important role in generalization of modern machine learning methods such as deep learning. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. With small initialization, we that GD behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019). That is, GD sequentially learns solutions with increasing ranks until it recovers the ground-truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result is stronger because it characterizes the whole learning process. Moreover, our analysis of the incremental learning procedure applies to the under-parameterized regime as well. As a key ingredient of our analysis, we observe that GD always follows an approximately low-rank trajectory and develops novel landscape properties for matrix sensing with low-rank parameterization. Finally, we conduct numerical experiments which confirm our theoretical findings.

1. INTRODUCTION

Understanding the optimization and generalization properties of optimization algorithms is one of the central topics in deep learning theory (Zhang et al., 2021; Sun, 2019) . It has long been a mystery why simple algorithms such as Gradient Descent (GD) or Stochastic Gradient Descent (SGD) can find global minima even for highly non-convex functions (Du et al., 2019) , and why the global minima being found can generalize well (Hardt et al., 2016) . One influential line of works provides theoretical analysis of the implicit bias of GD/SGD. These results typically exhibit theoretical settings where the low-loss solutions found by GD/SGD attain certain optimality conditions of a particular generalization metric, e.g., the parameter norm (or the classifier margin) (Soudry et al., 2018; Gunasekar et al., 2018; Nacson et al., 2019; Lyu & Li, 2020; Ji & Telgarsky, 2020) , the sharpness of local loss landscape (Blanc et al., 2020; Damian et al., 2021; Li et al., 2022; Lyu et al., 2022) . Among these works, a line of works seek to characterize the implicit bias even when the training is away from convergence. Kalimeris et al. (2019) empirically observed that SGD learns model from simple ones, such as linear classifiers, to more complex ones. This behavior, usually referred to as the simplicity bias/incremental learning of GD/SGD, can help prevent overfitting for highly over-parameterized models since it tries to fit the training data with minimal complexity. Hu et al. (2020); Lyu et al. (2021); Frei et al. (2021) theoretically establish that GD on two-layer nets learns linear classifiers first. The goal of this paper is to demonstrate this simplicity bias/incremental learning in the matrix sensing problem, a non-convex optimization problem that arises in a wide range of real-world applications, e.g., image reconstruction (Zhao et al., 2010; Peng et al., 2014) , object detection (Shen & Wu, 2012; Zou et al., 2013) and array processing systems (Kalogerias & Petropulu, 2013) . Moreover, this problem serves as a standard test-bed of the implicit bias of GD/SGD in deep learning theory, since it retains many of the key phenomena in deep learning while being simpler to analyze. Formally, the matrix sensing problem asks for recovering a ground-truth matrix Z * ∈ R d×d given m observations y 1 , . . . , y m . Each observation y i here is resulted from a linear measurement y i = ⟨A i , Z * ⟩, where {A i } 1≤i≤m is a collection of symmetric measurement matrices. In this paper, we focus on the case where Z * is positive semi-definite (PSD) and is of low-rank: Z * ⪰ 0 and rank (Z * ) = r * ≪ d. An intriguing approach to solve this matrix sensing problem is to use the Burer-Monteiro type decomposition Z * = U U ⊤ with U ∈ R d×r , and minimize the squared loss with GD: min U ∈R d×r f (U ) := 1 4m m i=1 y i -A i , U U ⊤ 2 . (1) In the ideal case, the number of columns of U , denoted as r above, should be set to r * , but r * may not be known in advance. This leads to two training regimes that are more likely to happen: the under-parameterized regime where r < r * , and the over-parameterized regime where r > r * . The over-parameterized regime may lead to overfitting at first glance, but surprisingly, with small initialization, GD induces a good implicit bias towards solutions with exact or approximate recovery of the ground truth. It was first conjectured in Gunasekar et al. ( 2017) that GD with small initialization finds the matrix with minimum nuclear norm. However, a series of works point out that this nuclear norm minimization view cannot capture the simplicity bias/incremental learning behavior of GD. In the matrix sensing setting, this term particularly refers to the phenomenon that GD tends learn solutions with rank gradually increasing with training steps. Arora et al. ( 2019) exhibits this phenomenon when there is only one observation (m = 1). Gissin et al. ( 2019); Jiang et al. ( 2022) study the full-observation case, where every entry of the ground truth is measured independently f (U ) = 1 4d 2 ∥Z * -U U ⊤ ∥ 2 F , and GD is shown to sequentially recover singular components of the ground truth from the largest singular value to the smallest one. Li et al. ( 2020) provide theoretical evidence that the incremental learning behavior generally occurs for matrix sensing. They also give a concrete counterexample for Gunasekar et al. ( 2017)'s conjecture, where the simplicity bias drives GD to a rank-1 solution that has a large nuclear norm. In spite of these progresses, theoretical understanding of the simplicity bias of GD remains limited. Indeed, a vast majority of existing analysis only shows that GD is initially biased towards learning a rank-1 solution, but their analysis cannot be generalized to higher ranks, unless additional assumptions on the GD dynamics are made (Li et al., 2020 , Appendix H), (Belabbas, 2020; Jacot et al., 2021; Razin et al., 2021; 2022) .

1.1. OUR CONTRIBUTIONS

In this paper, we take a step towards understanding the generalization of GD with small initialization by firmly demonstrating the simplicity bias/incremental learning behavior in the matrix sensing setting, assuming the Restricted Isometry Property (RIP). Our main result is informally stated below. See Theorem 4.1 for the formal version. Definition 1.1 (Best Rank-s Solution) We define the best rank-s solution as the unique global minimizer Z * s of the following constrained optimization problem: min Z∈R d×d 1 4m m i=1 (y i -⟨A i , Z⟩) 2 s.t. Z ⪰ 0, rank (Z) ≤ s. Theorem 1.1 (Informal version of Theorem 4.1) Consider the matrix sensing problem (1) with rank-r * ground-truth matrix Z * and measurements {A i } m i=1 . Assume that the measurements satisfy the RIP condition (Definition 3.2). With small learning rate µ > 0 and small initialization U α,0 = αU ∈ R d×r , the trajectory of U α,t U ⊤ α,t during GD training enters an o(1)-neighbourhood of each of the best rank-s solutions in the order of s = 1, 2, . . . , r ∧ r * when α → 0. Li et al. (2018); Stöger & Soltanolkotabi (2021) that GD exactly recovers the ground truth under the RIP condition, but our theorem goes beyond this result in a number of ways. First, in the over-parameterized regime (i.e., r ≥ r * ), it implies that the trajectory of GD exhibits an incremental learning phenomenon: learning solutions with increasing ranks until it finds the ground

