TOWARDS RESOLVING THE IMPLICIT BIAS OF GRADIENT DESCENT FOR MATRIX FACTORIZATION: GREEDY LOW-RANK LEARNING

Abstract

Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent. Gunasekar et al. (2017) conjectured that Gradient Flow with infinitesimal initialization converges to the solution that minimizes the nuclear norm, but a series of recent papers argued that the language of norm minimization is not sufficient to give a full characterization for the implicit regularization. In this work, we provide theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions. This generalizes the rank minimization view from previous works to a much broader setting and enables us to construct counter-examples to refute the conjecture from Gunasekar et al. (2017). We also extend the results to the case where depth ≥ 3, and we show that the benefit of being deeper is that the above convergence has a much weaker dependence over initialization magnitude so that this rank minimization is more likely to take effect for initialization with practical scale. * Alphabet ordering.

1. INTRODUCTION

There are usually far more learnable parameters in deep neural nets than the number of training data, but still deep learning works well on real-world tasks. Even with explicit regularization, the model complexity of state-of-the-art neural nets is so large that they can fit randomly labeled data easily (Zhang et al., 2017) . Towards explaining the mystery of generalization, we must understand what kind of implicit regularization does Gradient Descent (GD) impose during training. Ideally, we are hoping for a nice mathematical characterization of how GD constrains the set of functions that can be expressed by a trained neural net. As a direct analysis for deep neural nets could be quite hard, a line of works turned to study the implicit regularization on simpler problems to get inspirations, for example, low-rank matrix factorization, a fundamental problem in machine learning and information process. Given a set of observations about an unknown matrix W * ∈ R d×d of rank r * d, one needs to find a low-rank solution W that is compatible with the given observations. Examples include matrix sensing, matrix completion, phase retrieval, robust principal component analysis, just to name a few (see Chi et al. 2019 for a survey). When W * is symmetric and positive semidefinite, one way to solve all these problems is to parameterize W as W = U U for U ∈ R d×r and optimize L(U ) := 1 2 f (U U ), where f ( • ) is some empirical risk function depending on the observations, and r is the rank constraint. In theory, if the rank constraint is too loose, the solutions do not have to be low-rank and we may fail to recover W * . However, even in the case where the rank is unconstrained (i.e., r = d), GD with small initialization can still get good performance in practice. This empirical observation reveals that the implicit regularization of GD exists even in this simple matrix factorization problem, but its mechanism is still on debate. Gunasekar et al. (2017) proved that Gradient Flow (GD with infinitesimal step size, a.k.a., GF) with infinitesimal initialization finds the minimum nuclear norm solution in a special case of matrix sensing, and further conjectured this holds in general. Conjecture 1.1 (Gunasekar et al. 2017, informal) . With sufficiently small initialization, GF converges to the minimum nuclear norm solution of matrix sensing. Subsequently, Arora et al. (2019a) challenged this view by arguing that a simple mathematical norm may not be a sufficient language for characterizing implicit regularization. One example illustrated in Arora et al. ( 2019a) is regarding matrix sensing with a single observation. They showed that GD with small initialization enhances the growth of large singular values of the solution and attenuates that of smaller ones. This enhancement/attenuation effect encourages low-rank, and it is further intensified with depth in deep matrix factorization (i.e., GD optimizes f (U 1 • • • U L ) for L ≥ 2). However, these are not captured by the nuclear norm alone. Gidel et al. (2019) ; Gissin et al. (2020) further exploited this idea and showed in the special case of full-observation matrix sensing that GF learns solutions with gradually increasing rank. Razin and Cohen (2020) showed in a simple class of matrix completion problems that GF decreases the rank along the trajectory while any norm grows towards infinity. More aggressively, they conjectured that the implicit regularization can be explained by rank minimization rather than norm minimization.

Our Contributions.

In this paper, we move one further step towards resolving the implicit regularization in the matrix factorization problem. Our theoretical results show that GD performs rank minimization via a greedy process in a broader setting. Specifically, we provide theoretical evidence that GF with infinitesimal initialization is in general mathematically equivalent to another algorithm called Greedy Low-Rank Learning (GLRL). At a high level, GLRL is a greedy algorithm that performs rank-constrained optimization and relaxes the rank constraint by 1 whenever it fails to reach a global minimizer of f ( • ) with the current rank constraint. As a by-product, we refute Conjecture 1.1 by demonstrating an counterexample (Example 5.9). We also extend our results to deep matrix factorization Section 6, where we prove that the trajectory of GF with infinitesimal identity initialization converges to a deep version of GLRL, at least in the early stage of the optimization. We also use this result to confirm the intuition achieved on toy models (Gissin et al., 2020) , that benefits of depth in matrix factorization is to encourage rank minimization even for initialization with a relatively larger scale, and thus it is more likely to happen in practice. This shows that describing the implicit regularization using GLRL is more expressive than using the language of norm minimization. We validate all our results with experiments in Appendix E.

2. RELATED WORKS

Norm Minimization. The view of norm minimization, or the closely related view of margin maximization, has been explored in different settings. Besides the nuclear norm minimization for matrix factorization (Gunasekar et al., 2017) discussed in the introduction, previous works have also studied the norm minimization/margin maximization for linear regression (Wilson et al., 2017; Soudry et al., 2018a; b; Nacson et al., 2019b; c; Ji and Telgarsky, 2019b) , deep linear neural nets (Ji and Telgarsky, 2019a; Gunasekar et al., 2018) , homogeneous neural nets (Nacson et al., 2019a; Lyu and Li, 2020) , ultra-wide neural nets (Jacot et al., 2018; Arora et al., 2019b; Chizat and Bach, 2020) . 2018) proved recovery guarantees for gradient flow solving matrix sensing under Restricted Isometry Property (RIP), but the proof cannot be generalized easily to the case without RIP. Belabbas (2020) made attempts to prove that gradient flow is approximately rank-1 in the very early phase of training, but it does not exclude the possibility that the approximation error explodes later and gradient flow is not converging to low-rank solutions. Compared to these works, the current paper studies how GF encourages low-rank in a much broader setting.

3. BACKGROUND

Notations. For two matrices A, B, we define A, B := Tr(AB ) as their inner product. We use A F , A * and A 2 to denote the Frobenius norm, nuclear norm and the largest singular value of A respectively. For a matrix A ∈ R d×d , we use λ 1 (A), . . . , λ d (A) to denote the eigenvalues of A in decreasing order (if they are all reals). We define S d as the set of symmetric d × d matrices and S + d ⊆ S d as the set of positive semidefinite (PSD) matrices. We write A B or B A iff A -B is PSD. We use S + d,r , S + d,≤r to denote the set of d × d PSD matrices with rank = r, ≤ r respectively.



Initialization and Rank Minimization. The initialization scale can greatly influence the implicit regularization. A sufficiently large initialization can make the training dynamics fall into the lazy training regime defined by Chizat et al. (2019) and diminish test accuracy. Using small initialization is particularly important to bias gradient descent to low-rank solutions for matrix factorization, as empirically observed by Gunasekar et al. (2017). Arora et al. (2019a); Gidel et al. (2019); Gissin et al. (2020); Razin and Cohen (2020) studied how gradient flow with small initialization encourages low-rank in simple settings, as discussed in the introduction. Li et al. (

