IMPLICIT REGULARIZATION VIA SPECTRAL NEURAL NETWORKS AND NON-LINEAR MATRIX SENSING

Abstract

The phenomenon of implicit regularization has attracted interest in recent years as a fundamental aspect of the remarkable generalizing ability of neural networks. In a nutshell, it entails that gradient flow dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. However, known results attempting to theoretically explain this phenomenon focus overwhelmingly on the setting of linear neural nets, and the simplicity of the linear structure is particularly crucial to existing arguments. In this paper, we explore this problem in the context of more realistic neural networks with a general class of non-linear activation functions, and rigorously demonstrate the implicit regularization phenomenon for such networks in the setting of matrix sensing problems. This is coupled with rigorous rate guarantees that ensure exponentially fast convergence of gradient descent, complemented by matching lower bounds which stipulate that the exponential rate is the best achievable. In this vein, we contribute a network architecture called Spectral Neural Networks (abbrv. SNN) that is particularly suitable for matrix learning problems. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries, a potentially fruitful perspective for matrix learning. We demonstrate that the SNN architecture is inherently much more amenable to theoretical analysis than vanilla neural nets and confirm its effectiveness in the context of matrix sensing, supported via both mathematical guarantees and empirical investigations. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning scenarios.

1. INTRODUCTION

A longstanding pursuit of deep learning theory is to explain the astonishing ability of neural networks to generalize despite having far more learnable parameters than training data, even in the absence of any explicit regularization. An established understanding of this phenomenon is that the gradient descent algorithm induces a so-called implicit regularization effect. In very general terms, implicit regularization entails that gradient flow dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. In a sense, this creates a learning paradigm that automatically favors models characterized by "low complexity". A standard test-bed for mathematical analysis in studying implicit regularization in deep learning is the matrix sensing problem. The goal is to approximate a matrix X ⋆ from a set of measurement matrices A 1 , . . . , A m and observations y 1 , . . . , y m where y i = ⟨A i , X ⋆ ⟩. A common approach, matrix factorization, parameterizes the solution as a product matrix, i.e., X = U V ⊤ , and optimizes the resulting non-convex objective to fit the data. This is equivalent to training a depth-2 neural network with a linear activation function. In an attempt to explain the generalizing ability of over-parameterized neural networks, Neyshabur et al. (2014) first suggested the idea of an implicit regularization effect of the optimizer, which entails a bias towards solutions that generalize well. Gunasekar et al. (2017) investigated the possibility of an implicit norm-regularization effect of gradient descent in the context of shallow matrix factorization. In particular, they studied the standard Burer-Monteiro approach Burer & Monteiro (2003) to matrix factorization, which may be viewed as a depth-2 linear neural network. They were able to theoretically demonstrate an implicit norm-regularization phenomenon, where a suitably initialized gradient flow dynamics approaches a solution to the nuclear-norm minimization approach to lowrank matrix recovery Recht et al. (2010) , in a setting where the involved measurement matrices commute with each other. They also conjectured that this latter restriction on the measurement matrices is unnecessary. This conjecture was later resolved by Li et al. (2018) in the setting where the measurement matrices satisfy a restricted isometry property. Other aspects of implicit regularization in matrix factorization problems were investigated in several follow-up papers (Neyshabur et al., 2017; Arora et al., 2019; Razin & Cohen, 2020; Tarmoun et al., 2021; Razin et al., 2021) . For instance, Arora et al. (2019) showed that the implicit norm-regularization property of gradient flow, as studied by Gunasekar et al. (2017) , does not hold in the context of deep matrix factorization. Razin & Cohen (2020) constructed a simple 2 × 2 example, where the gradient flow dynamics lead to an eventual blow-up of any matrix norm, while a certain relaxation of rank-the so-called e-rank-is minimized in the limit. These works suggest that implicit regularization in deep networks should be interpreted through the lens of rank minimization, not norm minimization. Incidentally, Razin et al. ( 2021) have recently demonstrated similar phenomena in the context of tensor factorization. Researchers have also studied implicit regularization in several other learning problems, including linear models (Soudry et al., 2018; Zhao et al., 2019; Du & Hu, 2019) , neural networks with one or two hidden layers (Li et al., 2018; Blanc et al., 2020; Gidel et al., 2019; Kubo et al., 2019; Saxe et al., 2019) . Besides norm-regularization, several of these works demonstrate the implicit regularization effect of gradient descent in terms of other relevant quantities such as margin (Soudry et al., 2018) , the number of times the model changes its convexity (Blanc et al., 2020) , linear interpolation (Kubo et al., 2019) , or structural bias (Gidel et al., 2019) . A natural use case scenario for investigating the implicit regularization phenomenon is the problem of matrix sensing. Classical works in matrix sensing and matrix factorization utilize convex relaxation approaches, i.e., minimizing the nuclear norm subject to agreement with the observations, and deriving tight sample complexity bound (Srebro & Shraibman, 2005; Candès & Recht, 2009; Recht et al., 2010; Candès & Tao, 2010; Keshavan et al., 2010; Recht, 2011) . Recently, many works analyze gradient descent on the matrix sensing problem. Ge et al. ( 2016) and Bhojanapalli et al. (2016) show that the non-convex objectives on matrix sensing and matrix completion with lowrank parameterization do not have any spurious local minima. Consequently, the gradient descent algorithm converges to the global minimum. Despite the large body of works studying implicit regularization, most of them consider the linear setting. It remains an open question to understand the behavior of gradient descent in the presence of non-linearities, which are more realistic representations of neural nets employed in practice. In this paper, we make an initial foray into this problem, and investigate the implicit regularization phenomenon in more realistic neural networks with a general class of non-linear activation functions. We rigorously demonstrate the occurrence of an implicit regularization phenomenon in this setting for matrix sensing problems, reinforced with quantitative rate guarantees ensuring exponentially fast convergence of gradient descent to the best approximation in a suitable class of matrices. Our convergence upper bounds are complemented by matching lower bounds which demonstrate the optimality of the exponential rate of convergence. In the bigger picture, we contribute a network architecture that we refer to as the Spectral Neural Network architecture (abbrv. SNN), which is particularly suitable for matrix learning scenarios. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries. We believe that this point of view can be beneficial for tackling matrix learning problems in a neural network setup. SNNs are particularly well-suited for theoretical analysis due to the spectral nature of their non-linearities, as opposed to vanilla neural nets, while at the same time provably guaranteeing effectiveness in matrix learning problems. We also introduce a much more general counterpart of the near-zero initialization that is popular in related literature, and our methods are able to handle a much more robust class of initializing setups that are constrained only via certain inequalities. Our theoretical contributions include a compact analytical representation of the gradient flow dynamics, accorded by the spectral nature of our network architecture. We demonstrate the efficacy of the SNN architecture through its application to the matrix sensing problem, bolstered via both theoretical guarantees and extensive empirical studies. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning problems. In particular, we believe that the SNN architecture would be natural for the study of rank (or e-rank)

