A UNIFYING VIEW ON IMPLICIT BIAS IN TRAINING LINEAR NEURAL NETWORKS

Abstract

We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For L-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the 2/L max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted 1 and 2 norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.

1. INTRODUCTION

Overparametrized neural networks have infinitely many solutions that achieve zero training error, and such global minima have different generalization performance. Moreover, training a neural network is a high-dimensional nonconvex problem, which is typically intractable to solve. However, the success of deep learning indicates that first-order methods such as gradient descent or stochastic gradient descent (GD/SGD) not only (a) succeed in finding global minima, but also (b) are biased towards solutions that generalize well, which largely has remained a mystery in the literature. To explain part (a) of the phenomenon, there is a growing literature studying the convergence of GD/SGD on overparametrized neural networks (e. 2020), and many more). There are also convergence results that focus on linear networks, without nonlinear activations (Bartlett et al., 2018; Arora et al., 2019a; Wu et al., 2019; Du & Hu, 2019; Hu et al., 2020) . These results typically focus on the convergence of loss, hence do not address which of the many global minima is reached. Another line of results tackles part (b), by studying the implicit bias or regularization of gradientbased methods on neural networks or related problems (Gunasekar et al., 2017; 2018a; b; Arora et al., 2018; Soudry et al., 2018; Ji & Telgarsky, 2019a; Arora et al., 2019b; Woodworth et al., 2020; Chizat & Bach, 2020; Gissin et al., 2020) . These results have shown interesting progress that even without explicit regularization terms in the training objective, algorithms such as GD applied on neural networks have an implicit bias towards certain solutions among the many global minima. However, in proving such results, many results rely on convergence assumptions such as global convergence of loss to zero and/or directional convergence of parameters and gradients. Ideally, such convergence assumptions should be removed because they cannot be tested a priori and there are known examples where GD does not converge to global minima under certain initializations (Bartlett et al., 2018; Arora et al., 2019a) .



g., Du et al. (2018a;b); Allen-Zhu et al. (2018); Zou et al. (2018); Jacot et al. (2018); Oymak & Soltanolkotabi (

