A UNIFYING VIEW ON IMPLICIT BIAS IN TRAINING LINEAR NEURAL NETWORKS

Abstract

We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize the convergence direction of the network parameters as singular vectors of a tensor defined by the network. For L-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the 2/L max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted 1 and 2 norms in the transformed input space. Our theorems subsume existing results in the literature while removing standard convergence assumptions. We also provide experiments that corroborate our analysis.

1. INTRODUCTION

Overparametrized neural networks have infinitely many solutions that achieve zero training error, and such global minima have different generalization performance. Moreover, training a neural network is a high-dimensional nonconvex problem, which is typically intractable to solve. However, the success of deep learning indicates that first-order methods such as gradient descent or stochastic gradient descent (GD/SGD) not only (a) succeed in finding global minima, but also (b) are biased towards solutions that generalize well, which largely has remained a mystery in the literature. To explain part (a) of the phenomenon, there is a growing literature studying the convergence of GD/SGD on overparametrized neural networks (e.g., Du et al. (2018a; b) ; Allen-Zhu et al. (2018); Zou et al. (2018); Jacot et al. (2018); Oymak & Soltanolkotabi (2020) , and many more). There are also convergence results that focus on linear networks, without nonlinear activations (Bartlett et al., 2018; Arora et al., 2019a; Wu et al., 2019; Du & Hu, 2019; Hu et al., 2020) . These results typically focus on the convergence of loss, hence do not address which of the many global minima is reached. Another line of results tackles part (b), by studying the implicit bias or regularization of gradientbased methods on neural networks or related problems (Gunasekar et al., 2017; 2018a; b; Arora et al., 2018; Soudry et al., 2018; Ji & Telgarsky, 2019a; Arora et al., 2019b; Woodworth et al., 2020; Chizat & Bach, 2020; Gissin et al., 2020) . These results have shown interesting progress that even without explicit regularization terms in the training objective, algorithms such as GD applied on neural networks have an implicit bias towards certain solutions among the many global minima. However, in proving such results, many results rely on convergence assumptions such as global convergence of loss to zero and/or directional convergence of parameters and gradients. Ideally, such convergence assumptions should be removed because they cannot be tested a priori and there are known examples where GD does not converge to global minima under certain initializations (Bartlett et al., 2018; Arora et al., 2019a) . Networks are initialized at the same coefficients (circles on purple lines), but follow different trajectories due to implicit biases of networks induced from their architecture. The figures show that our theoretical predictions on limit points (circles on yellow line, the set of global minima) agree with the solution found by GD. For details of the experimental setup, see Section 6.

1.1. SUMMARY OF OUR CONTRIBUTIONS

We study the implicit bias of gradient flow (GD with infinitesimal step size) on linear neural networks. Following recent progress on this topic, we consider classification and regression problems that have multiple solutions with zero training error. Our analyses apply to a general class of networks, and prove both convergence and implicit bias, providing a more complete characterization of the algorithm trajectory without relying on convergence assumptions. • We propose a general tensor formulation of nonlinear neural networks which includes many network architectures considered in the literature. In this paper, we focus on the linear version of this formulation (i.e., no nonlinear activations), called linear tensor networks. • For linearly separable classification, we prove that linear tensor network parameters converge in direction to singular vectors of a tensor defined by the network. As a corollary, we show that linear fully-connected networks converge to the 2 max-margin solution (Ji & Telgarsky, 2020). • For separable classification, we further show that if the linear tensor network is orthogonally decomposable (Assumption 1), the gradient flow finds the 2/depth max-margin solution in the singular value space, leading the parameters to converge to the top singular vectors of the tensor when depth = 2. This theorem subsumes known results on linear convolutional networks and diagonal networks proved in Gunasekar et al. (2018b) , without using convergence assumptions. • For underdetermined linear regression, we study the limit points of gradient flow on orthogonally decomposable networks (Assumption 1), and provide a full characterization of the limit points. This theorem covers results on deep matrix sensing (Arora et al., 2019b) as a special case, and extends a similar recent result (Woodworth et al., 2020) to a broader class of networks. • For underdetermined linear regression with deep linear fully-connected networks, we prove that the network converges to the minimum 2 norm solutions as we scale the initialization to zero. • Lastly, we present simple experiments that corroborate our theoretical analysis. Figure 1 shows that our predictions of limit points match with solutions found by GD.

2. PROBLEM SETTINGS AND RELATED WORKS

We first define notation used in the paper. Given a positive integer a, let [a] := {1, . . . , a}. We use I d to denote the d × d identity matrix. Given a matrix A, we use vec(A) to denote its vectorization, i.e., the concatenation of all columns of A. For two vectors a and b, let a⊗b be their tensor product, a b be their element-wise product, and a k be the element-wise k-th power of a. Given an order-L tensor A ∈ R k1×•••×k L , we use [A] j1,...,j L to denote the (j 1 , j 2 , . . . , j L )-th element of A, where j l ∈ [k l ] for all l ∈ [L]. In element indexing, we use • to denote all indices in the corresponding dimension, and a : b to denote all indices from a to b. For example, for a matrix A, [A] •,4:6 denotes a submatrix that consists of 4th-6th columns of A. The square bracket notation for indexing overloads



Figure 1: Gradient descent trajectories of linear coefficients of linear fully-connected, diagonal, and convolutional networks on a regression task, initialized with different initial scales α = 0.01, 1.Networks are initialized at the same coefficients (circles on purple lines), but follow different trajectories due to implicit biases of networks induced from their architecture. The figures show that our theoretical predictions on limit points (circles on yellow line, the set of global minima) agree with the solution found by GD. For details of the experimental setup, see Section 6.

