SHARP CONVERGENCE ANALYSIS OF GRADIENT DESCENT FOR OVERPARAMETERIZED DEEP LINEAR NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

This paper presents sharp rates of convergence of the gradient descent (GD) method for overparameterized deep linear neural networks with different random initializations. This study touches upon one major open theoretical problem in machine learning-why deep neural networks trained with GD methods are efficient in many practical applications? While the solution of this problem is still beyond the reach of general nonlinear deep neural networks, extensive efforts have been invested in studying relevant questions for deep linear neural networks, and many interesting results have been reported to date. For example, recent results on loss landscape show that even though the loss function of deep linear neural networks is non-convex, every local minimizer is also a global minimizer. When the GD method is applied to train deep linear networks, it's convergence behavior depends on the initialization. In this study, we obtained sharp rate of convergence of GD for deep linear networks and demonstrated that this rate does not depend on the types of random initialization. Furthermore, here, we show that the depth of the network does not affect the optimal rate of convergence, if the width of each hidden layer is appropriately large. Finally, we explain why the GD for an overparameterized deep linear network automatically avoids bad saddles.

1. INTRODUCTION

Deep linear neural networks, as a class of toy models, are frequently used to understand loss surfaces and gradient-based optimization methods related to non-convex problems. Dauphin et al. (2014) and Choromanska et al. (2015a) explored the loss function of deep nonlinear networks based on random matrix theory (such as a spherical spin-glass model). This theory essentially converts the loss surface of deep nonlinear neural networks into that of deep linear neural networks under certain assumptions, some of which are unrealistic. Choromanska et al. (2015b) suggested an open problem to establish a connection between the loss function of neural networks and the Hamiltonian of spherical spin-glass models under milder assumptions. Later, Kawaguchi (2016) successfully discarded most of these assumptions by analyzing the loss surface of the deep linear neural networks. The landscape for deep linear neural network (Kawaguchi, 2016; Kawaguchi & Lu, 2017; Laurent & Brecht, 2018) focuses on several properties of the critical points: (i) every local minimum is a global minimum; (ii) every critical point that is not a local minimum is a saddle point; and (iii) there exists a saddle such that all eigenvalues of its Hessian are zeros if the network is deeper than three layers. Thus, for deep linear neural networks, convergence to a global minimum is impeded by the existence of poor saddles. Lee et al. (2016) showed that the gradient method almost surely never converges to a strict saddle point, although the time cost can depend exponentially on the dimension (Du et al., 2017) . Gradient descent (GD) with perturbations (Ge et al., 2015; Jin et al., 2017) can find a local minimizer in polynomial time. Thus, the trajectory approach combined with random initialization or random algorithm circumvents the obstacle of existence of poor saddles. According to studies on continuous time dynamics of a gradient flow (Du et al., 2018; Arora et al., 2018b) , the balance property of deep linear network is preserved if the initialization is balanced. Arora et al. (2018a; b), Du & Hu (2019), and Hu et al. (2020) successfully proved that GD with its corresponding initialization schemes con-verges to a global minimizer of deep linear neural networks with high probability. Furthermore, the rate of convergence is linear, and behaves like GD for a convex problem. Hu et al. (2020) established that the convergence for Gaussian initialization can be very slow for deep linear neural networks with large depths, unless the width is almost linear. They also showed that orthogonal initialization in deep linear neural networks accelerate the convergence. Thus, the convergence behavior of the GD method, for training deep linear neural networks, crucially networks depends on the initialization. Recent studies have demonstrated the connection between deep learning and kernel methods (Daniely, 2017; Arora et al., 2019a; b; Chizat et al., 2019; Lee et al., 2019; Du et al., 2019; Cao & Gu, 2019; Woodworth et al., 2020) , especially the neural tangent kernel (NTK), introduced by Jacot et al. (2018) . For most common neural networks, the NTK becomes constant (Jacot et al., 2018; Liu et al., 2020) and remains so throughout the training in the limit of a large layer width. Throughout the training, the neural networks are well described by their first-order Taylor expansion around their parameters at the initialization (Lee et al., 2019) . In this paper, we first evaluate the convergence region, i.e. the set of initialization parameters that lead to the linear convergence of GD for deep linear neural networks (see Lemma 4.1 or Lemma D.1). Next, we demonstrate that if the minimum width among all the hidden layers is sufficiently large, then the random initialization will fall into the convergence region with high probability (see Theorem 3.1, Theorem B.1, Theorem B.2 and Theorem B.3). Furthermore, the worst-case convergence rate of GD for deep linear neural networks is almost the same as the original convex problem with a corresponding learning rate. We also demonstrate that the GD trajectories for deep linear neural networks are arbitrarily close to those for the convex problem. The precise statement is related to remark 3, Theorem 3.2, Corollary 1 and Lemma 4.4 (also see Lemma D.5). The present study was inspired by a recent reported work Du & Hu (2019); Hu et al. (2020) , in which the authors carefully constructed the upper and lower bounds of the eigenvalues of the Gram matrix along the GD and established a linear convergence. In this paper, we generalize their results to strongly convex loss functions with layer varying widths and obtain sharper results. We also show that our rate of convergence for GD in deep linear neural networks is sharp in the sense that it matches the worst-case convergence rate for the original convex problem. The trajectories between the GD for deep linear neural networks and the original convex problem (1) can be arbitrary close. Furthermore, we show that if the width of each hidden layer is appropriately large, then the optimal rate does not depend on the random initialization types and network depth. Lastly, we elucidate the mechanism underlying the observed automatic avoidance of bad saddles by the GD for overparameterized deep linear networks.

2. PRELIMINARIES

2.1 PROBLEM SETUP Let x ∈ R nx and y ∈ R ny be an input vector and a target vector, respectively. Define {(x i , y i )} m i=1 as a training dataset of size m, and let X = [x 1 , x 2 , • • • , x m ] ̸ = 0 and Y = [y 1 , y 2 , • • • , y m ]. Denote the weight parameters by W ∈ R ny×nx . Consider the well-studied convex optimization problem: minimize W L(W ) := 1 m m i=1 l(W x i , y i ). The GD for convex problem (1) with a learning rate of η * is given by: W (t + 1) = W (t) -η * ∇L(W (t)), t = 0, 1, 2, • • • . (2) For any matrix A, let σ max (A) and σ min (A) be the largest and smallest singular values of A respectively. Here, we consider two types of matrix norms and one type of semi-norm for A, ∥A∥ := σ max (A), ∥A∥ 2 F := tr(AA T ), and ∥A∥ X := ∥AP X ∥ F , where P X = X(X T X) † X T is the orthogonal projection matrix onto the column space of X, and (X T X) † is the Moore-Penrose inverse.

