SHARP CONVERGENCE ANALYSIS OF GRADIENT DESCENT FOR OVERPARAMETERIZED DEEP LINEAR NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

This paper presents sharp rates of convergence of the gradient descent (GD) method for overparameterized deep linear neural networks with different random initializations. This study touches upon one major open theoretical problem in machine learning-why deep neural networks trained with GD methods are efficient in many practical applications? While the solution of this problem is still beyond the reach of general nonlinear deep neural networks, extensive efforts have been invested in studying relevant questions for deep linear neural networks, and many interesting results have been reported to date. For example, recent results on loss landscape show that even though the loss function of deep linear neural networks is non-convex, every local minimizer is also a global minimizer. When the GD method is applied to train deep linear networks, it's convergence behavior depends on the initialization. In this study, we obtained sharp rate of convergence of GD for deep linear networks and demonstrated that this rate does not depend on the types of random initialization. Furthermore, here, we show that the depth of the network does not affect the optimal rate of convergence, if the width of each hidden layer is appropriately large. Finally, we explain why the GD for an overparameterized deep linear network automatically avoids bad saddles.

1. INTRODUCTION

Deep linear neural networks, as a class of toy models, are frequently used to understand loss surfaces and gradient-based optimization methods related to non-convex problems. Dauphin et al. (2014) and Choromanska et al. (2015a) explored the loss function of deep nonlinear networks based on random matrix theory (such as a spherical spin-glass model). This theory essentially converts the loss surface of deep nonlinear neural networks into that of deep linear neural networks under certain assumptions, some of which are unrealistic. Choromanska et al. (2015b) suggested an open problem to establish a connection between the loss function of neural networks and the Hamiltonian of spherical spin-glass models under milder assumptions. Later, Kawaguchi (2016) successfully discarded most of these assumptions by analyzing the loss surface of the deep linear neural networks. The landscape for deep linear neural network (Kawaguchi, 2016; Kawaguchi & Lu, 2017; Laurent & Brecht, 2018) focuses on several properties of the critical points: (i) every local minimum is a global minimum; (ii) every critical point that is not a local minimum is a saddle point; and (iii) there exists a saddle such that all eigenvalues of its Hessian are zeros if the network is deeper than three layers. Thus, for deep linear neural networks, convergence to a global minimum is impeded by the existence of poor saddles. Lee et al. (2016) showed that the gradient method almost surely never converges to a strict saddle point, although the time cost can depend exponentially on the dimension (Du et al., 2017) . Gradient descent (GD) with perturbations (Ge et al., 2015; Jin et al., 2017) can find a local minimizer in polynomial time. Thus, the trajectory approach combined with random initialization or random algorithm circumvents the obstacle of existence of poor saddles. According to studies on continuous time dynamics of a gradient flow (Du et al., 2018; Arora et al., 2018b) , the balance property of deep linear network is preserved if the initialization is balanced. Arora et al. (2018a; b), Du & Hu (2019), and Hu et al. (2020) successfully proved that GD with its corresponding initialization schemes con-

