ON THE CONVERGENCE OF GRADIENT FLOW ON MULTI-LAYER LINEAR MODELS Anonymous authors Paper under double-blind review

Abstract

In this paper, we analyze the convergence of gradient flow on a multi-layer linear model with a loss function of the form f We show that when f satisfies the gradient dominance property, proper weight initialization leads to exponential convergence of the gradient flow to a global minimum of the loss. Moreover, the convergence rate depends on two trajectory-specific quantities that are controlled by the weight initialization: the imbalance matrices, which measure the difference between the weights of adjacent layers, and the least singular value of the weight product Our analysis provides improved rate bounds for several multi-layer network models studied in the literature, leading to novel characterizations of the effect of weight imbalance on the rate of convergence. Our results apply to most regression losses and extend to classification ones.

1. INTRODUCTION

The mysterious ability of gradient-based optimization algorithms to solve the non-convex neural network training problem is one of the many unexplained puzzles behind the success of deep learning in various applications (Krizhevsky et al., 2012; Hinton et al., 2012; Silver et al., 2016) . A vast body of work has tried to theoretically understand this phenomenon by analyzing either the loss landscape or the dynamics of the training parameters. The landscape-based analysis is motivated by the empirical observation that deep neural networks used in practice often have a benign landscape (Li et al., 2018a) , which can facilitate convergence. Existing theoretical analysis (Lee et al., 2016; Sun et al., 2015; Jin et al., 2017) shows that gradient descent converges when the loss function satisfies the following properties: 1) all of its local minimums are global minima; and 2) every saddle point has a Hessian with at least one strict negative eigenvalue. Prior work suggests that the matrix factorization model (Ge et al., 2017 ), shallow networks (Kawaguchi, 2016) , and certain positively homogeneous networks (Haeffele & Vidal, 2015; 2017) have such a landscape property, but unfortunately condition 2) does not hold for networks with multiple hidden layers (Kawaguchi, 2016) . Moreover, the landscape-based analysis generally fails to provide a good characterization of the convergence rate, except for a local rate around the equilibrium (Lee et al., 2016; Ge et al., 2017) . In fact, during early stages of training, gradient descent could take exponential time to escape some saddle points if not initialized properly (Du et al., 2017) . The trajectory-based analyses study the training dynamics of the weights given a specific initialization. For example, the case of small initialization has been studied for various models (Arora et al., 2019a; Gidel et al., 2019; Li et al., 2018b; Stöger & Soltanolkotabi, 2021; Li et al., 2021b; a) . Under this type of initialization, the trained model is implicitly biased towards low-rank (Arora et al., 2019a; Gidel et al., 2019; Li et al., 2018b; Stöger & Soltanolkotabi, 2021; Li et al., 2021b), and sparse (Li et al., 2021a) models. While the analysis for small initialization gives rich insights on the generalization of neural networks, the number of iterations required for gradient descent to find a good model often increases as the initialization scale decreases. Such dependence proves to be logarithmic on the scale for symmetric matrix factorization model (Li et al., 2018b; Stöger & Soltanolkotabi, 2021; Li et al., 2021b) , but for deep networks, existing analysis at best shows a polynomial dependency (Li et al., 2021a) . Therefore, the analysis for small initialization, while insightful in understanding the implicit bias of neural network training, is not suitable for understanding the training efficiency in practice since small initialization is rarely implemented due to its slow convergence. Another line of work studies the initialization in the kernel regime, where a randomly initialized sufficiently wide neural network can be well approximated by its linearization at initialization Jacot et al. ( 2018 (2019b) . In this regime, gradient descent enjoys a linear rate of convergence toward the global minimum (Du et al., 2019; Allen-Zhu et al., 2019; Du & Hu, 2019) . However, the width requirement in the analysis is often unrealistic, and empirical evidence has shown that practical neural networks generally do not operate in the kernel regime (Chizat et al., 2019) . The study of non-small, non-kernel-regime initialization has been mostly centered around linear models. For matrix factorization models, spectral initialization (Saxe et al., 2014; Gidel et al., 2019; Tarmoun et al., 2021) allows for decoupling the training dynamics into several scalar dynamics. For non-spectral initialization, the notion of weight imbalance, a quantity that depends on the differences between the weights matrices of adjacent layers, is crucial in most analyses. When the initialization is balanced, i.e., when the imbalance matrices are zero, the convergence relies on the initial end-to-end linear model being close to its optimum (Arora et al., 2018a; b) . It has been shown that having a non-zero imbalance potentially improves the convergence rate (Tarmoun et al., 2021; Min et al., 2021) , but the analysis only works for two-layer models. For deep linear networks, the effect of weight imbalance on the convergence has been only studied in the case when all imbalance matrices are positive semi-definite (Yun et al., 2020) , which is often unrealistic in practice. Lastly, most of the aforementioned analyses study the l 2 loss for regression tasks, and it remains unknown whether they can be generalized to other types of losses commonly used in classification tasks. Our contribution: This paper aims to provide a general framework for analyzing the convergence of gradient flow on multi-layer linear models. We consider the gradient flow on a loss function of the form L = f (W 1 W 2 • • • W L ) , where f satisfies the gradient dominance property. We show that with proper initialization, the loss converges to its global minimum exponentially. More specifically: • Our analysis shows that the convergence rate depends on two trajectory-specific quantities: 1) the imbalance matrices, which measure the difference between the weights of adjacent layers, and 2) a lower bound on the least singular values of weight product W = W 1 W 2 • • • W L . The former is time-invariant under gradient flow, thus it is fully determined by the initialization, while the latter can be controlled by initializing the product sufficiently close to its optimum. • Our analysis covers most initialization schemes used in prior work (Saxe et al., 2014; Tarmoun et al., 2021; Arora et al., 2018a; b; Min et al., 2021; Yun et al., 2020) for both multi-layer linear networks and diagonal linear networks while providing convergence guarantees for a wider range of initializations. Furthermore, our rate bounds characterize the general effect of weight imbalance on convergence. • Our convergence results directly apply to loss functions commonly used in regression tasks, and can be extended to loss functions used in classification tasks with an alternative assumption on f , under which we show O(1/t) convergence of the loss. Notations: For an n × m matrix A, we let A T denote the matrix transpose of A, σ i (A) denote its i-th singular value in decreasing order and we conveniently write σ min (A) = σ min{n,m} (A) and let σ k (A) = 0 if k > min{n, m}. We also let ∥A∥ 2 = σ 1 (A) and ∥A∥ F = tr(A T A). For a square matrix of size n, we let tr(A) denote its trace and we let diag{a i } n i=1 be a diagonal matrix with a i specifying its i-th diagonal entry. For a Hermitian matrix A of size n, we let λ i (A) denote its i-th eigenvalue and we write A ⪰ 0 (A ⪯ 0) when A is positive semi-definite (negative semi-definite). For two square matrices A, B of the same size, we let ⟨A, B⟩ F = tr(A T B). For a scalar-valued or matrix-valued function of time, F (t), we write Ḟ , Ḟ (t) or d dt F (t) for its time derivative. Additionally, we use I n to denote the identity matrix of order n and O(n) to denote the set of n × n orthogonal matrices. Lastly, we use [•] + := max{•, 0}.

2. OVERVIEW OF THE ANALYSIS

This paper considers the problem of finding a matrix W that solves min W ∈R n×m f (W ) , with the following assumption on f .



); Chizat et al. (2019); Arora et al.

