Implicit Acceleration of Gradient Flow in Overparameterized Linear Models

Abstract

We study the implicit acceleration of gradient flow in over-parameterized two-layer linear models. We show that implicit acceleration emerges from a conservation law that constrains the dynamics to follow certain trajectories. More precisely, gradient flow preserves the difference of the Gramian matrices of the input and output weights and we show that the amount of acceleration depends on both the magnitude of that difference (which is fixed at initialization) and the spectrum of the data. In addition, and generalizing prior work, we prove our results without assuming small, balanced or spectral initialization for the weights, and establish interesting connections between the matrix factorization problem and Riccati type differential equations.

1. Introduction

Understanding over-parameterization in deep learning is a puzzling question. Contrary to the common belief that over-parameterization may hurt generalization and optimization, recent work suggests that over-parameterization may actually bias the optimization algorithm towards solutions that generalize well, a phenomenon known as implicit regularization or implicit bias, and even accelerate convergence, a phenomenon known as implicit acceleration. Recent work on the implicit bias in the over-parameterized regime (e.g. Gunasekar et al. (2018a; b) ; Chizat & Bach (2020); Ji & Telgarsky (2019b)) shows that gradient descent on unregularized problems finds minimum norm solutions. For instance, Soudry et al. (2018); Ji & Telgarsky (2019a) analyze linear networks trained for binary classification on linearly separable data and show that the predictor converges to a max-margin solution. Similar ideas have been developed for matrix factorization, yielding solutions with minimum nuclear norm (Gunasekar et al., 2017; Li et al., 2018 ) or low-rank (Arora et al., 2019a) . It has also been shown that optimization methods which introduce multiplicative stochastic noise, such as dropout and dropblock, induce nuclear norm regularization (Cavazza et al., 2018) and spectral k-support norm regularization (Pal et al., 2020) , respectively. Recent work on the implicit acceleration of gradient descent for matrix factorization and deep linear networks (Arora et al., 2018) shows that when the initialization is sufficiently small and balanced (see Definition 2), over-parameterization acts as a pre-conditioning of the gradient that can be interpreted as a combination of momentum and an adaptive learning rate. They claim that acceleration for p -regression is possible only if p > 2, though there is no theory supporting such claim. 2018b)) have also analyzed the convergence behaviour of gradient descent in the over-parameterized setting, particularly for very wide networks and have concluded linear convergence when the initialization is gaussian or balanced. While a precise study of the connections between gradient descent and gradient flow dynamics for non-convex problems remains elusive, recent work (Franca et al., 2020) shows that discrete-time convergence rates can be derived from their continuous-time counterparts via symplectic integrators. Therefore, our work focuses on the analysis of gradient flow as a stepping stone for future analysis of gradient descent. In this paper, we present a new analysis of the implicit acceleration of gradient flow for overparametrized two-layer neural networks that applies not only in the case of small, balanced, or spectral initialization but also extends to imbalanced and non-spectral initializations. We show that the key reason for the implicit acceleration of gradient flow is the existence of a conservation law that constrains the dynamics to follow a particular path.foot_0 More precisely, the quantity that is preserved by gradient flow is the difference of the Gramians of the input and output weight matrices, which in turn implies that the difference of the square of the norm of the weight matrices is preserved. The particular case where this difference is zero corresponds to the case of balanced weights, but the more general case of imbalanced weights also emerges as a conserved quantity and plays an important role. In particular, we show that acceleration can occur even in the case of 2 -regression as a result of imbalanced initialization. The reason this phenomenon was not previously observed in (Saxe et al., 2014; 2019; Gidel et al., 2019) is precisely due to the assumption of balanced initialization, which follows as a particular case of our analysis. Our work also establishes interesting connections with Riccati type differential equations. Indeed, some of our results have a similar flavor to those in (Fukumizu, 1998) , while others are more general and provide an explicit characterization of the continuous-time convergence rate. In short, our work makes the following contributions. 1. In Section 2, we analyze the implicit acceleration properties of gradient flow for symmetric matrix factorization, providing a closed form solution and a convergence rate that depends on the eigenvalues of the data without the assumptions of spectral and small initialization. 2. In Section 3, we analyze the implicit acceleration properties of gradient flow for asymmetric matrix factorization with spectral initialization. We show that implicit acceleration emerges as a consequence of conservation laws that only appear in over-parameterized settings due to an underlying rotational symmetry. 3. In Section 4, we analyze the implicit acceleration properties of gradient flow for asymmetric matrix factorization with an arbitrary initialization. We make connections with Riccati differential equations, obtaining a more general characterization of the convergence rate and establish an interesting link with explicit regularization.

2. Gradient Flow Dynamics for Symmetric Matrix Factorization

In this section, we analyze and compare the dynamics of gradient flow,foot_1 Ẋ(t) = -∇ X (X(t)), (1) when applied to two problems. The first one is learning a symmetric one-layer linear model min X∈R m×m (X) ≡ 1 2 ||Y -X|| 2 F , where Y ∈ R m×m is a given data matrix that one wishes to approximate by X ∈ R m×m . The second one is learning its over-parameterized symmetric matrix factorization counterpart min U ∈R m×k (U ) ≡ 1 2 ||Y -U U T || 2 F . (3)



A quantity Q(x(t)) is said to be conserved by the flow ẋ(t) = f (x(t)) if it remains constant through dynamical evolution, i.e., d dt Q(x(t)) = 0. For example, in mechanics the sum of potential and kinetic energies remains constant for a conservative system. A conservation law is usually a consequence of an underlying symmetry (Noether's theorem). In optimization, this can be seen as a constraint Q(x) = Q0 that is automatically satisfied without having to explicitly enforce it. Gradient descent,, is simply an explicit Euler discretization of (1).



Saxe et al. (2014) focused on 2 -regression with balanced spectral initializations (see Definition 3) and similarly concluded that depth may actually slow down the convergence. For two-layer linear networks, Saxe et al. (2019); Gidel et al. (2019) analyzed the dynamics of gradient flow and obtained explicit solutions under the assumption of vanishing spectral initialization, highlighting the sequential learning of the hierarchical components as a phenomenon that could improve generalization. Several recent papers (e.g Arora et al. (2019b); Du & Hu (2019); Du et al. (

Relationships between our work and the state of the art.

