Implicit Acceleration of Gradient Flow in Overparameterized Linear Models

Abstract

We study the implicit acceleration of gradient flow in over-parameterized two-layer linear models. We show that implicit acceleration emerges from a conservation law that constrains the dynamics to follow certain trajectories. More precisely, gradient flow preserves the difference of the Gramian matrices of the input and output weights and we show that the amount of acceleration depends on both the magnitude of that difference (which is fixed at initialization) and the spectrum of the data. In addition, and generalizing prior work, we prove our results without assuming small, balanced or spectral initialization for the weights, and establish interesting connections between the matrix factorization problem and Riccati type differential equations.

1. Introduction

Understanding over-parameterization in deep learning is a puzzling question. Contrary to the common belief that over-parameterization may hurt generalization and optimization, recent work suggests that over-parameterization may actually bias the optimization algorithm towards solutions that generalize well, a phenomenon known as implicit regularization or implicit bias, and even accelerate convergence, a phenomenon known as implicit acceleration. Ji & Telgarsky (2019a) analyze linear networks trained for binary classification on linearly separable data and show that the predictor converges to a max-margin solution. Similar ideas have been developed for matrix factorization, yielding solutions with minimum nuclear norm (Gunasekar et al., 2017; Li et al., 2018) or low-rank (Arora et al., 2019a) . It has also been shown that optimization methods which introduce multiplicative stochastic noise, such as dropout and dropblock, induce nuclear norm regularization (Cavazza et al., 2018) and spectral k-support norm regularization (Pal et al., 2020) , respectively.

Recent work on the

Recent work on the implicit acceleration of gradient descent for matrix factorization and deep linear networks (Arora et al., 2018) shows that when the initialization is sufficiently small and balanced (see Definition 2), over-parameterization acts as a pre-conditioning of the gradient that can be interpreted as a combination of momentum and an adaptive learning rate. They claim that acceleration for p -regression is possible only if p > 2, though there is no theory supporting such claim. 2018b)) have also analyzed the convergence behaviour of gradient descent in the over-parameterized setting, particularly for very wide networks and have concluded linear convergence when the initialization is gaussian or balanced. While a precise study of the connections between gradient descent and gradient flow dynamics for non-convex problems remains elusive, recent work (Franca et al., 2020) shows that discrete-time convergence rates can be derived from their continuous-time 1



implicit bias in the over-parameterized regime (e.g. Gunasekar et al. (2018a;b); Chizat & Bach (2020); Ji & Telgarsky (2019b)) shows that gradient descent on unregularized problems finds minimum norm solutions. For instance, Soudry et al. (2018);

Saxe et al. (2014) focused on 2 -regression with balanced spectral initializations (see Definition 3) and similarly concluded that depth may actually slow down the convergence. For two-layer linear networks, Saxe et al. (2019); Gidel et al. (2019) analyzed the dynamics of gradient flow and obtained explicit solutions under the assumption of vanishing spectral initialization, highlighting the sequential learning of the hierarchical components as a phenomenon that could improve generalization. Several recent papers (e.g Arora et al. (2019b); Du & Hu (2019); Du et al. (

