BEHIND THE SCENES OF GRADIENT DESCENT: A TRAJECTORY ANALYSIS VIA BASIS FUNCTION DE-COMPOSITION

Abstract

This work analyzes the solution trajectory of gradient-based algorithms via a novel basis function decomposition. We show that, although solution trajectories of gradient-based algorithms may vary depending on the learning task, they behave almost monotonically when projected onto an appropriate orthonormal function basis. Such projection gives rise to a basis function decomposition of the solution trajectory. Theoretically, we use our proposed basis function decomposition to establish the convergence of gradient descent (GD) on several representative learning tasks. In particular, we improve the convergence of GD on symmetric matrix factorization and provide a completely new convergence result for the orthogonal symmetric tensor decomposition. Empirically, we illustrate the promise of our proposed framework on realistic deep neural networks (DNNs) across different architectures, gradient-based solvers, and datasets. Our key finding is that gradient-based algorithms monotonically learn the coefficients of a particular orthonormal function basis of DNNs defined as the eigenvectors of the conjugate kernel after training. Our code is available at github.com/jianhaoma/function-basis-decomposition.

1. INTRODUCTION

Learning highly nonlinear models amounts to solving a nonconvex optimization problem, which is typically done via different variants of gradient descent (GD). But how does GD learn nonlinear models? Classical optimization theory asserts that, in the face of nonconvexity, GD and its variants may lack any meaningful optimality guarantee; they produce solutions that-while being first-or second-order optimal (Nesterov, 1998; Jin et al., 2017) -may not be globally optimal. In the rare event where the GD can recover a globally optimal solution, the recovered solution may correspond to an overfitted model rather than one with desirable generalization. Inspired by the large empirical success of gradient-based algorithms in learning complex models, recent work has postulated that typical training losses have benign landscapes: they are devoid of spurious local minima and their global solutions coincide with true solutions-i.e., solutions corresponding to the true model. For instance, different variants of low-rank matrix factorization (Ge et al., 2016; 2017) and deep linear NNs (Kawaguchi, 2016) have benign landscapes. However, when spurious solutions do exist (Safran & Shamir, 2018) or global and true solutions do not coincide (Ma & Fattahi, 2022b) , such a holistic view of the optimization landscape cannot explain the success of gradient-based algorithms. To address this issue, another line of research has focused on analyzing the solution trajectory of different algorithms. Analyzing the solution trajectory has been shown extremely powerful in sparse recovery (Vaskevicius et al., 2019) , low-rank matrix factorization (Li et al., 2018; Stöger & Soltanolkotabi, 2021), and linear DNNs (Arora et al., 2018; Ma & Fattahi, 2022a) . However, these analyses are tailored to specific models and thereby cannot be directly generalized. In this work, we propose a unifying framework for analyzing the optimization trajectory of GD based on a novel basis function decomposition. We show that, although the dynamics of GD may vary

