BEHIND THE SCENES OF GRADIENT DESCENT: A TRAJECTORY ANALYSIS VIA BASIS FUNCTION DE-COMPOSITION

Abstract

This work analyzes the solution trajectory of gradient-based algorithms via a novel basis function decomposition. We show that, although solution trajectories of gradient-based algorithms may vary depending on the learning task, they behave almost monotonically when projected onto an appropriate orthonormal function basis. Such projection gives rise to a basis function decomposition of the solution trajectory. Theoretically, we use our proposed basis function decomposition to establish the convergence of gradient descent (GD) on several representative learning tasks. In particular, we improve the convergence of GD on symmetric matrix factorization and provide a completely new convergence result for the orthogonal symmetric tensor decomposition. Empirically, we illustrate the promise of our proposed framework on realistic deep neural networks (DNNs) across different architectures, gradient-based solvers, and datasets. Our key finding is that gradient-based algorithms monotonically learn the coefficients of a particular orthonormal function basis of DNNs defined as the eigenvectors of the conjugate kernel after training. Our code is available at github.com/jianhaoma/function-basis-decomposition.

1. INTRODUCTION

Learning highly nonlinear models amounts to solving a nonconvex optimization problem, which is typically done via different variants of gradient descent (GD). But how does GD learn nonlinear models? Classical optimization theory asserts that, in the face of nonconvexity, GD and its variants may lack any meaningful optimality guarantee; they produce solutions that-while being first-or second-order optimal (Nesterov, 1998; Jin et al., 2017) -may not be globally optimal. In the rare event where the GD can recover a globally optimal solution, the recovered solution may correspond to an overfitted model rather than one with desirable generalization. Inspired by the large empirical success of gradient-based algorithms in learning complex models, recent work has postulated that typical training losses have benign landscapes: they are devoid of spurious local minima and their global solutions coincide with true solutions-i.e., solutions corresponding to the true model. For instance, different variants of low-rank matrix factorization (Ge et al., 2016; 2017) and deep linear NNs (Kawaguchi, 2016) have benign landscapes. However, when spurious solutions do exist (Safran & Shamir, 2018) or global and true solutions do not coincide (Ma & Fattahi, 2022b) , such a holistic view of the optimization landscape cannot explain the success of gradient-based algorithms. To address this issue, another line of research has focused on analyzing the solution trajectory of different algorithms. Analyzing the solution trajectory has been shown extremely powerful in sparse recovery (Vaskevicius et al., 2019) , low-rank matrix factorization (Li et al., 2018; Stöger & Soltanolkotabi, 2021) , and linear DNNs (Arora et al., 2018; Ma & Fattahi, 2022a) . However, these analyses are tailored to specific models and thereby cannot be directly generalized. In this work, we propose a unifying framework for analyzing the optimization trajectory of GD based on a novel basis function decomposition. We show that, although the dynamics of GD may vary The first row of the Figure 1 shows the top-5 coefficients of the solution trajectory when projected onto a randomly generated orthonormal basis. We see that the trajectories of the coefficients are highly non-monotonic and almost indistinguishable (they range between -0.04 to 0.06), implying that the energy of the obtained model is spread out on different orthogonal components. The second row of Figure 1 shows the same trajectory after projecting onto an orthogonal basis defined as the eigenvectors of the conjugate kernel after training (Long, 2021) (see Section 3.4 and Appendix B for more details). Unlike the previous case, the top-5 coefficients carry more energy and behave monotonically (modulo the small fluctuations induced by the stochasticity in the algorithm) in all three architectures, until they plateau around their steady state. In other words, the algorithm behaves more monotonically after projecting onto a correct choice of orthonormal basis.

1.1. MAIN CONTRIBUTIONS

The monotonicity of the projected solution trajectory motivates the use of an appropriate basis function decomposition to analyze the behavior of gradient-based algorithms. In this paper, we show how an appropriate basis function decomposition can be used to provide a much simpler convergence analysis for gradient-based algorithms on several representative learning problems, from simple kernel regression to complex DNNs. Our main contributions are summarized below: -Global convergence of GD via basis function decomposition: We prove that GD learns the coefficients of an appropriate function basis that forms the true model. In particular, we show that GD learns the true model when applied to the expected ℓ 2 -loss under certain gradient independence and gradient dominance conditions. Moreover, we characterize the convergence rate of GD, identifying conditions under which it enjoys linear or sublinear



Figure 1: The solution trajectories of LARS on AlexNet and ResNet-18 and AdamW on ViT with ℓ2-loss after projecting onto two different orthonormal bases. The first row shows the trajectories of the top-5 coefficients after projecting onto a generated orthonormal basis. The second row shows the trajectories of the top-5 coefficients after projecting onto the eigenvectors of the conjugate kernel evaluated at the last epoch. More detail on our implementation can be found in Appendix B.

