GRADIENT FLOW IN THE GAUSSIAN COVARIATE MODEL: EXACT SOLUTION OF LEARNING CURVES AND MULTIPLE DESCENT STRUCTURES Anonymous authors Paper under double-blind review

Abstract

A recent line of work has shown remarkable behaviors of the generalization error curves in simple learning models. Even the least-squares regression has shown atypical features such as the model-wise double descent, and further works have observed triple or multiple descents. Another important characteristic are the epoch-wise descent structures which emerge during training. The observations of model-wise and epoch-wise descents have been analytically derived in limited theoretical settings (such as the random feature model) and are otherwise experimental. In this work, we provide a full and unified analysis of the whole time-evolution of the generalization curve, in the asymptotic large-dimensional regime and under gradient-flow, within a wider theoretical setting stemming from a gaussian covariate model. In particular, we cover most cases already disparately observed in the literature, and also provide examples of the existence of multiple descent structures as a function of a model parameter or time. Furthermore, we show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets. Technically we compute averages of rational expressions involving random matrices using recent developments in random matrix theory based on "linear pencils". Another contribution, which is also of independent interest in random matrix theory, is a new derivation of related fixed point equations (and an extension there-off) using Dyson brownian motions.

1. INTRODUCTION

1.1 PRELIMINARIES With growing computational resources, it has become customary for machine learning models to use a huge number of parameters (billions of parameters in Brown et al. (2020) ), and the need for scaling laws has become of utmost importance Hoffmann et al. (2022) . Therefore it is of great relevance to study the asymptotic (or "thermodynamic") limit of simple models in which the number of parameters and data samples are sent to infinity. A landmark progress made by considering these theoretical limits, is the analytical (oftentimes rigorous) calculation of precise double-descent curves for the generalization error starting with Belkin et al. ( 2020 In the next paragraphs we set-up the model together with a list of special realizations, and describe our main contributions.

1.2. MODEL DESCRIPTION

Generative Data Model: In this paper, we use the so-called Gaussian Covariate model in a teacherstudent setting. An observation in our data model is defined through the realization of a gaussian vector z ∼ N (0, 1 d I d ). The teacher and the student obtain their observations (or two different views of the world) with the vectors x ∈ R p B and x ∈ R p A respectively, which are given by the application of two linear operations on z. In other words there exists two matrices B ∈ R d×p B and A ∈ R d×p A such that x = B T z and x = A T z. Note that the generated data can also be seen as the output of a generative 1-layer linear network. In the following, the structure of A and B is pretty general as long as it remains independent of the realization z: the matrices may be random matrices or block-matrices of different natures and structures to capture more sophisticated models. While the models we treat are defined through appropriate A and B, we will often only need the structure of U = AA T and V = BB  = J T D 1 J with J = (I d |0 p A +p B -d ). Therefore if we let z = 1 √ d D -1 2 1 JOx which has variance 1 d I d , then upon noticing JJ T = I d and defining (A|B) T = √ dO T J T D 1 2 1 we find (A|B) T z ∼ N (0, Σ). The Gaussian Covariate model unifies many different models as shown in Table 1 . These special cases are all discussed in section 3 and Appendix D       r 2 d p I γp 0 0 0 r 2 d p I γ p 0 0 0 σ 2 d q I q         d γp I γp O (1-γ)p×γd O q×γd    Mismatched ridgeless regression withz signal r and noise σ and mismatch parameter γ with γ + γ = 1     I γd 0 0 0 0 I γd 0 0 0 0 . . . . . . 0 0 . . . I γd        1 α 0 I γd 0 0 0 . . . . . . 0 . . . 1 α p-1 2 I γd    non-isotropic ridgless regression noiseless with a α polynomial distorsion of the inputs scalings     r d p I p O p×q O N ×p O N ×q O q×p σ d q I q         µ d p W ν d p I N O q×N     Random features regression of a noisy linear function with W the random weights and (µ, ν) describing a non-linear activation function     √ ω 1 0 • • • 0 0 √ ω 2 • • • 0 . . . . . . . . . . . . 0 0 • • • √ ω d         √ ω 1 0 • • • 0 0 √ ω 2 • • • 0 . . . . . . . . . . . . 0 0 • • • √ ω d    

Further Kernel methods

Learning task: We consider the problem of learning a linear teacher function f d (x) = β * T x with x and x sampled as defined above, and with β * ∈ R p a column vectors. This hidden vector β *



); Hastie et al. (2019); Mei & Montanari (2019), Advani et al. (2020), d'Ascoli et al. (2020), Gerace et al. (2020), Deng et al. (2021), Kini & Thrampoulidis (2020) confirming in a precise (albeit limited) theoretical setting the experimental phenomenon initially observed in Belkin et al. (2019), Geiger et al. (2019); Spigler et al. (2019), Nakkiran et al. (2020a). Further derivations of triple or even multiple descents for the generalization error have also been performed d'Ascoli et al. (2020); Nakkiran et al. (2020b); Chen et al. (2021); Richards et al. (2021); Wu & Xu (2020). Other aspects of multiples descents have been explored in Lin & Dobriban (2021); Adlam & Pennington (2020b) also for the Neural tangent kernel in Adlam & Pennington (2020a). The tools in use come from modern random matrix theory Pennington & Worah (2017); Rashidi Far et al. (2006); Mingo & Speicher (2017), and statistical physics methods such as the replica method Engel & Van den Broeck (2001).In this paper we are concerned with a line of research dedicated to the precise time-evolution of the generalization error under gradient flow corroborating, among other things, the presence of epoch-wise descents structures Crisanti & Sompolinsky (2018); Bodin & Macris (2021) observed inNakkiran et al. (2020a). We consider the gradient flow dynamics for the training and generalisation errors in the setting of a Gaussian Covariate model, and develop analytical methods to track the whole time evolution. In particular, for infinite times we get back the predictions of the least square estimator which have been thoroughly described in a similar model byLoureiro et al. (2021).

T . A direct connection can be made with the Gaussian Covariate model described in Loureiro et al. (2021) which suggests considering directly observations x = (x T , xT ) T ∼ N (0, Σ) for a given covariance structure Σ. The spectral theorem provides the existence of orthonormal matrix O and diagonal D such that Σ = O T DO and D contains d non-zero eigenvalues in a squared block D 1 and p A + p B -d zero eigenvalues. We can write D

Different matrices and corresponding models

