GRADIENT FLOW IN THE GAUSSIAN COVARIATE MODEL: EXACT SOLUTION OF LEARNING CURVES AND MULTIPLE DESCENT STRUCTURES Anonymous authors Paper under double-blind review

Abstract

A recent line of work has shown remarkable behaviors of the generalization error curves in simple learning models. Even the least-squares regression has shown atypical features such as the model-wise double descent, and further works have observed triple or multiple descents. Another important characteristic are the epoch-wise descent structures which emerge during training. The observations of model-wise and epoch-wise descents have been analytically derived in limited theoretical settings (such as the random feature model) and are otherwise experimental. In this work, we provide a full and unified analysis of the whole time-evolution of the generalization curve, in the asymptotic large-dimensional regime and under gradient-flow, within a wider theoretical setting stemming from a gaussian covariate model. In particular, we cover most cases already disparately observed in the literature, and also provide examples of the existence of multiple descent structures as a function of a model parameter or time. Furthermore, we show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets. Technically we compute averages of rational expressions involving random matrices using recent developments in random matrix theory based on "linear pencils". Another contribution, which is also of independent interest in random matrix theory, is a new derivation of related fixed point equations (and an extension there-off) using Dyson brownian motions.

1. INTRODUCTION

1.1 PRELIMINARIES With growing computational resources, it has become customary for machine learning models to use a huge number of parameters (billions of parameters in Brown et al. ( 2020)), and the need for scaling laws has become of utmost importance Hoffmann et al. (2022) . Therefore it is of great relevance to study the asymptotic (or "thermodynamic") limit of simple models in which the number of parameters and data samples are sent to infinity. A landmark progress made by considering these theoretical limits, is the analytical (oftentimes rigorous) calculation of precise double-descent curves for the generalization error starting with Belkin et al. ( 2020 



); Hastie et al. (2019); Mei & Montanari (2019), Advani et al. (2020), d'Ascoli et al. (2020), Gerace et al. (2020), Deng et al. (2021), Kini & Thrampoulidis (2020) confirming in a precise (albeit limited) theoretical setting the experimental phenomenon initially observed in Belkin et al. (2019), Geiger et al. (2019); Spigler et al. (2019), Nakkiran et al. (2020a). Further derivations of triple or even multiple descents for the generalization error have also been performed d'Ascoli et al. (2020); Nakkiran et al. (2020b); Chen et al. (2021); Richards et al. (2021); Wu & Xu (2020). Other aspects of multiples descents have been explored in Lin & Dobriban (2021); Adlam & Pennington (2020b) also for the Neural tangent kernel in Adlam & Pennington (2020a). The tools in use come from modern random matrix theory Pennington & Worah (2017); Rashidi Far et al. (2006); Mingo & Speicher (2017), and statistical physics methods such as the replica method Engel & Van den Broeck (2001). In this paper we are concerned with a line of research dedicated to the precise time-evolution of the generalization error under gradient flow corroborating, among other things, the presence of epoch-wise descents structures Crisanti & Sompolinsky (2018); Bodin & Macris (2021) observed in

