GRADIENT FLOW IN THE GAUSSIAN COVARIATE MODEL: EXACT SOLUTION OF LEARNING CURVES AND MULTIPLE DESCENT STRUCTURES Anonymous authors Paper under double-blind review

Abstract

A recent line of work has shown remarkable behaviors of the generalization error curves in simple learning models. Even the least-squares regression has shown atypical features such as the model-wise double descent, and further works have observed triple or multiple descents. Another important characteristic are the epoch-wise descent structures which emerge during training. The observations of model-wise and epoch-wise descents have been analytically derived in limited theoretical settings (such as the random feature model) and are otherwise experimental. In this work, we provide a full and unified analysis of the whole time-evolution of the generalization curve, in the asymptotic large-dimensional regime and under gradient-flow, within a wider theoretical setting stemming from a gaussian covariate model. In particular, we cover most cases already disparately observed in the literature, and also provide examples of the existence of multiple descent structures as a function of a model parameter or time. Furthermore, we show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets. Technically we compute averages of rational expressions involving random matrices using recent developments in random matrix theory based on "linear pencils". Another contribution, which is also of independent interest in random matrix theory, is a new derivation of related fixed point equations (and an extension there-off) using Dyson brownian motions.

1. INTRODUCTION

1.1 PRELIMINARIES With growing computational resources, it has become customary for machine learning models to use a huge number of parameters (billions of parameters in Brown et al. (2020) ), and the need for scaling laws has become of utmost importance Hoffmann et al. (2022) . Therefore it is of great relevance to study the asymptotic (or "thermodynamic") limit of simple models in which the number of parameters and data samples are sent to infinity. A landmark progress made by considering these theoretical limits, is the analytical (oftentimes rigorous) calculation of precise double-descent curves for the generalization error starting with Belkin et al. (2020) ; Hastie et al. (2019) ; Mei & Montanari (2019) , Advani et al. (2020 ), d'Ascoli et al. (2020) , Gerace et al. (2020) , Deng et al. (2021) , Kini & Thrampoulidis (2020) confirming in a precise (albeit limited) theoretical setting the experimental phenomenon initially observed in Belkin et al. (2019) , Geiger et al. (2019) ; Spigler et al. (2019) , Nakkiran et al. (2020a) . Further derivations of triple or even multiple descents for the generalization error have also been performed d 'Ascoli et al. (2020) ; Nakkiran et al. (2020b) ; Chen et al. (2021) ; Richards et al. (2021) ; Wu & Xu (2020) . Other aspects of multiples descents have been explored in Lin & Dobriban (2021) ; Adlam & Pennington (2020b) also for the Neural tangent kernel in Adlam & Pennington (2020a) . The tools in use come from modern random matrix theory Pennington & Worah (2017); Rashidi Far et al. (2006) ; Mingo & Speicher (2017) , and statistical physics methods such as the replica method Engel & Van den Broeck (2001) . In this paper we are concerned with a line of research dedicated to the precise time-evolution of the generalization error under gradient flow corroborating, among other things, the presence of epoch-wise descents structures Crisanti & Sompolinsky (2018) ; Bodin & Macris (2021) observed in Nakkiran et al. (2020a) . We consider the gradient flow dynamics for the training and generalisation errors in the setting of a Gaussian Covariate model, and develop analytical methods to track the whole time evolution. In particular, for infinite times we get back the predictions of the least square estimator which have been thoroughly described in a similar model by Loureiro et al. (2021) . In the next paragraphs we set-up the model together with a list of special realizations, and describe our main contributions.

1.2. MODEL DESCRIPTION

Generative Data Model: In this paper, we use the so-called Gaussian Covariate model in a teacherstudent setting. An observation in our data model is defined through the realization of a gaussian vector z ∼ N (0, 1 d I d ). The teacher and the student obtain their observations (or two different views of the world) with the vectors x ∈ R p B and x ∈ R p A respectively, which are given by the application of two linear operations on z. In other words there exists two matrices B ∈ R d×p B and A ∈ R d×p A such that x = B T z and x = A T z. Note that the generated data can also be seen as the output of a generative 1-layer linear network. In the following, the structure of A and B is pretty general as long as it remains independent of the realization z: the matrices may be random matrices or block-matrices of different natures and structures to capture more sophisticated models. While the models we treat are defined through appropriate A and B, we will often only need the structure of U = AA T and V = BB T . A direct connection can be made with the Gaussian Covariate model described in Loureiro et al. (2021) 1 we find (A|B) T z ∼ N (0, Σ). The Gaussian Covariate model unifies many different models as shown in Table 1 . These special cases are all discussed in section 3 and Appendix D non-isotropic ridgless regression noiseless with a α polynomial distorsion of the inputs scalings     r d p I p O p×q O N ×p O N ×q O q×p σ d q I q         µ d p W ν d p I N O q×N     Random features regression of a noisy linear function with W the random weights and (µ, ν) describing a non-linear activation function     √ ω 1 0 • • • 0 0 √ ω 2 • • • 0 . . . . . . . . . . . . 0 0 • • • √ ω d         √ ω 1 0 • • • 0 0 √ ω 2 • • • 0 . . . . . . . . . . . . 0 0 • • • √ ω d    

Further Kernel methods

Learning task: We consider the problem of learning a linear teacher function f d (x) = β * T x with x and x sampled as defined above, and with β * ∈ R p a column vectors. This hidden vector β * (to be learned) can potentially be a deterministic vector. We suppose that we have n data-points (z i , y i ) 1≤i≤n with x i = Bz i , xi = Az i . This data can be represented as the n × d matrix Z ∈ R n×d where z T i is the i-th row of Z, and the column vector vector Y ∈ R n with i-th entry y i . Therefore, we have the matrix notation Y = ZBβ * . We can also set X = ZB so that Y = Xβ * . In the same spirit, we define the estimator of the student ŷβ (z) = β T x = z T Aβ. We note that in general the dimensions of β and β * (i.e., p A and p B ) are not necessarily equal as this depends on the matrices B and A. We have Ŷ = ZAβ = Xβ for X = ZA. Training and test error: We will consider the training error E λ train and test errors E gen with a regularization coefficient λ ∈ R * + defined as E λ train (β) = 1 n Ŷ -Y 2 2 + λ n β 2 2 , E gen (β) = E z∼N (0, I d d ) (z T Aβ -z T Bβ * ) 2 (1) It is well known that the least-squares estimator β = arg min H(β) is given by the Thikonov regression formula βλ = ( XT X + λI) -1 XT Y and that in the limit λ → 0, this estimator converges towards the β0 given by the Moore-Penrose inverse β0 = ( XT X) + XT Y .

Gradient-flow:

We use the gradient-flow algorithm to explore the evolution of the test error through time with ∂βt ∂t = -n 2 ∇ β E λ train (β t ). In practice, for numerical calculations we use the discrete-time version, gradient-descent, which is known to converge towards the aforementioned least-squares estimator provided a sufficiently small time-step (in the order of 1 λmax where λ max is the maximum eigenvalue of XT X). The upfront coefficient n on the gradient is used so that the test error scales with the dimension of the model and allows for considering the evolution in the limit n, d, p A , p B → +∞ with a fixed ratios n d , p A d , p B d . We will note φ = n d .

1.3. CONTRIBUTIONS

1. We provide a general unified framework covering multiple models in which we derive, in the asymptotic large size regime, the full time-evolution under gradient flow dynamics of the training and generalization errors for teacher-student settings. In particular, in the infinite time-limit we check that our equations reduce to those of Loureiro et al. (2021) (as should be expected). But with our results we now have the possibility to explore quantitatively potential advantages of different stopping times: indeed our formalism allows to compute the time derivative of the generalization curve at any point in time. 2. Various special cases are illustrated in section 3, and among these a simpler re-derivation of the whole dynamics of the random features model Bodin & Macris (2021) , the full dynamics for kernel methods, and situations exhibiting multiple descent curves both as a function of model parameters and time (See section 3.2 and Appendix D.2). In particular, our analysis allows to design multiple descents with respect to the training epochs. 3. We show that our equations can also capture the learning curves over realistic datasets such as MNIST with gradient descent (See section 3.4 and Appendix D.5), extending further the results of Loureiro et al. (2021) to the time dependence of the curves. This could be an interesting guideline for deriving scaling laws for large learning models. 4. We use modern random matrix techniques, namely an improved version of the linear-pencil method -recently introduced in the machine learning community by Adlam et al. (2019) -to derive asymptotic limits of traces of rational expressions involving random matrices. Furthermore we propose a new derivation an important fixed point equation using Dyson brownian motion which, although non-rigorous, should be of independent interest (See Appendix E). 

2. MAIN RESULTS

We resort to the high-dimensional assumptions (see Bodin & Macris (2021) for similar assumptions). 2. There exists a sequence of complex contours Γ d ⊂ C enclosing the eigenvalues of the random matrix XT X ∈ R d×d but not enclosing -λ, and there exist also a fixed contour Γ enclosing the support of the limiting (when d → +∞) eigenvalue distribution of XT X but not enclosing -λ. With these assumptions in mind, we derive the precise time evolution of the test error in the highdimensional limit (see result 2.1) and similarly for the training error (see result 2.4). We will also assume that the results are still valid in the case λ = 0 as suggested in Mei & Montanari (2019) .

2.1. TIME EVOLUTION FORMULA FOR THE TEST ERROR

Result 2.1 The limiting test error time evolution for a random initialization β 0 such that N d (β 0 ) = r 0 and E[β 0 ] = 0 is given by the following expression: Ēgen (t) = c 0 + r 2 0 B 0 (t) + B 1 (t) (2) with V * = Bβ * β * T B T and c 0 = Tr d [V * ] and: B 1 (t) = -1 4π 2 Γ Γ (1 -e -t(x+λ) )(1 -e -t(y+λ) ) (x + λ)(y + λ) f 1 (x, y)dxdy + 1 iπ Γ 1 -e -t(z+λ) z + λ f 2 (z)dz (3) B 0 (t) = -1 2iπ Γ e -2t(z+λ) f 0 (z)dz where f 1 (x, y) = f 2 (x) + f 2 (y) + f1 (x, y) -c 0 and: f1 (x, y) = Tr d (φU + ζ x I) -1 (ζ x ζ y V * + f1 (x, y)φU 2 )(φU + ζ y I) -1 (5) f 2 (z) = c 0 -Tr d ζ z V * (φU + ζ z I) -1 (6) f 0 (z) = -1 + ζ z z (7) and ζ z given by the self-consistent equation: ζ z = -z + Tr d ζ z U (φU + ζ z I) -1 The former result can be expressed in terms of expectations w.r.t the joint limiting eigenvalue distributions of U and V * when they commute with each other. Result 2.2 Besides, when U and V * commute, let u, v * be jointly-distributed according to U and V * eigenvalues respectively. Then: f1 (x, y) = E u,v * ζ x ζ y v * + f1 (x, y)φu 2 (φu + ζ x )(φu + ζ y ) , f 2 (z) = c 0 -E u,v * ζ z v * φu + ζ z (9) ζ z = -z + E u ζ z u φu + ζ z (10) Notice also that in the limit t → ∞: B 1 (+∞) = f 1 (-λ, -λ) -2f 2 (-λ) = f1 (-λ, -λ) -c 0 , B 0 (+∞) = 0 (11) which leads to the next result. Result 2.3 In the limit t → ∞, the limiting test error is given by Ēgen (+∞) = f1 (-λ, -λ). Remark 1 Notice that the matrix V * is of rank one depending on the hidden vector β * . However, it is also possible to calculate the average generalization (and training) error over a prior distri-  E P * [ f1 (x, y)] = Tr d (φU + ζ x I) -1 (ζ x ζ y E P * [V * ] + E P * [ f1 (x, y)]φU 2 )(φU + ζ y I) -1 (12) E β * ∼P * [f 2 (z)] = c 0 -Tr d ζ z E P * [V * ](φU + ζ z I) -1 In conclusion, we find that E β * ∼P * [ Ēgen ] follows the same equations as Ēgen in result 2.1 with E β * ∼P * [V * ] instead of V * . In the following, we will consider V * without any distinction whether it comes from a specific vector β * or averaged through a sample distribution P * . Remark 2 In the particular case where U is diagonal, the matrix V * can be replaced by the following diagonal matrix Ṽ * which, in fact, commutes with U : Ṽ * =     [V * ] 11 [β * ] 2 1 0 . . . 0 0 [V * ] 22 [β * ] 2 2 . . . 0 . . . . . . . . . . . . 0 0 . . . [V * ] dd [β * ] 2 d     This comes essentially from the fact that given a diagonal matrix D and a non-diagonal matrix A, then [DA] ii = [D] ii [A] ii . This is particularly helpful, and shows that in many cases the calculations of f1 or f 2 remain tractable even for a deterministic β * (see the example in Appendix D.3) . Remark 3 Sometimes U = AA T and V = BB T are more difficult to handle than their dual counterparts U = φA T A and V = φB T B together with the additional matrix Ξ = φA T B. The following expressions are thus very useful (See Appendix C): f 1 (x, y) = Tr n (U + ζ x I) -1 ((Ξβ * β * T Ξ T ) + f1 (x, y)U )U (U + ζ y I) -1 (15) f 2 (z) = Tr n (Ξβ * β * T Ξ T )(U + ζ z I) -1 ) ζ z = -z + Tr n ζ z U (U + ζ z I) -1 In fact, when x = y = -λ (which corresponds to the limit when t → ∞), these are the same expressions as (59) in Loureiro et al. (2021) with the appropriate change of variable λ(1 + V ) → ζ and f1 → ρ + q -2m.

2.2. TIME EVOLUTION FORMULA FOR THE TRAINING ERROR

Result 2.4 The limiting training error time evolution is given by the following expression: Ē0 train (t) = c 0 + r 2 0 H 0 (t) + H 1 (t) with: H 1 (t) = -1 4π 2 Γ Γ (1 -e -t(x+λ) )(1 -e -t(y+λ) ) (x + λ)(y + λ) h 1 (x, y)dxdy + 1 iπ Γ 1 -e -t(z+λ) z + λ h 2 (z)dz (19) H 0 (t) = -1 2iπ Γ e -2t(z+λ) h 0 (z)dz (20) where h 1 (x, y) = h 2 (x) + h 2 (y) + h1 (x, y) -c 0 and with η z = -z ζz : h1 (x, y) = η x η y f1 (x, y), h 2 (z) = η z (c 0 f 0 (z) + f 2 (z)), h 0 (z) = η z f 0 (z) Eventually, in the limit t → ∞ we find: H 1 (+∞) = h 1 (-λ, -λ) -2h 2 (-λ) = h1 (-λ, -λ) -c 0 , H 0 (+∞) = 0 (22) Result 2.5 In the limit t → ∞, we have the relation Ē0 train (+∞) = η 2 -λ Ēgen (+∞) We notice the same proportionality factor η 2 -λ = λ ζ(-λ) 2 as already stated in Loureiro et al. (2021) , however interestingly, in the time evolution of the training error, such a factor is not valid as we have h 2 (z) = η z f 2 (z).

3. APPLICATIONS AND EXAMPLES

We discuss some of the models provided in table 1 and some others in Appendix D.

3.1. RIDGELESS REGRESSION OF A NOISY LINEAR FUNCTION

Target function Consider the following noisy linear function y(x) = rx T β * 0 + σ for some constant σ ∈ R + and ∼ N (0, 1), and a hidden vector β * 0 ∼ N (0, I p ). Assume we have a data matrix X ∈ R n×p . In order to incorporate the noise in our structural matrix B, we consider an additional parameter q(d) that grows linearly with d and such that d = p + q. Let φ 0 = n p . Therefore φ = n d = n p p d = φ 0 ψ. Also, we let β * T = (β * T 0 |β T 1 ) ∼ N (0, I p+q ) and we consider an average V * over β * . We construct the following block-matrix B and compute the averaged V * as follow: B =   r d p I p 0 0 σ d q I q   =⇒ V * = r 2 1 ψ I p 0 0 σ 2 1 1-ψ I q (23) Now let's consider the random matrix Z ∈ R n×d and split it into two sub-blocks Z = p d X| q d Σ . The framework of the paper yields the following output vector: Y = ZBβ * = rXβ * 0 + σξ where ξ = Σβ * 1 is used as a proxy for the noise . Estimator Now let's consider the linear estimator ŷt = x T β t . To capture the structure of this model, we use the following block-matrix A and compute the resulting matrix U : A = d p I p 0 q×p =⇒ U = 1 ψ I p 0 0 0 q×q (25) Therefore, it is straightforward to check that we have indeed: Ŷt = ZAβ t = Xβ t . Analytic result In this specific example, U and V * obviously commute and the result 2.2 can thus be used. First we derive the joint-distribution of the eigenvalues: P u = 1 ψ , v = r 2 ψ = ψ P u = 0, v = σ 2 1 -ψ = 1 -ψ In this specific example, we focus only on rederiving the high-dimensional generalization error without any regularization term (λ = 0) for the minimum least-squares estimator. So we calculate ζ = ζ(0) as follows: ζ = ψ ζ 1 ψ φ ψ +ζ + 0 implies ζ 2 + φ 0 ζ = ζ so ζ ∈ {0, 1 -φ 0 }. For f1 we get: f1 = ψ f1 φ ψ 2 ( φ ψ + ζ) 2 + ψ r 2 ψ ζ 2 ( φ ψ + ζ) 2 + (1 -ψ) σ 2 1 -ψ ζ 2 (ζ) 2 (27) In fact, the expression can be simplified as follow (without the constants φ, ψ):  1 - φ 0 (φ 0 + ζ) 2 f1 = r 2 ζ 2 (φ 0 + ζ) 2 + σ 2 E gen (+∞) = σ 2 φ0 φ0-1 (ζ = 0) r 2 (1 -φ 0 ) + σ 2 1 1-φ0 (ζ = 1 -φ 0 ) (29)

3.2. NON-ISOTROPIC RIDGELESS REGRESSION OF A NOISELESS LINEAR MODEL

Non-isotropic models have been studied in Dobriban & Wager (2018) and then also Wu & Xu (2020) ; Richards et al. (2021) ; Nakkiran et al. (2020b) ; Chen et al. (2021) where multiple-descents curve have been observed or engineered. In this section, we extend this idea to show that any number of descents can be generated and derive the precise curve of the generalization error as in Figure 1 . Target function We use the standard linear model y(z) = z T β * for a random β * ∼ N (0, I d ). Therefore, we consider the matrix B = I d and thus V * = I d such that Y = ZBβ * = Zβ * . Estimator: Following the structure provided in table 1, the design a matrix A is a scalar matrix with p ∈ N * sub-spaces of different scales spaced by a polynomial progression α -1 2 i . In other words, the student is trained on a dataset with different scalings. We thus have U = A 2 and Ŷt = ZAβ t .

Analytic results

We refer the reader to the Appendix D.2 for the calculation. Depending if φ is above or below 1, ζ is the solution of the following equations: ζ = 0 or 1 = 1 p p-1 i=0 1 φ+α i ζ . In the over-parameterized regime (φ < 1), the generalisation error is fully characterized by the equation: Ēgen (+∞) = φ(1 -φ) 1 p p-1 i=0 α i ζ (φ + α i ζ) 2 -1 -φ (30) In the asymptotic limit α → ∞, ζ can be approximated and thus we can derive an asymptotic expansion of Ēgen (+∞) for φ ∈ [0, 1] \ k p Z where clearly, the multiple descents appear as roots of the denominator of the sum: Interestingly, we can see how these peaks are being formed with the time-evolution of the gradient flow as in Figure 2 with one peak close to φ = 1 3 and the second one at φ = 2 3 . (Note that small λ requires more computational resources to have finer resolution at long times, hence here the second peak develops fully after t = 10 4 ). It is worth noticing also the existence of multiple time-descent, in particular at φ = 1 with some "ripples" that can be observed even in the training error. The eigenvalue distribution (See Appendix D.2.1) provides some insights on the existence of these phenomena. As seen in Figure 3 , the emergence of a spike is related to the rise of a new "bulk" of eigenvalues, which can be clearly seen around φ = 1 3 and φ = 2 3 here. Note that there is some analogy for the generic double-descent phenomena described in Hastie et al. (2019) where instead of two bulks, there is a mass in 0 which is arising. Furthermore, the existence of multiple bulks allow for multiple evolution at different scales (with the e -(z+λ)t terms) and thus enable the emergence of multiple epoch-wise peaks.  Ēgen (+∞) = 1 p p-1 k=0 φ(1 -φ) φ -k p k+1 p -φ 1 ] k p ; k+1 p [ (φ) -φ + o α (1)

3.3. RANDOM FEATURES REGRESSION

In this section, we show that we can derive the learning curves for the random features model introduced in Rahimi & Recht (2008) , and we consider the setting described in Bodin & Macris (2021) . In this setting, we define the random weight-matrix W ∈ R p×N where ψ 0 = N p such that W ij ∼ N (0, 1 p ) and d = p + N + q and φ = n d , ψ = p d , and φ 0 = n p = n d d p = φ ψ (thus q d = 1 -(1 + ψ 0 )ψ). So with Z = p d X| p d Ω| q d ξ , using the structures A and B from table 1 we have: ZA = µXW + νΩ and ZB = X + σξ, hence the model: Ŷ = ZAβ = (µXW + νΩ)β (32) Y = ZBβ * = Xβ * 0 + σξβ * 1 (33) With further calculation that can be found in Appendix D.4, a similar complete time derivation of the random feature regression can be performed with a much smaller linear-pencil than the one suggested in Bodin & Macris (2021) . As stated in this former work, the curves derived from this formula track the same training and test error in the high-dimensional limit as the model with the point-wise application of a centered non-linear activation function f ∈ L 2 (e -x 2 2 dx ) with Ŷ = 1 √ p f ( √ pXW )β. More precisely, with the inner-product defined such that for any function g ∈ L 2 (e -x 2 2 dx ), f, g = E x∼N (0,1) [f (x)g(x)], we derive the equivalent model parameters (µ, ν) with µ = f, H e1 , ν 2 = f, f -µ 2 while having the centering condition f, H e0 = 0 where (H en ) is the Hermite polynomial basis. This transformation is dubbed the Gaussian equivalence principle and has been observed and rigorously proved under weaker conditions in Pennington & Worah (2017) ; Péché (2019) ; Hu & Lu (2022) , and since then has been applied more broadly for instance in Adlam & Pennington (2020a) .

3.4. TOWARDS REALISTIC DATASETS

As stated in Loureiro et al. (2021) , the training and test error of realistic datasets can also be captured. In this example we track the MNIST dataset and focus on learning the parity of the images (y = +1 for even numbers and y = -1 for odd-numbers). We refer to Appendix D.5 for thorough discussions of Figures 4 and 5 as well as technical details to obtain them, and other examples. Besides the learning curve profile at t = +∞, the full theoretical time evolution is predicted and matches the experimental runs. In particular, the rise of the double-descent phenomenon is observed through time. (2020; 2021) for an overview of this tool. This method is a priori unrelated to ours and yields a set of non-linear integro-differential equations for time correlation functions which are in general not solvable analytically and one has to resort to a numerical solution. It would be interesting to understand if for the present model the DMFT equations can be reduced to our set of algebraic equations. We believe it can be a fruitful endeavor to compare in detail the two approaches: the one based on DMFT and the one based on random matrix theory tools and Cauchy integration formulas. Another interesting direction which came to our knowledge recently is the one taken in Lu & Yau (2022); Hu & Lu (2022) and in Misiakiewicz (2022) ; Xiao & Pennington (2022) , who study the high-dimensional polynomial regime where n ∝ d κ for a fixed κ. In particular, it is becoming notorious that changing the scaling can yield additional descents. This regime is out of the scope of the present work but it would be desirable to explore if the linear-pencils and the random matrix tools that we extensively use in this work can extend to these cases.

A GRADIENT FLOW CALCULATIONS

In this section, we derive the main equations for the gradient flow algorithm, and derive and set of Cauchy integration formula involving the limiting traces of large matrices. The calculation factoring out Z in the limit d → ∞ is pursued in the next section. First, we recall and expand the training error function in 1: E λ train (β t ) = 1 n Y -Xβ t 2 2 + λ n β t 2 2 (34) = 1 n Y 2 2 - 2 n Y T XT Xβ t + λ n β t 2 2 (35) = 1 n ZBβ * 2 2 - 2 n β * T B T Z T ZAβ t + 1 n β T t A T Z T ZAβ t + λ n β t 2 2 (36) Let K = ( XT X + λI) -1 = (A T Z T ZA + λI) -1 which is invertible for λ > 0. Therefore, we can write the gradient of the training error for any β as: n 2 ∇ β E train (β) = XT ( Xβ -Y ) + λβ = ( XT X + λI)β -XT Y = K -1 β -XT Y (37) The gradient flow equations reduces to a first order ODE ∂β t ∂t = - n 2 ∇ β E λ train (β t ) = XT Y -K -1 β t (38) The solution can be completely expressed using L t = (I -exp(-tK -1 )) as β t = exp(-tK -1 )β 0 + (I -exp(-tK -1 ))K XT Y (39) = (I -L t )β 0 + L t K XT Xβ * In the following two subsections, we will focus on deriving an expression of the time evolution of the test error and training error using these equations averaged over the a centered random vector β 0 such that r 2 0 = N d (β 0 ) 2 .

A.1 TEST ERROR

As above, the test error can be expanded using the fact that on N 0 = N (0, 1 d ), we have the identity E z∼N0 [zz T ] = 1 d I d : E gen (β t ) = E z∼N0 (z T Aβ t -z T Bβ * ) 2 (41) = (Aβ t -Bβ * ) T E z∼N0 [zz T ](Aβ t -Bβ * ) (42) = 1 d β T t U β t - 2 d β * T B T Aβ t + 1 d β * T B T Bβ * (43) So expanding the first term yields β T t U β t = (β T 0 (I -L t ) + β * T X T XKL t )U ((I -L t )β 0 + L t K XT Xβ * ) (44) = β T 0 (I -L t )U (I -L t )β 0 (45) + β * T (B T Z T ZA)KL t U L t K(AZ T ZB)β * (46) + 2β T 0 (I -L t )U L t K(AZ T ZB)β * (47) while the second term yields β * T B T Aβ t = β * T B T A((I -L t )β 0 + L t K XT Xβ * ) (48) = β * T B T A(I -L t )β 0 + β * T L t K(A T Z T ZB)β * Let's consider now the high-dimensional limit Ēgen (t) = lim d→+∞ E gen (β t ). We further make the underlying assumption that the generalisation error concentrates on its mean with β 0 , that is to say: Ēgen (t) = lim d→+∞ E β0 [E gen (β t )]. Let V * = Bβ * β * T B T and c 0 = Tr d [V * ], then using the former expanded terms in 41 we find the expression Ēgen (t) = c 0 + r 2 0 Tr d A T (I -L t ) 2 A (50) + Tr d Z T ZAKL t U L t KA T Z T ZV * -2Tr d AL t KA T Z T ZV * (51) So Ēgen (t) = c 0 + r 2 0 B 0 (t) + B 1 (t) with: B 0 (t) = Tr d A T (I -L t ) 2 A (52) B 1 (t) = Tr d Z T ZAKL t U L t KA T Z T ZV * -2Tr d AL t KA T Z T ZV * (53) Let K(z) = ( XT X -zI) -1 the resolvent of XT X, and let's have the convention K = K(-λ) to remain consistent with the previous formula. Then for any holomorphic functional f : U → C defined on an open set U which contains the spectrum of XT X, with Γ a contour in C enclosing the spectrum of XT X but not the poles of f , we have with the extension of f onto C n×n : f ( XT X) = -1 2iπ Γ f (z)K(z)dz. For instance, we can apply it for the following expression: KL t = L t K = (I -exp(-t XT X + tλI))( XT X -λI) -1 (54) = -1 2iπ Γ 1 -e -t(z+λ) z + λ ( XT X -zI) -1 dz (55) = -1 2iπ Γ 1 -e -t(z+λ) z + λ K(z)dz So we can generalize this idea to each trace and rewrite B 1 (t) and B 0 (t) with B 1 (t) = -1 4π 2 Γ Γ (1 -e -t(x+λ) )(1 -e -t(y+λ) ) (x + λ)(y + λ) f 1 (x, y)dxdy + 1 iπ Γ 1 -e -t(z+λ) z + λ f 2 (z)dz B 0 (t) = -1 2iπ Γ e -2t(z+λ) f 0 (z)dz where we introduce the set of functions f 1 (x, y), f 2 (z) and f 0 (z) f 1 (x, y) = Tr d Z T ZAK(x)AA T K(y)A T Z T ZV * (59) f 2 (z) = Tr d AK(z)A T Z T ZV * (60) f 0 (z) = Tr d AK(z)A T (61) Let G(x) = (U Z T Z -xI) -1 , using the push-through identity, it is straightforward that AK(z)A = G(z)U = U G(z) T . This help us reduce further the expression of f 1 into smaller terms which will be easier to handle with linear-pencils later on f 1 (x, y) = Tr d Z T ZU G(x) T G(y)U Z T ZV * (62) = Tr d (G(x) -1 + xI) T G(x) T G(y)(G(y) -1 + yI)V * (63) = Tr d (I + yG(y))V * (I + xG(x)) T (64) = c 0 + yTr d [G(y)V * ] + xTr d [G(x)V * ] + xyTr d G(x)V * G(y) T Similarly with f 2 and f 0 , they can be rewritten as f 2 (z) = Tr d G(z)U Z T ZV * (66) = Tr d G(z)(G(z) -1 + zI)V * (67) = c 0 + zTr d [G(z)V * ] (68) f 0 (z) = Tr d [G(z)U ] Hence in fact the definition f1 (x, y) = xyTr d G(x)V * G(y) T such that f 1 (x, y) = f 2 (x) + f 2 (y) + f1 (x, y) -c 0 At this point, the equations provided by 57 are valid for any realization Z in the limit d → ∞. We will see in the next section how to simplify these terms by factoring out Z.

A.2 TRAINING ERROR

Similar formulas can be derived for the training error. For the sake of simplicity, we provide a formula to track the training error without the regularization term, that is to say E 0 train (β t ) (as in Loureiro et al. ( 2021)) while still minimizing the loss E λ train (β t ). So using the expanded expression 34, and considering the high-dimensional assumption with concentration Ē0 train (t ) := lim d→+∞ E train (β t ) = lim d→+∞ E β0 [E train (β t )] we have Ē0 train (t) = Tr n Z T ZV * + r 2 0 Tr n A T Z T ZA(I -L t ) 2 (71) + Tr n Z T ZAKL t A T Z T ZAL t KA T Z T ZV * (72) -2Tr n Z T ZAL t KA T Z T ZV * 73) First of all, standard random matrix results (for instance see Rubio & Mestre (2011) ) assert the result Tr d Z T ZV * = Tr d Z T Z Tr d [V * ] = φc 0 . This result can also be derived under our random matrix theory framework, for completeness we provide this calculation in C.2. Therefore, we can define H 0 (t) and H 1 (t) such that Ē0 train (t) = c 0 + r 2 0 H 0 (t) + H 1 (t) ) where we have the traces H 0 (t) = Tr n A T Z T ZA(I -L t ) 2 (75) H 1 (t) = Tr n Z T ZAKL t (A T Z T ZA)L t KA T Z T ZV * -2Tr n Z T ZAL t KA T Z T ZV * 76) And using the functional calculus argument with Cauchy integration formula over the same contour Γ we find H 1 (t) = -1 4π 2 Γ Γ (1 -e -t(x+λ) )(1 -e -t(x+λ) ) (x + λ)(y + λ) h 1 (x, y)dxdy + 1 iπ Γ (1 -e -t(z+λ) ) (z + λ) h 2 (z)dz H 0 (t) = -1 2iπ Γ e -2t(z+λ) h 0 (z)dz Where we use the traces (which only contain algebraic expression of matrices): h 1 (x, y) = Tr n Z T ZAK(x)AZ T ZA T K(y)A T Z T ZV * (79) h 2 (z) = Tr n Z T ZAK(z)A T Z T ZV * (80) h 0 (z) = Tr n Z T ZA T K(z)A T (81) The expression of h 1 can be reduced to smaller terms as before with f 1 φh 1 (x, y) = Tr d Z T ZU G(x) T Z T ZG(y)U Z T ZV * (82) = Tr d (G(x) -1 + xI) T G(x) T Z T ZG(y)(G(y) -1 + yI)V * (83) = Tr d Z T ZV * + xTr d G(x) T Z T ZV * + yTr d Z T ZG(y)V * (84) + xyTr d Z T ZG(y)V * G(x) T (85) = c 0 φ + xTr d Z T ZG(x)V * + yTr d Z T ZG(y)V * (86) + xyTr d Z T ZG(y)V * G(x) T (87) and similarly with h 2 φh 2 (z) = Tr d Z T ZG(z)U Z T ZV * (88) = Tr d Z T ZG(z)(G(z) -1 + zI)V * (89) = Tr d Z T ZV * + zTr d Z T ZG(z)V * (90) = c 0 φ + zTr d Z T ZG(z)V * (91) and similarly with h 0 φh 0 (z) = Tr d Z T ZG(z)U (92) = Tr d G(z)(G(z) -1 + zI) (93) = 1 + zTr d [G(z)] We can also define the term h1 (x, y) = xyTr n ZG(y)V * G(x) T Z T so that: h 1 (x, y) = h 2 (x) + h 2 (y) + h1 (x, y) -c 0

B TEST ERROR AND TRAINING ERROR LIMITS WITH LINEAR PENCILS

In this section we compute a set of self-consistent equation to derive the high-dimensional evolution of the training and test error. We refer to Appendix E for the definition and result statements concerning the linear pencils. We will derive essentially two linear-pencils of size 6 × 6 and 4 × 4 which will enable us to calculate the limiting values for f1 , f 2 , f 0 for the test error, and h1 , h 2 , h 0 for the training error. Note that these block-matrices are derived essentially by observing the recursive application of the block-matrix inversion formula and manipulating it so as to obtain the desired result. Compared to other works such as Bodin & Macris (2021); Adlam & Pennington (2020a), our approach yields smaller sizes of linear-pencils to handle, which in turn yields a smaller set of algebraic equations. One of the ingredient of our method consists in considering a multiple-stage approach where the trace of some random blocks can be calculated in different parts (See the random feature model for example in Appendix D.4). However, the question of finding the simplest linear-pencil remains open and interesting to investigate.

B.1 LIMITING TRACES OF THE TEST ERROR

LIMITING TRACE FOR f1 AND f 0 We construct a linear-pencil M 1 as follow (with Z the random matrix into consideration) M 1 =        0 0 0 -yI 0 Z T 0 0 0 0 Z I 0 0 0 U I 0 -xI 0 U -xyV * 0 0 0 Z T I 0 0 0 Z I 0 0 0 0        The inverse of this block-matrix contains the terms in the traces of f1 and f 0 . To see this, let's calculate the inverse of M 1 by splitting it first into other "flattened" blocks: M 1 = 0 B T y B x D =⇒ M -1 1 = -B -1 x DB T -1 y B -1 x B T -1 y 0 Where B x and D are given by B x =   -xI 0 U 0 Z T I Z I 0   D = -xyV * 0 0 0 0 0 0 0 0 (98) then to calculate the inverse of B x , notice first its lower right-hand sub-block has inverse Z T I I 0 -1 = 0 I I -Z T (99) Which lead us to the following inverse using the block-matrix inversion formula (the dotted terms aren't required): Let's now consider g the limiting value of g d , and calculate the mapping η(g): B -1 x =   G(x) -G(x)U G(x)U Z T -ZG(x) . . . I n -ZG(x)U Z T Z T ZG(x) . . . . . .   η(g) =        0 0 0 0 φg 26 0 0 0 0 0 0 g 15 0 0 0 0 0 0 0 0 0 0 0 0 φg 62 0 0 0 φg 22 0 0 g 51 0 0 0 g 11        So we can calculate the matrix Π(M 1 ) such that the elements of g are the limiting trace of the squared sub-blocks of (Π(M 1 )) -1 (divided by the block-size) following the steps of the result in App. E: Π(M 1 ) =        0 0 0 -yI -φg 26 I 0 0 0 0 0 0 (1 -g 15 )I 0 0 0 U I 0 -xI 0 U -xyV * 0 0 -φg 62 I 0 I 0 -φg 22 I 0 0 (1 -g 51 )I 0 0 0 -g 11 I        Therefore, there remains to compute the inverse of Π(M 1 ). We split again Π(M 1 ) as flattened sub-blocks to make the calculation easier Π(M 1 ) = 0 BT y Bx D =⇒ Π(M 1 ) -1 = -B-1 x D( B-1 y ) T B-1 x ( B-1 y ) T 0 (105) With the three block-matrices Bx =   -xI 0 U -g 62 φI 0 I 0 (1 -g 51 )I 0   By =   -xI 0 U -g 26 φI 0 I 0 (1 -g 15 )I 0   (106) D =   -xyV * 0 0 0 -g 22 φI 0 0 0 -g 11 I   A straightforward application of the block-matrix inversion formula yields inverse of Bx B-1 x =   (φg 62 U -xI) -1 -U (φg 62 U -xI) -1 0 0 0 (1 -g 51 ) -1 I φg 62 (φg 62 U -xI) -1 -x(φg 62 U -xI) -1 0   Therefore, we retrieve the following close set of equations: g 11 = Tr d (g 62 φU -xI) -1 (xyV * + g 22 φU 2 )(g 26 φU -yI) -1 (109) g 22 = g 11 (1 -g 15 ) -1 (1 -g 51 ) -1 g 26 = (1 -g 51 ) -1 (111) g 15 = -Tr d U (g 62 φU -xI) -1 These equations can be simplified slightly by removing g 22 , g 26 and introducing q 15 : g 11 = Tr d (φU -xq 15 I) -1 (xyq 15 q 51 V * + g 11 φU 2 )(φU -yq 51 I) -1 q 15 = Tr d (φU -xq 15 I + q 15 U )(φU -xq 15 I) -1 (114) g 15 = 1 -q 15 (115) Let ζ x = -xq 15 , or by symmetry ζ y = -yq 51 , then using the fact that f1 (x, y) = -g 11 and f 0 (x) = -g 15 we find the system of equations f1 (x, y) = Tr d (φU + ζ x I) -1 (ζ x ζ y V * + f1 (x, y)φU 2 )(φU + ζ y I) -1 (116) f 0 (x) = -1 + ζ x x ζ z = -z + Tr d ζ z U (φU + ζ z I) -1 Remark: As a byproduct of this analysis, notice the term g 62 = (q 15 ) -1 = -x ζx . In fact we have: g 62 = Tr n I n -ZG(x)U Z T (119) = 1 -Tr n Z(AA T Z T Z -xI) -1 AA T Z T (120) = 1 -Tr n (ZAA T Z T -xI) -1 ZAA T Z T (121) = 1 -Tr n ( X XT -xI) -1 ( X XT -xI n + xI n ) (122) = -xTr n ( X XT -xI) -1 So if we let m(x) = Tr n ( X XT -xI) -1 the trace of the resolvent of the student data matrix, we find that m(x) = ζ -1 x . This can be useful for analyzing the eigenvalues as in Appendix D.2.1.

LIMITING TRACE FOR f 2

As before, we construct a second linear-pencil M 2 with Z the random matrix component into consideration M 2 =    I 0 0 0 -zV * -zI 0 U 0 0 Z T I 0 Z I 0    The former flattened block B z can be recognized in the lower right-hand side of M 2 , thus we can use the block matrix-inversion formula and get: M -1 2 =    I 0 0 0 zG(z)V * -zZG(z)V * B -1 z zZ T ZG(z)V *    ( ) Now it is clear that we can express f 2 (z) = c 0 + lim d→+∞ g 21 d . Following the steps of App. E we calculate the mapping η(g) =    0 0 0 0 0 0 0 0 0 φg 34 0 0 0 0 g 23 0    (126) Which in returns enable us to calculate Π(M 2 ) Π(M 2 ) =    I 0 0 0 -zV * -zI 0 U 0 -g 34 φI 0 I 0 0 (1 -g 23 )I 0    (127) To compute the inverse of Π(M 2 ), the block-matrix is first split with the sub-block Bz defined as follow Bz =   -zI 0 U -g 34 φI 0 I 0 (1 -g 23 )I 0   Π(M 2 ) =    I 0 0 0 -zV * 0 Bz 0    (128) A straightforward application of the block-matrix inversion formula yields the inverse of Bz : B-1 z =   (g 34 φU -zI) -1 -U (g 34 φU -zI) -1 0 0 0 (1 -g 23 ) -1 I g 34 φ(g 34 φU -zI) -1 -z(g 34 φU -zI) -1 0   (129) Hence we can derive the inverse Π(M 2 ) -1 =     I 0 0 0 z(g 34 φU -zI) -1 V * 0 B-1 z zg 34 φ(g 34 φU -zI) -1 V *     (130) Eventually, using the fixed-point result on linear-pencils, we derive the set of equations g 21 = Tr d zV * (g 34 φU -zI) -1 (131) g 34 = (1 -g 23 ) -1 (132) g 23 = -Tr d U (g 34 φU -zI) -1 g 41 = Tr d zg 34 φ(g 34 φU -zI) -1 V * (134) g 22 = Tr d (g 34 φU -zI) -1 (135) (136) In fact, it is a straightforward to see that g 23 , g 34 follows the same equations as the former g 15 , g 26 in the previous subsection, therefore g 23 = g 15 = 1 -q 15 = 1 + ζz z , and thus g 34 = -z ζz Eventually we get g 21 = -Tr d ζ z V * (φU + ζ z I) -1 so in the limit d → ∞: f 2 (z) = c 0 -Tr d ζ z V * (φU + ζ z I) -1 (137) B.

2. LIMITING TRACES FOR THE TRAINING ERROR LIMITING TRACE FOR h 1

A careful attention to the linear-pencil M 1 shows that the terms in the trace of h1 are actually given by the location g 22 . We have to be careful also of the fact that (M -1 1 ) 22 is a block matrix of size n × n, so it is already divided by the size n (and not d). Hence we simply have with η z = -z ζz : h1 (x, y) = -g 22 = -x ζ x -y ζ y f 1 (x, y) = η x η y f 1 (x, y) LIMITING TRACE FOR h 2 In the case of h 2 , we need the specific term provided by the linear-pencil M 2 by the location g 41 with φh 2 (z) = c 0 φ + g 41 For h 2 we use the linear pencil for f 2 , but instead of using g 21 we use h 2 = c 0 φ + g 41 . We find: g 41 = zφTr d V * (φU + ζ z I) -1 (139) = φ z ζ z Tr d ζ z V * (φU + ζ z I) -1 (140) = φ z ζ z (c 0 -f 2 (z)) Hence: h 2 (z) = c 0 1 - -z ζ z + -z ζ z f 2 (z) = η z (c 0 f 0 (z) + f 2 (z)) LIMITING TRACE FOR h 0 Finally for h 0 we use again the linear pencil M 2 with: Tr d [zG(z)] = zg 22 = -Tr d ζ z (φU + ζ z I) -1 (143) = -Tr d (ζ z + φU -φU )(φU + ζ z I) -1 (144) = -1 + φTr d U (φU + ζ z I) -1 (145) = -1 + φ ζ z Tr d ζ z U (φU + ζ z I) -1 (146) = -1 + φ ζ z (ζ z + z) Therefore: h 0 (z) = 1 - -z ζ z = -1 + ζ z z -z ζ z = η z f 0 (z)

C OTHER LIMITING EXPRESSIONS

In this section we bring the sketch of proofs of additional expressions seen in the main results. C.1 EXPRESSION WITH DUAL COUNTERPART MATRICES U AND V The former functionals f 2 and f1 can be rewritten as: f 2 (z) = c 0 -Tr d ζ z V * (φU + ζ z I) -1 (149) = c 0 -Tr d (ζ z I + φU -φU )V * (φU + ζ z I) -1 (150) = c 0 -Tr d [V * ] + Tr d φA T V * (φU + ζ z I) -1 A T (151) = c 0 -c 0 + Tr d φA T Bβ * β * T B T A(U + ζ z I) -1 (152) = Tr n (Ξβ * β * T Ξ T )(U + ζ z I) -1 (153) With similar steps using: ζ x V * ζ y = -(ζ x I + φU )V * (ζ y I + φU ) + ζ x V * (ζ y I + φU ) + (ζ x I + φU )V * ζ y + φ 2 U V * U (154) We find: f1 (x, y) = -c 0 + Tr d ζ x V * (ζ y I + φU ) -1 + Tr d (ζ x I + φU ) -1 V * ζ y (155) + Tr d (φU + ζ x I) -1 (φ 2 U V * U + f1 (x, y)φU 2 )(φU + ζ y I) -1 (156) = c 0 -f 2 (x) -f 2 (y) + Tr n (U + ζ x I) -1 ((Ξβ * β * T Ξ T ) + f1 (x, y)U )U (U + ζ y I) -1 Hence in fact: f 1 (x, y) = Tr n (U + ζ x I) -1 ((Ξβ * β * T Ξ T ) + f1 (x, y)U )U (U + ζ y I) -1 (159) Finally, we have using the push-through identity and the cyclicity of the trace: ζ z = -z + Tr d ζ z AA T (φAA T + ζ z I) -1 (160) = -z + Tr d ζ z A(φA T A + ζ z I) -1 A T (161) = -z + Tr n ζ z U (U + ζ z I) -1 (162) C.2 LIMITING TRACE OF Z T ZV * Here we show another way in which our random matrix result can be used to infer the result on the limiting trace Tr d Z T ZV * . To this end, we can design the linear-pencil: M 3 =    I -V * 0 0 0 I Z T 0 0 0 I Z 0 0 0 I    ( ) It is straightforward to calculate the inverse of the sub-matrix:   I Z T 0 0 I Z 0 0 I   -1 =   I -Z T Z T Z 0 I -Z 0 0 I   (164) So that: M -1 3 =    I V * -Z T V * V * Z T Z 0 I -Z T Z T Z 0 0 I -Z 0 0 0 I    At this point, it is clear that the quantity of interest is provided by the term g 14 of the linear-pencil M 3 . We find calculate further: η(g) =    0 0 0 0 0 0 0 φg 33 0 0 g 42 0 0 0 0 0    (166) Based on the inverse of M 3 , we can already predict that g 33 = 1 and g 42 = 0. Hence:  Π(M 3 ) =    I -V * 0 0 0 I 0 -φI 0 0 I 0 0 0 0 I    =⇒ Π(M 3 ) -1 =    I V * 0 φV * 0 I 0 φI 0 0 I 0 0 0 0 I   

D APPLICATIONS AND CALCULATION DETAILS D.1 MISMATCHED RIDGELESS REGRESSION OF A NOISY LINEAR FUNCTION

Target function Here we consider a slightly more complicated version of the former example where we let y(x 0 , x 1 ) = r x T 0 β * 0 + x T 1 β * 1 + σ and still averaged over β 0 ∼ N (0, I γp ) and β 1 ∼ N (0, I (1-γ)p ) with x 0 ∈ R γp , x 1 ∈ R (1-γ)p . We let again d = p + q and ψ = p d and φ 0 = p q . Therefore the former relation still holds φ = n d = n p p d = φ 0 ψ. Similarly, we derive a block-matrix B and compute V * : B =      r d p I γp 0 0 0 r d p I (1-γ)p 0 0 0 σ d q I q      =⇒ V * =    r 2 ψ I γp 0 0 0 r 2 ψ I (1-γ)q 0 0 0 σ 2 1-ψ I q    So that with the splitting Z = p d X 0 | p d X 1 | q d Σ , and β * T = β * T 0 |β * T 1 |β * T 2 , and with ξ = Σβ * 2 : Y = ZBβ * = r(X 0 β * 0 + X 1 β * 1 ) + σξ Estimator Following the same steps, we construct A and U with A =    d γp I γp 0 (1-γ)p×γd 0 q×γd    =⇒ U =   1 γψ I γp 0 0 0 0 0 0 0 0   So that we get the linear estimator Ŷt Ŷt = ZAβ t = 1 √ γ X 0 β t Analytic result as U and V * commute again, the joint probability distribution can be derived: P u = 1 γψ , v = r 2 ψ = γψ P u = 0, v = r 2 ψ = (1 -γ)ψ P u = 0, v = σ 2 (1 -ψ) = 1 -ψ Therefore, in the regime λ = 0, with κ = φ0 γ , a calculation leads to the following result (dubbed the "mismatched model" in Hastie et al. ( 2019)) E gen (+∞) = f1 = κ κ-1 (σ 2 + (1 -γ)r 2 ) (κ > 1) 1 1-κ σ 2 + r 2 γ(1 -κ) (κ < 1)

D.2 NON ISOTROPIC MODEL

We have the joint probabilities P (u = α -i , v = 1) = 1 p = γ for i ∈ {0, . . . , p -1} and λ = 0. Then: f1 = 1 p p-1 i=0 f1 φ + (α i ζ) 2 (φ + α i ζ) 2 (176) ζ = 1 p p-1 i=0 ζ φ + α i ζ (177) f 2 = c 0 - 1 p p-1 i=0 ζα i φ + α i ζ So either ζ = 0 and thus f1 = 0, or ζ = 0 and: f1 = 1 - 1 p p-1 i=0 φ (φ + α i ζ) 2 -1 1 p p-1 i=0 (α i ζ) 2 (φ + α i ζ) 2 (179) 1 = 1 p p-1 i=0 1 φ + α i ζ (180) Writing further down (α i ζ) 2 = (α i ζ + φ -φ) 2 = (α i ζ + φ) 2 -2φ(α i ζ + φ) + φ 2 we get: 1 p p-1 i=0 (α i ζ) 2 (φ + α i ζ) 2 = 1 -2φ 1 p p-1 i=0 1 φ + α i ζ + φ 2 1 p p-1 i=0 1 (φ + α i ζ) 2 (181) = 1 -2φ + φ 2 1 p p-1 i=0 1 (φ + α i ζ) 2 (182) = (1 -φ) -φ 1 - 1 p p-1 i=0 φ (φ + α i ζ) 2 So: f1 = (1 -φ) 1 - 1 p p-1 i=0 φ (φ + α i ζ) 2 -1 -φ Now injecting the expression for ζ: 1 - 1 p p-1 i=0 φ (φ + α i ζ) 2 = 1 p p-1 i=0 1 φ + α i ζ - φ (φ + α i ζ) 2 (185) = 1 p p-1 i=0 α i ζ (φ + α i ζ) 2 Hence the formula E gen (∞) = (1 -φ) 1 p p-1 i=0 α i ζ (φ + α i ζ) 2 -1 -φ Asymptotic limit: Let's consider the behavior of the generalisation error when α → ∞. Let's consider the potential solution for some k ∈ {0, . . . , p -1}: ζ k = c k α k (1 + o α (1)) for some constant c k . Then: p = p-1 i=0 1 φ + c k α i-k (1 + o α (1)) = 1 φ + c k + k φ + o α (1) Hence we choose: c k = φ 1 pφ -k -1 Because E gen (∞) ≥ 0, we need to enforce ζ k > 0 which leads to the condition 1 pφ-k -1 ≥ 0, that is 1 ≥ pφ -k > 0. So in fact it implies φ ∈ k p , k+1 p , so ζ k can only be a solution for φ in this range. Therefore we can consider the solution ζ(φ) = p-1 i=0 1 ] k p ; k+1 p [ (φ)ζ k (φ). Then notice: p-1 i=0 α i ζ k (φ + α i ζ k ) 2 = c k (c k + φ) 2 + o α (1) = -p 2 φ - k p φ - k + 1 p + o α (1) and thus for φ ∈ [0, 1] \ k p Z: E gen (∞) = p-1 k=0 φ(1 -φ) p φ -k p k+1 p -φ 1 ] k p ; k+1 p [ (φ) -φ + o α (1) So we clearly see that in the limit of α large, the test error approaches a function with two roots at the denominator. Evolution: f1 (x, y) = 1 p p-1 i=0 f1 (x, y)φ + α 2i ζ x ζ y (φ + α i ζ x )(φ + α i ζ y ) ζ z = -z + 1 p p-1 i=0 ζ z φ + α i ζ z (194) f 2 (z) = c 0 - 1 p p-1 i=0 α i ζ z φ + α i ζ z In particular f 2 is given by: f 2 (z) = c 0 -1 + φ p p-1 i=0 1 φ + α i ζ z = c 0 -1 + φζ z 1 + z ζ z ( ) and f1 is given by: f1 (x, y) = 1 p p-1 i=0 α 2i ζxζy (φ+α i ζx)(φ+α i ζy) 1 -φ p p-1 i=0 1 (φ+α i ζx)(φ+α i ζy) (197) D.2.1 EIGENVALUE DISTRIBUTION In our figures, we look at the log-eigenvalue distribution of the student data ρ log λ as it provides the most natural distributions on a log-scale basis. So in fact, if we plot the curve y(x) = ρ log λ (x) we have: y(x) = ρ log λ (x) = ∂ ∂x P(log λ ≤ x) (198) = ∂ ∂x P(λ ≤ e x ) = e x ρ λ (e x ) (200) So in a log-scale basis we have ρ log λ (log x) = xρ λ (x). It is interesting to notice the connection with η x for running computer simulations: ρ log λ (log x) = x π lim →0 + m(x + i ) = 1 π lim →0 + x + i ζ(x + i ) = - 1 π lim →0 + η x+i It is work mentioning that the bulks are further "detached" as α grows as it can be seen in figure 6 . Furthermore, bigger α makes the spike more distringuisable.  β = arg min β n i=1 θ T 0 φ(x i ) -β T φ(x i ) 2 + λ β 2 (202) Where φ(x) = (φ i (x)) i∈N = ( √ ω i e i (x)) for some orthogonal basis (e i ) i∈N . In fact we can consider: A = B =     √ ω 1 0 • • • 0 0 √ ω 2 • • • 0 . . . . . . . . . . . . 0 0 • • • √ ω d     and z i = (e 1 (x i ), . . . , e d (x i )). Then let's consider the following linear regression problem: β = arg min β Z (Bβ * -Aβ) 2 + λ β 2 (204) E gen ( β) = E z z T Bβ * -A β 2 ( ) This problem is identical to the kernel methods in the situation with a specific β * T = (θ 01 , . . . , θ 0d ). Although V * and U don't commute with each other, Notice that with x = y = -λ, due to the diagonal structure of U : f1 = Tr d (φU + ζI) -1 (ζ 2 V * + f1 φU 2 )(φU + ζI) -1 (206) = 1 d d i=1 [(ζ 2 V * + f1 φU 2 )(φU + ζI) -2 ] ii (207) = 1 d d i=1 (ζ 2 [V * ] ii + f1 φ[U 2 ] ii )(φ[U ] ii + ζ) -2 So in fact we find the self-consistent set of equation with E gen (+∞) = f1 : ζ = λ + 1 d d i=1 ζω i φω i + ζ (209) f1 = 1 d d i=1 f1 φω 2 i + ζ 2 θ 2 0i ω i (φω i + ζ) 2 (210) This is precisely the results from equation ( 78) in Loureiro et al. (2021) (see also Bordelon et al. (2020) ) with the change of variables λ(1 + V ) → ζ and ρ + q -2m → f1 .

D.4 RANDOM FEATURES EXAMPLE

We get the following matrices U, V with μ2 = µ 2 ψ , ν2 = ν 2 ψ , r2 = r 2 ψ , σ2 = σ 2 1-(1+ψ0)ψ : U =   μ2 W W T μνW 0 μνW T ν2 I N 0 0 0 0   V =   r2 I p 0 0 0 0 0 0 0 σ2 I q   In fact, the matrices U and V do not commute with each other, so we have more involved calculations. First we consider the subspace F = Ker(V -σ2 I q ) ⊥ . Let's define the matrices: U F = μ2 W W T μνW μνW T ν2 I N V F = r2 I p 0 0 0 (212) U F ⊥ = (0) V F ⊥ = σ2 I q ) Then, although U and V can't be diagonalized in the same basis, they are still both block-diagonal matrices in the same direct-sum space R d = F ⊕ F ⊥ , so in fact the following split between the two subspaces F and F ⊥ holds: f1 = Tr d (φU F + ζ x I) -1 ζ x ζ y V F (φU F + ζ y I) -1 (214) + Tr d (φU F ⊥ + ζ x I) -1 ζ x ζ y V F ⊥ (φU F ⊥ + ζ y I) -1 (215) + Tr d (φU + ζ x I) -1 f1 φU 2 (φU + ζ y I) -1 (216) Now let's define κ 1 , κ 2 , κ 3 such that: f1 = r 2 κ 1 + f1 (1 -κ -1 2 ) + σ 2 κ 3 That is to say, we get directly f1 = (r 2 κ 1 + σ 2 κ 3 )κ 2 and by definition: r 2 κ 1 = Tr d (φU F + ζ x I) -1 ζ x ζ y V F (φU F + ζ y I) -1 (218) 1 - 1 κ 2 = Tr d (φU + ζ x I) -1 φU 2 (φU + ζ y I) -1 (219) σ 2 κ 3 = Tr d (φU F ⊥ + ζ x I q ) -1 ζ x ζ y V F ⊥ (φU F ⊥ + ζ y I q ) -1 = σ 2 (220) So we already know that κ 3 = 1. Let's focus on κ 1 , we can deal with a linear pencil M such that we would get the desired term. First we define similarly A T F , the restriction of A T on the subspace F : A F = μW νI N =⇒ U F = A F A T F (221) Then, following the structure of M 1 we can construct the following linear-pencil M : M =     0 0 ζ y I A F 0 0 A T F -1 φ I ζ x I A F -ζ x ζ y V F 0 A T F -1 φ I 0 0     =   0 B y B x -ζ x ζ y V F 0 0 0   So that: M -1 =   B -1 x -ζ x ζ y V F 0 0 0 B -1 y B -1 x B -1 y 0   where: B -1 x = (φU F + ζ x I) -1 φ(φU F + ζ x I) -1 A F A T F φ(φU F + ζ x I) -1 (-1 φ I -1 ζy A T F A F ) -1 In the above matrices, the sub-blocks A F and V F are implicitly flattened, so in fact M is given completely by: M =        0 0 0 ζ y I 0 μW 0 0 0 0 ζ y I νI 0 0 0 μW T νI -1 φ I ζ x I 0 μW -r 2 ζ x ζ y I p 0 0 0 ζ x I νI 0 0 0 μW T νI -1 φ I 0 0 0        and therefore, one has to pay attention on the quantity of interest which is given by a sum of two terms: r 2 κ 1 = lim d→+∞ p d g 11 + N d g 22 = ψ(g 11 + ψ 0 g 22 ) (226) Using a Computer-Algebra-System, we get the equations with γ x , γ y , δ x , δ y defined such that g 36 = -ψγ x ζ x , g 63 = -ψγ y ζ y , δ x = ζ x g 14 , δ y = ζ y g 41 : ψg 11 = (ζ x ζ y ) -1 (δ x δ y )(r 2 ζ x ζ y + µ 2 ψ 0 g 33 ) ( ) ψg 22 = φ -2 (γ x γ y )(ψg 11 µ 2 ν 2 φ 2 ) ( ) g 33 = (ζ x ζ y )(γ x γ y )(ψg 11 µ 2 ) ( ) δ y = (1 + γ y µ 2 ψ 0 ) -1 (230) γ y = (µ 2 δ y + φ -1 0 ζ y + ν 2 ) -1 So: (1 -µ 4 ψ 0 (δ x δ y )(γ x γ y ))ψg 11 = (δ x δ y )(r 2 ) (232) and: ψg 11 + ψ 0 ψg 22 = 1 + ψ 0 µ 2 ν 2 (γ x γ y ) (ψg 11 ) Hence the result: κ 1 = 1 + ν 2 µ 2 ψ 0 (γ x γ y ) 1 -µ 4 ψ 0 (δ x δ y )(γ x γ y ) (δ x δ y ) Also there remain to use the last equation regarding ζ x using the fact that: ζ y + y = Tr d ζ x U (φU + ζ x I) -1 Notice that we have g 63 = -γ y ψζ y = Tr N - 1 φ I - 1 ζ y A T F A F -1 (236) So because A T F A F = A T A: ζ y γ y = φ 0 ζ y Tr N (φA T A + ζ y I) -1 (237) = φ 0 Tr N (φA T A + ζ y I -φA T A)(φA T A + ζ y I) -1 (238) = φ 0 Tr N I -φA T A(φA T A + ζ y I) -1 (239) = φ 0 1 -Tr N φ(φU + ζ y I) -1 U (240) = φ 0 1 - φ 0 ψ 0 ζ y Tr d ζ y U (φU + ζ y I) -1 (241) = φ 0 1 - φ 0 ψ 0 ζ y (ζ y + y) Therefore: γ y φ 0 ζ y = 1 - φ 0 ψ 0 1 + y ζ y For κ 2 we can calculate the following expression -which in fact is general and doesn't depend on the specific design of U : 1 - 1 κ 2 = Tr d (φU + ζ x I) -1 φU 2 (φU + ζ y I) -1 (244) = Tr d (φU + ζ x I) -1 (φU + ζ x I -ζ x I)U (φU + ζ y I) -1 (245) = Tr d (I -ζ x (φU + ζ x I) -1 )U (φU + ζ y I) -1 (246) = Tr d U (φU + ζ y I) -1 -ζ x (φU + ζ x I) -1 )U (φU + ζ y I) -1 (247) = Tr d U (φU + ζ y I) -1 - ζ x ζ y -ζ x (U (φU + ζ x I) -1 -U (φU + ζ y I) -1 ) (248) = 1 ζ y -ζ x Tr d ζ y U (φU + ζ y I) -1 -ζ x U (φU + ζ x I) -1 (249) = 1 ζ y -ζ x (ζ y + y -ζ x -x) (250) = 1 + y -x ζ y -ζ x Hence the general formula: κ 2 = - ζ y -ζ x y -x One can check that the same formula applies for instance for the mismatched ridgeless regression. Also, we assume that it can be replaced by its continuous limit in y → x in the situation x = y. Finally for f 2 , we find f 2 = c 0 -Tr d ζ z V (φU + ζ z I) -1 (253) = c 0 -Tr d ζ z V F ⊥ (φU F ⊥ + ζ z I) -1 -Tr d ζ z V F (φU F + ζ z I) -1 (254) = c 0 -σ 2 -lim d→+∞ p d g 11 + N d g 22 (255) = c 0 -σ 2 -ψ(g 11 + φ 0 g 22 ) where we use g associated to a slightly different linear-pencil M : M =    0 0 I 0 0 0 0 I ζ z I A F -ζ z V F 0 A T F -1 φ I 0 0    from which we get using a Compute-Algebra-System ψg 11 + ψφ 0 g 22 = r 2 δ z (258) Another more straightforward way for obtaining the same result without the need for an additional linear-pencil is to notice that if we let E 1 = (I p |0 p×N ) such that V F = r2 E 1 E T 1 , then we have: Tr d ζ x V F (φU F + ζ x I) -1 = Tr d ζ x r2 E T 1 (φU F + ζ x I) -1 E 1 (259) = r2 ζ x Tr p E T 1 (φU F + ζ x I) -1 E 1 ( ) Therefore reusing the definition of δ x and the former linear-pencil M : Tr d ζ x V F (φU F + ζ x I) -1 = r2 ψζ x g 14 = r 2 δ x Conclusion we have the following equations where a non-rigorous proof is provided using the replica symmetry tool from statistical physics. f1 (x, y) = - ζ y -ζ x y -x r 2 1 + ν 2 µ 2 ψ 0 (γ x γ y ) 1 -µ 4 ψ 0 (δ x δ y )(γ x γ y ) (δ x δ y ) + σ 2 (262) f 2 (z) = c 0 -(r 2 δ z + σ 2 ) (263) δ z = (1 + γ z µ 2 ψ 0 ) -1 (264) γ z = (µ 2 δ z + φ -1 0 ζ z + ν 2 ) -1 (265) γ y φ 0 ζ y = 1 - φ 0 ψ 0 1 + y ζ y Here we propose to generalize even further the fixed-point equation where we let the sub-blocks be potentially of any form, and provide a non-rigorous proof of the proposition following the steps proposed in Bun et al. (2017) ; Potters & Bouchaud (2020) using Dyson brownian motions and Itô Lemma to derive the fixed-point equation.

E.1 NOTATIONS AND MAIN STATEMENT

Let's consider an invertible self-adjoint complex block matrix M ∈ C N ×N with N = p 1 + . . . + p n such that M ij is the sub-matrix of size p i × p j . We assume that p 1 , . . . , p n → ∞ when N → ∞ such that we have the fixed ratios γ i = lim N →∞ pi N , and let's define the inverse G = M -1 . Now let S = {ij|p i = p j }. We define for (ij) ∈ S (and defined to 0 outside of this set): g ij N = 1 p i Tr G ij = 1 p i pi k=1 G ij kk We further decompose M as the sum of two components M = M 0 + 1 √ N H with M 0 and H both self-adjoint, M 0 is also invertible and where H is a block of random matrices independent of M 0 . In particular, Re(H) and Im(H) are independent element-wise with each-other, and we leave the possibility that the sub-blocks of Re(H) and Im(H) be either a Wigner random-matrix, Wishart random matrix, the adjoint of a Wishart random matrix, or a (real-)weighted sum of any of the three. For the sake of simplicity, we will consider that the elements Re H ij uv or Im H ij uv within the block ij are gaussian and identically distributed although the gaussian assumption can certainly be weakened. Now let's define σ kl ij the covariance between the elements of the sub-matrices H ij and H kl , that is for (il, jk) ∈ S 2 on the off-diagonal position uv on transposed element-locations: σ kl ij = E H ij uv H kl vu = E Re H ij uv Re H kl vu -E Im H ij uv Im H kl vu Also, there can be some covariances on similar element-locations which we define with σ for (ik, jl) ∈ S 2 : we state that: σkl ij = E H ij g ij = Tr pi Π(G) ij Remark 1: When M 0 = Z ⊗ I such that Z ij = 0 if ij / ∈ S, then we get Π(M ) = (Z -η(g)) ⊗ I, then Π(M ) -1 = (Z -η(g)) -1 ⊗ I. Therefore: g = (Z -η(g)) -1 , or re-adjusting the terms, we find back the equation from Adlam & Pennington (2020a) ; Bodin & Macris (2021) : Zg = I n + η(g)g Remark 2: When considering the linear pencil of a block-matrix M such this is not necessarily self-adjoint however still invertible, the amplified matrix M can be considered: M = 0 M M T 0 This implies that: M -1 = 0 ( M T ) -1 M -1 0 (278) So g will also be of the form: g = 0 ḡT g 0 η(g) = 0 η(g ) T η(g ) 0 So in fact, the same equation still holds with g ij = Tr pi (C -η(g ) ⊗ I) ij and thus, the self-adjoint constraints can be relaxed.

E.2 NON-RIGOROUS PROOF VIA DYSON BROWNIAN MOTIONS

In order to show the former result, we extend the sketch of proof provided in Bun et al. (2017) ; Potters & Bouchaud (2020) . First we introduce a time t and a matrix Z ∈ C n×n with M (t, Z) = Z ⊗ I + M 0 + 1 √ N H(t) with H a Dyson brownian motion. Therefore, the matrix that we are interested in is actually M = M (1, 0 n×n ). In order for H(1) to satisfy the property 271 we must have: As it would require more in-depth analysis, we make the following two assumptions: d H ij 1. We assume that g αβ N concentrates towards a constant value g αβ when N → ∞

2.. That

αβ N concentrates towards 0 when N → ∞ With these assumptions in mind we obtain the partial differential equation: ∂g αβ ∂t + ij∈S [η(g)] ij ∂g αβ ∂Z ij = 0 Finally, using the change of variable ĝ(s) = g( t(s), Ẑ(s)) = g(t + s, Z + sη(g(t, Z))) we find: So ĝ αβ (s) is constant so: ĝ αβ (-t) = ĝ αβ (0) which implies: dĝ g(0, Z -tη(g(t, Z))) = g(t, Z) Hence for (t, Z) = (1, 0 n×n ) we have: g(0, -η(g(1, 0))) = g(1, 0) Hence the expected result: We find that η(g) = g and using (276) we find directly -zg = 1 + g 2 (301) g ij = [



which suggests considering directly observations x = (x T , xT ) T ∼ N (0, Σ) for a given covariance structure Σ. The spectral theorem provides the existence of orthonormal matrix O and diagonal D such that Σ = O T DO and D contains d non-zero eigenvalues in a squared block D 1 and p A + p B -d zero eigenvalues. We can write D = J T D 1 J with J = (I d |0 p A +p B -d ). Therefore if we let z = 1 JOx which has variance 1 d I d , then upon noticing JJ T = I d and defining (A|B) T = √ dO T J T D 1 2

withz signal r and noise σ and mismatch parameter γ with γ + γ = 1

We will use Tr d [•] ≡ lim d→+∞ 1 Tr [•] and similarly for Tr n [•]. We also occasionally use N d (v) = lim d→+∞ 1 d v 2 for a vector v (when the limit exists).

Assumptions 2.1 (High-Dimensional assumptions) In the high-dimensional limit, i.e, when d → +∞ with all ratios n d , p A d , p B d fixed, we assume the following 1. All the traces Tr d [•], Tr n [•] concentrate on a deterministic value.

28) Using both solutions ζ = 0 or ζ = 1 -φ 0 yields the same results as in Hastie et al. (2019); Belkin et al. (2020) using 2.3:

Figure 1: Example of theoretical multiple descents in the least-squares solution for the non-isotropic ridgeless regression model with p = 3, λ = 10 -7 (left) and p = 4, λ = 10 -13 (right), and α = 10 4 in both of them.

Figure 2: Example of theoretical multiple descents evolution in the non-isotropic ridgeless regression model with p = 3, λ = 10 -5 , α = 100 with φ = 1 on the left and a range φ ∈ (0, 1) on the right heatmap.

Figure 3: Theoretical (log-)eigenvalue distribution in the non-isotropic ridgeless regression model with p = 3, λ = 10 -5 , α = 100 with φ = 1 on the left and a range φ ∈ (0, 1) on the right heatmap.

Figure 4: Comparison between the analytical and experimental learning profiles for the minimum least-squares estimator at λ = 10 -3 on the left (20 runs) and the time evolution at λ = 10 -2 , n = 700 on the right (10 runs).

Figure 5: Analytical training error and test error heat-maps for the theoretical gradient flow for λ = 10 -3 .

trace of the squared sub-block (M -1 1 ) ij divided by the size of the block (ij), we find the desired functions f1 (x, y) = lim d→+∞

Tr d [φV * ], and hence Tr d Z T ZV * = φTr d [V * ].

Figure 6: Theoretical (log-)eigenvalue distribution in the non-isotropic ridgeless regression model with p = 3, λ = 10 -5 , α = 10 4 with φ = 1 on the left and a range φ ∈ (0, 1) on the right heatmap.

Figure 7: Comparison between the analytical and experimental learning profiles for the minimum least-squares estimator at λ = 10 -3 on the left (average and ± 2-standard-deviations over 20 runs) and λ = 10 -2 , n = 700 on the right.

Figure 8: Comparison between the analytical and experimental learning evolution at λ = 10 -2 , n = 700 (10 runs).

the mapping η :C n×n → C n×n : [η(g)] ij = kl∈S γ k σ lj ik g kl(272)then let Π(M ) = M 0 -η(g) ⊗ I with the notation (η(g) ⊗ I) ij = η(g) ij I pi when ij ∈ S and (η(g) ⊗ I) ij = 0 pi×qj the null-matrix when ij / ∈ S. Similarly as G, with Π(G) = Π(M )

S 2 (ik, jl)δ ux δ vy σlk ij + δ S 2 (il, jk)δ uy δ vx σ kl ij )manipulations and the fact that G is analytic in M ij uv as a rational function, we can rewrite the above partial derivatives as:jk)∈S 2 p [G αk G jβ ] pp γ l g li N + [G αi G lβ ]pp γ j g matrix Z is now helpful upon noticing that (using again the analyticity of G) ∂Gαβ pp ∂Z kj = -G ∂M ∂Z kj G αβ pp = -[G(E kj ⊗ I)G] αβ pp = -G αk G jβ pp(290)Hence (using the fact that σ kl ij = σ ij kl )

g(1, 0)] ij = [g(0, -η(g(1, 0)))] ij = Tr pi (M 0 -η(g(1, 0)) ⊗ I) -1 ij = Tr pi Π(G) ijLet's consider n = 1 and the symmetric random matrix H ∈ R N and M = H √ N -zI.

Different matrices and corresponding models

bution β * ∼ P * . Averaging E β * ∼P * [ Ēgen ] propagates the expectation within E β * ∼P * [B 0 (t)] and E β * ∼P * [B 1 (t)], which propagates it further into the traces of E β * ∼P * [ f1 ] and E β * ∼P * [f 2 ]. In fact we find:

Notice that σ kl ij = σ ij kl and σkl ij = σji lk by symmetry, and also when H is real, we always have σ kl ij = σkl ij . So overall the random matrix H has to satisfy the following property at any off-diagonal locations (uv), (xy) and blocks (ij, kl):δ S 2 (jk, il)δ vx δ yu σ kl ij + δ S 2 (ik, jl)δ ux δ vy σlk ij = E H ij uv H kl xy

D.5 REALISTIC DATASETS

For the realistic datasets, we capture the time evolution for two different datasets: MNIST and Fashion-MNIST. To capture the dynamics over a realistic dataset X ∈ R ntot×d , it is more convenient to use the dual matrices U , V , Ξ. We only need to estimate U and Ξβ * with U 1 ntot X T X and Ξβ * 1 ntot X T Y . In both cases, we sill sample a subset of n < n tot data-samples for the training set. The scope of the theoretical equations is still subject to the high-dimensional limit assumption, in other words we need n and d "large enough", that is to say 1 n. At the same time, the approximation of U and Ξβ * hints at n tot sufficiently large compared to the number of considered samples n. Hence we need also n n tot .Numerically, for the two following datasets and as per assumptions 2.1, the theoretical prediction rely on a contour enclosing the spectrum Sp( XT X) of XT X, but not enclosing -λ. Therefore, in order to proceed with our computations, we take a symmetric rectangle around the x-axis crossing the axis at the particular values -λ 2 and 1.2 max Sp( XT X) after a preliminary computation of the spectrum. For the need of our experiments, we commonly discretized the contour and ran a numerical integration over the discretized set of points.MNIST Dataset: we consider the MNIST dataset with n tot = 70 000 images of size 28 × 28 of numbers between 0 and 9. In our setting, we consider the problem of estimating the parity of the number, that is the vector Y with Y i = 1 if image i represents an even number and Y i = -1 for an odd-number. The dataset X ∈ R ntot×d is further processed by centering each column to its mean, and normalized by the global standard-deviation of X (in other words the standard deviation of X seen as a flattened n tot × d vector) and further by √ d (for consistency with the theoretical random matrix Z).The results that we obtain are shown in Figure 4 . On the figure on the left side we show the theoretical prediction of the training and test error with the minimum least-squares estimator (or alternatively the limiting errors at t = +∞). We make the following observations which in fact relates to the same ones as in Figure 4 in Loureiro et al. ( 2021):• There is an apparent larger deviation in the test error for smaller n which tends to heal with increasing number of data samples• A bias between the mean observation of the test error and the theoretical prediction emerges around the double-descent peak between n = 100 and n = 1000, in particular, the experiments are slightly above the given prediction. We notice that this bias is even more pronounced for smaller values of λ.• Although it is not visible on the figure, increasing n further tends to create another divergence between the theoretical prediction and the experimental runs -as it is expected with n getting closer to n tot .Besides the limiting error, we chose to draw the time-evolution of the training and test error around at n = 700 around the double descent on the right side of Figure 4 . This time, a gradient descent algorithm is executed for each 10 experimental runs with a constant learning-rate dt = 0.01. Due to the log-scale of the axis, it is interesting to notice that with such a basic non-adaptive learning-rate, each tick on the graph entails 10 times more computational time to update the weights. By contrast, the theoretical curves can be calculated at any point in time much farther away. Overall we see a good agreement between the evolution of the experimental runs with the theoretical predictions. However, as it is expected around the double-descent spike, learning-curves of the experimental runs appear slightly biased and above the theoretical curves.Fashion-MNIST Dataset: We provide another example with MNIST-Fashion dataset with d = 784 and n tot = 70 000. The dataset X is processed as for the MNIST dataset. We take the output vector Y such that Y i = 1 for items i above the waist, and Y i = -1 otherwise. We provide the results in Figure 7 where the training set is sampled randomly with n elements in n tot and the test set is sampled in the remaining examples. As it can be seen, the test error is slightly above the prediction for n < 10 3 but fits well with the predicted values for larger n. Furthermore, the learning curves through time in Figure 8 are different compared to the MNIST dataset in Figure 4 and we still observe a good match with the theoretical predictions. However the mismatch in the learning curves seems to increase in the specific case when λ is lower, increasing thereby the effect of the double descent.

MARCHENKO PASTUR LAW

Let's consider n = 2 and the random matrix X ∈ R d×N with φ = N d = γ1 γ2 and γ 1 = N N +d , γ 2 = d N +d and the random symmetric block matrix:Using Schur complement, it can be seen that gwhich is precisely the trace that is being looked for.A careful analysis shows that σ 

