PROVABLE MORE DATA HURT IN HIGH DIMENSIONAL LEAST SQUARES ESTIMATOR

Abstract

This paper investigates the finite-sample prediction risk of the high-dimensional least squares estimator. We derive the central limit theorem for the prediction risk when both the sample size and the number of features tend to infinity. Furthermore, the finite-sample distribution and the confidence interval of the prediction risk are provided. Our theoretical results demonstrate the sample-wise nonmonotonicity of the prediction risk and confirm "more data hurt" phenomenon.

1. INTRODUCTION

More data hurt refers to the phenomenon that training on more data can hurt the prediction performance of the learned model, especially for some deep learning tasks. Loog et al. (2019) shows that various standard learners can lead to sample-wise non-monotonicity. Nakkiran et al. (2019) experimentally confirms the sample-wise non-monotonicity of the test accuracy on deep neural networks. This challenges the conventional understanding in large sample properties: if an estimator is consistent, more data makes the estimator more stable and improves its finite-sample performance. Nakkiran (2019) considers adding one single data point to a linear regression task and analyzes its marginal effect to the test risk. Dereziński et al. (2019) gives an exact non-asymptotic risk of the high-dimensional least squares estimator and observes the sample-wise non-monotonicity on mean square error. For adversarially robust models, Min et al. (2020) proves that more data may increase the gap between the generalization error of adversarially-trained models and standard models. Chen et al. (2020) shows that more training data causes the generalization error to increase in the strong adversary regime. In this work, we derive the finite-sample distribution of the prediction risk under linear models and prove the "more data hurt" phenomenon from an asymptotic point of view. Intuitively, the "more data hurt" stems from the "double descent" risk curve: as the model complexity increases, the prediction risk of the learned model first decreases and then increases, and then decreases again. The double descent phenomenon can be precisely quantified for certain simple models (Hastie et al. (2019) ; Mei & Montanari (2019) ; Ba et al. (2019) ; Belkin et al. (2019) ; Bartlett et al. (2020) ; Xing et al. (2019) ). Among these works, Hastie et al. (2019) and Mei & Montanari (2019) use the tools from random matrix theory and explicitly prove the double descent curve of the asymptotic risk of linear regression and random features regression in high dimensional setup. Ba et al. (2019) gives the asymptotic risk of two-layer neural networks when either the first or the second layer is trained using a gradient flow. The second decline of the prediction risk in the double descent curve is highly related to the more data hurt phenomenon. In the over-parameterized regime when the model complexity is fixed while the sample size increases, the degree of over-parameterization decreases and becomes close to the interpolation boundary (for example p/n = 1 in Hastie et al. (2019) ), in which a high prediction risk is achieved. However, the existing asymptotic results, which focus on the first-order limit of the prediction risk, cannot fully describe the more data hurt phenomenon. In fact, the "double descent" curve is a function of the limiting ratio lim p/n, which may not be able to characterize the empirical prediction risk in finite sample situations. There will be a non-negligible discrepancy between the empirical prediction risk and its limit, especially when the sample size or dimension is small. Finegrained second-order results are thus needed to fully characterize such discrepancy and further, a confidence band for the prediction risk can be constructed to evaluate its finite sample performance. We take Figure 1 as an example to illustrate this. According to the first-order limit, given a fixed dimension p = 100, the prediction risks at sample size n = 90 and n = 98 are about 10.20 and 49.02. More data hurt seems true. However, the 95% confidence interval of the prediction risks with sample size 98 is [4.91, 142.12] , which contains the risk for n = 90. Then more data hurt is not statistically significant. Hence, in this work, we characterize the second-order fluctuations of the prediction risk and make attempts to fill this gap. We employ the linear regression task in Hastie et al. (2019) and Nakkiran (2019) , and introduce new tools from the random matrix theory, e.g. the central limit theorems for linear spectral statistics in Bai & Silverstein (2004) ; Bai et al. (2007) , to derive the central limit theorem of the prediction risk. Consider a linear regression task with n data points and p features, the setup of the more data hurt is similar to that in the classical asymptotic analysis in Van der Vaart (2000) . According to the classical asymptotic analysis with p fixed and n → ∞, the least square estimator is unbiased and √ n-consistent to the ground truth. This implies that the more data will not hurt and even improve the prediction performance. However, the story is very different in the over-parameterized regime. The prediction risk doesn't decrease monotonously with n when p > n. More data does hurt in the over-parameterized case. In the following, we will justify this phenomenon by developing the second-order asymptotic results as both n and p tend to infinity. We assume p/n → c, and denote 0 < n 1 < n 2 < +∞, c 1 = p/n 1 and c 2 = p/n 2 . Then the direct comparison of the prediction risk between sample sizes n 1 and n 2 can be decomposed into three parts: (i) the gap between the finitesample risk under n = n 1 and the asymptotic risk with c = c 1 ; (ii) the gap between the finite-sample risk under n = n 2 and the asymptotic risk with c = c 2 ; (iii) the comparison between two asymptotic risks under c = c 1 and c = c 2 . Theorem 1 and 2 of Hastie et al. (2019) give answers to the task (iii). For (i) and (ii), we develop in this paper the convergence rate and the limiting distribution of the prediction risk as n, p → +∞, p/n → c. Furthermore, the confidence interval of the finite-sample risk can be obtained as well. Figure 1 : Sample-wise double descent. We take p = 100 and 1 ≤ n ≤ 200. Left: The conditional density of the prediction risk when sample size varies from 1 to 200. According to the conditional distribution of the prediction risk, we can observe the sample-wise double descent phenomenon. Right: The 95%-confidence band (point-wise) of the prediction risk. In the over-parameterized regime 1 ≤ n < 100, there exists some pairs (n 1 , n 2 ), 1 ≤ n 1 < n 2 < 100 such that the upper boundary of the confidence interval at n 1 is smaller than the lower boundary of the confidence interval at n 2 . This confirms the more data hurt phenomenon. The main goal of this paper is to study the second order asymptotic behavior of two different types of conditional prediction risk in the linear regression model. One is R X,β ( β, β) given both the training data and regression coefficient while the other is R X ( β, β) given the training data only. We summarize our main results as follows: (1) The regression coefficient is set to be either random or nonrandom to cover more cases. Different convergence rates and limiting distributions of both prediction risk are derived under various scenarios. (2) In particular, the finite-sample distribution of the conditional prediction risk given both the training data and regression coefficient is derived and the sample-wise double descent is characterized in Theorem 4.2 and Theorem 4.5 (see Figure 1 ). Under certain assumptions, the more data hurt phenomenon can be confirmed by comparing the confidence intervals built via the central limit theorems. (3) Our results incorporate non-Gaussian observations. For Gaussian data, the limiting mean and variance in the central limit theorems have simpler forms, see Section 4.2 and 4.3 for more details. The rest of this paper is organized as follows. Section 3 introduces the model settings and two different prediction risk. Section 4 presents the main results on CLTs for the two types of risk with discussion. Section 5 conducts simulation experiments to verify the main results. All the technical proofs and lemmas are relegated to the appendix in the supplementary file.

Double Descent

The double descent curve describes how generalization ability changes as model capacity increases. It subsumes the classical bias-variance trade-off, a U-shape curve, and further show that the test error exhibits a second drop when the model capacity exceeds the interpolation threshold (Belkin et al. (2018); Geiger et al. (2019) ; Spigler et al. (2019) ; Advani & Saxe (2017)). The double descent phenomenon has been quantified for certain models, including two-layer neural networks via non-asymptotic bounds or asymptotic risk (Belkin et al. (2019); Muthukumar et al. (2020) ; Hastie et al. (2019) ; Mei & Montanari (2019) ; Ba et al. (2019) ). As our results are based on the linear regression, we mainly focus on the literature of linear models. Muthukumar et al. (2020) and Bartlett et al. (2020) derive the generalization bounds for over-parametrized linear models and show the benefits of the interpolation. Hastie et al. (2019) gives the first-order limit of the generalization error for linear regressions as n, p → +∞. Dereziński et al. (2019) provides an exact non-asymptotic expression for double descent of the high-dimensional least square estimator. Wu & Xu (2020) extends the first-order limit of the prediction error of the generalized weighted ridge estimator to more general case with anisotropic features and signals. Montanari et al. (2019) , Deng et al. (2019) and Kini & Thrampoulidis (2020) investigate the sharp asymptotic of binary classification tasks with the max-margin solution and the maximum likelihood solution. Emami et al. (2020) and Gerbelot et al. (2020a) consider the double descent in generalized linear models. Furthermore, the double descent phenomenon is also observed on linear tasks with various problems and assumptions, e.g. LeJeune et al. (2020) ; Gerbelot et al. (2020b) ; Javanmard et al. (2020) ; Dar & Baraniuk (2020) ; Xu & Hsu (2019) ; Dar et al. (2020) . Xing et al. (2019) sharply quantifies the benefit of interpolation in the nearest neighbors algorithm. Mei & Montanari (2019) derives the limiting risk on the random features model and shows that minimum generalization error is achieved by highly overparametrized interpolators. Ba et al. (2019) gives the limiting risk of the regression problem under two-layer neural networks. However, the existing asymptotic results focus on the first-order limit of the prediction risk and do not indicate the convergence rate. There are very few second-order results in the literature, Shen & Bellec (2020) establishes the asymptotic normality for the derivatives of 2-layers neural network, but not the exact limiting distribution of the risk. In this work, we are the first to develop results on second-order fluctuations of the prediction risk in linear regressions and provide its corresponding confidence intervals. The more data hurt phenomenon is further justified from the asymptotic point of view. Random Matrix Theory The primary tool for analyzing the second-order fluctuations of prediction risk comes from random matrix theory. In particular, Bai & Silverstein (2004) refines the central limit theorem for linear spectral statistics of large dimensional sample covariance matrix with general population and the population is not necessary to be Gaussian. Similar central limit theorems are also developed for other random matrix ensembles, see Sinai & Soshnikov (1998) ; Bai & Yao (2005) ; Zheng (2012) . Other than the central limit theorem for linear spectral statistics, Bai et al. (2007) and Pan & Zhou (2008) study the asymptotic fluctuation of eigenvectors of sample covariance matrices. Bai & Yao (2008) considers the fluctuation of quadratic forms. All these technical tools and results are adopted and fully utilized in this paper, especially those related to Stieltjes transform, which are closely connected to the prediction risk studied in this paper.

3.1. PROBLEM, DATA AND ESTIMATOR

Suppose that the training data {(x i , y i ) ∈ R p × R, i = 1, 2, . . . , n} is generated independently from the model (ground truth or teacher model): y i = β T x i + i , and (x i , i ) ∼ (P x , P ), i = 1, 2, . . . , n. (1) Here, P x is a distribution on R p such that E(x i ) = 0, Cov(x i ) = Σ, and P is a distribution on R such that E( i ) = 0, Var( i ) = σ 2 . In particular, the coordinates of x i are not necessarily independent, that is, Σ is not restricted to be diagonal. To proceed further, we denote X n×p = (x 1 , x 2 , . . . , x n ) T , y = (y 1 , y 2 , . . . , y n ) T . The minimum 2 norm (min-norm) least squares estimator, of y on X, is defined by β = arg min β y -Xβ 2 = (X T X) + X T y, where (X T X) + denotes the Moore-Penrose pseudoinverse of X T X.

3.2. BIAS, VARIANCE AND RISK

Similar to Hastie et al. (2019) , we define two different types of out-of-sample prediction risk. The first one is given by R X ( β, β) = E (x T 0 β -x T 0 β) 2 X = E β -β 2 Σ X , where x 0 ∼ P x is a test point and is independent of the training data, and the notation β 2 Σ stands for β T Σβ. Here β is assumed to be a random vector independent of x 0 . In this definition, the expectation E stands for the conditional expectation for x 0 , β and β when X is given. According to the bias-variance decomposition, we have R X ( β, β) := B X ( β, β) + V X ( β, β), where B X ( β, β) = E E( β|X) -β 2 Σ X and V X ( β, β) = Tr{Cov( β|X)Σ}. Plugging the model ( 1) into the min-norm estimator (2), the bias and variance terms can be rewritten as B X ( β, β) = E β T ΠΣΠβ X and V X ( β, β) = σ 2 n Tr( Σ+ Σ), where Σ = X T X/n is the (uncentered) sample covariance matrix of X, and Π = I p -Σ+ Σ is the projection onto the null space of X. The second type of out-of-sample prediction risk is defined as R X,β ( β, β) = E (x T 0 β -x T 0 β) 2 X, β = E β -β 2 Σ X, β , where B X,β ( β, β) = β T ΠΣΠβ and V X,β ( β, β) = V X ( β, β) = σ 2 n Tr( Σ+ Σ). In this definition, the parameter β is assumed to be given. The expectation E is the conditional expectation for x 0 and β when X and β are given. This is consistent with the commonly-used testing procedure, in which a trained model is evaluated by the average loss on unseen testing data.

4. MAIN RESULTS

Before stating our main results, we briefly highlight the challenges we faced in proving the more data hurt phenomenon. First, the finite-sample behaviors of the prediction risk is required. Hastie et al. (2019) gives the first-order limits of both R X,β ( β, β) and R X ( β, β) as n, p → +∞ and p/n → c ∈ (0, +∞). However, to prove the more data hurt phenomenon, we should fix p and investigate the finite-sample risk with sample size n varies. This implies that only knowing the firstorder limit is not enough, the convergence rate is also needed. To solve this problem, we have derived the central limit theorems for both R X,β ( β, β) and R X ( β, β), respectively, which characterize the second-order fluctuations of the risk. Then we can figure out the finite-sample behavior of the risk by computing the gap between the risk and its limit. The confidence intervals of the risk can be further obtained. Second, the parameter β also contributes randomness to the finite-sample risk, which further influences the convergence rate. To analyze the contribution of β, we need to make use of the technical tools and asymptotic results for eigenvectors and quadratic forms developed in Bai et al. (2007) and Bai & Yao (2008) . Another interesting finding is that, in the over-parameterized regime such that p > n, the two types of out-of-sample prediction risk R X,β ( β, β) and R X ( β, β) enjoy different convergence rates.

4.1. ASSUMPTIONS AND MORE NOTATIONS

As follows are some notations used in this paper. The p × p identity matrix is denoted by I p . For a symmetric matrix A ∈ R p×p , we define its empirical spectral distribution as F A (x) = 1 p p i=1 1{λ i (A) ≤ x} where 1{•} is the indicator function and λ i (A), i = 1, 2, . . . p are the eigenvalues of A. The notation d -→ stands for the convergence in distribution. Z α/2 is the α/2 upper quantile of the standard normal distribution, λ max (A) and λ min (A) denote the largest and smallest eigenvalues of A, respectively. Here we list all the assumptions for X and β needed under different scenarios: (A) x j ∼ P x is of the form x j = Σ 1/2 z j , where z j is a p-length random vector with i.i.d. entries that have zero mean, unit variance, and a finite 4-th order moment E(z 4 ij ) = ν 4 , i = 1, • • • , p, j = 1, • • • , n. (B1) Σ is a deterministic positive definite matrix, such that 0 < c 0 ≤ λ min (Σ) ≤ λ max (Σ) ≤ c 1 , for all n, p and some constants c 0 , c 1 . As p → ∞, we assume that the empirical spectral distribution F Σ converges weakly to a probability measure H. (B2) Σ is an identity matrix, Σ = I p . (C1) β is a nonrandom constant vector, and β 2 2 = β T β = r 2 . (C2) β ∼ P β is independent of X and follows multivariate Gaussian distribution N p (0, r 2 p I p ). Throughout this paper, we consider the limiting distributions and the convergence rates of the outof-sample prediction risk when n, p → ∞ such that p/n = c n → c ∈ (0, ∞). If c > 1, the sample size n is smaller than the number of parameters p, we call this case "over-parametrized". Otherwise when c < 1, we call it "under-parameterized".

4.2. UNDER-PARAMETRIZED ASYMPTOTICS

In this section, we focus on the risk of the min-norm estimator (2) in the under-parametrized regime. According to Theorem 1 of Hastie et al. (2019) , both B X,β ( β, β) and B X ( β, β) converge to σ 2 c/(1 -c) almost surely. The following Theorem 4.1 and 4.2 show that both B X ( β, β) and B X,β ( β, β) converge to σ 2 c/(1 -c) at the rate 1/p. Furthermore, the limiting distributions are derived by making use of the CLT for linear spectral statistics of large-dimensional sample covariance matrices. Theorem 4.1. Suppose that the training data is generated from the model (1), and the assumptions (A) and (B1) hold. Then the first type of out-of-sample prediction risk R X ( β, β) of the min-norm estimator (2) satisfies that, as n, p → ∞ such that p/n = c n → c < 1, p R X ( β, β) - c n σ 2 1 -c n d -→ N (µ c , σ 2 c ), where µ c = c 2 σ 2 (c -1) 2 + σ 2 c 2 (ν 4 -3) 1 -c and σ 2 c = 2c 3 σ 4 (c -1) 4 + c 3 σ 4 (ν 4 -3) (1 -c) 2 . Conclusively, P (L α,c ≤ R X ( β, β) ≤ U α,c ) → 1 -α, ) where 1 -α is the confidence level and L α,c = c n σ 2 1 -c n + 1 p (µ c -Z α/2 σ c ), U α,c = c n σ 2 1 -c n + 1 p (µ c + Z α/2 σ c ). Under the assumptions of Theorem 4.1, we know that Π = I p -Σ+ Σ = 0 and B X ( β, β) = B X,β ( β, β) = 0, V X ( β, β) = V X,β ( β, β) = σ 2 n Tr( Σ+ Σ). Thus R X ( β, β) equals to R X,β ( β, β) and the two risk share the same asymptotic limit. Theorem 4.2. Under the assumptions of Theorem 4.1, the second type of out-of-sample prediction risk R X,β ( β, β) of the min-norm estimator (2) satisfies that, as n, p → ∞ such that p /n = c n → c < 1, p R X,β ( β, β) - c n σ 2 1 -c n d -→ N (µ c , σ 2 c ), and P (L α,c ≤ R X,β ( β, β) ≤ U α,c ) → 1 -α, where µ c , σ 2 c , L α,c and U α,c are the same as those in Theorem 4.1.

4.3. OVER-PARAMETRIZED ASYMPTOTICS

In this section, we consider the min-norm estimator (2) in the over-parametrized case c > 1. This implies that the bias term can influence the asymptotic behavior of the prediction risk, including the convergence rate. Hence to derive the CLT of the out-of-sample prediction risk, we need to consider both the bias and variance terms in (4). In the following, we investigate the asymptotic properties of the two prediction risk R X ( β, β) and R X,β ( β, β) under various combinations of the assumptions (A1), (B2) for X and scenarios (C1), (C2) for both random and nonrandom β. We start with the case when β is a constant vector. Theorem 4.3. Suppose that the training data is generated from the model (1), and the assumptions (A), ( B2) and (C1) hold. Then the first type of out-of-sample prediction risk R X ( β, β) of the minnorm estimator (2) satisfies that, as n, p → ∞ such that p/n = c n → c > 1, √ p R X ( β, β) -(1 - 1 c n )r 2 - σ 2 c n -1 d -→ N (µ c,1 , σ 2 c,1 ), where µ c,1 = 0 and σ 2 c,1 = 2(c-1) c 2 r 4 . A more practical version is to replace µ c,1 and σ 2 c,1 with μc,1 = 1 √ p cσ 2 (1 -c) 2 + σ 2 (ν 4 -3) c -1 and σ2 c,1 = 2(c -1) c 2 r 4 + 1 p 2c 3 σ 4 (1 -c) 4 + cσ 4 (ν 4 -3) (c -1) 2 . Conclusively, P (L α,c ≤ R X ( β, β) ≤ U α,c ) → 1 -α, where 1 -α is the confidence level and L α,c = (1 - 1 c n )r 2 + σ 2 c n -1 + 1 √ p (μ c,1 -Z α/2 σc,1 ), U α,c = (1 - 1 c n )r 2 + σ 2 c n -1 + 1 √ p (μ c,1 + Z α/2 σc,1 ). Remark 4.1. Under assumption (C1), B X ( β, β) = B X,β ( β, β) and R X ( β, β) = R X,β ( β, β). Thus Theorem 4.3 still holds if we replace R X ( β, β) with R X,β ( β, β). Remark 4.2. Under Assumption (B2), the eigenvector of Σ is asymptotically Haar distributed. Therefore, the bias term B X ( β, β) is only related to the length of β. However, in the anisotropic settings with general Σ, the eigenvector of the Σ is no longer asymptotically Haar distributed. The limiting behavior of B X ( β, β) heavily relies on the interaction between β and the eigenvectors of Σ. Therefore, we conjecture that there is no universal convergence rate for the bias term B X ( β, β) that can cover arbitrary non-random β and anisotropic Σ in the over-parametrized case, not to mention the prediction risk R X ( β, β). A small simulation experiment is conducted in Appendix G to confirm our conjecture on this point. Next we consider the case when β is a random vector that follows assumption (C2). Theorem 4.4. Suppose that the training data is generated from the model (1), and the assumptions (A), (B2) and (C2) hold. Then as n, p → ∞ such that p/n = c n → c > 1, the first type of out-of-sample prediction risk R X ( β, β) of the min-norm estimator (2) satisfies, p R X ( β, β) -(1 - 1 c n )r 2 - σ 2 c n -1 d -→ N (µ c,2 , σ 2 c,2 ), where µ c,2 = cσ 2 (1 -c) 2 + σ 2 (ν 4 -3) c -1 and σ 2 c,2 = 2c 3 σ 4 (1 -c) 4 + cσ 4 (ν 4 -3) (c -1) 2 . Hence we have P (L α,c ≤ R X ( β, β) ≤ U α,c ) → 1 -α, where L α,c = σ 2 c n -1 + (1 - 1 c n )r 2 + 1 p (µ c,2 -Z α/2 σ c,2 ), U α,c = σ 2 c n -1 + (1 - 1 c n )r 2 + 1 p (µ c,2 + Z α/2 σ c,2 ). As for R X,β ( β, β), we have the following theorem. Theorem 4.5. Suppose that the training data is generated from the model (1), and the assumptions (A), ( B2) and (C2) hold. Then, as n, p → ∞ such that p/n = c n → c > 1, the second type of out-of-sample prediction risk R X,β ( β, β) of the min-norm estimator (2) satisfies, √ p R X,β ( β, β) -(1 - 1 c n )r 2 - σ 2 c n -1 d -→ N (µ c,3 , σ 2 c,3 ), where µ c,3 = 0 and σ 2 c,3 = 2(1 -1 c )r 4 . A more practical version is to replace µ c,3 and σ 2 c,3 with μc,3 = 1 √ p cσ 2 (1 -c) 2 + σ 2 (ν 4 -3) c -1 and σ2 c,3 = 2(1 - 1 c )r 4 + 1 p 2c 3 σ 4 (1 -c) 4 + cσ 4 (ν 4 -3) (c -1) 2 . The corresponding (1 -α)-confidence interval is given by P (L α,c ≤ R X,β ( β, β) ≤ U α,c ) → 1 -α, (11) with L α,c = σ 2 c n -1 + (1 - 1 c n )r 2 + 1 √ p (μ c,3 -Z α/2 σc,3 ), U α,c = σ 2 c n -1 + (1 - 1 c n )r 2 + 1 √ p (μ c,3 + Z α/2 σc,3 ). Remark 4.3. Note that besides the leading constants in (µ c,3 , σ c,3 ), the version (μ c,3 , σc,3 ) also contains smaller order terms, including terms of order O(1/ √ p) in μc,3 and terms of order O(1/p) in σc,3 . These smaller order terms will vanish when p and n grow very large, but for finite sample situations, these smaller order terms will provide a finer approximation for the finite sample distribution of R X,β ( β, β). As shown in the following experiments, these terms have indeed made non-negligible contributions to fitting the empirical distribution of R X,β ( β, β), which sheds new lights for practitioners. Remark 4.4. If we compare the results in Theorem 4.3 and 4.5, we will find out that R X ( β, β) with constant β and R X,β ( β, β) with random β share the same first-order limit and second-order error rate O(p -1/2 ). This is quite intuitive because both risk treat β as a constant. Their differences are reflected in their limiting variances. Nevertheless, it's very interesting to observe from Theorem 4.4 that, R X ( β, β) with random β under the over-parametrized case has a smaller second-order error rate O(p -1 ). It enjoys the same rate as the under-parametrized case in Theorem 4.1. A possible explanation would be that averaging over the randomness in β can partially offset the curse of dimensionality so that R X ( β, β) achieves the same error rate for all p, n combinations. Remark 4.5. It's worth mentioning that the only assumption regarding data distribution is assumption (A), where only finite fourth order moment is required. Non-Gaussianity allows our theoretical results more widely applied.

4.4. DISCUSSION

In this section, we first make a short conclusion of what we have done theoretically in this paper and further discuss some possible directions of extension. We have systematically investigated the second-order fluctuations of two types of prediction risk, R X ( β, β) and R X,β ( β, β), for the high dimensional least square estimators β. Theorem 4.1 and 4.4 are for R X ( β, β) while Theorem 4.2 and 4.5 are for R X,β ( β, β). Both fixed effect and random effect of the regression coefficients β are discussed following the settings in Hastie et al. (2019) . R X ( β, β) and R X,β ( β, β) are the same when β is nonrandom as established in Theorem 4.3. Asymptotic results are categorized into the under-parametrized case (p < n) and the over-parametrized case (p > n). Although the first-order limits of the prediction risk in high dimensional linear models have already been well studied in recent years, including general extensions to anisotropic features and signals in Wu & Xu (2020) , the "double descent" risk curve is just a function of the limiting ratio lim p/n. There is still a non-negligible discrepancy between the finite-sample prediction risk and its first-order limit on the "double descent" curve. How large is this discrepancy? How fast does the risk converge to its limit? Our CLTs provide answers to such questions and give a fine-grained characterization of the second-order fluctuations of the prediction risks. Not only explicit forms of the leading constants in the limiting means and variances are shown in the main theorems, smaller order terms are also derived to improve the empirical performance for practitioners. It is also important to recognize the limitations of our results. First, the present paper only concerns linear regression tasks since the linear regression task is simple but important as well. Some recent works linearize neural networks at the initialization and employ Neural Tangent Kernels (Jacot et al. (2018) ) to approximate the training procedure of a strongly over-parametrized neural network by solving a linear regression task, e.g. Du et al. (2018) ; Arora et al. (2019) ; Lee et al. (2019) . Though the setting considered in this paper is simple and limited, the problem has not however been fully understood so far in the literature. Therefore, we are among the first to take the task and develop the second-order fluctuation results for the prediction risk. Second, we assume general covariance Σ and non-Gaussianity for the under-parametrized case, which fits the most updated and realistic settings in the literature, however we only investigate the isotropic settings while still allow for non-Gaussianity under over-parametrization. We haven't extended it to the more general anisotropic settings yet. The reasons are two-fold. On the one hand, according to Wu & Xu (2020) , the first-order limits depend on the Stieltjes transforms of the unknown spectral distribution of Σ. Since Σ is unknown, we cannot obtain any explicit characterization of the first-order limits, not to mention the second-order fluctuations. The CLTs would only be written as certain complicated implicit functions of Σ and would be too abstract to evaluate practically. More restrictions would be imposed on Σ to guarantee the second-order convergence. On the other hand, from the technical perspective, the techniques required for anisotropic over-parametrized cases are very different from the isotropic cases due to difference in the bias-variance decomposition in (4). The tools in random matrix theory have not been fully developed yet for anisotropic cases. Since we have considered various scenarios in this paper, including random and nonrandom signals β for both conditional and unconditional risks, it will take great efforts and continuous work to extend all of them to the most general settings, which would lead to many subsequent works in the field of machine learning and random matrix theory literature.

5. EXPERIMENTS

In this section, we carry out simulation experiments to examine the central limit theorems and the corresponding confidence intervals in Theorem 4.2 and Theorem 4.5. We generate data points from the linear model ( 1) and directly compute the prediction risk via the bias-variance decomposition in (4). The generative distribution P x is taken to be the standard normal distribution. The noise distribution P is taken to be N (0, 1). In the following, we present the gap between the finite-sample distribution of the prediction risk and the corresponding limiting distribution to check the central limit theorems and use the cover rate to measure the effectiveness of the confidence intervals. More simulation results, including cases with non-Gaussian distributions for P x and P are relegated to the Appendix due to space limitations. Example 1. This example examines the results in Theorem 4.2. We define a statistic T n = p σ c R X,β ( β, β) -σ 2 c n 1 -c n - µ c σ c . According to Theorem 4.2, T n weakly converges to the standard normal distribution as n, p → ∞. In this example, c = 1/2 and p = 50, 100, 200. The finite-sample distribution of T n is presented by the histogram of T n in Figure 2 with 1000 repetitions, where the solid blue curve stands for standard normal density function. It can be seen that the finite-sample distribution of T n is very consistent with the standard normal distribution, especially when n, p become larger. When α = 0.05, the empirical cover rates of the 95%-confidence interval are 94.2%, 93.5% and 95.3% for p = 50, 100 and 200, respectively. All these experiments verify the correctness of our theoretical results. Example 2. This example verifies the results in Theorem 4.5. Here we define two statistics: T n,0 = √ p σ c,3 R X,β ( β, β) -(1 - 1 c n )r 2 - σ 2 c n -1 - µ c,3 σ c,3 , T n,1 = √ p σc,3 R X,β ( β, β) -(1 - 1 c n )r 2 - σ 2 c n -1 - μc,3 σc,3 According to Theorem 4.5, both T n,0 and T n,1 weakly converge to the standard normal distribution as n, p → +∞. Compared to T n,0 , T n,1 provides a better approximation for the finite sample distribution of R X,β ( β, β) because it contains smaller order terms in the asymptotic mean and variance. We take c = 3/2 and p = 150, 300, 450. Similarly the finite-sample distributions of T n,0 and T n,1 are presented by the histogram of T n,0 and T n,1 with 1000 repetitions. The comparison between these two statistics is shown in Figure 3 . It can also be seen that the finite sample distributions of T n,0 and T n,1 both match the standard normal distribution quite well, especially T n,1 with more precise characterization. The empirical cover rates of the 95%-confidence interval (11) are 93.8%, 94.7% and 94.4% for p = 150, 300 and 600 respectively, which further shows the validity of our theoretical results. A PROOF OF THEOREM 4.1 AND THEOREM 4.2 Let X = ZΣ 1/2 . According to the Bai-Yin theorem (Bai & Yin (2008) ), the smallest eigenvalue of Z T Z/n is almost surely larger than (1 - √ c) 2 /2 for sufficiently large n. Thus λ min ( 1 n X T X) ≥ c 0 λ min ( 1 n Z T Z) ≥ c 0 2 (1 - √ c) 2 , which implies that the matrix X T X/n is almost surely invertible for large n. By Section 3.2, Π = 0, B X ( β, β) = B X,β ( β, β) = 0 and V X ( β, β) = V X,β ( β, β). Thus the CLT of R X ( β, β) is same to that of R X,β ( β, β). For simplicity, we focus on R X ( β, β) in the following. Notice that V X ( β, β) = σ 2 n Tr( Σ-1 Σ) = σ 2 n Tr Σ -1/2 Z T Z n -1 Σ -1/2 Σ = σ 2 n p i=1 1 s i = σ 2 p n ¢ 1 s dF Z (s) where F Z is the spectral measure of Z T Z/n. According to Theorem 1 of Hastie et al. (2019) , as n, p → ∞ such that p/n = c n → c ∈ (0, ∞), F Z (x) weakly converges to the standard Marcenko-Pastur law F c (x) and V X ( β, β) → σ 2 c ¢ 1 s dF c (s) = σ 2 c 1 -c . Here the standard Marcenko-Pastur law F c (x) has a density function p c (x) = 1 2πcx (b -x)(x -a), if a ≤ x ≤ b, 0, o.w., where a = (1 - √ c) 2 , b = (1 + √ c) 2 and p c (x) has a point mass 1 -1 c at the origin if c > 1. Hence R X ( β, β) -σ 2 c n 1 -c n = σ 2 p n ¢ 1 s dF Z (s) -σ 2 c n ¢ 1 s dF cn (s) = σ 2 c n ¢ 1 s dF Z (s) -dF cn (s) . According to Theorem 1.1 of Bai & Silverstein (2004) , p R X ( β, β) -σ 2 c n 1 -c n d -→ N (µ c , σ 2 c ), where µ c = - σ 2 c 2πi γ 1 z cm(z) 3 (1 + m(z)) -3 {1 -cm(z) 2 (1 + m(z)) -2 } 2 dz (13) - σ 2 c(ν 4 -3) 2πi γ 1 z cm(z) 3 (1 + m(z)) -3 1 -cm(z) 2 (1 + m(z)) -2 dz, σ 2 c = - σ 4 c 2 2π 2 C1 C2 1 z 1 z 2 1 (m(z 1 ) -m(z 2 )) 2 d dz 1 m(z 1 ) d dz 2 m(z 2 )dz 1 dz 2 (14) - σ 4 c 3 (ν 4 -3) 4π 2 C1 C2 1 z 1 z 2 1 (1 + m(z 1 )) 2 (1 + m(z 2 )) 2 dm(z 1 )dm(z 2 ). Here the contours in ( 13) and ( 14) are closed and taken in the positive direction in the complex plane, enclosing the support of F Z , i.e. [(1 - √ c) 2 , (1 + √ c) 2 ]. The Stieltjes transform m(z) satisfies the equation z = - 1 m + c 1 + m . To further simplify the integrations in µ c and σ c , let z = 1 + √ c(rξ + 1 rξ ) + c and perform change of variables, then we have m(z) = - 1 1 + √ crξ , dz = √ c(r - 1 rξ 2 )dξ, dm = √ cr (1 + √ crξ) 2 dξ and when ξ moves along the unit circle |ξ| = 1 on the complex plane, z will orbit around the center point 1 + c along an ellipse which enclosing the support of F Z . Thus µ c = - σ 2 c 2πi γ 1 z cm(z) 3 (1 + m(z)) -3 (1 -cm(z) 2 (1 + m(z)) -2 ) 2 dz - σ 2 c(ν 4 -3) 2πi γ 1 z cm(z) 3 (1 + m(z)) -3 1 -cm(z) 2 (1 + m(z)) -2 dz = σ 2 c 2πi |ξ|=1 1 r( √ c + rξ)(1 + √ crξ)(ξ -1 r )(ξ + 1 r ) dξ + σ 2 c(ν 4 -3) 2πi |ξ|=1 1 rξ 2 ( √ c + rξ)(1 + √ crξ) dξ = σ 2 c 2 (c -1) 2 + σ 2 c 2 (ν 4 -3) 1 -c . As for σ 2 c , note that 1 2πi γ1 1 z 1 (m 1 -m 2 ) 2 dm 1 = 1 2πi |ξ1|=1 1 1 + √ c(r 1 ξ 1 + 1 r1ξ1 ) + c • √ c r 1 (m 2 + 1 1+ √ cr1ξ1 ) 2 (1 + √ cr 1 ξ 1 ) 2 dξ 1 = 1 2πi |ξ1|=1 √ c r 1 ξ 1 (ξ 1 + √ c r1 )(r 1 ξ 1 √ c + 1) {(r 1 ξ 1 √ c + 1)m 2 + 1} 2 dξ 1 = c (c -1) {(c -1)m 2 -1} 2 , therefore - σ 4 c 2 2π 2 1 z 1 z 2 (m 1 -m 2 ) 2 dm 1 dm 2 = 2σ 4 c 2 2πi |ξ2|=1 c z 2 (c -1) {(c -1)m 2 -1} 2 dm 2 = 2σ 4 c 2 2πi |ξ2|=1 √ c r 2 2 ξ 2 (c -1)(1 + √ c r 2 ξ 2 )( √ c + r 2 ξ 2 ) 3 dξ 2 = 2c 3 σ 4 (c -1) 4 . Meanwhile, 1 2πi γ1 1 z 1 1 (1 + m(z 1 )) 2 dm(z 1 ) = 1 2πi |ξ|=1 1 √ cξ(1 + √ crξ)( √ c + rξ) dξ = 1 c -1 , hence - σ 4 c 3 (ν 4 -3) 4π 2 C1 C2 1 z 1 z 2 1 (1 + m(z 1 )) 2 (1 + m(z 2 )) 2 dm(z 1 )dm(z 2 ) = σ 4 c 3 (ν 4 -3) (1 -c) 2 , and σ 2 c = 2c 3 σ 4 (c -1) 4 + σ 4 c 3 (ν 4 -3) (1 -c) 2 . Let T n = p σ c R X ( β, β) -σ 2 c n 1 -c n - µ c p . According to (12), we have P (L α,c ≤ R X,β ( β, β) ≤ U α,c ) = P (-Z α/2 ≤ T n ≤ Z α/2 ) → 1 -α, where L α,c = σ 2 c n 1 -c n + 1 p (µ c -Z α/2 σ c ), U α,c = σ 2 c n 1 -c n + 1 p (µ c + Z α/2 σ c ). B PROOF OF THEOREM 4.3 Notice that B X ( β, β) = β T (I p -Σ+ Σ)β = lim z→0 + β T I p -( Σ + zI p ) -1 Σ β = lim z→0 + zβ T ( Σ + zI p ) -1 β. Since β is a constant vector, we can make use of the results in Theorem 3 in Bai et al. (2007) and Theorem 1.3 in Pan & Zhou (2008) regarding eigenvectors. Their works investigate the sample covariance matrix A p = T 1/2 p X T p X p T 1/2 p /n, where T p is an p × p nonnegative definite Hermitian matrix with a square root T 1/2 p and X p is an n × p matrix with i.i.d. entries (x ij ) n×p . Let U p Λ p U T p denote the spectral decomposition of A p where Λ p = diag(λ 1 , • • • , λ p ) and U p is a unitary matrix consisting of the orthonormal eigenvectors of A p . Assume that x p is an arbitrary nonrandom unit vector and y = (y 1 , y 2 , • • • , y p ) T = U T p x p , two empirical distribution functions based on eigenvectors and eigenvalues are defined as F Ap 1 (x) = p i=1 |y i | 2 1(λ i ≤ x), F Ap (x) = 1 p p i=1 1(λ i ≤ x). Then for a bounded continuous function g(x), we have p j=1 |y j | 2 g(λ j ) - 1 p p j=1 g(λ j ) = ¢ g(x)dF Ap 1 (x) - ¢ g(x)dF Ap (x). The results in Bai et al. (2007) and Pan & Zhou (2008) show that Lemma B.1. (Theorem 3 Bai et al. (2007) and Theorem 1.3 Pan & Zhou (2008) ) Suppose that (1) x ij 's are i.i.d. satisfying E(x ij ) = 0, E(|x ij | 2 ) = 1 and E(|x ij | 4 ) < ∞; (2) x p ∈ C p , x p = 1, lim n,p→∞ p/n = c ∈ (0, ∞); (3) T p is nonrandom Hermitian non-negative definite with with its spectral norm bounded in p, with H p = F Tp d -→ H a proper distribution function and x T p (T p -zI p ) -1 x p → m F H (z) , where m F H (z) denotes the Stieltjes transform of H(t); (4) g 1 , • • • , g k are analytic functions on an open region of the complex plain which contains the real interval lim inf p λ min (T p )1 (0,1) (c)(1 - √ c) 2 , lim sup p λ max (T p )1 (0,1) (c)(1 + √ c) 2 ; (5) as n, p → ∞, sup z √ n x T p m F cn ,Hp (z)T p -I p -1 x p - ¢ 1 1 + tm F cn ,Hp (z) dH n (t) → 0. Define G p (x) = √ n(F Ap 1 (x) -F Ap (x)), then the random vectors ¢ g 1 (x)dG p (x), • • • , ¢ g k (x)dG p (x) forms a tight sequence and converges weakly to a Gaussian vector x g1 , • • • , x g k with mean zero and covariance function Cov(x g1 , x g2 ) = - 1 2π 2 ¢ C1 ¢ C2 g 1 (z 1 )g 2 (z 2 ) (z 2 m 2 -z 1 m 1 ) 2 c 2 z 1 z 2 (z 2 -z 1 )(m 2 -m 1 ) dz 1 dz 2 . The contours C 1 , C 2 are disjoint, both contained in the analytic region for the functions (g 1 , • • • , g k ) and enclose the support of F cn,Hp for all large p. ( ) If H(x) satisfies ¢ dH(t) (1 + tm(z 1 ))(1 + tm(z 2 )) = ¢ dH(t) 1 + tm(z 1 ) ¢ dH(t) 1 + tm(z 2 ) , then the covariance function can be further simplified to Cov(x g1 , x g2 ) = 2 c ( ¢ g 1 (x)g 2 (x)dF c,H (x) - ¢ g 1 (x)dF c,H (x) ¢ g 2 (x)dF c,H (x)). Recall that B X ( β, β) = lim z→0 + zβ T ( Σ + zI p ) -1 β. Let g(x) = 1/(x + z) and x p = β/r. Then we have ¢ g(x)dG n (x) = √ n 1 r 2 β T ( Σ + zI p ) -1 β - ¢ g(x)dF cn (x) , where F cn (x) is the standard Marcenko-Pastur law. It is not difficult to check that under Assumptions (A1), (B1) and (C1), all the conditions (1)-( 6) in Lemma B.1 are satisfied. To proceed further, denote a = (1 - √ c) 2 , b = (1 + √ c) 2 . If c is replaced by c n , a and b are denoted by a n and b n respectively. By some algebraic calculations, we have ¢ g(x)dF cn (x) = (1 - 1 c n ) • 1 z + ¢ bn an 1 x + z • 1 2πc n x (b n -x)(x -a n )dx = (1 - 1 c n ) • 1 z - -1 + c n + z -c 2 n + 2c n (z -1) + (1 + z) 2 2c n z , Var(x g ) = 2 c ¢ {g(x)} 2 dF c (x) - ¢ g(x)dF c (x) 2 = 2 c (1 - 1 c ) • 1 z 2 + ¢ b a 1 (x + z) 2 • 1 2πcx (b -x)(x -a)dx - 2 c (1 - 1 c ) • 1 z + ¢ b a 1 x + z • 1 2πcx (b -x)(x -a)dx 2 . Therefore, lim z→0 + z ¢ g(x)dF cn (x) = 1 - 1 c n and lim z→0 + z 2 Var(x g ) = 2(c -1) c 3 . Furthermore, as n, p → ∞, p/n = c n → c > 1, √ n B X ( β, β) -(1 - 1 c n )r 2 d -→ N 0, 2(c -1) c 3 r 4 . This can be rewritten as √ p B X ( β, β) -(1 - 1 c n )r 2 d -→ N 0, 2(c -1) c 2 r 4 . Next we deal with the variance term V X ( β, β). According to the Assumption (B1), the variance term is V X ( β, β) = σ 2 n Tr{ Σ+ } = σ 2 n n i=1 1 s i , where s i , i = 1, . . . , n are the nonzero eigenvalues of X T X/n. Let {t i , i = 1, . . . n} denote the non-zero eigenvalues of XX T /p, then we have V X ( β, β) = σ 2 p n i=1 1 t i = σ 2 n p ¢ 1 t dF XX T /p (t) → σ 2 c -1 . By interchanging the role of p and n, from the result in Theorem 4.1, as n, p → ∞, p/n = c n → c > 1, we have, n i=1 1 t i - n 1 -c n d -→ N c (c -1) 2 + c (ν 4 -3) 1 -c , 2c (c -1) 4 + c (ν 4 -3) (1 -c ) 2 . where c n = n/p = 1/c n , c = 1/c. This result can be rewritten as n i=1 1 t i - p c n -1 d -→ N c (1 -c) 2 + (ν 4 -3) c -1 , 2c 3 (1 -c) 4 + c(ν 4 -3) (c -1) 2 . Hence the CLT of V X ( β, β) is given by p V X ( β, β) - σ 2 c n -1 d -→ N cσ 2 (1 -c) 2 + σ 2 (ν 4 -3) c -1 , 2c 3 σ 4 (1 -c) 4 + cσ 4 (ν 4 -3) (c -1) 2 . Notice that Cov B X ( β, β), V X ( β, β) = 0. According to the consistency rate and the limiting distribution of B X ( β, β) and V X ( β, β), we know that the bias B X ( β, β) is the leading term of R X ( β, β). This implies that √ p R X ( β, β) -(1 - 1 c n ) β 2 2 - σ 2 c n -1 d -→ N 0, σ 2 c,1 , where σ 2 c,1 = 2(c -1)r 4 /c 2 . A practical version of this CLT is given by In particular, if β follows multivariate Gaussian distribution, i.e. β ∼ N p (0, r 2 p I p ), then as p → ∞, √ p B X,β ( β, β) -r 2 (1 -n p ) √ p R X ( β, β) -(1 - 1 c n ) β 2 2 - σ 2 c n -1 d -→ N μc,1 , σ2 c,1 , where μc,1 = 1 √ p cσ 2 (1 -c) 2 + σ 2 (ν 4 -3) c -1 , d -→ N 0, 2(1 - 1 c )r 4 . Moreover, V X,β ( β, β) = V X ( β, β), we have already proved in Theorem 4.4 that p(V X,β ( β, β) - σ 2 c n -1 ) d -→ N cσ 2 (1 -c) 2 + σ 2 (ν 4 -3) c -1 , 2c 3 σ 4 (1 -c) 4 + cσ 4 (ν 4 -3) (c -1) 2 . Note that Cov(B X,β ( β, β), V X,β ( β, β)) = 0. According to the consistency rate of B X,β ( β, β) and V X,β ( β, β), we know that the bias B X ( β, β) is the leading term of R X,β ( β, β). This implies that √ p R X,β ( β, β) -r 2 (1 - 1 c n ) - σ 2 c n -1 d -→ N (0, σ 2 c,3 ), where σ 2 c,3 = 2r 4 (1 -1/c). A practical version of this CLT is given by √ p R X,β ( β, β) -r 2 (1 - 1 c n ) - σ 2 c n -1 d -→ N (μ c,3 , σ2 c,3 ), where μc,3 = 1 √ p cσ 2 (1 -c) 2 + σ 2 (ν 4 -3) c -1 , σ2 c,3 = 2(1 - 1 c )r 4 + 1 p 2c 3 σ 4 (1 -c) 4 + cσ 4 (ν 4 -3) (c -1) 2 .

E MORE EXPERIMENTS E.1 MORE RESULTS OF EXAMPLE 1

This example checks Theorem 4.2. We define a statistic T n = p σ c R X ( β, β) -σ 2 c n 1 -c n - µ c σ c . According to Theorem 4.2, T n weakly converges to the standard normal distribution as n, p → ∞. In this example, c = 1/2 and p = 50, 100, 200. To make sure the assumption (A) holds, the generative distribution P x is taken to be the standard normal distribution, the centered gamma with shape 4.0 and scale 0.5, and the normalized Student-t distribution with 6.0 degree of freedom. The finitesample distribution of T n is estimated by the histogram of T n under 1000 repetitions. The results are presented in Figure 4 . One can find that the finite-sample distribution of T n tends to the standard normal distribution as n, p → +∞. When α = 0.05, the empirical cover rates of the 95%-confidence interval are reported in Figure 5 . 



Figure 2: The histogram of T n . The solid line is the density of the standard normal distribution.

Figure 3: The histograms of T n,0 and T n,1 . The solid line is the density of the standard normal distribution.

4 (1 -c) 4 + cσ 4 (ν 4 -3) (c -1) 2 .According to the results in Lemma D.1, let A n = Π = I p -Σ+ Σ, then we have, as p → ∞,√ p β T Πβθ -ω)(γ 2 + γ 2 ) = 2(θ -ω)r 4 .Since in the proof of Theorem 4.4, we have already shown that

Figure 4: The histogram of T n . The solid line is the density of the standard normal distribution.

Figure 5: The cover rate of the confidence interval (7) as p creases. The confidence level is 95%.

Figure 7: The histogram of T n,1 . The solid line is the density of the standard normal distribution.

Figure 8: The cover rate of the confidence interval (11) as p creases. The confidence level is 95%.

Figure 10: The histogram of T n,3 . The solid line is the density of the standard normal distribution.

Figure 11: The coverage of confidence interval (9) as p increases. The confidence level is 95%.

The bias term, either B X ( β, β) or B X,β ( β, β), is generally nonzero. According to Lemma 2 inHastie et al. (2019), both B X ( β, β) and B X,β ( β, β) converge to r 2 (1 -1/c) as n, p → +∞ and p/n → c > 1.

C PROOF OF THEOREM 4.4

First we consider the bias term B X ( β, β). By Assumption (A1), (B1), and (C2),Alternatively, we can rewrite the bias asDefine that f n (z) = z r 2 p Tr( Σ + zI p ) -1 . Notice that |f n (z)| and |f n (z)| are bounded above. By the Arzela-Ascoli theorem, we deduce that f n (z) converges uniformly to its limit. Under Assumption (C2), by the Moore-Osgood theorem, almost surely,where m n (z) is the Stieltjes transform of empirical spectral distribution of Σ = X T X/n. According to Theorem 2.1 in Zheng et al. (2015) and Lemma 1.1 in Bai & Silverstein (2004) , the truncated version of p(m n (z) -m(z)) converges weakly to a two-dimensional Gaussian process M (•) satisfyingandwhere m = m(z) represents the Stieltjes transform of limiting spectral distribution of companion matrix XX T /n satisfying the equationWhen p > n, we can actually solve m(z) equation and obtain thatTherefore, by some algebraic calculations, we haveMoreover,By substituting of the explicit form of m(z), we can easily derive thatOn the other hand, by Assumption (B1),whereCombining the results of B X ( β, β) and V X ( β, β), we havewhereD PROOF OF THEOREM 4.5Note that under Assumption (B1) and (C2), assume the following limits existThen the K-dimensional random vectorsconverge weakly to a zero-mean Gaussian vector with covariance matrix

E.2 MORE RESULTS OF EXAMPLE 2

The Example 2 checks Theorem 4.5. Here we consider the standardized statistics:μc,3 σc,3 .According to the central limit theorem (10) and its practical version, both T n,0 and T n,1 weakly converge to the standard normal distribution as n, p → +∞. We take c = 2 and p = 100, 200, 400.The finite-sample distributions of T n,0 and T n,1 are estimated by the histogram of T n,0 and T n,1 under 1000 repetitions. The results are presented in Figure 6 and Figure 7 . When α = 0.05, the empirical cover rates of the 95%-confidence interval ( 11) are reported in Figure 8 . 

E.3 EXAMPLE 3

This example checks Theorem 4.3. To proceed further, we denote two statistics:According to the central limit theorem ( 8) and its practical version, both T n,2 and T n,3 weakly converge to the standard normal distribution as n, p → +∞. We take c = 2 and p = 100, 200, 400.The finite-sample distributions of T n,2 and T n,3 are estimated by the histogram of T n,2 and T n,3 under 1000 repetitions. The results are presented at Figure 9 and Figure 10 . One can see that the finite-sample distributions of T n,2 and T n,3 are close to the standard normal distribution, and the finite-sample performance of T n,3 is better than that of T n,2 . When α = 0.05, the empirical cover rates of the 95%-confidence interval ( 9) are reported in Figure 11 . 

F AN EXAMPLE FROM FIGURE 1

Figure 12 : According to the first-order limit, given a fixed dimension p = 100, the prediction risks at sample size n = 90 and n = 98 are about 10.20 and 49.02. More data hurt seems true. However, the 95% confidence interval of the prediction risks with sample size 98 is [4.91, 142.12], which contains the risk for n = 90. Then more data hurt is not statistically significant.G AN ANISOTROPIC EXAMPLE FOR REMARK 4.2In the over-parameterized case, the bias term B X ( β, β) = β T ΠΣΠβ is non-zero while the variance term V X ( β, β) remains the same as under-parameterized case. Therefore in this section, we conduct a small simulation to examine the fluctuation of the bias B X for both isotropic and anisotropic Σ in the over-parameterized case with non-random β satisfying Assumption (C1). In particular, in the following we set r = 1.We consider both localized and delocalized β such that 1. Localized case:and both the isotropic and anisotropic Σ 3. Identity case: Σ 1 = I p ;4. Compound symmetric case: Σ 2 = 0.5I p + 0.51 p 1 T p .Then we fix p/n = 2 and let p vary from 10 to 300, we present in Figure 13 the empirical variance of √ p * B X and p * B X under various combinations of Σ and β with 1000 replications.From the plot on the top left panel in Figure 13 , we can see that the variance of √ p * B X for both β 1 and β 2 remain constant as p grows, which indicates that the convergence rate of B X is 1/ √ p under the isotropic case regardless of localized or delocalized β. As for the anisotropic case on the top right corner, the variance of √ p * B X stabilizes for β 1 , while decays for β 2 , which indicates that convergence rate of B X under (Σ 2 , β 2 ) and (Σ 2 , β 1 ) are different.This simulation result further confirms our conjecture that in the over-parameterized case, there is no universal CLT for the prediction risk R X ( β, β) under the anisotropic setting for non-random β. 

