PROVABLE MORE DATA HURT IN HIGH DIMENSIONAL LEAST SQUARES ESTIMATOR

Abstract

This paper investigates the finite-sample prediction risk of the high-dimensional least squares estimator. We derive the central limit theorem for the prediction risk when both the sample size and the number of features tend to infinity. Furthermore, the finite-sample distribution and the confidence interval of the prediction risk are provided. Our theoretical results demonstrate the sample-wise nonmonotonicity of the prediction risk and confirm "more data hurt" phenomenon.

1. INTRODUCTION

More data hurt refers to the phenomenon that training on more data can hurt the prediction performance of the learned model, especially for some deep learning tasks. Loog et al. (2019) shows that various standard learners can lead to sample-wise non-monotonicity. Nakkiran et al. (2019) experimentally confirms the sample-wise non-monotonicity of the test accuracy on deep neural networks. This challenges the conventional understanding in large sample properties: if an estimator is consistent, more data makes the estimator more stable and improves its finite-sample performance. Nakkiran (2019) considers adding one single data point to a linear regression task and analyzes its marginal effect to the test risk. Dereziński et al. (2019) gives an exact non-asymptotic risk of the high-dimensional least squares estimator and observes the sample-wise non-monotonicity on mean square error. For adversarially robust models, Min et al. (2020) proves that more data may increase the gap between the generalization error of adversarially-trained models and standard models. Chen et al. (2020) shows that more training data causes the generalization error to increase in the strong adversary regime. In this work, we derive the finite-sample distribution of the prediction risk under linear models and prove the "more data hurt" phenomenon from an asymptotic point of view. Intuitively, the "more data hurt" stems from the "double descent" risk curve: as the model complexity increases, the prediction risk of the learned model first decreases and then increases, and then decreases again. The double descent phenomenon can be precisely quantified for certain simple models (Hastie et al. ( 2019 2019) gives the asymptotic risk of two-layer neural networks when either the first or the second layer is trained using a gradient flow. The second decline of the prediction risk in the double descent curve is highly related to the more data hurt phenomenon. In the over-parameterized regime when the model complexity is fixed while the sample size increases, the degree of over-parameterization decreases and becomes close to the interpolation boundary (for example p/n = 1 in Hastie et al. ( 2019)), in which a high prediction risk is achieved. However, the existing asymptotic results, which focus on the first-order limit of the prediction risk, cannot fully describe the more data hurt phenomenon. In fact, the "double descent" curve is a function of the limiting ratio lim p/n, which may not be able to characterize the empirical prediction risk in finite sample situations. There will be a non-negligible discrepancy between the empirical prediction risk and its limit, especially when the sample size or dimension is small. Finegrained second-order results are thus needed to fully characterize such discrepancy and further, a confidence band for the prediction risk can be constructed to evaluate its finite sample performance. We take Figure 1 as an example to illustrate this. According to the first-order limit, given a fixed dimension p = 100, the prediction risks at sample size n = 90 and n = 98 are about 10.20 and 49.02. More data hurt seems true. However, the 95% confidence interval of the prediction risks with sample size 98 is [4.91, 142.12], which contains the risk for n = 90. Then more data hurt is not statistically significant. Hence, in this work, we characterize the second-order fluctuations of the prediction risk and make attempts to fill this gap. We employ the linear regression task in Hastie et al. (2019) and Nakkiran (2019), and introduce new tools from the random matrix theory, e.g. the central limit theorems for linear spectral statistics in Bai & Silverstein (2004); Bai et al. (2007) , to derive the central limit theorem of the prediction risk. Consider a linear regression task with n data points and p features, the setup of the more data hurt is similar to that in the classical asymptotic analysis in Van der Vaart (2000) . According to the classical asymptotic analysis with p fixed and n → ∞, the least square estimator is unbiased and √ n-consistent to the ground truth. This implies that the more data will not hurt and even improve the prediction performance. However, the story is very different in the over-parameterized regime. The prediction risk doesn't decrease monotonously with n when p > n. More data does hurt in the over-parameterized case. In the following, we will justify this phenomenon by developing the second-order asymptotic results as both n and p tend to infinity. We assume p/n → c, and denote 0 < n 1 < n 2 < +∞, c 1 = p/n 1 and c 2 = p/n 2 . Then the direct comparison of the prediction risk between sample sizes n 1 and n 2 can be decomposed into three parts: (i) the gap between the finitesample risk under n = n 1 and the asymptotic risk with c = c 1 ; (ii) the gap between the finite-sample risk under n = n 2 and the asymptotic risk with c = c 2 ; (iii) the comparison between two asymptotic risks under c = c 1 and c = c 2 . Theorem 1 and 2 of Hastie et al. ( 2019) give answers to the task (iii). For (i) and (ii), we develop in this paper the convergence rate and the limiting distribution of the prediction risk as n, p → +∞, p/n → c. Furthermore, the confidence interval of the finite-sample risk can be obtained as well. Figure 1 : Sample-wise double descent. We take p = 100 and 1 ≤ n ≤ 200. Left: The conditional density of the prediction risk when sample size varies from 1 to 200. According to the conditional distribution of the prediction risk, we can observe the sample-wise double descent phenomenon. Right: The 95%-confidence band (point-wise) of the prediction risk. In the over-parameterized regime 1 ≤ n < 100, there exists some pairs (n 1 , n 2 ), 1 ≤ n 1 < n 2 < 100 such that the upper boundary of the confidence interval at n 1 is smaller than the lower boundary of the confidence interval at n 2 . This confirms the more data hurt phenomenon. The main goal of this paper is to study the second order asymptotic behavior of two different types of conditional prediction risk in the linear regression model. One is R X,β ( β, β) given both the training data and regression coefficient while the other is R X ( β, β) given the training data only. We summarize our main results as follows: (1) The regression coefficient is set to be either random or nonrandom to cover more cases. Different convergence rates and limiting distributions of both prediction risk are derived under various scenarios. (2) In particular, the finite-sample distribution of the conditional prediction risk given both the training data and regression coefficient is derived and the sample-wise double descent is characterized in Theorem 4.2 and Theorem 4.5 (see Figure 1 ). Under certain assumptions, the more data hurt phenomenon can be confirmed by comparing the confidence intervals built via the central limit theorems. (3) Our results incorporate non-Gaussian observations. For Gaussian data, the limiting mean and variance in the central limit theorems have simpler forms, see Section 4.2 and 4.3 for more details. The rest of this paper is organized as follows. Section 3 introduces the model settings and two different prediction risk. Section 4 presents the main results on CLTs for the two types of risk with



); Mei & Montanari (2019); Ba et al. (2019); Belkin et al. (2019); Bartlett et al. (2020); Xing et al. (2019)). Among these works, Hastie et al. (2019) and Mei & Montanari (2019) use the tools from random matrix theory and explicitly prove the double descent curve of the asymptotic risk of linear regression and random features regression in high dimensional setup. Ba et al. (

