PROVABLE MORE DATA HURT IN HIGH DIMENSIONAL LEAST SQUARES ESTIMATOR

Abstract

This paper investigates the finite-sample prediction risk of the high-dimensional least squares estimator. We derive the central limit theorem for the prediction risk when both the sample size and the number of features tend to infinity. Furthermore, the finite-sample distribution and the confidence interval of the prediction risk are provided. Our theoretical results demonstrate the sample-wise nonmonotonicity of the prediction risk and confirm "more data hurt" phenomenon.

1. INTRODUCTION

More data hurt refers to the phenomenon that training on more data can hurt the prediction performance of the learned model, especially for some deep learning tasks. Loog et al. (2019) shows that various standard learners can lead to sample-wise non-monotonicity. Nakkiran et al. (2019) experimentally confirms the sample-wise non-monotonicity of the test accuracy on deep neural networks. This challenges the conventional understanding in large sample properties: if an estimator is consistent, more data makes the estimator more stable and improves its finite-sample performance. Nakkiran (2019) considers adding one single data point to a linear regression task and analyzes its marginal effect to the test risk. Dereziński et al. (2019) gives an exact non-asymptotic risk of the high-dimensional least squares estimator and observes the sample-wise non-monotonicity on mean square error. For adversarially robust models, Min et al. (2020) proves that more data may increase the gap between the generalization error of adversarially-trained models and standard models. Chen et al. (2020) shows that more training data causes the generalization error to increase in the strong adversary regime. In this work, we derive the finite-sample distribution of the prediction risk under linear models and prove the "more data hurt" phenomenon from an asymptotic point of view. Intuitively, the "more data hurt" stems from the "double descent" risk curve: as the model complexity increases, the prediction risk of the learned model first decreases and then increases, and then decreases again. The double descent phenomenon can be precisely quantified for certain simple models (Hastie et al. ( 2019 2019) gives the asymptotic risk of two-layer neural networks when either the first or the second layer is trained using a gradient flow. The second decline of the prediction risk in the double descent curve is highly related to the more data hurt phenomenon. In the over-parameterized regime when the model complexity is fixed while the sample size increases, the degree of over-parameterization decreases and becomes close to the interpolation boundary (for example p/n = 1 in Hastie et al. ( 2019)), in which a high prediction risk is achieved. However, the existing asymptotic results, which focus on the first-order limit of the prediction risk, cannot fully describe the more data hurt phenomenon. In fact, the "double descent" curve is a function of the limiting ratio lim p/n, which may not be able to characterize the empirical prediction risk in finite sample situations. There will be a non-negligible discrepancy between the empirical prediction risk and its limit, especially when the sample size or dimension is small. Finegrained second-order results are thus needed to fully characterize such discrepancy and further, a confidence band for the prediction risk can be constructed to evaluate its finite sample performance. We take Figure 1 as an example to illustrate this. According to the first-order limit, given a fixed dimension p = 100, the prediction risks at sample size n = 90 and n = 98 are about 10.20 and



); Mei & Montanari (2019); Ba et al. (2019); Belkin et al. (2019); Bartlett et al. (2020); Xing et al. (2019)). Among these works, Hastie et al. (2019) and Mei & Montanari (2019) use the tools from random matrix theory and explicitly prove the double descent curve of the asymptotic risk of linear regression and random features regression in high dimensional setup. Ba et al. (

