ON THE UNIVERSALITY OF THE DOUBLE DESCENT PEAK IN RIDGELESS REGRESSION

Abstract

We prove a non-asymptotic distribution-independent lower bound for the expected mean squared generalization error caused by label noise in ridgeless linear regression. Our lower bound generalizes a similar known result to the overparameterized (interpolating) regime. In contrast to most previous works, our analysis applies to a broad class of input distributions with almost surely full-rank feature matrices, which allows us to cover various types of deterministic or random feature maps. Our lower bound is asymptotically sharp and implies that in the presence of label noise, ridgeless linear regression does not perform well around the interpolation threshold for any of these feature maps. We analyze the imposed assumptions in detail and provide a theory for analytic (random) feature maps. Using this theory, we can show that our assumptions are satisfied for input distributions with a (Lebesgue) density and feature maps given by random deep neural networks with analytic activation functions like sigmoid, tanh, softplus or GELU. As further examples, we show that feature maps from random Fourier features and polynomial kernels also satisfy our assumptions. We complement our theory with further experimental and analytic results.

1. INTRODUCTION

Seeking for a better understanding of the successes of deep learning, Zhang et al. (2016) pointed out that deep neural networks can achieve very good performance despite being able to fit random noise, which sparked the interest of many researchers in studying the performance of interpolating learning methods. Belkin et al. (2018) made a similar observation for kernel methods and showed that classical generalization bounds are unable to explain this phenomenon. Belkin et al. (2019a) observed a "double descent" phenomenon in various learning models, where the test error first decreases with increasing model complexity, then increases towards the "interpolation threshold" where the model is first able to fit the training data perfectly, and then decreases again in the "overparameterized" regime where the model capacity is larger than the training set. This phenomenon has also been discovered in several other works (Bös & Opper, 1997; Advani & Saxe, 2017; Neal et al., 2018; Spigler et al., 2019) . Nakkiran et al. (2019) performed a large empirical study on deep neural networks and found that double descent can not only occur as a function of model capacity, but also as a function of the number of training epochs or as a function of the number of training samples. Theoretical investigations of the double descent phenomenon have mostly focused on specific unregularized ("ridgeless") or weakly regularized linear regression models. These linear models can be described via i.i.d. samples (x 1 , y 1 ), . . . , (x n , y n ) ∈ R d , where the covariates x i are mapped to feature representations z i = φ(x i ) ∈ R p via a (potentially random) feature map φ, and (ridgeless) linear regression is then performed on the transformed samples (z i , y i ). While linear regression with random features can be understood as a simplified model of fully trained neural networks, it is also interesting in its own right: For example, random Fourier features (Rahimi & Recht, 2008) and random neural network features (see e.g. Cao et al., 2018; Scardapane & Wang, 2017) have gained a notable amount of attention. Unfortunately, existing theoretical investigations of double descent are usually limited in one or more of the following ways: (1) They assume that the z i (or a linear transformation thereof) have (centered) i.i. (2019b) . While these analyses provide insights for some practically relevant random feature models, the assumptions on the input distribution prevent them from applying to real-world data. (3) Their analysis only applies in a high-dimensional limit where n, p → ∞ and n/p → γ, where γ ∈ (0, ∞) is a constant. This applies to all works mentioned in ( 1) and ( 2) except the model by Belkin et al. (2019b) where the z i follow a standard Gaussian distribution. In this paper, we provide an analysis under significantly weaker assumptions. We introduce the basic setting of our paper in Section 2 and Section 3. Our main contributions are: • In Section 4, we show a non-asymptotic distribution-independent lower bound for the expected excess risk of ridgeless linear regression with (random) features. While the underparameterized bound is adapted from a minimax lower bound in Mourtada ( 2019), the overparameterized bound is new and perfectly complements the underparameterized version. The obtained general lower bound relies on significantly weaker assumptions than most previous works and shows that there is only limited potential to reduce the sensitivity of unregularized linear models to label noise via engineering better feature maps. • In Section 5, we show that our lower bound applies to a large class of input distributions and feature maps including random deep neural networks, random Fourier features and polynomial kernels. This analysis is also relevant for related work where similar assumptions are not investigated (e.g. Mourtada, 2019; Muthukumar et al., 2020) . For random deep neural networks, our result requires weaker assumptions than a related result by Nguyen & Hein (2017). • In Section 6 and Appendix C, we compare our lower bound to new theoretical and experimental results for specific examples, including random neural network feature maps as well as finite-width Neural Tangent Kernels (Jacot et al., 2018) . We also show that our lower bound is asymptotically sharp in the limit n, p → ∞. Similar to this paper, Muthukumar et al. ( 2020) study the "fundamental price of interpolation" in the overparameterized regime, providing a probabilistic lower bound for the generalization error under the assumption of subgaussian features or (suitably) bounded features. We explain the difference to our lower bound in detail in Appendix L, showing that our overparameterized lower bound for the expected generalization error requires significantly weaker assumptions, that it is uniform across feature maps and that it yields a more extreme interpolation peak. Our lower bound also applies to a large class of kernels if they can be represented using a feature map with finite-dimensional feature space, i.e. p < ∞. For ridgeless regression with certain classes of kernels, lower or upper bounds have been derived (Liang & Rakhlin, 2020; Rakhlin & Zhai, 2019; Liang et al., 2019) . However, as explained in more detail in Appendix K, these analyses impose restrictions on the kernels that allow them to ignore "double descent" type phenomena in the feature space dimension p. Beyond Double Descent, a series of papers have studied "Multiple Descent" phenomena theoretically and empirically, both with respect to the number of parameters p and the input dimension 



d. Adlam & Pennington (2020) and d'Ascoli et al. (2020b) theoretically investigate Triple Descent phenomena. Nakkiran et al. (2020) argue that Double Descent can be mitigated by optimal regularization. They also empirically observe a form of Triple Descent in an unregularized model. Liang

d. components.  This assumption is made byHastie et al. (2019), whileAdvani & Saxe (2017)  andBelkin et al. (2019b)  even assume that the z i follow a Gaussian distribution. While the assumption of i.i.d. components facilitates the application of some random matrix theory results, it excludes most feature maps: For feature maps φ with d < p, the z i will usually be concentrated on a d-dimensional submanifold of R p , and will therefore usually not have i.i.d. components.(2) They assume a (shallow) random feature model with fixed distribution of the x i , e.g. an isotropic Gaussian distribution or a uniform distribution on a sphere. Examples for this are the single-layer random neural network feature models by Hastie et al. (2019) in the unregularized case and by Mei & Montanari (2019); d'Ascoli et al. (2020a) in the regularized case. A simple Fourier model with d = 1 has been studied by Belkin et al.

