ON THE UNIVERSALITY OF THE DOUBLE DESCENT PEAK IN RIDGELESS REGRESSION

Abstract

We prove a non-asymptotic distribution-independent lower bound for the expected mean squared generalization error caused by label noise in ridgeless linear regression. Our lower bound generalizes a similar known result to the overparameterized (interpolating) regime. In contrast to most previous works, our analysis applies to a broad class of input distributions with almost surely full-rank feature matrices, which allows us to cover various types of deterministic or random feature maps. Our lower bound is asymptotically sharp and implies that in the presence of label noise, ridgeless linear regression does not perform well around the interpolation threshold for any of these feature maps. We analyze the imposed assumptions in detail and provide a theory for analytic (random) feature maps. Using this theory, we can show that our assumptions are satisfied for input distributions with a (Lebesgue) density and feature maps given by random deep neural networks with analytic activation functions like sigmoid, tanh, softplus or GELU. As further examples, we show that feature maps from random Fourier features and polynomial kernels also satisfy our assumptions. We complement our theory with further experimental and analytic results.

1. INTRODUCTION

Seeking for a better understanding of the successes of deep learning, Zhang et al. (2016) pointed out that deep neural networks can achieve very good performance despite being able to fit random noise, which sparked the interest of many researchers in studying the performance of interpolating learning methods. Belkin et al. (2018) made a similar observation for kernel methods and showed that classical generalization bounds are unable to explain this phenomenon. Belkin et al. (2019a) observed a "double descent" phenomenon in various learning models, where the test error first decreases with increasing model complexity, then increases towards the "interpolation threshold" where the model is first able to fit the training data perfectly, and then decreases again in the "overparameterized" regime where the model capacity is larger than the training set. This phenomenon has also been discovered in several other works (Bös & Opper, 1997; Advani & Saxe, 2017; Neal et al., 2018; Spigler et al., 2019) . Nakkiran et al. (2019) performed a large empirical study on deep neural networks and found that double descent can not only occur as a function of model capacity, but also as a function of the number of training epochs or as a function of the number of training samples. Theoretical investigations of the double descent phenomenon have mostly focused on specific unregularized ("ridgeless") or weakly regularized linear regression models. These linear models can be described via i.i.d. samples (x 1 , y 1 ), . . . , (x n , y n ) ∈ R d , where the covariates x i are mapped to feature representations z i = φ(x i ) ∈ R p via a (potentially random) feature map φ, and (ridgeless) linear regression is then performed on the transformed samples (z i , y i ). While linear regression with random features can be understood as a simplified model of fully trained neural networks, it is also interesting in its own right: For example, random Fourier features (Rahimi & Recht, 2008) and random neural network features (see e.g. Cao et al., 2018; Scardapane & Wang, 2017) have gained a notable amount of attention.

