MULTIPLE DESCENT: DESIGN YOUR OWN GENERAL-IZATION CURVE

Abstract

This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms.

1. INTRODUCTION

The main goal of machine learning methods is to provide an accurate out-of-sample prediction, known as generalization. For a fixed family of models, a common way to select a model from this family is through empirical risk minimization, i.e., algorithmically selecting models that minimize the risk on the training dataset. Given a variably parameterized family of models, the statistical learning theory aims to identify the dependence between model complexity and model performance. The empirical risk usually decreases monotonically as the model complexity increases, and achieves its minimum when the model is rich enough to interpolate the training data, resulting in zero (or nearzero) training error. In contrast, the behaviour of the test error as a function of model complexity is far more complicated. Indeed, in this paper we show how to construct a model family for which the generalization curve can be fully controlled (away from the interpolation threshold) in both under-parameterized and over-parameterized regimes. Classical statistical learning theory supports a U-shaped curve of generalization versus model complexity (Geman et al., 1992; Hastie et al., 2009) . Under such a framework, the best model is found at the bottom of the U-shaped curve, which corresponds to appropriately balancing under-fitting and over-fitting the training data. From the view of the bias-variance trade-off, a higher model complexity increases the variance while decreasing the bias. A good choice of model complexity achieves a relatively low bias while still keeping the variance under control. On the other hand, a model that interpolates the training data is deemed to over-fit and tends to worsen the generalization performance due to the soaring variance. Although classical statistical theory suggests a pattern of behavior for the generalization curve up to the interpolation threshold, it does not describe what happens beyond the interpolation threshold, commonly referred to as the over-parameterized regime. This is the exact regime where many modern machine learning models, especially deep neural networks, achieved remarkable success. Indeed, neural networks generalize well even when the models are so complex that they have the potential to interpolate all the training data points (Zhang et al., 2017; Belkin et al., 2018b; Ghorbani et al., 2019; Hastie et al., 2019) . Modern practitioners commonly deploy deep neural networks with hundreds of millions or even billions of parameters. It has become widely accepted that large models achieve performance superior to small models that may be suggested by the classical U-shaped generalization curve (Bengio et al., 2003; Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2016; Huang et al., 2019) . This indicates that the test error decreases again once model complexity grows beyond the interpolation threshold, resulting in the so called double-descent phenomenon described in (Belkin et al., 2018a) , which has been broadly supported by empirical evidence (Neyshabur et al., 2015; Neal et al., 2018; Geiger et al., 2019; 2020) and confirmed empirically on modern neural architectures by Nakkiran et al. (2019) . On the theoretical side, this phenomenon has been recently addressed by several works on various model settings. In particular, Belkin et al. (2019a) proved the existence of double-descent phenomenon for linear regression with random feature selection and analyzed the random Fourier feature model (Rahimi & Recht, 2008) . Mei & Montanari (2019) also studied the Fourier model and computed the asymptotic test error which captures the double-descent phenomenon. Bartlett et al. (2020) ; Tsigler & Bartlett (2020) analyzed and gave explicit conditions for "benign overfitting" in linear and ridge regression, respectively. In a recent work, Caron & Chretien (2020) provided a finite sample analysis of the nonlinear function estimation and showed that the parameter learned through empirical risk minimization converges to the true parameter with high probability as the model complexity tends to infinity, implying the existence of double descent. Among all the aforementioned efforts, one particularly interesting question is whether one can observe more than two descents in the generalization curve. In a recent work, d'Ascoli et al. ( 2020) empirically showed a sample-wise triple-descent phenomenon under the random Fourier feature model. Similar triple-descent was also observed for linear regression (Nakkiran et al., 2020) . More rigorously, Liang et al. ( 2020) presented an upper bound on the risk of the minimum-norm interpolation versus the data dimension in Reproducing Kernel Hilbert Spaces (RKHS), which exhibits multiple descent. However, a multiple-descent upper bound without a properly matching lower bound does not imply the existence of a multiple-descent generalization curve. In this work, we study the multiple descent phenomenon by addressing the following questions: • Can the existence of a multiple descent generalization curve be rigorously proven? • Can an arbitrary number of descents occur? • Can the generalization curve and the locations of descents be designed? In this paper, we show that the answer to all three of these questions is yes. Further related work is presented in Appendix A. Our Contribution. We consider the linear regression model and analyze how the risk changes as the dimension of the data grows. In the linear regression setting, the data dimension is equal to the dimension of the parameter space, which reflects the model complexity. We rigorously show that the multiple descent generalization curve exists under this setting. To our best knowledge, this is the first work proving a multiple descent phenomenon for any learning model. Our analysis considers both the underparameterized and overparameterized regimes. In the overparameterized regime, we show that one can control where a descent or an ascent occurs in the generalization curve. This is realized through our algorithmic construction of a feature-revealing process. To be more specific, we assume that the data is in R D , where D can be arbitrarily large or even essentially infinite. We view each dimension of the data as a feature. We consider a linear regression problem restricted on the first d features, where d < D. New features are revealed by increasing the dimension of the data. We then show that by specifying the distribution of the newly revealed feature to be either a standard Gaussian or a Gaussian mixture, one can determine where an ascent or a descent occurs. In order to create an ascent when a new feature is revealed, it is sufficient that the feature follows a Gaussian mixture distribution. In order to have a descent, it is sufficient that the new feature follows a standard Gaussian distribution. Therefore, in the overparameterized regime, we can fully control the occurrence of a descent and an ascent. As a comparison, in the underparameterized regime, the generalization loss always increases regardless of the feature distribution. We also consider a dimension-normalized version of the generalization loss, under which we show that the generalization curve exhibits multiple descent in the underparameterized regime. Generally speaking, we show that we are able to design the generalization curve. On the one hand, we show theoretically that the generalization curve is malleable and can be constructed in an arbitrary fashion. On the other hand, we rarely observe complex generalization curves in practice, besides carefully curated constructions. Putting these facts together, we arrive at the conclusion that realistic generalization curves arise from specific interactions between properties of typical data and the inductive biases of algorithms. We should highlight that the nature of these interactions is far from being understood and should be an area of further investigations. 



Notation. For x ∈ R D and d ≤ D, we let x[1 : d] ∈ R d denote a d-dimensional vector with x[1 : d] i = x i for all 1 ≤ i ≤ d.For a matrix A ∈ R n×d , we denote its Moore-Penrose pseudoinverse by

