MULTIPLE DESCENT: DESIGN YOUR OWN GENERAL-IZATION CURVE

Abstract

This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms.

1. INTRODUCTION

The main goal of machine learning methods is to provide an accurate out-of-sample prediction, known as generalization. For a fixed family of models, a common way to select a model from this family is through empirical risk minimization, i.e., algorithmically selecting models that minimize the risk on the training dataset. Given a variably parameterized family of models, the statistical learning theory aims to identify the dependence between model complexity and model performance. The empirical risk usually decreases monotonically as the model complexity increases, and achieves its minimum when the model is rich enough to interpolate the training data, resulting in zero (or nearzero) training error. In contrast, the behaviour of the test error as a function of model complexity is far more complicated. Indeed, in this paper we show how to construct a model family for which the generalization curve can be fully controlled (away from the interpolation threshold) in both under-parameterized and over-parameterized regimes. Classical statistical learning theory supports a U-shaped curve of generalization versus model complexity (Geman et al., 1992; Hastie et al., 2009) . Under such a framework, the best model is found at the bottom of the U-shaped curve, which corresponds to appropriately balancing under-fitting and over-fitting the training data. From the view of the bias-variance trade-off, a higher model complexity increases the variance while decreasing the bias. A good choice of model complexity achieves a relatively low bias while still keeping the variance under control. On the other hand, a model that interpolates the training data is deemed to over-fit and tends to worsen the generalization performance due to the soaring variance. Although classical statistical theory suggests a pattern of behavior for the generalization curve up to the interpolation threshold, it does not describe what happens beyond the interpolation threshold, commonly referred to as the over-parameterized regime. This is the exact regime where many modern machine learning models, especially deep neural networks, achieved remarkable success. Indeed, neural networks generalize well even when the models are so complex that they have the potential to interpolate all the training data points (Zhang et al., 2017; Belkin et al., 2018b; Ghorbani et al., 2019; Hastie et al., 2019) . Modern practitioners commonly deploy deep neural networks with hundreds of millions or even billions of parameters. It has become widely accepted that large models achieve performance superior to small models that may be suggested by the classical U-shaped generalization curve (Bengio et al., 2003; Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2016; Huang et al., 2019) . This indicates that the test error decreases again once model complexity grows beyond the interpolation threshold, resulting in the so called double-descent phenomenon described in (Belkin et al., 2018a) , which has been broadly supported by empirical evidence (Neyshabur et al., 2015; Neal et al., 2018; Geiger et al., 2019; 2020) and confirmed empirically on modern neural architectures by Nakkiran et al. (2019) . On the theoretical side, this phenomenon has been recently addressed by several works on various model settings. In particular, Belkin et al. (2019a) proved the existence of double-descent phenomenon for linear regression with random feature selection and analyzed the random Fourier feature model (Rahimi & Recht, 2008) . Mei & Montanari (2019) also studied the Fourier model and computed the asymptotic test error which captures the double-descent phenomenon. Bartlett et al. (2020) ; Tsigler & Bartlett (2020) analyzed and gave explicit conditions for "benign overfitting" in linear and ridge regression, respectively. In a recent work, Caron & Chretien (2020) provided a finite sample analysis of the nonlinear function estimation and showed that the parameter learned through empirical risk minimization converges to the true parameter with high probability as the model complexity tends to infinity, implying the existence of double descent. Among all the aforementioned efforts, one particularly interesting question is whether one can observe more than two descents in the generalization curve. In a recent work, d'Ascoli et al. ( 2020) empirically showed a sample-wise triple-descent phenomenon under the random Fourier feature model. Similar triple-descent was also observed for linear regression (Nakkiran et al., 2020) . More rigorously, Liang et al. (2020) presented an upper bound on the risk of the minimum-norm interpolation versus the data dimension in Reproducing Kernel Hilbert Spaces (RKHS), which exhibits multiple descent. However, a multiple-descent upper bound without a properly matching lower bound does not imply the existence of a multiple-descent generalization curve. In this work, we study the multiple descent phenomenon by addressing the following questions: • Can the existence of a multiple descent generalization curve be rigorously proven? • Can an arbitrary number of descents occur? • Can the generalization curve and the locations of descents be designed? In this paper, we show that the answer to all three of these questions is yes. Further related work is presented in Appendix A. Our Contribution. We consider the linear regression model and analyze how the risk changes as the dimension of the data grows. In the linear regression setting, the data dimension is equal to the dimension of the parameter space, which reflects the model complexity. We rigorously show that the multiple descent generalization curve exists under this setting. To our best knowledge, this is the first work proving a multiple descent phenomenon for any learning model. Our analysis considers both the underparameterized and overparameterized regimes. In the overparameterized regime, we show that one can control where a descent or an ascent occurs in the generalization curve. This is realized through our algorithmic construction of a feature-revealing process. To be more specific, we assume that the data is in R D , where D can be arbitrarily large or even essentially infinite. We view each dimension of the data as a feature. We consider a linear regression problem restricted on the first d features, where d < D. New features are revealed by increasing the dimension of the data. We then show that by specifying the distribution of the newly revealed feature to be either a standard Gaussian or a Gaussian mixture, one can determine where an ascent or a descent occurs. In order to create an ascent when a new feature is revealed, it is sufficient that the feature follows a Gaussian mixture distribution. In order to have a descent, it is sufficient that the new feature follows a standard Gaussian distribution. Therefore, in the overparameterized regime, we can fully control the occurrence of a descent and an ascent. As a comparison, in the underparameterized regime, the generalization loss always increases regardless of the feature distribution. We also consider a dimension-normalized version of the generalization loss, under which we show that the generalization curve exhibits multiple descent in the underparameterized regime. Generally speaking, we show that we are able to design the generalization curve. On the one hand, we show theoretically that the generalization curve is malleable and can be constructed in an arbitrary fashion. On the other hand, we rarely observe complex generalization curves in practice, besides carefully curated constructions. Putting these facts together, we arrive at the conclusion that realistic generalization curves arise from specific interactions between properties of typical data and the inductive biases of algorithms. We should highlight that the nature of these interactions is far from being understood and should be an area of further investigations.

2. PRELIMINARIES AND PROBLEM FORMULATION

Notation. For x ∈ R D and d ≤ D, we let x[1 : d] ∈ R d denote a d-dimensional vector with x[1 : d] i = x i for all 1 ≤ i ≤ d. For a matrix A ∈ R n×d , we denote its Moore-Penrose pseudoinverse by A + ∈ R d×n . We use the big O notation O and write variables in the subscript of O if the implicit constant depends on them. For example, O n,d,σ (1) is a constant that only depends on n, d, and σ. If f (σ) and g(σ) are functions of σ, write f (σ) ∼ g(σ) if lim f (σ) g(σ) = 1. It will be given in the context how we take the limit. Distributions. Let N (µ, σ 2 ) (µ, σ ∈ R) and N (µ, Σ) (µ ∈ R n , Σ ∈ R n×n ) denote the univariate and multivariate Gaussian distributions, respectively, where µ ∈ R n and Σ ∈ R n×n is a positive semi-definite matrix. We define a family of trimodal Gaussian mixture distributions as follows N mix σ,µ 1 3 N (0, σ 2 ) + 1 3 N (-µ, σ 2 ) + 1 3 N (µ, σ 2 ) . For an illustration, please see Fig. 1 . Figure 1 : Density functions of the N (0, 1) and N mix σ,1 feature. A new entry is independently sampled from the 1-dimensional distribution being either a standard Gaussian or trimodal Gaussian mixture. Smaller σ leads to higher concentration around each modes. Let χ 2 (k, λ) denote the noncentral chi-squared distribution with k degrees of freedom and the noncentrality parameter λ. For example, if X i ∼ N (µ i , 1) (for i = 1, 2, . . . , k) are independent Gaussian random variables, we have k i=1 X 2 i ∼ χ 2 (k, λ), where λ = k i=1 µ 2 i . We also denote by χ 2 (k) the (central) chi-squared distribution with k degrees and the F -distribution by F (d 1 , d 2 ) where d 1 and d 2 are the degrees of freedom. Problem Setup. Let x 1 , . . . , x n ∈ R D be column vectors that represent the training data of size n and let x test ∈ R D be a column vector that represents the test data. We assume that they are all independently drawn from a distribution x 1 , . . . , x n , x test iid ∼ D . Let us consider a linear regression problem on the first d features, where d ≤ D for some arbitrary large D. Here, d can be viewed as the number of features revealed. The design matrix A equals [x 1 [1 : d], . . . , x n [1 : d]] ∈ R n×d . The true linear model is β * ∈ R d . The noise ε ∈ R n follows the multivariate standard Gaussian distribution N (0, η 2 I n ). Let x = x test [1 : d] denote the first d features of the test data. For the underparameterized regime where d < n, the least square solution on the training data is A + (Aβ * + ε). For the overparameterized regime where d > n, A + (Aβ * + ε) is the minimum-norm solution. In both regimes we consider the solution β A + (Aβ * + ε). The excess generalization loss on the test data is then given by L d E y -x β 2 -y -x β * 2 = E x ( β -β * ) 2 = E x (A + A -I)β * + A + ε 2 = E (x (A + A -I)β * ) 2 + E (x A + ε) 2 = E (x (A + A -I)β * ) 2 + η 2 E (A ) + x 2 , where y = x β * + ε test and ε test ∼ N (0, η 2 ). We call the term E (x (A + A -I)β * ) 2 the bias and call the term η 2 E (A ) + x 2 the variance. In this paper, we assume β * = 0 and the noise level η = 1. In this settings, we get L d = E (A ) + x 2 . Remark 1. In the underparametrized regime, if D is a continous distribution (our construction presented later satisfies this condition), the matrix A has independent column almost surely. In this case, we have A + A = I and therefore the bias E (x (A + A -I)β * ) 2 vanishes irrespective of the true linear model β * . In other words, in the underparametrized regime, L d equals η 2 E (A ) + x 2 for all β * . We would like to study the change in the loss caused by the growth in the number of features revealed. Note that the product (A + ) x sums over d dimensions. Once we reveal a new feature, which is equivalent to adding a new row b to A and a new component y to x, the product A b + x y sums over d + 1 dimensions. As a result, to compare quantities of different dimensions, we need to normalize the generalization loss by the dimension. We define the dimension-normalized generalization loss L d as follows L d E 1 d (A ) + x 2 = 1 d 2 L d . Local Maximum and Multiple Descent. We say that a local maximum occurs at a dimension d ≥ 1 if L d-1 < L d and L d > L d+1 . Intuitively, a local maximum occurs if there is an increasing stage of the generalization loss, followed by a decreasing stage, as the dimension d grows. Additionally, we define L 0 -∞. If the generalization loss exhibits a single descent, based on our definition, a unique local maximum occurs at d = 1. For a double-descent generalization curve, a local maximum occurs at two different dimensions. In general, if we observe a local maximum at K different dimensions we call it a K-descent.

3. UNDERPARAMETERIZED REGIME

First, we present our main theorem for the underparametrized regime below, whose proof is deferred to the end of Section 3. It states that the un-normalized generalization loss L d is always nondecreasing as d grows. Moreover, it is possible to have an arbitrarily large ascent, i.e., L d+1 - L d > C for any C > 0. Theorem 1 (Proof in Appendix B.1). If d < n, we have L d+1 ≥ L d irrespective of the data distribution. Moreover, for any C > 0, there exists a distribution D such that L d+1 -L d > C. For the dimension-normalized generalization loss L d , there can be both ascents and descents. And it is possible to specify where the local peaks in the generalization curve occur. Theorem 2 (Underparameterized regime). Let D + 2 < √ 2n. For any 1 < d 1 < d 2 < • • • < d K < D where d j+1 -d j ≥ 2, there exists a distribution D such that a local maximum of the L d curve occurs at d j . Note that the assumption d j+1 -d j ≥ 2 is necessary because two local maxima may not be adjacent. We present an example in Fig. 2 . Remark 2 (D can be a product distribution). As will be clear later in the proof of Theorem 2, the distribution D can be made as simple as a product distribution D = D 1 × • • • × D D such that x i,j iid ∼ D j for all 1 ≤ i ≤ n, where D j is either sampled from N (0, 1) or a Gaussian mixture N mix σj for some σ j > 0. As a consequence, by permuting the order of D i 's in the product distribution, we can change the order of revealing the features. Remark 3 (Kernel regression on Gaussian data). In light of Remark 2, D can be chosen to be a product distribution that consists of only N (0, 1) and N mix σj . Note that one can simulate N mix σ,1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 D im e ns ion of da ta  ' Local Max. Local Max. Local Max. 1 2 3 ,1 ,1 ,1 d if L d-1 < L d > L d+1 . The triplet L d-1 , L d , L d+1 then form an ascent/descent, which is marked by the shaded area. Local maxima are marked by the dotted lines. Adding a new feature with a Gaussian mixture distribution increases the loss, while adding one with a univariate Gaussian distribution decreases the loss. Therefore, a Gaussian mixture feature followed by a Gaussian feature creates one ascent/descent. with N (0, 1) through the inverse transform sampling. To see this, let F N (0,1) and F N mix σ,1 be the cdf of N (0, 1) and N mix σ,1 , respectively. If X ∼ N (0, 1), we have F N (0,1) (X) ∼ Unif((0, 1)) and therefore ϕ σ (X) F -1 N mix σ,1 (F N (0,1) (X)) ∼ N mix σ,1 . In fact, we can use a multivariate Gaussian D = N (0, I D×D ) and a sequence of non-linear kernels k [1:d] (x, y) φ [1:d] (x), φ [1:d] (y) , where the feature map is φ [1:d] (x) [φ 1 (x 1 ), φ 2 (x 2 ), . . . , φ d (x d )] ∈ R d . Here is a simple rule for defining φ j : If D j = N (0, 1), we set φ j to the identity function. If D j = N mix σj , we set φ j to ϕ σj . Thus, the problem becomes a kernel regression problem on the standard Gaussian data. We consider the change in (dimension-normalized) generalization loss as follows L d+1 -L d = E   A b + x y 2 -(A + ) x 2   , L d+1 -L d = E   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   . ( . If z = 0 and the columnwise partitioned matrix [A, b] has linearly independent columns, we have A b + = I -bb b 2 I + AA + bb b 2 -b AA + b (A + ) , (I-AA + )b b 2 -b AA + b = (I -Q)(I + P Q 1-tr(P Q) )(A + ) , (I-P )b b (I-P )b = (I -Q)(I + P Q z )(A + ) , (I-P )b b (I-P )b . In our construction of D, the components D j are all continuous distributions. The matrix I -P is an orthogonal projection matrix and therefore rank(I -P ) = n -d. As a result, it holds almost surely that b = 0, z = 0, and [A, b] has linearly independent columns. Thus the assumptions of Lemma 3 are satisfied almost surely. In the sequel, we assume that these assumptions are always fulfilled. , where b = [b 1 , . . . , b n ] ∈ R n . If y, b 1 , • • • , b n iid ∼ N mix σ,1 , we have E[1/z] = O n,d,σ (1) and E[y 2 /b (I -P )b] = O n,d,σ . Theorem 5 provides an upper bound for the following quantity E b,y   1 d + 1 A b + x y 2 - 1 d (A + ) x 2 A, x   if b 1 , . . . , b n , y are i.i.d. according to N (0, 1) or N mix σ,1 . This quantiry is similar to the difference between the dimension-normalized generalization loss L d+1 -L d but with expectation only over b and y. Theorem 5 (Proof in Appendix B.4). Conditioned on A and x, the following statements hold: iid ∼ N mix σ,1 , and by taking expectation over all random variables, we have (a) If d + 2 < √ 2n and b 1 , . . . , b n , y iid ∼ N (0, 1), we have E b,y   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   < d -(A + ) x 2 (2n -(d + 2) 2 ) d(d + 1) 2 (n -d -2) . (2) (b) If d + 2 < n and b 1 , . . . , b n , y iid ∼ N mix σ,1 , we have E b,y 1 d + 1 A b + x y 2 ≤ 1 d (A + ) x 2 + O n,d,σ E 1 d + 1 A b + x y 2 = O n,d,σ E 1 d (A + ) x 2 . We will use Theorem 5 in two different ways. The first way is presented in Corollary 6. We would like to show inductively (on d) that L d is finite for every d. Provided that we are able to guarantee finite L 1 , Corollary 6 implies that L d is finite for every d if the components are always sampled from N (0, 1) or N mix σ,1 . Alternatively, we can use Theorem 5 to create a descent, i.e., make L d+1 < L d . In light of (2), to make the left-hand side negative, we need d -(A + ) x 2 (2n -(d + 2) 2 ) < 0, which is equivalent to 1 d (A + ) x 2 > 1 d(2n -(d + 2)) 2 . One we take expectation over A and x, we need the above equation to hold in expectation in order to create a descent, i.e., L d > 1 d(2n -(d + 2)) 2 . Provided that L d can be made sufficiently large, letting L D satisfy the above inequality and then adding an additional N (0, 1) entry will lead to L d+1 < L d . Making a large L d , in turn, can be achieved by adding an entry sampled from N mix σ,1 when the data dimension increases from d -1 to d in the previous step. Indeed, Theorem 7 shows that adding a N mix σ,1 feature can increase the loss by arbitrary amount. Theorem 7 (Proof in Appendix B.5). For any C > 0 and E (A + ) x 2 < +∞, there exists a σ > 0 such that if b 1 , . . . , b n , y iid ∼ N mix σ,1 , we have E   A b + x y 2 -(A + ) x 2   > C , E   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   > C . We are now ready to prove Theorem 2. Proof of Theorem 2. We construct D inductively. Let D 1 = N (0, 1). When d = 1, we have A = [x 1 [1 : d], . . . , x n [1 : d]] = [x 1,1 , . . . , x n,1 ] ∈ R n , which is a column vector. Therefore, A + = A A 2 . As a result, we get L 1 = E (A + ) x 2 = E |x| 2 A 2 = 1 n -2 , where x ∼ N (0, 1) and A 2 ∼ χ 2 (n). Since we will set D j (for j ≥ 2) to either N (0, 1) or N mix σ,1 , by Corollary 6, we have L j+1 = O n,j,σj+1 (L j ). By induction, we obtain that L j is finite for all 1 ≤ j ≤ D. We define d 0 0. Assume that we have determined distributions D 1 , . . . , D dj +1 , where 0 ≤ j < K. We set D dj +2 , . . . , D dj+1-1 to N (0, 1). For D dj+1 , by Theorem 7, we pick σ dj+1 such that if D dj+1 = N mix σ d j+1 , we have L dj+1 > max L dj+1-1 , 1 d j+1 (2n -(d j+1 + 2) 2 ) . Next, we set D dj+1+1 = N (0, 1). Taking the expectation of (2) in Theorem 5 over all random variables, we have L dj+1+1 -L dj+1 ≤ d j+1 -d 2 j+1 L dj+1 (2n -(d j+1 + 2) 2 ) d j+1 (d j+1 + 1) 2 (n -d j+1 -2) < 0 , where the last inequality is due to (3). So far we have constructed a local maximum at d j+1 . By induction, we conclude that a local maximum occurs at every d j . Remark 4. From Remark 2 and the proof of Theorem 2 it is clear that D = D 1 × • • • × D D is a product distribution. The construction in the proof also shows that the generalization curve is actually determined by the specific choice of the D i 's. Note that permuting the order of D i 's is equivalent to changing the order by which the features are being revealed (i.e., permuting the entries of the data x i 's). Therefore, given the same data points x 1 , • • • , x n ∈ R D , we can create many different generalization curves simply by changing the order of the feature-revealing process.

4. OVERPARAMETERIZED REGIME

In this section, we study the multiple decent phenomenon in the overparameterized regime. Note that as stated in Section 2, we consider the minimum-norm solution here. As stated in the following theorem, we require d ≥ n + 8, which means d starts at roughly the same order as n. In other words, the result covers almost the entire spectrum of the overparameterized regime. Theorem 8 (Overparameterized regime). Let n < D -9. Given any sequence ∆ n+8 , ∆ n+9 , . . . , ∆ D-1 where ∆ d ∈ {↑, ↓}, there exists a distribution D such that for every n + 8 ≤ d ≤ D -1, we have In Theorem 8, the sequence ∆ n+8 , ∆ n+9 , • • • , ∆ D-1 is just used to specify the increasing/decreasing behavior of the L d sequence for d > n + 8. Compared to Theorem 2 for the underparameterized regime, where one is able to fully control the ascents but only partially control the descents, Theorem 8 indicates that one is able to fully control both ascents and descents in the overparameterized regime by placing an ascent/descent wherever one desires. Fig. 3 Lemma 10 establishes finite expectation for several random variables. These finite expectation results are necessary for Theorem 11 and Theorem 12 to hold. Technically, they are the dominating random variables needed in Lebesgue's dominated convergence theorem. Lemma 10 indicates that to guarantee these finite expectations, it suffices to set the first n + 8 distributions to the standard normal distribution and then set D n+8 , . . . , D D to either a Gaussian or a Gaussian mixture distribution. In fact, in Theorem 11 and Theorem 12, we always add a Gaussian distribution or a Gaussian mixture. Lemma 10  L d+1 > L d , if ∆ d = ↑ < L d , if ∆ d = ↓ , L d+1 > L d , if ∆ d = ↑ < L d , if ∆ d = ↓ . (Proof in Appendix C.2). Let D = D 1 × • • • × D D be a product distribution where (a) D d = N (0, 1) if d = 1, . . . , n + 8; and (b) D d is either N (0, σ 2 d ) or N mix σ d ,µ d for d > n + 8. Let D [1:d] denote D 1 × • • • × D d . E[ (A + ) x 2 ] < + ∞ , E[λ 2 max ((AA ) -1 )] < + ∞ , E[λ max ((AA ) -1 ) (A + ) x 2 ] < + ∞ , E[λ 2 max ((AA ) -1 ) (A + ) x 2 ] < + ∞ . Theorem 11 shows that in order to have L d+1 < L d and L d+1 < L d , it suffices to add a Gaussian feature. Theorem 11 (Appendix C.3). If E[ (A A) + x 2 ] > 0 and all equations in (4) hold, there exists σ > 0 such that if y, b 1 , . . . , b n iid ∼ N (0, σ 2 ), we have L d+1 -L d = E A b + x y 2 -E (A + ) x 2 < 0 , L d+1 -L d = E 1 d + 1 A b + x y 2 -E 1 d (A + ) x 2 < 0 . Theorem 12 shows that adding a Gaussian mixture feature can make L d+1 > L d and L d+1 > L d . Theorem 12 (Proof in Appendix C.4). Assume E (A + ) x 2 < +∞. For any C > 0, there exist µ, σ > 0 such that if y, b 1 , . . . , b n iid ∼ N mix σ,µ , we have L d+1 -L d = E A b + x y 2 -E (A + ) x 2 > C , L d+1 -L d = E 1 d + 1 A b + x y 2 -E 1 d (A + ) x 2 > C . The proof of Theorem 8 immediately follows from Theorem 11 and Theorem 12. Proof of Theorem 8. We construct the product distribution  D = D d=1 D d . We set D d = N (0, 1) for d = 1, . . . , n + 8. For n + 8 < d ≤ D, D d is either N (0, σ 2 d ) or N mix σ d ,µ d depending on ∆ d being either ↓ or ↑.

First we show that for each step d, the assumption E[

(A A) + x 2 ] > 0 of Theorem 11 is sat- isfied. If E[ (A A) + x 2 ] = 0,

5. CONCLUSION

Our work proves that the expected risk of linear regression can manifest multiple descents when the number of features increases and sample size is fixed. This is carried out through an algorithmic construction of a feature-revealing process where the newly revealed feature follows either a Gaussian distribution or a Gaussian mixture distribution. Notably, the construction also enables us to control local maxima in the underparameterized regime and control ascents/descents freely in the overparameterized regime. Overall, this allows us to design the generalization curve away from the interpolation threshold. We conjecture that the same multiple-descent generalization curve can occur in non-linear neural networks and we humbly suggest that entities with infinite computational powers investigate this phenomenon.

A FURTHER RELATED WORK

Our work is directly related to the recent line of research in the theoretical understanding of the double descent (Belkin et al., 2019a; Hastie et al., 2019; Xu & Hsu, 2019; Mei & Montanari, 2019) and the multiple descent phenomenon (Liang et al., 2020) . Here we briefly discuss some other work that is closely related to this paper. Least Square Regression. In this paper we focus on the least square linear regression with no regularization. For the regularized least square regression, De Vito et al. (2005) proposed a selection procedure for the regularization parameter. Advani & Saxe (2017) analyzed the generalization of neural networks with mean squared error under the asymptotic regime where both the sample size and model complexity tend to infinity. Richards et al. (2020) proved for least square regression in the asymptotic regime that as the dimension-to-sample-size ratio d/n grows, an additional peak can occur in both the variance and bias due to the covariance structure of the features. As a comparison, in this paper the sample size is fixed and the model complexity increases. Rudi & Rosasco (2017) studied kernel ridge regression and gave an upper bound on the number of the random features to reach certain risk level. Our result shows that there exists a natural setting where by manipulating the random features one can control the risk curve. Over-Parameterization and Interpolation. The double descent occurs when the model complexity reaches and increases beyond the interpolation threshold. Most previous works focused on proving an upper bound or optimal rate for the risk. Caponnetto & De Vito (2007) gave the optimal rate for least square ridge regression via careful selection of the regularization parameter. Belkin et al. (2019b) showed that the optimal rate for risk can be achieved by a model that interpolates the training data. In a series of work on kernel regression with regularization parameter tending to zero (a.k.a. kernel ridgeless regression), Rakhlin & Zhai (2019) showed that the risk is bounded away from zero when the data dimension is fixed with respect to the sample size. Liang & Rakhlin (2019) then considered the case when d n and proved a risk upper bound that can be small given favorable data and kernel assumptions. Instead of giving a bound, our paper presents an exact computation of risk in the cases of underparameterized and overparameterized linear regression, and proves the existence of the multiple descent phenomenon. Wyner et al. (2017) analyzed AdaBoost and Random Forest from the perspective of interpolation. There has also been a line of work on wide neural networks (Arora et al., 2019a; b; c; Du et al., 2019; Allen-Zhu et al., 2019; Wei et al., 2019; Cao & Gu, 2019; Advani et al., 2020; Chen & Xu, 2020; Zou et al., 2020) .

Sample-wise Double Descent and Non-monotonicity.

There has also been recent development beyond the model-complexity double-descent phenomenon. For example, regarding sample-wise non-monotonicity, Nakkiran et al. (2019) empirically observed the epoch-wise double-descent and sample-wise non-monotonicity for neural networks. Chen et al. (2020) and Min et al. (2020) identified and proved the sample-wise double descent under the adversarial training setting, and Javanmard et al. ( 2020) discovered double-descent under adversarially robust linear regression. Loog et al. (2019) showed that empirical risk minimization can lead to sample-wise non-monotonicity in the standard linear model setting under various loss functions including the absolute loss and the squared loss, which covers the range from classification to regression. We also refer the reader to their discussion of the earlier work on non-monotonicity of generalization curves. Dar et al. (2020) demonstrated the double descent curve of the generalization errors of subspace fitting problems.

B PROOFS FOR UNDERPARAMETRIZED REGIME B.1 PROOF OF THEOREM 1

Proof. We follow the notation convention in (1): L d+1 -L d = E   A b + x y 2 -(A ) + x 2   . Recall d < n and the matrix B A b is of size (d + 1) × n. Both matrices B and B A are fat matrices. As a result, if x x y , we have B + x 2 = min z:B z=x z 2 , B + x 2 = min z:Bz=x z 2 . Since {z | B z = x } ⊆ {z | Bz = x}, we get B + x 2 ≥ B + x 2 . Therefore, we obtain L d+1 ≥ L d . The second part of the theorem, which says that for any C > 0 there exists a distribution such that L d+1 -L d > C, follows directly from Theorem 7.

B.2 PROOF OF LEMMA 3

Proof. By (Baksalary & Baksalary, 2007, Theorem 1), we have A b + = (I -Q)A(A (I -Q)A) -1 , (I-P )b b (I-P )b) . Define r A b ∈ R d . Since A has linearly independent columns, the Gram matrix G = A A is non-singular. The Sherman-Morrison formula gives (A (I -Q)A) -1 = A A - rr b 2 -1 = G -1 + G -1 rr G -1 b 2 -r G -1 r = G -1 + G -1 rb (A + ) b 2 -r G -1 r , where we use the facts r = A b and AG -1 = (A + ) in the last equality. Therefore, we deduce A(A (I -Q)A) -1 = AG -1 + AG -1 rb (A + ) b 2 -r G -1 r = (A + ) + AG -1 A bb (A + ) b 2 -r G -1 r = I + AA + bb b 2 -r G -1 r (A + ) = I + P Q 1 -r G -1 r b 2 (A + ) . Observe that 1 - r G -1 r b 2 = 1 - b A(A A) -1 A b b 2 = 1 - b P b b 2 = z . Therefore, we obtain the desired expression.

B.3 PROOF OF LEMMA 4

Lemma 13 shows that a noncentral χ 2 distribution first-order stochastically dominates a central χ 2 distribution of the same degree of freedom. It will be needed in the proof of Lemma 4. Lemma 13. Assume that random variables X ∼ χ 2 (k, λ) and Y ∼ χ 2 (k), where λ > 0. For any c > 0, we have P(X ≥ c) > P(Y ≥ c). In other words, the random variable X (first-order) stochastically dominates Y . Proof. Let Y 1 , X 2 , . . . , X k iid ∼ N (0, 1) and X 1 ∼ N ( √ λ, 1) and all these random variables are jointly independent. Then  X k i=1 X 2 i ∼ χ 2 (k, λ) and Y Y 2 1 + k i=2 X 2 i ∼ χ 2 (k). F c (µ) = 1 - 1 √ 2π c -c exp - (x -µ) 2 2 dx = 1 - 1 √ 2π c-µ -c-µ exp - x 2 2 dx, and thus dF c (µ) dµ = 1 √ 2π exp - (c -µ) 2 2 -exp - (c + µ) 2 2 > 0. This shows P(|N (µ, 1)| ≥ c) > P(|N (0, 1)| ≥ c) and we are done. Proof of Lemma 4. Since b i iid ∼ N mix σ,1 , we can rewrite b = u + w where w ∼ N (0, σ 2 I n ) and the entries of u satisfy u i iid ∼ Unif({-1, 0, 1}). Furthermore, u and w are independent. Note that for any fixed n and d, the support of u is finite and its cardinality only depends on n. Therefore, we only need to show that conditioning on u, the expectation over w is O n,d,σ (1). In other words, for any fixed u, we want to show E w [1/z | u] = O n,d,σ (1) and E w y 2 b (I-P )b u = O n,d,σ (1). Note that since y 2 /σ 2 is first-order stochastically dominated by χ 2 (1, 1), we have E[y 2 | u] = E[y 2 ] ≤ σ 2 E[χ 2 (1, 1)] = 2σ 2 . Therefore, it remains to show E w [1/z | u] = O n,d,σ (1) and E w 1 b (I-P )b u = O n,d,σ (1). Note that 1 z = b Ib b (I -P )b = 1 + (u + w) P (u + w) (u + w) (I -P )(u + w) . Since P is an orthogonal projection, there exists an orthogonal transformation O depending only on P such that (u + w) P (u + w) = [O(u + w)] D d [O(u + w)] where D d = diag([1, . . . , 1, 0 . . . , 0]) with d diagonal entries equal to 1 and the others equal to 0. We denote ũ = O(u), which is fixed (as u and O are fixed), and w = O(w) ∼ N (0, σ 2 I n ). It follows that 1 z = 1 + (ũ + w) D d (ũ + w) (ũ + w) (I -D d )(ũ + w) = 1 + d i=1 (ũ i + wi ) 2 n i=d+1 (ũ i + wi ) 2 = 1 + d i=1 (ũ i + wi ) 2 /σ 2 n i=d+1 (ũ i + wi ) 2 /σ 2 . Observe that d i=1 (ũ i + wi ) 2 /σ 2 ∼ χ 2   d, d i=1 ũ2 i   n i=d+1 (ũ i + wi ) 2 /σ 2 ∼ χ 2   n -d, n i=d+1 ũ2 i   , and that these two quantities are independent. It follows that E d i=1 (ũ i + wi ) 2 /σ 2 u = d + d i=1 ũ2 i . By Lemma 13, the denominator n i=d+1 (ũ i + wi ) 2 /σ 2 first-order stochastically dominates χ 2 (nd). Therefore, we have E 1 n i=d+1 (ũ i + wi ) 2 /σ 2 u ≤ E 1 χ 2 (n -d) = 1 n -d -2 . Putting the numerator and denominator together yields E 1 z u ≤ 1 + d + d i=1 ũ2 i n -d -2 ≤ 1 + d + √ d n -d -2 = O n,d,σ . Similarly, we have E 1 b (I -P )b u = E 1 [O(u + w)] (I -D d )[O(u + w)] u = E 1/σ 2 n i=d+1 (ũ i + wi ) 2 /σ 2 u ≤ 1 σ 2 E 1 χ 2 (n -d) = 1 σ 2 • 1 n -d -2 = O n,d,σ (1) .

B.4 PROOF OF THEOREM 5

Proof. First, we rewrite the expression as follows 1 d + 1 A b + x y 2 - 1 d (A + ) x 2 = 1 (d + 1) 2 (I -Q)(I + P Q/z)(A + ) x + (I -P )b b (I -P )b y 2 - 1 d 2 (A + ) x 2 , where P, Q, z are defined in Lemma 3. Since y has mean 0 and is independent of other random variables, so that the cross term vanishes under expectation over b and y: E b,y (I -Q)(I + P Q/z)(A + ) x, (I -P )b b (I -P )b y = 0 , where •, • denotes the inner product. Therefore taking the expectation of (5) over b and y yields E b,y   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   (6) = E b,y 1 (d + 1) 2 (I -Q)(I + P Q/z)(A + ) x 2 - 1 d 2 (A + ) x 2 + 1 (d + 1) 2 (I -P )b b (I -P )b y 2 (7) = 1 (d + 1) 2 E b,y (I -Q)(I + P Q/z)(A + ) x 2 -(1 + 1 d ) 2 (A + ) x 2 + (I -P )b b (I -P )b y 2 . ( ) We simplify the third term. Recall that I -P = I -AA + is an orthogonal projection matrix and thus idempotent  Thus we have E b,y   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   (10) = 1 (d + 1) 2 E b,y (I -Q)(I + P Q/z)(A + ) x 2 -(1 + 1 d ) 2 (A + ) x 2 + y 2 b (I -P )b . We consider the first and second terms. We write v = (A + ) x and define z = b (I-P )b b 2 . The sum of the first and second terms equals (I -Q)(I + P Q/z)v 2 -(1 + 1 d ) 2 v 2 = -v (M + δI)v , where δ = 2 d + 1 d 2 and M Q - P Q + QP z + 2 z - 1 z 2 QP Q + QP QP Q z 2 . The rank of M is at most 2. To see this, we re-write M in the following way M = Q - P z + 2 z - 1 z 2 P Q + P QP Q z 2 + - P Q z M 1 + M 2 . Notice that rank(M 1 ) ≤ rank(Q), rank(M 2 ) ≤ rank(Q), and rank(Q) = 1. It follows that rank(M ) ≤ rank(M 1 ) + rank(M 2 ) = 2. The matrix M has at least n -2 zero eigenvalues. We claim that M has two non-zero eigenvalues and they are 1 -1/z < 0 and 1. Since rank(P Q) ≤ rank(Q) = 1 and tr(P Q) = b P b b 2 = 1 -z, thus P Q has a unique non-zero eigenvalue 1 -z. Let u = 0 denote the corresponding eigenvector such that P Qu = (1 -z)u. Since u ∈ im P and P is a projection, we have P u = u. Therefore we can verify that M u = (1 - 1 z )u . To show that the other non-zero eigenvalue of M is 1, we compute the trace of M tr(M ) = tr(Q) - 2 tr(P Q) z + 2 z - 1 z 2 tr(P Q) + tr((P Q) 2 ) z 2 = 2 - 1 z , where we use the fact that tr(Q) = 1, tr (P Q) = 1 -z, tr((P Q) 2 ) = tr P bb P bb b 4 = tr (b P b)(b P b) b 4 = (1 -z) 2 . We have shown that M has eigenvalue 1 -1/z and M has at most two non-zero eigenvalues. Therefore, the other non-zero eigenvalue is tr(M ) -(1 -1/z) = 1. We are now in a position to upper bound (12) as follows: - v (M + δI)v ≤ -(1 -1/z + δ) v 2 < -(1 -1/z + 2/d) v 2 . Putting all three terms of the change in the dimension-normalized generalization loss yields  E b,y   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   ≤ 1 (d + 1) 2 E b,y -(1 -1/z + 2/d) v 2 + E b,y   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   ≤ 1 (d + 1) 2 d n -d -2 - 2 d v 2 + 1 n -d -2 < d -v 2 (2n -(d + 2) 2 ) d(d + 1) 2 (n -d -2) . For b 1 , . . . , b n , y iid ∼ N mix σ,1 , Lemma 4 implies that E b,y [1/z] < O n,d,σ , and E b,y [ y 2 b (I -P )b ] < O n,d,σ . Therefore, we conclude that E b,y 1 d + 1 A b + x y 2 ≤ 1 d (A + ) x 2 + O n,d,σ .

B.5 PROOF OF THEOREM 7

Proof. We start from (11). Taking expectation over all random variables gives E   1 d + 1 A b + x y 2 - 1 d (A + ) x 2   = 1 (d + 1) 2 E (I -Q)(I + P Q/z)(A + ) x 2 -(1 + 1 d ) 2 (A + ) x 2 + y 2 b (I -P )b ≥ 1 (d + 1) 2 -(1 + 1 d ) 2 E (A + ) x 2 + E y 2 n i=1 b 2 i . Our strategy is to choose σ so that E y 2 n i=1 b 2 i is sufficiently large. This is indeed possible as we immediately show. Define independent random variables u ∼ Unif({-1, 0, 1}) and w ∼ N (0, σ 2 ). Since y has the same distribution as u + w, we have E[y 2 ] = E[(u + w) 2 ] = E[u 2 ] + E[w 2 ] ≥ 2 3 . On the other hand, E 1 n i=1 b 2 i ≥ P(max i |b i | ≤ σ) E 1 n i=1 b 2 i max i |b i | ≤ σ = [P(|b 1 | ≤ σ)] n E 1 n i=1 b 2 i max i |b i | ≤ σ ≥ 1 3 √ 2πσ 2 σ -σ exp - t 2 2σ 2 dt n 1 nσ 2 ≥ 1 5 n nσ 2 . By our choice of D [1:n+8] , the matrix (A n+8 A n+8 ) -1 is an inverse Wishart matrix of size n × n with (n + 8) degrees of freedom, and thus has finite fourth moment (see, for example, Theorem 4.1 in (von Rosen, 1988) ). It then follows that E[λ 4 max (G n+8 )] ≤ tr(E[(A n+8 A n+8 ) -4 ]) < +∞ . For the inductive step, assume E[λ max (G d )] 4 < +∞ for some d ≥ n + 8. We claim that λ max (G d+1 ≤ λ max (G d ) , or equivalently, λ min (A d A d ) ≤ λ min (A d+1 A d+1 ) . Indeed, this follows from the fact that A d A d A d A d + bb = A d+1 A d+1 , under the Loewner order, where b ∈ R n×1 is the (d + 1)-th column of A. Therefore, we have E[λ 4 max (G d+1 )] ≤ E[λ 4 max (G d )] and by induction, we conclude that E[λ 4 max (G)] < +∞ for all d ≥ n + 8. Now we proceed to show E v 4 < +∞. We have v 4 = (AA ) -1 Ax 4 ≤ (AA ) -1 A 4 op • x 4 , where • op denotes the 2 → 2 operator norm. Note that (AA ) -1 A 4 op = λ 2 max (AA ) -1 A (AA ) -1 A = λ 2 max A (AA ) -2 A = λ max A (AA ) -2 A 2 , where the last equality uses the fact that A (AA ) -2 A is positive semidefinite. Moreover, we deduce (AA ) -1 A 4 op = λ max A (AA ) -3 A ≤ tr A (AA ) -3 A = tr (AA ) -3 AA = tr (AA ) -2 .

Using the fact that

A d A d A d+1 A d+1 established above, induction gives (AA ) -2 (A n+8 A n+8 ) -2 . It follows that E (AA ) -1 A 4 op ≤ E tr A n+8 A n+8 -2 = tr E A n+8 A n+8 -2 < +∞ , where again we use that fact that inverse Wishart matrix A n+8 A n+8 -1 has finite second moment. Next, we demonstrate E x 4 < +∞. Recall that every D i is either a Gaussian or a Gaussian mixture distribution. Therefore, every entry of x has a subgaussian tail, and thus E x 4 < +∞. Together with (13) and the fact that x and A are independent, we conclude that E v 4 ≤ E (AA ) -1 A 4 op • E x 4 < +∞ .

C.3 PROOF OF THEOREM 11

Proof. The randomness comes from A, x, y and b. We first condition on A and x being fixed. Let G (AA ) -1 ∈ R n×n and u b G 1+b Gb ∈ R 1×n . Define v (A + ) x , r 1 + b Gb , H bb . We compute the left-hand side but take the expectation over only y for the moment E y A b + x y 2 -(A + ) x 2 = E y (I -bu) v + u y 2 -v 2 = (I -bu) v 2 + E y u y 2 -v 2 (E[y] = 0) = (I -bu) v 2 + E y [y 2 ] Gb 2 r 2 -v 2 . Let us first consider the first and third terms of the above equation: (I -bu) v 2 -v 2 = v (I -bu)(I -bu) -I v = -v bu + u b -buu b v = -v HG + GH r - HG 2 H r 2 v . Write G = V ΛV , where Λ = diag(λ 1 , . . . , λ n ) ∈ R n×n is a diagonal matrix (λ i > 0) and V ∈ R n×n is an orthogonal matrix. Recall b ∼ N (0, σ 2 I n ). Therefore w V b ∼ N (0, σ 2 I n ). Taking the expectation over b, we have E b HG + GH r = E b V V bb V Λ + ΛV bb V 1 + b V ΛV b V = V E w ww Λ + Λww 1 + w Λw V . Let R E w ww Λ+Λww 1+w Λw . We have R ii = E w 2λ i w 2 i 1 + n i=1 λ i w 2 i = σ 2 E ν∼N (0,In) 2λ i ν 2 i 1 + σ 2 n i=1 λ i ν 2 i > and if i = j, R ij = E w (λ i + λ j )w i w j 1 + n i=1 λ i w 2 i . Notice that for any w and j, it has the same distribution if we replace w j by -w j . As a result, R ij = E w (λ i + λ j )w i (-w j ) 1 + n i=1 λ i w 2 i = -R ij . Thus the matrix R is a diagonal matrix and R = 2σ 2 Λ diag(ν) 2 1 + σ 2 ν Λν . Thus we get E b,A HG + GH r = 2σ 2 E ν∼N (0,In),A GV diag(ν) 2 V 1 + σ 2 ν Λν Moreover, by the monotone convergence theorem, we deduce lim σ→0 + E ν∼N (0,In),A,x -v GV diag(ν) 2 V 1 + σ 2 ν Λν v = E ν∼N (0,In),A,x -v GV diag(ν) 2 V v = E[-v Gv] . It follows that as σ → 0 + , E -v HG + GH r v ∼ -2σ 2 E[v Gv] = -2σ 2 E v (AA ) -1 v = -2σ 2 E[ (A A) + x 2 ] . Moreover, by (4), we have E v (AA ) -1 v ≤ E λ max (AA ) -1 (A + ) x 2 < +∞ . Next, we study the term HG 2 H/r 2 : E b,A HG 2 H r 2 = E b,A V V bb V Λ 2 V bb V (1 + b V ΛV b) 2 V = E w∼N (0,σ 2 In),A V ww Λ 2 ww (1 + w Λw) 2 V = σ 4 E ν∼N (0,In),A V νν Λ 2 νν (1 + σ 2 ν Λν) 2 V . Again, by the monotone convergence theorem, we have lim σ→0 + E ν∼N (0,In),A,x v V νν Λ 2 νν (1 + σ 2 ν Λν) 2 V v = E ν∼N (0,In),A,x v V νν Λ 2 νν V v = E A,x v V 2Λ 2 + I n n i=1 λ 2 i V v = E v 2G 2 + tr(G 2 )I n v . It follows that as σ → 0 + , E b,A,x HG 2 H r 2 ∼ σ 4 E v 2G 2 + tr(G 2 )I n v = σ 4 E 2 (AA ) -1 v 2 + tr((AA ) -2 ) v 2 . Moreover, by (4), we have E 2 (AA ) -1 v 2 + tr((AA ) -2 ) v 2 ≤ (n + 2)E λ 2 max ((AA ) -1 ) (A + ) x 2 < +∞ . We apply a similar method to the term Gb 2 r 2 . We deduce Gb 2 r 2 = b G 2 b (1 + b Gb) 2 = b V Λ 2 V b (1 + b V ΛV b) 2 . It follows that E Gb 2 r 2 = E w∼N (0,σ 2 In),A w Λ 2 w (1 + w Λw) 2 = σ 2 E ν∼N (0,In),A ν Λ 2 ν (1 + σ 2 ν Λν) 2 The monotone convergence theorem implies lim σ→0 + E ν∼N (0,In),A ν Λ 2 ν (1 + σ 2 ν Λν) 2 = E[ν Λ 2 ν] = E[tr(G 2 )] . Thus we get as σ → 0 + E y [y 2 ] Gb 2 r 2 ∼ σ 4 E[tr(G 2 )] , where E[tr(G 2 )] ≤ nE[λ 2 max ((AA ) -1 )] < + ∞. Putting all three terms together, we have as σ → 0 + L d+1 -L d ∼ -2σ 2 E[ (A A) + x 2 ] . Therefore, there exists σ > 0 such that L d+1 -L d < 0. Furthermore, we deduce L d+1 -L d = 1 d 2 (L d+1 -L d ) < 0 . Proof. Again we first condition on A and x being fixed. Let G (AA ) -1 ∈ R n×n and u b G 1+b Gb ∈ R 1×n as defined in Lemma 9. We also define the following variables: v (A + ) x , r 1 + b Gb. We compute L d+1 -L d but take the expectation over only y for the moment E y 1 d + 1 A b + x y 2 - 1 d (A + ) x 2 = 1 (d + 1) 2 E y (I -bu) v + u y 2 -(1 + 1/d) 2 v 2 = 1 (d + 1) 2 (I -bu) v 2 + E y u y 2 -(1 + 1/d) 2 v 2 (E[y] = 0) = 1 (d + 1) 2 (I -bu) v 2 + E y [y 2 ] Gb 2 r 2 -(1 + 1/d) 2 v 2 . ( ) Our strategy is to make E[y 2 Gb 2 r 2 ] arbitrarily large. To this end, by the independence of y and b we have E y,b y 2 Gb 2 r 2 = E y [y 2 ]E b Gb 2 r 2 . By definition of N mix σ,µ , with probability 2/3, y is sampled from either N (µ, σ 2 ) or N (-µ, σ 2 ), which implies E[y 2 ] ≥ 1 3 µ 2 . For each b i , we have P(|b i | ∈ [σ, 2σ]) ≥ 1 3 × 1 4 . Also note that G is positive definite. It follows that E b ||Gb|| 2 r 2 = E b ||Gb|| 2 (1 + b Gb) 2 ≥ E b (λ min (G)||b||) 2 (1 + λ max (G)||b|| 2 ) 2 ≥ 1 12 n λ 2 min (G)nσ 2 (1 + 4λ max (G)nσ 2 ) 2 . Altogether we have E y,b y 2 Gb 2 r 2 ≥ 1 3 • 12 n nλ 2 min (G)µ 2 σ 2 (1 + 4nλ max (G)σ 2 ) 2 . Let µ = 1/σ 2 and we have lim σ→0 + E y 2 Gb 2 r 2 ≥ lim σ→0 + E A,x E y,b 1 3 • 12 n nλ 2 min (G) σ 2 (1 + 4nλ max (G)σ 2 ) 2 = E A,x E y,b lim σ→0 + 1 3 • 12 n nλ 2 min (G) σ 2 (1 + 4nλ max (G)σ 2 ) 2 = +∞ , where we switch the order of expectation and limit using the monotone convergence theorem. Taking full expectation over A, x, b and y of (14) and using the assumption that E v 2 < +∞ we have L d+1 -L d = 1 (d + 1) 2 E A,x,b (I -bu) v 2 + E y 2 Gb 2 r 2 -(1 + 1/d) 2 E A,x v 2 → +∞ as σ → 0 + . In addition, we have as σ → 0 + , L d+1 -L d ≥ d 2 (L d+1 -L d ) → +∞ .



N mix σ,1 feature, (σ = 0.2)

Figure 2: Illustration of multiple descent for the dimension-normalized generalization loss L d as a function of the dimension d. A local maximum occurs atd if L d-1 < L d > L d+1 . The triplet L d-1 , L d , L d+1then form an ascent/descent, which is marked by the shaded area. Local maxima are marked by the dotted lines. Adding a new feature with a Gaussian mixture distribution increases the loss, while adding one with a univariate Gaussian distribution decreases the loss. Therefore, a Gaussian mixture feature followed by a Gaussian feature creates one ascent/descent.

Getting back Theorem 2, let us discuss how we will construct such a distribution D inductively. We fix d. Again, denote the first d features of x test by x x test [1 : d]. Let us consider adding an additional component to the training data x 1 [1 : d], . . . , x n [1 : d] and test data x so that we increment the dimension d by 1. Let b i ∈ R denote the additional component that we add to the vector x i (so that the new vector is [x i [1 : d] , b i ] . Similarly, let y ∈ R denote the additional component that we add to the vector x. We form the column vector b = [b 1 , . . . , b n ] ∈ R n that collects all additional components that we add to the training data.

1) Note that the components b 1 , . . . , b n , y are i.i.d. Lemma 3 relates the pseudo-inverse of [A, b] to that of A . Lemma 3 (Proof in Appendix B.2). Let A ∈ R n×d and 0 = b ∈ R n×1 , where n ≥ d + 1. Additionally, let P = AA + and Q = bb + = bb b 2 , and define z b (I-P )b b 2

Lemma 4 (Proof in Appendix B.3). Assume d, n > d + 2 and P are fixed, where P ∈ R n×n is an orthogonal projection matrix whose rank is d. Define z b (I-P )b b 2

, where O n,d,σ (1) is a universal constant that only depends on n, d, and σ. Corollary 6. Assume d + 2 < √ 2n. If either b 1 , . . . , b n , y iid ∼ N (0, 1) or b 1 , . . . , b n , y

Figure 3: Illustration of the multiple descent phenomenon for the generalization loss L d (or the dimension-normalized generalization loss L d ) versus the dimension of data d in the overparameterized regime starting from d = n + 8. One can fully control the generalization curve to increase or decrease as specified by the sequence ∆ = {↓, ↑, ↓, ↓, ↑, ↓, . . . }. Adding a new feature with Gaussian mixture distribution increases the loss, while adding one with Gaussian distribution decreases the loss.

the pseudo-inverse of A when d > n. Lemma 9 (Proof in Appendix C.1). Let A ∈ R n×d and b ∈ R n×1 , where n ≤ d. Assume that matrix A and the columnwise partitioned matrix B [A, b] have linearly independent rows. Let G (AA ) -1 ∈ R n×n and u b G 1+b Gb ∈ R 1×n . We have A b + = (I -bu) (A + ) , u .

we know that (A A) + x = 0 almost surely. Since D is a continuous distribution, the matrix A has full row rank almost surely. Therefore, rank((A A) + ) = rank(A A) = n almost surely. Thus dim ker(A A) + = d -n ≤ d -1 almost surely, which implies x / ∈ ker(A A) + . In other words, (A A) + x = 0 almost surely. We reach a contradiction. Moreover, by Lemma 10, the assumption E (A + ) x 2 < +∞ of Theorem 12 is also satisfied. If ∆ d-1 = ↓, by Theorem 11, there exists σ d > 0 such that if D d = N (0, σ 2 d ), then L d < L d-1 and L d < L d-1 . Similarly if ∆ d-1 = ↑, by Theorem 12, there exists σ d and µ d such that D d = N mix σ d ,µ d guarantees L d > L d-1 and L d > L d-1 .

It suffices to show that P(X ≥ c) > P(Y ≥ c), or equivalently, P(|N (µ, 1)| ≥ c) > P(|N (0, 1)| ≥ c) for all c > 0 and µ √ λ > 0. Denote F c (t) = P(|N (µ, 1)| ≥ c) and we have

I -P )b b (I -P )b y 2 = y 2 (b (I -P )b) 2 (I -P )b 2 = y 2 b (I -P )b .

y 2 b (I -P )b . For b 1 , . . . , b n , y iid ∼ N (0, 1), we have E[y 2 ] = 1. Moreover, b (I -P )b follows χ 2 (n -d) a distribution. Thus 1 b (I-P )b follows an inverse-chi-squared distribution with mean 1 n-d-2 . Therefore the expectation E[ y 2 b (I-P )b ] = 1 n-d-2 . Notice that 1/z follows a 1 + d n-d F (d, n -d) distribution and thus E[1/z] = 1 + d n-d-2 . As a result, we obtain

Assume that every row of A ∈ R n×d and x ∈ R d×1 are i.i.d. and follow D [1:d] . For any d such that n + 8 ≤ d ≤ D, all of the followings hold:

annex

Together we havewe havewhich completes the proof.

C PROOFS FOR OVERPARAMETRIZED REGIME

C.1 PROOF OF LEMMA 9Proof. Since A and B have full row rank, (AA ) -1 and (BB ) -1 exist. Therefore we haveThe Sherman-Morrison formula givesHence, we deduceTransposing the above equation yields to the promised equation.

C.2 PROOF OF LEMMA 10

Proof. Let us first denoteFirst note that by Cauchy-Schwarz inequality, it suffices to show there exists D such that E[λ 4 max (G)] < +∞ and E v 4 < +∞. We define A d ∈ R n×d to be the submatrix of A that consists of all n rows and firstWe will prove E[λ 4 max (G)] < +∞ by induction. The base step is

