FOR INTERPOLATING KERNEL MACHINES, MINIMIZING THE NORM OF THE ERM SOLUTION MINIMIZES STABILITY

Abstract

We study the average CV loo stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm minimizes a bound on CV loo stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error should be expected to follow a double descent curve. * K * = K † K. Next, we see that since K * has the same rank as K, a lies in the column space of K * , and b lies in the column space of K * . Furthermore β = 1 + b K * a = 0. This means we can use Theorem 6 in Meyer (1973) (equivalent to formula 2.1 in Baksalary et al. (2003) ) to obtain the expression for (K Si ) † , with k = K † * a and h = b K † * . (K Si ) † = K

1. INTRODUCTION

Statistical learning theory studies the learning properties of machine learning algorithms, and more fundamentally, the conditions under which learning from finite data is possible. In this context, classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures, such as combinatorial dimensions, covering numbers and Rademacher/Gaussian complexities (Shalev-Shwartz & Ben-David, 2014; Boucheron et al., 2005) . Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data (Bousquet & Elisseeff, 2001; Kutin & Niyogi, 2002) . In this view, the continuity of the process that maps data to estimators is crucial, rather than the complexity of the hypothesis space. Different notions of stability can be considered, depending on the data perturbation and metric considered (Kutin & Niyogi, 2002) . Interestingly, the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other, and can be shown to be equivalent as shown in Poggio et al. (2004) and Shalev-Shwartz et al. (2010) . In modern machine learning overparameterized models, with a larger number of parameters than the size of the training data, have become common. The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process (Bühlmann & Van De Geer, 2011; Steinwart & Christmann, 2008) . However, it was recently shown -first for deep networks (Zhang et al., 2017) , and more recently for kernel methods (Belkin et al., 2019) -that learning is possible in the absence of regularization, i.e., when perfectly fitting/interpolating the data. Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding. Since learning using models that interpolate is not exclusive to deep neural networks, we study generalization in the presence of interpolation in the case of kernel methods. We study both linear and kernel least squares problems in this paper.

Our Contributions:

• We characterize the generalization properties of interpolating solutions for linear and kernel least squares problems using a stability approach. While the (uniform) stability properties of regularized kernel methods are well known (Bousquet & Elisseeff, 2001) , we study interpolating solutions of the unregularized ("ridgeless") regression problems. • We obtain an upper bound on the stability of interpolating solutions, and show that this upper bound is minimized by the minimum norm interpolating solution. This also means that among all interpolating solutions, the minimum norm solution has the best test error. In particular, the same conclusion is also true for gradient descent, since it converges to the minimum norm solution in the setting we consider, see e.g. Rosasco & Villa (2015) . • Our stability bounds show that the average stability of the minimum norm solution is controlled by the condition number of the empirical kernel matrix. It is well known that the numerical stability of the least squares solution is governed by the condition number of the associated kernel matrix (see the discussion of why overparametrization is "good" in Poggio et al. (2019) ). Our results show that the condition number also controls stability (and hence, test error) in a statistical sense. Organization: In section 2, we introduce basic ideas in statistical learning and empirical risk minimization, as well as the notation used in the rest of the paper. In section 3, we briefly recall some definitions of stability. In section 4, we study the stability of interpolating solutions to kernel least squares and show that the minimum norm solutions minimize an upper bound on the stability. In section 5 we discuss our results in the context of recent work on high dimensional regression. We conclude in section 6.

2. STATISTICAL LEARNING AND EMPIRICAL RISK MINIMIZATION

We begin by recalling the basic ideas in statistical learning theory. In this setting, X is the space of features, Y is the space of targets or labels, and there is an unknown probability distribution µ on the product space Z = X × Y . In the following, we consider X = R d and Y = R. The distribution µ is fixed but unknown, and we are given a training set S consisting of n samples (thus |S| = n) drawn i.i.d. from the probability distribution on Z n , S = (z i ) n i=1 = (x i , y i ) n i=1 . Intuitively, the goal of supervised learning is to use the training set S to "learn" a function f S that evaluated at a new value x new should predict the associated value of y new , i.e. y new ≈ f S (x new ). ) , where F is the space of measurable functions from X to Y , that measures how well a function performs on a data point. We define a hypothesis space H ⊆ F where algorithms search for solutions. With the above notation, the expected risk of f is defined as I[f ] = E z V (f, z) which is the expected loss on a new sample drawn according to the data distribution µ. In this setting, statistical learning can be seen as the problem of finding an approximate minimizer of the expected risk given a training set S. A classical approach to derive an approximate solution is empirical risk minimization (ERM) where we minimize the empirical risk

The loss is a function

V : F × Z → [0, ∞ I S [f ] = 1 n n i=1 V (f, z i ). A natural error measure for our ERM solution f S is the expected excess risk E S [I[f S ]-min f ∈H I[f ]]. Another common error measure is the expected generalization error/gap given by E S [I[f S ] -I S [f S ]]. These two error measures are closely related since, the expected excess risk is easily bounded by the expected generalization error (see Lemma 5).

2.1. KERNEL LEAST SQUARES AND MINIMUM NORM SOLUTION

The focus in this paper is on the kernel least squares problem. We assume the loss function V is the square loss, that is, V (f, z) = (y -f (x)) 2 . The hypothesis space is assumed to be a reproducing kernel Hilbert space, defined by a positive definite kernel K : X × X → R or an associated feature map Φ : X → H, such that K(x, x ) = Φ(x), Φ(x ) H for all x, x ∈ X, where •, • H is the inner product in H. In this setting, functions are linearly parameterized, that is there exists w ∈ H such that f (x) = w, Φ(x) H for all x ∈ X. The ERM problem typically has multiple solutions, one of which is the minimum norm solution: f † S = arg min f ∈M f H , M = arg min f ∈H 1 n n i=1 (f (x i ) -y i ) 2 . ( ) Here • H is the norm on H induced by the inner product. The minimum norm solution can be shown to be unique and satisfy a representer theorem, that is for all x ∈ X: f † S (x) = n i=1 K(x, x i )c S [i], c S = K † y (2) where c S = (c S [1], . . . , c S [n]), y = (y 1 . . . y n ) ∈ R n , K is the n by n matrix with entries K ij = K(x i , x j ), i, j = 1, . . . , n, and K † is the Moore-Penrose pseudoinverse of K. If we assume n ≤ d and that we have n linearly independent data features, that is the rank of X is n, then it is possible to show that for many kernels one can replace K † by K -1 (see Remark 2). Note that invertibility is necessary and sufficient for interpolation. That is, if K is invertible, f † S (x i ) = y i for all i = 1, . . . , n, in which case the training error in (1) is zero. Remark 1 (Pseudoinverse for underdetermined linear systems) A simple yet relevant example are linear functions f (x) = w x, that correspond to H = R d and Φ the identity map. If the rank of X ∈ R d×n is n, then any interpolating solution w S satisfies w S x i = y i for all i = 1, . . . , n, and the minimum norm solution, also called Moore-Penrose solution, is given by (w † S ) = y X † where the pseudoinverse X † takes the form X † = X (XX ) -1 . Remark 2 (Invertibility of translation invariant kernels) Translation invariant kernels are a family of kernel functions given by K (x 1 , x 2 ) = k(x 1 -x 2 ) where k is an even function on R d . Translation invariant kernels are Mercer kernels (positive semidefinite) if the Fourier transform of k(•) is non-negative. For Radial Basis Function kernels (K(x 1 , x 2 ) = k(||x 1 -x 2 ||)) we have the additional property due to Theorem 2.3 of Micchelli (1986) that for distinct points x 1 , x 2 , . . . , x n ∈ R d the kernel matrix K is non-singular and thus invertible. The above discussion is directly related to regularization approaches. Remark 3 (Stability and Tikhonov regularization) Tikhonov regularization is used to prevent potential unstable behaviors. In the above setting, it corresponds to replacing Problem (1) by min f ∈H 1 n n i=1 (f (x i ) -y i ) 2 + λ f 2 H where the corresponding unique solution is given by f λ S (x) = n i=1 K(x, x i )c[i], c = (K + λI n ) -1 y. In contrast to ERM solutions, the above approach prevents interpolation. The properties of the corresponding estimator are well known. In this paper, we complement these results focusing on the case λ → 0. Finally, we end by recalling the connection between minimum norm and the gradient descent. Remark 4 (Minimum norm and gradient descent) In our setting, it is well known that both batch and stochastic gradient iterations converge exactly to the minimum norm solution when multiple solutions exist, see e.g. Rosasco & Villa (2015) . Thus, a study of the properties of the minimum norm solution explains the properties of the solution to which gradient descent converges. In particular, when ERM has multiple interpolating solutions, gradient descent converges to a solution that minimizes a bound on stability, as we show in this paper.

3. ERROR BOUNDS VIA STABILITY

In this section, we recall basic results relating the learning and stability properties of Empirical Risk Minimization (ERM). Throughout the paper, we assume that ERM achieves a minimum, albeit the extension to almost minimizer is possible (Mukherjee et al., 2006) and important for exponential-type loss functions (Poggio, 2020) . We do not assume the expected risk to achieve a minimum. Since we will be considering leave-one-out stability in this section, we look at solutions to ERM over the complete training set S = {z 1 , z 2 , . . . , z n } and the leave one out training set S i = {z 1 , z 2 , . . . , z i-1 , z i+1 , . . . , z n } The excess risk of ERM can be easily related to its stability properties. Here, we follow the definition laid out in Mukherjee et al. (2006) and say that an algorithm is Cross-Validation leave-one-out (CV loo ) stable in expectation, if there exists β CV > 0 such that for all i = 1, . . . , n, E S [V (f Si , z i ) -V (f S , z i )] ≤ β CV . ( ) This definition is justified by the following result that bounds the excess risk of a learning algorithm by its average CV loo stability (Shalev-Shwartz et al., 2010; Mukherjee et al., 2006) . Lemma 5 (Excess Risk & CV loo Stability) For all i = 1, . . . , n, E S [I[f Si ] -inf f ∈H I[f ]] ≤ E S [V (f Si , z i ) -V (f S , z i )]. Remark 6 (Connection to uniform stability and other notions of stability) Uniform stability, introduced by Bousquet & Elisseeff (2001) , corresponds in our notation to the assumption that there exists β u > 0 such that for all i = 1, . . . , n, sup z∈Z |V (f Si , z) -V (f S , z)| ≤ β u . Clearly this is a strong notion implying most other definitions of stability. We note that there are number of different notions of stability. We refer the interested reader to Kutin & Niyogi (2002) , Mukherjee et al. (2006) . We recall the proof of Lemma 5 in Appendix A.2 due to lack of space. In Appendix A, we also discuss other definitions of stability and their connections to concepts in statistical learning theory like generalization and learnability.

4. CV loo STABILITY OF KERNEL LEAST SQUARES

In this section we analyze the expected CV loo stability of interpolating solutions to the kernel least squares problem, and obtain an upper bound on their stability. We show that this upper bound on the expected CV loo stability is smallest for the minimum norm interpolating solution (1) when compared to other interpolating solutions to the kernel least squares problem. We have a dataset S = {(x i , y i )} n i=1 and we want to find a mapping f ∈ H, that minimizes the empirical least squares risk. Here H is a reproducing kernel hilbert space (RKHS) defined by a positive definite kernel K : X × X → R. All interpolating solutions are of the form fS ( •) = n j=1 ĉS [j]K(x j , •), where ĉS = K † y + (I -K † K)v. Similarly, all interpolating solutions on the leave one out dataset S i can be written as fSi ( •) = n j=1,j =i ĉSi [j]K(x j , •), where ĉSi = K † Si y i + (I -K † Si K Si )v i . Here K, K Si are the empirical kernel matrices on the original and leave one out datasets respectively. We note that when v = 0 and v i = 0, we obtain the minimum norm interpolating solutions on the datasets S and S i . Theorem 7 (Main Theorem) Consider the kernel least squares problem with a bounded kernel and bounded outputs y, that is there exist κ, M > 0 such that K(x, x ) ≤ κ 2 , |y| ≤ M, almost surely. Then for any interpolating solutions fSi , fS , E S [V ( fSi , z i ) -V ( fS , z i )] ≤ β CV (K † , y, v, v i ) (6) This bound β CV is minimized when v = v i = 0, which corresponds to the minimum norm interpolating solutions f † S , f † Si . For the minimum norm solutions we have β CV = C 1 β 1 + C 2 β 2 , where β 1 = E S ||K 1 2 || op ||K † || op × cond(K) × ||y|| and, β 2 = E S ||K 1 2 || 2 op ||K † || 2 op × (cond(K)) 2 × ||y|| 2 , and C 1 , C 2 are absolute constants that do not depend on either d or n. In the above theorem ||K|| op refers to the operator norm of the kernel matrix K, ||y|| refers to the standard 2 norm for y ∈ R n , and cond(K) is the condition number of the matrix K. We can combine the above result with Lemma 5 to obtain the following bound on excess risk for minimum norm interpolating solutions to the kernel least squares problem: Corollary 8 The excess risk of the minimum norm interpolating kernel least squares solution can be bounded as: E S I[f † Si ] -inf f ∈H I[f ] ≤ C 1 β 1 + C 2 β 2 where β 1 , β 2 are as defined previously. Remark 9 (Underdetermined Linear Regression) In the case of underdetermined linear regression, ie, linear regression where the dimensionality is larger than the number of samples in the training set, we can prove a version of Theorem 7 with β 1 = E S X † op y and β 2 = E S X † 2 op y 2 . Due to space constraints, we present the proof of the results in the linear regression case in Appendix B.

4.1. KEY LEMMAS

In order to prove Theorem 7 we make use of the following lemmas to bound the CV loo stability using the norms and the difference of the solutions. Lemma 10 Under assumption (5), for all i = 1. . . . , n, it holds that E S [V ( fSi , z i ) -V ( fS , z i )] ≤ E S 2M + κ fS H + fSi H × κ fS -fSi H Proof We begin, recalling that the square loss is locally Lipschitz, that is for all y, a, a ∈ R, with |(y -a) 2 -(y -a ) 2 | ≤ (2|y| + |a| + |a |))|a -a |. If we apply this result to f, f in a RKHS H, |(y -f (x)) 2 -(y -f (x)) 2 | ≤ κ(2M + κ ( f H + f H )) f -f H . using the basic properties of a RKHS that for all f ∈ H |f (x)| ≤ f ∞ = sup x |f (x)| = sup x | f, K x H | ≤ κ f H (7) In particular, we can plug fSi and fS into the above inequality, and the almost positivity of ERM (Mukherjee et al., 2006) will allow us to drop the absolute value on the left hand side. Finally the desired result follows by taking the expectation over S. Now that we have bounded the CV loo stability using the norms and the difference of the solutions, we can find a bound on the difference between the solutions to the kernel least squares problem. This is our main stability estimate. Lemma 11 Let fS , fSi be any interpolating kernel least squares solutions on the full and leave one out datasets (as defined at the top of this section), then fS -fSi  and B CV is minimized when v = v i = 0, which corresponds to the minimum norm interpolating solutions f † S , f † Si . Also for some absolute constant C, H ≤ B CV (K † , y, v, v i ), f † S -f † Si H ≤ C × K 1 2 op K † op × cond(K) × y (8) Since the minimum norm interpolating solutions minimize both fS H + fSi H and fS -fSi H (from lemmas 10, 11), we can put them together to prove theorem 7. In the following section we provide the proof of Lemma 11. Remark 12 (Zero training loss) In Lemma 10 we use the locally Lipschitz property of the squared loss function to bound the leave one out stability in terms of the difference between the norms of the solutions. Under interpolating conditions, if we set the term V ( fS , z i ) = 0, the leave one out stability reduces to E S V ( fSi , z i ) -V ( fS , z i ) = E S V ( fSi , z i ) = E S [( fSi (x i ) -y i ) 2 ] = E S [( fSi (x i ) -fS (x i )) 2 ] = E S [ fSi (•) -fS (•), K xi (•) 2 ] ≤ E S || fS -fSi || 2 H × κ 2 . We can plug in the bound from Lemma 11 to obtain similar qualitative and quantitative (up to constant factors) results as in Theorem 7. Simulation: In order to illustrate that the minimum norm interpolating solution is the best performing interpolating solution we ran a simple experiment on a linear regression problem. We synthetically generated data from a linear model y = w X, where X ∈ R d×n was i.i.d N (0, 1). The dimension of the data was d = 1000 and there were n = 200 samples in the training dataset. A held out dataset of 50 samples was used to compute the test mean squared error (MSE). Interpolating solutions were computed as ŵ = y X † + v (I -XX † ) and the norm of v was varied to obtain the plot. The results are shown in Figure 1 , where we can see that the training loss is 0 for all interpolants, but test MSE increases as ||v|| increases, with (w † ) = y X † having the best performance. The figure reports results averaged over 100 trials. 

4.2. PROOF OF LEMMA 11

We can write any interpolating solution to the kernel regression problem as fS (x) = n i=1 ĉS [i]K(x i , x) where ĉS = K † y + (I -K † K)v, and K ∈ R n×n is the kernel matrix K on S and v is any vector in R n . i.e. K ij = K(x i , x j ), and y ∈ R n is the vector y = [y 1 . . . y n ] . Similarly, the coefficient vector for the corresponding interpolating solution to the problem over the leave one out dataset S i is ĉSi = (K Si ) † y i + (I -(K Si ) † K Si )v i . Where y i = [y 1 , . . . , 0, . . . y n ] and K Si is the kernel matrix K with the i th row and column set to zero, which is the kernel matrix for the leave one out training set. We define a = [-K(x 1 , x i ), . . . , -K(x n , x i )] ∈ R n and b ∈ R n as a one-hot column vector with all zeros apart from the i th component which is 1. Let a * = a + K(x i , x i )b. Then, we have: K * = K + ba * K Si = K * + ab (9) That is, we can write K Si as a rank-2 update to K. This can be verified by simple algebra, and using the fact that K is a symmetric kernel. Now we are interested in bounding || fS -fSi || H . For a function h(•) = m i=1 p i K(x i , •) ∈ H we have ||h|| H = p Kp = ||K 1 2 p||. So we have: || fS -fSi || H = ||K 1 2 (ĉ S -ĉSi )|| = ||K 1 2 (K † y + (I -K † K)v -(K Si ) † y i -(I -(K Si ) † K Si )v i )|| = ||K 1 2 (K † y -(K Si ) † y + y i (K Si ) † b + (I -K † K)(v -v i ) + (K † K -(K Si ) † K Si )v i )|| = ||K 1 2 [(K † -(K Si ) † )y + (I -K † K)(v -v i ) -(K † K -(K Si ) † K Si )v i ]|| Here we make use of the fact that (K Si ) † b = 0. If K has full rank (as in Remark 2), we see that b lies in the column space of K and a * lies in the column space of K . Furthermore, et al. (2003) we obtain: β * = 1 + a * K † b = 1 + a K † b + K(x i , x i )b K † b = K ii (K † ) ii = 0. Using equation 2.2 of Baksalary K † * = K † -(K ii (K † ) ii ) -1 K † ba * K † = K † -(K ii (K † ) ii ) -1 K † ba K † -((K † ) ii ) -1 K † bb K † = K † + (K ii (K † ) ii ) -1 K † bb -((K † ) ii ) -1 K † bb K † (11) Here we make use of the fact that a K † = -b. Also, using the corresponding formula from List 2 of Baksalary et al. (2003) , we have K † Putting the two parts together we obtain the bound on (K Si ) † -K † op : ||K † -(K Si ) † || op = ||K † -K † * + K † * -(K Si ) † || op ≤ 3||K † * || op + ||K † -K † * || op ≤ 3||K † || op + 4(K ii (K † ) ii ) -1 ||K † || op + 4((K † ) ii ) -1 ||K † || 2 op ≤ ||K † || op (3 + 8||K † || op ||K|| op ) (14) The last step follows from (K ii ) -1 ≤ ||K † || op and ((K † ) ii ) -1 ≤ ||K|| op . Plugging in these calculations into equation 10 we get: || fS -fSi || H = ||K 1 2 [(K † -(K Si ) † )y + (I -K † K)(v -v i ) -(K † K -(K Si ) † K Si )v i ]|| ≤ ||K 1 2 || op ||(K † -(K Si ) † )y|| + ||(I -K † K)(v -v i )|| + ||kk † v i || ≤ ||K 1 2 || op (B 0 + ||I -K † K|| op ||v -v i || + ||v i ||) We see that the right hand side is minimized when v = v i = 0. We have also computed B 0 = C × ||K † || op × cond(K) × ||y||, which concludes the proof of Lemma 11.

5. REMARK AND RELATED WORK

In the previous section we obtained bounds on the CV loo stability of interpolating solutions to the kernel least squares problem. Our kernel least squares results can be compared with stability bounds for regularized ERM (see Remark 3). Regularized ERM has a strong stability guarantee in terms of a uniform stability bound which turns out to be inversely proportional to the regularization parameter λ and the sample size n (Bousquet & Elisseeff, 2001) . However, this estimate becomes vacuous as λ → 0. In this paper, we establish a bound on average stability, and show that this bound is minimized when the minimum norm ERM solution is chosen. We study average stability since one can expect built from a random data matrix distributed as N (0, 1): as in the linear case, the condition number is worse when n = d, better if n > d (on the right of n = d) and also better if n < d (on the left of n = d). The parameter σ was chosen to be 5. From Poggio et al. (2019) worst case scenarios where the minimum norm is arbitrarily large (when n ≈ d). One of our key findings is the relationship between minimizing the norm of the ERM solution and minimizing a bound on stability. This leads to a second observation, namely, that we can consider the limit of our risk bounds as both the sample size (n) and the dimensionality of the data (d) go to infinity, but the ratio d n → γ > 1 as n, d → ∞ . This is a classical setting in statistics which allows us to use results from random matrix theory (Marchenko & Pastur, 1967) . In particular, for linear kernels the behavior of the smallest eigenvalue of the kernel matrix (which appears in our bounds) can be characterized in this asymptotic limit. In fact, under appropriate distributional assumptions, our bound for linear regression can be computed as (||X † || × ||y||) 2 ≈ √ n √ d- √ n → 1 √ γ-1 . Here the dimension of the data coincides with the number of parameters in the model. Interestingly, analogous results hold for more general kernels (inner product and RBF kernels) (El Karoui, 2010) where the asymptotics are taken with respect to the number and dimensionality of the data. These results predict a double descent curve for the condition number as found in practice, see Figure 2 . While it may seem that our bounds in Theorem 7 diverge if d is held constant and n → ∞, this case is not covered by our theorem, since when n > d we no longer have interpolating solutions. Recently, there has been a surge of interest in studying linear and kernel least squares models, since classical results focus on situations where constraints or penalties that prevent interpolation are added to the empirical risk. For example, high dimensional linear regression is considered in Mei & Montanari (2019) ; Hastie et al. (2019) ; Bartlett et al. (2019) , and "ridgeless" kernel least squares is studied in Liang et al. (2019) ; Rakhlin & Zhai (2018) and Liang et al. (2020) . While these papers study upper and lower bounds on the risk of interpolating solutions to the linear and kernel least squares problem, ours are the first to derived using stability arguments. While it might be possible to obtain tighter excess risk bounds through careful analysis of the minimum norm interpolant, our simple approach helps us establish a link between stability in statistical and in numerical sense. Finally, we can compare our results with observations made in Poggio et al. (2019) on the condition number of random kernel matrices. The condition number of the empirical kernel matrix is known to control the numerical stability of the solution to a kernel least squares problem. Our results show that the statistical stability is also controlled by the condition number of the kernel matrix, providing a natural link between numerical and statistical stability.

6. CONCLUSIONS

In summary, minimizing a bound on cross validation stability minimizes the expected error in both the classical and the modern regime of ERM. In the classical regime (d < n), CV loo stability implies generalization and consistency for n → ∞. In the modern regime (d > n), as described in this paper, CV loo stability can account for the double descent curve in kernel interpolants (Belkin et al., 2019) under appropriate distributional assumptions. The main contribution of this paper is characterizing stability of interpolating solutions, in particular deriving excess risk bounds via a stability argument. In the process, we show that among all the interpolating solutions, the one with minimum norm also minimizes a bound on stability. Since the excess risk bounds of the minimum norm interpolant depend on the pseudoinverse of the kernel matrix, we establish here an elegant link between numerical and statistical stability. This also holds for solutions computed by gradient descent, since gradient descent converges to minimum norm solutions in the case of "linear" kernel methods. Our approach is simple and combines basic stability results with matrix inequalities.

A EXCESS RISK, GENERALIZATION, AND STABILITY

We use the same notation as introduced in Section 2 for the various quantities considered in this section. That is in the supervised learning setup V (f, z) is the loss incurred by hypothesis f on the sample z, and I[f ] = E z [V (f, z)] is the expected error of hypothesis f . Since we are interested in different forms of stability, we will consider learning problems over the original training set S = {z 1 , z 2 , . . . , z n }, the leave one out training set S i = {z 1 , . . . , z i-1 , z i+1 , . . . , z n }, and the replace one training set (S i , z) = {z 1 , . . . , z i-1 , z i+1 , . . . , z n , z}

A.1 REPLACE ONE AND LEAVE ONE OUT ALGORITHMIC STABILITY

Similar to the definition of expected CV loo stability in equation ( 3) of the main paper, we say an algorithm is cross validation replace one stable (in expectation), denoted as CV ro , if there exists β ro > 0 such that E S,z [V (f S , z) -V (f (Si,z) , z)] ≤ β ro . We can strengthen the above stability definition by introducing the notion of replace one algorithmic stability (in expectation) Bousquet & Elisseeff (2001) . There exists α ro > such that for all i = 1, . . . , n, E S,z [ f S -f (Si,z) ∞ ] ≤ α ro . We make two observations: First, if the loss is Lipschitz, that is if there exists C V > 0 such that for all f, f ∈ H V (f, z) -V (f , z) ≤ C V f -f , then replace one algorithmic stability implies CV ro stability with β ro = C V α ro . Moreover, the same result holds if the loss is locally Lipschitz and there exists R > 0, such that f S ∞ ≤ R almost surely. In this latter case the Lipschitz constant will depend on R. Later, we illustrate this situation for the square loss. Second, we have for all i = 1, . . . , n, S and z, E S,z [ f S -f (Si,z) ∞ ] ≤ E S,z [ f S -f Si ∞ ] + E S,z [ f (Si,z) -f Si ∞ ]. This observation motivates the notion of leave one out algorithmic stability (in expectation) Bousquet & Elisseeff (2001) ] E S,z [ f S -f Si ∞ ] ≤ α loo . Clearly, leave one out algorithmic stability implies replace one algorithmic stability with α ro = 2α loo and it implies also CV ro stability with β ro = 2C V α loo . A.2 EXCESS RISK AND CV loo , CV ro STABILITY We recall the statement of Lemma 5 in section 3 that bounds the excess risk using the CV loo stability of a solution. Lemma 13 (Excess Risk & CV loo Stability) For all i = 1, . . . , n, E S [I[f Si ] -inf f ∈H I[f ]] ≤ E S [V (f Si , z i ) -V (f S , z i )]. In this section, two properties of ERM are useful, namely symmetry, and a form of unbiasedeness.

Symmetry.

A key property of ERM is that it is symmetric with respect to the data set S, meaning that it does not depend on the order of the data in S. A second property relates the expected ERM with the minimum of expected risk. ERM Bias. The following inequality holds. E[[I S [f S ]] -min f ∈H I[f ] ≤ 0. To see this, note that I S [f S ] ≤ I S [f ] for all f ∈ H by definition of ERM, so that taking the expectation of both sides E S [I S [f S ]] ≤ E S [I S [f ]] = I[f ] for all f ∈ H. This implies E S [I S [f S ]] ≤ min f ∈H I[f ] and hence (17) holds. Remark 14 Note that the same argument gives more generally that E[ inf f ∈H [I S [f ]] -inf f ∈H I[f ] ≤ 0. ( ) Given the above premise, the proof of Lemma 5 is simple. Proof [of Lemma 5] Adding and subtracting E S [I S [f S ]] from the expected excess risk we have that E S [I[f Si ] -min f ∈H I[f ]] = E S [I[f Si ] -I S [f S ] + I S [f S ] -min f ∈H I[f ]], and since E S [I S [f S ]] -min f ∈H I[f ] ] is less or equal than zero, see ( 18), then E S [I[f Si ] -min f ∈H I[f ]] ≤ E S [I[f Si ] -I S [f S ]]. Moreover, for all i = 1, . . . , n E S [I[f Si ]] = E S [E zi V (f Si , z i )] = E S [V (f Si , z i )] and E S [I S [f S ]] = 1 n n i=1 E S [V (f S , z i )] = E S [V (f S , z i )]. Plugging these last two expressions in (20) and in (19) leads to (4). We can prove a similar result relating excess risk with CV ro stability. Lemma 15 (Excess Risk & CV ro Stability) Given the above definitions, the following inequality holds for all i = 1, . . . , n, E S [I[f S ] -inf f ∈H I[f ]] ≤ E S [I[f S ] -I S [f S ]] = E S,z [V (f S , z) -V (f (Si,z) , z)]. Proof The first inequality is clear from adding and subtracting I S [f S ] from the expected risk I[f S ] we have that E S [I[f S ] -min f ∈H I[f ]] = E S [I[f S ] -I S [f S ] + I S [f S ] -min f ∈H I[f ]], and recalling (18). The main step in the proof is showing that for all i = 1, . . . , n, E[I S [f S ]] = E[V (f (Si,z) , z)] to be compared with the trivial equality, E[I S [f S ] = E[V (f S , z i )]. To prove Equation ( 22), we have for all i = 1, . . . , n, E S [I S [f S ]] = E S,z [ 1 n n i=1 V (f S , z i )] = 1 n n i=1 E S,z [V (f (Si,z) , z)] = E S,z [V (f (Si,z) , z)] where we used the fact that by the symmetry of the algorithm E S,z [V (f (Si,z) , z)] is the same for all i = 1, . . . , n. The proof is concluded noting that E S [I[f S ]] = E S,z [V (f S , z)]. A.3 DISCUSSION ON STABILITY AND GENERALIZATION Below we discuss some more aspects of stability and its connection to other quantities in statistical learning theory. Remark 16 (CV loo stability in expectation and in probability) In Mukherjee et al. (2006) , CV loo stability is defined in probability, that is there exists β P CV > 0, 0 < δ P CV ≤ 1 such that P S {|V (f Si , z i ) -V (f S , z i )| ≥ β P CV } ≤ δ P CV . Note that the absolute value is not needed for ERM since almost positivity holds Mukherjee et al. (2006) , that is V (f Si , z i ) -V (f S , z i ) > 0. Then CV loo stability in probability and in expectation are clearly related and indeed equivalent for bounded loss functions. CV loo stability in expectation ( 3) is what we study in the following sections. Remark 17 (Connection to uniform stability and other notions of stability) Uniform stability, introduced by Bousquet & Elisseeff (2001) , corresponds in our notation to the assumption that there exists β u > 0 such that for all i = 1, . . . , n, sup z∈Z |V (f Si , z) -V (f S , z)| ≤ β u . Clearly this is a strong notion implying most other definitions of stability. We note that there are number of different notions of stability. We refer the interested reader to Kutin & Niyogi (2002) , Mukherjee et al. (2006) . Remark 18 (CV loo Stability & Learnability) A natural question is to which extent suitable notions of stability are not only sufficient but also necessary for controlling the excess risk of ERM. Classically, the latter is characterized in terms of a uniform version of the law of large numbers, which itself can be characterized in terms of suitable complexity measures of the hypothesis class. Uniform stability is too strong to characterize consistency while CV loo stability turns out to provide a suitably weak definition as shown in Mukherjee et al. (2006) , see also Kutin & Niyogi (2002) , Mukherjee et al. (2006) . Indeed, a main result in Mukherjee et al. (2006) shows that CV loo stability is equivalent to consistency of ERM: Theorem 19 Mukherjee et al. (2006) For ERM and bounded loss functions, CV loo stability in probability with β P CV converging to zero for n → ∞ is equivalent to consistency and generalization of ERM. Remark 20 (CV loo stability & in-sample/out-of-sample error) Let (S, z) = {z 1 , . . . , z n , z}, (z is a data point drawn according to the same distribution) and the corresponding ERM solution f (S,z) , then (4) can be equivalently written as, E S [I[f S ] -inf f ∈F I[f ]] ≤ E S,z [V (f S , z) -V (f (S,z) , z)]. Thus CV loo stability measures how much the loss changes when we test on a point that is present in the training set and absent from it. In this view, it can be seen as an average measure of the difference between in-sample and out-of-sample error. Remark 21 (CV loo stability and generalization) A common error measure is the (expected) generalization gap E S [I[f S ]-I S [f S ]]. For non-ERM algorithms, CV loo stability by itself not sufficient to control this term, and further conditions are needed Mukherjee et al. (2006) , since E S [I[f S ] -I S [f S ]] = E S [I[f S ] -I S [f Si ]] + E S [I S [f Si ] -I S [f S ]]. The second term becomes for all i = 1, . . . , n, E S [I S [f Si ] -I S [f S ]] = 1 n n i=1 E S [V (f Si , z i ) -V (f S , z i )] = E S [V (f Si , z i ) -V (f S , z i )] and hence is controlled by CV stability. The first term is called expected leave one out error in Mukherjee et al. (2006) and is controlled in ERM as n → ∞, see Theorem 19 above. In the above equation we make use of the fact that b (X i ) † = 0. We use an old formula (Meyer, 1973; Baksalary et al., 2003) to compute (X i ) † from X † . We use the development of pseudo-inverses of perturbed matrices in Meyer (1973) . We see that a = -x i is a vector in the column space of X and b is in the range space of X T (provided X has full column rank), with β = 1 + b X † a = 1 -b X † x i = 0. This means we can use Theorem 6 in Meyer (1973) (equivalent to formula 2.1 in Baksalary et al. (2003) ) to obtain the expression for (X i ) † (X i ) † = X † -kk † X † -X † h † h + (k † X † h † )kh where k = X † a, and h = b X † , and u † = u ||u|| 2 for any non-zero vector u. (X i ) † -X † = (k † X † h † )kh -kk † X † -X † h † h = a (X † ) X † (X † ) b × kh ||k|| 2 ||h|| 2 -kk † X † -X † h † h =⇒ ||(X i ) † -X † || op ≤ |a (X † ) X † (X † ) b| ||X † a||||b X † || + 2||X † || op ≤ ||X † || op ||X † a||||b X † || ||X † a||||b X † || + 2||X † || op = 3||X † || op The above set of inequalities follows from the fact that the operator norm of a rank 1 matrix is given by ||uv || op = ||u|| × ||v|| Also, from List 2 of Baksalary et al. (2003) we have that X i (X i ) † = XX † -h † h. Plugging in these calculations into equation 28 we get: || ŵSi -ŵS || = ||y ((X i ) † -X † ) + (v i -v )(I -XX † ) -v i (XX † -X i (X i ) † )|| ≤ B 0 + ||I -XX † || op ||v -v i || + ||v i || × ||h † h|| op ≤ B 0 + 2||v -v i || + ||v i || We see that the right hand side is minimized when v = v i = 0. We can also compute B 0 = 3||X † || op ||y||, which concludes the proof of Lemma 22.



Figure1: Plot of train and test mean squared error (MSE) vs distance between an interpolating solution ŵ and the minimum norm interpolant w † of a linear regression problem. Data was synthetically generated as y = w X, where X ∈ R d×n with i.i.d N (0, 1) entries and d = 1000, n = 200. Other interpolating solutions were computed as ŵ = y X † + v (I -XX † ) and the norm of v was varied to obtain the plot. Train MSE is 0 for all interpolants, but test MSE increases as ||v|| increases, with w † having the best performance. This plot represents results averaged over 100 trials.

Figure 2: Typical double descent of the condition number (y axis) of a radial basis function kernel K(x, x ) = exp -||x-x || 2 2σ 2

op for d = 50

Figure 3: Typical double descent of the pseudoinverse norm (y axis) of a random data matrix distributed as N (0, 1): the condition number is worse when n = d, better if n > d (on the right of n = d) and also better if n < d (on the left of n = d).. From Poggio et al. (2019)

B CV loo STABILITY OF LINEAR REGRESSION

We have a dataset S = {(x i , y i )} n i=1 and we want to find a mapping w ∈ R d , that minimizes the empirical least squares risk. All interpolating solutions are of the form ŵS = y X † + v (I -XX † ). Similarly, all interpolating solutions on the leave one out dataset S i can be written as ŵSi =Here X, X i ∈ R d×n are the data matrices for the original and leave one out datasets respectively. We note that when v = 0 and v i = 0, we obtain the minimum norm interpolating solutions on the datasets S and S i .In this section we want to estimate the CV loo stability of the minimum norm solution to the ERM problem in the linear regression case. This is the case outlined in Remark 9 of the main paper. In order to prove Remark 9, we only need to combine Lemma 10 with the linear regression analogue of Lemma 11. We state and prove that result in this section. This result predicts a double descent curve for the norm of the pseudoinverse as found in practice, see Figure 3 .Lemma 22 Let ŵS , ŵSi be any interpolating least squares solutions on the full and leave one out datasets S, S i , then ŵS -ŵSi ≤ B CV (X † , y, v, v i ), and B CV is minimized when v = v i = 0, which corresponds to the minimum norm interpolating solutions w † S , w † Si . Also,As mentioned before in section 2.1 of the main paper, linear regression can be viewed as a case of the kernel regression problem where H = R d , and the feature map Φ is the identity map. The inner product and norms considered in this case are also the usual Euclidean inner product and 2-norm for vectors in R d . The notation • denotes the Euclidean norm for vectors both in R d and R n . The usage of the norm should be clear from the context. Also, A op is the left operator norm for a matrix A ∈ R n×d , that is A op = sup y∈R n ,||y||=1 ||y A||.We have n samples in the training set for a linear regression problem, {(x i , y i )} n i=1 . We collect all the samples into a single matrix/vector X = [x 1 x 2 . . . x n ] ∈ R d×n , and y = [y 1 y 2 . . . y n ] ∈ R n . Then any interpolating ERM solution w S satisfies the linear equationAny interpolating solution can be written as:If we consider the leave one out training set S i we can find the minimum norm ERM solution forWe can write X i as:where a ∈ R d is a column vector representing the additive change to the i th column, i.e, a = -x i , and b ∈ R n×1 is the i-th element of the canonical basis in R n (all the coefficients are zero but the i-th which is one). Thus ab is a d × n matrix composed of all zeros apart from the i th column which is equal to a.We also have y i = y -y i b. Now per Lemma 10 we are interested in bounding the quantity || ŵSi -ŵS || = ||( ŵSi ) -( ŵS ) ||. This simplifies to:

