ON THE UNIVERSALITY OF THE DOUBLE DESCENT PEAK IN RIDGELESS REGRESSION

Abstract

We prove a non-asymptotic distribution-independent lower bound for the expected mean squared generalization error caused by label noise in ridgeless linear regression. Our lower bound generalizes a similar known result to the overparameterized (interpolating) regime. In contrast to most previous works, our analysis applies to a broad class of input distributions with almost surely full-rank feature matrices, which allows us to cover various types of deterministic or random feature maps. Our lower bound is asymptotically sharp and implies that in the presence of label noise, ridgeless linear regression does not perform well around the interpolation threshold for any of these feature maps. We analyze the imposed assumptions in detail and provide a theory for analytic (random) feature maps. Using this theory, we can show that our assumptions are satisfied for input distributions with a (Lebesgue) density and feature maps given by random deep neural networks with analytic activation functions like sigmoid, tanh, softplus or GELU. As further examples, we show that feature maps from random Fourier features and polynomial kernels also satisfy our assumptions. We complement our theory with further experimental and analytic results.

1. INTRODUCTION

Seeking for a better understanding of the successes of deep learning, Zhang et al. (2016) pointed out that deep neural networks can achieve very good performance despite being able to fit random noise, which sparked the interest of many researchers in studying the performance of interpolating learning methods. Belkin et al. (2018) made a similar observation for kernel methods and showed that classical generalization bounds are unable to explain this phenomenon. Belkin et al. (2019a) observed a "double descent" phenomenon in various learning models, where the test error first decreases with increasing model complexity, then increases towards the "interpolation threshold" where the model is first able to fit the training data perfectly, and then decreases again in the "overparameterized" regime where the model capacity is larger than the training set. This phenomenon has also been discovered in several other works (Bös & Opper, 1997; Advani & Saxe, 2017; Neal et al., 2018; Spigler et al., 2019) . Nakkiran et al. (2019) performed a large empirical study on deep neural networks and found that double descent can not only occur as a function of model capacity, but also as a function of the number of training epochs or as a function of the number of training samples. Theoretical investigations of the double descent phenomenon have mostly focused on specific unregularized ("ridgeless") or weakly regularized linear regression models. These linear models can be described via i.i.d. samples (x 1 , y 1 ), . . . , (x n , y n ) ∈ R d , where the covariates x i are mapped to feature representations z i = φ(x i ) ∈ R p via a (potentially random) feature map φ, and (ridgeless) linear regression is then performed on the transformed samples (z i , y i ). While linear regression with random features can be understood as a simplified model of fully trained neural networks, it is also interesting in its own right: For example, random Fourier features (Rahimi & Recht, 2008) and random neural network features (see e.g. Cao et al., 2018; Scardapane & Wang, 2017) have gained a notable amount of attention. Unfortunately, existing theoretical investigations of double descent are usually limited in one or more of the following ways: (1) They assume that the z i (or a linear transformation thereof) have (centered) i.i.d. components. This assumption is made by Hastie et al. (2019) , while Advani & Saxe (2017) and Belkin et al. (2019b) even assume that the z i follow a Gaussian distribution. While the assumption of i.i.d. components facilitates the application of some random matrix theory results, it excludes most feature maps: For feature maps φ with d < p, the z i will usually be concentrated on a d-dimensional submanifold of R p , and will therefore usually not have i.i.d. components. (2) They assume a (shallow) random feature model with fixed distribution of the x i , e.g. an isotropic Gaussian distribution or a uniform distribution on a sphere. Examples for this are the single-layer random neural network feature models by Hastie et al. (2019) in the unregularized case and by Mei & Montanari (2019); d'Ascoli et al. (2020a) in the regularized case. A simple Fourier model with d = 1 has been studied by Belkin et al. (2019b) . While these analyses provide insights for some practically relevant random feature models, the assumptions on the input distribution prevent them from applying to real-world data. (3) Their analysis only applies in a high-dimensional limit where n, p → ∞ and n/p → γ, where γ ∈ (0, ∞) is a constant. This applies to all works mentioned in (1) and ( 2) except the model by Belkin et al. (2019b) where the z i follow a standard Gaussian distribution. In this paper, we provide an analysis under significantly weaker assumptions. We introduce the basic setting of our paper in Section 2 and Section 3. Our main contributions are: • In Section 4, we show a non-asymptotic distribution-independent lower bound for the expected excess risk of ridgeless linear regression with (random) features. While the underparameterized bound is adapted from a minimax lower bound in Mourtada (2019) , the overparameterized bound is new and perfectly complements the underparameterized version. The obtained general lower bound relies on significantly weaker assumptions than most previous works and shows that there is only limited potential to reduce the sensitivity of unregularized linear models to label noise via engineering better feature maps. • In Section 5, we show that our lower bound applies to a large class of input distributions and feature maps including random deep neural networks, random Fourier features and polynomial kernels. This analysis is also relevant for related work where similar assumptions are not investigated (e.g. Mourtada, 2019; Muthukumar et al., 2020) . For random deep neural networks, our result requires weaker assumptions than a related result by Nguyen & Hein (2017) . • In Section 6 and Appendix C, we compare our lower bound to new theoretical and experimental results for specific examples, including random neural network feature maps as well as finite-width Neural Tangent Kernels (Jacot et al., 2018) . We also show that our lower bound is asymptotically sharp in the limit n, p → ∞. Similar to this paper, Muthukumar et al. (2020) study the "fundamental price of interpolation" in the overparameterized regime, providing a probabilistic lower bound for the generalization error under the assumption of subgaussian features or (suitably) bounded features. We explain the difference to our lower bound in detail in Appendix L, showing that our overparameterized lower bound for the expected generalization error requires significantly weaker assumptions, that it is uniform across feature maps and that it yields a more extreme interpolation peak. Our lower bound also applies to a large class of kernels if they can be represented using a feature map with finite-dimensional feature space, i.e. p < ∞. For ridgeless regression with certain classes of kernels, lower or upper bounds have been derived (Liang & Rakhlin, 2020; Rakhlin & Zhai, 2019; Liang et al., 2019) . However, as explained in more detail in Appendix K, these analyses impose restrictions on the kernels that allow them to ignore "double descent" type phenomena in the feature space dimension p. Beyond Double Descent, a series of papers have studied "Multiple Descent" phenomena theoretically and empirically, both with respect to the number of parameters p and the input dimension d. Adlam & Pennington (2020) and d'Ascoli et al. (2020b) theoretically investigate Triple Descent phenomena. Nakkiran et al. (2020) argue that Double Descent can be mitigated by optimal regularization. They also empirically observe a form of Triple Descent in an unregularized model. Liang et al. (2019) prove an upper bound exhibiting infinitely many peaks and empirically observe Multiple Descent. Chen et al. (2020) show that in ridgeless linear regression, the feature distributions can be designed to control the locations of ascents and descents in the double descent curve for a "dimension-normalized" noise-induced generalization error. Our lower bound provides a fundamental limit to this "designability" of the generalization curve for methods that can interpolate with probability one in the overparameterized regime. Proofs for our statements can be found in the appendix. We provide code to reproduce all of our experimental results at https://github.com/dholzmueller/universal_double_descent and we will provide the computed data at https://doi.org/10.18419/darus-1771.

2. BASIC SETTING AND NOTATION

Following Györfi et al. (2002) , we consider the scenario where the samples (x i , y i ) of a data set D = ((x 1 , y 1 ), . . . , (x n , y n )) ∈ (R d × R) n are sampled independently from a probability distribution P on R d × R, i.e. D ∼ P n . 1 We define X :=    x 1 . . . x n    ∈ R n×d , y :=    y 1 . . . y n    ∈ R n . We also consider random variables (x, y) ∼ P that are independent of D and denote the distribution of x by P X . The (least squares) population risk of a function f : R d → R is defined as R P (f ) := E x,y (y -f (x)) 2 . We assume Ey 2 < ∞. Then, R P is minimized by the target function f * P given by f * P (x) = E(y|x) , we have R P (f * P ) < ∞, and the excess risk (a.k.a. generalization error) of a function f is R P (f ) -R P (f * P ) = E x (f (x)f * P (x)) 2 . Notation For two symmetric matrices, we write A B if A-B is positive semidefinite and A B if A -B is positive definite. For a symmetric matrix S ∈ R n×n , we let λ 1 (S) ≥ . . . ≥ λ n (S) be its eigenvalues in descending order. We denote the trace of A by tr(A) and the Moore-Penrose pseudoinverse of A by A + . For φ : R d → R p and X ∈ R n×d , we let φ(X) ∈ R n×p be the matrix with φ applied to each of the rows of X individually. For a set A, we denote its indicator function by 1 A . For a random variable x, we say that x has a Lebesgue density if P X can be represented by a probability density function (w.r.t. the Lebesgue measure). We say that x is nonatomic if for all possible values x, P (x = x) = 0. We denote the uniform distribution on a set A, e.g. the unit sphere S p-1 ⊆ R p , by U(A). We denote the normal (Gaussian) distribution with mean µ and covariance Σ by N (µ, Σ). For n ∈ N, we define [n] := {1, . . . , n}. We review relevant matrix facts, e.g. concerning the Moore-Penrose pseudoinverse, in Appendix B.

3. LINEAR REGRESSION WITH (RANDOM) FEATURES

The most general setting that we will consider in this paper is ridgeless linear regression in random features: Given a random variable θ that is independent from the data set D and an associated random feature map φ θ : R d → R p , we define the estimator f X,y,θ (x) := φ θ (x) φ θ (X) + y , which simply performs unregularized linear regression with random features. As a special case, the feature map φ θ may be deterministic, in which case we drop the index θ. An even more specialized case is ordinary linear regression, where d = p and φ θ = id, yielding f X,y (x) = x X + y. As described in Hastie et al. (2019) , the ridgeless linear regression parameter β := φ θ (X) + y • has minimal Euclidean norm among all parameters β minimizing φ θ (X)βy 2 2 , • is the limit of gradient descent with sufficiently small step size on L(β) := φ θ (X)β-y 2 2 with initialization β (0) := 0, and • is the limit of ridge regression with regularization λ > 0 for λ 0: β = lim λ 0 φ θ (X) (φ θ (X)φ θ (X) + λI n ) -1 y . For a fixed feature map φ, the kernel trick provides a correspondence between ridgeless linear regression with φ and ridgeless kernel regression with the kernel k(x, x) := φ(x) φ(x) via f X,y (x) = φ(x) φ(X) + y = φ(x) φ(X) (φ(X)φ(X) ) + y = k(x, X)k(X, X) + y , where k(x, X) :=    k(x, x 1 ) . . . k(x, x n )    , k(X, X) :=    k(x 1 , x 1 ) . . . k(x 1 , x n ) . . . . . . . . . k(x n , x 1 ) . . . k(x n , x n )    .

4. A LOWER BOUND

In this section, we prove our main theorem, which provides a non-asymptotic distributionindependent lower bound on the expected excess risk. The expected excess risk E X,y,θ R P (f X,y,θ ) -R P (f * P ) can be decomposed into several different contributions (see e.g. d 'Ascoli et al., 2020a) . In the following, we will focus on the contribution of label noise to the expected excess risk for the estimators f X,y,θ considered in Section 3. Using a bias-variance decomposition with respect to y, it is not hard to show that the label-noise-induced error provides a lower bound for the expected excess risk: E Noise := E X,y,θ,x f X,y,θ (x) -E y|X f X,y,θ (x) 2 ≤ E X,θ,x E y|X f X,y,θ (x) -E y|X f X,y,θ (x) 2 + E y|X f X,y,θ (x) -f * P (x) 2 = E X,y,θ,x (f X,y,θ (x) -f * P (x)) 2 = E X,y,θ (R P (f X,y,θ ) -R P (f * P ) ) . For linear models as considered here, it is not hard to see that E Noise does not depend on f * P and is equal to the expected excess risk in the special case f * P ≡ 0. In the following, we first consider the setting where the feature map φ is deterministic. We will consider linear regression on z := φ(x) and Z := φ(X) and formulate our assumptions directly w.r.t. the distribution P Z of z, hiding the dependence on the feature map φ. While the distribution P X of x is usually fixed and determined by the problem, the distribution P Z can be actively influenced by choosing a suitable feature map φ. We will analyze in Section 5 how the assumptions on P Z can be translated back to assumptions on P X and assumptions on φ. Remark 1. For typical feature maps, we have p > d, P Z is concentrated on a d-dimensional submanifold of R p and the components of z are not independent. A simple example (cf. Proposition 8) is a polynomial feature map φ : R 1 → R p , x → (1, x, x 2 , . . . , x p-1 ). The imposed assumptions on P Z should hence allow for such distributions on submanifolds and not require independent components. Definition 2. Assuming that E z 2 2 < ∞, i.e. (MOM) in Theorem 3 holds, we can define the (positive semidefinite) second moment matrix Σ := E z∼P Z zz ∈ R p×p . If Ez = 0, Σ is also the covariance matrix of z. If Σ is invertible, i.e. (COV) in Theorem 3 holds, the rows w i := Σ -1/2 x i of the "whitened" data matrix W := ZΣ -1/2 satisfy Ew i w i = I p . With these preparations, we can now state our main theorem. Its assumptions and the obtained lower bound will be discussed in Section 5 and Section 6, respectively. Theorem 3 (Main result). Let n, p ≥ 1. Assume that P and φ satisfy: (INT) Ey 2 < ∞ and hence R P (f * P ) < ∞, (NOI) Var(y|z) ≥ σ 2 almost surely over z, (MOM) E z 2 2 < ∞, i.e. Σ exists and is finite, (COV) Σ is invertible, (FRK) Z ∈ R n×p almost surely has full rank, i.e. rank Z = min{n, p}. Then, for the ridgeless linear regression estimator f Z,y (z) = z Z + y, the following holds: If p ≥ n, E Noise (I) ≥ σ 2 E Z tr((Z + ) ΣZ + ) (II) ≥ σ 2 E Z tr((W W ) -1 ) (IV) ≥ σ 2 n p + 1 -n . If p ≤ n, E Noise (I) ≥ σ 2 E Z tr((Z + ) ΣZ + ) (III) = σ 2 E Z tr((W W ) -1 ) (V) ≥ σ 2 p n + 1 -p . Here, the matrix inverses exist almost surely in the considered cases. Moreover, we have: • If (NOI) holds with equality, then (I) holds with equality. • If n = p or Σ = λI p for some λ > 0, then (II) holds with equality. For a discussion on how Σ influences E Noise , we refer to Remark G.1. We can extend Theorem 3 to random features if it holds for almost all of the random feature maps: Corollary 4 (Random features). Let θ ∼ P Θ be a random variable such that φ θ : R d → R p is a random feature map. Consider the random features regression estimator f X,y,θ (x) = z θ Z + θ y with z θ := φ θ (x) and Z θ := φ θ (X). If for P Θ -almost all θ, the assumptions of Theorem 3 are satisfied for z = z θ and Z = Z θ (with the corresponding matrix Σ = Σ θ ), then E Noise ≥ σ 2 n p+1-n if p ≥ n, σ 2 p n+1-p if p ≤ n. The main novelty in Theorem 3 is the explicit uniform lower bound (IV) for p ≥ n: The lower bound (V) for p ≤ n follows by adapting Corollary 1 in Mourtada (2019) . Statements similar to (I), (II) and (III) have also been proven, see e.g. Hastie et al. (2019) and Theorem 1 in Muthukumar et al. (2020) . However, as discussed in Section 1, Hastie et al. (2019) make sificiantly stronger assumptions for computing the expectation. In Appendix L, we explain in more detail that the probabilistic overparameterized lower bound of Muthukumar et al. (2020) is not distribution-independent and only applies to a smaller class of distributions than our lower bound. For a discussion on how Theorem 3 applies to kernels with finite-dimensional feature space, we refer to Appendix K.

5. WHEN ARE THE ASSUMPTIONS SATISFIED?

In this section, we want to discuss the assumptions of Theorem 3 and provide different results helping to verify these assumptions for various input distributions and feature maps. The theory will be particularly nice for analytic feature maps, which we define now: Definition 5 (Analytic function). A function f : R d → R is called (real) analytic if for all z ∈ R d , the Taylor series of f around z converges to f in a neighborhood of z. A function f : R d → R p , z → (f 1 (z), . . . , f p (z)) is called (real) analytic if f 1 , . . . , f p are analytic. Sums, products and compositions of analytic functions are analytic, cf. e.g. Section 2.2 in Krantz & Parks (2002) . We will discuss examples of analytic functions later in this section. Proposition 6 (Characterization of (COV) and (FRK)). Consider the setting of Theorem 3 and let FRK(n) be the statement that (FRK) holds for n. Then, (i) Let n ≥ 1. Then, FRK(n) iff P (z ∈ U ) = 0 for all linear subspaces U ⊆ R p of dimension min{n, p} -1. (ii) Let (MOM) hold. Then, (COV) holds iff P (z ∈ U ) < 1 for all linear subspaces U ⊆ R p of dimension p -1. Assuming that (MOM) holds such that (COV) is well-defined, consider the following statements: With this in mind, we can characterize the assumptions now: • The assumption (INT) is standard (see e.g. Section 1.6 in Györfi et al., 2002) and guarantees R P (f * P ) < ∞, such that the excess risk is well-defined. Györfi et al., 2002) . In this sense, (MOM) is a standard assumption. • The assumptions (COV) and FRK(n) are implied by FRK(p) and are even equivalent to FRK(p) in the underparameterized case p ≤ n. In the following, we will therefore focus on proving FRK(p) for various φ and P X . In the case p = n, FRK(p) ensures that f X,y almost surely interpolates the data, or equivalently that the kernel matrix k(X, X) almost surely has full rank. Importantly, assuming FRK(p) is weaker than assuming a strictly positive definite kernel, since strictly positive definite kernels require p = ∞. Example D.1 shows that the assumption (FRK) in Theorem 3 cannot be removed. For the analytic function φ = id with d = p, Proposition 6 yields a simple sufficient criterion: If z has a Lebesgue density, then (FRK) holds for all n. This assumption is already more realistic than assuming i.i.d. components. However, Proposition 6 is also very useful for other analytic feature maps, as we will see in the remainder of this section. Remark 7. Suppose that φ ≡ 0 is analytic, x has a Lebesgue density and (INT), (MOM) and (NOI) are satisfied. If (d) in Proposition 6 does not hold, there exists p < p such that the lower bound from Theorem 3 holds with p replaced by p: Moreover, (d) holds iff dim U = p. Take any isometric isomorphism ψ : U → R p and define the feature map φ : R d → R p, x → ψ(φ(x)). Then, φ is analytic since ψ is linear, and φ satisfies (d), hence Theorem 3 can be applied to φ. However, φ and φ lead to the same kernel k since ψ is isometric, hence to the same estimator f X,y by Eq. ( 2) and hence to the same E Noise . Let U := Span{φ(x) | x ∈ R d }. Since φ ≡ 0, p := dim U ≥ 1. Proposition 8 (Polynomial kernel). Let m, d ≥ 1 and c > 0. For x, x ∈ R d , define the polynomial kernel k(x, x) := (x x + c) m . Then, there exists a feature map φ : R d → R p , p := m+d m , such that: Proposition 8 says that the lower bound from Theorem 3 holds for ridgeless kernel regression with the polynomial kernel with p := m+d m if x has a Lebesgue density and (a) k(x, x) = φ(x) φ(x) for all x, x ∈ R d E z 2 2 = Ek(x, x) = E( x 2 2 + c) m < ∞. The proof of Proposition 8 can be extended to the case c = 0, where one needs to choose p = m+d-1 m . In general, we discuss in Appendix K that Theorem 3 can often be applied to ridgeless kernel regression, where p needs to be chosen as the minimal feature space dimension for which k can still be represented. We can also extend our theory to analytic random feature maps: Proposition 9 (Random feature maps). Consider feature maps φ θ : R d → R p with (random) parameter θ ∈ R q . Suppose the map (θ, x) → φ θ (x) is analytic and that θ and x are independent and have Lebesgue densities. If there exist fixed θ ∈ R q , X ∈ R p×d with det(φ θ ( X)) = 0, then almost surely over θ, (FRK) holds for all n for z = φ θ (x). In Appendix C, we demonstrate that Proposition 9 can be used to computationally verify (FRK) for analytic random feature maps. Up until now, we have assumed that x has a Lebesgue density. It is desirable to weaken this assumption, such that x can, for example, be concentrated on a submanifold of R d . It is necessary for FRK(p) with p ≥ 2 that x is nonatomic, such that the x i are distinct with probability one. In general, this is not sufficient: For example, if φ = id and x lives on a proper linear subspace of R d , FRK(p) is not satisfied. Perhaps surprisingly, we will show next that for random neural network feature maps, it is indeed sufficient that x is nonatomic. 2 Especially, our lower bound in Corollary 4 thus applies to a large class of feedforward neural networks where only the last layer is trained (and initialized to zero, such that gradient descent converges to the Moore-Penrose pseudoinverse). Theorem 10 (Random neural networks). Let d, p, L ≥ 1, let σ : R → R be analytic and let the layer sizes be d 0 = d, d 1 , . . . , d L-1 ≥ 1 and d L = p. Let W (l) ∈ R d l+1 ×d l for l ∈ {0, . . . , L -1} be random variables and consider the two cases where (a) σ is not a polynomial with less than p nonzero coefficients, θ := (W (0) , . . . , W (L-1) ) and the random feature map φ θ : R d → R p is recursively defined by φ(x (0) ) := x (L) , x (l+1) := σ(W (l) x (l) ) . (b) σ is not a polynomial of degree < p -1, θ := (W (0) , . . . , W (L-1) , b (0) , . . . , b (L-1) ) with random variables b (l) ∈ R d l+1 for l ∈ {0, . . . , L -1}, and the random feature map φ θ : R d → R p is recursively defined by φ(x (0) ) := x (L) , x (l+1) := σ(W (l) x (l) + b (l) ) . In both cases, if θ has a Lebesgue density and x is nonatomic, then (FRK) holds for all n and almost surely over θ. The assumptions of Theorem 10 on θ are satisfied by the classical initialization methods of He et al. (2015) and Glorot & Bengio (2010) . Possible choices of σ are presented in Table 1 . A statement similar to Theorem 10 has been proven in Lemma 4.4 in Nguyen & Hein (2017) . However, their statement only applies to networks with bias and to a more restricted class of activation functions: For example, the activation functions RBF, GELU, SiLU/Swish, Mish, sin and cos are not covered by their assumptions. In Appendix E, we explain that the proofs of many other theorems in the literature similar to Theorem 10 for single-layer networks are incorrect. Table 1 : Some examples of analytic activation functions that are not polynomials. The CDF of N (0, 1) is denoted by Φ. Other non-polynomial analytic activation functions are sin, cos or erf.

Activation function Equation Sigmoid

sigmoid(x) = 1/(1 + e -x ) Hyperbolic Tangent tanh(x) = (e xe -x )/(e x + e -x ) Softplus softplus(x) = log(1 + e x ) RBF RBF(x) = exp(-βx 2 ) GELU (Hendrycks & Gimpel, 2016 ) GELU(x) = xΦ(x) SiLU (Elfwing et al., 2018) SiLU(x) = x sigmoid(x) Swish (Ramachandran et al., 2017) Swish(x) = x sigmoid(βx) Mish (Misra, 2019) Mish(x) = x tanh(softplus(x)) 2 This appears to be a convenient consequence of randomizing the feature map: For each fixed feature map φ θ , there may be an exceptional set E θ of nonatomic input distributions PX for which FRK(p) is not satisfied. However, Theorem 10 shows that for each nonatomic input distribution PX , the set { θ | PX ∈ E θ } is a Lebesgue null set. For (deterministic) feature maps where it is not possible to prove FRK(p) for all nonatomic PX , there is a trick that works in certain cases: If x lives on a submanifold (e.g. a sphere), we might be able to write x = ψ(v), where v has a Lebesgue density and ψ is analytic (e.g. the stereographic projection for the sphere). Then, we can apply our theory to the analytic feature map φ θ • ψ. Under the assumptions of Theorem 10, if x 2 is bounded, (MOM) holds for all θ. This follows since any analytic function φ is also continuous, and continuous functions preserve boundedness. However, the activation functions from Table 1 even satisfy |σ(x)| ≤ a|x| + b for some a, b ≥ 0 and all x ∈ R. In this case, it is not hard to see that (MOM) already holds for all θ if E x 2 2 < ∞. Theorem 10 does not hold for ReLU, ELU (Clevert et al., 2015) , SELU (Klambauer et al., 2017) or other activation functions with a perfectly linear part. To see this, observe that with nonzero probability, all weights and inputs are positive. In this case, the feature map acts as a linear map from R d to R p , and if d < p = n, the output matrix φ(X) cannot be invertible. In Appendix I, we show that (FRK) is satisfied for random Fourier features if x is nonatomic and the frequency distribution (i.e. the Fourier transform of the shift-invariant kernel) has a Lebesgue density.

6. QUALITY OF THE LOWER BOUND

In this section, we discuss how sharp the lower bound from Theorem 3 is. To this end, we assume that Var(y|z) = σ 2 almost surely over z. In their Lemma 3, Hastie et al. (2019) consider the case where z has i.i.d. entries with zero mean, unit variance and finite fourth moment. They then use the Marchenko-Pastur law to show in the limit p, n → ∞, p/n → γ > 1 that E Noise → σ 2 1 γ-1 . In this limit, our lower bound shows E Noise ≥ σ 2 n p + 1 -n = σ 2 1 p/n + 1/n -1 → σ 2 1 γ -1 , hence, in this sense, our overparameterized bound is asymptotically sharp. An analogous argument shows that the underparameterized bound is also asymptotically sharp. In order to better understand to which extent our lower bound is non-asymptotically sharp in the over-and underparameterized regimes, we explicitly compute E Noise = σ 2 E Z tr((Z + ) ΣZ + ) (cf. Theorem 3) for some distributions P Z : Theorem 11. Let P Z = U(S p-1 ). Then, P Z satisfies the assumptions (MOM), (COV) and (FRK) for all n with Σ = 1 p I p . Moreover, for n ≥ p = 1 or p ≥ n ≥ 1, we can compute E Z tr((Z + ) ΣZ + ) =          1 n if n ≥ p = 1, 1 p if p ≥ n = 1, ∞ if 2 ≤ n ≤ p ≤ n + 1, n p-1-n • p-2 p if 2 ≤ n ≤ n + 2 ≤ p. The formulas in the next theorem have already been computed by Breiman & Freedman (1983) for p ≤ n -2 and by Belkin et al. (2019b) for general p. Our alternative proof circumvents a technical problem in their proof for p ∈ {n -1, n, n + 1}, cf. Appendix J. Theorem 12. Let P Z = N (0, I p ). Then, P Z satisfies the assumptions (MOM), (COV) and (FRK) for all n with Σ = I p . Moreover, for n, p ≥ 1, E Z tr((Z + ) ΣZ + ) =      n p-1-n if p ≥ n + 2, ∞ if p ∈ {n -1, n, n + 1}, p n-1-p if p ≤ n -2. For P Z = N (0, I p ) and the lower bound from Theorem 3, the formulas for the under-and overparameterized cases can be obtained from each other by switching the roles of n and p. Numerical data strongly suggests that this symmetry does not hold exactly for P Z = U(S p-1 ). For p ≥ n + 2, we can relate our lower bound (Theorem 3), the result for the sphere (Theorem 11) and the result for the Gaussian distribution (Theorem 12) as follows: n p + 1 -n = n p -1 -n • (p + 1 -n) -2 p + 1 -n ≤ n p -1 -n • p -2 p < n p -1 -n . These values are also shown in Figure 1 together with empirical values of NN feature maps specifically optimized to minimize E Noise . Since Σ affects E Noise only for p > n (cf. Remark G.1), it is 10 0 10 1 10 2 Number of points n 0.7 0.8 0.9 1.0 1.1 E noise relative to U (S p-1 ) (empirical) N (0, I p ) (analytic) Optimized for n = 15 Optimized for n = 60 U (S p-1 ) (empirical) U (S p-1 ), n ≤ p (analytic) Lower bound Figure 1 : Various estimates and bounds for E Noise relative to E Noise for P Z = U(S p-1 ), using Var(y|z) = 1 and p = 30. The optimized curves correspond to multi-layer NN feature maps whose parameters were trained to minimize E Noise for n = 15 or n = 60. More experiments and details on the setup can be found in Appendix C. We do not plot estimates for n ∈ {28, . . . , 32} since they have high estimation variances. not surprising that the feature map optimized for n = 60 > 30 = p performs badly in the overparameterized regime. The results in Figure 1 support the hypothesis that among all P Z satisfying (MOM), (COV) and (FRK), P Z = U(S p-1 ) minimizes E Z tr((Z + ) ΣZ + ). The plausibility of this hypothesis is further discussed in Remark G.3. We can prove the hypothesis for n = 1 or p = 1 since in this case, the results from Theorem 11 are equal to the lower bound from Theorem 3. When using a continuous feature map φ : R d → R p with d ≤ p -2, it seems to be unclear at first whether the low E Noise of P Z = U(S p-1 ) can even be achieved. We show in Proposition J.2 that this is possible using space-filling curve feature maps. The results in this section and in Appendix C show that while our lower bound presumably does not fully capture the height of the interpolation peak at p ≈ n, it is quite sharp in the practically relevant regimes p n and p n irrespective of d.

A OVERVIEW

The appendix is structured as follows: In Appendix B, we provide an overview over some matrix identities used throughout the paper. We provide additional numerical experiments for various (random) feature maps in Appendix C. The counterexample given in Appendix D shows that the assumption (FRK) cannot be omitted from Theorem 3. In Appendix E, we provide an overview over full-rank theorems for random neural networks in the literature and explain why their proofs are incorrect. We then prove our main results in Appendix F before discussing consequences in Appendix G. In Appendix H, we give proofs for the theorems and propositions from Section 5. As an addition, we prove (FRK) for random Fourier features in Appendix I. Finally, proofs for the statements from Section 6 are given in Appendix J. Whenever we prove theorems or propositions from the main paper (like Theorem 3) in the Appendix, we repeat their statement before the proof for convenience. In contrast, new theorems or propositions are numbered according to the section they are stated in, e.g. Proposition I.1.

B MATRIX ALGEBRA

In the following, we will present some facts about matrices that are relevant to this paper. For a general reference, we refer to textbooks on the subject (e.g. Golub & Van Loan, 1989; Bhatia, 2013) . Let n, p ≥ 1 and let k := min{n, p}. The singular value decomposition (SVD) of a matrix A ∈ R n×p is a decomposition A = U DV into orthogonal matrices U ∈ R n×k , V ∈ R p×k with U U = V V = I k and a diagonal matrix D ∈ R k×k with non-negative diagonal elements s 1 (A) ≥ . . . ≥ s k (A) called singular values. For a given symmetric square matrix A ∈ R n×n with eigenvalues λ 1 (A) ≥ . . . ≥ λ n (A), the trace satisfies tr(A) = n i=1 A ii = n i=1 λ i . The trace is linear and invariant under cyclical permutations. We use this multiple times in arguments of the following type: If v ∈ R p and A ∈ R p×p are stochastically independent (e.g. because A is constant), we can write E v v Av = E v tr(v Av) = E v tr(Avv ) = tr(AE v vv ) . Moreover, if A 0, then for all i ∈ [n], λ i (A) > 0 and we have λ n+1-i (A -1 ) = 1 λ i (A) . For a positive definite matrix Σ ∈ R p×p , the SVD and the eigendecomposition coincide as Σ = U diag(λ 1 (Σ), . . . , λ p (Σ))U and we can define the inverse square root as Σ -1/2 := U diag(λ 1 (Σ) -1/2 , . . . , λ p (Σ) -1/2 )U 0 , which is the unique s.p.d. matrix satisfying (Σ -1/2 ) 2 = Σ -1 . By the Courant-Fischer-Weyl theorem, two symmetric matrices A, B ∈ R n×n with A B satisfy λ i (A) = max V⊆R d subspace dim V=i min v∈V v 2=1 v Av ≤ max V⊆R d subspace dim V=i min v∈V v 2=1 v Bv = λ i (B) . Let A ∈ R n×p . The Moore-Penrose pseudoinverse A + of A satisfies the following relations (see e.g. Section 1.1.1 in Wang et al., 2018 ): • Suppose A has the SVD A = U DV , where D = diag(s 1 , . . . , s k ), k := min{n, p}. Using the convention 1/0 := 0, we can write A + = V D + U , where D + := diag(1/s 1 , . . . , 1/s k ). • A + = (A A) + A = A (AA ) + . • If A is invertible, then A + = A -1 . • A + (A + ) = (A A) + . We will also use a basic fact on the Schur complement that, for example, is outlined in Appendix A.5.5 in Boyd & Vandenberghe (2004) : If 0 ≺ A = A 11 A 12 A 21 A 22 ∈ R (m1+m2)×(m1+m2) , then A 22 0 and A 11 -A 12 A -1 22 A 21 0 and the top-left m 1 × m 1 block of A -1 is given by (A 11 -A 12 A -1 22 A 21 ) -1 .

C EXPERIMENTS

In the following, we experimentally investigate E Noise for different (random) feature maps. We will first give an overview over the plots and then explain the details of how they were generated and some implications. More details can be found in the provided code. All empirical curves show one estimated standard deviation of the mean estimator as a shaded area around the estimated mean, but this standard deviation is sometimes too small to be visible. We assume Var(y|z) = 1 almost surely over z such that (I) in Theorem 3 holds with equality with σ 2 = 1. define such a sequence of feature maps. For this reason, we will plot their results only with varying n. Double descent as a function of the number of samples n has for example been pointed out by Nakkiran et al. (2019) and Nakkiran (2019) . As the input distribution P X , we use N (0, I d ). We initialize the NN weights independently as W (l) ij ∼ N (0, 1/V l ), where V l := d 0 if l = 0, d l Var u∼N (0,1) (σ(u)) if l ≥ 1. Here, Var u∼N (0,1) (σ(u)) is approximated once by using with 10 4 samples for u. This initialization is similar (and for the ReLU activation essentially equivalent) to the initialization method by He et al. (2015) . The initialization variances are chosen such that for fixed input x with x 2 ≈ 1, the pre-activations in each layer are approximately N (0, 1)-distributed. given byfoot_2 φθ : ij ∼ N (0, 1) i.i.d. Our input distribution is again P X = N (0, I d ). The NTK feature map is then defined as R d → R 1 , x → 1 √ V 1 W (1) σ 1 √ V 0 W (0) φ θ : R d → R p , x → ∂ ∂θ φθ (x) , where p = 6 • 4 + 1 • 6 = 30 is the number of parameters in θ. Note that moving the variances V l outside of the W (l) does not affect the forward pass but only the backward pass (i.e. the derivatives). While we have not theoretically established the properties (FRK) and (COV) for such feature maps, we can do this experimentally for analytic activation functions like sigmoid, tanh, softplus and GELU: Since the random NTK feature map is a derivative of the analytic random NN feature map, it is also analytic. By Proposition 9, if (FRK) does not hold, then for every fixed θ, the range of φ θ must be contained in a proper linear subspace of R p , and therefore Z = φ θ (X) never has full rank for n ≥ p. In this case, the singular value s p (Z) would be zero, and even accounting for numerical errors, the "inverse condition number" sp(Z) s1(Z) should at most be of the order of 64-bit float machine precision, i.e. around 10 -16 . However, among 10foot_3 samples of Z for n := 90 ≥ 30 = p, the maximum observed "inverse condition number" was greater than 10 -3 for all of the activation functions sigmoid, tanh, softplus and GELU. 4 Hence, by Proposition 6 and Proposition 9, we can confidently conclude that (COV) and (FRK) hold for all n almost surely over θ for this network size and these activation functions. Estimation of E Noise In order to estimate E Noise , we proceed as follows: Recall from Section 3 that ridgeless regression is the limit of ridge regression for λ 0. We use a small regularization of λ = 10 -12 in order to improve numerical stability. Also for numerical stability, we use a singular value decomposition (SVD) Z = U diag(s 1 , . . . , s k )V with k := min{n, p} as in Appendix B. The regularized approximation of Z + is then Z + ≈ (Z Z + λI p ) -1 Z = V diag s 1 s 2 1 + λ , . . . , s k s 2 k + λ U . We can then estimate tr((Z + ) ΣZ + ) = tr(Z + (Z + ) Σ) ≈ tr V diag s 2 1 (s 2 1 + λ) 2 , . . . , s 2 k (s 2 k + λ) 2 V Σ . (4) We can then obtain m = 10 4 (non-ReLU NNs, polynomial kernel, high-variance random Fourier features) or m = 10 5 (all other empirical results) sampled estimates for tr((Z + ) ΣZ + ) in order to obtain a Monte-Carlo estimate of E tr((Z + ) ΣZ + ) by repeating the following procedure m times: (1) Sample a random parameter θ. (2) Sample random matrices X ∈ R n×d and X ∈ R l×d , l = 10 4 , with i.i.d. P X -distributed rows. (3) Compute Z := φ θ (X) and Z := φ θ ( X). (4) Compute the estimate Σ := 1 l Z Z. (5) Compute a regularized estimate of tr((Z + ) ΣZ + ) using the SVD of Z and Eq. ( 4). For performance reasons, we make the following modification of step ( 2) and ( 5): Since we perform the computation for all n ∈ [256], we sample X ∈ R 256×d and then, for all n ∈ [256], perform the computation for n using the first n rows of Z. Hence, the estimates for different n are not independent. In 

Curious ReLU results

The curves for the ReLU NNs and ReLU NTKs in the underparameterized regime p ≤ n may seem odd. The locally almost constant "plateaus" are presumably an artefact of the non-independent estimates for different n as explained in the last paragraph. As discussed in Section 5, networks with ReLU, ELU or SELU activations do not satisfy (FRK). It seems plausible that, since both "halves" of the ReLU function are linear, ReLU networks have a significantly higher chance than SELU or ELU networks to be initialized with "bad" parameters that are likely to generate "degenerate" feature matrices Z that do not have full rank at n = p and only become full rank for some n > p. When inspecting the data underlying the plots, the estimate of E Noise for ReLU networks in the underparameterized regime seems to be dominated by few outliers. It seems that the distribution of tr((Z + ) ΣZ + ) for ReLU networks is often so heavy-tailed that computing more Monte Carlo samples does not significantly reduce the estimation uncertainty. Whitening For computing E((W W ) -1 ) = E((ZΣ -1 Z ) -1 ) in the overparameterized case in Figure C .2, we regularize both matrix inversions on the right-hand side as above: For a symmetric matrix A ∈ R m×m , we use a symmetric eigendecomposition A = U diag(s 1 , . . . , s m )U and approximate A -1 ≈ U diag s 1 s 2 1 + λ , . . . , s m s 2 m + λ U . Optimization For our optimized feature maps in Figure 1  φ θ (x) = σ(b (2) + V -1/2 2 W (2) σ(b (1) + V -1/2 1 W (1) σ(b (0) + V -1/2 0 W (0) x))) with independent initialization W (l) ij ∼ N (0, 1), b (l) i = 0. As the input distribution, we use P X = N (0, I d ). For given θ, let Σ θ := E x φ θ (x)φ θ (x) , i.e. we define the second moment matrix Σ as depending on θ. We then optimize the loss function L(θ) := E X tr((φ θ (X) + ) Σ θ φ θ (X) + ) using AMSGrad (Reddi et al., 2018) with a learning rate that linearly decays from 10 -3 to 0 over 1000 iterations. In order to approximate L(θ) in each iteration, we approximate Σ θ using 1000 Monte Carlo points and we draw 1024 different realizations of X (this can be considered as using batch size 1024). We also use a regularized version as in Eq. ( 3), but we omit the SVD for reasons of differentiability.

D A COUNTEREXAMPLE

Example D.1. Let p ≥ 2. Consider the uniform distribution P Z on an orthonormal basis {e 1 , . . . , e p } ⊆ R p . Then, Σ = 1 p p i=1 e i e i = 1 p I p and hence (COV) is satisfied. However, from Proposition 6 it is easy to see that for any n ≥ 2, (FRK) is not satisfied. Indeed, if the vector e i occurs m i times among the samples z 1 , . . . , z n , then Z Z = diag(m 1 , . . . , m p ). Assuming Var(y|z) = σ 2 := 1 for all z, we then obtain from Theorem 3 with the convention 1 0 := 0: E Noise = E Z tr((Z + ) ΣZ + ) = 1 p E Z tr(Z + (Z + ) ) = 1 p E Z tr((Z Z) + ) = 1 p p i=1 E 1 m i = E 1 m 1 , where m 1 follows a binomial distribution with parameters n and 1 p . Using E 1 m1 ≤ E1 = 1, it is easy to see that the lower bound is violated for some values of n. For example, Figure D .1 shows that for p = 30, the lower bound is violated in a large region, especially in the overparameterized regime. 5 The underlying reason is that the function x → 1 x is convex on (0, ∞), but not on [0, ∞). If Z has singular values s i , the pseudo-inverse Z + has singular values 1 si . If s i = 0 is possible, we cannot use a convexity-based argument anymore. Györfi et al., 2002) . For example, suppose that P X is supported on a domain D ⊆ R d and this domain is partitioned into disjoint sets A 1 , . . . , A p . Then, performing histogram regression on this partition is equivalent to performing ridgeless linear regression with the feature map φ : R d → R p with φ(x) := e i if x ∈ A i . If all partitions are equally likely, i.e. P X (A i ) = 1/p for all i ∈ {1, . . . , p}, then P Z is the uniform distribution on {e 1 , . . . , e p } as in Example D.1.

E FULL-RANK RESULTS FOR RANDOM WEIGHT NEURAL NETWORKS

In the literature on neural networks with random weights (Schmidt et al., 1992) and the later virtually identical Extreme Learning Machine (ELM) (Huang et al., 2006) , properties similar to (FRK) have been investigated for random neural network feature maps. In the following, we review the relevant approaches known to us and explain why most of them are flawed. First, Sartori & Antsaklis (1991) state a result where the assumptions are not clearly specified, but which could look as follows: Claim 1 (Sartori & Antsaklis (1991) , informally discussed after Lemma 1). For n = p, consider a single-layer random neural network with weights W (0) ∈ R (n-1)×d and signum activation σ that appends a 1 to its output: φ θ : R n → R p , x → (σ(W (0) x), 1) . If x 1 , . . . , x n ∈ R d are distinct, then for almost all W (0) , φ θ (X) is invertible. This claim is evidently false for n ≥ 2. For example, the following argument works for n ≥ 3: For τ ∈ {-1, 0, 1} n , let U τ := {w ∈ R n | σ(w x j ) = τ j for all j}. Then, τ ∈{-1,0,1} n U τ = R n and since {-1, 0, 1} n is finite, there exists τ * such that U τ * is not a Lebesgue null set. Hence the set W := {W (0) ∈ R (n-1)×n | at least two rows of W (0) are in U τ * } is not a Lebesgue null set. But for all W (0) ∈ W, the matrix φ θ (X) = (σ(X(W (0) ) ) | 1 n×1 ) has two columns equal to τ * and is therefore not invertible, contradicting Claim 1. Second, Tamura & Tateishi (1997) state a result where the assumptions are not clearly formulated, but which could look as follows: Claim 2 (Tamura & Tateishi (1997) ). For n = p, consider a single-layer random neural network with weights W (0) ∈ R (n-1)×d , biases b (0) ∈ R n-1 and sigmoid activation σ that appends a 1 to its output: φ θ : R n → R p , x → (σ(W (0) x + b (0) ), 1) . If x 1 , . . . , x n ∈ R d are distinct, then for almost all W (0) and all a < b, there exists b (0) ∈ [a, b] n-1 such that φ θ (X) is invertible. Tamura & Tateishi (1997) attempt to prove this claim via showing that the curve c i : [a, b] → R n , b i → σ((w (0) i ) x j + b i ) j∈[n] is not contained in any (n -1)-dimensional subspace of R n , which would allow them to pick a suitable bias b i for each curve c i such that the c i (b i ) and (1, . . . , 1) are linearly independent. Although this is part is formulated confusingly, it should work out. The major problem is that under the counterassumption that c i lies in an (n -1)-dimensional subspace, they construct an infinite (overdetermined) system of equations involving derivatives of the sigmoid function which they claim has no solution. However, in general, overdetermined systems can have solutions and they do not prove why this would not be the case in their situation. Indeed, the only properties of the sigmoid function they use is that it is C ∞ and not a polynomial, and it is not hard to see that the exponential function has these properties as well and leads to a solvable overdetermined system since its derivatives are all identical. This leads us to the next claim: Claim 3 (Huang (2003) , Theorem 2.1). Consider the setting of Claim 2 but with W (0) ∈ R n×d instead of appending a 1 to all feature vectors. If x 1 , . . . , x n ∈ R d are distinct and θ has a Lebesgue density, then φ θ (X) is invertible almost surely over θ. Huang (2003) uses the "proof" by Tamura & Tateishi (1997) to show that from any nontrivial intervals one can choose W (0) , b (0) such that φ θ (X) is invertible, setting all rows of W (0) to be equal. He then concludes without further reasoning that φ θ (X) must then be invertible almost surely over θ. This major unexplained step cannot be fixed by a continuity-based argument since all b (0) i were chosen from the same interval and similarly, all rows of W (0) were chosen from the same interval. Since the sigmoid function is analytic, it would however be possible to prove this step using the multivariate identity theorem for analytic functions, Theorem H.3, in a fashion similar to the proof of (d) ⇒ (a) in Proposition 6. The approach pursued in Huang (2003) thus introduces an additional problem on top of the problem of Tamura & Tateishi (1997) although this additional problem is in principle fixable. A similar strategy is reused in the next claim issued in a particularly popular paper: Claim 4 (Huang et al. (2006) , Theorem 2.1). Claim 3 holds for all C ∞ activation functions σ. Not only does the "proof" in Huang et al. (2006) inherit the problems from the previous "proofs", but now the result is obviously false as well: For σ ≡ 0, the matrix φ θ (X) can never be invertible, and for σ = id, it is also easy to find counterexamples. Moreover, it is not sufficient to exclude (loworder) polynomials σ: For example, the well-known construction (e.g. Remark 3.4 (d) in Chapter V.3 in Amann & Escher, 2005) σ(x) := 0 , x ≤ 0 e -1/x , x > 0 yields a C ∞ function σ that is zero on (-∞, 0] but not a polynomial. For this function, it is not hard to see that φ θ (X) would be singular with nonzero probability since there is a chance that W (0) x + b (0) contains only negative components. Despite these evident problems and the paper's popularity, Claim 4 is restated years later as Theorem 1 in a survey paper by the same author (Huang et al., 2015) . In his Theorem 1, Guo (2018) attempts to disprove Claim 4. However, this "disproof" is also invalid: For certain saturating activation functions like tanh, Guo (2018) lets θ → ∞ in a certain fashion, which causes φ θ (X) to converge to a singular matrix. The problem with this approach is that just because the limiting matrix is singular, the matrices for finite θ do not need to be singular. However, this limiting behavior might at least explain the empirical results of Henriquez & Ruz (2017) , which find in a certain setting with sigmoid activation that numerically, φ θ (X) is usually not a full-rank matrix. In contrast, Widrow et al. (2013) numerically reach the opposite conclusion. This is not a contradiction, considering that very small eigenvalues of φ θ (X) may be numerically rounded to zero, and the probability of having such small eigenvalues depends on the chosen distributions of x and θ. The only previously published correct result known to us is the following one: Lemma E.5 (Lemma 4.4 in Nguyen & Hein (2017) ). Let σ : R → R be analytic, strictly increasing and let one of the following two conditions be satsified: (a) σ is bounded, or (b) there exist positive constants ρ 1 , ρ 2 , ρ 3 , ρ 4 such that |σ(t)| ≤ ρ 1 e ρ2t for t > 0 and |σ(t)| ≤ ρ 3 t + ρ 4 for t ≥ 0. Let φ θ be a random NN feature map with biases as in Theorem 10 (b) and let x 1 , . . . , x n ∈ R d be distinct. If p ≥ n -1, then the n × (p + 1) feature matrix    φ θ (x 1 ) 1 . . . . . . φ θ (x n ) 1    has rank n for (Lebesgue-) almost all θ. In Theorem H.9, we generalize Lemma E.5 (without the appended ones in the feature matrix) to a substantially larger class of analytic activation functions. 6 As argued above, it is not possible to replace the assumption that σ is analytic with σ ∈ C ∞ and as shown in Remark H.17, our exclusion of certain polynomials for σ is in general necessary.

F PROOFS FOR SECTION 4

In this section, we prove our main theorem and corollary from Section 4. Recall from Definition 2 that if E z 2 2 < ∞, we can define the second moment matrix Σ := E z∼P Z zz ∈ R p×p , and if Σ is invertible, we can also define the "whitened" data matrix W := ZΣ -1/2 , whose rows w i = Σ -1/2 satisfy Ew i w i = I p . Theorem 3 (Main result). Let n, p ≥ 1. Assume that P and φ satisfy: (INT) Ey 2 < ∞ and hence R P (f * P ) < ∞, (NOI) Var(y|z) ≥ σ 2 almost surely over z, (MOM) E z 2 2 < ∞, i.e. Σ exists and is finite, (COV) Σ is invertible, (FRK) Z ∈ R n×p almost surely has full rank, i.e. rank Z = min{n, p}. Then, for the ridgeless linear regression estimator f Z,y (z) = z Z + y, the following holds: If p ≥ n, E Noise (I) ≥ σ 2 E Z tr((Z + ) ΣZ + ) (II) ≥ σ 2 E Z tr((W W ) -1 ) (IV) ≥ σ 2 n p + 1 -n . If p ≤ n, E Noise (I) ≥ σ 2 E Z tr((Z + ) ΣZ + ) (III) = σ 2 E Z tr((W W ) -1 ) (V) ≥ σ 2 p n + 1 -p . Here, the matrix inverses exist almost surely in the considered cases. Moreover, we have: • If (NOI) holds with equality, then (I) holds with equality. • If n = p or Σ = λI p for some λ > 0, then (II) holds with equality. Proof. By (FRK), we only consider the case where Z has full rank. For further details on some matrix computations, we refer to Appendix B. Step 1: Preparation. Since the pairs (z 1 , y 1 ), . . . , (z n , y n ) are independent, we have the conditional covariance matrix Cov(y|Z) =    Var(y 1 |z 1 ) . . . Var(y n |z n )    (I) σ 2 I n with equality in (I) if (NOI) holds with equality. Therefore, E Noise def = E Z,y,θ,z f Z,y,θ (z) -E y|Z f Z,y,θ (z) 2 = E Z,y,z z Z + y -E y|Z z Z + y 2 = E Z,z E y|Z z Z + (y -E y|Z y)(y -E y|Z y) (Z + ) z = E Z,z z Z + Cov(y|Z)(Z + ) z (I) ≥ σ 2 E Z,z z Z + (Z + ) z = σ 2 E Z,z tr((Z + ) zz Z + ) = σ 2 E Z tr((Z + ) Σ(Z + ) ) . Step 2.1: Reformulation, underparameterized case. In the case p ≤ n, we have Z + = (Z Z) -1 Z thanks to (FRK) and thus tr((Z + ) ΣZ + ) = tr(Σ 1/2 Z + (Z + ) Σ 1/2 ) = tr(Σ 1/2 (Z Z) -1 Σ 1/2 ) = tr((W W ) -1 ) . Step 2.2: Reformulation, overparameterized case. In the case p ≥ n, we have Z + = Z (ZZ ) -1 thanks to (FRK). For Σ = λI p , we can show tr((Z + ) ΣZ + ) = tr((W W ) -1 ) similar to Step 2.1. For general Σ, we can obtain a lower bound by "removing a projection": First, let S := (Z + ) ΣZ + = (ZZ ) -1 ZΣZ (ZZ ) -1 , A := Σ 1/2 Z . Now, since Z and Σ have full rank, we can compute S -1 = ZZ (ZΣZ ) -1 ZZ = W A(A A) -1 A W . Since A(A A) -1 A is the orthogonal projection onto the column space of A, we have S -1 W W and hence λ i (S -1 ) ≤ λ i (W W ) by the Courant-Fischer-Weyl theorem. This yields tr((Z + ) ΣZ + ) = tr(S) = n i=1 λ n+1-i (S) = n i=1 1 λ i (S -1 ) ≥ n i=1 1 λ i (W W ) = n i=1 λ n+1-i ((W W ) -1 ) = tr((W W ) -1 ) . For n = p, A(A A) -1 A projects onto a p-dimensional space and is therefore the identity matrix, yielding equality. Step 3.0: Random matrix bound, p = 1 or n = 1. If n ≥ p = 1, w i = w i is a scalar and we have E tr((W W ) -1 ) = E 1 W W = E 1 n i=1 w 2 i ≥ 1 E n i=1 w 2 i = 1 n i=1 tr(Ew i w i ) = 1 n = p n + 1 -p by Jensen's inequality. Similarly, for p ≥ n = 1, we obtain E tr((W W ) -1 ) = E 1 w 1 w 1 ≥ 1 Ew 1 w 1 = 1 tr(Ew 1 w 1 ) = 1 p = n p + 1 -n . Step 3.1: Random matrix bound, overparameterized case. We first consider the overparameterized case p ≥ n ≥ 2 and block-decompose W =: w 1 W 2 ∈ R (1+(n-1))×p ⇒ W W = w 1 w 1 w 1 W 2 W 2 w 1 W 2 W 2 . Since Z has full rank, W has full rank. Because of n ≤ p, it follows that W W 0. Therefore, (W W ) -1 11 = w 1 w 1 -w 1 W 2 (W 2 W 2 ) -1 W 2 w 1 -1 = w 1 (I p -P 2 )w 1 -1 , where P 2 := W 2 W 2 W 2 -1 W 2 ∈ R p×p is the orthogonal projection onto the column space of W 2 . Thus, P 2 has the eigenvalues 1 with multiplicity n -1 and 0 with multiplicity p -(n -1), which yields tr(P 2 ) = n -1. Since the z i are stochastically independent, w 1 and W 2 are also stochastically independent and we obtain Ew 1 (I p -P 2 )w 1 = E tr((I p -P 2 )(E w1 w 1 w 1 )) = E tr(I p -P 2 ) = p + 1 -n . Using Jensen's inequality with the convex function (0, ∞) → (0, ∞), x → 1/x, we thus find that E (W W ) -1 11 = E w 1 (I p -P 2 )w 1 -1 ≥ 1 Ew 1 (I p -P 2 )w 1 = 1 p + 1 -n . Since tr((W W ) -1 ) = n i=1 ((W W ) -1 ) ii and all diagonal entries can be treated in the same fashion (e.g. via permutation of the w i ), it follows that E tr((W W ) -1 ) ≥ n p + 1 -n . Step 3.2: Random matrix bound, underparameterized case. In the case n ≥ p ≥ 1, we follow the proof of Corollary 1 in Mourtada (2019) , which can be applied in our setting as follows: Introduce a new random variable w n+1 such that w 1 , . . . , w n+1 are i.i.d. Then, E tr((W W ) -1 ) = E W tr((W W ) -1 E wn+1 w n+1 w n+1 ) = Ew n+1 (W W ) -1 w n+1 . Now, Lemma 1 in Mourtada (2019) uses the Sherman-Morrison formula to show that w n+1 (W W ) -1 w n+1 = ˆ n+1 1 -ˆ n+1 with the so-called leverage score ˆ n+1 := w n+1 (W W + w n+1 w n+1 ) -1 w n+1 ∈ [0, 1) . Since W W + w n+1 w n+1 = n+1 i=1 w i w i and since w 1 , . . . , w n+1 are i.i.d., we obtain E ˆ n+1 = E tr   n+1 i=1 w i w i -1 w n+1 w n+1   = 1 n + 1 E tr   n+1 i=1 w i w i -1 n+1 i=1 w i w i   = 1 n + 1 E tr(I p ) = p n + 1 . Finally, the function [0, 1) → (0, ∞), x → x 1-x = 1 1-x -1 is convex, and Jensen's inequality yields E tr((W W ) -1 ) = E ˆ n+1 1 -ˆ n+1 ≥ E ˆ n+1 1 -E ˆ n+1 = p n + 1 -p . Note that it is not possible to analyze Step 3.2 like Step 3.1 since the corresponding matrix blocks are not stochastically independent. Corollary 4 (Random features). Let θ ∼ P Θ be a random variable such that φ θ : R d → R p is a random feature map. Consider the random features regression estimator f X,y,θ (x) = z θ Z + θ y with z θ := φ θ (x) and Z θ := φ θ (X). If for P Θ -almost all θ, the assumptions of Theorem 3 are satisfied for z = z θ and Z = Z θ (with the corresponding matrix Σ = Σ θ ), then E Noise ≥ σ 2 n p+1-n if p ≥ n, σ 2 p n+1-p if p ≤ n. Proof. First, let p ≤ n. Since θ is independent from X, x, y, E Noise = E X,y,θ,x z θ Z + θ y -E y|X z θ Z + θ y 2 = E θ E X,y,x z θ Z + θ y -E y|X z θ Z + θ y 2 Theorem 3 ≥ E θ σ 2 p n + 1 -p = σ 2 p n + 1 -p . The case p ≥ n can be treated analogously.

G DISCUSSION OF THE MAIN THEOREM

Remark G.1 (Dependence on Σ). Hastie et al. (2019) discuss that Σ only influences the expected excess risk in the overparameterized regime. In the following, we will illustrate that Theorem 3 even implies that in the overparameterized case, Σ = λI d yields the lowest E Noise . This fact is also discussed in Muthukumar et al. ( 2020) in a slightly different setting. Note that since P X is unknown in general, Σ is also unknown in general. Assume that (NOI) holds with equality such that we have E Noise = σ 2 E Z tr((Z + ) ΣZ + ) by Theorem 3. Suppose that we perform linear regression on the whitened data z := Σ -1/2 z. Then, Z = ZΣ -1/2 = W , Σ = E z zz = Σ -1/2 E z zz Σ -1/2 = I p , W = Z Σ-1/2 = Z = W . In the underparameterized case p ≤ n, we then obtain E Z tr(( Z+ ) Σ Z+ ) = E Z tr(( W W ) -1 ) = E Z tr((W W ) -1 ) = E Z tr((Z + ) ΣZ + ) . Therefore, whitening the data does not make a difference if p ≤ n. In contrast, for p > n, we only know E Z tr(( Z+ ) Σ Z+ ) (II) = E Z tr(( W W ) -1 ) = E Z tr((W W ) -1 ) (II) ≤ E Z tr((Z + ) ΣZ + ) since (II) holds with equality for whitened features. From Step 2.1 in the proof of Theorem 3, it is obvious that (II) in general does not hold with equality. Hence, in the overparameterized case p > n, whitening the features often reduces and never increases E Noise . Since E Noise is just a lower bound for the expected excess risk, whitening does not necessarily reduce the expected excess risk. This phenomenon also has a different kernel-based interpretation: Under the assumptions (MOM) and (COV), we can choose an ONB u 1 , . . . , u p of eigenvectors of Σ with corresponding eigenvalues λ 1 , . . . , λ p > 0. If z = φ(x) and k(x, x ) = φ(x) φ(x ), we can write k(x, x ) = φ(x) n i=1 u i u i φ(x ) = p i=1 λ i ψ i (x)ψ i (x ) , where the functions ψ i : R d → R, x → 1 √ λi u i φ(x ) form an orthonormal system in L 2 (P X ): E x ψ i (x)ψ j (x) = 1 λ i λ j u i E x φ(x)φ(x) u j = 1 λ i λ j u i Σu j = λ j λ i λ j u i u j = δ ij . (6) Therefore, Eq. ( 5) is a Mercer representation of k and the eigenvalues λ i of Σ are also the eigenvalues of the integral operator T k : L 2 (P X ) → L 2 (P X ) associated with k: (T k f )(x) := k(x, x )f (x ) dP X (x ) = n i=1 λ i ψ i , f L2(P X ) ψ i (x) . We can define a kernel k with flattened eigenspectrum via k(x, x ) := p i=1 ψ i (x)ψ i (x ) . Its feature map x → (ψ i (x)) i∈[n] by Eq. ( 6) satisfies Σ = I p . By the above discussion, E Noise ( k) ≤ E Noise (k). However, this needs to be taken with caution: Assume for simplicity that for independent x, x, the feature map φ yields φ(x), φ(x) ∼ N (0, 1 p I p ). Then, k(x, x) = 1 but Ek(x, x) = 0 and Var k(x, x) = 1 p 2 p i=1 E u,v∼N (0,1) u 2 v 2 = 1 p . In this sense, for p → ∞, k converges to the Dirac kernel k(x, x) = δ x,x , which satisfies E Noise = 0 but provides bad interpolation performance if f * P ≡ 0. The next lemma and its proof are an adaptation of Lemma 4.14 in Bordenave & Chafaï (2012) . Lemma G.2. For i ∈ [n], let W -i := Span{w j | j ∈ [n] \ {i}}. Then, under the assumptions of Theorem 3, in the overparameterized case p ≥ n, tr((W W ) -1 ) = n i=1 dist(w i , W -i ) -2 . ( ) Proof. The case n = 1 is trivial, hence let n ≥ 2. In Step 3.1 in the proof of Theorem 3, since P 2 is the orthogonal projection onto W -1 , we have w 1 (I p -P 2 )w 1 = w 1 2 2 -P 2 w 1 2 2 Pythagoras = dist(w 1 , W -1 ) 2 , where dist(w 1 , W -1 ) is the Euclidean distance between w 1 and W -1 . Remark G.3 (Is U(S p-1 ) optimal?). From (I) in Theorem 3 it is clear that the best possible lower bound for E Noise under the assumptions of Theorem 3 given n, p ≥ 1 is E Noise ≥ σ 2 inf Distribution P Z on R p satisfying (MOM), (COV), (FRK) E Z∼P n Z tr((Z + ) ΣZ + ) . Here, we want to discuss the hypothesis that the infimum in Eq. ( 8) is achieved (for example) by P Z = U(S p-1 ). Figure 1 shows that we were not able to obtain a lower E Noise by optimizing a neural network feature map to minimize E Noise . In the following, we want to discuss some theoretical evidence as to why this is plausible in the overparameterized case p ≥ n. Lemma G.2 shows that Step 3.1 of the proof of Theorem 3 has a distance-based interpretation. In this interpretation, Step 3.1 then applies Jensen's inequality to the convex function (0, ∞) → (0, ∞), x → 1/x using E dist(w i , W -i ) 2 = p + 1 -n . ( ) We can use this perspective to gain insights on how distributions P Z with small E Noise in the overparameterized case p ≥ n look like. First of all, Remark G.1 suggests that for minimizing E Noise in the overparameterized case, Σ should be a multiple of I p , which is clearly satisfied for U(S p-1 ) by Theorem 11. Since the lower bound obtained from ( 9) is independent of the distribution of W , minimizing E tr((W W ) -1 ) amounts to minimizing the error made by Jensen's inequality, which essentially amounts to reducing the variance of the random variables dist(w i , W -i ). We can decompose dist(w i , W -i ) = w i 2 • dist(w i / w i 2 , W -i ), where dist(w i / w i 2 , W -i ) only depends on the angular components w j / w j 2 for j ∈ [n]. This suggests that for a "good" distribution P W , • the radial component w i 2 should have low variance, and • the distribution of the angular component w i / w i 2 should not contain "clusters", since clusters would increase the probability of dist(w i , W -i ) being very small. Clearly, both points are perfectly achieved for P Z = U(S p-1 ).

H PROOFS FOR SECTION 5

In this section, we prove all theorems and propositions from Section 5 as well as some additional results.

H.1 MISCELLANEOUS

First, we prove a statement about conditional variances. Lemma H.1. In the setting of Theorem 3, we have Var(y|z) = E(Var(y|x)|z) + Var(E(y|x)|z) . Hence, if Var(y|x) ≥ σ 2 almost surely over x, then Var(y|z) ≥ σ 2 almost surely over z. The converse holds, for example, if φ is injective. Proof. For properties of conditional expectations, we refer to the literature, e.g. Chapter 4.1 in Durrett (2019) . Since z = φ(x) is a function of x, we have E[•|z] = E[E(•|x)|z]. Thus, Var(y|z) = E (y -E(y|z)) 2 z = E (y -E(y|x)) + (E(y|x) -E(E(y|x)|z)) 2 z = E E (y -E(y|x)) 2 |x z + E [E ((y -E(y|x))(E(y|x) -E(E(y|x)|z))|x)|z] + E E (E(y|x) -E(E(y|x)|z)) 2 |x z . For the first term, we have E (y -E(y|x)) 2 |x = Var(y|x) by definition. The second term is zero: Since E(y|x) -E(E(y|x)|z) is already a function of x, we have E ((y -E(y|x))(E(y|x) -E(E(y|x)|z))|x) = E ((y -E(y|x))|x) (E(y|x) -E(E(y|x)|z)) = (E(y|x) -E(y|x))(E(y|x) -E(E(y|x)|z)) = 0 . Finally, the third term is by definition equal to Var(E(y|x)|z). Therefore, Var(y|z) = E(Var(y|x)|z) + Var(E(y|x)|z) . If φ is injective, then x is also a function of z and we obtain Var(y|z) = Var(y|x). Our main ingredient for analyzing analytic activation functions will be the univariate and multivariate identity theorems. Theorem H.2 (Identity theorem for univariate real analytic functions). Let f : R → R be analytic. If f is not the zero function, the set N (f ) := {z ∈ R | f (z) = 0} has no accumulation point. In particular, N (f ) is countable. (i) Denote the n (stochastically independent) rows of Z by z 1 , . . . , z n . First, assume P (z ∈ U ) > 0 for some subspace U of dimension min{n, p} -1. Then, P rank(Z) ≤ min{n, p} -1 ≥ P (z 1 , . . . , z n ∈ U ) = (P (z ∈ U )) n > 0 . For the converse, it suffices to consider the case n ≤ p since if n > p and if an arbitrary p×p submatrix of Z ∈ R n×p almost surely has full rank, then Z also almost surely has full rank. We prove the statement for n ≤ p by induction on n. For n = 1, the claim is trivial. Thus, let n > 1 and let z 1 , . . . , z n be the (stochastically independent) rows of Z. Assume P (z ∈ U ) = 0 for all linear subspaces U ⊆ R p of dimension n -1. Then, we also have P (z ∈ U ) = 0 for all linear subspaces U ⊆ R p of dimension n -2 and by the induction hypothesis, we obtain that z 1 , . . . , z n-1 are almost surely linearly independent. Hence, almost surely, z 1 , . . . , z n are linearly independent iff z n / ∈ U n-1 := Span{z 1 , . . . , z n-1 }. Then, since z n and U n-1 are stochastically independent, P (Z ∈ R n×p has full rank) = P (z 1 , . . . , z n are linearly independent) = P (z n / ∈ U n-1 ) = 1 U c n-1 (z n ) dP Z (z n ) dP (U n-1 ) = P Z (z n / ∈ U n-1 ) dP (U n-1 ) = 1 dP (U n-1 ) = 1 . For the case p ≤ n, (i) is also proven after Definition 1 in Mourtada ( 2019). (ii) If (COV) does not hold, there exists a vector 0 = v ∈ R p with v Σv = 0, hence 0 = v Ezz v = E(v z) 2 , and therefore z is almost surely orthogonal to v. If U is the orthogonal complement of Span{v} in R p , then P (z ∈ U ) = 1. For the converse, if there exists a (p -1)-dimensional subspace U with P (z ∈ U ) = 1, then we can again take a vector 0 = v ∈ R p that is orthogonal to U , reverse the above computation and obtain v Σv = 0, hence (COV) does not hold. Step 2: (a) ⇔ (b). The implication (b) ⇒ (a) is trivial and the implication (a) ⇒ (b) follows immediately from (i). Step 3: (b) ⇒ (c). This also follows from (i) and (ii). Step 4: (c) ⇒ (d). We will prove by induction on n that for all n ∈ {0, . . . , p}, there exist x 1 , . . . , x n ∈ R d such that φ(x 1 ), . . . , φ(x n ) are linearly independent. For n = 0, the statement is trivial. Now assume that the statement holds for x 1 , . . . , x n-1 , where 0 ≤ n -1 ≤ p -1. Since (COV) holds, by (ii) the subspace U := Span{φ(x 1 ), . . . , φ(x n-1 )} satisfies P (φ(x) ∈ U ) < 1, hence there is x n such that x n / ∈ U and for this choice, φ(x 1 ), . . . , φ(x n ) are linearly independent. Finally, the statement for n = p yields the existence of X ∈ R p×d such that φ(X) has linearly independent rows, which implies det(φ(X)) = 0. Step 5: Analytic feature map. Assume that φ is analytic, x has a Lebesgue density and (d) holds. Let n = p. For X ∈ R p×d , consider the analytic function f ( X) := det(φ( X)). (The determinant is analytic since it is a polynomial in the matrix entries.) Since (d) holds, Theorem H.3 shows that f ( X) = 0 for (Lebesgue-) almost all X. Since x has a Lebesgue density and X has independent x-distributed rows, X has a Lebesgue density. Therefore, f (X) = 0 almost surely over X, hence (a) holds. Proposition 8 (Polynomial kernel). Let m, d ≥ 1 and c > 0. For x, x ∈ R d , define the polynomial kernel k(x, x) := (x x + c) m . Then, there exists a feature map φ : R d → R p , p := m+d m , such that:  (a) k(x, x) = φ(x) φ(x) for all x, x ∈ R d = C(m)z m1 1 • • • z m d d • ( √ c) m d+1 m∈M . (a) We have φ(x) φ(x) = m∈M C(m)(x 1 x1 ) m1 • • • (x d xd ) m d • c m d+1 = (x 1 x1 + . . . + x d xd + c) m = k(x, x) . (b) Assume that x has a Lebesgue density. Let U be an arbitrary (p -1)-dimensional linear subspace of R p . Then, there exists 0 = v ∈ R p such that U = (Span{v}) ⊥ . Since the monomials x m1 1 • • • x m d d for m ∈ M are all distinct, the polynomial v φ(x) = m∈M v m C(m)c m d+1 x m1 1 • • • x m d d is not the zero polynomial. By the identity theorem (Theorem H.3), since x has a Lebesgue density and since polynomials are analytic, P (φ(x) ∈ U ) = P (v φ(x) = 0) = 0 . Hence, Proposition 6 shows that (FRK) is satisfied for n = p and hence for all n. We want to remark at this point that the proof strategy of Proposition 8, where the identity theorem is applied to the functions v φ(x) for all 0 = v ∈ R p , does not work for random feature maps: The statements • For all 0 = v ∈ R p for almost all x for almost all θ, v φ θ (x) = 0 • For almost all θ for all 0 = v ∈ R p for almost all x, v φ θ (x) = 0 are not equivalent since in the first statement, the null set for θ may depend on v, and the union of the null sets for all v is an uncountable union. Perhaps the simplest counterexample is n = p = 2, θ ∼ N (0, I 2 ) and φ θ (x) := θ, which satisfies the first but not the second statement. For our rescue, we can replace the uncountable analytic function family (x → v φ θ (x)) v∈R p \{0} by the single analytic function (θ, X) → det(φ θ (X)): Proposition 9 (Random feature maps). Consider feature maps φ θ : R d → R p with (random) parameter θ ∈ R q . Suppose the map (θ, x) → φ θ (x) is analytic and that θ and x are independent and have Lebesgue densities. If there exist fixed θ ∈ R q , X ∈ R p×d with det(φ θ ( X)) = 0, then almost surely over θ, (FRK) holds for all n for z = φ θ (x). Proof. Consider the analytic map (θ, X) → det(φ θ (X)). Suppose there exist θ ∈ R q and X ∈ R p×d with det(φ θ ( X)) = 0. Then, Theorem H.3 tells us that det( φ(θ, X)) = 0 for almost all (θ, X). This implies that for almost all θ, we have for almost all X that det(φ θ (X)) = 0. Since by assumption, θ has a density, this implies that almost surely over θ, there exists X such that det(φ θ (X)) = 0. Since all φ θ are analytic, the claim now follows using (d) ⇒ (b) from Proposition 6. (The proof of (d) ⇒ (a) ⇔ (b) in Proposition 6 does not require (MOM).)

H.2 RANDOM NETWORKS WITH BIASES

In order to prove (FRK) for random deep neural networks, we pursue slightly different approaches for networks with bias (this section) and without bias (Appendix H.3). In both approaches, we consider a property related to the diversity of the x 1 , . . . , x n ∈ R d and proceed as follows: (1) Projections: Ensure that random projections of the x 1 , . . . , x n almost surely preserve the diversity property, such that it is sufficient to consider the case d = 1. (2) Propagation: Using the projection result from (1), prove that if the inputs to a layer have the diversity property, then the outputs also have the diversity property almost surely over the random parameters of the layer. Proof. Step 1: Bias. By assumption, σ is not a polynomial of degree less than m = 2. By Lemma H.5, there exist (a k ) k≥0 , b ∈ R and ε > 0 such that a 0 , a 1 = 0 and for all x ∈ R with |x -b| < ε, σ(x) = ∞ k=0 a k (x -b) k . Step 2: Weight. By Lemma H.6, we can choose u ∈ R d such that u x 1 , . . . , u x n are distinct. Now, consider i, j ∈ [n] with i = j. The function f ij : R → R, λ → σ(λu x i + b) -σ(λu x j + b) satisfies f ij (λ) = ∞ k=0 a k ((u x i ) k -(u x j ) k )λ k for sufficiently small |λ|. Here, the coefficient a k ((u x i ) k -(u x j ) k ) is nonzero for k = 1, and hence f ij is not the zero function. Using the identity theorem again (as in the proof of Lemma H.6, we find that there exists λ ∈ R with f ij (λ) = 0 for all i, j ∈ [n] with i = j. Step 3: Generalization. Now, choose W :=    λu . . . λu    ∈ R p×d , b :=    b . . . b    ∈ R p . Then, by construction, the first components (σ(W x i + b)) 1 for i ∈ [n] are distinct. Hence, the analytic function (W , b) → i,j∈[n] i =j ((σ(W x i + b)) 1 -(σ(W x j + b)) 1 ) is not the zero function. By the identity theorem (Theorem H.3), for (Lebesgue-) almost all (W , b), the first components of the vectors σ(W x i +b) are distinct, and therefore also the vectors themselves are distinct. Lemma H.8 (Independence from distinct variables). Let p, d ≥ 1. Let σ : R → R be analytic but not a polynomial of degree less than p -1. If x 1 , . . . , x p ∈ R d are distinct, then for (Lebesgue-) almost all W ∈ R p×d and b ∈ R p , the vectors σ(W x 1 + b), . . . , σ(W x p + b) are linearly independent. Proof. Step 1: Preparation. By Lemma H.5, there exist (a k ) k≥0 , b ∈ R and ε > 0 such that a 0 , . . . , a p-1 = 0 and for all x ∈ R with |x -b| < ε, σ(x) = ∞ k=0 a k (x -b) k . By Lemma H.6, there exists u ∈ R d such that x i := u x i for i ∈ [p] are all distinct. Using the fact that Vandermonde matrices of distinct x i are invertible and using the Leibniz formula for the determinant (with the permutation group S p on [p]), we obtain D x := π∈Sp sgn(π) p i=1 x i-1 π(i) = det    x 0 1 . . . x 0 p . . . . . . . . . x p-1 1 . . . x p-1 p    = 0 . ( ) Step 2: Determinant expansion. Define the analytic function f : R p → R, w → det    σ(w 1 x 1 + b) . . . σ(w p x 1 + b) . . . . . . . . . σ(w 1 x p + b) . . . σ(w p x p + b)    . For small enough w 2 , we can use the Leibniz formula for the determinant to write f (w) = π∈Sp sgn(π) p i=1 ∞ k=0 a k (w i x π(i) ) k = k1,...,kp≥0 π∈Sp sgn(π) p i=1 a ki w ki i x ki π(i) = k1,...,kp≥0 p i=1 a ki   π∈Sp sgn(π) p i=1 x ki π(i)   w k1 1 • • • w kp p . For k i := i -1, we find the coefficient of w k1 1 • • • w kp p to be p i=1 a ki   π∈Sp sgn(π) p i=1 x ki π(i)   (10) = p i=1 a ki • D x = 0 , hence f is not the zero function and there exists w ∈ R p such that f (w) = 0. Step 3: Generalization. Consider the analytic function g(W , b) := det    σ(W x 1 + b) . . . σ(W x p + b)    . When setting W := wu ∈ R p×d and b := (b, . . . , b) ∈ R p , we obtain from Step 2 that g(W , b) = f (w) = 0, hence g is not the zero function. But then, by the identity theorem (Theorem H.3), g is nonzero for (Lebesgue-) almost all (W , b). Theorem H.9. Let p, d ≥ 1, let σ : R → R be analytic but not a polynomial of degree less than max{1, p -1} and let x 1 , . . . , x p ∈ R d be distinct. Let L ≥ 1 and d 0 := d, d 1 , . . . , d L-1 ≥ 1, d L := p. For l ∈ {0, . . . , l -1}, let W (l) ∈ R d l+1 ×d l and b (l) ∈ R d l+1 be random variables such that θ := (W (0) , . . . , W (L-1) , b (0) , . . . , b (L-1) ) has a Lebesgue density. Consider the random feature map given by ) , where x (l+1) := W (l) x (l) + b (l) . φ θ (x (0) ) := x (L Then, almost surely over θ, φ θ (x 1 ), . . . , φ θ (x p ) are linearly independent. Proof. By Lemma H.4, it suffices to consider the case where θ has a standard normal distribution, since a standard normal distribution has a nonzero probability density everywhere. Especially, in this case, all weights and biases are independent. Using Lemma H.7 and that σ is non-constant, it follows by induction on l ∈ {0, . . . , L -1} that x (l) 1 , . . . , x p are distinct almost surely over θ. But if x (L-1) 1 , . . . , x (L-1) p are distinct, then x (L) 1 , . . . , x (L) p are linearly independent almost surely over θ by Lemma H.8, which is what we wanted to show.

H.3 RANDOM NETWORKS WITHOUT BIASES

As discussed at the beggining of Appendix H.2, we will now consider the property of having independent nonatomic random variables x 1 , . . . , x n . We will again consider projection and propagation lemmas first. Lemma H.10 (Projections of nonatomic random variables). A random variable x ∈ R d is nonatomic iff for (Lebesgue-) almost all u ∈ R d , u x is nonatomic.

Proof.

Step 1: Decompose into subspace contributions. For k ∈ {0, . . . , d}, let S k := {w + V | w ∈ R d , V linear subspace of R d , dim V = k} be the set of k-dimensional affine subspaces of R d . Let A 0 := {A ∈ S 0 | P X (A) > 0}, which corresponds to the set of atoms of P X . We then recursively define "minimally contributing subspaces" for k ∈ [d]: A k := {A ∈ S k | P X (A) > 0 and for all Ã ∈ S k-1 with Ã ⊆ A, P X ( Ã) = 0} . We can define corresponding sets of "annihilating" vectors as follows: For A = w + V ∈ S k , let A ⊥ := V ⊥ = {u ∈ R d | for all v ∈ V , u v = 0} . This is well-defined since it is independent of the choice of w. Define U := d k=0 A∈A k A ⊥ N := {u ∈ R d | u x is not nonatomic} . We want to show U = N . Step 2: Show U ⊆ N . Let A = w + V ∈ A k for some k ∈ {0, . . . , d} and let u ∈ A ⊥ . Then, 0 < P X (A) = P (x ∈ A) ≤ P (u x ∈ {u (w + v) | v ∈ V }) = P (u x = u w) , which shows u ∈ N . Step 3: Show N ⊆ U. Let u ∈ N , i.e. there exists a ∈ R such that P (u x = a) > 0. Define A := {v ∈ R d | u v = a}. Then, A is an affine subspace of R d and we have P X (A) > 0 by construction of A. Among all affine subspaces Ã of A with P X ( Ã > 0), there exists one with minimal dimension. This subspace then satisfies Ã ∈ A dim Ã and it is not hard to show that u ∈ A ⊥ ⊆ Ã⊥ , hence u ∈ U. Step 4: For all k ∈ {0, . . . , d}, A k is countable. To derive a contradiction, assume that A k is uncountable. Then, there exists ε > 0 such that A k,ε := {A ∈ A k | P X (A) ≥ ε} is also uncountable. Pick an integer n > 1/ε and pick n distinct sets A 1 , . . . , A n ∈ A k,ε . We will show by induction on l ∈ [n] that Ãl := A 1 ∪ . . . ∪ A l satisfies P X ( Ãl ) = P X (A 1 ) + . . . + P X (A l ) , which will then yield the contradiction 1 ≥ P X ( Ãl ) = P X (A 1 ) + . . . + P X (A n ) ≥ nε > 1 . Obviously, P X ( Ã1 ) = P X (A 1 ). Assuming that the statement holds for l ∈ [n -1], we first derive P X ( Ãl+1 ) = P X ( Ãl ∪ A l+1 ) = P X ( Ãl ) + P X (A l+1 ) -P X ( Ãl ∩ A l+1 ) = P X (A 1 ) + . . . + P X (A l+1 ) -P X ( Ãl ∩ A l+1 ) . For i = j, the intersection A i ∩ A j of two distinct k-dimensional affine subspaces is either empty or an affine subspace of dimension less than k, hence the definition of A k yields P X (A i ∩ A j ) = 0. Therefore, P X ( Ãl ∩ A l+1 ) = P X ((A 1 ∩ A l+1 ) ∪ . . . ∪ (A l ∩ A l+1 )) ≤ P X (A 1 ∩ A l+1 ) + . . . + P X (A l ∩ A l+1 ) = 0 + . . . + 0 = 0 , which concludes the induction. Step 5: Conclusion. Let A ∈ S k , then A ⊥ is a (dk)-dimensional linear subspace of R d . If x is not nonatomic, the set A 0 is not empty, and hence N = U ⊇ A∈A0 A ⊥ = R d , which means that N is not a Lebesgue null set. Conversely, if x is nonatomic, the set A 0 is empty, and therefore N = U = d k=1 A∈A k A ⊥ is (by Step 4) a countable union of proper affine subspaces of R d , all of which are Lebesgue null sets. Therefore, N is a Lebesgue null set. Proof. (a) By Lemma H.10, the set N := {u ∈ R d | u x is not nonatomic} is a Lebesgue null set. If w 1 is the first row of W , then clearly, {W ∈ R p×d | W x is not nonatomic} ⊆ {W ∈ R p×d | w 1 ∈ N } , where the right-hand side is a Lebesgue null set. This proves the claim. (b) This is trivial. (c) Let z ∈ R d . Since the function R → R, x → σ(x)-z i is analytic and not the zero function by assumption, its zero set σ -1 ({z i }) is countable by the identity theorem, Theorem H.2. Therefore, the set σ -1 ({z}) = {x ∈ R d | σ(x) = z} = σ -1 ({z 1 }) × . . . × σ -1 ({z d }) is also countable. Thus, since x is nonatomic, we obtain P (σ(x) = z) = P (x ∈ σ -1 ({z})) = x∈σ -1 ({z}) P (x = x) = x∈σ -1 ({z}) 0 = 0 . For proving independence from nonatomic random variables, we need some preparation. In the proof of the independence result for the case with biases (Lemma H.8), a Vandermonde matrix appeared. In the case of networks without bias, we will not be able to use Lemma H.5 to control the power series coefficients of σ, which requires us to treat Vandermonde matrices with more general exponents. Lemma H.12 (Random Vandermonde-type matrices). For n ≥ 1, let x 1 , . . . , x n be independent nonatomic R-valued random variables and let k 1 , . . . , k n be distinct non-negative integers. Then, the random Vandermonde-type matrix V := V k1,...,kn (x 1 , . . . , x n ) :=    x k1 1 . . . x k1 n . . . . . . . . . x kn 1 . . . x kn n    is invertible with probability one. Proof. By swapping rows of V , we can assume without loss of generality that 0 ≤ k 1 < k 2 < . . . < k n . We prove the statement by induction on n. For n = 1, V is invertible whenever x 1 = 0, and this happens with probability one. Now assume that the statement holds for n -1 ≥ 1. For i ∈ [n], define the submatrices V i := V k1,...,kn-1 (x 1 , . . . , x i-1 , x i+1 , . . . , x n ) . Since the statement holds for n -1, V n is invertible with probability one. Now, fix any such x 1 , . . . , x n-1 where V n is invertible. Especially, C := det( V n ) is a non-zero constant. We will show that V is then invertible almost surely over x n . For this, we compute the determinant of V using the Laplace expansion with respect to the last row of V as f (x n ) := det(V ) = n i=1 (-1) i+n x kn i det( V i ) = Cx kn n + n-1 i=1 (-1) i+n x kn i det( V i ) . By the Leibniz formula, for i ∈ [n -1], det( V i ) is a polynomial in x n of degree ≤ k n-1 < k. Hence, f is a nonzero polynomial in x n of degree k n and has at most k n zeros. Since any finite set is a null set with respect to a nonatomic distribution, it follows that det(V ) = 0 almost surely over x n . Since the assumptions on x 1 , . . . , x n-1 are also satisfied almost surely, the statement follows. As in the case of networks with biases, we will prove the independence result by first reducing it to the case d = 1. Since we also want to perform this reduction for random Fourier features in Proposition I.1, we state the reduction to the d = 1 case as a separate result, Lemma H.14, and define the d = 1 case in the following definition. Definition H.13 (Non-degenerate). Let p, q ≥ 1. We call a function f : R q → R q non-degenerate if it is analytic and for arbitrary independent R-valued nonatomic random variables x 1 , . . . , x p , there almost surely exists w = w x1,...,xp ∈ R q with det    f (wx 1 ) . . . f (wx p )    = 0 . Lemma H.14. Let f : R q → R p be non-degenerate, let W ∈ R q×d be a random variable with a Lebesgue density and let x 1 , . . . , x p ∈ R d be independent nonatomic random variables. Then, det    f (W x 1 ) . . . f (W x p )    = 0 almost surely over W and x 1 , . . . , x p . Proof. By Lemma H.10, there exists u ∈ R d such that for all i ∈ [p], x i := u x i ∈ R is nonatomic. Obviously, x 1 , . . . , x n are independent. Fix x 1 , . . . , x n such that w = w x1,...,xp ∈ R q as in Definition H.13 exists, which is true with probability one since f is non-degenerate. Then, for W := wu , we have g( W ) := det    f ( W x 1 ) . . . f ( W x p )    = det    f (wx 1 ) . . . f (wx p )    = 0 . Since g is a non-zero analytic function, Theorem H.3 shows that g is only zero on a Lebesgue null set, and this null set is also a null set with respect to the distribution of W since W has a Lebesgue density. Hence, g(W ) = 0 almost surely over W . The following lemma proves the d = 1 version of the independence result, which can then be upgraded to the general case using Lemma H.14. Lemma H.15 (Independence from nonatomic random variables). Let p ≥ 1. Let σ : R → R be analytic and not a polynomial with less than p nonzero coefficients. Then, the elementwise application function f : R p → R p , x → (σ(x 1 ), . . . , σ(x p )) is non-degenerate in the sense of Definition H.13. Proof. Let x 1 , . . . , x p be independent scalar nonatomic random variables. Step 1: Power series coefficients. Since σ is analytic, there exists ε > 0 and coefficients (a k ) k≥0 such that σ(z) = ∞ k=0 a k z k for z ∈ R with |z| < ε. Let K := {k ≥ 0 | a k = 0}. Then, |K| ≥ p: Assume that |K| ≤ p -1, then the polynomial h : R → R, z → k∈K a k z k equals σ on (-ε, ε). Hence, the function g := σh is zero on (-ε, ε). By Theorem H.2, g is the zero function and hence σ = h is a polynomial with less than p nonzero coefficients, which we assumed not to be the case. Step 2: Condition on the x i . By Step 1, we can choose indices k 1 < k 2 < . . . < k p with k 1 , . . . , k p ∈ K. Then, by Lemma H.12 and the Leibniz formula for the determinant, we have D x := π∈Sp sgn(π)x k1 π(1) • . . . • x kp π(p) = det    x k1 1 . . . x k1 p . . . . . . . . . x kp 1 . . . x kp p    = 0 with probability one. Step 3: Determinant power series. Now, fix a realization of x 1 , . . . , x p such that (11) holds. For w ∈ R p with sufficiently small w ∞ , we can write If g was the zero function, all derivatives of g would be zero and therefore the coefficients of all monomials would be zero, which is not the case. Hence, there exists w ∈ R p with g(w) = 0. This shows that f is non-degenerate. g(w) := det    f (wx 1 ) . . . f (wx p )    = det    σ(w 1 x 1 ) . . . σ(w p x 1 ) . . . . . . . . . σ(w 1 x p ) . . . σ(w p x p )    = π∈Sp sgn(π) p i=1 ∞ k=0 a k (w i x π(i) ) k = k1,...,kp≥0 π∈Sp sgn(π) p i=1 a ki w ki i x ki π(i) = k1,...,kp≥0 p i=1 a ki   π∈Sp sgn(π) p i=1 x ki π(i)   w k1 1 • . . . • w kp p .

H.4 RANDOM NETWORKS: CONCLUSION

In the following, we will prove Theorem 10 and discuss some possible extensions and limitations. Theorem 10 (Random neural networks). Let d, p, L ≥ 1, let σ : R → R be analytic and let the layer sizes be d 0 = d, d 1 , . . . , d L-1 ≥ 1 and d L = p. Let W (l) ∈ R d l+1 ×d l for l ∈ {0, . . . , L -1} be random variables and consider the two cases where (a) σ is not a polynomial with less than p nonzero coefficients, θ := (W (0) , . . . , W (L-1) ) and the random feature map φ θ : R d → R p is recursively defined by φ(x (0) ) := x (L) , x (l+1) := σ(W (l) x (l) ) . (b) σ is not a polynomial of degree < p -1, θ := (W (0) , . . . , W (L-1) , b (0) , . . . , b (L-1) ) with random variables b (l) ∈ R d l+1 for l ∈ {0, . . . , L -1}, and the random feature map φ θ : R d → R p is recursively defined by φ(x (0) ) := x (L) , x (l+1) := σ(W (l) x (l) + b (l) ) . In both cases, if θ has a Lebesgue density and x is nonatomic, then (FRK) holds for all n and almost surely over θ. Proof. By Lemma H.4, it suffices to consider the case where θ has a standard normal distribution, since a standard normal distribution has a nonzero probability density everywhere. Especially, we can assume that all parameters in θ are independent. By Proposition 6, we only need to prove (FRK) for n = p. Let x (0) 1 , . . . , x p ∼ P X be i.i.d. nonatomic random variables. (a) If p = 1, σ is allowed to be a non-zero constant function. In this case, the feature map φ θ is constant and non-zero with p = 1, which means that (FRK) holds. In the following, we thus assume that σ is non-constant. Let x (0) ∼ P X . Since x (0) is nonatomic and σ is nonconstant, an inductive application of Lemma H.11 yields that almost surely over θ, x (L-1) is also nonatomic. Hence, x (L-1) 1 , . . . , x (L-1) p are independent and nonatomic almost surely over θ. But by Lemma H.15, the elementwise application of σ is non-degenerate, and hence by Lemma H.14, we almost surely have det    σ(W (L-1) x (L-1) 1 ) . . . σ(W (L-1) x (L-1) p )    = 0 , which implies that (FRK) holds for n = p almost surely over θ. (b) If p = 1 and σ is a polynomial of degree p -1 = 0, this means that σ is a non-zero constant function. Like in case (a), this implies that φ θ is constant and non-zero with p = 1, which means that (FRK) holds. In the following, we thus assume again that σ is non-constant, i.e. not a polynomial of degree less than max{1, p-1}. Since the distribution of the x (0) i := x i is non-atomic, they are distinct almost surely. By Theorem H.9, x (L) 1 , . . . , x (L) p are linearly independent almost surely over θ, which proves (FRK) for n = p almost surely over θ. Remark H.16 (Generalizations). The proof technique used in Theorem 10 is quite robust and can be further generalized. For example, it is easy to incorporate different activation functions for different neurons, and in all layers but the last layer, the activation functions only need to be analytic and non-constant, as required by the corresponding propagation lemmas. It is also possible to treat fixed but nonzero biases using a combination of Lemma H.11 (b) and using shifted activation functions σi (x) = σ(x + b i ) in Lemma H.15. Also, the propagation lemmas, which are used for all layers except the last one, can be easily extended to DenseNet-like structures where the input of a layer is concatenated to the output. Remark H.17 (Necessity of the assumptions). For analytic σ, the assumptions on not being a too simple polynomial in Theorem 10 are necessary. For this, consider the case with L = 1 layer and d = 1. (a) Assume that σ is a polynomial with less than p nonzero coefficients, i.e. σ(x) = k∈K a k x k for |K| ≤ p -1. For arbitrary weights w ∈ R p×1 and data points x 1 , . . . , x p ∈ R, we obtain the feature matrix φ w (X) ∈ R p×p with φ w (X) ij = σ(w j x i ) = k∈K a k w k j x k i , which means that φ w (X) is the sum of the |K| ≤ p -1 matrices (a k w k j x k i ) i,j∈[p] , which have at most rank 1. Hence, φ w (X) has at most rank p -1 and is therefore not invertible. (b) Assume that σ is a polynomial of degree less than p -1, i.e. σ = p-2 k=0 a k x k . For arbitrary weights w ∈ R p×1 , biases b ∈ R p and data points x 1 , . . . , x p ∈ R, we obtain the feature matrix φ w,b (X) ∈ R p×p with φ w,b (X) ij = σ(w j x i + b j ) = p-2 k=0 a k (w j x i + b j ) k = p-2 k=0 k l=0 a k k l b k-l j w l j x l i = p-2 l=0 p-2 k=l a k k l b k-l j w l j x l i = p-2 l=0 u (l) j x l i , where u (l) j p-2 k=l a k k l b k-l j w l j does not depend on i. Hence, φ w,b (X) is the sum of the p -1 matrices (u (l) j x l i ) i,j∈[p] , each of which has rank at most 1. Hence, φ w,b (X) has at most rank p -1 and is therefore not invertible.

I RANDOM FOURIER FEATURES

In a celebrated paper, Rahimi & Recht (2008) propose to approximate a shift-invariant positive definite kernel k(x, x ) = k(x-x ) with a potentially infinite-dimensional feature map by a random finite-dimensional feature map, yielding so-called random Fourier features. If k is (up to scaling) the Fourier transform of a probability distribution P k on R d , two versions of random Fourier features are proposed: (1) One version uses φ W ,b (x) = √ 2 cos(W x + b), where the rows of W ∈ R p×d are independently sampled from P k and the entries of b ∈ R p are independently sampled from the uniform distribution on [0, 2π]. This feature map is covered by Theorem 10 and hence, if P k has a Lebesgue density and x is nonatomic, (FRK) is satisfied for all n. For example, if k is a Gaussian kernel, P k is a Gaussian distribution and therefore has a Lebesgue density. (2) The other version uses φ W (x) = sin(W x) cos(W x) with the same distribution over W . It is not covered by Theorem 10 because of the different "activation functions" and the "weight sharing" between these activation functions. In the following proposition, we show that the proof of Theorem 10 can be adjusted to this setting and the conclusions still hold. Proposition I.1. For x ∈ R d , W ∈ R q×d and p := 2q, define φ W (x) := sin(W x) cos(W x) ∈ R p . If W has a Lebesgue density and x is nonatomic, then (FRK) holds for all n almost surely over W . Proof. Step 1: Reduction. According to Proposition 6, it suffices to consider the case n = p. By Lemma H.14, it is then sufficient to prove that the function f : R q → R 2q , x → (sin(x), cos(x)) is non-degenerate in the sense of Definition H.13. Step 2: Condition on the x i . We will proceed similar to Lemma H.15. Let x 1 , . . . , x p be independent scalar nonatomic random variables. For i ∈ [q], choose k i := 2i -1 and k q+i := 2i -2. Then, k 1 , . . . , k p are distinct non-negative integers, and by Lemma H.12 and the Leibniz formula for the determinant, we have D x := π∈Sp sgn(π)x k1 π(1) • . . . • x kp π(p) = det    x k1 1 . . . x k1 p . . . . . . . . . x kp 1 . . . x kp p    = 0 with probability one. Step 3: Non-degeneracy. Now, suppose that we are indeed in the case where D x = 0. Take the power series of sin and cos as In the following, we will show that c k1,...,k2q = 0 for all other (k 1 , . . . , k 2q ) ∈ K, which implies c = 0 and therefore yields that g is not the zero function, which is what we want to show. If k i = k j for some i = j, we have . This shows the claim.

J PROOFS FOR SECTION 6

In this section, we first prove the analytic formulas from Section 6 before discussing the case of low input dimension d. Theorem 11. Let P Z = U(S p-1 ). Then, P Z satisfies the assumptions (MOM), (COV) and (FRK) for all n with Σ = 1 p I p . Moreover, for n ≥ p = 1 or p ≥ n ≥ 1, we can compute E Z tr((Z + ) ΣZ + ) =          1 n if n ≥ p = 1, 1 p if p ≥ n = 1, ∞ if 2 ≤ n ≤ p ≤ n + 1, n p-1-n • p-2 p if 2 ≤ n ≤ n + 2 ≤ p. Proof. Step 1: Verify (MOM). Let x i ∼ N (0, I p ) for i ∈ [n] be independent. Then, z i := xi xi 2 ∼ U(S p-1 ). Since E z i 2 2 = E1 = 1, (MOM) is satisfied and thus, Σ is well-defined. Step 2: Compute Σ. We can use rotational invariance as follows: Let V ∈ R p×p be an arbitrary fixed orthogonal matrix. Then, V x i ∼ N (0, V V ) = N (0, I p ) and hence V z i = V xi xi 2 = V xi V xi 2 ∼ U(S p-1 ). Therefore, Σ = Ez i z i = EV z i z i V = V ΣV . ( ) If 0 = v ∈ R p is an eigenvector of Σ with eigenvalue λ, then V v must by Eq. ( 14) also be an eigenvector of Σ with eigenvalue λ. But since V is an arbitrary orthogonal matrix, this means that V v is an arbitrary rotation of v. From this it is easy to conclude that Σ = λI p , and from pλ = tr(Σ) = E tr(z i z i ) = Ez i z i = E1 = 1 , it follows that Σ = 1 p I p . Hence, (COV) is satisfied and w i = √ pz i . Step 3: Verify (FRK) for all n. By Proposition 6, it is sufficient to verify (FRK) for n = p. Therefore, let n = p. It is obvious from Proposition 6 with φ = id that N (0, I p ) satisfies (FRK). Hence, X almost surely has full rank. But then, since x i 2 > 0 almost surely, Z = diag 1 x 1 , . . . , 1 x n X almost surely has full rank as well, which proves (FRK). Step 4.1: Computation for n ≥ p = 1. In the underparameterized case n ≥ p = 1, we can compute E Z tr((Z + ) ΣZ + ) = E Z tr((W W ) -1 ) = E Z 1 n i=1 w 2 i = E Z 1 n = 1 n . Step 4.2: Computation for p ≥ n = 1. In the overparameterized case p ≥ n = 1, we can compute E Z tr((Z + ) ΣZ + ) = E Z tr((W W ) -1 ) = E Z 1 w 1 w 1 = E Z 1 p = 1 p , where we used that since Σ = 1 p I p , w 1 w 1 = w 1 2 2 = √ pz 1 2 2 = p. Step 4.3: Computation for p ≥ n ≥ 2. Now, let p ≥ n ≥ 2. Since Σ = 1 p I p , we have E Z tr((Z + ) ΣZ + ) = E Z tr((W W ) -1 ) by Theorem 3. Using that the w i are i.i.d., we obtain from Lemma G.2 that E((W W ) -1 ) = nE dist(w 1 , W -1 ) -2 , where W -1 is the space spanned by w 2 , . . . , w n . Define the subspace U n := {z ∈ R p | z n = z n+1 = . . . = z p = 0}. By (FRK), we almost surely have dim(W -1 ) = n -1. Thus, there is an orthogonal matrix U -1 depending only on W -1 that rotates W -1 to U n : U n = U -1 W -1 . Because w 1 is stochastically independent from W -1 and U -1 and its distribution is rotationally symmetric, we have the distributional equivalence (using the z i and x i from Step 1) dist(w 1 , W -1 ) 2 = dist(U -1 w 1 , U -1 W -1 ) 2 distrib. = dist(w 1 , U n ) 2 = p(z 2 1,n + . . . (b) Let (p k , φ) and (p k , φ) be two P X -representations of k such that z = φ(x) satisfies the assumptions of Theorem 3. We need to show that z = φ(x) also satisfies the assumptions of Theorem 3: First of all, (INT) and (NOI) hold since they are independent of the feature map. Moreover, (COV) holds by (a). We find that (MOM) holds due to E z 2 2 = E φ(x) φ(x) = Ek(x, x) = Eφ(x) φ(x) = E z 2 2 < ∞ and (FRK) holds since, almost surely, rank φ(X) = rank( φ(X) φ(X) ) = rank(k(X, X)) = rank(φ(X)φ(X) ) = rank φ(X) = min{n, p} . (c) Assume that there exists a P X -representation (p, φ) of k that satisfies the assumptions of Theorem 3. By (a), we have p = p k . By Eq. ( 2) in Section 3, the ridgeless kernel regression estimator and the linear regression estimator with the feature map φ are almost surely equivalent, hence they have the same E Noise . For kernels that cannot be represented with a finite-dimensional feature space, Theorem 3 cannot be applied. In fact, any distribution-independent lower bound for ridgeless kernel regression must be zero in this case: For example, the Kronecker delta kernel given by k(x, x) = 1 if x = x 0 otherwise yields E Noise = 0 for any nonatomic input distribution P X . Of course, this kernel is not well-suited for learning since the learned functions are zero almost everywhere. However, there exist results for ridgeless kernel regression with specific classes of kernels. For example, Rakhlin & Zhai (2019) show that in certain settings, ridgeless kernel regression with Laplace kernels is inconsistent because E Noise = Ω(1) as n → ∞. Note that Laplace kernels in general do not allow for finite-dimensional feature map representations. Liang & Rakhlin (2020) derive upper bounds for a certain class of kernels and input distributions with (linearly transformed) i.i.d. components. Their analysis focuses on the high-dimensional limit d, n → ∞ with 0 < c ≤ d/n ≤ C < ∞ and ignores the "effective dimension" p k of the feature space. It appears that their analysis is not impacted by Double Descent w.r.t. p k since their assumptions on the kernel imply either p k = ∞ or p k /n → ∞ as n, d → ∞: In particular, they consider kernels of the form k(x, x) = h 1 d x, x for a suitable smooth function h that is independent of d. Due to the factor 1 d and the limit d → ∞, the kernel behaves essentially like a quadratic kernel k(x, x) ≈ a 0 + a 1 1 d x, x + a 2 1 d x, x 2 =: k quad (x, x) , where the curvature a 2 should be positive in order to obtain good upper bounds on E Noise (the variance term). For a 2 , a 1 , a 0 > 0, it is possible to represent this quadratic kernel with a feature map analogous to that of the polynomial kernel in Proposition 8 with feature space dimension p = 1 + d + d(d+1) 2 . An argument similar to the proof of Proposition 8 shows that this feature map satisfies FRK(p) if x has a Lebesgue density, hence (COV) is satisfied. By Lemma K.2 (a), we have p k quad = p = Θ(d 2 ) = Θ(n 2 ), which shows p k quad /n → ∞ as n, d → ∞. Liang et al. (2019) consider a similar setting with d ∼ n α , α ∈ (0, 1), and find that E Noise converges to zero under suitable assumptions as d, n → ∞. Again, it appears that their assumptions on the kernel imply at least a strongly overparameterized regime with p k = ∞ or p k /n → ∞ for d, n → ∞, where our lower bound is vacuous.

L NOVELTY OF THE OVERPARAMETERIZED BOUND

In their Corollary 1, Muthukumar et al. (2020) provide a lower bound in the case p ≥ n holding with high probability for ε (W W ) -1 ε, where ε ∼ N (0, I p ) is a noise vector independent of



Although many of our theorems apply to general domains xi ∈ X and not just X = R d , we set X = R d for notational simplicity. We require X = R d whenever we assume that the distribution of the xi has a Lebesgue density or work with analytic feature maps. (p)θ : R d → R p for each value of p. For the following NTK and polynomial kernels, there is no canonical way to For random NN feature maps as in Theorem 10, one can interpret the linear regression as being an extra layer on top of the neural network, and therefore the last layer of the feature map should contain an activation function. For NTK feature maps, one can instead interpret the linear regression as performing a "linearized" training of the whole NN, and the whole NN usually does not contain an activation function in the last layer. We use n > p since this usually improves the "inverse condition number" of Z. For n = 1, (FRK) holds and it is easy to see that the lower bound holds exactly in this case. It is in principle possible to extend our results to the case with appended ones in the feature matrix by choosing the activation function for one of the output neurons to be σ ≡ 1. As discussed in Remark H.16, our arguments have no problem handling different activation functions. In this case, we would only need to adapt the corresponding Taylor series coefficients in Lemma H.8.



(a) FRK(p) holds. (b) FRK(n) holds for all n ≥ 1. (c) (COV) holds. (d) There exists a fixed deterministic matrix X ∈ R p×d such that det(φ( X)) = 0. We have (a) ⇔ (b) ⇒ (c) ⇒ (d). Furthermore, if x ∈ R d has a Lebesgue density and φ is analytic, then (a) -(d) are equivalent.

, and (b) if x ∈ R d has a Lebesgue density and we use z = φ(x), then (FRK) is satisfied for all n.

Figure C.1: Estimated E Noise for random neural network feature maps (cf. Theorem 10) with different activation functions and d 0 = d = 10, d 1 = d 2 = 256, d 3 = p = 30. We include the results from P Z = U(S d-1 ) as comparison, cf. Section 6.

Figure C.1 shows E Noise for random three-layer neural network feature maps with p = 30, different activation functions and varying n. Note that all neural networks produce higher E Noise than P Z = U(S p-1 ). The effect of non-isotropic covariance matrices in the overparameterized regime can be clearly seen when comparing Figure C.1 to Figure C.2, where features have been whitened separately for each set of random parameters θ, cf. Remark G.1. Figure C.3 then shows E Noise for n = 30 and varying p.Note that double descent is usually plotted as a function of the "model complexity" p as inBelkin  et al. (2019a), but varying p is only possible when we have a (random) feature map φ

Figure C.4 shows E Noise for various random finite-width Neural Tangent Kernels (NTKs), cf. Jacot et al. (2018). These results mostly exhibit larger E Noise than the random NN feature maps from Figure C.1, perhaps because of correlations parameter gradients in different layers. However, this

Figure C.5 and Figure C.6 show E Noise for two variants of random Fourier features for two different scalings of the random parameters. Figure C.6 shows that for random Gaussian parameters with large variance (corresponding to an approximated narrow Gaussian kernel), the values of E Noise for random Fourier features are very close to the values for P Z = U(S p-1 ). We decided to plot these values relative to each other as in Figure 1, since the curves would overlap in a normal plot like Figure C.5. Note that the the version of random Fourier features with sin and cos features automatically yields constant z 2 like for P Z = U(S p-1 ). Finally, Figure C.7 shows that linear regression with the polynomial kernel is quite sensitive to label noise. We use p = 35 for the polynomial kernel since there are no particularly interesting polynomial kernels with p = 30. Neural Network feature maps For Figures C.1, C.2 and C.3, we use random neural network feature maps without biases as in Theorem 10 with d 0 = d = 10, d 1 = d 2 = 256 and d 3 = p.As the input distribution P X , we use N (0, I d ). We initialize the NN weights independently as W

Figure C.5: Estimated E Noise for the two versions of random Fourier features described in Appendix I. We use d = 10, P X = N (0, I d ), p = 30 and the weight vector distribution P k = N (0, 1 p I d ) (cf. Appendix I).

Figure C.6: Estimated E Noise for the two versions of random Fourier features described in Appendix I relative to P Z = U(S p-1 ). We use d = 10, P X = N (0, I d ), p = 30 and the weight vector distribution P k = N (0, I d ) (cf. Appendix I).

Figure C.7: Estimated E Noise for the polynomial kernel (cf. Proposition 8) with d = 3, m = 4, c = 1, P X = N (0, I d ), resulting in p = m+d m = 7 4 = 35.

Figure C.3, we perform an analogous optimization for p by taking the first p ∈ [256] of the d 3 = 256 output features in Z.

with p = 30, we use a neural network feature map with d 0 = d = p = 30, d 1 = d 2 = 256, d 3 = p = 30 and tanh activation function. We use NTK parameterization and zero-initialized biases, leading to

Figure D.1: The counterexample given in Example D.1 for p = 30 often has lower E Noise than the lower bound from Theorem 3, but violates its assumption (FRK). For the counterexample, E Noise was approximated as explained in Example D.1 using 10 6 Monte Carlo samples for each n. We assume Var(y|z) = 1 almost surely over z. Remark D.2 (Histogram regression). The distribution P Z in Example D.1 may seem contrived at first, but such a distribution can arise in histogram regression (cf e.g. Chapter 4 inGyörfi et al., 2002). For example, suppose that P X is supported on a domain D ⊆ R d and this domain is partitioned into disjoint sets A 1 , . . . , A p . Then, performing histogram regression on this partition is equivalent to performing ridgeless linear regression with the feature map φ : R d → R p with φ(x) := e i if x ∈ A i . If all partitions are equally likely, i.e. P X (A i ) = 1/p for all i ∈ {1, . . . , p}, then P Z is the uniform distribution on {e 1 , . . . , e p } as in Example D.1.

, and (b) if x ∈ R d has a Lebesgue density and we use z = φ(x), then (FRK) is satisfied for all n. Proof. Let M := {(m 1 , . . . , m d+1 ) ∈ N d+1 0 | m 1 + . . . + m d+1 = m} and for m = (m 1 , . . . , m d+1 ) ∈ M, let C(m) := m m 1 . . . m d+1 be the corresponding multinomial coefficient. Then, |M| = m+d d = p. Define the feature map φ : R d → R p by φ(x) :

Lemma H.11 (Propagation of nonatomic random variables). Let the random variable x ∈ R d be nonatomic. (a) For all p ≥ 1 and (Lebesgue-) almost all W ∈ R p×d , W x is nonatomic. (b) For all b ∈ R d , x + b is nonatomic. (c) If σ : R → R is analytic and not constant, then σ(x) ∈ R d is nonatomic, where σ is applied element-wise.

Now, consider the special values k 1 , . . . , k p chosen in Step 2. Since k i ∈ K, we have a ki = 0. The coefficient of the multivariate monomial w k1 1 • . . . = a k1 • . . . • a kp • D x = 0 .

a k1 • • • a kq b kq+1 • • • b k2q w k1 • • • a kq b kq+1 • • • b k2q K := {(k 1 , . . . , k 2q ) ∈ N 2q 0 | for all i ∈ [q], k i + k q+i = 2i -1}. Then, the coefficient c of the monomial w 1 1 w 3 2 • . . . • w 2q-1 q in (13) can be written asc := (k1,...,k2q)∈K c k1,...,k2q , c k1,...,k2q := a k1 • • • a kq b kq+1 • • • b k2q (k 1 , . . . , k 2q ) ∈ K. Note that a k = 0 iff k is odd and b k = 0 iff k is even. For the choice k i := 2i -1, k q+i := 2i -2 for i ∈ [q] from Step 2, we have (k 1 , . . . , k 2q ) ∈ K and c k1,...,k2q = a k1 • • • a kq b kq+1 • • • b k2q D x = 0 .

since the i-th and j-th rows of the matrix are equal, and hence c k1,...,k2q = 0. Now, suppose that (k 1 , . . . , k 2q ) ∈ K with c k1,...,k2q = 0. By induction, it is easy to show that {ki , k q+i } = {2i -1, 2i -2} for all i ∈ [q]. But since a k1 • • • a kq b kq+1 • • • b k2q = 0 and a k = 0 for even k, we need to have k i = 2i -1, k q+i = 2i -2 for all i ∈ [q]

where A := x 2 1,n + x 2 1,n+1 + . . . + x 2 1,p has a χ 2 p+1-n distribution and B :=x 2 1,1 + . . . + x 2 1,n-1 has a χ 2 n-1 distribution. Hence, E((W W ) -1 ) = nE dist(w 1 , W -1 ) -2 = n p 1 + E B A .

• The assumption (NOI) is required to ensure the existence of sufficient label noise. Importantly, Lemma H.1 shows that (NOI), i.e. Var(y|z) ≥ σ 2 almost surely over z, holds if Var(y|x) ≥ σ 2 almost surely over x. All Double Descent papers from Section 1 make the stronger assumption that the distribution of y -E(y|x) is independent of x or even a fixed

ACKNOWLEDGMENTS

The author would like to thank Ingo Steinwart for proof-reading most of this paper and for providing helpful comments. Funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -EXC 2075 -390740016. The author thanks the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for support.

annex

Proof. See e.g. Corollary 1.2.7 in Krantz & Parks (2002) for the first statement. If N (f ) is uncountable, there exists k ∈ Z such that [k, k + 1] ∩ N (f ) is also uncountable, hence it contains a strictly increasing and bounded sequence of points, and the limit of this sequence is an accumulation point of N (f ).Theorem H.3 (Multivariate version of the identity theorem). Let f : R d → R be analytic. If f is not the zero function, then N (f ) := {x ∈ R d | f (x) = 0} is a Lebesgue null set.Proof. Although less well-known than the univariate version, this multivariate version has been proven several times in the literature. For example, different proofs are given in Section 3.1.24 in Federer (1969) , Lemma 1.2 in Nguyen (2015) and Proposition 0 in Mityagin (2015) . More proof strategies have been hinted at in Lemma 5.22 in Kuchment (2015) . Here, we provide an elementary proof following the proof strategy briefly mentioned at the beginning of Section 4.1 in Krantz & Parks (2002) .Let λ d be the Lebesgue measure on R d and let λ := λ 1 . We prove the statement by induction on d ≥ 1. For d = 1, if λ(N (f )) > 0, then N (f ) is uncountable and hence f ≡ 0 by Theorem H.2. Now, let the statement hold for d -1 ≥ 1 and assume λ d (N (f )) > 0. For a ∈ R, define the functions f a : R d-1 → R, f a (x) = f (a, x). Then,It follows that the set U := {x ∈ R | λ d-1 (N (f x )) > 0} satisfies λ(U ) > 0. By induction, for all x ∈ U , we have f x ≡ 0. Then, for all x ∈ R d-1 , we can conclude that the functionThe following lemma provides some intuition about null sets for readers less familiar with measure theory. Recall that a property Q holds almost surely with respect to a measure P on R d if there exists a null set N , i.e. a measurable set with P (N ) = 0, such that Q(x) holds for all x ∈ R d \ N .Lemma H.4. Let λ d be the Lebesgue measure on R d and let P be a measure on R d with a Lebesgue density function (i.e. a probability density function) p. Then, a null set with respect to λ d is also a null set with respect to P . The converse holds if p(x) = 0 for (almost) all x.Proof. A well-known fact from measure and integration theory states that if a measure µ has a density with respect to a measure ν, then ν-null sets are also µ-null sets. Setting µ = P and ν = λ d yields the first fact. If p(x) = 0 for (almost) all x, then µ := λ d has density 1/p with respect to ν := P , and hence the converse follows.Proposition 6 (Characterization of (COV) and (FRK)). Consider the setting of Theorem 3 and let FRK(n) be the statement that (FRK) holds for n. Then,Assuming that (MOM) holds such that (COV) is well-defined, consider the following statements: Proof.Step 1: Prove (i) and (ii).Published as a conference paper at ICLR 2021(3) Independence: Prove that if the inputs to the last layer have the diversity property and n = p, then the outputs of the last layer are almost surely linearly independent.Our main tools will be the identity theorems for analytic functions (Theorem H.2 and Theorem H.3), expanding σ into its power series around a point and the Leibniz formula for the determinant of a n × n matrix, which is based on the permutation group S n on [n].We consider two diversity properties:(a) The first property is that x 1 , . . . , x n are distinct. This is the weakest possible diversity property. However, it cannot always be used for networks without (random) biases: For example, if σ is an even function and x i = -x j for some i = j, the propagation property is violated. As another example, if σ(0) = 0 and x i = 0 for some i, the independence property is violated. (b) The second property, which works for networks with and without bias, is the property that x 1 , . . . , x n are independent nonatomic random variables.Using the first property yields shorter proofs and a slightly stronger theorem for networks with bias, Theorem H.9. If we only care about probability distributions P X on x that almost surely generate distinct x 1 , . . . , x n , these probability distributions are exactly the nonatomic distributions. Hence from the viewpoint of probability distributions P X on x, which we take in the main part of the paper, the first property (a) does not provide a benefit over the second property (b) except for the shorter proofs.The advantage of having biases is in being able to choose the point in which σ is Taylor-expanded.The following lemma shows that this choice enables us to make certain coefficients of the Taylor expansion nonzero:Lemma H.5. Let m ≥ 1. Let σ : R → R be analytic and not a polynomial of degree less than m.Then, there exists b ∈ R such that the Taylor expansionProof. Since σ is not a polynomial of degree less than m, neither of the derivatives σ (0) , . . . , σ (m) is the zero function. Since all of these derivatives are analytic, the setis (by Theorem H.3) a finite union of Lebesgue null sets and hence a Lebesgue null set. Hence, there exists b ∈ R such that σ (0) (b) = 0, . . . , σ (m) (b) = 0. This implies that the corresponding coefficients a 0 , . . . , a m in the Taylor expansion around b are nonzero.We now prove our three-step program (1) -( 3) from above for the distinctness property.Proof. We essentially follow the corresponding proof in Lemma 4.3 in Nguyen & Hein (2017) . By the identity theorem (Theorem H.3), the functionsfor i, j ∈ [n], i = j are nonzero almost everywhere, hence there exists u ∈ R p such that u x 1 , . . . , u x n are all distinct.Lemma H.7 (Propagation of distinct variables). Let σ : R → R be analytic and non-constant. If x 1 , . . . , x n ∈ R d are distinct, then for (Lebesgue-) almost all (W , b) ∈ R p×d × R p , the vectors σ(W x i + b) (with σ applied element-wise) are also distinct.Since p ≥ n ≥ 2, n -1 and p + 1n are positive. Since A and B are independent, B/(n-1)A/(p+1-n)follows a Variance-Ratio F -distribution with parameters n -1 and p + 1n, whose mean is known (see e.g. Chapter 20 in Forbes et al., 2011) :(15)The infinite expectation for p + 1n ≤ 2 is not explicitly specified in Forbes et al. (2011) , but it is easy to obtain from the p.d.f. of the F -distribution: The p.d.f.for some constant C a,b (cf. Chapter 20 in Forbes et al., 2011) , and the expected value is thereforeIn the following, we will prove Theorem 12 using the same proof idea as for Theorem 11. The formulas in Theorem 12 have in principle already been computed by Breiman & Freedman (1983) for ). However, for p ≥ n -1, the latter expectation is not specified in common literature 7 on the inverse Wishart distribution (Mardia et al., 1979; Press, 2005; von Rosen, 1988) , presumably because it is ∞ for diagonal elements but is not well-defined for off-diagonal matrix elements.Theorem 12. Let P Z = N (0, I p ). Then, P Z satisfies the assumptions (MOM), (COV) and (FRK) for all n with Σ = I p . Moreover, for n, p ≥ 1,Step 1: Assumptions. Verifying (MOM), (COV) and Σ = I p is trivial and (FRK) for all n follows from Proposition 6 with x = z and φ = id.Step 2: Overparameterized case. For the expectation, we first follow Step 4.3 in the proof of Theorem 11 in the overparameterized case p ≥ n ≥ 1, the main difference being that instead of w i = √ p xi xi 2 , we now have w i = x i , which translates to the simpler equation1 be independent of A, then we can compute similar to Eq. ( 15) This proves the over-parameterized case.Step 3: Underparameterized case. Since the rows w i of W ∈ R n×p are independent and follow a N (0, I p ) distribution, the rows of W ∈ R p×n are independent and follow a N (0, I n ) distribution. Therefore, the underparameterized case p ≤ n follows from the overparameterized case n ≤ p by switching the roles of n and p.Remark J.1. An alternative (and presumably similar) way to prove Theorem 12 is to use that the diagonal elements of a matrix with an inverse Wishart distribution follow an inverse Gamma distribution as specified in Example 5.2.2 in Press (2005) .The next proposition shows that a small input dimension d does not necessarily provide a limitation: Proposition J.2. Let p, d ≥ 1. Then, there exists a probability distribution P X on R d (with bounded support) and a continuous feature map φ : R d → R p such that for x ∼ P X , φ(x) ∼ U(S p-1 ).Proof. For p = 1, the result is trivial, we will therefore assume p ≥ 2. We will prove the result for any d by a reduction to the case d = 1, although substantially simpler constructions are possible for d ≥ p -1. First, introduce the spacesStep 1: Space-filling curve on the sphere. In this step, we show that there exists a continuous surjective map φ : X → S p-1 . First of all, let f 1 : [0, 1] → [0, 1] p-1 be continuous and surjective, e.g. a Hilbert or Peano curve (see e.g. Sagan, 2012) . We define the following maps:It is not hard to verify that f 2 and f 3 are continuous and surjective as well. For example, f 2 is continuous in 0 since u ∞ ≤ u 2 for all u ∈ R p-1 . Thus, the map f+ is continuous and surjective. Define the mapBy the previous considerations, it is not hard to verify that the mapis continuous and surjective as well. We can therefore define the continuous and surjective map φ : X → S p-1 , x → g(x 1 ).Step 2: Existence of a pull-back measure. We consider the Borel σ-algebras B(X ), B(S p-1 ) on X and S p-1 . The uniform distribution P Z = U(S p-1 ) on the sphere is defined with respect to B(S p-1 ) and is therefore a Borel measure. Since φ is continuous, it is Borel measurable. Moreover, since X and S p-1 are complete separable metric spaces, they are also Souslin spaces, cf. Section 6.6 in Bogachev (2007) . Since φ is surjective, Theorem 9.1.5 in Bogachev (2007) guarantees the existence of a measure P X such that if x ∼ P X , then φ(x) ∼ U(S p-1 ). Since P X (X ) = P Z (S p-1 ) = 1, P X is a probability measure.Step 3: Continuation. We can arbitrarily extend the mapping φ : X → S p-1 to a continuous mapping φ : R d → R p . Moreover, the domain X of P X can be extended to R d via P X (A) := P X (A ∩ X ), the support of P X is still bounded, and we still have φ(x) ∼ U(S p-1 ) if x ∼ P X .Remark J.3. The proof of Proposition J.2 could be slightly shorter if we required φ(x) ∼ U(S p-1 + ) instead of φ(x) ∼ U(S p-1 ). This would be of similar interest since the uniform distribution U(S p-1 + ) on the "half-sphere" leads to the same E Z tr((Z + ) Σ(Z + )) as the uniform distribution U(S p-1 ) on the full sphere: If z i ∼ U(S p-1 + ) and ε i ∼ U({-1, 1}) are stochastically independent, then zi := ε i z i ∼ U(S p-1 ). Therefore, Σ = Σ, Z = diag(ε 1 , . . . , ε n )Z, and if U DV is a SVD of Z, then (diag(ε 1 , . . . , ε n )U )DV is a SVD of Z. Therefore, Z and Z have the same singular values, hence W and W have the same singular values, hence tr((W W ) -1 ) = tr(( W W ) -1 ) for p ≥ n and tr((W W ) -1 ) = tr(( W W ) -1 ) for p ≤ n.Remark J.4. One might ask whether it is possible in Proposition J.2 to choose P X as a "nice" distribution, like a uniform distribution on a cube or a Gaussian distribution. The answer to this question is affirmative if there exists an area-preserving space-filling curve φ : [0, volume(S p-1 )] → S p-1 . For p = 3, such a construction is informally described by Purser et al. (2009) and it seems plausible that such a construction is possible for all p.

K RELATION TO RIDGELESS KERNEL REGRESSION

In this section, we want to discuss the relation between this paper and recent work on ridgeless kernel regression. To this end, we need to introduce some terminology on representations of kernels with finite-dimensional feature maps.Definition K.1. Let k : R d × R d → R be a kernel, let p be an integer with 1 ≤ p < ∞ and let φ : R d → R p be a (measurable) function. Then, (p, φ) is called a P X -representation of k if • k(x, x) = φ(x) φ(x) almost surely for independent x, x ∼ P X , and • k(x, x) = φ(x) φ(x) almost surely for x ∼ P X .If k has a P X -representation, then we definei.e. p k is the smallest p for which a P X -representation exists.Usually, p k corresponds to the dimension of the RKHS associated with the restriction of k to the support of P X , but since the feature map φ only needs to represent the kernel P X -almost surely, p k may be smaller for pathological kernels. The following lemma states that Theorem 3, if applicable, should be applied to ridgeless kernel regression with p = p k :Lemma K.2. Let k be a kernel on R d with P X (k(x, x) = 0) > 0. Let (p, φ) be a P X -representation of k. Proof. In the notation of Section 3, the definition of P X -representation implies that k(X, X) = φ(X)φ(X) and k(x, X) = φ(x) φ(X) almost surely.(a) If (COV) is not satisfied and p ≥ 2, it is possible to construct a P X -representation with smaller p using the construction from Remark 7, hence p > p k . If p = 1, (COV) is satisfied due to the assumption on k. Conversely, assume p > p k and let (p k , φ) be another P X -representation of k. Set n = p. Then, we almost surely have rank φ(X) = rank(φ(X)φ(X) ) = rank(k(X, X)) = rank( φ(X) φ(X) ) ≤ p k < p , hence φ(X) has full rank with probability zero. Since the rows φ(x i ) of φ(X) are i.i.d., this means that there must be a proper linear subspace U of R p such that φ(x i ) ∈ U with probability one. But then, according to Proposition 6, (COV) is not satisfied.

W . Since Eε

), their lower bound yields a lower bound for E tr((W W ) -1 ). However, the resulting lower bound is weaker than ours and requires stronger assumptions:(1) Assuming that the subgaussian norm w i ψ2 := sup v∈S p-1 sup q∈N+ q -1/2 (E|v w i | q ) 1/q (cf. Vershynin, 2010) is bounded by a constant K < ∞, they obtain a lower bound of the form c K σ 2 n p with a constant c K > 0 that depends on K and is only explicitly specified for the case of centered Gaussian P Z . They note that w i ψ2 ≤ K holds, for example, if the components w i,j of w i are independent and all satisfy w i,j ψ2 ≤ K. However, as discussed in Remark 1, such independence assumptions are not realistic. In contrast, our lower bound is explicit, independent of constants like K and is larger: For example, at n = p, our lower bound is σ 2 n and theirs is σ 2 c K .(2) Assuming w i 2 2 ≤ p almost surely, they obtain a lower bound of the form cσ 2 n p log(n) . First of all, this lower bound converges to zero as n = p → ∞. Moreover, since we always have E w i 2 2 = E tr(w i w i ) = E tr(I p ) = p, the assumption implies w i 2 2 = p almost surely. Although we can sometimes guarantee constant z i 2 2 , e.g. for certain random Fourier features, we cannot guarantee the same for w i = Σ -1/2 z i since Σ depends on the unknown input distribution P X .By inspecting the proof behind (1), one finds that c K → 0 as K → ∞. Hence, lower bound of Muthukumar et al. (2020) might raise hope that it is possible to achieve low E Noise by choosing features with a large (or even infinite) subgaussian norm. Our result shows that this is not possible: Essentially the only possibility to avoid a large E Noise for ridgeless linear regression around n ≈ p is to violate the property (FRK) that guarantees the ability to interpolate the data in the overparameterized case, see Section 5. Otherwise, in order to achieve E Noise < εσ 2 , ε 1, it is necessary to make the model either strongly underparameterized (p < εn) or strongly overparameterized (p > n/ε).

