STRONG INDUCTIVE BIASES PROVABLY PREVENT HARMLESS INTERPOLATION

Abstract

Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.

1. INTRODUCTION

According to classical wisdom (see, e.g., Hastie et al. (2001) ), an estimator that fits noise suffers from "overfitting" and cannot generalize well. A typical solution is to prevent interpolation, that is, stopping the estimator from achieving zero training error and thereby fitting less noise. For example, one can use ridge regularization or early stopping for iterative algorithms to obtain a model that has training error close to the noise level. However, large overparameterized models such as neural networks seem to behave differently: even on noisy data, they may achieve optimal test performance at convergence after interpolating the training data (Nakkiran et al., 2021; Belkin et al., 2019a ) -a phenomenon referred to as harmless interpolation (Muthukumar et al., 2020) or benign overfitting (Bartlett et al., 2020) and often discussed in the context of double descent (Belkin et al., 2019a) . To date, we lack a general understanding of when interpolation is harmless for overparameterized models. In this paper, we argue that the strength of an inductive bias critically influences whether an estimator exhibits harmless interpolation. An estimator with a strong inductive bias heavily favors "simple" solutions that structurally align with the ground truth (such as sparsity or rotational invariance). Based on well-established high-probability recovery results of sparse linear regression (Tibshirani, 1996; Candes, 2008; Donoho & Elad, 2006) , we expect that models with a stronger inductive bias generalize better than ones with a weaker inductive bias, particularly from noiseless data. In contrast, the effects of inductive bias are much less studied for interpolators of noisy data. Recently, Donhauser et al. (2022) provided a first rigorous analysis of the effects of inductive bias strength on the generalization performance of linear max-ℓ p -margin/min-ℓ p -norm interpolators. In particular, the authors prove that a stronger inductive bias (small p → 1) not solely enhances a model's ability to generalize on noiseless data, but also increases a model's sensitivity to noiseeventually harming generalization when interpolating noisy data. As a consequence, their result suggests that interpolation might not be harmless when the inductive bias is too strong. In this paper, we confirm the hypothesis and show that strong inductive biases indeed prevent harmless interpolation, while also moving away from sparse linear models. As one example, we consider data where the true labels nonlinearly only depend on input features in a local neighborhood, and vary the strength of the inductive bias via the filter size of convolutional kernels or shallow convolutional neural networks -small filter sizes encourage functions that depend nonlinearly only on local neighborhoods of the input features. As a second example, we also investigate classification for rotationally invariant data, where we encourage different degrees of rotational invariance for neural networks. In particular, • we prove a phase transition between harmless and harmful interpolation that occurs by varying the strength of the inductive bias via the filter size of convolutional kernels for kernel regression in the high-dimensional setting (Theorem 1). • we further show that, for a weak inductive bias, not only is interpolation harmless but partially fitting the observation noise is in fact necessary (Theorem 2). • we show the same phase transition experimentally for neural networks with two common inductive biases: varying convolution filter size, and rotational invariance enforced via data augmentation (Section 4). From a practical perspective, empirical evidence suggests that large neural networks not necessarily benefit from early stopping. Our results match those observations for typical networks with a weak inductive bias; however, we caution that strongly structured models must avoid interpolation, even if they are highly overparameterized.

2. RELATED WORK

We now discuss three groups of related work and explain how their theoretical results cannot reflect the phase transition between harmless and harmful interpolation for high-dimensional kernel learning. Low-dimensional kernel learning: Many recent works (Bietti et al., 2021; Favero et al., 2021; Bietti, 2022; Cagnetta et al., 2022) prove statistical rates for kernel regression with convolutional kernels in low-dimensional settings, but crucially rely on ridge regularization. In general, one cannot expect harmless interpolation for such kernels in the low-dimensional regime (Rakhlin & Zhai, 2019; Mallinar et al., 2022; Buchholz, 2022) ; positive results exist only for very specific adaptive spiked kernels (Belkin et al., 2019b) . Furthermore, techniques developed for low-dimensional settings (see, e.g., Schölkopf et al. (2018) ) usually suffer from a curse of dimensionality, that is, the bounds become vacuous in high-dimensional settings where the input dimension grows with the number of samples. High-dimensional kernel learning: One line of research (Liang et al., 2020; McRae et al., 2022; Liang & Rakhlin, 2020; Liu et al., 2021) tackles high-dimensional kernel learning and proves non-asymptotic bounds using advanced high-dimensional random matrix concentration tools from El Karoui (2010) . However, those results heavily rely on a bounded Hilbert norm assumption. This assumption is natural in the low-dimensional regime, but misleading in the high-dimensional regime, as pointed out in Donhauser et al. (2021b) . Another line of research (Ghorbani et al., 2021; 2020; Mei et al., 2021; Ghosh et al., 2022; Misiakiewicz & Mei, 2021; Mei et al., 2022) asymptotically characterizes the precise risk of kernel regression estimators in specific settings with access to a kernel's eigenfunctions and eigenvalues. However, these asymptotic results are insufficient to investigate how varying the filter size of a convolutional kernel affects the risk of a kernel regression estimator. In contrast to both lines of research, we prove tight non-asymptotic matching upper and lower bounds for high-dimensional kernel learning which precisely capture the phase transition described in Section 3.2. Overfitting of structured interpolators: Several works question the generality of harmless interpolation for models that incorporate strong structural assumptions. Examples include structures enforced via data augmentation (Nishi et al., 2021) , adversarial training (Rice et al., 2020; Kamath et al., 2021; Sanyal et al., 2021; Donhauser et al., 2021a) , neural network architectures (Li et al., 2021) , pruning-based sparsity (Chang et al., 2021) , and sparse linear models (Wang et al., 2022; Muthukumar et al., 2020; Chatterji & Long, 2022) . In this paper, we continue that line of research and offer a new theoretical perspective to characterize when interpolation is expected to be harmless. For this purpose, we derive and compare tight non-asymptotic bias and variance bounds as a function of filter size for min-norm interpolators and optimally ridge-regularized estimators (Theorem 1). Furthermore, we prove for large filter sizes that not only does harmless interpolation occur (Theorem 1), but fitting some degree of noise is even necessary to achieve optimal test performance (Theorem 2).

3.1. SETTING

We study kernel regression with a (cyclic) convolutional kernel in a high-dimensional setting where the number of training samples n scales with the dimension of the input data d as n ∈ Θ(d ℓ ). We use the same setting as in previous works on high-dimensional kernel learning such as Misiakiewicz & Mei (2021) : we assume that the training samples {x i , y i } n i=1 are i.i.d. draws from the distributions x i ∼ U({-1, 1} d ), and y i = f ⋆ (x i )+ϵ i with ground truth f ⋆ and noise ϵ ∼ N (0, σ 2 ). For simplicity of exposition, we further assume that f ⋆ (x) = x 1 x 2 • • • x L * , with L * specified in Theorem 1. While the assumptions on the noise and ground truth can be easily extended by following the proof steps in Section 5, generalizing the feature distribution is challenging. Indeed, existing results that establish precise risk characterizations (see Section 2) crucially rely on hypercontractivity of the feature distribution -an assumption so far only proven for few high-dimensional distributions, including the hypersphere (Beckner, 1992) , and the discrete hypercube (Beckner, 1975) which we use in this paper. Hypercontractivity is essential to tightly control the empirical kernel matrix within Lemma 3 in Section 5. Generalizations beyond this assumption require the development of new tools in random matrix theory, which we consider important future work. We consider (cyclic) convolutional kernels with filter size q ∈ {1, . . . , d} of the form K(x, x ′ ) = 1 d d k=1 κ ⟨x (k,q) , x ′ (k,q) ⟩ q , where x (k,q) := [x mod(k,d) • • • x mod(k+q-1,d) ], and κ : [-1, 1] → R is a nonlinear function that implies standard regularity assumptions (see Assumption 1 in Appendix B) that hold for instance for the exponential function. Decreasing the filter size q restricts kernel regression solutions to depend nonlinearly only on local neighborhoods instead of the entire input x. We analyze the kernel ridge regression (KRR) estimator, which is the minimizer of the following convex optimization problem: fλ = arg min f ∈H 1 n n i=1 (f (x i ) -y i ) 2 + λ n ∥f ∥ 2 H , where H is the Reproducing Kernel Hilbert space (RKHS) over {-1, 1} d generated by the convolutional kernel K in Equation ( 1), ∥•∥ H the corresponding norm, and λ > 0 the ridge regularization penalty. 1 In the interpolation limit (λ → 0), we obtain the min-RKHS-norm interpolator f0 = arg min f ∈H ∥f ∥ H s.t. ∀i : f (x i ) = y i . For simplicity, we refer to f0 as the kernel ridge regression estimator with λ = 0. We evaluate all estimators with the expected population risk over the noise, defined as Risk( fλ ) := E x E ϵ [ fλ (x)] -f ⋆ (x) 2 :=Bias 2 ( fλ ) + E x,ϵ fλ (x) -E ϵ [ fλ (x)] 2 :=Variance( fλ ) .

3.2. MAIN RESULT

We now present tight upper and lower bounds for the prediction error of kernel regression estimators in the setting from Section 3.1. The resulting rates hold for the high-dimensional regime, that is, when both the ambient dimension d and filter size q scale with n. 2 We defer the proof to Section 5. Theorem 1 (Non-asymptotic prediction error rates). Let ℓ > 0, β ∈ (0, 1), ℓ σ ∈ R. Assume a dataset and a kernel as described in Section 3.1, with the kernel satisfying Assumption 1. Assume further n ∈ Θ(d ℓ ), the filter size q ∈ Θ d β , and σ 2 ∈ Θ(d -ℓσ ). Lastly, define δ := ℓ-1 β -⌊ ℓ-1 β ⌋ and δ := ℓ-ℓ λ -1 β -⌊ ℓ-ℓ λ -1 β ⌋ for any ℓ λ . Then, with probability at least 1 -cd -β min{ δ,1-δ} uniformly over all ℓ λ ∈ [0, ℓ -1), the KRR estimate fλ in Equation (2) with max{λ, 1} ∈ Θ(d ℓ λ ) satisfies Variance( fλ ) ∈ Θ n -ℓσ -ℓ λ ℓ -β ℓ min{δ,1-δ} . Further, for a ground truth f ⋆ (x) = x 1 x 2 • • • x L * with L * ≤ ℓ-ℓ λ -1 β , with probability at least 1 -cd -β min{ δ,1-δ} , we have Bias 2 ( fλ ) ∈ Θ n -2-2 ℓ (-ℓ λ -1-β(L * -1)) . Finally, by setting ℓ λ = 0, both rates also hold for the min-RKHS-norm interpolator f0 in Eq. (3). Note how the theorem reflects the usual intuition for the effects of noise and ridge regularization strength on bias and variance via the parameter ℓ λ : With increasing ridge regularization ℓ λ (and thus increasing λ), the bias increases and the variance decreases. Similarly, as noise increases (and thus ℓ σ decreases), the variance increases. Phase transition as a function of β: In the following, we focus on the impact of the filter size q ∈ Θ d β on the risk (sum of bias and variance) via the growth rate β. Recalling that a small filter size (small β) corresponds to a strong inductive bias, and vice versa, Figure 1 demonstrates how the strength of the inductive bias affects generalization. For illustration, we choose the ground truth f ⋆ (x) = x 1 x 2 so that the assumption on L * is satisfied for all β. Specifically, Figure 1a shows the rates for the min-RKHS-norm interpolator f0 and the optimally ridge-regularized estimator fλopt , where we choose λ opt to minimize the expected population risk Risk( fλopt ). Furthermore, Figure 1b depicts the (statistical) bias and variance of the interpolator f0 . At the threshold β * ∈ (0, 1), implicitly defined as the β at which the rates of statistical bias and variance in Theorem 1 match, we can observe the following phase transition: • For β < β * , that is, for a strong inductive bias, the rates in Figure 1a for the optimally ridgeregularized estimator fλopt are strictly better than the ones for the corresponding interpolator f0 . In other words, we are observing harmful interpolation. • For β > β * , that is, for a weak inductive bias, the rates in Figure 1a of the optimally ridge-regularized estimator fλopt and the min-RKHS-norm interpolator f0 match. Hence, we observe harmless interpolation. In the following theorem, we additionally show that interpolation is not only harmless for β > β * , but the optimally ridge-regularized estimator fλopt necessarily fits part of the noise and has a training error strictly below the noise level. In contrast, we show that when interpolation is harmful in Figure 1a , that is, when β < β * , the training error of the optimally ridge-regularized model approaches the noise level. Theorem 2 (Training error (informal)). Let λ opt be such that the expected population risk Risk( fλopt ) is minimal, and let β * be the unique thresholdfoot_3 where the bias and variance bounds in Theorem 1 are of the same order for the interpolator f0 (setting ℓ λ = 0). Then, the expected training error converges in probability: lim n,d→∞ 1 σ 2 E ϵ 1 n i ( fλopt (x i ) -y i ) 2 = 1 β < β * , ≤ c β β ≥ β * , where c β < 1 for any β > β * . We refer to Appendix D.2 for the proof and a more general statement. Bias-variance trade-off: We conclude by discussing how the phase transition arises from a (statistical) bias and variance trade-off for the min-RKHS-norm interpolator as a function of β, reflected in Theorem 1 when setting ℓ λ = 0 and illustrated in Figure 1b . While the statistical bias monotonically decreases with decreasing β (i.e., increasing strength of the inductive bias), the variance follows a multiple descent curve with increasing minima as β decreases. Hence, analogous to the observations in Donhauser et al. (2022) for linear max-ℓ p -margin/min-ℓ p -norm interpolators, the interpolator achieves its optimal performance at a β ∈ (0, 1), and therefore at a moderate inductive bias. Finally, we note that Liang et al. (2020) previously observed a multiple descent curve for the variance, but as a function of input dimension and without any connection to structural biases.

4. EXPERIMENTS

We now empirically study whether the phase transition phenomenon that we prove for kernel regression persists for deep neural networks with feature learning. More precisely, we present controlled experiments to investigate how the strength of a CNN's inductive bias influences if interpolating noisy data is harmless. In practice, the inductive bias of a neural network varies by way of design choices such as the architecture (e.g., convolutional vs. fully-connected vs. graph networks) or the training procedure (e.g., data augmentation, adversarial training). We focus on two examples: convolutional filter size that we vary via the architecture, and rotational invariance via data augmentation. To isolate the effects of inductive bias and provide conclusive results, we use datasets where we know a priori that the ground truth exhibits a simple structure that matches the networks' inductive bias. See Appendix E for experimental details. Analogous to ridge regularization for kernels, we use early stopping as a mechanism to prevent noise fitting. Our experiments compare optimally early-stopped CNNs to their interpolating versions. This highlights a trend that mirrors our theoretical results: the stronger the inductive bias of a neural network grows, the more harmful interpolation becomes. These results suggest exciting future work: proving this trend for models with feature learning.

4.1. FILTER SIZE OF CNNS ON SYNTHETIC IMAGES

In a first experiment, we study the impact of filter size on the generalization of interpolating CNNs. As a reminder, small filter sizes yield functions that depend nonlinearly only on local neighborhoods of the input features. To clearly isolate the effects of filter size, we choose a special architecture on a synthetic classification problem such that the true label function is indeed a CNN with small filter size. Concretely, we generate images of size 32 × 32 containing scattered circles (negative class) and crosses (positive class) with size at most 5 × 5. Thus, decreasing filter size down to 5 × 5 corresponds to a stronger inductive bias. Motivated by our theory, we hypothesize that interpolating noisy data is harmful with a small filter size, but harmless when using a large filter size. Training setup: In the experiments, we use CNNs with a single convolutional layer, followed by global spatial max pooling and two dense layers. We train those CNNs with different filter sizes on 200 training samples (either noiseless or with 20% label flips) to minimize the logistic loss and achieve zero training error. We repeat all experiments over 5 random datasets with 15 optimizations per dataset and filter size, and report the average 0-1-error for 100k test samples per dataset. For a detailed discussion on the choice of hyperparameters and more experimental details, see Appendix E.1. Results: First, the noiseless error curves (dashed) in Figure 2a confirm the common intuition that the strongest inductive bias (matching the ground truth) at size 5 yields the lowest test error. More interestingly, for 20% training noise (solid), Figure 2a reveals a similar phase transition as Theorem 1 and confirms our hypothesis: Models with weak inductive biases (large filter sizes) exhibit harmless interpolation, as indicated by the matching test error of interpolating (blue) and optimally earlystopped (yellow) models. In contrast, as filter size decreases, models with a strong inductive bias (small filter sizes) suffer from an increasing gap in test errors when interpolating versus using optimal early stopping. Furthermore, Figure 2b reflects the dual perspective of the phase transition as presented in Theorem 2 under optimal early stopping: models with a small filter size entirely avoid fitting training noise, such that the training error on the noisy training subset equals 100%, while models with a large filter size interpolate the noise. Difference to double descent: One might suspect that our empirical observations simply reflect another form of double descent (Belkin et al., 2019a) . As a CNN's filter size increases (inductive bias becomes weaker), so does the number of parameters and degree of overparameterization. Thus, double descent predicts vanishing benefits of regularization due to model size for weak inductive biases. Nevertheless, we argue that the phenomenon we observe here is distinct, and provide an extended discussion in Appendix E.3. In short, we choose sufficiently large networks and tune their hyperparameters to ensure that all models interpolate and yield small training loss, even for filter size 5 and 20% training noise. To justify this approach, we repeat a subset of the experiments while significantly increasing the convolutional layer width. As the number of parameters increases for a fixed filter size, double descent would predict that the benefits of optimal early stopping vanish. However, we observe that our phenomenon persists. In particular, for filter size 5 (strongest inductive bias), the test error gap between interpolating and optimally early-stopped models remains large.

4.2. ROTATIONAL INVARIANCE OF WIDE RESIDUAL NETWORKS ON SATELLITE IMAGES

In a second experiment, we investigate rotational invariance as an inductive bias for CNNs whenever true labels are independent of an image's rotation varying degrees of data augmentation. 4 As an example dataset with a rotationally invariant ground truth, we classify satellite images from the EuroSAT dataset (Helber et al., 2018) into 10 types of land usage. Because the true labels are independent of image orientation, we expect rotational invariance to be a particularly fitting inductive bias for this task. Training and test setup: For computational reasons, we subsample the original EuroSAT training set to 7680 raw training and 10k raw test samples. In the noisy case, we replace 20% of the raw training labels with a wrong label chosen uniformly at random. We then vary the strength of the inductive bias towards rotational invariance by augmenting the dataset with an increasing number of k rotated versions of itself. For each sample, we use k equal-spaced angles spanning 360 • , plus a random offset. Note that training noise applies before rotations, so that all rotated versions of the same image share the same label. We then center-crop all rotated images such that they only contain valid pixels. In all experiments, we fit Wide Residual Networks (Zagoruyko & Komodakis, 2016) Results: Similar to the previous subsection, Figure 3a corroborates our hypothesis under rotational invariance: stronger inductive biases result in lower test errors on noiseless data, but an increased gap between the test errors of interpolating and optimally early-stopped models. In contrast to filter size, the phase transition is more abrupt; invariance to 180 • rotations already prevents harmless interpolation. Figure 3b confirms this from a dual perspective, since all models with some rotational invariance cannot fit noisy samples for optimal generalization.

5. PROOF OF THE MAIN RESULT

The proof of the main result, Theorem 1, proceeds in two steps: First, Section 5.1 presents a fixeddesign result that yields matching upper and lower bounds for the prediction error of general kernels under additional conditions. Second, in Section 5.2, we show that the setting of Theorem 1 satisfies those conditions with high probability over dataset draws. Published as a conference paper at ICLR 2023 Notation Assuming that inputs are draws from a data distribution ν (i.e., x, x ′ ∼ ν), we can decompose and divide any continuous, positive semi-definite kernel function as K(x, x ′ ) = ∞ k=1 λ k ψ k (x)ψ k (x ′ ) = m k=1 λ k ψ k (x)ψ k (x ′ ) :=K ≤m (x,x ′ ) + ∞ k=m+1 λ k ψ k (x)ψ k (x ′ ) :=K >m (x,x ′ ) , where {ψ k } k≥1 is an orthonormal basis of the RKHS induced by ⟨f, g⟩ ν := E x∼ν [f (x)g(x)] and the eigenvalues λ k are sorted in descending order. In the following, we write [•] i,j to refer to the entry in row i and column j of a matrix. Then, we define the empirical kernel matrix for K as K ∈ R n×n with [K] i,j := K(x i , x j ), and analogously the truncated versions K ≤m and K >m for K ≤m and K >m , respectively. Next, we utilize the matrices First, Theorem 3 provides tight fixed-design bounds for the prediction error. Ψ ≤m ∈ R n×m with [Ψ ≤m ] i,l := ψ l (x i ), Theorem 3 (Generalization bound for fixed-design). Let K be a kernel that under a distribution ν decomposes as K(x, x ′ ) = k λ k ψ k (x)ψ k (x ′ ) with E x∼ν [ψ k (x)ψ k ′ (x)] = δ k,k ′ , and {(x i , y i )} n i=1 be a dataset with y i = f ⋆ (x i ) + ϵ i for zero-mean σ 2 -variance i.i.d. noise ϵ i and ground truth f ⋆ . Define τ 1 := min nλm max{λ,1} , 1 , τ 2 := max nλm+1 max{λ,1} , 1 , r 1 := µmin(K>m)+λ max{λ,1} , r 2 := ∥K>m∥+λ max{λ,1} . Then, for any m ∈ N such that r 1 > 0 and Ψ ⊺ ≤m Ψ ≤m /n -I m ≤ 1 2 , ( ) the KRR estimate fλ in Equation (2) for any λ ≥ 0 has a variance upper and lower bounded by r 2 1 τ 2 1 2r 2 2 (1.5 + r 1 ) 2 m n + n i=m+1 µ i (S >m ) r 2 2 max{λ, 1} 2 ≤ Variance( fλ )/σ 2 ≤ 6 r 2 2 r 2 1 m n + Tr (S >m ) r 2 1 max{λ, 1} 2 . Furthermore, for any ground truth that can be expressed as f ⋆ (x) = m k=1 a k ψ k (x) with a ∈ R m + and ψ k as defined in Equation (4), the bias is upper and lower bounded by r 2 1 τ 2 1 (1.5 + r 1 ) 2 max{λ, 1} 2 ∥D -1 ≤m a∥ 2 n 2 ≤ Bias 2 ( fλ ) ≤ 4 r 2 2 + 1.5 r 3 2 r 2 1 τ 2 max{λ, 1} 2 ∥D -1 ≤m a∥ 2 n 2 . See Appendix A.1 for the proof. Note that this result holds for any fixed-design dataset. We derive the main ideas from the proofs in Bartlett et al. (2020) ; Tsigler & Bartlett (2020) , where the authors establish tight bounds for the min-ℓ 2 -norm interpolator on independent sub-Gaussian features. ≤m a∥/n, but also applies to more general ground truths. We improve that upper bound in Theorem 3 and present a matching lower bound.

5.2. PROOF OF THEOREM 1

Throughout the remainder of the proof, all statements hold for the setting in Section 3.1 and under the assumptions of Theorem 1, especially Assumption 1 on K, n ∈ Θ(d ℓ ), and max{λ, 1} ∈ Θ(d ℓ λ ). Furthermore, as in Theorem 1, we use δ, δ > 0 with δ := ℓ-ℓ λ -1 β -ℓ-ℓ λ -1 β and δ := ℓ-1 β -ℓ-1 β . We show that there exists a particular m for which the conditions in Theorem 3 hold and we can control the terms in the bias and variance bounds. Step 1: Conditions for the bounds in Theorem 3 We first derive sufficient conditions on m such that the conditions on Ψ ≤m and f ⋆ in Theorem 3 hold with high probability. The following standard concentration result shows that all m ≪ n satisfy Equation (5) with high probability. Lemma 1 (Corollary of Theorem 5.44 in Vershynin (2012) ). For d large enough, with probability at least 1 -cd -β δ , all m ∈ O(n • q -δ ) satisfy Equation (5). See Appendix C.1 for the proof. Simultaneously, to ensure that f ⋆ is contained in the span of the first m eigenfunctions, m must be sufficiently large. We formalize this in the following lemma. Lemma 2 (Bias bound condition). Consider a kernel as in Theorem 1 satisfying Assumption 1, and a ground truth f ⋆ (x) = L * j=1 x j with 1 ≤ L * ≤ ℓ-ℓ λ -1 β . Then, for any m with λ m ∈ o 1 dq L * -1 and d sufficiently large, f ⋆ is in the span of the first m eigenfunctions and D -1 ≤m a ∈ Θ dq L * -1 . See Appendix B.3 for the proof. Note that L * ≥ 1 follows from ℓ λ < ℓ -1 in Theorem 1, and allows us to focus on non-trivial ground truth functions. Step 2: Concentration of the (squared) kernel matrix In a second step, we show that there exists a set of m that satisfy the sufficient conditions, and for which the spectra of the kernel matrix K >m and squared kernel matrix S >m concentrate. Lemma 3 (Tight bound conditions). With probability at least 1 -cd -β min{ δ,1-δ} uniformly over all ℓ λ ∈ [0, ℓ -1), for d sufficiently large and any λ ∈ R, m ∈ N with max{λ, 1} ∈ Θ(d ℓ λ ), nλ m ∈ Θ(max{λ, 1}), and m ∈ O nq -δ max{λ,1} , we have 1-δ) . r 1 , r 2 ∈ Θ(1), Tr(S >m ) ∈ O d ℓ λ q -δ + d ℓ λ q -(1-δ) , n i=m+1 µ i (S >m ) ∈ Ω d ℓ λ q -( We refer to Appendix C.3 for the proof, which heavily relies on the feature distribution. Note that the results for m in Lemma 3 also imply τ 1 , τ 2 ∈ Θ(1). Step 3: Completing the proof Finally, we complete the proof by showing the existence of a particular m that simultaneously satisfies all conditions of Lemmas 1 to 3. Lemma 4 (Eigendecay). There exists an m such that nλ m ∈ Θ(max{λ, 1}) and m ∈ Θ nq -δ max{λ, 1} ⊆ O(n • q -δ ). Furthermore, assuming L * ≤ ℓ-ℓ λ -1 β , we have λ m ∈ o 1 d•q L * -1 . We refer to Appendix B.4 for the proof. As a result, we can use Lemmas 1 to 4 to instantiate Theorem 3 for the setting in Theorem 1, resulting in the following tight high-probability bounds for variance and bias: d -ℓ λ q -δ + d -ℓ λ q -(1-δ) ≲ Variance( fλ )/σ 2 ≲ d -ℓ λ q -δ + d -ℓ λ q -(1-δ) , d -2(ℓ-ℓ λ -1-β(L * -1)) ≲ Bias 2 ( fλ ) ≲ d -2(ℓ-ℓ λ -1-β(L * -1) ) . Reformulating the bounds in terms of n then concludes the proof of Theorem 1.

6. SUMMARY AND OUTLOOK

In this paper, we highlight how the strength of an inductive bias impacts generalization. Concretely, we study when the gap in test error between interpolating models and their optimally ridge-regularized or early-stopped counterparts is zero, that is, when interpolation is harmless. In particular, we prove a phase transition for kernel regression using convolutional kernels with different filter sizes: a weak inductive bias (large filter size) yields harmless interpolation, and even requires fitting noise for optimal test performance, whereas a strong inductive bias (small filter size) suffers from suboptimal generalization when interpolating noise. Intuitively, this phenomenon arises from a bias-variance trade-off, captured by our main result in Theorem 1: with increasing inductive bias, the risk on noiseless training samples (bias) decreases, while the sensitivity to noise in the training data (variance) increases. Our empirical results on neural networks suggest that this phenomenon extends to models with feature learning, which opens up an avenue for exciting future work.

A GENERALIZATION BOUND FOR FIXED-DESIGN

We prove Theorem 3 by deriving a closed-form expression for the bias and variance, and bounding them individually. Hence, the proof does not rely on any matrix concentration results.

A.1 PROOF OF THEOREM 3

It is well-known that the KRR problem defined in Equation ( 2) yields the estimator fλ (x) = y ⊺ H -1 k(x), where x, x i ∼ ν, y := f + ϵ = [f ⋆ (x 1 ) + ϵ 1 , . . . , f ⋆ (x n ) + ϵ n ] ⊺ , k(x) := [K(x 1 , x), . . . , K(x n , x)] ⊺ , and H := K + λI n . For this estimator, both bias and variance exhibit a closed-form expression: Bias 2 ( fλ ) = E x∼ν [ f ⋆ (x) -E ϵ [(f + ϵ) ⊺ H -1 k(x)] 2 ] = E x∼ν [ f ⋆ (x) -f ⊺ H -1 k(x) 2 ] = E x∼ν f ⋆ (x) 2 -2f ⊺ H -1 E x∼ν [f ⋆ (x)k(x)] + f ⊺ H -1 E x∼ν [k(x)k(x) ⊺ ]H -1 f (i) = a ⊺ a -2a ⊺ Ψ ⊺ ≤m H -1 Ψ ≤m D ≤m a + a ⊺ Ψ ⊺ ≤m H -1 SH -1 Ψ ≤m a, Variance( fλ )/σ 2 = 1 σ 2 E x∼ν E ϵ [ (f + ϵ) ⊺ H -1 k(x) -E ϵ [(f + ϵ) ⊺ H -1 k(x)] 2 ] = 1 σ 2 E x∼ν E ϵ [ϵ ⊺ H -1 k(x)k(x) ⊺ H -1 ϵ] = 1 σ 2 i,j E[ϵ i ϵ j ] H -1 E x∼ν [k(x)k(x) ⊺ ]H -1 i,j (i) = Tr H -1 SH -1 . Step (i) uses the definition of the squared kernel S, that f ⋆ (x) = m k=1 a k ψ k (x) as f ⋆ is in the span of the first m eigenfunctions, and the following consequences of the eigenfunctions' orthonormality: E x∼ν f ⋆ (x) 2 = m k,k ′ =1 a k a k ′ E x∼ν [ψ k (x)ψ k ′ (x)] = m k,k ′ =1 a k a k ′ δ k,k ′ = ∥a∥ 2 2 , E x∼ν [[f ⋆ (x)k(x)] i ] = m k=1 ∞ k ′ =1 a k λ k ′ E x∼ν [ψ k (x)ψ k ′ (x)] ψ k ′ (x i ) = m k=1 a k λ k ψ k (x i ) = [Ψ ≤m D ≤m a] i . We now bound the closed-form expressions of bias and variance individually. Bias Lemma 5 below yields S = Ψ ≤m D 2 ≤m Ψ ⊺ ≤m + S >m . Hence, the bias decomposes into Bias 2 ( fλ ) = a ⊺ a -2a ⊺ Ψ ⊺ ≤m H -1 Ψ ≤m D ≤m a + a ⊺ Ψ ⊺ ≤m H -1 Ψ ≤m D 2 ≤m Ψ ⊺ ≤m H -1 Ψ ≤m a + a ⊺ Ψ ⊺ ≤m H -1 S >m H -1 Ψ ≤m a = I m -D ≤m Ψ ⊺ ≤m H -1 Ψ ≤m a 2 :=B 1 + a ⊺ Ψ ⊺ ≤m H -1 S >m H -1 Ψ ≤m a :=B 2 . First, we rewrite B 1 as B 1 = I m -D ≤m Ψ ⊺ ≤m H -1 Ψ ≤m a 2 (i) = D ≤m -D ≤m Ψ ⊺ ≤m Ψ ≤m D ≤m Ψ ⊺ ≤m + H >m -1 Ψ ≤m D ≤m D -1 ≤m a 2 (ii) = D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 D -1 ≤m a 2 , where (i) uses the decomposition H = K ≤m + H >m with H >m := K >m + λI n , and (ii) applies the Woodbury matrix identity. Second, we can upper-bound B 2 as follows: B 2 ≤ ∥S >m ∥a ⊺ Ψ ⊺ ≤m H -2 Ψ ≤m a = ∥S >m ∥ D -1 ≤m a ⊺ D ≤m Ψ ⊺ ≤m H -2 Ψ ≤m D ≤m D -1 ≤m a (i) = ∥S >m ∥a ⊺ D -1 ≤m D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 D -1 ≤m a ≤ ∥S >m ∥ Ψ ⊺ ≤m H -2 >m Ψ ≤m :=C 1 D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 D -1 ≤m a 2 B1 , where (i) uses Lemma 6. Thus, the bias can be bounded by B 1 (1 + C 1 ) with C 1 ≥ 0. We proceed by upper bounding C 1 : 1 + C 1 ≤ 1 + n∥S >m ∥ Ψ ⊺ ≤m Ψ ≤m n 1 µ min (H >m ) 2 (i) ≤ 1 + 1.5 nλ m+1 ∥K >m ∥ (µ min (K >m ) + λ) 2 ≤ 1 + 1.5 nλ m+1 (∥K >m ∥ + λ) (µ min (K >m ) + λ) 2 ≤ 1 + 1.5 nλ m+1 max{λ, 1} max{λ, 1} 2 (µ min (K >m ) + λ) 2 ∥K >m ∥ + λ max{λ, 1} ≤ 1 + 1.5 r 2 r 2 1 nλ m+1 max{λ, 1} (ii) ≤ 1 + 1.5 r 2 r 2 1 max nλ m+1 max{λ, 1} , 1 = 1 + 1.5 r 2 r 2 1 τ 2 , where (i) uses Equation ( 5) to bound ∥Ψ ⊺ ≤m Ψ ≤m /n∥ and Lemma 7 to bound ∥S >m ∥, and (ii) follows from cx + 1 ≤ (c + 1) max{x, 1} for c ≥ 0. Hence, to conclude the bias bound, we need to bound B 1 in Equation ( 8) from above and below. Upper bound: B 1 ≤ D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1 2 ∥D -1 ≤m a∥ 2 n 2 ≤ ∥H >m ∥ 2 µ min Ψ ⊺ ≤m Ψ ≤m n 2 ∥D -1 ≤m a∥ 2 n 2 (i) ≤ 4 (∥K >m ∥ + λ) 2 ∥D -1 ≤m a∥ 2 n 2 = 4r 2 2 max{λ, 1} 2 ∥D -1 ≤m a∥ 2 n 2 (10) where (i) follows from Equation (5). Combining Equation ( 9) in (i) and Equation ( 10) in (ii) yields the desired upper bound on the bias: Bias 2 ≤ B 1 +B 2 ≤ (1+C 1 )B 1 (i) ≤ 1 + 1.5 r 2 r 2 1 τ 2 B 1 (ii) ≤ 4 1 + 1.5 r 2 r 2 1 r 2 2 τ 2 max{λ, 1} 2 ∥D -1 ≤m a∥ 2 n 2 . Lower bound: B 1 ≥ µ min   D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1   2 ∥D -1 ≤m a∥ 2 n 2 ≥ 1 1 nλm + 1 µmin(H>m) Ψ ⊺ ≤m Ψ ≤m n 2 ∥D -1 ≤m a∥ 2 n 2 (i) ≥   nλm max{λ,1} 1.5 max{λ,1} µmin(K>m)+λ nλm max{λ,1} + 1   2 max{λ, 1} 2 ∥D -1 ≤m a∥ 2 n 2 (ii) ≥ r 1 1.5 + r 1 2 min nλ m max{λ, 1} , 1 2 max{λ, 1} 2 ∥D -1 ≤m a∥ 2 n 2 , = r 1 1.5 + r 1 2 τ 2 1 max{λ, 1} 2 ∥D -1 ≤m a∥ 2 n 2 , where (i) follows from Equation ( 5), and (ii) from the fact that x cx+1 ≥ 1 1+c min{x, 1} for x, c ≥ 0. Since Bias 2 ≥ B 1 , this concludes the lower bound for the bias. Variance As for the bias bound, we first apply Lemma 5 to write S = Ψ ≤m D 2 ≤m Ψ ⊺ ≤m + S >m and decompose the variance in Equation ( 7) into Variance( fλ )/σ 2 = Tr H -1 Ψ ≤m D 2 ≤m Ψ ⊺ ≤m H -1 :=V 1 + Tr H -1 S >m H -1 :=V 2 . Next, we rewrite V 1 as follows: V 1 = Tr H -1 Ψ ≤m D 2 ≤m Ψ ⊺ ≤m H -1 = Tr D ≤m Ψ ⊺ ≤m H -2 Ψ ≤m D ≤m (i) = Tr D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 = 1 n Tr   D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m n D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1   , where (i) follows from Lemma 6. To bound V 1 and V 2 , we use the fact that the trace is the sum of all eigenvalues. Therefore, the trace is bounded from above and below by the size of the matrix times the largest and smallest eigenvalue, respectively. This yields the following bounds for V 1 : V 1 ≤ m n D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m n D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1 (i) ≤ m n 4r 2 2 max{λ, 1} 2 ∥Ψ ⊺ ≤m Ψ ≤m /n∥ µ min (H >m ) 2 (ii) ≤ 6 m n r 2 2 max{λ, 1} 2 (µ min (K >m ) + λ) 2 = 6 r 2 r 1 2 m n , and V 1 ≥ m n µ min   D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m n D -1 ≤m n + Ψ ⊺ ≤m H -1 >m Ψ ≤m n -1   (iii) ≥ m n r 1 1.5 + r 1 2 τ 2 1 max{λ, 1} 2 µ min Ψ ⊺ ≤m Ψ ≤m /n (∥K >m ∥ + λ) 2 (iv) ≥ 1 2 m n r 1 1.5 + r 1 2 τ 2 1 max{λ, 1} 2 (∥K >m ∥ + λ) 2 = 1 2 m n r 1 r 2 (1.5 + r 1 ) 2 τ 2 1 . For (i) and (iii), we bound the terms analogously to the upper and lower bound of B 1 , and use Equation ( 5) in (ii) and (iv). Next, the upper bound on V 2 follows from a special case of Hölder's inequality as follows: V 2 = Tr H -1 S >m H -1 ≤ Tr (S >m ) ∥H -2 ∥ = Tr (S >m ) µ min (K ≤m + H >m ) 2 ≤ Tr (S >m ) µ min (H >m ) 2 = Tr (S >m ) r 2 1 max{λ, 1} 2 . For the lower bound, we need a more accurate analysis. First, we apply the identity H -1 = (H >m + K ≤m ) -1 = H -1 >m -H -1 >m K ≤m (H >m + K ≤m ) -1 :=A , which is valid since H and H >m are full rank. Next, note that the rank of A is bounded by the rank of K ≤m , which can be written as Ψ ≤m D ≤m Ψ ⊺ ≤m and therefore has itself rank at most m. Furthermore, Equation ( 5) implies that Ψ ⊺ ≤m Ψ ≤m has full rank, and hence m ≤ n. Let now {v 1 , . . . , v rank(A) } be an orthonormal basis of col(A), and let {v rank(A)+1 , . . . , v n } be an orthonormal basis of col(A) ⊥ . Hence, {v 1 , . . . , v n } is an orthonormal basis of R n , and similarity invariance of the trace yields V 2 = Tr H -1 S >m H -1 = n i=1 v ⊺ i H -1 S >m H -1 v i ≥ n i=rank(A)+1 v ⊺ i H -1 S >m H -1 v i (i) = n i=rank(A)+1 v ⊺ i (H -1 >m -A)S >m (H -1 >m -A)v i (ii) = n i=rank(A)+1 v ⊺ i H -1 >m S >m H -1 >m v i ≥ 1 ∥H >m ∥ 2 n i=rank(A)+1 v ⊺ i S >m v i , = 1 ∥H >m ∥ 2 Tr (P ⊺ S >m P) , where P is the projection matrix of R n onto col(A) ⊥ , and (i) follows from Equation (12). Step (ii) uses that, for all i > rank(A), v i is orthogonal to the column space of A, and hence v ⊺ i (H -1 >m -A)S >m (H -1 >m -A)v i =v ⊺ i H -1 >m S >m H -1 >m v i -v ⊺ i A =0 S >m H -1 >m v i -v ⊺ i H -1 >m S >m Av i =0 + v ⊺ i AS >m Av i =0 . Finally, let µ i (•) be the i-th eigenvalue of its argument with respect to a decreasing order. Then, the Cauchy interlacing theorem yields µ rank(A)+i (S ≤m ) ≤ µ i (P ⊺ S >m P) for all i = 1, . . . , nrank(A). This implies Tr (P ⊺ S >m P) = n-rank(A) i=1 µ i (P ⊺ S >m P) ≥ n-rank(A) i=1 µ rank(A)+i (S >m ) = n i=1+rank(A) µ i (S >m ) (i) ≥ n i=1+m µ i (S >m ), where (i) uses that the rank of A is bounded by m. This concludes the lower bound on V 2 as follows: V 2 = Tr H -1 S >m H -1 ≥ n i=m+1 µ i (S ≤m ) ∥H >m ∥ 2 = n i=m+1 µ i (S ≤m ) r 2 2 max{λ, 1} 2 . A.2 TECHNICAL LEMMAS Lemma 5 (Squared kernel decomposition). Let K be a kernel function that under a distribution ν can be decomposed as K(x, x ′ ) = k λ k ψ k (x)ψ k (x ′ ), where E x∼ν [ψ k (x)ψ k ′ (x)] = δ k,k ′ . Then, the squared kernel S(x, x ′ ) = E z∼ν [K(x, z)K(x ′ , z)] can be written as S(x, x ′ ) = k λ 2 k ψ k (x)ψ k (x ′ ), and for any m > 0, the corresponding kernel matrix can be written as S = Ψ ≤m D 2 ≤m Ψ ⊺ ≤m + S >m . Proof. The statement simply follows from S(x, x ′ ) = k,k ′ λ k λ k ′ ψ k (x) E z∼ν [ψ k (z)ψ k ′ (z)] δ k,k ′ ψ k ′ (x ′ ) = k λ 2 k ψ k (x)ψ k (x ′ ). Lemma 6 (Corollary of Lemma 20 in Bartlett et al. ( 2020)). D ≤m Ψ ⊺ ≤m H -2 Ψ ≤m D ≤m = D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 Proof. D ≤m Ψ ⊺ ≤m H -2 Ψ ≤m D ≤m = D 1/2 ≤m D 1/2 ≤m Ψ ⊺ ≤m H -2 Ψ ≤m D 1/2 ≤m D 1/2 ≤m (i) = D 1/2 ≤m I m + D 1/2 ≤m Ψ ⊺ ≤m H -1 >m Ψ ≤m D 1/2 ≤m -1 D 1/2 ≤m Ψ ⊺ ≤m H -2 >m Ψ ≤m D 1/2 ≤m I m + D 1/2 ≤m Ψ ⊺ ≤m H -1 >m Ψ ≤m D 1/2 ≤m -1 D 1/2 ≤m = D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 , where (i) applies Lemma 20 from Bartlett et al. (2020) . Lemma 7 (Squared kernel tail). For m > 0, let S >m be the kernel matrix of the truncated squared kernel S >m = k>m λ 2 k ψ k (x)ψ k (x ′ ), and let K >m be the kernel matrix of the truncated original kernel K >m = k>m λ k ψ k (x)ψ k (x ′ ). Then, ∥S >m ∥ ≤ λ m+1 ∥K >m ∥. Proof. We show that for any vector v, v ⊺ S >m v ≤ λ m+1 v ⊺ K >m v, which implies the claim. To do so, we define Ψ k ∈ R n×n with [Ψ k ] i,j = ψ k (x i )ψ k (x j ) for all k > m. Then we can write K >m = k>m λ k Ψ k and S >m = k>m λ 2 k Ψ k . Since the eigenvalues are in decreasing order, we have λ k ≤ λ m+1 for any k > m, and thus  v ⊺ S >m v = k>m λ 2 k v ⊺ Ψ k v ≤ λ m+1 k>m λ k v ⊺ Ψ k v = λ m+1 v ⊺ K >m v. := E x∼U ({-1,1} d ) [f (x)g(x)]. Those functions play a key role in the remainder of our proof; as it turns out, they are the eigenfunctions of the kernel in Section 3.1. Towards a formal statement, define the polynomials only depends on the (Euclidean) inner product of x and x ′ . Furthermore, {G G (d) l ⟨x, x ′ ⟩ √ d := 1 B(l, d) |S|=l Y S (x)Y S (x ′ ), (d) l } d l=0 is a set of orthonormal polynomials with respect to the distribution of ⟨x, x ′ ⟩/ √ d. The following lemma shows how such polynomials form an eigenbasis for functions that only depend on the inner product between points in the unit hypercube. Lemma 8 (Local kernel decomposition). Let κ : R → R be any function and d ∈ N >0 . Then, for any x, x ′ ∈ {-1, 1} d , we can decompose κ(⟨x, x ′ ⟩/d) as κ ⟨x, x ′ ⟩ d = d l=0 ξ (d) l 1 B(l, d) |S|=l Y S (x)Y S (x ′ ). ( ) Proof. Note that the decomposition only needs to hold at the evaluation of κ in the values that ⟨x, x ′ ⟩/d can take, that is, κ computed in {-1, -  κ ⟨x, x ′ ⟩ d = d l=0 c l G (d) l ⟨x, x ′ ⟩ √ d for some (unknown) coefficients c l . Finally, the proof follows by expanding the definition of G (d) l in Equation ( 14) and choosing ξ (d) l = c l .

B.2 CONVOLUTIONAL KERNELS ON THE HYPERCUBE

While the previous subsection considers general functions on the hypercube, we now focus on convolutional kernels and their eigenvalues. This yields the tools to prove Lemmas 2 and 4. In order to characterize eigenvalues, we first follow existing literature such as Misiakiewicz & Mei (2021) and introduce useful quantities. Let S ⊆ {1, . . . , d}. The diameter of S is γ(S) := max i,j∈S min {mod (j -i, d) + 1, mod (i -j, d) + 1} for S ̸ = ∅, and γ(∅) = 0. Furthermore, we define C(l, q, d) := |{S ⊆ {1, . . . , d} | |S| = l, γ(S) ≤ q}|. ( ) Intuitively, the diameter of S is the smallest number of contiguous feature indices that fully contain S. The following lemma yields an explicit formula for C(l, q, d), that is, the number of sets of size l with diameter at most q. Lemma 9 (Number of overlapping sets). Let l, q, d ∈ N with l ≤ q < d/2. Then, C(l, q, d) = d q-1 l-1 l > 0, 1 l = 0. Proof. Since the result holds trivially for l = 0 and l = 1, we henceforth focus on l ≥ 2. Let C(l, γ, d) be the number of subsets S ⊆ {1, . . . , d} of cardinality |S| = l with diameter exactly γ(S) = γ. First, consider C(2, γ, d). For each set, we can choose the first element i from d different values, and the second as mod ((i -1) ± (γ -1), d) + 1. In this way, since q < d/2, we count each set exactly twice. Thus, C(2, γ, d) = d • 2 2 = d. Next, consider l > 2. We can build all the possible sets by starting with one of the C(2, γ, d) = d sets, and adding the remaining l -2 elements from γ -2 possible indices. Hence, every fixed set of size 2 and diameter γ yields γ-2 l-2 different sets of size l. Furthermore, by construction, every set of size l and diameter γ results from exactly one set of size 2 and diameter γ. Therefore, C(l, γ, d) = d γ -2 l -2 . The result for l ≥ 2 then follows from summing C(l, γ, d) over all diameters γ ≤ q: C(l, q, d) = d q γ=l γ -2 l -2 (i) = d q -1 l -1 , where (i) follows from the hockey-stick identity. Now, we focus on cyclic convolutional kernels K as in Equation ( 1). First, we restate Proposition 1 from Misiakiewicz & Mei (2021) . This proposition establishes that Y S for S ⊆ {1, . . . , d} are indeed the eigenfunctions of K, and yields closed-form eigenvalues λ S up to factors ξ (q) |S| that depend on the inner nonlinearity κ. Next, Lemma 10 uses additional regularity assumptions on κ to eliminate the dependency on ξ (q) |S| . This characterization of the eigenvalues then enables the proof Lemmas 2 and 4. Proposition 1 (Proposition 1 from Misiakiewicz & Mei (2021) ). Let K be a cyclic convolutional kernel over the unit hypercube as defined in Equation (1). Then, K(x, x ′ ) := 1 d d k=1 κ ⟨x (k,q) , x ′ (k,q) ⟩ q = q l=0 γ(S)≤q |S|=l λ S Y S (x)Y S (x ′ ), with λ S = ξ (q) |S| q + 1 -γ(S) dB(|S|, q) where ξ (q) |S| are the coefficients of the κ-decomposition (Equation ( 16)) over {-1, 1} q . Alternatively, K(x, x ′ ) = k λ S k Y S k (x)Y S k (x ′ ) where we order all S k ⊆ {1, . . . , d} with γ(S k ) ≤ q such that λ S k ≥ λ S k+1 . In particular, λ S k = λ k . We refer to Proposition 1 from Misiakiewicz & Mei (2021) for a formal proof. Intuitively, the result follows from applying Lemma 8 for each subset S ⊆ {1, . . . , d} of contiguous elements. This is possible because crucially, any subset of q features is again distributed uniformly on the q-dimensional unit hypercube. Lastly, the factor q + 1 -γ(S) stems from the fact that each eigenfunction Y S appears as many times as there are contiguous index sets of size γ(S) supported in a fixed contiguous index set of size q. In other words, the term is the number of shifted instances of S supported in a contiguous subset of q features. As mentioned before, Proposition 1 characterizes the eigenvalues of cyclic convolutional kernels K up to factors ξ (q) |S| that depend on the inner nonlinearity κ. To avoid the additional factors, we require the following regularity assumptions: Assumption 1 (Regularity). Let T := 4 + 4ℓ β . A cyclic convolutional kernel K(x, x ′ ) = 1 d d k=1 κ ⟨x (k,q) ,x ′ (k,q) ⟩ q from the setting of Section 3.1 with inner function κ satisfies the regularity assumption if there exist constants c ≥ T , c ′ , c ′′ > 0 and a series of constants {c l > 0} T l=0 such that, for any q ≥ c, the decomposition K(x, x ′ ) = q l=0 γ(S)≤q |S|=l ξ (q) |S| q + 1 -γ(S) dB(|S|, q) Y S (x)Y S (x ′ ) from Proposition 1 over inputs x, x ′ ∈ {-1, 1} d satisfies ξ (q) l ≥ c l ∀l ∈ {0, . . . , T }, ξ (q) l ≥ 0 ∀l > T, (q) q-l ≤ c ′ q T -l+1 ∀l ∈ {0, . . . , T }, ( ) l≥0 ξ (q) l ≤ c ′′ . For sufficiently high-dimensional inputs x, x ′ , Equations ( 18) and ( 19) ensure that the convolutional kernel K(x, x ′ ) in Equation ( 1) is a valid kernel, and that it can learn polynomials of degree up to T . Indeed, if ξ (q) l = 0 for some l, then there are no polynomials of degree l among the eigenfunctions of K. Furthermore, Equations ( 18), ( 20) and ( 21) guarantee that the eigenvalue tail is sufficiently bounded. This allows us to bound ∥K >m ∥ and µ min (K >m ) in Appendix C. Our assumption resembles Assumption 1 by Misiakiewicz & Mei (2021) : For one, Equations ( 20) and ( 21) are equivalent to Equations 43 and 44 in Misiakiewicz & Mei (2021) . Furthermore, Equation (18) above is a slightly stronger version of Equation 42, where strengthening is necessary due to the non-asymptotic nature of our results. We still argue that many standard κ, for example, the Gaussian kernel, satisfy Assumption 1 with our convolution kernel K. Because such κ satisfy Assumption 1 in Misiakiewicz & Mei (2021) , we only need to check that they additionally satisfy our Equation ( 18). If κ is a smooth function, we have ξ 1) for all l ≤ T , where κ (l) is the l-th derivative of κ. In particular, all derivatives of the exponential function at 0 are strictly positive, implying Equation ( 18) for the Gaussian kernel if d is large enough. (q) l = κ (l) (0) + o( The final lemma of this section is a corollary that characterizes the eigenvalues solely in terms of |S|, and further shows that, for d large enough, the eigenvalues decay as |S| grows. Lemma 10 (Corollary of Proposition 1). Consider a cyclic convolutional kernel as in Proposition 1 that satisfies Assumption 1 with q ∈ Θ(d β ) for some β ∈ (0, 1). Then, for any S ⊆ {1, . . . , d} such that γ(S) ≤ q and |S| < T , the eigenvalue λ S corresponding to the eigenfunction Y S (x) satisfies λ S ∈ Ω 1 d • q |S| and λ S ∈ O 1 d • q |S|-1 . (22) Furthermore, max |S|≥T γ(S)≤q λ S ∈ O 1 d • q T -1 . ( ) Proof. Without loss of generality, assume d is large enough such that d > q/2 ≥ c/2, where c is a constant from Assumption 1. Let S ⊆ {1, . . . , d} with γ(S) ≤ q be arbitrary and define l := |S| and r := q + 1 -γ(S). Since l ≤ γ(S) ≤ q, we have 1 ≤ r ≤ q + 1 -l. Furthermore, since B(l, q) = q l , we use the following classical bound on the binomial coefficient throughout the proof: 1 l l q l ≤ B(l, q) ≤ e l l q l . For the first part of the lemma, assume |S| < T . Then, using Assumption 1, we have λ S = ξ (q) l r dB(l, q) (i) ≤ l l q l c ′′ r d ≤ c l,1 1 dq l-1 ≤ c T,1 1 dq l-1 , (ii) ≥ l l q l e l c l r d ≥ c l,2 1 dq l ≥ c T,2 1 dq l for some positive constants c l,1 , c l,2 that depend on l, c T,1 := max l∈{0,...,T -1} c l,1 , and c T,2 := min l∈{0,...,T -1} c l,2 . Step (i) follows from the upper bound in Equation ( 21) with non-negativity in Equations ( 18) and ( 19), and (ii) follows from the lower bound in Equation ( 18). Since c T,1 and c T,2 do not depend on l, this concludes the first part of the proof. For the second part of the proof, we consider two cases depending on whether |S| ∈ [T, q -T ] or |S| > q -T . Hence, first assume T ≤ |S| ≤ q -T . Then, λ S (i) = ξ |S| q + 1 -γ(S) d q |S| (ii) ≤ ξ |S| q + 1 -γ(S) d q T (iii) ≤ c ′′ q d q T T = c ′′ T T dq T -1 , where (i) follows from Proposition 1. In step (ii), we use that q |S| is minimized when |S| is has the largest difference to q/2. Lastly, step (iii) applies the upper bound from Equation ( 21) together with non-negativity of ξ |S| in Assumption 1, and the classical bound on the binomial coefficient. Now assume|S| > q -T . Then, λ S (i) = ξ |S| q + 1 -γ(S) d q |S| ≤ ξ q-(q-|S|) q d q q-(q-|S|) (ii) = ξ q-(q-|S|) q d q q-|S| (iii) ≤ ξ q-(q-|S|) (q -|S|) q-|S| q dq q-|S| (iv) ≤ c ′ (q -|S|) q-|S| q dq T -(q-|S|)+1 q q-|S| (v) ≤ c ′ T T dq T , where (i) follows from Proposition 1, (ii) from the fact that n n-k = n k , (iii) from the classical bound for the binomial coefficient, (iv) from Equation (20) in Assumption 1, and (v) from the fact that q -|S| < T in the current case. Combining Equations ( 24) and ( 25) from the two cases finally yields max |S|≥T γ(S)≤q λ S ≤ max      max T ≤|S|≤q-T γ(S)≤q λ S , max |S|>q-T γ(S)≤q λ S      ≤ max c ′′ T T dq T -1 , c ′ T T dq T ∈ O 1 dq T -1 , which concludes the second part of the proof. B.3 PROOF OF LEMMA 2 First, note that f ⋆ = Y S * for S * = {1, . . . , L * }. Since |S * | = γ(S * ) = L * , Proposition 1 yields λ S * = ξ (q) L * q + 1 -L * dB(L * , q) (i) ∈ Θ q dq L * = Θ 1 dq L * -1 (ii) ⊆ ω(λ m ), where (i) uses Equations ( 18) and ( 21) in Assumption 1 for L * ≤ T and d large enough to get ξ (q) L * ∈ Θ(1), and (ii) uses λ m ∈ o 1 dq L * -1 . Hence, for d sufficiently large, λ S * > λ m . Since the eigenvalues are in decreasing order, this implies that f ⋆ is in the span of the first m eigenfunctions. This further yields ∥D -1 ≤m a∥ = λ -1 S * ∈ Θ dq L * -1 , since the entry of a corresponding to Y S * is 1 while all others are 0.

B.4 PROOF OF LEMMA 4

Before proving Lemma 4, we introduce the following quantity: L := ℓ -ℓ λ -1 β . ( ) Intuitively, L corresponds to the degree of the largest polynomial that a cyclic convolutional kernel as defined in Equation ( 1) can learn. This quantity plays a key role throughout the proof of Lemma 4, and Lemma 3 later. Finally, note that δ as defined in Theorem 1 can be written as δ = ℓ-ℓ λ -1 β -L. Proof of Lemma 4. First, we use Proposition 1 to write the cyclic convolutional kernel as K(x, x ′ ) = q l=0 γ(S)≤q |S|=l λ S Y S (x)Y S (x ′ ) = k λ S k Y S k (x)Y S k (x ′ ) where the λ S k are ordered such that λ S k+1 ≤ λ S k . For the first part of the proof, we need to pick an m ∈ N such that nλ m ∈ Θ (max{λ, 1}) and m ∈ Θ nq -δ max{λ,1} . We will equivalently choose m ∈ Θ(dq L ) with λ m ∈ Θ 1 d•q L+δ ; since n ∈ Θ(d ℓ ) and max{λ, 1} ∈ Θ(d ℓ λ ), we have Θ max{λ, 1} n = Θ d ℓ λ d ℓ = Θ d ℓ λ d 1+ℓ λ +β(L+δ) = Θ 1 d • q L+δ , Θ nq -δ max{λ, 1} = Θ d ℓ q -δ d ℓ λ = Θ d 1+ℓ λ +β(L+δ) q -δ d ℓ λ = Θ(dq L ). The remainder of the proof proceeds in five steps: we first construct a candidate S m ⊆ {1, . . . , d} with γ(S m ) ≤ q, show that the rate of the eigenvalue corresponding to Y Sm satisfies λ Sm = λ m ∈ Θ 1 d•q L+δ , show that the rate of m ∈ Θ(dq L ), establish Θ nq -δ max{λ,1} ⊆ O(n • q -δ ), and finally show that λ m ∈ o 1 d•q L * -1 for appropriate L * . Construction of m We consider two different S m depending on δ: S m = {1, . . . , L, ⌊q + 1 -q 1-δ ⌋} δ ∈ (0, 1) {1, . . . , L, ⌊q/2⌋} δ = 0. For d-and hence q ∈ Θ(d β )-large enough, S m is well-defined, |S m | = L + 1, and the diameter is γ(S m ) = ⌊q + 1 -q 1-δ ⌋ δ ∈ (0, 1) ⌊q/2⌋ δ = 0. For the rest of the proof, assume that d is sufficiently large. Rate of λ Sm Using Proposition 1 and |S m | = L + 1, we can write λ Sm = ξ (q) L+1 q + 1 -γ(S m ) dB(L + 1, q) . First, we show that the numerator is in Θ q 1-δ for both definitions of S m . In the case where δ ∈ (0, 1), we have q + 1 -⌊q + 1 -q 1-δ ⌋ = -⌊-q 1-δ ⌋ = ⌈q 1-δ ⌉ (i) ∈ Θ q 1-δ , where (i) follows from δ < 1 and q sufficiently large. In the case where δ = 0, we have q + 1 -⌊q/2⌋ ≤ q + 1 ∈ O(q), q + 1 -⌊q/2⌋ ≥ q/2 ∈ Ω(q). Thus, since δ = 0 in this case, the numerator is in Θ(q) = Θ(q 1-δ ). As the denominator does not depend on δ, we use the same technique for both δ = 0 and δ ∈ (0, 1). The classical bound on B(L + 1, q) = q L+1 yields q L+1 ≲ q L + 1 L+1 ≤ q L + 1 ≤ e L+1 q L + 1 L+1 ≲ q L+1 . Therefore, dB(L + 1, q) ∈ Θ dq L+1 . Finally, since L + 1 ≤ T = ⌈4 + 4ℓ/β⌉, we have ξ (q) L+1 ∈ Θ(1) by Equations ( 18) and ( 21) in Assumption 1 for d sufficiently large. Combining all results then yields the desired rate of λ Sm as follows: λ Sm ∈ Θ q 1-δ dq L+1 = Θ 1 d • q L+δ . ( ) Rate of m To establish m ∈ Θ(dq L ), we bound m individually from above and below. Upper bound: Since the eigenvalues are in decreasing order, we can bound m from above by counting how many eigenvalues are larger than λ m . To do so, we use |S m | = L + 1, and show that for d sufficiently large, all S k with |S k | > L + 1 correspond to eigenvalues λ S k < λ Sm . We first decompose max k:|S k |>L+1 γ(S k )≤q λ S k = max              max k:L+1<|S k |<T γ(S k )≤q λ S k =:M1 , max k:|S k |≥T γ(S k )≤q λ S k =:M2              . For M 1 , let k with L + 1 < |S k | < T be arbitrary. Then, λ S k (i) ∈ O 1 dq |S k |-1 ⊆ O 1 dq (L+2)-1 = O 1 dq L+1 (ii) ⊆ o(λ Sm ), where we apply Equation ( 22) from Lemma 10 in (i), and use Equation ( 28) with δ < 1 in (ii). This implies M 1 ∈ o(λ Sm ). For M 2 , we directly get M 2 = max k:|S k |≥T γ(S k )≤q λ S k (i) ∈ O 1 dq T -1 (ii) ⊆ O 1 dq (L+2)-1 = O 1 dq L+1 (iii) ⊆ o(λ Sm ), where we apply Equation ( 23) from Lemma 10 in (i), (ii) follows from L + 2 ≤ T , and step (iii) uses Equation (28) with δ < 1.

Combined, we have max

k:|S k |>L+1,γ(S k )≤q λ S k = max {M 1 , M 2 } ∈ o(λ Sm ). Thus, for d sufficiently large and |S k | > L + 1, we have λ S k < λ Sm . For this reason, m is at most the number of eigenfunctions with degree no larger than L + 1: m ≤ L+1 l=0 C(l, q, d) (i) ∈ O(dq L ), where (i) uses Lemma 9 for d large enough with q l ∈ Θ(q l ). Lower bound: By construction of S m in Equation ( 27), we have γ(S m ) ≥ ⌊q/2⌋. This, combined with Proposition 1, implies that the indices of all polynomials with degree L + 1 but diameter at most ⌊q/2⌋ -1 are smaller than m. Hence, for large enough d, Lemma 9 yields the following lower bound: m ≥ C(L + 1, ⌊q/2⌋ -1, d) = d ⌊q/2⌋ -2 L ≥ d ⌊q/2⌋ -2 L L ∈ Ω(dq L ). The upper and lower bound together then imply m ∈ Θ(dq L ). This concludes the existence of an m ∈ N such that λ m and m exhibit the desired rates. Rate of m with respect to n We can write n as n ∈ Θ(d ℓ ) = Θ d • d β ℓ-1 β = Θ dq ⌊ ℓ-1 β ⌋+( ℓ-1 β -⌊ ℓ-1 β ⌋) = Θ dq ⌊ ℓ-1 β ⌋+ δ . Combining this with L ≤ ℓ-1 β , we directly get Θ(dq L ) ⊆ O(dq ⌊ ℓ-1 β ⌋ ) = O(nq -δ ). Rate of λ m for appropriate L * Since nλ m ∈ Θ(max{λ, 1}), we have λ m ∈ Θ max{λ, 1} n (i) = Θ d ℓ λ d • d ℓ λ q L+δ = Θ 1 dq L+δ , where (i) uses the identity ℓ = 1 + ℓ λ + β(L + δ). Assume now L * ≤ ℓ-ℓ λ -1 β . For the remainder, we need to consider two cases depending on δ. If δ > 0, then L * ≤ L + 1, and we have λ m ∈ O q -δ dq L * -1 ⊆ o 1 dq L * -1 . If δ = 0, then L * ≤ L, and we have λ m ∈ O q -δ dq L ⊆ O 1 dq L * ⊆ o 1 dq L * -1 . In both cases, λ m ∈ o In this proof, we show that the matrix Ψ ⊺ ≤m Ψ ≤m /n concentrates around the identity matrix for all m ∈ O(nd -β δ ), thereby establishing Equation ( 5). Let m be the largest m ∈ O(nq -δ ). The proof consists of applying Theorem 5.44 from Vershynin (2012) to the matrix Ψ ≤ m, and extending the result to all suitable choices of m simultaneously. More precisely, let c be the implicit constant of the O(nq -δ )-notation, and define m to be the largest m ∈ N with m ≤ c • nq -δ . Note that m exists, because d is large enough and fixed. Bound for m To apply Theorem 5.44 from Vershynin (2012) , we need to verify the theorem's conditions on the rows of Ψ ≤ m. In particular, we show that the rows are independent, have a common second moment matrix, and that their norm is bounded. Let [Ψ ≤ m] i,: indicate the i-th row of Ψ ≤ m ∈ R n× m. We may write each row entry-wise as [Ψ ≤ m] i,: = [Y 1 (x i ) Y 2 (x i ) • • • Y m(x i )] ⊺ . First, the rows of Ψ ≤ m are independent, since each row depends on a different x i , and we assume the data to be i.i.d.. Second, since the eigenfunctions are orthonormal w.r.t. the data distribution, the second moment of the rows is E [Ψ ≤ m] i,: [Ψ ≤ m] ⊺ i,: = I m for all rows i ∈ {1, . . . , n}. Third, to show that each row has a bounded norm, we use the fact that the eigenfunctions Y k in Equation ( 13) over {-1, 1} d satisfy Y k (x i ) 2 = 1 for all k. Thus, the norm of each row is ∥[Ψ ≤ m] i,: ∥ 2 = m k=1 Y k (x i ) 2 = m k=1 1 = √ m. We can now apply Theorem 5.44 from Vershynin (2012) . For any t ≥ 0, this yields the following inequality with probability 1 -m exp{-ct 2 }, where c is an absolute constant: Ψ ⊺ ≤ mΨ ≤ m n -I m ≤ max ∥I m∥ 1 2 ∆, ∆ 2 , where ∆ = t m n . The choice t = 1 2 n m yields max ∥I m∥ 1 2 ∆, ∆ 2 = 1/2, and the following error probability for large enough d: m exp -c n 4 m (i) ≲ nq -δ exp -c ′ n nq -δ ≲ q -δ • d ℓ exp{-c ′ q δ } ≲ q -δ , where (i) follows from m ∈ O(nq -δ ). Bound for any m < m Note that Ψ ⊺ ≤m Ψ ≤m is a submatrix of Ψ ⊺ ≤ mΨ ≤ m. Thus, Ψ ⊺ ≤m Ψ ≤m n -I m is also a submatrix of Ψ ⊺ ≤ mΨ ≤ m n -I m. Therefore, Ψ ⊺ ≤m Ψ ≤m n -I m ≤ Ψ ⊺ ≤ mΨ ≤ m n -I m ≤ 1/2 with probability at least 1 -cd -β δ uniformly over all m ≤ m.

C.2 FURTHER DECOMPOSITION OF THE TERMS AFTER m

In this section, we focus on the concentration of the smallest and largest eigenvalue of the kernel matrix K >m to prove Lemma 3. However, this proof is involved, and requires additional tools. In particular, we further decompose K >m into two kernels K 1 and K 2 . In the following, we consider the setting of Theorem 1 with a convolutional kernel K that satisfies Assumption 1. We define the additional notation L := ℓ -ℓ λ -1 β and L := ℓ -1 β . Intuitively, L is the maximum polynomial degree that K can learn with regularization, and L is the analogue without regularization. Finally, note that δ and δ as defined in Theorem 1 can be written as δ = ℓ-ℓ λ -1 β -L and δ = ℓ-1 β -L, respectively. We now introduce the two additional kernels, and then show in Lemma 11 that K >m = K 1 + K 2 . First, applying Proposition 1 to K yields K(x, x ′ ) = k λ S k Y S k (x)Y S k (x ′ ), where {S k } k>0 is a sequence of all subsets S k ⊆ {1, . . . , d} with γ(S k ) ≤ q, ordered such that λ S k ≥ λ S k+1 . Next, let m ∈ N be such that nλ Sm ∈ Θ(max{λ, 1}), and define the index sets I 1 := {k ∈ N | k > m and |S k | ≤ L + 1}, I 2 := {k ∈ N | |S k | ≥ L + 2}. Those sets induce the following kernels: K 1 (x, x ′ ) := k∈I1 λ S k Y S k (x)Y S k (x ′ ), S 1 (x, x ′ ) := k∈I1 λ 2 S k Y S k (x)Y S k (x ′ ), K 2 (x, x ′ ) := k∈I2 λ S k Y S k (x)Y S k (x ′ ), S 2 (x, x ′ ) := k∈I2 λ 2 S k Y S k (x)Y S k (x ′ ), where S 1 and S 2 are the squared kernels corresponding to K 1 and K 2 , respectively. The empirical kernel matrices K 1 , K 2 , S 1 , S 2 ∈ R n×n are [K 1 ] i,j = K 1 (x i , x j ), [K 2 ] i,j = K 2 (x i , x j ), [S 1 ] i,j = S 1 (x i , x j ), and [S 2 ] i,j = S 2 (x i , x j ). Furthermore, as in the original kernel decomposition, we define the matrices Ψ 1 ∈ R n×|I1| , [Ψ 1 ] i,j = Y S k j (x i ), D 1 ∈ R |I1|×|I1| , D 1 = diag(λ S k 1 , . . . , λ S k |I 1 | ), where {k j } |I1| j=1 is a sequence of all indices in I 1 ordered such that λ S k j ≥ λ S k j+1 . Intuitively, Ψ 1 , D 1 are the analogue to Ψ ≤m , D ≤m in the original decomposition K = K ≤m + K >m . Lastly, we define m as the largest eigenvalue corresponding to an eigenfunction Y S of degree |S| ≥ L + 2, that is, m := min I 2 . Using the previous definitions, the following lemma establishes that K 1 and K 2 indeed constitute a decomposition of K >m . Lemma 11 (1-2 decomposition). For d sufficiently large, we have K >m (x, x ′ ) = K 1 (x, x ′ ) + K 2 (x, x ′ ) and S >m (x, x ′ ) = S 1 (x, x ′ ) + S 2 (x, x ′ ). Proof. For the decomposition of K >m , we have to show that exactly the eigenfunctions with index larger than m appear in either K 1 or K 2 , that is, I 1 ∪ I 2 = {k > m}, and that no eigenfunction appears in both K 1 or K 2 , that is, I 1 ∩ I 2 = ∅. Furthermore, since we can write S >m (x, x ′ ) = k>m λ 2 S k Y S k (x)Y S k (x ′ ) by Lemma 5, the same argument implies the 1-2 decomposition of S >m . First, from the definition of I 1 and I 2 , it follows directly that I 1 ∩ I 2 = ∅, that I 1 ∪ I 2 ⊇ {k > m}, and that I 1 ⊆ {k > m}. Hence, to conclude the proof, we only need to show that I 2 ⊆ {k > m}. Since the eigenvalues are sorted in decreasing order, we equivalently show that, for d sufficiently large, all eigenvalues λ S k with k ∈ I 2 are smaller than λ Sm ∈ Θ(max{λ, 1}/n). More precisely, we show that max k∈I2 nλ S k ∈ o(nλ Sm ) = o(max{λ, 1}). Using T from Assumption 1, we have max k∈I2 nλ S k = max          max k∈I2:|S k |<T nλ S k =:M1 , max k∈I2:|S k |≥T nλ S k =:M2          . For M 1 , we bound a generic k ∈ I 2 with |S k | < T as follows: nλ S k (i) ∈ O n dq |S k |-1 ⊆ O n dq ( L+2)-1 = O d 1+ℓ λ +β(L+δ) d 1+β( L+δ) d -β(1-δ) = O(max{λ, 1}d β(L-L) d -β(1-δ) ) (ii) ⊆ o(nλ m ), where (i) applies Equation ( 22) from Lemma 10, and (ii) uses L ≥ L and δ < 1. In particular, this implies M 1 ∈ o(nλ m ). For M 2 we have max k∈I2:|S k |≥T nλ S k = n max |S k |≥T γ(S k )≤q λ S k (i) ∈ O n dq T -1 (ii) ⊆ O n dq ( L+2)-1 = O d 1+ℓ λ +β(L+δ) d 1+β( L+δ) d -β(1-δ) = O(max{λ, 1}d β(L-L) d -β(1-δ) ) (iii) ⊆ o(nλ m ), where (i) applies Equation ( 23) from Lemma 10, (ii) uses that L + 2 ≤ T , and (iii) follows from L ≥ L and δ < 1. Combining the bounds on M 1 and M 2 , we have max k∈I2 nλ S k ∈ o(nλ Sm ) = o(max{λ, 1}). Hence, for d sufficiently large, all k ∈ I 2 yield λ S k < λ Sm and consequently k > m. Using the 1-2 decomposition, we now prove Lemma 3. We defer the auxiliary Lemmas 12 to 15 to Appendix C.4, and concentration-results to Appendix C.5.

C.3 PROOF OF LEMMA 3

Throughout the proof, we assume d to be large enough such that all quantities are well-defined and all necessary lemmas apply. In particular, we assume the conditions of Lemma 11 to be satisfied, and that c < ⌊q/2⌋ < q < d/2 for c in Assumption 1. Hence, L + 2 ≤ L + 2 < T < ⌊q/2⌋, and we can apply Lemmas 9 to 15, the setting of Appendix C.2, as well as Assumption 1 throughout the proof. We will mention additional implicit lower bounds on d as they arise. The proof proceeds in three steps: we first bound r 1 and r 2 , then bound Tr(S >m ), and finally n i=1+m µ i (S >m ). We do not establish the required matrix concentration results directly, but apply various auxiliary lemmas. All corresponding statements hold with either probability at least 1 -cq - δ or at least 1 -cq - (1-δ) for context-dependent constants c. We hence implicitly choose a c > 0 such that collecting all error probabilities yields the statement of Lemma 3 with probability at least 1 -cd -β min{ δ,1-δ} . To start, let m ∈ N as in the statement of Lemma 3, and instantiate Appendix C.2 with that m. In particular, Lemma 11 yields the 1-2 decomposition K >m = K 1 + K 2 and S >m = S 1 + S 2 , which we will henceforth use. Finally, we define Q (d,q) l (x, x ′ ) := γ(S)≤q |S|=l q + 1 -γ(S) dB(l, q) Y S (x)Y S (x ′ ) with the corresponding kernel matrix Q (d,q) l ∈ R n×n . Bound on r 1 and r 2 Remember the definition of r 1 and r 2 : r 1 = µ min (K >m ) + λ max{λ, 1} , r 2 = ∥K >m ∥ + λ max{λ, 1} To bound those quantities, we have to bound µ min (K >m ) and ∥K >m ∥. For the upper bound on ∥K >m ∥, we use the triangle inequality on ∥K >m ∥ = ∥K 1 + K 2 ∥, and then bound ∥K 1 ∥ and ∥K 2 ∥ individually. Note that we can write K 1 = Ψ 1 D 1 Ψ ⊺ 1 by definition. Hence, ∥K 1 ∥ = ∥Ψ 1 D 1 Ψ ⊺ 1 ∥ = n∥D 1 ∥ Ψ ⊺ 1 Ψ 1 n (i) ≤ 1.5n∥D 1 ∥ (ii) ≤ 1.5nλ m (iii) ∈ O (max{λ, 1}) , where (i) follows from Lemma 12 with probability at least 1 -c1 q -δ , (ii) uses that all eigenvalues of K 1 are at most λ m by definition, and (iii) follows from nλ m ∈ Θ(max{λ, 1}). Next, Lemma 13 directly yields with probability at least 1 -c2 q -(1-δ) that ∥K 2 ∥ ∈ Θ( 1) and µ min (K 2 ) ∈ Θ(1). Hence, with probability at least 1 - (c 1 q -δ + c2 q -(1-δ) ) ≥ 1 -cq -min{ δ,1-δ} , we have ∥K 1 ∥, ∥K 2 ∥ ∈ O (max{λ, 1}) , µ min (K 2 ) ∈ Ω(1). This implies ∥K >m ∥ + λ ≤ ∥K 1 ∥ + ∥K 2 ∥ + λ ∈ O(max{λ, 1}), µ min (K >m ) + λ ≥ µ min (K 2 ) + λ ∈ Ω(max{λ, 1}), and subsequently r 2 = ∥K >m ∥ + λ max{λ, 1} ∈ O(1), r 1 = µ min (K >m ) + λ max{λ, 1} ∈ Ω(1). Finally, since r 1 ≤ r 2 , this yields r 1 , r 2 ∈ Θ(1). Bound on Tr(S >m ) We need to show that Tr(S >m ) ≲ d ℓ λ q -δ + d ℓ λ q δ-1 with high probability, where the two terms correspond to the 1-2 decomposition Tr(S >m ) = Tr(S 1 ) + Tr(S 2 ). We differentiate between L = L and L > L. Intuitively, the case L = L corresponds to interpolation or weak regularization, because the maximum degree of learnable polynomials with regularization equals the one without regularization. Conversely, L > L corresponds to strong regularization. Case L = L (interpolation or weak regularization): In this setting, δ = ℓ -ℓ λ -1 β -L = ℓ -1 β -L - ℓ λ β = δ - ℓ λ β . First, Lemma 14 yields Tr(S 2 ) ∈ Θ(q -(1-δ) ) with probability at least 1 -c3 q -(1-δ) . Therefore, Tr(S 2 ) ∈ Θ(q -(1-δ) ) (i) = Θ(q -(1-δ)+ ℓ λ β ) = Θ d ℓ λ q -(1-δ) , where (i) follows from Equation (32). We now consider Tr(S 1 ): Tr(S 1 ) = nTr Ψ ⊺ 1 Ψ 1 n D 2 1 ≤ n Ψ ⊺ 1 Ψ 1 n Tr(D 2 1 ) (i) ≤ 1.5n k∈I1 λ 2 S k , where (i) follows from Lemma 12 with probability at least 1 -c1 q -δ . The bound continues as Tr(S 1 ) ≲ n k∈I1 λ 2 S k (i) ≤ nλ 2 m |I 1 | (ii) ∈ O max{λ, 1} 2 |I 1 | n (iii) ⊆ O d 2ℓ λ dq L dq L+ δ = O d 2ℓ λ q -δ (iv) = O d ℓ λ q -δ , where (i) uses λ k ≤ λ m for all k ∈ I 1 by definition, (ii) uses nλ m ∈ Θ(max{λ, 1}), and (iv) follows from Equation (32). Furthermore, (iii) uses the following bound of |I 1 |: |I 1 | ≤ L+1 l=0 C(l, q, d) (i) = 1 + d L+1 l=1 q -1 l -1 (ii) ≤ 1 + d L+1 l=1 e q -1 l -1 l-1 (iii) ∈ O(d • q L), where C(l, q, d) is defined in Equation ( 17), (i) follows from Lemma 9, (ii) is a classical bound on the binomial coefficient, and (iii) follows from the fact that the term corresponding to L + 1 dominates the polynomial. Finally, collecting the upper bounds on Tr(S 1 ) and Tr(S 2 ) yields Tr(S ≥m ) = Tr(S 1 ) + Tr(S 2 ) ∈ O d ℓ λ q -δ + d ℓ λ q -(1-δ) with probability at least 1 -(c 3 q -(1-δ) + c1 q -δ ) ≥ 1 -cq -min{ δ,1-δ} . Case L > L (strong regularization): In this setting, the dominating rate will arise from Tr(S 1 ). We start by linking δ and δ in analogy to Equation ( 32): δ = ℓ -ℓ λ -1 β -L = - ℓ λ β + ℓ -1 β -L + L -L = δ - ℓ λ β + L -L. Next, as in the previous case, Lemma 14 yields Tr(S 2 ) ∈ Θ(q -(1-δ) ) with probability at least 1 -c3 q -(1-δ) , and therefore Tr(S 2 ) ∈ Θ(q -(1-δ) ) = Θ(q δ-1 ) (i) = Θ d ℓ λ q δ-( L-L)-1 (ii) ⊆ o d ℓ λ q -(1-δ) , where (i) follows from Equation (33), and (ii) from L > L. To bound Tr(S 1 ), we start as in the previous case: Tr(S 1 ) = nTr Ψ ⊺ 1 Ψ 1 n D 2 1 ≤ n Ψ ⊺ 1 Ψ 1 n Tr(D 2 1 ) (i) ≤ 1.5n k∈I1 λ 2 S k , where (i) follows from Lemma 12 with probability at least 1 -c1 q -δ . We then decompose the sum over all squared eigenvalues with index in I 1 as n k∈I1 λ 2 S k = n k∈I1 |S k |≤L+1 λ 2 S k =:E1 + n k∈I1 |S k |=L+2 λ 2 S k =:E2 + n k∈I1 |S k |≥L+3 λ 2 S k =:E3 , and bound the three terms individually. First, we upper-bound E 1 as follows: E 1 = n k∈I1 |S k |≤L+1 λ 2 S k (i) ≤ n 2 λ 2 m n k∈I1 |S k |≤L+1 1 ≤ n 2 λ 2 m n L+1 l=0 C(l, q, d) (ii) ∈ O max{λ, 1} 2 n dq L = O d 2ℓ λ d • d ℓ λ • q L+δ dq L = O d ℓ λ q -δ , where (i) follows from λ k ≤ λ m for all k > m due to the decreasing order of eigenvalues. Step (ii) applies nλ m ∈ Θ(max{λ, 1}), as well as L+1 l=0 C(l, q, d) ∈ O(d • q L ), which follows as in the other case from Lemma 9 and the classical bound on the binomial coefficient. Second, the upper bound of E 2 arises as follows: E 2 = n k∈I1 |S k |=L+2 λ 2 S k (i) = n(ξ (q) L+2 ) 2 k∈I1 |S k |=L+2 q + 1 -γ(S k ) dB(L + 2, q) 2 (ii) ≤ (ξ (q) L+2 ) 2 n • q dB(L + 2, q) γ(S)≤q |S|=L+2 q + 1 -γ(S) dB(L + 2, q) (iii) ≲ n • q d • q L+2 γ(S)≤q |S|=L+2 q + 1 -γ(S) dB(L + 2, q) (iv) = n • q d • q L+2 Q (d,q) L+2 (x, x) (v) ∈ O d • d ℓ λ q L+δ • q d • q L+2 = O d ℓ λ q -(1-δ) , where (i) follows from Proposition 1, and (ii) uses q + 1 -γ(S k ) ≤ q. Next, (iii) uses that Equations ( 18) and ( 21) in Assumption 1 imply ξ (q) L+2 ∈ Θ( 1), and applies the bound B(L + 2, q) = q L+2 ≤ (eq/(L + 2)) L+2 . Step (iv) uses Y S (x)Y S (x) = 1 for all S and x ∈ {-1, 1} d , together with the definition of Q (d,q) L+2 . Lastly, (v) applies Lemma 15 and n ∈ Θ(d ℓ ) = Θ(d • d ℓ λ q L+δ ). Third, we upper-bound E 3 : E 3 = n k∈I1 |S k |≥L+3 λ 2 S k ≤ n max k∈I1,|S k |≥L+3 λ S k∈I1 |S k |≥L+3 λ S k ≲ n max k∈I1,|S k |≥L+3 (λ S k ). The last step follows from k∈I1 |S k |≥L+3 λ S k ≤ L+1 l=L+3 γ(S)≤q |S|=l λ S (i) = L+1 l=L+3 γ(S)≤q |S|=l ξ (q) l q + 1 -γ(S) dB(l, q) (ii) = L+1 l=L+3 ξ (q) l γ(S)≤q |S|=l q + 1 -γ(S) dB(l, q) Y S (x)Y S (x) (iii) = L+1 l=L+3 ξ (q) l (iv) ≲ 1, where (i) follows from Proposition 1, (ii) uses Y S (x)Y S (x) = 1 for all S and x ∈ {-1, 1} d , (iii) applies the definition of Q (d,q) l and Lemma 15, and (iv) follows from Equations ( 18) and ( 21) in Assumption 1 since L + 1 ≤ T . For max k∈I1,|S k |≥L+3 (λ S k ), we bound each element individually: λ S k (i) ∈ O 1 dq |S k |-1 ⊆ O 1 dq (L+3)-1 , where (i) uses Equation ( 22) in Lemma 10 since k ≤ L + 1 < T by definition of I 1 . Hence, we obtain the following bound on E 3 : 1-δ) . E 3 ≲ n max k∈I1,|S k |≥L+3 (λ S k ) ∈ O n dq L+2 = O d • d ℓ λ q L+δ dq L+2 ⊆ O d ℓ λ q -( Finally, we can bound Tr(S 1 ) as Tr(S 1 ) ≤ 1.5n k∈I1 λ 2 S k = E 1 + E 2 + E 3 ∈ O d ℓ λ q -δ + d ℓ λ q -(1-δ) , which yields Tr(S ≥m ) = Tr(S 1 ) + Tr(S 2 ) ∈ O d ℓ λ q -δ + d ℓ λ q -(1-δ) as desired with probability at least 1 -(c 3 q -(1-δ) + c1 q -δ ) ≥ 1 -cq -min{ δ,1-δ} . Bound on n i=1+m µ i (S >m ) As before, we differentiate between no/weak and strong regularization, that is, between L = L and L > L: Case L = L (interpolation or weak regularization): In this case, we start by directly bounding n i=1+m µ i (S >m ) = n i=1+m µ i (S 1 + S 2 ) ≥ n i=1+m µ i (S 2 ) = n i=1 µ i (S 2 ) - m i=1 µ i (S 2 ) (i) ≥ Tr(S 2 ) -m∥S 2 ∥ ( ) (ii) ∈ Ω q -(1-δ) - m dq L+1 (iii) = Ω d ℓ λ q -(1-δ) - m dq L+1 , where (i) bounds each of the first m eigenvalues of S 2 with the largest one, (ii) follows from Lemma 14 with probability at least 1 -c3 q - (1-δ) , and (iii) from Equation (32) since L = L. To conclude the lower bound, it suffices to show that m dq L+1 ∈ o(d ℓ λ q -(1-δ) ): m dq L+1 (i) ∈ O nq -δ max{λ, 1}dq L+1 = O q -δ dq δ+ L d ℓ λ dq L+1 (ii) = O q -δ d ℓ λ q -(1-δ) (iii) ⊆ o(d ℓ λ q -(1-δ) ), where (i) follows from m ∈ O nq -δ max{λ,1} , and (ii) from Equation (32). For (iii), note that δ = 0 for a sufficiently large c yields a vacuous result. We hence assume without loss of generality that δ > 0, which justifies the step. This concludes the proof for the current case with probability at least 1 -c3 q -(1-δ) ≥ 1 -cq -min{ δ,1-δ} . Case L > L (strong regularization): In this case, we define the additional index set I 3 := {k ∈ I 1 | |S k | = L + 2} with S 3 , Ψ 3 , D 3 analogously to S 1 , Ψ 1 , D 1 in Appendix C.2, but using I 3 instead of I 1 . Since I 3 ⊆ I 1 , it follows that Ψ ⊺ 3 Ψ 3 is a submatrix of Ψ ⊺ 1 Ψ 1 , and thus Ψ ⊺ 3 Ψ 3 n -I |I3| is a submatrix of Ψ ⊺ 1 Ψ 1 n -I |I1| . This particularly implies Ψ ⊺ 3 Ψ 3 n -I |I3| ≤ Ψ ⊺ 1 Ψ 1 n -I |I1| (i) ≤ 1/2, ( ) where (i) follows from Lemma 12 with probability at least 1 -c1 q -δ . Published as a conference paper at ICLR 2023 We now move our focus back to the lower bound of n i=1+m µ i (S >m ): n i=1+m µ i (S >m ) (i) ≥ n i=1+m µ i (S 1 ) (ii) ≥ n i=1+m µ i (S 3 ) (iii) ≥ Tr(S 3 ) -m∥S 3 ∥, where (i) follows from the 1-2 decomposition S >m = S 1 + S 2 , (ii) from the fact that I 3 ⊆ I 1 , and (iii) analogously to Equation ( 34). Similar to the previous case, we conclude the proof by first showing that Tr(S 3 ) ∈ Ω d ℓ λ q - (1-δ) , and then m∥S 3 ∥ ∈ o(Tr(S 3 )). For the lower bound of Tr(S 3 ), we start with Tr(S 3 ) = nTr 1 n Ψ 3 D 2 3 Ψ ⊺ 3 ≥ nTr D 2 3 µ min Ψ ⊺ 3 Ψ 3 n (i) ≥ 0.5n k∈I3 λ 2 k (ii) = n(ξ (q) L+2 ) 2 |S|=L+2 γ(S)≤q q + 1 -γ(S) dB(L + 2, q) 2 where (i) follows with high probability from Equation (35). Step (ii) applies Proposition 1, and the fact that I 3 = {k ∈ N | |S k = L + 2 and γ(S) ≤ q} for d sufficiently large. To show this, we use λ m ∈ Θ d ℓ λ n = Θ d ℓ λ dd ℓ λ q L+δ = Θ 1 dq L+δ . Since the eigenvalues are in decreasing order and L + 2 ≤ L + 1 in the current case, we only need to show that λ S < λ m for all S ⊆ {1, . . . , d} with |S| = L + 2 and γ(S) ≤ q: λ S (i) ∈ O 1 dq L+1 = o 1 dq L+δ = o(λ m ), where (i) applies Equation ( 22) in Lemma 10 since L+2 < T . Thus, λ S < λ m for all S ⊆ {1, . . . , d} with |S| = L + 2 and γ(S) ≤ q if d is sufficiently large, which we additionally assume from now on. The lower bound of Tr(S 3 ) continues as follows: Tr(S 3 ) ≥ n(ξ (q) L+2 ) 2 |S|=L+2 γ(S)≤q q + 1 -γ(S) dB(L + 2, q) 2 ≥ n(ξ (q) L+2 ) 2 |S|=L+2 γ(S)≤⌊q/2⌋ q + 1 -γ(S) dB(L + 2, q) 2 (iii) ≥ (ξ (q) L+2 ) 2 n • q/2 dB(L + 2, q) |S|=L+2 γ(S)≤⌊q/2⌋ ⌊q/2⌋ + 1 -γ(S) dB(L + 2, q) (iv) ≳ (ξ (q) L+2 ) 2 n • q/2 dB(L + 2, q) |S|=L+2 γ(S)≤⌊q/2⌋ ⌊q/2⌋ + 1 -γ(S) dB(L + 2, ⌊q/2⌋) , where (iii) follows from q ≥ ⌊q/2⌋ and q + 1 -γ(S) ≥ q/2. Step (iv) follows from the fact that B(L + 2, q) and B(L + 2, ⌊q/2⌋) are of the same order; this follows from a classical bound on the binomial coefficient: B(L + 2, q) ≤ eq L + 2 L+2 ≲ ⌊q/2⌋ L + 2 L+2 ≤ B(L + 2, ⌊q/2⌋). We conclude the lower bound on Tr(S 3 ) as follows: Tr(S 3 ) ≳ (ξ (q) L+2 ) 2 n • q/2 dB(L + 2, q) |S|=L+2 γ(S)≤⌊q/2⌋ ⌊q/2⌋ + 1 -γ(S) dB(L + 2, ⌊q/2⌋) (v) = (ξ (q) L+2 ) 2 n • q/2 dB(L + 2, q) Q (d,⌊q/2⌋) L+2 (x, x) (vi) = (ξ (q) L+2 ) 2 n • q/2 dB(L + 2, q) (vii) ∈ Ω n • q dB(L + 2, q) (viii) = Ω d • d ℓ λ q L+δ • q d • q L+2 = Ω d ℓ λ q -(1-δ) , where (v) uses Y S (x)Y S (x) = 1 for all S and x ∈ {-1, 1} d with the definition of Q (d,⌊q/2⌋) L+2 in Equation ( 30), (vi) applies Lemma 15 with ⌊q/2⌋ as filter size, (vii) follows from Equation ( 18) in Assumption 1, and (viii) uses the classical bound on the binomial coefficient. Finally, for the upper bound on m∥S 3 ∥, we have m∥S 3 ∥ = m∥Ψ 3 D 3 Ψ ⊺ 3 ∥ ≤ mn∥D 3 ∥ Ψ ⊺ 3 Ψ 3 n (i) ≤ 1.5mn∥D 3 ∥ = mn max k∈I3 λ 2 k (ii) = mn max k∈I3 ξ (q) L+2 q + 1 -γ(S k ) dB(L + 2, q) 2 ≤ m n (ξ (q) L+2 ) 2 n • q dB(L + 2, q) 2 (iii) ∈ O m n dd ℓ λ q L+δ • q dq L+2 2 (iv) ⊆ O q -δ d -ℓ λ d ℓ λ q -(1-δ) 2 = O q -δ q -(1-δ) d ℓ λ q -(1-δ) (v) ⊆ o d ℓ λ q -(1-δ) (vi) ⊆ o(Tr(S 3 )), where (i) follows with high probability from Equation ( 35), and (ii) from Proposition 1. Step (iii) uses that Equations ( 18) and ( 21) in Assumption 1 yield ξ (q) L+2 ∈ Θ(1), and B(L + 2, q) = q L+2 ∈ O(q L+2 ). Furthermore, (iv) follows from m ∈ O nq -δ max{λ,1} , (v) from q -δ q -(1-δ) ∈ o(1), and (vi) from the lower bound on Tr(S 3 ) in Equation ( 37). Published as a conference paper at ICLR 2023 Note that Î1 does not depend on m or λ, and I 1 ⊆ Î1 for any m. Furthermore, | Î1 | = L+1 l=0 C(l, q, d) (i) = 1 + L+1 l=1 d q -1 l -1 (ii) ≤ 1 + d L+1 l=1 e(q -1) l -1 l-1 (iii) ∈ O(d • q L), where (i) follows from Lemma 9, (ii) is a classical bound on the binomial coefficient, and (iii) follows from the fact that the largest degree monomial dominates the others. Next, we define Ψ1 ∈ R n×| Î1| with [ Ψ1 ] i,j = Y S k j (x i ) where k j is the j-th element in Î1 . Using the same arguments as in the proof of Lemma 1, it follows that the rows of Ψ1 are independent, that their norm is bounded by | Î1 |, and that they have an expected outer product equal to I | Î1| . Hence, as in the proof of Lemma 1, we can apply Theorem 5.44 from Vershynin (2012) . Choosing t = 1 2 n | Î1| yields Ψ⊺ 1 Ψ1 n -I | Î1| ≤ 1/2 with probability at least 1 -cq -δ for some absolute constant c. Finally, since I 1 ⊆ Î1 for all choices of λ and m, Ψ ⊺ 1 Ψ 1 n -I |I1| ≤ Ψ⊺ 1 Ψ1 n -I | Î1| ≤ 1/2 with probability at least 1 -cq -δ uniformly over all λ and m. Lemma 13 (Bound on K 2 ). In the setting of Appendix C.2, if c < ⌊q/2⌋ < q < d/2 for c as in Assumption 1, we have c 1 ≤ µ min (K 2 ) ≤ ∥K 2 ∥ ≤ c 2 for some positive constants c 1 , c 2 with probability at least 1 -cq - (1-δ) . Proof. First, the condition on d implies q > T and ensures that we can apply Lemmas 16 and 17, and Assumption 1. Furthermore, note that T -1 > L + 2. Proposition 1 and the definition of K 2 yield the following decomposition: K 2 (x, x ′ ) = q l= L+2 ξ (q) l γ(S)≤q |S|=l q + 1 -γ(S) dB(l, q) Y S (x)Y S (x ′ ) = q l= L+2 ξ (q) l Q (d,q) l (x, x ′ ), where Q (d,q) l (x, x ′ ) is defined in Equation ( 30). Then, using the triangle inequality with non-negativity of the ξ (q) l from Equations ( 18) and ( 19) in Assumption 1, we have ∥K 2 ∥ = q l= L+2 ξ (q) l Q (d,q) l = T -1 l= L+2 ξ (q) l Q (d,q) l + q l=T ξ (q) l Q (d,q) l ≤ T -1 l= L+2 ξ (q) l Q (d,q) l + q l=T ξ (q) l Q (d,q) l (i) ≤ 1.5 T -1 l= L+2 ξ (q) l + c 3 (ii) ≤ c 2 with probability ≥ 1 -cq -(1-δ) . Q (d,q) l is the kernel matrix corresponding to Q (d,q) l (x, x ′ ), (i) uses Lemmas 16 and 17, and (ii) uses non-negativity and additionally Equation ( 21) in Assumption 1. The lower bound follows similarly from µ min (K 2 ) ≥ q l= L+2 ξ (q) l µ min Q (d,q) l ≥ ξ (q) L+2 µ min Q (d,q) L+2 (i) ≥ 1 2 ξ (q) L+2 (ii) ≥ c 1 , where (i) follows from Lemma 16 with probability at least 1 -cq - (1-δ) , and (ii) from Equation ( 18) in Assumption 1. Since Lemma 16 yields both the upper and lower bound for all l uniformly with probability 1cq - (1-δ) , this concludes the proof. Lemma 14 (Bound on Tr(S 2 )). In the setting of Appendix C.2, if c < ⌊q/2⌋ < q < d/2 for c as in Assumption 1, we have with probability at least 1 -cq - (1-δ) that Tr(S 2 ) ∈ Θ(q -(1-δ) ) and ∥S 2 ∥ ∈ O 1 dq L+1 , where S 2 is defined in Appendix C.2. Proof. Throughout the proof, the conditions on d and hence q ∈ Θ(d β ) ensure that we can apply Assumption 1 and Lemma 16, as well as L + 2 ≤ T < ⌊q/2⌋. First, Lemma 13 yields ∥K 2 ∥ ∈ Θ( 1) with probability at least 1 -c ′ q -(1-δ) , which we will use throughout the proof. Next, we bound ∥S 2 ∥ in two steps. For this, remember that λ m is the largest eigenvalue corresponding to an eigenfunction Y S of degree |S| ≥ L + 2. Proof that ∥S 2 ∥ ≤ λ m∥K 2 ∥ Define Ψk ∈ R n×n with [ Ψk ] i,j := Y S k (x i )Y S k (x j ) for all i, j ∈ {1, . . . , n}, k ∈ I 2 , and let v be any vector in R n . Then, ∥S 2 v∥ (i) = k∈I2 λ 2 k Ψk v ≤ max k∈I2 {λ k } k∈I2 λ k Ψk v (ii) = λ m k∈I2 λ k Ψk v = λ m∥K 2 v∥, where (i) follows from the definition of S 2 and (ii) from the definition of m. Proof that λ m ∈ Θ 1 dq L+1 We show that λ k ∈ O 1 dq L+1 for all k ∈ I 2 , and that there exists m ∈ I 2 with λ m ∈ Ω 1 dq L+1 . Since λ m = max k∈I2 λ S k , those two facts imply λ m ∈ Θ 1 dq L+1 . Let k ∈ I 2 be arbitrary. Lemma 10 yields λ k ∈    O 1 dq |S k |-1 |S k | < T O 1 dq T -1 |S k | ≥ T (i) ⊆ O 1 dq L+1 , where (i) follows from |S k | ≥ L + 2 and T ≥ L + 2. Now we show that there exists m with λ m ∈ Ω 1 dq L+1 . We choose m with S m = {1, 2, . . . , L+2}. Note that L + 2 = γ(S m) ≤ q and thus m ∈ I 2 . Next, Proposition 1 yields λ S m = ξ (q) L+2 (q + 1 -γ(S m)) dB( L + 2, q) (i) ∈ Θ q dq L+2 ⊆ Θ 1 dq L+1 , where (i) follows from Equations ( 18) and ( 21) in Assumption 1. Finally, combining the previous two results and ∥K 2 ∥ ∈ Θ(1), we have ∥S 2 ∥ ≤ λ m∥K 2 ∥ ∈ O (λ m) = O 1 dq L+1 . Upper bound of Tr(S 2 ) The upper bound also follows directly from the last two results: Tr(S 2 ) ≤ n∥S 2 ∥ ∈ O n dq L+1 = O dq L+ δ dq L+1 = O(q -(1-δ) ). Lower bound of Tr(S 2 ) The lower bound requires a more refined argument. S 2 (x, x ′ ) = k∈I2 λ 2 S k Y S k (x)Y S k (x ′ ) ≥ γ(S)≤q |S|= L+2 λ 2 S Y S (x)Y S (x ′ ) ≥ γ(S)≤⌊q/2⌋ |S|= L+2 λ 2 S Y S (x)Y S (x ′ ) (i) = (ξ (q) L+2 ) 2 γ(S)≤⌊q/2⌋ |S|= L+2 (q + 1 -γ(S)) 2 d 2 B( L + 2, q) 2 Y S (x)Y S (x ′ ) = (ξ (q) L+2 ) 2 dB( L + 2, q) γ(S)≤⌊q/2⌋ |S|= L+2 (q + 1 -γ(S)) q + 1 -γ(S) dB( L + 2, q) Y S (x)Y S (x ′ ) (ii) ≥ (ξ (q) L+2 ) 2 q/2 dB( L + 2, q) γ(S)≤⌊q/2⌋ |S|= L+2 ⌊q/2⌋ + 1 -γ(S) dB( L + 2, q) Y S (x)Y S (x ′ ), where (i) follows from Proposition 1. In (ii), we use that, as long as γ(S) ≤ ⌊q/2⌋, we have q + 1 -γ(S) ≥ q 2 and q ≥ ⌊q/2⌋. Continuing the bound, we have S 2 (x, x ′ ) ≥ (ξ (q) L+2 ) 2 q/2 dB( L + 2, q) γ(S)≤⌊q/2⌋ |S|= L+2 ⌊q/2⌋ + 1 -γ(S) dB( L + 2, q) Y S (x)Y S (x ′ ) (iii) ≥ c ′ L (ξ (q) L+2 ) 2 q/2 dB( L + 2, q) γ(S)≤⌊q/2⌋ |S|= L+2 ⌊q/2⌋ + 1 -γ(S) dB( L + 2, ⌊q/2⌋) Y S (x)Y S (x ′ ) (iv) = c ′ L (ξ (q) L+2 ) 2 q/2 dB( L + 2, q) Q (d,⌊q/2⌋) L+2 (x, x ′ ). In (iii), we use the classical bound on the binomial coefficient q L+2 = B( L + 2, q) as follows: B( L + 2, q) ≤ (2e) L+2 q/2 L + 2 L+2 ≤ (2e) L+2 cL ⌊q/2⌋ L + 2 L+2 = c ′ LB( L + 2, ⌊q/2⌋), where c ′ L is a constant that depends only on L. Finally, (iv) follows from the definition of Q (d,⌊q/2⌋) L+2 in Equation ( 30). The bound on S 2 (x, x ′ ) implies µ min (S 2 ) ≥ c ′ L (ξ (q) L+2 ) 2 q/2 dB( L + 2, q) µ min Q (d,⌊q/2⌋) L+2 , and allows us to ultimately lower-bound Tr(S 2 ) as follows: Tr(S 2 ) ≥ nµ min (S 2 ) ≥ c ′ Ln (ξ (q) L+2 ) 2 q/2 dB( L + 2, q) µ min Q (d,⌊q/2⌋) L+2 (i) ≳ c ′ Ln (ξ (q) L+2 ) 2 q/2 dB( L + 2, q) (ii) ≳ d ℓ q dq L+2 = dq L+ δ dq L+1 = q -(1-δ) , where (i) follows from the lower bound on µ min Q (d,⌊q/2⌋) L+2 in Lemma 16 with probability at -δ) , and (ii) follows from Equation ( 18) in Assumption 1 and the classical lower bound on the binomial coefficient. This yields the desired lower bound Tr(S 2 ) ∈ Ω(q -(1-δ) ). least 1 -c ′′ ⌊q/2⌋ -(1-δ) ≥ 1 -c ′′ q -(1 Finally, collecting all error probabilities concludes the proof. Lemma 15 (Diagonal elements of Q (d,q) ). Let l, q, d ∈ N with 0 < l ≤ q < d/2 and x ∈ {-1, 1} d . Then, Q (d,q) l (x, x) = 1. Proof. First, Q (d,q) l (x, x) = γ(S)≤q |S|=l q + 1 -γ(S) dB(l, q) Y S (x) 2 (i) = 1 dB(l, q) γ(S)≤q |S|=l q + 1 -γ(S) = 1 dB(l, q) q γ=l (q + 1 -γ) γ(S)=γ |S|=l where (i) follows from the fact that Y S (x) 2 = 1. Note that γ(S)=γ |S|=l 1 matches the definition of C(l, γ, d) in the proof of Lemma 9. Next, we use the following recurrence: q γ=l (q + 1 -γ) C(l, γ, d) = q γ=l C(l, γ, d) + ((q -1) + 1 -γ) C(l, γ, d) = C(l, q, d) + 0 + q-1 γ=l ((q -1) + 1 -γ) C(l, γ, d), where the last step uses the fact that C(l, q, d) = q γ=l C(l, γ, d) by definition, and that the term corresponding to γ = q in the second sum is zero. Recursively applying this formula q -l times yields q γ=l (q + 1 -γ) C(l, γ, d) = q γ=l C(l, γ, d). Using this identity, we finally get Q (d,q) l (x, x) = 1 dB(l, q) q γ=l C(l, γ, d) (i) = 1 dB(l, q) q γ=l d γ -1 l -1 (ii) = 1 dB(l, q) d q l (iii) = 1, where (i) follows from Lemma 9, (ii) from the hockey-stick identity, and (iii) from the definition of B(l, q) in Equation (15).

C.5 RANDOM MATRIX THEORY LEMMAS

We use the following lemmas related to random matrix theory in the proof of Lemma 3. The first two results bound the kernel's intermediate and late eigenvalues. Lemma 16 (Bound on the kernel's intermediate tail). In the setting of Appendix C.2, for T -1 ≤ q < d/2, with probability at least 1 -cq - (1-δ) , all l ∈ N with L + 2 ≤ l < T satisfy ∥Q (d,q) l -I n ∥ ≤ 1 2 , where T is defined in Assumption 1 and Q in Equation (30). This lemma particularly implies ∥Q (d,q) l ∥ ≤ 1.5 and µ min Q (d,q) l ≥ 1/2 for all L + 2 ≤ l < T with high probability. Lemma 17 (Bound on the kernel's late tail). In the setting of Appendix C.2, if c ≤ ⌊q/2⌋ -1 < q < d/2, for c and T as in Assumption 1, we have q l=T ξ (q) l Q (d,q) l ≲ 1 with probability at least 1 -cd -1 . Proof of Lemma 16. First, Lemma 15 yields that the diagonal elements of Q (d,q) l are just 1. Hence, we define W (d,q) l := Q (d,q) l -I n , and want to show that, with high probability, ∥W (d,q) l ∥ ≤ 1/2 for all L + 2 ≤ l < T at the same time. The proof makes use of Lemma 18. Therefore, we need to find an appropriate M (l,q) for each considered l, and show that the conditions of the lemma hold. The first condition follows directly from the construction of W (d,q) l and diag(Q (d,q) l ) = I n . To establish the second condition, we have for all i ̸ = j, k ̸ = j E xj [W (d,q) l ] i,j [W (d,q) l ] j,k = 1 (dB(l, q)) 2 γ(S)≤q |S|=l γ(S ′ )≤q |S ′ |=l (q + 1 -γ(S))(q + 1 -γ(S ′ ))Y S (x i )E [Y S (x j )Y S ′ (x j )] Y S ′ (x k ) (i) = 1 (dB(l, q)) 2 γ(S)≤q |S|=l (q + 1 -γ(S)) 2 Y S (x i )Y S (x k ) ≤ q dB(l, q) 1 dB(l, q) γ(S)≤q |S|=l (q + 1 -γ(S))Y S (x i )Y S (x k ) = dB(l, q) q -1 W (d,q) l i,k ≤ dB(l, q) lq -1 W (d,q) l i,k , where (i) follows from orthogonality of the eigenfunctions. Hence, M (l,q) = dB(l, q)/lq satisfies the second condition in Lemma 18 for all L + 2 ≤ l < T . The extra l factor is necessary for the third condition to hold. As in Lemma 18, let p ∈ N, p ≥ 2. Then, for all i ̸ = j, q) , where (i) follows from hypercontractivity in Lemma 19, (ii) from orthogonality of the eigenfunctions, and (iii) from the definition of C(l, q, d) as well as 1 ≤ γ(S) ≤ q. E |[W (d,q) l ] i,j | p 1 p (i) ≤ (p -1) l/2 E [W (d,q) l ] 2 i,j (ii) ≤ p l γ(S)≤q |S|=l (q + 1 -γ(S)) 2 (dB(l, q)) 2 (iii) ≤ p l q 2 (dB(l, q)) 2 C(l, q, d) = p l lq dB(l, q) qC(l, q, d) ldB(l, q) (iv) = p l M (l, Step (iv) follows from Lemma 9, the definition of B(l, q) in Equation ( 15), q-1 l-1 q l = q l , and the definition of M (l,q) . Since all conditions are satisfied, Lemma 18 yields for all p ∈ N, p > 2 Pr ∥W (d,q) l ∥ > 1/2 ≤ c p l,1 p 3p n n M (l,q) p + c p l,2 n M (l,q) 2 , where c l,1 , c l,2 are positive constants that depend on l. In particular, if p ≥ 2 + 1 β + L+ δ 1- δ , then we get Pr ∥W (d,q) l ∥ > 1/2 ≤ c l q -2(1-δ) , where c l is a positive constant that depends on l. To avoid this dependence, we can take the union bound over all l = L + 2, . . . , T -1: Pr ∃l ∈ { L + 2, . . . , T -1} : ∥W (d,q) l ∥ > 1/2 ≤ q -2(1-δ) T -1 l= L+2 c l ≤ c ′ Lq -2(1-δ) , where c ′ L only depends on L and T , which are fixed in our setting. Finally, additionally note that neither L nor T depend on ℓ λ . Proof of Lemma 17. First, Lemma 15 yields that the diagonal elements of Q (d,q) l are just 1 for all l ∈ {T, . . . , q}. Hence, we define W (d,q) l := Q (d,q) l -I n , and decompose the kernel matrix as q l=T ξ (q) l Q (d,q) l = q l=T ξ (q) l W (d,q) l + q l=T ξ (q) l I n . We can hence apply the triangle inequality to bound the norm as follows: q l=T ξ (q) l Q (d,q) l ≤ q l=T ξ (q) l W (d,q) l + q l=T ξ (q) l I n (i) ≤ q l=T ξ (q) l ∥W (d,q) l ∥ + q l=T ξ (q) l ∥I n ∥ (ii) ≤ q l=T ξ (q) l n i̸ =j [W (d,q) l ] 2 i,j + q l=T ξ (q) l (iii) ≤ q l=T ξ (q) l n 2 max i̸ =j [W (d,q) l ] 2 i,j =:ϖ l + c ′′ (iv) ≤ q l=T c ′ q + c ′′ ≲ 1, with probability ≥ 1 -cd -1 , where (i) uses non-negativity of the ξ (q) l from Equations ( 18) and ( 19) in Assumption 1, (ii) bounds the operator norm with the Frobenius norm, and (iii) additionally bounds the sum of the ξ (q) l using Equation ( 21) in Assumption 1. Step (iv) use a bound that we show in the remainder of the proof: with probability at least 1 -cd -1 , we have ϖ l ≤ c ′ /q uniformly over all l ∈ {T, . . . , q}. We first bound ϖ l for a fixed T ≤ l ≤ q as follows: Pr(ϖ l > 1/q) = Pr ξ (q) l n 2 max i̸ =j [W (d,q) l ] 2 i,j > 1/q = Pr max i̸ =j [W (d,q) l ] 2 i,j > 1 n 2 (ξ (q) l ) 2 q 2 = Pr ∃i ̸ = j : [W (d,q) l ] 2 i,j > 1 n 2 (ξ (q) l ) 2 q 2 (i) ≤ n(n -1) Pr [W (d,q) l ] 2 1,2 > 1 n 2 (ξ (q) l ) 2 q 2 (ii) ≤ n 4 (ξ (q) l ) 2 q 2 E [W (d,q) l ] 2 1,2 (iii) = n 4 (ξ (q) l ) 2 q 2 (dB(l, q)) 2 γ(S),γ(S ′ )≤q |S|,|S ′ |=l (q + 1 -γ(S))(q + 1 -γ(S ′ )) E [Y S (x 1 )Y S (x 2 )Y S ′ (x 1 )Y S ′ (x 2 )] δ S,S ′ = n 4 (ξ (q) l ) 2 q 2 (dB(l, q)) 2 γ(S)≤q |S|=l (q + 1 -γ(S)) 2 (iv) ≤ (q + 1 -l)n 4 (ξ (q) l ) 2 q 2 dB(l, q) γ(S)≤q |S|=l (q + 1 -γ(S)) dB(l, q) Y S (x 1 ) 2 ≤ n 4 (ξ (q) l ) 2 q 3 dB(l, q) Q (d,q) l (x 1 , x 1 ) (v) = n 4 q 3 d (ξ (q) l ) 2 B(l, q) , where (i) follows from the union bound and the distribution of the off-diagonal entries in W (d,q) , and (ii) from the Markov inequality. In step (iii), we use orthogonality of the eigenfunctions, as well as the fact that W (d,q) l and Q (d,q) l coincide on off-diagonal entries by construction. Step (iv) follows from Y S (x) 2 = 1 for all S and x ∈ {-1, 1} d . Finally, step (v) applies Lemma 15. Next, we use the union bound over all ϖ l of interest: Pr (∃l ∈ {T, . . . , q} : ϖ l > 1/q) ≤ q l=T Pr (ϖ l > 1/q) ≤ q l=T n 4 q 3 d (ξ (q) l ) 2 B(l, q) = n 4 q 3 d   ⌈q/2⌉ l=T (ξ (q) l ) 2 q l + q-T l=⌈q/2⌉+1 (ξ (q) l ) 2 q l + q l=q-T +1 (ξ (q) l ) 2 q l   (i) = n 4 q 3 d   ⌈q/2⌉ l=T (ξ (q) l ) 2 q l + q-⌈q/2⌉-1 l ′ =T (ξ (q) q-l ′ ) 2 q q-l ′ + T -1 l ′ =0 (ξ (q) q-l ′ ) 2 q q-l ′   (ii) ≤ n 4 q 3 d       ⌈q/2⌉ l=T (ξ (q) l ) 2 + (ξ (q) q-l ) 2 q l =:E1 + T -1 l ′ =0 (ξ (q) q-l ′ ) 2 q l ′ =:E2       , where (i) substitutes l ′ = q -l, and (ii) uses q -⌈q/2⌉ -1 ≤ ⌈q/2⌉ as well as the fact that q q-l ′ = q l ′ . We bound both E 1 and E 2 using Assumption 1. For E 1 in particular, Equations ( 18), ( 19) and ( 21) imply that all ξ (q) l ≲ 1. Hence, E 1 = ⌈q/2⌉ l=T (ξ (q) l ) 2 + (ξ (q) q-l ) 2 q l ≲ ⌈q/2⌉ l=T 1 q l (i) ≤ ⌈q/2⌉ l=T 1 q T = ⌈q/2⌉ -T + 1 q T (ii) ≤ T T q q T ≲ 1 q T -1 , where (i) exploits that T is the value in {T, . . . , ⌈q/2⌉} the furthest away from q/2, and thus min l∈{T,...,⌈q/2⌉} q l = q T , and (ii) follows from the classical lower bound on the binomial coefficient. For E 2 , we have E 2 = T -1 l ′ =0 (ξ (q) q-l ′ ) 2 q l ′ (i) ≤ T -1 l ′ =0 l ′l ′ (ξ (q) q-l ′ ) 2 q l ′ (ii) ≤ T -1 l ′ =0 l ′l ′ c ′ q T -l ′ +1 2 q l ′ ≲ T -1 l ′ =0 l ′l ′ q 2T -2l ′ +2+l ′ ≤ T T T q T +2 ≲ 1 q T -1 , where (i) uses the classical bound on the binomial coefficient, and (ii) Equation ( 20) in Assumption 1. Combining the bounds on E 1 and E 2 finally yields Pr (∃l ∈ {T, . . . , q} : ϖ l > 1/q) ≤ q l=T Pr (ϖ l > 1/q) ≤ n 4 q 3 d (E 1 + E 2 ) ≲ n 4 dq T -4 ≲ d 4ℓ-1-β(T -4) (i) ≲ 1 d , where (i) follows from the definition of T = ⌈4 + 4ℓ β ⌉ in Assumption 1. The next statement is a non-asymptotic version of Proposition 3 from Ghorbani et al. (2021) . Lemma 18 (Graph argument). Let W ∈ R n×n be a random matrix that satisfies the following conditions: 1. [W] i,i = 0, ∀i ∈ {1, . . . , n}. 2. There exists M > 0 such that, for all i, j, k ∈ {1, . . . , n} with i ̸ = j and j ̸ = k, we have E xj [[W] i,j [W] j,k ] ≤ 1 M |[W] i,k |. 3. There exists l ∈ N such that, for all p ∈ N, p ≥ 2 and all i, j ∈ {1, . . . , n}, i ̸ = j, we have E [|[W] i,j | p ] 1/p ≤ p l M . Then, for all p ∈ N, p > 2, Pr (∥W∥ > 1/2) ≤ c p 1 p 3p n n M p + c p 2 n M 2 , where c 1 and c 2 are positive constants that depend on l. Proof. Repeating the steps in the proof of Proposition 3 from Ghorbani et al. (2021) , we get E[∥W∥ 2p ] ≤ E Tr(W 2p ) ≤ (cp) 3p n p+1 M p + c ′p n M 2 . Note that the proof in Ghorbani et al. (2021) assumes M to be in the order of d l . We get rid of this assumption and keep M explicit. Furthermore, Ghorbani et al. (2021) use their Lemma 4 during their proof, but we use our Lemma 19 instead. Ultimately, we apply the Markov inequality to get a high-probability bound: Pr(∥W∥ ≥ 1/2) = Pr ∥W∥ 2p ≥ (1/2) 2p ≤ E[∥W∥ 2p ] (1/2) 2p = c 3 (1/2) 2 p p 3p n p+1 M p + c ′ (1/2) 2 p n M 2 . Renaming the constants concludes the proof. Lemma 19 (Hypercontractivity). For all l, q, d ∈ N and p ≥ 2, we have E x,x ′ ∼U ({-1,1} d ) |Q (d,q) l (x, x ′ )| p 1/p ≤ (p -1) l/2 E x,x ′ ∼U ({-1,1} d ) (Q (d,q) l (x, x ′ )) 2 , where Q (d,q) l (x, x ′ ) is defined in Equation (30). Proof. Let x, x ′ ∼ U({-1, 1} d ) and let z be the entry-wise product of x and x ′ . Then, for all S ⊆ {1, . . . , d}, Y S (x)Y S (x ′ ) depends only on z: Y S (x)Y S (x ′ ) = i∈S [x] i i∈S [x ′ ] i = i∈S [x] i [x ′ ] i = i∈S [z] i . Hence, Q (d,q) l (x, x ′ ) also only depends on x and x ′ via z. Furthermore, note that z ∼ U({-1, 1} d ). Therefore, we can use hypercontractivity (Beckner, 1975) as for instance in Lemma 4 from Misiakiewicz & Mei (2021) to conclude the proof.

D OPTIMAL REGULARIZATION AND TRAINING ERROR D.1 OPTIMAL REGULARIZATION

In the main text we often refer to the optimal regularization λ opt , defined as the minimizer of the risk Risk( fλ While we cannot calculate λ opt directly, we only need the rate ℓ λopt such that max{λ opt , 1} ∈ Θ d ℓ λ opt . Furthermore, it is not a priori clear that such a ℓ λopt minimizes the rate exponent of the risk in Theorem 1. The current subsection establishes that this is indeed the case, and provides a way to determine ℓ λopt . We introduce some shorthand notation for the rate exponents in Theorem 1: η v (ℓ λ ; ℓ, ℓ σ , β) := -ℓ σ -ℓ λ ℓ - β ℓ min{δ, 1 -δ}, η b (ℓ λ ; ℓ, L * , β) := -2 - 2 ℓ (-ℓ λ -1 -β(L * -1)), η(ℓ λ ; ℓ, ℓ σ , L * , β) := max {η v (ℓ λ ; ℓ, ℓ σ , β), η b (ℓ λ ; ℓ, L * , β)} . We highlight that η b and η depend on ℓ λ also through δ = ℓ-ℓ λ -1 β -ℓ-ℓ λ -1 β . Hence, in the setting of Theorem 1, we have with high probability that Variance( fλ ) ∈ Θ(n ηv(ℓ λ ;ℓ,ℓσ,β) ), Bias 2 ( fλ ) ∈ Θ(n η b (ℓ λ ;ℓ,L * ,β) ), Risk 2 ( fλ ) ∈ Θ(n η(ℓ λ ;ℓ,ℓσ,L * ,β) ). In the following, we view those quantities as functions of ℓ λ , with all other parameters fixed. Next, we additionally define λ opt := arg min λ≥0|max {λ,1}∈O(d l ) Risk( fλ ), ℓ λmin := arg min ℓ λ ∈[0, l] η(ℓ λ ; ℓ, ℓ σ , L * , β), η min := min ℓ λ ∈[0, l] η(ℓ λ ; ℓ, ℓ σ , L * , β), l := ℓ -1 -β(L * -1). First, we remark that ℓ λmin -the set of regularization rates that minimize the risk rate-might have cardinality larger than one. However, it cannot be empty: [0, l] is a closed set, and Lemma 21 below shows that η is a continuous function. Second, l defines the scope of the minimization domain, guaranteeing that the constraint on L * in Theorem 1 holds for all candidate ℓ λ . Rate of optimal regularization ℓ λopt vs. optimal rate ℓ λmin : Let ℓ λopt be the rate of the optimal regularization strength such that max {λ opt , 1} ∈ Θ(d ℓ λ opt ). It is a priori not clear that ℓ λopt minimizes η. However, Lemma 20 bridges the two quantities, and guarantees with high probability that the rate of λ opt minimizes the rate of the risk. Lemma 20 (Optimal regularization and optimal rate). In the setting of Theorem 1, assume ℓ > 0, β ∈ (0, 1), ℓ σ ≥ -l, L * ∈ 1, ⌈ ℓ-1 β ⌉ ∩ N. Then, for d sufficiently large, with probability at least 1 -cd -β min{ δ,1-δ} there exists l ∈ ℓ λmin such that max{λ opt , 1} ∈ Θ d l . Hence, we only need to obtain a minimum rate l ∈ ℓ λmin instead of ℓ λopt . In order to propose a method for this, we first establish properties of η, η b , η v in the following lemma. Lemma 21 (Properties of η). Assume ℓ > 0, β ∈ (0, 1), ℓ σ ≥ -l, L * ∈ 1, ⌈ ℓ-1 β ⌉ ∩ N. 1. Over [0, l], η v (•; ℓ, ℓ σ , β) is continuous and non-increasing, and η b (•; ℓ, L * , β) is continuous and strictly increasing. 2. ℓ λmin := arg min ℓ λ ∈[0, l] η(ℓ λ ; ℓ, ℓ σ , L * , β) is a closed interval. 3. If there exists l ∈ 0, l with η v ( l; ℓ, ℓ σ , β) = η b ( l; ℓ, L * , β), then η min = η( l; ℓ, ℓ σ , L * , β). Otherwise, η min = η(0; ℓ, ℓ σ , L * , β). Finding an optimal rate: Lemma 21 suggests a simple strategy to find a l ∈ ℓ λmin numerically: search the intersection of η v and η b in [0, l]; if found, then the intersection point is optimal, otherwise l = 0 is optimal. Note that, if the intersection point exists, it is unique and easy to numerically approximate, since η v is non-increasing, and η b is strictly increasing. Calculating numerical solutions: However, Lemma 21 also shows that ℓ λmin is an interval and thus might contain multiple values. In that case, the proposed strategy might not necessarily retrieve the rate of λ opt , but a different l ∈ ℓ λmin . Yet, Theorem 1 guarantees that both the optimally regularized estimator and any estimator regularized with max {λ, 1} ∈ d l for any l ∈ ℓ λmin have a risk vanishing with the same rate n ηmin with high probability. In particular, this allows us to exhibit the rate of the optimally regularized estimator in Figure 1a . Finally, because of the multiple descent phenomenon (see, for example, Figure 1 ), we do not expect either ℓ λopt or β * to attain an easily readable closed-form expression. Nevertheless, simple optimization procedures allow us to calculate accurate numerical approximations. Finally, we prove Lemmas 20 and 21. Proof of Lemma 20. Let ℓ λopt be such that max {λ opt , 1} ∈ Θ d ℓ λ opt . which establishes Equation (39) for l ∈ ℓ λmin and thereby concludes this case. Case ℓ λopt > max (ℓ λmin ): Let l := max(ℓ λmin ). Then, Lemma 21 yields η(ℓ λopt ) = -2 - 2 ℓ (-ℓ λopt -1 -β(L * -1)) = -2 - 2 ℓ (-ℓ λopt -l + l -1 -β(L * -1)) = -2 - 2 ℓ (-l -1 -β(L * -1)) - 2 ℓ (l -ℓ λopt ) (i) = η min + 2 ℓ (ℓ λopt -l), where (i) uses that l ∈ ℓ λmin . Applying Equation ( 40), we get ℓ λopt -l = ℓ 2 (η(ℓ λopt ) -η min ) ≤ ℓ log(c 2 /c 1 ) 2 log n . Analogously to the previous case, this implies d ℓ λ opt -l ∈ O(1), and the fact that ℓ λopt > l implies d ℓ λ opt -l ∈ Ω( 1). Together, this yields d ℓ λ opt -l ∈ Θ(1), which establishes Equation ( 39) for l ∈ ℓ λmin and thereby concludes this case. Proof of Lemma 21. Throughout this proof, we drop the dependencies on ℓ, ℓ σ , L * , β in the notation of η, η v , η b and simply write η(l), η v (l), η b (l).

Continuity and monotonicity (Item 1)

We first show that η v (l) and η b (l) are continuous functions. η b (l) is an affine function of l, hence continuous. η v (l), however, additionally depends on l via δ, which is not linear. Hence, to show that η v (l) is continuous, we need to show that min{δ, 1 -δ} is continuous. Consider the triangle wave function ϖ(t) := min{t -⌊t⌋, 1 -(t -⌊t⌋)}, which is well-known to be continuous. Because min{δ, 1 -δ} = ϖ ℓ-l-1 β , and ℓ-l-1 β is a linear function of l, we get that η v (l) is also continuous. For monotonicity, we consider the derivatives of η v (l) and η b (l). For η b (l), we have ∂ l η b (l) = 2 ℓ . Since η b is an affine function, and 2 ℓ > 0, this also implies that η b is strictly increasing. For η v , we need to distinguish two cases: ∂ l η v (l) = ∂ l -ℓσ-l-βδ ℓ δ < 1 -δ ∂ l -ℓσ-l-β(1-δ) ℓ δ ≥ 1 -δ = 0 δ < 1 -δ, -2 ℓ δ ≥ 1 -δ. Since η v is a continuous function with non-positive derivatives, this implies that η v is non-increasing. η decomposition (Item 3) First assume that there exists l ∈ 0, l with η v ( l) = η b ( l). Since η v is non-increasing and η b is strictly increasing, η(l) := max{η b (l), η v (l)} = η v (l) l < l, η b (l) otherwise. In particular, η(l) > η( l) for all l > l as η b is strictly increasing, and η(l) ≥ η( l) for all l < l since η v is non-increasing. Combined, this yields η min = η( l; ℓ, ℓ σ , L * , β). Next, assume l does not exist, that is, η v (l) ̸ = η b (l) for all l ∈ 0, l . Then, due to continuity, either η b (l) > η v (l) for all l ∈ 0, l , or η v (l) > η b (l) for all l ∈ 0, l . However, a closer analysis shows that the latter is not possible: the rates at the boundary l are η b ( l) = -2 - 2 ℓ (-(ℓ -1 -β(L * -1)) -1 -β(L * -1)) = 0, η v ( l) = - ℓ σ + (ℓ -1 -β(L * -1)) ℓ - β ℓ min{δ, 1 -δ} (i) ≤ 0, where (i) follows from the assumption ℓ σ ≥ -l and the fact that min{δ, 1 -δ} ≥ 0. Hence, if l does not exist, η v (l) < η b (l), ∀l ∈ [0, l]. In particular, η min is the minimum of the strictly increasing function η b (l) over [ 0, l , and therefore attained only at 0. Lastly, combining both cases of l yields the following convenient expression: η(l) = η v (l) l exists and l < l, η b (l) otherwise. ( ) Closed interval of solutions (Item 2) We again differentiate whether l as in the previous step exists or not. If l does not exist, then the previous step already yields ℓ λmin = {0}, which is a closed interval. Next, assume l exists. Then, all l ∈ ℓ λmin satisfy l ≤ l, since l ∈ ℓ λmin and η b is strictly increasing. Since further η v is continuous and non-increasing over 0, l ⊇ ℓ λmin , ℓ λmin is an interval. Finally, η(l) = max{η b (l), η v (l)} is the maximum of two continuous functions, hence itself also continuous. Therefore, the minimizers ℓ λmin of η are a closed set. This concludes that ℓ λmin is a closed interval. Proof of Items 4 and 5 Item 4 follows straightforwardly from Equation ( 41): If l does not exist, then η(l) = η b (l) for all l ∈ 0, l . Similarly, if l exists, η(l) = η b (l) for all l ≥ l, in particular for l ≥ l ∈ ℓ λmin . Item 5 requires additional considerations. In the case where l does not exist, we have ℓ λmin = {0}, and the result follows directly. Otherwise, using Equation (41), we have η(l) = η v (l) for any l ≤ l, as l ≤ min(ℓ λmin ) ≤ l. As shown in the proof of Item 1, η v (l) alternates between derivatives 0 and -2/ℓ. We claim that there exists a left neighborhood of min (ℓ λmin ) where the derivative is -2/ℓ. Assume towards a contradiction that no such left neighborhood exists. Then, there must be a left neighborhood of min (ℓ λmin ) where the derivative is 0, since η v (l) alternates between only two derivatives. However, in that left neighborhood, η v (l) is constant with all values equal to η min . Hence, there exists l < min (ℓ λmin ) with η(l) = η v (l) = η min and thus l ∈ ℓ λmin , which is a contradiction. Thus, there exists a left neighborhood of min (ℓ λmin ) with diameter ε > 0 throughout which the derivative is -2 ℓ . Then, for all l ∈ [min (ℓ λmin ) -ε, min (ℓ λmin )], η v (l) = η min - 2 ℓ (l -min (ℓ λmin )). Finally, since η v (l) is non-increasing, as long as η v (l) ≤ η min + 2 ℓ ε we have l ≥ min (ℓ λmin ) -ε. Hence, choosing c = 2 ℓ ε yields the statement of Item 5.

D.2 PROOF OF THEOREM 2

The informal Theorem 2 in the main text relies on a β * , defined as the intersection of the variance and bias rates from Theorem 1 for the interpolator f0 (setting ℓ λ = 0). Whenever β * is unique, the fact that Bias 2 ( f0 ) in Theorem 1 strictly increases as a function of β induces a phase transition: for β > β * , the bias dominates the rate of the risk in Theorem 1, while for β ≤ β * , the variance dominates. In particular, Lemmas 20 and 21 imply that interpolation is harmless whenever the bias dominates, and harmful if the variance dominates. Intuitively, Theorem 2 considers optimally regularized estimators, and varies the inductive bias strength via β. The formal Theorem 4 below presents a different perspective: it considers β fixed, Combining the upper and lower bounds on Tr H -2 yields n -rank(K -2 ) (c 2 + λ opt ) 2 ≤ Tr H -2 ≤ n (c 1 + λ opt ) 2 n -rank(K -2 ) n λ 2 opt (c 2 + λ opt ) 2 ≤ λ 2 opt n Tr H -2 ≤ λ 2 opt (c 1 + λ opt ) 2 - rank(K -2 ) n λ 2 opt (c 2 + λ opt ) 2 + λ 2 opt (c 2 + λ opt ) 2 -1 ≤ λ 2 opt n Tr H -2 -1 ≤ λ 2 opt (c 1 + λ opt ) 2 -1 - rank(K -2 ) n λ 2 opt (c 2 + λ opt ) 2 - 2c 2 λ opt + c 2 2 (c 2 + λ opt ) 2 ≤ λ 2 opt n Tr H -2 -1 ≤ - 2c 1 λ opt + c 2 1 (c 1 + λ opt ) 2 . Taking absolute values, a simple case distinction and c 1 < c 2 yields the following bound: λ 2 opt n Tr H -2 -1 ≤ 2c 2 + λ opt (c 2 + λ opt ) 2 + rank(K -2 ) n λ 2 opt (c 2 + λ opt ) 2 (i) ≲ d l d 2l + nq - δ n d 2l d 2l = d -l + q -δ ∈ O(d -l ), where (i) uses the rate of rank(K -2 ) from Equation ( 44) and λ opt ∈ Θ d l . Combining this bound on T 1 with the bound on T 2 in Equation ( 43) and collecting all error probabilities concludes the current case. Harmless interpolation setting (Item 2) In this setting, the only minimizer of the risk rate is l = 0, and hence λ opt ≤ max {λ opt , 1} ∈ O(d 0 ) = O(1) . As for the previous case, let m be the same as in the proof of Theorem 1 for ℓ λ = 0. With probability at least 1 -cd -β min{ δ,1-δ} , this m again satisfies the conditions of Lemma 23, which yields E ϵ 1 n i ( fλopt (x i ) -y i ) 2 ≤ λ 2 opt σ 2 n Tr H -2 :=T 3 + 6λ 2 opt r 2 2 r 2 1 ∥D -1 ≤m a∥ 2 n 2 T2 . Furthermore, we apply the same steps with l = 0 as in the previous case (Equation ( 43)) to bound T 2 : T 2 ∈ O Risk 2 ( fλopt ) . For T 3 , we use the same bound on Tr H -2 as in Equation ( 45) with the same probability as follows: T 3 = λ 2 opt σ 2 n Tr H -2 ≤ λ 2 opt σ 2 n n (c 1 + λ opt ) 2 = σ 2 λ 2 opt (c 1 + λ opt ) 2 (i) ≤ σ 2 (c ′′ ) 2 (c 1 + c ′′ ) 2 , where (i) follows for d sufficiently large from λ opt ∈ O(1) with c ′′ > 0. Since c 1 > 0, we have (c ′′ ) 2 (c1+c ′′ ) 2 < 1. Hence, combining the bounds on T 2 and T 3 , as well as collecting all probabilities, we get the desired result for this case.

D.3 TECHNICAL LEMMAS

Lemma 22 (Trace of the inverse). Let M, M 1 , M 2 ∈ R n×n be symmetric positive semi-definite matrices with M = M 1 + M 2 . Furthermore, assume that µ min (M 2 ) > 0 and rank(M 1 ) < n. Then, Tr(M -2 ) ≥ n -rank(M 1 ) ∥M 2 ∥ 2 . First, T 2 > 0 since H is positive semi-definite. Therefore, T 1 already yields the desired lower bound on the expected training error. For the upper bound, we bound T 2 as follows: T 2 = λ 2 n (D -1 ≤m a) ⊺ D ≤m Ψ ⊺ ≤m H -2 Ψ ≤m D ≤m (D -1 ≤m a) (i) = λ 2 n (D -1 ≤m a) ⊺ D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 Ψ ⊺ ≤m H -2 >m Ψ ≤m D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 (D -1 ≤m a) = λ 2 n H -1 >m Ψ ≤m D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 D -1 ≤m a 2 ≤ λ 2 Ψ ⊺ ≤m H -2 >m Ψ ≤m n D -1 ≤m + Ψ ⊺ ≤m H -1 >m Ψ ≤m -1 D -1 ≤m a 2 (ii) ≤ 1.5λ 2 (µ min (K >m ) + λ) 2 B 1 . (iii) ≤ 6λ 2 r 2 2 max{λ, 1} 2 (µ min (K >m ) + λ) 2 ∥D -1 ≤m a∥ 2 n 2 = 6λ 2 r 2 2 r 2 1 ∥D -1 ≤m a∥ 2 n 2 , where H >m := K >m + λI n . Step (i) follows from Lemma 6, step (ii) uses Equation ( 5) and matches the term B 1 from the proof of Theorem 3, and step (iii) applies the bound on B 1 achieved in Theorem 3. This upper-bounds the expected training error and thereby concludes the proof.

E EXPERIMENTAL DETAILS

This section describes our experimental setup and includes additional details. We provide the code to replicate all experiments and plots in https://github.com/michaelaerni/ iclr23-InductiveBiasesHarmlessInterpolation.

E.1 SETUP FOR FILTER SIZE EXPERIMENTS

The following describes the main filter size experiments presented in Section 4.1. Network architecture We use a special CNN architecture that amplifies the role of filter size as an inductive bias. Each model of the main filter size experiments in Figure 2 has the following architecture: epochs, and then reduce the learning rate according to an inverse square-root decay every 20 epochs. For peak learning rate γ 0 , a decay rate L, the inverse square-root decay schedule at epoch t ≥ 0 is γ 0 1 + ⌊t/L⌋ . ( ) Learning rate warm-up helps to capture the early-stopped test error more precisely. Whenever possible, we use deterministic training algorithms, so that our results are as reproducible as possible. We selected all hyperparameters to minimize the training loss of the strongest inductive bias (filter size 5) on noisy training data, with the constraint that all other settings still converge and interpolate. Note that we do not use data augmentation, dropout, or weight decay. Evaluation We observed that all models achieved their minimum test error either at the beginning or very end of training. Hence, our experiments evaluate the test error every 2 epochs during the first 150 epochs, and every 10 epochs afterwards to save computation time. We use an oracle, that is, the true test error, to determine the optimal early stopping epoch in retrospective. The optimal early stopping training error is always over the entire training set (including potential noise) for a fixed model, not an average over mini-batches. Noise model In the noisy case, we select 20% of all training samples uniformly at random without replacement, and flip their label. The noise is deterministic per dataset seed and does not change between different optimization runs. Note that we never apply noise to the test data.

E.2 SETUP FOR ROTATIONAL INVARIANCE EXPERIMENTS

The following describes the rotational invariance experiments presented in Section 4.2. Dataset We use the EuroSAT (Helber et al., 2018) training split and subsample it into 7680 raw training and 10k raw test samples in a stratified way. For a fixed number of rotations k, we generate a training dataset as follows: 1. In the noisy case, select a random 20% subset of all training samples without replacement; for each, change the label to one of the other 9 classes uniformly at random. 2. For the i-th training sample (i ∈ {1, . . . , 7680}): To generate the actual test dataset, we apply a random rotation to each raw test sample independently, and crop the rotated images to the same size as the training samples. This procedure rotates every image exactly once, and uses random angle offsets to avoid distribution shift effects from image interpolation. Note that all random rotations are independent of the label noise and the number of training rotations. Hence, all experiments share the same test dataset. Furthermore, since we apply label noise before rotating images, all rotations of an image consistently share the same label. Network architecture All experiments use a Wide Residual Network (Zagoruyko & Komodakis, 2016) with 16 layers, widen factor 6, and default PyTorch weight initialization. We chose the width and depth such that all networks are sufficiently overparameterized while still being manageable in terms of computational cost. Optimization procedure We use the same training procedure for all settings in Figure 3 . Optimization minimizes the softmax cross-entropy loss using mini-batch SGD with momentum 0.9 and batch size 128. Since the training set size grows in the number of rotations, all experiments fix the number of gradient updates to 144k. This corresponds to 200 epochs over a dataset with 12 rotations. Similar to the filter size experiments, we linearly increase the learning rate from zero to a peak value of 0.15 during the first 4800 steps, and then reduce the learning rate according to an inverse square-root decay (Equation ( 46)) every 960 steps. Whenever possible, we use deterministic training algorithms, so that our results are as reproducible as possible. We selected all hyperparameters to minimize the training loss of the strongest inductive bias (12 rotations) on noisy training data, with the constraint that all other settings still converge and interpolate. As for all experiments in this paper, we do not use additional data augmentation, dropout, or weight decay. Evaluation Similar to the filter size experiments, we evaluate the test error more frequently during early training iterations: every 480 steps for the first 9600 steps, every 1920 steps afterwards. The experiments again use the actual test error to determine the best step for early-stopping, and calculate the corresponding training error over the entire training dataset, including all rotations and potential noise. Due to the larger training set size and increased computational costs, we only sample a single training and test dataset, and report the mean and standard error of all metrics over five training seeds.

E.3 DIFFERENCE TO DOUBLE DESCENT

As mentioned in Section 4.1, our empirical observations resemble the double descent phenomenon. This subsection expands on the discussion and provides additional details about how this paper's phenomenon differs from double descent. While all models in all experiments interpolate the training data, we observe that both noisy labels and stronger inductive biases increase the final training loss of an interpolating model: Smaller filter size results in a decreasing number of model parameters. Enforcing invariance to more rotations requires a model to interpolate more (correlated) training samples. Thus, in both cases, increasing inductive bias strength decreases a model's overparameterization in relation to the number of training samples -shifting the setting closer to the interpolation threshold. We argue that our choice of architecture and hyperparameter tuning ensures that no model in any experiment is close to the corresponding interpolation threshold. If that is the case, then double descent predicts that increasing the number of model parameters has a negligible effect on whether regularization benefits generalization, and does therefore not explain our observations. In the following, we first describe how our hyperparameter and model selection procedure ensures that all models in all experiments are sufficiently overparameterized, so that double descent predicts negligible effects from increasing the number of parameters. Then, we provide additional experimental evidence that supports our argument: We repeat a subset of the experiments in Section 4 while upscaling the number of parameters in all models. For a fixed model scale and varying inductive bias, we observe that all phenomena in Section 4 persist. For a fixed inductive bias strength, we further see that the test error of interpolating models saturates at a value that matches our hypothesis. In particular, for strong inductive biases, the gap in test error between interpolating models and their optimally early-stopped version -harmful interpolation -persists. Hyperparameter tuning We mitigate differences in model complexity for different inductive bias strengths by tuning all hyperparameters on worst-case settings, that is, maximum inductive bias with noisy training samples. To avoid optimizing on test data, we tune on dataset seeds and network initializations that differ from the ones used in actual experiments. Figure 5 displays the final training loss for all empirical settings in this paper. While models with a stronger inductive bias exhibit larger training losses, all values are close to zero, and the numerical difference is small. Finally, we want to stress again that this discussion is only about the training loss; all models in all experiments have zero training error and perfectly fit the corresponding training data. Increasing model complexity for varying filter size As additional evidence, we repeat the main filter size experiments from Figure 2 in Section 4.1 using the same setup as before (see Appendix E.1), but increase the convolutional layer width to 256, 512, 1024, and 2048. For computational reasons, we evaluate a reduced number of filter sizes for widths 256 and 512, and only the smallest filter size 5 for widths 1024 and 2048. Since we found the original learning rate 0.2 to be too unstable for the larger model sizes, we use a decreased peak learning rate 0.13 for widths 256 and 512, and 0.1 for widths 1024 and 2048.

Figures 6a and 6b

show the test errors for 20% and 0% training noise, respectively. With noisy training data (Figure 6a ), larger interpolating models yield a slightly smaller test error, but the overall trends remain: the gap in test error between converged and optimally early-stopped models increases with inductive bias strength, and the phase transition between harmless and harmful interpolation persists. In particular, Figure 6a shows strong evidence that the number of model parameters does not influence our phenomenon: for example, models with filter size 5 (strong inductive bias) and width 512 (red) have more parameters than models with filter size 27 (weak inductive bias) and width 128 (blue). Nevertheless, models with filter size 5 benefit significantly from early stopping, while interpolation for models with filter size 27 is harmless. In the noiseless case (Figure 6b ), increasing model complexity does neither harm nor improve generalization, and all models achieve their optimal performance after interpolating the entire training dataset. Similarly, Figure 6c reveals that the fraction of training noise that optimally early-stopped models fit stays the same for larger models. Finally, for a fixed inductive bias strength, the test errors saturate as model size increases, making a different trend for models with more than 2048 filters unlikely. To increase legibility, we present the numerical results for the largest two filter sizes in Table 1 . Increasing model capacity for varying rotational invariance For completeness, we also repeat the rotation invariance experiments from Figure 3 in Section 4.2 with twice as wide Wide Residual Networks on a reduced number of rotations. More precisely, we increase the network widen-factor from 6 to 12, and otherwise use the same setting as the main experiments (see Appendix E.2). Note that this corresponds to a parameter increase from around 6 million to around 24 million parameters. The results in Figure 7 provide additional evidence that our phenomenon is distinct from double descent: both the test error (Figures 7a and 7b ) and fraction of fitted noise under optimal early stopping (Figure 7c ) exhibit the same trend, despite the significant difference in number of parameters.



THEORETICAL RESULTSFor convolutional kernels, a small filter size induces a strong bias towards estimators that depend nonlinearly on the input features only via small patches. This section analyzes the effect of filter size (as an example inductive bias) on the degree of harmless interpolation for kernel ridge regression. Note that previous works show how early-stopped gradient methods on the square loss behave statistically similarly to kernel ridge regression(Raskutti et al., 2014;Wei et al., 2017).2 We hide positive constants that depend at most on ℓ and β (defined in Theorem 1) using the standard Bachmann-Landau notation O(•), Ω(•), Θ(•), as well as ≲, ≳, and use c, c1, . . . as generic positive constants. See Theorem for a more general statement that does not rely on a unique β * . Data augmentation techniques can efficiently enforce rotational invariance; see, e.g.,Yang et al. (2019). We omit an explicit definition of c here for brevity and refer to the proof instead.



Figure1: Illustration of the rates in Theorem 1 for high-dimensional kernel ridge regression as a function of β -the rate of the filter size q ∈ Θ(d β ). (a) Rate exponent α of the Risk ∈ Θ(n α ) for the interpolator f0 vs. the optimally ridge-regularized estimator fλopt . (b) Rate exponent of the variance and bias for the interpolator f0. For both illustrations, we choose f0 with ℓ = 2, ℓσ = 0.6, and the ground truth f ⋆ (x) = x1x2. Lastly, β * denotes the threshold where the bias and variance terms in Theorem 1 match, and where we observe a phase transition between harmless and harmful interpolation. See Appendix D.1 for technical details.

Comparison with McRae et al. (2022)). The upper bound for the bias in Theorem 1 from McRae et al. (2022) depends on the suboptimal term ∥D -1/2

where B(l, d) := |{S ⊆ {1, . . . , d} | |S| = l}| = d

L * -1 , concluding the proof. C MATRIX CONCENTRATION This section considers the random matrix theory part of our main result. First, Appendix C.1 focuses on the large eigenvalues of our kernel, and proves Lemma 1. Next, Appendices C.2 and C.3 focus on the tail of the eigenvalues, culminating in the proof of Lemma 3. Lastly, Appendices C.4 and C.5 establishes some technical tools that we use throughout the proofs. C.1 PROOF OF LEMMA 1

Every l ∈ [0, l] with l ≥ l for all l ∈ ℓ λmin satisfies η(l; ℓ, ℓ σ , L * , β) = η b (l; ℓ, L * , β). 5. Let l ∈ [0, l] with l ≤ l for all l ∈ ℓ λmin . If η(l; ℓ, ℓ σ , L * , β) -η min ≤ c where c > 0 is constant 5 and depends only on ℓ, ℓ σ , L * , β, then η(l; ℓ, ℓ σ , L * , β) = η min + 2 ℓ (min (ℓ λmin ) -l) .

Figure 4: Example synthetic images used in the filter size experiments.

(a) Determine a random offset angle α i . (b) Rotate the original image by each angle in {α i + j • (360 • /k) | j = 0, . . . , k -1}. (c) Crop each of the k rotated 64 × 64 images to 44 × 44 so that no network sees black borders from image interpolation. 3. Concatenate all k × 7680 samples into a single training dataset. 4. Shuffle this final dataset (at the beginning of training and every epoch).

Figure 5: Training losses of the models in this paper's experiments as a function of (a) filter size and (b) the number of training set rotations. Models with a stronger inductive bias generally exhibit larger losses. However, in all instances, the numerical difference is small. Lines show the mean loss over 5 training set samples in (a) and 5 different optimization runs in (b), shaded areas the corresponding standard error.

Figure6: An increase in convolutional layer width by factors 2 to 16 does not significantly alter the behavior of (a) the test error when training on 20% label noise, (b) the test error when training on 0% label noise, and (c) the training error when using optimal early stopping on 20% label noise. Despite significantly larger model size, the phase transition between harmless and harmful interpolation persists. Lines show the mean over five random datasets, shaded areas the standard error.

Convolutional neural network experiments with varying filter size on synthetic image data. (a) For noisy data, small filter sizes (strong inductive bias) induce a gap between the generalization performance of interpolating (blue) vs. optimally early-stopped models (yellow). The gap vanishes as the inductive bias decreases

and D ≤m := diag(λ 1 , . . . , λ m ). We further use the squared kernel S(x, x ′ ) := E z∼ν [K(x, z)K(z, x ′ )], its truncated versions S ≤m and S >m , as well as the corresponding empirical kernel matrices S, S ≤m , S >m ∈ R n×n . Next, for a symmetric positive-definite matrix, we write µ min (•) and µ max (•) (or ∥•∥) to indicate the min and max eigenvalue, respectively, and µ i (•) for the i-th eigenvalue in decreasing order. Finally, we use ⟨•, •⟩ for the Euclidean inner product in R d .

1 + 2/d, . . . , -2/d + 1, 1}. Since that set has a cardinality d + 1, we can write κ as a linear combination of d + 1 uncorrelated functions. In particular, {G

To mitigate randomness in both the training data and optimization procedure, we average over multiple dataset and training seeds. More precisely, we sample 5 different pairs of training and test datasets. For each dataset, we fit 15 randomly initialized models per filter size on the same dataset, and calculate average metrics. The plots then display the mean and standard error over the 5 datasets.Dataset All filter size experiments use synthetic images. For a fixed seed, the experiments generate 200 training and 100k test images, both having an equal amount of positive and negative classes. Given a class, the sampling procedure iteratively scatters 10 shapes on a black 32 × 32 image. A single shape is either a circle (negative class) or a cross (positive class), has a uniformly random size in[3, 5], and a uniformly random center such that all shapes end up completely inside the target image. We use a conceptual line width of 0.5 pixels, but discretize the shapes into a grid. See Figure4for examples. A single dataset seed fully determines the training data, test data, and all scattered shapes.

Test errors for filter size 5 (strongest inductive bias) and very large width under 20% training noise.

acknowledgement

ACKNOWLEDGMENTS K.D. is supported by the ETH AI Center and the ETH Foundations of Data Science. We would further like to thank Afonso Bandeira for insightful discussions.

annex

Finally, combining this result with Equations ( 36) and (37), we have n i=1+m µ i (S >m ) ≥ Tr(S 3 ) -m∥S 3 ∥ ∈ Ω(Tr(S 3 )) ⊆ Ω d ℓ λ q - (1-δ) with probability at least 1 -c1 q -δ ≥ 1 -cq -min{ δ,1-δ} .

C.4 TECHNICAL LEMMAS

We use the following technical lemmas in the proof of Lemma 3. All results assume the setting of Appendix C.2, particularly a kernel as in Theorem 1 that satisfies Assumption 1. Lemma 12 (Bound on K 1 ). In the setting of Appendix C.2, for d/2 > q ≥ L + 1, we havewith probability at least 1 -cq -δ uniformly over all choices of m and λ.Proof. The proof follows a very similar argument to Lemma 1 with minor modifications. First, defineCombining the bounds on bias and variance from Theorem 1, we have Risk 2 ( fλ ) = Bias 2 ( fλ ) + Variance 2 ( fλ ) β) with probability ≥ 1 -cd -min{ δ, 1-δ} (38) uniformly for all ℓ λ ∈ 0, l with max{λ, 1} ∈ Θ(d ℓ λ ).Throughout the remainder of this proof, we drop the dependencies on ℓ, ℓ σ , L * , β in the notation of η and simply write η(ℓ λ ). We further omit repeating that each step is true with probability at least 1 -cd -min{ δ,1-δ} , but imply it throughout.The goal of the proof is to show that there exists l ∈ ℓ λmin sufficiently close to ℓ λopt ; formally, we need to show thatIf this is the case, then the definition of ℓ λopt yields the conclusion as follows:where (i) uses Equation (39).Towards an auxiliary result, we first apply Equation ( 38) to ℓ λopt and l ∈ ℓ λmin with max{λ, 1} ∈ Θ(d l ):where c 1 , c 2 > 0 are the constants hidden by the Θ-notation in Equation ( 38). Next, the optimality of λ opt yields c 1 < c 2 , andThis implieswhere the second implication uses that η(ℓ λopt ) -η min ≥ 0, since η min is the minimum rate.With this result, we finally focus on establishing Equation (39) which yields the claim of this lemma.Lemma 21 shows that ℓ λmin is an interval; hence, we distinguish three cases:Case ℓ λopt < min (ℓ λmin ): Let l := min(ℓ λmin ). Equation (40) yields η(ℓ λopt ) -η min ≤ c for any c > 0 if d is large enough. Hence, for d fixed but sufficiently large, Lemma 21 yieldsApplying Equation (40), we getwhere (i) follows from n ∈ Θ(d ℓ ). Since both d l-ℓ λ opt andand instead differentiates whether the optimal risk rate in Theorem 1 results from ℓ λ > 0 (harmful interpolation) or ℓ λ = 0 (harmless interpolation).Whenever β * is well-defined and unique, as for example in the setting of Figure 1 , the two perspectives coincide. However, one can construct pathological edge cases where variance and bias rates intersect on an interval of β values, or the dominating quantity of the risk rate in Theorem 1 as a function of β alternates between the variance and bias. While Theorem 2 fails to capture such settings, Theorem 4 still applies. We hence present Theorem 4 as a more general result.Theorem 4 (Formal version of Theorem 2). In the setting of Theorem 1 using the notation from Appendix D.1, let ℓ > 0, β ∈ (0, 1), ℓ σ ≥ -l, and L * ∈ 1, ⌈ ℓ-1 β ⌉ ∩ N. Let further λ opt = arg min λ≥0|max {λ,1}∈O(d l ) Risk( fλ ). Then, the expected training error behaves as follows:Intuitively, the two cases in Theorem 4 correspond to harmful and harmless interpolation, where the optimal rate in Theorem 1 is for ℓ λ > 0 and ℓ λ = 0, respectively. Then, Lemma 20 yields with high probability that also ℓ λopt > 0 and ℓ λopt = 0 in the first and second case, respectively. Finally, we remark that Theorem 4 lacks an edge case: if both some ℓ λ > 0 and ℓ λ = 0 minimize the risk rate in Theorem 1 simultaneously, Lemma 20 fails to differentiate whether ℓ λopt is zero or positive. However, that edge case corresponds to either interpolation or very weak regularization. Hence, we conjecture the corresponding model's training error to behave similar to the second case in Theorem 4.Proof of Theorem 4. First, Lemma 20 yields with probability at least 1 -cd -β min{ δ,1-δ} that max{λ opt , 1} ∈ Θ(d l ), where l is a minimizer of η(l; ℓ, ℓ σ , L * , β). The condition in Item 1 guarantees that all optimal l are positive, while the condition in Item 2 ensures that the optimal l = 0.Harmful interpolation setting (Item 1) In this case, l > 0, and thus λ opt ∈ Θ(d l ) for d sufficiently large. We start by applying Theorem 1 with ℓ λ = l in this setting.Within the proof of Theorem 1 in Section 5.2, we pick a m ∈ N such that for d sufficiently large, with probability at least 1 -cd -β min{ δ,1-δ} , m satisfies the conditions of Lemmas 1 to 4 and Theorem 3. For the remainder of this proof, let m be the same as in the proof of Theorem 1 for ℓ λ = l. This m satisfies the conditions of Lemma 23, which hence yieldswhere H := K + λI n . Then, using that a ≤ b ≤ a + d implies |b -c| ≤ |a -c| + d, we further getWe first bound T 2 as follows:where (i) uses n ∈ Θ(d ℓ ) and λ opt ∈ Θ(d l ), (ii) applies Lemma 2 for the rate of ∥D -1 ≤m a∥ and Lemma 3 for r 2 2 , r 2 1 ∈ Θ( 1), and (iii) matches the expression to the rate of the bias in Theorem 1. For T 1 , we will bound Tr H -2 from above and below using Lemma 22. We hence introduce the following notation:so that H = H 2 + K -2 , and where K 1 and K 2 are defined in Appendix C.2. Furthermore, the rank ofwhere (i) uses the definition of C(l, q, d) in Equation ( 17), (ii) applies Lemma 9 with d sufficiently large, and (iii) uses the classical bound q-1 l-1 ≤ e q-1 l-1 l-1as well as the fact that the largest monomial dominates the sum. Finally, since n ∈ Θ(d • q L+ δ ), we haveTherefore, for d and hence n ∈ Θ(d ℓ ) sufficiently large, rank(K -2 ) < n, and Lemma 13 yieldsWe can thus instantiate Lemma 22 with M = H, M 1 = K -2 , M 2 = H 2 . This implies that:The upper bound on Tr H -2 simply follows fromwhere (i) uses the previous lower bound on µ min (K 2 ) from Lemma 13.Proof. First, we apply the identitywhich holds since M 2 , and thus M, are full rank. Next, A is a product of matrices including M 1 ; hence, the rank of A is bounded by rank(M 1 ) < n. Let now {v 1 , . . . , v rank(A) } be an orthonormal basis of col(A), and let {v rank(A)+1 , . . . , v n } be an orthonormal basis of col(A) ⊥ . Thus, {v 1 , . . . , v n } is an orthonormal basis of R n , and similarity invariance of the trace yieldswhere (i) uses that, for all i > rank(A), v i is orthogonal to the column space of A, and A is symmetric. Furthermore, (ii) uses that all v i have norm 1, and (iii) that rank(A) ≤ rank(M 1 ).Lemma 23 (Fixed-design training error). In the setting of Theorem 3, let m ∈ N such that r 1 > 0, Equation (5) holds, and the ground truth satisfieswhere H := K + λI n .Proof. The kernel ridge regression estimator is fλ (where y := [y 1 , . . . , y n ] ⊺ with y i = f ⋆ (x i ) + ϵ i , and k(x * ) := [K(x 1 , x * ), . . . , K(x n , x * )] ⊺ . Thus, the estimator evaluated at x 1 , . . . , x n is KH -1 y, which yields the following training error:where (i) follows fromNext, by the assumptions on the ground truth, we can write y = ϵ + Ψ ≤m a. Thus, the expected training error with respect to the noise is (c) Opt. early stopping (20% noise)Figure 7 : Doubling Wide Residual Network width does not significantly alter the behavior of (a) the test error when training on 20% label noise, (b) the test error when training on 0% label noise, and (c) the training error when using optimal early stopping on 20% label noise. Despite significantly larger model size, the phase transition between harmless and harmful interpolation persists. Lines show the mean over five random network initializations, shaded areas the standard error.

