ON THE SATURATION EFFECT OF KERNEL RIDGE REGRESSION

Abstract

The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this longstanding conjecture.

1. INTRODUCTION

Suppose that we have observed n i.i.d. samples {(x i , y i )} n i=1 from an unknown distribution ρ supported on X × Y where X ⊆ R d and Y ⊆ R. One of the central problems in the statistical learning theory is to find a function f based on these observations such that the generalization error E (x,y)∼ρ f (x) -y 2 (1) is small. It is well known that the conditional mean f * ρ (x) := E ρ [ y | x ] = Y ydρ(y|x) minimizes the square loss E(f ) = E ρ (f (x) -y) 2 where ρ(y|x) is the distribution of y conditioning on x. Thus, this question is equivalent to looking for an f such that the generalization error E x∼µ f (x) -f * ρ (x) 2 (2) is small, where µ is the marginal distribution of ρ in X . In other words, f can be viewed as an estimator of f * ρ . When there is no explicit parametric assumption made on the distribution ρ or the function f * ρ , researchers often assumed that f * ρ falls into a class of certain functions and developed lots of non-parametric methods to estimate f * ρ (e.g., Györfi (2002) ; Tsybakov (2009) ). The kernel method, one of the most widely applied non-parametric regression methods (e.g., Kohler & Krzyzak (2001) ; Cucker & Smale (2001) ; Caponnetto & De Vito (2007) ; Steinwart et al. (2009) ; Fischer & Steinwart (2020) ), assumes that f * ρ belongs to certain reproducible kernel Hilbert space (RKHS) H, a separable Hilbert space associated to a kernel function k defined on X . The kernel ridge regression (KRR), which is also known as the Tikhonov regularization or regularized least squares, estimates f * ρ by solving the penalized least square problem: f KRR λ = arg min f ∈H 1 n n i=1 (y i -f (x i )) 2 + λ∥f ∥ 2 H , where λ > 0 is the so-called regularization parameter. By the representer theorem (see e.g., Andreas Christmann (2008) ), this estimator has an explicit formula (please see (8) for the exact meaning of the notation): f KRR λ (x) = K(x, X) (K(X, X) + nλI) -1 y. Theories have been developed for KRR from many aspects over the last decades, especially for the convergence rate of the generalization error. For example, if f * ρ ∈ H without any further smoothness assumptions, Caponnetto & De Vito (2007) and Steinwart et al. (2009) showed that the generalization error of KRR achieves the information theoretical lower bound n -1 1+β , where β is a characterizing quantity of the RKHS H (see e.g., the eigenvalue decay rate defined in Condition (A)). Further studies reveal that when more regularity(or smoothness) of f * ρ is assumed, the KRR fails to achieve the information theoretic lower bound of the generalization error. More precisely, when f * ρ is assumed to belong some interpolation space [H] α of the RKHS H where α > 2, the information theoretical lower bound of the generalization error is n -α α+β (Rastogi & Sampath, 2017) and the best upper bound of the generalization error of KRR is n -2 2+β (Caponnetto & De Vito, 2007) . This gap between the best existing KRR upper bounds and the information theoretical lower bounds of the generalization error has been widely observed in practices (e.g. Bauer et al. (2007) ; Gerfo et al. (2008) ). It has been conjectured for decades that no matter how carefully one tunes the KRR, the rate of the generalization error can not be faster than n -2 2+β (Gerfo et al., 2008; Dicker et al., 2017) . This phenomenon is often referred to as the saturation effect (Bauer et al., 2007) and we refer to the conjectural fastest generalization error rate n -2 2+β of KRR as the saturation lower bound. The main focus of this paper is to prove this long-standing conjecture. 1.1 RELATED WORK KRR also belongs to the spectral regularization algorithms, a large class of kernel regression algorithms including kernel gradient descent, spectral cut-off, etc, see e.g. Rosasco et al. (2005) ; Bauer et al. (2007) ; Gerfo et al. (2008) ; Mendelson & Neeman (2010) . The spectral regularization algorithms were originally proposed to solve the linear inverse problems (Engl et al., 1996) , where the saturation effect was firstly observed and studied (Neubauer, 1997; Mathé, 2004; Herdman et al., 2010) . Since the spectral algorithms were introduced into the statistical learning theory, the saturation effect has been also observed in practice and reported in literatures (Bauer et al., 2007; Gerfo et al., 2008) . Researches on spectral algorithms show that the asymptotic performance of spectral algorithms is mainly determined by two ingredients (Bauer et al., 2007; Rastogi & Sampath, 2017; Blanchard & Mücke, 2018; Lin et al., 2018) . One is the relative smoothness(regularity) of the regression function with respect to the kernel, which is also referred to as the source condition (see, e.g. Bauer et al. (2007, Section 2.3) ). The other is the qualification of the spectral algorithm, a quantity describing the algorithm's fitting capability(see, e.g. Bauer et al. (2007, Definition 1) ). It is widely believed that algorithms with low qualification can not achieve the information theoretical lower bound when the regularity of f * ρ is high. This is the (conjectural) saturation effect for the spectral regularized algorithms (Bauer et al., 2007; Lin & Cevher, 2020; Lian et al., 2021) . To the best of our knowledge, most works pursue showing that spectral regularized algorithm with high qualification can achieve better generalization error rate while few work tries to answer this conjecture directly (Gerfo et al., 2008; Dicker et al., 2017) . The main focus of this paper is to provide a rigorous proof of the saturation effect of KRR for its simplicity and popularity. The technical tools introduced here might help us to solve the saturation effect of other spectral algorithms. Notation. Let us denote by X = (x 1 , . . . , x n ) the sample input matrix and y = (y 1 , . . . , y n ) ′ the sample output vector. We denote by µ the marginal distribution of ρ on X . Let ϵ i := y i -f * (x i ) be the noise. We use L p (X , dµ) (sometimes abbreviated as L p ) to represent the Lebesgue L p spaces, where the corresponding norm is denoted by ∥•∥ L p . Hence, we can express the generalization error as E x∼µ f (x) -f * ρ (x) 2 = f -f * ρ 2 L 2 . We use the asymptotic notations O(•), o(•), Ω(•) and ω(•). We also denote a n ≍ b n iff a n = O(b n ) and a n = Ω(b n ). We use the asymptotic notations in probability O P (•) and Ω P (•) to state our results. Let {a n } n≥1 be a sequence of positive numbers and {ξ n } n≥1 a sequence of non-negative random variables. If for any δ > 0 there exists M δ > 0 and N δ > 0 such that P{ξ n < M δ a n } ≥ 1-δ, ∀n ≥ N δ , we say that ξ n is bounded (above) by a n in probability and write ξ n = O P (a n ). The definition of Ω P (b n ) follows similarly.

2. BRIEF REVIEW OF THE SATURATION EFFECT

2.1 REGRESSION OVER REPRODUCING KERNEL HILBERT SPACE Throughout the paper, we assume that X ⊂ R d is compact and k is a continuous positive-definite kernel function defined on X . Let T : L 2 (X , dµ) → L 2 (X , dµ) be the integral operator defined by (T f )(x) := X k(x, y)f (y)dµ(y). It is well known that T is trace-class(thus compact), positive and self-adjoint (Steinwart & Scovel, 2012) . The spectral theorem of compact self-adjoint operators together with Mercer's theorem (see e.g., Steinwart & Scovel (2012) ) yield that T = i∈N λ i ⟨•, e i ⟩ L 2 e i , (x, y) = i∈N λ i e i (x)e i (y), where λ i 's are the positive eigenvalues of T in descending order, e i 's are the corresponding eigenfunctions, and N ⊆ N is an at most countable index set. Let H be the separable RKHS associated to the kernel k (see, e.g., Wainwright (2019, Chapter 12) ). One may easily verify that λ 1/2 i e i i∈N is an orthonormal basis of H. Since we are interested in the infinite-dimensional cases, we may assume that N = N. Recall that the kernel ridge regression estimates the regression function f * ρ through the following optimization problem: f KRR λ = arg min f ∈H 1 n n i=1 (y i -f (x i )) 2 + λ∥f ∥ 2 H . The representer theorem (see e.g., Andreas Christmann ( 2008)) implies that f KRR λ (x) = K(x, X)(K(X, X) + nλI) -1 y, where K(x, X) = (k(x, x 1 ), . . . , k(x, x n )) and K(X, X) = k(x i , x j ) n×n . The following conditions are commonly adopted when discussing the performance of f KRR λ . (A) Eigenvalue decay rate: there exists absolute constant c 1 > 0, c 2 > 0 and β ∈ (0, 1) such that the eigenvalues λ ′ i s of T , the integral operator associated to the kernel k, satisfy c 1 i -1/β ≤ λ i ≤ c 2 i -1/β , ∀ i = 1, 2, . . . ( ) (B) Smoothness condition: the conditional mean f * ρ = E ρ [ y | x ] satisfies that f * ρ H ≤ R for some R > 0; (C) Moment condition of the noise: Y y -f * ρ (x) m dρ(y|x) ≤ 1 2 m!σ 2 M m-2 , µ-a.e. x ∈ X , ∀m = 2, 3, . . . , where σ, M > 0 are some constants. The quantity β appearing in the condition (A) is often referred to as the eigenvalue decay rate of an RKHS or the corresponding kernel function k. It describes the span of the RKHS and depends only on the kernel function k (or equivalently, the RKHS H). This polynomial decay rate condition is quite standard in the literature and is also known as the capacity condition or effective dimension condition (Caponnetto & De Vito, 2007; Steinwart et al., 2009; Blanchard & Mücke, 2018) , which is closely related to the covering/entropy number conditions of the RKHS (see, e.g., Steinwart et al. (2009, Theorem 15) ). The condition (B) requires that the conditional mean f * ρ (x) falls into the RKHS H with norm smaller than a given constant R. The condition (C) requires that the tail probability of the 'noise' decays fast, which is satisfied if the noise ε = y -f * ρ (x) is bounded or sub-Gaussian. Proposition 2.1 (Optimality of KRR). Suppose that H satisfies the condition (A) and P consists of all the distributions satisfying the conditions (B) and (C). i) The minimax rate of estimating f * ρ is n -1 1+β , i.e., we have inf f sup ρ∈P E ρ f -f * ρ 2 L 2 = Ω n -1 1+β , ( ) where inf f is taken over all estimators and both the expectation E ρ and the conditional mean f * ρ depend on ρ. ii) If we choose λ ≍ n -1 1+β , then we have f KRR λ -f * ρ 2 L 2 = O P n -1 1+β . This theorem comes from a combination (with a slight modification) of the upper rates and lower rates given in Caponnetto & De Vito (2007) . It says that the optimally tuned KRR can achieve the information theoretical lower bound if f * ρ (x) falls into H and no further regularity condition is imposed.

2.2. THE SATURATION EFFECT

When further regularity assumption is made on the conditional mean f * ρ (x), there will be a gap between the information theoretical lower bound and the upper bound provided by the KRR. This phenomenon is now referred to the saturation effect in KRR. In order to explicitly describe the saturation effect, we need to introduce a family of interpolation spaces of the RKHS H (see, e.g. Fischer & Steinwart (2020) ). For any s ≥ 0, the operator T s : L 2 (X ) → L 2 (X ) is given by T s (f ) = i∈N λ s i ⟨f, e i ⟩ L 2 e i ( ) and the interpolation space [H] α for α ≥ 0 is defined by [H] α := Ran T α/2 = i∈N a i λ α/2 i e i (a i ) i∈N ∈ ℓ 2 (N ) , with the inner product defined by ⟨f, g⟩ [H] α = T -α/2 f, T -α/2 g L 2 . ( ) From the definition, it is easy to verify that λ (Fischer & Steinwart, 2020) ). Let X = Ω ⊆ R d be a bounded domain with smooth boundary. We consider the Sobolev space H = H s (Ω), which, roughly speaking, consists of functions with weak derivatives up to order s, see, e.g., Adams & Fournier (2003) . It is known that if s > d 2 , we have the Sobolev embedding H s (Ω) → C r (Ω) where C r (Ω) is the Hölder space of continuous differentiable functions and r = s -d 2 , and thus H s (Ω) is an RKHS (Fischer & Steinwart, 2020) . Moreover, by the method of real interpolation (Steinwart & Scovel, 2012, Theorem 4.6) , we have [H] α ∼ = H αs (Ω) for α > 0. This example shows that the larger the α, the "smoother" the functions in [H] α . If one believes that f * ρ (x) possesses more regularity (i.e., f * ρ (x) ∈ [H] α ), we may replace the condition (B) by the following condition: (B ′ ) Smoothness condition: the conditional mean f * ρ = E ρ [ y | x ] satisfies that f * ρ [H] α ≤ R for some R > 0. In this condition, the parameter α describes the smoothness of the regression function with respect to the underlying kernel. The larger α is, the "smoother" the regression function is. This assumption is also referred to as the source condition in the literature, see, e.g., Bauer et al. (2007) ; Rastogi & Sampath (2017) . With this new regularity assumption, we have the following statement. Proposition 2.2 (Saturation phenomenon of KRR). Suppose that H satisfies the condition (A) and P consists of all the distributions satisfying the conditions (B ′ ) and (C). i) The minimax rate of estimating f * ρ is n -α α+β , i.e., we have inf f sup ρ∈P E ρ f -f * ρ 2 L 2 = Ω n -α α+β , where inf f is taken over all estimators and both the expectation E ρ and the regression function f * ρ depend on ρ. ii) Let α = min(α, 2). Then, by choosing λ ≍ n -1 α+β , we have f KRR λ -f * 2 L 2 = O P n -α α+β . This proposition comes from a combination (with a slight modification) of the lower rate derived in Rastogi & Sampath (2017, Corollary 3.3 ) and the upper rate given in Fischer & Steinwart (2020, Theorem 1 (ii) ). It says that the optimally tuned KRR can achieve the information theoretical lower bound if f * ρ (x) ∈ [H] α when α ≤ 2. However, when α > 2, there is a gap between the information theoretical lower bound and the upper bound provided by the KRR.

3. MAIN RESULTS

We introduce two additional assumptions in order to state our main result. Assumption 1 (RKHS). We assume that H is an RKHS over a compact set X ⊆ R d associated with a Hölder-continuous kernel k, that is, there exists some s ∈ (0, 1] and L > 0 such that |k(x 1 , x 2 ) -k(y 1 , y 2 )| ≤ L∥(x 1 , x 2 ) -(y 1 , y 2 )∥ s R d×d , ∀x 1 , x 2 , y 1 , y 2 ∈ X . Assumption 2 (Noise). The conditional variance satisfies that E (x,y)∼ρ (y -f * ρ (x)) 2 | x ≥ σ2 > 0, µ-a.e. x ∈ X . ( ) The first assumption is a Hölder condition on the kernel, which is slightly stronger than assuming that k is continuous. It is satisfied, for example, when k is Lipschitz or C 1 . Kernels that satisfy this assumption include the popular RBF kernel, Laplace kernel and kernels of the form (1 -∥x -y∥) p + (Wendland, 2004, Theorem 6.20) , and kernels associated with Sobolev RKHS introduced in Example 2.1. The second assumption requires that the variance of the noise ε = y -f * ρ (x) is lower bounded. When y = f * (x) + ε where ε ∼ N (0, σ 2 ) is an independent noise, the second assumption simply requires that σ ̸ = 0. In other words, Assumption 2 is a fairly weak assumption which just requires that the noise is non-vanishing almost everywhere. Now we are ready to state our main theorem. Theorem 3.1 (Saturation effect). Suppose that H satisfies the condition (A), the distribution ρ satisfies that f * ρ ̸ = 0 and f * ρ ∈ [H] α for some α ≥ 2, and Assumptions 1 and 2 hold. For any δ > 0, for any choice of regularization parameter λ(n) satisfying that λ(n) → 0, we have that, for sufficiently large n, E f KRR λ -f * 2 L 2 X ≥ cn -2 2+β ( ) holds with probability at least 1 -δ for some positive constant c. Consequently, we have E f KRR λ -f * 2 L 2 X = Ω P n -2 2+β . Remark 3.2. The saturation effect of KRR states that when the regression function is very smooth, i.e., α ≥ 2, no matter how the regularization parameter is tuned, the convergence rate of KRR is bounded below by 2 2+β . The saturation lower bound also coincides with the upper bound ( 16) in Proposition 2.2. Therefore, Theorem 3.1 rigorously proves the saturation effect of KRR. Moreover, we would like to emphasize that the saturation lower bound is established for arbitrary fixed non-zero f * ∈ [H] α , and it is essentially different from the information theoretical lower bound, e.g., (15), in both the statement and the proof technique.

3.1. SKETCH OF THE PROOF

We present the sketch of our proofs in this part and defer the complete proof to Section B. Let us introduce the sampling operator K x : R → H defined by K x y = yk(x, •) and its adjoint operator K * x : H → R given by K * x f = f (x). We further introduce the sample covariance operator T X : H → H by T X := 1 n n i=1 K xi K * xi , and define the sample basis function g Z := 1 n n i=1 K xi y i ∈ H. ( ) The following explicit operator form of the solution of KRR is shown in Caponnetto & De Vito (2007) : f KRR λ = (T X + λ) -1 g Z . The first step of our proof is the bias-variance decomposition, which differs from the commonly used approximation-estimation error decomposition in the literature (e.g. Caponnetto & De Vito (2007) ; Fischer & Steinwart (2020) ). It can be shown that E f KRR λ -f * ρ 2 L 2 X = λ 2 (T X + λ) -1 f * ρ 2 L 2 + 1 n 2 n i=1 σ 2 xi (T X + λ) -1 k(x i , •) 2 L 2 =: Bias 2 + Var. Then, the desired lower bound can be derived by proving the following lower bounds of the two terms respectively: Bias 2 = Ω P λ 2 , Var = Ω P λ -β n . These two lower bounds follow from our bias-variance trade-off intuition: smaller λ (less regularization) leads to smaller bias but larger variance, while the variance decreases as n increases. They also coincide with the main terms of the upper bound in the literature, see, e.g., Caponnetto & De Vito (2007) ; Fischer & Steinwart (2020) . The bias term First, we establish the approximation (T X + λ) -1 f * ρ 2 L 2 ≈ (T + λ) -1 f * ρ 2 L 2 , where we refine the concentration result between (T X + λ) -1/2 and (T + λ) -1/2 obtained in the previous literature. Second, we use the eigen-decomposition and the fact that KRR's qualification is limited to show that λ (T + λ) -1 f * ρ L 2 ≥ cλ for some constant c > 0. Consequently, we have Bias 2 ≈ λ 2 (T + λ) -1 f * ρ 2 L 2 ≥ cλ 2 . This lower bound shows that the bais of KRR can only decrease in linear order with respect to λ no matter how smooth the regression function is, limiting the performance of KRR. The variance term We first rewrite the variance term in matrix forms and deduce that Var ≥ σ2 n X (T X + λ) -1 k(x, •) 2 L 2 ,n dµ(x), where ∥f ∥ 2 L 2 ,n := 1 n n i=1 f (x i ) 2 . This observation ( 23) is key and novel in the proof, allowing us to make the following two-step approximation: (T X + λ) -1 k(x, •) 2 L 2 ,n ≈ (T + λ) -1 k(x, •) 2 L 2 ,n ≈ (T + λ) -1 k(x, •) 2 L 2 . The main difficulty here is to control the errors in the approximation so that they are infinitesimal compared to the main term, while errors of the same order as the main term are sufficient in the proof of upper bounds. To resolve the difficulty, we refine the analysis by combining both the integral operator technique (e.g. in Caponnetto & De Vito (2007) ) and the empirical process technique (e.g. in Steinwart et al. (2009) ), applying tight concentration inequalities and analyzing the covering number of regularized basis function family (T + λ) -1 k(x, •) x∈X . Finally, by Mercer's theorem and the eigenvalue decay rate, we obtain that X (T + λ) -1 k(x, •) 2 L 2 dµ(x) = X ∞ i=1 λ i λ + λ i 2 e i (x) 2 dµ(x) = ∞ i=1 λ i λ + λ i 2 =: N 2 (λ). It can be shown that N 2 (λ) = Tr (T + λ) -1 T 2 and it is a variant of the effective dimension (Caponnetto & De Vito, 2007) . The condition (A) implies that N 2 (λ) ≥ cλ -β . As a result, we get N (λ) = Tr (T + λ) -1 T introduced in the literature Var ≥ σ2 n X (T X + λ) -1 k(x, •) 2 L 2 ,n dµ(x) ≈ σ2 n X (T + λ) -1 k(x, •) 2 L 2 dµ(x) = σ2 n N 2 (λ) ≥ c λ -β n .

4. NUMERICAL EXPERIMENTS

The saturation effect in KRR has been reported in a number of works (e.g., Gerfo et al. (2008) ; Dicker et al. (2017) ). In this section, we illustrate the saturation effect through a toy example. Published as a conference paper at ICLR 2023 Suppose that X = [0, 1] and µ is the uniform distribution on [0, 1]. Let us consider the following first-order Sobolev space containing absolutely continuous functions H 1 := f : [0, 1] → R f is A.C., f (0) = 0, 1 0 (f ′ (x)) 2 dx < ∞ . ( ) It is well-known that it is the RKHS associated to the kernel k(x, y) = min(x, y) (Wainwright, 2019) . Let T be the integral operator associated to the kernel function k. We know explicitly the eigenvalues and eigenfunctions of this operator: λ n = 2n -1 2 π -2 , e n (x) = √ 2 sin 2n -1 2 πx , n = 1, 2, . . . . It is clear that the eigenvalue decay rate β of the kernel function k (or the RKHS H 1 ) is 0.5. To illustrate the saturation effect better, we choose the second eigen-function e 2 = √ 2 sin 3 2 πx to be the regression function in our experiment. The significance of this choice is that for any α > 0, we have e 2 ∈ [H] α . We further set the noise to be an independent Gaussian noise with variance σ = 0.2. In other words, we consider the following data generation model: y = f * (x) + σε, where f * (x) = e 2 (x) and ε ∼ N (0, 1) is the standard normal distribution. Since the gradient flow (GF) method with proper early stopping is a spectral algorithm proved to be rate optimal for any α ≥ 1 and any Yao et al., 2007; Lin et al., 2018) , we make a comparison between KRR and GF to show the saturation effect. More precisely, we report and compare the decaying rates of the generalization errors produced by the KRR and GF methods with different selections of parameters. f * ρ ∈ [H] α ( For various α's, we choose the regularization parameter in KRR as λ = cn -1 α+β for a fixed constant c, and set the stopping time in the gradient flow by t = λ -1 . It is shown in Lin et al. (2018) that the choice of stopping time t (i.e., t = λ -1 ) is optimal for the gradient descent algorithm under the assumption that f * ρ ∈ [H] α . By choosing different α's, we also evaluate the performance of the algorithms with different selections of parameters. For the generalization error ∥ f -f * ρ ∥ 2 L 2 , we numerically compute the integration (L 2 -norm) by Simpson's formula with N ≫ n points. For each n, we perform 100 trials and show the average as well as the region within one standard deviation. Finally, we use logarithmic least-squares log err = r log n + b to fit the error with respect to the sample size, and report the slope r as the convergence rate. The results are reported in Figure 1 on page 9. We also change the regression function to be other eigenfunctions and list the results in Table 4 on page 9. First, the error curves show that the error converges indeed in the rate of n -r . Moreover, when we apply the GF method, the convergence rate increases as the α increases, confirming that spectral algorithms with high qualification can adapt to the smoothness of the regression function. The convergence rates also match the theoretical value α α+β . In contrast, when we apply the KRR method, the convergence rate achieves its best performance at α = 2, and the rate decreases as α gets bigger, showing the saturation effect. The resulting best rate also coincides with our theoretic value 2 2+β = 0.8. We also remark that besides the best rate, rates of other selection of regularization parameter also correspond to theoretical lower bounds that can be further obtained by the bias-variance decomposition (22). We conduct further experiments with different kernels and report them in Section E. The results are also approving. In conclusion, our numerical results confirm the saturation effect and approve our theory, and all the results can be explained and understood by the theory.

5. CONCLUSION

The saturation effect refers to the phenomenon that kernel ridge regression fails to achieve the information theoretical lower bound when the regression function is too smooth. When the regression function is sufficiently smooth, a saturation lower bound of KRR has been conjectured for decades. The dashed black lines are computed using logarithmic least-squares and the slopes are reported as convergence rates. In this paper, we provide a rigorously proof of the saturation effect of KRR, i.e., we show that, if f ∈ [H] α for some α > 2, the rate of generalization error of the KRR can not be better than n -2 2+β , no matter how one tunes the KRR. f * = e 1 f * = e 2 f * = e 3 f * = e 4 α KRR Our results suggest that the KRR method of regularization may be inferior to some special regularization algorithms, including spectral cut-off and kernel gradient descent, which never saturate and are capable of achieving optimal rates (Bauer et al., 2007; Lin et al., 2018) . The technical tools developed here may also help us establish the lower bound of the saturation effects for other spectral regularization algorithms.

A BASIC FACTS IN RKHS

Let k be a continuous positive definite kernel function defined on a compact set X and H be the RKHS associated to the kernel k. Since k is a continuous function and X is a compact set, there exists a constant κ such that |k(x, y)| ≤ κ for any x, y ∈ X . ( ) It is well know that k(x, •) ∈ H as a function and that the inner product satisfies that ⟨k(x, •), f ⟩ = f (x), ∀f ∈ H. In particularly, we have ⟨k(x, •), k(y, •)⟩ H = k(x, y).  Let T : L 2 (X ) → L 2 (X ) := sup x∈X |f (x)| be the supremum-norm of a function f : X → R. Then for any f ∈ H, we have ∥f ∥ ∞ ≤ κ∥f ∥ H Proof. It is easy to verify that for f ∈ H, ∥f ∥ ∞ = sup x∈X |f (x)| ≤ sup x∈X |⟨k(x, •), f ⟩ H | ≤ sup x∈X ∥k(x, •)∥ H ∥f ∥ H ≤ κ∥f ∥ H . Definition A.2. Let K ⊆ R d be a compact set and α ∈ [0, 1]. For a function f : K → R, we introduce the Hölder semi-norm [f ] α,K := sup x,y∈K, x̸ =y |f (x) -f (y)| |x -y| α , where |•| represents the usual Euclidean norm. Then, we define the Hölder space C α (K) := {f : K → R | [f ] α,K < ∞} , which is equipped with norm ∥f ∥ C α (K) := sup x∈K |f (x)| + [f ] α,K . Lemma A.3. Assume that H is an RKHS over a compact set X ⊆ R d associated with a kernel k ∈ C α (X × X ) for α ∈ (0, 1]. Then, we have H ⊆ C α/2 (X ) and [f ] α/2,X ≤ 2κ 2 [k] α,X ×X ∥f ∥ H , where κ 2 := sup x∈X |k(x, x)|. Proof. Since k is positive definite, from det k(x, x) k(x, y) k(x, y) k(y, y) ≥ 0 we know that sup x,y∈X |k(x, y)| ≤ κ 2 . By the properties of RKHS, we have f (x) -f (y) = ⟨f, k(x, •)⟩ H -⟨f, k(y, •)⟩ H = ⟨f, k(x, •) -k(b, •)⟩ H ≤ ∥f ∥ H ∥k(x, •) -k(y, •)∥ H . Moreover, ∥k(x, •) -k(y, •)∥ 2 H = k(x, x)k(y, y) -k(x, y) 2 ≤k(x, x)|k(y, y) -k(x, y)| + k(x, y)|k(x, x) -k(x, y)| ≤ 2κ 2 [k] α,X ×X |x -y| α . Therefore, we obtain |f (x) -f (y)| ≤ ∥f ∥ H 2κ 2 [k] α,X ×X |x -y| α/2 .

A.2 SAMPLE SUBSPACE AND SEMI-NORM

It is convenient to introduce the following commonly used notations K(x, X) := (k(x, x 1 ), . . . , k(x, x n )) , K(X, x) := K(x, X) ′ , (32) K(X, X) := k(x i , x j ) i,j , K := 1 n K(X, X), ( ) where K is known as the (normalized) kernel matrix. For a function f , we also denote by f [X] = (f (x 1 ), . . . , f (x n )) ′ the column vector of function values. Definition A.4. Given {x 1 , ..., x n } ⊂ X , the subspace H n := span {k(x 1 , •), . . . , k(x n , •)} ⊂ H. ( ) of H is called the sample subspace. We also call the operator Q n : H → H n given by Q n (f )(x) = K(x, X)K(X, X) -1 f (X). ( ) the sample projection map. Recall that we have defined the operator T X = 1 n i K xi K * xi in (19). It is clear that T X = T X Q n and Ran T X = H n . Under the natural base {k(x 1 , •), . . . , k(x 1 , •)} of H n , we have T X k(x i , •) = 1 n n j=1 K xj k(x i , x j ) = 1 n k(x i , x j )k(x j , •), i.e., T X can be represented by the matrix K under the natural basis. We can use T X K(X, •) = KK(X, •) to express this result. Furthermore, for any continuous function φ(x), the operator φ(T X ) satisfies that φ(T X )K(X, •) = φ(K)K(X, •). In particlar, we have (φ(T X )f )[X] = φ(K)f [X]. Since g Z ∈ H n ( see ( 20)). Therefore, we know that f KRR λ (x) = 1 n K(x, X) (K + λ) -1 y.

A.2.1 SEMI-INNER PRODUCTS IN THE SAMPLE SPACE

We consider the following sample semi-inner products: ⟨f, g⟩ L 2 ,n := 1 n n i=1 f (x i )g(x i ) = 1 n f [X] ′ g[X], ⟨f, g⟩ H,n := ⟨Q n f, Q n g⟩ H . ( ) Lemma A.5. ⟨f, g⟩ H,n = 1 n f [X] ′ K -1 g[X] Proof. By the definition (36) of Q n , we have ⟨f, g⟩ H,n = K(•, X)K(X, X) -1 f [X], K(•, X)K(X, X) -1 g[X] H = K(X, X) -1 f [X] ′ K(X, X) K(X, X) -1 g[X] = f [X] ′ K(X, X) -1 g[X] = 1 n f [X] ′ K -1 g[X]. Proposition A.6. For f, g ∈ H, we have ⟨f, g⟩ L 2 ,n = ⟨T X f, g⟩ H = T 1/2 X f, T 1/2 X g , ∥f ∥ L 2 ,n = T 1/2 X f H . Proof. Since Ran T X = H n , we have ⟨T X f, g⟩ H = ⟨Q n T X f, Q n g⟩ = ⟨T X f, g⟩ H,n . Since (T X f )[X] = Kf [X], we obtain ⟨T X f, g⟩ H,n = 1 n [(T X f )[X]] ′ K -1 g[X] = 1 n f [X] ′ g[X] = ⟨f, g⟩ L 2 ,n .

A.3 COVERING NUMBER AND ENTROPY NUMBER

Definition A.7. Let (E, ∥•∥ E ) be a normed space and A ⊂ E be a subset. For ε > 0, we say S ⊆ A is an ε-net of A if ∀a ∈ A, ∃s ∈ S such that ∥a -s∥ ≤ ε. Moreover, we define the ε-covering number of A to be N (A, ∥•∥ E , ε) := inf n ∈ N * : ∃s 1 , . . . , s n ∈ A such that A ⊆ n i=1 B E (s i , ε) , ( ) = inf {|S| : S is an ε-net of A} (45) where B E (x 0 , ε) := {x ∈ E | ∥x 0 -x∥ E ≤ ε} be the closed ball centered at x 0 ∈ E with radius ε. The following result about the covering number of a bounded set in the Euclidean space is wellknown, see, e.g., Vershynin (2018, Section 4.2). Lemma A.8. Let A ⊆ R d be a bounded set. Then there exists a constant C (depending on A) such that N (A, ∥•∥ R d , ε) ≤ Cε -d .

B PROOF OF THE MAIN THEOREM B.1 BIAS-VARIANCE DECOMPOSITION

The first step of the proof is the traditional bias-variance decomposition. Recalling (21), we have f KRR λ = 1 n (T X + λ) -1 n i=1 K xi y i = 1 n (T X + λ) -1 n i=1 K xi (K * xi f * + ϵ i ) = (T X + λ) -1 T X f * + 1 n n i=1 (T X + λ) -1 K xi ϵ i , so that f KRR λ -f * = -λ(T X + λ) -1 f * + 1 n n i=1 (T X + λ) -1 K xi ϵ i . Taking expectation over the noise ϵ conditioned on X, since ε|x are independent noise with mean 0 and variance σ 2 x , we have E f KRR λ -f * 2 L 2 X = Bias 2 + Var, where Bias 2 := λ 2 (T X + λ) -1 f * 2 L 2 , Var := 1 n 2 n i=1 σ 2 xi (T X + λ) -1 k(x i , •) 2 L 2 .

B.2 LOWER BOUND FOR THE BIAS TERM

Proposition B.1. Suppose that f * ∈ [H] 0 is a non-zero function. There is some c > 0 independent of λ such that (T + λ) -1 f * 2 L 2 ≥ c as λ → 0. ( ) Proof. Since f * ∈ [H] 0 and it is non-zero, we may assume that f * = ∞ i=1 a i e i . Because λ → 0, we have (T + λ) -1 f * 2 L 2 = ∞ i=1 a i λ i + λ 2 ≥ ∞ i=1 a i λ i + 1 2 > 0 (Since f * ̸ = 0). Theorem B.2 (Lower bound of the Bias term). Suppose that α ≥ 2, f * ∈ [H] α is a non-zero function and λ = λ(n) = Ω(n -(1-ε) ) for some ε ∈ (0, 1). Then, for any δ > 0, there exists an integer n 0 such that for any n > n 0 , we have that Bias 2 ≥ cλ 2 , ( ) holds with probability at least 1 -δ where c > 0 is a constant independent of λ. As a consequence, Bias 2 = Ω P (λ 2 ). Proof. Corollary C.8 together with Proposition B.1 yields that with probability at least 1 -δ we have (T X + λ) -1 f * 2 L 2 ≥ (T + λ) -1 f * 2 L 2 -O(n -q ) ln 4 δ 2 ≥ c -O(n -q ) ln 4 δ 2 for some q > 0. Therefore, we get Bias 2 = λ 2 (T X + λ) -1 f * 2 L 2 ≥ cλ 2 1 -O(n -q ) ln 4 δ 2 ≥ c 2 λ 2 when n is sufficiently large.

B.3 LOWER BOUND FOR THE VARIANCE TERM

For the variance term, Assumption 2 yields that σ 2 xi ≥ σ2 almost surely. Recalling the discussion of sample subspaces in Section A.2, we have Var ≥ σ2 n 2 n i=1 (T X + λ) -1 k(x i , •) 2 L 2 = σ2 n 2 (T X + λ) -1 K(X, •) 2 L 2 (X ,dµ;R n ) (By (38)) = σ2 n 2 (K + λ) -1 K(X, •) 2 L 2 (X ,dµ;R n ) = σ2 n 2 X K(x, X)(K + λ) -2 K(X, x)dµ(x). Let us denote h x = k(x, •). Then by definition it is obvious that h x [X] = K(X, x). From (39), we find that (T X + λ) -1 h x [X] = (K + λ) -1 h x [X] = (K + λ) -1 K(X, x), so we obtain 1 n K(x, X)(K + λ) -2 K(X, x) = 1 n (K + λ) -1 K(X, x) 2 R n = 1 n (T X + λ) -1 h x [X] 2 R n = (T X + λ) -1 h x 2 L 2 ,n . from the definition (40) of sample semi-inner product. Consequently, we get Var ≥ σ2 n X (T X + λ) -1 h x 2 L 2 ,n dµ(x). Combining with some concentration results, we can obtain the following theorem. Theorem B.3. Assume that Assumptions 1 and 2 and condition (A) (e.g., the eigenvalue decay rate (9)) hold. Suppose that λ = λ(n) → 0 satisfying that λ = Ω n -1 2 +p for some p ∈ (0, 1/2). Then, for any δ > 0, when n is sufficiently large, the following holds with probability at least 1 -δ: Var ≥ cλ -β n . ( ) As a consequence, we have Var = Ω P λ -β n . Proof. First, we assert that the approximation (T X + λ) -1 h x 2 L 2 ,n ≥ 1 2 (T + λ) -1 h x 2 L 2 -o(1) (T + λ) -1 h x L 2 + o(1) ln 4 δ ln 4 δ holds with probability at least 1 -δ. Then, plugging the approximation into (52) gives Var ≥ σ2 n X (T X + λ) -1 h x 2 L 2 ,n dµ(x) ≥ σ2 2n X (T + λ) -1 h x 2 L 2 dµ(x) - o(1) n X (T + λ) -1 h x L 2 dµ(x) - o(1) n ln 4 δ 2 . For the two integral terms, applying Mercer's theorem, we get X (T + λ) -1 h x 2 L 2 dµ(x) = X ∞ i=1 λ i λ + λ i 2 e i (x) 2 dµ(x) = ∞ i=1 λ i λ + λ i 2 = N 2 (λ) ≥ cλ -β , and X (T + λ) -1 h x L 2 dµ(x) ≤ X (T + λ) -1 h x 2 L 2 dµ(x) 1/2 = (N 2 (λ)) 1/2 ≤ Cλ -β/2 , where the estimation of N 2 (λ) comes from Proposition D.1. Therefore, we obtain that Var ≥ cσ 2 2n λ -β - o(λ -β/2 ) n ln 4 δ - o(1) n ln 4 δ 2 ≥ cσ 2 4n λ -β as n goes to infinity. It remains to establish the approximation (54). Lemma C.11 and Lemma C.12 yield that (T + λ) -1 h x 2 L 2 ,n ≤ 3 2 (T + λ) -1 h x 2 L 2 + o(1) ln 4 δ , (T + λ) -1 h x 2 L 2 ,n ≥ 1 2 (T + λ) -1 h x 2 L 2 -o(1) ln 4 δ , T 1/2 X (T X + λ) -1 h x H -T 1/2 X (T + λ) -1 h x H ≤ o(1) ln 4 δ ( ) with probability at least 1 -δ. Consequently, from ( 43) and ( 56), we get T 1/2 X (T + λ) -1 h x H = (T + λ) -1 h x L 2 ,n ≤ C (T + λ) -1 h x L 2 + o(1) ln 4 δ 1/2 . Combining it with (58), we find that T 1/2 X (T X + λ) -1 h x H + T 1/2 X (T + λ) -1 h x H ≤ 2 T 1/2 X (T + λ) -1 h x H + o(1) ln 4 δ ≤ C (T + λ) -1 h x L 2 + o(1) ln 4 δ , which gives the approximation of the squared norm T 1/2 X (T X + λ) -1 h x 2 H -T 1/2 X (T + λ) -1 h x 2 H ≤ o(1) ln 4 δ • (T + λ) -1 h x L 2 + o(1) ln 4 δ . Finally, combining ( 59) and ( 57) yields (T X + λ) -1 h x 2 L 2 ,n = T 1/2 X (T X + λ) -1 h x 2 H ≥ T 1/2 X (T + λ) -1 h x 2 H -o(1) ln 4 δ • (T + λ) -1 h x L 2 + o(1) ln 4 δ = (T + λ) -1 h x 2 L 2 ,n -o(1) ln 4 δ • (T + λ) -1 h x L 2 + o(1) ln 4 δ ≥ 1 2 (T + λ) -1 h x 2 L 2 -o(1) ln 4 δ -o(1) ln 4 δ • (T + λ) -1 h x L 2 + o(1) ln 4 δ = 1 2 (T + λ) -1 h x 2 L 2 -o(1) (T + λ) -1 h x L 2 + o(1) ln 4 δ ln 4 δ . B.4 PROOF OF THEOREM 3.1 Let λ = λ(n) be an arbitrary choice of regularization parameter satisfying that λ(n) → 0. We consider the truncation λ := max λ, n -1 2+β , which satisfies that λ = Ω n -1 2+β and λ → 0. Applying Theorem B.2 and Theorem B.3 to λ, we obtain that Bias 2 ( λ) ≥ c 1 λ2 , Var( λ) ≥ c 2 λ-β n with probability at least 1 -δ for sufficiently large n, where we use Bias 2 ( λ) and Var( λ) to highlight the choice of regularization parameter. Let us consider two cases. Case 1: λ > n -1 2+β In this case λ = λ, so E f KRR λ -f * 2 X = Bias 2 (λ) + Var(λ) = Bias 2 ( λ) + Var( λ) ≥ c 1 λ2 + c 2 λ-β n ≥ cn -2 2+β , where the last inequality is obtained by elementary inequalities in Lemma D.3 with 1 p = β 2+β and 1 q = 2 2+β . Case 2: λ ≤ n -1 2+β From the intermediate result ( 51) in proving the lower bound of the variance, we know that Var( λ) ≥ σ2 n 2 X K(x, X)(K + λ) -2 K(X, x)dµ(x) ≥ c 2 λ-β n and Var(λ) ≥ σ2 n 2 X K(x, X)(K + λ) -2 K(X, x)dµ(x). Noticing that (K + λ 1 ) -2 ⪰ (K + λ 2 ) -2 if λ 1 ≤ λ 2 , where ⪰ represents the partial order induced by positive definite matrices, we get Var(λ) ≥ σ2 n 2 X K(x, X)(K + λ) -2 K(X, x)dµ(x) ≥ σ2 n 2 X K(x, X)(K + λ) -2 K(X, x)dµ(x) ≥ c 2 λ-β n = c 2 n -2 2+β , where we note that λ = n -1 2+β in this case. Consequently, E f KRR λ -f * 2 X ≥ Var(λ) ≥ c 2 n -2 2+β . The proof is completed by concluding two cases. Remark B.4. It is worth noticing that both the requirement that f * ̸ = 0 and that the noise is nonvanishing are necessary. If the former does not hold, choosing λ = ∞ will yield the best estimator f = 0 with zero loss. If the latter does not hold, the interpolation with λ = 0 will be the best estimator since there is no noise.

C APPROXIMATION LEMMAS

In the following proofs, we always assume that δ ∈ (0, 1). For convenience, we use notations like C, c to represent constants independent of n, δ, which may vary from appearance to appearance.

C.1 CONCENTRATION RESULTS

Lemma C.1. Suppose that Assumption 1 holds. Let f ∈ H be given. We have ∥(T X -T )f ∥ H ≤ 2κ 2∥f ∥ ∞ n + ∥f ∥ L 2 √ n ln 2 δ ( ) holds with probability at least 1 -δ. Proof of Lemma C.1. Let us define an H-valued random variable ξ x = T x f := K x K * x f = f (x)k(x, •). It is easy to verify that E x∼µ ξ x = T f, and 1 n n i=1 ξ xi = T X f. Furthermore, we have ∥ξ x ∥ H = ∥f (x)k(x, •)∥ H ≤ κ∥f ∥ ∞ and E x∼µ ∥ξ x ∥ 2 H = E x∼µ ∥f (x)k(x, •)∥ 2 H = E x∼µ |f (x)| 2 κ 2 = κ 2 ∥f ∥ 2 L 2 . Therefore, the proof is concluded by applying Lemma D.6 with L = 2κ∥f ∥ ∞ and σ = κ∥f ∥ L 2 . The following lemma shows that (T X + λ) -1 approximates to (T + λ) -1 . It is similar to Lin & Cevher (2020, Lemma 19) , but here we do not require that λ = n -θ . Lemma C.2. Suppose that the Assumption 1 holds. If n, λ satisfy κ 2 λn ln 4N (λ) δ ≤ 1 16 , where N (λ) = Tr(T + λ) -1 T , then with probability at least 1 -δ we have (T + λ) -1 2 (T X + λ) 1 2 2 B(H) , (T + λ) 1 2 (T X + λ) -1 2 2 B(H) ≤ 2. ( ) If Condition (A) ( i.e, the eigen-value decay condition (9) ) holds and λ = Ω(n -(1-ε) ) for some ε ∈ (0, 1), then condition (62) holds for sufficiently large n. To prove Lemma C.2, we first prove the following lemma, which is a modified version of Lin & Cevher (2020, Lemma 16 ). Lemma C.3. Under Assumption 1 and condition (9), the following holds with probability at least 1 -δ: (T + λ) -1 2 (T -T X )(T + λ) -1 2 ≤ 4κ 2 B 3λn + 2κ 2 B λn , where B = ln 4(∥T ∥ + λ)N (λ) δ∥T ∥ . Proof. We prove by using Lemma D.7. Let A i = A(x i ) = (T + λ) -1 2 (T x -T )(T + λ) -1 2 . Then, EA i = 0 and 1 n n i=1 A i = (T + λ) -1 2 (T x -T )(T + λ) -1 2 . Calculation shows that ∥A∥ ≤ (T + λ) -1 2 (∥T X ∥ + ∥T ∥) (T + λ) -1 2 ≤ 2κ 2 λ -1 = L. Using the fact that E(B -EB) 2 ⪯ EB 2 for self-adjoint operator B, we have EA 2 ⪯ E (T + λ) -1 2 T x (T + λ) -1 2 2 . Moreover, noticing that T ⪰ 0 and 0 ⪯ T x ⪯ κ 2 , we have A ⪯ (T + λ) -1 2 κ 2 (T + λ) -1 2 ⪯ κ 2 λ -1 and hence EA 2 ⪯ κ 2 λ -1 E (T + λ) -1 2 T x (T + λ) -1 2 = κ 2 λ -1 T (T + λ) -1 =: V. We get ∥V ∥ = κ 2 λ -1 T (T + λ) -1 = κ 2 λ -1 λ 1 λ + λ 1 = κ 2 λ -1 ∥T ∥ ∥T ∥ + λ Tr V = κ 2 λ -1 Tr T (T + λ) -1 = κ 2 λ -1 N (λ), implying that B = ln 4 Tr V δ∥V ∥ = ln 4(∥T ∥ + λ)N (λ) δ∥T ∥ , and ∥V ∥ ≤ κ 2 λ -1 . Now we are ready to prove Lemma C.2. Proof of Lemma C.2. Let u = κ 2 B λn = κ 2 λn ln 4(∥T ∥ + λ)N (λ) δ∥T ∥ . By Lemma C.3, with probability at least 1 -δ, we have a = (T + λ) -1 2 (T -T X )(T + λ) -1 2 ≤ 4 3 u + √ 2u ≤ 1 2 , where the last inequality comes from (62), namely u ≤ 1 16 . Then, for the first term we have (T + λ) -1 2 (T X + λ) 1 2 2 = (T + λ) -1/2 (T X + λ)(T + λ) -1/2 = (T + λ) -1 (T X -T + T + λ)(T + λ) -1 = (T + λ) -1 (T X -T )(T + λ) -1 + I ≤ a + 1 ≤ 2. Similarly, the second term can be bounded by (T + λ) 1 2 (T X + λ) -1 2 2 = (T + λ) 1/2 (T X + λ) -1 (T + λ) 1/2 = (T + λ) -1/2 (T X + λ)(T + λ) -1/2 -1 ≤ (1 -a) -1 ≤ 2. Furthermore, if condition (9) holds and λ = Ω(n -(1-ε) ), then from Proposition D.1 we get N (λ) ≤ Cλ -β = O n β(1-ε) . Therefore, κ 2 λn ln 4(∥T ∥ + λ)N (λ) δ∥T ∥ ≤ Cn -ε (1 + β)(1 -ε) ln n + ln 4 δ + C → 0 as n → ∞. C.2 NORM CONTROL OF REGULARIZED FUNCTIONS Proposition C.4. Suppose that f ∈ [H] α . Then, for any 0 ≤ γ ≤ α such that α -γ ≤ 2, we have (T + λ) -1 f [H] γ ≤ λ α-γ 2 -1 ∥f ∥ [H] α . Proof. From the definition of ∥ •∥ [H] γ , we have f = T α/2 f 0 for some f 0 ∈ L 2 such that ∥f 0 ∥ L 2 = ∥f ∥ [H] α , (T + λ) -1 f [H] γ = T -γ/2 (T + λ) -1 T α/2 f 0 L 2 = T α-γ 2 (T + λ) -1 f 0 L 2 ≤ T α-γ 2 (T + λ) -1 B(L 2 ) ∥f 0 ∥ L 2 ≤ λ α-γ 2 -1 ∥f ∥ [H] α , where the last inequality comes from applying Proposition D.2 to operator calculus. The following special cases of Proposition C.4 are useful in our proofs. We present them as corollaries. Notice that we have (29), so from estimations of the RKHS-norm we can also get estimations of the sup-norm. Corollary C.5. For f ∈ [H] 2 , we have the following estimations: (T + λ) -1 f L 2 ≤ ∥f ∥ [H] 2 , (T + λ) -1 f ∞ ≤ κλ -1/2 ∥f ∥ [H] 2 . Proof. Applying Proposition C.4 with α = 2 and γ = 0, 1 respectively, we obtain (T + λ) -1 f L 2 ≤ ∥f ∥ [H] 2 , (T + λ) -1 f ∞ ≤ κ (T + λ) -1 f H ≤ κλ -1/2 ∥f ∥ [H] 2 . Similarly, noticing that k(x, •) ∈ [H] 1 , we also have the following corollary controlling the norms of regularized kernel basis function (T + λ) -1 k(x, •): Corollary C.6. We have the following estimations: ∀x ∈ X , (T + λ) -1 k(x, •) L 2 ≤ κλ -1/2 , (T + λ) -1 k(x, •) H ≤ κλ -1 , (T + λ) -1 k(x, •) ∞ ≤ κ 2 λ -1 . ( ) where C is a positive constant. C.3 APPROXIMATION OF THE REGULARIZED REGRESSION FUNCTION Lemma C.7. Suppose that f * ∈ [H] 2 . If λ = λ(n) = Ω(n -(1-ε) ) for some ε > 0 and λ(n) → 0, then there exist some q > 0 such that for sufficient large n, the following holds with probability at least 1 -δ: (T + λ) -1 f * L 2 -(T X + λ) -1 f * L 2 = O(n -q ) ln 4 δ . Proof. By the triangle inequality and noticing that (T + λ) -1 -(T X + λ) -1 = (T X + λ) -1 (T X -T )(T + λ) -1 , we have (T + λ) -1 f * L 2 -(T X + λ) -1 f * L 2 ≤ (T + λ) -1 -(T X + λ) -1 f * L 2 = T 1/2 (T X + λ) -1 (T X -T )(T + λ) -1 f * H ≤ T 1/2 (T X + λ) -1 B(H) (T X -T )(T + λ) -1 f * H . For the first term in (68), we have T 1/2 (T X + λ) -1 B(H) = T 1/2 (T + λ) -1 (T + λ)(T X + λ) -1 B(H) ≤ T 1/2 (T + λ) -1 B(H) (T + λ)(T X + λ) -1 B(H) (By Proposition D.2) ≤ λ -1/2 (T + λ)(T X + λ) -1 B(H) . Since λ = Ω(n -(1-ε) ), Lemma C.2 yields that (T + λ) 1/2 (T X + λ) -1/2 B(H) ≤ 2 holds with probability at least 1 -δ/2 for sufficient large n. Because the operator (T + λ) 1/2 (T X + λ) -1/2 is invertible, we get (T X + λ) 1/2 (T + λ) -1/2 B(H) ≥ 1 2 . By Lemma D.4, we have (T X + λ)(T + λ) -1 B(H) ≥ (T X + λ) 1/2 (T + λ) -1/2 2 B(H) ≥ 1 4 , which implies that (T + λ)(T X + λ) -1 B(H) ≤ 4. Therefore, we obtain the upper bound T 1/2 (T X + λ) -1 B(H) ≤ 4λ -1/2 . ( ) For the second term in (68), we apply Lemma C.1 with f = (T + λ) -1 f * and get that (T X -T )(T + λ) -1 f * H ≤ 2κ 2 (T + λ) -1 f * ∞ n + (T + λ) -1 f * L 2 √ n ln 4 δ holds with probability at least 1 -δ/2. Plugging in the bounds from Corollary C.5, we obtain that (T X -T )(T + λ) -1 f * H ≤ C λ -1/2 n + 1 √ n ln 4 δ . Plugging ( 69) and ( 70) back into (68), we finally get (T + λ) -1 f * L 2 -(T X + λ) -1 f * L 2 ≤ C λ -1 n + λ -1/2 √ n ln 4 δ = O(n -q ) ln 4 δ , where we use the condition λ = Ω(n -(1-ε) ) in the last equality. Combining the previous lemma with L 2 -norm control of the regularized regression function, we obtain the following corollary: Corollary C.8. Suppose that f * ∈ [H] 2 . If λ = λ(n) = Ω(n -(1-ε) ) for some ε > 0 and λ(n) → 0, for sufficient large n, the following holds with probability at least 1 -δ: (T + λ) -1 f * 2 L 2 -(T X + λ) -1 f * 2 L 2 = O(n -q ) ln 4 δ 2 . ( ) Proof. From Lemma C.7 and (T + λ) -1 f * L 2 ≤ ∥f * ∥ [H] 2 in Corollary C.5, we have (T X + λ) -1 f * L 2 ≤ (T + λ) -1 f * L 2 + (T + λ) -1 f * L 2 -(T X + λ) -1 f * L 2 ≤ ∥f * ∥ [H] 2 + o(1) ln 4 δ = O(1) ln 4 δ , and hence (T + λ) -1 f * L 2 + (T X + λ) -1 f * L 2 = O(1) ln 4 δ . Therefore, we get (T + λ) -1 f * 2 L 2 -(T X + λ) -1 f * 2 L 2 = (T + λ) -1 f * L 2 -(T X + λ) -1 f * L 2 • (T + λ) -1 f * L 2 + (T X + λ) -1 f * L 2 = O(n -q ) ln 4 δ • O(1) ln 4 δ = O(n -q ) ln 4 δ 2 .

C.4 APPROXIMATION OF THE REGULARIZED KERNEL BASIS FUNCTION

The following proposition about estimating the L 2 norm with empirical norms is a corollary of Lemma D.5. Proposition C.9. Let µ be a probability measure on X , f ∈ L 2 (X , dµ) and ∥f ∥ ∞ ≤ M . Suppose we have x 1 , . . . , x n sampled i.i.d. from µ. Then, for any α > 0, the following holds with probability at least 1 -δ: ∥f ∥ 2 L 2 ,n -∥f ∥ 2 L 2 ≤ αM 2 ∥f ∥ 2 L 2 + 3 + 4αM 2 6αn ln 2 δ . By choosing α = 1 2M 2 , we have 1 2 ∥f ∥ 2 L 2 - 5M 2 3n ln 2 δ ≤ ∥f ∥ 2 L 2 ,n ≤ 3 2 ∥f ∥ 2 L 2 + 5M 2 3n ln 2 δ . Proof. Defining ξ i = f (x i ) 2 , we have Eξ i = ∥f ∥ 2 L 2 , Eξ 2 i = E x∼µ f (x) 4 ≤ ∥f ∥ 2 ∞ ∥f ∥ 2 L 2 . Therefore, applying Lemma D.5, we get ∥f ∥ 2 L 2 ,n -∥f ∥ 2 L 2 ≤ α∥f ∥ 2 ∞ ∥f ∥ 2 L 2 + 3 + 4αM 2 6αn ln 2 δ . We establish the following lemma about covering numbers of the regularized kernel basis functions. For simplicity, let us denote h x = k(x, •) ∈ H and K λ := (T + λ) -1 h x x∈X . ( ) Lemma C.10. Assuming that X ⊆ R d is bounded and k ∈ C s (X × X ) for some s ∈ (0, 1]. Then, we have N (K λ , ∥•∥ ∞ , ε) ≤ C (λε) -2d s , N (K λ , ∥•∥ H , ε) ≤ C λ 1+ β 2 ε -2d s , ( ) where C is a positive constant not depending on λ or ε. Proof. We first prove (74). By Mercer's theorem, we have (T + λ) -1 h a = i∈N λ i λ + λ i e i (a)e i , and thus (T + λ) -1 h a (x) = i∈N λ i λ + λ i e i (a)e i (x) = (T + λ) -1 h x (a). Therefore, (T + λ) -1 h a -(T + λ) -1 h b ∞ = sup x∈X (T + λ) -1 h a (x) -(T + λ) -1 h b (x) = sup x∈X (T + λ) -1 h x (a) -(T + λ) -1 h x (b) . Since k is Hölder-continuous, by Lemma A.3 we know that (T + λ) -1 h x is also Hölder-continuous. Plugging the bound (T + λ) -1 h x ≤ κλ -1 obtained in Corollary C.6 into (31), we get [(T + λ) -1 h x ] s/2,X ≤ 2κ 2 [k] s,X ×X (T + λ) -1 h x H ≤ κ 2 2[k] s,X ×X λ -1 , which implies that (T + λ) -1 h x (a) -(T + λ) -1 h x (b) ≤ C 0 λ -1 |a -b| s/2 , where C 0 = κ 2 2[k] s,X ×X . Consequently, we have (T + λ) -1 h a -(T + λ) -1 h b ∞ ≤ C 0 λ -1 |a -b| s/2 . ( ) (76) yields that to find an ε-net of K λ with respect to ∥•∥ ∞ , we only need to find an ε-net of X with respect to the Euclidean norm, where ε = λε C0 2/s . Since the result of the covering number of the latter one is already known in Lemma A.8, we finally obtain that N (K λ , ∥•∥ ∞ , ε) ≤ N (X , ∥•∥ R d , ε) ≤ C ε-d = C λε C 0 -2d s = C (λε) -2d s . Now we prove (75). By Mercer's theorem and the series definition of the RHKS norm, we have (T + λ) -1 h x -(T + λ) -1 h y 2 H = i∈N λ i (e i (x) -e i (y)) λ + λ i e i 2 H = i∈N λ i (e i (x) -e i (y)) 2 (λ + λ i ) 2 . Since ∥e i ∥ H = λ -1/2 i , by Lemma A.3 we have [e i ] s/2,X ≤ κ 2 2[k] s,X ×X λ -1/2 i , and thus (e i (x) -e i (y)) 2 ≤ Cλ -1 i |x -y| s . Therefore, we get 2+β) . (T + λ) -1 h x -(T + λ) -1 h y 2 H ≤ C i∈N |x -y| s (λ + λ i ) 2 = C|x -y| s λ -2 i∈N λ λ + λ i 2 ≤ C|x -y| s λ -( Using a similar covering argument over X gives the desired result. Lemma C.11. Suppose that Assumption 1 holds. Assume that λ = λ(n) = Ω(n -1/2+p ) for some p ∈ (0, 1/2). Then, there exists some q > 0 such that for any δ > 0, the following holds with probability at least 1 -δ: ∀x ∈ X , (T + λ) -1 h x 2 L 2 ,n ≤ 3 2 (T + λ) -1 h x 2 L 2 + O(n -q ) ln 2 δ , (T + λ) -1 h x 2 L 2 ,n ≥ 1 2 (T + λ) -1 h x 2 L 2 -O(n -q ) ln 2 δ , where the constants in O(n -q ) do not depend on x. Proof. By Lemma C.10, we can find an ε-net F ⊆ K λ ⊆ H with respect to sup-norm of K λ such that |F| ≤ C (λε) -2d s , where ε = ε(n) will be determined later. Applying Proposition C.9 to F with the ∥•∥ ∞ -bound in Corollary C.6, with probability at least 1 -δ we have 1 2 ∥f ∥ 2 L 2 - Cλ -2 n ln 2|F| δ ≤ ∥f ∥ 2 L 2 ,n ≤ 3 2 ∥f ∥ 2 L 2 + Cλ -2 n ln 2|F| δ , ∀f ∈ F. Now, since F is an ε-net of K λ with respect to ∥•∥ ∞ , for any x ∈ X , there exists some f ∈ F such that (T + λ) -1 h x -f ∞ ≤ ε, which implies that (T + λ) -1 h x L 2 -∥f ∥ L 2 ≤ ε, (T + λ) -1 h x L 2 ,n -∥f ∥ L 2 ,n ≤ ε. Since (T + λ) -1 h x ∞ ≤ Cλ -1 and a 2 -b 2 = (a -b)(2b + (a -b)), we get (T + λ) -1 h x 2 L 2 -∥f ∥ 2 L 2 ≤ Cελ -1 , (T + λ) -1 h x 2 L 2 ,n -∥f ∥ 2 L 2 ,n ≤ Cελ -1 . (79) For the upper bound, we have (T + λ) -1 h x 2 L 2 ,n ≤ ∥f ∥ 2 L 2 ,n + Cελ -1 (By (79)) (By (78)) ≤ 3 2 ∥f ∥ 2 L 2 + Cλ -2 n ln 2|F| δ + Cελ -1 (By (79) again) ≤ 3 2 (T + λ) -1 h x 2 L 2 + Cλ -2 n ln 2|F| δ + Cελ -1 . Noticing that λ = Ω n -1/2+p and (77), by choosing ε = n -1 , it is easy to verify that Cλ -2 n ln 2|F| δ + Cελ -1 = O n -2p ln |F| + ln 2 δ + O(n -1 2 -p ) = O n -2p C 1 ln n + C 2 + ln 2 δ + O(n -1 2 -p ) = O n -q ln 2 δ for some q > 0. The lower bound follows similarly. Lemma C.12. Suppose that Assumption 1 and Condition (A) ( i.e., the eigenvalue decay rate (9) ) hold. Assume that λ = Ω n -1 2 +p for some p ∈ (0, 1/2). Then, there exists some q > 0 such that T 1/2 X (T X + λ) -1 h x H -T 1/2 X (T + λ) -1 h x H ≤ O(n -q ) ln 2 δ , ∀x ∈ X , with probability at least 1 -δ, where the constant in O(n -q ) is independent of x. Proof. We begin with T 1/2 X (T X + λ) -1 h x H -T 1/2 X (T + λ) -1 h x H ≤ T 1/2 X (T X + λ) -1 -(T + λ) -1 h x H . Noticing that (T X + λ) -1 -(T + λ) -1 = (T X + λ) -1 (T -T X ) (T + λ) -1 , we obtain T 1/2 X (T X + λ) -1 -(T + λ) -1 h x = T 1/2 X (T X + λ) -1 (T -T X ) (T + λ) -1 h x H ≤ T 1/2 X (T X + λ) -1 (T -T X ) (T + λ) -1 h x H ≤ λ -1/2 (T -T X ) (T + λ) -1 h x H where we apply Proposition D.2 in the last inequality. Now, we deal with the last term in (80). By Lemma C.10, we find an ε-covering F of K λ with respect to ∥•∥ H such that |F| ≤ C λ 2+β 2 ε -2d s . Applying Lemma C.1 to F and noticing the L 2 -norm and sup-norm bounds obtained in Corollary C.6, we find that ∀f ∈ F, ∥(T -T X ) f ∥ H ≤ 2κ 2∥f ∥ ∞ n + ∥f ∥ L 2 √ n ln 2|F| δ ≤ C λ -1 n + λ -1/2 √ n ln 2|F| δ . Then, for any x ∈ X , we can find some f ∈ F such that (T + λ) -1 h x -f H ≤ ε, and thus (T -T X ) (T + λ) -1 h x H ≤ (T -T X ) (T + λ) -1 h x -f H + ∥(T -T X ) f ∥ H ≤ ∥T -T X ∥ B(H) (T + λ) -1 h x -f H + ∥(T -T X ) f ∥ H ≤ 2κ 2 ε + C λ -1 n + λ -1/2 √ n ln 2|F| δ . Plugging ( 82) into (80), we obtain T 1/2 X (T X + λ) -1 -(T + λ) -1 h x ≤ Cλ -1/2 ε + C λ -3 2 n + λ -1 √ n ln 2|F| δ . Finally, noticing λ = Ω n -1 2 +p and (81), by letting ε = n -1 we have T 1/2 X (T X + λ) -1 -(T + λ) -1 h x ≤ Cn -q ln 2 δ for some q > 0.

D AUXILIARY RESULTS

For any p ≥ 1, let us introduce the p-effective dimension N p (λ) := Tr T (T + λ) -1 p = ∞ i=1 λ i λ + λ i p , p ≥ 1. Proposition D.1. If λ i ≍ i -1/β , we have N p (λ) ≍ λ -β . ( ) Proof. Since c i -1/β ≤ λ i ≤ Ci -1/β , we have N p (λ) = ∞ i=1 λ i λ i + λ p ≤ ∞ i=1 Ci -1/β Ci -1/β + λ p = ∞ i=1 C C + λi 1/β p ≤ ∞ 0 C λx 1/β + C p dx = λ -β ∞ 0 C y 1/β + C p dy ≤ Cλ -β . for some constant C. Similarly, we hace N p (λ) ≥ C′ λ -β . for some constant C′ . Proposition D.2. For λ > 0 and s ∈ [0, 1], we have sup t≥0 t s t + λ ≤ λ s-1 . Proof. It follows from the inequality a s ≤ a + 1 for any a ≥ 0 and s ∈ [0, 1]. Lemma D.3. (Young's inequality) Let a, b > 0. For p, q > 1 satisfying 1 p + 1 q = 1, we have ab ≤ 1 p a p + 1 q b q , or equivalently a + b ≥ p √ p q √ q • a 1 p b 1 q . ( ) The following operator inequality (Fujii et al., 1993) will be used in our proofs. Lemma D.4 (Cordes' Inequality). Let A, B be two positive semi-definite bounded linear operators on separable Hilbert space H. Then ∥A s B s ∥ B(H) ≤ ∥AB∥ s B(H) , ∀s ∈ [0, 1].

D.1 CONCENTRATION INEQUALITIES

The following concentration inequality is adopted from Caponnetto & Yao (2010): Lemma D.5. Let ξ 1 , . . . , ξ n be n i.i.d. bounded random variables such that |ξ i | ≤ B, Eξ i = µ, and E(ξ i -µ) 2 ≤ σ 2 . Then for any α > 0, any δ ∈ (0, 1), we have 1 n n i=1 ξ i -µ ≤ ασ 2 + 3 + 4αB 6αn ln 2 δ ( ) holds with probability at least 1 -δ. Proof of Lemma D.5. The high probability form of bound in ( 87) is equivalent to the following probability form: P 1 n n i=1 ξ i -µ ≥ ασ 2 + ε ≤ 2 exp - 6nαε 3 + 4αB , where ε = 3+4αB 6αn ln 2 δ . By symmetry, it suffices to prove the following one-sided inequality: P 1 n n i=1 ξ i -µ ≥ ασ 2 + ε ≤ exp - 6nαε 3 + 4αB . Taking exponent with some factor s > 0, we obtain  Let X i = ξ i -µ, t = s n and M = 2B. As long as t < 3 M , we have E exp s n (ξ i -µ) = E exp(tX i ) = ∞ k=0 t k k! EX k i ≤ 1 + ∞ k=2 t k k! M k-2 σ 2 ≤ 1 + t 2 σ 2 2 ∞ k=0 M t 3 k = 1 + 3t 2 σ 2 6 -2M t ≤ exp 3t 2 σ 2 6 -2M t . Therefore, (88) ≤ exp -s(ασ 2 + ε) + n 3t 2 σ 2 6 -2M t = exp -s(ασ 2 + ε) + 3s 2 σ 2 6n -4Bs = exp -sε + sσ 2 α + 3s 6n -4Bs Solving α + 3s 6n-4Bs = 0 and we get s = 6αn 3+4αB , which satisfies that t = s n < 3 M , hence we have (88) ≤ exp -6αn 3 + 4αB ε and the proof is complete.  f * =

E.2 EXPERIMENTS ON THE SPHERE

In this part we conduct experiments beyond dimension 1. We consider some inner-product kernels on the sphere X = S 2 ⊂ R 3 with µ being the uniform distribution. The reason is that in general it is hard to find an explicit eigen-decomposition of a general kernel, where we can obtain explicit forms of eigen-functions for inner-product kernels on S d-1 , which are necessary for us to construct smooth regression functions. These eigen-functions are known as the spherical harmonics, which turn out to be homogeneous polynomials. We refer to Dai & Xu (2013) for a detailed introduction. On S 2 , the spherical harmonics are often denoted by Y m l , l = 0, 1, 2, . . . , m = -l, . . . , l, and Y m l is a homogeneous polynomial of order l. We pick some of them to be our underlying truth function f * , which are listed below: Y 1 1 (x 1 , x 2 , x 3 ) = 3 4π x 1 , Y -2 2 (x 1 , x 2 , x 3 ) = 1 2 15 π x 1 x 2 , Y 2 3 (x 1 , x 2 , x 3 ) = 1 4 105 π (x 2 1 -x 2 2 )x 3 . In terms of kernels, we use the truncated power function k(x, y) = (1 -∥x -y∥) p + , where a + = max(a, 0). It is known that if p > ⌊ d 2 ⌋ + 1, this kernel is positive definite on R d and thus positive definite on S d-1 (Wendland, 2004, Theorem 6.20) . However, we do not know the eigen-decay rate β for these kernels. In the following experiment, we basically follow the same procedure as described in Section 4, except that we choose the regularization parameter by λ = n -θ with various θ. The results are collected in Table E .2 on page 31. We also plot the error curves of one of the experiments in Figure 2 on page 31. The results show that the convergences rates of KRR increase and then decrease as θ decrease, while the convergences rates of GF keep increasing, and the best convergence rate of KRR is significant slower than that of GF. We conclude that this experiment also justifies the saturation effect and our theory. Published as a conference paper at ICLR 2023 f * = Y 1 1 f * = Y -2 2 f * = Y 2 3 Kernel θ KRR GF KRR GF KRR GF (1 -∥x -y∥) 



Figure 1: Error decay curves of KRR and GF. Both axes are logarithmic. The colored curves show the averaged error over 100 trials and the regions within one standard deviation are shown in green.The dashed black lines are computed using logarithmic least-squares and the slopes are reported as convergence rates.

µ ≥ ασ 2 + ε = P exp s n n i=1 (ξ i -µ) ≥ exp s(ασ 2 + ε) (Markov Inequality) ≤ exp -s(ασ 2 + ε)

Figure 2: Error decay curves of KRR and GF with kernel (1 -∥x -y∥) 4 + on S 2 and f * = Y 1 1 .

is also a separable Hilbert space. It is obvious that[H] 0 ⊆ L 2 and [H] 1 ⊆ H. Moreover, we have isometric isomorphisms T s/2 : [H] α → [H] α+s and compact embeddings [H] α1 → [H] α2 , ∀α 1 > α 2 ≥ 0.The example below illustrates the intuition of α: it describes the smoothness of functions with respect to the kernel.

Convergence rates comparison between KRR and GF with λ = cn -1 α+β for various α's. Bold numbers represent the max rate over different choices of λ.

be the integral operator associated to the kernel fucntion k. By the spectral decomposition of T , it can be easily shown that Ran T | H ⊆ H and ∥T f ∥ H ≤ κ 2 ∥f ∥ H , which implies that T can also be viewed a bounded linear operator on H. With a little abuse of notation, we still use T to indicate it.We denote by B(H) the set of bounded linear operators over H and ∥•∥ B(H) the corresponding operator norm, where the subscript may be omitted if there is no confusion.

e 1 f * = e 2 f * = e 3 Kernel α KRR GF CUT KRR GF CUT KRR GF CUT Convergence rates comparison between KRR, GF and CUT with λ = cn -1 α+β for various α's. Bold numbers represent the max rate over different choices of λ.

Convergence rates comparison between KRR and GF with λ = cn -θ for various θ's. Bold numbers represent the max rate over different choices of λ.

ACKNOWLEDGMENTS

This research was partially supported by the National Natural Science Foundation of China (Grant 11971257), Beijing Natural Science Foundation (Grant Z190001), National Key R&D Program of China (2020AAA0105200), and Beijing Academy of Artificial Intelligence.

Published as a conference paper at ICLR 2023

The following concentration inequality about vector-valued random variables is commonly used in the literature, see, e.g. Caponnetto & De Vito (2007, Proposition 2 ) and references therein. Lemma D.6. Let H be a real separable Hilbert space. Let ξ, ξ 1 , . . . , ξ n be i.i.d. random variables taking values in H. Assume thatThen for fixed δ ∈ (0, 1), one hasParticularly, a sufficient condition for (89) isThe following Bernstein type concentration inequality about self-adjoint Hilbert-Schmidt operator valued random variable results from applying the discussion in Minsker (2017, Section 3.2) to Tropp (2012, Theorem 7.3.1) . It can be found in, e.g., Lin & Cevher (2020, Lemma 24) . Lemma D.7. Let H be a separable Hilbert space. Let A 1 , . . . , A n be i.i.d. random variables taking values of self-adjoint Hilbert-Schmidt operators such that EA 1 = 0, ∥A 1 ∥ ≤ L almost surely for some L > 0 and EA 2 1 ⪯ V for some positive trace-class operator V . Then, for any δ ∈ (0, 1), with probability at least 1 -δ we have

E MORE EXPERIMENTS

In this section we provide more results about the experiments.

E.1 MORE EXPERIMENTS ON THE INTERVAL

In the following experiments, we use the same setting as in Section 4. We consider other commonly used kernels and set f * as one of its eigenfunctions. We also compare KRR with another regularization algorithm called spectral cut-off (CUT), which also never saturates like GF (see, e.g. Lin et al. (2018, Example 3.1) ).We introduce another kernel with known explicit forms of eigenfunction, which will be used as the underlying regression function.Heavy-side step kernel The heavy-side step kernel on [0, 1] is defined byThe associated RKHS iswith inner product ⟨f, g⟩It is known that the eigen-system of this kernel isand hence β = 0.5.We conduct experiments on the two kernels and set the regression function to be one of the eigenfunctions. We report the results in Table E .1 on page 30. The results are generally the same as that of Section 4: GF and CUT methods are similar and they both achieve better performances as α increases, while KRR reaches its best performance at α = 2 with resulting max rate approximately 0.8, verifying our theory. We also notice that there are some numerical fluctuation. We attribute them to randomness where we find the deviance is large and numerical error since the eigenvalues are small. In conclusion, the numerical results are supportive.

