ON THE SATURATION EFFECT OF KERNEL RIDGE REGRESSION

Abstract

The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this longstanding conjecture.

1. INTRODUCTION

Suppose that we have observed n i.i.d. samples {(x i , y i )} n i=1 from an unknown distribution ρ supported on X × Y where X ⊆ R d and Y ⊆ R. One of the central problems in the statistical learning theory is to find a function f based on these observations such that the generalization error E (x,y)∼ρ f (x) -y 2 (1) is small. It is well known that the conditional mean f * ρ (x) := E ρ [ y | x ] = Y ydρ(y|x) minimizes the square loss E(f ) = E ρ (f (x) -y) 2 where ρ(y|x) is the distribution of y conditioning on x. Thus, this question is equivalent to looking for an f such that the generalization error E x∼µ f (x) -f * ρ (x) 2 (2) is small, where µ is the marginal distribution of ρ in X . In other words, f can be viewed as an estimator of f * ρ . When there is no explicit parametric assumption made on the distribution ρ or the function f * ρ , researchers often assumed that f * ρ falls into a class of certain functions and developed lots of non-parametric methods to estimate f * ρ (e.g., Györfi (2002); Tsybakov (2009) ). The kernel method, one of the most widely applied non-parametric regression methods (e.g., Kohler & Krzyzak (2001); Cucker & Smale (2001); Caponnetto & De Vito (2007); Steinwart et al. (2009); Fischer & Steinwart (2020) ), assumes that f * ρ belongs to certain reproducible kernel Hilbert space (RKHS) H, a separable Hilbert space associated to a kernel function k defined on X . The kernel ridge regression (KRR), which is also known as the Tikhonov regularization or regularized least squares, estimates f * ρ by solving the penalized least square problem: f KRR λ = arg min f ∈H 1 n n i=1 (y i -f (x i )) 2 + λ∥f ∥ 2 H , where λ > 0 is the so-called regularization parameter. By the representer theorem (see e.g., Andreas Christmann ( 2008)), this estimator has an explicit formula (please see ( 8) for the exact meaning of the notation): f KRR λ (x) = K(x, X) (K(X, X) + nλI) -1 y. Theories have been developed for KRR from many aspects over the last decades, especially for the convergence rate of the generalization error. For example, if f * ρ ∈ H without any further smoothness assumptions, Caponnetto & De Vito (2007) and Steinwart et al. (2009) showed that the generalization error of KRR achieves the information theoretical lower bound n -1 1+β , where β is a characterizing quantity of the RKHS H (see e.g., the eigenvalue decay rate defined in Condition (A)). Further studies reveal that when more regularity(or smoothness) of f * ρ is assumed, the KRR fails to achieve the information theoretic lower bound of the generalization error. More precisely, when f * ρ is assumed to belong some interpolation space [H] α of the RKHS H where α > 2, the information theoretical lower bound of the generalization error is n -α α+β (Rastogi & Sampath, 2017) and the best upper bound of the generalization error of KRR is n -2 2+β (Caponnetto & De Vito, 2007) . This gap between the best existing KRR upper bounds and the information theoretical lower bounds of the generalization error has been widely observed in practices (e.g. Bauer et al. (2007) ; Gerfo et al. ( 2008)). It has been conjectured for decades that no matter how carefully one tunes the KRR, the rate of the generalization error can not be faster than n -2 2+β (Gerfo et al., 2008; Dicker et al., 2017) . This phenomenon is often referred to as the saturation effect (Bauer et al., 2007) and we refer to the conjectural fastest generalization error rate n -2 2+β of KRR as the saturation lower bound. The main focus of this paper is to prove this long-standing conjecture. 2010). The spectral regularization algorithms were originally proposed to solve the linear inverse problems (Engl et al., 1996) , where the saturation effect was firstly observed and studied (Neubauer, 1997; Mathé, 2004; Herdman et al., 2010) . Since the spectral algorithms were introduced into the statistical learning theory, the saturation effect has been also observed in practice and reported in literatures (Bauer et al., 2007; Gerfo et al., 2008) . Researches on spectral algorithms show that the asymptotic performance of spectral algorithms is mainly determined by two ingredients (Bauer et al., 2007; Rastogi & Sampath, 2017; Blanchard & Mücke, 2018; Lin et al., 2018) . One is the relative smoothness(regularity) of the regression function with respect to the kernel, which is also referred to as the source condition (see, e.g. Bauer et al. (2007, Section 2.3) ). The other is the qualification of the spectral algorithm, a quantity describing the algorithm's fitting capability(see, e.g. Bauer et al. (2007, Definition 1) ). It is widely believed that algorithms with low qualification can not achieve the information theoretical lower bound when the regularity of f * ρ is high. This is the (conjectural) saturation effect for the spectral regularized algorithms (Bauer et al., 2007; Lin & Cevher, 2020; Lian et al., 2021) . To the best of our knowledge, most works pursue showing that spectral regularized algorithm with high qualification can achieve better generalization error rate while few work tries to answer this conjecture directly (Gerfo et al., 2008; Dicker et al., 2017) . The main focus of this paper is to provide a rigorous proof of the saturation effect of KRR for its simplicity and popularity. The technical tools introduced here might help us to solve the saturation effect of other spectral algorithms. Notation. Let us denote by X = (x 1 , . . . , x n ) the sample input matrix and y = (y 1 , . . . , y n ) ′ the sample output vector. We denote by µ the marginal distribution of ρ on X . Let ϵ i := y i -f * (x i ) be the noise.



RELATED WORK KRR also belongs to the spectral regularization algorithms, a large class of kernel regression algorithms including kernel gradient descent, spectral cut-off, etc, see e.g. Rosasco et al. (2005); Bauer et al. (2007); Gerfo et al. (2008); Mendelson & Neeman (

