MINIMAX OPTIMAL KERNEL OPERATOR LEARNING VIA MULTILEVEL TRAINING

Abstract

Learning mappings between infinite-dimensional function spaces have achieved empirical success in many disciplines of machine learning, including generative modeling, functional data analysis, causal inference, and multi-agent reinforcement learning. In this paper, we study the statistical limit of learning a Hilbert-Schmidt operator between two infinite-dimensional Sobolev reproducing kernel Hilbert spaces (RKHSs). We establish the information-theoretic lower bound in terms of the Sobolev Hilbert-Schmidt norm and show that a regularization that learns the spectral components below the bias contour and ignores the ones above the variance contour can achieve the optimal learning rate. At the same time, the spectral components between the bias and variance contours give us flexibility in designing computationally feasible machine learning algorithms. Based on this observation, we develop a multilevel kernel operator learning algorithm that is optimal when learning linear operators between infinite-dimensional function spaces.

1. INTRODUCTION

Supervised learning of operators between two infinite-dimensional spaces has attracted attention in several areas of application of machine learning, including, scientific computing (Lu et al., 2019; Li et al., 2020; de Hoop et al., 2021; Li et al., 2018; 2021b) , functional data analysis (Crambes & Mas, 2013; Hörmann & Kidziński, 2015; Wang et al., 2020a) , mean-field games (Guo et al., 2019; Wang et al., 2020b) , conditional kernel mean embedding (Song et al., 2009; 2013; Muandet et al., 2017) and econometrics (Singh et al., 2019; Muandet et al., 2020; Dikkala et al., 2020; Singh et al., 2020) . Despite the empirical success of operator learning, the statistical limit of learning an infinite-dimensional operator has not been investigated studied. In this paper, we study the problem of learning Hilbert Schmidt operators between infinite-dimensional Sobolev RKHSs H β K and H γ L with given kernels k and l, respectively with β, γ ∈ [0, 1) (Adams & Fournier, 2003; Christmann & Steinwart, 2008; Fischer & Steinwart, 2020) . Our goal is to derive the optimal sample complexity for linear operator learning, i.e. how much data is required to achieve a certain performance level. We first establish an information-theoretic lower bound for learning a Hilbert-Schmidt operator between Sobolev spaces with respect to a general Sobolev norm. Our information-theoretic lower bound indicates that the optimal learning rate is determined by the minimum of two polynomial rates: one is purely decided by the input Sobolev reproducing kernel Hilbert space and its evaluating norm, while the other one is purely determined by the output space along with its evaluating norm. The rate is novel in that all existing results (Fischer & Steinwart, 2020; Li et al., 2022; de Hoop et al., 2021) only establish rates that depend on the parameter of input space. The reason is all previous works (Talwai et al., 2022; Li et al., 2022; de Hoop et al., 2021) only consider the case of the output space as a subspace of a trace bounded reproducing kernel Hilbert space but not a general Sobolev space. We refer to Remark 2.1 for detailed comparisons. To design a learning algorithm for approximating an infinite-dimensional operator, we need to learn a finite-dimensional restriction instead of the whole operator, as the latter would result in infinite variance. The finite-dimensional selection leads to bias error but decreases the variance. A natural task is then to study the shape of regularization that can lead to the optimal bias-variance trade-off and achieve the optimal learning rate. In this paper, we consider the bias and variance contour at the scale of optimal learning. Once the regularization enables one to learn all the spectral parts above the bias contour and below the variance contour, the learning is optimal. Finally, utilizing the region between the bias contour and variance contour, we developed a multilevel training algorithm (Lye et al., 2021; Li et al., 2021a) which first learns the mapping on low frequency and then successively fine-tunes the machine learning models to fit the high-frequency output. The intuition of our algorithm aligns with the original motivation of multilevel Monte Carlo (Giles, 2008; 2015) : we use the next level to reduce bias while keeping the variance at the same scale. We demonstrate that such a multilevel algorithm can achieve an optimal non-parametric rate for linear operator learning.

1.1. RELATED WORK

Machine Learning Based PDE Solver Solving partial differential equations (PDEs) plays a prominent role in many scientific and engineering disciplines, such as physics, chemistry, operation management, macro-economy, etc. The recent deep learning breakthrough has drawn attention to solving PDEs via machine learning methods (Raissi et al., 2019; Han et al., 2018; Sirignano & Spiliopoulos, 2018; Yu et al., 2018; Khoo et al., 2019; Chen et al., 2021) . The statistical power and computational cost of these problem is well-studied by recent papers (Lu et al., 2021; 2022; Nickl et al., 2020; Nickl & Wang, 2020) . This paper focuses on operator learning (Chen & Chen, 1995; Long et al., 2018; 2019; Feliu-Faba et al., 2020; Khoo et al., 2021; Lu et al., 2019; Li et al., 2020; Kovachki et al., 2021; Stepaniants, 2021) , i.e. learning a map between two infinite-dimensional function spaces. For example, one can learn a PDE solver that maps from the boundary condition to the solution or an inverse problem that maps from the boundary measurement to the coefficient field. Regarding the mathematical foundation of operator learning, (Liu et al., 2022) considers the learning rate of non-parametric operator learning. However, non-parametric functional data analysis often suffers from slower-than-polynomial convergence rates (Mas, 2012) due to the small ball probability problem for the probability distributions in infinite dimensional spaces (Delaigle & Hall, 2010) . The most relevant works are (Lin et al., 2011; Reimherr, 2015; de Hoop et al., 2021) , which consider the rates for learning a linear operator. For the comparison between our work and (de Hoop et al., 2021) , see Remark 2.1. Learning with Kernel. Supervised least square regression in RKHS and its generalization capability have been thoroughly studied (Caponnetto & De Vito, 2007; Smale & Zhou, 2007; De Vito et al., 2005; Rosasco et al., 2010; Mendelson & Neeman, 2010) . The minimax optimality with respect to the Sobolev norm has been discussed recently in (Fischer & Steinwart, 2020; Liu & Li, 2020; Lu et al., 2022) . Our paper is highly related to recent works (Schuster et al., 2020; Mollenhauer & Koltai, 2020; Talwai et al., 2022; Li et al., 2022; Park & Muandet, 2020; Singh et al., 2019; 2020) on identifying the Sobolev norm learning rate for the kernel mean embedding (Song et al., 2009; 2013; Muandet et al., 2017) , which can also formulated as learning an operator. The difference between our work and (Talwai et al., 2022; Li et al., 2022) sees Remark 2.1. A concurrent paper (Balasubramanian et al., 2022) considers a unified RKHS methodology for functional data analysis. Our paper provided a refined analysis and provided information theortical optimal rates for this problem. Multilevel Monte Carlo By combining biased estimators with multiple stepsizes, multilevel Monte Carlo (MLMC) (Giles, 2008; 2015) dramatically improves the rate of convergence and achieves in many settings the canonical square root convergence rate associated with unbiased Monte Carlo (Rhee & Glynn, 2015; Blanchet & Glynn, 2015) . Multilevel Monte Carlo can also be used for a random variable with infinite variance (Blanchet & Liu, 2016; Chen et al., 2020) . To the best of our knowledge, this is the first paper that provides optimal sample complexity for multilevel Monte Carlo type algorithm for infinite variance problems in the non-parametric regime. Very recently, (Lye et al., 2021; Li et al., 2021a) developed a multilevel machine learning Monte Carlo algorithm (ML2MC) / multilevel fine-tuning algorithm for learning solution maps by first learning the map on the coarsest grid and then successively fine-tuning the network on samples generated at finer grids. The authors also showed that, following the telescoping in MLMC, the multilevel training procedure could reduce the generalization error without spending more time on generating training samples. (Schäfer & Owhadi, 2021; Boullé et al., 2022) consider such multi-scale algorithm for learning Green's function. However, the statistical power of such an algorithm is still under investigation. Another difference with (Boullé et al., 2022 ) is that we consider the Green function in H -1 norm rather than the ℓ 1 norm used in (Boullé et al., 2022) . In this paper, we qualify a specific setting where this multilevel procedure can and is necessary to achieve the minimax optimal learning rate.

1.2. CONTRIBUTION

• We derive a novel information-theoretic lower bound of learning a linear operator between two infinite-dimensional Sobolev reproducing kernel Hilbert spaces. The optimal learning rate is a minimum of two polynomial rates, one only dependent on the parameters of the input space while the other only on the parameters of the output space. The first rate aligns with the previous works (Li et al., 2022) , while the second is novel to the literature. • We study the shape of regularization that can lead to the optimal learning rate. One should learn all the spectral parts under the bias contour at the level of the optimal learning rate but not the spectral components above the variance contour at the level of the learning rate. This enables the estimator to enjoy an optimal balance of bias-variance. • We qualify a specific setting where a multilevel training procedure (Lye et al., 2021; Li et al., 2021a) is necessary and capable of achieving a minimax optimal learning rate for learning a linear operator. We perform the optimal learning rate via O(ln ln n) ensemble of ridge regression models. This differs from finite-dimensional operator learning, where a single-level estimator can be optimal.

2. PROBLEM FORMULATION

2.1 PRELIMINARY Let P K be a distribution over the input space H K and define covariance operator C KK = E u∼P K u⊗ u. Consider its spectral decomposition C KK = +∞ i=1 µ 2 i e i ⊗ e i , where 1 2 i e i } +∞ i=1 is an orthogonal eigenbasis and {µ i } is the corresponding eigenvalues of C KK (here the g ⊗ h is an operator defined as g ⊗ h = gh * : f → ⟨f, h⟩ g). In the typical machine learning applications, the test distribution is the same as the training distribution, so we can assume that H K = i a i µ 1 2 i e i : {a i } ∞ i=1 ∈ ℓ 2 without loss of generality. Note that this automatically holds in the context of learning the conditional mean embedding (CME) (Fischer & Steinwart, 2020; Talwai et al., 2022; Li et al., 2022) . Following Christmann & Steinwart (2008) ; Fischer & Steinwart (2020) , we define the interpolation Sobolev space H β K = f = i a i (µ β 2 i e i ) : {a i } ∞ i=1 ∈ l 2 for any β > 0, equipped with Sobolev norm defined by the inner product i a i (µ β/2 i e i ), i b i (µ β/2 i e i ) H β K = i a i b i . For the output space, we fix a user-specified distribution Q L and a reproducing Kernel Hilbert Space. We can similarly define the covariance operator C Q L and the Sobolev space H γ L . Natural choices of Q L include some distribution on kernel functions {ℓ(y, •) : y ∈ Y } of H L induced by some distribution Q L on Y , so that C Q L is a kernel integral operator with respect to Q L and H γ L is an interpolation space between H L and L 2 (Q L ); see Example 2.1 for a specific example. Following Li et al. (2022) , in this paper we consider the Hilbert-Schmidt norm between two Sobolev Spaces for all the operators, which is defined as following. Definition 2.1 ((β, γ)-norm) Let T : H K → H L be a possibly unbounded linear operator. I 1,β,P K : H K → H β K , β ∈ (0, 1) is the canonical embedding mapping that takes u ∈ H K to the same element u in the larger space H β K , and I 1,γ,Q L : H L → H γ L , γ < 1 is similarly defined. Then the (β, γ)-norm of T is defined as ∥T ∥ β,γ = (I * 1,γ,Q L ) † • T • I * 1,β,P K HS(H β K ,H γ L ) = C -(1-γ)/2 Q L • T • C (1-β)/2 KK HS(H K ,H L ) , where we omit the dependence of ∥•∥ β,γ on P K and Q L since it will always be clear from context.

2.2. PROBLEM FORMULATION

We consider the problem of learning an unknown linear operator A 0 : H K → H L between two reproducing kernel Hilbert spaces corresponding to kernl k and l respectively. We are given N noisy data pairs (u i , v i ), 1 ⩽ i ⩽ N related by v i = A 0 u i + ε i (1) where u i i.i.d. ∼ P K for some unknown distribution P K and ε i is the noise drawn from some distribution with zero mean that may depend on u i . We use P KL for the joint distribution of (u i , v i ). Denote C KK = E u∼P K u ⊗ u, C KL = E (u,v)∼P KL u ⊗ v and its adjoint C LK = C * KL be uncentered cross-covariance operators associated with P KL . Then we can reformulate the ground turth operator as A 0 = C LK C † KK , where † is the pseudo-inverse (Talwai et al., 2022; Li et al., 2022) . With the goal of understanding the relative difficulty of learning different types of linear operators, we investigate the sample efficiency of learning A 0 under certain source assumptions imposed on the data model (1). Source condition (Caponnetto & De Vito, 2007; Mendelson & Neeman, 2010; Steinwart et al., 2009; Rosasco et al., 2010; Fischer & Steinwart, 2020) assumes that the learning target lies in a parameterized function class and study the learning rate for different problems with different hardness. Specifically, the source condition assume that the learning target is bounded in certain Sobolev norm. In this paper, we consider learning an operator with bounded (β, γ)-norm, which is the Hilbert-Schmidt norm that maps from H β K to H γ L . We consider the generalization error/convergence rate under another (β ′ , γ ′ )-norm as in Fischer & Steinwart (2020) ; Lu et al. (2022) ; Talwai et al. (2022) ; Li et al. (2022) . Remark 2.1 Although recent works have considered similar problems in the context of conditional mean embedding (Talwai et al., 2022; Li et al., 2022) and functional data analysis (de Hoop et al., 2021) , in all these papers, the output space is a trace bounded RKHS (de Hoop et al., 2021, Assumption 2.14 (vi) ) rather than the general parameterized Sobolev space in our paper. We then list all the assumptions imposed on the underlying kernel for our theoretical results. We follow the standard capacity assumptions and embedding properties used in kernel regression (Fischer & Steinwart, 2020; Talwai et al., 2022; Li et al., 2022) . Assumption 2.1 (Capacity Condition of the Covariance) The eigenvalues {µ i } i⩾1 of the covariance operator C KK = E u∼P K u ⊗ u satisfies µ i ∝ i -1 p for some p ∈ (0, 1). Similarly, the eigenvalues {ρ i } i⩾1 of the covariance operator C Q L = E v∼Q L v ⊗ v satisfies ρ i ∝ i -1 q for some q ∈ (0, 1). Assumption 2.2 (ℓ ∞ Embedding Property of the Input RKHS) There exists a smallest α ∈ (0, 1) such that I * 1,α,P K † f H α K ⩽ A 1 a.s. under P K for some A 1 < +∞. Assumption 2.3 (ℓ ∞ Embedding Property of the Output RKHS) There exists A 2 < +∞ such that ∥g∥ H L ⩽ A 2 holds for all g in the range of A 0 , except from a Q L -null set. Assumption 2.4 (Moment Condition) There exists an operator V : H L → H L with tr (V ) ⩽ σ 2 such that for every u ∈ H K , We have E v∼P KL (•|u) ((v -A0u) ⊗ (v -A0u)) k ⪯ 1 2 (2k)!R 2k-2 V. holds for all k ≥ 2. Assumption 2.5 (Source Condition) A 0 is bounded under (β, γ)-norm i.e. ∥A 0 ∥ β,γ ⩽ B.

2.3. EXAMPLES

In this section, we will introduce two examples of our theory. The first one is about learning a differential operator, for example inferring an advection-diffusion model Portone & Moser (2022)  ρ n ∝ n -2m . The assumption ∥A 0 ∥ β,γ < +∞ is satisfied if and only if (1 -γ)m < (1 -β)s -1 2 (i.e. γ > 1 -2(1-β)s-1 2m ). Example 2.2 (Conditional mean embedding) Suppose that we would like to learn the conditional distribution P (y | x) from a data set {(x i , y i ) : 1 ⩽ i ⩽ N } ⊂ X × Y where x i i.i.d. ∼ P K . Let H K and H L be two RKHSs on X and Y respectively, with measurable kernel k(•, •) and ℓ(•, •). Then we can define a conditional mean embedding (CME) operator C Y |X that satisfies C Y |X k(x, •) = E Y |x ℓ(Y, •) =: µ Y |x , and E Y |x g(Y ) = g, µ Y |x ∀x ∈ X. We choose A 0 = C Y |x . In this case, C KK = E P K k(X, •) ⊗ k(X, •). Assumption 2.2 states that sup x∈X k α (x, x) = A 1 , while Assumption 2.3 is equivalent to sup x∈X µ Y |x ⩽ A 2 ( for simplicity we only focus on the case ζ = 1). According to Assumption 2.5, we assume that C Y |X β,γ ⩽ B. The mis-specified setting where β < 1 has been studied in previous work (Fischer & Steinwart, 2020; Talwai et al., 2022; Li et al., 2022) . However, they only consider the case γ = 1. Our results also cover the case γ < 1, which allows us to obtain theoretical guarantee for computing conditional expectation of the larger function class H γ L .

3. INFORMATION THEORETIC LOWER BOUND

In this section, we provide an information-theoretic lower bound via the Fano method for the convergence rate of the operator learning problem formulated in Section 2. Theorem 3.1 Suppose that P K and Q L are probability distributions on Hilbert spaces H K and H L respectively such that Assumptions 2.1 and 2.2 hold. Then for any estimator L : (H K × H L ) ⊗N → HS H β K , H γ L , there exists a linear operator A 0 and a joint data distribution P KL with marginal distribution P K on H K satisfying Assumptions 2.3 to 2.5, such that with probability ⩾ 0.99 over (u i , v i ) i.i.d. ∼ P KL we have L {(u i , v i )} N i=1 -A 0 2 β ′ ,γ ′ ≳ N -min max{α,β}-β ′ max{α,β}+p , γ ′ -γ 1-γ . Remark 3.1 Our lower bound is composed of a minimum of two parts. The first rate N -max{α,β}-β ′ max{α,β}+p is the minimax optimal Sobolev learning rate for kernel regression (Fischer & Steinwart, 2020; Talwai et al., 2022; Li et al., 2022; Lu et al., 2022) and is fully determined by the parameter of the input Sobolev reproducing kernel Hilbert space. Our second rate N -γ ′ -γ 1-γ is novel to the literature. This bound shows how the infinite-dimensional problem is different from the finite-dimensional regression problem and is fully determined by the output Sobolev reproducing kernel Hilbert space parameter. Our lower bound shows that the hardness of learning a linear operator is determined by the harder part between the input and output spaces. We will explain why the lower bound has such a structure in Remark 4.2 and Figure 2 .

4. ON THE SHAPE OF REGULARIZATION

In this section, we aim to understand the shape of regularization so that the constructed estimator Â based on N i.i.d. data {(u i , v i )} n i=1 ∼ P ⊗n KL for 1 ⩽ i ⩽ N enjoys an optimal learning rate. Compared with existing approaches where a regularized least-squares estimator can achieve statistical optimality (Fischer & Steinwart, 2020; Talwai et al., 2022; Li et al., 2022; de Hoop et al., 2021) under (β, 1)-norm, we study the learning rate under the (β, γ)-norm (β ′ ∈ (0, β), γ ′ ∈ (γ, 1)) which is defined in Definition 2.1 as Â -A 0 β ′ ,γ ′ = C -1-γ ′ 2 Q L Â -A 0 C 1-β ′ 2 KK HS(H K ,H L ) . The norm of the additional C -1-γ ′ 2 Q L term is unbounded which make our setting harder than the convergence in (β, 1)-norm in existing works. Since C -1-γ ′ 2 Q L is bounded when restricted to the finite-dimensional space span ρ 1 2 i f i : 1 ⩽ i ⩽ n , we should also include another bias-variance trade-off via regularizing in the output shape. As a result, we are interested in answering the following question What is the optimal way to combine the regularization in the input space and regularization in the output space? i.e. What is the optimal shape of regularization? To answer this question, we investigate the problem in the spectral space, i.e. considering the spec- tral representation of operator A 0 = +∞ i,j=1 a ij µ β 2 i e i ⊗ ρ 1-γ 2 j f j . The problem of estimating A then reduces to learning the coefficients "matrix" (a ij ) ∞ i,j=1 . The source condition Assumption 2.5 enforces ∞ i,j=1 a 2 ij ≤ B. We show in Appendix B.1.1 that regularizing the basis e i ⊗ f j will introduce a bias of order a ij µ β 2 i e i ⊗ ρ 1-γ 2 j f j 2 β ′ ,γ ′ = a 2 ij µ β-β ′ i ρ γ ′ -γ j ∝ i -β-β ′ p j -γ ′ -γ q under the (β ′ , γ ′ )-norm. On the other hand, when α ⩽ β + p, we show in Appendix B.1.2 that the variance of learning (i, j) from noisy data scales as 1 N µ -β ′ i ρ -(1-γ ′ ) j ∝ 1 N i β ′ p j 1-γ ′ q . Since the variance would accumulate for a fixed j, learning (i, j) for i ⩽ i max results in a variance of ∝ 1 N i β ′ +p p max j 1-γ ′ q . (Similar analysis can be carried out for the α > β + p case as well, but the variance now scales as 1 N i β ′ +α-β p max j 1-γ ′ q ; see Appendix B for detailed derivations.) In summary, we need to make biasvariance trade off in the (i, j)-plane, i.e. decide whether we should learn or regularize over the basis e i ⊗ f j .

4.1. REGULARIZATION VIA VARIANCE CONTOUR

The underlying idea of regularization is that some components are intrinsically hard to learn due to large variance; these components are then neglected by adding regularization and are counted as bias. Thus, the remaining components are easy to learn due to controllable variance. This intuition works well when the estimation error results from the noise of the data and is well-studied in a line of works (Fischer & Steinwart, 2020; de Hoop et al., 2021; Talwai et al., 2022; Li et al., 2022) . This idea still works in our setting, but we need to re-evaluate the bias and variance of each component. Since we work with the Hilbert-Schmidt norm, this can be done in a coordinate-wise manner, meaning that we can look at each a ij separately and decide whether to neglect it (contribute to bias) or to learn it from data (contribute to variance). Since the variance term measures the hardness of learning, we naturally introduce the notion of variance contour, which is a curve on the R 2 + plane on which all points induce the same order of variance (here we work with real coordinates for convenience, although we only care about integer Figure 1 : An illustration of our proposed regularization scheme. Left: the regularized least-squares estimator studied in previous works (Fischer & Steinwart, 2020; de Hoop et al., 2021; Talwai et al., 2022) which only regularizes on the input space. Right: our double regularization scheme via variance contour can achieve the optimal convergence rate in our setting. points). Formally, we fix an arbitrary constant C > 0 and define ℓ C,var = (x, y) ∈ R 2 + : x β ′ +max{α-β,p} p y 1-γ ′ q = C . A reasonable regularization scheme is then to learn all coordinates (i, j) ∈ Z 2 + below the curve ℓ C,var and 'regularize out' the remaining coordinates that are difficult to learn due to large variances. This can gives us the estimator with smallest estimator at give variance level. This observation motivates us to construct our estimator as Â = y N j=1 ρ 1 2 j f j ⊗ ρ 1 2 j f j ĈLK ĈKK + λ j I -1 , where ĈLK = 1 N N i=1 v i ⊗ u i , λ j (1 ⩽ j ⩽ y N = C q 1-γ ′ ) are the regularization coefficients imposed on different dimensions of the output space. According to (2) and noting that µ i ∝ i -1 p , we define λ j = max j -1-γ ′ q N max 1- β-β ′ max{α,β+p} , 1-γ ′ 1-γ -1 β ′ +p , c 0 N log N -1 α , with C = N max 1- β-β ′ max{α,β+p} , 1-γ ′ 1-γ in (2). The additional N -1 α term in (4) is needed for controlling the error of approximating C KK via ĈKK (cf. Theorem D.3) which is standard in the Sobolev learning literature Fischer & Steinwart (2020) ; Talwai et al. (2022) ; Lu et al. (2022) . The following theorem describes the convergence rate of our estimator defined by (3) and (4). Theorem 4.1 Consider the estimator Â defined by (3) and (4). Suppose that Assumptions 2.1 to 2.5 hold, then there exists a universal constant C such that with probability ⩾ 1 -e -τ , we have Â -A 0 2 β ′ ,γ ′ ⩽ Cτ 2 N log N -min β-β ′ max{α,β+p} , γ ′ -γ 1-γ log 2 N. Remark 4.1 Compared with Theorem 3.1, our upper bound is optimal up to logarithmic factors when α ⩽ β. The optimal learning rate in the α > β regime is an outstanding problem for decades, even without the additional problem-dependent parameters γ, γ ′ (see e.g. the discussions following (Fischer & Steinwart, 2020 , Theorem 2)). In this paper, we do not address this problem either.

4.2. REGULARIZATION VIA BIAS CONTOUR

We have showed that if we learn all the spectral components under certain variance contour and regularize all other component can achieve optimal rate. In this section, we introduce another scheme to design the optimal estimator via learning all the spectral component under a certain bias contour. Specifically, we consider deciding the regularization strength according to the spectral elements induce a certain level of bias i.e. the bias contour ℓ C ′ ,bias = (x, y) ∈ R 2 + : x β-β ′ p y γ ′ -γ q = C ′ . does not coincide with ℓ C,var for any C ′ up to constant scaling. Thus, there exists a point (x * , y * ) on the variance contour with maximal contribution to bias. Naturally, we can also construct our estimator using a bias contour that passes through (x * , y * ). In this case, we may define λ j = max j -γ ′ -γ q N min β-β ′ max{α,β+p} , γ ′ -γ 1-γ -1 β-β ′ , c 0 N log N -1 α for similar reasons as Section 4.1, which also yields optimal rate as stated in Theorem 4.2 below. Remark 4.2 (On the optimal shape of regularization) The discussion in Sections 4.1 and 4.2 reveals another understanding of our information theoretic lower bound. Firstly, we should learn all the spectral components under the bias contour otherwise the bias will exceed the lower bound. Secondly, we should not learn any spectral component over the variance contour since otherwise the variance will exceed the lower bound. Thus the bias contour should always be under the variance contour, otherwise no estimator can be designed. The bias and variance contours at the level of optimal learning rate are plotted in Figure 2 . They only meet at (x * , y * ) with x * = 1 or y * = 1, which has the largest contribution to the bias (resp. variance) among all points on the variance (resp. bias) contour, thus dominating the estimation error. When the two curves meet at y * = 1, it reduces to the original kernel regression case. When the two curves meet at x * = 1, it leads to our new rate that depends on the output space. Theorem 4.2 Consider the estimator Â defined by (3) with λ j defined above. Suppose that Assumptions 2.1 to 2.5 hold, then there exists a universal constant C, such that Â -A 0 2 β ′ ,γ ′ ⩽ Cτ 2 N log N -min β-β ′ max{α,β+p} , γ ′ -γ 1-γ log 2 N holds with probability ⩾ 1 -e -τ .

5. MULTILEVEL KERNEL OPERATOR LEARNING

In this section, we study a multilevel machine learning algorithm Lye et al. ( 2021 2022) to control the variance at a proper scale. We show that the multilevel level algorithm can cover all the spectral component below the bias contour and achieve the optimal learning rate. Our idea is similar to the multilevel Monte Carlo Giles (2008; 2015) , which reduces bias from multilevel algorithm. Our multilevel estimator differs from the DeepONet (Lu et al., 2019) and the PCA-Net (Bhattacharya et al., 2020) since we add different regularizations for each level. Our theory indicates that the multilevel approach outperforms previous ones and achieves the optimal learning rate. The basic idea is to design a minimum number of machine learning estimators that cover all the spectral elements under the bias contour but do not exceed the variance contour at the same time. To achieve this, we choose sequences {x i } and {y i } for 1 ⩽ i ⩽ L N where y i denotes the i-th level and x i controls the corresponding regularization via the regularization coefficient λ (K) i = x -1 p i . The sequences are chosen in a staircase manner as plotted in Figure 3 (for formal definitions see Appendix C). The eigenbasis ρ 1 2 j f j of the output space is divided into defferent levels by {y i }. The main idea behind our multilevel method is that different levels of the output need to be learned with different regularization. Formally, we define our multilevel estimator as Âml = L N i=0   yi-1⩽j<yi ρ 1 2 j f j ⊗ ρ 1 2 j f j   ĈLK ĈKK + λ (K) i I -1 . (5) The following theorem shows that the estimator (5) can achieve the optimal convergence rate with L N = O(ln ln N ) when β-β ′ max{α,β+p} ̸ = γ ′ -γ 1-γ . We also show that O(ln N ) estimator is needed for the case when β-β ′ max{α,β+p} = γ ′ -γ 1-γ (Figure 3 Right) in Appendix C. Theorem 5.1 Suppose that Assumptions 2.1 to 2.5 hold, then there exists a sequence {y i } 1⩽i⩽L N with L N = O(ln N ) when β-β ′ max{α,β+p} = γ ′ -γ 1-γ and O(ln ln N ) otherwise, such that the estimator Âml satisfies Âml -A 0 2 β ′ ,γ ′ ⩽ Cτ 2 N log N -min β-β ′ max{α,β+p} , γ ′ -γ 1-γ log 2 N with probability ⩾ 1 -e -τ , where C is a universal constant. Remark 5.1 Our multilevel algorithm first apply the regression algorithm on low-frequency projections of the output samples with small regularization and then successively fine-tune the regression model on high-frequency projections of the output samples with stronger regularization, which matches the empirical use (Li et al., 2021a; Lye et al., 2021) .

6. CONCLUSION AND DISCUSSION

We considered the sample complexity of learning an operator between two infinite-dimensional Sobolev kernel Hilbert spaces. We provided an information theoretical lower bound for this problem along with a multi-level machine learning algorithm. Our lower bound is determined by the harder rate of two polynomial rates: one is fully determined by the hardness of the input space, while the other is fully controlled by the hardness of the output space. The second rate is new to the literature. We explained our bound from the viewpoint of variance and bias counters in Remark 4.2 and Figure 2 . The optimal estimator should learn all the spectral elements under the bias contour but learn no information above the variance contour. To meet this requirement, we combined the idea of multi-level Monte Carlo with kernel operator learning, using successive levels to fit higher frequency information while keeping the variance at the same scale to reduce the bias. Our paper is the first on the non-parametric statistical optimality for multi-level algorithms. We leave estimation from discretely observed functional covariates with noise as future work (Zhou et al.; 2022) .

A PROOF OF THE LOWER BOUND

In this section, we follow the lower bound proof in Fischer & Steinwart (2020) to give a lower bound of the convergence rate in our operator learning setting. A.1 PRELIMINARIES ON TOOLS FOR LOWER BOUNDS In this section, we repeat the standard tools we use to establish the lower bound. The main tool we use is the Fano's inequality and the Varshamov-Gilber Lemma. Lemma A.1 (Fano's methods) Assume that V is a uniform random variable over set V, then for any Markov chain V → X → V , we always have P( V ̸ = V ) ≥ 1 - I(V ; X) + log 2 log(|V|) In our proof we will use a version from Fischer & Steinwart (2020) . Lemma A.2 (Fischer & Steinwart, 2020, Theorem 20) Let M ⩾ 2, (Ω, A) be a measurable space, P 0 , P 1 , . . . , P M be probability measures on (Ω, A) with P j ≪ P 0 for all j = 1, . . . , M , and 0 < α * < ∞ with 1 M M j=1 KL (P j ||P 0 ) ⩽ α * . Then, for all measurable functions Ψ : Ω → {0, 1, . . . , M }, the following bound is satisfied max j=0,1,...,M P j (ω ∈ Ω : Ψ(ω) ̸ = j) ⩾ √ M 1 + √ M 1 - 3α * log(M ) - 1 2 log(M ) . A.2 PROOF OF THE LOWER BOUND To prove our lower bound, we construct a sequence of linear operators as follows: A ω = 32ε m 1 K m1 i=1 K j=1 ω ij µ β ′ /2 i+m1 ρ 1-γ ′ /2 j+m2 f j+m2 ⊗ e i+m1 , ω ij ∈ {0, 1} where m 1 and m 2 are hyper-parameters (scale as poly(N ) and will be selected later) and K is a constant that will be specified afterwards. It's easy to check that ∥A ω -A ω ′ ∥ 2 β ′ ,γ ′ ⩽ 32ε m 1 K m1 i=1 K j=1 ω ij -ω ′ ij 2 By Gilbert-Varshamov Lemma it is possible to select M ε ⩾ 2 m1K/8 binary strings ω (1) , ω (2) , • • • , ω (Mε) ∈ {0, 1} m1K such that ω (i) -ω (j) 2 2 ⩾ 4ε. Let Ω be the collection of this strings. We now select the hyper-parameters to satisfies the assumptions made in Section 2. First we have ∥A ω ∥ 2 β,γ ⩽ 32ε m 1 K m1 i=1 K j=1 µ -(β-β ′ ) i+m1 ρ -(γ ′ -γ) j+m2 ≲ ε (2m 1 ) β-β ′ p (2m 2 ) γ ′ -γ q where the last step follows from Assumption 2.1. Similarly, we have ∥A ω ∥ 2 α,1 ≲ ε (2m 1 ) α-β ′ p (2m 2 ) γ ′ -1 q . To the assumptions made in Section 2, we should make (2m 1 ) max{α,β}-β ′ p (2m 2 ) γ ′ -γ q ≲ ε -1 (6) be satisfied. To be specific, with the previous selection of hyper-parameters, we can have ∥A ω ∥ β,γ = O(1) and sup g∈range(Aω) ∥g∥ H L ⩽ sup f ∥A ω ∥ α1 • I * 1,α,P K † f H α K < +∞ where the last step follows from our assumption on the input distribution Assumption 2.2. This verifies that Assumptions 2.3 and 2.5 hold for A ω , ∀ω ∈ Ω. We now construct the hypothesis (probability distributions) as follows: for ∀ω ∈ {0, 1} m1 , define P ω (df, dg) = dN (C ω f, Σ) (g) • dP K (f ) where the covariance operator Σ = σ 2 K K j=1 ρ j+m2 f j+m2 ⊗ f j+m2 for some constant σ > 0. It's then easy to see that tr (Σ) = σ 2 , which satisfies Assumption 2.4 .Note that the range of A ω is span(f m2 ) and Σ is non-degenerate on this subspace. As a result, we can view P ω , ω ∈ Ω as distributions on H K × span (f j+m2 : 1 ⩽ j ⩽ K), and we have for ∀ω, ω ′ ∈ Ω that KL (P ω ||P ω ′ ) = E f ∼P K [KL (P ω (dg | f )||P ω ′ (dg | f ))] = E f ∼P K [KL (N (C ω f, Σ)||N (C ω ′ f, Σ))] = E f ∼P K (A ω -A ω ′ )f, Σ † (A ω -A ω ′ )f ⩽ σ -2 KE f ∼P K ⟨(A ω -A ω ′ )f, (A ω -A ω ′ )f ⟩ = 32ε m 1 σ 2 E f ∼P K m1 i=1 K j=1 (ω ij -ω ′ ij )µ β ′ /2 i+m1 ρ 1-γ ′ /2 j+m2 ⟨f, e i+m1 ⟩ f j+m2 2 H L = 32ε m 1 σ 2 E f ∼P K K j=1 ρ 1-γ ′ j+m2 m1 i=1 (ω ij -ω ′ ij )µ β ′ /2 i+m1 ⟨f, e i+m1 ⟩ 2 = 32ε m 1 σ 2 m1 i=1 K j=1 (ω ij -ω ′ ij ) 2 µ β ′ i+m1 ρ 1-γ ′ j+m2 ≲ εσ -2 m -β ′ p 1 m -1-γ ′ q 2 where the last step follows from E P K u⊗u = C KK = ∞ i=1 µ 2 i e i ⊗e i and recall that K is a constant. Hence we deduce that 1 M ε ω ′ ∈Ω KL(P n ω ′ ||P n ω ) ≲ σ -2 nεm -β ′ p 1 m -1-γ ′ q 2 =: α * Applying Lemma A.2, we find that when α * ≲ log M ε ⇔ ε ≲ n -1 m β ′ p 1 m 1-γ ′ q 2 , there exists a hypothesis P ω0 such that for any estimator Âω0 , Âω0 -A ω0 2 β ′ ,γ ′ ≳ ε ⊃ ω 0 ̸ = arg min ω∈Ω ∥A ω -A ω0 ∥ β ′ ,γ ′ holds with high probability. Finally, we need to choose optimal m 1 and m 2 under the constraint (6). It turns out that either m 1 = 1 or m 2 = 1, and the resulting lower bound is Â -A β ′ ,γ ′ ≳ n -min max{α,β}-β ′ 2(max{α,β}+p) , γ ′ -γ 2(1-γ) .

B PROOF OF THE UPPER BOUND

In this section, we upper-bound the learning error of estimator (3) which defined as Â = y N j=1 ρ 1 2 j f j ⊗ ρ 1 2 j f j ĈLK ĈKK + λ j I -1 , where λ j , 1 ⩽ j ⩽ y N = N q 1-γ ′ max{1- β-β ′ max{α,β+p} , 1-γ ′ 1-γ } are regularization coefficients that we impose on different dimensions of the output space. In this section, we consider the following two ways to select regularization coefficients in Section 4: • We regularize all spectral component below certain variance contour, i.e. we set regulariza- tion strength λj = max    j -1-γ ′ q N max 1- β-β ′ max{α,β+p} , 1-γ ′ 1-γ -1 β ′ +p , c0 N log N -1 α    (4). • We regularize all spectral component below certain bias contour, i.e. we set regularization strength λj = max    j -γ ′ -γ q N min β-β ′ max{α,β+p} , γ ′ -γ 1-γ -1 β-β ′ , c0 N log N -1 α    (19). To obtain the upper bound for our estimator, we decompose the learning error E( Â) = Â -A 0 β ′ ,γ ′ in to bias and variance via E(A) ⩽ Â -A λ β ′ ,γ ′ variance term + ∥A λ -A 0 ∥ β ′ ,γ ′ bias term , where  A λ = y N j=1 ρ 1 2 j f j ⊗ ρ 1 2 j f j C KL (C KK + λ j I) -1 . Lemma B.1 ∥A 0 -A λ ∥ 2 β ′ ,γ ′ ≲ N -min β-β ′ β+p , γ ′ -γ 1-γ . Proof sketch: Since ∥A 0 ∥ β,γ ⩽ B, we can write A 0 := +∞ i=1 +∞ j=1 a ij µ β 2 i ρ 1-γ 2 j f j ⊗ e i where the coefficient matrix A 0 = (a ij ) 1⩽i,j⩽+∞ satisfies ∥A 0 ∥ 2 F ⩽ B 2 . The definition (8) implies that for 1 ⩽ j ⩽ y N and i ⩾ 1 we have ρ 1 2 j f j , A λ µ 1 2 i e i = ρ 1 2 j f j , C KL (C KK + λ j I) -1 µ 1 2 i e i = ρ 1 2 j f j , A 0 C KK (C KK + λ j I) -1 µ 1 2 i e i = µ 1+β 2 i µ i + λ j ρ 1-γ 2 j a ij . The bias term can be bounded as follows: ∥A 0 -A λ ∥ 2 β ′ ,γ ′ = +∞ i,j=1 ρ 1 2 j f j , C -1-γ ′ 2 Q K (A 0 -A λ ) C 1-β ′ 2 KK µ 1 2 i e i 2 = y N j=1 +∞ i=1 µ β-β ′ i ρ γ ′ -γ j λ 2 j (µ i + λ j ) 2 a 2 ij ⩽ y N j=1 ρ γ ′ -γ j max i⩾1 µ β-β ′ i λ 2 j (µ i + λ j ) 2 • +∞ i=1 a 2 ij ≲ y N j=1 j -γ ′ -γ q λ -(β-β ′ ) j +∞ i=1 a 2 ij ≲ B 2 max 1⩽j⩽y N j -γ ′ -γ q λ -(β-β ′ ) j . ( ) We now prove that j γ ′ -γ q λ β-β ′ j ≳ N min β-β ′ β+p , γ ′ -γ 1-γ , ∀1 ⩽ j ⩽ y N . ( ) Case 1. If λ j = c 0 N log N 1 α , then j γ ′ -γ q λ β-β ′ j ⩾ λ β-β ′ j ≳ N β-β ′ α ⩾ N β-β ′ β+p where we use α ⩽ β + p in the final Case 2. If λ j = N max β ′ +p β+p , 1-γ ′ 1-γ j -1-γ ′ q 1 β ′ +p , we need to consider two sub-cases: • If β ′ +p β+p > 1-γ ′ 1-γ , then we have λ j = N β ′ +p β+p j -1-γ ′ q 1 β ′ +p and thus j γ ′ -γ q λ β-β ′ j = j γ ′ -γ q N β ′ +p β+p j -1-γ ′ q β-β ′ β ′ +p = N β-β ′ β+p j 1-γ ′ q γ ′ -γ 1-γ ′ -β-β ′ β ′ +p ⩾ N β-β ′ β+p . • If β ′ +p β+p < 1-γ ′ 1-γ , then similarly we have λ j = N 1-γ ′ 1-γ j -1-γ ′ q 1 β ′ +p and j γ ′ -γ q λ β-β ′ j = j γ ′ -γ q N 1-γ ′ 1-γ j -1-γ ′ q β-β ′ β ′ +p ⩾ y γ ′ -γ q N N 1-γ ′ 1-γ y -1-γ ′ q N β-β ′ β ′ +p = N γ ′ -γ 1-γ . Hence, in all cases (10) holds and we have that ∥A 0 -A λ ∥ 2 β ′ ,γ ′ ≲ N -min β-β ′ β+p , γ ′ -γ 1-γ . ( ) □ B.1.2 VARIANCE The variance term can be rewritten in the following way: V = Â -A λ 2 β ′ ,γ ′ = C -1-γ ′ 2 Q K Â -A λ C 1-β ′ 2 KK 2 HS = +∞ i,j=1 ρ 1 2 j f j , C -1-γ ′ 2 Q K Â -A λ C 1-β ′ 2 KK µ 1 2 i e i 2 (12a) = n N j=1 ρ -(1-γ ′ ) j +∞ i=1 ρ 1 2 j f j , ĈLK ĈKK + λ j I -1 -C LK (C KK + λ j I) -1 µ 1-β ′ 2 i e i 2 (12b) = n N j=1 ρ -(1-γ ′ ) j +∞ i=1 (C KK + λ j I) -1 2 ĈKL -ĈKK + λ j I (C KK + λ j I) -1 C KL =:Uj ρ 1 2 j f j , (C KK + λ j I) 1 2 ĈKK + λ j I -1 (C KK + λ j I) 1 2 =:Gj µ 1-β ′ 2 i µ i + λ j e i 2 (12c) = n N j=1 ρ -(1-γ ′ ) j U j ρ 1 2 j f j , G j +∞ i=1 µ 2-β ′ i µ i + λ j e i ⊗ e i G j U j ρ 1 2 j f j ≲ n N j=1 j 1-γ ′ q ∥G j ∥ 2 λ -β ′ j U j ρ 1 2 j f j 2 In ( 12), (12a) uses the definition of the Hilbert-Schmidt norm; (12b) follows from the definition of Â (cf.( 3)) and the fact that for any j ⩾ y N , we have ρ 1 2 j f j , Â -A λ µ 1 2 i e i = 0; (12c) is obtained from re-arranging and (12d) follows from +∞ i=1 µ 2-β ′ i µi+λj e i ⊗ e i = max i⩾1 µ 1-β ′ i µi+λj ≲ λ -β ′ j and ρ j ≲ j -1 q . Note that U j = (C KK + λ j I) -1 2 ĈKL -C KL -ĈKK -C KK (C KK + λ j I) -1 C KL = 1 N N k=1 (C KK + λ j I) -1 2 u k ⊗ v k -E P KL u k ⊗ A 0 u k -(u k ⊗ u k -E P KL u k ⊗ u k ) (C KK + λ j I) -1 C KK A * 0 = 1 N N k=1 (C KK + λ j I) -1 2 (u k ⊗ (v k -A 0 u k )) :=U 1 j + 1 N N k=1 (C KK + λ j I) -1 2 u k ⊗ A 0 u k -E P KL u k ⊗ A 0 u k -(u k ⊗ u k -E P KL u k ⊗ u k ) (C KK + λ j I) -1 C KK A * 0 :=U 2 j =λj 1 N N k=1 (C KK +λj I) -1 2 (uk⊗A0(CKK+λjI) -1 u k -E P KL u k ⊗A0(C KK +λj I) -1 u k ) . The U 1 j term is the variance of observational noise and U 2 j term is the variance of regularized bias. Thus the U 1 j term is the dominating term. Plugging the above decomposition into (12), we deduce that V ⩽ 2 (V 1 + V 2 ) where V 1 ≲ max 1⩽j⩽n N ∥G j ∥ 2 n N j=1 j 1-γ ′ q λ -β ′ j 1 N N k=1 v k -A 0 u k , ρ 1 2 j f j (C KK + λ j I) -1 2 u k 2 :=V 2 1,j V 2 ≲ max 1⩽j⩽n N ∥G j ∥ 2 n N j=1 j 1-γ ′ q λ 2-β ′ j Ê -E A 0 (C KK + λ j I) -1 u k , ρ 1 2 j f j (C KK + λ j I) -1 2 u k 2 :=V 2 2,j (13) where Ê[X] = 1 N N k=1 X k denotes the empirical mean. Define the event E 1,j = G j = P ij (C KK ) 1 2 P ij ĈKK † P ij (C KK ) 1 2 ⩽ 2 √ a 1 . . Recall that m N ⩽ c 0 N log N p α , by Theorem D.3, we know that E 1,j holds with probability ⩾ 1 -2e -a1 . As a result E 1 = ∩ n N j=1 E 1,j holds with probability ⩾ 1 -2n N e -a1 . We assume event E 1 holds in all the following proof. Bounding V 1 . Let X j,k = j 1-γ ′ 2q λ -β ′ 2 j v k -A 0 u k , ρ 1 2 j f j (C KK + λ j I) -1 2 u k ∈ H K and X k = (X j,k : 1 ⩽ j ⩽ n N ) ∈ H y N K . Then we have V 1 ≲ 1 N N k=1 X k 2 where the norm here defined for H ⊗y N K is induced by ⟨a, b⟩ = n N i=1 ⟨a i , b i ⟩ H K . Note that X k , k = 1, 2, • • • , N are i.i.

d. random variables with mean zero, and

E ∥X 1 ∥ 2t = E P KL     n N j=1 ∥X j,k ∥ 2   t   = E P KL     n N j=1 j 1-γ ′ q λ -β ′ j v 1 -A 0 u 1 , ρ 1 2 j f j 2 (C KK + λ j I) -1 2 u 2   t   ⩽ max 1⩽j⩽y N sup u∈supp(P K )     j 1-γ ′ q i β ′ p j (C KK + λ j I) -1 2 u 2 =:G1     t-1 • E (u,v)∼P KL   ∥v -A 0 u∥ 2t-2   n N j=1 j 1-γ ′ q λ -β ′ j v -A 0 u, ρ 1 2 j f j 2 (C KK + λ j I) -1 2 u 2     =:G2 By Lemma D.2 we have G 1 ≲ j 1-γ ′ q λ -(β ′ +α) j . For G 2 , note that for fixed u, Assumption 2.4 implies that E v|u   ∥v -A 0 u∥ 2t-2   n N j=1 j 1-γ ′ q i β ′ p j v -A 0 u, ρ 1 2 j f j 2 (C KK + λ j I) -1 2 u 2     ⩽ 1 2 (2t)!R 2t-2 n N j=1 σ 2 j j 1-γ ′ q λ -β ′ j (C KK + λ j I) -1 2 u 2 . where σ 2 j = ρ 1 2 j f j , V ρ 1 2 f j . As a result, we have G 2 ⩽ E P K   1 2 (2t)!R 2t-2 n N j=1 σ 2 j j 1-γ ′ q i β ′ p j (C KK + λ j I) -1 2 u 2   ⩽ 1 2 (2t)!R 2t-2 σ 2 max 1⩽j⩽n N j 1-γ ′ q λ -(p+β ′ ) j , where in the second step we use +∞ j=1 σ 2 j = tr (V ) = σ 2 and E P K (C KK + λ j I) -1 2 u 2 ⩽ tr E P K (C KK + λ j I) -1 2 u ⊗ (C KK + λ j I) -1 2 u = tr +∞ i=1 µ 2 i µ i + λ j e i ⊗ e i = +∞ i=1 µ i µ i + λ j ≲ λ -p j . We have shown that for some constant c 1 > 0, E∥X 1 ∥ 2t ⩽ 1 2 (2t)!σ 2 max 1⩽j⩽n N j 1-γ ′ q λ -(p+β ′ ) j • c 1 R 2 max 1⩽j⩽n N j 1-γ ′ q λ -(β ′ +p) j t-1 . By Bernstein's inequality, the event E 2 :=    1 N N k=1 X k 2 ⩽ 6a 2   σ 2 max j∈[y N ] j 1-γ ′ q λ -(β ′ +p) j N + c 1 R 2 max 1⩽j⩽n N j 1-γ ′ q λ -(β ′ +α) j N 2      (14) holds with probability ⩾ 1 -2e -a2 . By our definition of λ j , we have max 1⩽j⩽n N j 1-γ ′ q λ -(β ′ +p) j ≲ N max β ′ +p β+p , 1-γ ′ 1-γ and λ j ≳ N -1 α (which implies that the 1 N 2 term is dominated by the 1 N term). Hence, under E 1 ∩E 2 we have V 1 ≲ a 1 a 2 σ 2 N -min β-β ′ β+p , γ ′ -γ 1-γ with probability ⩾ 1 -2n N e -a2 . Bounding V 2 . For any j ∈ Z + we have E u∼P K A 0 (C KK + λ j I) -1 u, ρ 1 2 j f j 2 = E u∼P K ρ 1 2 j f j , E P K A 0 (C KK + λ j I) -1 u ⊗ A 0 (C KK + λ j I) -1 u ρ 1 2 j f j = ρ 1 2 j f j , A 0 (C KK + λ j I) -1 C KK (C KK + λ j I) -1 A * 0 ρ 1 2 j f j (15a) = ρ 1-γ j C -1-γ 2 Q K A 0 C 1-β 2 KK * ρ 1 2 j f j , (C KK + λ j I) -1 C β KK (C KK + λ j I) -1 C -1-γ 2 Q K A 0 C 1-β 2 KK * ρ 1 2 j f j (15b) ≲ j -1-γ q λ -(2-β) j C -1-γ 2 Q K A 0 C 1-β 2 KK * ρ 1 2 j f j 2 =:Dj,2 where (15a) follows from E P K u⊗u = C KK , (15b) uses the fact that C KK and C KK +λ j I commute, and lastly (15c) follows from (C KK + λ j I) -1 C β KK (C KK + λ j I) -1 H K ∝ λ -(2-β) j . Let Y j,k = A 0 (C KK + λ j I) -1 u k , ρ 1 2 j f j (C KK + λ j I) -1 2 u k ∈ H K and Y k = (Y j,k : 1 ⩽ j ⩽ y N ) ∈ H n N K . Then we have V 2 ≲ 1 N N k=1 Y k 2 H y N K . Note that Y k , k = 1, 2, • • • , N are i.i.d. random variables, and E∥Y 1 ∥ 2t = E     n N j=1 ∥Y j,k ∥ 2   t   = E P K     n N j=1 j 1-γ ′ q λ 2-β ′ j A 0 (C KK + λ j I) -1 u 1 , ρ 1 2 j f j 2 C -1 2 KK I ij (u 1 ) 2   t   ⩽ sup u∈supp(P K )   n N j=1 j 1-γ ′ q λ 2-β ′ j A 0 (C KK + λ j I) -1 u, ρ 1 2 j f j 2 (C KK + λ j I) -1 2 u 2   t-1 • n N j=1 j 1-γ ′ q λ 2-β ′ j E A 0 (C KK + λ j I) -1 u, ρ 1 2 j f j 2 sup u∈supp(P K ) (C KK + λ j I) -1 2 u 2 ≲ sup u∈supp(P K )   n N j=1 j 1-γ ′ q λ 2-β ′ -α j A 0 (C KK + λ j I) -1 u, ρ 1 2 j f j 2   t-1 • n N j=1 j 1-γ ′ q λ -(β ′ +α-β) j D j,2 . For any j ∈ Z + and u ∈ supp(P K ) we have n N j=1 j 1-γ ′ q λ -(β ′ +α) j A 0 (C KK + λ j I) -1 λ j u, ρ 1 2 j f j 2 ⩽ n N j=1 j 1-γ ′ q λ -(β ′ +α) j 2 A 0 u, ρ 1 2 j f j 2 + 2ρ 1-γ j C -1-γ 2 Q K A 0 (C KK + λ j I) -1 C KK u, ρ 1 2 j f j 2 (17a) ≲ max 1⩽j⩽y N j 1-γ ′ q λ -(β ′ +α) j + n N j=1 λ -(β ′ +α) j λ -max{α-β,0} j C -1-γ 2 Q K A 0 C 1-β 2 KK * ρ 1 2 j f j 2 (17b) ≲ max 1⩽j⩽y N j 1-γ ′ q λ -(β ′ +α) j + max 1⩽j⩽y N λ -(β ′ +α)-max{α-β,0} j . ( ) where (17a) uses the AM-GM inequality, (17b) follows from the assumption that ∥A 0 u∥ ⩽ A 2 is uniformly bounded, and that C -1-β 2 KK (C KK + λ j I) -1 C KK u = C 1-α-β 2 KK (C KK + λ j I) -1 C -1-α 2 KK u ≲ λ -max{α-β,0} j . by Assumption 2.2, and lastly (17c) follows from ∥A 0 ∥ β,γ ⩽ B. Plugging into (16), we deduce that E∥Y 1 ∥ 2t ≲ sup u∈supp(P K ) max 1⩽j⩽n N j 1-γ ′ q λ -(β ′ +α) j + max 1⩽j⩽n N λ -(β ′ +α)-max{α-β,0} j t-1 • n N j=1 j 1-γ ′ q λ -(β ′ +α-β) j D j,2 ≲ sup u∈supp(P K ) max 1⩽j⩽n N j 1-γ ′ q λ -(β ′ +α) j + max 1⩽j⩽n N λ -(β ′ +α)-max{α-β,0} j t-1 max 1⩽j⩽n N j 1-γ ′ q λ -(β ′ +α-β) j where the last step follows from +∞ j=1 D j,2 = ∥A 0 ∥ 2 β,γ . By Bernstein's inequality, there exists a constant C 3 such that the event E 3 =      V 2 ⩽ 6a 3 C 3    j 1-γ ′ q λ -(β ′ +α-β) j N + max 1⩽j⩽n N λ -(β ′ +α) j j 1-γ ′ q + λ -max{α-β.0} j N 2         (18) holds with probability ⩾ 1 -2e -a3 . The definition of λ N implies that the 1 N 2 term is dominated by the 1 N term, so V 2 ≲ a 1 a 3 1 N max 1⩽j⩽y N j 1-γ q λ -(β ′ +α-β) j ≲ N -min β-β ′ β+p , γ ′ -γ 1-γ holds under E 1 ∩ E 3 . To summarize, under E 1 ∩ E 2 ∩ E 3 which holds with probability ⩾ 1 - 2n N e -a1 -2e -a2 -2e -a3 , we have V ⩽ 2a 1 max{a 2 , a 3 } (V 1 + V 2 ) ≲ N -min β-β ′ β+p , γ ′ -γ 1-γ . Recall that the bias term is upper bounded in (11). This gives the final upper bound Â -A 0 β ′ ,γ ′ ≲ N -min β-β ′ 2(β+p) , γ ′ -γ 2(1-γ) .

B.1.3 THE HARD-LEARNING REGIME

In the previous sections, we focus on the case where α ⩽ β + p and establish an upper bound for the convergence rate via an optimal bias-variance trade-off. The opposite case, α > β + p is referred to as the hard-learning regime, for which the optimal rate is not known for several decades even in the case of γ = 1 (cf. the discussion following (Fischer & Steinwart, 2020 , Theorem 2)). In the hard learning regime the V 2 term becomes the leading terms. In this section, we use the technique developed in previous sections to obtain an upper bound in the hard-learning regime. To do this, we need to re-define the truncation set S N as follows: S N = (x, y) ∈ Z 2 x β ′ +α-β p y 1-γ ′ q ⩽ N 1-min β-β ′ α , γ ′ -γ 1-γ and x ⩽ c 0 N log N p α . The definition implies that the variance can be controlled by N -min β-β ′ 2α , γ ′ -γ 2(1-γ) and it remains to focus on the bias term. Similar to the derivations in Appendix B.1.1, we have ∥A 0 -T N (A 0 )∥ 2 β ′ ,γ ′ ≲ max (i,j) / ∈S N i -β-β ′ p j -γ ′ -γ q . The maximum value of the right hand side can be achieved in either of the following two cases: • i = O(1). Then we have j ≳ N q 1-γ ′ 1-min β-β ′ α , γ ′ -γ 1-γ so that i -β-β ′ p j -γ ′ -γ q ≲ N -γ ′ -γ 1-γ ′ 1-min β-β ′ α , γ ′ -γ 1-γ ⩽ N -γ ′ -γ 1-γ . • j = O(1). In this case we must have i ≲ N min p α , p β-β ′ γ ′ -γ 1-γ , otherwise it falls into S N by definition. Hence we have i -β-β ′ p j -γ ′ -γ q ⩽ i -β-β ′ p ≲ N -min β-β ′ α , γ ′ -γ 1-γ . On the other hand, for the variance term we still have V 1 ≲ 1 N max 1⩽j⩽n N j 1-γ ′ q i β ′ +p p j and V 2 ⩽ 1 N max 1⩽j⩽n N j 1-γ ′ q i β ′ +α-β p j , so that V ≲ 1 N max 1⩽j⩽n N j 1-γ ′ q i β ′ +α-β p j ⩽ N -min β-β ′ β+p , γ ′ -γ 1-γ . As a result, we can obtain the following convergence rate: Â -A 0 β ′ ,γ ′ ≲ N -min β-β ′ 2α , γ ′ -γ 2(1-γ) .

B.2 REGULARIZATION VIA BIAS CONTOUR

In this subsection, we analyze the convergence rate of regularization via bias contour (cf. Figure 2 ). Specifically, we consider the estimator (3) with the choice λ j = max j -γ ′ -γ q N min β-β ′ max{α,β+p} , γ ′ -γ 1-γ -1 β-β ′ , c 0 N log N -1 α . ( ) It now remains to plug the above λ j into our bounds for bias and variance derived in the previous subsections. Bounding the bias term. It follows from ( 9) that ∥A 0 -A λ ∥ 2 β,γ ′ ≲ max 1⩽j⩽y N j -γ ′ -γ q λ β-β ′ j ≲ max    N -min β-β ′ max{α,β+p} , γ ′ -γ 1-γ , c 0 N log N -β-β ′ α    ≲ N -min β-β ′ max{α,β+p} , γ ′ -γ 1-γ . Bounding the variance term It follows from ( 14) and ( 18) that the variance is bounded by Â -A λ 2 β ′ ,γ ′ ≲ 1 N max 1⩽j⩽y N j 1-γ ′ q λ -(β ′ +max{α-β,p}) j . As before, we consider the cases α ⩽ β + p and α > β + p separately. • If α ⩽ β + p, then it follows that Â -A λ 2 β ′ ,γ ′ ≲ 1 N max 1⩽j⩽y N j 1-γ ′ q λ -(β ′ +p) j ≲ 1 N max 1⩽j⩽y N j 1-γ ′ q j -γ ′ -γ q N min β-β ′ β+p , γ ′ -γ 1-γ β ′ +p β-β ′ ≲ 1 N max 1⩽j⩽y N j γ ′ -γ q 1-γ ′ γ ′ -γ -β ′ +p β-β ′ N β ′ +p β-β ′ min β-β ′ β+p , γ ′ -γ 1-γ = 1 N max j∈{1,y N } j γ ′ -γ q 1-γ ′ γ ′ -γ -β ′ +p β-β ′ N β ′ +p β-β ′ min β-β ′ β+p , γ ′ -γ 1-γ = N min β-β ′ β+p , γ ′ -γ 1-γ max β ′ +p β-β ′ , 1-γ ′ γ ′ -γ -1 = N -min β-β ′ β+p , γ ′ -γ 1-γ , where we use y γ ′ -γ q N = N min β-β ′ β+p , γ ′ -γ 1-γ by definition. • If α > β + p, then similarly we have Â -A λ 2 β ′ ,γ ′ ≲ 1 N max 1⩽j⩽y N j 1-γ ′ q λ -β ′ +α-β j ≲ 1 N max 1⩽j⩽y N j 1-γ ′ q j -γ ′ -γ q N min β-β ′ α , γ ′ -γ 1-γ β ′ +α-β β-β ′ = 1 N max j∈{1,y N } j 1-γ ′ q j -γ ′ -γ q N min β-β ′ α , γ ′ -γ 1-γ β ′ +α-β β-β ′ ⩽ N -min β-β ′ α , γ ′ -γ 1-γ . Hence we deduce that Â -A λ 2 β ′ ,γ ′ ≲ N -min β-β ′ α , γ ′ -γ 1-γ , as desired.

B.3 IMPLICATION OF THE UPPER BOUND

In this section, we discuss the implications of our upper bounds under the (β ′ , γ ′ )-norm.

Note that

C -1-γ ′ 2 Q K v H L = ∥v∥ H 2-γ ′ L for all v ∈ L 2 (Q K ) (if one side of the equation is +∞ then so is the other), we have that E u∼P K Â -A 0 u 2 H 2-γ ′ L = E u∼P K C -1-γ ′ 2 Q K Â -A 0 u 2 H L = tr C -1-γ ′ 2 Q K Â -A 0 E u∼P K u ⊗ u C -1-γ ′ 2 Q K Â -A 0 * ≲ Â -A 0 2 β ′ ,γ ′ , where the last step follows from E u∼P K u ⊗ u = C P K . Note that the above derivations hold for any 0 ⩽ β ′ < β, so choosing β ′ = 0 yields the best upper bound. We can see from (20) that our analysis implies an upper bound of the expected error of the learned solution evaluated under the H 2-γ ′ L norm. On the other hand, it is also possible to obtain a uniform convergence rate when β ′ ⩾ α: Â -A 0 u H 2-γ ′ L = C -1-γ ′ 2 Q K Â -A 0 u H L ⩽ Â -A 0 β ′ ,γ ′ • C -1-β ′ 2 P K u H K ≲ Â -A 0 β ′ ,γ ′ .

C PROOFS FOR THE MULTI-LEVEL OPERATOR LEARNING ALGORITHM

In this section, we analyze the convergence rate of our multi-level algorithm described in Section 5. We define η 1 = min β-β ′ max{α,β+p} , γ ′ -γ 1-γ and η 2 = max 1 - β-β ′ max{α,β+p} , 1-γ ′ 1-γ = 1 -η 1 . We first restrict ourselves to the case when β-β ′ max{α,β+p} ̸ = γ ′ -γ 1-γ ; the special case when the two terms are equal will be separately treated in Appendix C.1. For the optimal bias and variance contours ℓ C1,bias and ℓ C2,var with C 1 = N η1 and C 2 = N η2 , we define a sequence {x n } as follows: x 0 = max 1 2 N p β ′ +p η2 , c 0 N log N -1 α (21a) y n = the solution of x β ′ +max{α-β,p} p n y 1-γ ′ q = N η2 , n ⩾ 0 (21b) x n+1 = the solution of x β-β ′ p y γ ′ -γ q n = N η1 , n ⩾ 0. We first derive an explicit recursive formula for {x n }. Lemma C.1 Let u = β ′ +max{α-β,p} β-β ′ γ ′ -γ 1-γ ′ > 0, then (1). if u > 1, then N -p β+p x n+1 = N -p β+p x n u . (2). if u < 1, then x n+1 = x u n . Proof : (1). Suppose that u > 1, then we have η 1 = β-β ′ max{α,β+p} and η 2 = 1 -η 1 . It follows from (21b) and (21c) that x n+1 = N p max{α,β+p} y -γ ′ -γ q p β-β ′ n = N p max{α,β+p} N η2 x - β ′ +max{α-β,p} p n -γ ′ -γ 1-γ ′ p β-β ′ = N p max{α,β+p} N - p max{α,β+p} x n u . (2). Suppose that u < 1, then we have η 1 = γ ′ -γ 1-γ and η 2 = 1-γ ′ 1-γ , so that η1 η2 = γ ′ -γ 1-γ ′ , and it follows from (21b) and (21c) that x β ′ +max{α-β,p} p γ ′ -γ 1-γ ′ n = x β-β ′ p n+1 , thus x n+1 = x u n .

□

Lemma C.1 implies that when u ̸ = 1, the sequence {x n } decreases super-exponentially. Thus, there exists L N = O(log log N ) such that x n ⩽ 2 for all n ⩾ L N . Let λ (K) i = x -1 p i and λ (L) i = y -1 q i , then we construct the following estimator: Âml = L N i=0   yi-1⩽j<yi ρ 1 2 j f j ⊗ ρ 1 2 j f j   ĈY X ĈKK + λ (K) i I -1 where y -1 := 0. Note that each summand in the above equation is essentially a regularized leastsquares estimator and learns a rectangular region. The following theorem states that the estimator Âml can achieve minimax optimal convergence rate. Theorem C.1 Consider the estimator Âml defined by (5). Suppose that Assumptions 2.1 to 2.5 hold, then there exists a universal constant C, such that Âml -A 0 2 β ′ ,γ ′ ⩽ Cτ 2 N log N -min β-β ′ max{α,β+p} , γ ′ -γ 1-γ log 2 N holds with probability ⩾ 1 -e -τ . Proof : The proof of Theorem 5.1 is similar to that of Theorems 4.1 and 4.2. We consider the bias-variance decomposition of the estimation error Âml -A 0 β ′ ,γ ′ ⩽ Âml -Âλ ml β ′ ,γ ′ + Âλ ml -A 0 β ′ ,γ ′ where Âλ ml = L N i=0   yi⩽j<yi+1 ρ 1 2 j f j ⊗ ρ 1 2 j f j   C Y X C KK + λ (K) i I -1 . ( ) Bounding the bias term. Since ∥A 0 ∥ β,γ ⩽ B, we can write A 0 := +∞ i=1 +∞ j=1 a ij µ β 2 i ρ 1-γ 2 j f j ⊗ e i where the coefficient matrix A 0 = (a ij ) 1⩽i,j⩽+∞ satisfies ∥A 0 ∥ 2 F ⩽ B 2 . We fix (i, j) ∈ Z 2 + and assume WLOG that y mj -1 ⩽ j < y mj for some m ⩾ 0, where y L N +1 = +∞. It follows from ( 23) that ρ 1 2 j f j , Âλ ml µ 1 2 i e i = L N k=0   y k-1 ⩽j<y k ρ 1 2 j f j ⊗ ρ 1 2 j f j   ρ 1 2 j f j , C Y X C KK + λ (K) k I -1 µ 1 2 i e i = µ i µ i + λ (K) m ρ 1-γ 2 j µ -1-β 2 i a ij . Thus A 0 -Âλ ml 2 β ′ ,γ ′ = C -1-γ 2 Q L Âλ ml -A 0 C 1-β ′ 2 KK 2 HS = +∞ i,j=1 ρ 1 2 j f j , C -1-γ ′ 2 Q L Âλ ml -A 0 C 1-β ′ 2 KK µ 1 2 i e i 2 = +∞ i,j=1 λ (K) mj µ i + λ (K) mj 2 µ β-β ′ i ρ γ ′ -γ j a 2 ij = +∞ j=1 ρ γ ′ -γ j +∞ i=1 a 2 ij max i⩾1 µ β-β ′ i λ (K) mj µ i + λ (K) mj 2 ≲ +∞ j=1 ρ γ ′ -γ j λ (K) mj β-β ′ +∞ i=1 a 2 ij ≲ B 2 max j⩾1 ρ γ ′ -γ j λ (K) mj β-β ′ ⩽ B 2 max j⩾1 ρ γ ′ -γ j x -β-β ′ p mj ≲ B 2 max j⩾1 j -γ ′ -γ q x -β-β ′ p mj ⩽ B 2 y -γ ′ -γ q mj -1 x -β-β ′ p mj ≲ N -η1 where we recall that η 1 = min β-β ′ max{α,β+p} , γ ′ -γ 1-γ and the last step follows from (21c). Bounding the variance term. The variance term can be rewritten in the following way: V = Âml -A λ ml 2 β ′ ,γ ′ = C -1-γ ′ 2 Q L Âml -A λ ml C 1-β ′ 2 KK 2 HS = +∞ i,j=1 ρ 1 2 j f j , C -1-γ ′ 2 Q L Âml -A λ ml C 1-β ′ 2 KK µ 1 2 i e i 2 = z N j=1 ρ -(1-γ ′ ) j +∞ i=1 ρ 1 2 j f j , ĈY X ĈKK + λ mj I -1 -C Y X C KK + λ mj I -1 µ 1-β ′ 2 i e i 2 = z N j=1 ρ -(1-γ ′ ) j +∞ i=1 C KK + λ mj I -1 2 ĈKL -ĈKK + λ mj I C KK + λ mj I -1 C KL =:Um j ρ 1 2 j f j , C KK + λ mj I 1 2 ĈKK + λ mj I -1 C KK + λ mj I 1 2 =:Gm j µ 1-β ′ 2 i µ i + λ j e i 2 = z N j=1 ρ -(1-γ ′ ) j U mj ρ 1 2 j f j , G mj +∞ i=1 µ 2-β ′ i µ i + λ mj e i ⊗ e i G mj U mj ρ 1 2 j f j ≲ z N j=1 j 1-γ ′ q ∥G mj ∥ 2 λ -β ′ mj U mj ρ 1 2 j f j 2 for reasons similar to (12). It now remains to bound G mj and U mj ρ 1 2 j f j for 1 ⩽ j ⩽ L N . Note that these quantities have already been bounded in Appendix B.1.2 with λ mj replaced with λ j (there we use a different regularization for each j). Hence, those bounds can be directly applied here, so there exists a constant C > 0 such that V ⩽ Ca 2 1 N max 1⩽j⩽L N j 1-γ ′ q λ -(β ′ +max{α-β,p}) mj with probability ⩾ 1 -N e -a . Since j ⩽ y mj , by (21b) we have j 1-γ ′ q λ -(β ′ +max{α-β,p}) mj ≲ y 1-γ ′ q mj x β ′ +max{α-β,p} p mj = N η2 . Hence V ≲ 1 N max 1⩽j⩽L N j 1-γ ′ q λ -(β ′ +max{α-β,p}) mj ⩽ N η2-1 = N η1 . Combining the bias and variance bounds, the conclusion directly follows. □ C.1 SPECIAL CASE: β-β ′ max{α,β+p} = γ ′ -γ 1-γ Note that Lemma C.1 does not cover the case u = 1, or equivalently β-β ′ max{α,β+p} = γ ′ -γ 1-γ . This case is special since the bias contour coincides with the variance contour, and we need to modify our construction of the multilevel estimator. We define two sequences {x n }, {y n } as follows: x 0 = max 1 2 N p β ′ +p η2 , c 0 N log N -1 α x n = 1 2 x n-1 y n = the solution of x β-β ′ p n y γ ′ -γ q n = N η1 , where we recall that η 1 = β-β ′ max{α,β+p} = γ ′ -γ 1-γ . In this case, there exists L N = O(ln N ) such that x n < 1 for all n ⩾ L N . Let λ (K) i = x -1 p i , then we construct the following estimator: Âλ ml = L N i=0   yi-1⩽j<yi ρ 1 2 j f j ⊗ ρ 1 2 j f j   ĈLK ĈKK + λ (K) i I -1 . ( ) Similar to Theorem C.1, we can establish the following result: Theorem C.2 Consider the estimator Âml defined by (26). Suppose that Assumptions 2.1 to 2.5 hold, then there exists a universal constant C, such that Âml -A 0 2 β ′ ,γ ′ ⩽ Cτ 2 N log N -min β-β ′ max{α,β+p} , γ ′ -γ 1-γ log 2 N holds with probability ⩾ 1 -e -τ . Proof : The proof of Theorem C.2 is similar to that of Theorems 4.1 and 4.2. We consider the bias-variance decomposition Âml -A 0 β ′ ,γ ′ ⩽ Âm1 -Âλ m1 β ′ ,γ ′ + Âλ m1 -A 0 β ′ ,γ ′ where A λ ml = L N i=0   yi-1⩽j<yi ρ 1 2 j f j ⊗ ρ 1 2 j f j   C LK C KK + λ (K) i I -1 . ( ) as defined in (26). Bounding the bias term. Let A 0 := +∞ i=1 +∞ j=1 a ij µ β 2 i ρ 1-γ 2 j f j ⊗ e i with coefficient matrix A 0 = (a ij ) +∞ i,j=1 such that ∥A 0 ∥ 2 F ⩽ B 2 . We fix (i, j) ∈ Z 2 + and assume WLOG that y mj -1 ⩽ j < y mj for some m j ⩾ 0, where y L N +1 = +∞. It follows from ( 27) that ρ 1 2 j f j , A λ ml µ 1 2 i e i = µ i µ i + λ (K) mj ρ 1-γ 2 j µ -1-β 2 i a ij . Thus we can proceed as in (24) to deduce that A 0 -A λ ml 2 β ′ ,γ ′ ⩽ max j⩾1 ρ γ ′ -γ j λ (K) mj β-β ′ ≲ max max 1⩽j⩽L N j -γ ′ -γ q x -β-β ′ p mj , y -γ ′ -γ q L N ⩽ max max 1⩽j⩽L N y -γ ′ -γ q mj -1 x -β-β ′ p mj , y -γ ′ -γ q L N The definition (25) implies that y -γ ′ -γ q mj -1 x -β-β ′ p mj ⩽ 2 β-β ′ p y -γ ′ -γ q mj -1 x -β-β ′ p mj -1 ⩽ 2 β-β ′ p N -η1 . On the other hand, since x L N < 1, by ( 25) implies that y -γ ′ -γ q L N ≲ N -η1 . Therefore, for the bias term A 0 -A λ ml 2 β ′ ,γ ′ ≲ N -η1 . Bounding the variance term. Repeating the arguments in (25), we can deduce that there exists a constant C > 0 such that V ⩽ Ca 2 1 N max 1⩽j⩽L N j 1-γ ′ q λ -(β ′ +max{α-β,p}) mj ⩽ Ca 2 1 N max 1⩽j⩽L N y 1-γ ′ q mj x (β ′ +max{α-β,p}) p mj ≲ N -η1 with probability ⩾ 1 -N e -a . Combining the bias and variance bounds, we arrive at the desired conclusion. □ The conclusion of Theorem 5.1 then follows from Theorems C.1 and C.2. Fischer & Steinwart, 2020, Theorem 27) Let (Ω, B, P ) be a probability space, H be a separable Hilbert space and X : Ω → HS(H; H) be a random variable with selfadjoint values. Furthermore, assume that ∥X∥ F ⩽ B, P -a.s. and V be a positive semidefinite matrix with E P X 2 ≼ V , i.e. V -E P X 2 is positive semi-definite. Then, for g(V ) := log 2e tr(V )∥V ∥ -1 , τ ⩾ 1, and n ⩾ 1, the following concentration inequality is satisfied

Lemma

P n (ω 1 , . . . , ω n ) ∈ Ω n : 1 n n i=1 X (ω i ) -E P X(ω) ⩾ 4τ Bg(V ) 3n + 2τ ∥V ∥g(V ) n ⩽ 2e -τ . Theorem D.2 (Fischer & Steinwart, 2020, Theorem 26) Let (Ω, B, P ) be a probability space, H be a separable Hilbert space, and ξ : Ω → H be a random variable with E P ∥ξ∥ m H ⩽ 1 2 m!σ 2 L m-2 for all m ⩾ 2. Then, for τ ⩾ 1 and n ⩾ 1, the following concentration inequality is satisfied P n   (ω 1 , . . . , ω n ) ∈ Ω n : 1 n n i=1 ξ (ω i ) -E P ξ 2 H ⩾ 32 τ 2 n σ 2 + L 2 n   ⩽ 2e -τ The following theorem shows that the regularized covariance C KK + λI can be estimated with small error when λ is above a certain threshold. Although it is well-known (Fischer & Steinwart, 2020; Talwai et al., 2022) , we still recall it below for completeness. ∼ P K . Suppose that Assumption 2.2 holds and N ≳ A 2 1 τ g λ λ -α , where g λ = log 2eN P K (λ) ∥C KK ∥+λ ∥C KK ∥ and N P K (λ) = tr (C KK + λI) -1 C KK is the effective dimension, then with probability at least 1 -e -τ , we have (C KK + λI) -1 2 C KK -ĈKK (C KK + λI) -1 2 ≲ A 2 1 τ g λ N λ α ⩽ 0.1. Proof : Let X(u) = (C KK + λI) -1 2 u ⊗ u (C KK + λI) -1 2 where u ∈ H K , then the LHS of (28) can be expressed as 1 N N i=1 X(u i ) -E u∼P K X(u) . We hope to apply Theorem D.1 and start with verifying the assumptions. Since E P K X = (C KK + λI) (C KK + λI) 1 2 ⩽ C 1 .



u ⩽ λ -α 2 • A 1 P K -a.s.



Figure 2: The plot of the bias contour and the variance contour. For simplicity, we only plot the case α ⩽ β + p here. The variance contour is always above the bias contour. Left: When β ′ +p β+p ⩾ 1-γ ′ 1-γ , the two yields O N -β-β ′ max{α,β+p} convergence rate. It is the same learning rate as the two kernel regression curves meet when y = 1. Right: When β ′ +p β+p ⩾ 1-γ ′ 1-γ , the two contours yield the same regularization on the output space leading to a convergence rate of O N -γ ′ -γ 1-γ .

); Li et al. (2021a); Boullé et al. (2022) but at each level we consider a cost-accuracy trade-off De Hoop et al. (

Figure 3: Construction of the sequence {(x i , y i )}. Left: the case

Lemma A.3 (Varshamov-Gillbert Lemma,Tsybakov (2008)  Theorem 2.9) Let D ≥ 8. There exists a subset V = {τ (0) , • • • , τ (2 D/8 ) } of D-dimensional hypercube H D = {0, 1} D such that τ (0) = (0, 0, • • • , 0) and the ℓ 1 distance between every two elements is larger thanD 8 D l=1 ∥τ (j) -τ (k) ∥ ℓ1 ≥ D 8, for all 0 ≤ j, k ≤ 2 D/8

We have ∥T ∥ β,γ = C We recall from the definition that ∥T ∥β,γ = (I 1,γ,Q L ) † • T • I * 1β,P K HS(H β K ,H γ L )Under Assumption 2.2, we have(C KK + λI)Proof : By Assumption 2.2 we have C

Recall that C KK = E P K u ⊗ u and ĈKK = 1 N N i=1 u i ⊗ u i where u i i.i.d.

α , so that there existsV = O λ -α (C KK + λI) E P K X 2 ≼ V . It's easy to see that ∥V ∥ ≲ λ -α and tr(V ) ≲ N P K (λ).The conclusion then follows from Theorem D.1 with B = O(λ -α ) and g(V ) = g λ . □ Corollary D.1 Under the notations and assumptions of Theorem D.3, there exists a constant C 1 > 0 with probability ⩾ 1 -e -τ we have (C KK + λI)

Suppose that the ground-truth operator A 0 = ∆ t where ∆ is the Laplacian and t ∈ Z. Let H K = H m+2t ([0, 1]) be the Sobolev space with smoothness m + 2t on [0, 1] and H L = H m ([0, 1]), then A 0 is a bounded operator from H K to H L which corresponds to the β = γ = 1 case. However, we will see below that we can obtain a better characterization of the learning error using our theory.Consider for example that the input has mean zero and the Matérn-type covariance operatorC KK = σ 2 -∆ + τ 2 I -s .Its eigenvalues satisfy µ n ∝ n -2s . On the other hand, we choose Q Y to be a distribution supported on {ℓ(y, •) : y ∈ [0, 1]} induced by a uniform distribution on [0, 1], where ℓ is the kernel function of H L . Then C Q Y is essentially the kernel integral operator on H L w.r.t. the uniform distribution, and its eigenvalues are

ACKNOWLEDGMENTS

Jikai Jin is partially supported by the elite undergraduate training program of School of Mathematical Sciences in Peking University. Yiping Lu is supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF). Jose Blanchet is supported in part by the Air Force Office of Scientific Research under award number FA9550-20-1-0397. Lexing Ying is supported is supported by National Science Foundation under award DMS-2208163.

annex

Proof : By Theorem D.3 we havewith probability ⩾ 1 -e -λ , as desired. □

