MINIMAX OPTIMAL KERNEL OPERATOR LEARNING VIA MULTILEVEL TRAINING

Abstract

Learning mappings between infinite-dimensional function spaces have achieved empirical success in many disciplines of machine learning, including generative modeling, functional data analysis, causal inference, and multi-agent reinforcement learning. In this paper, we study the statistical limit of learning a Hilbert-Schmidt operator between two infinite-dimensional Sobolev reproducing kernel Hilbert spaces (RKHSs). We establish the information-theoretic lower bound in terms of the Sobolev Hilbert-Schmidt norm and show that a regularization that learns the spectral components below the bias contour and ignores the ones above the variance contour can achieve the optimal learning rate. At the same time, the spectral components between the bias and variance contours give us flexibility in designing computationally feasible machine learning algorithms. Based on this observation, we develop a multilevel kernel operator learning algorithm that is optimal when learning linear operators between infinite-dimensional function spaces.

1. INTRODUCTION

Supervised learning of operators between two infinite-dimensional spaces has attracted attention in several areas of application of machine learning, including, scientific computing (Lu et al., 2019; Li et al., 2020; de Hoop et al., 2021; Li et al., 2018; 2021b) , functional data analysis (Crambes & Mas, 2013; Hörmann & Kidziński, 2015; Wang et al., 2020a) , mean-field games (Guo et al., 2019; Wang et al., 2020b) , conditional kernel mean embedding (Song et al., 2009; 2013; Muandet et al., 2017) and econometrics (Singh et al., 2019; Muandet et al., 2020; Dikkala et al., 2020; Singh et al., 2020) . Despite the empirical success of operator learning, the statistical limit of learning an infinite-dimensional operator has not been investigated studied. In this paper, we study the problem of learning Hilbert Schmidt operators between infinite-dimensional Sobolev RKHSs H β K and H γ L with given kernels k and l, respectively with β, γ ∈ [0, 1) (Adams & Fournier, 2003; Christmann & Steinwart, 2008; Fischer & Steinwart, 2020) . Our goal is to derive the optimal sample complexity for linear operator learning, i.e. how much data is required to achieve a certain performance level. We first establish an information-theoretic lower bound for learning a Hilbert-Schmidt operator between Sobolev spaces with respect to a general Sobolev norm. Our information-theoretic lower bound indicates that the optimal learning rate is determined by the minimum of two polynomial rates: one is purely decided by the input Sobolev reproducing kernel Hilbert space and its evaluating norm, while the other one is purely determined by the output space along with its evaluating norm. The rate is novel in that all existing results (Fischer & Steinwart, 2020; Li et al., 2022; de Hoop et al., 2021) only establish rates that depend on the parameter of input space. The reason is all previous works (Talwai et al., 2022; Li et al., 2022; de Hoop et al., 2021) only consider the case of the output space as a subspace of a trace bounded reproducing kernel Hilbert space but not a general Sobolev space. We refer to Remark 2.1 for detailed comparisons. To design a learning algorithm for approximating an infinite-dimensional operator, we need to learn a finite-dimensional restriction instead of the whole operator, as the latter would result in infinite variance. The finite-dimensional selection leads to bias error but decreases the variance. A natural task is then to study the shape of regularization that can lead to the optimal bias-variance trade-off and achieve the optimal learning rate. In this paper, we consider the bias and variance contour at the scale of optimal learning. Once the regularization enables one to learn all the spectral parts above the bias contour and below the variance contour, the learning is optimal. Finally, utilizing the region between the bias contour and variance contour, we developed a multilevel training algorithm (Lye et al., 2021; Li et al., 2021a) which first learns the mapping on low frequency and then successively fine-tunes the machine learning models to fit the high-frequency output. The intuition of our algorithm aligns with the original motivation of multilevel Monte Carlo (Giles, 2008; 2015) : we use the next level to reduce bias while keeping the variance at the same scale. We demonstrate that such a multilevel algorithm can achieve an optimal non-parametric rate for linear operator learning.

1.1. RELATED WORK

Machine Learning Based PDE Solver Solving partial differential equations (PDEs) plays a prominent role in many scientific and engineering disciplines, such as physics, chemistry, operation management, macro-economy, etc. The recent deep learning breakthrough has drawn attention to solving PDEs via machine learning methods (Raissi et al., 2019; Han et al., 2018; Sirignano & Spiliopoulos, 2018; Yu et al., 2018; Khoo et al., 2019; Chen et al., 2021) . The statistical power and computational cost of these problem is well-studied by recent papers (Lu et al., 2021; 2022; Nickl et al., 2020; Nickl & Wang, 2020) . This paper focuses on operator learning (Chen & Chen, 1995; Long et al., 2018; 2019; Feliu-Faba et al., 2020; Khoo et al., 2021; Lu et al., 2019; Li et al., 2020; Kovachki et al., 2021; Stepaniants, 2021) , i.e. learning a map between two infinite-dimensional function spaces. For example, one can learn a PDE solver that maps from the boundary condition to the solution or an inverse problem that maps from the boundary measurement to the coefficient field. Regarding the mathematical foundation of operator learning, (Liu et al., 2022) considers the learning rate of non-parametric operator learning. However, non-parametric functional data analysis often suffers from slower-than-polynomial convergence rates (Mas, 2012) due to the small ball probability problem for the probability distributions in infinite dimensional spaces (Delaigle & Hall, 2010) . The most relevant works are (Lin et al., 2011; Reimherr, 2015; de Hoop et al., 2021) , which consider the rates for learning a linear operator. For the comparison between our work and (de Hoop et al., 2021), see Remark 2.1. Learning with Kernel. Supervised least square regression in RKHS and its generalization capability have been thoroughly studied (Caponnetto & De Vito, 2007; Smale & Zhou, 2007; De Vito et al., 2005; Rosasco et al., 2010; Mendelson & Neeman, 2010) . The minimax optimality with respect to the Sobolev norm has been discussed recently in (Fischer & Steinwart, 2020; Liu & Li, 2020; Lu et al., 2022) . Our paper is highly related to recent works (Schuster et al., 2020; Mollenhauer & Koltai, 2020; Talwai et al., 2022; Li et al., 2022; Park & Muandet, 2020; Singh et al., 2019; 2020) on identifying the Sobolev norm learning rate for the kernel mean embedding (Song et al., 2009; 2013; Muandet et al., 2017) , which can also formulated as learning an operator. The difference between our work and (Talwai et al., 2022; Li et al., 2022) sees Remark 2.1. A concurrent paper (Balasubramanian et al., 2022) considers a unified RKHS methodology for functional data analysis. Our paper provided a refined analysis and provided information theortical optimal rates for this problem. Multilevel Monte Carlo By combining biased estimators with multiple stepsizes, multilevel Monte Carlo (MLMC) (Giles, 2008; 2015) dramatically improves the rate of convergence and achieves in many settings the canonical square root convergence rate associated with unbiased Monte Carlo (Rhee & Glynn, 2015; Blanchet & Glynn, 2015) . Multilevel Monte Carlo can also be used for a random variable with infinite variance (Blanchet & Liu, 2016; Chen et al., 2020) . To the best of our knowledge, this is the first paper that provides optimal sample complexity for multilevel Monte Carlo type algorithm for infinite variance problems in the non-parametric regime. Very recently,

