SAMPLE COMPLEXITY OF NONPARAMETRIC OFF-POLICY EVALUATION ON LOW-DIMENSIONAL MANIFOLDS USING DEEP NETWORKS

Abstract

We consider the off-policy evaluation problem of reinforcement learning using deep convolutional neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Specifically, we establish a sharp error bound for fitted Q-evaluation, which depends on the intrinsic dimension of the state-action space, the smoothness of Bellman operator, and a function class-restricted χ 2 -divergence. It is noteworthy that the restricted χ 2 -divergence measures the behavior and target policies' mismatch in the function space, which can be small even if the two policies are not close to each other in their tabular forms. We also develop a novel approximation result for convolutional neural networks in Q-function estimation. Numerical experiments are provided to support our theoretical analysis.

1. INTRODUCTION

Off-policy Reinforcement Learning (RL) [38, 40] is an important area in decision-making applications, when the data cannot be acquired with arbitrary policies. For example, in clinical decisionmaking problems, experimenting new treatment policies on patients is risky and may raise ethical concerns. Therefore, we are only allowed to generate data using certain policies (or sampling distributions), which have been approved by medical professionals. These so-called "behavior policies" are unknown but could impact our problem of interest, resulting in distribution shift and insufficient data coverage of the problem space. In general, the goal is to design algorithms that need as little data as possible to attain desired accuracy. A crucial problem in off-policy RL is policy evaluation. The goal of Off-Policy Evaluation (OPE) is to estimate the value of a new target policy based on experience data generated by existing behavior policies. Due to the mismatch between behavior and target policies, the off-policy setting is entirely different from the on-policy one, in which policy value can be easily estimated via Monte Carlo. A popular algorithm to solve OPE is the fitted Q-evaluation method (FQE), as an off-policy variant of the fitted Q-iteration [28, 15, 75] . FQE iteratively estimates Q-functions by supervised regression using various function approximation methods, e.g., linear function approximation, and has achieved great empirical success [65, 20, 21] , especially in large-scale Markov decision problems. Complementary to the empirical studies, several works theoretically justify the success of FQE. Under linear function approximation, [31] show that FQE is asymptotically efficient, and [15] further provide a minimax optimal non-asymptotic bound, and [47] provide a variance-aware characterization of the distribution shift via a weighted variant of FQE. [75] analyze FQE with realizable, general differentiable function approximation. [37, 64] tackle OPE for even more general function approximation, but they require stronger assumptions such as full data coverage. [16] focus on on-policy estimation and study a kernel least square temporal difference estimator. Recently, deploying neural networks in FQE has achieved great empirical success, which is largely due to networks' superior flexibility of modeling in high-dimensional complex environments [65, 20, 21] . Nonetheless, the theory of FQE using deep neural networks has not been fully understood. While there are existing results on FQE with various function approximators [28, 15, 75] , many of them are not immediately applicable to neural network approximation. [18] focus on the online policy learning problem and studies DQN with feed-forward ReLU network; a concurrent work [73] studies offline policy learning with realizable, general differentiable function approximation. Notably, a recent study [51] provide an analysis of the estimation error of nonparametric FQE using feed-forward ReLU network, yet this error bound grows quickly when data dimension is high. Moreover, their result requires full data coverage, i.e., every state-action pair has to eventually be visited in the experience data. Precisely, besides universal function approximation, there are other properties that contribute to the success of neural networks in supervised learning, for example, its ability to adapt to the intrinsic low-dimensional structure of data. While these properties are actively studied in the deep supervised learning literature, they have not been reflected in RL theory. Hence, it is of interest to examine whether these properties still hold in a problem with sequential nature under standard assumptions and how neural networks can take advantage of such low-dimensional structures in OPE. Main results. This paper establishes sample complexity bounds of deep FQE using convolutional neural networks (CNNs). Different from existing results, our theory exploits the intrinsic geometric structures in the state-action space. This is motivated by the fact that in many practical highdimensional applications, especially image-based ones [59, 11, 76] , the data are actually governed by a much smaller number of intrinsic free parameters [2, 55, 32] . See an example in Figure 1 . Figure 1 : An example of state-action space with low-dimensional structures. The states of OpenAI Gym Bipedal Walker can be visually displayed in high resolution (e.g., 200 × 300), while they are internally represented by a 24-tuple [29] . Consequently, we model the state-action space as a d-dimensional Riemannian manifold embedded in R D with d ≪ D. Under some standard regularity conditions, we show CNNs can efficiently approximate Q-functions and allow for fast-rate policy value estimation-free of the curse of ambient dimensionality D. Moreover, our results do not need strong data coverage assumptions. In particular, we develop a function class-restricted χ 2 -divergence to quantify the mismatch between the visitation distributions induced by behavior and target policies. The function class can be viewed as a smoothing factor of the distribution mismatch, since the function class may be insensitive to certain differences in the two distributions. Our approximation theory and mismatch characterization significantly sharpen the dimension dependence of deep FQE. In detail, our theoretical results are summarized as follows: (I) Given a target policy π, we measure the distribution shift between the experience data distribution {q data h } H h=1 and the visitation distribution of target policy {q π h } H h=1 by κ = 1 H H h=1 χ 2 Q (q π h , q data h ) + 1, where χ 2 Q (q π h , q data h ) is the restricted χ 2 -divergence between q π h and q data h defined as χ 2 Q (q π h , q data h ) = sup f ∈Q E q π h [f ] 2 E q data h [f 2 ] -1 with Q being a function space relevant to our algorithm. (II) We prove that the value estimation error of a target policy π is E|v π -v π | = O κH 2 K -α 2α+d , where K is the effective sample size of experience data sampled by the behavior policy (more details in Section 3), H is the length of the horizon, α is the smoothness parameter of the Bellman operator, κ is defined in (1) , and O(•) hides some constant depending on state-action space and a polynomial factor in D. We compare our results with several related works in Table 1 . Both [15] and [75] consider parametric function approximation to the Q-function: [15] study linear function approximation, and [75] assume third-order differentibility, which is not applicable to neural networks with non-smooth activation. On the other hand, [51] use feed-forward neural networks with ReLU activation to parametrize nonparametric Q-functions, but they do not take any low-dimensional structures of the state-action space into consideration. Therefore, their result suffer from the curse of dimension D. Moreover, they characterize the distribution shift with the absolute density ratio between the experience data distribution and the visitation distribution of target policy, which is strictly larger than our characterization in restricted χ 2 -divergences. As can be seen from our comparison, our result improves over existing results. Moreover, since CNNs have been widely used in deep RL applications and also retain state-of-the-art performance [48, 22] , our consideration of CNNs is a further step of bridging practice and theory. Table 1 : κ 1 and κ 2 are measures of distribution shift with respect to their respective regularity spaces; D eff denotes the effective dimension of the function approximator in [75] , which usually suffers from the curse of dimensionality; κ 3 is the absolute density ratio between the data distribution and target policy's visitation distribution; κ is defined in (1) and is no larger than and often substantially smaller than κ 3 . See in-depth discussions in Section 4.

Regularity Approximation

Estimation Error [15] Linear None O H 2 κ 1 D/K [75] Third-time differentiable None O H 2 κ 2 D eff /K [51] Besov Feed-Forward ReLU Net O H 2-α/(2α+2D) κ 3 K -α/(2α+2D) This work Besov CNN O H 2 κK -α/(2α+d) Additional Related Work Besides FQE, there are other types of methods in the OPE literature. One popular type is using importance sampling to reweigh samples by the distribution shift ratio [56] , but importance sampling suffers from large variance, which is exponential in the length of the horizon in the worst case. To address this issue, some variants with reduced variance such as marginal importance sampling (MIS) [68] and doubly robust estimation [35, 61] have been developed. For the tabular setting with complete data coverage, [72] show that MIS is an asymptotically efficient OPE estimator, which matches the Cramer-Rao lower bound in [35] . Moreover, a line of work [49, 50, 39, 74] focuses on policy evaluation without function approximation using MIS and linear programming. Notation For a scalar a > 0, ⌈a⌉ denotes the ceiling function, which gives the smallest integer which is no less than a; ⌊a⌋ denotes the floor function, which gives the largest integer which is no larger than a. For any scalars a and b, a ∨ b denotes max(a, b) and a ∧ b denotes min(a, b). For a vector or a matrix, ∥•∥ 0 denotes the number of nonzero entries and ∥•∥ ∞ denotes the maximum magnitude of entries. Given a function f : R D → R and a multi-index s = [s 1 , • • • , s D ] ⊤ , ∂ s f denotes ∂ |s| f ∂x s 1 1 •••∂x s D D . ∥f ∥ L p denote the L p norm of function f . We adopt the convention 0/0 = 0. Given distributions p and q, if p is absolutely continuous with respect to q, the Pearson χ 2 -divergence is defined as χ 2 (p, q) := E q [( dp dq -1) 2 ].

2.1. TIME-INHOMOGENEOUS MARKOV DECISION PROCESS

We consider a finite-horizon time-inhomogeneous Markov Decision Process (MDP) (S, A, {P h } H h=1 , {R h } H h=1 , H, ξ) , where ξ is the initial state distribution. At time step h = 1, • • • , H, from a state s in the state space S, we may choose action a from the action space A and transition to a random next state s ′ ∈ S according to the transition probability distribution s ′ ∼ P h (• | s, a). Then, the system generates a random scalar reward r ∼ R h (s, a) with r ∈ [0, 1]. We denote the mean of R h (s, a) by r h (s, a). A policy π = {π h } H h=1 specifies a set of H distributions π h (• | s) for choosing actions at every state s ∈ S and time step h. Given a policy π, the state-action value function, also known as the Q-function, for h = 1, 2, • • • , H, is defined as Q π h (s, a) := E π H h ′ =h r h ′ (s h ′ , a h ′ ) s h = s, a h = a , where a h ′ ∼ π h ′ (• | s h ′ ) and s h ′ +1 ∼ P h ′ (• | s h ′ , a h ′ ). Moreover, let q π h denote the state-action visitation distribution of π at step h, i.e., q π h (s, a) := P π [s h = s, a h = a | s 1 ∼ ξ]. For notational ease, we denote X := S × A. Let P π h : R X → R X denote the conditional transition operator at step h: P π h f (s, a) := E[f (s ′ , a ′ ) | s, a], ∀f : X → R, where a ′ ∼ π h (• | s ′ ) and s ′ ∼ P h (• | s, a). Denote the Bellman operator at time h under policy π as T π h : T π h f (s, a) := r h (s, a) + P π h f (s, a), ∀f : X → R.

The Bellman equation may be written as

Q π h = T π h Q π h+1 . 2.2 RIEMANNIAN MANIFOLD Let M be a d-dimensional Riemannian manifold isometrically embedded in R D . A chart for M is a pair (U, ϕ) such that U ⊂ M is open and ϕ : U → R d is a homeomorphism, i.e. , ϕ is a bijection, its inverse and itself are continuous. Two charts (U, ϕ) and (V, ψ) are called C k compatible if and only if We introduce the reach [19, 53] of a manifold to characterize the curvature of M. ϕ • ψ -1 : ψ(U ∩ V ) → ϕ(U ∩ V ) and ψ • ϕ -1 : ϕ(U ∩ V ) → ψ(U ∩ V ) are both C k functions (k-th order continuously differentiable). A C k atlas of M is a collection of C k compatible charts {(U i , ϕ i )} such that i U i = M. Definition 2 (Reach, Definition 2.1 in [1] ). The medial axis of M is defined as T (M), which is the closure of T (M) = {x ∈ R D | ∃ x 1 ̸ = x 2 ∈ M, such that ∥x -x 1 ∥ 2 = ∥x -x 2 ∥ 2 = inf y∈M ∥x -y∥ 2 }. The reach ω of M is the minimum distance between M and T (M), i.e. ω = inf x∈T (M),y∈M ∥x -y∥ 2 . Roughly speaking, reach measures how fast a manifold "bends". A manifold with a large reach "bends" relatively slowly. On the contrary, a small ω signifies more complicated local geometric structures, which are possibly hard to fully capture.

2.3. BESOV FUNCTIONS ON SMOOTH MANIFOLD

Through the concept of atlas, we are able to define Besov space on a smooth manifold. Definition 3 (Modulus of Smoothness [12] ). Let Ω ⊂ R D . For a function f : R D → R be in L p (Ω) for p > 0, the r-th modulus of smoothness of f is defined by w r,p (f, t) = sup ∥h∥2≤t ∥∆ r h (f )∥ L p , where ∆ r h (f )(x) = r j=0 r j (-1) r-j f (x + jh) if x ∈ Ω, x + rh ∈ Ω, 0 otherwise. Definition 4 (Besov functions on M). Let M be a compact manifold of dimension d with a finite atlas {(U i , ϕ i )}. A function f : M → R belongs to the Besov space B α p,q (M), if on any chart U i , it holds ∥f ∥ B α p,q (Ui) := ∥f ∥ L p (Ui) + ∥f ∥ B α p,q (Ui) < ∞, where ∥f ∥ B α p,q (Ui) := ∞ 0 (t -α w ⌊α⌋+1,p (f, t)) q dt t 1/q if q < ∞, sup t>0 t -α w ⌊α⌋+1,p (f, t) if q = ∞. Further, the Besov norm of f is defined as ∥f ∥ B α p,q (M) = i ∥f ∥ B α p,q (Ui) . We occasionally omit M in the Besov norm when it is clear from the context.

2.4. CONVOLUTIONAL NEURAL NETWORK

We consider one-sided stride-one convolutional neural networks (CNNs) with rectified linear unit (ReLU) activation function (ReLU(z) = max(z, 0)). Specifically, a CNN we consider consists of a padding layer, several convolutional blocks, and finally a fully connected output layer. Given an input vector x ∈ R D , the network first applies a padding operator P : R D → R D×C for some integer C ≥ 1 such that Z = P (x) = [x 0 • • • 0] ∈ R D×C . Then the matrix Z is passed through M convolutional blocks. We will denote the input matrix to the m-th block as Z m and its output as Z m+1 (i.e., Z 1 = Z). Let us define the convolution operation in Equation (2) . Let W = {W j,i,l } ∈ R C ′ ×I×C be a filter where C ′ is the output channel size, I is the filter size and C is the input channel size. For Z ∈ R D×C , the convolution of W with Z,

I

denoted with W * Z, results in Y ∈ R D×C ′ with Y k,j = I i=1 C l=1 W j,i,l Z k+i-1,l , where we set Z k+i-1,l = 0 for k + i -1 > D. In the m-th convolutional block, let W m = {W } be a collection of filters and biases of proper sizes. The m-th block maps its input matrix Z m ∈ R D×C to Z m+1 ∈ R D×C by Z m+1 = ReLU W (Lm) m * • • • * ReLU W (1) m * Z m + B (1) m • • • + B (Lm) m with ReLU applied entrywise. For notational simplicity, we denote this series of operations in the m-th block with a single operator from R D×C to R D×C with Conv Wm,Bm , so (2) can be abbreviated as Z m+1 = Conv Wm,Bm (Z m ). Overall, we denote the mapping from input x to the output of the M -th convolutional block as G(x) = (Conv W M ,B M ) • • • • • (Conv W1,B1 ) • P (x). Given (3), a CNN applies an additional fully connected layer to G and outputs f (x) = W ⊗ G(x) + b, where W ∈ R D×C and b ∈ R are a weight matrix and a bias, respectively, and ⊗ denotes sum of entrywise product, i.e. W ⊗ G(x) = i,j W i,j [G(x)] i,j . Thus, we define a class of CNNs of the same architecture as F(M, L, J, I, τ 1 , τ 2 ) = f | f (x) = W ⊗ Q(x) + b with ∥W ∥ ∞ ∨ |b| ≤ τ 2 , G(x) in the form of (3) with M blocks. The number of filters per block is bounded by L; filter size is bounded by I; the number of channels is bounded by J; max m,l ∥W (l) m ∥ ∞ ∨ ∥B (l) m ∥ ∞ ≤ τ 1 . Furthermore, F(M, L, J, I, τ 1 , τ 2 , V ) is defined as F(M, L, J, I, τ 1 , τ 2 , V ) = {f ∈ F(M, L, J, I, τ 1 , τ 2 ) | ∥f ∥ L ∞ ≤ V } .

Universal Approximation of Neural Networks

There exists rich literature about using neural networks to approximate functions supported on compact domain in Euclidean space, from early asymptotic results [34, 23, 10, 33] to later quantitative results [3, 46, 44, 30, 71] . These results suffer from the curse of dimensionality, in that to approximate a function up to certain error, the network size grows exponentially in the data dimension. Recently, a line of work advances the approximation theory of neural network to functions supported on domains with intrinsic geometric structures [8, 57, 60, 58, 7, 42] . They show that neural networks are adaptive to the intrinsic structures in data, suggesting that to approximate a function up to certain error, it suffices to choose the network size only depending on the intrinsic dimension, which is often much smaller than the representation dimension of the data. In addition to existing results, we prove a novel universal approximation theory of CNN en route to our RL result.

3. NEURAL FITTED Q-EVALUATION

We consider the off-policy evaluation (OPE) problem of a finite-horizon time-inhomogeneous MDP. The transition model {P h } H h=1 and reward function {R h } H h=1 are unknown, and we are only given the access to an unknown behavior policy π 0 to generate experience data from the MDP. Our objective is to evaluate the value of π from a fixed initial distribution ξ over horizon H, given by v π := E π H h=1 r h (s h , a h ) s 1 ∼ ξ , where a h ∼ π(• | s h ) and s h+1 ∼ P h (• | s h , a h ).

At time-step h, we generate data

D h := {(s h,k , a h,k , s ′ h,k , r h,k )} K k=1 . Specifically, {s h,k } K k=1 are i.i.d. samples from some state distribution. For each s h,k , we use the unknown behavior policy π 0 to generate a h,k ∼ π 0 (• | s h,k ). More generally, we may view {(s h,k , a h,k )} K k=1 as i.i.d. samples from a sampling distribution q π0 h . For each (s h,k , a h,k ), we can further generate s ′ h,k ∼ P h (• | s h,k , a h,k ) and r h,k ∼ R h (s h,k , a h,k ) independently for each k. This assumption on data generation is the time-inhomogeneous analog of a standard data assumption in time-homogeneous OPE, which assumes all data are i.i.d. samples from the same sampling distribution [51, 74, 50, 69] . Moreover, our dataset is similar to one that comprises K independent episodes generated by the behavior policy, as [41] introduce a subroutine whereby one can process an episodic dataset and treat it as an i.i.d.-sampled dataset in any downstream algorithm. To estimate v π , neural FQE estimates Q π h in a backward, recursive fashion. Q π h , our estimate at step h, is taken as T π h Q π h+1 , whose update rule is based on the Bellman equation: T π h Q π h+1 = arg min f ∈F K k=1 f (s h,k , a h,k ) -r h,k - A Q π h+1 (s ′ h,k , a)π h (a | s ′ h,k ) da 2 , where T π h is an intermediate estimate of the Bellman operator T π h , Q π h+1 is an intermediate estimate of Q π h+1 , and F denotes a class of convolutional neural networks as specified in (5) with proper hyperparameters. The pseudocode for our algorithm is presented in Algorithm 1.

4. MAIN RESULTS

In this section, we prove an upper bound on the estimation error of v π by Algorithm 1. First, let us state two assumptions on our MDP of interest. Assumption 1 (Low-dimensional state-action space). The state-action space X is a d-dimensional compact Riemannian manifold isometrically embedded in R D . There exists B > 0 such that ∥x∥ ∞ ≤ B for any x ∈ X . The reach of X is ω > 0. Assumption 1 characterizes the low-dimensional structures of the MDP represented in high dimensions. We say that the "intrinsic dimension" of X is d ≪ D. Such a setting, as mentioned in Section 1, is common in practice, because the representation or feature people have access to are often excessive compared to the latent structures of the problem. For instance, images of a dynamical system Algorithm 1 Neural Fitted Q-Evaluation (Neural-FQE) Input: Initial distribution ξ, target policy π, horizon H, effective sample size K, function class F. (6) . end for Output: Init: Q π H+1 := 0 for h = H, H -1, • • • , 1 do Sample D h = {(s h,k , a h,k , s ′ h,k , r h,k )} K k=1 . Update Q π h ← T π h Q π h+1 by v π := X Q π 1 (s, a)ξ(s)π(a | s) ds da. are widely believed to admit such low-dimensional latent structures [26, 32, 55] . People often take the visual display of a computer game as its state representations, which are in pixels, but computer only keeps a small number of parameters internally to represent the state of the game. Assumption 2 (Bellman completeness). Under target policy π, for any time step h and any f ∈ F, we have T π h f ∈ B α p,q (X ), where 0 < p, q ≤ ∞ and d/p + 1 ≤ α < ∞. Moreover, there exists a constant c 0 > 0 that satisfies ∥T π h f ∥ B α p,q (X ) ≤ c 0 for any time step h. Bellman completeness assumption is about the closure of a function class under the Bellman operator. It has been widely adopted in RL literature [70, 14, 6, 69] . Some classic MDP settings implicitly possess this property, e.g., linear MDP [36] . Note that [66] show the necessity of such an assumption on the Bellman operator to regulate the Bellman residual: without such assumption, even in the simple setting of linear function approximation with realizability, to solve OPE up to a constant error, the lower bound on sample complexity is exponential in horizon. The Besov family contains a large class of smooth functions, and has been widely adopted in existing nonparametric statistics literature for various problems [24, 62, 63] . For MDP, Assumption 2 holds for most common smooth dynamics, as long as certain regularity conditions on smoothness are satisfied. For instance, [51] show a simple yet sufficient condition, under which for any time step h and s ′ ∈ S, the reward function r h (s, a) and the transition kernel P h (s ′ |s, a) are functions in B α p,q . This further implies Q π 0 , • • • , Q π H ∈ B α p,q X ). In addition, Assumption 2 may be satisfied even when the transition kernel is not smooth, examples of which are provided in [18] . Note that while most existing work on function approximation assumes Bellman completeness with respect to the function approximator, which in our work is deep convolutional neural networks with ReLU activation, we are only concerned with the closure of the Besov class B α p,q (X ) under the Bellman operator. This assumption is weaker than the previous work [75] , which considers smooth function approximation (excluding ReLU networks). Our main result is summarized in Theorem 2, which relies on using CNNs to accurately represent T π h f for any f ∈ F. The following theorem provides a novel quantatitive analysis on how to properly choose CNN classes for approximating T π h f depending on the regularity of T π h . Theorem 1. Suppose Assumption 1 and 2 hold. For any positive integers I ∈ [2, D] and M , J > 0, we let L = O log( M J) + D + log D , J = O(D J), τ 1 = (8ID) -1 M -1 L = O(1), log τ 2 = O log 2 M J + D log M J , M = O( M ), Then CNN class F(M, L, J, I, τ 1 , τ 2 ) in (4) can approximate T π h f for any f ∈ F(M, L, J, I, τ 1 , τ 2 ) and h = 1, • • • , H, i.e., there exists f ∈ F(M, L, J, I, τ 1 , τ 2 ) with ∥ f -T π h f ∥ ∞ ≤ ( M J) -α d . O(•) hides a constant depending on d, α, 2d αp-d , p, q, ∥T π h f ∥ B α p,q (X ) , B, ω and surface area of X . The proof is provided in Appendix B. As can be seen, the rate of approximation is free of the curse of ambient dimension D. We remark that Theorem 1 allows an arbitrary rescaling of M and J, as only their product is relevant to the approximation error. This is more flexible than conventional approximation theories [7, 54, 51] , where the network width and depth have to maintain a fixed ratio in terms of the desired approximation error. In Theorem 2, we choose a configuration of M and J that leads to the optimal statistical rate via a bias-variance tradeoff argument. Theorem 2. Suppose Assumption 1 and 2 hold. By choosing L = O(log K + D + log D), J = O(D), τ 1 = O(1), log τ 2 = O(log 2 K + D log K), M = O(K d 2α+d ), V = H (9) with any integer I ∈ [2, D] in Algorithm 1, in which O(•) hides factors depending on d, α, 2d αp-d , p, q, c 0 , B, ω, and the surface area of X , we have E |v π -v π | ≤ CH 2 κK -α 2α+d log 5 2 K, in which the expectation is taken over the data, and C is a constant depending on D 3α 2α+d , d, α, d αp-d , p, q, c 0 , B, ω and the surface area of X . The distributional mismatch is captured by κ = 1 H H h=1 χ 2 Q (q π h , q π0 h ) + 1, in which q π h and q π0 h are the visitation distributions of π and π 0 at step h respectively and Q is the Minkowski sum between the CNN function class in ( 5) and the Besov function class, i.e., Q = {f + g | f ∈ B α p,q (X ), g ∈ F(M, L, J, I, τ 1 , τ 2 , V )}. We next compare our Theorem 2 with existing work: (I) Tight characterization of distributional mismatch. The term κ depicts the distributional mismatch between the target policy's visitation distribution and data coverage via restricted χ 2divergence. Note that the restricted χ 2 -divergence is always no larger than the commonly-used absolute density ratio [51, 6, 67] and can often be substantially smaller. This is because probability measures q π h and q π0 h might differ a lot over some small regions in the sample space, while their integrations of a smooth function in Q over the entire sample space could be close to each other. The absolute density ratio measures the former and restricted χ 2 -divergence measures the latter. More strikingly, when considering function approximation (e.g. state-action space is not countably finite), the restricted χ 2 -divergence can still remain small even when absolute density ratio becomes unbounded. For example, we consider two isotropic multivariate Gaussian distributions with different means. [52] has shown that Pearson χ 2 -divergence, which is always larger than or equal to restricted χ 2 -divergence, has a finite expression: χ 2 (N (µ 1 , I), N (µ 2 , I)) = e ∥µ1-µ2∥ 2 2 -1, whereas one may find the absolute density ratio unbounded: for any µ 1 ̸ = µ 2 , dN (µ 1 , I) dN (µ 2 , I) ∞ = sup x exp x ⊤ (µ 1 -µ 2 ) - 1 2 ∥µ 1 ∥ 2 + 1 2 ∥µ 2 ∥ 2 = ∞. Such a stark comparison can also be observed in other common distributions that have support with infinite cardinality, e.g. Poisson distribution. Furthermore, when the state-action space exhibits small intrinsic dimensions, i.e., d ≪ D, the restricted χ 2 -divergence adapts to such low-dimensional structure and characterizes the distributional mismatch with respect to Q, which is a small function class depending on the intrinsic dimension. In contrast, the absolute density ratio in [51] does not take advantage of the low-dimensional structure. In summary, though the absolute density ratio is a tight in the tabular setting and some other special classes of MDPs, in the general function approximation setting, it could easily become intractably vacuous, and restricted χ 2 -divergence is tighter characterization of distributional mismatch. (II) Adaptation to intrinsic dimension. Note that our estimation error is dominated by the intrinsic dimension d, rather than the representation dimension D. Therefore, it is significantly smaller than the error of methods oblivious to the problem's intrinsic dimension such as [51] . Such a fast convergence owes to the adaptability of neural networks to the manifold structure in the state-action space. With properly chosen width and depth, the neural network automatically captures local geometries on the manifold through the empirical risk minimization in Algorithm 1 for approximating Q-functions.

Sample Complexity

Comparison. Given a pre-specified estimation error of policy value ϵ, our algorithm requires a sample complexity of O(H 4+2d/α κ 2+d/α ϵ -2-d/α ). We next compare our result with [51] , which among existing work is the most similar to ours. Specifically, we reprove our result with feed-forward ReLU network so as to be in the same setting as [51] (details in Theorem 3 of Appendix E). When the experience data are allowed to be reused, they show a sample complexity of O(H 2+2D/α κ 2+2D/α 3 ϵ -2-2D/α ). As can be seen, our result is more efficient than theirs as long as H 1-D-d α ≤ ϵ -1-D-d/2 α . Such a requirement of the horizon can be satisfied in real applications, as d ≪ D and α is moderate. Note that even with no consideration for low-dimensional structures, i.e., d = D, our result is still more efficient, as κ is often substantially smaller than κ 3 . Moreover, when the experience data are used just for one pass, our method is instantly more efficient, as their sample complexity becomes O(H 4+2D/α κ 2+D/α 3 ϵ -2-D/α ).

5. EXPERIMENTS

We present numerical experiments for evaluating FQE with CNN function approximation on the classic CartPole environment [4] . The CartPole problem has a 4-dimensional continuous intrinsic state space. We consider a finite-horizon MDP with horizon H = 100 in this environment. In our experiments, we solve the OPE problem with FQE (Algorithm 1). We take the visual display of the environment as states. These images serve as a high-dimensional representation of CartPole's original 4-dimensional continuous state space. In our algorithm, we use a deep CNN to approximate the Q-functions and solve the regression with SGD (see Appendix F.1 for details). We use a policy trained for 200 iterations with REINFORCE as the target policy. We conduct this experiment in two cases: (A) data are generated from the target policy itself; (B) data are generated from a mixture policy of 0.8 target policy and 0.2 uniform distribution. (A) aims to verify the performance's dependence on data intrinsic dimension without the influence from distribution shift. We observe that the performance of FQE on high-resolution and low-resolution data is similar, in both the off-policy case and the easier case with no distribution shift. It shows that the estimation error of FQE takes little influence from the representation dimension of the data but rather from the intrinsic structure of the environment, which is the same regardless of resolution. We also observe that the estimation becomes increasingly accurate as sample size K increases. These empirical results confirm our upper bound in Theorem 2, which is only dominated by data intrinsic dimension.

6. CONCLUSION

This paper studies nonparametric off-policy evaluation in MDPs. We use CNNs to approximate Qfunctions. Our theory proves that when state-action space exhibits low-dimensional structures, the finite-sample estimation error of FQE converges depending on the intrinsic dimension. In the estimation error, the distribution mismatch between the data distribution and target policy's visitation distribution is quantified by a restricted χ 2 -divergence term, which is oftentimes much smaller than the absolute density ratio. Our theory also reassures practitioners of the benignity of overrepresentation in deep RL and provides insights into how to choose network hyperparameters properly in presence of low intrinsic dimension. We support our theory with experiments. For future directions, it would be of interest to adapt this low-dimensional analysis to time-homogeneous MDPs. It is nontrivial to preserve the error rate with sample reuse in the presence of temporal dependency. A PROOF OF THEOREM 2 In this section, we provide a proof for the upper bound on the estimation error in Theorem 2. We can tackle the sequential dependencies by recursively conditioning on the previous-step estimation and the fact that the error in the previous steps accumulates linearly. The estimation error can be decomposed into a sum of statistical error and approximation error. A tradeoff exists about the network size: while a larger network reduces the approximation error, it leads to higher variance in the statistical error. Consequently, we choose the network size and architecture appropriately to balance the two types of error, which in turn minimizes the final estimation error. Proof of Theorem 2. The goal is to bound E | v π -v π | = E X Q π 1 -Q π 1 (s, a) dq π 1 (s, a) ≤ E X Q π 1 -Q π 1 (s, a) dq π 1 (s, a) . To get an expression for that, we first expand it recursively. To illustrate the recursive relation, we examine the quantity at step h: E X Q π h -Q π h (s, a) dq π h (s, a) = E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) ≤ E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) + E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) = E X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + E E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) | D h+1 , • • • , D H (a) ≤ E X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + E E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) χ 2 Q (q π h , q π0 h ) + 1 | D h+1 , • • • , D H (b) ≤ E X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + E E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) | D h+1 , • • • , D H χ 2 Q (q π h , q π0 h ) + 1 (c) ≤ X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + C ′ (5H 2 )K -2α 2α+d log 5 K χ 2 Q (q π h , q π0 h ) + 1 ≤ X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + CHK -α 2α+d log 5/2 K χ 2 Q (q π h , q π0 h ) + 1, where C denotes a (varying) constant depending on D 3α 2α+d , d, α, d αp-d , p, q, c 0 , B, ω and the surface area of X . In (a), note T π h Q π h+1 ∈ B α p,q (X ) by Assumption 2 and -T π h Q π h+1 ∈ F by our algorithm, so T π h Q π h+1 -T π h Q π h+1 ∈ Q. Then we employ a change-of-measure argument and obtain this inequality by invoking the following lemma. Lemma 1. Given a function class Q that contains functions mapping from X to R and two probability distributions q 1 and q 2 supported on X , for any g ∈ Q, E x∼q1 [g(x)] ≤ E x∼q2 [g 2 (x)](1 + χ 2 Q (q 1 , q 2 )). Proof of Lemma 1. E x∼q1 [g(x)] = E x∼q2 [g 2 (x)] E x∼q1 [g(x)] 2 E x∼q2 [g 2 (x)] ≤ E x∼q2 [g 2 (x)] sup f ∈Q E x∼q1 [f (x)] 2 E x∼q2 [f 2 (x)] = E x∼q2 [g 2 (x)](1 + χ 2 Q (q 1 , q 2 )), where the last step is by the definition of χ 2 Q (q 1 , q 2 ) := sup f ∈Q Eq 1 [f ] 2 Eq 2 [f 2 ] -1. In (b), we use Jensen's inequality and the fact that square root is concave. To obtain (c), we invoke Lemma 10, which provides an upper bound on the error of nonparametric regression at each step of the FQE algorithm. Specifically, we will invoke Lemma 10 when conditioning on D h+1 To justify our use of this theorem, we need to cast our problem into a regression problem described in the theorem. Since {(s h,k , a h,k )} K k=1 are i.i.d. from q π0 h , we can view them as the samples x i 's in the lemma. We can view T π h Q π h+1 , which is measurable under our conditioning, as f 0 in the lemma. Furthermore, we let ζ h,k := r h,k + A Q π h+1 (s ′ h,k , a)π(a | s ′ h,k ) da -T π h Q π h+1 (s h,k , a h,k ). In order to invoke Lemma 10 under the conditioning on D h+1 3. Noise {ζ h,k } K k=1 are independent, zero-mean, subgaussian random variables. In our setting, {(s h,k , a h,k )} K k=1 are i.i.d. from q π0 h . Due to the time-inhomogeneous setting, they are independent from D h+1 , • • • , D H , so {(s h,k , a h,k )} K k=1 are still i.i.d. under our conditioning. Thus, Condition 1 is clearly satisfied. We may observe that under our conditioning, the transition from (s h,k , a h,k ) to s ′ h,k is the only source of randomness in ζ h,k , besides (s h,k , a h,k ) itself. The distribution of (s h,k , a h,k , s ′ h,k ) is actually the product distribution between P h (•|s h,k , a h,k ) and q π0 h , so a function of s ′ h,k , generated from the transition distribution P h (•|s h,k , a h,k ), is uncorrelated with (s h,k , a h,k ). Thus, (s h,k , a h,k )'s are uncorrelated with ζ h,k 's under our conditioning, and Condition 2 is satisfied. Condition 3 can also be easily verified. Under our conditioning, the randomness in ζ h,k only comes from (s h,k , a h,k , s ′ h,k , r h,k ), which are independent from (s h,k ′ , a h,k ′ , s ′ h,k ′ , r h,k ′ ) for any k ′ ̸ = k, so ζ h,k 's are independent from each other. As for the mean of ζ h,k , E [ζ h,k | D h+1 , • • • , D H ] = E r h,k + A Q π h+1 (s ′ h,k , a)π(a | s ′ h,k ) da -r h (s h,k , a h,k ) -P π h Q π h+1 (s h,k , a h,k ) | D h+1 , • • • , D H = E r h,k -r h (s h,k , a h,k ) + A Q π h+1 (s ′ h,k , a)π(a | s ′ h,k ) da -E s ′ ∼P h (•|s h,k ,a h,k ) A Q π h+1 (s ′ , a)π(a | s ′ ) da | s h,k , a h,k , D h+1 , • • • , D H | D h+1 , • • • , D H = 0 + 0 = 0. On the other hand, Q π h+1 ∞ ≤ H almost surely, because it is a function in our CNN class F. Thus, ζ h,k is a bounded random variable with ζ h,k ∈ [-2H, 2H ] almost surely, so its variance is bounded by 4H 2 . Its boundedness also implies it is a subgaussian random variable. Thus, Condition 3 is also satisfied. Hence, Lemma 10 proves, for step h in our algorithm, E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) | D h+1 , • • • , D H ≤ C ′ (H 2 + 4H 2 )K -2α 2α+d log 5 K, where C ′ depends on D 6α 2α+d , d, α, 2d αp-d , p, q, c 0 , B, ω and the surface area of X . Note that this upper bound holds for any Q π h+1 or D h+1 , • • • , D H . The sole purpose of our conditioning is that we could view Q π h+1 as a measurable or deterministic function under the conditioning and then apply Lemma 10. Therefore, E E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) | D h+1 , • • • , D H ≤ C ′ (H 2 + 4H 2 )K -2α 2α+d log 5 K. Finally, we carry out the recursion from time step 1 to time step H, and the final result is E |v π -v π | ≤ CH 2 K -α 2α+d log 5/2 K 1 H H h=1 χ 2 Q (q π h , q π0 h ) + 1 .

B PROOF OF THEOREM 1

For simplicity, let us denote f 0 := T π h f in the theorem statement. Note that f 0 ∈ B α p,q (X ). Moreover, let us define a class of single-block CNNs in the form of  f (x) = W • Conv W,B (x) as F SCNN (L, J, I, τ 1 , τ 2 ) = f | f (x) ∥W (l) m ∥ ∞ ∨ ∥B (l) m ∥ ∞ ≤ τ 1 , ∥W ∥ ∞ ≤ τ 2 . ( ) We will refer to CNNs in this form as "single-block CNNs" and use them as building blocks of our final CNN approximation for the ground truth Besov function.

B.1 PROOF OVERVIEW OF THEOREM 1

Theorem 1 serves as a building block for Theorem 2, which establishes the relation between network architecture and approximation error. For simplicity, denote c 0 := ∥f 0 ∥ B α p,q (X ) . Theorem 1 is proven in the following steps: STEP 1: DECOMPOSE f AS SUM OF LOCALLY SUPPORTED FUNCTIONS OVER MANIFOLD Since manifold X is assumed compact (Assumption 1), we can cover it with a finite set of D- dimensional open Euclidean balls {B β (c i )} C X i=1 , where c i denotes the center of the i-th ball and β is its radius. We choose β < ω/2, and define U i = B β (c i ) ∩ X . Note that each U i is diffeomorphic to an open subset of R d (Lemma 5.4 in Niyogi et al. [53]); moreover, {U i } C X i=1 forms an open cover for X . There exists a carefully designed open cover with cardinality C X ≤ ⌈ A(X ) β d T d ⌉, where A(X ) denotes the surface area of X and T d denotes the thickness of U i 's, i.e., the average number of U i 's that contain a given point on X . T d is O(d log d) (Conway et al. [9] ). Moreover, for each U i , we can define a linear transformation ϕ i (x) = a i V ⊤ i (x -c i ) + b i , where a i ∈ R is the scaling factor and b i ∈ R d is the translation vector, both of which are chosen to ensure ϕ(U i ) ⊂ [0, 1] d , and the columns of V i ∈ R D×d form an orthonormal basis for the tangent space T ci (X ). Overall, the atlas {(ϕ i , U i )} C X i=1 transforms each local neighborhood on the manifold to a d-dimensional cube. Thus, we can decompose f 0 using this atlas as f 0 = C X i=1 f i with f i = f ρ i , because there exists such a C ∞ partition of unity {ρ i } C X i=1 with supp(ϕ i ) ⊂ U i (Proposition 1 in Liu et al. [42] ). Since each f i is only supported on U i , we can further write f 0 = C X i=1 f i • ϕ -1 i • ϕ i × 1 Ui with f i = f ρ i , where 1 Ui is the indicator for membership in U i . Lastly, we extend f i • ϕ -1 i to entire [0, 1] d with 0, which is a function in B α p,q ([0, 1] d ) with B α p,q ([0, 1] d ) Besov norm at most Cc 0 (Lemma 4 in Liu et al. [42] ), where C is a constant depending on α, p, q and d. This extended function is to be approximated with cardinal B-splines in the next step.

STEP 2: APPROXIMATE EACH LOCAL FUNCTION WITH CARDINAL B-SPLINES

With most things connected with the intrinsic dimension d in the last step, we proceed an approximation of f 0 on the low-dimensional manifold. With α ≥ d/p + 1 assumed in Assumption 2, we can invoke a classic result of using cardinal B-splines to approximate Besov functions (Lemma 5), by setting r = +∞ and m = ⌈α⌉ + 1 in the lemma. It states that there exists a weighted sum of cardinal B-splines f i in the form f i ≡ N j=1 f i,j ≈ f i • ϕ -1 i with f i,j = c (i) k,j g d k,j,m such that f i -f i • ϕ -1 i L ∞ ≤ Cc 0 N -α/d . In (14) , c (i) k,j ∈ R is coefficient and g d k,j,m : [0, 1] d → R denotes a cardinal B-spline with index k, m ∈ N + , j ∈ R d . k is a scaling factor, j is a shifting vector, m is the degree of the B-spline. By ( 13) and ( 14), we now have a sum of cardinal B-splines f ≡ C X i=1 f i • ϕ i × 1 Ui = C X i=1 N j=1 f i,j • ϕ i × 1 Ui . ( ) which can approximate our target Besov function f 0 with error f -f 0 L ∞ ≤ CC X c 0 N -α/d .

STEP 3: APPROXIMATE EACH CARDINAL B-SPLINE A COMPOSITION OF CNNS

Each summand in ( 16) is a composition of functions, each of which we can implement with a CNN. Specifically, we do so with a special class of CNNs defined in (11) , which we refer to as "singleblock CNNs". The multiplication operation × can be approximated by a single-block CNN × with at most η error in the L ∞ sense (Proposition 1). × needs O(log 1 η ) layers and 6 channels. All weight parameters are bounded by (c 2 0 ∨ 1). We consider each f i • ϕ i together, which we can approximate with a sum of N CNNs f SCNN i,j • ϕ i up to δ error, namely, N j=1 f SCNN i,j -f i • ϕ -1 i L ∞ ≤ δ. In particular, we can use a single-block CNN f SCNN i,j to approximate the B-spline f i,j up to δ/N error. Moreover, since ϕ i is linear, it can be expressed with a single-layer perceptron ϕ i . The architecture and size of f SCNN i,j and ϕ i are characterized in Proposition 2 as functions of δ. Ui is an indicator for membership in U i , so we need 1 Ui (x) = 1 if d 2 i (x) = ∥x -c i ∥ 2 ≤ β 2 and 1 Ui (x) = 0 otherwise. By this definition, we can write 1 Ui as a composition of a univariate indicator 1 [0,β 2 ] and the distance function d 2 i : 1 Ui (x) = 1 [0,β 2 ] • d 2 i (x) for x ∈ X . Given θ ∈ (0, 1) and ∆ ≥ 8DB 2 θ, it turns out that 1 [0,β 2 ] and d 2 i can be approximated with two single-block CNNs 1 ∆ and d 2 i respectively (Proposition 3) such that d 2 i -d 2 i L ∞ ≤ 4B 2 Dθ ( ) and 1 ∆ • d 2 i (x) =    1, if x ∈ U i , d 2 i (x) ≤ β 2 -∆, 0, if x / ∈ U i , some value between 0 and 1, otherwise. (20) The architecture and size of 1 ∆ and d 2 i are characterized in Proposition 3 as functions of θ and ∆. The above three approximations rely on the classic result of using CNN to approximate cardinal B-splines (Lemma 10 in Liu et al. [42] ; Lemma 1 in Suzuki [60] ). Putting the above together, we can develop a composition of single-block CNNs fi,j ≡ × f SCNN i,j • ϕ i , 1 ∆ • d 2 i (21) as an approximation for f i,j • ϕ i × 1 Ui . The overall approximation error of fi,j can be written as a sum of the three types of approximation error above. Details are provided in Appendix B.2. Moreover, by Lemma 6, there exists a single-block CNN f i,j that can express fi,j .

STEP 4: EXPRESS THE SUM OF CNN COMPOSITIONS WITH A CNN Finally, we can assemble everything into

f f ≡ C X i=1 N j=1 f i,j , which serves as an approximation for f 0 . By choosing the appropriate network size in Lemma 2, which the tradeoff between the approximation error of f i,j and its size, we can ensure that f -f 0 L ∞ ≤ N -α/d . By Lemma 7, for M , J > 0, we can write this sum of N • C X single-block CNNs as a sum of M single-block CNNs with the same architecture, whose channel number upper bound J depends on J. This allows Theorem 1 to be more flexible with network architecture. By Lemma 4, this sum of M CNNs can be further expressed as one CNN in the CNN class (5) . Finally, N will be chosen appropriately as a function of network architecture parameters, and the approximation theory of CNN is proven. When Theorem 1 is applied in our problem setting, we will take the target function f 0 above to be T π h Q π h+1 at each time step h, which is the ground truth of the regression at each step of Algorithm 1. For more details about this, please refer to the proof of Theorem 2 in Appendix A.

B.2 PROOF OF THEOREM 1

In the following, we provide the proof details for Theorem 1, which quantifies the tradeoff between a CNN in the class of 11 and its approximation error for Besov functions on a low-dimensional manifold. We start from the decomposition of the approximation error of f , which is based on the decomposition of the approximation error of fi,j in ( 21), and will proceed to the end of this proof.  -f 0 L ∞ ≤ C X i=1 (A i,1 + A i,2 + A i,3 ) with A i,1 = N j=1 ×( f SCNN i,j • ϕ i , 1 ∆ • d 2 i ) -f SCNN i,j • ϕ i × ( 1 ∆ • d 2 i ) L ∞ ≤ C ′′ δ -d/α η, A i,2 =   N j=1 f SCNN i,j • ϕ i   × ( 1 ∆ • d 2 i ) -f i × ( 1 ∆ • d 2 i ) L ∞ ≤ δ, A i,3 = ∥f i × ( 1 ∆ • d 2 i ) -f i × 1 Ui ∥ L ∞ ≤ c(π + 1) β(1 -β/ω) ∆ for some constant C ′′ depending on d, α, p, q and some constant c. Furthermore, for any ε ∈ (0, 1), setting δ = N -α/d 3C X , η = 1 C ′′ N -1-α/d (3C X ) d/α , ∆ = β(1 -β/ω)N -α/d 3c(π + 1)C X , θ = ∆ 16B 2 D ( ) gives rise to f -f 0 L ∞ ≤ N -α d . The choice in (24) satisfies the condition ∆ > 8B 2 Dθ in Proposition 3. Proof of Lemma 2. As in Proposition 1, A i,1 measures the error from ×: A i,1 = N j=1 ×( f SCNN i,j • ϕ i , 1 ∆ • d 2 i ) -f SCNN i,j • ϕ i × ( 1 ∆ • d 2 i ) L ∞ ≤ N η ≤ C ′′ δ -d/α η, for some constant C ′′ depending on d, α, p, q. The last inequality is due to the choice of N in Proposition 2. A i,2 measures the error from CNN approximation of Besov functions. As in Proposition 2, A i,2 ≤ δ. A i,3 measures the error from CNN approximation of the chart determination function. The bound of A i,3 can be derived using the proof of Lemma 4.5 in Chen et al. [7] , since f i • ϕ -1 i is a Lipschitz function and its domain is in [0, 1] d . In order to attain the error desired in Lemma 2, we need each network in fi,j with appropriate size. The network size of the components in fi,j can be analyzed as follows: • 1 i : The chart determination network  1 i = d 2 i • 1 ∆ is O δ -(log 2)( 2d αp-d +c1d -1 ) = O N (log 2) α d ( 2d αp-d +c1d -1 ) . Next, we want to show fi,j , a composition of the aforementioned single-block CNNs, can be simply expressed as a single-block CNN. By Lemma 6, there exists a single-block CNN g i,j with O(log N + D) layers and ⌈24d(α + 1)(α + 3) + 9d⌉ width realizing f SCNN i,j • ϕ i . All weight parameters in g i,j are in the order of O N (log 2) α d ( 2d αp-d +c1d -1 ) . Moreover, recall that the chart determination network 1 i is a singleblock CNN with O(log N + D + log D) layers and width 6D + 2, whose weight parameters are of O(1). By Lemma 14 in Liu et al. [42] , one can construct a convolutional block, denoted by ḡi,j , such that ḡi,j (x) = (g i,j (x)) + (g i,j (x)) -( 1 i (x)) + ( 1 i (x)) - ⋆ ⋆ ⋆ ⋆ . ( ) Here ḡi,j has ⌈24d(α + 1)(α + 3) + 9d⌉ + 6D + 2 channels. Since the input of × is g i,j 1 i , by Lemma 15 in Liu et al. [42] , there exists a CNN gi,j which takes (25) as the input and outputs ×(g i,j , 1 i ). Note that ḡi,j only contains convolutional layers. The composition gi,j • ḡi,j , denoted by g SCNN i,j , is a CNN and for any x ∈ X , g SCNN i,j (x) = fi,j (x). We have g SCNN i,j ∈ F SCNN (L, J, I, τ, τ ) with L = O (log N + D + log D) , J = ⌈48d(α + 1)(α + 3) + 18d⌉ + 12D + O(1), τ = O N (log 2) d α ( 2d αp-d +c1d -1 ) , and I can be any integer in [2, D] . Therefore, we have shown that g SCNN i,j is a single-block CNN that expresses fi,j , as we desired. Furthermore, recall that f can be written as a sum of C X N such SCNNs. By Lemma 7, for any M , J satisfying M J = O(N ), there exists a CNN architecture F SCNN (L, J, I, τ, τ ) that gives rise to a set of single-block CNNs { g i } M i=1 ∈ F SCNN (L, J, I, τ, τ ) with f = M i=1 g i and L = O (log N + D + log D) , J = O(D J), τ = O N (log 2) d α ( 2d αp-d +c1d -1 ) . ( ) By Lemma 3 below, we slightly adjust the CNN architecture by re-balancing the weight parameter boundary of the convolutional blocks and that of the final fully connected layer. In particular, we rescale all parameters in convolutional layers of g i to be no larger than 1. While this procedure does not change the approximation power of the CNN, it can make the CNN have a smaller covering number, which is conducive to a smaller variance. Lemma 3 (Lemma 16 in Liu et al. [42] ). Let γ ≥ 1. For any g ∈ F SCNN (L, J, I, τ 1 , τ 2 ), there exists f ∈ F SCNN (L, J, I, γ -1 τ, γ L τ ) such that g(x) = f (x). In this case, we set γ = c ′ N (log 2) d α ( 2d αp-d +c1d -1 ) (8ID) M 1 L , where c ′ is a constant such that τ ≤ c ′ N (log 2) d α ( 2d αp-d +c1d -1 ) . With this γ, we have f i ∈ F SCNN (L, J, I, τ 1 , τ 2 ) with L = O(log N + D + log D), J = O(D), τ 1 = (8ID) -1 M -1 L = O(1), log τ 2 = O log M + log 2 N + D log N . Finally, we prove that it suffices to use one CNN to realize the sum of single-block CNNs in (27) . Lemma 4. Let F SCNN (L, J, I, τ 1 , τ 2 ) be any CNN architecture from R D to R. Assume the weight matrix in the fully connected layer of F SCNN (L, J, I, τ 1 , τ 2 ) has nonzero entries only in the first row. For any positive integer M , there exists a CNN architecture F(M, L, J, I, τ 1 , τ 2 (1 ∨ τ -1 1 )) such that for any { f i (x)} M i=1 ⊂ F SCNN (L, J, I, τ 1 , τ 2 ), there exists f ∈ F(M, L, 4 + J, I, τ 1 , τ 2 (1 ∨ τ -1 1 )) with f (x) = M m=1 f m (x). Consequently, by Lemma 4, there exists a CNN that can express our sum of M single-block CNNs with architecture F(M, L, J, I, τ 1 , τ 2 ) with L = O(log N + D + log D), J = O(D J), τ 1 = (8ID) -1 M -1 L = O(1), log τ 2 = O log M + log 2 N + D log N , M = O( M ). and J, M satisfying M J = O(N ), which is a requirement inherited from Lemma 7. This CNN is our final approximation for f 0 . Applying this relation N = O( M J) to (29) gives f -f 0 L ∞ ≤ ( M J) -α d ( ) and the network size L = O log( M J) + D + log D , J = O(D J), τ 1 = (8ID) -1 M -1 L = O(1), log τ 2 = O log 2 M J + D log M J , M = O( M ).

B.3 PROOF OF LEMMA 4

Denote the architecture of f m with f m (x) = W m • Conv Wm,Bm (x), where W m = W (l) m L l=1 , B m = B (l) m L l=1 . Furthermore, denote the weight matrix and bias in the fully connected layer of f with W , b and the set of filters and biases in the m-th block of f with W m and B m , respectively. The padding layer P in f pads the input x from R D to R D×4 with zeros. Each column denotes a channel. Let us first show that for each m, there exists some Conv Wm, Bm : R D×4 → R D×4 such that for any Z ∈ R D×4 with the form Z = [(x) + (x) -⋆ ⋆] , where (x) + means applying (• ∨ 0) to every entry of x and (x) -means applying -(• ∧ 0) to every entry of x, so all entries in Z are non-negative. We have Conv Wm, Bm (Z) =     τ1 τ2 (f m (x) ∨ 0) -τ1 τ2 (f m (x) ∧ 0) 0 0 ⋆ ⋆ . . . . . . ⋆ ⋆     + Z where ⋆'s denotes entries that do not affect this result and may take any different value. For any m, the first layer of f m takes input in R D . Thus, the filters in W m are in R D . Again, we pad these filters with zeros to get filters in R D×4 and construct W (1) m such that ( W (1) m ) 1,:,: = [e 1 0 0 0] , ( W (1) m ) 2,:,: = [0 e 1 0 0] , ( W (1) m ) 3,:,: = [0 0 e 1 0] , ( W (1) m ) 4,:,: = [0 0 0 e 1 ] , ( W (1) m ) 4+j,:,: = (W m ) j,:,: (-W (1) m ) j,:,: 0 0 , where we use the fact that W (1) m * (x) + -W (1) m * (x) -= W (1) m * x. The first four output channels at the end of this first layer is a copy of Z. For the filters in later layers of f m and all biases, we simply set In Conv Wm, Bm , an additional convolutional layer is constructed to realize the fully connected layer in f m . By our assumption, only the first row of W m is nonzero. Furthermore, we set B (L) m = 0 and W L m as size-one filters with three output channels in the form of ( W (L) m ) 3,:,: = 0 0 e 1 0 τ1 τ2 (W m ) 1,: , ( W (L) m ) 4,:,: = 0 0 0 e 1 -τ1 τ2 (W m ) 1,: . Under such choices, (33) is proved and all parameters in W m , B m are bounded by τ 1 . ( W (l) m ) 1,:,: = [e 1 0 0 0 • • • 0] for l = 2, . . . , L, ( W (l) m ) 2,:,: = [0 e 1 0 0 • • • 0] for l = 2, . . . , L, ( W (l) m ) 3,:,: = [0 0 e 1 0 • • • 0] for l = 2, . . . , L -1, ( W (l) m ) By composing all convolutional blocks, we have (Conv W M , B M ) • • • • • (Conv W1, B1 ) • P (x) =      τ1 τ2 M m=1 ( f m ∨ 0) -τ1 τ2 M m=1 ( f m ∧ 0) (x) + (x) - ⋆ ⋆ . . . . . . ⋆ ⋆      . Lastly, the fully connect layer can be set as W = 0 0 τ2 τ1 -τ2 τ1 0 0 0 0 , b = 0. Note that the weights in the fully connected layer are bounded by τ 2 (1 ∨ τ -1 1 ). The above construction gives f (x) = M m=1 ( f m (x) ∨ 0) + M m=1 ( f m (x) ∧ 0) = M m=1 f m (x).

B.4 SUPPORTING LEMMAE FOR THEOREM 1

Before stating Lemma 5, we provide a brief definition of cardinal B-splines. Definition 5 (Cardinal B-spline). Let ψ(x) = 1 [0,1] (x) be the indicator function for membership in [0, 1]. The cardinal B-spline of order m is defined by taking m + 1-times convolution of ψ: ψ m (x) = (ψ * ψ * • • • * ψ m+1 times )(x) where f * g(x) ≡ f (x -t)g(t)dt. Note that ψ m is a piecewise polynomial with degree m and support [0, m + 1]. It can be expressed as [45]  ψ m (x) = 1 m! m+1 j=0 (-1) j m + 1 j (x -j) m + . For any k, j ∈ N, let g k,j,m (x) = ψ m (2 k x -j), which is the rescaled and shifted cardinal Bspline with resolution 2 -k and support 2 -k [j, j + (m + 1)]. For k = (k 1 , . . . , k d ) ∈ N d and j = (j 1 , . . . , j d ) ∈ N d , we define the d dimensional cardinal B-spline as g d k,j,m (x) = d i=1 ψ m (2 ki x i - j i ). When k 1 = . . . = k d = k ∈ N, we denote g d k,j,m (x) = d i=1 ψ m (2 k x i -j i ). B.4.1 APPROXIMATING BESOV FUNCTIONS WITH CARDINAL B-SPLINES For any m ∈ N, let J(k) = {-m, -m + 1, . . . , 2 k -1, 2 k } d and the quasi-norm of the coefficient {c k,j } for k ∈ N, j ∈ J(k) be ∥{c k,j }∥ b α p,q =    k∈N   2 k(α-d/p)   j∈J(k) |c k,j | p   1/p    q    1/q . ( ) We can state the following lemma, from DeVore & Popov [13] , Dung [17] , which provides an upper bound on the error of using cardinal B-splines to approximate functions in B α p,q ([0, 1] d ). Lemma 5 (Lemma 2 in Suzuki [60] ; DeVore & Popov [13] , Dung [17] ). Assume that 0 < p, q, r ≤ ∞ and 0 < α < ∞ satisfying α > d(1/p -1/r) + . Let m ∈ N be the order of the cardinal B-spline basis such that 0 < α < min(m, m -1 + 1/p). For any f ∈ B α p,q ([0, 1] d ), there exists f N satisfying ∥f -f N ∥ L r ([0,1] d ) ≤ CN -α/d ∥f ∥ B α p,q ([0,1] d ) for some constant C with N ≫ 1. f is in the form of f N (x) = H k=0 j∈J(k) c k,j g d k,j,m (x) + H * k=K+1 n k i=1 c k,ji g d k,ji,m (x), where {j i } n k i=1 ⊂ J(k), H = ⌈c 1 log(N )/d⌉, H * = ⌈ν -1 log(λN )⌉ + H + 1, n k = ⌈λN 2 -ν(k-H) ⌉ for k = H + 1, . . . , H * , u = d(1/p -1/r) + and ν = (α -u)/(2u). The real numbers c 1 > 0 and λ > 0 are two absolute constants chosen to satisfy H k=1 (2 k + m) d + H * k=H+1 n k ≤ N , which are to N . Moreover, we can choose the coefficients {c k,j } such that ∥{c k,j }∥ b α p,q ≤ C 1 ∥f ∥ B α p,q ([0,1] d ) for some constant C 1 .

B.4.2 APPROXIMATING CARDINAL B-SPLINES AND OTHERS WITH SINGLE-BLOCK CNNS

The following Proposition 1 quantifies the tradeoff between the size of a single-block CNN and its approximation error for the multiplication operator. Proposition 1. Let × be defined as in (13) . For any η ∈ (0, 1), there exists a single-block CNN ×(•, •) such that a × b -×(a, b) L ∞ ≤ η, where a, b are functions uniformly bounded by c 0 . × is a single-block CNN approximation of × and is in F SCNN (L, J, I, τ, τ ) with L = O(log 1/η) + D layers, J = 24 channels and any 2 ≤ I ≤ D. All parameters are bounded by τ = (c 2 0 ∨ 1). Furthermore, the weight matrix in the fully connected layer of × has nonzero entries only in the first row. Proof of Proposition 1. First, let us define a particular class of feed-forward ReLU networks of the form f (x) = W L • ReLU(W L-1 • • • ReLU(W 1 x + b 1 ) • • • + b L-1 ) + b L , F(L, J, τ ) = {f | f (x) in the form (36) with L layers and width at most J, ∥W i ∥ ∞,∞ ≤ τ, ∥b i ∥ ∞ ≤ τ for i = 1, • • • , L}. By Proposition 3 in Yarotsky, there exists a feed-forward ReLU network that can approximate the multiplication operation between values with magnitude bounded by c 0 , with η error. Such feedforward network has O(log 1/η) layers, whose width is all bounded by 6, and all its parameters are bounded by c 2 0 . Therefore, such a feed-forward network is sufficient to approximate × with η error in L ∞ -norm, because the arguments of × are uniformly bounded c 0 by Assumption 2. Furthermore, by Lemma 8 in Liu et al. [42] , we can express the aforementioned feed-forward network with a single-block CNN in F SCNN (L, J, I, τ, τ ), where L, J, I, τ are as specified in the statement of the proposition. Proposition 2 quantifies the tradeoff between the size of a single-block CNN and its approximation error for the cardinal B-spline f i • ϕ -1 i . Proposition 2 (Proposition 3 in Liu et al. [42]). Let f i •ϕ -1 i be defined as in (13) . For any δ ∈ (0, 1), set N = C 1 δ -d/α . For any 2 ≤ I ≤ D, there exists a set of single-block CNNs f SCNN N j=1 such that N j=1 f SCNN i,j -f i • ϕ -1 i L ∞ ≤ δ, where C 1 is a constant depending on α, p, q and d. f SCNN i,j is a single-block CNN approximation of f i,j (defined in ( 14)) in F SCNN (L, J, I, τ, τ ) with L = O (log(1/δ)) , J = ⌈24d(α + 1)(α + 3) + 8d⌉, τ = O δ -(log 2)( 2d αp-d +c1d -1 ) . The constant hidden in O(•) depends on d, α, 2d αp-d , p, q, c 0 . Proposition 3 quantifies the tradeoff between the size of the sub-networks for the chart determination network and its approximation error for the chart determination indicators and the distance function d 2 i . Proposition 3 (Lemma 9 in Liu et al. [42]). Let d 2 i and 1 [0,β 2 ] be defined as in (18) . For any θ ∈ (0, 1) and ∆ ≥ 8B 2 Dθ, there exists a single-block CNN d 2 i approximating d 2 i such that ∥ d 2 i -d 2 i ∥ L ∞ ≤ 4B 2 Dθ, and a CNN 1 ∆ approximating 1 [0,β 2 ] with 1 ∆ (x) =    1, if a ≤ (1 -2 -k )(β 2 -4B 2 Dθ), 0, if a ≥ β 2 -4B 2 Dθ, 2 k ((β 2 -4B 2 Dθ) -1 a -1), otherwise. for x ∈ X . The single-block CNN for d 2 i has O(log(1/θ)) layers, 6D channels and all weights parameters are bounded by 4B 2 . The single-block CNN for 1 ∆ has log(β 2 /∆) layers, 2 channels. All weight parameters are bounded by max(2, |β 2 -4B 2 Dθ|).

C.1 SUPPORTING LEMMAE AND PROOFS

Proposition 4 below provides an upper bound on the L ∞ -norm of a series of convolutional neural network blocks in terms of its architecture parameters, e.g. number of layers, number of channels, etc. Let J (i) m be the number of channels in i-th layer of the m-th block, and let I (i) m be the filter size of i-th layer in the m-th block. Q [i,j] is defined as Q [i,j] (x) = Conv Wj ,Bj • • • • • (Conv Wi,Bi ) (x). Proposition 4. For m = 1, 2, • • • , M and x ∈ [-1, 1] D , we have Q [1,m] (x) ∞ ≤ (1 ∨ τ 1 )   m j=1 Lj i=1 J (i-1) j I (i) j τ 1   1 + m k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) . Proof. Q [1,m] (x) ∞ = Conv Wm,Bm (Q [1,m-1] (x)) ∞ ≤ Lm i=1 J (i-1) m I (i) m τ 1 Q [1,m-1] (x) ∞ + τ 1 L m Lm i=1 (1 ∨ J (i-1) m I (i) m τ 1 ) ≤ ∥P (x)∥ ∞ m j=1 Lj i=1 J (i-1) j I (i) j τ 1 + τ 1 m k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) m l=j+1 L l i=1 J (i-1) l I (i) l τ 1 ≤ ∥x∥ ∞ m j=1 Lj i=1 J (i-1) j I (i) j τ 1 + τ 1 m k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) m l=j+1 L l i=1 J (i-1) l I (i) l τ 1 ≤ (1 ∨ τ 1 )   m j=1 Lj i=1 J (i-1) j I (i) j τ 1   1 + m k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) , where the first two inequalities are obtained by applying Proposition 9 from Oono & Suzuki [54] recursively. Lemma 9 quantifies the sensitivity of a CNN with respect to small changes in its weight parameters. This will be used to create a discrete covering for the CNN class. Lemma 9. For f, f ′ ∈ F(M, L, J, I, τ 1 , τ 2 , V ) such that for ϵ > 0, ∥W -W ′ ∥ ∞ ≤ ϵ, ∥b -b ′ ∥ ∞ ≤ ϵ, W (l) m -W (l) m ′ ∞ ≤ ϵ and B (l) m -B (l) m ′ ∞ ≤ ϵ for all m and l, where (W, b, {{(W (l) m , B (l) m )} Lm l=1 } M m=1 ) and (W ′ , b ′ , {{(W (l) m ′ , B (l) m ′ )} Lm l=1 } M m=1 ) are the parameters of f and f ′ respectively, we have ∥f -f ′ ∥ ∞ ≤ Λ 1 ϵ, where Λ 1 is defined in Lemma 8. Proof. For any x ∈ [-1, 1] D , |f (x) -f ′ (x)| = |W ⊗ Q(x) + b -W ′ ⊗ Q ′ (x) -b ′ | = |(W -W ′ ) ⊗ Q(x) + b -b ′ + W ′ ⊗ (Q(x) -Q ′ (x))| = |(W -W ′ ) ⊗ Q(x) + b -b ′ + W ′ ⊗ (Q(x) -Conv W M ,B M (Q ′ (x)) + Conv W M ,B M (Q ′ (x)) -Q ′ (x))| = (W -W ′ ) ⊗ Q(x) + b -b ′ + M m=1 W ′ ⊗ Q [m+1,M ] • Conv Wm,Bm -Conv W ′ m ,B ′ m • Q ′ [0,m-1] ≤ |(W -W ′ ) ⊗ Q(x; θ) + b -b ′ | + M m=1 W ′ ⊗ Q [m+1,M ] • Conv Wm,Bm -Conv W ′ m ,B ′ m • Q ′ [0,m-1] (a) ≤ (3 + M )JD(1 ∨ τ 1 )(1 ∨ τ 2 )   M j=1 Lj i=1 J (i-1) j I (i) j τ 1   1 + M k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) ϵ, where (a) is obtained through the following reasoning. The first term in (a) can be bounded as |(W -W ′ ) ⊗ Q(x) + b -b ′ | ≤ (∥W ∥ 0 + ∥W ′ ∥ 0 ) ∥W -W ′ ∥ ∞ ∥Q(x)∥ ∞ + ∥b -b ′ ∥ ∞ ≤ 2JDϵ ∥Q(x)∥ ∞ + ϵ ≤ 3JDϵ ∥Q(x)∥ ∞ ≤ 3JD max{1, τ 1 }   M j=1 Lj i=1 J (i-1) j I (i) j τ 1   1 + M k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) ϵ, where the first inequality uses Proposition 8 from Oono & Suzuki [54] and the last inequality is obtained by invoking Proposition 4. For the second term in (a), it is true that for any m = 1, • • • , M , we have Proof of Lemma 8. We grid the range of each parameter into subsets with width Λ -1 1 δ, so there are at most 2(τ 1 ∨ τ 2 )Λ 1 δ -1 different subsets for each parameter. In total, there are 2(τ 1 ∨ τ 2 )Λ 1 δ -1 Λ2 bins in the grid. For any f, f ′ ∈ F(M, L, J, I, τ 1 , τ 2 , V ) within the same grid, by Lemma 9, we have ∥f -f ′ ∥ ∞ ≤ δ. We can construct the ϵ-covering with cardinality 2(τ 1 ∨ τ 2 )Λ 1 δ -1 Λ2 by selecting one neural network from each bin in the grid. W ′ ⊗ Q [m+1,M ] • Conv Wm,Bm -Conv W ′ m ,B ′ m • Q ′ [1,m-1] (b) ≤ ∥W ′ ∥ 0 τ 2 Q [m+1,M ] • Conv Wm,Bm -Conv W ′ m ,B ′ m • Q ′ [1,m-1] ∞ (c) ≤ JDτ 2   M j=m+1 Lj i=1 J (i-1) j I (i) j τ 1   Conv Wm,Bm -Conv W ′ m ,B ′ m • Q ′ [1,m-1] ∞ (d) ≤ JDτ 2   M j=m+1 Lj i=1 J (i-1) j I (i) j τ 1   Lm i=1 J (i-1) m I (i) m τ 1 Q ′ [1,m-1] ∞ ϵ (e) ≤ JDτ 2   M j=m+1 Lj i=1 J (i-1) j I (i) j τ 1   Lm i=1 J (i-1) m I (i) m τ 1 (1 ∨ τ 1 )   m j=1 Lj i=1 J (i-1) j I (i) j τ 1   1 + m k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) ϵ ≤ JDτ 2   M j=1 Lj i=1 J (i-1) j I (i) j τ 1   (1 ∨ τ 1 ) 1 + M k=1 L k L k i=1 (1 ∨ J (i-1) k I (i) k τ 1 ) ϵ, Taking log and plugging in the network architecture parameters in Lemma 1, we have log N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) = O Λ 2 log (τ 1 ∨ τ 2 ) Λ 1 δ -1 + 48σ 2 log N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) + 2 n + (8 √ 6 log N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) + 2 n + 8)σδ, where N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) denotes the δ-covering number of F(M, L, J, I, τ 1 , τ 2 , V ) with respect to the ℓ ∞ norm, i.e., there exists a discretization of F(M, L, J, I, τ 1 , τ 2 , V ) into N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) distinct elements, such that for any f ∈ F, there is f in the discretization satisfying f -f ∞ ≤ ϵ. Lemma 12 (Lemma 6 in Chen et al. [7] ). For any constant δ ∈ (0, 2R), T 2 satisfies T 2 ≤ 104V 2 3n log N (δ/4V, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) + 4 + 1 2V δ.

D.2 PROOF OF LEMMA 10

Proof of Lemma 10. Recall that the bias and variance decomposition of E X f n (x) -f 0 (x) 2 dP x (x) as E X f n (x) -f 0 (x) 2 dP x (x) = E 2 n n i=1 ( f n (x i ) -f 0 (x i )) 2 T1 + E X f n (x) -f 0 (x) 2 dP x (x) -E 2 n n i=1 ( f n (x i ) -f 0 (x i )) 2 T2 . Applying the upper bounds of T 1 and T 2 in Lemmas 11 and 12 respectively, we can derive E X f n (x) -f 0 (x) 2 dP x (x) ≤ 4 inf f ∈F (M,L,J,I,τ1,τ2,V ) X (f (x) -f 0 (x)) 2 dP x (x) + 48σ 2 log N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) + 2 n + 8 √ 6 log N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) + 2 n σδ + 104V 2 F 3n log N (δ/4V, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) + 4 + 1 2V F + 8σ δ. We need there to exist a network in F(M, L, J, I, τ 1 , τ 2 , V ) which can yield a function f satisfying ∥f -f 0 ∥ ∞ ≤ ϵ for ϵ ∈ (0, 1). ϵ will be chosen later to balance the bias-variance tradeoff. In order to achieve such ϵ-error, we set M J = ϵ -d/α , so we now have our network architecture as specified in Theorem 1 in terms of ϵ. Then, we can use the parameters in this architecture to invoke the upper bound of the covering number in Lemma 8: log N (δ, F(M, L, J, I, τ 1 , τ 2 , V ), ∥•∥ ∞ ) = O Λ 2 log (τ 1 ∨ τ 2 ) Λ 1 δ -1 ≤ O M J 2 D 3 log 5 ( M J) log 1 δ = O ϵ -d/α D 3 log 5 ϵ log 1 δ , where O(•) hides constant depending on log D, d, α, 2d αp-d , p, q, c 0 , B, ω and the surface area of X . Plugging it in, we have E X f n (x) -f 0 (x) 2 dD x (x) ≤ 4ϵ 2 + 48σ 2 n c ′′ ϵ -d/α D 3 log 5 ϵ log 1 δ + 2 + 8 √ 6c ′′ ϵ -d/α D 3 log 5 ϵ log 1 δ n σδ + 104V 2 3n ϵ -d/α D 3 log 5 ϵ log 1 δ + 4 + 1 2V F + 8σ δ = O ϵ 2 + V 2 F + σ 2 n ϵ -d α D 3 log 5 ϵ log 1 δ + σδ ϵ -d α D 3 log 5 ϵ log 1 δ n + σδ + σ 2 n . Finally we choose ϵ to satisfy ϵ 2 = 1 n D 3 ϵ -d α , which gives ϵ = D 3α 2α+d n -α 2α+d . It suffices to pick δ = 1 n . Substituting both ϵ and δ into (41), we deduce the desired estimation error bound E X f n (x) -f 0 (x) 2 dD x (x) ≤ c(V 2 F + σ 2 )n -2α 2α+d log 5 n, where constant c depends on D 6α 2α+d , d, α, 2d αp-d , p, q, c 0 , B, ω and the surface area of X .

E A RESULT FOR FEED-FORWARD RELU NEURAL NETWORK E.1 FEED-FORWARD RELU NEURAL NETWORK

We consider multi-layer ReLU (Rectified Linear Unit) neural networks [25] . ReLU activation is popular in computer vision, natural language processing, etc. because the vanishing gradient issue is less severe with it, which is nonetheless common with its counterparts like sigmoid or hyperbolic tangent activation [25, 27] . An L-layer ReLU neural network can be expressed as  f (x) = W L • ReLU(W L-1 • • • ReLU(W 1 x + b 1 ) • • • + b L-1 ) + b L , ∥f ∥ ∞ ≤ V, L i=1 ∥W i ∥ 0 + ∥b i ∥ 0 ≤ I, ∥W i ∥ ∞,∞ ≤ τ, ∥b i ∥ ∞ ≤ τ for i = 1, • • • , L}.

E.2 POLICY EVALUATION ERROR AND ITS PROOF

From this point, we denote the function class F(L, p, I, τ, V ), whose parameters L, p, I, τ, V are chosen according to Theorem 3, with the shorthand F. In this section, this F is used in Algorithm 1, instead of the CNN class in (11) . Theorem 3. Suppose Assumption 1 and 2 hold. By choosing L = O (log K) , p = O K d 2α+d , I = O K d 2α+d log K , τ = max{B, H, √ d, ω 2 }, V = H in Algorithm 1, in which O(•) hides factors depending on α, d and log D, we have E |v π -v π | ≤ CH 2 κ K -α 2α+d + D/K log 3 2 K, in which the expectation is taken over the data, and C is a constant depending on log D, α, B, d, ω, the surface area of X and c 0 . The distributional mismatch is captured by κ = 1 H H h=1 χ 2 Q (q π h , q π0 h ) + 1, in which Q is the Minkowski sum between the ReLU function class and the Besov function class, i.e., Q = {f + g | f ∈ B α p,q (X ), g ∈ F}. Proof of Theorem 3. The goal is to bound E | v π -v π | = E X Q π 1 -Q π 1 (s, a) dq π 1 (s, a) ≤ E X Q π 1 -Q π 1 (s, a) dq π 1 (s, a) . To get an expression for that, we first expand it recursively. To illustrate the recursive relation, we examine the quantity at step h: E X Q π h -Q π h (s, a) dq π h (s, a) = E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) ≤ E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) + E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) = E X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + E E X T π h Q π h+1 -T π h Q π h+1 (s, a) dq π h (s, a) | D h+1 , • • • , D H (a) ≤ E X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + E E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) χ 2 Q (q π h , q π0 h ) + 1 | D h+1 , • • • , D H (b) ≤ E X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + E E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) | D h+1 , • • • , D H χ 2 Q (q π h , q π0 h ) + 1 (c) ≤ X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + c(5H 2 ) K -2α 2α+d + D K log 3 K χ 2 Q (q π h , q π0 h ) + 1 ≤ X Q π h+1 -Q π h+1 (s, a) dq π h+1 (s, a) + CH K -α 2α+d + D K log 3/2 K χ 2 Q (q π h , q π0 h ) + 1, where C denotes a (varying) constant depending on log D, α, B, d, ω, the surface area of X and c 0 . In (a), note T π h Q π h+1 ∈ B α p,q (X ) by Assumption 2 and -T π h Q π h+1 ∈ F by our algorithm, so T π h Q π h+1 -T π h Q π h+1 ∈ Q. Then we obtain this inequality by invoking the following lemma. In (b), we use Jensen's inequality and the fact that square root is concave. To obtain (c), we invoke the following lemma, which provides an upper bound on the regression error. Specifically, we will use Lemma 13 when conditioning on D h+1 , • • • , D H , i.e. the data from time step h + 1 to time step H. Note that after conditioning, T π h Q π h+1 becomes measurable and deterministic with respect to D h+1 , • • • , D H . Also, D h+1 , • • • , D H are independent from D h , which we use in the regression at step h. To justify our use of Lemma 13, we need to cast our problem into a regression problem described in the lemma. Since {(s h,k , a h,k )} K k=1 are i.i.d. from q π0 h , we can view them as the samples x i 's in the lemma. We can view T π h Q π h+1 , which is measurable under our conditioning, as f 0 in the lemma. Furthermore, we let Hence, Lemma 13 proves, for step h in our algorithm, ζ h,k := r h,k + A Q π h+1 (s ′ h,k , a)π(a | s ′ h,k ) da -T π h Q π h+1 (s h,k , a h,k (s h,k , a h,k , s ′ h,k , r h,k ), which are independent from (s h,k ′ , a h,k ′ , s ′ h,k ′ , r h,k ′ ) for any k ′ ̸ = k, so ζ h,k 's are independent from each other. As for the mean of ζ h,k , E [ζ h,k | D h+1 , • • • , D H ] = E r h,k + A Q π h+1 (s ′ h,k , a)π(a | s ′ h,k ) da -r h (s h,k , a h,k ) -P π h Q π h+1 (s h,k , a h,k ) | D h+1 , • • • , D H = E r h,k -r h (s h,k , a h,k ) + A Q π h+1 (s ′ h,k , a)π(a | s ′ h,k ) da -E s ′ ∼P h (•|s h,k ,a h,k ) A Q π h+1 (s ′ , a)π(a | s ′ ) da | s h,k , a h,k , D h+1 , • • • , D H | D h+1 , • • • , D H = 0 + 0 = 0. E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) | D h+1 , • • • , D H ≤ c(H 2 + 4H 2 ) K -2α 2α+d + D K log 3 K. Note that this upper bound holds for any Q π h+1 or D h+1 , • • • , D H . The sole purpose of our conditioning is that we could view Q π h+1 as a measurable or deterministic function under the conditioning and then apply Lemma 13. Therefore, E E X T π h Q π h+1 -T π h Q π h+1 2 (s, a) dq π0 h (s, a) | D h+1 , • • • , D H ≤ c(H 2 + 4H 2 ) K -2α 2α+d + D K log 3 K. Finally, we carry out the recursion from time step 1 to time step H, and the final result is E |v π -v π | ≤ CH 2 K -α 2α+d + D K log 3/2 K 1 H H h=1 χ 2 Q (q π h , q π0 h ) + 1 . E.3 LEMMA 13 AND ITS PROOF Lemma 13. Let X be a d-dimensional compact Riemannian manifold isometrically embedded in R D with reach ω. There exists a constant B > 0 such that for any x ∈ X , |x j | ≤ B for all j = 1, • • • , D. We are given a function f 0 ∈ B α p,q (X ) and samples S n = {(x i , y i )} n i=1 , where x i are i.i.d. sampled from a distribution P x on X and y i = f 0 (x i ) + ζ i . ζ i 's are i.i.d. sub-Gaussian random noise with variance σ 2 , uncorrelated with x i 's. If we compute an estimator f n = arg min f ∈F 1 n n i=1 (f (x i ) -y i ) 2 , with the neural network class F = F(L, p, I, τ, V ) such that L = O (log n) , p = O n d 2α+d , I = O n d 2α+d log n , τ = max{B, V F , √ d, ω 2 }, V = V F , ) then we have E X f n (x) -f 0 (x) 2 dP x (x) ≤ c V 2 F + σ 2 n -2α 2α+d + D n log 3 n, where V F = ∥f 0 ∥ ∞ and the expectation is taken over the training sample S n , and c is a constant depending on log D, α, B, d, ω, the surface area of X and c 0 . Proof of Lemma 13. Recall that the bias and variance decomposition of We need there to exist a network in F(L, p, I, τ, V ) which can yield a function f satisfying ∥f -f 0 ∥ ∞ ≤ ϵ for ϵ ∈ (0, 1). ϵ will be chosen later to balance the bias-variance tradeoff. By Lemma 2 of Nguyen-Tang et al. [51] , in order to achieve such ϵ-error, we need  E X f n (x) -f 0 (x) 2 dP x (x) as E X f n (x) -f 0 (x) 2 dP x (x) = E 2 n n i=1 ( f n (x i ) -f 0 (x i )) 2 T1 + E X f n (x) -f 0 (x) + 4 + 1 2V F + 8σ δ = O ϵ 2 + V 2 F + σ 2 n ϵ -d α log 3 1 ϵ log 1 δ + σδ ϵ -d α log 3 1 ϵ log 1 δ n + σδ + σ 2 n . Finally we choose ϵ to satisfy ϵ 2 = 1 n ϵ -d α , which gives ϵ = n -α 2α+d . It suffices to pick δ = 1 n . Substituting both ϵ and δ into (48), we deduce the desired estimation error bound E X f n (x) -f 0 (x) 2 dD x (x) ≤ c(V 2 F + σ 2 ) n -2α 2α+d + D n log 3 n, where constant c depends on log D, d, α, 2d αp-d , p, q, c 0 , B, ω and the surface area of X .

F SUPPLEMENT FOR EXPERIMENTS F.1 DETAILS FOR EXPERIMENTS WITH CARTPOLE

We use the CartPole environment from OpenAI gym. We consider it as a time-inhomogeneous finite-horizon MDP by setting a time limit of 100 steps. We turn the terminal states in the original CartPole into absorbing states, so if a trajectory terminates before 100 steps, the agent would keep receiving zero reward in its terminal state until the end. The target policy is a policy trained for 200 iterations using REINFORCE, in which each iteration samples for 100 trajectories with truncation after 150 time steps. The target policy value v π is estimated to be 65.2117, which we obtain by Monte Carlo rollout from the initial state distribution. For a given behavior policy, to obtain dataset D h at time step h, we sample for K independent episodes under the behavior policy and only take the (s, a, s ′ , r) tuple from the h-th transition in each episode. This is an excessive way to guarantee the independence among these K samples; in practice, we could directly sample from a sampling distribution. We sample for D h for each h = 1, • • • , 100. We use the render function in OpenAI gym for the visual display of CartPole. We downsample images to the desired resolution via cubic interpolation. A high-resolution image (see Figure 3 ) is represented as a 3 × 40 × 150 RGB array; a low-resolution image (see Figure 4 ) is represented as a 3 × 20 × 75 RGB array. For the function approximator in FQE, we use a neural network that comprises 3 convolutional layers each with output channel size 16, 32 and 32 and a final linear layer. These layers are interleaved with ReLU activation and batch norm layers for weight normalization. For high resolution input, we use kernel size 5 and stride 2; for low resolution input, we use kernel size 3 and stride 1. For experiments with high resolution, in each step of FQE, we solve the regression by training the network via stochastic gradient descent with batch size 256 for 20 epochs. In high-resolution experiments, we use 0.01 learning rate; in low-resolution experiments, we use 0.001 learning rate. We compute the average and standard deviation of FQE's result over 5 random seeds. We take the visual display of the environment as states. These images serve as a high-dimensional representation of LunarLander's original 8-dimensional continuous state space. In our algorithm, we use a deep CNN to approximate the Q-functions and solve the regression with SGD. The results are report in Table 3 . The experiment setup is almost the same as our CartPole experiments. We conduct this experiment in two cases: (A) data are generated from the target policy itself; (B) data are generated from a mixture policy of 0.9 target policy and 0.1 uniform distribution. A high-resolution image (see Figure 5 ) is represented as a 3 × 40 × 70 RGB array; a low-resolution image (see Figure 6 ) is represented as a 3 × 20 × 35 RGB array. 

F.3 ERROR DECAY RATE IN CARTPOLE

We plot the relative error of FQE in our CartPole experiments from Section 5. Figure 7 and 8 show that the estimation error follows an exponential decay. We also plot this relative error in log-log plots (Figure 9 and 10). We can observe that the curves are linear, which confirms the form of our theoretical bound in Theorem 2. Moreover, we can observe that the slope, which represents the decay rate of the estimation error, is generally the same between high-resolution and low-resolution experiments. This confirms our theory that the decay rate takes little influence from the ambient dimension. 



An atlas of M contains an open cover of M and mappings from each open cover to R d . Smooth manifold). A manifold M is smooth if it has a C ∞ atlas.

Figure 2: Convolution of W * Z. W j,:,: is a I × C matrix for the j-th output channel.

in the form of (3) with L layers. The number of filters per block is bounded by L; filter size is bounded by I; the number of channels is bounded by J; max m,l

Let η be the approximation error of the multiplication operator ×(•, •) as defined in Step 3 of Appendix B.1 and Proposition 1, δ be defined as in Step 3 of Appendix B.1 and Proposition 2, ∆ and θ be defined as in Step 3 of Appendix B.1 and Proposition 3. Assume N is chosen according to Proposition 2. For any i = 1, ..., C X , we have f

,:,: = [0 0 0 e 1 • • • 0] for l = 2, . . . , L -1, ( W (l) m ) 4+j,:,: = 0 0 0 0 (W (l) m ) j,:,: for l = 2, . . . , L -1, ( B (l) m ) j,:,: = 0 0 0 0 (B (l) m ) j,:,: for l = 1, . . . , L -1.

where (b) is by Proposition 7 from Oono & Suzuki [54], (c) is by Proposition 2 and 4 from Oono & Suzuki [54], (d) is by Proposition 2 and 5 from Oono & Suzuki [54], and (e) is obtained by invoking Proposition 4. C.2 PROOF OF LEMMA 8

in which W 1 , • • • , W L and b 1 , • • • , b L are weight matrices and vectors and ReLU(•) is the entrywise rectified linear unit, i.e. ReLU(a) = max{0, a}. The width of a neural network is defined as the number of neurons in its widest layer. For notational simplicity, we define a class of neural networks F(L, p, I, τ, V ) = {f | f (x) in the form (42) with L layers and width at most p,

f n (x i ) -f 0 (x i ))Applying the upper bounds of T 1 and T 2 in Lemmas 11 and 12 respectively, we can deriveE X f n (x) -f 0 (x) 2 dP x (x) ≤ 4 inf f ∈F (L,p,I,τ,V ) X (f (x) -f 0 (x)) 2 dP x (x) + 48σ 2 log N (δ, F(L, p, I, τ, V ), ∥•∥ ∞ ) + 2 n + 8 √ 6 log N (δ, F(L, p, I, τ, V ), ∥•∥ ∞ ) + 2 n σδ + 104V 3n log N (δ/4V, F(L, p, I, τ, V ), ∥•∥ ∞ )

= max{B, V F , √ d, ω 2 }, V = V F ,where O(•) hides factors of log D, α, d and the surface area of X , so we now have our network architecture as specified in Theorem 1 in terms of ϵ. Then, we can use the architecture parameters in(13) to invoke the upper bound of the covering number in Lemma 7 of Chen et al.[7]:log N (δ, F(L, p, I, τ, V ), ∥•∥ ∞ ) = log 2L 2 (pB + 2)τ L p L+1 δ I ≤ c ′′ ϵ -d α log 3 1 ϵ log 1 δ ,where c ′′ is a constant depending on log B, ω and log log n.Plugging it in, we haveE X f n (x) -f 0 (x) 2 dD x (x) ≤ 4ϵ 2 + 48σ 2 n c ′′ ϵ -d/α log 3

Figure 3: CartPole in high resolution.Figure4: CartPole in low resolution.

Figure 3: CartPole in high resolution.Figure4: CartPole in low resolution.

± 3.8 55.5 ± 3.6 55.5 ± 5.5 55.0 ± 5.9 10000 55.1 ± 2.7 55.7 ± 2.9 55.8 ± 4.0 55.4 ± 3.9 20000 55.4 ± 1.9 55.6 ± 2.0 56.3 ± 3.1 56.0 ± 3.5

Figure 5: LunarLander in high resolution.Figure6: LunarLander in low resolution.

Figure 5: LunarLander in high resolution.Figure6: LunarLander in low resolution.

Figure 7: On-policy CartPole.

Figure 8: Off-policy CartPole.

Figure 9: On-policy CartPole (log-log plot).

Figure 10: Off-policy CartPole (log-log plot).

Value estimation v π under high resolution and low resolution. The true v π ≈ 65.2 is computed via Monte Carlo rollout.

Note that after conditioning, T π h Q π h+1 becomes measurable and deterministic with respect to D h+1 , • • • , D H . Also, D h+1 , • • • , D H are independent from D h , which we use in the regression at step h.

the composition of d 2 i and 1 ∆ . By Proposition 3, d 2 i is a single-block CNN with O(log 1 θ ) = O( α d log N + D + log D) layers and width 6D; 1 ∆ is a single-block CNN with O(log(β 2 /∆)) = O( α d log N ) layers and width 2. In both subnetworks, all parameters are of O(1). By Lemma 6, the chart determination network 1 i is a single-block CNN with O( α d log N + D + log D) layers, width 6D + 2 and all weight parameters are of O(1). : The projection ϕ i is a linear one, so it can be expressed with a single-layer perceptron. By Lemma 8 in Liu et al.[42], this single-layer perceptron can be expressed with a singleblock CNN with 2 + D layers and width d. All parameters are of O(1).

). In order to invoke Lemma 13 under the conditioning on D h+1 , • • • , D H , we need to verify whether three conditions are satisfied (conditioning onD h+1 , • • • , D H ): 1. Sample {(s h,k , a h,k )} K k=1 are i.i.d;2. Sample {(s h,k , a h,k )} K k=1 and noise {ζ h,k } K k=1 are uncorrelated;3. Noise {ζ h,k } K k=1 are independent, zero-mean, subgaussian random variables.In our setting, {(s h,k , a h,k )} K k=1 are i.i.d. from q π0 h . Due to the time-inhomogeneous setting, they are independent from D h+1 , • • • , D H , so {(s h,k , a h,k )} K k=1 are still i.i.d. under our conditioning. Thus, Condition 1 is clearly satisfied.We may observe that under our conditioning, the transition from(s h,k , a h,k ) to s ′ h,k is the only source of randomness in ζ h,k , besides (s h,k , a h,k ) itself. The distribution of (s h,k , a h,k , s ′ h,k) is actually the product distribution between P h (•|s h,k , a h,k ) and q π0 h , so a function of s ′ h,k , generated from the transition distribution P h (•|s h,k , a h,k ), is uncorrelated with (s h,k , a h,k ). Thus, (s h,k , a h,k )'s are uncorrelated with ζ h,k 's under our conditioning, and Condition 2 is satisfied.Condition 3 can also be easily verified. Under our conditioning, the randomness in ζ h,k only comes from

Value estimation v π under high resolution and low resolution. The true v π ≈ 55.982 is computed via Monte Carlo rollout.

acknowledgement

ACKNOWLEDGEMENT Mengdi Wang acknowledges the support by NSF grants DMS-1953686, IIS-2107304, CMMI-1653435, ONR grant 1006977, and C3.AI. Tuo Zhao acknowledges the support by DMS-2012652.

annex

As a result, for any x ∈ X , 1 ∆ • d 2 i (x) gives an approximation of 1 Ui satisfying∈ U i ; between 0 and 1, otherwise.

B.4.3 LEMMAE ABOUT SUMMATION AND COMPOSITION OF CNN

Lemma 6 states that the composition of two single-block CNNs can be expressed as one single-block CNN with augmented architecture.has nonzero entries only in the first row. Then there exists a CNN architecture. Furthermore, the weight matrix in the fully connected layer of F SCNN (L, J, I, τ, τ ) has nonzero entries only in the first row.Lemma 7 states that the sum of n 0 single-block CNNs with the same architecture can be expressed as the sum of n 1 single-block CNNs with modified width.Lemma 7 (Lemma 7 in Liu et al. [43] ). Let {f i } n0 i=1 be a set of single-block CNNs with architecture F SCNN (L 0 , J 0 , I 0 , τ 0 , τ 0 ). For any integers 1 ≤ n ≤ n 0 and J satisfying n J = O(n 0 J 0 ) and J ≥ J 0 , there exists an architecture F SCNN (L, J, I, τ, τ ) that gives a set of single-block CNNsSuch an architecture hasFurthermore, the fully connected layer of f has nonzero elements only in the first row.

C PROOF OF CNN CLASS COVERING NUMBER

In this section, we prove a bound on the covering number of the convolutional neural network class used in Algorithm 1.Lemma 8. Given δ > 0, the δ-covering number of the neural network class F(M, L, J, I, τ 1 , τ 2 , V ) satisfieswhereWith a network architecture as stated in Theorem 1, we havewhere O(•) hides constant depending on d, α, 2d αp-d , p, q, c 0 , B, ω and the surface area of X .where the inequality is due toD STATISTICAL RESULT OF CNN-BESOV APPROXIMATION (LEMMA 10)In this section, we derive the statistical estimation error for using a CNN empirical MSE minimizer to estimate a Besov ground truth function over an i.i.d. dataset. We need to choose the appropriate CNN architecture and size in order to balance the approximation error from Theorem 1 and variance. Thsi statistical estimation error can be decomposed into the error of using CNN to approximate Besov function (Theorem 1), terms that grow with the covering number of our CNN class, and the error of using the discrete covering to approximate our CNN class.In Theorem 2, we expand the estimation error v π -v π over time steps and upper-bound the amount of estimation error in each time step with Lemma 10. Details of Theorem 2 are in Appendix A. Lemma 10. Let X be a d-dimensional compact Riemannian manifold that satisfies Assumption 1.We are given a function f 0 ∈ B α p,q (X ), where s, p, q satisfies Assumption 2. We are also given samples S n = {(x i , y i )} n i=1 , where x i are i.i.d. sampled from a distribution P x on X and y i = f 0 (x i ) + ζ i . ζ i 's are i.i.d. sub-Gaussian random noise with variance σ 2 , uncorrelated with x i 's. If we compute an estimatorwith the neural network classwith any integer I ∈ [2, D] and M , J > 0 satisfying M J = O(n), then we havewhere V F = ∥f 0 ∥ ∞ and the expectation is taken over the training sample S n , and c is a constant depending on D 6α 2α+2d , d, α, 2d αp-d , p, q, c 0 , B, ω and the surface area of X . O(•) hides constant depending on d, α, 2d αp-d , p, q, c 0 , B, ω and the surface area of X .First, note that the nonparametric regression error can be decomposed into two terms:where T 1 reflects the squared bias of using neural networks to approximate ground truth f 0 , which is related to Theorem 1, and T 2 is the variance term.

D.1 SUPPORTING LEMMAE

Lemma 11 (Lemma 5 in Chen et al. [7] ). Fix the neural network class F(M, L, J, I, τ 1 , τ 2 , V ). For any constant δ ∈ (0, 2V ), we havef ∈F (M,L,J,I,τ1,τ2,V ) X (f (x) -f 0 (x)) 2 dP x (x)

