OPTIMISTIC EXPLORATION WITH LEARNED FEATURES PROVABLY SOLVES MARKOV DECISION PROCESSES WITH NEURAL DYNAMICS

Abstract

Incorporated with the recent advances in deep learning, deep reinforcement learning (DRL) has achieved tremendous success in empirical study. However, analyzing DRL is still challenging due to the complexity of the neural network class. In this paper, we address such a challenge by analyzing the Markov decision process (MDP) with neural dynamics, which covers several existing models as special cases, including the kernelized nonlinear regulator (KNR) model and the linear MDP. We propose a novel algorithm that designs exploration incentives via learnable representations of the dynamics model by embedding the neural dynamics into a kernel space induced by the system noise. We further establish an upper bound on the sample complexity of the algorithm, which demonstrates the sample efficiency of the algorithm. We highlight that, unlike previous analyses of RL algorithms with function approximation, our bound on the sample complexity does not depend on the Eluder dimension of the neural network class, which is known to be exponentially large (Dong et al., 2021) .

1. INTRODUCTION

Reinforcement learning (RL) aims to accomplish sequential decision-making in an uncertain environment via iteratively interacting with the environment (see Sutton et al. (1998) ). Equipped with modern function approximators such as deep neural networks, deep RL algorithms achieve tremendous empirical successes (Mnih et al., 2015; Silver et al., 2017; Hafner et al., 2019) . Despite its empirical successes, the theoretical understanding of deep RL is relatively underdeveloped. There are several recent works (Abbasi-Yadkori et al., 2019; Wang et al., 2019; Fan et al., 2020) that analyze RL algorithms with neural network parameterization, including policy iteration (PI) (Lagoudakis & Parr, 2003) , policy gradient (PG) (Williams, 1992) and deep Q-learning (Mnih et al., 2013) . However, those works depend on restrictive assumptions that either the agent has access to a simulator or the MDPs have bounded concentrability coefficients, which in fact imply that the state space is already well-explored. Another line of research (Jiang et al., 2017; Jin et al., 2020; Cai et al., 2019; Du et al., 2021) further removes such assumptions by conducting provably efficient exploration in RL. Such a direction of research typically hinges on a low-rank MDP assumption. Thus, those works either assume that the MDP is linear in the known feature or propose computational-inefficient algorithms, limiting the ability to explore the environment with neural network parameterization. To explore the environment with neural network parameterization, a recent line of work (Wang et al., 2020; Jin et al., 2021a) analyzes the use of general function approximators in RL, covering neural network parameterization as a special case. Such analyses typically depend on the Eluder dimension (Russo & Van Roy, 2013) , which unfortunately can be exponentially large even for a simple neural network class (Dong et al., 2021) and thus makes the results statistically inefficient for neural network parameterization. Therefore, we raise the following question: Can we design RL algorithms that can conduct provably efficient exploration in structured environments with neural network parameterization? Specifically, Our goal is to develop computational-efficient algorithms whose sample efficiency does not depend on the Eluder dimension of neural networks for structured environments with neural network parameterization. Our key insight is that, when the transition dynamics is captured by an energy-based model, we leverage the spectral decomposition of the kernel such that the challenge of distribution shift is characterized by the effective dimension of the kernel. To illustrate this insight, we propose a new model called MDPs with neural dynamics, which allows neural network parameterization and captures various MDP models proposed in previous works, including the KNR model (Kakade et al., 2020) and the linear MDP model (Jin et al., 2020) . We then propose an algorithm, namely, Exploration with Learnable Neural Features (ELNF), and show that ELNF is sample efficient. ELNF iteratively fits the transition dynamics and reward functions with neural networks. Upon fitting the models, ELNF conducts exploration based on upper confidence bounds (UCB) (Abbasi-Yadkori et al. ( 2011)), which are obtained from the feature maps that correspond to the fitted model. We remark that the bonus in ELNF can be efficiently computed. Contributions. Our contribution is threefold. First, we identify a class of models that incorporates NN feature representation, which captures nonlinearity in the transition dynamics beyond the KNR and linear MDP model. We also show that our proposed setting can generalize to models in previous works (Kakade et al. (2020) , Ren et al. (2021) ). Second, we propose a new algorithm, namely ELNF, which tackles our proposed MDPs with neural dynamics. Our algorithm is computationally efficient when we have an optimization oracle for the model estimation. Third, we analyze the sample complexity of ELNF and show that ELNF is sample efficient. A key feature of ELNF is that the sample complexity of ELNF depends only on the covering number of neural network classes and does not depend on the corresponding Eluder dimension. We highlight that our work is the first to cover arbitrary NN classes with bounded log-covering numbers. In contrast, previous research typically depends on the Eluder dimension (Russo & Van Roy, 2013 ) of the hypothesis class, which is exponentially large for simple neural network classes (Dong et al., 2021) .

1.1. RELATED WORK

Our work is closely related to the line of research on provably efficient exploration in the function approximation setting (Jiang et al., 2017; Jin et al., 2020; Cai et al., 2019; Du et al., 2021; Uehara et al., 2021; Zhang et al., 2022a) . Such a line of research typically hinges on MDPs with a low-rank structure. For instance, the study of linear MDPs (Jin et al., 2020; Cai et al., 2019) requires that the transition dynamics are linear in the known feature map. In contrast, the feature maps are unknown in our setting and need to be estimated. The study of low-rank MDPs (Jiang et al., 2017; Du et al., 2021; Uehara et al., 2021; Ren et al., 2022) is more closely aligned to our work in the sense that the feature map is unknown and needs to be estimated. Jiang et al. (2017) and Du et al. (2021) require optimistic planning over the confidence set of transition dynamics, which is computationally inefficient. Uehara et al. (2021) and Ren et al. (2022) propose an algorithm for low-rank MDP that is both computationally efficient and sample efficient. Nevertheless, they only consider finite hypothesis classes, and require sampling from the stationary distribution of the MDP. Our work is also related to the study of provably efficient exploration with general function approximation (Wang et al., 2020; Jin et al., 2021a) . Nevertheless, previous results typically depend on the Eluder dimension (Russo & Van Roy, 2013) of the hypothesis class, which is exponentially large for simple neural network classes (Dong et al., 2021) . Yang et al. (2020) achieves sample-efficient exploration based on the overparameterized neural networks (Simsek et al., 2021) as the function approximator. However, their analysis hinges on the neural tangent kernel (NTK) and can not handle NNs beyond NTK regime. In contrast, our analysis can adapt to generic neural network classes. Our work is also related to the analysis of model-based RL (Osband & Van Roy, 2014; Ayoub et al., 2020; Kakade et al., 2020) and representation learning (Ren et al., 2021; Nachum & Yang, 2021; Zhang et al., 2022b) . The definition of our MDPs with neural dynamics generalizes that in Kakade et al. (2020) and Ren et al. (2021) . In contrast to the KNR model in Kakade et al. (2020) , we can handle the infinite neural network hypothesis class and do not require the nonlinear feature map to be known. Ren et al. (2021) require sampling from the posterior distribution of the hypothesis class, which is computational-inefficient when the hypothesis class is large. In addition, the sample complexity bound of Ren et al. (2021) depends on the Eluder dimension of the feature map class, which is exponentially large for simple neural network classes (Dong et al., 2021) . In contrast, our sample complexity bound depends only on the neural network classes only through its capacity. Our work is motivated by the complexity analysis of neural network classes. Dong et al. (2021) show that the Eluder dimension of one-layer neural network classes is exponentially large, suggesting that the previous analysis of RL algorithms based on the Eluder dimension (Russo & Van Roy, 2013) may not be applicable to neural networks.

1.2. NOTATION

For a vector v ∈ R d , we define v 2 = ( d i=1 v 2 i ) 1/2 , where v i is the i-th element of v. For a real-valued function f : X → R, we define f ∞ = max x∈X |f (x)|. For a vector-valued function f : X → R d , we define f ∞,2 = max x∈X f (x) 2 . For a sequence of real-valued functions r = {r h } H h=1 ⊂ X → R, we define r ∞ = sup h∈[H],x∈X |r h (x)|. We denote by N (F, , • ) the -covering number of the function class F with respect to the norm • , define H ∞ (F, ) = log N (F, , • ∞ ) for a real-valued function class F, and define H 2 (F, ) = log N (F, , • ∞,2 ) for a vector-valued function class F. We further define [n] = {1, . . . , n} when n is an integer. For a set C, we denote by ∆(C) the set of the distributions over C, and U(C) the uniform distribution over C. For g : X → R and X n = {x 1 , . . . , x n } ⊂ X , we define g[X n ] = (g(x 1 ), . . . , g(x n )) .

2. PRELIMINARY

We consider an episodic MDP V * = (S, A, H, P * , r * ) with a state space S ∈ R d , an action space A, a horizon H, transition kernels P * = {P * h } H h=1 , and reward functions r * = {r * h } H h=1 . We assume that the reward functions are bounded and deterministic, that is, r * h ∞ ∈ [0, 1] for all h ∈ [H]. We also assume that the action space is finite, that is, |A| < ∞. The agent iteratively interacts with the environment as follows. At the beginning of each episode, the agent determines a policy π = {π h } H h=1 , where π h : S → ∆(A) for any h ∈ [H] . Without loss of generality, we assume that the initial state is fixed to s init ∈ S across all episodes. At the h-th step, the agent receives a state s h and takes an action a h following a h ∼ π h (• | s h ). Subsequently, the agent receives a reward r * h (s h , a h ) and the next state following s h+1 ∼ P * h+1 (• | s h , a h ). The episode ends after the agent receives the last state s H+1 . For a given policy π = {π h } H h=1 , where π h : S → ∆(A) for any h ∈ [H], we define the value function V π h and the Q-function Q π h for any h ∈ [H] as V π h (s; r * , P * ) = E π,P * H i=h r * i (s i , a i ) s h = s , Q π h (s, a; r * , P * ) = E π,P * H i=h r * i (s i , a i ) s h = s, a h = a . (2.1) Here the expectation E π,P * [•] in (2.1) is taken with respect to s i+1 ∼ P * i (• | s i , a i ) and a i ∼ π i (• | s i ) for i ∈ {h, h + 1, . . . , H}. For convenience, we define V π H+1 (s; r, P) = 0 for any state s ∈ S, reward function r, transition kernel P and policy π. For simplicity, we define the expected total reward J(π; r * , P * ) as J(π; r * , P * ) = V π 1 (s init ; r * , P * ). The goal of RL is to find a policy π * that maximizes the expected total reward. Specifically, for the episodic MDP V * = (S, A, H, P * , r * ), We define π * ∈ argmax π J(π; r * , P * ) as an optimal policy. Correspondingly, we define the optimal Q-function Q * h and the optimal value function V * h as Q * h (s, a; r * , P * ) = Q π * h (s, a; r * , P * ) and V * h (s; r * , P * ) = V π * h (s; r * , P * ) for any (s, a) ∈ S × A.

3. MARKOV DECISION PROCESSES WITH NEURAL DYNAMICS

In this paper, our goal is to develop a provably efficient algorithm for RL problems adapted with large feature space, such as neural networks (NNs). To this end, we introduce the MDPs with neural dynamics, whose reward and transition dynamics are parameterized by NNs. Motivation. Our definition is motivated by the kernelized nonlinear regulator (KNR). In a KNR model, the transition kernel takes the following form, s h+1 = W * h φ * h (s h , a h ) + , ∼ N (0, I d ), (3.1) where φ * h is a known nonlinear feature map. Former research proposes sample-efficient algorithms for such a model. Although such a KNR setting empowers sample efficient RL (Kakade et al. (2020) ), it is relatively restrictive in the following two aspects. First, the feature map φ * h and the expected reward r * h are known a priori. Second, the model only imposes nonlinearity on (s h , a h ) via the known feature map, while the conditional expectation of the next state given s h , a h is a linear function of φ * h (s h , a h ). In other words, when φ * h is known, the transition dynamics can be recovered via linear system identification methods such as ridge regression. To generalize the KNR model, we interpret (3.1) as an energy-based model. More specifically, we can write the transition of the MDP in (3.1) as P * h (s h+1 | s h , a h ) ∝ exp -E(s h+1 , s h , a h ) , (3.2) where the energy function E(s h+1 , s h , a h ) is defined as E(s h+1 , s h , a h ) = s h+1 -W * h φ * h (s h , a h ) 2 2 /2. (3.3) Here (3.2) omits a normalization factor, which is a function of (s h , a h ). We generalize this model and impose nonlinearity on the next state s h+1 by substituting a nonlinear feature map ψ * h+1 (s h+1 ) for s h+1 in (3.3). Such a generalization allows us to incorporate the nonlinearity of the next state in the model. In addition, we assume that the nonlinear feature maps φ * h and ψ * h+1 are unknown and need to be estimated from pre-specified feature classes Φ and Ψ, which for example, can be two classes of NNs. We further assume that the expected reward r * is unknown and needs to be estimated from the reward function class R. We formalize our generalization in the following definition. Definition 3.1 (MDPs with Neural Dynamics). An episodic MDP (S, A, H, P * , r * ) is an MDP with neural dynamics if its reward function r * = {r * h } H h=1 belongs to a reward function class R, which is a known function class that consists of NNs, and the transition kernel of the MDP P * = {P * h } H h=1 takes the following form, P * h (s h+1 | s h , a h ) ∝ exp -ψ * h+1 (s h+1 ) -φ * h (s h , a h ) 2 2 /2 . (3.4) Here φ * h ∈ Φ : R d × A → R m and ψ * h+1 ∈ Ψ : R d → R m are two unknown feature maps, and Φ, Ψ are two known feature map classes that consist of NNs. We denote by M the set of all the transition kernels that take the form of (3.4), and let X ∈ R m denote the union of the image spaces of the feature maps, namely, X = {ϕ(s, a) | (ϕ, s, a) ∈ Φ × S × A} ∪ {ϕ(s) | (ϕ, s) ∈ Ψ × S} ⊆ R m . Generality of Definition 3.1. We remark that Definition 3.1 is a significant generalization of stochastic nonlinear systems beyond KNR. For instance, when ψ * h+1 (s h+1 ) = s h+1 , the transition kernel in Definition 3.1 takes the following form, P * h (s h+1 | s h , a h ) ∝ exp -s h+1 -φ * h (s h , a h ) 2 2 /2 , which is the transition kernel in Ren et al. (2021) . Therefore, we recover the model in Ren et al. (2021) when ψ * h+1 is known to be the identity map and the reward function is known. Moreover, the transition kernel defined in (3.4) also includes a class of nonlinear dynamics satisfying s h+1 = (ψ * h+1 ) -1 φ * h (s h , a h ) + h , where S ⊆ R m , ψ * h+1 : R m → R m , the determinant of the Jacobian matrix of ψ * h+1 is a constant, and h is a Gaussian noise. Our model significantly generalizes such a model by allowing a possibly noninvertible feature map ψ * h+1 . Relationship with Kernelized Linear MDP. Recall that K(x, y) = exp(-x -y 2 2 /2) is also known as the Gaussian RBF kernel, which induces a reproducing kernel Hilbert space (RKHS) defined on R m (Rahimi et al., 2007) . (See Appendix §D for a brief introduction of RKHS.) Intuitively, K(x, y) measures the proximity between x and y in the kernel space. From this perspective, the transition kernel in (3.4) specifies the next state s h+1 by measuring the proximity of the representations φ * h (s h , a h ) and ψ * h+1 (s h+1 ). Besides, since K(x, y) can be written as k(x), k(y) H , where the feature map of the RKHS k is defined as k(x) = K(x, •), and •, • H is the inner product of the RKHS respectively. Thus, (3.4) can be equivalently written as P * h (s h+1 | s h , a h ) = Z * h (s h , a h ) • k φ * h (s h , a h ) , k ψ * h+1 (s h+1 ) H , where Z * h (s h , a h ) is the normalization factor in (3.4). Thus, when Z * h is known, our model can be regarded as an RKHS extension of the linear MDP model (Jin et al., 2020) . This is the case when ψ * is the identity maps but unknown to the learner, and Z * h (s h , a h ) becomes a constant (Ren et al., 2021) . See Appendix §D for more details of the relationship between the model in (3.4) and RKHS. Role of NNs in Our Model. We would like to remark that the model specified in Definition 3.1 is not restricted to NNs. In fact, the definition only requires proper function classes of the reward function, representations of (s h , a h ) and s h+1 , namely, R, Φ, and Ψ. Thus, our model can also be defined for other function approximators such as polynomial spline (Unser et al., 1993) , classification and regression tree (Syrgkanis & Zampetakis, 2020) . Meanwhile, as we will see in the sequel, both our algorithm and the theoretical results do not hinge on NNs and can employ general function classes with bounded capacity. Here we call our model neural dynamics in order to highlight that our work is the first one that covers arbitrary NN classes with bounded log-covering numbers.

4. ALGORITHM

In this section, we introduce an algorithm for solving MDPs with neural dynamics in the online setting. We first introduce the motivation of the algorithm, and then introduce the procedure in detail. Motivation. To strike a balance between exploration and exploitation, our algorithm follows the principle of Optimism in the Face of Uncertainty (Lattimore & Szepesvári, 2020) . When we know the true feature maps {φ * h } H h=1 , we can apply kernel LSVI (Yang et al., 2020) to construct the exploration bonus since the energy-base transition admits a kernel structure. (See §3 for the details.) However, in MDPs with neural dynamics, we do not know {φ * h } H h=1 . A straightforward solution for handling the unknown feature maps is to learn the feature maps from the data we collect and construct the bonus based on the learned features. However, the bonus constructed by the learned features might be invalid since the learned features have errors. We handle the error in the learned feature by purposefully taking uniform actions when exploring the environment. Such a sampling scheme gives us more diverse data for model estimation. Based on this motivation, we design an iterative algorithm that outputs a policy after N iterations. In particular, in each iteration n ∈ [N ], our algorithm performs the following four steps: (i) sampling new data from the environment, (ii) estimating the model via maximum likelihood estimation, (iii) constructing exploration incentives using the features of the learned model, and (iv) updating the online policy for exploration via planning on the learned model. Sampling Scheme. As we mentioned in §3, the transition of MDPs with neural dynamics can be written as an energy-based model and admits the Gaussian RBF kernel. To exploit the kernel structure in the transition, we explore the environment using the exploration bonus induced by the Gaussian RBF kernel and the feature maps learned from the data, which is motivated by Yang et al. (2020) . However, since the bonus is not induced by the true underlying feature, it might fail to indicate the most uncertain state-action pairs for exploration. To mitigate such an issue, we combine the uniform policy, which samples action from the uniform distribution over the action space, with the optimistic policy during the sampling procedure. Intuitively, such a sampling scheme provides a wider coverage over the state-action space and better explores the environment. To simplify the presentation in the main text, we present the sketch of the sampling scheme as follows. The rigorous presentation of the sampling scheme for the boundary case is deferred to Appendix §B. In the n-th iteration of our algorithm, given the previously collected dataset D n-1 h,i for i ∈ {0, 1, 2} and h ∈ [H], we interact with the MDP following the policy π n = {π n h } H h=1 and obtain the new dataset D n h,i for i ∈ {0, 1, 2} and h ∈ [H]. Specifically, for any h ∈ {-1, . . . , H}, we start from the initial state s 1 and choose the action ah ∼ π n h (• | sh) in the h-th step when h ∈ {1, • • • , h}, and choose the action ah ∼ U(A) when h ∈ {h + 1, h + 2}. Here U(A) is the uniform distribution over the action space A. By following such a procedure, we obtain the following trajectory, s 1 , a 1 , r 1 , . . . , s h+2 , a h+2 , r h+2 , s h+3 , (4.1) where s 1 = s init . Then, we label the obtained trajectory as follows, s n h+i,i = s h+i , a n h+i,i = a h+i , r n h+i,i = r h+i , sn h+i,i = s h+i+1 , (4.2) for any i ∈ {0, 1, 2}. We then update the dataset D n h,i by D n h,i = D n-1 h,i ∪ (s n h,i , a n h,i , r n h,i , sn h+i,i ) for any i ∈ {0, 1, 2} and h ∈ [H]. We use i in (4.2) to indicate how many steps of the uniform policy we need to execute to obtain such a dataset. Intuitively, the dataset with a bigger i has a better coverage over the state-action space S × A. We summarize the sampling scheme in Algorithm 1. See Figure 1 in Appendix §B for an illustration of the sampling scheme, and see Algorithm 3 in Appendix §B for the formal presentation of Algorithm 1. Algorithm 1 Sampling Scheme (Informal) 1: Input: Policy π n = {π n h } H h=1 , datasets D n-1 h,i for i ∈ {0, 1, 2} and h ∈ [H]. 2: for h = -1, . . . , H do 3: Interact with the environment to obtain the trajectory in (4.1) by first executing π n from s 1 to s h , and then execute U(A) for two more steps. Label the obtained trajectory as (4.2). 4: end for 5: Set D n h,i ← (s τ h,i , a τ h,i , r τ h,i , sτ h+i,i ) n τ =1 for h ∈ [H] and i ∈ {0, 1, 2} , where the tuple (s τ h,i , a τ h,i , r τ h,i , sτ h+i,i ) is the data we label in (4.2). Updating the datasets 6: Return: Datasets {D n h,i } h=H,i=2 h=1,i=0 . Model Estimation. To estimate the model, we solve the following optimization problems, r n h = argmin r∈R 2 i=1 (s h ,a h ,r h ,s h+1 )∈D n h,i r h -r(s h , a h ) 2 , P n h = argmin P∈M - 2 i=1 (s h ,a h ,r h ,s h+1 )∈D n h,i log P(s h+1 | s h , a h ). (4.4) Here R and M are the reward function class and the transition kernel class defined in Definition 3.1. We denote by φ n h and ψ n h+1 the feature maps that correspond to the transition kernel P n h estimated in (4.4). To simplify our analysis, we assume that there exists an oracle that returns the global minimum of the optimization problems (4.3) and (4.4). Similar assumption also arises in the previous study of RL (Fan et al., 2020; Kakade et al., 2020; Uehara et al., 2021; Jin et al., 2021a) . When the normalization factor in (3.5) is a constant, (4.4) can be easily implemented since it is equivalent with P n h = argmin φ∈Φ,ψ∈Ψ 2 i=1 (s h ,a h ,r h ,s h+1 )∈D n h,i ψ(s h+1 ) -φ(s h , a h ) 2 2 . Remark 4.1 (Transition Estimation). We would like to remark that the method for estimating the transition is not restricted to maximum likelihood estimation (MLE). Methods including variational autoencoder (Kingma & Welling, 2013) , score matching (Hyvärinen & Dayan, 2005) can also be used for transition estimation. Our sample complexity bound holds for any transition estimator whose total variance error has an upper bound. Exploration Bonus. The transition kernel in Definition 3.1 is closely related to the radial basis function (RBF) kernel. In the sequel, we define the bonuses for exploration and update the policy based on such bonuses. Specifically, for a fixed feature map φ : S → R, we define the Gram matrix K n h [φ] and the function k n h [φ] : S × A → R n as follows, K n h [φ] = K φ(s τ h,1 , a τ h,1 ), φ(s τ h,1 , a τ h,1 ) n τ,τ =1 ∈ R n×n , k n h [φ](s, a) = K φ(s, a), φ(s 1 h,1 , a 1 h,1 ) , . . . , K φ(s, a), φ(s n h,1 , a n h,1 ) ∈ R n , ∀(s, a) ∈ S × A, where {(s τ h,1 , a τ h,1 , r τ h,1 , sτ h,1 )} n τ =1 ∈ D n h,1 . We then define the bonus u n h as follows, u n h (s, a) = min 2H + 2, β u n h (s, a)/λ , where u n h (s, a) = 1 -k n h [φ n h ](s, a) λI + K n h [φ n h ] -1 k n h [φ n h ](s, a). (4.5) Here β > 0 and λ > 0 are the tuning parameters. We remark that the form of the bonus in (4.5) aligns with the bonus in other previous works that use kernel functions for function approximation (Srinivas et al., 2009; Yang et al., 2020) . Remark 4.2 (Dependency of Rewards on Features.). Here we do not require that the reward depends on the feature in the transition kernel, which is different from the literature in linear MDPs (Cai et al., 2019; Jin et al., 2020) . Common sense seems to dictate that we can not characterize the uncertainty using the feature without such a dependency. However, the estimation error of the reward estimators r n h in the empirical measure can be bounded from the above by the property of the least square estimator. Therefore, the uncertainty of r * h in the distribution induced by any new policy, such as π * , can be bounded from the above by the distribution shift, which is characterized by u n h-1 . Such an observation allows us to characterize the uncertainty of the reward estimator even when the reward does not depend on the feature in the transition kernel. Policy Update. We update the policy π n+1 by setting it as the optimal policy of the learned model, which can be efficiently computed by dynamic programming. Due to the space limit, we defer the details of the planning algorithm to Appendix §B. We remark that we can also apply other modelbased algorithms, including Dyna (Sutton, 1991) and Gradient-Aware Model-based Policy Search (D'Oro et al., 2020) , to compute the optimal policy of the learned model, and the suboptimality of the output of Algorithm 2 is bounded when the error of the planning oracle is bounded. Remark 4.3 (Computational Efficiency). Our algorithm is oracle efficient in the sense that our algorithm is computationally efficient given an optimization oracle for model estimation, which also appears in the previous study (Fan et al., 2020; Kakade et al., 2020; Uehara et al., 2021) . More specifically, the bonus and the policy in each iteration can be efficiently computed by (4.5) and Algorithm 4 in the appendix. The existing literature on general function approximation requires either global optimism over the confidence set (Kakade et al., 2020; Jin et al., 2021a) or posterior sampling over the hypothesis set (Ren et al., 2021) , which can not be computed efficiently.

Algorithm 2 Exploration with Learnable Neural Features (ELNF)

1: Input: Failure probability δ > 0, tuning parameters β, λ > 0. 2: Initialize: Set  π 1 = {π 1 h } H h=-1 , where π 1 h (• | s) = U(A) Return: π = U {π n } N +1 n=2 .

5. THEORY

In this section, we present the analysis of ELNF. We first present the boundedness assumption on the model. Assumption 5.1 (Boundedness of Model). We assume that the state space S is a bounded set of R d , and the Lebesgue measure of S is an absolute constant. We also assume that max{ φ(s, a) 2 , ψ(s) 2 } ≤ R for all (s, a, φ, ψ) ∈ S × A × Φ × Ψ. We further assume that 0 ≤ r(s, a) ≤ 1 for any (s, a, r) ∈ S × A × R. Since S is bounded, Assumption 5.1 is a reasonable regularity condition on the model. Such a regularity assumption is standard and is also assumed in the previous works (Cai et al., 2019; Jin et al., 2020; 2021b) . Next, we introduce the following assumption, which characterizes the complexity of the NN classes.

Assumption 5.2 (Decay Rate of Covering Number

). There exists constants C net > 0 and γ ≥ 0 that only depend on (R, Φ, Ψ), such that H c ( ) max H ∞ (R, ), H 2 (Φ, ), H 2 (Ψ, ) ≤ C net • 1 + log(1/ ) / γ . In Assumption 5.2, γ characterizes the complexity of the NN class by quantifying the growth rate of the covering number when the covering radius decays. We remark that previous research bounds the covering number of NN classes from the above at the same scale as Assumption 5.2. For example, Schmidt-Hieber (2020) and Chen et al. (2019) show that NN classes with specific structures satisfy Assumption 5.2 with γ = 0. See Lemmas C.3 and C.6 in Appendix §C for the details. Theorem 5.3 (Sample Complexity of ELNF). We assume that Assumption 5.2 holds with γ < 1/2, and we can obtain the exact solution to the optimization problems (4.3) and (4.4). We set the tuning parameters λ and β as λ = C N γ/(1+γ) m log(48HRN/δ), β = C H|A| 1/2 m 1/2 N 3γ/(4+4γ) log(48HRN/δ) in ELNF, where m is the dimension of the image of the feature maps, C , C are constants that only depend on the regularity parameters in Assumption 5.1. Under Assumption 5.1 and 5.2, for the policy π returned by ELNF, it holds with probability at least 1 -δ that J(π * ; r * , P * ) -J( π; r * , P * ) ≤ CH 5 • |A| 2 • ξ • N (2γ-1)/(2+2γ) (log N ) m+1 . Here C is a constant that only depends on the dimension m, the bound of the feature maps R, and C net in Assumption 5.2, and ξ = (log(48HRN/δ)) 5/2 is a logarithmic factor. Proof. See Appendix §E for a detailed proof. In Theorem 5.3, λ is the regularization parameter that trades off between bias and variance, and β is the uncertainty coefficient, which scales with γ and N . We remark that our analysis is not restricted to NN classes, and can be extended to other bounded function classes with bounded covering numbers. We further remark that m in Theorem 5.3 is the dimension of the image of the feature maps, which can be much smaller than the dimension of the state. Moreover, in Appendix §E, we show that the suboptimality bound in Theorem 5.3 reduces to O(d eff N (2γ-1)/(2+2γ) ) in terms of N , where d eff is the effective dimension in Definition E.4 in the appendix. Such a bound connects the sample efficiency of ELNF to the effective dimension of the RKHS and the covering number of NN classes. We further remark that when γ = 0 in Theorem 5.3, the suboptimality bound is sublinear in N , which aligns with the previous theoretical research. A detailed comparison with the related work can be found in Appendix §A. 6 PROOF SKETCH OF THEOREM 5.3 In this section, we sketch the proof of Theorem 5.3 and highlight the technique that allows us to remove the dependency on the Eluder dimension of NN classes. Since the policy π returned by Algorithm 2 (ELNF) is the mixture of π 1 , . . . , π N , we can decompose the suboptimality as J(π * ; r * , P * ) -J( π; r * , P * ) = 1 N N n=1 J(π * ; r * , P * ) -J(π * ; r n + u n , P n ) Term (a) + 1 N N n=1 J(π * ; r n + u n , P n ) -J(π n+1 ; r n + u n , P n ) Term (b) + 1 N N n=1 J(π n+1 ; r n + u n , P n ) -J(π n+1 ; r * , P * ) Term (c) . Here Term (a) is the out-of-sample estimation error of the estimated value J, Term (b) is the error of the planning algorithm, and Term (c) is the in-sample estimation error of the estimated value J. Since the bonus u n captures the uncertainty of the estimated model with high probability, and the planning algorithm finds the optimal policy of the learned model, Term (a) and Term (b) are small with high probability, which is shown in the following lemma. Lemma 6.1 (Informal). We define 1+γ) , where C 2 is a constant that depends on the regularity parameters and the parameters of the NN class. Following the same condition of Theorem 5.3, we have Term(a) ≤ 8H|A|ζ √ N holds with probability at least 1 -δ. We also have Term(b) ≤ 0 when we use Algorithm 4 for planning. ζ = C 2 log(20HRN/δ)N γ/( Proof. This can be directly proved by Lemma E.5 and Lemma E.6. See Appendix §E for the details. The remaining analysis is to connect Term (c) with the complexity measure of the model class. Former research (Wang et al., 2020; Jin et al., 2021a ) that allows neural network parameterization quantifies the uncertainty by the level set, and the Eluder dimension naturally appears when we telescope the in-sample error. To remove the dependency on the Eluder dimension, we want to quantify the uncertainty by the bonus defined by the true feature. However, since the true feature is unknown, we can only quantify the uncertainty using the learned feature. Inspired by Uehara et al. (2021) , we obtain the following lemma, which connects the bonus defined by the learned feature with the bonus defined by the true unseen feature. Lemma 6.2 (Bonus Equivalence for the True Model, Informal). Following the same condition with Theorem 5.3, we have J(π; u n , P * ) ≤2 |A| βd 1/2 eff / √ n + J(π; u * ,n , P * ) holds for any policy π and n ≥ 2 with high probability, where the bonus defined by the true feature u * ,n = {u * ,n h } H h=1 is defined in Lemma I.1. Proof. This is Lemma I.2. See Appendix §I.5 for the details. We remark that we cannot obtain the above lemma by directly applying the technique in Uehara et al. (2021) since they only consider the finite-dimensional inner product. Using the lemma above, we have Term(c) ≤ 32H 2 |A| ζ √ d eff N + 2 N n=1 J(π n+1 ; u * ,n , P * ) holds with high probability. Finally, we show that the sum of the bonuses defined by the true feature can be bounded by the effective dimension of the RKHS induced by the noise instead of the Eluder dimension of NN classes. We conclude the proof of Theorem 5.3 by combining the upper bounds of Term (a), Term (b), and Term (c). Removing Dependency on Eluder Dimension. The existing literature on RL using general function approximators relies on the Eluder dimension when bounding the regret or the suboptimality (Wang et al., 2020; Jin et al., 2021a; Ren et al., 2021) , which is exponentially large for a simple neural network class (Dong et al., 2021) . However, we can remove such dependency in MDPs with neural dynamics. Our key insight is that we can regard MDPs with neural dynamics as kernel MDPs (Yang et al., 2020) whose feature is the composite map of the neural network and the feature map of the RKHS since the energy-base transition admits a kernel structure, which is shown in §3. Therefore, we can characterize the effect of the distribution shifts by the bonus defined by the true feature, whose sum is bounded by the effective dimension of the RKHS instead of the Eluder dimension of the NN class, without knowing the true feature. Role of Uniform Policy in Proof. The uniform policy in the sampling scheme enables us to bound the influence of the distribution shift and show that the bonuses defined by the learned feature are valid uncertainty quantification. In order to show that the estimated value defined by the learned bonus is almost optimistic (Lemma E.5) for any policy, we need to bound the influence of the distribution shift. The data that we obtain from the uniform sampling has better coverage over S × A, and the bonuses we construct with these data quantify the uncertainty with the presence of the distribution shift. Therefore, we take uniform actions when exploring the environment to obtain a valid uncertainty quantification without knowing the true feature, and |A| 2 in the suboptimality bound is the price paid for the uniform sampling.

LIST OF NOTATION

In the sequel, we present a list of notations in the paper.

Notation

Explanation S, A, X The state, the action, and the feature spaces, respectively.

H, N

The length of an episode, and the total number of iterations of Algorithm 2 (ELNF), respectively.

n, h

The index that iterates from 1 to N , and the index that iterates from 1 to H, respectively.

τ , h

The index that iterates from 1 to n" and the index that iterates from 1 to h, respectively.

Φ, Ψ

The feature map classes of the current state-action pair and the next state, which parameterized by NNs.

R

The reward function class parameterized by NNs. {P * h } H h=1 , {r * h } H h=1 The transition kernel and the reward of the MDP.

d, m

The dimension of the state space and the image space of the feature maps, respectively. {D n h,i } h=H,i=2 h=1,i=0 , π n The dataset and the policy in the n-th iteration of Algorithm 2 (ELNF), respectively. {P n h } H h=1 , { r n h } H h=1 , {u n h } H h=1 The estimated transition, the estimated reward, and the bonus in the n-th iteration of Algorithm 2 (ELNF), respectively. {φ n h } H h=1 , {ψ n h } H+1 h=2 The learned feature in the n-th iteration of Algorithm 2 (ELNF), respectively. K, H,k K(x, y) = exp(-x -y 2 ), H is the RKHS induced by K, and k is the feature map of H.

∆(C), U(C)

The set of the distributions over C, and the uniform distribution over C.

R

The upper bound of the norm of the features. C net , γ The parameters in Assumption 5.2.

λ, β

The tuning parameters of Algorithm 2 (ELNF). d eff The effective dimension we defined in Definition E.4. c[φ, ψ](s, a) The normalization function we defined in (E.1).

N (F, , • )

The -covering number of the function class F with respect to the norm • . H ∞ (F, ), H 2 (F, ) H ∞ (F, ) = log N (F, , • ∞ ), and H 2 (F, ) = log N (F, , • ∞,2 ). c max , c min , r max c max is the upper bound of the feature maps, c min is the lower bound of the feature maps, and r max = c max /c min . ζ The parameter that related to the upper bound of the estimation error in Lemma E.3. 

A DETAILED COMPARISON WITH EXISTING LITERATURE

In this section, we provide detailed comparison with existing literature. Comparison with Kakade et al. (2020) . Kakade et al. (2020) studies the setting of KNRs and proposes the algorithm Lower Confidence-based Continuous Control (LC 3 ). They show that LC 3 is sample efficient by bounding the expectation of the regret from the above by O(H 3/2 N 1/2 d eff ), which aligns with our results in terms of N 1/2 and d eff . However, they assume that the nonlinear feature map is known, while we need to learn it from prespecified NN classes. Thus, our model is significantly more challenging. Moreover, their algorithm requires optimism over the confidence set, which is not computational efficient. Comparison with Research on Linear MDPs and Low-rank MDPs. We show in §D.1 that our model can be generalized to include linear MDP with unknown feature maps and nonlinear reward as a special case. Jin et al. (2020) show that their algorithm achieves O( √ m 3 H 3 N ) regret. Our bound on suboptimality aligns with their results in terms of the dimension m and the number of iterations N without assuming known feature maps and linear reward functions. Uehara et al. (2021) and Ren et al. (2022) propose algorithms for linear MDPs with unknown feature maps and show that the sample complexity of their algorithm is O(m 4 / 2 ) in infinite-horizon linear MDP with unknown feature maps, which also aligns with us in terms of . However, they assume that the transition is a finite dimensional inner product, and the unknown feature maps belong to a finite set, which greatly reduces the complexity of the problem. Comparison with Dong et al. (2021) . The sample complexity is O( -2 ) in terms of N when the logarithmic factors are omitted and = 0. Meanwhile, Theorem 5.1 in Dong et al. (2021) shows that the minimax sample complexity of solving a nonlinear bandit problem with one-layer NNs and ReLU activation is Ω( -(d-2) ). To obtain such a lower bound, Dong et al. (2021) assume that the action space is the unit sphere S d-1 in R d , which is an infinite set, while the action space in our setting is finite. In the case where H = 1, our model reduces to a finite-arm bandit problem whose reward is parameterized by an NN. Although the Eluder dimension of the NN class is large, the agent only needs to explore the arms of the bandits in our model, while the agent needs to explore the unit sphere in their model. Therefore, their model does not belong to our model, our result does not contradict the lower bound in Dong et al. (2021) , and the sample complexity in our model is dominated by the number of arms instead of the Eluder dimension of the NN class when H = 1. Comparison with Ren et al. (2021) . Ren et al. (2021) studies a nonlinear model with Gaussian noise. They show that the expectation of the regret of their algorithm is O H 2 N • log N (Φ, N -1/2 , • 2 ) • dim E (Φ, N -1/2 ) , where dim E (Φ, •) is the Eluder dimension of Φ. We show in §3 that the model in Ren et al. ( 2021) is a special case of our model. Our bound on the suboptimality aligns with their result in terms of the number of iterations N when γ = 0. However, they do not fully exploit the kernel structure in the transition in their analysis, and their result depends on the Eluder dimension of Ψ. Lemma C.7 in Appendix §C.2 provides an example of an NN class whose -Eluder dimension is at least Ω( -(d-1 ) ) and the -log covering number is at most O(log(1/ )). Lemma C.7 shows that removing the dependency of the sample complexity on the Eluder dimension significantly improves the sample complexity. In addition, their algorithm requires sampling from the posterior distribution of the hypothesis class, which is difficult to implement in practice. In contrast, our algorithm only requires planning with respect to the learned model, which can be computed efficiently. Comparison with Yang et al. (2020) . Yang et al. (2020) use overparameterized NNs for function approximation in the algorithm Neural Optimistic Least-Squares Value Iteration (NOVI) and shows that NOVI is sample efficient. However, their analysis relies on the connection between overparameterized NNs and neural tangent kernel and can not handle NNs beyond NTK regime.

B SUPPLEMENTARY FOR ALGORITHM

As we mentioned in §4, our algorithm performs the following four steps in each iteration: (i) sampling new data from the environment, (ii) estimating the model via maximum likelihood estimation, (iii) constructing exploration incentives using the features of the learned model, and (iv) updating the online policy for exploration via planning on the learned model. We simplify the presentation of the sampling algorithm and omit the details of the planning algorithm in the main text due to the space limit. In this section, we first describe the sampling scheme rigorously in detail. We then provide the detail of the planning algorithm. Sampling Scheme. As we mentioned in §3, the transition of MDPs with neural dynamics can be written as an energy-based model and admits the Gaussian RBF kernel. To exploit the kernel structure in the transition, we explore the environment using the exploration bonus induced by the Gaussian RBF kernel and the feature maps learned from the data, which is motivated by Yang et al. (2020) . However, since the bonus is not induced by the true underlying feature, it might fail to indicate the most uncertain state-action pairs for exploration. To mitigate such an issue, we combine the uniform policy, which samples action from the uniform distribution over the action space, with the optimistic policy during the sampling procedure. Intuitively, such a sampling scheme provides a wider coverage over the state-action space and better explores the environment. To simplify the presentation of the algorithm in our work, we introduce an extended MDP, where we assign meanings to steps h = -1, 0, H + 1, and H + 2. In particular, the interaction of an agent with the extended MDP starts with a dummy initial state s -1 . During the interaction, all the dummy state and action sequences {s -1 , a -1 , s 0 , a 0 } lead to the same initial state s init . Moreover, the agent is allowed to interact with the environment for two steps after observing the final state s H+1 of an episode. Nevertheless, the agent only collects the reward r h (s h , a h ) at steps h ∈ [H], which leads to the same learning objective as the original MDP. In addition, we denote by [H] + = [-1, 0, . . . , H + 2] the set of steps in the extended MDP. We remark that the dummy state and action sequences {s -1 , a -1 , s 0 , a 0 } do not exist, and we introduce them just to simplify the rigorous presentation of the boundary case of our algorithm. In the sequel, we do not distinguish between an MDP and an extended MDP for the simplicity of presentation. Now we describe the sampling procedure in detail. In the n-th iteration of our algorithm, given the previously collected dataset D n-1 h,i for i ∈ {0, 1, 2} and h ∈ [H], we interact with the MDP following the policy π n = {π n h } H h=-1 and obtain the new dataset D n h,i for i ∈ {0, 1, 2} and h ∈ [H]. Specifically, for any h ∈ {-1, . . . , H}, we start from the initial state s -1 and choose the action ah ∼ π n h (• | sh) in the h-th step when h ∈ {-1, • • • , h} , and choose the action ah ∼ U(A) when h ∈ {h + 1, h + 2}. Here U(A) is the uniform distribution over the action space A. By following such a procedure, we obtain the following trajectory, s -1 , a -1 , s 0 , a 0 , s 1 , a 1 , r 1 , . . . , s h+2 , a h+2 , r h+2 , s h+3 , (B.1) where s 1 = s init . Then, we label the obtained trajectory as follows, s n h+i,i = s h+i , a n h+i,i = a h+i , r n h+i,i = r h+i , sn h+i,i = s h+i+1 , for any i ∈ {0, 1, 2}. We then update the dataset as follows, D n h,i = D n-1 h,i ∪ (s n h,i , a n h,i , r n h,i , sn h+i,i ) = (s τ h,i , a τ h,i , r τ h,i , sτ h+i,i ) n τ =1 (B.2) for any i ∈ {0, 1, 2} and h ∈ [H]. The index i in (B. 2) indicates how many steps of the uniform policy we need to execute to obtain such a dataset. Intuitively, the dataset with a bigger index i has a better coverage over the state-action space S × A. See Figure 1 for an illustration of the sampling scheme, which is summarized in Algorithm 3. Algorithm 3 Sampling Scheme (Formal) 1: Input: Policy π n = {π n h } H h=-1 , datasets D n-1 h,i for i ∈ {0, 1, 2} and h ∈ [H]. 2: for h = -1, . . . , H do 3: Interact with the environment to obtain the trajectory in (B.1) by first executing π n from s -1 to s h , and then executing U(A) for two more steps. Sampling 4: Set (s n h+i,i , a n h+i,i , r n h+i,i , sn h+i,i ) ← (s h+i , a h+i , r h+i , s h+i+1 ) for i ∈ {0, 1, 2}, where (s h+i , a h+i , r h+i , s h+i+1 ) is defined in (B.1).  Set D n h,i ← D n-1 h,i ∪ Dn h,i for i ∈ {0 , 1, 2}. 9: end for 10: Return: Datasets {D n h,i } h=H,i=2 h=1,i=0 . s -1 . . . s h s h+1 s h+2 s h+3 s n h,0 sn h,0 s n h+1,1 sn h+1,1 s n h+2,2 sn h+2,2 a -1 a h-1 a h a h+1 a h+2 a n h,0 a n h+1,1 a n h+2,2 π n -1 π n h-1 π n h U(A) U(A) Label Label LabelLabel Label Label Label Label Label Figure 1 : Sampling procedure in the h-th trajectory of the n-iteration. We first execute the optimistic policy for h steps, and then execute the uniform policy for two steps. Finally, we label the collected data as the figure shows. Planning Algorithm. The details of the planning algorithm is now summarized in Algorithm 4. Algorithm 4 Planning Algorithm 1: Input: Estimated reward { r n h } H h=1 , estimated transition {P n h } H h=1 , bonus { u n h } H h=1 . 2: Set Q n H (s, a) ← r n H (s, a) + u n H (s, a). 3: Set V n H (s) ← max a∈A Q n H (s, a) and π n+1 H (• | s) ← argmax π∈∆(A) a∈A Q n H (s, a)π(a). 4: for h = H -1, . . . , 1 do 5: Set Q n h (s, a) ← r n h (s, a) + u n h (s, a) + S P n h (s | s, a)V n h+1 (s )ds . 6: Set V n h (s) ← max a∈A Q n h (s, a) and π n+1 h (• | s) ← argmax π∈∆(A) a∈A Q n h (s, a)π(a). 7: end for 8: Set π n+1 -1 and π n+1 0 as the uniform policy U(A). 9: Return: π n+1 ← {π n+1 h } H h=-1 . We remark that we can also apply other model-based algorithm, including Dyna (Sutton, 1991) and Gradient-Aware Model-based Policy Search (D'Oro et al., 2020) , to compute the optimal policy of the learned model.

C SUPPLEMENTARY FOR NEURAL NETWORKS

In this section, we provide more details on neural networks. In the first subsection, we introduce the definition of the covering number, which measure the complexity of a function class, and provide two examples of neural networks in detail and show that such neural networks satisfy our assumptions. We remark that our analysis can be extended to other neural network classes when Assumption 5.2 is satisfied with γ ≤ 1/2, and is not restricted to the examples below. In the second subsection, we provide a lemma that compares the log-covering number of an NN class with the Eluder dimension. Lemma C.7 in Appendix §C.2 shows that removing the dependency of the sample complexity on the Eluder dimension greatly improves the sample complexity.

C.1 EXAMPLES

We begin with the definition of the covering number. Definition C.1 (Covering Number). Let (Φ, • ) be a normed space, and Φ 0 ⊂ Φ. The set {φ 1 , φ 2 , . . . , φ s } is an -covering of Φ 0 if sup φ∈Φ0 inf i∈[s] φ -φ i ≤ . We define the covering number N (Φ 0 , , ) as the minimum size of such a covering. In what follows, we provide two examples of neural networks in detail and show that such neural networks satisfy our assumptions. Example 1. The first example is the s-sparse neural network class. Schmidt-Hieber (2020) show that the s-sparse neural network class satisfies Assumptions 5.1 and 5.2. We first introduce the definition. Definition C.2 (The s-Sparse Neural Network Class). We define σ(x) = max{x, 0}. For a vector v ∈ R r , we define the shifted activation function σ v : R r → R r as σ v (y 1 , . . . , y r ) = σ(y 1 -v 1 ), . . . , σ(y r -v r ) . A neural network with network architecture (L, p) is a function that takes the form, f : R p0 → R p L+1 , f (x) = W L σ L . . . σ 1 W 0 x. (C.1) when x ∈ R p0 . Here L is the depth of the neural network, W i is a p i+1 × p i weight matrix, v i ∈ R pi is a shift vector, and p = (p 0 , . . . , p L+1 ) . The neural network class F nn (L, p) is defined as F nn (L, p) = f | f takes the form of (C.1) with max j=0,...,L W j ∞ ∨ v j ∞ ≤ 1 . The s-sparse neural network class is defined as F nn (L, p, s, M ) = f | f ∈ F nn (L, p), L j=0 W j 0 + v j 0 ≤ s, sup x∈X f (x) ∞ ≤ M . We directly obtain the boundedness of the neural network class F nn (L, p, s, M ) by the definition. Moreover, Schmidt-Hieber (2020) bound the covering number of F nn (L, p, s, M ) from the above by the following lemma. Lemma C.3 (Lemma 5 in Schmidt-Hieber ( 2020)). We define V = L+1 l=0 (p l + 1). Then, for any > 0, we have log N F nn (L, p, s, M ), , • ∞ ≤ (s + 1) log 2(L + 1)V 2 / . Lemma C.3 verifies that the s-sparse neural network class satisfies Assumption 5.2 with γ = 0. Therefore, the s-sparse neural network class defined in Definition C.2 satisfies Assumptions 5.1 and 5.2. Example 2. The second example is the recurrent neural network class. Chen et al. (2019) show that the recurrent neural network class satisfies Assumptions 5.1 and 5.2. We first introduce the definition. Definition C.4 (Recurrent Neural Network Class). A recurrent neural network f (X t , t; U, W, h 0 ) is a mapping that, when it is parameterized by U ∈ R d h ×d h , V ∈ R dy×d h , W ∈ R d h ×dx , h 0 ∈ R d h and get X t = (x 1 , . . . , x t ) ∈ R dx×t , t as input, it returns Y t = (y 1 , . . . , y t ) ∈ R dy×t h t = σ h (U h t-1 + W x t ), y t = σ y (V h t ), where σ h and σ y are two nonlinear activation functions. We define the function class F RNN (t, B) as the set of functions that take the form of f (•, t; U, W, h 0 ), where f is a recurrent neural network that only takes bounded input, and the spectral norm of U, V, and W are all bounded by B. In Chen et al. (2019) , they analyze the recurrent neural network class under the following assumption. Assumption C.5. We assume that the activation functions σ h and σ y are Lipschitz with parameters ρ h and ρ y respectively, and σ h (0) = σ y (0) = 0. Additionally, we assume that σ h is entrywise bounded by M .  log N F RNN (t, B), , • ∞ ≤ 3d 2 log 1 + 6ct √ d (ρ h B -1) t -1) / (ρ h B -1) , where d = d h (d x + d y + d h ) and c = ρ y ρ h B 3 max{1, ρ h B}. Lemma C.6 verifies that the recurrent neural network class satisfies Assumption 5.2 with γ = 0. Therefore, the recurrent neural network class defined in Definition C.4 satisfies Assumptions 5.1 and 5.2.

C.2 COMPARISON OF ELUDER DIMENSION AND LOG-COVERING NUMBER

In this section, we provide a lemma that use an example to illustrate that, the log-covering number of a NN class can grow more moderately as the covering radius decays than the Eluder dimension. Lemma C.7. Let R be a one-layer NN class defined as R = r θ,b : A → R | r θ,b (a) = σ(θ a + b), θ ∈ R d , θ 2 ≤ 1, 0 ≤ b ≤ 1 , where σ is the activation function in Definition C.2. Suppose we have A ⊂ R d and a 2 ≤ 1 when a ∈ A, then the -Eluder dimension of R is at least Ω( -(d-1) ), and the -log covering number is bounded from the above by O(log(1/ )). Proof. See Appendix §J.4 for a detailed proof. Lemma C.7 suggests that, the log-covering number of an NN class can be much smaller than the Eluder dimension. Our algorithm is sample efficient since the sample complexity of ELNF depends on the log-covering number of the NN class instead of the Eluder dimension. We remark that the NN class in Lemma C.7 also satisfies Assumption 5.2 with γ = 0.

D SUPPLEMENTARY FOR REPRODUCING KERNEL HILBERT SPACES

As shown in Nachum & Yang (2021) and Ren et al. (2021) , the transition kernel in our model is closely related to the RKHS corresponding to the Gaussian kernel. In this section, we provide more details on the RKHS. In the first subsection, we provide the definition of RKHSs and some basic properties of RKHSs. In the second subsection, we lay out several properties of the population operator and the empirical operator defined in (E.4) and (E.2).

D.1 BASIC CONCEPT

We will be considering the use of RKHS for designing the exploration bonus. We first introduce the definition of RKHS and positive-definite kernels. Definition (Reproducing Hilbert Kernel Space). Let H be a vector space which consists of functions that take element in X as input and take their values in R. We assume that H is equiped with an inner-product •, • H : H × H → R. The function K : X × X → R is a reproducing kernel of H if it satisfies the following properties, namely, (1) we have K(x, •) ∈ H for all x ∈ X , and (2) we have g, K(x, •) H = g(x) for all x ∈ X and g ∈ H.We call H a reproducing Hilbert kernel space (RKHS) if it is a Hilbert space with a reproducing kernel K. Definition (Positive-Definite Kernel). A kernel function K : X × X → R is positive definite if n i=1 n j=1 a i a j K(x i , x j ) ≥ 0 for all n ≥ 1, (a 1 , . . . , a n ) ∈ R n and (x 1 , . . . , x n ) ∈ X n . The following lemma reveals the connection between MDPs with neural dynamics and RKHS. Lemma. There exists an RKHS H, such that K(x 1 , x 2 ) = exp(-x 1 -x 2 2 2 /2 ) is the kernel of H. The kernel is called the Gaussian kernel. Proof. Rahimi et al. (2007) show that the Gaussian kernel is a positive-definite kernel on X ⊂ R m . Moore-Aronszajn theorem (Aronszajn (1950) ) shows that for every positive-definite kernel, there exists an RKHS H associated with the kernel. Thus, we conclude the proof. We remark that by the similarity between different RKHSs, our analysis can be adapted to the cases when the right-hand-side of (3.4) is another kernel function. Definition D.1 (Generalization of Definition 3.1). An episodic Markov decision process (S, A, H, P * , r * ) is an MDP with neural dynamics if its reward functions r * = {r * h } H h=1 ⊂ R, where R is a known reward function class that consists of neural networks, and the transition kernel of the MDP P * = {P * h } H h=1 takes the following form, P * h (s h+1 | s h , a h ) = K φ * h (s h , a h ), ψ * h+1 (s h+1 ) . Here K is a positive-definite kernel, φ * h ∈ Φ : R d × A → R m and ψ * h+1 ∈ Ψ : R d → R m are two unknown neural networks, and Φ and Ψ are two known feature map classes that consist of neural networks. We denote by M the set of all the transition kernels that take the form of (3.4), and denote by X ∈ R m the space of the embedded feature. We remark that when K(x 1 , x 2 ) = x 1 x 2 in Definition D.1, we recover linear MDP with unknown feature and nonlinear reward function.

D.2 OPERATOR PROPERTIES

In this subsection, we provide the properties of operators defined in (E.2) and (E.4). The following lemma shows that Γ e [φ, D, λ] and Γ p [φ, ρ, λ, τ ] are both positive-definite operators on H, which guarantees the existence of the corresponding inverse. Lemma D.2. Let ρ be a probability measure over X , and L 2 ρ (X ) be the set of all functions that is square-integrable on X with respect to ρ. We have H ⊂ L 2 ρ (X ). When we define the operator Γ as Γg(x) = X K(x , x)g(x )dρ(x ), we have Γg ∈ H when g ∈ L 2 ρ (X ). Moreover, when Γ is regarded as an operator from H to H, it is non-negative definite. Proof. See Appendix §J.5 for a detailed proof. The following lemma shows that the operator norm of the inverse operator of the empirical operator can be computed efficiently. Lemma D.3. Let H be an RKHS. For X n = {x 1 , . . . , x n }, we define Γ[X n ] : H → H as Γ[X n ]g(x) = λg(x) + n i=1 g(x i )K(x i , x). Here K is the kernel of the RKHS. We denote by k the feature map of the RKHS. We have k(x) 2 Γ -1 [Xn] = K(x, x) -k[X n ](x) λI + K[X n ] -1 k[X n ](x) λ. Here k[X n ](x) = (K(x 1 , x), . . . , K(x n , x)) ∈ R n and K[X n ] = [K(x τ1 , x τ2 )] n τ1,τ2=1 ∈ R n×n . Proof. See Appendix §J.6 for a detailed proof. We end this subsection by bounding the operator norm of the operators Γ e [φ, D, λ] and Γ p [φ, ρ, λ, τ ]. Lemma D.4. For any g ∈ H, we have λ g, g H ≤ g, Γ e [φ, D, λ]g H ≤ λ + (s,a,r,s )∈D K φ(s, a), φ(s, a) g, g H , λ g, g H ≤ g, Γ p [φ, ρ, λ, τ ]g H ≤ λ + τ E (s,a)∼ρ K φ(s, a), φ(s, a) g, g H . Proof. See Appendix §J.7 for a detailed proof. E PROOF OF THEOREM 5.3 We first state the theorem again with clearer description on the choice of the parameters. Theorem (Theorem 5.3, restated). For an MDP with neural dynamics that satisfies Assumptions 5.1 and 5.2 with γ < 1/2, we set λ = Cn γ/(1+γ) m log(48HRN/δ), β = 2(H + 1) 4λv 2 /c 2 min + 10r 2 max ζ 2 |A|. in Algorithm 2 (ELNF) , where m is the dimension of the image of the feature maps. Suppose we can obtain the exact solution to the optimization problems (4.3) and (4.4). We have J(π * ; r * , P * ) -J( π; r * , P * ) ≤ CH 5 |A| 2 N (2γ-1)/(2+2γ) log(48HRN/δ) 5/2 (log N ) m+1 with probability at least 1 -δ, where π is the policy returned by Algorithm 2 (ELNF). Here C is a constant only depends on the dimension m, the bound of the feature maps R, and C net in Assumption 5.2, and ζ is defined as ζ = C 2 log(20HRN/δ)N γ/(1+γ) , where C 2 is a constant depends on the regularity parameters and the parameters of the NN class. Proof. We define the normalization function c [φ * h , ψ * h+1 ](φ * h (s h , a h )) as c[φ * h , ψ * h+1 ] s h , a h = s h+1 ∈S exp -φ * h (s h , a h ) -ψ * h+1 (s h+1 ) 2 2 /2 ds h+1 . (E.1) By (3.4) and (E.1), the transition kernel P * h can be written as P * h (s h+1 | s h , a h ) = exp -φ * h (s h , a h ) -ψ * h+1 (s h+1 ) 2 2 2 /c[φ * h , ψ * h+1 ] s h , a h . We define the bound on the normalization constant as We have c max ≤ v, c min ≥ v exp(-2R 2 ), and r max ≤ exp(2R 2 ) when Assumption 5.1 holds. Here v is the Lebesgue measure of S. c max = sup φ∈Φ,ψ∈Ψ Since the kernel function K(x 1 , x 2 ) = exp( x 1 -x 2 2 2 /2 ) is a positive-definite kernel, it induces an RKHS. We denote by H the RKHS induced by K, and denote by k the corresponding feature map. This allows us to corporate techniques in RKHS into our analysis. The following lemma reveals the relationship between the bonus we define and H.  u n h (s, a) = min 2H + 2, β k φ n h (s, a) Γ -1 e φ n h ,D n h,1 ,λ . Proof. We conclude the proof by directly applying Lemma D.3. In the sequel, we introduce two good events. We first define distributions and the population operator, which we use in the definitions of the good events. For an index n ∈ [N ] and i ∈ {0, 1, 2}, we denote by ϑ n h,i the distribution of (s h , a h ) when s 1 = s init , the state sh +1 ∼ P * h (• | sh, ah), the action ah ∼ π n h (• | sh) for h ∈ [h -i], and ah ∼ U(A) for h ∈ {h -i + 1, . . . , h}, where U(A) is the uniform distribution over the action set A. We further define the measure ρ n h,i as ρ n h,i (s, a) = n τ =1 ϑ τ h,i (s, a)/n, ∀(s, a) ∈ S × A. (E.3) For a distribution ρ over S × A, we define the population operator as and denote by Γ -1 p [φ, ρ, λ, τ ] the corresponding inverse operator. For the empirical operator and the population operator, we have Γ e [φ, D, λ]g ∈ H and Γ p [φ, ρ, λ, τ ]g ∈ H when g ∈ H. The operators in (E.2) are also positive-definite and self-adjoint for any dataset D when λ, τ > 0, which guarantees the existence of its inverse. We denote by E 1 that the bonus defined by the empirical operator is bounded from the above and below by the operator defined by the population operator, that is, Γ p [φ, ρ, λ, τ ]g(x) = λg(x) + τ E (s, 1 2 k φ(s, a) Γ -1 p [φ,ρ n h,i ,λ,n] ≤ k φ(s, a) Γ -1 e [φ,D n h,i ,λ] ≤ 2 k φ(s, a) Γ -1 p [φ,ρ n h,i ,λ,n] (E.5) for any φ ∈ Φ, (s, a) ∈ S × A, i ∈ {0, 1, 2}, and (h, n) ∈ [H] × [N ] . The data we collected contains enough information for us to design the bonus for exploration when E 1 holds. We denote by E 2 that E (s,a)∼ρ n h,i TV[ P n h (• | s, a), P * h (• | s, a)] 2 ≤ ζ 2 /n, (E.6) E (s,a)∼ρ n h,i | r n h (s, a) -r * h (s, a)| 2 ≤ ζ 2 /n (E.7) for all (h, n) ∈ [H] × [N ], i ∈ {1, 2}. Here TV[p 1 (•), p 2 (•)] is the total variance divergence of two distributions. Our estimators of the model are accurate in the sense that the population risk of our estimators are small when E 2 holds. The following lemmas show that E 1 and E 2 hold with high probability. Lemma E.2 (Concentration of Inverse Covariance). Under Assumption 5.2, when we set λ = C 1 mN γ/(1+γ) log(48HRN/δ), Event E 1 defined in (E.5) holds with probability at least 1 -δ/4. Here C 1 is a constant only depends on the parameters of the NN class. Proof. See Appendix §F for a detailed proof. Lemma E.3 (Estimation Error). We assume that Assumption 5.2 holds with γ < 2. Event E 2 defined in (E.6) and (E.7) holds with probability at least 1 -δ/5 when ζ = C 2 log(20HRN/δ)N γ/(1+γ) . Here C 2 is a constant depends on the regularity parameters and the parameters of the NN class. Proof. See Appendix §G for a detailed proof. Lemmas E.2 and E.3 show that good events E 1 and E 2 hold with high probability. In the following part of the proof, we condition on Event E 1 and E 2 . We also define the effective dimension as follows. Definition E.4 (Effective Dimension). Let H be an RKHS and K be the corresponding kernel. For X n = {x 1 , . . . , x n }, we define the matrix K[X n ] = [K(x τ1 , x τ2 )] n τ1,τ2=1 , and define Λ 1 (n, λ 0 ) = sup Xn⊂X log det(I n + K[X n ]/λ 0 ). Let k be the feature map of H, and ∆(X ) be the set of distributions over X , we define Λ 2 (n, λ 0 ) = sup ρ∈∆(X ) nE x∼ρ k(x) 2 Γ -1 p [ρ,λ0,n] . Here the population operator Γ p : H → H is defined as Γ p [ρ, λ 0 , n]f (x) = λ 0 f (x) + nE x0∼ρ K(x, x 0 )f (x 0 ) . We define d eff = max{Λ 1 (N + 1, λ), Λ 2 (N, λ)}. The effective dimension Λ 1 is the maximum information gain in Srinivas et al. (2009) , which is closely related to Gaussian process regression. The effective dimension Λ 2 is also closely related to the dimension in finite-dimension RKHSs. See Appendix §H for a brief discussion. The effective dimension d eff is closely related to the sample complexity. We are now ready to present the proof of Theorem 5.3. Our goal is to bound J(π * ; r * , P * ) -J( π; r * , P * ). Since the policy π returned by Algorithm 2 (ELNF) is the mixture of π 1 , . . . , π N , we have J(π * ; r * , P * ) -J( π; r * , P * ) = 1 N N n=1 J(π * ; r * , P * ) -J(π n+1 ; r * , P * ) . (E.8) We decompose the suboptimality as N n=1 J(π * ; r * , P * ) -J( π; r * , P * ) = N n=1 z n,1 Term (a) + N n=1 z n, Term (b) + N n=1 z n,3 Term (c) , (E.9) where u n = {u n h } H h=1 is the bonus we define in (4.5) and {z n,j } (n,h)∈[N ]×[3] are defined as z n,1 = J(π * ; r * , P * ) -J(π * ; r n + u n , P n ), z n,2 = J(π * ; r n + u n , P n ) -J(π n+1 ; r n + u n , P n ), z n,3 = J(π n+1 ; r n + u n , P n ) -J(π n+1 ; r * , P * ). In the sequel, we bound the terms from the above in (E.9) separately. Term (a). To bound Term (a), we introduce the following characterization of the bonus. Lemma E.5 (Almost Optimistic for the Planning Phase). Following the same condition of Theorem 5.3, when condition on the good events E 1 , E 2 defined in (E.5), (E.6) and (E.7), we have J(π; r n + u n , P n ) -J(π; r, P * ) ≥ -(H + 1) |A| ζ/ √ n. holds for any policy π. Here the bonus u n = {u n h } H h=1 is defined in (4.5). Proof. See Appendix §I.1 for a detailed proof. By Lemma E.5, we have Term (a) ≤ N n=1 (H + 1) |A| ζ/ √ n ≤ 8H |A| ζ √ N . Term (b). The following lemma bounds Term (b) from the above. Lemma E.6. For the policy π n+1 returned by Algorithm 4 (Planning Algorithm), we have J(π * ; r n + u n , P n ) -J(π n+1 ; r n + u n , P n ) ≤ 0. Proof. See Appendix §I.2 for a detailed proof. By Lemma E.6, we have Term (b) = N n=1 [J(π * ; r n + u n , P n ) -J(π n+1 ; r n + u n , P n )] ≤ 0. Term (c). We have the following lemma, which bounds Term (c) from the above. Lemma E.7 (Bounded optimistism). Following the same condition of Theorem 5.3, we have N n=1 J(π n+1 ; r n + u n , P n ) -J(π n+1 ; r * , P * ) ≤ 46H 2 |A| ζβ 1 d eff N log(10H/δ) with probability at least 1 -δ/2. Here β 1 = (4H 2 + 6H + 2) 4λv 2 /c 2 min + 4r 2 max |A| β 2 ζ 2 d eff and ζ is defined in Lemma E.3. Proof. See Appendix §I.3 for a detailed proof. We denote by E 3 the event defined by Lemma E.7. In the following part of the proof, we condition on Event E 3 . By Lemma E.7, we have Term (c) ≤ 46H 2 |A| ζβ 1 √ d eff N log(10H/δ). Combining the upper bounds of the terms in (E.9), we have  β 1 ≤ 600H 3 ζv 2 c 2 max /c 4 min d eff |A| √ λ + |A| (E.11) Combining (E.10), (E.11) with the value of λ in Theorem 5.3, we have N n=1 J(π * ; r * , P * ) -J(π n+1 ; r * , P * ) ≤ 36000H 5 v 2 c 2 max |A| 2 ζ 2 d eff √ λN log(10H/δ)/c 4 min ≤ C 3 H 5 |A| 2 d eff N (1+4γ)/(2+2γ) log(48HRN/δ) 5/2 (E.12) where C 3 is a constant that only depends on the dimension of the feature m, the bound of the feature maps R, and C net in Assumption 5.2. The following lemma bounds d eff from the above. Lemma E.8. For the Gaussian kernel K(x 1 , x 2 ) = exp(-x 1 -x 2 2 2 /2), we have d eff ≤ C 4 (log N ) m+1 . Here C 4 is a constant that only depends on the dimension m and the radius R. Proof. See Appendix §H.1 for a detailed proof. Combining (E.8), (E.12) with Lemma E.8, we have J(π * ; r * , P * ) -J( π; r * , P * ) ≤ C 5 H 5 |A| 2 N (2γ-1)/(2+2γ) log(48HRN/δ) 5/2 (log N ) m+1 , where C 5 is a constant that only depends on the dimension of the feature m, the bound of the feature maps R, and C net in Assumption 5.2. Thus, we conclude the proof of Theorem 5.3.

F CONCENTRATION OF THE INVERSE COVARIANCE

In this section, we provide the proof of Lemma E.2, which shows that Event E 1 defined in (E.5) happens with high probability. We first show that we can prove Lemma E.2 by the concentration of covariance using Lemma F.2. We then prove the concentration of the covariance for a fixed feature map using the concentration inequality in Lemma F.3. Next, we take a union bound to prove the uniform concentration for a covering of the feature map class, and use the property of the covering to prove the uniform concentration of the covariance for the whole feature map class, which concludes the proof of Lemma F.2. Proof. We first introduce the filtration we use for our analysis. Definition F.1 (Filtration). For any n ∈ [N ], we define F n as the σ-algebra generated by the trajectories in the first n loops of Algorithm 2 (ELNF). By taking a union bound, we only need to show that for a fix n ∈ [N ] and i ∈ {0, 1, 2}, we have k φ(s, a) 2 Γ -1 p [φ,ρ n h,i ,λ,n] /4 ≤ k φ(s, a) 2 Γ -1 e [φ,D n h,i ,λ] ≤ 4 k φ(s, a) 2 Γ -1 p [φ,ρ n h,i ,λ,n] for any φ ∈ Φ and (s, a) ∈ S × A with probability 1 -δ/(24HN ). Since we have φ(s, a) 2 ≤ R, it remains to show that with probability 1 -δ/(24HN ), we have k(x) 2 Γ -1 p [φ,ρ n h,i ,λ,n] /4 ≤ k(x) 2 Γ -1 e [φ,D n h,i ,λ] ≤ 4 k(x) 2 Γ -1 p [φ,ρ n h,i ,λ,n] for any x ∈ R m with x 2 ≤ R. We first prove that k(x) 2 Γ -1 e [φ,D n h,i ,λ] ≤ 4 k(x) 2 Γ -1 p [φ,ρ n h,i ,λ,n] for any x ∈ R m with x 2 ≤ R with probability at least 1 -δ/(48HN ). The following lemma allows us to prove the concentration of the inverse covariance by the concentration of the covariance. Lemma F.2. Let H be a Hilbert space, and A, B be two positive-definite and self-adjoint bounded linear operators on H. Suppose x, Ax H ≥ x, Bx H for all x ∈ D, we have x, A -1 x H ≤ x, B -1 x H when B -1/2 C -1 B -1/2 x ∈ D. Here C = (B -1/2 AB -1/2 ) 1/2 . Proof. See Appendix §J.8 for a detailed proof. For simplicity, we define where Γ 1 = Γ p [φ, ρ n h,i , λ, n], Γ 2 = Γ e [φ, D n h,i , λ], H 0 = k(x) | x ∈ R m , Γ = Γ -1/2 1 (Γ -1/2 1 Γ 2 Γ -1/2 1 ) -1/2 Γ -1/2 1 . By Lemma F.2, it remains to show that g 2 Γ1 ≤ 4 g 2 Γ2 for any g ∈ H 1 with probability at least 1 -δ/(48HN ). By the definition of the population operator, we have g 2 Γ1 = g 2 Γp[φ,ρ n h,i ,λ,n] = λ g 2 H + n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) . Similarly, we have g 2 Γ2 = λ g 2 H + n τ =1 g 2 (φ(s τ h,i , a τ h,i )) . Therefore, we can prove g 2 Γ1 ≤ 4 g 2 Γ2 by the concentration of g 2 . We define N 1 = C H 1 , 1 , • ∞ , N 2 = C Φ, 2 , • ∞,2 , (F.2) for simplicity. Here C denote the covering sets we defined in notations, 1 and 2 are tuning parameters. By Bernstein inequality, we have the following lemma, which can be used to show the concentration of g 2 . Lemma F.3. Suppose that {F τ } n τ =0 is a filtration and {(s τ , a τ )} n τ =1 is a S × A-value stochastic process adapted to this filtration. We denote by τ the distribution of (s τ , a τ ) when condition on F τ -1 . For any fix g ∈ H and φ ∈ Φ, The following inequality holds with probability at least 1 -δ/2. n τ =1 g 2 φ(s τ , a τ ) ≤ 2 n τ =1 E (s,a)∼ τ g 2 φ(s, a) + 2 log(2/δ) g, g H . We also have the following inequality with probability at least 1 -δ/2. n τ =1 E (s,a)∼ τ g 2 φ(s, a) /2 ≤ n τ =1 g 2 φ(s n , a n ) + 2 log(2/δ) g, g H . Proof. See Appendix §J.9 for a detailed proof. For any g ∈ C(H 1 , 1 , • ∞ ) and φ ∈ C(Φ, 2 , • ∞,2 ), by Lemma F.3, we have n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) /2 ≤ n τ =1 g 2 φ(s τ h,i , a τ h,i ) + 2 log(48HN N 1 N 2 /δ) g, g H (F.3) with probability at least 1 -δ/(48HN N 1 N 2 ). Here N 1 and N 2 are defined in (F.2), τ h,i is the distribution of (s τ h,i , a τ h,i ) when condition on F τ -1 . By taking a union bound, we have (F.3) holds for all g ∈ C(H 1 , 1 , • ∞ ) and φ ∈ C(Φ, 2 , • ∞,2 ) with probability at least 1 -δ/(48HN ). Therefore, we have P(E n,h,i ) ≥ 1 -δ/(48HN ) when we define E n,h,i as the event that (F.3) holds for all g ∈ C(H 1 , 1 , • ∞ ) and φ ∈ C(Φ, 2 , • ∞,2 ). In the following part of the proof, we condition on Event E n,h,i . For an arbitrary g ∈ H 1 and φ ∈ Φ, we choose g 0 ∈ C(H 1 , 1 , • ∞ ) and φ 0 ∈ C(Φ, 2 , • ∞,2 ) such that sup x∈X |g(x) -g 0 (x)| ≤ 1 and sup (s,a)∈S×A φ(s, a) -φ 0 (s, a) 2 ≤ 2 . We decompose the difference in the expectation by n τ =1 E (s,a)∼ τ h,i g 2 0 φ 0 (s, a) - n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) (F.4) = n τ =1 E (s,a)∼ τ h,i g 0 φ 0 (s, a) + g φ(s, a) g 0 φ 0 (s, a) -g φ(s, a) . Since sup x∈X |g(x) -g 0 (x)| ≤ 1 , we have g 0 φ 0 (s, a) -g φ(s, a) ≤ 1 + g φ 0 (s, a) -g φ(s, a) (F.5) for any (s, a) ∈ S × A. Combining (F.5) with the reproducing property of H, we have g 0 φ 0 (s, a) -g φ(s, a) ≤ 1 + g, k φ 0 (s, a) -k φ(s, a) H (F.6) ≤ 1 + g H k φ 0 (s, a) -k φ(s, a) H , where the last inequality follows Cauchy-Schwarz inequality. For the kernel feature map k, we have k φ 0 (s, a) -k φ(s, a) 2 H = k φ 0 (s, a) 2 H + k φ(s, a) 2 H -2 k φ 0 (s, a) , k φ(s, a) H = 2 1 -exp -φ 0 (s, a) -φ(s, a) 2 /2 ≤ φ 0 (s, a) -φ(s, a) 2 ≤ 2 2 . (F.7) Combining (F.6) with (F.7), we have |g 0 (φ 0 (s, a)) -g(φ(s, a))| ≤ 1 + 2 g H , and g 0 φ 0 (s, a) + g φ(s, a) ≤ 1 + 2 g H + 2 g H . (F.8) We plug (F.8) into (F.4) and have n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) - n τ =1 E (s,a)∼ τ h,i g 2 0 φ 0 (s, a) (F.9) ≤ n ( 2 2 + 2 2 ) g, g H + (2 1 2 + 2 1 ) g H + 2 1 . The following lemma provide an upper bound and a lower bound of g, g H for g ∈ H 1 . Lemma F.4. For any g ∈ H and the operator Γ defined in (F.1), we have λ g, g H /(λ + n) 3 ≤ Γg, Γg H ≤ (λ + n) g, g H /λ 3 . Proof. See Appendix §J.10 for a detailed proof. Combining the definition of H 1 in (F.1) with Lemma F.4, we have g, g H = Γk(x), Γk(x) H ≥ λ k(x), k(x) H /(λ + n) 3 = λ/(λ + n) 3 , (F.10) where the operator Γ is defined in Lemma F.4. Therefore, we plug (F.10) into (F.9) and have n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) - n τ =1 E (s,a)∼ τ h,i g 2 0 φ 0 (s, a) (F.11) ≤ n ( 2 2 + 2 2 ) + (λ + n) 3/2 (2 1 2 + 2 1 )/λ 1/2 + 2 1 (λ + n) 3 /λ g, g H . By the same method that induces (F.11), we have n τ =1 g 2 0 φ 0 (s τ h,i , a τ h,i ) - n τ =1 g 2 φ(s τ h,i , a τ h,i ) (F.12) ≤ n ( 2 2 + 2 2 ) + (λ + n) 3/2 (2 1 2 + 2 1 )/λ 1/2 + 2 1 (λ + n) 3 /λ g, g H . By the definition of Event E n,h,i in (F.3), we have n τ =1 E (s,a)∼ τ h,i g 2 0 φ(s, a) 2 ≤ n τ =1 g 2 0 φ(s τ h,i , a τ h,i ) + 2 log(48HN N 1 N 2 /δ) (F.13) for g 0 ∈ C(H 1 , 1 , • ∞ ) when E n,h,i holds. We plug (F.11), (F.12) into (F.13), and have n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) 2 (F.14) ≤ n τ =1 g 2 φ(s τ h,i , a τ h,i ) + 2 log(48HN N 1 N 2 /δ) + κ[n, 1 , 2 ] g, g H when condition on Event E n,h,i defined in (F.13). Here κ[n, 1 , 2 ] is defined as κ[n, 1 , 2 ] = 3n ( 2 2 + 2 2 ) + (λ + n) 3/2 (2 1 2 + 2 1 )/λ 1/2 + 2 1 (λ + n) 3 /λ 2 (F.15) By the definition of κ[n, 1 , 2 ] in (F.15), we have κ[n, 1 , 2 ] ≤ 5n γ/(1+γ) + 6(λ + n) 3/2 /(n 3/2 λ 3/2 ) + 2(λ + n) 3 /(n 4 λ 3 ) ≤ 38n γ/(1+γ) (F.16) when we set 1 = 1/(n 5/2 λ) and 2 = n -1/(1+γ) . Combining (F.14) with (F.16), we have n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) 2 (F.17) ≤ n τ =1 g 2 φ(s τ h,i , a τ h,i ) + 2 log N 1 + 2 log N 2 + 2 log(48HN/δ) + 38n γ/(1+γ) g, g H . It remains to bound log N 1 and log N 2 from the above. The following lemma bounds N 1 from the above. Lemma F.5. For the set H 1 defined in (F.1), we have N (H 1 , , • ∞ ) ≤ (R 2 (λ + n)/(λ 3 2 )) m/2 . Proof. See Appendix §J.11 for a detailed proof. By the definition of N 1 in (F.2) and Lemma F.5, we have 2 log N 1 ≤ 12m log(nR). By the definition of N 1 in (F.2) and Assumption 5.2, we have 2 log N 2 ≤ C cic,1 n γ/(1+γ) log n, where C cic,1 is a constant only depends on C net in Assumption 5.2. Therefore, by (F.17), we have n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) /2 ≤ n τ =1 g 2 φ(s τ h,i , a τ h,i ) + 53C cic,1 mN γ/(1+γ) log(48HN R/δ) g, g H when condition on E n,h,i in (F.3). By choosing λ = 106C cic,1 mN γ/(1+γ) log(48HN R/δ), we have n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) /2 + λ g, g H /2 ≤ n τ =1 g 2 φ(s τ h,i , a τ h,i ) + λ g, g H (F.18) for any g ∈ H 1 and φ ∈ Φ when condition on E n,h,i . By the same method inducing (F.18), we have n τ =1 g 2 φ(s τ h,i , a τ h,i ) + λ g, g H ≤ 2 n τ =1 E (s,a)∼ τ h,i g 2 φ(s, a) + 2λ g, g H for any g ∈ H 1 and φ ∈ Φ when condition on E n,h,i , which concludes the proof of Lemma E.2.

G PROOF OF LEMMA E.3

In this section, we provide the proof of Lemma E.3, which shows that Event E 2 defined in (E.6) and (E.7) happens with high probability. We conclude the proof of Lemma E.3 by combining Lemma G.1 and Lemma G.2, which provide upper bounds of the estimation errors of the reward and the transition estimation. Lemma G.1 is proven by the standard technique of bounding generalization error. Lemma G.2 is proven by the same method of Theorem 7.4 in Van de Geer (2000) . To prove Lemma G.2, we show that the estimation error in the transition is closely related to the uniform convergence over a function class induced by the transition kernel, and then prove the uniform convergence over the function class using empirical process theory. Proof. The following lemma bounds the population risk of the estimators of the reward from the above. Lemma G.1. Let {F τ } n τ =0 be a filtration and {x τ } n τ =1 be a X -valued stochastic process adapted to this filtration. Suppose r τ is a F τ +1 -measurable random variable with E[r τ | F τ ] = r * (x τ ), where r * ∈ R is an unknown function and R is a known function class. We define the estimator of r * as r = argmin r∈R n τ =1 r τ -r(x τ ) 2 . We also define the population risk as Risk(r) = n τ =1 E[(r(x τ ) -r * (x τ )) 2 | F τ -1 ]/n,

and then have

Risk( r) ≤ 16 log N (R, , • ∞ )/δ /n + 12 with probability at least 1 -δ for any fix , δ > 0. Proof. See Appendix §G.1 for a detailed proof. We first apply Lemma G.1 with = 1/n -1/(1+γ) , we then have E (s,a)∼ρ n h,1 | r n h (s, a) -r * h (s, a)| 2 + E (s,a)∼ρ n h,2 | r n h (s, a) -r * h (s, a)| 2 ≤ 24n -1/(1+γ) + 32 log(20HN/δ) + C net n γ/(1+γ) log(n)/(1 + γ) /n ≤ C net,1 n -1/(1+γ) log(20HN/δ) for any (h, n) ∈ [H] × [N ] with probability at least 1 -δ/(20HN ) by Assumption 5.2. Here C net,1 = 56 + 32C net /(1 + γ) is a constant only depends on the parameter of the neural network classes. By taking a union bound, we have E (s,a)∼ρ n h,i r n+1 h (s, a) -r * h (s, a) 2 ≤ C net,1 n -1/(1+γ) log(20HN/δ) holds for all (i, h, n) ∈ [2] × [H] × [N ] with probability at least 1 -δ/20. The following lemma bounds the population risk of the estimators of the transition kernel from the above. Lemma G.2. Let {F τ } n τ =0 be a filtration and let {(S τ , A τ , S τ )} n τ =1 be a S ×A×S-value stochastic process adapted to this filtration. Let ρ τ be the distribution of (S τ , A τ ) when condition on F τ -1 . We assume that S τ ∼ P * (• | S τ , A τ ) when condition on (S τ , A τ ) and F τ -1 . We estimate P * by P = argmax P∈M n τ =1 log P(S τ | S τ , A τ ). Let N ( , M, • ∞ ) be the covering number of the transition class M, and define H ∞ ( , M) = log N ( , M, • ∞ ). Let G be a function satisfies (1). G( )/ 2 is non-increasing, and ( 2) G( ) ≥ max 8 /2 3 2 /2 17 H 1/2 ∞ (2v) -1/2 u, Ḡ du, , where Ḡ = {ḡ P | ḡP (s, a, s ) = (P(s | s, a) + P * (s | s, a))/2, P ∈ M}. Suppose n satisfies √ n 2 n ≥ C 9 G( n ) , where C 9 is an absolute constant. Then for any ≥ max{ n , 1/ √ n}, we have P 1 n n τ =1 E (s,a)∼ρτ TV 2 P(• | s, a), P * (• | s, a) ≥ 2 ≤ C 9 exp -n 2 /C 2 9 . Proof. See Appendix §G.2 for a detailed proof. The following lemma bounds the covering number of Ḡ in Lemma G.2 from the above. Lemma G.3. For two feature map classes Φ and Ψ, we define the density class as M = P : P(s | s, a) = exp -φ(s, a) -ψ(s ) 2 2 /2 c[φ, ψ](s, a) φ ∈ Φ, ψ ∈ Ψ , where the normalization function c[φ, ψ](s, a) is defined in (E.1). We then have H ∞ (δ, Ḡ) ≤ H 2 exp(-R 2 )c 2 min δ R √ c max (v + c max ) , Φ, • ∞,2 + H 2 exp(-R 2 )c 2 min δ R √ c max (v + c max ) , Ψ, • ∞,2 , where Ḡ is defined in Lemma G.2. Proof. See Appendix §G.3 for a detailed proof. By Lemma G.3, when we define Ḡ in the same way as Lemma G.2, we have H ∞ ( , Ḡ) ≤ H 2 (C reg,1 , Φ) + H 2 (C reg,1 , Ψ) ≤ 2C net 1 -log C reg,1 + log(1/ ) /(C γ reg,1 γ ) ≤ C 2 reg,2 1 + log(1/ ) / γ , where C reg,1 = exp(-R 2 )c 2 min / R √ c max (v + c max ) , C reg,2 = 2C net (1 -log C reg,1 )/C γ reg,1 are constants only depend on the regularity parameters. We have 2+γ) log n, where C reg,3 is a constant only depends on the regularity parameters. Therefore, when we choose = C reg,4 n -1/(2+γ) log(20HN/δ), where C reg,4 is a constant only depends on the regularity parameters, we have Proof. We first define τ = r τ -r * (x τ ) and Risk(r /8 2 /2 17 H 1/2 ∞ (u, Ḡ)du ≤ /8 2 /2 17 C reg,2 1 + log(1/u)/u γ/2 du ≤ C reg,2 log(2 19 / 2 ) 1-γ/2 /(1 -γ/2) when γ < 2. Therefore, we set G( ) = C reg,2 log(2 15 / 2 ) 1-γ/2 /(1 -γ/2). By such def- inition, the solution of √ n 2 n = CG( n ) should satisfy n ≤ C reg,3 n -1/( 1 2 2 i=1 E (s,a)∼ρ τ h,i TV 2 P n h (• | s, a), P * h (• | s, a) ≤ C 2 reg,4 n -2/(2+γ) log(20HN/δ) for any (h, n) ∈ ×[H] × [N ] with probability 1 -δ/(20HN ) = n τ =1 [r τ -r(x τ )] 2 /n. By the definition, the noise τ is a F τ +1 -measurable random variable with E[ τ | F τ ] = 0. We have r τ -r(x τ ) 2 = 2 τ + 2 τ r * (x τ ) -r(x τ ) + r * (x τ ) -r(x τ ) 2 (G.1) for any fix r ∈ R. We also have Risk(r * ) = n τ =1 2 τ . Since E[ τ | F τ ] = 0, by (G.1), we have E r τ -r(x τ ) 2 -r τ -r * (x τ ) 2 -r * (x τ ) -r(x τ ) 2 | F τ -1 = 0, and Var r τ -r(x τ ) 2 -r τ -r * (x τ ) 2 -r * (x τ ) -r(x τ ) 2 | F τ -1 = Var 2 τ (r * (x τ ) -r(x τ )) | F τ -1 ≤ 4E r * (x τ ) -r(x τ ) 2 | F τ -1 . Applying Lemma J.4 with λ = 1/4, we have Risk(r) -Risk(r * ) -Risk(r) ≥ -(e -2) n τ =1 E r * (x τ ) -r(x τ ) 2 | F τ -1 n -4 log(1/δ)/n with probability at least 1 -δ. By the definition of the population risk Risk(r), we have Risk(r) -Risk(r * ) -(3 -e)Risk(r) ≥ -4 log(1/δ)/n (G. 2) with probability at least 1 -δ. Equation (G.2) shows the concentration of the risk for a fix r ∈ R. For the uniform convergence, we define C as the -covering set of R with infinity norm, and have Risk(r) -Risk(r * ) -Risk(r)/4 ≥ -4 log(N /δ)/n (G. 3) for all r ∈ C with probability at least 1 -δ by taking a union bound. Here the covering number N = |C|. Therefore, when we denote by E R the event that (G.3) holds for all r ∈ C, we have P(E R ) ≥ 1 -δ. In the following part of the proof, we condition on Event E R . For an arbitrary r ∈ R, we choose r ∈ R such that r -r ∞ ≤ . First, we have Risk(r) -Risk(r)/4 -Risk(r ) + Risk(r )/4 = 1 n n τ =1 2r τ -r(x τ ) -r (x τ ) r (x τ ) -r(x τ ) + 1 4n n τ =1 E 2r * (x τ ) -r(x τ ) -r (x τ ) r(x τ ) -r (x τ ) | F τ -1 By the definition of the covering and the boundedness of the reward, we have Risk(r) -Risk(r)/4 -Risk(r ) + Risk(r )/4 ≥ -3 . Therefore, we have Risk(r) -Risk(r)/4 -Risk(r * ) ≥ -4 log(N /δ)/n -3 for all r ∈ R when condition on E R defined in (G.3). Let r be the minimizer of Risk(r), we have -Risk( r) ≥ 4Risk( r) -Risk( r) -4Risk(r * ) ≥ -16 log(N /δ)/n -12 when condition on Event E R . Since P(E R ) ≥ 1 -δ, we conclude the proof of Lemma G.1.

G.2 PROOF OF LEMMA G.2

Proof. We denote by H the Hellinger divergence of two probability measures, which is defined as H 2 P 1 (•), P 2 (•) = 1 2 S P 1 (s) -P 2 (s) 2 ds. Since we have TV 2 (P 1 , P 2 ) ≤ 8H 2 (P 1 , P 2 ), it remains to bound the population risk from the above in terms of the Hellinger divergence. We now require the following lemmas, which connect the Hellinger divergence with the uniform law of large number (ULLN).  2 n τ =1 E H 2 P(• | S τ , A τ ), P * (• | S τ , A τ ) F τ -1 ≤ n τ =1 log P(S τ | S τ , A τ ) P * (S τ | S τ , A τ ) - n τ =1 E log P(S τ | S τ , A τ ) P * (S τ | S τ , A τ ) F τ -1 . Proof. We conclude the proof by directly applying Lemma 4.1 in Van de Geer (2000) . We define the function g P,1 , the functional ν n,1 , ν n,1 as follows, by showing that ν n,2 ( P) ≤ 2 /128 holds with high probability. Therefore, we only need to prove that g P,1 (s, a, s ) = 1 2 log P(s | s, a) + P * (s | s, a) / 2P * (s | s, a) , (G.4) ν n,1 (g) = 1 √ n n τ =1 g(s τ , a τ , s τ ) -E (s, P sup P∈M,νn,2(P)≥ 2 /128 ν n,1 (g P,1 ) - √ nν n,2 (P) ≥ 0 ≤ C exp -n 2 /C 2 for some absolute constant C. Since the Hellinger distance is bounded from the above by 1, we have P sup P∈M,νn,2(P)≥ 2 /128 ν n,1 (g P,1 ) - √ nν n,2 (P) ≥ 0 (G.6) ≤ S s=1 P sup g∈G(2 s-4 ) ν n,1 (g) ≥ √ n2 2s-10 2 , where S = min{s : 2 s-4 > 1} and G( ) = {g P,1 | P ∈ M, ν n,2 (P) ≤ 2 }. Therefore, we can prove Lemma G.2 by the uniform law of large number on the function class G( ). We introduce the definition of the bracketing and a related ULLN in martingale processes as follows. Definition G.6 (δ-entropy with the bracketing). Let {F τ } n τ =0 be a filtration and {X τ } n τ =1 be a X -valued random process adapted to this filtration. For 0 < δ ≤ R, let N B,M (δ, G, {X τ } n τ =1 , {F τ } n τ =0 ) be the smallest value of N for which there exists a non-random collection {[g L j , g U j ]} N j=1 , such that (1) for all g ∈ G, there exists a non-random j(g) ∈ [N ], such that g L j(g) (x) ≤ g(x) ≤ g U j(g) (x) for all x ∈ X , and (2). χ M (g U j -g L j , {X τ } n τ =1 , {F τ } n τ =0 ) ≤ δ for all j ∈ [N We define δ-entropy with the bracketing H B,M as H B,M (δ, G, {X τ } n τ =1 , {F τ } n τ =0 ) = log N B,M (δ, G, {X τ } n τ =1 , {F τ } n τ =0 ). Here χ M (g, {X τ } n τ =1 , {F τ } n τ =0 ) is defined as χ 2 M g, {X τ } n τ =1 , {F τ } n τ =0 = 2M 2 n τ =1 E exp |g(X τ )| /M -1 -|g(X τ )| /M F τ -1 n. Lemma G.7. Let {F τ } n τ =0 be a filtration and {X τ } n τ =1 be a X -valued random process adapted to this filtration. Suppose we have sup g∈G χ M (g) ≤ R for the function class G. We set the values of the parameters M , ς 0 , ς 1 , ς 2 , and C 2 such that the following inequalities hold. ς 0 ≤ ς 2 √ nR 2 /M, ς 0 ≤ 8 √ nR, ς 2 1 ≥ C 2 2 (ς 2 + 1), (G.7) ς 0 ≥ ς 1 max R ς0/(64 √ n) H 1/2 B,M u, G, {X τ } n τ =1 , {F τ } n τ =0 du, R , (G.8) where H B,M is defined in Definition G.6 and C 2 is an absolute constant. We have P sup g∈G 1 √ n n τ =1 g(X τ ) -E g(X τ ) | F τ -1 ≥ ς 0 ≤ C 2 exp - ς 2 0 C 2 2 ς 2 R 2 . Proof. We conclude the proof by directly applying Theorem 8.13 in Van de Geer (2000) . By the following lemma, we have χ 1 (g) ≤ 4 for g ∈ G( ). Lemma. For the function g P,1 , ν n,2 (P) defined in (G.4), (G.5). We have χ 2 1 (g P,1 ) ≤ 16ν n,2 (P). Proof. We conclude the proof by directly applying Lemma 7.2 in Van de Geer (2000) . Therefore, we can apply Lemma G.7 on G(2 s-4 ) with ς 0 = √ n2 2s-10 2 , ς 1 = 4C 2 , ς 2 = 15, M = 1, and R = 2 s-2 . Our selection of parameters satisfies (G.7). To validate (G.8), we need to bound the generalized entropy of G( ) from the above, which depends on the distribution of the corresponding stochastic process. The following lemma decouples such dependency. Lemma G.8. We assume that 0 ≤ P 1 (s | s, a) ≤ P 2 (s | s, a), and define g i , Pi as Pi (s, a, s ) = P i (s | s, a) + P * (s | s, a) 2 , g i (s, a, s ) = 1 2 log P2 i (s, a, s ) P * (s | s, a) 1 {P * (s |s,a)>0} , for i = 1, 2. Let v be the Lebesgue measure of S. We have χ 1 g 2 -g 1 , (S τ , A τ , S τ ) n τ =1 , F τ n τ =0 ≤ √ 2v sup (s,a,s )∈S×A×S P2 (s, a, s ) -P1 (s, a, s ) . Proof. See G.7 for a detailed proof. Combining Lemma G.8, Definition G.6, with the definition of G( ) in (G.6), we have H B,1 u, G( ), (S τ , A τ , S τ ) n τ =1 , F τ n τ =0 ≤ H ∞ (8v) -1/2 u, Ḡ , (G.9) where H ∞ ( , G) is the -log-covering number of G with respect to the infinity norm, and Ḡ is defined in Lemma G.2. When n satisfies the condition in Lemma G.2 with C 9 = 2 10 C 2 and ≥ n , we have √ n2 2s-10 2 ≥ C 2 G(2 s ) (G.10) ≥ C 2 max 8 2 s-3 2 2s-17 2 H 1/2 ∞ (2v) -1/2 u, Ḡ du, 2 s . Combining (G.9) with (G.10), we have √ n2 2s-10 2 ≥ 4C 2 max 2 s-2 2 2s-16 2 H 1/2 B,1 u, G(2 s-4 ), (S τ , A τ , S τ ) n τ =1 , F τ n τ =0 du, 2 s-2 for all ≥ n and s ≥ 0, which validates (G.8) Lemma G.7. Therefore, by Lemma G.7, we have P sup g∈G(2 s-4 ) ν n,1 (g) ≥ √ n2 2s-10 2 ≤ C 2 exp -(n2 2s-16 2 )/(15C 2 2 ) . Therefore, by (G.6), when > 1/ √ n and C 3 is an absolute constant that large enough, we have P sup P∈M,νn,2(P)> 2 /128 ν n (g P,1 ) - √ nν n,2 (P) ≤ C 2 ∞ s=0 exp -(n2 2s-16 2 )/(15C 2 2 ) ≤ C 3 exp -n 2 /C 2 3 . We conclude the proof of Lemma G.2 by setting C 9 in Lemma G.2 by the maximum of C 3 and 2 10 C 2 .

G.3 PROOF OF LEMMA G.3

Proof. We use the following lemmas to connect the covering number of different function classes and bound the covering number of Ḡ from the above using the covering number of Ψ and Ψ. Lemma G.9. We have N (δ, Ḡ, • ∞ ) ≤ N (4δ √ c, M, • ∞ ) , where Ḡ is defined in Lemma G.2 and c = inf P∈M,s ,s∈S,a∈A P(s | s, a). Proof. See Appendix §G.4 for a detailed proof. Lemma G.10. Suppose the function class M is defined as M = P u (s | s, a)/c[P u ](s, a) | P u ∈ M u , where the normalization function c[P u ](s, a) is defined as c[P u ](s, a) = s ∈S P u (s | s, a)d(s ). We assume that c max ≥ c[P u ](s, a) ≥ c min and P u (s | s, a) ≤ 1 for any P u ∈ M u . We have N δ, M, • ∞ ≤ N c 2 min δ/(v + c max ), M u , • ∞ . Proof. See Appendix §G.5 for a detailed proof. Lemma G.11. For two feature map classes Φ and Ψ, we define the function class M u as M u = exp -φ(s, a) -ψ(s ) 2 2 /2 | φ ∈ Φ, ψ ∈ Ψ . We have N (δ, M u , • ∞ ) ≤ N (δ/(4R), Φ, • ∞,2 ) • N (δ/(4R), Ψ, • ∞,2 ), where R is the bound of the feature maps. Proof. See Appendix §G.6 for a detailed proof. For the density class M, we have inf P∈M,s ,s∈S,a∈A P(s | s, a) ≥ exp(-2R 2 )/c max , where R bounds the norm of the feature maps from the above. Combining Lemma G.9, G.10, with G.11, we have N δ, M , • ∞ ≤ N c cover δ, Φ, • ∞,2 N c cover δ, Ψ, • ∞,2 , (G.11) where c cover = exp(-R 2 )c 2 min /(r √ c max (v + c max )). We conclude the proof of Lemma G.3 by taking logarithms of both sides of (G.11). We set x = φ 1 (s, a) -ψ 1 (s ) 2 / √ 2 and y = φ 2 (s, a) -ψ 2 (s ) 2 / √ 2 in (G.17) and have exp -φ 1 (s, a) -ψ 1 (s ) 2 2 /2 -exp -φ 2 (s, a) -ψ 2 (s ) 2 2 /2 (G.18) ≤ 1 2 φ 1 (s, a) -ψ 1 (s ) 2 2 -φ 1 (s, a) -ψ 2 (s ) 2 2 + 1 2 φ 1 (s, a) -ψ 2 (s ) 2 2 -φ 2 (s, a) -ψ 2 (s ) 2 2 by triangle inequality. For the first term in (G.18), we have φ 1 (s, a) -ψ 1 (s ) 2 2 -φ 1 (s, a) -ψ 2 (s ) 2 2 (G.19) ≤ ψ 2 (s ) -ψ 1 (s ) 2 2φ 1 (s, a) -ψ 1 (s ) -ψ 2 (s ) 2 Combining (G.19) with the boundedness of the feature maps, we have φ 1 (s, a) -ψ 1 (s ) 2 2 -φ 1 (s, a) -ψ 2 (s ) 2 2 ≤ 4R ψ 2 (s ) -ψ 1 (s ) 2 . Similarly, we have | φ 1 (s, a) -ψ 2 (s ) 2 2 -φ 2 (s, a) -ψ 2 (s ) 2 2 | ≤ 4R φ 2 (s, a) -φ 1 (s, a) 2 . By (G.18), we have exp -φ 1 (s, a) -ψ 1 (s ) 2 2 /2 -exp -φ 2 (s, a) -ψ 2 (s ) 2 2 /2 ≤ 4R max ψ 2 (s ) -ψ 1 (s ) 2 , φ 2 (s, a) -φ 1 (s, a) 2 . (G.20) Let C 1 be a /(4r)-covering set of Φ, and C 2 be a /(4r)-covering set of Ψ. We define C by C = exp -φ(s, a) -ψ(s ) 2 2 /2 | φ ∈ C 1 , ψ ∈ C 2 . By (G.20), C is an -covering set of M u , and |C| = N (δ/(4R), Φ, • ∞,2 ) • N (δ/(4R), Ψ, • ∞,2 ). Thus, we conclude the proof of Lemma G.11. G.7 PROOF OF LEMMA G.8 Proof. Since 2(exp(x) -1 -x) ≤ (exp(x) -1) 2 when x ≥ 0, we have χ 2 1 g U -g L , (s τ , a τ , s τ ) n τ =1 ≤ 1 n n τ =1 E exp g U (s τ , a τ , s τ ) -g L (s τ , a τ , s τ ) -1 2 F τ -1 = 1 n n τ =1 E S P * (s | s τ , a τ ) + P U (s | s τ , a τ ) P * (s | s τ , a τ ) + P L (s | s τ , a τ ) -1 2 P * (s | s τ , a τ )ds F τ -1 ≤ 2 n n τ =1 E S PU (s τ , a τ , s ) -PL (s τ , a τ , s ) 2 ds F τ -1 . (G.21) We also have E S PU (s τ , a τ , s ) -PL (s τ , a τ , s ) 2 ds F τ -1 (G.22) ≤ v sup (s,a,s )∈S×A×S PU (s, a, s ) -PL (s, a, s ) 2 . Thus, we conclude the proof of Lemma G.8 by combining (G.21) with (G.22).

H DISCUSSION ON THE EFFECTIVE DIMENSION

We first provide the following lemma, which shows the relation between Λ 2 in Definition E.4 and the dimension of a space in the case of finite dimension. Lemma H.1. We have Λ[n, λ 0 ] ≤ d when X = R d and K(x 1 , x 2 ) = x 1 x 2 . Proof. For any f ∈ H, there exists x f ∈ R d , such that f (x) = x f x. We also have k(x) = x. By the definition of Γ p , we have Γ p [ρ, λ 0 , n]f (x) = λ 0 x f x + nx f E x1∼ρ [x 1 x 1 ]x. Therefore, the operator Γ p [ρ, λ 0 , n] can be written as λ 0 I d + nE x1∼ρ [x 1 x 1 ] . By the property of the matrix trace, we have nE x∼ρ k(x), Γ -1 p [ρ, λ 0 , n]k(x) H =nE x∼ρ tr x λ 0 I d + nE x1∼ρ [x 1 x 1 ] -1 x =nE x∼ρ tr λ 0 I d + nE x1∼ρ [x 1 x 1 ] -1 xx for any ρ ∈ ∆(X ). By the exchangeability of the expectation and the trace, we have nE x∼ρ k(x), Γ -1 p [ρ, λ 0 , n]k(x) H = n tr λ 0 I d + nE x1∼ρ [x 1 x 1 ] -1 E x∼ρ [xx ] = tr(I d ) -λ 0 tr λ 0 I d + N E x1∼ρ [x 1 x 1 ] -1 ≤ d. Since ρ can be an arbitrary distribution over X , we conclude the proof of Lemma H.1.

H.1 PROOF OF LEMMA E.8

To bound the effective dimension from the above, we construct an upper bound of the effective dimension by the eigenvalue of operator in the RKHS, and then use the characterization of the eigenvalue in Lemma H.2 to obtain the upper bound. Lemma H.2 can be proven by Theorem A of Belkin ( 2018), and we provide the proof in Appendix §H.2 here for the completeness of our paper. Proof. In the following part of the proof, we bound Λ 1 and Λ 2 from the above separately. Upper Bound of Λ 1 . Let X N +1 = {x 1 , . . . , x N +1 } be a subset of X . We define K N +1 = [K(x τ1 , x τ2 )] N +1 τ1,τ2=1 . Let λ j be the j-th eigenvalue of K N +1 and α j = (α 1,j , . . . , α N +1,j ) denote the corresponding eigenvector, we have log det(I N +1 + K N +1 /λ) = N +1 i=1 log(1 + λ i /λ). (H.1) Therefore, we can bound log det(I N +1 + K N +1 /λ) from the above by bounding {λ i } N +1 i=1 from the above. We define the operator Γ 0 as Γ 0 f (x) = N +1 i=1 f (x i )K(x i , x)/(N + 1). The function N +1 i=1 α i,j K(x i , x ) is an eigenfunction of Γ 0 and λ j /(N + 1) is the corresponding eigenvalue. The following lemma bounds the eigenvalue of Γ 0 from the above. Lemma H.2. Suppose X ⊂ X ⊂ R m , and X is a cube with side length l ≥ 1/ √ m, and ρ is a distribution over X , and the operator Γ : L ρ 2 (X ) → L ρ 2 (X ) is defined as Γf (x) = x∈X f (t)K(t, x)dρ(t), where K(t, x) = exp(-t -x 2 2 /2). It holds that λ i (Γ) ≤ C app,5 exp(-C app,4 i 1/m ). Here C app,5 and C app,4 are two constants only depend on the side length l and the dimension m. Proof. See Appendix §H.2 for a detailed proof. Applying Lemma H.2 on Γ 0 , we have λ j ≤ N C app,5 exp(-C app,4 j 1/m ), where C app,4 , C app,4 are two constants that only depend on m and R. Therefore, by (H.1), we have log det(I N +1 + K N +1 /λ) ≤e N mid j=1 log 1 + N C app,5 exp(-C app,4 j 1/m )/λ (H.2) + e N j=N mid +1 log 1 + N C app,5 exp(-C app,4 j 1/m )/λ . We have N mid j=1 log(1 + N C app,5 exp(-C app,4 j 1/m )/λ) ≤ N mid log(1 + N C app,5 /λ). The following lemma bounds the second term from the above. Lemma H.3. If m is a positive integer, we have ∞ x t m e -t dt ≤ 2m!x m e -x when x ≥ 1. Proof. When we define b m = ∞ x t m e -t dt ≤ 2x m e -x , we have b m = - ∞ x t m de -t = x m e -x + m ∞ x t m-1 e -t dt = x m e -x + mb m-1 . By induction, we have b m = m!e -x m m0=0 x m0 /m 0 ! ≤ 2m!x m e -x , which concludes the proof of Lemma H.3. Since log(1 + x) ≤ x, by the lemma above, we have N j=N mid +1 log 1 + N C app,5 exp(-C app,4 j 1/m )/λ ≤ N j=N mid +1 N C app,5 exp(-C app,4 j 1/m )/λ ≤ N mC app,5 ∞ Capp,4N 1/m mid j m-1 exp(-j)dj/(C m app,4 λ) ≤ 2N m!C app,5 N (m-1)/m mid exp(-C app,4 N 1/m mid )/(λC app,4 ). (H.3) By (H.2) and (H.3), we have log det( I N +1 + K N +1 /λ) ≤ C 4 (log N/C app,4 ) m+1 when we choose N mid = (log N/C app,4 ) m . Here C 4 is a constant that only depends on the dimension m and the bound of the feature maps R. Since X N +1 can be any subset of X with |X N +1 | = N + 1, we have Λ 1 (N + 1, λ) ≤ C 4 (log N/C app,4 ) m+1 . Upper Bound of Λ 2 . In order to bound Λ 2 from the above, we need to choose an appropriate representation of the RKHS. We define the integral operator Γ 0 : L ρ 2 (X ) → L ρ 2 (X ) as Γ 0 f (x) = E x0∼ρ [f (x 0 )K(x 0 , x)]. Since ρ is a probability measure and the Gaussian kernel K is bounded from the above, Γ 0 is compact and self-adjoint. Therefore, by spectral theorem, there exists {e j } ∞ j=1 such that it is both the eigenfunction of the operator Γ 0 and the orthonormal basis of the space L ρ 2 (X ). We then define H as H = f : f ∈ L ρ 2 (X ), f (x) = ∞ j=1 α j e j (x) ∞ j=1 α 2 j /λ j < ∞ , where λ j is the eigenvalue corresponding to e j . We also know that K(x, y) = ∞ j=1 λ j e j (x)e j (y). For f (x) = ∞ j=1 α j e j (x) and g(x) = ∞ j=1 β j e j (x), we define the inner product on H as f, g H = ∞ j=1 α j β j /λ j . Then the space H is an RKHS with the kernel K, e j / λ j is an eigenfunction of Γ with eigenvalue λ j , and {e j / λ j } ∞ j=1 is an orthonormal basis of H. We represent k(x) using the orthonormal basis and have k(x) 2 Γ -1 = k(x), Γ -1 k(x) H = ∞ j=1 λ j e j (x)e j , ∞ j=1 λ j e j (x)e j /(λ + N λ j ) H = ∞ j=1 λ j e 2 j (x)/(λ + N λ j ) when we define Γf (x) = λf (x) + N Γ 0 f (x). Therefore, using Fubini's Theorem, we have N E x∼ρ k(x) 2 Γ -1 = ∞ j=1 N λ j E x∼ρ e 2 j (x) /(λ + N λ j ) = ∞ j=1 N λ j /(λ + N λ j ). (H.4) By Lemma H.2, we have λ j ≤ C app,5 exp(-C app,4 j 1/m ), where C 0 and C 1 are two constants only depend on the side length l and the dimension m. We combine the bound on λ j with (H.4) and have N E x∼ρ k(x) 2 Γ -1 ≤ ∞ j=1 N C app,5 exp(-C app,4 j 1/m ) N C app,5 exp(-C app,4 j 1/m ) + λ = N mid j=1 N C app,5 exp(-C app,4 j 1/m ) N C app,5 exp(-C app,4 j 1/m ) + λ + ∞ j=N mid +1 N C app,5 exp(-C app,4 j 1/m ) N C app,5 exp(-C app,4 j 1/m ) + λ ≤ N mid + N C app,5 ∞ N mid exp(-C app,4 j 1/m )dj/λ. By Lemma H.3, we have ∞ N mid exp(-C app,4 j 1/m )dj ≤ 2m!N 1-1/m mid exp(-C app,4 N 1/m mid )/C app,4 . Therefore, we choose N mid = (log N/C app,4 ) m and have N E x∼ρ [ k(x) 2 Γ -1 ] ≤(log N/C app,4 ) m + 2m!C app,5 (log N ) m-1 /(C app,4 λ) ≤ C 4 (log N ) m , where C 4 = (1 + 2m!C app,5 C m-1 app,4 )/C m app,5 . By the definition, C 4 is a constant that only depends on the side length l and the dimension m. We conclude the proof of Lemma E.8 by combining the upper bound of Λ 1 and Λ 2 .

H.2 PROOF OF LEMMA H.2

Proof. We require the following two lemmas. The first lemma allows us to bound the eigenvalues by the bound on the residuals of the approximation, and the second lemma bounds the residuals. Lemma H.4 (Lemma 1 in Belkin ( 2018)). Suppose Γ : H → H is a self-adjoint operator on a Hilbert space H, and Γ n is a finite-rank operator with rank n, such that Γ -Γ n op ≤ . Here A op = sup x∈H/{0} Ax H / x H . It holds that all eigenvalues of the operator Γ except for at most n (counting multiplicity) are smaller than . Lemma H.5. Suppose X ⊂ R m is a cube with side length l ≥ 1/ √ m, and Γ : V → H is a (not necessarily linear) map from a Hilbert space V to an RKHS H of functions on R m . There exists a map Γ n from the space V to an n-dimensional linear subspace H n ⊂ H, such that Γ -Γ n V →L ρ 2 (X ) ≤ C app,2 exp(-C app,3 n 1/m ) Γ V →H . Here C app,2 and C app,3 are two positive constants that only depend on the side length l and the dimension m, and Γ V →L ρ 2 (X ) = sup v∈V /{0} Γv L ρ 2 (X ) / v V . Remark. Since H is a subset of L ρ 2 (X ) , we can view Γ as an operator from the space V to the space L ρ 2 (X ) and investigate its operator norm accordingly. Proof. See Appendix §H.3 for a detailed proof. First, by Lemma H.4, we have λ i (Γ) ≤ inf rank(Γi-1)=i-1 Γ -Γ i-1 op . By Lemma H.5, we have  inf rank(Γi-1)=i-1 Γ -Γ i-1 op = inf rank(Γi-1)=i-1 Γ -Γ i-1 L ρ 2 (X )→L ρ 2 (X ) (H.5) ≤ C app,2 exp -C app,3 (i -1) 1/m Γ L ρ 2 (X )→H ≤ C app,2 exp -C app,4 i 1/m Γ L ρ 2 (X )→H for i > 1. Here C app,4 = C app,3 2 -1/m ≤ e L ρ 2 (X ) e L ρ 2 (X ) ≤ e H e L ρ 2 (X ) , where the last inequality is derived from (H.10). Therefore, we have Γe H = sup e∈H e, Γe H / e H ≤ e L ρ 2 (X ) , which implies Γ L ρ 2 (X )→H ≤ 1. Therefore, we have λ i (Γ) ≤ C app,2 exp(-C app,4 i 1/m ) for i > 1 by (H.5). For i = 1, we have λ 1 (Γ) ≤ sup x∈L ρ 2 (X )/{0} Γx L ρ 2 (X ) / x L ρ 2 (X ) ≤ sup x∈L ρ 2 (X )/{0} Γx H / x L ρ 2 (X ) = Γ L ρ 2 (X )→H ≤ 1. Therefore, we have λ i (Γ) ≤ C app,5 exp(-C app,4 i 1/m ) for all integer i when we set C app,5 = C app,2 exp(C app,4 ), which concludes the proof of Lemma H.2. H.3 PROOF OF LEMMA H.5 Proof. We prove Lemma H.5 by constructing an operator that satisfies the condition in this lemma. For n ≥ (3γ m exp(8mγ m + 2)m 3/2 l 2 ) m , we have max x∈X min x ∈Xn x -x 2 = √ mln -1/m ≤ 1/ 3γ m exp(8mγ m + 2)ml when γ m = 4 m m! and X n = (x 1 , . . . , x n ) is an m-dimensional grid of X . By Theorem 6.10 of Wendland (2004) , the kernel K(x 1 , x 2 ) = exp(-x 1 -x 2 2 2 /2) is positive-definite. Therefore, the matrix K[X n ] is invertible, and n i=1 α ij1 K(x i , x j2 ) = 1 j1=j2 when K[X n ]α j = e j , α ij is the i-th element of α j , and K[X n ] = K(x τ1 , x τ2 ) n τ1,τ2=1 , (H.6) k[X n ](x) = K(x, x 1 ), . . . , K(x, x n ) . (H.7) The following lemma allows us to construct an operator that satisfies the condition in Lemma H.5. Lemma H.6. Let H be the RKHS induced by the kernel K, where K : R m × R m → R is defined as K(x, x ) = exp(-x -x 2 2 /2). Suppose X ⊂ R m is a cube with side length l ≥ 2/m, and x) , where α j = K -1 [X n ]e j , α ij is the i-th element of α j , and the matrix K[X n ] is defined in (H.6). We also define S Xn as max x∈X min x ∈Xn x -x 2 = ι ≤ 1/ 3 • 4 m m! exp(m2 2m+3 m! + 2)ml , for the set X n ⊂ X with |X n | = n. We define u j (x) = n i=1 α ij K(x i , S Xn f (x) = n i=1 f (x i )u i (x) for all f ∈ H. For an operator Γ : H → L ρ 2 (X ) , where ρ is a probability measure over X , we define Γ H→L ρ 2 (X ) = sup f ∈H/{0} Γf L ρ 2 (X ) / f H . We have S 0 -S Xn H→L ρ 2 (X ) ≤ 4l √ m exp -C app,1 /(2ι) , where γ m = 4 m • m!, C app,1 = l/(3γ m ), and S 0 : H → L ρ 2 (X ) is defined as S 0 f (x) = f (x). Proof. See Appendix §H.4 for a detailed proof. Let S 0 and S Xn be the operator defined in Lemma H.6, we have Γ -S Xn = (S 0 -S Xn ) • Γ. Combining the definition of the norm with Lemma H.6, we have Γ -S Xn • Γ V →L ρ 2 (X ) ≤ Γ V →H S 0 -S Xn H→L ρ 2 (X ) (H.8) ≤ 4l √ m exp -C app,1 n 1/m /(2l) Γ V →H . Here C app,1 = l/(3γ m ). Therefore, we also have Γ -Γ n V →L ρ 2 (X ) ≤ 4l √ m exp -C app,1 n 1/m /(2l) Γ V →H when we define Γ n = S Xn • Γ in the case that n ≥ (3γ m exp(8mγ m + 2)m 3/2 l 2 ) m . Since the rank of S Xn does not exceed n, the rank of Γ n does not exceed n. When n < (3γ m exp(8mγ m + 2)m 3/2 l 2 ) m , we define Γ n as Γ n f (x) = 0 for all x ∈ X and v ∈ V . We then have Γ V →L ρ 2 (X ) = Γ -Γ n V →L ρ 2 (X ) . We first show that Γ V →L ρ 2 (X ) is bounded from the above by Γ V →H . By the reproducing property of the space H, we have g 2 L ρ 2 (X ) = x∈X g 2 (x)dρ(x) = x∈X g, k(x) H 2 dρ(x) (H.9) for any g ∈ H. Combining (H.9) with Cauchy-Schwarz inequality, we have g 2 L ρ 2 (X ) ≤ x∈X g, g H k(x), k(x) H dρ(x) = x∈X g, g H dρ(x) = g 2 H (H.10) for any g ∈ H when ρ is a probability measure. By (H.10), we have Γ V →L ρ 2 (X ) = sup v∈V /{0} Γv L ρ 2 (X ) / v V ≤ sup v∈V /{0} Γv H / v V = Γ V →H . (H.11) Since Γ -Γ n V →L ρ 2 (X ) = Γ V →L ρ 2 (X ) when we define Γ n as Γ n v(x) = 0, we have Γ -Γ n V →L ρ 2 (X ) ≤ exp C app,1 n 1/m /(2l) exp -C app,1 n 1/m /(2l) Γ V →H . (H.12) by (H.11) . Combining (H.12) with n ≤ (3γ m exp(8mγ m + 2)m 3/2 l 2 ) m , we have Γ -Γ n V →L ρ 2 (X ) ≤ exp m 3/2 l 2 exp(8mγ m + 2)/2 exp -n 1/m /(6γ m ) Γ V →H (H.13) since C app,1 = l/(3γ m ). Therefore, by combining (H.8) with (H.13), when we choose C app,2 = max 4l √ m, exp m 3/2 l 2 exp(8mγ m + 2)/2 and C app,3 = C app,1 /(2l) = 1/(6γ m ), we have (1) both C app,2 and C app,3 are constants that only depend on the side-length l and the dimension m, and (2) for any positive integer n, there exists an operator Γ n : V → L ρ 2 (X ) with finite rank n, such that Γ -Γ n V →L ρ 2 (X ) ≤ C app,2 exp(-C app,3 n 1/m ) Γ V →H . Thus, we conclude the proof of Lemma H.5.

H.4 PROOF OF LEMMA H.6

Proof. By the definition in Lemma H.6 and the reproducing property of the kernel, we have f (x) -S Xn f (x) = k(x), f H - n i=1 k(x i ), f H u i (x) (H.14) = k(x) - n i=1 u i (x)k(x i ), f H . Combining (H.14) with Cauchy-Schwarz inequality, we have f (x) -S Xn f (x) ≤ k(x) - n i=1 u i (x)k(x i ) H f H . Therefore, taking expectation with respect to the probability measure ρ, we have (R -S Xn ) • f L 2 ρ (X ) = X f (x) -S Xn,u f (x) 2 dρ(x) (H.15) ≤ sup x∈X k(x) - n i=1 u i (x)k(x i ) 2 H f 2 H . The following lemma allows us to bound the term sup x∈X k(x) - n i=1 u i (x)k(x i ) 2 H from the above. for all x 0 ∈ X and p ∈ q (R m ). Combining (H.21) with the fact that x -x 2 ≤ √ ml when x, x ∈ X , we have K(x 0 , x 0 ) -k[X n ](x 0 ) K -1 [X n ]k[X n ](x 0 ) ≤ k s -p L ∞ (B(0, √ ml)) (H.22) + 2 n i=1 υ i (x 0 ) k s -p L ∞ (B(0, √ ml)) + k s -p L ∞ (B(0, √ ml)) n i=1 n j=1 υ i (x 0 )υ j (x 0 ) . Here f L ∞ (B(0, √ ml)) = sup x 2 ≤ √ ml |f (x)|. By Lemma H.8, we have n i=1 |υ i (x 0 )| ≤ exp(2mγ m (q + 1)) for all x 0 ∈ X . Therefore, by (H.22), we have K(x 0 , x 0 ) -k[X n ](x 0 ) K -1 [X n ]k[X n ](x 0 ) ≤ 1 + exp 2mγ m (q + 1) 2 ς p (H.23) for all x 0 ∈ X and p ∈ q (R m ). Here ς p = k s -p L ∞ (B(0, √ ml)) = sup x 2 ≤ √ ml |k s (x) -p(x)|. Therefore, we only need to bound inf p∈ q (R m ) k s -p L ∞ (B(0, √ ml)) from the above. We have p( x 2 2 ) ∈ q (R m ) when p ∈ q/2 (R). Since we define k s (x) = exp(-x 2 2 /2), by (H.23), we have K(x 0 , x 0 ) -k[X n ](x 0 ) K -1 [X n ]k[X n ](x 0 ) (H.24) ≤ 4 exp 4mγ m (q + 1) inf p∈ q/2 (R) sup x∈[0,ml 2 ] exp(-x/2) -p(x) . By Taylor's Theorem (with Lagrange Remainder), we have sup x∈[0,ml 2 ] |exp(-x/2) -p t (x)| ≤ (ml 2 /2) [q/2]+1 / [q/2] + 1 ! (H.25) when we choose the polynomial p t as the Taylor polynomial of exp(-x/2) around zero of degree [q/2]. Without loss of generality, we assume ml 2 > 2. Since the degree q is an integer, we have [q/2] ≥ (q -1)/2 and q ≥ 1 when ι ≤ C app,1 . Combining (H.25) with Stirling's formula, we have sup x∈[0,ml 2 ] exp(-x/2) -p t (x) ≤ (ml 2 /2) [q/2]+1 1 ([q/2] + 1)! ≤ (eml 2 /2) [q/2]+1 [q/2] + 1 -[q/2]-1 ≤ (ml 2 /2) [q/2]+1 exp (q + 1)/2 (q + 1)/2 -(q+1)/2 . (H.26) We plug (H.26) into (H.24) and have K(x 0 , x 0 ) -k[X n ](x 0 ) K -1 [X n ]k[X n ](x 0 ) ≤ 4l √ m exp(8mγ m + 1)ml 2 /(q + 1) (q+1)/2 . Since we choose q = [C app,1 /ι], we have q + 1 ≥ C app,1 /ι. Therefore, we have K(x 0 , x 0 ) -k[X n ](x 0 ) K -1 [X n ]k[X n ](x 0 ) ≤4l √ m exp(8mγ m + 1)ml 2 ι/C app,1 (q+1)/2 =4l √ m 3γ m exp(8mγ m + 1)mlι (q+1)/2 where the second equation is induced by the definition of C app,1 in Lemma H.8. Therefore, when ι < 1/(3γ m exp(8mγ m + 2)ml), we have K(x 0 , x 0 ) -k[X n ](x 0 ) K -1 [X n ]k[X n ](x 0 ) ≤ 4l √ m exp -(q + 1)/2 (H.27) ≤ 4l √ m exp -C app,1 /(2ι) . Combining (H.15) with (H.27), we have (R -S Xn ) • f L ρ 2 (X ) = X f (x) -S Xn f (x) 2 dρ(x) ≤ 4l √ m exp -C app,1 /(2ι) f H for f ∈ H. By the definition of the operator norm, we have R -S Xn H→L ρ 2 (X ) = sup f ∈H/{0} (R -S Xn ) • f L ρ 2 (X ) / f H ≤ 4l √ m exp -C app,1 /(2ι) , which concludes the proof of Lemma H.6.

I PROOF OF LEMMAS IN APPENDIX §E

I.1 PROOF OF LEMMA E.5 Proof. In the following part of the proof, we condition on the events good events E 1 and E 2 . First, by Lemma J.1, we decompose the difference in value as follows, J(π; r n + u n , P n ) -J(π; r * , P * ) (I.1) = H h=1 E π,P n E s h+1 ∼P n h (s h ,a h ) V π h+1 (s h+1 ; r * , P * ) -E s h+1 ∼P * h (s h ,a h ) V π h+1 (s h+1 ; r * , P * ) + H h=1 E π,P n [u n h (s h , a h ) + r n h (s h , a h ) -r * h (s h , a h )] . Since the value function in (I.1) is bounded from the above by H, we have J(π; r n + u n , P n ) -J(π; r * , P * ) (I.2) ≥ H h=1 E π,P n [u n h (s h , a h )] - H h=1 E π,P n [f h,r (s h , a h )] - H h=1 E π,P n [Hf h,P (s h , a h )] , where f h,P (s, a) = P n h (• | s, a) -P * h (• | s, a) 1 and f h,r (s, a) = | r n h (s h , a h ) -r * h (s h , a h )|. In the following part of the proof, we bound the expectation of f in (I.2) from the above. For h = 1, by the definition of Event E 1 in (E.5) and ρ n 1,1 in (E.3), we have E π,P n f 1,P (s 1 , a 1 ) ≤ |A| E ρ n 1,1 f 1,P (s, a) ≤ |A| E ρ n 1,1 f 2 1,P (s, a) ≤ |A| ζ/ √ n, (I.3) when condition on Event E 1 . Here the second inequality follows Cauchy-Schwarz inequality. By the same technique in (I.3), we have E π,P n f 1,r (s 1 , a 1 ) ≤ |A| ζ/ √ n. For h > 1, we have f h,P ∞ ≤ 2. Since we have (s , a ) ∼ ρ n h,2 when (s, a) ∼ ρ n h-1,1 , s ∼ P * h-1 (• | s, a), a ∼ U(A), by Lemma J.2, we have |E π,P n [f h,P (s h , a h )]| ≤ β n h E π,P n k φ n h-1 (s h-1 , a h-1 ) Γ -1 p [φ n h-1 ,ρ n h-1,1 ,λ,n] , where β n h = 4λv 2 /c 2 min + 2r 2 max (n |A| E ρ n h,1 [f h,P (s, a) 2 ] + 4ζ 2 ) . By the definition of E 1 in (E.5), we have nE ρ n h,1 [f h,P (s, a) 2 ] ≤ ζ 2 when condition on E 1 . We also have k φ n h (s h , a h ) Γ -1 p [φ n h ,ρ n h,1 ,λ,n] ≤ 2 k φ n h (s h , a h ) Γ -1 e [φ n h ,D n h,1 ,λ] when condition on Event E 2 in Lemma E.2. By the definition of u n h in (4.5), we have E π,P n f h,P (s h , a h ) ≤ E π,P n u n h (s h , a h ) /(H + 1) . By the exactly same method, we have |E π,P n [f h,r (s h , a h )]| ≤ E π,P n [u n h (s h , a h )]/(H + 1). Therefore, we conclude the proof of Lemma E.5 by combing the bound above, (I.2), with (I.3).

I.2 PROOF OF LEMMA E.6

Proof. By the definition of Q n h in Algorithm 4 (Planning Algorithm), we have Q n h (s, a) = Q n h (s, a, π n+1 ; r n + u n , P n ) for all (s, a) ∈ S × A, and π n+1 is the greedy policy with regard to Q n h . By the definition of the value function, we have V π * h (s; r n + u n , P n ) -V n h (s) = Q π * h (s, •; r n + u n , P n ) -Q n h (s, •), π * h (• | s) A (I.4) + Q n h (s, •), π * h (• | s) -π h (• | s) A . Therefore, it remains to bound J(s init , π; u * ,n , P * ). Since the bonus u * ,n are bounded by 4H 2 + 6H + 2, it is a 12H 2 -subGaussian random variable. Therefore, by Hoeffding's inequality and the definition of the value function, we have N n=1 J(π n+1 ; u * ,n , P * ) = H h=1 N n=1 E π n+1 ,P * u * ,n h (s h , a h ) (I.7) ≤ 12H 2 log(10H/δ) √ N + H h=1 N n=1 u * ,n h (s n+1 h,0 , a n+1 h,0 ) with probability at least 1 -δ/10. Combining the definition of the underlying bonus in Lemma I.1 with Cauchy-Schwarz inequality, we have H h=1 N n=1 u * ,n h (s n+1 h,0 , a n+1 h,0 ) ≤β 1 H h=1 N n=1 k φ * h (s n+1 h,0 , a n+1 h,0 ) Γ -1 e [φ * h ,D n h,0 ,λ] (I.8) ≤β 1 √ N H h=1 N n=1 k φ * h (s n+1 h,0 , a n+1 h,0 ) 2 Γ -1 e [φ * h ,D n h,0 ,λ] . For simplicity, we define x n h = φ * h (s n h,0 , a n h,0 ), K n h = [K(x τ1 h , x τ2 h )] n τ1,τ2=1 , and k n h = K(x 1 h , x n+1 h ), . . . , K(x n h , x n+1 h ) . By Lemma D.3, we have k φ * h (s n+1 h,0 , a n+1 h,0 ) 2 Γ -1 e [φ * h ,D n h,0 ,λ] = K(x n+1 h , x n+1 h ) -k n, h (λI + K n h ) k n h /λ. We also define a n h = (K(x n+1 h , x n+1 h ) -k n, h (λI + K n h ) -1 k n h )/λ. We then have det(I + K n+1 h /λ) = det I + K n h /λ k n h /λ k n, h /λ 1 + K(x n+1 h , x n+1 h )/λ = det I 0 b 1 det I + K n /λ 0 0 1 + a n det I b 0 1 = det(I + K n h /λ)(1 + a n h ), where b = (λI + K n h ) -1 k n h . We also have a n h ≤ 1.  ; r n + u n , P n ) -J(π n ; r * , P * ) ≤ 32H 2 |A| ζ d eff N + 12H 2 log(10H/δ) √ N + β 1 H eN d eff ≤ 46H 2 |A| ζβ 1 d eff N log(10H/δ ), which concludes the proof of Lemma E.7.

I.4 PROOF OF LEMMA I.1

Proof. In the following part of the proof, we condition on E 1 and E 2 . First, by Lemma J.1, we decompose the difference in value as follows, J(π; r * + u n + u * ,n , P * ) -J(π; r n + u n , P n ) (I.9) = H h=1 E π,P * E s h+1 ∼P * h (s h ,a h ) V π h+1 (s h+1 ; r n + u n , P n ) -E s h+1 ∼P n h (s h ,a h ) V π h+1 (s h+1 ; r n + u n , P n ) + H h=1 E π,P * u * ,n h (s h , a h ) + r * h (s h , a h ) -r n h (s h , a h ) . Since r n + u n ∞ ≤ (2H + 3), the value function in (I.9) is bounded from the above, and we have J(π; r * + u n + u * ,n , P * ) -J(π; r n + u n , P n ) (I.10) ≥ H h=1 E π,P * u * ,n h (s h , a h ) - H h=1 E π,P * f h,r (s h , a h ) - H h=1 E π,P * (2H 2 + 3H)f h,P (s h , a h ) , where f h,P (s, a) = P n h (• | s, a) -P * h (• | s, a) 1 and f h,r (s, a) = | r n h (s h , a h ) -r * h (s h , a h ) | are defined for simplicity. In the following part of the proof, we bound the expectation of f in (I.10) from the above. For h = 1, by the definition of Event E 1 in (E.5) and ρ n 1,1 in (E.3), we have E π,P * f 1,P (s 1 , a 1 ) ≤ |A| E ρ n 1,1 f 1,P (s, a) ≤ |A| E ρ n 1,1 f 2 1,P (s, a) ≤ |A| ζ/ √ n (I.11) when condition on E 1 . Here the second inequality follows Cauchy-Schwarz inequality. By the same technique in (I.11), we have E π,P * f 1,r (s 1 , a 1 ) ≤ |A| ζ/ √ n. For h > 1, we have f h,P ∞ ≤ 2. Since (s , a ) ∼ ρ n h,1 when (s, a) ∼ ρ n h-1,0 , s ∼ P * h-1 (• | s, a), a ∼ U(A), by Lemma J.3, we have E π,P * f h,P (s h , a h ) ≤ β n, h E π,P * k(φ * h-1 (s h-1 , a h-1 ), •) Γ -1 p [φ * h-1 ,ρ n h-1,0 ,λ,n] , (I.12) where β n, h = 4λv 2 /c 2 min + r 2 max n |A| E (s h ,a h )∼ρ n h,1 [f h,P (s h , a h ) 2 ]. By the definition of E 1 in (E.5), we have nE (s h ,a h )∼ρ n h,1 [f h,P (s h , a h ) 2 ] ≤ ζ 2 when E 1 holds. We also have k φ * h (s h , a h ) Γ -1 p [φ * h ,ρ n h,0 ,λ,n] ≤ 2 k φ * h (s h , a h ) Γ -1 e [φ * h ,D n h,0 ,λ] when condition on Event E 2 defined in (E.6), (E.7). By the definition of u * ,n h in Lemma I.1 and (I.12), we have E π,P * f h,P (s h , a h ) ≤ E π,P * u * ,n h-1 (s h-1 , a h-1 ) /(2H 2 + 3H + 1). By the same method, we have |E π,P * [f h,r (s h , a h )]| ≤ E π,P * [u * ,n h-1 (s h-1 , a h-1 )]/(2H 2 + 3H + 1 ). Therefore, we conclude the proof of Lemma I.1 by combing the bound above, (I.10), with (I.11).

I.5 PROOF OF LEMMA I.2

Proof. In the following part of the proof, we condition on the good events E 1 and E 2 defined in (E.5), (E.6), and (E.7). By the definition of the value function, we have J(π; u n , P * ) = E π u n 1 (s init , a h ) Term (a) + H h=2 E π,P * u n h (s h , a h ) Term (b) . (I.13) Now we bound the terms in (I.13) from the above separately. Term (a). By the definition of ρ n 1,1 in (E.3), Algorithm 3 (Sampling Scheme), and Cauchy-Schwarz inequality, we have E π u n 1 (s init , a h ) ≤ |A| E ρ n 1,1 u n 1 (s, a) ≤ |A| E ρ n 1,1 u n 1 (s, a) 2 . (I.14) By the definition of u n h in (4.5) and d eff in Definition E.4, when Event E 2 defined in Lemma E.2 holds, we have  E ρ n h,1 u n h (s, a) 2 ≤β 2 E ρ n h,1 k φ n h (s, a) 2 Γ -1 e [φ n h ,D n h,1 ,λ] ≤4β 2 E ρ n h,1 k φ n h (s, a) 2 Γ -1 p [φ n h ,ρ n h,1 ,λ,n] ≤ 4β 2 d eff /n. ∼ ρ n h,1 when (s, a) ∼ ρ n h-1,0 , s ∼ P * h-1 (• | s, a), a ∼ U(A) . Therefore, by Lemma J.3, when we condition on Event E 2 defined in Lemma E.2, we have E π,P * u n h (s h , a h ) 2 ≤β n, * h E π,P * k φ * h-1 (s h-1 , a h-1 ) Γ -1 p [φ * h-1 ,ρ n h-1,0 ,λ,n] (I.15) ≤2β n, * h E π,P * k φ * h-1 (s h-1 , a h-1 ) Γ -1 e [φ * h-1 ,D n h-1,0 ,λ] , where β n, * h = 4λ(H + 1) 2 v 2 /c 2 min + r 2 max n |A| E ρ n h,1 [u n h (s, a) 2 ] . By the definition of u n h in (4.5), and d eff in Definition E.4, when we condition on Event E 2 defined in Lemma E.2, we have E ρ n h,1 [u n h (s, a) 2 ] ≤ β 2 E ρ n h,1 k φ n h (s, a) Γ -1 e [φ n h ,D n h ,λ] ≤ 2β 2 E ρ n h,1 k φ n h (s, a) Γ -1 p [φ n h ,ρ n h,1 ,λ,n] ≤ 2β 2 eff /n. Therefore, we have β 1 ≥ 2β n, * h , where β 1 is defined in Lemma I.1. By (I.15) and the definition of u * ,n h-1 (s h-1 , a h-1 ) in Lemma I.1, we have E π,P * [u n h (s h , a h )] ≤ E π,P * [u * ,n h-1 (s h-1 , a h-1 )]. We conclude the proof of Lemma I.2 by combining the bound on Term (a) and Term (b). J PROOF OF AUXILIARY LEMMAS J.1 PROOF OF LEMMA J.1 In this subsection, we provide the proof of the simulation lemma. We first state it below. Lemma J.1 (Simulation Lemma). Let V h (s; r, P, π) be the value function defined in (2.1). For the transition kernels P = {P h } H h=1 , P * = {P * h } H h=1 and the reward functions r = {r h } H h=1 , we have V π h (s; r + u, P) -V π h (s; r, P * ) = H h=h E π,P [uh(sh, ah) | s h = s] + H h=h E π,P E 1, h V π h+1 (s h+1 ; r, P * ) -E 2, h V π h+1 (s h+1 ; r, P * ) | s h = s , where E 1,h [•] = E s h+1 ∼P h (s h ,a h ) [•] and E 2,h [•] = E s h+1 ∼P * h (s h ,a h ) [•]. Here E π,P is the expectation taken over the trajectory induced by the policy π = {π h } H h=1 and the transition kernel P. Proof. By the definition of V h (s; r, P, π) in (2.1), we have V π h (s; r + u, P) -V π h (s; r, P * ) = V π h (s; u, P) + V π h (s; r, P) -V π h (s; r, P * ) (J.1) = H h=h E π,P uh(sh, ah) | s h = s + V π h (s; r, P) -V π h (s; r, P * ). Here E π,P is the expectation taken over the trajectory induced by policy π and the transition kernel P. By the definition of the value function Vh(s; r, P, π) in (2.1), we have V π h (s true h ; r, P) = E π,P rh(sh, ah) + V π h+1 (sh +1 ; r, P) | sh = s true h (J.2) = a∈A rh(s true h , a)πh(a | s true h ) + E π,P V π h+1 (sh +1 ; r, P) | sh = s true h . By (J.2), we have V π h (s true h ; r, P) -V π h (s true h ; r, P * ) (J.3) = E π,P V π h+1 (sh +1 ; r, P) | sh = s true h -E π,P * V π h+1 (sh +1 ; r, P * ) | sh = s true h . Combining the property of the expectation with (J.3), we have V π h (s true h ; r, P) -V π h (s true h ; r, P * ) = E π,P V π h+1 (sh +1 ; r, P) -V π h+1 (sh +1 ; r, P * ) | sh = s true h + E π,P V π h+1 (sh +1 ; r, P * ) | sh = s true h -E π,P * V π h+1 (sh +1 ; r, P * ) | sh = s true h . (J.4) Taking expectation in both sides of (J.4) and summing it up from h = h to H, we have V π h (s; r, P) -V π h (s; r, P * ) (J.5) = H h=h E π,P E 1, h V π h+1 (s h+1 ; r, P * ) -E 2, h V π h+1 (s h+1 ; r, P * ) s h = s , where E 1,h [•] = E s h+1 ∼P h (s h ,a h ) [•] and E 2,h [•] = E s h+1 ∼P * h (s h ,a h ) [•]. We conclude the proof of Lemma J.1 by combining (J.1) with (J.5).

J.2 PROOF OF LEMMA J.2

In this subsection, we provide the proof of the one-step backward inequality for the learned model. We first state this lemma below. Lemma J.2 (One-Step Backward for the Learned Model). Let ρ n h-1 be a distribution over S × A. We assume that the transition kernels P n = {P n h } H h=1 and P * = {P * h } H h=1 satisfy P n h (s | s, a) = k φ n h (s, a) , k ψ n h+1 (s ) H c[φ n h , ψ n h+1 ](s, a), P * h (s | s, a) = k φ * h (s, a) , k ψ * h+1 (s ) H c[φ * h , ψ * h+1 ](s, a), where c[φ, ψ](s, a) is the normalization function defined in (E.1) and k : X → H is the feature map of the RKHS H. We assume further that nE (s,a) ∼ρ n h-1 [TV 2 (P n h-1 (• | s, a), P * h-1 (• | s, a))] ≤ ζ 2 , and the non-negative function g satisfies g ∞ ≤ B. For any policy π, we have E π,P n g(s h , a h ) ≤ E π,P n β l k φ n h-1 (s h-1 , a h-1 ) Γ -1 p [φ n h-1 ,ρ n h-1 ,λ,n] . Here the operator Γ p [φ n h-1 , ρ n h-1 , λ, n] is defined in (E.4 ), E π,P is the expectation taken over the trajectory induced by the policy π = {π h } H h=1 and the transition kernel P, β is defined as β l = λB 2 v 2 /c 2 min + 2r 2 max n |A| E ρ n h-1 ,P * h ,U (A) g 2 (s h , a h ) + B 2 ζ 2 , and the expectation E ρ n h-1 ,P * h ,U (A) [g(s h , a h )] is defined as E ρ n h-1 ,P * h-1 ,U (A) g(s h , a h ) = E (s h-1 ,a h-1 )∼ρ n h-1 ,s h ∼P * h-1 (•|s h-1 ,a h-1 ),a h ∼U (A) g(s h , a h ) . Proof. By the structure of the transition and the property of conditional expectation, we have E π,P n g(s h , a h ) (J.6) = E π,P n s P n h-1 (s | s h-1 , a h-1 ) a∈A π h (a | s)g(s, a)ds = E π,P n s∈S K φ n h-1 (s h-1 , a h-1 ), ψ n h (s) ḡ(s, π h )ds/c[φ n h-1 , ψ n h ](s h-1 , a h-1 ) , where the last equation follows from the structure of P n h-1 , and ḡ(s, π) is defined as ḡ(s, π) = a∈A π(a | s)g(s, a). (J.7) Combining (J.6) with the lower bound of the normalization constant, we have We plug (J.17) into (J.12) and u[ψ n h , π, g]  E π,P n g(s h , a h ) ≤ E π,P n s∈S K φ n h-1 (s h-1 , a h-1 ), ψ n h β l k φ n h-1 (s h-1 , a h-1 ) Γ -1 p [φ n h-1 ,ρ n h-1 ,λ,n] . Here the expectation and β l are defined in Lemma J.2. Thus, we conclude the proof of Lemma J.2.

J.3 PROOF OF LEMMA J.3

In this subsection, we provide the proof of the one-step backward inequality for the true model. We first state this lemma below. Lemma J.3 (One-Step Backward for the True Model). Let ρ n h-1 be a distribution over S × A. We assume that for the transition kernel P n = {P We also assume that the non-negative function g satisfies g ∞ ≤ B. Then for any policy π, we have E π,P * h g(s h , a h ) ≤ E π,P * h β t k(φ * h-1 (s h-1 , a h-1 )) Γ -1 p [φ * h-1 ,ρ n h-1 ,λ,n] . Here the expectation E ρ n h-1 ,P * h ,U (A) [g(s h , a h )] is defined as E ρ n h-1 ,P * h ,U (A) g(s h , a h ) = E (s h-1 ,a h-1 )∼ρ n h-1 ,s h ∼P * h (•|s h-1 ,a h-1 ),a h ∼U (A) g(s h , a h ) , the operator Γ p [φ * h-1 , ρ n h-1 , λ, n] is the operator defined in (E.4), and β t is defined as β t = λB 2 v 2 /c 2 min + r 2 max n |A| E ρ n h-1 ,P * h ,U (A) g 2 (s h , a h ) . Proof. The proof is similar with the proof of Lemma J.2. Similar with (J.8), we have  E β t k(φ * h-1 (s h-1 , a h-1 )) Γ -1 p [φ * h-1 ,ρ n h-1 ,λ,n] , where β t is defined in Lemma J.3. Thus, we conclude the proof of Lemma J.3. Where the last equation holds since x ) = exp(-x -x 2 2 /2).Therefore, we have Γg ∈ L 2 ρ (X ) when g ∈ L 2 ρ (X ), which implies that Γ is a linear operator on L 2 ρ (X ). When we define the functional γ g : H → R as γ g (ḡ) = X g(x)ḡ(x)dρ(x) for g ∈ L 2 ρ (X ), we have γ g (ḡ) ≤ X g 2 (x)dρ(x) X ḡ2 (x)dρ(x) ≤ ḡ, ḡ H X g 2 (x)dρ(x) by (J.28) and (J.29). Therefore, by Riesz's theorem, we know that there exists u ∈ H, such that u, ḡ H = γ g (ḡ), which implies. Γg(x) ∈ H. The operator Γ : H → H is also non-negative definite since we have g, Γg H = X g 2 (x)dρ(x) ≥ 0 when g ∈ H. Thus, we conclude the proof of Lemma D.2. J.6 PROOF OF LEMMA D.3 Proof. We first define the operator Γ[X n ] as  Γ[X n ]g(x) = g(x) -g[X n ] λI + K[X n ] -1 k[X n ](x) [X n ]Γ[X n ]g(x) = Γ[X n ]g(x) -Γ[X n ]g [X n ] λI + K[X n ] -1 k[X n ](x) λ = λg(x) + g[X n ] k[X n ](x) -g[X n ] k[X n ](x) λ = g(x) . By the definition in (J.30) and the definition of ( Γ[X n ]g)[X n ] in §1.2, we have Γ[X n ]g [X n ] = g[X n ] -K[X n ] λI + K[X n ] -1 g[X n ] λ = λI + K[X n ] -1 g[X n ]. (J.32) Combining (J.32) with (J.30), we have Γ[X n ] Γ[X n ]g(x) = λ Γ[X n ]g(x) + Γ[X n ]g [X n ] k[X n ](x) = g(x) -g[X n ] λI + K[X n ] -1 k[X n ](x) + g[X n ] λI + K[X n ] -1 k[X n ](x) = g(x). Therefore, the operator Γ[X n ] is the inverse operator of Γ[X n ], and k(x) 2 Γ -1 [Xn] = k(x), Γ -1 [X n ]k(x) H = K(x, x) -k[X n ](x) λI + K[X n ] -1 k[X n ](x) λ. Thus, we conclude the proof of Lemma D.3. We conclude the proof of Lemma D.4 by combining (J.35) with (J.36).



Northwestern University University of Chicago Yale University University of Alberta DeepMind Corresponding authors: siruizheng2025@u.northwestern.edu



Assumption C.5 can be satisfied by a lot of activation functions. For example, Assumption C.5 is satisfied when we choose σ h (•) = tanh(•) and σ y (•) = max(0, •). We directly obtain the boundedness of the function class F RNN (t, B) under Assumption C.5. Moreover, Chen et al. (2019) bound the covering number of F RNN (t, B) from the above by the following lemma. Lemma C.6 (Lemma 3 in Chen et al. (2019)). Under Assumption C.5, we have

s,a)∈S×A c[φ, ψ](s, a), c min = inf φ∈Φ,ψ∈Ψ (s,a)∈S×A c[φ, ψ](s, a), r max = c max /c min .

For a finite dataset D = {(s τ , a τ , r τ , s τ +1 )} n τ =1 and g ∈ H, we define the empirical operator on H Γ e [φ, D, λ] as Γ e [φ, D, λ]g(x) = λg(x) + (s,a,r,s )∈D g φ(s, a) K φ(s, a), x , (E.2) and denote by Γ -1 e [φ, D, λ] the corresponding inverse operator. Then we have

a)∼ρ g(φ(s, a))k(φ(s, a), x), (E.4)

* ; r * , P * ) -J(π n+1 ; r * , P * ) ≤ H |A| ζ √ N 8 + 46Hβ 1 d eff log(10H/δ) , (E.10) By the definition of β 1 in Lemma E.7, we have β 1 ≤ 24H 2 vc max /c 2 min ( √ λ + ζβ d eff |A|). By the definition of β in Theorem 5.3, we have β ≤ 24Hvc max /c 2 min ( √ λ + ζ |A|). Therefore, we have

and x 2 ≤ R , and H 1 = Γk(x) | x ∈ R m , and x 2 ≤ R , (F.1)

Lemma 4.2 in Van de Geer (2000)). For the Hellinger divergence H, two probability measures P(• | s, a) and P * (• | s, a), we have 16H 2 P(• | s, a) + P * (• | s, a) /2, P * (• | s, a) ≥ H 2 P(• | s, a), P * (• | s, a) . Lemma G.5. When we define P(s | s, a) = ( P(s | s, a) + P * (s | s, a))/2, we have

a)∼ρτ ,s ∼P * (•|s,a) g(s, a, s ) (s,a)∼ρτ H 2 P(s | s, a) + P * (s | s, a) /2, P * (• | s, a) . (G.5) By Lemma G.5, we have ν n,1 (g P,1 ) -√ nν n,2 ( P) ≥ 0. By Lemma G.4, we can prove Lemma G.2

+ nE (s,a)∼ρ n h-1 u 2 [ψ n h , π, g] φ n h-1 (s, a) ≤λB 2 v 2 + nE (s,a)∼ρ n h-1 u 2 [ψ n h , π, g] φ n h-1 (s, a) . (J.12) Since u[ψ n h , π, g] is the representation of χ[ψ n h , π, g] defined in (J.9), we have u[ψ n h , π, g] φ n h-1 (s, a) = u[ψ n h , π, g], k φ n h-1 (s, a) a | s )g(s , a )ds .Combining (J.13) with the structure of the transition kernel P n h-1 in Lemma J.2, we haveu[ψ n h , π, g] φ n h-1 (s, a) = c[φ n h-1 , ψ n h ](s, a) s∈S P n h-1 (s | s, a) a ∈A π h (a | s )g(s , a )ds ≤ c max E s ∼P n h-1 (•|s,a),π h g(s , a ) (J.14) since c[φ n h-1 , ψ n h ](s, a)is bounded from the above by c max . By the property of the sum of squares, we haveE s ∼P n h-1 (•|s,a),π h g(s , a ) 2 ≤2 E s ∼P * h-1 (•|s,a),π h g(s , a ) -E s ∼P n h-1 (•|s,a),π h g(s , a ) 2 + 2E s ∼P * h-1 (•|s,a),π h g(s , a ) 2 . (J.15)By simple calculation, we haveE s ∼P * h-1 (•|s,a),π h g(s , a ) -E s ∼P n h-1 (•|s,a),π h g(s , a ) s | s, a) -P n h-1 (s | s, a) a ∈A π h (a | s )g(s , a )ds . ≤ B • TV P n h-1 (• | s, a), P * h-1 (• | s, a), where the last inequality follows from the definition of the total variance divergence. Combining (J.14), (J.15) with (J.16), we haveu[ψ n h , π, g] φ n h-1 (s, a) 2 = c 2 [φ n h-1 , ψ n h ](s, a)E s ∼P n h-1 (•|s,a),π h g(s ,a ) 2 (J.17) ≤ 2c 2 max E s ∼P * h-1 (•|s,a),π h g(s , a ) 2 + B 2 TV 2 P n h-1 (• | s, a), P * h-1 (• | s, a) .

n h } H h=1 and the transition kernelP * = {P * h } H h=1 with P n h (s | s, a) = k φ n h (s, a) , k ψ n h+1

PROOF OF LEMMA C.7Proof. The lower bound of the -Eluder dimension has been proven by Theorem 5.2 ofDong et al. (2021). Therefore, we only need to bound the -log covering number of R from the above. Let C 1 be an /2-covering set of the unit sphere in R m , and C 2 be an /2-covering set of [0, 1]. We have|C 1 | ≤ 2 2d -d and |C 1 | ≤ 2/ . We define the set C as C = {r θ,b : A → R | r θ,b (a) = σ(θ a + b), θ ∈ C 1 , b ∈ C 2 }.For any r θ,b ∈ R, there exists θ 0 ∈ C 1 and b 0 ∈ C 2 , such that θ -θ 0 2 ≤ and |b -b 0 | ≤ /2. We have r θ0,b0 ∈ C, and |r θ,b (a) -r θ0,b0 (a)| = max{θ a + b, 0} -max{θ 0 a + b 0 , 0}+ (J.27) ≤ θ -θ 0 2 a 2 + |b -b 0 | ≤ for any a ∈ A. Therefore, C is an -covering set of R, and log |C| ≤ log |C 1 | + log |C 2 | ≤ (d + 1) log(4/ ), which concludes the proof of Lemma C.7. J.5 PROOF OF LEMMA D.2 Proof. By the reproducing property of H and Cauchy-Schwarz inequality, we have X (x , x)dρ(x ) X g 2 (x )dρ(x ) dρ(x) = X g 2 (x )dρ(x ) X X K 2 (x , x)dρ(x )dρ(x) = X g 2 (x )dρ(x ).

PROOF OFLEMMA D.4    Proof. By the definition of the operator Γ e [φ, D, λ], we haveg, Γ e [φ, D, λ]g H = λ g, g H + have λ g, g H ≤ g, Γ e [φ, D, λ]g H . By the definition of the reproducing kernel, we have (s,a,r,s )∈D g(φ(s, a)) 2 = (s,a,r,s )∈D g, k(φ(s, a)) 2H . Therefore, we have inequality. Combining (J.33) with (J.34), we haveg, Γ e [φ, D, λ]g H ≤ λ + (s,a,r,s )∈D K φ(s, a), φ(s, a) g, g H .Similarly, by the definition of the operator Γ p [φ, ρ, λ, τ ], we haveg, Γ p [φ, ρ, λ, τ ]g H = λ g, g H + τ E (s,a)∼ρ g 2 φ(s, a) . (J.35) Therefore, we have λ g, g H ≤ g, Γ p [φ, ρ, λ, τ ]g H . Similar with (J.34), we have E (s,a)∼ρ g 2 φ(s, a) ≤ g, g H E (s,a)∼ρ K φ(s, a), φ(s, a) . (J.36)

. It remains to bound Γ L ρ 2 (X )→H from the above. For elements e ∈ H and e ∈ L ρ 2 (X ), we have e, Γe H = e,

Therefore, by (I.14), we have E π [u n 1 (s init , a 1 )] ≤ 2 |A| βd By the definition of u n h in Lemma E.5, we have u n ∞ ≤ 2H +2. By the definition of ρ n

For a bounded function g, we define the functional χ[ψ n h , π, g] :H → R as χ[ψ n h , π, g]g = Bds = Bv g H ,where v is the Lebesgue measure of S. Therefore, χ[ψ n h , π, g] is bounded by Bv. By Riesz theorem, there exists an element u[ψ n h , π, g] ∈ H with u[ψ n h , π, g] H ≤ Bv, such that χ[ψ n h , π, g]g = u[ψ n h , π, g], g H for any g ∈ H.Therefore, by (J.8) and Cauchy-Schwarz inequality, we haveE π,P n [g(s h , a h )] ≤ E π,P n u[ψ n h , π, g], k φ n h-1 (s h-1 , a h-1 ) ≤ E π,P n k φ n h-1 (s h-1 , a h-1 ) Γ -1 ,λ,n] /c min , where Γ p [φ n h-1 , ρ n h-1 , λ, n] is defined in (E.4), and c min is the lower bound of the normalization constant c[φ n h-1 , ψ n h ](s h-1 , a h-1 ). What remains is to bound the term u[ψ n h , π, g] Γp[φ n h-1 ,ρ n h-1 ,λ,n]from the above. By the definition of Γ p [φ n h-1 , ρ n h-1 , λ, n], we have

,λ,n] ≤ λB 2 v 2 + 2nE (s,a)∼ρ n h-1 c 2 max E s ∼P * h-1 (•|s,a),π h g(s , a ) 2 + 2nB 2 E (s,a)∼ρ n h-1 c 2 max TV 2 P n h-1 (• | s, a), P * h-1 (• | s, a) . (J.18) Therefore, when nE (s,a)∼ρ n h-1 [TV 2 (P n h-1 (• | s, a), P * h-1 (• | s, a))] ≤ ζ 2 holds, we have ,λ,n] ≤ λB 2 v 2 + 2c 2 max nE ρ n h-1 ,P * h-1 ,π h g 2 (s , a ) + B 2 ζ 2 (J.19)by (J.18). Here the expectation is defined in Lemma J.2. By the definition of U(A), we haveE ρ n h-1 ,P * h-1 ,π h g 2 (s , a ) ≤ |A| E ρ n h-1 ,P * h-1 ,U (A) g 2 (s , a) . (J.20) Combining (J.11), (J.19) with (J.20), we have E π,P n h-1 [g(s h , a h )] ≤E π,P n h-1

π,P * g(s h , a h ) ≤ E π,P * (s h-1 , a h-1 ), ψ * h (s) ḡ(s, π h )ds c min , (J.21) where ḡ(s, π h ) is defined in (J.7). By (J.10), the the linear functional χ[ψ * h , π, g] defined in (J.9) is bounded by Bv. Here v is the Lebesgue measure of S. By Riesz's theorem, there exists an elementu[ψ * h , π, g] ∈ H with u[ψ * h , π, g] H ≤ Bv, such that χ[ψ * h , π, g]g = u[ψ * h , π, g], gH holds for all g ∈ H. Combining this property with (J.21), we haveE π,P * h g(s h , a h ) ≤ E π,P * p [φ * h-1 , ρ * h-1 , λ, n] is defined in (E.4), and c min is the lower bound of the normalization constant c[φ n h-1 , ψ n h ](s h-1 , a h-1 ). By the definition of the operator , we have,λ,n] =λ u[ψ * h , π, g] 2 H + nE (s,a)∼ρ * h-1 u 2 [ψ * h , π, g] φ * h-1 (s, a) ≤λB 2 v 2 + nE (s,a)∼ρ * h-1 u 2 [ψ * h , π, g] φ * h-1 (s, a) . ,λ,n] ≤λB 2 v 2 + nc 2 max E (s,a)∼ρ n h-1 E s ∼P * h (•|s,a),π h g(s , a ) 2 ≤λB 2 v 2 + nc 2 max E ρ n h-1 ,P * h (•|s,a),π h g 2 (s , a ) , (J.25)where the last inequality follows from Cauchy-Schwarz inequality. Since g 2 is non-negative, we haveE ρ n h-1 ,P * h (•|s,a),π h g 2 (s h , a h ) ≤ |A| E ρ n h-1 ,P * h (•|s,a),U (A) g 2 (s h , a h ) . ,π h [g(s h , a h )] = E (s h-1 ,a h-1 )∼ρ n h-1 ,s h ∼P * h (•|s h-1 ,a h-1 ),a h ∼π h (•|s h ) [g(s h , a h )]. Combining (J.22), (J.25) and (J.26), we have E π,P * h g(s h , a h ) ≤ E π,P *

λ, (J.30)for any g ∈ H, x ∈ X . Here g[X n ] is defined in §1.2. We show that Γ[X n ] is the inverse of Γ[X n ]. By the definition of Γ[X n ] in Lemma D.3, we have (Γ[X n ]g)[X n ] = (λI + K[X n ])g[X n ], and Γ[X n ]g(x) = λg(x) + g[X n ] k[X n ](x), (J.31)where k[X n ](x) is defined in Lemma D.3. Combining Equaitions (J.30) with (J.31), we have Γ

Appendix

G.4 PROOF OF LEMMA G.9Proof. For P 1 , P 2 ∈ G , we have P i (s | s, a) = (P i (s | s, a) + P * (s | s, a))/2 for i = 1, 2, s , s ∈ S and a ∈ A for some P 1 , P 2 ∈ M. Therefore, we have By the definition of the normalization function c[P u ](s, a), we havewhere v is the Lebesgue measure of S. Combining (G.15), (G.16) with the fact that P u (s | s, a) ≤ 1 for all P u ∈ M u , we haveThe set C is an -covering set of M, and. Thus, we conclude the proof of Lemma G.10.

G.6 PROOF OF LEMMA G.11

Proof. First, for any x, y > 0, we have(G.17)Lemma H.7. For a fix x ∈ X , a set X n = {x τ } n τ =1 ⊂ X , and a vector v = (v 1 , . . . , v n ) , we haveHere the matrix K[X n ], the vector k[X n ](x) are defined in (H.6) and (H.7).Proof. We conclude the proof by using simple linear algebra.Combining (H.15) with the lemma above, we can prove Lemma H.6 by bounding the termWe bound the term using tools from the approximation theory.Lemma H.8 (Theorem 11.21 in Wendland (2004) ). Suppose that X ⊂ R m is a cube with side length l, and we have max x∈X min x ∈Xn x -x 2 ≤ ι for X n = {x 1 , . . . , x n }. We define γ m = 4 m m!, C app,1 = l/(3γ m ), and q = [C app,1 /ι], where [x] denote the maximal integer that does not exceed x. Then there exists a sequence of function {υ i (x)} n i=1 ⊂ X → R such that (1) for every p ∈ q (R m ), we have p(x) = n i=1 υ i (x)p(x i ) for all x ∈ X , (2) we have n i=1 |υ i (x)| ≤ exp(2mγ m (q + 1)) for all x ∈ X , and (3) we have υ i (x) = 0 when x -x i 2 > √ ml. Here q (R m ) is the set of all polynomials in R m with total degree no higher than q.By the lemma above, there exists a sequence of function {υ i } n i=1 such that (1) we havefor all x ∈ X and p ∈ q (R m ), and (2) we have n i=1 |υ i (x)| ≤ exp(2mγ m (q + 1)) for all x ∈ X . For a fix x 0 , we have p(x -x 0 ) ∈ q (R m ) when p(x) ∈ q (R m ). By (H.16), we have(H.17)Similarly, we have p(x i -x) ∈ q (R m ) when p(x) ∈ q (R m ). We apply (H.16) again and havefor all x 0 ∈ X and p ∈ q (R m ). Combining (H.17) with (H.18), we haveCombining (H.19) with Lemma H.7, we havefor all x 0 ∈ X and p ∈ q (R m ). By the definition of the kernel K, we haveSince π n+1 is the greedy policy with regard to. Therefore, taking expectation over P n and π * in both sides of (I.4), we havewe conclude the proof of Lemma E.6.

I.3 PROOF OF LEMMA E.7

Proof. The following lemma shows that the value function defined by the estimated model is bounded from the above by the value function defined by the true model.Lemma I.1 (Bounded optimistism in Each Iteration). Following the same condition with Theorem 5.3, when condition on the good events E 1 and E 2 , which are defined in (E.5), (E.6) and (E.7), we havefor any policy π, where the bonusProof. See Appendix §I.4 for a detailed proof.By Lemma I.1, and the definition of the expected total reward, we haveHowever, the bonus defined by the learned feature might vary in each episode, which make it difficult to bound from the above. The following lemma connects the bonus defined by the learned feature with the bonus defined by the true feature.Lemma I.2 (Bonus Equivalence for the True Model). Following the same condition with Theorem 5.3, when condition on E 1 and E 2 , which are defined in (E.5), (E.6), and (E.7), we havefor any policy π and n ≥ 2, where the bonus u n = {u n h } H h=1 is defined in Lemma E.5 and the underlying bonus u * ,n = {u * ,n h } H h=1 is defined in Lemma I.1.Proof. See Appendix §I.5 for a detailed proof.Combining (I.5) with Lemma I.2, we haveJ(π n+1 ; u * ,n , P * ).PROOF OF LEMMA F.2Proof. Since the operators A, B are self-adjoint and positive-definite, the operator B -1/2 AB -1/2 is also self-adjoint and positive-definite. Therefore, the operator C = (B -1/2 AB -1/2 ) 1/2 is also self-adjoint and positive-definite. Since B -1/2 C -1 B -1/2 x ∈ D, we haveby the definition of D in Lemma F.2. Since the operators B and C are self-adjoint, we haveWe also have. Therefore, we haveby (J.37). Similarly, we haveWe conclude the proof of Lemma F.2 by combining (J.38) with (J.39).J.9 PROOF OF LEMMA F.3Proof. Without loss of generality, we assume that g, g H = 1. We then havefor any x ∈ X . We also introduce the following version of Bernstein inequality.Lemma J.4. Let {F τ } n τ =0 be a filtration, and {X τ } n τ =1 be a R-valued stochastic process adapted to this filtration. Suppose E[X τ | F τ -1 ] = 0 and |X τ | ≤ c almost surely, we haveProof. See Appendix §J.12 for a detailed proof.holds with probability at least 1 -δ. By the property of the variance, we haveCombining (J.40) with (J.41), we haveholds with probability at least 1 δ. Similarly, we haveholds with probability at least 1 -δ. By simple calculation, we haveholds with probability at least 1 -δ. Thus, we conclude the proof of Lemma F.3.J.10 PROOF OF LEMMA F.4Proof. By the definition of Γ in (F.1), we haveThe following lemma characterize the operatorHere Γ p [φ, ρ n h,i , λ, n] and Γ e [φ, D n h,i , λ] are defined in (E.2) and (E.4).Proof. See Appendix §J.13 for a detailed proof.Combining Lemma J.5 with (J.43), we haveSimilarly, by Lemma D.4, we have g 1 , Γ -1 1 g 1 H ≥ g 1 , g 1 H /(λ + n) for any g 1 ∈ H. Therefore, by (J.42), we haveCombining Lemma J.5 with (J.44), we have. By Proposition D.4, we have g 1 , Γ -1 1 g 1 H ≥ g 1 , g 1 H /(λ + n) for any g 1 ∈ H. Therefore, we have Γg, Γg H ≥ λ g, Γ -1 1 g H /(λ + n) 3 . Thus, we conclude the proof of Lemma F.4. J.11 PROOF OF LEMMA F.5Proof. We denote by C ¯ -covering of the unit ball in m dimension, and define C by C = {ΓK(x, •) | x ∈ C}. In the following part of the proof, we show that C is a (λ + n)¯ 2 /λ 3covering of H 1 with regard to the infinity norm.For any g ∈ H 1 , we have g(•) = ΓK(x, •) for some x in the unit ball. By the definition of the set C, there exist x ∈ C, such that x -x ≤ ¯ . By the reproducing property of the kernel, we havefor any y ∈ R m . Combining (J.45) with Cauchy-Schwarz inequality, we havefor any y ∈ R m , where the second equation follows from the fact that k(y), k(y) H = 1. By Lemma F.4, we have Γg 1 , Γg 1 H ≤ (λ + n) g 1 , g 1 /λ 3 for any g 1 ∈ H. Therefore, by (J.46), we haveWe conclude the proof of Lemma F.5 by setting ¯ = λ λ/(λ + n).J.12 PROOF OF LEMMA J.4Proof. Since e x ≤ 1 + x + (e -2)x 2 holds for any x ≤ 1, we havewhen λ ≤ 1/c. Therefore, when we define Y t = exp{λ J.13 PROOF OF LEMMA J.5Proof. By the property of the inverse operator, it remains to prove thatg H ≤ (λ + n) g, g H /λ for any g ∈ H. By Lemma D.4, we have g 1 , Γ 2 g 1 H ≤ (λ + n) g 1 , g 1 H . Therefore, we haveBy Lemma D.4, we have g 1 , Γ -1 1 g 1 H ≤ g 1 , g 1 H /λ for any g 1 ∈ H. Therefore, we have g,1 g H ≤ (λ + n) g, g H /λ. Similarly, by Lemma D.4, we have g 1 , Γ 2 g 1 H ≥ λ g 1 , g 1 H . Therefore, we haveBy Lemma D.4, we have g 1 , Γ -1 1 g 1 H ≥ g 1 , g 1 H /(λ + n) for any g 1 ∈ H. Therefore, we have g,1 g H ≥ λ g, g H /(λ + n). Thus, we conclude the proof of Lemma J.5.

