OFFLINE REINFORCEMENT LEARNING WITH DIFFER-ENTIABLE FUNCTION APPROXIMATION IS PROVABLY EFFICIENT

Abstract

Offline reinforcement learning, which aims at optimizing sequential decisionmaking strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. We show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research. On the other hand, statistical analysis has been actively conducted to understand the sample/statistical efficiency for reinforcement learning with function approximation, and fruitful results have been achieved under the respective model representations (Munos, 2003; 

1. INTRODUCTION

Offline reinforcement learning (Lange et al., 2012; Levine et al., 2020) refers to the paradigm of learning a policy in the sequential decision making problems, where only the logged data are available and were collected from an unknown environment (Markov Decision Process / MDP). Inspired by the success of scalable supervised learning methods, modern reinforcement learning algorithms (e.g. Silver et al. ( 2017)) incorporate high-capacity function approximators to acquire generalization across large state-action spaces and have achieved excellent performances along a wide range of domains. For instance, there are a huge body of deep RL-based algorithms that tackle challenging problems such as the game of Go and chess (Silver et al., 2017; Schrittwieser et al., 2020) , Robotics (Gu et al., 2017; Levine et al., 2018) , energy control (Degrave et al., 2022) and Biology (Mahmud et al., 2018; Popova et al., 2018) . Nevertheless, practitioners also noticed that algorithms with general function approximators can be quite data inefficient, especially for deep neural networks where the models may require million of steps for tuning the large number of parameters they contain. 1structures make the analysis trackable (linear problems are easier to analyze), they are unable to reveal the sample/statistical complexity behaviors of practical algorithms that apply powerful function approximations (which might have complex structures). In addition, there is an excellent line of works tackling provably efficient offline RL with general function approximation (e.g. (Chen and Jiang, 2019; Xie et al., 2021a; Zhan et al., 2022) ). Due to the generic function approximation class considered, those complexity bounds are usually expressed in the standard worst-case fashion O(Vfoot_1 max 1 n ) which lack the characterizations of individual instance behaviors. However, as mentioned in Zanette and Brunskill (2019) , practical reinforcement learning algorithms often perform far better than what these problem-independent bounds would suggest. These observations motivate us to consider function approximation schemes that can help address the existing limitations. In particular, in this work we consider offline reinforcement learning with differentiable function class approximations. Its definition is given in below. Definition 1.1 (Parametric Differentiable Function Class). Let S, A be arbitrary state, action spaces and a feature map ϕ(•, •) : S × A → Ψ ⊂ R m . The parameter space Θ ∈ R d . Both Θ and Ψ are compact spaces. Then the parametric function class (for a model f : R d × R m → R) is defined as F := {f (θ, ϕ(•, •)) : S × A → R, θ ∈ Θ} that satisfies differentiability/smoothness condition: 1. for any ϕ ∈ R m , f (θ, ϕ) is third-time differentiable with respect to θ; 2. f, ∂ θ f, ∂ 2 θ,θ f, ∂foot_2 θ,θ,θ f are jointly continuous for (θ, ϕ). Remark 1.2. Differentiable Function Class was recently proposed for studying Off-Policy Evaluation (OPE) Problem (Zhang et al., 2022a) and we adopt it here for the policy learning task. Note by the compactness of Θ, Ψ and continuity, there exists constants C Θ , B F , κ 1 , κ 2 , κ 3 > 0 that bounds: θ 2 ≤ C Θ , |f (θ, ϕ(s, a))| ≤ B F , ∇ θ f (θ, ϕ(s, a)) 2 ≤ κ 1 , ∇ 2 θθ f (θ, ϕ(s, a)) 2 ≤ κ 2 , ∇ 3 θθθ f (θ, ϕ(s, a)) 2 ≤ κ 3 for all θ ∈ Θ, s, a ∈ S × A. 2 Why consider differentiable function class (Definition 1.1)? There are two main reasons why differentiable function class is worth studying for reinforcement learning. • Due to the limitation of statistical tools, existing analysis in reinforcement learning usually favor basic settings such as tabular MDPs (where the state space and action space are finite (Azar et al., 2013; 2017; Sidford et al., 2018; Jin et al., 2018; Cui and Yang, 2020; Agarwal et al., 2020; Yin et al., 2021a; b; Li et al., 2020; Ren et al., 2021; Xie et al., 2021b; Li et al., 2022; Zhang et al., 2022b; Qiao et al., 2022; Cui and Du, 2022) ) or linear MDPs (Yang and Wang, 2020; Jin et al., 2020b; Wang et al., 2020; Jin et al., 2021b; Ding et al., 2021; Wang et al., 2021a; Min et al., 2021) / linear Mixture MDPs (Modi et al., 2020; Cai et al., 2020; Zhang et al., 2021a; Zhou et al., 2021b; a) (where the transition dynamic admits linear structures) so that well-established techniques (e.g. from linear regression) can be applied. In addition, subsequent extensions are often based on linear models (e.g. Linear Bellman Complete models (Zanette et al., 2020) and Eluder dimension (Russo and Van Roy, 2013; Jin et al., 2021a) ). Differentiable function class strictly generalizes over the previous popular choices, i.e. by choosing f (θ, ϕ) = θ, ϕ or specifying ϕ to be one-hot representations, and is far more expressive as it encompasses nonlinear approximators. • Practically speaking, the flexibility of selecting model f provides the possibility for handling a variety of tasks. For instance, when f is specified to be neural networks, θ corresponds to the weights of each network layers and ϕ(•, •) corresponds to the state-action representations (which is induced by the network architecture). When facing with easier tasks, we can deploy simpler model f such as polynomials. Yet, our statistical guarantee is not affected by the specific choices as we can plug the model f into Theorem 3.2 to obtain the respective bounds (we do not need separate analysis for different tasks).

1.1. RELATED WORKS

Reinforcement learning with function approximation. RL with function approximation has a long history that can date back to Bradtke and Barto (1996) ; Tsitsiklis and Van Roy (1996) . Later, Algorithm Assumption Suboptimality Gap v ⋆ -v π VFQL, Theorem 3.1 Concentrability 2.2 √ C eff H • H 2 d+λC 2 Θ K + 1 4 H 3 dϵF K + √ C eff H 3 ϵ F + Hϵ F PFQL, Theorem 3.2 Uniform Coverage 2.3 H h=1 16dH • E π ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Σ ⋆-1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) VAFQL, Theorem 4.1 Uniform Coverage 2.3 16d • H h=1 E π ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Λ ⋆-1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) Table 1 : Suboptimality gaps for different algorithms with differentiable function class 1.1. Here we omit the higher order term for clear comparison. With Concentrability, we can only achieve the worst case bound that does not explicit depend on the function model f . With the stronger uniform coverage 2.3, better instance-dependent characterizations become available. Here C eff is in 2.2, Σ ⋆ in 3.2, Λ ⋆ in 4.1 and ϵ F in 2.1. it draws significant interest for the finite sample analysis (Jin et al., 2020b; Yang and Wang, 2019) . Since then, people put tremendous efforts towards generalizing over linear function approximations and examples include Linear Bellman complete models (Zanette et al., 2020) , Eluder dimension (Russo and Van Roy, 2013; Jin et al., 2021a) , linear deterministic Q ⋆ (Wen and Van Roy, 2013) or Bilinear class (Du et al., 2021) . While those extensions are valuable, the structure conditions assumed usually make the classes hard to track beyond the linear case. For example, the practical instances of Eluder Dimension are based on the linear-in-feature (or its transformation) representations (Section 4.1 of Wen and Van Roy ( 2013)). As a comparison, differentiable function class contains a range of functions that are widely used in practical algorithms (Riedmiller, 2005) . Offline RL with general function approximation (GFA). Another interesting thread of work considered offline RL with general function approximation (Ernst et al., 2005; Chen and Jiang, 2019; Liu et al., 2020; Xie et al., 2021a) which only imposes realizability and completeness/concentrability assumptions. The major benefit is that the function hypothesis can be arbitrary with no structural assumptions and it has been shown that offline RL with GFA is provably efficient. However, the generic form of functions in GFA makes it hard to go beyond worst-case analysis and obtain finegrained instance-dependent learning bounds similar to those under linear cases. On the contrary, our results with DFA can be more problem adaptive by leveraging gradients and higher order information. In addition to the above, there are more connected works. Zhang et al. (2022a) first considers the differentiable function approximation (DFA) for the off-policy evaluation (OPE) task and builds the asymptotic theory, Fan et al. (2020) analyzes the deep Q-learning with the specific ReLu activations, and Kallus and Uehara (2020) considers semi-parametric / nonparametric methods for offline RL (as opposed to our parametric DFA in 1.1). These are nice complementary studies to our work. Our contribution. We provide the first Instance-dependent offline learning bound under non-linear function approximation. Informally, we show that (up to a lower order term) the natural complexity measure is proportional to H h=1 E π ⋆ ,h [ g θ (s, a) ⊤ Σ -1 h g θ (s, a)] where g θ (s, a) := ∇f (θ, ϕ(s, a)) is the gradient w.r.t. the parameter θ ⋆ at feature ϕ and Σ h = i g(s i,h , a i,h )g(s i,h , a i,h ) ⊤ is the Fisher information matrix of the observed data at θ. This is achieved by analyzing the pessimistic fitted Q-learning (PFQL) algorithm (Theorem 3.2). In addition, we further analyze its variancereweighting variant, which recovers the variance-dependent structure and can yield faster sample convergence rate. Last but not least, existing offline RL studies with tabular models, linear models and GLM models can be directly indicated by the appropriate choice of our model F. mapping s h → π h (•|s h ) ∀h ∈ [H] and induces a random trajectory s 1 , a 1 , r 1 , . . . , s H , a H , r H , s H+1 with s 1 ∼ d 1 , a h ∼ π(•|s h ), s h+1 ∼ P h (•|s h , a h ), ∀h ∈ [H]. Given a policy π, the V -value functions and state-action value function (Q-functions) Q π h (•, •) ∈ R S×A are defined as: V π h (s) = Eπ[ H t=h rt|s h = s], Q π h (s, a) = Eπ[ H t=h rt|s h , a h = s, a], ∀s, a, h ∈ S, A, [H]. The Bellman (optimality) equations follow ∀h ∈ [H], s, a ∈ S × A:Q π h (s, a) = r h (s, a) + S V π h+1 (•)dP h (•|s, a), V π h (s) = E a∼π h (s) [Q π h (s, a)], Q ⋆ h (s, a) = r h (s, a) + S V ⋆ h+1 (•)dP h (•|s, a), V ⋆ h (s) = maxa Q ⋆ h (s, a). We define Bellman operator P h for any function V ∈ R S as P h (V ) = r h + S V dP h , then P h (V π h+1 ) = Q π h and P h (V ⋆ h+1 ) = Q ⋆ h . The performance measure is v π := E d 1 [V π 1 ] = E π,d 1 H t=1 rt . Lastly, the induced stateaction marginal occupancy measure for any h ∈ [H] is defined to be: for any E ⊆ S × A, d π h (E) := E[(s h , a h ) ∈ E|s 1 ∼ d 1 , a i ∼ π(•|s i ), s i ∼ P i-1 (•|s i-1 , a i-1 ), 1 ≤ i ≤ h] and E π,h [f (s, a)] := S×A f (s, a)d π h (s, a)dsda. Offline Reinforcement Learning. The goal of Offline RL is to learn the policy π ⋆ := arg max π v π using only the historical data D = {(s τ h , a τ h , r τ h , s τ h+1 )} h∈[H] τ ∈[K] . The data generating behavior policy is denoted as µ. In the offline regime, we have neither the knowledge about µ nor the access to further exploration for a different policy. The agent is asked to find a policy π such that v ⋆ -v π ≤ ϵ for the given batch data D and a specified accuracy ϵ > 0.

2.1. ASSUMPTIONS

Function approximation in offline RL requires sufficient expressiveness of F. In fact, even under the realizability and concentrability conditions, sample efficient offline RL might not be achievable (Foster et al., 2021) . Therefore, under the differentiable function setting (Definition 1.1), we make the following assumptions. Assumption 2.1 (Realizability+Bellman Completeness). The differentiable function class F in Definition 1.1 satisfies: • Realizability: for optimal Q ⋆ h , there exists θ ⋆ h ∈ Θ such that Q ⋆ h (•, •) = f (θ ⋆ h , ϕ(•)) ∀h; • Bellman Completeness: Let G := {V (•) ∈ R S : such that V ∞ ≤ H}. Then in this case sup V ∈G inf f ∈F f -P h (V ) ∞ ≤ ϵ F for some ϵ F ≥ 0. Realizability and Bellman Completeness are widely adopted in the offline RL analysis with general function approximations (Chen and Jiang, 2019; Xie et al., 2021a) and Assumption 2.1 states its differentiable function approximation version. There are other forms of completeness, e.g. optimistic closure defined in Wang et al. (2021b) . Data coverage assumption. Furthermore, in the offline regime, it is known that function approximation cannot be sample efficient for learning a ϵ-optimal policy without data-coverage assumptions when ϵ is small (i.e. high accuracy) (Wang et al., 2021a) . In particular, we consider two types of coverage assumptions and provide guarantees for them separately. Assumption 2.2 (Concentrability Coverage). For any fixed policy π, define the marginal stateaction occupancy ratio as d π h (s, a)/d µ h (s, a) ∀s, a. Then the concentrability coefficient is defined as C eff := sup π sup h∈[H] d π h /d µ h 2 2,d µ h , where g(•, •) 2,d µ := E d µ [g(•, •) 2 ] and C eff < ∞. This is the standard coverage assumption that has has been widely assumed in (Ernst et al., 2005; Szepesvári and Munos, 2005; Chen and Jiang, 2019; Xie and Jiang, 2020a) , and 2.2 is fully characterized by the MDPs. In addition, we can make an alternative assumption 2.3 that depends on both the MDPs and the function approximation class F.foot_3 It assumes a curvature condition for F. Assumption 2.3 (Uniform Coverage). We have ∀h ∈ [H], there exists κ > 0, • E µ,h (f (θ1, ϕ(•, •)) -f (θ2, ϕ(•, •))) 2 ≥ κ ∥θ1 -θ2∥ 2 2 , ∀θ1, θ2 ∈ Θ; (⋆) • E µ,h ∇f (θ, ϕ(s, a)) • ∇f (θ, ϕ(s, a)) ⊤ ≻ κI, ∀θ ∈ Θ. (⋆⋆) In the linear function approximation regime, Assumption 2.3 reduces to 2.4 since (⋆) and (⋆⋆) are identical assumptions. Concretely, if f (θ, ϕ) = θ, ϕ , then (⋆) E µ,h [(f (θ1, ϕ(•, •)) -f (θ2, ϕ(•, •))) 2 ] = (θ1 -θ2) ⊤ E µ,h [ϕ(•, •)ϕ(•, •) ⊤ ](θ1 -θ2) ≥ κ ∥θ1 -θ2∥ 2 2 ∀θ1, θ2 ∈ Θ ⇔ 2.4 ⇔ (⋆⋆)E µ,h ∇f (θ, ϕ(s, a)) • ∇f (θ, ϕ(s, a)) ⊤ κI. Therefore, 2.3 can be considered as a natural extension of 2.4 for differentiable class. We do point that 2.3 can be violated for function class F that is "not identifiable" by the data distribution µ (i.e., there exists f (θ 1 ), f (θ 2 ) ∈ F , θ 1 = θ 2 s.t. E µ,h [(f (θ 1 , ϕ(•, •)) -f (θ 2 , ϕ(•, •))) 2 ] = 0). Nevertheless, there are representative non-linear differentiable classes (e.g. generalized linear model (GLM)) satisfying 2.3. Example 2.4 (Linear function coverage assumption (Wang et al., 2021a; Min et al., 2021; Yin et al., 2022; Xiong et al., 2022) ). Σ feature h := E µ,h ϕ(s, a)ϕ(s, a) ⊤ κI ∀h ∈ [H] with some κ > 0. Example 2.5 (offline generalized linear model (Li et al., 2017; Wang et al., 2021b) ). For a known feature map ϕ : S × A → B d and link function f : [-1, 1] → [-1, 1] the class of GLM is F GLM := {(s, a) → f ( ϕ(s, a), θ ) : θ ∈ Θ} satisfying E µ,h ϕ(s, a)ϕ(s, a) ⊤ κI. Furthermore, f (•) is either monotonically increasing or decreasing and 0 < κ ≤ |f ′ (z)| ≤ K < ∞, |f ′′ (z)| ≤ M < ∞ for all |z| ≤ 1 and some κ, K, M . Then F GLM satisfies 2.3, see Appendix B.

3. DIFFERENTIABLE FUNCTION APPROXIMATION IS PROVABLY EFFICIENT

In this section, we present our solution for offline reinforcement learning with differentiable function approximation. As a warm-up, we first analyze the vanilla fitted Q-learning (VFQL, Algorithm 2), which only requires the concentrability Assumption 2.2. The algorithm is presented in Appendix I. Theorem 3.1. Choose 0 < λ ≤ 1/2C 2 Θ in Algorithm 2. Suppose Assumption 2.1,2.2. Then if K ≥ max 512 κ 4 1 κ 2 log( 2Hd δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 λ 2 ) , 4λ κ , with probability 1 -δ, the output π of VFQL guarantees: v ⋆ -v π ≤ √ C eff H • O H 2 d+λC 2 Θ K + 1 4 H 3 dϵ F K +O( √ C eff H 3 ϵ F +Hϵ F ) If the model capacity is insufficient, 3.1 will induce extra error due to the large ϵ F . If ϵ F → 0, the parametric rate 1 √ K can be recovered and similar results are derived with general function approximation (GFA) (Chen and Jiang, 2019) . However, using concentrability coefficient conceals the problem-dependent structure and omits the specific information of differentiable functions in the complexity measure. Owing to this, we switch to the stronger "uniform" coverage 2.3 and analyze the pessimistic fitted Q-learning (PFQL, Algorithm 1).

Motivation of PFQL.

The PFQL algorithm mingles the two celebrated algorithmic choices: Fitted Q-Iteration (FQI) and Pessimism. However, before going into the technical details, we provide some interesting insights that motivate our analysis. First of all, the square error loss used in FQI (Gordon, 1999; Ernst et al., 2005) naturally couples with the differentiable function class as the resulting optimization objective is more computationally tractable (since stochastic gradient descent (SGD) can be readily applied) comparing to other information-theoretical algorithms derived with general function approximation (e.g. the maxmin objective in Xie et al. (2021a) , eqn (3.2)). 4 In particular, FQI with differentiable function approximation resembles the theoretical prototype of neural FQI algorithm (Riedmiller, 2005) and DQN algorithm (Mnih et al., 2015; Fan et al., 2020) when instantiating the model f to be deep neural networks. Furthermore, plenty of practical algorithms leverage fitted-Q subroutines for updating the critic step (e.g. (Schulman et al., 2017; Haarnoja et al., 2018) ) with different differentiable function choices. In addition, we also incorporate pessimism for the design. Indeed, one of the fundamental challenges in offline RL comes from the distributional shift. When such a mismatch occurs, the estimated/optimized Q-function (using batch data D) may witness severe overestimation error due to the extrapolation of model f (Levine et al., 2020) . Pessimism is the scheme to mitigate the error / overestimation bias via penalizing the Q-functions at state-action locations that have high uncertainties (as opposed to the optimism used in the online case), and has been widely adopted (e.g. (Buckman et al., 2020; Kidambi et al., 2020; Jin et al., 2021b) ). Algorithm 1 description. Inside the backward iteration of PFQL, Fitted Q-update is performed to optimize the parameter (Line 4). θ h is the root of the first-order stationarity equation K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ ⊤ θ f (θ, ϕ h,k ) + λθ = 0 and Σ h is the Gram matrix with respect to ∇ θ f | θ= θ h . Note for any s, a ∈ S × A, m(s, a) := (∇ θ f ( θ h , ϕ(s, a)) ⊤ Σ -1 h ∇ θ f ( θ h , ϕ(s, a) )) -1 measures the effective sample size that explored s, a along the gradient direction ∇ θ f | θ= θ h , and β/ m(s, a) is the estimated uncertainty at (s, a). However, the quantity m(s, a) depends on θ h , and θ h needs to be close to the true θ ⋆ h (i.e. Q h ≈ f ( θ h , ϕ) needs to be close to Q ⋆ h ) for the uncertainty estimation Γ h to be valid, since putting a random θ into m(s, a) can cause an arbitrary Γ h that is useless (or might even deteriorate the algorithm). Such an "implicit" constraint over θ h imposes the extra difficulty for the theoretical analysis due to that general differentiable functions encode nonlinear structures. Besides, the choice of β is set to be O(dH) in Theorem 3.2 and the extra term O( 1 K ) in Γ h is for theoretical reason only. Algorithm 1 Pessimistic Fitted Q-Learning (PFQL) 1: Input: Offline Dataset D = s k h , a k h , r k h , s k h+1 K,H k,h=1 . Require β. Denote ϕ h,k := ϕ(s k h , a k h ). 2: Initialization: Set VH+1(•) ← 0 and λ > 0. 3: for h = H, H -1, . . . , 1 do 4: Set θ h ← arg min θ∈Θ K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) 2 + λ • ∥θ∥ 2 2 5: Set Σ h ← K k=1 ∇ θ f ( θ h , ϕ h,k )∇ ⊤ θ f ( θ h , ϕ h,k ) + λI d . 6: Set Γ h (•, •) ← β ∇ θ f ( θ h , ϕ(•, •)) ⊤ Σ -1 h ∇ θ f ( θ h , ϕ(•, •)) + O( 1 K ) 7: Set Qh (•, •) ← f ( θ h , ϕ(•, •)) -Γ h (•, •) 8: Set Q h (•, •) ← min Qh (•, •), H -h + 1 + 9: Set π h (• | •) ← arg maxπ h Q h (•, •), π h (• | •) A , V h (•) ← maxπ h Q h (•, •), π h (• | •) A 10: end for 11: Output: { π h } H h=1 . Model-Based vs. Model-Free. PFQL can be viewed as the strict generalization over the previous value iteration based algorithms, e.g. PEVI algorithm (Jin et al. (2021b) , linear MDPs) and the VPVI algorithm (Yin and Wang (2021) , tabular MDPs). On one hand, approximate value iteration (AVI) algorithms (Munos, 2005) are usually model-based algorithms (for instance the tabular algorithm VPVI uses empirical model P for planning). On the other hand, FQI has the form of batch Qlearning update (i.e. Q-learning is a special case with batch size equals to one), therefore is more of model-free flavor. Since FQI is a concrete instantiation of the abstract AVI procedure (Munos, 2007) , PFQL draws a unified view of model-based and model-free learning. Now we are ready to state our main result for PFQL and the full proof can be found in Appendix D,E,F. Theorem 3.2. Let β = 8dHι and choose 0 < λ ≤ 1/2C 2 Θ in Algorithm 1. Suppose Assump- tion 2.1,2.3 with ϵ F = 0. 5 Then if K ≥ max 512 κ 4 1 κ 2 log( 2Hd δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 λ 2 ) , 4λ κ , with probability 1 -δ, for all policy π simultaneously, the output of PFQL guarantees v π -v π ≤ H h=1 8dH • Eπ ∇ ⊤ θ f ( θ h , ϕ(s h , a h ))Σ -1 h ∇ θ f ( θ h , ϕ(s h , a h )) • ι + O( C hot K ), where ι is a Polylog term and the expectation of π is taken over s h , a h . In particular, if further K ≥ max{ O( (κ 2 1 +λ) 2 κ 2 2 κ 2 1 H 4 d 2 κ 6 ), 128κ 4 1 log(2d/δ) κ 2 } we have 0 ≤ v π ⋆ -v π ≤ H h=1 16dH • Eπ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Σ ⋆-1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) • ι + O( C ′ hot K ). Here Σ ⋆ h = K k=1 ∇ θ f (θ ⋆ h , ϕ(s k h , a k h ))∇ ⊤ θ f (θ ⋆ h , ϕ(s k h , a k h )) + λI d and the definition of higher order parameter C hot , C ′ hot can be found in List A. Corollary 3.3 (Offline Generalized Linear Models (GLM)). Consider the GLM function class defined in 2.5. Suppose β, λ, K are defined the same as Theorem 3.2. ϵ F = 0. Then with probability 1 -δ, for all policy π simultaneously, PFQL guarantees v π -v π ≤ H h=1 8dH • Eπ f ′ (⟨ θ h , ϕ(s h , a h )⟩) 2 • ϕ ⊤ (s h , a h )Σ -1 h ϕ(s h , a h ) • ι + O( C hot K ). PFQL is provably efficient. Theorem 3.2 verifies PFQL is statistically efficient. In particular, by Lemma L.5 we have ∇ θ f (θ ⋆ h , ϕ) Σ -1 h ≲ 2κ1 √ κK , resulting the main term to be bounded by 32dH 2 κ1 √ κK that recovers the standard statistical learning convergence rate 1 √ K . Comparing to Jin et al. (2021b) . Theorem 3.2 strictly subsumes the linear MDP learning bound in Jin et al. (2021b) . In fact, in the linear case ∇ θ f (θ, ϕ) = ∇ θ θ, ϕ = ϕ and 3.2 reduces to

O(dH

H h=1 E π ⋆ [ ϕ(s h , a h ) ⊤ (Σ linear h ) -1 ϕ(s h , a h )]). Instance-dependent learning. Previous studies for offline RL with general function approximation (GFA) (Chen and Jiang, 2019; Xie and Jiang, 2020b ) are more of worst-case flavors as they usually rely on the concentrability coefficient C. The resulting learning bounds are expressed in the formfoot_6 O(V 2 max C n ) that is unable to depict the behavior of individual instances. In contrast, the guarantee with differentiable function approximation is more adaptive due to the instancedependent structure H h=1 E π ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ)Σ ⋆-1 h ∇ θ f (θ ⋆ h , ϕ) . This Fisher-Information style quantity characterizes the learning hardness of separate problems explicitly as for different MDP instances M 1 , M 2 , the coupled θ ⋆ h,M1 , θ ⋆ h,M2 will generate different performances via the measure H h=1 E π ⋆ ∇ ⊤ θ f (θ ⋆ h,Mi , ϕ)Σ ⋆-1 h ∇ θ f (θ ⋆ h,Mi , ϕ) (i = 1, 2). Standard worst-case bounds (e.g. from GFA approximation) cannot explicitly differentiate between problem instances. Feature representation vs. Parameters. One interesting observation from Theorem 3.2 is that the learning complexity does not depend on the feature representation dimension m but only on parameter dimension d as long as function class F satisfies differentiability definition 1.1 (not even in the higher order term). This seems to suggest, when changing the model f with more complex representations, the learning hardness will not grow as long as the number of parameters need to be learned does not increase. Note in the linear MDP analysis this phenomenon is not captured since the two dimensions are coupled (d = m). Therefore, this heuristic might help people rethink about what is the more essential element (feature representation vs. parameter space) in the representation learning RL regime (e.g. low rank MDPs (Uehara et al., 2022) ). We leave the concrete understanding the connection between features and parameters as the future work.

Technical challenges with differentiable function approximation (DFA). Informally, one key step for the analysis is to bound |f

( θ h , ϕ) -f (θ ⋆ h , ϕ)|. This can be estimated by the first order approximation ∇f ( θ h , ϕ) ⊤ • ( θ h -θ ⋆ h ). However, different from the least-square value iteration (LSVI) objective (Jin et al., 2020b; 2021b) , the fitted Q-update (Line 4, Algorithm 1) no longer admits a closed-form solution for θ h . Instead, we can only leverage θ h is a stationary point of Z h (θ) := K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 s k h+1 ∇f (θ, ϕ h,k ) + λ • θ (since Z h ( θ h ) = 0). To measure the difference θ h -θ ⋆ h , for any θ ∈ Θ, we do the Vector Taylor expansion Z h (θ)-Z h ( θ h ) = Σ s h (θ -θ h ) + R K (θ) (where R K (θ) is the higher-order residuals) at the point θ h with Σ s h := ∂ ∂θ Z h (θ) θ= θ h = ∂ ∂θ K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) ∇f (θ, ϕ h,k ) + λ • θ θ= θ h = K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ h , ϕ h,k ) :=∆ Σ s h + K k=1 ∇ θ f ( θ h , ϕ h,k )∇ ⊤ θ f ( θ h,k , ϕ h,k ) + λI d :=Σ h . ( ) The perturbation term ∆ Σ s h encodes one key challenge for solving θ h -θ ⋆ h since it breaks the positive definiteness of Σ s h , and, as a result, we cannot invert the Σ s h in the Taylor expansion of Z h . This is due to DFA (Definition 1.1) is a rich class that incorporates nonlinear curvatures. In the linear function approximation regime, this hurdle will not show up since ∇ 2 θθ f ≡ 0 and ∆ Σ s h is always invertible as long as λ > 0. Moreover, for the off-policy evaluation (OPE) task, one can overcome this issue by expanding over the population counterpart of Z h at underlying true parameter of the given behavior target policy (Zhang et al., 2022a) .foot_7 However, for the policy learning task, we cannot use either population quantity or the true parameter θ ⋆ h since we need a computable/data-based pessimism Γ h to make the algorithm practical.

3.1. SKETCH OF THE PFQL ANALYSIS

Due to the space constraint, here we only overview the key components of the analysis. To begin with, by following the result of general MDP in Jin et al. (2021b) , the suboptimality gap can be bounded by (Appendix D) H h=1 2E π [Γ h (s h , a h )] if |(P h V h+1 -f ( θ h , ϕ))(s, a)| ≤ Γ h (s, a). To deal with P h V h+1 , by Assumption 2.1 we can leverage the parameter Bellman operator T (Definition D.1) so that P h V h+1 = f (θ T V h+1 , ϕ). Next, we apply the second-order approximation to obtain P h V h+1 -f ( θ h , ϕ) ≈ ∇f ( θ h , ϕ) ⊤ (θ T V h+1 -θ h ) + 1 2 (θ T V h+1 -θ h ) ⊤ ∇ 2 θθ f ( θ h , ϕ)(θ T V h+1 -θ h ). Later, we use (1) to represent Z h (θ T V h+1 ) -Z h ( θ h ) = Σ s h (θ T V h+1 -θ h ) + R K (θ T V h+1 ) = Σ h (θ T V h+1 -θ h ) + R K (θ T V h+1 ) by denoting R K (θ T V h+1 ) = ∆ Σ s h ( θ h -θ T V h+1 ) + R K (θ T V h+1 ). Now that Σ -1 h is invertible thus provides the estimation (note Z h ( θ h ) = 0) θ T V h+1 -θ h = Σ -1 h • Z h (θ T V h+1 ) -Σ -1 h R K (θ T V h+1 ). However, to handle the higher order terms, we need the explicit finite sample bound for ∥θ T V h+1 - θ h ∥2 (or ∥θ ⋆ h -θ h ∥2). In the OPE literature, Zhang et al. (2022a) uses asymptotic theory (Prohorovs Theorem) to show the existence of B(δ) such that θ h -θ ⋆ h ≤ B(δ) √ K . However, this is insufficient for finite sample/non-asymptotic guarantees since the abstraction of B(δ) might prevent the result from being sample efficient. For example, if B(δ) has the form e H log( 1 δ ), then e H log( 1 δ ) √ K is an inefficient bound since K needs to be e H /ϵ 2 large to guarantee ϵ accuracy. To address this, we use a novel reduction to general function approximation (GFA) learning proposed in Chen and Jiang (2019) . We first bound the loss objective E µ [ℓ h ( θ h )] -E µ [ℓ h (θ T V h+1 )] via a "orthogonal" decomposition and by solving a quadratic equation. The resulting bound can be directly used to further bound ∥θ T V h+1 -θ h ∥2 for obtaining efficient guarantee O( dH √ κK ). During the course, the covering technique is applied to extend the finite function hypothesis to all the differentiable functions in 1.1. See Appendix G and Appendix D,E,F for the complete proofs.

4. IMPROVED LEARNING VIA VARIANCE AWARENESS

In addition to knowing the provable efficiency for differentiable function approximation (DFA), it is of great interest to understand what is the statistical limit with DFA, or equivalently, what is the "optimal" sample/statistical complexity can be achieved in DFA (measured by minimaxity criteria)? Towards this goal, we further incorporate variance awareness to improve our learning guarantee. Variance awareness is first designed for linear Mixture MDPs (Talebi and Maillard, 2018; Zhou et al., 2021a) to achieve the near-minimax sample complexity and it uses estimated conditional variances Var P (•|s,a) (V ⋆ h+1 ) to reweight each training sample in the LSVI objective.foot_8 Later, such a technique is leveraged by Min et al. (2021) ; Yin et al. (2022) to obtained the instance-dependent results. Intuitively, conditional variances σ 2 (s, a) := Var P (•|s,a) (V ⋆ h+1 ) serves as the uncertainty measure of the sample (s, a, r, s ′ ) that comes from the distribution P (•|s, a). If σ 2 (s, a) is large, then the distribution P (•|s, a) has high variance and we should put less weights in a single sample (s, a, r, s ′ ) rather than weighting all the samples equally. In the differentiable function approximation regime, the update is modified to θ h ← arg min θ∈Θ K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) 2 /σ 2 h (s k h , a k h ) + λ • ∥θ∥ 2 2 with σ 2 h (•, •) estimated by the offline data. Notably, empirical algorithms have also shown uncertainty reweighting can improve the performances for both online RL (Mai et al., 2022) and offline RL (Wu et al., 2021) . These motivates our variance-aware fitted Q-learning (VAFQL) algorithm 3. Theorem 4.1. Suppose Assumption 2.1,2.3 with ϵ F = 0. Let β = 8dι and choose 0  < λ ≤ 1/2C 2 Θ in Algorithm 3. Then if K ≥ K 0 and √ d ≥ O(ζ) , with probability 1 -δ, for all policy π simultaneously, the output of VAFQL guarantees v π -v π ≤ H h=1 8d • Eπ ∇ ⊤ θ f ( θ h , ϕ(s h , a h ))Λ -1 h ∇ θ f ( θ h , ϕ(s h , a h )) • ι + O( Chot K ), where ι is a Polylog term and the expectation of π is taken over s h , a h . In particular, we have 0 ≤ v π ⋆ -v π ≤ 16d • H h=1 E π ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Λ ⋆-1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) • ι + O( C′ hot K ). Here Λ ⋆ h = K k=1 ∇ θ f (θ ⋆ h , ϕ h,k )∇ ⊤ θ f (θ ⋆ h , ϕ h,k )/σ ⋆ h (s k h , a k h ) 2 + λI d and the σ ⋆ h (•, •) 2 := max{1, Var P h V ⋆ h+1 (•, •)}. The definition of K 0 , Chot , C′ hot , ζ can be found in List A. In particular, to bound the error for u h , v h and σ 2 h , we need to define an operator J that is similar to the parameter Bellman operator D.1. The Full proof of Theorem 4.1 can be found in Appendix J. Comparing to Theorem 3.2, VAFQL enjoys a net improvement of the horizon dependence since Var P (V ⋆ h ) ≤ H 2 . Moreover, VAFQL provides better instance-dependent characterizations as the main term is fully depicted by the system quantities except the feature dimension d. For instance, when the system is fully deterministic (transition P h 's are deterministic), σ ⋆ h ≈ Var P h V ⋆ h+1 (•, •) ≡ 0 (if ignore the truncation) and Λ ⋆-1 → 0. This yields a faster convergence with rate O( 1 K ). Lastly, when reduced to linear MDPs, 4.1 recovers the results of Yin et al. (2022) except an extra √ d. On the statistical limits. To complement the study, we incorporate a minimax lower bound via a reduction to Zanette et al. (2021) . The following theorem reveals we cannot improve over Theorem 4.1 by more than a factor of √ d in the most general cases. The full discussion is in K. Theorem 4.2 (Minimax lower bound). Specifying the model to have linear representation f = θ, ϕ . There exist a pair of universal constants c, c ′ > 0 such that given dimension d, horizon H and sample size K > c ′ d 3 , one can always find a family of MDP instances such that for any algorithm π (where Λ ⋆,p h = E K k=1 ∇ θ f (θ ⋆ h ,ϕ(s k h ,a k h ))•∇ θ f (θ ⋆ h ,ϕ(s k h ,a k h )) ⊤ Var h (V ⋆ h+1 )(s k h ,a k h ) ) inf π sup M ∈M EM v ⋆ -v π ≥ c √ d • H h=1 Eπ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(•, •))(Λ ⋆,p h ) -1 ∇ θ f (θ ⋆ h , ϕ(•, •)) . (2)

5. CONCLUSION, LIMITATION AND FUTURE DIRECTIONS

In this work, we study offline RL with differentiable function approximation and show the sample efficient learning. We further improve the horizon dependence via a variance aware variant. However, the dependence of the parameter space still scales with d (whereas for the linear case it is √ d), and this is due to applying covering argument for the rich class of differentiable functions. For large deep models, the parameter dimension is huge, therefore it would be interesting to know if certain algorithms can further improve the parameter dependence, or whether this d is essential. Also, how to relax uniform coverage 2.3 is unknown under the current analysis. In addition, understanding the connections between the differentiable function approximation and overparameterized neural networks approximation Nguyen-Tang and Arora (2023); Xu and Liang ( 2022) is important. We leave these open problems as future work. Lastly, the differentiable function approximation setting provides a general framework that is not confined to offline RL. Understanding the sample complexity behaviors of online reinforcement learning (Jin et al., 2020b; Wang et al., 2021b) , reward-free learning (Jin et al., 2020a; Wang et al., 2020) 

Appendix

A NOTATION LIST Σ p h (θ) E µ,h ∇f (θ, ϕ(s, a)) • ∇f (θ, ϕ(s, a)) ⊤ κ min h,θ λ min (Σ p h (θ)) σ 2 V (s, a) max{1, Var P h (V )(s, a)} for any V δ Failure probability K 0 max 512 κ 4 1 κ 2 log( 2Hd δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 λ 2 ) , 4λ κ ζ 2 max s ′ ∼P (•|s,a),h∈[H] (P h V ⋆ h+1 )(s,a)-r-V ⋆ h+1 (s ′ ) σ ⋆ h (s,a) C hot = Chot κ1H √ κ + κ 2 1 H 3 d 2 κ + d 3 H 4 κ 2 2 κ 2 1 κ 3 + κ 2 max( κ1 κ , 1 √ κ )d 2 H 3 + d 2 H 4 κ3+λκ1CΘ κ + H 3 κ2d 2 κ C ′ hot = C′ hot C hot + κ1κ2H 4 d 2 κ 3/2 B FURTHER ILLUSTRATION THAT GENERALIZED LINEAR MODEL EXAMPLE SATISFIES 2.3 Recall the definition in 2.5, then: For (⋆⋆), E µ,h ∇f (θ, ϕ(s, a)) • ∇f (θ, ϕ(s, a)) ⊤ = E µ,h f ′ ( θ, ϕ(s, a) ) 2 ϕ(•, •) • ϕ(•, •) ⊤ κ 2 E µ,h ϕ(•, •) • ϕ(•, •) ⊤ κ 3 I, ∀θ ∈ Θ For (⋆), by Taylor's Theorem, E µ,h (f (θ 1 , ϕ(•, •)) -f (θ 2 , ϕ(•, •))) 2 = E µ,h [f ′ (θ s,a , ϕ(•, •)) 2 (θ 1 -θ 2 ) ⊤ ϕ(•, •)ϕ(•, •) ⊤ (θ 1 -θ 2 )] ≥ κ 2 E µ,h [(θ 1 -θ 2 ) ⊤ ϕ(•, •)ϕ(•, •) ⊤ (θ 1 -θ 2 )] = κ 2 (θ 1 -θ 2 ) ⊤ E µ,h [ϕ(•, •)ϕ(•, •) ⊤ ](θ 1 -θ 2 ) ≥ κ 3 θ 1 -θ 2 2 2 and choose κ 3 as κ in 2.3. The space complexity and computational complexity for VAFQL has the same order as PFQL except that the constant factors are larger.

D SOME BASIC CONSTRUCTIONS

First of all, Recall in the first-order condition, we have ∇ θ K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 s k h+1 2 + λ • θ 2 2 θ= θ h = 0, ∀h ∈ [H].

Therefore, if we define the quantity Z

h (•, •) ∈ R d as Z h (θ|V ) = K k=1 f (θ, ϕ h,k ) -r h,k -V s k h+1 ∇f (θ, ϕ h,k ) + λ • θ, ∀θ ∈ Θ, V 2 ≤ H, then we have (recall θ h ∈ Int(Θ)) Z h ( θ h | V h+1 ) = 0. In addition, according to Bellman completeness Assumption 2.1, for any bounded V (•) ∈ R S with V ∞ ≤ H, inf f ∈F f -P h (V ) ∞ ≤ ϵ F , ∀h (recall P h (V ) = r h + S V dP h ) . Therefore, we can define the parameter Bellman operator T as follows. Definition D.1 (parameter Bellman operator). By the Bellman completeness Assumption 2.1, for any V ∞ ≤ H, we can define the parameter Bellman operator T : Jin et al. (2021b) we have the following decomposition. V → θ TV ∈ Θ such that θ TV = arg min θ∈Θ f (θ, ϕ) -P h (V ) ∞ Denote δ V := f (θ TV , ϕ) -P h (V ), then we have f (θ TV , ϕ) -P h (V ) ∞ = δ V ∞ ≤ ϵ F . In particular, by realizability Assumption 2.1 it holds θ TV ⋆ h+1 = θ ⋆ h and this is due to f (θ TV ⋆ h+1 , ϕ) = P h (V ⋆ h+1 ) = V ⋆ h = f (θ ⋆ h , ϕ). 9 D.1 SUBOPTIMALITY DECOMPOSITION Denote ι h (s, a) := P h V h+1 (s, a) -Q h (s, a), by Lemma D.2 (Lemma 3.1 of Jin et al. (2021b)). Let π = { π h } H h=1 a policy and Q h be any estimates with V h = Q h (s, •), π h (• | s) A . Then for any policy π, we have v π -v π = - H h=1 E π [ι h (s h , a h )] + H h=1 E π [ι h (s h , a h )] + H h=1 E π [ Q h (s h , •) , π h (• | s h ) -π h (• | s h ) A ].

In particular, if we choose π

h (•|s) := arg max π Q h (s, •), π(• | s) A , then v π -v π = - H h=1 E π [ι h (s h , a h )] + H h=1 E π [ι h (s h , a h )]. Lemma D.3. Let P h be the general estimated Bellman operator. Suppose with probability 1 -δ, it holds for all h, s, a (s, a) . Furthermore, it holds for any policy π simultaneously, with probability 1 -δ, ∈ [H] × S × A that |(P h V h+1 -P h V h+1 )(s, a)| ≤ Γ h (s, a), then it implies ∀s, a, h ∈ S × A × [H], 0 ≤ ζ h (s, a) ≤ 2Γ h V π 1 (s) -V π 1 (s) ≤ H h=1 2 • E π [Γ h (s h , a h ) | s 1 = s] . Proof of Lemma D.3. This is a generic result that holds true for the general MDPs and was first raised by Theorem 4.2 of Jin et al. (2021b) . Later, it is summarized in Lemma C.1 of Yin et al. (2022) . With Lemma D.3, we need to bound the term |P h V h+1 (s, a) -P h V h+1 (s, a)|. E ANALYZING |P h V h+1 (s, a) -P h V h+1 (s, a)| FOR PFQL. Throughout this section, we suppose ϵ F = 0, i.e. f (θ TV , ϕ) = P h (V ). According to the regression oracle (Line 4 of Algorithm 1), the estimated Bellman operator P h maps V h+1 to θ h , i.e. P h V h+1 = f ( θ h , ϕ). Therefore (recall Definition D.1) P h V h+1 (s, a) -P h V h+1 (s, a) = P h V h+1 (s, a) -f ( θ h , ϕ(s, a)) =f (θ T V h+1 , ϕ(s, a)) -f ( θ h , ϕ(s, a)) =∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h + Hot h,1 , where we apply the first-order Taylor expansion for the differentiable function f at point θ h and Hot h,1 is a higher-order term. Indeed, the following Lemma E.1 bounds the Hot h,1 term with O( 1 K ). Lemma E.1. Recall the definition (from the above decomposition) Hot h,1 := f (θ T V h+1 , ϕ(s, a)) - f ( θ h , ϕ(s, a)) -∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h , then with probability 1 -δ, |Hot h,1 | ≤ 18H 2 κ 2 (log(H/δ) + C d,log K ) + κ 2 λC 2 Θ κK , ∀h ∈ [H]. Proof of Lemma E.1. By second-order Taylor's Theorem, there exists a point ξ (lies in the line segment of θ h and θ T V h+1 ) such that f (θ T V h+1 , ϕ(s, a))-f ( θ h , ϕ(s, a)) = ∇f ( θ h , ϕ(s, a)) ⊤ θ T V h+1 -θ h + 1 2 θ T V h+1 -θ h ⊤ ∇ 2 θθ f (ξ, ϕ(s, a)) θ T V h+1 -θ h Therefore, by directly applying Theorem G.2, with probability 1 -δ, for all h ∈ [H], |Hot h,1 | = 1 2 θ T V h+1 -θ h ⊤ ∇ 2 θθ f (ξ, ϕ(s, a)) θ T V h+1 -θ h ≤ 1 2 κ 2 • θ T V h+1 -θ h 2 2 ≤ 18H 2 κ 2 (log(H/δ) + C d,log K ) + κ 2 λC 2 Θ κK E.1 ANALYZING ∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h VIA Z h . From (3) and Lemma E.1, the problem further reduces to bounding ∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h . To begin with, we first provide a characterization of θ T V h+1 -θ h . Indeed, by first-order Vector Taylor expansion (Lemma L.1), we have (note Z h ( θ h | V h+1 ) = 0) for any θ ∈ Θ, Z h (θ| V h+1 ) -Z h ( θ h | V h+1 ) = Σ s h (θ -θ h ) + R K (θ), where R K (θ) is the higher-order residuals and Σ s h := ∂ ∂θ Z h (θ| θ h+1 ) θ= θ h with Σ s h := ∂ ∂θ Z h (θ| V h+1 ) θ= θ h = ∂ ∂θ K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) ∇f (θ, ϕ h,k ) + λ • θ θ= θ h = K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ h , ϕ h,k ) :=∆ Σ s h + K k=1 ∇ θ f ( θ h , ϕ h,k )∇ ⊤ θ f ( θ h,k , ϕ h,k ) + λI d :=Σ h , ( ) Published as a conference paper at ICLR 2023 here ∇ 2 = ∇ ∇ denotes outer product of gradients. Note ∆ Σ s h is not desirable since it could prevent Σ s h from being positive-definite (and it could cause Σ s h to be singular). Therefore, we first deal with ∆ Σ s h in below. Lemma E.2. With probability 1 -δ, for all h ∈ [H], 1 K ∆ Σ s h 2 = 1 K K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ h , ϕ h ) 2 ≤9κ 2 max( κ 1 √ κ , 1) dH 2 (log(2H/δ) + d log(1 + 2C Θ Hκ 3 K) + C d,log K ) K + 1 K . Proof of Lemma E.2. Step1: We prove for fixed θ ∈ Θ, with probability 1 -δ, for all h ∈ [H], 1 K K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ 9κ2 max( κ1 √ κ , 1) H 2 (log(2H/δ) + C d,log K ) K . Indeed, we have 1 K K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ 1 K K k=1 f ( θ h , ϕ h,k ) -f (θ T V h+1 , ϕ h,k ) • ∇ 2 θθ f ( θ, ϕ h ) 2 + 1 K K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 . ( ) On one hand, by Theorem G.2 with probability 1 -δ/2 for all h ∈ [H] 1 K K k=1 f ( θ h , ϕ h,k ) -f (θ T V h+1 , ϕ h,k ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ κ 2 • max θ,s,a ∇f (θ, ϕ(s, a)) 2 θ h -θ T V h+1 2 ≤ κ 2 κ 1 θ h -θ T V h+1 2 ≤ κ 2 κ 1 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK + b d,K,ϵ F κ + 2Hϵ F κ . (7) On other hand, recall the definition of T, we have E (f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 )) • ∇ 2 θθ f ( θ, ϕ h,k ) s k h , a k h =E f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) s k h , a k h • ∇ 2 θθ f ( θ, ϕ h,k ) = (P h V h+1 )(s k h , a k h ) -E r h,k + V h+1 (s k h+1 ) s k h , a k h • ∇ 2 θθ f ( θ, ϕ h,k ) = (P h V h+1 )(s k h , a k h ) -(P h V h+1 (s k h+1 ))(s k h , a k h ) • ∇ 2 θθ f ( θ, ϕ h,k ) = 0. Also, since f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 )) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ Hκ 2 , denote σ 2 := K • H 2 κ 2 2 , then by Vector Hoeffding's inequality (Lemma L.2), P 1 K K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≥ t/K {s k h , a k h } K k=1 ≤ d•e -t 2 /8dKH 2 κ 2 2 := δ which is equivalent to P 1 K K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ 8dH 2 κ 2 2 log(d/δ) K {s k h , a k h } K k=1 ≥ 1-δ Define A = { 1 K K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ 8dH 2 κ 2 2 log(d/δ) K }, then by law of total expectation P(A) = E[1 A ] = E[E[1 A |{s k h , a k h } K k=1 ]] = E[P[A|{s k h , a k h } K k=1 ]] ≥ E[1 -δ] = 1 -δ, i.e . with probability at least 1 -δ/2 (and a union bound), 1 K K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ 8dH 2 κ 2 2 log(2Hd/δ) K , ∀h ∈ [H]. Using above and equation 6, equation 7 and a union bound, w.p. 1 -δ, for all h ∈ [H], 1 K K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 ≤ 6κ2κ1 H 2 (log(2H/δ) + C d,log K ) κK + 8dH 2 κ 2 2 log(2Hd/δ) K ≤ 9κ2 max( κ1 √ κ , 1) dH 2 (log(2H/δ) + C d,log K ) K Step2: we finish the proof of the lemma. Consider the function class f ( θ ) := 1 K K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ, ϕ h ) 2 θ ∈ Θ , then by triangular inequality |f ( θ1 ) -f ( θ2 )| ≤ 1 K K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ1 , ϕ h ) -∇ 2 θθ f ( θ2 , ϕ h ) 2 ≤H • sup s,a ∇ 2 θθ f ( θ1 , ϕ h ) -∇ 2 θθ f ( θ2 , ϕ h ) 2 ≤ Hκ 3 θ1 -θ2 2 . By Lemma L.8, the covering number C of the ϵ-net of the above function class satisfies log C ≤ d log(1 + 2CΘHκ3 ϵ ). By choosing ϵ = 1/K, by a union bound over C cases we obtain for all h ∈ [H] 1 K K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ h , ϕ h ) 2 ≤9κ 2 max( κ 1 √ κ , 1) dH 2 (log(2H/δ) + d log(1 + 2C Θ Hκ 3 K) + C d,log K ) K + 1 K . Combing Lemma E.2 and Theorem G.2 (and a union bound), we directly have Corollary E.3. With probability 1 -δ, 1 K ∆ Σ s h ( θ h -θ T V h+1 ) 2 ≤ 1 K ∆ Σ s h 2 θ h -θ T V h+1 2 ≤ O( κ 2 max( κ1 κ , 1 √ κ )d 2 H 2 K ) Here O absorbs all the constants and Polylog terms. Now we select θ = θ T V h+1 in equation 4, and denote R K (θ T V h+1 ) = ∆ Σ s h ( θ h -θ T V h+1 ) + R K (θ T V h+1 ), then equation 4 is equivalent to Z h (θ T V h+1 | V h+1 )-Z h ( θ h | V h+1 ) = Σ s h (θ T V h+1 -θ h )+R K (θ T V h+1 ) = Σ h (θ T V h+1 -θ h )+ R K (θ T V h+1 ) Note λ > 0 implies Σ h is invertible, then we have (recall Z h ( θ h | θ h+1 ) = 0) θ T V h+1 -θ h =Σ -1 h [Z h (θ T V h+1 | V h+1 ) -Z h ( θ h | V h+1 )] -Σ -1 h R K (θ T V h+1 ) =Σ -1 h [Z h (θ T V h+1 | V h+1 )] -Σ -1 h R K (θ T V h+1 ) Plug it back to equation 3 to get ∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h =∇f ( θ h , ϕ(s, a))Σ -1 h [Z h (θ T V h+1 | V h+1 )] -∇f ( θ h , ϕ(s, a))Σ -1 h R K (θ T V h+1 ) =∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ ⊤ θ f (θ T V h+1 , ϕ h,k ) + λθ T V h+1 ] -∇f ( θ h , ϕ(s, a))Σ -1 h R K (θ T V h+1 ) = ∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ ⊤ θ f (θ T V h+1 , ϕ h,k )] :=I -∇f ( θ h , ϕ(s, a))Σ -1 h R K (θ T V h+1 ) + λθ T V h+1 :=Hot2 (8) We will bound second term Hot 2 to have higher order O( 1 K ) in Section E.5 and focus on the first term. By direct decomposition, I :=∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ ⊤ θ f (θ T V h+1 , ϕ h,k )] = ∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k )] :=I 1 + ∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ T V h+1 , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V h+1 (s k h+1 ) + V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k )] :=I 2 + ∇f ( θ h , ϕ(s, a))Σ -1 h K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ ⊤ θ f (θ T V h+1 , ϕ h,k ) -∇ ⊤ θ f ( θ h , ϕ h,k ) :=I 3 E.2 BOUNDING THE TERM I 3 We first bound the term I 3 . We have the following Lemma. Lemma E.4. For any fixed V (•) ∈ R S with V ∞ ≤ H and any fixed θ such that θ TV -θ 2 ≤ 36H 2 (log(H/δ)+C d,log K )+2λC 2 Θ κK . Let I 3 := ∇f ( θ h , ϕ(s, a)) ⊤ Σ -1 h K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) , and if K ≥ max 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ1D 2 κ2CΘK 3 λ 2 ) , 4λ κ , then with probability 1 -δ, (where D = max{κ 1 , (144dH 2 κ 2 2 (H 2 log(H/δ)+C d,log K )+8dH 2 κ 2 2 λC 2 Θ ) log(d/δ) κ }) | I 3 | ≤ 4κ 1 (144dH 2 κ 2 2 (H 2 log(H/δ) + C d,log K ) + 8dH 2 κ 2 2 λC 2 Θ ) log(d/δ) κ 3 1 K + O( 1 K 3/2 ). Proof of Lemma E.4. Indeed, with probability 1 -δ/2, | I3| = ∇f ( θ h , ϕ(s, a)) ⊤ Σ -1 h K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) ≤ ∇f ( θ h , ϕ(s, a)) Σ -1 h K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) Σ -1 h ≤ 2κ1 √ κK + O( 1 K ) K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) Σ -1 h where, under the condition K ≥ max 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 λ 2 ) , 4λ κ , we applied Lemma L.5 . Next, on one hand, ∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k ) 2 ≤ κ 2 • θ TV -θ 2 ≤ κ 2 36H 2 (log(H/δ)+C d,log K )+2λC 2 Θ κK . On the other hand, E f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • ∇ ⊤ θ f (θ TV , ϕ h,k ) -∇ ⊤ θ f (θ, ϕ h,k ) s k h , a k h =E f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) s k h , a k h • ∇ ⊤ θ f (θ TV , ϕ h,k ) -∇ ⊤ θ f (θ, ϕ h,k ) = (P h V )(s k h , a k h ) -(P h V )(s k h , a k h ) • ∇ ⊤ θ f (θ TV , ϕ h,k ) -∇ ⊤ θ f (θ, ϕ h,k ) = 0 Therefore by Vector Hoeffding's inequality (Lemma L.2) (also note the condition for boundedness f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) 2 ≤ Hκ 2 • θ TV -θ 2 ≤ Hκ 2 36H 2 (log(H/δ)+C d,log K )+2λC 2 Θ κK ) with probability 1 -δ/2, 1 K K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) 2 ≤ 4d Hκ2 36H 2 (log(H/δ)+C d,log K )+2λC 2 Θ κK 2 log( d δ ) K = (144dH 2 κ 2 2 (H 2 log(H/δ) + C d,log K ) + 8dH 2 κ 2 2 λC 2 Θ ) log(d/δ) κ • 1 K and this implies with probability 1 -δ/2, K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) 2 ≤ (144dH 2 κ 2 2 (H 2 log(H/δ) + C d,log K ) + 8dH 2 κ 2 2 λC 2 Θ ) log(d/δ) κ choose u = K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) in Lemma L.5, by a union bound we obtain with probability 1 -δ | I3| = ∇f ( θ h , ϕ(s, a)) ⊤ Σ -1 h K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) ≤ 2κ1 √ κK + O( 1 K ) K k=1 f (θ TV , ϕ h,k ) -r h,k -V (s k h+1 ) • (∇ θ f (θ TV , ϕ h,k ) -∇ θ f (θ, ϕ h,k )) Σ -1 h ≤ 2κ1 √ κK + O( 1 K ) 2 (144dH 2 κ 2 2 (H 2 log(H/δ) + C d,log K ) + 8dH 2 κ 2 2 λC 2 Θ ) log(d/δ) κ 2 K + O( 1 K ) =4κ1 (144dH 2 κ 2 2 (H 2 log(H/δ) + C d,log K ) + 8dH 2 κ 2 2 λC 2 Θ ) log(d/δ) κ 3 1 K + O( 1 K 3/2 ). Lemma E.5. Under the same condition as Lemma E.4. With probability 1 -δ, |I3| ≤ 4κ1 (144dH 2 κ 2 2 (H 2 log(H/δ) + D d,log K + C d,log K ) + 8dH 2 κ 2 2 λC 2 Θ )(log(d/δ) + D d,log K ) κ 3 1 K +O( 1 K 3/2 ). Here D d,log K := d • log(1 + 6CΘ(2κ 2 1 + Hκ2)K) + d log(1 + 6CΘHκ2K) + d log 1 + 288CΘκ 2 1 (κ1 √ CΘ + 2 √ Bκ1κ2) 2 K 2 + d 2 log 1 + 288 √ dBκ 4 1 K 2 = O(d 2 ) with O absorbs Polylog terms. Proof of Lemma E.5. Define h(V, θ, θ) = K k=1 f ( θ, ϕ h,k ) -r h,k -V (s k h+1 ) • ∇ θ f ( θ, ϕ h,k ) -∇ θ f (θ, ϕ h,k ) , then |h(V 1 , θ 1 , θ 1 ) -h(V 2 , θ 2 , θ 2 )| ≤ K k=1 [f ( θ 1 , ϕ h,k ) -V 1 (s k h+1 )] -[f ( θ 2 , ϕ h,k ) -V 2 (s k h+1 )] • ∇ θ f ( θ 1 , ϕ h,k ) -∇ θ f (θ 1 , ϕ h,k ) + K k=1 f ( θ 2 , ϕ h,k ) -r h,k -V 2 (s k h+1 ) • [∇ θ f ( θ 1 , ϕ h,k ) -∇ θ f (θ 1 , ϕ h,k )] -[∇ θ f ( θ 2 , ϕ h,k ) -∇ θ f (θ 2 , ϕ h,k )] ≤K sup s,a,s ′ [f ( θ 1 , ϕ(s, a)) -f ( θ 2 , ϕ(s, a))] -[V 1 (s ′ ) -V 2 (s ′ )] 2 • 2κ 1 +KH • sup s,a [∇ θ f ( θ 1 , ϕ(s, a)) -∇ θ f (θ 1 , ϕ(s, a))] -[∇ θ f ( θ 2 , ϕ(s, a)) -∇ θ f (θ 2 , ϕ(s, a))] 2 ≤K2κ 2 1 θ 1 -θ 2 2 + 2Kκ 1 V 1 -V 2 ∞ + HKκ 2 θ 1 -θ 2 2 + HKκ 2 θ 1 -θ 2 2 =(2κ 2 1 + Hκ 2 )K θ 1 -θ 2 2 + 2κ 1 K V 1 -V 2 ∞ + HKκ 2 θ 1 -θ 2 2 . Let C a be the ϵ/3 (2κ 2 1 +Hκ2)K -covering net of {θ : θ 2 ≤ C Θ }, C V be the ϵ 6κ1K -covering net of V defined in Lemma L.9 and C b be the ϵ 3Hκ2K -covering net of {θ : θ 2 ≤ C Θ }, then by Lemma L.8 and Lemma L.9, log |C a | ≤ d • log(1 + 6C Θ (2κ 2 1 + Hκ 2 )K ϵ ), log |C b | ≤ d log(1 + 6C Θ Hκ 2 K ϵ ) log C V ≤ d log 1 + 288C Θ κ 2 1 (κ 1 √ C Θ + 2 √ Bκ 1 κ 2 ) 2 K 2 ϵ 2 + d 2 log 1 + 288 √ dBκ 4 1 K 2 ϵ 2 . Further notice with probability 1-δ/2 (by Lemma L.5), for all fixed sets of parameters θ, V satisfies θ TV -θ 2 ≤ 36H 2 (log(2H/δ)+C d,log K )+2λC 2 Θ κK simultaneously, |I 3 -I 3 | ≤ ∇f ( θ h , ϕ(s, a)) Σ -1 h • h( V h+1 , θ T V h+1 , θ h ) -h(V, θ TV , θ) Σ -1 h ≤ 2κ 1 √ κK + O( 1 K ) • h( V h+1 , θ T V h+1 , θ h ) -h(V, θ TV , θ) Σ -1 h and θ T V h+1 -θ h 2 ≤ 36H 2 (log(2H/δ)+C d,log K )+2λC 2 Θ κK with probability 1 -δ/2 by Theorem G.2. Now, choosing ϵ = O(1/K 2 ) and by Lemma E.4 and union bound over covering instances, we obtain with probability 1 -δ |I3| ≤ 4κ1 (144dH 2 κ 2 2 (H 2 log(H/δ) + D d,log K + C d,log K ) + 8dH 2 κ 2 2 λC 2 Θ )(log(d/δ) + D d,log K ) κ 3 1 K +O( 1 K 3/2 ).

E.3 BOUNDING THE SECOND TERM I 2

In this section, we bound the term I2 := ∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ T V h+1 , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V h+1 (s k h+1 ) + V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k )]. The following Lemma shows that I 2 is a higher-order error term with rate O( 1 K ). Lemma E.6 (Bounding I 2 ). If K satisfies K ≥ 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ 3 1 κ2CΘK λ 2 ) , and K ≥ 4λ/κ, then with probability 1 -δ |I 2 | ≤ O( κ 2 1 H 2 d 2 κK ) + O( 1 K 3/2 ). Here O absorbs constants and Polylog terms. Proof of Lemma E.6. Step1. Define η k (V ) := f (θ TV , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V (s k h+1 ) + V ⋆ h+1 (s k h+1 ) and let V (•) ∞ ≤ H be any fixed function such that sup s k h ,a k h ,s k h+1 |η k (V )| ≤ O(κ 1 H 2 d 2 κK ), i.e. arbitrary fixed V function in the neighborhood (measured by η k ) of V ⋆ h+1 . Then by definition of T it holds E[η k (V, θ)|s k h , a k h ] = 0. Let the fixed θ ∈ Θ be arbitrary and define x k (θ) = ∇ θ f (θ, ϕ h,k ). Next, define G h (θ) = K k=1 ∇f (θ, ϕ(s k h , a k h )) • ∇f (θ, ϕ(s k h , a k h )) ⊤ + λI d , since x k 2 ≤ κ 1 and |η k | ≤ O(κ 1 H 2 d 2 κK ), by self-normalized Hoeffding's inequality (Lemma L.3), with probability 1 -δ (recall t := K in Lemma L.3), K k=1 x k (θ)η k (V ) G h (θ) -1 ≤ O(κ 1 H 2 d 2 κK ) d log λ + Kκ 1 λδ . Step2. Define h(V, θ) : = K k=1 x k (θ)η k (V ) and H(V, θ) := K k=1 x k (θ)η k (V ) G h (θ) -1 , then note by definition |η k (V )| ≤ 2H, which implies h(V, θ) 2 ≤ 2KHκ 1 and |η k (V 1 ) -η k (V 2 )| ≤ |P h V 1 -P h V 2 | + V 1 -V 2 ∞ ≤ 2 V 1 -V 2 ∞ and h(V 1 , θ 1 ) -h(V 2 , θ 2 ) 2 ≤K max k (2H x k (θ 1 ) -x k (θ 2 ) 2 + κ 1 |η k (V 1 ) -η k (V 2 )|) ≤K(2Hκ 2 θ 1 -θ 2 2 + 2κ 1 V 1 -V 2 ∞ ). Furthermore, G h (θ1) -1 -G h (θ2) -1 2 ≤ G h (θ1) -1 2 ∥G h (θ1) -G h (θ2)∥ 2 G h (θ2) -1 2 ≤ 1 λ 2 K sup k ∇f (θ1, ϕ h,k ) • ∇f (θ1, ϕ h,k ) ⊤ -∇f (θ2, ϕ h,k ) • ∇f (θ2, ϕ h,k ) ⊤ 2 ≤ 1 λ 2 K sup k (∇f (θ1, ϕ h,k ) -∇f (θ2, ϕ h,k )) • ∇f (θ1, ϕ h,k ) ⊤ 2 + ∇f (θ2, ϕ h,k ) • (∇f (θ1, ϕ h,k ) ⊤ -∇f (θ2, ϕ h,k ) ⊤ ) 2 ≤ 2κ1K λ 2 κ2 ∥θ1 -θ2∥ 2 = 2κ1κ2K λ 2 ∥θ1 -θ2∥ 2 . All the above imply |H(V 1 , θ 1 ) -H(V 2 , θ 2 )| ≤ |h(V 1 , θ 1 ) ⊤ G h (θ 1 ) -1 h(V 1 , θ 1 ) -h(V 2 , θ 2 ) ⊤ G h (θ 2 ) -1 h(V 2 , θ 2 )| ≤ h(V 1 , θ 1 ) -h(V 2 , θ 2 ) 2 • 1 λ • 2KHκ 1 + 2KHκ 1 • G h (θ 1 ) -1 -G h (θ 2 ) -1 2 • 2KHκ 1 + 2KHκ 1 • 1 λ • h(V 1 , θ 1 ) -h(V 2 , θ 2 ) 2 ≤2 K(2Hκ 2 θ 1 -θ 2 2 + 2κ 1 V 1 -V 2 ∞ ) • 1 λ • 2KHκ 1 + 2KHκ 1 • 2κ 1 κ 2 K λ 2 θ 1 -θ 2 2 • 2KHκ 1 ≤ 4 K 3 H 2 κ 1 κ 2 1 λ + 8K 3 H 2 κ 3 1 κ 2 1 λ 2 θ 1 -θ 2 2 + 4 K 3 κ 2 1 H 1 λ V 1 -V 2 ∞ Published as a conference paper at ICLR 2023 Then a ϵ-covering net of {H(V, θ)} can be constructed by the union of ϵ 2 4 4 √ K 3 H 2 κ1κ2 1 λ + 8K 3 H 2 κ 3 1 κ2 1 λ 2 2 -covering net of {θ ∈ Θ} and ϵ 2 4(4 √ K 3 κ 2 1 H 1 λ ) 2 -covering net of V in Lemma L.9. The covering number N ϵ satisfies log N ϵ ≤d log   1 + 8C Θ 4 K 3 H 2 κ 1 κ 2 1 λ + 8K 3 H 2 κ 3 1 κ 2 1 λ 2 2 ϵ 2    +d log   1 + 8C Θ (κ 1 √ C Θ + 2 √ Bκ 1 κ 2 ) 2 ϵ 4 16(4 √ K 3 κ 2 1 H 1 λ ) 4   + d 2 log   1 + 8 √ dBκ 2 1 ϵ 4 16(4 √ K 3 κ 2 1 H 1 λ ) 4   . Step3. First note by definition in Step2 K k=1 f (θ T V h+1 , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V h+1 (s k h+1 ) + V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) Σ -1 h = H( V h+1 , θ h ) and with probability 1 -δ  |η k ( V h+1 )| =|f (θ T V h+1 , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V h+1 (s k h+1 ) + V ⋆ h+1 (s k h+1 )| ≤κ 1 • θ T V h+1 -θ ⋆ h 2 + V h+1 -V ⋆ h+1 ∞ ≤κ 1 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK + C κ 1 H 2 d 2 κK = O κ 1 H 2 d 2 κK ( H( V h+1 , θ h ) = K k=1 x k ( θ h )η k ( V h+1 ) G h ( θ h ) -1 ≤ O(κ 1 H 2 d 2 κK ) d + d 2 = O( κ 1 H 2 d 2 √ κK ) (10) where we absorb all the Polylog terms. Meanwhile, by Lemma L.5 with probability 1 -δ, ∇f ( θ h , ϕ s,a ) Σ -1 h ≤ 2κ 1 √ κK + O( 1 K ). Finally, by equation 10 and equation 11 and a union bound, we have with probability 1 -δ, |I2| := ∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ T V h+1 , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V h+1 (s k h+1 ) + V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k )] ≤ ∇f ( θ h , ϕs,a) Σ -1 h K k=1 f (θ T V h+1 , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V h+1 (s k h+1 ) + V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) Σ -1 h = ∇f ( θ h , ϕs,a) Σ -1 h • H( V h+1 , θ h ) ≤ 2κ1 √ κK + O( 1 K ) • O( κ1H 2 d 2 √ κK ) = O( κ 2 1 H 2 d 2 κK ) + O( 1 K 3/2 ) where the first inequality is CauchySchwarz inequality.

E.4 BOUNDING THE MAIN TERM I 1

In this section, we bound the dominate term I 1 := ∇f ( θ h , ϕ(s, a))Σ -1 h [ K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k )]. First of all, by CauchySchwarz inequality, we have |I 1 | ≤ ∇f ( θ h , ϕ(s, a)) Σ -1 h • K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) Σ -1 h . (12) Then we have the following Lemma to bound I 1 . Lemma E.7. With probability 1 -δ, |I 1 | ≤ 4Hd ∇f ( θ h , ϕ(s, a)) Σ -1 h • C δ,log K + O( κ 1 √ κK ), where C δ,log K only contains Polylog terms. Proof of Lemma E.7. Step1. Let the fixed θ ∈ Θ be arbitrary and define x k (θ) = ∇ θ f (θ, ϕ h,k ). Next, define G h (θ) = K k=1 ∇f (θ, ϕ(s k h , a k h )) • ∇f (θ, ϕ(s k h , a k h )) ⊤ + λI d , then x k 2 ≤ κ 1 . Also denote η k := f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ), then E[η k |s k h , a k h ] = 0 and |η k | ≤ H. Now by self-normalized Hoeffding's inequality (Lemma L.3), with probability 1 -δ (recall t := K in Lemma L.3), K k=1 x k (θ)η k G h (θ) -1 ≤ 2H d log λ + Kκ 1 λδ . Step2. Define h(θ) := K k=1 x k (θ)η k and H(θ) := K k=1 x k (θ)η k G h (θ) -1 , then note by defini- tion |η k | ≤ H, which implies h(θ) 2 ≤ KHκ 1 and by x k (θ 1 ) -x k (θ 2 ) = ∇ 2 θθ f (ξ, ϕ) • (θ 1 -θ 2 ), h(θ 1 ) -h(θ 2 ) 2 ≤K max k (H x k (θ 1 ) -x k (θ 2 ) 2 ) ≤ HKκ 2 θ 1 -θ 2 2 . Furthermore, G h (θ 1 ) -1 -G h (θ 2 ) -1 2 ≤ G h (θ 1 ) -1 2 G h (θ 1 ) -G h (θ 2 ) 2 G h (θ 2 ) -1 2 ≤ 1 λ 2 K sup k ∇f (θ 1 , ϕ h,k ) • ∇f (θ 1 , ϕ h,k ) ⊤ -∇f (θ 2 , ϕ h,k ) • ∇f (θ 2 , ϕ h,k ) ⊤ 2 ≤ 2κ 1 K λ 2 κ 2 θ 1 -θ 2 2 = 2κ 1 κ 2 K λ 2 θ 1 -θ 2 2 . All the above imply |H(θ 1 ) -H(θ 2 )| ≤ |h(θ 1 ) ⊤ G h (θ 1 ) -1 h(θ 1 ) -h(θ 2 ) ⊤ G h (θ 2 ) -1 h(θ 2 )| ≤ h(θ 1 ) -h(θ 2 ) 2 • 1 λ • KHκ 1 + KHκ 1 • G h (θ 1 ) -1 -G h (θ 2 ) -1 2 • KHκ 1 + KHκ 1 • 1 λ • h(θ 1 ) -h(θ 2 ) 2 ≤2 KHκ 2 θ 1 -θ 2 2 • 1 λ • KHκ 1 + KHκ 1 • 2κ 1 κ 2 K λ 2 θ 1 -θ 2 2 • KHκ 1 ≤ 4K 2 H 2 κ 1 κ 2 /λ + 2K 3 H 2 κ 3 1 κ 2 /λ 2 θ 1 -θ 2 2 Then a ϵ-covering net of {H(θ)} can be constructed by the union of ϵ 2 √ 4K 2 H 2 κ1κ2/λ+ √ 2K 3 H 2 κ 3 1 κ2/λ 2 2 -covering net of {θ ∈ Θ}. By Lemma L.8, the covering number N ϵ satisfies log N ϵ ≤d log   1 + 2C Θ 4K 2 H 2 κ 1 κ 2 /λ + 2K 3 H 2 κ 3 1 κ 2 /λ 2 2 ϵ 2    = O(d) Published as a conference paper at ICLR 2023 Step3. First note by definition in Step2 K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) Σ -1 h = H( θ h ) Now choosing ϵ = O(1/K) in Step2 and union bound over the covering number in Step2, we obtain with probability 1 -δ, H( θ h ) = K k=1 x k ( θ h )η k G h ( θ h ) -1 ≤ 2H d log λ + Kκ 1 λδ + O(d) + O( 1 K ). ( ) where we absorb all the Polylog terms. Combing above with equation 12, we obtain with probability 1 -δ, |I 1 | ≤ ∇f ( θ h , ϕ(s, a)) Σ -1 h • K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) Σ -1 h ≤ ∇f ( θ h , ϕ(s, a)) Σ -1 h • 2H d log λ + Kκ 1 λδ + O(d) + O( 1 K ) ≤4Hd ∇f ( θ h , ϕ(s, a)) Σ -1 h • C δ,log K + O( κ 1 √ κK ), where C δ,log K only contains Polylog terms.

E.5 ANALYZING Hot

2 IN EQUATION 8 Lemma E.8. Recall Hot 2 := ∇f ( θ h , ϕ(s, a))Σ -1 h R K (θ T V h+1 ) + λθ T V h+1 . If the number of episode K satisfies K ≥ max 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 κλ 2 ) , 4λ κ , then with probability 1 -δ, ∇f ( θ h , ϕ(s, a))Σ -1 h R K (θ T V h+1 ) + λθ T V h+1 ≤ O   κ 2 max( κ1 κ , 1 √ κ )d 2 H 2 + d 2 H 3 κ3+λκ1CΘ κ K   where O absorbs all the constants and Polylog terms. Proof of Lemma E.8. Step1: we first show with probability 1 -δ ∇f ( θ h , ϕ(s, a))Σ -1 h R K (θ T V h+1 ) ≤ O( 1 K ). Recall by plug in θ T V h+1 in equation 4, we have Z h (θ T V h+1 | V h+1 ) -Z h ( θ h | V h+1 ) = ∂ ∂θ Z h ( θ h | V h+1 )(θ T V h+1 -θ h ) + R K (θ T V h+1 ), and by second-order Taylor's Theorem we have R K (θ T V h+1 ) 2 = Z h (θ T V h+1 | V h+1 ) -Z h ( θ h | V h+1 ) - ∂ ∂θ Z h ( θ h | V h+1 )(θ T V h+1 -θ h ) 2 = 1 2 (θ T V h+1 -θ h ) ⊤ ∂ 2 ∂θ∂θ Z h (ξ| V h+1 )(θ T V h+1 -θ h ) 2 ≤ 1 2 κ z2 θ T V h+1 -θ h 2 2 (15) Note ∂ 2 ∂θθ Z h (θ| V h+1 ) θ=ξ = ∂ ∂θ Σ s h = K k=1 ∂ ∂θ f (ξ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f (ξ, ϕ h,k ) + K k=1 ∂ ∂θ ∇ θ f (ξ, ϕ h,k )∇ ⊤ θ f (ξ, ϕ h,k ) + λI d (16 ) Therefore, we can bound κ z2 with κ z2 ≤ (Hκ 3 +3κ 1 κ 2 )K and this implies with probability 1-δ/2, R K (θ T V h+1 ) 2 ≤ 1 2 κ z2 θ T V h+1 -θ h 2 2 ≤ 1 2 (Hκ 3 + 3κ 1 κ 2 )K • θ T V h+1 -θ h 2 2 ≤ 1 2 (Hκ 3 + 3κ 1 κ 2 )K • 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK ≤ O((Hκ 3 + 3κ 1 κ 2 )H 2 d 2 /κ). And by Corollary E.3 with probability 1 -δ/2, ∆ Σ s h ( θ h -θ T V h+1 ) 2 ≤ O(1), Therefore, by Lemma L.5 and a union bound with probability 1 -δ, |∇f ( θ h , ϕ(s, a)) ⊤ Σ -1 h R K (θ T V h+1 )| = ∇f ( θ h , ϕ(s, a)) ⊤ Σ -1 h ∆ Σ s h ( θ h -θ T V h+1 ) + R K (θ T V h+1 ) ≤ ∇f ( θ h , ϕ(s, a)) Σ -1 h ∆ Σ s h ( θ h -θ T V h+1 ) + R K (θ T V h+1 ) Σ -1 h ≤ 2κ 1 √ κK + O( 1 K ) ∆ Σ s h ( θ h -θ T V h+1 ) + R K (θ T V h+1 ) Σ -1 h ≤ 2κ 1 √ κK + O( 1 K ) C √ K + O( 1 K ) = O   κ 2 max( κ1 κ , 1 √ κ )d 2 H 2 + d 2 H 3 κ3 κ K   where O absorbs all the constants and Polylog terms. Here the last inequality uses bound for R K (θ T V h+1 ) 2 and ∆ Σ s h ( θ h -θ T V h+1 ) 2 . Step2: By Lemma L.5, with probability 1 -δ, ∇f ( θ h , ϕ(s, a))Σ -1 h λθ T V h+1 ≤ λ ∇f ( θ h , ϕ(s, a)) Σ -1 h θ T V h+1 Σ -1 h ≤ λ 2κ 1 √ κK + O( 1 K ) • 2C Θ √ κK + O( 1 K ) = 4λκ 1 C Θ κK + O( 1 K 3 2 ) F PROOF OF THEOREM 3.2 Now we are ready to prove Theorem 3.2. In particular, we prove the first part. Also, recall that we consider the exact Bellman completeness (ϵ F = 0).

F.1 THE FIRST PART

Proof of Theorem 3.2 (first part). First of all, from the previous calculation (3), (8), we have P h V h+1 (s, a) -P h V h+1 (s, a) ≤ ∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h + |Hot h,1 | ≤|I 1 | + |I 2 | + |I 3 | + |Hot h,2 | + |Hot h,1 | Now by Lemma E.5, Lemma E.6, Lemma E.7, Lemma E.8 and Lemma E.1 (and a union bound), with probability 1 -δ, |I 3 | ≤ O( d 3 H 2 κ 2 2 κ 2 1 κ 3 ) 1 K , |I 2 | ≤ O( κ 2 1 H 2 d 2 κK ) + O( 1 K 3/2 ), |I 1 | ≤4Hd ∇f ( θ h , ϕ(s, a)) Σ -1 h • C δ,log K + O( κ 1 √ κK ), |Hot 2,h | ≤ O   κ 2 max( κ1 κ , 1 √ κ )d 2 H 2 + d 2 H 3 κ3+λκ1CΘ κ K   , |Hot 1,h | ≤ O( H 2 κ 2 d 2 κ ) 1 K . Finally, Plug the above into Lemma D.3, by a union bound over all h ∈ [H], we have with probability 1 -δ, for any policy π, v π -v π ≤ H h=1 2 • E π [|I 1 | + |I 2 | + |I 3 | + |Hot h,2 | + |Hot h,1 |] ≤ H h=1 8dHE π ∇ ⊤ f ( θ h , ϕ(s h , a h ))Σ -1 h ∇f ( θ h , ϕ(s h , a h )) • ι + O( C hot K ). where ι = C δ,log K only contains Polylog terms and C hot = κ 1 H √ κ + κ 2 1 H 3 d 2 κ + d 3 H 4 κ 2 2 κ 2 1 κ 3 +κ 2 max( κ 1 κ , 1 √ κ )d 2 H 3 + d 2 H 4 κ 3 + λκ 1 C Θ κ + H 3 κ 2 d 2 κ F.2 THE SECOND PART Next we prove the second part of Theorem 3.2.

Proof of Theorem 3.2 (second part).

Step1. Choose π = π ⋆ in the first part, we have 0 ≤ v π ⋆ -v π ≤ H h=1 8dH • E π ⋆ ∇ ⊤ θ f ( θ h , ϕ(s h , a h ))Σ -1 h ∇ θ f ( θ h , ϕ(s h , a h )) • ι + O( C hot K ), Next, by the triangular inequality of the norm to obtain ∇ θ f ( θ h , ϕ(s h , a h )) Σ -1 h -∇ θ f (θ ⋆ h , ϕ(s h , a h )) Σ -1 h ≤ ∇ θ f ( θ h , ϕ(s h , a h )) -∇ θ f (θ ⋆ h , ϕ(s h , a h )) Σ -1 h = ∇ 2 θθ f (ξ, ϕ(s h , a h )) • θ h -θ ⋆ h Σ -1 h , since with probability 1 -δ, ∇ 2 θθ f (ξ, ϕ(s h , a h )) • θ h -θ ⋆ h 2 ≤ κ 2 θ h -θ ⋆ h 2 ≤ O κ 1 κ 2 H 2 d κ 1 K , where the last inequality uses part three of Theorem G.3. Then by a union bound and Lemma L.5, ∇ 2 θθ f (ξ, ϕ(s h , a h )) • θ h -θ ⋆ h Σ -1 h ≤ O κ 1 κ 2 H 2 d κ 3/2 • 1 K . Step2. Next, we show with probability 1 -δ, ∇ θ f (θ ⋆ h , ϕ(s h , a h )) Σ -1 h ≤ 2 ∇ θ f (θ ⋆ h , ϕ(s h , a h )) Σ ⋆-1 h . First of all, 1 K Σ h - 1 K Σ ⋆ h 2 = 1 K K k=1 ∇f ( θ h , ϕ(s, a))∇f ( θ h , ϕ(s, a)) ⊤ -∇f (θ ⋆ h , ϕ(s, a))∇f (θ ⋆ h , ϕ(s, a)) ⊤ 2 ≤ sup s,a ∇f ( θ h , ϕ(s, a)) -∇f (θ ⋆ h , ϕ(s, a)) ∇f ( θ h , ϕ(s, a)) 2 + ∇f ( θ h , ϕ(s, a)) -∇f (θ ⋆ h , ϕ(s, a)) ∇f ( θ h , ϕ(s, a)) 2 ≤2κ 2 κ 1 θ h -θ ⋆ h 2 ≤ O κ 2 κ 2 1 H 2 d κ 1 K Second, by Lemma L.6 with probability 1 -δ Σ ⋆ h K -E µ [∇ θ f (θ ⋆ h , ϕ)∇ θ f (θ ⋆ h , ϕ) ⊤ ] - λ K ≤ 4 √ 2κ 2 1 √ K log 2d δ 1/2 This implies Σ ⋆ h K ≤ E µ [∇ θ f (θ ⋆ h , ϕ)∇ θ f (θ ⋆ h , ϕ) ⊤ ] + λ K + 4 √ 2κ 2 1 √ K log 2d δ 1/2 ≤κ 2 1 + λ + 4 √ 2κ 2 1 log 2d δ 1/2 and also by Weyl's spectrum theorem and under the condition K ≥ 128κ 4 1 log(2d/δ) κ 2 , with probability 1 -δ λ min ( Σ ⋆ h K ) ≥λ min E µ [∇ θ f (θ ⋆ h , ϕ)∇ θ f (θ ⋆ h , ϕ) ⊤ ] + λ K - 4 √ 2κ 2 1 √ K log 2d δ 1/2 ≥κ + λ K - 4 √ 2κ 2 1 √ K log 2d δ 1/2 ≥ κ 2 then ( Σ ⋆ h K ) -1 ≤ 2 κ . Similarly, with probability 1 -δ, ( Σ h K ) -1 ≤ 2 κ . Then by Lemma L.7, ∥∇ θ f (θ ⋆ h , ϕ(s, a))∥ KΣ -1 h ≤ 1 + KΣ ⋆-1 h ∥Σ ⋆ h /K∥ • KΣ -1 h • ∥Σ h /K -Σ ⋆ h /K∥ • ∥∇ θ f (θ ⋆ h , ϕ(s, a))∥ KΣ ⋆-1 h ≤   1 + 4 κ 2 O(κ 2 1 + λ) O κ2κ 2 1 H 2 d κ 1 K   • ∥∇ θ f (θ ⋆ h , ϕ(s, a))∥ KΣ ⋆-1 h ≤2 ∥∇ θ f (θ ⋆ h , ϕ(s, a))∥ KΣ ⋆-1 h as long as K ≥ O( (κ 2 1 +λ) 2 κ 2 2 κ 2 1 H 4 d 2 κ 6 ). The above is equivalently to ∇ θ f (θ ⋆ h , ϕ(s h , a h )) Σ -1 h ≤ 2 ∇ θ f (θ ⋆ h , ϕ(s h , a h )) Σ ⋆-1 h . Combining Step1, Step2 and a union bound, we have with probability 1 -δ, 0 ≤v π ⋆ -v π ≤ H h=1 8dH • Eπ⋆ ∇ ⊤ θ f ( θ h , ϕ(s h , a h ))Σ -1 h ∇ θ f ( θ h , ϕ(s h , a h )) • ι + O( C hot K ) ≤ H h=1 8dH • Eπ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Σ -1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) • ι + O( C hot K ) + O κ1κ2H 4 d 2 κ 3/2 • 1 K ≤ H h=1 16dH • Eπ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Σ ⋆-1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) • ι + O( C ′ hot K ) where C ′ hot = C hot + κ1κ2H 4 d 2 κ 3/2 .

G PROVABLE EFFICIENCY BY REDUCTION TO GENERAL FUNCTION APPROXIMATION

In this section, we bound the accuracy of the parameter difference θ h -θ T V h+1 2 via a reduction to General Function Approximation scheme in Chen and Jiang (2019) .

Recall the objective

ℓ h (θ) := 1 K K k=1 f θ, ϕ(s k h , a k h ) -r(s k h , a k h ) -V h+1 s k h+1 2 + λ K • θ 2 2 (17) Then by definition, θ h := arg min θ∈Θ ℓ h (θ) and θ T V h+1 satisfies f (θ T V h+1 , ϕ) = P h V h+1 + δ V h+1 . Therefore, in this case, we have the following lemma: Lemma G.1. Fix h ∈ [H]. With probability 1 -δ, E µ [ℓ h ( θ h )]-E µ [ℓ h (θ T V h+1 )] ≤ 36H 2 (log(1/δ) + C d,log K ) + λC 2 Θ K + 16H 3 ϵ F (log(1/δ) + C d,log K ) K +4Hϵ F . where the expectation over µ is taken w.r.t. (s k h , a k h , s k h+1 ) k = 1, ..., K only (i.e., first compute E µ [ℓ h (θ)] for a fixed θ, then plug-in either θ h+1 or θ T V h+1 ). Here C d,log(K) := d log(1+24C Θ (H + 1)κ 1 K)+d log 1 + 288H 2 C Θ (κ 1 √ C Θ + 2 κ 1 κ 2 /λ) 2 K 2 + d 2 log 1 + 288H 2 √ dκ 2 1 K 2 /λ . Proof of Lemma G.1. Step1: we first prove the case where λ = 0. Indeed, fix h ∈ [H] and any function V (•) ∈ R S . Similarly, define f V (s, a) := f (θ TV , ϕ) = P h V + δ V . For any fixed θ ∈ Θ, denote g(s, a) = f (θ, ϕ(s, a)). Then define 10 X(g, V, f V ) := (g(s, a) -r -V (s ′ )) 2 -(f V (s, a) -r -V (s ′ )) 2 . Since all episodes are independent of each other, X k (g, V, f V ) := X(g(s k h , a k h ), V (s k h+1 ), f V (s k h , a k h )) are independent r.v.s and it holds 1 K K k=1 X k (g, V, f V ) = ℓ(g) -ℓ(f V ). ( ) Next, the variance of X is bounded by: Var[X(g, V, f V )] ≤ E µ [X(g, f, f V ) 2 ] =E µ (g(s h , a h ) -r h -V (s h+1 )) 2 -(f V (s h , a h ) -r h -V (s h+1 )) 2 2 =E µ (g(s h , a h ) -f V (s h , a h )) 2 (g(s h , a h ) + f V (s h , a h ) -2r h -2V (s h+1 )) 2 ≤4H 2 • E µ [(g(s h , a h ) -f V (s h , a h )) 2 ] ≤4H 2 • E µ (g(s h , a h ) -r h -V (s h+1 )) 2 -(f V (s h , a h ) -r h -V (s h+1 )) 2 + 8H 3 ϵ F ( * ) =4H 2 • E µ [X(g, f, f V )] + 8H 3 ϵ F where the step ( * ) comes from E µ (g(s h , a h ) -r h -V (s h+1 )) 2 -(f V (s h , a h ) -r h -V (s h+1 )) 2 =E µ [(g(s h , a h ) -f V (s h , a h )) • (g(s h , a h ) + f V (s h , a h ) -2r h -2V (s h+1 ))] =E µ [(g(s h , a h ) -f V (s h , a h )) • (g(s h , a h ) -f V (s h , a h ) + 2f V (s h , a h ) -2r h -2V (s h+1 ))] =E µ (g(s h , a h ) -f V (s h , a h )) 2 + E µ [2(g(s h , a h ) -f V (s h , a h ))E P h [f V (s h , a h ) -r h -V (s h+1 ) | s h , a h ]] ≥E µ (g(s h , a h ) -f V (s h , a h )) 2 -2H δ V ∞ ≥ E µ (g(s h , a h ) -f V (s h , a h )) 2 -2Hϵ F (19) where the last step uses law of total expectation and the definition of f V . Therefore, by Bernstein inequality, with probability 1 -δ, E µ [X(g, f, f V )] - 1 K K k=1 X k (g, f, f V ) ≤ 2Var[X(g, f, f V )] log(1/δ) K + 4H 2 log(1/δ) 3K ≤ 8H 2 E µ [X(g, f, f V )] log(1/δ) K + 16H 3 ϵ F log(1/δ) K + 4H 2 log(1/δ) 3K . Now, if we choose g(s, a) := f ( θ h , ϕ(s, a)), then θ h minimizes ℓ h (θ), therefore, it also minimizes 1 K K k=1 X i (θ, V h+1 , f V h+1 ) and this implies 1 K K k=1 X k ( θ h , V h+1 , f V h+1 ) ≤ 1 K K k=1 X k (θ T V h+1 , V h+1 , f V h+1 ) = 0. Therefore, we obtain E µ [X( θ h , V h+1 , f V h+1 )] ≤ 8H 2 • E µ [X( θ h , V h+1 , f V h+1 )] log(1/δ) K + 16H 3 ϵ F log(1/δ) K + 4H 2 log(1/δ) 3K . However, the above does not hold with probability 1 -δ since θ h and V h+1 := min{max a f ( θ h+1 , ϕ(•, a))-∇f ( θ h+1 , ϕ(•, a)) ⊤ A • ∇f (θ, ϕ(•, a) ), H} (where A is certain symmetric matrix with bounded norm) depend on θ h and θ h+1 which are data-dependent. Therefore, we need to further apply covering Lemma L.10 and choose ϵ = O(1/K) and a union bound to obtain with probability 1 -δ, . 11 Solving this quadratic equation to obtain with probability 1 -δ, E µ [X( θ h , V h+1 , f V h+1 )] ≤ 8H 2 • E µ [X( θ h , V h+1 , f V h+1 )](log(1/δ) + C d,log K ) K + 7H 2 (log(1/δ) + C d,log K ) 3K + 16H 3 ϵ F (log(1/δ) + C d,log K ) K + 4Hϵ F where C d,log(K) := log(1+24C Θ (H + 1)κ 1 K)+d log 1 + 288H 2 C Θ (κ 1 √ C Θ + 2 κ 1 κ 2 /λ) 2 K 2 + d 2 log 1 + 288H 2 √ dκ 2 1 K 2 /λ E µ [X( θ h , V h+1 , f V h+1 )] ≤ 36H 2 (log(1/δ) + C d,log K ) K + 16H 3 ϵ F (log(1/δ) + C d,log K ) K +4Hϵ F Now according to equation 18, by definition we finally have with probability 1 -δ (recall the expectation over µ is taken w.r.t. (s k h , a k h , s k h+1 ) k = 1, ..., K only) E µ [ℓ h ( θ h+1 )] -E µ [ℓ h (θ T V h+1 )] = E µ [X( θ h , V h+1 , f V h+1 )] ≤ 36H 2 (log(1/δ) + C d,log K ) K + 16H 3 ϵ F (log(1/δ) + C d,log K ) K + 4Hϵ F . ( ) Step2. If λ > 0, there is only extra term λ K θ h 2 -θ T V h+1 2 ≤ λ K θ h 2 ≤ λC 2 Θ K in addition to above. This finishes the proof. Theorem G.2 (Provable efficiency (Part I)). Let C d,log K be the same as Lemma G.1. Then denote b d,K,ϵ F := 16H 3 ϵ F (log(1/δ)+C d,log K ) K + 4Hϵ F , with probability 1 -δ θ h -θ T V h+1 2 ≤ 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK + b d,K,ϵ F κ + 2Hϵ F κ , ∀h ∈ [H]. Proof of Theorem G.2. Apply a union bound in Lemma G.1, we have with probability 1 -δ, Eµ[ℓ h ( θ h )] -Eµ[ℓ h (θ T V h+1 )] ≤ 36H 2 (log(H/δ) + C d,log K ) + λC 2 Θ K + b d,K,ϵ F , ∀h ∈ [H] ⇒Eµ[ℓ h ( θ h ) - λ K θ h 2 2 ] -Eµ[ℓ h (θ T V h+1 ) - λ K θ T V h+1 2 2 ] ≤ 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ K + b d,K,ϵ F (21) Now we prove for all h ∈ [H], Eµ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 ≤ Eµ   ℓh( θ h ) - λ θ h 2 2 K   -Eµ   ℓh(θ T V h+1 ) - λ θ T V h+1 2 2 K   +2HϵF . ( ) Indeed, similar to equation 20, by definition we have 11 Here in our realization of Lemma L.9, we set B = 1/λ (since E µ   ℓh( θ h ) - λ θ h 2 2 K    -E µ   ℓh(θ T V h+1 ) - λ θ T V h+1 2 2 K    = E µ [X( θ h , V h+1 , f V h+1 )] =E µ f θ h , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) 2 -f θ T V h+1 , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) 2 =E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 +E µ f ( θ h , ϕ(s h , a h )) -f (θ T V h+1 , ϕ(s h , a h )) • f θ T V h+1 , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) =E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 +E µ f ( θ h , ϕ(s h , a h )) -f (θ T V h+1 , ϕ(s h , a h )) • E f θ T V h+1 , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) s h , a h ≥E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 -2Hϵ F Σ -1 h 2 ≤ 1/λ). where the third identity uses µ is taken w.r.t. s h , a h , s h+1 (recall Lemma G.1) and law of total expectation. The first inequality uses the definition of θ T V h+1 . Now apply Assumption 2.3, we have E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 ≥ κ θ h -θ T V h+1 2 2 , Combine the above with equation 21 and equation 22, we obtain the stated result. Theorem G.3 (Provable efficiency (Part II)). Let C d,log K be the same as Lemma G.1 and suppose ϵ F = 0. Furthermore, suppose λ ≤ 1/2C 2 Θ and K ≥ max 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 λ 2 ) , 4λ κ . Then, with probability 1 -δ, ∀h ∈ [H], sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) ≤ κ1H 36H 2 (log(H 2 /δ) + C d,log K ) + 2λC 2 Θ κ + 2H 2 dκ1 √ κ 1 K +O( 1 K ). Furthermore, we have with probability 1δ, sup h V h -V ⋆ h ∞ ≤ κ1H 36H 2 (log(H 2 /δ) + C d,log K ) + 2λC 2 Θ κ + 2H 2 dκ1 √ κ 1 K + O( 1 K ) = O κ1H 2 d 2 κ 1 K where O absorbs Polylog terms and higher order terms. Lastly, it also holds for all h ∈ [H], w.p. 1 -δ θ h -θ ⋆ h 2 ≤ κ 1 H 72H 2 (log(H 2 /δ) + C d,log K ) + 4λC 2 Θ κ + 4H 2 dκ 1 κ 1 K + O( 1 K ) = O κ 1 H 2 d κ 1 K Proof of Theorem G.3. Step1: we show the first result. We prove this by backward induction. When h = H + 1, by convention f ( θ h , ϕ(s, a)) = f (θ ⋆ h , ϕ(s, a)) = 0 so the base case holds. Suppose for h + 1, with probability 1 -(H -h)δ, it holds true that sup s,a f ( θ h+1 , ϕ(s, a)) -f (θ ⋆ h+1 , ϕ(s, a)) ≤ C h+1 1 K + a(h + 1) , we next consider the case for t = h. On one hand, by Theorem G.2, we have with probability 1 -δ/2, sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) ≤ sup s,a f ( θ h , ϕ(s, a)) -f (θ T V h+1 , ϕ(s, a)) + sup s,a f (θ T V h+1 , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) = sup s,a ∇f (ξ, ϕ(s, a)) ⊤ ( θ h -θ T V h+1 ) + sup s,a f (θ T V h+1 , ϕ(s, a)) -f (θ TV ⋆ h+1 , ϕ(s, a)) ≤κ 1 • θ h -θ T V h+1 2 + sup s,a P h,s,a V h+1 -P h,s,a V ⋆ h+1 ≤κ 1 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK + V h+1 -V ⋆ h+1 ∞ , Recall V h+1 (•) := min{max a f ( θ h+1 , ϕ(•, a)) -Γ h (•, a), H} and V ⋆ h+1 (•) = max a f (θ ⋆ h+1 , ϕ(•, a)) = min{max a f (θ ⋆ h+1 , ϕ(•, a)), H}, we obtain V h+1 -V ⋆ h+1 ∞ ≤ sup s,a f ( θ h+1 , ϕ(s, a)) -f (θ ⋆ h+1 , ϕ(s, a)) + sup h,s,a Γ h (s, a) Note the above holds true for any generic Γ h (s, a). In particular, according to Algorithm 1, we specify Γ h (•, •) = dH ∇ θ f ( θ h , ϕ(•, •)) ⊤ Σ -1 h ∇ θ f ( θ h , ϕ(•, •)) + O( 1 K ) and by Lemma L.5, with probability 1 -δ, Γ h ≤ 2dHκ 1 √ κK + O( 1 K ) and by a union bound this implies with probability 1 - (H -h + 1)δ, sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) ≤C h+1 1 K + κ 1 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK + 2dHκ 1 √ κK + O( 1 K ) := C h 1 K + O( 1 K ) Solving for C h , we obtain C h ≤ κ 1 H 36H 2 (log(H/δ)+C d,log K )+2λC 2 Θ κ + H 2dHκ1 √ κ for all H. By a union bound (replacing δ by δ/H), we obtain the stated result. Step2: Utilizing the intermediate result equation 23, we directly have with probability 1 -δ, sup h V h -V ⋆ h ∞ ≤ sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) + 2dHκ 1 √ κK + O( 1 K ), where sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) can be bounded using Step1. Step3: Denote M := κ 1 H 36H 2 (log(H 2 /δ)+C d,log K )+2λC 2 Θ κ + 2H 2 dκ1 √ κ 1 K + O( 1 K ), then by Step1 we have with probability 1 -δ (here ξ is some point between θ h and θ ⋆ h ) for all h ∈ [H] M 2 ≥ sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) 2 ≥E µ,h [(f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a))) 2 ] ≥ κ θ h -θ ⋆ h 2 2 where the last inequality is by Assumption 2.3. Solve this to obtain the stated result. H WITH POSITIVE BELLMAN COMPLETENESS COEFFICIENT ϵ F > 0 In Theorem 3.2, we consider the case where ϵ F = 0. If ϵ F > 0, similar guarantee can be achieved with the measurement of model misspecification. For instance, the additional error 16H 3 ϵ F (log(1/δ)+C d,log K ) K + 4Hϵ F will show up in Lemma G.1 (as stated in the current version), b d,K,ϵ F κ + 2Hϵ F κ will show up in Lemma G.2. Then the decomposition in equation 3 will incur the extra δ V h+1 term with δ V h+1 might not be 0. The analysis with positive ϵ F > 0 will make the proofs more intricate but incurs no additional technical challenge. Since the inclusion of this quantity is not our major focus, as a result, we only provide the proof for the case where ϵ F = 0 so the readers can focus on the more critical components that characterize the hardness of differentiable function class.

I VFQL AND ITS ANALYSIS

We present the vanilla fitted Q-learning (VFQL) Algorithm 2 as follows. For VFQL, no pessimism is used and we assume θ h ∈ Θ without loss of generality. Algorithm 2 Vanilla Fitted Q-Learning (VFQL) 1: Input: Offline Dataset D = s k h , a k h , r k h , s k h+1 K,H k,h=1 . Denote ϕ h,k := ϕ(s k h , a k h ). 2: Initialization: Set V H+1 (•) ← 0 and λ > 0. 3: for h = H, H -1, . . . , 1 do 4: Set θ h ← arg min θ∈Θ K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) 2 + λ • θ 2 2 5: Set Q h (•, •) ← min f ( θ h , ϕ(•, •)), H -h + 1 + 6: Set π h (• | •) ← arg max π h Q h (•, •), π h (• | •) A , V h (•) ← max π h Q h (•, •), π h (• | •) A 7: end for 8: Output: { π h } H h=1 . I.1 ANALYSIS FOR VFQL (THEOREM 3.1) Recall ι h (s, a) := P h V h+1 (s, a) -Q h (s, a) and the definition of Bellman operator D.1. Note min{•, H -h + 1} + is a non-expansive operator, therefore we have |ι h (s, a)| =|P h V h+1 (s, a) -Q h (s, a)| = min P h V h+1 (s, a), H -h + 1 + -min f ( θ h , ϕ(•, •)), H -h + 1 + ≤ P h V h+1 (s, a) -f ( θ h , ϕ(•, •)) ≤ f (θ T V h+1 ) -f ( θ h , ϕ(•, •)) + ϵ F . By Lemma D.2, we have for any π, v π -v π = - H h=1 E π [ι h (s h , a h )] + H h=1 Eπ[ι h (s h , a h )] ≤ H h=1 E π [|ι h (s h , a h )|] + H h=1 Eπ[|ι h (s h , a h )|] ≤ H h=1 E π [|f (θ T V h+1 , ϕ(•, •)) -f ( θ h , ϕ(•, •))|] + H h=1 Eπ[|f (θ T V h+1 , ϕ(•, •)) -f ( θ h , ϕ(•, •))|] + 2HϵF ≤ H h=1 E π [|f (θ T V h+1 , ϕ(•, •)) -f ( θ h , ϕ(•, •))| 2 ] + H h=1 Eπ[|f (θ T V h+1 , ϕ(•, •)) -f ( θ h , ϕ(•, •))| 2 ] + 2HϵF ≤2 √ C eff H h=1 E µ,h [|f (θ T V h+1 , ϕ(•, •)) -f ( θ h , ϕ(•, •))| 2 ] + 2HϵF where the second inequality uses Cauchy inequality and the third one uses the definition of concentrability coefficient 2.2. Next, for VFQL, there is no pessimism therefore the quantity B in Lemma L.10 is zero, hence the covering number applied in Lemma G.1 is bounded by C d,log(K) ≤ O(d) and E µ [ℓ h ( θ h )]-E µ [ℓ h (θ T V h+1 )] ≤ 36H 2 (log(1/δ) + C d,log K ) + λC 2 Θ K + 16H 3 ϵ F (log(1/δ) + C d,log K ) K +4Hϵ F . Now leveraging equation 21 and equation 22 in Theorem G.2 to obtain E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 ≤E µ   ℓh( θ h ) - λ θ h 2 2 K    -E µ   ℓh(θ T V h+1 ) - λ θ T V h+1 2 2 K    + 2Hϵ F ≤ 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ K + b d,K,ϵ F + 2Hϵ F Plug the above into equation 24, we obtain with probability 1 -δ, for all policy π, v π -v π ≤ 2 C eff H 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ K + b d,K,ϵ F + 2Hϵ F + 2Hϵ F =2 C eff H 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ K + 16H 3 ϵ F (log(1/δ) + C d,log K ) K + 6Hϵ F + 2Hϵ F = C eff H • O H 2 d + λC 2 Θ K + 1 4 H 3 dϵ F K + O( C eff H 3 ϵ F + Hϵ F ) This finishes the proof of Theorem 3.1.

J PROOFS FOR VAFQL

In this section, we present the analysis for variance-aware fitted Q learning (VAFQL). Throughout the whole section, we assume ϵ F = 0, i.e. the exact Bellman-Completeness holds. The algorithm is presented in the following. Before giving the proofs of Theorem 3, we first prove some useful lemmas. Algorithm 3 Variance-Aware Fitted Q Learning (VAFQL) 1: Input: Split dataset D = s k h , a k h , r k h K,H k,h=1 D ′ = sk h , āk h , rk h K,H k,h=1 . Require β. 2: Initialization: Set VH+1(•) ← 0. Denote ϕ h,k := ϕ(s k h , a k h ), φh,k := ϕ(s k h , āk h ) 3: for h = H, H -1, . . . , 1 do 4: Set u h ← arg min θ∈Θ K k=1 f θ, φh,k -V h+1 (s k h+1 ) 2 + λ • ∥θ∥ 2 2 5: Set v h ← arg min θ∈Θ K k=1 f θ, φh,k -V 2 h+1 (s k h+1 ) 2 + λ • ∥θ∥ 2 2 6: Set Var h V h+1 (•, •) = f (v h , ϕ(•, •)) [0,(H-h+1) 2 ] -f (u h , ϕ(•, •)) [0,H-h+1] 2 7: Set σ h (•, •) 2 ← max{1, VarP h V h+1 (•, •)} 8: Set θ h ← arg min θ∈Θ K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 (s k h+1 ) 2 / σ 2 h (s k h , a k h ) + λ • ∥θ∥ 2 2 9: Set Λ h ← K k=1 ∇f ( θ h , ϕ h,k )∇f ( θ h , ϕ h,k ) ⊤ / σ 2 (s k h , a k h ) + λ • I, 10: Set Γ h (•, •) ← β ∇ θ f ( θ h , ϕ(•, •)) ⊤ Λ -1 h ∇ θ f ( θ h , ϕ(•, •)) + O( 1 K ) 11: Set Qh (•, •) ← f ( θ h , ϕ(•, •)) -Γ h (•, •), Q h (•, •) ← min Qh (•, •), H -h + 1 + 12: Set π h (• | •) ← arg maxπ h Q h (•, •), π h (• | •) A , V h (•) ← maxπ h Q h (•, •), π h (• | •) A 13: end for 14: Output: { π h } H h=1 . J.1 PROVABLE EFFICIENCY FOR VARIANCE-AWARE FITTED Q LEARNING Recall the objective ℓ h (θ) := 1 K K k=1 f θ, ϕ(s k h , a k h ) -r(s k h , a k h ) -V h+1 (s k h+1 ) 2 / σ 2 h (s k h , a k h ) + λ K • θ 2 2 Then by definition, θ h := arg min θ∈Θ ℓ h (θ) and θ T V h+1 satisfies f (θ T V h+1 , ϕ) = P h V h+1 (s k h+1 ) (recall ϵ F = 0). Therefore, in this case, we have the following lemma: Lemma J.1. Fix h ∈ [H]. With probability 1 -δ, E µ [ℓ h ( θ h )] -E µ [ℓ h (θ T V h+1 )] ≤ 36H 2 (log(1/δ) + C d,log K ) + λC 2 Θ K where the expectation over µ is taken w.r.t. (s k h , a k h , s k h+1 ) k = 1, ..., K only (i.e., first compute E µ [ℓ h (θ)] for a fixed θ, then plug-in either θ h+1 or θ T V h+1 ). Here C d,log(K) := d log(1+24C Θ (H + 1)κ 1 K)+d log 1 + 288H 2 C Θ (κ 1 √ C Θ + 2 κ 1 κ 2 /λ) 2 K 2 + d 2 log 1 + 288H 2 √ dκ 2 1 K 2 /λ + d log(1 + 16C Θ H 2 κ 1 K) + d log(1 + 32C Θ H 3 κ 1 K). Proof of Lemma J.1. Step1: Consider the case where λ = 0. Indeed, fix h ∈ [H] and any function V (•) ∈ R S . Similarly, define f V (s, a) := f (θ TV , ϕ) = P h V . For any fixed θ ∈ Θ, denote g(s, a) = f (θ, ϕ(s, a)). Moreover, for any u, v ∈ Θ, define σ 2 u,v (•, •) := max{1, f (v, ϕ(•, •)) [0,(H-h+1) 2 ] -f (u, ϕ(•, •)) [0,H-h+1] 2 } Then define (we omit the subscript u, v of σ 2 u,v for the illustration purpose when there is no ambiguity) X(g, V, f V , σ 2 ) := (g(s, a) -r -V (s ′ )) 2 -(f V (s, a) -r -V (s ′ )) 2 σ 2 u,v (s, a) . Since all episodes are independent of each other, X k (g, V, f V ) := X(g(s k h , a k h ), V (s k h+1 ), f V (s k h , a k h ), σ 2 (s k h , a k h ) ) are independent r.v.s and it holds 1 K K k=1 X k (g, V, f V , σ 2 ) = ℓ(g) -ℓ(f V ). Next, the variance of X is bounded by Var[X(g, V, f V , σ 2 )] ≤ E µ [X(g, f, f V , σ 2 ) 2 ] =E µ (g(s h , a h ) -r h -V (s h+1 )) 2 -(f V (s h , a h ) -r h -V (s h+1 )) 2 2 /σ 2 (s h , a h ) 2 =E µ (g(s h , a h ) -f V (s h , a h )) 2 σ 2 (s h , a h ) • (g(s h , a h ) + f V (s h , a h ) -2r h -2V (s h+1 )) 2 σ 2 (s h , a h ) ≤4H 2 • E µ [ (g(s h , a h ) -f V (s h , a h )) 2 σ 2 (s h , a h ) ] =4H 2 • E µ (g(s h , a h ) -r h -V (s h+1 )) 2 -(f V (s h , a h ) -r h -V (s h+1 )) 2 σ 2 (s h , a h ) ( * ) =4H 2 • E µ [X(g, f, f V , σ 2 )] ( * ) follows from that E µ f ( θ h , ϕ(s h , a h )) -f (θ T V h+1 , ϕ(s h , a h )) σ 2 (s h , a h ) • E f θ T V h+1 , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) s h , a h = 0. Therefore, by Bernstein inequality, with probability 1 -δ, E µ [X(g, f, f V , σ 2 )] - 1 K K k=1 X k (g, f, f V , σ 2 ) ≤ 2Var[X(g, f, f V , σ 2 )] log(1/δ) K + 4H 2 log(1/δ) 3K ≤ 8H 2 E µ [X(g, f, f V , σ 2 )] log(1/δ) K + 4H 2 log(1/δ) 3K . Now, if we choose g(s, a) := f ( θ h , ϕ(s, a)) and u = u h , v = v h from Algorithm 3, then θ h minimizes ℓ h (θ), therefore, it also minimizes 1 K K k=1 X i (θ, V h+1 , f V h+1 , σ 2 h ) and this implies 1 K K k=1 X k ( θ h , V h+1 , f V h+1 , σ 2 h ) ≤ 1 K K k=1 X k (θ T V h+1 , V h+1 , f V h+1 , σ 2 h ) = 0. Thus, we obtain E µ [X( θ h , V h+1 , f V h+1 , σ 2 h )] ≤ 8H 2 • E µ [X( θ h , V h+1 , f V h+1 , σ 2 h )] log(1/δ) K + 4H 2 log(1/δ) 3K . However, the above does not hold with probability 1 -δ since θ h , σ 2 h and V h+1 := min{max a f ( θ h+1 , ϕ(•, a))-∇f ( θ h+1 , ϕ(•, a)) ⊤ A • ∇f (θ, ϕ(•, a)), H} (where A is certain symmetric matrix with bounded norm) depend on θ h , θ h+1 which are data-dependent. Therefore, we need to further apply covering Lemma L.11 and choose ϵ = O(1/K) and a union bound to obtain with probability 1 -δ, Eµ[X( θ h , V h+1 , f V h+1 , σ 2 h )] ≤ 8H 2 • Eµ[X( θ h , V h+1 , f V h+1 , σ 2 h )](log(1/δ) + C d,log K ) K + 4H 2 (log(1/δ) + C d,log K ) 3K . where C d,log(K) := d log(1+24C Θ (H + 1)κ 1 K)+d log 1 + 288H 2 C Θ (κ 1 √ C Θ + 2 κ 1 κ 2 /λ) 2 K 2 + d 2 log 1 + 288H 2 √ dκ 2 1 K 2 /λ + d log(1 + 16C Θ H 2 κ 1 K) + d log(1 + 32C Θ H 3 κ 1 K) (where we let B = 1/λ since Λ -1 h 2 ≤ 1/λ). Solving this quadratic equation to obtain with probability 1 -δ, E µ [X( θ h , V h+1 , f V h+1 )] ≤ 36H 2 (log(1/δ) + C d,log K ) K . Now according to equation 25, by definition we finally have with probability 1 -δ (recall the expectation over µ is taken w.r.t. (s k h , a k h , s k h+1 ) k = 1, ..., K only) E µ [ℓ h ( θ h+1 )] -E µ [ℓ h (θ T V h+1 )] = E µ [X( θ h , V h+1 , f V h+1 )] ≤ 36H 2 (log(1/δ) + C d,log K ) K ( ) where we used f (θ T V h+1 , ϕ) = P h V h+1 = f V h+1 . Step2. If λ > 0, there is only extra term λ K θ h 2 -θ T V h+1 2 ≤ λ K θ h 2 ≤ λC 2 Θ K in addition to above. This finishes the proof. Theorem J.2 (Provable efficiency for VAFQL). Let C d,log K be the same as Lemma J.1. Then, with probability 1 -δ θ h -θ T V h+1 2 ≤ 36H 4 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK , ∀h ∈ [H]. Proof of Theorem J.2. Apply a union bound in Lemma J.1, we have with probability 1 -δ, E µ [ℓ h ( θ h )] -E µ [ℓ h (θ T V h+1 )] ≤ 36H 2 (log(H/δ) + C d,log K ) + λC 2 Θ K , ∀h ∈ [H] ⇒E µ [ℓ h ( θ h ) - λ K θ h 2 2 ] -E µ [ℓ h (θ T V h+1 ) - λ K θ T V h+1 2 2 ] ≤ 36H 2 (log(H/δ) + C d,log K ) + 2λC 2 Θ K (27) Now we prove for all h ∈ [H], E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 = E µ   ℓh( θ h ) - λ θ h 2 2 K   -Eµ   ℓh(θ T V h+1 ) - λ θ T V h+1 2 2 K    . ( ) Indeed, identical to equation 26, Eµ   ℓh( θ h ) - λ θ h 2 2 K    -Eµ   ℓh(θ T V h+1 ) - λ θ T V h+1 2 2 K    = Eµ[X( θ h , V h+1 , f V h+1 )] =Eµ f θ h , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) 2 / σ 2 h (s h , a h ) -f θ T V h+1 , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) 2 / σ 2 h (s h , a h ) =Eµ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 / σ 2 h (•, •) +Eµ f ( θ h , ϕ(s h , a h )) -f (θ T V h+1 , ϕ(s h , a h )) • f θ T V h+1 , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) / σ 2 h (s h , a h ) =Eµ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 / σ 2 h (•, •) +Eµ f ( θ h , ϕ(s h , a h )) -f (θ T V h+1 , ϕ(s h , a h )) • E f θ T V h+1 , ϕ(s h , a h ) -r h -V h+1 (s h+1 ) s h , a h / σ 2 h (s h , a h ) =Eµ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 / σ 2 h (•, •) where the third identity uses law of total expectation and that µ is taken w.r.t. s h , a h , s h+1 only (recall Lemma J.1) so the σ 2 h can be move outside of the conditional expectation. 12 The fourth identity uses the definition of θ T V h+1 since f (θ T V h+1 , ϕ(s, a)) = P h,s,a V h+1 . Then we have E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 / σ 2 h (•, •) ≥E µ f ( θ h , ϕ(•, •)) -f (θ T V h+1 , ϕ(•, •)) 2 /H 2 ≥ κ H 2 θ h -θ T V h+1 2 2 , where the third identity uses µ is over s h , a h only and the last one uses σ 2 h (•, •) ≤ H 2 . Combine the above with equation 27 and equation 28, we obtain the stated result. Theorem J.3 (Provable efficiency of VAFQL (Part II)). Let C d,log K be the same as Lemma J.1. Furthermore, suppose λ ≤ 1/2C 2 Θ and K ≥ max 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 λ 2 ) , 4λ κ . Then, with probability 1 -δ, ∀h ∈ [H] sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) ≤ κ1H 36H 4 (log(H/δ) + C d,log K ) + 2λC 2 Θ κ + 2dH 3 κ1 √ κ 1 K +O( 1 K ), Furthermore, we have with probability 1 -δ, sup h V h -V ⋆ h ∞ ≤ κ 1 H 36H 4 (log(H/δ) + C d,log K ) + 2λC 2 Θ κ + 2dH 3 κ 1 √ κ 1 K + O( 1 K ) = O κ 1 H 3 d 2 κ 1 K where O absorbs Polylog terms and higher order terms. Lastly, it also holds for all h ∈ [H], w.p. 1 -δ θ h -θ ⋆ h 2 ≤ κ 1 H 72H 4 (log(H 2 /δ) + C d,log K ) + 4λC 2 Θ κ + 4H 3 dκ 1 κ 1 K + O( 1 K ) = O κ 1 H 3 d κ 1 K Proof of Theorem J.3. Step1: we show the first result. We prove this by backward induction. When h = H + 1, by convention f ( θ h , ϕ(s, a)) = f (θ ⋆ h , ϕ(s, a)) = 0 so the base case holds. Suppose for h + 1, with probability 1 -(H -h)δ, sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) ≤ C h+1 1 K , we next consider the case for t = h. On one hand, by Theorem J.2, we have with probability 1 -δ/2, sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) ≤ sup s,a f ( θ h , ϕ(s, a)) -f (θ T V h+1 , ϕ(s, a)) + sup s,a f (θ T V h+1 , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) = sup s,a ∇f (ξ, ϕ(s, a)) ⊤ ( θ h -θ T V h+1 ) + sup s,a f (θ T V h+1 , ϕ(s, a)) -f (θ TV ⋆ h+1 , ϕ(s, a)) ≤κ 1 • θ h -θ T V h+1 2 + sup s,a P h,s,a V h+1 -P h,s,a V ⋆ h+1 ≤κ 1 36H 4 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK + V h+1 -V ⋆ h+1 ∞ , Recall we have the form V h+1 (•) := min{max a f ( θ h+1 , ϕ(•, a)) -Γ h (•, a), H} and V ⋆ h+1 (•) = max a f (θ ⋆ h+1 , ϕ(•, a)) = min{max a f (θ ⋆ h+1 , ϕ(•, a)), H}, we obtain V h+1 -V ⋆ h+1 ∞ ≤ sup s,a f ( θ h+1 , ϕ(s, a)) -f (θ ⋆ h+1 , ϕ(s, a)) + sup h,s,a Γ h (s, a) Note the above holds true for any generic Γ h (s, a). In particular, according to Algorithm 3, we specify Γ h (•, •) = d ∇ θ f ( θ h , ϕ(•, •)) ⊤ Λ -1 h ∇ θ f ( θ h , ϕ(•, •)) + O( 1 K ) and by Lemma L.5, with probability 1-δ (note here Σ -1 h is replaced by Λ -1 h and Λ -1 h 2 ≤ H 2 /κ), Γ h ≤ 2dH 2 κ 1 √ κK + O( 1 K ) and by a union bound this implies with probability 1 - (H -h + 1)δ, sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) ≤C h+1 1 K + κ 1 36H 4 (log(H/δ) + C d,log K ) + 2λC 2 Θ κK + 2dH 2 κ 1 √ κK + O( 1 K ) := C h 1 K . Solving for C h , we obtain C h ≤ κ 1 H 36H 4 (log(H/δ)+C d,log K )+2λC 2 Θ κ + H 2dH 2 κ1 √ κ for all H. By a union bound (replacing δ by δ/H), we obtain the stated result. Step2: Utilizing the intermediate result equation 29, we directly have with probability 1 -δ, sup h V h -V ⋆ h ∞ ≤ sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) + 2dH 2 κ 1 √ κK + O( 1 K ), where sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) can be bounded using Step1. Step3: Denote M := κ 1 H 36H 4 (log(H 2 /δ)+C d,log K )+2λC 2 Θ κ + 2H 3 dκ1 √ κ 1 K + O( 1 K ), then by Step1 we have with probability 1 -δ (here ξ is some point between θ h and θ ⋆ h ) for all h ∈ [H] M 2 ≥ sup s,a f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) 2 ≥ E µ [ f ( θ h , ϕ(s, a)) -f (θ ⋆ h , ϕ(s, a)) 2 ] ≥ κ θ h -θ ⋆ h where the last step is by Assumption 2.3. Solving this to obtain the stated result.

J.2 BOUNDING | σ

2 h -σ ⋆2 h | Recall the definition σ ⋆2 h (•, •) = max{1, [Var P h V ⋆ h+1 ](•, •)}. In this section, we bound the term | σ 2 h -σ ⋆2 h | := σ 2 h (•, •) -σ ⋆2 h (•, •) ∞ and u h = arg min θ∈Θ 1 K K k=1 f θ, φh,k -V h+1 (s k h+1 ) 2 + λ K • θ 2 2 v h = arg min θ∈Θ 1 K K k=1 f θ, φh,k -V 2 h+1 (s k h+1 ) 2 + λ K • θ 2 2 (30) where σ 2 h (•, •) := max{1, f (v h , ϕ(•, •)) [0,(H-h+1) 2 ] -f (u h , ϕ(•, •)) [0,H-h+1] 2 } and true parameters u ⋆ h , v ⋆ h satisfy f (u ⋆ h , ϕ(•, •)) = E P (s ′ |•,•) [V ⋆ h (s ′ )], f (v ⋆ h , ϕ) = E P (s ′ |•,•) [V ⋆2 h (s ′ )]. Furthermore, we define σ 2 V h+1 (•, •) := max{1, [Var P h V h+1 ](•, •)} and the parameter Expectation operator J : V ∈ R S → θ JV ∈ Θ such that: f (θ JV , ϕ) = E P h [V (s ′ )], ∀ V 2 ≤ B F . Note θ JV ∈ Θ by Bellman completeness, reward r is constant and differentiability (Definition 1.1) is an additive closed property. By definition, | σ 2 h -σ 2 V h+1 | ≤|f (v h , ϕ) -f (θ J V 2 h+1 , ϕ)| + |f (u h , ϕ) 2 -f (θ J V h+1 , ϕ) 2 | ≤|f (v h , ϕ) -f (θ J V 2 h+1 , ϕ)| + 2H • |f (u h , ϕ) -f (θ J V h+1 , ϕ)| and |σ ⋆2 h -σ 2 h | ≤|f (v ⋆ h , ϕ) -f (v h , ϕ)| + |f (u ⋆ h , ϕ) 2 -f (v h , ϕ) 2 | ≤|f (v ⋆ h , ϕ) -f (v h , ϕ)| + 2H • |f (u ⋆ h , ϕ) -f (v h , ϕ)| We first give the following result. Lemma J.4. Suppose λ ≤ 1/2C 2 Θ and K ≥ max 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ 3 1 κ2CΘK 3 λ 2 ) , 4λ κ . Then, with probability 1 -δ, ∀h ∈ [H], u h -θ J V h+1 2 ≤ 36H 2 (log(H/δ) + O(d 2 )) + 2λC 2 Θ κK , ∀h ∈ [H], v h -θ J V 2 h+1 2 ≤ 36H 4 (log(H/δ) + O(d 2 )) + 2λC 2 Θ κK , ∀h ∈ [H]. and sup s,a |f (u h , ϕ(s, a)) -f (u ⋆ h , ϕ(s, a))| ≤   κ1H 36H 2 (log(H 2 /δ) + O(d 2 )) + 2λC 2 Θ κ + 2H 2 dκ1 √ κ   1 K + O( 1 K ), sup s,a |f (v h , ϕ(s, a)) -f (v ⋆ h , ϕ(s, a))| ≤   κ1H 36H 4 (log(H 2 /δ) + O(d 2 )) + 2λC 2 Θ κ + 2H 3 dκ1 √ κ   1 K + O( 1 K ). The above directly implies for all h ∈ [H], with probability 1 -δ, |σ ⋆2 h -σ 2 h | ≤   3κ 1 H 2 36H 4 (log(H 2 /δ) + O(d 2 )) + 2λC 2 Θ κ + 6H 4 dκ 1 √ κ   1 K + O( 1 K ) | σ 2 h -σ 2 V h+1 | ≤3Hκ 1 36H 4 (log(H/δ) + O(d 2 )) + 2λC 2 Θ κK . Proof of Lemma J.4. In fact, the proof follows a reduction from the provable efficiency procedure conducted in Section G. This is due to the regression procedure in equation 30 is the same as the procedure equation 17 except the parameter Bellman operator T is replaced by the parameter Expectation operator J (recall here φh,k uses the independent copy D ′ and O(d 2 ) comes from the covering argument.). Concretely, the X(g, V, f V ) used in Lemma G.1 will be modified to X(g, V, f V ) = (g(s, a) -V (s ′ )) 2 -(f (θ JV , ϕ(s, a)) -V (s ′ )) 2 by removing reward information and the decomposition Eµ (g(s h , a h ) -V (s h+1 )) 2 -(f (θ JV , ϕ(s h , a h )) -V (s h+1 )) 2 = E µ (g(s h , a h ) -f (θ JV , ϕ(s h , a h ))) 2 holds true. Then with probability 1 -δ, |σ ⋆2 h -σ 2 h | ≤|f (v ⋆ h , ϕ) -f (v h , ϕ)| + 2H • |f (u ⋆ h , ϕ) -f (v h , ϕ)| ≤   3κ 1 H 2 36H 4 (log(H 2 /δ) + O(d 2 )) + 2λC 2 Θ κ + 6H 4 dκ 1 √ κ   1 K + O( 1 K ). and | σ 2 h -σ 2 V h+1 | ≤|f (v h , ϕ) -f (θ J V 2 h+1 , ϕ)| + 2H • |f (u h , ϕ) -f (θ J V h+1 , ϕ)| ≤κ 1 v h -θ J V 2 h+1 2 + 2Hκ 1 u h -θ J V h+1 2 ≤3Hκ 1 36H 4 (log(H/δ) + O(d 2 )) + 2λC 2 Θ κK . J.3 PROOF OF THEOREM 4.1 In this section, we sketch the proof of Theorem 4.1 since the most components are identical to Theorem 3.2. We will focus on highlighting the difference for obtaining the tighter bound. First of all, Recall in the first-order condition, we have ∇ θ      K k=1 f (θ, ϕ h,k ) -r h,k -V h+1 s k h+1 2 σ 2 h (s k h , a k h ) + λ • θ 2 2      θ= θ h = 0, ∀h ∈ [H]. Therefore, if we define the quantity Z h (•, •) ∈ R d as Z h (θ|V, σ 2 ) = K k=1 f (θ, ϕ h,k ) -r h,k -V s k h+1 σ(s k h , a k h ) ∇f (θ, ϕ h,k ) σ(s k h , a k h ) + λ • θ, ∀θ ∈ Θ, V 2 ≤ H, then we have Z h ( θ h | V h+1 , σ 2 h ) = 0. According to the regression oracle (Line 8 of Algorithm 3), the estimated Bellman operator P h maps V h+1 to θ h , i.e. P h V h+1 = f ( θ h , ϕ). Therefore (recall Definition D.1) P h V h+1 (s, a) -P h V h+1 (s, a) = P h V h+1 (s, a) -f ( θ h , ϕ(s, a)) =f (θ T V h+1 , ϕ(s, a)) -f ( θ h , ϕ(s, a)) =∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h + Hot h,1 , where we apply the first-order Taylor expansion for the differentiable function f at point θ h and Hot h,1 is a higher-order term. Indeed, the following Lemma E.1 bounds the Hot h,1 term with O( 1 K ). Lemma J.5. Recall the definition (from the above decomposition) Hot h,1 : = f (θ T V h+1 , ϕ(s, a)) - f ( θ h , ϕ(s, a)) -∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h , then with probability 1 -δ, |Hot h,1 | ≤ O( 1 K ), ∀h ∈ [H]. Proof. The proof is identical to that of Lemma E.1 but with the help of Lemma J.2. Next, according to the expansion of Z h (θ| V h+1 , σ 2 h ), we have ∇f ( θ h , ϕ(s, a)) θ T V h+1 -θ h = I 1 + I 2 + I 3 + Hot 2 , ( ) where Hot 2 :=∇f ( θ h , ϕ(s, a))Λ -1 h R K (θ T V h+1 ) + λθ T V h+1 ∆ Λ s h = K k=1 f ( θ h , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ 2 θθ f ( θ h , ϕ h,k ) σ 2 (s k h , a k h ) Λ h = K k=1 ∇ θ f ( θ h , ϕ h,k )∇ ⊤ θ f ( θ h,k , ϕ h,k ) σ 2 (s k h , a k h ) + λI d R K (θ T V h+1 ) =∆ Λ s h ( θ h -θ T V h+1 ) + R K (θ T V h+1 ) where R K (θ T V h+1 ) is the second order residual that is bounded by O(1/K) and I1 =∇f ( θ h , ϕ(s, a))Λ -1 h K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) σ 2 h (s k h , a k h ) I2 =∇f ( θ h , ϕ(s, a))Λ -1 h K k=1 f (θ T V h+1 , ϕ h,k ) -f (θ TV ⋆ h+1 , ϕ h,k ) -V h+1 (s k h+1 ) + V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) σ 2 h (s k h , a k h ) I3 =∇f ( θ h , ϕ(s, a))Λ -1 h K k=1 f (θ T V h+1 , ϕ h,k ) -r h,k -V h+1 (s k h+1 ) • ∇ ⊤ θ f (θ T V h+1 , ϕ h,k ) -∇ ⊤ θ f ( θ h , ϕ h,k ) σ 2 h (s k h , a k h ) Similar to the PFQL case, I 2 , I 3 , Hot 2 can be bounded to have order O(1/K) via provably efficiency theorems in Section J.1 and in particular, the inclusion of σ 2 u,v will not cause additional order in d.foot_12 Now we prove the result for the dominate term I 1 . Lemma J.6. With probability 1 -δ, |I 1 | ≤ 4Hd ∇f ( θ h , ϕ(s, a)) Σ -1 h • C δ,log K + O( κ 1 √ κK ), where C δ,log K only contains Polylog terms. Proof of Lemma J.6. First of all, by CauchySchwarz inequality, we have |I 1 | ≤ ∇f ( θ h , ϕ(s, a)) Λ -1 h • K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) σ 2 h (s k h , a k h ) Λ -1 h . (33) Recall that σ 2 u,v (•, •) := max{1, f (v, ϕ(•, •)) [0,(H-h+1) 2 ] -f (u, ϕ(•, •)) [0,H-h+1] 2 }. Step1. Let the fixed θ ∈ Θ be arbitrary and fixed u, v such that σ 2 u,v (•, •) ≥ 1 2 σ 2 u ⋆ h ,v ⋆ h (•, •) = 1 2 σ ⋆2 h (•, •) and define x k (θ, u, v) = ∇ θ f (θ, ϕ h,k )/σ u,v (s k h , a k h ). Next, define G u,v (θ) = K k=1 ∇f (θ, ϕ(s k h , a k h )) • ∇f (θ, ϕ(s k h , a k h )) ⊤ /σ 2 u,v (s k h , a k h ) + λI d , then x k 2 ≤ κ 1 . Also denote η k := [f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 )]/σ u,v (s k h , a k h ), then E[η k |s k h , a k h ] = 0 and Var[η k |s k h , a k h ] = Var[f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 )|s k h , a k h ] σ 2 u,v (s k h , a k h ) ≤ 2Var[f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 )|s k h , a k h ] σ ⋆2 h (s k h , a k h ) = 2[Var P h V ⋆ h+1 ](s k h , a k h ) σ ⋆2 h (s k h , a k h ) ≤ 2, then by Self-normalized Bernstein's inequality (Lemma L.4), with probability 1 -δ, K k=1 x k (θ, u, v)η k G(θ,u,v) -1 ≤ 16 d log 1 + Kκ 2 1 λd • log 4K 2 δ +4ζ log 4K 2 δ ≤ O( √ d) where |η k | ≤ ζ with ζ = 2 max s,a,s ′ |f (θ TV ⋆ h+1 ,ϕ(s,a))-r-V ⋆ h+1 (s ′ )| σ ⋆ h (s,a) and the last inequality uses √ d ≥ O(ζ). Step2. Define h(θ, u, v) := K k=1 x k (θ, u, v)η k (u, v) and H(θ, u, v) := h(θ, u, v) Gu,v(θ) -1 , h(θ 1 , u 1 , v 1 ) -h(θ 2 , u 2 , v 2 ) 2 ≤ K max k (x k • η k )(θ 1 , u 1 , v 1 ) -(x k • η k )(θ 2 , u 2 , v 2 ) 2 ≤ K max k H ∇f (θ 1 , ϕ h,k ) -∇f (θ 2 , ϕ h,k ) σ 2 u1,v1 (s k h , a k h ) + Hκ 1 σ 2 u1,v1 (s k h , a k h ) -σ 2 u2,v2 (s k h , a k h ) σ 2 u1,v1 (s k h , a k h )σ 2 u2,v2 (s k h , a k h ) ≤ KHκ 1 θ 1 -θ 2 2 + KHκ 1 σ 2 u1,v1 -σ 2 u2,v2 2 Furthermore, G h (θ1, u1, v1) -1 -G h (θ2, u2, v2) -1 2 ≤ G h (θ1, u1, v1) -1 2 ∥G h (θ1, u1, v1) -G h (θ2, u2, v2)∥ 2 G h (θ2, u2, v2) -1 2 ≤ 1 λ 2 K sup k ∇f (θ1, ϕ h,k ) • ∇f (θ1, ϕ h,k ) ⊤ σ 2 u 1 ,v 1 (s k h , a k h ) - ∇f (θ2, ϕ h,k ) • ∇f (θ2, ϕ h,k ) ⊤ σ 2 u 2 ,v 2 (s k h , a k h ) 2 ≤ 1 λ 2 Kκ2κ1 ∥θ1 -θ2∥ 2 + Kκ 2 1 σ 2 u 1 ,v 1 -σ 2 u 2 ,v 2 2 All the above imply Step3. First note by definition in Step2 where we absorb all the Polylog terms. Combing above with equation 33, we obtain with probability 1 -δ, |H(θ1, u1, v1) -H(θ2, u2, v2)| ≤ |h(θ1, u1, v1) ⊤ Gu 1 ,v 1 (θ1) -1 h(θ1, u1, v1) -h(θ2, u2, v2) ⊤ Gu 2 ,v 2 (θ2) -1 h(θ2, u2, v2)| ≤ ∥h(θ1, u1, v1) -h(θ2, u2, v2)∥ 2 • 1 λ • KHκ1 + KHκ1 • ∥Gu 1 ,v 1 (θ1) -1 -Gu 2 ,v 2 (θ2) -1 ∥ 2 • KHκ1 + (KHκ1 • 1 λ ) • ∥h(θ1, u1, v1) -h(θ2, u2, v2)∥ 2 ≤2 KHκ1(∥θ1 -θ2∥ 2 + σ 2 u 1 ,v 1 -σ 2 u 2 ,v 2 2 ) • 1 λ • KHκ1 + K 2 H 2 κ 2 1 • Kκ1 λ 2 κ2 ∥θ1 -θ2∥ 2 + κ1 σ 2 u 1 ,v 1 -σ 2 u 2 ,v 2 2 ≤ 4K 2 H 2 κ 2 1 /λ + K 3 H 2 κ 3 1 κ2/λ 2 ∥θ1 -θ2∥ 2 + 4K 2 H 2 κ 2 1 /λ + K 3 H 2 κ 4 1 /λ 2 σ 2 u 1 ,v 1 -σ 2 u 2 ,v 2 2 note |σ 2 u1,v1 (s, a) -σ 2 u2,v2 K k=1 f (θ TV ⋆ h+1 , ϕ h,k ) -r h,k -V ⋆ h+1 (s k h+1 ) • ∇ ⊤ θ f ( θ h , ϕ h,k ) σ 2 h (s k h , a k h ) Λ -1 h = H( θ h , u h , v h ) Now choosing ϵ = O(1/K) in |I 1 | ≤ ∇f ( θ h , ϕ(s, a)) Λ -1 h • H( θ h , u h , v h ) ≤ ∇f ( θ h , ϕ(s, a)) Λ -1 h • O(d) + O( 1 K ) ≤ O d ∇f ( θ h , ϕ(s, a)) Λ -1 h + O( κ 1 √ κK ), Combing dominate term I 1 (via Lemma J.6) and all other higher order terms we can obtain the first result together with Lemma D.3. The proof of the second result is also very similar to the proofs in Section F.2. Concretely, when picking π = π ⋆ , we can convert the quantity ∇ ⊤ θ f ( θ h , ϕ(s h , a h ))Λ -1 h ∇ θ f ( θ h , ϕ(s h , a h )) to ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Λ -1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) using Theorem J.3, and convert  ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Λ -1 h ∇ θ f (θ ⋆ h , ϕ(s h , a h )) to ∇ ⊤ θ f (θ ⋆ h , ϕ(s h , a h ))Λ ⋆-1 h ∇ θ f (θ ⋆ h , ϕ( EM v ⋆ -v π ≥ c √ d • H h=1 Eπ⋆ ∇ ⊤ θ f (θ ⋆ h , ϕ(•, •))(Λ ⋆,p h ) -1 ∇ θ f (θ ⋆ h , ϕ(•, •)) , ( ) where Λ ⋆,p h = E K k=1 • ∇f (θ, ϕ(s h , a h )) ⊤ ], then by Lemma H.5. of Min et al. (2021) , as long as ∇ θ f (θ ⋆ h ,ϕ(s k h ,a k h ))•∇ θ f (θ ⋆ h ,ϕ(s k h ,a k h )) ⊤ Var h (V ⋆ h+1 )(s k h , K ≥ max{512κ 4 1 G -1 2 log( 2d δ ), 4λ G -1 2 }, ( ) then with probability 1 -δ, for all u ∈ R d simultaneously, u Ḡ-1 ≤ 2 √ K u G -1 . As a corollary, if we constraint u to the subspace u 2 ≤ B, then we have: with probability 1 -δ, for all {u ∈ R d : u 2 ≤ B} simultaneously, u Ḡ-1 ≤ 2 √ K u G -1 = 2 √ K √ u ⊤ G -1 u ≤ 2B G -1 2 √ K . ( ) Next, for any θ, define  h u (θ) := u Ḡ-1 = u ⊤ Ḡ-1 u = u ⊤ K k=1 ∇f (θ, u ⊤ Ḡ(θ 1 ) -1 -Ḡ(θ 2 ) -1 u ≤ B 2 • Ḡ(θ 1 ) -1 -Ḡ(θ 2 ) -1 2 ≤ B 2 • Ḡ(θ 1 ) -1 2 Ḡ(θ 1 ) -Ḡ(θ 2 ) 2 Ḡ(θ 2 ) -1 2 ≤ B 2 1 λ 2Kκ 2 κ 1 θ 1 -θ 2 2 1 λ = 2B 2 Kκ 1 κ 2 θ 1 -θ 2 2 λ 2 Therefore, the ϵ-covering net of {h(θ) : θ ∈ Θ} is implies by the λ 2 ϵ 2 2KB 2 κ1κ2 -covering net of {θ : θ ∈ Θ}, so by Lemma L.8, the covering number N ϵ satisfies log N ϵ ≤ d log(1 + 4B 2 Kκ 1 κ 2 C Θ λ 2 ϵ 2 ). Select θ = θ h . Choose ϵ = O(1/K) and by a union bound over equation 36 to get with probability 1 -δ, for all u 2 ≤ B (note By Assumption 2.3 G -1 2 ≤ 1/κ), u Σ -1 h ≤ 2B √ κK + O( 1 K ) if (union bound over the condition equation 35) K ≥ max 512 κ 4 1 κ 2 log( 2d δ ) + d log(1 + 4κ 1 B 2 κ 2 C Θ K 3 λ 2 ) , 4λ κ where this condition is satisfied by the Lemma statement. Proof of Lemma L.6. See Lemma H.5 of Yin et al. (2022) or Lemma H.4 of Lemma Min et al. (2021) for details. Lemma L.7 (Lemma H.4 in Yin et al. (2022) ). Let Λ 1 and Λ 2 ∈ R d×d are two positive semi-definite matrices. Then: Λ -1 1 ≤ Λ -1 2 + Λ -1 1 • Λ -1 2 • Λ 1 -Λ 2 and ϕ Λ -1 1 ≤ 1 + Λ -1 2 Λ 2 • Λ -1 1 • Λ 1 -Λ 2 • ϕ Λ -1 2 . for all ϕ ∈ R d . where the parameter spaces are {θ : θ 2 ≤ C Θ } and {A : A 2 ≤ B}. Let N V ϵ be the covering number of ϵ-net with respect to l ∞ distance, then we have log N V ϵ ≤ d log 1 + 8C Θ (κ 1 √ C Θ + 2 √ Bκ 1 κ 2 ) 2 ϵ 2 + d 2 log 1 + 8 √ dBκ 2 1 ϵ 2 . Proof of Lemma L.9.  θ2∥ 2 • B • κ1 + κ 2 1 ∥A1 -A2∥ 2 ≤κ1 • ∥θ1 -θ2∥ 2 + 2 κ2 • ∥θ1 -θ2∥ 2 • B • κ1 + κ 2 1 ∥A1 -A2∥ 2 ≤ κ1 √ CΘ + 2 √ Bκ1κ2 ∥θ1 -θ2∥ 2 + κ1 ∥A1 -A2∥ 2 ≤ κ1 √ CΘ + 2 √ Bκ1κ2 ∥θ1 -θ2∥ 2 + κ1 ∥A1 -A2∥ F Here • F is Frobenius norm. Let C θ be the  |C w | ≤ 1 + 8C Θ (κ 1 √ C Θ + 2 √ Bκ 1 κ 2 ) 2 ϵ 2 d , |C A | ≤ 1 + 8 √ dBκ 2 1 ϵ 2 d 2 Therefore, the covering number of space V satisfies log N V ϵ ≤ log(|C w |•|C A |) ≤ d log 1 + 8C Θ (κ 1 √ C Θ + 2 √ Bκ 1 κ 2 ) 2 ϵ 2 +d 2 log 1 + 8 √ dBκ 2 1 ϵ 2 Lemma L.10 (Covering of E µ (X(g, V, f ))). Define X(θ, θ ′ ) := (f (θ, ϕ(s, a)) -r -V θ ′ (s ′ )) 2 -(f V θ ′ (s, a) -r -V θ ′ (s ′ )) 2 , where f V := P h V +δ V and V (s) has form V θ (s) that belongs to V (as defined in Lemma L.9). Here X(θ, θ ′ ) is a function of s, a, r, s ′ as well, and we suppress the notation for conciseness only. Then the function class H = {h(θ, θ ′ ) := E µ [X(θ, θ ′ )]| θ 2 ≤ C Θ , V θ ∈ V} has the covering number of (ϵ + 4Hϵ F )-net bounded by d log(1+ 24C Θ (H + 1)κ 1 ϵ )+d log 1 + 288H 2 C Θ (κ 1 √ C Θ + 2 √ Bκ 1 κ 2 ) 2 ϵ 2 +d 2 log 1 + 288H 2 √ dBκ 2 1 ϵ 2 . Proof of Lemma L.10. First of all, X(θ, θ ′ ) =f (θ, ϕ(s, a)) 2 -f V θ ′ (s, a) 2 -2f (θ, ϕ(s, a)) • (r + V θ ′ (s ′ )) + 2f V θ ′ (s, a) • (r + V θ ′ (s ′ )),



Check Arulkumaran et al. (2017) and the references therein for an overview. Here ∇ θθθ f (θ, ϕ(s, a)) 2 is defined as the 2-norm for 3-d tensor and in the finite horizon setting we simply instantiate BF = H. Generally speaking, 2.2 and 2.3 are not directly comparable. However, for the specific function class f = ⟨θ, ϕ⟩ with ϕ = 1(s, a) and tabular MDPs, it is easy to check 2.3 is strong than 2.2. We mention Xie et al. (2021a) has a nice practical version PSPI, but the convergence is slower (the rate O(n -1 3 )). Here we assume model capacity is sufficient to make the presentation concise. If ϵF > 0, the complexity bound will include the term ϵF . We include more discussion in Appendix H. Here n is the number of samples used in the infinite horizon discounted setting and is similar to K in the episodic setting. i.e. expanding over Z p h (θ) := E s,a,s ′ [(f (θ, ϕ(s, a)) -r -V π h+1 (s ′ ))∇f (θ, ϕ(s, a))],and the corresponding ∆ Σ s h in ∂ ∂θ Z h (θ)| θ=θ π h is zero by Bellman equation. We mentionZhang et al. (2021b) uses variance-aware confidence sets in a slightly different way. Here without loss of generality we assume Q ⋆ h can be uniquely identified, i.e. there is a uniqueθ ⋆ such that f (θ ⋆ h , ϕ) = Q ⋆ h . We abuse the notation here to use either X(g, V, fV ) or X(θ, V, fV ). They mean the same quantity. Recall σ 2 h computed in Algorithm 3 uses an independent copy D ′ . Note in Lemma L.11, we only have additive terms that has the same order has Lemma L.10.



) where the second inequality uses θ TV ⋆ h+1 = θ ⋆ h and the third inequality uses Theorem G.2 and Theorem G.3. The last equal sign is due to C d,log K ≤ O(d 2 ) (recall Lemma G.1). Now choosing ϵ = O(1/K) in Step2 and union bound over both equation 9 and covering number in Step2, we obtain with probability 1 -δ,

(s, a)| ≤ |f (v 1 , ϕ(s, a)) -f (v 2 , ϕ(s, a))| + 2H |f (u 1 , ϕ(s, a)) -f (u 2 , ϕ(s, a))| ≤κ 1 v 1 -v 2 2 + 2Hκ 1 u 1 -u 2 2 ,Then a ϵ-covering net of {H(θ, u, v)} can be constructed by the union of covering net for θ, u, v and by Lemma L.8, the covering number N ϵ satisfies (where O absorbs Polylog terms)log N ϵ ≤ O(d)

s h , a h )) using Lemma J.4.K THE LOWER BOUNDTheorem K.1 (Restatement of Theorem 4.2). Specifying the model to have linear representation f = θ, ϕ . There exist a pair of universal constants c, c ′ > 0 such that given dimension d, horizon H and sample size K > c ′ d 3 , one can always find a family of MDP instances such that for any algorithm π

let ϕ : S × A → R d satisfies ϕ(s, a) ≤ C for all s, a ∈ S × A. For any K > 0, λ > 0, define ḠK = K k=1 ϕ(s k , a k )ϕ(s k , a k ) ⊤ + λI dwhere (s k , a k )'s are i.i.d samples from some distribution ν. Then with probability 1 -δ,

COVERING ARGUMENTS Lemma L.8. (Covering Number of Euclidean Ball) For any ϵ > 0, the ϵ-covering number of the Euclidean ball in R d with radius R > 0 is upper bounded by (1 + 2R/ϵ) d . Lemma L.9. Define V to be the class mapping S to R with the parametric form V (•) := min{max a f (θ, ϕ(•, a)) -∇f (θ, ϕ(•, a)) ⊤ A • ∇f (θ, ϕ(•, a)), H}

, ϕ(s, a)) -∇f (θ1, ϕ(s, a)) ⊤ A1 • ∇f (θ1, ϕ(s, a)) -f (θ2, ϕ(s, a))+ ∇f (θ2, ϕ(s, a)) ⊤ A2 • ∇f (θ2, ϕ(s, a)) = sup s,a ∇f (ξ, ϕ(s, a)) • (θ1 -θ2) -∇f (θ1, ϕ(s, a)) ⊤ A1 • ∇f (θ1, ϕ(s, a)) + ∇f (θ2, ϕ(s, a)) ⊤ A2 • ∇f (θ2, ϕ(s, a)) ≤κ1 • ∥θ1 -θ2∥ 2 + sup s,a ∇f (θ1, ϕ(s, a)) ⊤ A1 • ∇f (θ1, ϕ(s, a)) -∇f (θ2, ϕ(s, a)) ⊤ A2 • ∇f (θ2, ϕ(s, a)) ≤κ1 • ∥θ1 -θ2∥ 2 + sup s,a |[∇f (θ1, ϕ(s, a)) -∇f (θ2, ϕ(s, a))] ⊤ A1 • ∇f (θ1, ϕ(s, a))| + sup s,a |∇f (θ2, ϕ(s, a)) ⊤ (A1 -A2) • ∇f (θ1, ϕ(s, a))| + sup s,a |∇f (θ2, ϕ(s, a)) ⊤ A2 • [∇f (θ1, ϕ(s, a)) -∇f (θ2, ϕ(s, a))]| ≤κ1 • ∥θ1 -θ2∥ 2 + 2 sup s,a ∥∇f (θ1, ϕ(s, a)) -∇f (θ2, ϕ(s, a))∥ 2 • B • κ1 + κ 2 1 ∥A1 -A2∥ 2 ≤κ1 • ∥θ1 -θ2∥ 2 + 2 sup s,a ∥∇f (θ1, ϕ(s, a)) -∇f (θ2, ϕ(s, a))∥ 2 • B • κ1 + κ 2 1 ∥A1 -A2∥ 2 ≤κ1 • ∥θ1 -θ2∥ 2 + 2 sup s,a∥∇f (θ1, ϕ(s, a))∥ 2 • ∥θ1 -

-net of space {θ : θ 2 ≤ C Θ } and C w be the ϵ 2 4κ 2 1 -net of the space {A : A F ≤ √ dB}, then by Lemma L.8,

and representation learning (Uehara et al., 2022) might provide new and unified views over the existing studies. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604-609, 2020. John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440, 2017. Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186-5196, 2018. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354-359, 2017. Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898-2933. PMLR, 2019.

Step2 and union bound over the covering number in Step2, we obtain with probability 1 -δ (recall

Let {x t } ∞t=1 be an R d -valued stochastic process where x t is F t-1 measurable andx t ≤ L. Let Λ t = λI d + t s=1 x s x ⊤ s .Then for any δ > 0, with probability 1 -δ, for all t > 0, Lemma L.4 (Bernstein inequality for self-normalized martingales(Zhou et al., 2021a)). Let {η t } ∞ t=1 be a real-valued stochastic process. Let {F t } ∞ t=0 be a filtration, such that η t is F t -measurable. Assume η t also satisfies|η t | ≤ R, E [η t | F t-1 ] = 0, E η 2 t | F t-1 ≤ σ 2 . Let {x t } ∞ t=1 be an R d -valued stochastic process where x t is F t-1 measurable and x t ≤ L. Let Λ t = λI d + t s=1 x s x ⊤ s .Then for any δ > 0, with probability 1 -δ, for all t > 0,Lemma L.5. Let ∇f (θ, ϕ(•, •)) : S × A → R d be a bounded function s.t. sup θ∈Θ ∇f (θ, ϕ(•, •)) 2 ≤ κ 1 . If K satisfiesThen with probability at least 1 -δ, for all u 2 ≤ B simultaneously, it holds that ⊤ + λI d , and G = E µ [∇f (θ, ϕ(s h , a h ))

⊤ + λI d , we have for any θ 1 , θ 2 Ḡ(θ 1 ) -Ḡ(θ 2 ) 2 ≤ Kκ 2 κ 1 θ 1 -θ 2 2 ≤ 2Kκ 2 κ 1 θ 1 -θ 2 2 .

ACKNOWLEDGMENTS

The authors would like to thank anonymous reviewers for their helpful suggestions. Ming Yin would like to thank Chi Jin for the helpful suggestions regarding the assumption for differentiable function class and Andrea Zanette, Xuezhou Zhang for the friendly discussion. Mengdi Wang gratefully acknowledges funding from Office of Naval Research (ONR) N00014-21-1-2288, Air Force Office of Scientific Research (AFOSR) FA9550-19-1-0203, and NSF 19-589, CMMI-1653435. Ming Yin and Yu-Xiang Wang are gratefully supported by National Science Foundation (NSF) Awards #2007117 and #2003257. 

K.1 REGARDING THE PROOF OF LOWER BOUND

The proof of Theorem 4.2 can be done via a reduction to linear function approximation lower bound. In fact, it can be directly obtained from Theorem 3.5 of Yin et al. (2022) , and the original proof comes from Theorem 2 of Zanette et al. (2021) .Concretely, all the proofs in Theorem 3.5 of Yin et al. (2022) follows and the only modification is to replacein Section E.5 byL AUXILIARY LEMMAS Lemma L.1 (k-th Order Mean Value Form of Taylor's Expansion). Let k ≥ 1 be an integer and let function f : R d → R be k times differentiable and continuous over the compact domain Θ ⊂ R d . Then for any x, θ ∈ Θ, there exists ξ in the line segment of x and θ, such thatHere ∇ k f (θ) denotes k-dimensional tensor and denotes tensor product. Lemma L.2 (Vector Hoeffding's Inequality). Let X = (X 1 , . . . , X d ) be d-dimensional vector Random Variable with E[X] = 0 and X 2 ≤ R. X (1) , . . . , X (n) 's are n samples. Then with probability 1 -δ,Proof of Lemma L.2. Since X 2 ≤ R implies |X j | ≤ R, by the univariate Hoeffding's inequality, for a fixed j ∈ {1, ..., d}, denoteBy a union bound,Lemma L.3 (Hoeffding inequality for self-normalized martingales (Abbasi-Yadkori et al., 2011) ).Let {η t } ∞ t=1 be a real-valued stochastic process. Let {F t } ∞ t=0 be a filtration, such that η t is F tmeasurable. Assume η t also satisfies η t given F t-1 is zero-mean and R-subgaussian, i.e.F where the second inequality comes from f V = P h V +δ V . Note the above holds true for all s, a, r, s ′ , therefore it impliesand C 2 be the ϵ/6H-net of V, applying Lemma L.8 and Lemma L.9 to obtainwhich implies the covering number of H to be bounded byand define, where f V := P h V and V (s) has form V θ (s) that belongs to V (as defined in Lemma L.9). Here X(θ, θ ′ , u, v) is a function of s, a, r, s ′ as well, and we suppress the notation for conciseness only.Then the function class2 }, and since max, truncation are non-expansive operations, then we can achieve for any s, aNote the above holds true for all s, a, r, s ′ , therefore it implies Comparing to Lemma L.10, the last two terms are incurred by covering u, v arguments.

