TOWARDS MINIMAX OPTIMAL REWARD-FREE REIN-FORCEMENT LEARNING IN LINEAR MDPS

Abstract

We study reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, an agent first interacts with the environment without accessing the reward function in the exploration phase. In the subsequent planning phase, it is given a reward function and asked to output an ϵ-optimal policy. We propose a novel algorithm LSVI-RFE under the linear MDP setting, where the transition probability and reward functions are linear in a feature mapping. We prove an r OpH 4 d 2 {ϵ 2 q sample complexity upper bound for LSVI-RFE, where H is the episode length and d is the feature dimension. We also establish a sample complexity lower bound of ΩpH 3 d 2 {ϵ 2 q. To the best of our knowledge, LSVI-RFE is the first computationally efficient algorithm that achieves the minimax optimal sample complexity in linear MDP settings up to an H and logarithmic factors. Our LSVI-RFE algorithm is based on a novel variance-aware exploration mechanism to avoid overly-conservative exploration in prior works. Our sharp bound relies on the decoupling of UCB bonuses during two phases, and a Bernstein-type self-normalized bound, which remove the extra dependency of sample complexity on H and d, respectively. In this paper, we consider the RFE problem in the linear MDP setting, where the transition probability and the reward function are linear in a feature mapping ϕp¨, ¨q. The linear MDP model is an ˚Equal contribution.

1. INTRODUCTION

In reinforcement learning (RL), an agent tries to learn an optimal policy that maximizes the cumulative long-term rewards by interacting with an unknown environment. Designing efficient exploration mechanisms, being a critical task in RL algorithm design, is of great significance in improving the sample efficiency of RL, both theoretically Azar et al. (2017) ; Ménard et al. (2021) and empirically Schwarzer et al. (2020) ; Ye et al. (2021) . In particular, for scenarios where reward signals are sparse and require manually-designed reward functions, e.g., Nair et al. (2018) ; Riedmiller et al. (2018) , or multi-task settings where RL agents are required to accomplish different goals in different stages, e.g., Hessel et al. (2019) ; Yang et al. (2020) , efficient exploration of the environments is crucial, as it can avoid the agent from repeated learning under different reward functions, resulting in inefficiency and even intractability of sample complexity. However, the theoretical understanding is still limited, especially for MDPs with large (or infinite) states or action spaces. To understand the exploration mechanism in RL, reward-free exploration (RFE) is firstly proposed in Jin et al. (2020a) to explore the environment without reward signals. RFE contains two phases: exploration and planning. In the exploration phase, the agent first interacts with the environment without accessing the reward function. In the subsequent planning phase, the agent is given a reward function and asked to output an ϵ-optimal policy. RFE has great significance in a host of reinforcement learning applications, e.g., multi-task RL Hessel et al. (2019) ; Yang et al. (2020) , RL with sparse rewards Nair et al. (2018); Riedmiller et al. (2018) , and systematic generalization of RL Jiang et al. (2019) ; Mutti et al. (2022) . The minimax optimal sample complexity OpH 3 S 2 A{ϵ 2 q of RFE is obtained in Ménard et al. (2021) for tabular settings where S and A are sizes of state and action space, respectively. However, this bound is intractable when state and action space are large.

RFE with Linear Function Approximation

There are recent works Wang et al. (2020a) ; Zanette et al. (2020c) ; Zhang et al. (2021) ; Chen et al. (2021) ; Huang et al. (2022) ; Wagenmaker et al. (2022) focusing on RFE in RL with linear function approximation. Chen et al. (2021) gives a sample complexity bound of r OpH 4 d 3 {ϵ 2 q which is sharpest result on H, while Wagenmaker et al. (2022) gives a sample complexity bound of r OpH 5 d 2 {ϵ 2 q, which achieves optimal dependency on d. Our technical framework, i.e., an aggressive variance-aware exploration mechanism, is very different from that in Wagenmaker et al. (2022) , and also achieves an optimal dependency on d and a better dependency on H. The best known lower bound is ΩpmaxtH 3 d, d 2 u{ϵ 2 q by combining results obtained in Zhang et al. (2021) and Wagenmaker et al. (2022) . Another line of works Zhang et al. (2021) ; Chen et al. (2021) focus on RFE in linear mixture MDPs, where the minimax optimal sample complexity is obtained in Chen et al. (2021) when d ą H but the algorithm is not computationallyefficient. A detailed comparison of some most related works is shown in Table 1 . 1 Moreover, lowrank MDPs, which subsume linear MDPs, are considered in Modi et al. (2021) ; Chen et al. (2022) . In addition, more related work focuses on block MDPs, where the representation ϕ is unknown, Du et al. (2019) ; Misra et al. (2020) ; Zhang et al. (2022) . 

3. PRELIMINARIES

We consider an episodic finite-horizon MDP M " tS, A, H, tP h u h , tr h u h u, where S is the state space, A is the action space, H P Z `is the episode length, P h : S ˆA Ñ ∆pSq and r h : S ˆA Ñ r0, 1s are time-dependent transition probability and deterministic reward function. We assume that S is a measurable space with a possibly infinite number of elements and A is a finite set. For a time-inhomogeneous MDP, the policy is time-dependent, which is denoted as π " tπ 1 , ..., π H u, where π h psq is the action that agent takes at state s at the h-th step. We define the state-action function (i.e., Q-function) and value function as Q π h ps, a; rq " E « H ÿ h 1 "h r h 1 ps h 1 , a h 1 q | s h " s, a h " a, π ff , V π h ps; rq " E « H ÿ h 1 "h r h 1 ps h 1 , a h 1 q | s h " s, π ff , respectively for a specific set of the reward function r " tr h u H h"1 . For any function V p¨; rq : S Ñ R, we further denote P h V ps, a; rq " E s 1 "P h p¨|s,aq V ps 1 ; rq and value function variance rV h V s ps, a; rq " P h V 2 ps, a; rq ´rP h V ps, a; rqs 2 , where V 2 stands for the function whose value at s is V 2 ps; rq. The Bellman equation associated with a policy π for reward function r is Q π h ps, a; rq " r h ps, aq `Ph V π h`1 ps, a; rq, V π h ps; rq " Q π h ps, π h psq; rq , for any ps, aq P S ˆA and h P rHs. Since the action space and the episode length are both finite, there always exists an optimal policy π ˚for the reward function r " tr h u H h"1 , such that the associated optimal state-action function and value function are Q hps, a; rq " sup π Q π h ps, a; rq and V h ps; rq " sup π V π h ps; rq, respectively. For any ps, aq P S ˆA and h P rHs, the Bellman optimality equation for the reward function r " tr h u H h"1 is Q hps, a; rq " r h ps, aq `Ph V h`1 ps, a; rq, V h ps; rq " max aPA Q hps, a; rq, The structural assumption we make in this paper is a linear structure in both transition and reward, which has been considered in prior works, e.g., Yang & Wang (2019) ; Jin et al. (2020b) as below: Definition 3.1 (Linear MDP). A MDP M " tS, A, H, tP h u h , tr h u h u is a linear MDP with a known feature mapping ϕ : S ˆA Ñ R d , if for any h P rHs, there exist unknown d-dimensional measures µ h " pµ h psq sPS q P R dˆ|S| over and an unknown vector θ h P R d , such that for any ps, aq P S ˆA, P h ps 1 | s, aq " xϕps, aq, µ h ps 1 qy, r h ps, aq " xϕps, aq, θ h y. We make the following norm assumptions: for any h P rHs, (i) sup s,a }ϕps, aq} 2 ď 1, (ii) et al. (2020) . Specifically, there are two phases in the RFE paradigm. (i) Exploration Phase: The agent interacts with the environment for exploration up to K episodes without accessing the reward function. (ii) Planning Phase: The agent is given a reward function r " tr h u H h"1 with the goal of outputting an ϵ-optimal policy π based on learned information from the exploration phase. }µ h v} 2 ď ? d for any vector v P R |S| such that }v} 8 ď 1, (iii) }θ h } 2 ď W , We define the sample complexity to be the number of episodes K required in the exploration phase to output an ϵ-optimal policy π in the planning phase for any possible reward function r, i.e., E s"µ rV 1 ps; rqs ´Es"µ rV π 1 ps; rqs ď ϵ where µ P ∆pSq denotes the initial state distribution.

4. ALGORITHM AND MAIN RESULTS

This section presents our LSVI-RFE algorithm for reward-free reinforcement learning in linear MDPs. It builds upon the procedure of optimistic learning as Wang et al. (2020a) ; Zhang et al. (2021) , but with critical novelty of introducing an aggressive variance-aware exploration mechanism. The mechanism is inspired by Chen et al. (2021) ; Hu et al. (2022) , and LSVI-RFE further makes critical improvements in variance-aware weights, value function monotonicity, and computational tractability. In addtion, the mechanism is implemented by an aggressive exploration bonus b k,h and an aggressive reward function r k,h in the exploration phase: (i) The aggressive exploration bonus guarantees the monotonicity of value functions between two phases and removes the additional dependency of sample complexity on feature dimension d, due to building a uniform convergence argument by the covering net in prior works, e.g., Jin et al. (2020b) ; Wang et al. (2020a) . (ii) The reward function is also more aggressive than those in prior works Wang et al. (2020a) ; Zanette et al. (2020c) by an H factor to avoid overly-conservative exploration, which removes the extra dependency of sample complexity on episode length H.

4.1. EXPLORATION PHASE Overall Exploration Sketch

The observed state-action pairs are collected in each episode for estimating the parameter tµ h u hPrHs by weighted linear regression. Then, the optimistic state-action function p Q k,h is constructed (Line 9), and the agent executes the greedy policy π k h with respect to the optimistic state-action function (Line 11). Two critical steps in the exploration are the varianceaware exploration mechanism and the weighted linear regression, which are illustrated below. Aggressive Variance-Aware Exploration Mechanism LSVI-RFE designs a variance-aware weight p σ k,h , and subsequently builds an exploration bonus b k,h and a reward function r k,h to encourage the exploration in Lines 7 and 8 of Algorithm 1, respectively. Our exploration mechanism is variance-aware and aggressive with the following critical differences to prior works: (i) Variance-aware weights: p σ k,h (Line 24 of Algorithm 1) contains two terms: w k,h and W k,h . The motivation to introduce p σ k,h remains that we utilize a Bernstein-type self-normalized bound (Lemma C.2 in Appendix C.1) to build the confidence set, and the Bernstein bound contains a variance term (σ 2 ) and elliptical potential term (R). {p σ k,h small. This fine-grained control of w k,h is critical for removing the extra dependency on d, which is inspired by Hu et al. (2022) for regret minimization in linear MDPs, and is detailed in step 2 in Section 5. However, p σ k,h in Hu et al. (2022) contains three terms and the difference is that we abandon the variance upper estimator term concerning the real value function. Due to the agnosticism of the reward function in the exploration phase, a uniform upper bound of the value function (concerning the actual reward function) variance is infeasible.  p Λ 1,h , r Λ 1,h Ð λI; p µ 1,h Ð 0; p V 0,h p¨q Ð H; 3: end for 4: p V 1,H`1 p¨q Ð 0 5: for episode k " 1, ..., K do 6: for step h " H, ..., 1 // Value iteration do 7: b k,h p¨, ¨q " 2 p β E }ϕp¨, ¨q} p Λ ´1 k,h // Exploration driven bonus 8: r k,h p¨, ¨q " b k,h p¨, ¨q{2 // Exploration driven reward function 9: p Q k,h p¨, ¨q " r k,h p¨, ¨q `x p µ k,h p V k,h`1 , ϕp¨, ¨qy `bk,h p¨, ¨q // Optimistic Q function 10: p V k,h p¨q Ð min ! max aPA p Q k,h p¨, aq, H ) 11: π k h p¨q Ð arg max aPA p Q k,h p¨, aq 12: end for 13: Receive the initial state s k 1 14: for step h " 1, ..., H do 15: (ii) Aggressive exploration bonus b k,h : It is aggressive by a factor of 2 enlargement to ensure the monotonicity of the estimated value function between the exploration phase and planning phase, i.e., V h p¨; rq `p V k,h p¨q ě p V h p¨q (Lemma A.15 in Appendix), which is also necessary for removing the extra dependency of the sample complexity on d. This monotonicity is similar to "over-optimism" in Hu et al. (2022) , but our monotonicity is built between the exploration phase and the planning phase, which is different from that in Hu et al. (2022) , which builds monotonicity in each episode. The reason remains that only the planning phase bonus determines the sharpness of the sample complexity as detailed in Section 5, such that monotonicity is only required between the exploration phase and the planning phase, instead of each episode in the exploration phase. a k h Ð π k h ps k h q, and observe s k h`1 " P h p¨|s k h , a k h q 16: W k,h " mintH ¨px p µ k,h p V k,h`1 , ϕps k h , a k h qy `p β E }ϕps k h , a k h q} p Λ ´1 k,h `H? λ{2K ? dq, H 2 u 17: r σ k,h Ð a maxtH, pd 2 {HqW k,h u 18: r Λ k`1,h Ð r Λ k,h `r σ ´2 k,h ϕps k h , a k h qϕps k h , a k h q J 19: if }r σ ´1 k,h ϕps k h , a k h q} r Λ ´1 k,h p σ k,h Ð maxtw k,h , r σ k,h u 25: p Λ k`1,h Ð p Λ k,h `p σ ´2 k,h ϕps k h , a k h qϕps k h , a k h q J 26: p µ k`1,h Ð p Λ ´1 k`1,h ř k i"1 p σ ´2 i,h ϕps i h , a i h qδps i h`1 q J // (iii) Aggressive reward function r k,h : It is more aggressive by a factor of H enlargement than existing works for RFE in linear MDPs, e.g., Wang et al. (2020a) , which sets the reward function as mintb k,h p¨, ¨q{H, 1u so that it belongs to r0, 1s. r k,h takes the same order as the exploration bonus, i.e., without the 1{H factor. Using a factor of H enlargement to achieve faster learning rates was firstly proposed in Chen et al. (2021) for linear mixture MDPs, achieving minimax optimal sample complexity, yet in a computationally-inefficient manner. We prove that the H enlargement also works for our variance-aware exploration mechanism in a computationally-efficient way, and saves an H 2 factor in the sample complexity. Weighted Linear Regression To enable aggressive variance-aware exploration, we employ weighted linear regression to assemble variance-aware weights in linear regression. Note that the weighted ridge regression estimator has been built for regret minimization algorithms for RL with linear function approximations, e.g., Zhou et al. (2021) ; Hu et al. (2022) . Denote δpsq P R |S| as a one-hot vector that is zero everywhere except the entry corresponding to state s is one, and define ϵ k h :" P h p¨| s k h , a k h q´δps k h`1 q. Considering Erϵ k h | s k h , a k h s " 0, δps k h`1 q is an unbiased estimate of P h p¨| s k h , a k h q " µ J h ϕps k h , a k h q. Thus, µ h can be learned via regression from ϕps k h , a k h q to δps k h`1 q, and the sequence tp σ i,h u iPrks serves as the weight sequence. The estimated parameter p µ k,h in Line 26 of Algorithm 1 is the solution to the following weighted linear regression: min µPR dˆ|S| k´1 ÿ i"1 › › › " µ J h ϕps k h , a k h q ´δps k h`1 q ‰ p σ ´1 i,h › › › 2 2 `λ}µ} 2 F , with solution in Line 26 and the estimated transition probability p P k,h p¨| s k h , a k h q " p µ J k,h ϕps k h , a k h q. 4.2 PLANNING PHASE Overall Planning Sketch During the planning phase, the ϵ-optimal policy π is the greedy policy concerning the optimistic value iteration with respect to the estimated transition matrix from the exploration phase, i.e., parameters t p µ K`1,h u hPrHs . We introduce a UCB bonus term b h p¨, ¨q, which still takes a variance-aware mechanism by utilizing the covariance matrix p Λ K`1,h , to ensure optimism of p V h p¨q. In particular, Step 4 in Section 5 reveals that the sub-optimality gap is upper bounded by the summation of UCB bonus term b h p¨, ¨q in the planning phase, which is small on average and ensures the near optimality of the returned greedy policy π.  p V h p¨q Ð min ␣ max aPA p Q h p¨, aq, H π h p¨q Ð arg max aPA p Q h p¨, aq 7: end for 8: Return π " tπ h u hPrHs Remark 4.1. In our analysis, we introduce auxiliary value functions (Definition A.5 in Appendix). Although it is inspired by Chen et al. (2021) , we make critical changes, i.e., building optimism in our variance-aware exploration mechanism with a computationally-efficient manner. By auxiliary value functions, we can further decouple UCB bonuses b k,h and b h of exploration and planning phases, i.e., b k,h and b h can take different orders, as revealed in Eq. ( 9) and Eq. (10). In particular, Step 4 in Section 5 further shows that only the planning phase bonus b h determines the sharpness of the sample complexity. This is very different from prior works Wang et al. (2020a) ; Zhang et al. (2021) , which take the same order of bonuses in two phases. Our heterogenous UCB bonuses in two phases and accompanying auxiliary value functions together are able to reduce an H 2 factor and a d factor in the sample complexity, compared to prior work in Wang et al. (2020a) .

4.3. MAIN RESULTS

The sample complexity of the proposed LSVI-RFE algorithm is given below in Theorem 4.2 with a proof sketch in Section 5. The detailed proof is given in Appendix A. 

5. MECHANISM

In this section, we overview the key techniques and ideas used in the analysis of reward-free exploration and the proof of Theorem 4.2. As preliminary steps in optimistic learning, we construct the confidence sets p C E k,h and p C P h for the exploration phase and planning phase respectively in Steps 1 and 2. Subsequently, we bound the exploration error, i.e., summation of the exploration bonus, during the exploration phase based on the confidence set p C E k,h in Step 3. Finally, we bound the sub-optimality gap of the recovered policy in the planning phase with confidence set p C P h and the monotonicity of the exploration bonus in Step 4. The full proof is in Appendix A. Step 1: Build Confidence Set p C E k,h in the Exploration Phase LSVI-RFE estimates the parameter tµ h u hPrHs of the transition probability matrix in the exploration phase. The confidence set p C E k,h is built with a Hoeffding-type self-normalized bound, i.e., Lemma C.1 in Appendix C and a standard covering net argument, e.g., Lemma B.3. in Jin et al. (2020b) , such that we have for any k P rKs, h P rHs, with high probability, µ h P p C E k,h :" " µ : › › ›pµ ´p µ k,h q p V k,h`1 › › › p Λ k,h ď p β E * . This is detailed in Lemma C.3 with p β E " r Opd ? Hq. Consequently, we have p p P k,h ´Ph q p V k,h`1 ď p β E }ϕps k h , a k h q} p Λ ´1 k,h by the Cauchy-Schwarz inequality. Under confidence set p C E k,h , p V k,h p¨q is a upper confidence estimator of the optimal value function with exploration-driven reward, i.e., V h p¨; r k q, where r k " tr k,h u hPrHs , which is detailed in Lemma A.7. Step 2: Build the Confidence Set p C P h in the Planning Phase We prove that, with high probability, for any h P rHs in the planning phase, µ h P p C P h :" " µ : › › ›pµ ´p µ K`1,h q p V h`1 › › › p Λ K`1,h ď p β P * , where the bonus radius p β P " r Op ? dHq is sharper than p β E " r Opd ? Hq in the exploration phase. Specifically, we build the confidence set p C P h by the intersection of two confidence sets p C :" ! µ : › › pµ ´p µ K`1,h q V h`1 › › p Λ K`1,h ď p β P p1q ) , p C P p2q h :" " µ : › › ›pµ ´p µ K`1,h q ´p V h`1 ´V h`1 ¯› › › p Λ K`1,h ď p β P p2q * , where V h`1 p¨q refers to V h`1 p¨; rq. A couple of remarks are in place here. (i) In particular, since V h`1 p¨q is a fixed function, where r is the real reward function given in planning phase, we can build confidence set p C P p1q h with radius p β P p1q " r Op ? Hdq by Hoeffding-type self-normalized bound directly without a uniform convergence argument, as detailed in Lemma A.11. (ii) When building the confidence set p C P p2q h , the Bernstein-type self-normalized bound (Lemma C.2 in Appendix C.1) is applied, where the variance term (σ 2 ) and elliptical potential term (R) need to be controlled. In fact, we build the monotonicity of the estimated value function between the exploration phase and planning phase in Lemma A.15, i.e., V h p¨; rq `p V k,h p¨q ě p V h p¨q such that the variance term (σ 2 ) can be bounded by the variance of p V k,h p¨q. Moreover, by dynamically adjusting w k,h , we can keep the elliptical potential term (R) small, which is detailed in Lemma A.9. This gives p β P p2q " r Op ? Hdq as detailed in Lemma A.17. Consequently, the overhead due to building a uniform convergence argument by covering net is removed and p β P " p β P p1q `p β P p2q " r Op ? Hdq. Step 3: Bound the Exploration Error The exploration error refers to the summation of the constructed optimistic value function in the exploration phase, i.e., p V k,1 , which is upper bounded by the summation of the bonus term under the confidence sets p C E k,h for any h P rHs. It is named as the exploration error since the summation of the bonus term upper bounds the estimation error p p P k,h ´Ph q p V k,h`1 . In particular, the exploration error is upper bounded in Lemma A.20 by K ÿ k"1 p V k,1 ps k 1 q ď 4 p β E g f f e K ÿ k"1 H ÿ h"1 p σ 2 k,h looooooomooooooon I1 g f f e K ÿ k"1 H ÿ h"1 min " › › ›p σ ´1 k,h ϕps, aq › › › 2 p Λ ´1 k,h , 1 * loooooooooooooooooooooooomoooooooooooooooooooooooon I2 ď r Op ? d 3 H 4 Kq (8) where the second inequality holds since I 2 can be bounded as r Op ? Hdq by the elliptical potential lemma (Lemma C.6), and I 1 " r Op ? HT q since the enlargement operation in Line 20 of Algorithm 1 rarely happens due to the conservatism of elliptical potentials according to Lemma C.7, and the summation of W k,h is in the order of ? T by Lemma A.19. Step 4: Bound the Sub-optimality Gap The sub-optimality gap refers to the expected gap between the optimal value function V 1 p¨; rq and the value function V π 1 p¨; rq associated with the recovered policy π in the planning phase. Notice that }ϕp¨, ¨q} p Λ ´1 K`1,h ď }ϕp¨, ¨q} p Λ ´1 k,h since p Λ k,h ĺ p Λ K`1,h for any k P rKs. We thus have r k,h p¨, ¨q " p β E }ϕp¨, ¨q} p Λ ´1 k,h ě ? db h p¨, ¨q " ? d p β P }ϕp¨, ¨q} p Λ ´1 K`1,h , where b h p¨, ¨q is the exploration bonus in the planning phase. Then, we can apply standard analysis as in Wang et al. (2020a) to bound E s1"µ rV 1 ps 1 ; rq ´V π 1 ps 1 ; rqs by E s1"µ rV 1 ps 1 ; rq ´V π 1 ps 1 ; rqs ď E s1"µ " p V 1 ps 1 q ´V π 1 ps 1 ; rq ı (10) ď E s1"µ " r V π 1 ps 1 ; bq ı ď E s1"µ " r V π 1 ps 1 ; r k q ı { ? d ď E s1"µ « K ÿ k"1 p V k,1 ps 1 q ff {pK ? dq, where the first inequality holds due to optimism under the confidence sets p C P h for any h P rHs (Lemma A.15), the second inequality is the application of regret composition and r V π 1 is an auxiliary value function defined in Appendix, the third inequality holds by Eq. ( 9), and the last inequality holds due to the optimism under the confidence sets p C E k,h for any h P rHs (Lemma A.7). Indeed, Eq. ( 10) establishes the connection between the exploration phase and planning phase. Thus, if π is an ϵ-optimal policy, we obtain K ě r Opd 2 H 4 {ϵ 2 q by combining Eq. ( 8) and Eq. ( 10).

6. LOWER BOUND AND SUB-OPTIMALITY GAP

We provide a sample complexity lower bound for reward-free RL under the linear MDP setting in Theorem 6.1. We show that there exists an instance of linear MDP, such that any reward-free RL algorithm requires Ωpd 2 H 3 {ϵ 2 q episodes of interaction during the exploration phase to find a near-optimal policy during the planning phase. Theorem 6.1 (Lower Bound). Suppose H ě 4, d ě 2, 1{32K ă δ ă 1{H. Then, there exists a linear MDP instance M " pS, A, H, tP h u, tr h uq, such that any algorithm ALG that learns an ϵ-optimal policy with probability at least 1´δ needs to collect at least K " Cd 2 H 3 {ϵ 2 episodes during the exploration phase, where C is an absolute constant, and δ has no dependence on ϵ, H, d, K. Proof Sketch. Notice that if the reward function is given in the exploration phase, the RFE setting degrades to Probably Approximately Correct (PAC) RL setting Dann et al. (2019) . Thus, a lower bound for PAC RL also serves as a lower bound for RFE since an algorithm for RFE also works for PAC RL by neglecting the reward function in the exploration phase. The proof of Theorem 6.1 is inspired by the lower bound of RFE in linear mixture MDPs in Chen et al. (2021) . First, we construct a hard-to-learn MDP M such that any algorithm which runs K episodes will obtain the regret at least ΩpdH ? HKq as shown in Zhou et al. (2021) . Then, for any algorithm ALG 1 running K episodes to learn an ϵ-optimal policy πpKq with probability at least 1 ´δ, it suffices to prove that under the instance M, K ě Cd 2 H 3 {ϵ 2 . We prove this by constructing a new algorithm ALG 2 which runs ALG 1 in the first K episodes and executes the generated policy πpKq in the rest pc´1qK episodes, where c ą 1 is a positive constant. The regret under M in the last pc ´1qK episodes by executing πpKq satisfies ΩpdH ? HKq À ř K2 k"K`1 E x1"ν " V 1 px 1 q ´V p π 1 px 1 q ‰ À Kϵ, where ν is the initial state distribution, the first inequality holds due to the hardness of the constructed MDP, and the second inequality holds due to πpKq is an ϵ-optimal policy. Thus, we obtain K ě Ωpd 2 H 3 {ϵ 2 q. For detailed proof, please refer to Appendix B. Remark 6.2. Theorem 6.1 presents an improved sample complexity lower bound for RFE in linear MDPs than the results of ΩpH 2 d{ϵ 2 q in Zhang et al. ( 2021) and Ωpd 2 {ϵ 2 q in Wagenmaker et al. (2022) . The lower bound and Theorem 4.2 together show that the sample complexity of LSVI-RFE, i.e., r OpH 4 d 2 {ϵ 2 q, matches the lower bound ΩpH 3 d 2 {ϵ 2 q except for an H and logarithmic factors. To our best knowledge, the upper bound in Theorem 4. 

6.1. TOWARDS MINIMAX OPTIMALITY The sample complexity of LSVI-RFE is r

OpH 4 d 2 {ϵ 2 q, which matches the lower bound up to an H and logarithmic factors. The factor H between the upper and lower bounds is potentially due to utilizing a Hoeffding-type bonus instead of a Bernstein-type one for building the confidence set p C P p1q . Intuitively, a Bernstein-type bonus based on the variance of the value function, combined with the Law of Total Variance (LTV) Lattimore & Hutter (2012), can effectively reduce a ? H factor in the statistical complexity of RL algorithms. This phenomenon has been observed in existing works Zhou et al. (2021) ; Hu et al. (2022) for regret minimization in linear MDPs. However, to utilize the Bernstein-type inequality in building p C P p1q , we estimate the variance of the optimal value function V h`1 p¨, rq with real reward function. Unfortunately, under the RFE, the agent is unaware of the real reward during the exploration phase, which brings obstacles to building the variance estimator. In the Linear mixture setting, Chen et al. (2021) ; Zhang et al. (2021) successfully utilize the Bernsteintype self-normalized bound and presents a nearly minimax optimal algorithm. However, their works still cannot estimate the variance of the optimal value function. Instead, they build an upper-bound estimator from a candidate set to avoid estimating the variance, which is computationally inefficient. Degradation to PAC RL Our sample complexity upper and lower bounds in Theorem 4.2 and 6.1 are also applicable to the PAC RL setting, i.e., the agent is aware of the reward function during the exploration phase, both upper and lower bounds are sharp, yet there is still an H gap. However, a direct adaption of the Bernstein-type bonus will not improve the sample complexity under PAC RL due to policy inconsistency. Specifically, when we apply the total variance lemma (Lemma C.5 in Jin et al. ( 2018)), the exploration policy π k for episode k is inconsistent with the recovered policy π in the planning phase, since π k is the greedy with respect to constructed value function by explorationdriven reward function, instead of the real reward function given in the planning phase. A potential solution may be bounding the distance between two policies by policy distance measures, e.g., KL divergence, which helps tabular RFE to reach minimax optimality in Ménard et al. (2021) .

7. CONCLUSION

This work studies reward-free reinforcement learning with linear function approximation for episodic MDPs. We propose a novel computation-efficient algorithm LSVI-RFE with r OpH 4 d 2 {ϵ 2 q sample complexity upper bound for linear MDPs. We also establish a sample complexity lower bound of ΩpH 3 d 2 {ϵ 2 q, showing that LSVI-RFE's complexity is optimal up to an H and logarithmic factors. LSVI-RFE introduces a novel variance-aware exploration mechanism with weighted linear regression to avoid overly-aggressive exploration in prior works. Our sharp bound relies on the decoupling of UCB bonuses during two phases, a Bernstein-type self-normalized bound and the conservatism of elliptical potentials We leave removing the H gap as future work.

REPRODUCIBILITY STATEMENT

The assumption we make to the MDP structure can be found in Definition 3.1. The complete proof of our main theoretical results, Theorem 4.2 and Theorem 6.1 can be found in Appendix A and Appendix B, respectively. We also provide auxiliary lemmas required for our complete proof in Appendix C. A PROOF OF UPPER BOUND (THEOREM 4.2) A.1 NOTATIONS AND PRELIMINARIES We summarize the key notations used in our analysis in Table 2 . Confidence set in exploration phase, defined in Eq. ( 11) p C P p1q h Part one of confidence set in planning phase, defined in Eq. ( 14) p C P p2q h Part two of confidence set in planning phase, defined in Eq. ( 15) p C P h Confidence set in planning phase, defined in Eq. ( 16) Ψ E k,h Episodic optimism event in exploration phase, defined in Eq. ( 12) Ψ E h Optimism event in exploration phase, defined in Eq. ( 13) Ψ P h Optimism event in planning phase, defined in Eq. ( 17) ϵ k h Defined as P h p¨| s k h , a k h q ´δps k h`1 q p P k,h p¨| s, aq Estimated transition probability, p µ J k,h ϕps, aq Before the formal proof begins, we first start with some necessary definitions of measurable space and filtration required during our proofs. Measurable Space Note that the stochasticity in the transition probability of the MDP is the only source of randomness. Denote P as the gather of the distributions over state-action pair sequence pS ˆAq N , induced by the interconnection of policy obtained from Algorithm 1 and the episodic linear MDP M. Denote E as the corresponding expectation operator. Hence, all random variables can be defined over the sample space Ω " pS ˆAq N . Thus, we work with the probability space given by the triplet pΩ, F, Pq, where F is the product σ-algebra generated by the discrete σ-algebras underlying S and A. Definition A.1 (Filtration). For any k P rKs and any h P rHs, let F k,h be the σ-algebra generated by the random variables representing the state-action pairs up to and including that appears in stage h of episode k.

Measurability r σ

k,h , p σ k,h , r Λ k`1,h , p Λ k`1,h are F k,h -measurable, p µ k`1,h is F k,h`1 -measurable, p Q k,h , p V k,h , π k h are F k´1,H -measurable, yet not F k´1,h -measurable due to their backwards construction. We provide some necessary definitions of high probability events in Definition A.2 and Definition A.3 for exploration and planning phases, respectively. In particular, high probability events for the exploration phase are built in Appendix A.2, and those for the planning phase are built in Appendix A.3. Definition A.2 (High Probability Events in Exploration Phase).

• Confidence Set in Exploration Phase

p C E k,h :" " µ : › › ›pµ ´p µ k,h q p V k,h`1 › › › p Λ k,h ď p β E * , @pk, hq P rKs ˆrHs (11) • Optimistic Events in Exploration Phase Ψ E k,h :"tµ h 1 P p C E k,h 1 , @h ď h 1 ď Hu, @pk, hq P rKs ˆrHs Ψ E h :"tµ h 1 P p C E i,h 1 , @i P rKs, @h ď h 1 ď Hu, @h P rHs (13) Definition A.3 (High Probability Events in Planning Phase). • Confidence Sets in Planning Phase p C P p1q h :" ! µ : › › pµ ´p µ K`1,h q V h`1 › › p Λ K`1,h ď p β P p1q ) , @h P ˆrHs (14) p C P p2q h :" " µ : › › ›pµ ´p µ K`1,h q ´p V h`1 ´V h`1 ¯› › › p Λ K`1,h ď p β P p2q * , @h P ˆrHs (15) p C P h :" " µ : › › ›pµ ´p µ K`1,h q p V h`1 › › › p Λ K`1,h ď p β P * , @h P ˆrHs (16) • Optimistic Event in Planning Phase Ψ P h :" tµ h 1 P p C P h 1 , @h ď h 1 ď Hu, @h P rHs We also define the optimistic value function class below, which is required to build a uniform convergence argument by the covering net. Definition A.4 (Optimistic Value Function Class). For any L, B ą 0, let p VpL, Bq denote a class of functions mapping from S to R with the following parametric form V p¨q " min " max a w J ϕp¨, aq `βb ϕp¨, aq J Λ ´1ϕp¨, aq, H * , where the parameters pw, β, Λq satisfy }w} 2 ď L, β P r0, Bs, the minimum eigenvalue satisfies λ min pΛq ě λ, and sup s,a }ϕps, aq} 2 ď 1. Moreover, we also utilize the truncated optimal value function in the planning phase, which is defined recursively as below. Compared to the definition of V h ps; rq, we take minimization over the value function and H in each step in this definition. We can similarly define r V π h ps; rq, r Q π h ps, a; rq. Definition A.5 (Truncated Optimal Value Function). We introduce the value function r V h ps; rq which is recursively defined from step H `1 to step 1: r V H`1 ps; rq " 0, @s P S r Q hps, a; rq " r h ps, aq `Ph r V h`1 ps, a; rq, @ps, aq P S ˆA r V h ps; rq " min " max aPA r h ps, aq `Ph r V h`1 ps, a; rq, H * , @s P S, h P rHs. ( ) Proof Overview We first build the confidence sets p C E k,h and p C P h for both exploration and planning phases, respectively in Appendix A.2 and A.3. Subsequently, we bound the exploration error, i.e., summation of the exploration bonus, during the exploration phase based on the built confidence set p C E k,h in Appendix A.4. Finally, we can bound the sub-optimality gap of the recovered policy in the planning phase with confidence set p C P h and the monotonicity of the exploration bonus in Appendix A.5, which also ends the proof of Theorem 4.2.

A.2 HIGH PROBABILITY EVENTS IN EXPLORATION PHASE

In this subsection, we build high probability events in exploration phase, including confidence set p C E k,h and optimism in Lemma A.6 and Lemma A.7, respectively.

A.2.1 CONFIDENCE SET IN EXPLORATION PHASE

Lemma A.6. In Algorithm 1, for any δ P p0, 1q and fixed h P rHs, with probability at least 1´δ{H, we have that for any k P rKs, µ h P p C E k,h " " µ : › › ›pµ ´p µ k,h q p V k,h`1 › › › p Λ k,h ď p β E * , where p β E " ? H g f f e d log ˆ1 `K Hdλ ˙`log ˆH δ ˙`d log ˜1 `8K 2 ? d Hλ 3{2 ¸`d 2 log ˆ1 `32K 2 B 2 E H 2 λ 2 ? d Ḣ? λd `1 (20) with B E satisfying 3 p β E ď B E . Proof. (Lemma A.6) In Line 10 of Algorithm 1, for any pk, hq P ˆrKs ˆrHs, we have p V k,h p¨q " min ! max a x p µ k,h p V k,h`1 , ϕp¨, aqy `3 p β E }ϕp¨, aq} p Λ ´1 k,h

H

) . Moreover, › › › p µ k,h p V k,h`1 › › › 2 " › › › › › p Λ ´1 k,h k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h q p V k,h`1 ps i h`1 q › › › › › 2 ď H p ? Hq 2 › › › › › p Λ ´1 k,h k´1 ÿ i"1 ϕps i h , a i h q › › › › › 2 ď K λ , where the first inequality holds since p V k,h`1 p¨q ď H and p σ i,h ě ? H for any i P rks, and the second inequality holds since λ min pΛ k,h q ě λ and sup s,a }ϕps, aq} 2 ď 1. Thus, we claim that p V k,h P p VpL E , B E q for any pk, hq P rKs ˆrHs, where function set p Vp¨, ¨q is defined in Definition A.4, L E " K{λ and B E is a constant satisfying 3 p β E ď B E with p β E given in Eq. (20. 

For a fixed function

V P p VpL E , B E q, let G i " F i,h , x i " p σ ´1 i,h ϕps i h , a i h q and η i " p σ ´1 i,h ϵ i h J V " p σ ´1 i,h pxµ h V , ϕps i h , › › › › › k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J V › › › › › p Λ ´1 k,h ď ? H a d logp1 `pk ´1q{pHdλqq `logpH{δq ď ? H a d logp1 `K{pHdλqq `logpH{δq Denote the ε-cover of function class p VpL E , B E q as p N ε pL E , B E q. For an arbitrary f p¨q P p VpL E , B E q, there exists a V p¨q P p N ε , such that }f ´V } 8 ď ε. Since }ϵ i h J pf ´V q } 2 ď 2ε and › › › ř k´1 i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J › › › p Λ ´1 k,h ď K ? d{pH ? λq, we have › › › › › k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J pf ´V q › › › › › p Λ ´1 k,h ď 2εK ? d H ? λ . (21) Thus, › › › › › k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J f › › › › › p Λ ´1 k,h ď › › › › › k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J V › › › › › p Λ ´1 k,h `› › › › › k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J pf ´V q › › › › › p Λ ´1 k,h ď › › › › › k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J V › › › › › p Λ ´1 k,h `2εK ? d H ? λ ď ? H b d logp1 `K{pHdλqq `logpH{δq `log | p N ε pL E , B E q| `2εK ? d H ? λ . ( ) where the first inequality is due to triangle inequality, the second one holds by Eq. ( 21), and the third inequality follows from a union bound over all functions in p N ε pL E , B E q with log | p N ε pL E , B E q| ď d log p1 `4L E εq `d2 logp1 `8d 1{2 B 2 E {pλε 2 qq, according to Lemma C.5. Moreover, we have › › ›p p µ k,h ´µh q p V k,h`1 › › › p Λ k,h " › › › › › p Λ ´1 k,h « ´λµ h `k´1 ÿ i"1 p σ ´2 i,h ϕ `si h , a i h ˘ϵi h J ff p V k,h`1 › › › › › p Λ k,h ď › › ›´λµh p V k,h`1 › › › p Λ ´1 k,h `› › › › › k´1 ÿ i"1 p σ ´2 i,h ϕ `si h , a i h ˘ϵi h J p V k,h`1 › › › › › p Λ ´1 k,h ď 1 ? λ ¨λH ? d `› › › › › k´1 ÿ i"1 p σ ´2 i,h ϕ `si h , a i h ˘ϵi h J p V k,h`1 › › › › › p Λ ´1 k,h , where the equality is due to Lemma C.4, the first inequality is due to the triangle inequality, and the second inequality holds since }µ h p V k,h`1 } 2 ď H ? d and the minimum eigenvalue of p Λ k,h is no less than λ.

Thus, since p

V k,h`1 P p V, we have that with probability at least 1 ´δ{H, any k P rKs and fixed h P rHs: › › ›p p µ k,h ´µh q p V k,h`1 › › › p Λ k,h ď ? λH ? d `› › › › › k´1 ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J p V k,h`1 › › › › › p Λ ´1 k,h ď ? H g f f e d log ˆ1 `K Hdλ ˙`log ˆH δ ˙`d log ˜1 `8KL E ? d H ? λ q ¸`d 2 log ˆ1 `32K 2 B 2 E H 2 λ 2 ? d Ḣ? λd `1 " p β E , where the last inequality follows by Eq. ( 22) and setting ε " H ? λ{p2K ? dq.

A.2.2 OPTIMISM IN EXPLORATION PHASE

Lemma A.7 (Optimism in Exploration Phase). In Algorithm 1, for any k P rKs and any h P rHs, under Ψ E k,h , we have r V h ps; r k q ď p V k,h psq, @s P S. Proof. (Lemma A.7) We first prove the optimism for some fixed episode k P rKs by induction. Initially, the statement holds for h " H `1 since p V k,H`1 p¨q " r V H`1 p¨; r k q " 0 by definition. Assume the statement holds for h `1, which means p V k,h`1 p¨q ě r V h`1 p¨; r k q under Ψ E k,h`1 . Recall the definitions of p Q k,h p¨, ¨q and r Q hp¨, ¨; r k q, i.e., p Q k,h p¨, ¨q " r k,h p¨, ¨q `x p µ k,h p V k,h`1 , ϕp¨, ¨qy `p β E }ϕp¨, ¨q} p Λ ´1 k,h , r Q hp¨, ¨; r k q " min ! r k,h p¨, ¨q `Ph r V h`1 p¨, ¨; r k q, H ) . ( ) We have for any ps, aq P S ˆA that p Q k,h ps, aq ´r Q hps, a; r k q ě r k,h ps, aq `x p µ k,h p V k,h`1 , ϕps, aqy `p β E }ϕps, aq} p Λ ´1 k,h ´"r k,h ps, aq `Ph r V h`1 ps, a; r k q ı " x p µ k,h p V k,h`1 , ϕps, aqy ´xµ h p V k,h`1 , ϕps, aqy `p β E }ϕps, aq} p Λ ´1 k,h `Ph p V k,h`1 ps, aq ´Ph r V h`1 ps, a; r k q ě ´› › ›p p µ k,h ´µh q p V k,h`1 › › › p Λ k,h }ϕ ps, aq} p Λ ´1 k,h `p β E }ϕps, aq} p Λ ´1 k,h `Ph p V k,h`1 ps, aq ´Ph r V h`1 ps, a; r k q ě P h p V k,h`1 ps, aq ´Ph r V h`1 ps, a; r k q ě 0. Here the first inequality holds by Eq. ( 26), the second inequality follows from Cauchy-Schwarz inequality, the third inequality holds by the assumption that µ h P p C E k,h under Ψ E k,h , the last inequality holds by the induction assumption that p V k,h`1 p¨q ě r V h`1 p¨; r k q under Ψ E k,h`1 and P h is a valid distribution. Therefore, p V k,h p¨q " mintmax aPA p Q k,h p¨, aq, Hu ě max aPA r Q hp¨, a; r k q " r V h p¨; r k q. The lemma follows by applying the above argument to any k P rKs.

A.3 HIGH PROBABILITY EVENTS IN PLANNING PHASE

In this subsection, we built high probability events in planning phase, including confidence set p C P h and optimism in Lemma A.13 and Lemma A.14, respectively. In particular, confidence set p C P h in Lemma A.13 is built based on two confidence set p C P p1q h and p C P p2q h in Lemma A.11 and Lemma A.12, respectively. Apart from the optimism in Lemma A.14, we also build over-optimism between the exploration phase and planning phase in Lemma A.15. In addition, we denote V h p¨q :" V h p¨; rq for convenience, where r " tr h u hPrHs is the real reward function given in the planning phase.

A.3.1 CONFIDENCE SET IN PLANNING PHASE

First, we present a lemma regarding the difference between p Λ k,h and r Λ k,h . Lemma A.8. In Algorithm 1, for any k P rK `1s and any h P rHs, we have d 3 ¨p Λ k,h ľ r Λ k,h Proof. (Lemma A.8) Since p σ k,h ď ? d 3 r σ k,h by definition, we have d 3 ¨p Λ k,h ´r Λ k,h " k´1 ÿ i"1 ppp σ i,h { ? d 3 q ´2 ´r σ ´2 i,h qϕps i h , a i h qϕps i h , a i h q J ľ 0 is a semi-positive definite matrix. Lemma A.9. In Algorithm 1, for any i P rKs and any h P rHs, we have p σ ´1 i,h min ! }p σ ´1 i,h ϕps i h , a i h q} p Λ ´1 i,h ) ď 1 ? Hd 3 . Proof. We discuss two cases that are considered in Algorithm 1. 1. If }r σ ´1 i,h ϕps i h , a i h q} r Λ ´1 i,h ď 1 d 3 , then w i,h " ? H, such that p σ i,h " r σ i,h . Thus we have › › ›p σ ´1 i,h ϕps i h , a i h q › › › p Λ ´1 i,h " › › ›r σ ´1 i,h ϕps i h , a i h q › › › p Λ ´1 i,h ď ? d 3 › › ›r σ ´1 i,h ϕps i h , a i h q › › › r Λ ´1 i,h ď 1{ ? d 3 . This leads to p σ ´1 i,h min ! }p σ ´1 i,h ϕps i h , a i h q} p Λ ´1 i,h , 1 ) ď 1 ? Hd 3 by p σ i,h ě ? H. 2. If }r σ ´1 i,h ϕps i h , a i h q} r Λ ´1 i,h ą 1 d 3 , then w k,h " ? Hd 3 . Then we have p σ ´1 i,h min ! }p σ ´1 i,h ϕps i h , a i h q} p Λ ´1 i,h , 1 ) ď p σ ´1 i,h ď w ´1 i,h " 1 ? Hd 3 . Lemma A.10. For any i P rKs, h P rH ´1s, fixed function V : S Ñ r0, Hs, and ζ " H ? λ{p2K ? dq, we have rV h pV ´V h`1 qsps i h , a i h q ¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) ¨1 ␣ Ψ E i,h ( ď W i,h , ( ) where W i,h is defined in Algorithm 1, W i,h " min ! H ¨´A p µ i,h p V i,h`1 , ϕps i h , a i h q E `p β E › › ϕps i h , a i h q › › p Λ ´1 i,h `ζ¯, H 2 ) ( ) Proof. We define r V :" V ´V h`1 and r E :" ! ´ζ ď r V ď p V i,h`1 `ζ) for brevity. Thus we can write rV h pV ´V h`1 qsps i h , a i h q ¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) ¨1 ␣ Ψ E i,h ( "rV h r V sps i h , a i h q ¨1 ! r E X Ψ E i,h ) " ´Ph p r V 2 qps i h , a i h q ´pP h r V ps i h , a i h qq 2 ¯1 ! r E X Ψ E i,h ) ďH ¨pP h r V ps i h , a i h qq ¨1 ! r E X Ψ E i,h ) , where the inequality holds by | r V | ď H. Moreover, we can further condition on event r E X Ψ E i,h , or the left term of Eq. 27 will be zero. This leads to H ¨pP h r V ps i h , a i h qq ¨1 ! r E X Ψ E i,h ) ďH ¨pP h p V i,h`1 ps i h , a i h q `ζq ¨1 ! r E X Ψ E i,h ) "H ¨´A p µ i,h p V i,h`1 , ϕps i h , a i h q E `Apµ h ´p µ i,h q p V i,h`1 , ϕps i h , a i h q E `ζ¯¨1 ! r E X Ψ E i,h ) ďH ¨ˆA p µ i,h p V i,h`1 , ϕps i h , a i h q E `› › ›pµh ´p µ i,h q p V i,h`1 › › › p Λ i,h › › ϕps i h , a i h q › › p Λ ´1 i,h `ζ˙¨1 ! r E X Ψ E i,h ) ďH ¨´A p µ i,h p V i,h`1 , ϕps i h , a i h q E `p β E › › ϕps i h , a i h q › › p Λ ´1 i,h `ζ¯¨1 ! r E X Ψ E i,h ) , where the first inequality holds since we consider event r E, the second inequality holds by Cauchy-Schwarz inequality, and the last inequality holds under event Ψ E i,h . Moreover, the left term of Eq. 27 is smaller than H 2 , since |V ´V h`1 | ď H. Then we have Eq. 27 holds. Lemma A.11. In Algorithm 2, for any k P rKs and fixed h P rHs, with probability at least 1´δ{H:  µ h P p C P p1q h " ! µ : › › pµ ´p µ K`1,h q V h`1 › › p Λ K`1,h ď p β P p1q ) , i " F i,h , x i " p σ ´1 i,h ϕ `si h , a i h ˘and η i " p σ ´1 i,h ϵ i h J V h`1 " p σ ´1 i,h "@ µ h V h`1 , ϕps i h , a i h q D ´V h`1 `si h`1 ˘‰ . Since V h`1 p¨q is a fixed function, it is clear that x i are G i -measurable and η i is G i`1 -measurable. In addition, we have E rη i | G i s " 0. Since p σ i,h ě ? H, we see that |η i | ď ? H and }x i } 2 ď 1{ ? H. Then, by Lemma C.1, with probability at least 1 ´δ{H, for all k P rKs and fixed h P rHs, › › › › › k´1 ÿ i"1 p σ ´2 i,h ϕ `si h , a i h ˘ϵi h J V h`1 › › › › › p Λ ´1 k,h ď ? H a d logp1 `pk ´1q{pHdλqq `logp1{δq ď ? H a d logp1 `K{pHdλqq `logp1{δq Using a similar argument as in Eq. ( 24), we have › › p p µ K`1,h ´µh q V h`1 › › p Λ k,h ď H ? λd `› › › › › k´1 ÿ i"1 p σ ´2 i,h ϕ `si h , a i h ˘ϵi h J V h`1 › › › › › p Λ ´1 k,h . Therefore, with probability at least 1 ´δ{H, for any k P rKs and fixed h P rHs: › › p p µ K`1,h ´µh q V h`1 › › p Λ k,h ď ? H a d logp1 `K{pHdλqq `logpH{δq `H? λd " p β P p1q . Lemma A.12. In Algorithm 2, for any k P rKs and fixed h P rHs, under Ψ E h X Ψ P h`1 , with probability at least 1 ´δ{H: µ h P p C P p2q h " " µ : › › ›pµ ´p µ K`1,h q ´p V h`1 ´V h`1 ¯› › › p Λ K`1,h ď p β P p2q * , where p β P p2q "8 d H d log ˆ1 `K Hdλ ġ f f e log ˆ4K 2 H δ ˙`d log ˜1 `8K ? dL P H ? λ ¸`d 2 log ˆ1 `32d 3{2 K 2 B 2 P H 2 λ 2 4c H d 3 « logp 4K 2 H δ q `d log ˜1 `8K ? dL P H ? λ ¸`d 2 log ˆ1 `32d 3{2 K 2 B 2 P H 2 λ 2 ˙ff `H? λd `1 with L P " W `K{λ and an arbitrary B P ě p β P . Proof. (Lemma A.12) In Line 6 of Algorithm 2, for any h P rHs, we have p V h psq " min " max aPA ! r h ps, aq `x p µ K`1,h p V h`1 , ϕps, aqy `bh ps, aq ) , H * " min " max aPA ! xθ h `p µ K`1,h p V h`1 , ϕps, aqy `p β P }ϕps, aq} p Λ ´1 K`1,h ) , H * . Moreover, we have › › ›θh `p µ K`1,h p V h`1 › › › 2 " › › › › › θ h `p Λ ´1 K`1,h K ÿ i"1 p σ ´2 i,h ϕps i h , a i h q p V h`1 ps i h`1 q › › › › › 2 ď W `H p ? Hq 2 › › › › › p Λ ´1 K`1,h K ÿ i"1 ϕps i h , a i h q › › › › › 2 ď W `K{λ, where the first inequality holds by the triangle inequality with }θ h } 2 ď W , p V h`1 p¨q ď H, and p σ i,h ě ? H for any i P rKs, and the second inequality holds since λ min pΛ K`1,h q ě λ and sup s,a }ϕps, aq} 2 ď 1. This implies that p V h P p VpL P , B P q for any h P rHs with L P " W `K{λ and an arbitrary B P ě p β P , where function set p Vp¨, ¨q is defined in Definition A.4. For fixed h P rHs, fixed V p¨q P p VpL P , B P q and constant ζ " H ? λ{p2K ? dq, let G i " F i,h , x i " p σ ´1 i,h ϕps i h , a i h q, η i "p σ ´1 i,h ϵ i h J `V ´V h`1 ˘¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) ¨1 ␣ Ψ E i,h ( , for any i P rks. Note that V h`1 p¨q is a fixed function, p V i,h`1 p¨q and Ψ E i,h are G i -measurable. Thus, x i is G i -measurable and η i is G i`1 -measurable. Also, we have Erη i |G i s " 0. By Lemma A.9, we have p σ ´1 i,h min ! }p σ ´1 i,h ϕps i h , a i h q} p Λ ´1 i,h ? Hd 3 . Since |ϵ i h J `V ´V h`1 ˘| ď H, we have |η i mint}x i } Λ ´1 i,h , 1u| ď ? H{ ? d 3 . Furthermore, since p σ 2 i,h ě pd 2 {HqW i,h , it holds that Erη 2 i |G i s "p σ ´2 i,h ¨rV h pV ´V h`1 qsps i h , a i h q ¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) ¨1 ␣ Ψ E i,h ( ďp σ ´2 i,h ¨Wi,h ď H{d 2 , where the first inequality holds by Lemma A.10. By Lemma C.2, for fixed h P rHs, with probability at least 1 ´δ{H, › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J `V ´V h`1 1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) ¨1 ␣ Ψ E i,h ( › › › p Λ ´1 K`1,h ď8 d H d log ˆ1 `K Hdλ ˙log ˆ4K 2 H δ ˙`4 c H d 3 logp 4K 2 H δ q. Denote the ζ´cover of function class p VpL P , B P q as p N ζ , we have log | p N ζ | ď d logp1 `4L P {ζq `d2 logp1 `8d 1{2 B 2 P {pλζ 2 qq, according to Lemma C.5. Here L P " W `K{λ and B P ě p β P . Then for any k P rKs and fixed h P rHs, with probability at least 1 ´δ{H, for any V P p N ζ , conditioned on Ψ E h (i.e. 1 ! Ψ E i,h ) " 1 for any i P rks), we can build our argument under Ψ h`1 in the following), we have › › › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J `V ´V h`1 ˘¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) › › › › › p Λ ´1 K`1,h ď8 d H d log ˆ1 `K Hdλ ˙ˆlog ˆ4K 2 H δ ˙`log |N ζ | ˙`4 c H d 3 ˆlogp 4K 2 H δ q `log |N ζ | ˙. For any f p¨q P p VpL P , B P q, there exists a V p¨q P p N ζ , such that }f ´V } 8 ď ζ. Since }ϵ i h J pf ´V q } 2 ď 2ζ and › › › ř K i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J ¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ)› › › p Λ ´1 K`1,h ď K ? d{pH ? λq, we have › › › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J pf ´V q ¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) › › › › › p Λ ´1 K`1,h ď 2ζK ? d H ? λ . Conditioning on Ψ E h , we have › › › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J `f ´V h`1 ˘¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) › › › › › p Λ ´1 K`1,h ď › › › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J `V ´V h`1 ˘¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) › › › › › p Λ ´1 K`1,h `› › › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J pf ´V q ¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) › › › › › p Λ ´1 K`1,h ď › › › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J `V ´V h`1 ˘¨1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) › › › › › p Λ ´1 K`1,h `2ζK ? d H ? λ ď8 d H d log ˆ1 `K Hdλ ˙ˆlog ˆ4K 2 H δ ˙`log |N ζ | ˙`4 c H d 3 ˆlogp 4K 2 H δ q `log |N ζ | 2ζK ? d H ? λ . Here the first inequality is due to triangle inequality, the second holds by Eq. ( 34), and the third holds by Eq. ( 33). We further assume Ψ P h`1 holds. In addition, for any p V h`1 p¨q, there exists a V 1 p¨q P p N ζ such that › › › p V h`1 ´V 1 › › › 8 ď ζ. Since ζ " H ? λ{p2K ? dq, we have V 1 ď p V h`1 `ζ ď V h`1 `p V i,h`1 `ζ for any i P rKs where the second inequality holds by Lemma A.15 under Ψ E h`1 X Ψ P h`1 . On the other hand, V h`1 ´ζ ď p V h`1 ´ζ ď V 1 , where the first inequality holds by Lemma A.14 under Ψ P h`1 . Thus, 1 ! V h`1 ´ζ ď V ď p V i,h`1 `V h`1 `ζ) " 1 for any i P rKs under Ψ E h`1 X Ψ P h`1 . Moreover, we have that, with probability at least 1 ´δ{H, under Ψ E h X Ψ P h`1 , for any k P rKs and fixed h P rHs: › › ›p p µ K`1,h ´µh q ´p V h`1 ´V h`1 ¯› › › p Λ K`1,h ď ? λH ? d `› › › › › K ÿ i"1 p σ ´2 i,h ϕps i h , a i h qϵ i h J ´p V h`1 ´V h`1 ¯› › › › › p Λ ´1 K`1,h ď8 d H d log ˆ1 `K Hdλ ġ f f e log ˆ4K 2 H δ ˙`d log ˜1 `8K ? dL P H ? λ ¸`d 2 log ˆ1 `32d 3{2 K 2 B 2 P H 2 λ 2 4c H d 3 ˜logp 4K 2 H δ q `d log ˜1 `8K ? dL P H ? λ ¸`d 2 log ˆ1 `32d 3{2 K 2 B 2 P H 2 λ 2 ˙Ḩ ? λd `1 " p β P p2q . Lemma A.15 (Over-optimism Between Two Phases). In Algoirhms 1 and 2, for any and any h P rHs, under Ψ E h X Ψ P h , we have V h ps; rq `p V k,h psq ě p V h psq, @s P S. Proof. (Lemma A.15) We first prove the conclusion by introduction for some k P rKs. Notice that the statement holds trivially when h " H `1 since V H`1 p¨; rq`p V k,H`1 p¨q " p V H`1 p¨q " 0 by definitions. Assume the statement holds for h `1, which means ´"r h ps, aq `x p µ K`1,h p V h`1 , ϕps, aqy `p β P }ϕps, aq}  V h`1 p¨; rq `p V k,h`1 p¨q ě p V h`1 p¨q under Ψ E h`1 X Ψ P h`1 . We see that Q hp¨, ¨; rq " r h p¨, ¨q `Ph V h`1 p¨, ¨; rq, p Q k,h p¨, ¨q " r k,h p¨, ¨q `x p µ k,h p V k,h`1 , ϕp¨, ¨qy `p β E }ϕp¨, ¨q} p Λ ´1 k,h , p Q h p¨, ¨q " r h p¨, ¨q `x p µ K`1,h p V h`1 , p Λ ´1 K`1,h ı " " P h V h`1 ps, aq `Ph p V k,h`1 ps, aq ´Ph p V h`1 ps, aq ı `3 p β E }ϕps, aq} p Λ ´1 k,h ´p β P }ϕps, aq} p Λ ´1 K`1,h `x p µ k,h p V k, p Λ ´1 K`1,h ´› › ›p p µ k,h ´µh q p V k,h`1 › › › p Λ k,h }ϕ ps, aq} p Λ ´1 k,h ´› › ›p p µ K`1,h ´µh q p V h`1 › › › p Λ K`1,h }ϕ ps, aq} p Λ ´1 K`1,h ě " P h V h`1 ps, aq `Ph p V k,h`1 ps, aq ´Ph p V h`1 ps, aq ı `3 p β E }ϕps, aq} p Λ ´1 k,h ´p β P }ϕps, aq} p Λ ´1 K`1,h ´p β E }ϕps, aq} p Λ ´1 k,h ´p β P }ϕps, aq} p Λ ´1 K`1,h ě " P h V h`1 ps, aq `Ph p V k,h`1 ps, aq ´Ph p V h`1 ps, aq ı ě 0, where the first inequality is due to Cauchy-Schwarz inequality, the second inequality holds by µ h P p C E k,h X p C P h under Ψ E h X Ψ P h , the third inequality holds since p Λ K`1,h ľ p Λ k,h and p β E ě p β P , and the last inequality holds by the induction assumption V h`1 p¨; rq `p V k,h`1 p¨q ě p V h`1 p¨q under Ψ E h`1 X Ψ P h`1 and P h is a valid distribution. Therefore, V h p¨; rq `p V k,h p¨q " max aPA Q hp¨, a; rq `max aPA p Q k,h p¨, aq ě max aPA p Q h p¨, aq " p V h p¨q Finally, note that the same argument can be extended to any k P rKs.

A.4 UPPER BOUNDING THE EXPLORATION ERROR

In this subsection, we bound the exploration error in Lemma A.20. Before that, we provide a simulation lemma in linear MDPs in Lemma A.17. Besides, we denote the reward function tr k,h u H h"1 in Algorithm 1 at episode k as r k , for all k P rKs. A.16 (Trajectory Distribution) . For a fixed policy π " tπ 1 , π 2 , ..., π H u, define d π h ps h q as the probability measure over trajectory τ h " ps h , a h , s h`1 , a h`1 , ¨¨¨, s H , a H q induced by following π starting at s h at stage h:

Definition

d π h ps h qpτ h q :" H ź t"h`1 P π t ps t , a t | s t´1 , a t´1 q , where P π t ps t , a t | s t´1 , a t´1 q means the probability that the probability of stat-action pair transition from ps t´1 , a t´1 q to ps t , a t q in a trajectory τ h " ps h , a h , s h`1 , a h`1 , ¨¨¨, s H , a H q started at s h at stage h by following policy π. Lemma A.17 (Simulation Lemma). In Algorithm 1, for any k P rKs and any h P rHs, under Ψ E k,h , we have 0 ď p V k,h ps k h q ď E τ k h "d π k h ps k h q min # H ÿ h 1 "h 4 p β E › › ϕps k h 1 , a k h 1 q › › p Λ ´1 k,h 1 , H + Proof. (Lemma A.17) First, recall that p Q k,h p¨, ¨q " r k,h p¨, ¨q `x p µ k,h p V k,h`1 , ϕp¨, ¨qy `p β E }ϕp¨, ¨q} p Λ ´1 k,h " r k,h p¨, ¨q `p P k,h p V k,h`1 ps k h , a k h q `p β E }ϕp¨, ¨q} p Λ ´1 k,h p V k,h p¨q " min " max aPA p Q k,h p¨, aq, H * Thus, for any k P rKs and any h P rHs in Algorithm 1, we have p V k,h ps k h q ď p Q k,h `sk h , a k h ˘" r k,h ps k h , a k h q `p P k,h p V k,h`1 ps k h , a k h q `p β E }ϕps k h , a k h q} p Λ ´1 k,h " 3 p β E }ϕps k h , a k h q} p Λ ´1 k,h `"p P k,h p V k,h`1 ps k h , a k h q ´Ph p V k,h`1 ps k h , a k h q ı `Ph p V k,h`1 ps k h , a k h q ¨¨p aq " E τ k h "d π k h ps k h q « H ÿ h 1 "h p P k,h 1 p V k,h 1 `1ps k h 1 , a k h 1 q ´Ph 1 p V k,h 1 `1ps k h 1 , a k h 1 q `3 p β E › › ϕps k h 1 , a k h 1 q › › p Λ ´1 k,h ff " E τ k h "d π k h ps k h q « H ÿ h 1 "h A p p µ k,h 1 ´µh 1 q p V k,h 1 `1, ϕps k h 1 , a k h 1 q E `3 p β E › › ϕps k h 1 , a k h 1 q › › p Λ ´1 k,h 1 ff ď E τ k h "d π k h ps k h q « H ÿ h 1 "h 4 p β E › › ϕps k h 1 , a k h 1 q › › p Λ ´1 k,h 1 ff , where the equality paq holds since we can expand p V k,h`1 p¨q in a recursive way until p V k,H`1 p¨q and the expectation is taken over trajectory distribution d π k h ps k h q over trajectory τ k h " ps k h , a k h , s k h`1 , a k h`1 , . . . , s k H , a k H q, and the last inequality holds since µ h 1 P C E k,h 1 for any h ď h 1 ď H under Ψ E k,h . Since p V k,h ps k h q ď H and p V k,h ps k h q ě V h ps k h ; r k q ě 0 by Lemma A.7 under Ψ E k,h , the conclusion is obtained. Lemma A.18. Fix δ ą 0. In Algorithm 1, under Ψ E 1 , we have with probability 1 ´2δ, K ÿ k"1 H ÿ h"1 P h p V k,h`1 ps k h , a k h q ď4H p β E g f f e K ÿ k"1 H ÿ h"1 p σ 2 k,h g f f e K ÿ k"1 H ÿ h"1 mint}ϕps k h , a k h q} 2 p Λ ´1 k,h , 1u `pH 2 `Hq a 2T logpH{δq. (36) Proof. First, we have K ÿ k"1 H ÿ h"1 P h p V k,h`1 ps k h , a k h q " K ÿ k"1 H ÿ h"1 p V k,h`1 ps k h`1 q `"P h p V k,h`1 ps k h , a k h q ´p V k,h`1 ps k h`1 q ı ď K ÿ k"1 H ÿ h"1 p V k,h`1 ps k h`1 q `Ha 2T logpH{δq, holds with probability 1 ´δ, by common Hoeffding inequality, as stated in Lemma C.3. Notice that p V k,h ps k h q " min ! p Q k,h ps k h , a k h q, H ) " min ! 3 p β E }ϕps k h , a k h q} p Λ ´1 k,h `x p µ k,h p V k,h`1 , ϕps k h , a k h qy, H ) ď min ! 4 p β E }ϕps k h , a k h q} p Λ ´1 k,h `Ph p V k,h`1 ps k h , a k h q, H ) " min ! 4 p β E }ϕps k h , a k h q} p Λ ´1 k,h `p V k,h`1 ps k h`1 q `rP h p V k,h`1 ps k h , a k h q ´p V k,h`1 ps k h`1 qs, H ) where the inequality holds under p Ψ E 1 . For fixed h, We can recursively use this method and get K ÿ k"1 p V k,h ps k h q ď K ÿ k"1 min ! H ÿ h 1 "h 4 p β E }ϕps k h 1 , a k h 1 q} p Λ ´1 k,h 1 `rP h 1 p V k,h 1 `1ps k h 1 , a k h 1 q ´p V k,h 1 `1ps k h 1 `1qs, H ) ď K ÿ k"1 H ÿ h 1 "h 4 p β E p σ k,h 1 mint}ϕps k h 1 , a k h 1 q} p Λ ´1 k,h 1 , 1u `K ÿ k"1 H ÿ h 1 "h P h 1 p V k,h 1 `1ps k h 1 , a k h 1 q ´p V k,h 1 `1ps k h 1 `1q ď4 p β E g f f e K ÿ k"1 H ÿ h 1 "h p σ 2 k,h 1 g f f e K ÿ k"1 H ÿ h 1 "h mint}ϕps k h 1 , a k h 1 q} 2 p Λ ´1 k,h 1 , 1u `Ha 2T logpH{δq. The inequalities above hold with probability 1 ´δ by Cauchy-Schwarz inequality and Lemma C.3. Thus, we have K ÿ k"1 H ÿ h"1 P h p V k,h`1 ps k h , a k h q ď K ÿ k"1 H ÿ h"1 p V k,h`1 ps k h`1 q `Ha 2T logpH{δq ď4H p β E g f f e K ÿ k"1 H ÿ h"1 p σ 2 k,h g f f e K ÿ k"1 H ÿ h"1 mint}ϕps k h , a k h q} 2 p Λ ´1 k,h , 1u `pH 2 `Hq a 2T logpH{δq holds with probability 1 ´2δ. Lemma A.19. In Algorithm 1, under Ψ E 1 , we have K ÿ k"1 H ÿ h"1 p σ 2 k,h ď2HT `12H 3 d 9 logp1 `2d 7 {pλHqq `2d 3{2 H 2 ? λ `2d 2 pH 2 `Hq a 2T logpH{δq `8d 5 Hp1 `2Hq 2 p p β E q 2 logp1 `K{pHdλq. holds with probability 1 ´2δ. Proof. (Lemma A.19) From Algorithm 1, we can write the definition of p σ k,h here, p σ k,h :" max # w k,h , c d 2 H W k,h + . Thus we can write K ÿ k"1 H ÿ h"1 p σ 2 k,h ď K ÿ k"1 H ÿ h"1 w 2 k,h loooooomoooooon i `d2 H K ÿ k"1 H ÿ h"1 W k,h loooooooomoooooooon ii . To bound i, we utilize the conservatism of elliptical potentials, i.e., Lemma C.7. Initially, for fixed h P rHs, set x k as r σ ´1 k,h ϕps k h , a k h q in Lemma C.7. Then for C " 1{d 3 , during K episodes, there are at most 3d log `1 `d{pλH logp1 `C2 qq ˘{ logp1`C 2 q episodes that › › ›r σ ´1 k,h ϕps k h , a k h q › › › r Λ ´1 k,h ě 1{d 3 for fixed h P rHs. Thus, there are at most 3Hd log `1 `d{pλH logp1 `C2 qq ˘{ logp1 `C2 q episodes that there exists h 1 P rHs such that › › ›r σ ´1 k,h ϕps k h 1 , a k h 1 q › › › r Λ ´1 k,h 1 ą 1{d 3 . In particular, w 2 k,h " Hd 3 during these episodes. In summary, we obtain i " K ÿ k"1 H ÿ h"1 w 2 k,h ď K ÿ k"1 H ÿ h"1 H `H ¨Hd 3 3Hd logp1 `1{d 6 q log ˆ1 `d λH logp1 `1{d 6 q ďHT `6H 3 d 9 logp1 `2d 7 {pλHqq, where the last inequality holds by logp1 `1{xq ď 2x for x ą 0. To bound ii, we can write ii "  d 2 H K ÿ k"1 H ÿ h"1 W k,h " d 2 K ÿ k"1 H ÿ h"1 min !A p µ i,h p V i,h`1 , ϕps i h , a i h q E `p β E › › ϕps i h , a i h q › › p Λ ´1 i,h `ζ, H ) ďd 2 HKζ `d2 K ÿ k"1 H ÿ h"1 mint p β E }ϕps k h , a k h q} p Λ ´1 k,h , Hu `d2 K ÿ k"1 H ÿ h"1 A µ h p V k,h`1 , ϕps k h , a k h q E `Ap p µ k,h ´µh q p V k,h`1 , ϕps k h , a k h q E ďd 2 HKζ `2d 2 K ÿ k"1 H ÿ h"1 mint p β E }ϕps k h , a k h q} p Λ ´1 k,h , Hu `d2 K ÿ k"1 H ÿ h"1 P h p V k,h`1 ps k h , a k h q, 2d 2 K ÿ k"1 H ÿ h"1 mint p β E }ϕps k h , a k h q} p Λ ´1 k,h , Hu ď 2d 2 p β E K ÿ k"1 H ÿ h"1 p σ k,h mint}ϕps k h , a k h q} p Λ ´1 k,h , ď2d 2 p β E g f f e K ÿ k"1 H ÿ h"1 p σ 2 k,h g f f e K ÿ k"1 H ÿ h"1 mint}ϕps k h , a k h q} 2 p Λ ´1 k,h , 1u, By Lemma A.18, we have d 2 K ÿ k"1 H ÿ h"1 P h p V k,h`1 ps k h , a k h q ď4d 2 H p β E g f f e K ÿ k"1 H ÿ h"1 p σ 2 k,h g f f e K ÿ k"1 H ÿ h"1 mint}ϕps k h , a k h q} 2 p Λ ´1 k,h , 1u `d2 pH 2 `Hq a 2T logpH{δq. (44) 4 p β E › › ϕps k h , a k h q › › p Λ ´1 k,h , H + " K ÿ k"1 min # H ÿ h"1 4 p β E p σ k,h › › ›p σ ´1 k,h ϕps k h , a k h q › › › p Λ ´1 k,h , H + ď4 K ÿ k"1 H ÿ h"1 p β E p σ k,h min " › › ›p σ ´1 k,h ϕps k h , a k h q › › › p Λ ´1 k,h , 1 * looooooooooooooooooooooooooooomooooooooooooooooooooooooooooon I , where the inequality holds since p β E p σ k,h ě ? Hd ¨?H ě H. To further bound I, we have I ď p β E g f f e K ÿ k"1 H ÿ h"1 p σ 2 k,h g f f e K ÿ k"1 H ÿ h"1 min " › › ›p σ ´1 k,h ϕps, aq › › › 2 p Λ ´1 k,h , 1 * ď p β E g f f e K ÿ k"1 H ÿ h"1 p σ 2 k,h a 2Hd logp1 `K{pλHdqq, where the first inequality holds due to Cauchy-Schwarz inequality and the second inequality holds due to Lemma C.6 with the fact that › › ›p σ ´1 k,h ϕ `sk h , a k h ˘› › › 2 ď 1{ ? H. Now we denote the event that the conclusion of Lemma A.21 holds as Ξ, which is a high probability event. Lemma A.22. Under Ψ E 1 X Ξ X Φ, we have E s"µ " r V 1 ps; bq ı ď r O ´aH 4 d 2 {K ¯, where b " tb h u hPrHs is the UCB bonus defined in Algorithm 2. Proof. (Lemma A.22) Notice that p Λ K`1,h ľ p Λ k,h , p β E " r Opd ? Hq, and p β P " r Op ? Hdq, for any k P rKs and any h P rHs. We have p β E }ϕp¨, ¨q} p Λ ´1 k,h ě c ? d p β P }ϕp¨, ¨q} p Λ ´1 K`1,h , where c ą 0 is a constant, which further implies r k,h p¨, ¨q ě c ? db h p¨, ¨q for in LSVI-RFE. Subsequently, we have c ? dE s"µ " r V 1 ps; bq ı " E s"µ " r V 1 ps; c ? d ¨bq ı ď K ÿ k"1 E s"µ " r V 1 ps; r k q ı {K " # K ÿ k"1 r V 1 ps k 1 ; r k q `K ÿ k"1 ´Es"µ " r V 1 ps; r k q ı ´r V 1 ps k 1 ; r k q ¯+ {K ď ˜K ÿ k"1 r V 1 ps k 1 ; r k q ¸{K `Ha 2H logp1{δq{K ď ˜K ÿ k"1 p V k,1 ps k 1 q ¸{K `Ha 2H logp1{δq{K ď r O ´aH 4 d 3 {K ¯, where the second inequality holds by Lemma A.21 under Ξ, and third inequality holds by Lemma A.7 under Ψ E 1 , and the last inequality holds by Lemma A.20 under Ψ E 1 X Φ. Thus E s"µ rV 1 ps; bqs ď r O ´aH 4 d 2 {K ¯. Now we are ready to prove the main theorem. Proof of Theorem 4.2. It suffices to prove the conclusion under the event Ψ E 1 X Ψ P 1 X Ξ X Φ which holds at probability at least 1 ´7δ by Lemma A.13 and Lemma A.21 and taking a union bound. Initially, we have where the first inequality holds by Lemma A.14, the second inequality holds by Cauchy-Schwarz inequality, the third inequality holds since µ h P p C P 1 under Ψ P 1 , the fourth inequality holds by recursively decomposition, and the last inequality holds by definition of r V π 1 p¨; bq. In addition, we have E s"µ " r V π 1 ps; bq E s1"µ rV 1 ps 1 ; rq ´V π 1 ps 1 ; rqs ď E s1"µ " p V 1 ps 1 q ´V π 1 ps 1 ; rq ı " E s1"µ " min ! r 1 ps 1 , πps 1 qq `x p µ K`1,1 p V 2 , ϕps 1 , πps 1 qqy `b1 ps 1 , πps 1 qq, H ) ´r1 ps 1 , πps 1 qq ´P1 V π 2 ps 1 , πps 1 q; rq ı " E s1"µ " min ! H, x p µ K`1,1 p V 2 , ϕps 1 , πps 1 qqy `b1 ps 1 , πps 1 qq ´P1 V π 2 ps 1 , πps 1 q; rq )ı " E s1"µ " min ! H, x p µ K`1,1 p V 2 , ϕps 1 , πps 1 qqy ´xµ 1 p V 2 , ϕps 1 , πps 1 qqy `b1 ps 1 , πps 1 qq `P1 p V 2 ps 1 , πps 1 q; rq ´P1 V π 2 ps 1 , πps 1 q; rq )ı ď E s1"µ " min ! H, › › ›p p µ K`1,1 ´µ1 q p V 2 › › › p Λ K`1,1 }ϕ ps 1 , πps 1 qq} p Λ ´1 K`1,1 `b1 ps 1 , πps 1 qq `P1 p V 2 ps 1 , πps 1 q; rq ´P1 V π 2 ps 1 , πps 1 q; rq )ı ď E s1"µ " min ! H, 2b 1 ps 1 , πps 1 qq `P1 p V 2 ps 1 , πps 1 q; rq ´P1 V π 2 ps 1 , πps 1 q; rq )ı " E s1"µ,s2"Pp¨|s1,πps1qq " min ! H, 2b 1 ps 1 , πps 1 qq `p V 2 ps 2 q ´V π 2 ps 2 ; rq )ı ď E τ "d π « min # H ÿ h" ı ď E s"µ " r V 1 ps; bq ı ď r O ´aH 4 d 2 {K ¯, where the first inequality holds by definition of r V 1 ps; bq, and the second inequality holds by Lemma A.22 under Ψ E 1 X Ξ X Φ. If we ignore logarithmic terms and take K " mpH 4 d 2 {ϵ 2 q for a sufficiently large constant m ą 0, we have E s1"µ rV 1 ps 1 ; rq ´V π 1 ps 1 ; rqs ď r O ´aH 4 d 2 {K ¯ď ϵ Thus, we need K " r OpH 4 d 2 {ϵ 2 q episodes to output an ϵ-optimal policy π, when ϵ is small enough. B PROOF OF LOWER BOUND (THEOREM 6.1) In this section, we provide proof for Theorem 6.1 in the manuscript. Notice that if the reward function is given in the exploration phase, the RFE setting degrades to Probably Approximately Correct (PAC) RL setting Dann et al. (2019) . Thus, it suffices to prove a lower bound for PAC RL since an algorithm for RFE also works for PAC RL by neglecting the reward function in the exploration phase. Our proof is inspired by the proof of Theorem 3 in Appendix C of Chen et al. (2021) , which provides a lower bound for RFE in linear mixture MDPs. In particular, we connect the lower bound of RFE with regret minimization in linear MDPs. Firstly, we construct a hard-to-learn MDP M " tS, A, H, tP h u h , tr h u h , νufoot_2 in Lemma B.1 such that any algorithm which runs K episodes will obtain the regret at least ΩpdH ? HKq as shows in Lemma B.1. Lemma B.1 (Remark 23 in Zhou et al. (2021) ). Let d ą 1 and suppose d ě 4, H ě 3 and K ě max ␣ pd ´1q 2 H{2, pd ´1q{p32Hpd ´1qq ( . Then, there exists an episodic linear MDP M " tS, A, H, tP h u h , tr h u h , νu parameterized by tµ h u hPrHs , tθ h u hPrHs and satisfy Definition 3.1, such that for any algorithm, the expected regret is lower bounded as follows: E s k 1 "ν r K ÿ k"1 V ˚ps k 1 q ´V π k ps k 1 qs ě ΩpdH ? T q, By definition of V 1 , we choose the optimal policy at each stage. Since max aPA xµ h , ay " pd´1q∆, we get V 1 px 1 q " H ÿ h"1 pH ´hqppd ´1q∆ `ιqp1 ´pd ´1q∆ ´ιq h´1 ď H ¨H ¨pd∆q (53) where the inequality holds by 0 ă 1 ´pd ´1q∆ ´ι ă 1 and ∆ ą ι. Since the inequality holds by all fixed x 1 , we also have E x1"µ " V 1 px 1 q ´V π k 1 px 1 q ı ď E x1"µ rV 1 px 1 qs ď dH 2 ∆ ď dH 4 ? 2 c H K 2 , where the last inequality holds by ι " 1{H. On the one hand, by Lemma B.1, we have K2 ÿ k"1 E x1"ν " V 1 px 1 q ´V π k 1 px 1 q ı ě c 1 dH a HK 2 , where π k denotes the policy for ALG 2 in episode k and c 1 is a constant number. On the other hand, by Lemma B.2, we have K1 ÿ k"1 E x1"ν " V 1 px 1 q ´V π k 1 px 1 q ı ď K 1 dH 4 ? 2 c H K 2 " dH 4 ? 2 c HK 1 c " dH 4 ? 2c a HK 2 . By choosing c " maxt1{p2 ? 2c 1 q, 2u, we know that for ALG 2 under M, K2 ÿ k"K1`1 E x1"ν " V 1 px 1 q ´V π k 1 px 1 q ı ě pc 1 ´1 4 ? 2c qdH a HK 2 ě c 1 2 dH a HK 2 " c 1 2 dH a cHK 1 . Because π k " πpK 1 q in ALG 2 for K 1 `1 ď k ď K 2 and K 2 ´pK 1 `1q `1 " pc ´1qK 1 , we have E x1"ν " V 1 px 1 q ´V πpK1q 1 ps 1 q ı ě c 1 ? c 2pc ´1q dH a H{K 1 . Since ALG 1 outputs an ϵ-optimal policy πpK 1 q with probability at least 1 ´δ, we also have By setting δ satisfying 0 ă δ ă mint1{H, 2 ? 2cc 1 {pc ´1qu, we obtains K 1 ě Cd 2 H 3 {ε 2 for some positive constant C, which completes our proof. E x1"ν " V 1 px 1 q ´V πpK1q 1 ps 1 q ı ď p1

C AUXILIARY LEMMAS C.1 CONCENTRATION INEQUALITY

This subsection gives the encountered concentration inequalities in our proof. Lemma C.1 (Hoeffding inequality for vector-valued martingales, Theorem 1 in Abbasi-Yadkori et al. (2011) ). Let tG t u 8 t"1 be a filtration, tx t , η t u tě1 be a stochastic process so that x t P R d is G t -measurable and η t P R is G t`1 -measurable. Denote Z t " λI `řt i"1 x i x J i for t ě 1 and Z 0 " λI. If }x t } 2 ď L, and η t satisfies E rη t | G t s " 0, |η t | ď R for all t ě 1. Then, for any 0 ă δ ă 1, with probability at least 1 ´δ we have: )ˇˇˇď R for all t ě 1. Then, for any 0 ă δ ă 1, with probability at least 1 ´δ we have: @t ą 0, › › › › › t ÿ i"1 x i η i › › › › › Z ´1 t ď R a d @t ą 0, › › › › › t ÿ i"1 x i η i › › › › › Z ´1 t ď 8σ a d log p1 `tL 2 {pdλqq log p4t 2 {δq `4R log `4t 2 {δ where Z t " λI `řt i"1 x i x J i for t ě 1 and Z 0 " λI. Lemma C.3 (Azuma-Hoeffding Inequality). Let tx i u n i"1 be a martingale difference sequence with respect to a filtration tG i u n`1 i"1 such that |x i | ď M almost surely. That is, x i is G i`1 -measurable and E rx i | G i s " 0 a.s. Then for any 0 ă δ ă 1, with probability at least 1 ´δ, n ÿ i"1 x i ď M a 2n logp1{δq

C.2 LINEAR MDP PROPERTY

This subsection gives some indirect results about the estimated parameter p µ k,h in the exploration phase. Lemma C.4. In Algorithm 1, for any k P rK `1s and any h P rHs, we have:  p µ k,h ´µh " p Λ ´1 k,h « ´λµ h `k´1 ÿ i"1 p σ ´2 i,h ϕ `si h , a i h ˘ϵi h J ff



For time-inhomogenous case, the bound of UCRL-RFE `Zhang et al. (2021) is degraded by an H factor. The exact forms of p β E , p β P p1q , p β P p2q are given in Eq. (20), (30) and (32), respectively in Appendix A ν denotes the initial state distribution.



where W is a constant. These assumptions are mild and are common in the existing literature Jin et al. (2020b). Reward-Free Exploration (RFE) We consider the following RFE model, which has been considered in previous literature Jin et al. (2020a); Kaufmann et al. (2021); Ménard et al. (2021); Zhang



2 and lower bound in Theorem 6.1 are both sharper than those in existing works Wang et al. (2020a); Zanette et al. (2020c); Zhang et al. (2021); Chen et al. (2021).

Comparison of reward-free exploration in episodic RL with linear function approximation. Chen et al. Chen et al. (2021) No r OpH 3 dpH `dq{ϵ 2 q Lower bound Chen et al. (2021) ΩpH 3 d 2 {ϵ 2 q

Thus, LSVI-RFE estimates the variance upper bound of the value function p V k,h by W k,h in Line 16, and dynamically adjusts w k,h in Lines 19-23 to keep }p σ ´1

Algorithm 1 Least-Squares Value Iteration -RFE (LSVI-RFE): Exploration Phase Require: Regularization parameter λ, exploration radius p β E 1: for step h " H, ..., 1 do

Solution to the weighted regression

Opd 2 H `d|A|HKq and Opd 2 |A|HK 2 q, respectively, where K is the number of episodes that Algorithm 1 has run. By Theorem 4.2, K can take the order of r O `H4 d 2 {ϵ 2 ˘to output an ϵ-optimal policy with high probability.

Zeyu Jia, LinYang, Csaba Szepesvari, and Mengdi Wang. Model-based  reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pp. 666-686. PMLR, 2020. Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704-1713. PMLR, 2017. Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pp. 2010-2020. PMLR, 2020. Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. Model-free representation learning and exploration in low-rank mdps. arXiv preprint arXiv:2102.07035, 2021. Mirco Mutti, Riccardo De Santi, Emanuele Rossi, Juan Felipe Calderon, Michael Bronstein, and Marcello Restelli. Provably efficient causal model-based reinforcement learning for systematic generalization. arXiv preprint arXiv:2202.06545, 2022.



Proof. (Lemma A.11) For i P rKs, let G

ϕp¨, ¨qy `p β P }ϕp¨, ¨q} Thus, we have for any ps, aq P S ˆA that,Q hps, a; rq `p Q k,h ps, aq ´p Q h ps, aq " r h ps, aq `Ph V h`1 ps, a; rq `rk,h ps, aq `p P k,h p V k,h`1 ps, aq `p β E }ϕps, aq}

H holds for any k P rKs and h P rHs, we can write

1 2b h ps h , πps h qq, H

where the inequality holds by utilizing Lemma B.2, and the second inequality holds by K 2 " cK 1 . Combining Eq. (54) and Eq. (55) gives

log p1 `tL 2 {pdλqq `logp1{δq.Lemma C.2 (Bernstein inequality for vector-valued martingales, Theorem 7.1 inHu et al. (2022)). Let tG t u 8 t"1 be a filtration, tx t , η t u tě1 be a stochastic process so thatx t P R d is G t -measurable and η t P R is G t`1 -measurable.If }x t } 2 ď L, and η t satisfiesE rη t | G t s " 0, E

56)Proof. (Lemma C.4) We start from the closed-form solution of p µ k,h :

ACKNOWLEDGEMENTS

The work is supported by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403, the Tsinghua University Initiative Scientific Research Program, and Tsinghua Precision Medicine Foundation 10001020109.

Supplementary Materials

Here the first inequality holds similarly as Eq. ( 24), and the second inequality holds by Eq. ( 35) and ζ " H ? λ{p2K ? dq.Lemma A.13. Set p β P " p β P p1q `p β P p2q in Eq equation 16, for any δ P p0, 1q, with probability at least 1 ´3δ, Ψ E 1 X Ψ P 1 holds, i.e., for any h P rHs in the planning phase,Proof. (Lemma A.13) We first prove the following claim:Under Ψ E 1 , for fixed h P rHs and any k P rKs, with probability at least 1 ´2pH ´hqδ{H, Ψ P h holds.We prove this claim by induction. Firstly, when h " H the result is trivial. Assume the claim holds for h `1 ď H. Then, for any k P rKs, under Ψ E 1 , with probability 1 ´2pH ´h `1qδ{H, Ψ P h`1 holds.Subsequently, for any k P rKs, by Lemma A.11 and A.12 under Ψ E 1 X Ψ P h`1 , we have with probability 1 ´2δ{H that µ h P p C P p1q h

X p C P p2q h

By taking the union bound, for any k P rKs, under Ψ E 1 , with probability 1 ´2pH ´hqδ{H, Ψ P h holds, which means the claim holds for h. Thus, the claim is proved by induction.Since PtΨ E 1 u ě 1 ´δ by Lemma A.6, the conclusion is obtained by setting h " 1 and taking a union bound.

A.3.2 OPTIMISM IN PLANNING PHASE

Lemma A.14 (Optimism in Planning Phase). In Algorithm 2, for any h P rHs, under Ψ P h , we have V h ps; rq ď p V h psq, @s P S.Proof. We prove the optimism by introduction. Notice that the statement holds trivially when h " H `1 since V H`1 p¨; rq " p V H`1 p¨q " 0.Assume the statement holds for h `1, which means V h`1 p¨; rq ď p V h`1 p¨q under, Q hp¨, ¨; rq " r h p¨, ¨q `Ph V h`1 p¨, ¨; rq, we have for any ps, aq P S ˆA that p Q h ps, aq ´Qh ps, a; rq´"r h ps, aq `Ph V h`1 ps, a; rq ‰`Ph p V h`1 ps, aq ´Ph V h`1 ps, a; rq ě P h p V h`1 ps, aq ´Ph V h`1 ps, a; rq ě 0, where the first inequality is due to Cauchy-Schwarz inequality, the second inequality holds by µ h P p C P h under Ψ P h , the last inequality holds by the induction assumption p V h`1 p¨q ě V h`1 p¨; rq under Thus, combine Eq. 41, 42, 43, 44, and Lemma C.6 , we haveSince for any x ą 0, a ą 0, b ą 0, x ď a ?x `b leads to x ď 2b `a2 , we havewhich finishes the proof.Lemma A.20. In Algorithm 1, under Ψ E 1 , we have with probability at least 1 ´3δ,On the one hand, be Lemma A.17, we haveunder Ψ E 1 . On the other hand, we have Substituting Eq. ( 49) in ( 48), and combing Eq. ( 47), we obtainholds with probability at least 1 ´2δ. Finally, by Hoeffding inequality(Lemma C.3), we have with probability at least 1 ´3δ,That finishes the proof.We denote the event that the inequality in Lemma A.20 holds as Φ, which holds with probability at least 1 ´3δ.A.5 PROOF OF THEOREM 4.2In this subsection, we bound the sub-optimality gap of the recovered policy in the planning phase under event Ψ E 1 X Ψ P 1 X Ξ X Φ, which also ends the proof of Theorem 4.2. Lemma A.21. For any 0 ă δ ă 1, with probability at least 1 ´δ, we havewhere r k " tr k,h u hPrHs .Proof.V ˚ps; r k q ď H for all s P S, we can apply Azuma-Hoeffding inequality (Lemma C.3) to this martingale difference sequence and obtainwith probability at least 1 ´δ.where T " KH and the expectation is taken over the probability distribution generated by the interconnection of the algorithm and the MDP.Hard MDP Instance M In this MDP, S " tx 1 , x 2 , ¨¨¨, x H`2 u, A " t´1, 1u d´1 , and the linear parameterized of M are specified as ϕps, aq ", where ι " 1{H, ∆ " a ι{K{p4 ? 2q, α " a 1{p1 `∆pd ´1qq, and β " a ∆{p1 `∆pd ´1qq. Under this parameterization, we haveand only the transition starting at s H`2 generates a reward.Then, for any algorithm ALG 1 running K 1 episodes to learn an ϵ-optimal policy πpK 1 q with probability at least 1 ´δ, it suffices to prove that under the instance M, K 1 ě Cd 2 H 3 {ϵ 2 .We prove this by constructing a new algorithm ALG 2 . ALG 2 firstly runs ALG 1 for K 1 episodes, then ALG 1 outputs a ϵ-optimal policy πpK 1 q with probability at least 1 ´δ. Then ALG 2 executes πpK 1 q in the following pc ´1qK 1 episodes, which means ALG 2 runs K 2 " cK 1 episodes.Note that Lemma B.2 also gives an upper bound for the regret of every single step under M. Lemma B.2. Suppose 3pd ´1q∆ ď ι. Then, for all k P rK 2 s, we haveProof. (Lemma B.2) We prove this lemma by computing V 1 px 1 q, following the standard analysis of this hard-to-learn MDP in Zhou et al. (2021) .By definition of M, we have pd ´1q∆ " max aPA xµ h , ay such thatSince only r h px H`2 , ¨q " 1, we have V π 1 px 1 q " ř H´1 h"1 pH ´hqPpN h q, where N h " ts h`1 " x H`2 , s h " x h u. We also havep1 ´ι ´xµ j , a π j yq.Subsequently, we getRearranging the terms gives Eq. ( 56).

C.3 COVERING NET

Lemma C.5 (Lemma D.6. in Jin et al. (2020b) ). Let p N ε be the ε-covering of p VpL, Bq with respect to the distance dist pV, V 1 q " sup x |V pxq ´V 1 pxq|, where p VpL, Bq is defined in Definition A.4.

C.4 ELLIPTICAL POTENTIALS

In this subsection, we present Lemma C.6 from Abbasi-Yadkori et al. ( 2011), which is an important for establishing the Op ? T q worst case regret for linear bandits or RL with linear function approximation. Moreover, we also present the conservatism of Elliptical Potentials in Lemma C.7 from Hu et al. (2022) , which states that Elliptical Potentials are usually small. Lemma C.6 (Lemma 11 in Abbasi-Yadkori et al. ( 2011)). Given λ ą 0 and sequence tx t u T t"1 Ă R d with }x t } 2 ď L for all t P rT s, define Z t " λI `řt i"1 x i x J i for t ě 1 and Z 0 " λI. We have Hu et al. (2022) ). Given λ ą 0 and sequence tx t u T t"1 Ă R d with }x t } 2 ď L for all t P rT s, define Z t " λI `řt i"1 x i x J i for t ě 1 and Z 0 " λI. During rT s, the number of timeswhere c ą 0 is a constant.

D COMPUTATIONAL TRACTABILITY

In this section, we analyze the space and computational complexity of the LSVI-RFE algorithm. Notice that Algorithm 2 in the planning phase only performs a single run of value iteration to compute the output policy tπ h u hPrHs , the computational requirement of Algorithm 2 in the planning phase is equivalent to a single-episode run of Algorithm 1. Thus, it suffices to only analyze the space and computation complexity of Algorithm 1 in the exploration phase. Though we consider the linear MDP setting where the size of states |S| might be infinite, Algorithm 1 is computationally-efficient, i.e., the space and computational complexities are polynomial in d, H, K and |A|, and do not depend on |S|, where K is the number of episodes that Algorithm 1 has run.

D.1 SPACE COMPLEXITY OF ALGORITHM 1

Though we give the explicit form of p µ k,h P R dˆ|S| in Algorithm 1, we do not need to store it directly. In fact, we only need to store p µ k,h p V k,h`1 P R d since only this term is used in computing the Q-function p Q k,h ps, aq for fixed s and a. Moreover, we only explore ts k h : h P rHs, k P rKsu in Algorithm 1. Thus, Algorithm 1 only needs to store p µ k,h p V k,h`1 , p Λ k,h , p σ k,h , ϕps k h , aq for h P rHs, k P rKs, a P A. The total space complexity is Opd 2 H `d|A|HKq, where K is the number of episodes that Algorithm 1 has run.

D.2 COMPUTATIONAL COMPLEXITY OF ALGORITHM 1

In Algorithm 1, Lines 6-12 show the methods to calculate the estimated value function p V k,h and policy π k h p¨q. Notice that only p µ k,h p V k,h`1 and π k h ps k h q contribute to the observation steps (Line 15) and calculation of p σ k,h , p Λ k`1,h and p µ k`1,h (Lines 16-26). We consider the computational cost of the following 3 parts for fixed k P rKs.1. Calculating policy π k h ps k h q for s k h , h P rHs. Assume that we have already observed s k h for some h P rHs and calculated p µ k,h p V k,h`1 . By definition, π k h ps k h q " arg max aPA p Q k,h ps k h , aq. We can determine π k h ps k h q by calculating p Q k,h ps k h , aq for all Hence we need to calculate p V k`1,h ps i h`1 q for i P rks, which means we have to calculate p Q k`1,h ps i h`1 , aq for any i P rks and a P A. This will take Opd 2 |A|Kq operations. With known p V k`1,h ps i h`1 q, we need Opd 2 Kq operations to calculate the left term. Thus, calculating p µ k`1,h p V k`1,h`1 will take Opd 2 |A|HKq operations for every h P rHs and fixed k P rKs.

3.. Calculating p

σ k,h , p Λ k`1,h . Notice that we can compute r Λ k`1,h , p Λ k`1,h by the Sherman-Morrison formula, which takes Opd 2 q operations. By Lines 16-24, we know that calculating p σ k,h will take another Opd 2 q operations. By arguments above, for episode k, we need Opd 2 |A|HKq operations. Thus, the full algorithm costs Opd 2 |A|HK 2 q, which is a polynomial in d, |A|, H and K, and independent on |S|.

